Code Your Videos With Remotion
Building a JSON-driven video production pipeline with Remotion, React, and Kokoro TTS for automated narrated content creation

Code Your Videos With Remotion
Creating professional narrated videos can be time-consuming and expensive. But what if you could write your entire video in JSON, quickly generate high-quality speech audio with open-source AI, and render it programmatically with React components? That's exactly what I built for DappJak Labs.
The High-Level Workflow
The entire pipeline is driven by a single project.json file that serves as the source of truth for narration, visuals, and metadata:
project.json ──┬──> pnpm audio:full ───> narration.wav + scene_timings.txt
│ │
│ v
│ pnpm gen:durations
│ │
│ v
│ generatedDurations.ts
│ │
└──> pnpm dev ─────────────────┴──> Remotion Studio (preview)
│
v
pnpm render ────> video.mp4
│
v
project.json metadata ────> Tokotube upload API
Key Scripts
| Command | Purpose |
|---|---|
pnpm audio:full+durations | Generate speech audio + sync scene durations |
pnpm dev | Launch Remotion Studio on port 3011 for live preview |
pnpm render | Render the final video to out/ directory |
pnpm build | Bundle the project for deployment |
The Challenge
I needed to create a promotional video showcasing DappJak Labs' ecosystem of decentralized applications. Traditional video editing would mean:
- Recording voiceovers (or hiring a voice actor)
- Manually syncing audio with visual scenes
- Frame-by-frame adjustments in video editing software
- Re-rendering everything for small text changes
Instead, I built a fully automated pipeline where the entire video is defined in a single JSON file, narration is generated with AI, and React components handle all the visuals.
The Architecture
The system has three main components:
1. The Project JSON - Single Source of Truth
Everything starts with project.json, which defines all 14 scenes of the video:
{
"meta": {
"projectId": "dappjak-labs-your-unstoppable-ecosystem",
"title": "Dappjak Labs – Your Unstoppable Ecosystem",
"tags": ["internet-computer", "dapps", "decentralization", "icp"],
"upload": {
"description": "What would you build if no one could shut you down?",
"chapters": [
{ "time": "0:00", "title": "Opening Hook" },
{ "time": "0:22", "title": "The Problem" },
{ "time": "0:43", "title": "The Journey" }
]
}
},
"pronunciationMap": {
"IPFS": "I P F S",
"icOS": "I see oh S",
"Tokotube": "toko tube",
"DAO": "dow"
},
"scenes": [
{
"id": "hook",
"title": "Opening Hook",
"narrationText": [],
"silenceDuration": 4.83,
"onScreenText": [
"What would you build if no one could shut it down?"
]
},
{
"id": "intro",
"title": "Dappjak Labs Introduction",
"narrationText": [
"This is Dappjak Labs",
"an indie dev studio accelerated by AI, owned by its community, and impossible to shut down.",
"No gatekeepers. No takedowns. Unstoppable by design."
],
"onScreenText": [
"Dappjak Labs",
"Your Unstoppable Ecosystem"
]
},
{
"id": "the_problem",
"title": "The Problem",
"narrationText": [
"You spend years building your platform, your community, your livelihood. Then one morning—it's gone.",
"Your hosting provider can pull the plug. Your payment processor can freeze your funds.",
"You're building on rented land."
],
"onScreenText": [
"Your code. Their servers.",
"Your voice. Their platform."
]
}
]
}
The key innovation here is the dual-text strategy:
narrationTextarray: Full narration that will be spoken. Each element becomes a separate audio segment.onScreenTextarray: Concise, punchy text optimized for on-screen display.
This separation allows you to optimize for both mediums independently. Spoken narration can be verbose and flowing, while visual text can be short, formatted, and impactful.
2. The Audio Pipeline - Kokoro TTS
The audio generation happens in Python using Kokoro, an 82M-parameter open-weight TTS model licensed under Apache 2.0.
Pronunciation Control
The pronunciationMap in project.json ensures technical terms are spoken correctly:
{
"pronunciationMap": {
"IPFS": "I P F S",
"Storj": "Storge",
"icOS": "I see oh S",
"Tokotube": "toko tube",
"DAO": "dow",
"jannies": "jan Ees"
}
}
Audio Generation
Run the full audio pipeline with a single command:
pnpm audio:full+durations
This produces:
assets/audio/narration.wav- Combined narration trackassets/audio/scene_timings.txt- Timing metadata for Remotionpublic/audio/narration.mp3- Compressed version for websrc/scenes/generatedDurations.ts- Auto-generated scene durations
Scene Timings
The scene_timings.txt file provides frame-accurate timing metadata:
Scene: 01
Files: 01_hook.wav
Start: 0.00s
End: 4.83s
Duration: 4.83s
Scene: 02
Files: 02_intro_01.wav, 02_intro_02.wav, 02_intro_03.wav
Start: 6.33s
End: 20.68s
Duration: 14.35s
Scene: 07
Files: 07_dapp_tokotube_01.wav through 07_dapp_tokotube_06.wav
Start: 117.38s
End: 161.73s
Duration: 44.35s
Total Duration: 366.00s
3. The Remotion Integration - React Meets Video
Now comes the magic: Remotion lets you write videos using React components.
The Scene Registry
Each scene is defined in src/scenes/index.tsx:
export type SceneEntry = {
id: string;
label: string;
durationInSeconds: number;
Component: React.FC<SceneComponentProps>;
};
export const sceneEntries: SceneEntry[] = [
{ id: 'hook', label: 'Hook', durationInSeconds: 4.83, Component: Hook },
{ id: 'intro', label: 'Intro', durationInSeconds: 14.35, Component: Intro },
{ id: 'the_problem', label: 'Problem', durationInSeconds: 19.90, Component: TheProblem },
// ... 11 more scenes
];
The scene durations are auto-generated from scene_timings.txt via pnpm gen:durations, which outputs generatedDurations.ts.
Scene Components Use Project Data
Individual scene components import project.json directly:
import project from '../../project.json';
import { getSceneTextInfoFromScene } from './sceneText';
export const TheProblem: React.FC<SceneComponentProps> = ({ textOpacity }) => {
const scene = project.scenes.find((s) => s.id === 'the_problem');
const { onScreenText } = getSceneTextInfoFromScene(scene, 'The Problem');
return (
<AbsoluteFill style={{ ...baseBackgroundStyle, opacity: textOpacity }}>
<div style={{ fontSize: 72, fontWeight: 'bold' }}>
{onScreenText.map((text, i) => (
<div key={i}>{text}</div>
))}
</div>
</AbsoluteFill>
);
};
QR Code Integration for dApp Spotlights
Scene definitions can include QR code configurations for interactive elements:
{
"id": "dapp_tokotube",
"qrConfig": {
"qrImagePath": "assets/dapps/tokotube/qrcode.png",
"logoImagePath": "assets/dapps/tokotube/tokotube.gif",
"title": "TokoTube",
"subtitle": "YouTube without the gatekeepers",
"linkText": "Scan the QR code to open Tokotube."
}
}
4. Tokotube Platform Integration
The rich metadata in project.json doesn't just drive video rendering—it can also automate platform uploads:
{
"meta": {
"upload": {
"description": "🚀 What would you build if no one could shut you down?",
"shortDescription": "Dappjak Labs: Your unstoppable ecosystem",
"chapters": [
{ "time": "0:00", "title": "Opening Hook" },
{ "time": "0:22", "title": "The Problem" },
{ "time": "1:08", "title": "The Discovery" }
],
"links": {
"website": "https://dappjak.com",
"tokotube": "https://tokotube.com",
"github": "https://github.com/dappjak-labs"
},
"creator": {
"name": "Dappjak Labs",
"revenueShare": [
{ "principal": "aaaaa-aa", "username": "dappjaklabs", "percentage": 70 },
{ "principal": "bbbbb-bb", "username": "collaborator1", "percentage": 20 }
]
}
},
"publishing": {
"visibility": "public",
"category": "Science & Technology",
"madeForKids": false
}
}
}
This enables a fully automated upload pipeline:
| Metadata Field | Tokotube API Mapping |
|---|---|
upload.description | Video description |
upload.chapters | Chapter markers |
upload.links | Link cards / end screen |
upload.creator.revenueShare | On-chain revenue splits |
publishing.category | Platform category |
tags | Searchable tags |
The vision: pnpm render && pnpm upload — from JSON script to published video in minutes.
The Complete Workflow
Here's the end-to-end process:
1. Write Your Video in JSON
# Edit project.json
# Define all scenes with both spoken and visual text
# Add pronunciation hints for technical terms
# Configure metadata for platform upload
2. Generate Audio
pnpm audio:full+durations
This single command:
- Generates individual WAV files for each narration segment
- Concatenates them with professional silence padding
- Outputs
scene_timings.txt - Generates
generatedDurations.tsfor Remotion - Creates compressed MP3 for web playback
3. Preview in Remotion Studio
pnpm dev
Opens Remotion Studio with:
- Live preview with scrubbing
- Clickable timeline markers for each scene
- Hot reload when you change components
- Frame-accurate playback
4. Render Final Video
pnpm render
Outputs to out/ directory with specifications:
- Resolution: 1920x1080 (Full HD)
- Frame rate: 30 fps
- Total duration: ~6 minutes (14 scenes)
- Codec: H.264
Key Technical Details
Video Specifications
- Resolution: 1920x1080 (Full HD)
- Frame rate: 30 fps
- Total duration: ~6 minutes (366 seconds)
- Audio: 24kHz WAV (Kokoro TTS standard)
- 14 scenes, 53 audio segments
Frame-Based Synchronization
The system uses purely computational frame counting - no state management needed:
// For frame 1000 at 30fps:
// Scene 1 (hook): 0-145 frames (4.83s)
// Scene 2 (intro): 145-576 frames (14.35s)
// Scene 3 (the_problem): 576-1173 frames (19.90s)
// Frame 1000 falls in Scene 3
// Scene progress: (1000 - 576) / (1173 - 576) = 0.71 (71% through scene)
This is deterministic and reproducible - the same frame always produces the same output.
Animation Patterns
All scenes use a consistent fade pattern via interpolate:
const textOpacity = interpolate(
sceneProgress,
[0, 0.1, 0.9, 1], // Input range
[0, 1, 1, 0], // Output range
{ extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
);
This creates a professional "fade in → hold → fade out" effect.
Why This Approach Works
1. Content as Data
The entire video is defined in JSON. Need to fix a typo? Edit the JSON, regenerate audio (10 seconds), re-render (5 minutes). No manual video editing required.
2. Open-Source AI
Kokoro TTS is Apache 2.0 licensed with surprisingly good quality for an 82M parameter model. No API costs, no rate limits, runs on CPU.
3. React Component Reusability
Common patterns (card layouts, icon grids, text animations) are extracted into reusable components. The 14 scenes share a consistent visual language.
4. Frame-Accurate Synchronization
Remotion's frame-based architecture ensures perfect audio-visual sync. No drift, no manual alignment.
5. Platform-Ready Metadata
The same JSON that defines your video also contains everything needed for automated platform uploads—chapters, descriptions, tags, revenue splits.
Results
The final video:
- Duration: ~6 minutes (366 seconds)
- Scenes: 14 distinct sections
- Audio segments: 53 narration clips
- Resolution: 1920x1080
- Total render time: ~3 minutes on modern CPU
- Edit-to-render time: ~10 seconds (JSON edit) + 3 minutes (render)
Compare to traditional video production:
- Record voiceover: 1-2 hours (multiple takes, editing)
- Create visuals: 4-8 hours (motion graphics, transitions)
- Edit and sync: 2-4 hours (timeline editing, fine-tuning)
- Total: 7-14 hours
With this pipeline:
- Write narration JSON: 1-2 hours
- Generate audio: 10 seconds
- Build React components: 3-4 hours (reusable for future videos)
- Render: 3 minutes
- Total: ~4-5 hours (and future videos even faster)
Future: Fully Automated Publishing
The next evolution is connecting the metadata directly to the Tokotube upload API:
# Future workflow
pnpm render && pnpm upload:tokotube
# This would:
# 1. Upload the rendered video
# 2. Set description from meta.upload.description
# 3. Create chapter markers from meta.upload.chapters
# 4. Configure revenue splits from meta.upload.creator.revenueShare
# 5. Set visibility, category, tags from meta.publishing
Same JSON. Same pipeline. From script to published video with two commands.
Conclusion
Building this JSON-driven video pipeline transformed video production from a manual, time-intensive process into an automated, reproducible workflow. The combination of Remotion (React for video), Kokoro TTS (open-weight speech synthesis), and a data-first architecture creates a system that's:
- Fast: 10-second edits instead of hours of re-editing
- Scalable: Reusable components for future videos
- Open: No proprietary formats or vendor lock-in
- Cost-effective: Zero API costs, runs on commodity hardware
- Developer-friendly: TypeScript, Git, hot reload
- Platform-ready: Rich metadata for automated uploads
If you're creating educational content, product demos, or promotional videos with narration, this architecture is worth exploring. The initial setup investment pays off quickly as you build your component library and perfect your workflow.
The future of video production is code. And it's surprisingly accessible.
Resources
- Remotion: https://www.remotion.dev/
- Kokoro TTS: https://github.com/hexgrad/kokoro
- Tokotube: https://tokotube.com
- DappJak Labs: https://dappjak.com
Happy coding... your videos!

