Code Your Videos With Remotion

Building a JSON-driven video production pipeline with Remotion, React, and Kokoro TTS for automated narrated content creation

Creating professional narrated videos can be time-consuming and expensive. But what if you could write your entire video in JSON, quickly generate high-quality speech audio with open-source AI, and render it programmatically with React components? That's exactly what I built for DappJak Labs.

The High-Level Workflow

The entire pipeline is driven by a single project.json file that serves as the source of truth for narration, visuals, and metadata:

		project.json ──┬──> pnpm audio:full ───> narration.wav + scene_timings.txt
               │                              │
               │                              v
               │                      pnpm gen:durations
               │                              │
               │                              v
               │                      generatedDurations.ts
               │                              │
               └──> pnpm dev ─────────────────┴──> Remotion Studio (preview)
                              │
                              v
                       pnpm render ────> video.mp4
                              │
                              v
                  project.json metadata ────> Tokotube upload API

Key Scripts

Command	Purpose
`pnpm audio:full+durations`	Generate speech audio + sync scene durations
`pnpm dev`	Launch Remotion Studio on port 3011 for live preview
`pnpm render`	Render the final video to `out/` directory
`pnpm build`	Bundle the project for deployment

The Challenge

I needed to create a promotional video showcasing DappJak Labs' ecosystem of decentralized applications. Traditional video editing would mean:

Recording voiceovers (or hiring a voice actor)
Manually syncing audio with visual scenes
Frame-by-frame adjustments in video editing software
Re-rendering everything for small text changes

Instead, I built a fully automated pipeline where the entire video is defined in a single JSON file, narration is generated with AI, and React components handle all the visuals.

The Architecture

The system has three main components:

1. The Project JSON - Single Source of Truth

Everything starts with project.json, which defines all 14 scenes of the video:

		{
  "meta": {
    "projectId": "dappjak-labs-your-unstoppable-ecosystem",
    "title": "Dappjak Labs – Your Unstoppable Ecosystem",
    "tags": ["internet-computer", "dapps", "decentralization", "icp"],
    "upload": {
      "description": "What would you build if no one could shut you down?",
      "chapters": [
        { "time": "0:00", "title": "Opening Hook" },
        { "time": "0:22", "title": "The Problem" },
        { "time": "0:43", "title": "The Journey" }
      ]
    }
  },
  "pronunciationMap": {
    "IPFS": "I P F S",
    "icOS": "I see oh S",
    "Tokotube": "toko tube",
    "DAO": "dow"
  },
  "scenes": [
    {
      "id": "hook",
      "title": "Opening Hook",
      "narrationText": [],
      "silenceDuration": 4.83,
      "onScreenText": [
        "What would you build if no one could shut it down?"
      ]
    },
    {
      "id": "intro",
      "title": "Dappjak Labs Introduction",
      "narrationText": [
        "This is Dappjak Labs",
        "an indie dev studio accelerated by AI, owned by its community, and impossible to shut down.",
        "No gatekeepers. No takedowns. Unstoppable by design."
      ],
      "onScreenText": [
        "Dappjak Labs",
        "Your Unstoppable Ecosystem"
      ]
    },
    {
      "id": "the_problem",
      "title": "The Problem",
      "narrationText": [
        "You spend years building your platform, your community, your livelihood. Then one morning—it's gone.",
        "Your hosting provider can pull the plug. Your payment processor can freeze your funds.",
        "You're building on rented land."
      ],
      "onScreenText": [
        "Your code. Their servers.",
        "Your voice. Their platform."
      ]
    }
  ]
}

The key innovation here is the dual-text strategy:

narrationText array: Full narration that will be spoken. Each element becomes a separate audio segment.
onScreenText array: Concise, punchy text optimized for on-screen display.

This separation allows you to optimize for both mediums independently. Spoken narration can be verbose and flowing, while visual text can be short, formatted, and impactful.

2. The Audio Pipeline - Kokoro TTS

The audio generation happens in Python using Kokoro, an 82M-parameter open-weight TTS model licensed under Apache 2.0.

Pronunciation Control

The pronunciationMap in project.json ensures technical terms are spoken correctly:

		{
  "pronunciationMap": {
    "IPFS": "I P F S",
    "Storj": "Storge",
    "icOS": "I see oh S",
    "Tokotube": "toko tube",
    "DAO": "dow",
    "jannies": "jan Ees"
  }
}

Audio Generation

Run the full audio pipeline with a single command:

		pnpm audio:full+durations

This produces:

assets/audio/narration.wav - Combined narration track
assets/audio/scene_timings.txt - Timing metadata for Remotion
public/audio/narration.mp3 - Compressed version for web
src/scenes/generatedDurations.ts - Auto-generated scene durations

Scene Timings

The scene_timings.txt file provides frame-accurate timing metadata:

		Scene: 01
  Files: 01_hook.wav
  Start: 0.00s
  End: 4.83s
  Duration: 4.83s

Scene: 02
  Files: 02_intro_01.wav, 02_intro_02.wav, 02_intro_03.wav
  Start: 6.33s
  End: 20.68s
  Duration: 14.35s

Scene: 07
  Files: 07_dapp_tokotube_01.wav through 07_dapp_tokotube_06.wav
  Start: 117.38s
  End: 161.73s
  Duration: 44.35s

Total Duration: 366.00s

3. The Remotion Integration - React Meets Video

Now comes the magic: Remotion lets you write videos using React components.

The Scene Registry

Each scene is defined in src/scenes/index.tsx:

		export type SceneEntry = {
  id: string;
  label: string;
  durationInSeconds: number;
  Component: React.FC<SceneComponentProps>;
};
 
export const sceneEntries: SceneEntry[] = [
  { id: 'hook', label: 'Hook', durationInSeconds: 4.83, Component: Hook },
  { id: 'intro', label: 'Intro', durationInSeconds: 14.35, Component: Intro },
  { id: 'the_problem', label: 'Problem', durationInSeconds: 19.90, Component: TheProblem },
  // ... 11 more scenes
];

The scene durations are auto-generated from scene_timings.txt via pnpm gen:durations, which outputs generatedDurations.ts.

Scene Components Use Project Data

Individual scene components import project.json directly:

		import project from '../../project.json';
import { getSceneTextInfoFromScene } from './sceneText';
 
export const TheProblem: React.FC<SceneComponentProps> = ({ textOpacity }) => {
  const scene = project.scenes.find((s) => s.id === 'the_problem');
  const { onScreenText } = getSceneTextInfoFromScene(scene, 'The Problem');
 
  return (
    <AbsoluteFill style={{ ...baseBackgroundStyle, opacity: textOpacity }}>
      <div style={{ fontSize: 72, fontWeight: 'bold' }}>
        {onScreenText.map((text, i) => (
          <div key={i}>{text}</div>
        ))}
      </div>
    </AbsoluteFill>
  );
};

QR Code Integration for dApp Spotlights

Scene definitions can include QR code configurations for interactive elements:

		{
  "id": "dapp_tokotube",
  "qrConfig": {
    "qrImagePath": "assets/dapps/tokotube/qrcode.png",
    "logoImagePath": "assets/dapps/tokotube/tokotube.gif",
    "title": "TokoTube",
    "subtitle": "YouTube without the gatekeepers",
    "linkText": "Scan the QR code to open Tokotube."
  }
}

4. Tokotube Platform Integration

The rich metadata in project.json doesn't just drive video rendering—it can also automate platform uploads:

		{
  "meta": {
    "upload": {
      "description": "🚀 What would you build if no one could shut you down?",
      "shortDescription": "Dappjak Labs: Your unstoppable ecosystem",
      "chapters": [
        { "time": "0:00", "title": "Opening Hook" },
        { "time": "0:22", "title": "The Problem" },
        { "time": "1:08", "title": "The Discovery" }
      ],
      "links": {
        "website": "https://dappjak.com",
        "tokotube": "https://tokotube.com",
        "github": "https://github.com/dappjak-labs"
      },
      "creator": {
        "name": "Dappjak Labs",
        "revenueShare": [
          { "principal": "aaaaa-aa", "username": "dappjaklabs", "percentage": 70 },
          { "principal": "bbbbb-bb", "username": "collaborator1", "percentage": 20 }
        ]
      }
    },
    "publishing": {
      "visibility": "public",
      "category": "Science & Technology",
      "madeForKids": false
    }
  }
}

This enables a fully automated upload pipeline:

Metadata Field	Tokotube API Mapping
`upload.description`	Video description
`upload.chapters`	Chapter markers
`upload.links`	Link cards / end screen
`upload.creator.revenueShare`	On-chain revenue splits
`publishing.category`	Platform category
`tags`	Searchable tags

The vision: pnpm render && pnpm upload — from JSON script to published video in minutes.

The Complete Workflow

Here's the end-to-end process:

1. Write Your Video in JSON

		# Edit project.json
# Define all scenes with both spoken and visual text
# Add pronunciation hints for technical terms
# Configure metadata for platform upload

2. Generate Audio

		pnpm audio:full+durations

This single command:

Generates individual WAV files for each narration segment
Concatenates them with professional silence padding
Outputs scene_timings.txt
Generates generatedDurations.ts for Remotion
Creates compressed MP3 for web playback

3. Preview in Remotion Studio

		pnpm dev

Opens Remotion Studio with:

Live preview with scrubbing
Clickable timeline markers for each scene
Hot reload when you change components
Frame-accurate playback

4. Render Final Video

		pnpm render

Outputs to out/ directory with specifications:

Resolution: 1920x1080 (Full HD)
Frame rate: 30 fps
Total duration: ~6 minutes (14 scenes)
Codec: H.264

Key Technical Details

Video Specifications

Resolution: 1920x1080 (Full HD)
Frame rate: 30 fps
Total duration: ~6 minutes (366 seconds)
Audio: 24kHz WAV (Kokoro TTS standard)
14 scenes, 53 audio segments

Frame-Based Synchronization

The system uses purely computational frame counting - no state management needed:

		// For frame 1000 at 30fps:
// Scene 1 (hook): 0-145 frames (4.83s)
// Scene 2 (intro): 145-576 frames (14.35s)
// Scene 3 (the_problem): 576-1173 frames (19.90s)
// Frame 1000 falls in Scene 3
// Scene progress: (1000 - 576) / (1173 - 576) = 0.71 (71% through scene)

This is deterministic and reproducible - the same frame always produces the same output.

Animation Patterns

All scenes use a consistent fade pattern via interpolate:

		const textOpacity = interpolate(
  sceneProgress,
  [0, 0.1, 0.9, 1],     // Input range
  [0, 1, 1, 0],         // Output range
  { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
);

This creates a professional "fade in → hold → fade out" effect.

Why This Approach Works

1. Content as Data

The entire video is defined in JSON. Need to fix a typo? Edit the JSON, regenerate audio (10 seconds), re-render (5 minutes). No manual video editing required.

2. Open-Source AI

Kokoro TTS is Apache 2.0 licensed with surprisingly good quality for an 82M parameter model. No API costs, no rate limits, runs on CPU.

3. React Component Reusability

Common patterns (card layouts, icon grids, text animations) are extracted into reusable components. The 14 scenes share a consistent visual language.

4. Frame-Accurate Synchronization

Remotion's frame-based architecture ensures perfect audio-visual sync. No drift, no manual alignment.

5. Platform-Ready Metadata

The same JSON that defines your video also contains everything needed for automated platform uploads—chapters, descriptions, tags, revenue splits.

Results

The final video:

Duration: ~6 minutes (366 seconds)
Scenes: 14 distinct sections
Audio segments: 53 narration clips
Resolution: 1920x1080
Total render time: ~3 minutes on modern CPU
Edit-to-render time: ~10 seconds (JSON edit) + 3 minutes (render)

Compare to traditional video production:

Record voiceover: 1-2 hours (multiple takes, editing)
Create visuals: 4-8 hours (motion graphics, transitions)
Edit and sync: 2-4 hours (timeline editing, fine-tuning)
Total: 7-14 hours

With this pipeline:

Write narration JSON: 1-2 hours
Generate audio: 10 seconds
Build React components: 3-4 hours (reusable for future videos)
Render: 3 minutes
Total: ~4-5 hours (and future videos even faster)

Future: Fully Automated Publishing

The next evolution is connecting the metadata directly to the Tokotube upload API:

		# Future workflow
pnpm render && pnpm upload:tokotube
 
# This would:
# 1. Upload the rendered video
# 2. Set description from meta.upload.description
# 3. Create chapter markers from meta.upload.chapters
# 4. Configure revenue splits from meta.upload.creator.revenueShare
# 5. Set visibility, category, tags from meta.publishing

Same JSON. Same pipeline. From script to published video with two commands.

Conclusion

Building this JSON-driven video pipeline transformed video production from a manual, time-intensive process into an automated, reproducible workflow. The combination of Remotion (React for video), Kokoro TTS (open-weight speech synthesis), and a data-first architecture creates a system that's:

Fast: 10-second edits instead of hours of re-editing
Scalable: Reusable components for future videos
Open: No proprietary formats or vendor lock-in
Cost-effective: Zero API costs, runs on commodity hardware
Developer-friendly: TypeScript, Git, hot reload
Platform-ready: Rich metadata for automated uploads

If you're creating educational content, product demos, or promotional videos with narration, this architecture is worth exploring. The initial setup investment pays off quickly as you build your component library and perfect your workflow.

The future of video production is code. And it's surprisingly accessible.

Resources

Remotion: https://www.remotion.dev/
Kokoro TTS: https://github.com/hexgrad/kokoro
Tokotube: https://tokotube.com
DappJak Labs: https://dappjak.com

Happy coding... your videos!