Skip to main content

Code Your Videos With Remotion

Building a JSON-driven video production pipeline with Remotion, React, and Kokoro TTS for automated narrated content creation

Code Your Videos With Remotion

Code Your Videos With Remotion

Creating professional narrated videos can be time-consuming and expensive. But what if you could write your entire video in JSON, quickly generate high-quality speech audio with open-source AI, and render it programmatically with React components? That's exactly what I built for DappJak Labs.

The Challenge

I needed to create a 10-minute promotional video showcasing DappJak Labs' ecosystem of decentralized applications. Traditional video editing would mean:

  • Recording voiceovers (or hiring a voice actor)
  • Manually syncing audio with visual scenes
  • Frame-by-frame adjustments in video editing software
  • Re-rendering everything for small text changes

Instead, I built a fully automated pipeline where the entire video is defined in a single JSON file, narration is generated with AI, and React components handle all the visuals.

The Architecture

The system has three main components:

1. The Narration JSON - Single Source of Truth

Everything starts with narration.json, which defines all 17 scenes of the video:

		{
  "scenes": [
    {
      "id": "intro",
      "title": "Introduction",
      "text": [
        "Welcome to Dappjak Labs, crafting dapps on the Unstoppable Internet."
      ],
      "renderedText": [
        "Welcome to Dappjak Labs",
        "dApp crafting on the Unstoppable Internet."
      ]
    },
    {
      "id": "about_dappjak",
      "title": "About DappJak",
      "text": [
        "When big-tech platforms can deplatform you at will, where do you turn?",
        "This question sent DappJak searching for something better.",
        "His journey took him through various decentralized technologies...",
        "Then he discovered ICP, the Internet Computer Protocol.",
        "It's a blockchain network that can host inexpensive, full-stack...",
        "The vision crystallized: build tools, dapps, and infrastructure..."
      ],
      "renderedText": [
        "Deplatformed by big tech. Content deleted. Accounts suspended...",
        "Experimented with IPFS Storj Sia FileCoin BTFS...",
        "ICP Discovery: A Sveltekit frontend, Motoko smart contract...",
        "The vision: Build unstoppable tools, dApps, and infrastructure..."
      ]
    }
  ]
}
	

The key innovation here is the dual-text strategy:

  • text array: Full narration that will be spoken. Each element becomes a separate audio segment.
  • renderedText array: Concise, punchy text optimized for on-screen display.

This separation allows you to optimize for both mediums independently. Spoken narration can be verbose and flowing, while visual text can be short, formatted, and impactful.

2. The Audio Pipeline - Kokoro TTS

The audio generation happens in Python using Kokoro, an 82M-parameter open-weight TTS model licensed under Apache 2.0.

Step 1: Text-to-Speech Generation

The json_to_speech.py script converts narration text into individual WAV files:

		from kokoro import KPipeline
import soundfile as sf
 
# Initialize with American English voice
pipeline = KPipeline(lang_code='a', repo_id='hexgrad/Kokoro-82M', device='cpu')
 
# Generate speech for each text segment
for i, narration in enumerate(narrations):
    text_segments = narration.get('text', [])
 
    for segment_idx, text in enumerate(text_segments):
        # Use scene index for ordering
        scene_prefix = f"{i+1:02d}"
        narration_id = narration.get('id', f'narration_{i+1:03d}')
 
        # Generate filename with proper ordering
        if len(text_segments) == 1:
            filename = f"{scene_prefix}_{narration_id}.wav"
        else:
            filename = f"{scene_prefix}_{narration_id}_{segment_idx + 1:02d}.wav"
 
        # Generate speech using af_heart voice
        generator = pipeline(text, voice='af_heart')
 
        # Extract audio from generator
        for graphemes, phonemes, audio in generator:
            if audio is not None:
                sf.write(filename, audio, 24000)  # 24kHz sample rate
	

This produces 67 individual audio files like:

  • 01_intro.wav (single segment scene)
  • 02_about_dappjak_01.wav through 02_about_dappjak_06.wav (multi-segment scene)
  • 06_team_01.wav through 06_team_10.wav (10 segments for the team scene)

Step 2: Audio Concatenation

The concatenate_audio.py script merges all audio files into a single track with professional silence padding:

		import numpy as np
import soundfile as sf
 
def concatenate_audio_files(silence_duration=1.5, segment_silence=0.5):
    # Auto-discover audio files sorted lexicographically
    all_audio_files = get_all_audio_files_sorted(audio_dir)
    scene_groups = group_audio_files_by_scene(all_audio_files)
 
    all_audio_data = []
    current_time = 0.0
 
    for scene_name in sorted(scene_groups.keys()):
        scene_files = scene_groups[scene_name]
        scene_audio_data = []
 
        # Process each segment of the scene
        for j, file_path in enumerate(scene_files):
            data, sample_rate = sf.read(file_path)
            scene_audio_data.append(data)
 
            # Add silence between segments (0.5s)
            if j < len(scene_files) - 1:
                silence = np.zeros(int(segment_silence * sample_rate))
                scene_audio_data.append(silence)
 
        # Combine all segments for this scene
        scene_combined = np.concatenate(scene_audio_data)
        all_audio_data.append(scene_combined)
 
        # Add silence between scenes (1.5s)
        if scene_name != sorted_scene_names[-1]:
            silence = np.zeros(int(silence_duration * sample_rate))
            all_audio_data.append(silence)
 
    # Concatenate all scenes into final audio
    combined_audio = np.concatenate(all_audio_data)
    sf.write('combined_narration.wav', combined_audio, sample_rate)
	

The script also generates scene_timings.txt with frame-accurate timing metadata:

		Scene: 01
  Files: 01_intro.wav
  Start: 0.00s
  End: 2.25s
  Duration: 2.25s

Scene: 02
  Files: 02_description_01.wav, 02_description_02.wav
  Start: 3.75s
  End: 25.80s
  Duration: 22.05s

	

3. The Remotion Integration - React Meets Video

Now comes the magic: Remotion lets you write videos using React components.

The Video Engine: NarrationVideo.tsx

This component is the brain of the operation. It:

  1. Tracks the current frame
  2. Determines which scene should be active
  3. Calculates scene progress for animations
  4. Renders the appropriate scene component
		import { AbsoluteFill, Html5Audio, interpolate, useCurrentFrame, useVideoConfig } from 'remotion';
import { sceneEntries } from './scenes';
 
export const NarrationVideo: React.FC = () => {
  const frame = useCurrentFrame();
  const { fps } = useVideoConfig();
 
  // Convert scene durations to frames
  const sceneDurationsInFrames = sceneEntries.map((scene) =>
    Math.floor(scene.durationInSeconds * fps)
  );
 
  // Find which scene we're in based on accumulated frames
  let currentSceneIndex = 0;
  let currentSceneStartFrame = 0;
  let accumulatedFrames = 0;
 
  for (let i = 0; i < sceneDurationsInFrames.length; i++) {
    const sceneFrames = sceneDurationsInFrames[i];
    if (frame < accumulatedFrames + sceneFrames) {
      currentSceneIndex = i;
      currentSceneStartFrame = accumulatedFrames;
      break;
    }
    accumulatedFrames += sceneFrames;
  }
 
  // Calculate progress within current scene (0.0 to 1.0)
  const currentSceneFrames = sceneDurationsInFrames[currentSceneIndex] ?? 1;
  const sceneProgress = (frame - currentSceneStartFrame) / currentSceneFrames;
 
  // Animate text opacity: fade in 0-10%, hold 10-90%, fade out 90-100%
  const textOpacity = interpolate(
    sceneProgress,
    [0, 0.1, 0.9, 1],
    [0, 1, 1, 0],
    { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
  );
 
  // Render the current scene component
  const CurrentSceneComponent = sceneEntries[currentSceneIndex]?.Component;
 
  return (
    <AbsoluteFill>
      <Html5Audio src={narrationAudio} />
      {CurrentSceneComponent && (
        <CurrentSceneComponent
          textOpacity={textOpacity}
          sceneProgress={sceneProgress}
        />
      )}
    </AbsoluteFill>
  );
};
	

The Scene Registry

Each scene is defined in scenes/index.tsx:

		export type SceneEntry = {
  id: string;
  label: string;
  durationInSeconds: number;
  Component: React.FC<SceneComponentProps>;
};
 
export const sceneEntries: SceneEntry[] = [
  {
    id: 'intro',
    label: 'Intro',
    durationInSeconds: 6,
    Component: Intro,
  },
  {
    id: 'about_dappjak',
    label: 'About DappJak',
    durationInSeconds: 36,
    Component: AboutDappJak,
  },
  // ... 15 more scenes
];
	

The scene durations are manually set to match the audio timing from scene_timings.txt. This is the one manual synchronization step in the entire pipeline.

Scene Components Use Narration Data

Individual scene components import narration.json directly and render the renderedText:

		import narrationData from '../../narration.json';
 
export const AboutDappJak: React.FC<SceneComponentProps> = ({
  textOpacity = 1,
  titleScale = 1,
}) => {
  // Find our scene in the narration data
  const aboutDappJakScene = narrationData.scenes.find(
    (scene) => scene.id === 'about_dappjak'
  );
  const renderedText = aboutDappJakScene?.renderedText || [];
 
  return (
    <AbsoluteFill style={baseBackgroundStyle}>
      <AbsoluteFill
        style={{
          display: 'flex',
          flexDirection: 'column',
          alignItems: 'center',
          justifyContent: 'flex-start',
          padding: '6% 6% 4%',
          gap: 48,
          opacity: textOpacity,  // Animated by NarrationVideo
        }}
      >
        <div style={{ fontSize: 72, fontWeight: 'bold' }}>
          About DappJak
        </div>
 
        {/* Render each text segment as a card */}
        {renderedText.map((text, i) => (
          <div key={i} style={cardContainerStyle}>
            {text}
          </div>
        ))}
      </AbsoluteFill>
    </AbsoluteFill>
  );
};
	

Inline Icon Mapping

One clever pattern I used was inline icon replacement. When certain keywords appear in text, they're automatically replaced with icons:

		const researchInlineIconMap = [
  { keyword: 'IPFS', icon: ipfsIcon, alt: 'IPFS logo' },
  { keyword: 'Storj', icon: storjIcon, alt: 'Storj logo' },
  { keyword: 'FileCoin', icon: filecoinIcon, alt: 'FileCoin logo' },
];
 
const buildInlineIconNodes = (text: string, mapping: InlineIconEntry[]) => {
  const keywordsPattern = mapping.map(({ keyword }) => escapeRegExp(keyword)).join('|');
  const regex = new RegExp(`(${keywordsPattern})`, 'g');
  const segments = text.split(regex);
 
  return segments.map((segment, index) => {
    const replacement = mapping.find(({ keyword }) => keyword === segment);
 
    if (replacement) {
      return (
        <span key={index} style={{ display: 'inline-flex', alignItems: 'center' }}>
          <Img src={replacement.icon} style={{ width: 28, height: 28 }} />
          <span>{replacement.keyword}</span>
        </span>
      );
    }
    return segment;
  });
};
 
// Usage
const researchTextNodes = buildInlineIconNodes(
  "Experimented with IPFS Storj FileCoin...",
  researchInlineIconMap
);
	

Now when the text mentions "IPFS", it automatically renders with the IPFS logo inline.

The Complete Workflow

Here's the end-to-end process:

1. Write Your Video in JSON

		# Edit narration.json
# Define all scenes with both spoken and visual text
	

2. Generate Audio

		cd audio/
uv venv && source .venv/bin/activate
uv add kokoro soundfile
 
# Generate individual audio files
python json_to_speech.py ../narration.json
 
# Concatenate into single audio track
python concatenate_audio.py 1.5 0.5
# Arguments: scene_silence=1.5s, segment_silence=0.5s
	

This produces:

  • speech_output/ directory with 67 WAV files (49MB)
  • combined_narration.wav (26MB single file)
  • scene_timings.txt (timing metadata)

3. Update Scene Timings

Copy the scene durations from scene_timings.txt into scenes/index.tsx:

		export const sceneEntries: SceneEntry[] = [
  { id: 'intro', durationInSeconds: 2.25, Component: Intro },
  { id: 'description', durationInSeconds: 22.05, Component: Description },
  // ...
];
	

4. Build Scene Components

Create React components for each scene in src/scenes/:

		export const MyScene: React.FC<SceneComponentProps> = ({ textOpacity }) => {
  const sceneData = narrationData.scenes.find((s) => s.id === 'my_scene');
 
  return (
    <AbsoluteFill style={{ ...baseBackgroundStyle, opacity: textOpacity }}>
      {/* Your visual content here */}
    </AbsoluteFill>
  );
};
	

5. Preview and Render

		npm install
npm run dev  # Opens Remotion Studio for live preview
npm run render  # Renders final video to out/ directory
	

The Remotion Studio provides:

  • Live preview with scrubbing
  • Clickable timeline markers for each scene
  • Hot reload when you change components
  • Frame-accurate playback

Key Technical Details

Video Specifications

  • Resolution: 1920x1080 (Full HD)
  • Frame rate: 30 fps
  • Total duration: ~10 minutes (18,000 frames)
  • Audio: 24kHz WAV (Kokoro TTS standard)
  • Image format: JPEG (configured in remotion.config.ts)

File Naming Convention

The audio file naming encodes scene order and segment information:

		{scene_number:02d}_{scene_id}_{segment_number:02d}.wav

Examples:
01_intro.wav
02_about_dappjak_01.wav
02_about_dappjak_06.wav
06_team_01.wav through 06_team_10.wav

	

This lexicographic sorting ensures correct concatenation order.

Frame-Based Synchronization

The system uses purely computational frame counting - no state management needed:

		// For frame 1000 at 30fps:
// Scene 1: 0-180 frames (6s)
// Scene 2: 180-1260 frames (36s)
// Scene 3: 1260-2220 frames (32s)
// Frame 1000 falls in Scene 2
// Scene progress: (1000 - 180) / (1260 - 180) = 0.76 (76% through scene)
	

This is deterministic and reproducible - the same frame always produces the same output.

Animation Patterns

All scenes use a consistent fade pattern via interpolate:

		const textOpacity = interpolate(
  sceneProgress,
  [0, 0.1, 0.9, 1],     // Input range
  [0, 1, 1, 0],         // Output range
  { extrapolateLeft: 'clamp', extrapolateRight: 'clamp' }
);
	

This creates a professional "fade in → hold → fade out" effect:

  • 0-10% of scene: fade in from 0 to 1
  • 10-90% of scene: hold at 1
  • 90-100% of scene: fade out from 1 to 0

Why This Approach Works

1. Content as Data

The entire video is defined in JSON. Need to fix a typo? Edit the JSON, regenerate audio (10 seconds), re-render (5 minutes). No manual video editing required.

2. Open-Source AI

Kokoro TTS is Apache 2.0 licensed with surprisingly good quality for an 82M parameter model. No API costs, no rate limits, runs on CPU.

3. React Component Reusability

Common patterns (card layouts, icon grids, text animations) are extracted into reusable components. The 17 scenes share a consistent visual language through styles.ts.

4. Frame-Accurate Synchronization

Remotion's frame-based architecture ensures perfect audio-visual sync. No drift, no manual alignment.

5. Developer-Friendly Workflow

  • TypeScript for type safety
  • Hot reload during development
  • Git-friendly (all text files)
  • No proprietary formats

Challenges and Solutions

Challenge 1: TTS Pronunciation

Problem: Kokoro occasionally mispronounces technical terms (e.g., "ICP" as "I.C.P." instead of "I-C-P", "Motoko" with wrong emphasis).

Solution: Phonetic spelling in the text array:

		{
  "text": ["The Internet Computer Protocol, or I-C-P, uses Motoko..."],
  "renderedText": ["ICP uses Motoko"]
}
	

Challenge 2: Audio Timing Sync

Problem: The concatenated audio duration must exactly match the sum of scene durations in Remotion.

Solution: Use scene_timings.txt as the single source of truth. The Python script calculates exact timings including silence padding, which are then copied into scenes/index.tsx.

Challenge 3: Large Asset Files

Problem: 67 audio files + logos + screenshots = large repository.

Solution: Keep audio in separate directory, use Git LFS for large binaries, only commit combined_narration.wav for Remotion (not individual segments).

Challenge 4: Scene Duration Adjustments

Problem: If you regenerate audio with different silence padding, all scene timings change.

Solution: Standardize silence durations (1.5s between scenes, 0.5s between segments) and use them consistently. Document in concatenate_audio.py defaults.

Results

The final video:

  • Duration: 10 minutes
  • Scenes: 17 distinct sections
  • Audio segments: 67 narration clips
  • Resolution: 1920x1080
  • Total render time: ~5 minutes on modern CPU
  • Edit-to-render time: ~10 seconds (JSON edit) + 5 minutes (render)

Compare to traditional video production:

  • Record voiceover: 1-2 hours (multiple takes, editing)
  • Create visuals: 4-8 hours (motion graphics, transitions)
  • Edit and sync: 2-4 hours (timeline editing, fine-tuning)
  • Total: 7-14 hours

With this pipeline:

  • Write narration JSON: 2 hours
  • Generate audio: 10 seconds
  • Build React components: 4 hours (reusable for future videos)
  • Render: 5 minutes
  • Total: ~6 hours (and future videos even faster)

Lessons Learned

  1. Separation of Concerns Works: Keeping speech text separate from visual text was crucial. They have different requirements and evolve independently.

  2. Open-Source AI Is Production-Ready: Kokoro TTS quality rivals commercial services for narration use cases.

  3. Programmatic Video Scales: Once the component library is built, creating new videos is just writing JSON and rendering.

  4. Frame Counting > State Management: For video, pure functional frame-to-output calculations beat stateful animations.

  5. Audio-First Design: Let audio timing drive visual duration, not the other way around. Human speech sets the natural pace.

Future Improvements

Automatic Timing Sync

Generate scenes/index.tsx automatically from scene_timings.txt:

		# generate_scene_config.py
timings = parse_scene_timings('scene_timings.txt')
generate_typescript_config(timings, 'src/scenes/index.tsx')
	

Voice Cloning

Replace Kokoro with a cloned voice using RVC or Bark.

Dynamic Content

Pull narration from a CMS or database instead of static JSON:

		const narrationData = await fetch('/api/video-script/dappjak-labs');
	

A/B Testing

Generate multiple narration variants and render different versions:

		python json_to_speech.py narration_v1.json --output audio_v1/
python json_to_speech.py narration_v2.json --output audio_v2/
npm run render -- --props='{"audioVersion": "v1"}'
	

Multilingual Support

Kokoro supports 9 languages. Generate the same video in multiple languages by translating narration.json:

		{
  "scenes": [
    {
      "id": "intro",
      "text": {
        "en": "Welcome to DappJak Labs",
        "es": "Bienvenido a DappJak Labs",
        "zh": "欢迎来到DappJak实验室"
      }
    }
  ]
}
	

Then render with --props='{"language": "es"}'.

Conclusion

Building this JSON-driven video pipeline transformed video production from a manual, time-intensive process into an automated, reproducible workflow. The combination of Remotion (React for video), Kokoro TTS (open-weight speech synthesis), and a data-first architecture creates a system that's:

  • Fast: 10-second edits instead of hours of re-editing
  • Scalable: Reusable components for future videos
  • Open: No proprietary formats or vendor lock-in
  • Cost-effective: Zero API costs, runs on commodity hardware
  • Developer-friendly: TypeScript, Git, hot reload

If you're creating educational content, product demos, or promotional videos with narration, this architecture is worth exploring. The initial setup investment pays off quickly as you build your component library and perfect your workflow.

The future of video production is code. And it's surprisingly accessible.

Resources

Happy coding... your videos!