Skip to main content

AI Text-to-Speech on Linux - Complete Guide

Comprehensive guide to setting up AI-powered text-to-speech on Linux using Kokoro TTS, Piper, Coqui TTS, and other tools for natural voice synthesis

AI Text-to-Speech on Linux - Complete Guide

Introduction

Text-to-speech (TTS) technology has evolved dramatically with modern AI models, producing natural-sounding voices that rival human speech. With tools like Kokoro TTS, Piper, and Coqui TTS, you can generate high-quality speech synthesis entirely on your Linux machine. This guide covers multiple solutions from lightweight local models to advanced neural TTS systems.

Why Use AI Text-to-Speech on Linux?

  • Privacy - Generate speech locally without sending text to cloud services
  • Offline capability - Work without internet connection
  • Cost-effective - No API fees or subscription costs
  • Customization - Fine-tune voices and adjust speech parameters
  • Integration - Easy integration with Linux workflows, screen readers, and applications

Kokoro TTS is a modern, high-quality text-to-speech engine with natural-sounding voices and excellent performance.

Installation

		# Install dependencies
sudo pacman -S python python-pip  # Arch
sudo apt install python3 python3-pip  # Ubuntu/Debian
 
# Install required audio libraries
sudo pacman -S espeak-ng  # Arch
sudo apt install espeak-ng  # Ubuntu/Debian
 
# Install Kokoro TTS
pip install kokoro-tts
	

Basic Usage

		# Generate speech from text
kokoro-tts "Hello, this is a test of Kokoro text-to-speech."
 
# Specify output file
kokoro-tts "Welcome to Linux" -o output.wav
 
# Use different voice
kokoro-tts "Testing different voices" --voice female
 
# Adjust speech rate
kokoro-tts "Faster speech" --rate 1.5
 
# Adjust pitch
kokoro-tts "Higher pitch" --pitch 1.2
	

Python API

		from kokoro_tts import KokoroTTS
 
# Initialize TTS engine
tts = KokoroTTS()
 
# Generate speech
audio = tts.synthesize("Hello from Python!")
 
# Save to file
tts.save(audio, "output.wav")
 
# Use different voice
audio = tts.synthesize("Different voice test", voice="male")
 
# Adjust parameters
audio = tts.synthesize(
    "Custom parameters",
    rate=1.2,
    pitch=0.9,
    volume=0.8
)
	

Advanced Configuration

		from kokoro_tts import KokoroTTS, VoiceConfig
 
# Create custom voice configuration
config = VoiceConfig(
    voice="female",
    rate=1.1,
    pitch=1.0,
    volume=0.9,
    language="en-US"
)
 
tts = KokoroTTS(config=config)
 
# Generate with SSML support
ssml_text = """
<speak>
    <prosody rate="slow">This is slow,</prosody>
    <prosody rate="fast">this is fast,</prosody>
    and this is normal.
</speak>
"""
 
audio = tts.synthesize_ssml(ssml_text)
	

Option 2: Piper TTS (Fast & Lightweight)

Piper is an extremely fast, local neural TTS system perfect for real-time applications.

Installation

		# Download Piper binary
wget https://github.com/rhasspy/piper/releases/download/v1.2.0/piper_amd64.tar.gz
tar xzf piper_amd64.tar.gz
 
# Download a voice model
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json
	

Basic Usage

		# Generate speech
echo "Hello from Piper TTS" | ./piper 
  --model en_US-lessac-medium.onnx 
  --output_file output.wav
 
# From text file
cat article.txt | ./piper 
  --model en_US-lessac-medium.onnx 
  --output_file article.wav
 
# Play directly
echo "Testing audio output" | ./piper 
  --model en_US-lessac-medium.onnx 
  --output_raw | aplay -r 22050 -f S16_LE -t raw -
	

Available Voice Models

		# List all available voices
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/voices.json
 
# Popular English voices:
# - en_US-lessac-medium (high quality, natural)
# - en_US-amy-medium (clear, professional)
# - en_GB-alan-medium (British English)
# - en_US-libritts-high (very high quality, slower)
	

Python Integration

		import subprocess
import json
 
def piper_tts(text, model_path="en_US-lessac-medium.onnx", output_file="output.wav"):
    """Generate speech using Piper TTS"""
    process = subprocess.Popen(
        ['./piper', '--model', model_path, '--output_file', output_file],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )
 
    process.communicate(input=text.encode('utf-8'))
    return output_file
 
# Usage
piper_tts("Hello from Python with Piper!", output_file="test.wav")
	

Option 3: Coqui TTS (Advanced Neural TTS)

Coqui TTS is a professional-grade TTS engine with state-of-the-art voice quality and cloning capabilities.

Installation

		# Install Coqui TTS
pip install TTS
 
# Install with CUDA support (for GPU acceleration)
pip install TTS[cuda]
	

Basic Usage

		# List available models
tts --list_models
 
# Generate speech with default model
tts --text "Hello from Coqui TTS" --out_path output.wav
 
# Use specific model
tts --model_name "tts_models/en/ljspeech/tacotron2-DDC" 
    --text "Testing Tacotron model" 
    --out_path output.wav
 
# Multi-speaker model
tts --model_name "tts_models/en/vctk/vits" 
    --text "Different speakers available" 
    --speaker_idx 5 
    --out_path output.wav
	

Python API

		from TTS.api import TTS
 
# Load model
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC", progress_bar=True)
 
# Generate speech
tts.tts_to_file(text="Hello from Coqui TTS in Python", file_path="output.wav")
 
# Multi-speaker model
tts = TTS("tts_models/en/vctk/vits")
tts.tts_to_file(
    text="Speaking with different voice",
    speaker="p225",
    file_path="output.wav"
)
 
# Voice cloning (if supported by model)
tts = TTS("tts_models/multilingual/multi-dataset/your_tts")
tts.tts_to_file(
    text="Clone this voice",
    speaker_wav="reference_voice.wav",
    language="en",
    file_path="cloned.wav"
)
	

Voice Cloning

		from TTS.api import TTS
 
# Load a model that supports voice cloning
tts = TTS("tts_models/multilingual/multi-dataset/your_tts")
 
# Clone voice from reference audio
tts.tts_to_file(
    text="This should sound like the reference speaker",
    speaker_wav="path/to/reference/audio.wav",
    language="en",
    file_path="cloned_output.wav"
)
	

Option 4: eSpeak NG (Lightweight & Fast)

eSpeak NG is a compact, formant synthesis TTS engine - less natural but extremely fast and lightweight.

		# Install
sudo pacman -S espeak-ng  # Arch
sudo apt install espeak-ng  # Ubuntu/Debian
 
# Basic usage
espeak-ng "Hello from eSpeak NG"
 
# Save to file
espeak-ng "Save this audio" -w output.wav
 
# Adjust speed (words per minute)
espeak-ng -s 150 "Faster speech"
 
# Adjust pitch (0-99)
espeak-ng -p 50 "Higher pitch"
 
# Different voice
espeak-ng -v en-us+f3 "Female voice"
 
# Different language
espeak-ng -v es "Hola mundo"
	

Option 5: Festival

Festival is a classic TTS system with multiple synthesis techniques.

		# Install
sudo pacman -S festival festival-us  # Arch
sudo apt install festival festvox-kallpc16k  # Ubuntu/Debian
 
# Basic usage
echo "Hello from Festival" | festival --tts
 
# From file
festival --tts input.txt
 
# Save to file
text2wave input.txt -o output.wav
	

System-wide Integration

Creating a Global TTS Command

		#!/bin/bash
# ~/bin/speak
 
if [ -z "$1" ]; then
    # Read from stdin
    text=$(cat)
else
    text="$1"
fi
 
# Use Piper for high-quality speech
echo "$text" | ./piper 
    --model en_US-lessac-medium.onnx 
    --output_raw | aplay -r 22050 -f S16_LE -t raw -
	
		# Make executable
chmod +x ~/bin/speak
 
# Usage
speak "Hello world"
echo "Test from stdin" | speak
cat article.txt | speak
	

Clipboard to Speech

		#!/bin/bash
# ~/bin/speak-clipboard
 
# Read clipboard content
text=$(xclip -o -selection clipboard)
 
# Speak it
kokoro-tts "$text" --output_raw | aplay -r 22050 -f S16_LE -t raw -
	

Bind to a keyboard shortcut for instant clipboard reading.

PDF to Audio

		#!/bin/bash
# Convert PDF to speech
 
if [ -z "$1" ]; then
    echo "Usage: pdf2speech <pdf-file>"
    exit 1
fi
 
# Extract text from PDF
text=$(pdftotext "$1" -)
 
# Generate speech
echo "$text" | piper 
    --model en_US-lessac-medium.onnx 
    --output_file "${1%.pdf}.wav"
 
echo "Audio saved to ${1%.pdf}.wav"
	

Screen Reader Integration

		#!/usr/bin/env python3
# Simple screen reader for Linux
 
import subprocess
from kokoro_tts import KokoroTTS
import pyautogui
 
tts = KokoroTTS()
 
def read_screen_text():
    """Capture and read screen text"""
    # Get text from focused window
    text = subprocess.check_output(['xdotool', 'getwindowfocus', 'getwindowname'])
    text = text.decode('utf-8').strip()
 
    # Speak it
    audio = tts.synthesize(text)
    tts.play(audio)
 
# Bind to keyboard shortcut
while True:
    # Wait for trigger (implement with keybinding library)
    read_screen_text()
	

Notification Reader

		#!/bin/bash
# Read notifications aloud
 
# Monitor notifications with dunst
dunstctl subscribe | while read -r line; do
    if [[ $line == *"summary"* ]]; then
        notification=$(echo "$line" | sed 's/.*summary: //')
        echo "$notification" | speak
    fi
done
	

Real-time Text-to-Speech Server

Create a local TTS server for multiple applications:

		#!/usr/bin/env python3
from flask import Flask, request, send_file
from kokoro_tts import KokoroTTS
import io
 
app = Flask(__name__)
tts = KokoroTTS()
 
@app.route('/tts', methods=['POST'])
def text_to_speech():
    """
    POST /tts
    Body: {"text": "Hello world", "voice": "female", "rate": 1.0}
    """
    data = request.json
    text = data.get('text', '')
    voice = data.get('voice', 'default')
    rate = data.get('rate', 1.0)
 
    # Generate audio
    audio = tts.synthesize(text, voice=voice, rate=rate)
 
    # Return as WAV file
    audio_io = io.BytesIO()
    tts.save(audio, audio_io)
    audio_io.seek(0)
 
    return send_file(audio_io, mimetype='audio/wav')
 
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
	
		# Usage
curl -X POST http://localhost:5000/tts 
  -H "Content-Type: application/json" 
  -d '{"text": "Hello from TTS server"}' 
  --output speech.wav
	

GPU Acceleration

For faster processing with neural TTS models:

		# Install PyTorch with CUDA
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 
# Coqui TTS will automatically use GPU
tts --text "GPU accelerated speech" 
    --model_name "tts_models/en/ljspeech/tacotron2-DDC" 
    --out_path output.wav
	
		# Force GPU usage in Python
from TTS.api import TTS
 
tts = TTS("tts_models/en/ljspeech/tacotron2-DDC", gpu=True)
tts.tts_to_file(text="Using GPU for synthesis", file_path="output.wav")
	

Performance Optimization Tips

  1. Choose the right engine - Piper for speed, Coqui for quality
  2. Use appropriate models - Smaller models for real-time, larger for quality
  3. Enable GPU - 5-10x speedup with neural models
  4. Cache common phrases - Pre-generate frequently used audio
  5. Use streaming - For long-form content, stream audio output

Use Cases

Audiobook Generation

		#!/bin/bash
# Convert epub to audiobook
 
# Install calibre for conversion
sudo pacman -S calibre
 
# Convert epub to txt
ebook-convert book.epub book.txt
 
# Generate audio chapters
split -l 1000 book.txt chapter_
 
for chapter in chapter_*; do
    cat "$chapter" | piper 
        --model en_US-lessac-medium.onnx 
        --output_file "${chapter}.wav"
done
 
# Combine chapters
sox chapter_*.wav audiobook.wav
	

Podcast Generation

		#!/usr/bin/env python3
from kokoro_tts import KokoroTTS
from pydub import AudioSegment
import os
 
tts = KokoroTTS()
 
# Podcast script with multiple speakers
script = [
    {"speaker": "male", "text": "Welcome to the Linux podcast!"},
    {"speaker": "female", "text": "Today we're discussing text-to-speech."},
    {"speaker": "male", "text": "It's amazing what AI can do now."},
]
 
# Generate audio for each line
segments = []
for line in script:
    audio = tts.synthesize(line["text"], voice=line["speaker"])
    # Add pause between speakers
    silence = AudioSegment.silent(duration=500)  # 500ms
    segments.extend([audio, silence])
 
# Combine all segments
podcast = sum(segments)
podcast.export("podcast_episode.mp3", format="mp3")
	

Interactive Voice Assistant

		#!/usr/bin/env python3
import speech_recognition as sr
from kokoro_tts import KokoroTTS
 
recognizer = sr.Recognizer()
tts = KokoroTTS()
 
def listen():
    """Listen for voice command"""
    with sr.Microphone() as source:
        print("Listening...")
        audio = recognizer.listen(source)
        try:
            text = recognizer.recognize_whisper(audio)
            return text
        except:
            return None
 
def respond(text):
    """Speak response"""
    audio = tts.synthesize(text)
    tts.play(audio)
 
# Main loop
while True:
    command = listen()
    if command:
        print(f"You said: {command}")
 
        # Process command and generate response
        if "weather" in command.lower():
            respond("I don't have weather data, but it's probably nice outside!")
        elif "quit" in command.lower():
            respond("Goodbye!")
            break
        else:
            respond(f"You said: {command}")
	

Email Reader

		#!/usr/bin/env python3
import imaplib
import email
from kokoro_tts import KokoroTTS
 
tts = KokoroTTS()
 
# Connect to email
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('your_email@gmail.com', 'your_password')
mail.select('inbox')
 
# Get unread emails
_, messages = mail.search(None, 'UNSEEN')
 
for msg_id in messages[0].split():
    _, msg_data = mail.fetch(msg_id, '(RFC822)')
    email_body = email.message_from_bytes(msg_data[0][1])
 
    subject = email_body['subject']
    sender = email_body['from']
 
    # Read email aloud
    speech_text = f"New email from {sender}. Subject: {subject}"
    audio = tts.synthesize(speech_text)
    tts.play(audio)
	

Video Narration Pipeline (Production Framework)

For video content creation, you need a robust pipeline to generate synchronized narration. Here's a complete framework using Kokoro TTS for programmatic video narration.

Project Structure:

		video-project/
โ”œโ”€โ”€ narration.json          # Scene narration script
โ”œโ”€โ”€ audio/
โ”‚   โ”œโ”€โ”€ main.py            # Workflow orchestrator
โ”‚   โ”œโ”€โ”€ json_to_speech.py  # JSON โ†’ Audio generator
โ”‚   โ””โ”€โ”€ concatenate_audio.py  # Audio concatenation
โ””โ”€โ”€ speech_output/
    โ”œโ”€โ”€ 01_intro.wav       # Generated scene audio
    โ”œโ”€โ”€ 02_scene2.wav
    โ”œโ”€โ”€ combined_narration.wav  # Final audio track
    โ””โ”€โ”€ scene_timings.txt  # Timing info for video sync

	

Step 1: Define Your Narration Script

Create narration.json with scene-based narration:

		{
  "scenes": [
    {
      "id": "intro",
      "title": "Introduction",
      "text": [
        "Welcome to our tutorial on Linux text-to-speech.",
        "In this video, we'll explore how to generate high-quality voiceovers programmatically."
      ]
    },
    {
      "id": "kokoro_intro",
      "title": "Kokoro TTS Overview",
      "text": [
        "Kokoro TTS is a modern text-to-speech engine with natural-sounding voices.",
        "It's perfect for generating narration for videos, podcasts, and audiobooks."
      ]
    },
    {
      "id": "tutorial",
      "title": "Tutorial",
      "text": [
        "Let's start by installing the required dependencies.",
        "You'll need Python 3.8 or higher and the Kokoro library."
      ]
    }
  ]
}
	

Step 2: JSON to Speech Converter

Create json_to_speech.py:

		#!/usr/bin/env python3
"""Generate speech files from narration JSON using Kokoro TTS."""
 
import json
import sys
from pathlib import Path
from kokoro import KPipeline
import soundfile as sf
 
def load_narration_json(json_path):
    """Load and parse narration JSON."""
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
 
    # Support different JSON structures
    if isinstance(data, dict) and 'scenes' in data:
        return data['scenes']
    elif isinstance(data, list):
        return data
    else:
        return [data]
 
def sanitize_filename(filename):
    """Remove invalid filename characters."""
    invalid_chars = '<>:"/\|?*'
    for char in invalid_chars:
        filename = filename.replace(char, '_')
    return filename.strip()
 
def generate_speech_files(narrations, output_dir, pipeline):
    """Generate speech files for all narrations."""
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
 
    print(f"Processing {len(narrations)} narrations...\n")
 
    for i, narration in enumerate(narrations):
        # Get text (can be string or array of strings)
        text_data = narration.get('text', '')
        text_segments = [text_data] if isinstance(text_data, str) else text_data
 
        narration_id = narration.get('id', f'narration_{i+1:03d}')
        title = narration.get('title', '')
 
        print(f"Scene: {narration_id} - {title}")
 
        # Generate audio for each text segment
        for segment_idx, text in enumerate(text_segments):
            if not text.strip():
                continue
 
            # Create filename with scene ordering
            scene_prefix = f"{i+1:02d}"
 
            if len(text_segments) == 1:
                filename = f"{scene_prefix}_{sanitize_filename(narration_id)}.wav"
            else:
                filename = f"{scene_prefix}_{sanitize_filename(narration_id)}_{segment_idx+1:02d}.wav"
 
            output_path = output_dir / filename
 
            print(f"  Generating: {filename}")
            print(f"  Text: {text[:70]}{'...' if len(text) > 70 else ''}")
 
            # Generate speech with Kokoro TTS
            generator = pipeline(text, voice='af_heart')  # American female voice
 
            # Collect audio chunks
            audio_chunks = []
            for graphemes, phonemes, audio in generator:
                if audio is not None and len(audio) > 0:
                    audio_chunks.append(audio)
 
            if audio_chunks:
                final_audio = audio_chunks[0]
 
                # Save audio file (Kokoro outputs at 24kHz)
                sf.write(str(output_path), final_audio, 24000)
 
                duration = len(final_audio) / 24000
                print(f"  โœ“ Saved: {filename} ({duration:.1f}s)\n")
 
if __name__ == "__main__":
    import argparse
 
    parser = argparse.ArgumentParser(description="Generate speech from narration JSON")
    parser.add_argument("json_file", help="Path to narration JSON file")
    parser.add_argument("--output", default="./speech_output",
                       help="Output directory (default: ./speech_output)")
    parser.add_argument("--voice", default="af_heart",
                       help="Voice to use (default: af_heart)")
 
    args = parser.parse_args()
 
    # Load narration script
    narrations = load_narration_json(args.json_file)
    print(f"Found {len(narrations)} scene(s)\n")
 
    # Initialize Kokoro TTS (American English)
    print("Initializing Kokoro TTS...")
    pipeline = KPipeline(lang_code='a', repo_id='hexgrad/Kokoro-82M', device='cpu')
    print("โœ“ Ready\n")
 
    # Generate all speech files
    generate_speech_files(narrations, args.output, pipeline)
 
    print(f"\n๐ŸŽ‰ Complete! Audio files in '{args.output}'")
	

Step 3: Audio Concatenation Script

Create concatenate_audio.py to combine scene audio with proper timing:

		#!/usr/bin/env python3
"""Concatenate scene audio files into a single synchronized track."""
 
import numpy as np
import soundfile as sf
from pathlib import Path
import glob
 
def get_audio_files_sorted(audio_dir):
    """Get all WAV files, sorted by filename."""
    pattern = str(audio_dir / "*.wav")
    return sorted([Path(f) for f in glob.glob(pattern)])
 
def concatenate_audio_files(silence_between_scenes=1.0, silence_between_segments=0.5):
    """Concatenate audio with configurable silence."""
 
    audio_dir = Path("speech_output")
    audio_files = get_audio_files_sorted(audio_dir)
 
    if not audio_files:
        print("No audio files found!")
        return
 
    all_audio = []
    timings = []
    current_time = 0.0
    sample_rate = None
 
    print(f"Concatenating {len(audio_files)} audio files...")
    print(f"Silence between scenes: {silence_between_scenes}s")
    print(f"Silence between segments: {silence_between_segments}s\n")
 
    for i, file_path in enumerate(audio_files):
        # Load audio file
        data, file_sample_rate = sf.read(file_path)
        duration = len(data) / file_sample_rate
 
        if sample_rate is None:
            sample_rate = file_sample_rate
 
        print(f"{file_path.name}: {duration:.2f}s at {current_time:.2f}s")
 
        # Add audio
        all_audio.append(data)
 
        # Record timing
        timings.append({
            'file': file_path.name,
            'start': current_time,
            'duration': duration,
            'end': current_time + duration
        })
 
        current_time += duration
 
        # Add silence (except after last file)
        if i < len(audio_files) - 1:
            silence_samples = int(silence_between_scenes * sample_rate)
            silence = np.zeros(silence_samples)
            all_audio.append(silence)
            current_time += silence_between_scenes
 
    # Combine all audio
    combined_audio = np.concatenate(all_audio)
 
    # Save combined audio
    output_path = audio_dir / "combined_narration.wav"
    sf.write(str(output_path), combined_audio, sample_rate)
 
    total_duration = len(combined_audio) / sample_rate
 
    print(f"\nโœ… Saved: {output_path}")
    print(f"Total duration: {total_duration:.2f}s")
 
    # Save timing information for video synchronization
    timings_path = audio_dir / "scene_timings.txt"
    with open(timings_path, "w") as f:
        f.write("Scene Audio Timings\n")
        f.write("=" * 50 + "\n\n")
 
        for timing in timings:
            f.write(f"{timing['file']}\n")
            f.write(f"  Start: {timing['start']:.2f}s\n")
            f.write(f"  End: {timing['end']:.2f}s\n")
            f.write(f"  Duration: {timing['duration']:.2f}s\n\n")
 
        f.write(f"Total Duration: {total_duration:.2f}s\n")
 
    print(f"Timing info: {timings_path}")
 
    return output_path, timings
 
if __name__ == "__main__":
    import sys
 
    silence_between_scenes = float(sys.argv[1]) if len(sys.argv) > 1 else 1.5
    silence_between_segments = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5
 
    concatenate_audio_files(silence_between_scenes, silence_between_segments)
	

Step 4: Workflow Orchestrator

Create main.py to automate the entire pipeline:

		#!/usr/bin/env python3
"""Automated workflow for video narration generation."""
 
import subprocess
import sys
from pathlib import Path
import shutil
 
def main():
    script_dir = Path(__file__).parent
 
    # Clear old output
    speech_output = script_dir / "speech_output"
    if speech_output.exists():
        print("๐Ÿงน Clearing old audio files...")
        shutil.rmtree(speech_output)
    speech_output.mkdir(parents=True, exist_ok=True)
 
    # Step 1: Generate speech from JSON
    print("๐Ÿ”Š Step 1: Generating speech from narration.json...\n")
    result = subprocess.run(
        ["python", "json_to_speech.py", "../narration.json"],
        cwd=script_dir,
        capture_output=True,
        text=True
    )
 
    if result.returncode != 0:
        print(f"โŒ Error: {result.stderr}")
        sys.exit(1)
 
    print(result.stdout)
    print("โœ… Speech generation complete\n")
 
    # Step 2: Concatenate audio
    print("๐Ÿ”— Step 2: Concatenating audio files...\n")
    result = subprocess.run(
        ["python", "concatenate_audio.py"],
        cwd=script_dir,
        capture_output=True,
        text=True
    )
 
    if result.returncode != 0:
        print(f"โŒ Error: {result.stderr}")
        sys.exit(1)
 
    print(result.stdout)
    print("โœ… Concatenation complete\n")
 
    print("๐ŸŽ‰ Workflow complete!")
    print(f"   Combined audio: speech_output/combined_narration.wav")
    print(f"   Timing data: speech_output/scene_timings.txt")
 
if __name__ == "__main__":
    main()
	

Usage:

		# Install dependencies
pip install kokoro soundfile numpy
 
# Run complete workflow
python audio/main.py
 
# Or run steps individually:
python audio/json_to_speech.py narration.json
python audio/concatenate_audio.py
	

Output:

		speech_output/
โ”œโ”€โ”€ 01_intro.wav                  # Scene 1 audio
โ”œโ”€โ”€ 02_kokoro_intro.wav          # Scene 2 audio
โ”œโ”€โ”€ 03_tutorial.wav              # Scene 3 audio
โ”œโ”€โ”€ combined_narration.wav       # Complete narration track
โ””โ”€โ”€ scene_timings.txt            # Timing for video sync

	

scene_timings.txt (for video editors):

		Scene Audio Timings
==================================================

01_intro.wav
  Start: 0.00s
  End: 8.50s
  Duration: 8.50s

02_kokoro_intro.wav
  Start: 10.00s
  End: 17.25s
  Duration: 7.25s

03_tutorial.wav
  Start: 18.75s
  End: 25.10s
  Duration: 6.35s

Total Duration: 25.10s

	

Integration with Video Editors:

For Remotion (React-based video):

		import { Audio, useCurrentFrame, useVideoConfig } from 'remotion';
 
export const VideoNarration: React.FC = () => {
  const frame = useCurrentFrame();
  const { fps } = useVideoConfig();
 
  return (
    <Audio src="/audio/combined_narration.wav" />
  );
};
	

For Motion Canvas:

		import { Audio } from '@motion-canvas/core';
 
export default makeScene2D(function* (view) {
  const audio = new Audio('/audio/combined_narration.wav');
 
  yield* audio.play();
  yield* waitFor(25.1); // Total duration from scene_timings.txt
});
	

For FFmpeg (manual sync):

		# Combine video with generated narration
ffmpeg -i video.mp4 -i speech_output/combined_narration.wav 
  -c:v copy -c:a aac -map 0:v:0 -map 1:a:0 
  output_with_narration.mp4
	

Benefits of This Approach:

  • Reproducible - Version control your narration script
  • Editable - Update text and regenerate audio instantly
  • Scalable - Generate hours of narration programmatically
  • Synchronized - Timing data for perfect video alignment
  • Professional - High-quality AI voices at no cost
  • Automated - Complete pipeline from script to final audio

This framework is production-ready for YouTube videos, tutorials, documentaries, and any content requiring programmatic narration generation.

Quality Comparison

Based on naturalness and clarity:

  • Coqui TTS (Neural): Excellent (near-human quality)
  • Kokoro TTS: Very Good (natural with good intonation)
  • Piper: Good (fast, clean, natural enough for most uses)
  • eSpeak NG: Fair (robotic but intelligible)
  • Festival: Fair (dated but functional)
  • Cloud Services (Google/Azure): Excellent (requires internet)

Troubleshooting

Poor Audio Quality

		# Use higher quality model
tts --model_name "tts_models/en/ljspeech/glow-tts" 
    --text "Better quality" 
    --out_path output.wav
 
# Increase sample rate
kokoro-tts "High quality audio" --sample-rate 44100 -o output.wav
	

Slow Generation

		# Use faster engine (Piper)
echo "Fast speech" | piper --model en_US-lessac-low.onnx
 
# Use smaller Coqui model
tts --model_name "tts_models/en/ljspeech/speedy-speech" 
    --text "Faster generation"
	

Unnatural Pronunciation

		# Use phonetic hints
from kokoro_tts import KokoroTTS
 
tts = KokoroTTS()
 
# Add pronunciation hints
text_with_hints = """
The SQL (sequel) database uses APIs (ay pee eyes) for access.
"""
 
audio = tts.synthesize(text_with_hints)
	

Out of Memory

		# Use CPU instead of GPU
export CUDA_VISIBLE_DEVICES=""
 
# Use smaller model
tts --model_name "tts_models/en/ljspeech/speedy-speech" 
    --text "Lower memory usage"
	

Resources

Conclusion

AI text-to-speech on Linux has matured into a powerful ecosystem of tools offering everything from lightweight formant synthesis to state-of-the-art neural voices. Whether you need real-time speech for accessibility, audiobook generation, or voice assistant applications, these open-source solutions provide professional results with complete privacy and offline capability.

Start with Piper for a great balance of quality and speed, explore Kokoro for natural-sounding voices, and dive into Coqui TTS for advanced features like voice cloning. The future of voice synthesis is open source and running on your Linux machine.