AI Text-to-Speech on Linux - Complete Guide

Comprehensive guide to setting up AI-powered text-to-speech on Linux using Kokoro TTS, Piper, Coqui TTS, and other tools for natural voice synthesis

Introduction

Text-to-speech (TTS) technology has evolved dramatically with modern AI models, producing natural-sounding voices that rival human speech. With tools like Kokoro TTS, Piper, and Coqui TTS, you can generate high-quality speech synthesis entirely on your Linux machine. This guide covers multiple solutions from lightweight local models to advanced neural TTS systems.

Why Use AI Text-to-Speech on Linux?

Privacy - Generate speech locally without sending text to cloud services
Offline capability - Work without internet connection
Cost-effective - No API fees or subscription costs
Customization - Fine-tune voices and adjust speech parameters
Integration - Easy integration with Linux workflows, screen readers, and applications

Option 1: Kokoro TTS (Recommended)

Kokoro TTS is a modern, high-quality text-to-speech engine with natural-sounding voices and excellent performance.

Installation

		# Install dependencies
sudo pacman -S python python-pip  # Arch
sudo apt install python3 python3-pip  # Ubuntu/Debian
 
# Install required audio libraries
sudo pacman -S espeak-ng  # Arch
sudo apt install espeak-ng  # Ubuntu/Debian
 
# Install Kokoro TTS
pip install kokoro-tts

Basic Usage

		# Generate speech from text
kokoro-tts "Hello, this is a test of Kokoro text-to-speech."
 
# Specify output file
kokoro-tts "Welcome to Linux" -o output.wav
 
# Use different voice
kokoro-tts "Testing different voices" --voice female
 
# Adjust speech rate
kokoro-tts "Faster speech" --rate 1.5
 
# Adjust pitch
kokoro-tts "Higher pitch" --pitch 1.2

Python API

		from kokoro_tts import KokoroTTS
 
# Initialize TTS engine
tts = KokoroTTS()
 
# Generate speech
audio = tts.synthesize("Hello from Python!")
 
# Save to file
tts.save(audio, "output.wav")
 
# Use different voice
audio = tts.synthesize("Different voice test", voice="male")
 
# Adjust parameters
audio = tts.synthesize(
    "Custom parameters",
    rate=1.2,
    pitch=0.9,
    volume=0.8
)

Advanced Configuration

		from kokoro_tts import KokoroTTS, VoiceConfig
 
# Create custom voice configuration
config = VoiceConfig(
    voice="female",
    rate=1.1,
    pitch=1.0,
    volume=0.9,
    language="en-US"
)
 
tts = KokoroTTS(config=config)
 
# Generate with SSML support
ssml_text = """
<speak>
    <prosody rate="slow">This is slow,</prosody>
    <prosody rate="fast">this is fast,</prosody>
    and this is normal.
</speak>
"""
 
audio = tts.synthesize_ssml(ssml_text)

Option 2: Piper TTS (Fast & Lightweight)

Piper is an extremely fast, local neural TTS system perfect for real-time applications.

Installation

		# Download Piper binary
wget https://github.com/rhasspy/piper/releases/download/v1.2.0/piper_amd64.tar.gz
tar xzf piper_amd64.tar.gz
 
# Download a voice model
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

Basic Usage

		# Generate speech
echo "Hello from Piper TTS" | ./piper 
  --model en_US-lessac-medium.onnx 
  --output_file output.wav
 
# From text file
cat article.txt | ./piper 
  --model en_US-lessac-medium.onnx 
  --output_file article.wav
 
# Play directly
echo "Testing audio output" | ./piper 
  --model en_US-lessac-medium.onnx 
  --output_raw | aplay -r 22050 -f S16_LE -t raw -

Available Voice Models

		# List all available voices
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/voices.json
 
# Popular English voices:
# - en_US-lessac-medium (high quality, natural)
# - en_US-amy-medium (clear, professional)
# - en_GB-alan-medium (British English)
# - en_US-libritts-high (very high quality, slower)

Python Integration

		import subprocess
import json
 
def piper_tts(text, model_path="en_US-lessac-medium.onnx", output_file="output.wav"):
    """Generate speech using Piper TTS"""
    process = subprocess.Popen(
        ['./piper', '--model', model_path, '--output_file', output_file],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )
 
    process.communicate(input=text.encode('utf-8'))
    return output_file
 
# Usage
piper_tts("Hello from Python with Piper!", output_file="test.wav")

Option 3: Coqui TTS (Advanced Neural TTS)

Coqui TTS is a professional-grade TTS engine with state-of-the-art voice quality and cloning capabilities.

Installation

		# Install Coqui TTS
pip install TTS
 
# Install with CUDA support (for GPU acceleration)
pip install TTS[cuda]

Basic Usage

		# List available models
tts --list_models
 
# Generate speech with default model
tts --text "Hello from Coqui TTS" --out_path output.wav
 
# Use specific model
tts --model_name "tts_models/en/ljspeech/tacotron2-DDC" 
    --text "Testing Tacotron model" 
    --out_path output.wav
 
# Multi-speaker model
tts --model_name "tts_models/en/vctk/vits" 
    --text "Different speakers available" 
    --speaker_idx 5 
    --out_path output.wav

Python API

		from TTS.api import TTS
 
# Load model
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC", progress_bar=True)
 
# Generate speech
tts.tts_to_file(text="Hello from Coqui TTS in Python", file_path="output.wav")
 
# Multi-speaker model
tts = TTS("tts_models/en/vctk/vits")
tts.tts_to_file(
    text="Speaking with different voice",
    speaker="p225",
    file_path="output.wav"
)
 
# Voice cloning (if supported by model)
tts = TTS("tts_models/multilingual/multi-dataset/your_tts")
tts.tts_to_file(
    text="Clone this voice",
    speaker_wav="reference_voice.wav",
    language="en",
    file_path="cloned.wav"
)

Voice Cloning

		from TTS.api import TTS
 
# Load a model that supports voice cloning
tts = TTS("tts_models/multilingual/multi-dataset/your_tts")
 
# Clone voice from reference audio
tts.tts_to_file(
    text="This should sound like the reference speaker",
    speaker_wav="path/to/reference/audio.wav",
    language="en",
    file_path="cloned_output.wav"
)

Option 4: eSpeak NG (Lightweight & Fast)

eSpeak NG is a compact, formant synthesis TTS engine - less natural but extremely fast and lightweight.

		# Install
sudo pacman -S espeak-ng  # Arch
sudo apt install espeak-ng  # Ubuntu/Debian
 
# Basic usage
espeak-ng "Hello from eSpeak NG"
 
# Save to file
espeak-ng "Save this audio" -w output.wav
 
# Adjust speed (words per minute)
espeak-ng -s 150 "Faster speech"
 
# Adjust pitch (0-99)
espeak-ng -p 50 "Higher pitch"
 
# Different voice
espeak-ng -v en-us+f3 "Female voice"
 
# Different language
espeak-ng -v es "Hola mundo"

Option 5: Festival

Festival is a classic TTS system with multiple synthesis techniques.

		# Install
sudo pacman -S festival festival-us  # Arch
sudo apt install festival festvox-kallpc16k  # Ubuntu/Debian
 
# Basic usage
echo "Hello from Festival" | festival --tts
 
# From file
festival --tts input.txt
 
# Save to file
text2wave input.txt -o output.wav

System-wide Integration

Creating a Global TTS Command

		#!/bin/bash
# ~/bin/speak
 
if [ -z "$1" ]; then
    # Read from stdin
    text=$(cat)
else
    text="$1"
fi
 
# Use Piper for high-quality speech
echo "$text" | ./piper 
    --model en_US-lessac-medium.onnx 
    --output_raw | aplay -r 22050 -f S16_LE -t raw -

		# Make executable
chmod +x ~/bin/speak
 
# Usage
speak "Hello world"
echo "Test from stdin" | speak
cat article.txt | speak

Clipboard to Speech

		#!/bin/bash
# ~/bin/speak-clipboard
 
# Read clipboard content
text=$(xclip -o -selection clipboard)
 
# Speak it
kokoro-tts "$text" --output_raw | aplay -r 22050 -f S16_LE -t raw -

Bind to a keyboard shortcut for instant clipboard reading.

PDF to Audio

		#!/bin/bash
# Convert PDF to speech
 
if [ -z "$1" ]; then
    echo "Usage: pdf2speech <pdf-file>"
    exit 1
fi
 
# Extract text from PDF
text=$(pdftotext "$1" -)
 
# Generate speech
echo "$text" | piper 
    --model en_US-lessac-medium.onnx 
    --output_file "${1%.pdf}.wav"
 
echo "Audio saved to ${1%.pdf}.wav"

		#!/usr/bin/env python3
# Simple screen reader for Linux
 
import subprocess
from kokoro_tts import KokoroTTS
import pyautogui
 
tts = KokoroTTS()
 
def read_screen_text():
    """Capture and read screen text"""
    # Get text from focused window
    text = subprocess.check_output(['xdotool', 'getwindowfocus', 'getwindowname'])
    text = text.decode('utf-8').strip()
 
    # Speak it
    audio = tts.synthesize(text)
    tts.play(audio)
 
# Bind to keyboard shortcut
while True:
    # Wait for trigger (implement with keybinding library)
    read_screen_text()

Notification Reader

		#!/bin/bash
# Read notifications aloud
 
# Monitor notifications with dunst
dunstctl subscribe | while read -r line; do
    if [[ $line == *"summary"* ]]; then
        notification=$(echo "$line" | sed 's/.*summary: //')
        echo "$notification" | speak
    fi
done

Real-time Text-to-Speech Server

Create a local TTS server for multiple applications:

		#!/usr/bin/env python3
from flask import Flask, request, send_file
from kokoro_tts import KokoroTTS
import io
 
app = Flask(__name__)
tts = KokoroTTS()
 
@app.route('/tts', methods=['POST'])
def text_to_speech():
    """
    POST /tts
    Body: {"text": "Hello world", "voice": "female", "rate": 1.0}
    """
    data = request.json
    text = data.get('text', '')
    voice = data.get('voice', 'default')
    rate = data.get('rate', 1.0)
 
    # Generate audio
    audio = tts.synthesize(text, voice=voice, rate=rate)
 
    # Return as WAV file
    audio_io = io.BytesIO()
    tts.save(audio, audio_io)
    audio_io.seek(0)
 
    return send_file(audio_io, mimetype='audio/wav')
 
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

		# Usage
curl -X POST http://localhost:5000/tts 
  -H "Content-Type: application/json" 
  -d '{"text": "Hello from TTS server"}' 
  --output speech.wav

GPU Acceleration

For faster processing with neural TTS models:

		# Install PyTorch with CUDA
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 
# Coqui TTS will automatically use GPU
tts --text "GPU accelerated speech" 
    --model_name "tts_models/en/ljspeech/tacotron2-DDC" 
    --out_path output.wav

		# Force GPU usage in Python
from TTS.api import TTS
 
tts = TTS("tts_models/en/ljspeech/tacotron2-DDC", gpu=True)
tts.tts_to_file(text="Using GPU for synthesis", file_path="output.wav")

Performance Optimization Tips

Choose the right engine - Piper for speed, Coqui for quality
Use appropriate models - Smaller models for real-time, larger for quality
Enable GPU - 5-10x speedup with neural models
Cache common phrases - Pre-generate frequently used audio
Use streaming - For long-form content, stream audio output

Use Cases

Audiobook Generation

		#!/bin/bash
# Convert epub to audiobook
 
# Install calibre for conversion
sudo pacman -S calibre
 
# Convert epub to txt
ebook-convert book.epub book.txt
 
# Generate audio chapters
split -l 1000 book.txt chapter_
 
for chapter in chapter_*; do
    cat "$chapter" | piper 
        --model en_US-lessac-medium.onnx 
        --output_file "${chapter}.wav"
done
 
# Combine chapters
sox chapter_*.wav audiobook.wav

Podcast Generation

		#!/usr/bin/env python3
from kokoro_tts import KokoroTTS
from pydub import AudioSegment
import os
 
tts = KokoroTTS()
 
# Podcast script with multiple speakers
script = [
    {"speaker": "male", "text": "Welcome to the Linux podcast!"},
    {"speaker": "female", "text": "Today we're discussing text-to-speech."},
    {"speaker": "male", "text": "It's amazing what AI can do now."},
]
 
# Generate audio for each line
segments = []
for line in script:
    audio = tts.synthesize(line["text"], voice=line["speaker"])
    # Add pause between speakers
    silence = AudioSegment.silent(duration=500)  # 500ms
    segments.extend([audio, silence])
 
# Combine all segments
podcast = sum(segments)
podcast.export("podcast_episode.mp3", format="mp3")

Interactive Voice Assistant

		#!/usr/bin/env python3
import speech_recognition as sr
from kokoro_tts import KokoroTTS
 
recognizer = sr.Recognizer()
tts = KokoroTTS()
 
def listen():
    """Listen for voice command"""
    with sr.Microphone() as source:
        print("Listening...")
        audio = recognizer.listen(source)
        try:
            text = recognizer.recognize_whisper(audio)
            return text
        except:
            return None
 
def respond(text):
    """Speak response"""
    audio = tts.synthesize(text)
    tts.play(audio)
 
# Main loop
while True:
    command = listen()
    if command:
        print(f"You said: {command}")
 
        # Process command and generate response
        if "weather" in command.lower():
            respond("I don't have weather data, but it's probably nice outside!")
        elif "quit" in command.lower():
            respond("Goodbye!")
            break
        else:
            respond(f"You said: {command}")

Email Reader

		#!/usr/bin/env python3
import imaplib
import email
from kokoro_tts import KokoroTTS
 
tts = KokoroTTS()
 
# Connect to email
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('your_email@gmail.com', 'your_password')
mail.select('inbox')
 
# Get unread emails
_, messages = mail.search(None, 'UNSEEN')
 
for msg_id in messages[0].split():
    _, msg_data = mail.fetch(msg_id, '(RFC822)')
    email_body = email.message_from_bytes(msg_data[0][1])
 
    subject = email_body['subject']
    sender = email_body['from']
 
    # Read email aloud
    speech_text = f"New email from {sender}. Subject: {subject}"
    audio = tts.synthesize(speech_text)
    tts.play(audio)

Video Narration Pipeline (Production Framework)

For video content creation, you need a robust pipeline to generate synchronized narration. Here's a complete framework using Kokoro TTS for programmatic video narration.

Project Structure:

		video-project/
├── narration.json          # Scene narration script
├── audio/
│   ├── main.py            # Workflow orchestrator
│   ├── json_to_speech.py  # JSON → Audio generator
│   └── concatenate_audio.py  # Audio concatenation
└── speech_output/
    ├── 01_intro.wav       # Generated scene audio
    ├── 02_scene2.wav
    ├── combined_narration.wav  # Final audio track
    └── scene_timings.txt  # Timing info for video sync

Step 1: Define Your Narration Script

Create narration.json with scene-based narration:

		{
  "scenes": [
    {
      "id": "intro",
      "title": "Introduction",
      "text": [
        "Welcome to our tutorial on Linux text-to-speech.",
        "In this video, we'll explore how to generate high-quality voiceovers programmatically."
      ]
    },
    {
      "id": "kokoro_intro",
      "title": "Kokoro TTS Overview",
      "text": [
        "Kokoro TTS is a modern text-to-speech engine with natural-sounding voices.",
        "It's perfect for generating narration for videos, podcasts, and audiobooks."
      ]
    },
    {
      "id": "tutorial",
      "title": "Tutorial",
      "text": [
        "Let's start by installing the required dependencies.",
        "You'll need Python 3.8 or higher and the Kokoro library."
      ]
    }
  ]
}

Step 2: JSON to Speech Converter

Create json_to_speech.py:

		#!/usr/bin/env python3
"""Generate speech files from narration JSON using Kokoro TTS."""
 
import json
import sys
from pathlib import Path
from kokoro import KPipeline
import soundfile as sf
 
def load_narration_json(json_path):
    """Load and parse narration JSON."""
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
 
    # Support different JSON structures
    if isinstance(data, dict) and 'scenes' in data:
        return data['scenes']
    elif isinstance(data, list):
        return data
    else:
        return [data]
 
def sanitize_filename(filename):
    """Remove invalid filename characters."""
    invalid_chars = '<>:"/\|?*'
    for char in invalid_chars:
        filename = filename.replace(char, '_')
    return filename.strip()
 
def generate_speech_files(narrations, output_dir, pipeline):
    """Generate speech files for all narrations."""
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
 
    print(f"Processing {len(narrations)} narrations...\n")
 
    for i, narration in enumerate(narrations):
        # Get text (can be string or array of strings)
        text_data = narration.get('text', '')
        text_segments = [text_data] if isinstance(text_data, str) else text_data
 
        narration_id = narration.get('id', f'narration_{i+1:03d}')
        title = narration.get('title', '')
 
        print(f"Scene: {narration_id} - {title}")
 
        # Generate audio for each text segment
        for segment_idx, text in enumerate(text_segments):
            if not text.strip():
                continue
 
            # Create filename with scene ordering
            scene_prefix = f"{i+1:02d}"
 
            if len(text_segments) == 1:
                filename = f"{scene_prefix}_{sanitize_filename(narration_id)}.wav"
            else:
                filename = f"{scene_prefix}_{sanitize_filename(narration_id)}_{segment_idx+1:02d}.wav"
 
            output_path = output_dir / filename
 
            print(f"  Generating: {filename}")
            print(f"  Text: {text[:70]}{'...' if len(text) > 70 else ''}")
 
            # Generate speech with Kokoro TTS
            generator = pipeline(text, voice='af_heart')  # American female voice
 
            # Collect audio chunks
            audio_chunks = []
            for graphemes, phonemes, audio in generator:
                if audio is not None and len(audio) > 0:
                    audio_chunks.append(audio)
 
            if audio_chunks:
                final_audio = audio_chunks[0]
 
                # Save audio file (Kokoro outputs at 24kHz)
                sf.write(str(output_path), final_audio, 24000)
 
                duration = len(final_audio) / 24000
                print(f"  ✓ Saved: {filename} ({duration:.1f}s)\n")
 
if __name__ == "__main__":
    import argparse
 
    parser = argparse.ArgumentParser(description="Generate speech from narration JSON")
    parser.add_argument("json_file", help="Path to narration JSON file")
    parser.add_argument("--output", default="./speech_output",
                       help="Output directory (default: ./speech_output)")
    parser.add_argument("--voice", default="af_heart",
                       help="Voice to use (default: af_heart)")
 
    args = parser.parse_args()
 
    # Load narration script
    narrations = load_narration_json(args.json_file)
    print(f"Found {len(narrations)} scene(s)\n")
 
    # Initialize Kokoro TTS (American English)
    print("Initializing Kokoro TTS...")
    pipeline = KPipeline(lang_code='a', repo_id='hexgrad/Kokoro-82M', device='cpu')
    print("✓ Ready\n")
 
    # Generate all speech files
    generate_speech_files(narrations, args.output, pipeline)
 
    print(f"\n🎉 Complete! Audio files in '{args.output}'")

Step 3: Audio Concatenation Script

Create concatenate_audio.py to combine scene audio with proper timing:

		#!/usr/bin/env python3
"""Concatenate scene audio files into a single synchronized track."""
 
import numpy as np
import soundfile as sf
from pathlib import Path
import glob
 
def get_audio_files_sorted(audio_dir):
    """Get all WAV files, sorted by filename."""
    pattern = str(audio_dir / "*.wav")
    return sorted([Path(f) for f in glob.glob(pattern)])
 
def concatenate_audio_files(silence_between_scenes=1.0, silence_between_segments=0.5):
    """Concatenate audio with configurable silence."""
 
    audio_dir = Path("speech_output")
    audio_files = get_audio_files_sorted(audio_dir)
 
    if not audio_files:
        print("No audio files found!")
        return
 
    all_audio = []
    timings = []
    current_time = 0.0
    sample_rate = None
 
    print(f"Concatenating {len(audio_files)} audio files...")
    print(f"Silence between scenes: {silence_between_scenes}s")
    print(f"Silence between segments: {silence_between_segments}s\n")
 
    for i, file_path in enumerate(audio_files):
        # Load audio file
        data, file_sample_rate = sf.read(file_path)
        duration = len(data) / file_sample_rate
 
        if sample_rate is None:
            sample_rate = file_sample_rate
 
        print(f"{file_path.name}: {duration:.2f}s at {current_time:.2f}s")
 
        # Add audio
        all_audio.append(data)
 
        # Record timing
        timings.append({
            'file': file_path.name,
            'start': current_time,
            'duration': duration,
            'end': current_time + duration
        })
 
        current_time += duration
 
        # Add silence (except after last file)
        if i < len(audio_files) - 1:
            silence_samples = int(silence_between_scenes * sample_rate)
            silence = np.zeros(silence_samples)
            all_audio.append(silence)
            current_time += silence_between_scenes
 
    # Combine all audio
    combined_audio = np.concatenate(all_audio)
 
    # Save combined audio
    output_path = audio_dir / "combined_narration.wav"
    sf.write(str(output_path), combined_audio, sample_rate)
 
    total_duration = len(combined_audio) / sample_rate
 
    print(f"\n✅ Saved: {output_path}")
    print(f"Total duration: {total_duration:.2f}s")
 
    # Save timing information for video synchronization
    timings_path = audio_dir / "scene_timings.txt"
    with open(timings_path, "w") as f:
        f.write("Scene Audio Timings\n")
        f.write("=" * 50 + "\n\n")
 
        for timing in timings:
            f.write(f"{timing['file']}\n")
            f.write(f"  Start: {timing['start']:.2f}s\n")
            f.write(f"  End: {timing['end']:.2f}s\n")
            f.write(f"  Duration: {timing['duration']:.2f}s\n\n")
 
        f.write(f"Total Duration: {total_duration:.2f}s\n")
 
    print(f"Timing info: {timings_path}")
 
    return output_path, timings
 
if __name__ == "__main__":
    import sys
 
    silence_between_scenes = float(sys.argv[1]) if len(sys.argv) > 1 else 1.5
    silence_between_segments = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5
 
    concatenate_audio_files(silence_between_scenes, silence_between_segments)

Step 4: Workflow Orchestrator

Create main.py to automate the entire pipeline:

		#!/usr/bin/env python3
"""Automated workflow for video narration generation."""
 
import subprocess
import sys
from pathlib import Path
import shutil
 
def main():
    script_dir = Path(__file__).parent
 
    # Clear old output
    speech_output = script_dir / "speech_output"
    if speech_output.exists():
        print("🧹 Clearing old audio files...")
        shutil.rmtree(speech_output)
    speech_output.mkdir(parents=True, exist_ok=True)
 
    # Step 1: Generate speech from JSON
    print("🔊 Step 1: Generating speech from narration.json...\n")
    result = subprocess.run(
        ["python", "json_to_speech.py", "../narration.json"],
        cwd=script_dir,
        capture_output=True,
        text=True
    )
 
    if result.returncode != 0:
        print(f"❌ Error: {result.stderr}")
        sys.exit(1)
 
    print(result.stdout)
    print("✅ Speech generation complete\n")
 
    # Step 2: Concatenate audio
    print("🔗 Step 2: Concatenating audio files...\n")
    result = subprocess.run(
        ["python", "concatenate_audio.py"],
        cwd=script_dir,
        capture_output=True,
        text=True
    )
 
    if result.returncode != 0:
        print(f"❌ Error: {result.stderr}")
        sys.exit(1)
 
    print(result.stdout)
    print("✅ Concatenation complete\n")
 
    print("🎉 Workflow complete!")
    print(f"   Combined audio: speech_output/combined_narration.wav")
    print(f"   Timing data: speech_output/scene_timings.txt")
 
if __name__ == "__main__":
    main()

Usage:

		# Install dependencies
pip install kokoro soundfile numpy
 
# Run complete workflow
python audio/main.py
 
# Or run steps individually:
python audio/json_to_speech.py narration.json
python audio/concatenate_audio.py

Output:

		speech_output/
├── 01_intro.wav                  # Scene 1 audio
├── 02_kokoro_intro.wav          # Scene 2 audio
├── 03_tutorial.wav              # Scene 3 audio
├── combined_narration.wav       # Complete narration track
└── scene_timings.txt            # Timing for video sync

scene_timings.txt (for video editors):

		Scene Audio Timings
==================================================

01_intro.wav
  Start: 0.00s
  End: 8.50s
  Duration: 8.50s

02_kokoro_intro.wav
  Start: 10.00s
  End: 17.25s
  Duration: 7.25s

03_tutorial.wav
  Start: 18.75s
  End: 25.10s
  Duration: 6.35s

Total Duration: 25.10s

Integration with Video Editors:

For Remotion (React-based video):

		import { Audio, useCurrentFrame, useVideoConfig } from 'remotion';
 
export const VideoNarration: React.FC = () => {
  const frame = useCurrentFrame();
  const { fps } = useVideoConfig();
 
  return (
    <Audio src="/audio/combined_narration.wav" />
  );
};

For Motion Canvas:

		import { Audio } from '@motion-canvas/core';
 
export default makeScene2D(function* (view) {
  const audio = new Audio('/audio/combined_narration.wav');
 
  yield* audio.play();
  yield* waitFor(25.1); // Total duration from scene_timings.txt
});

For FFmpeg (manual sync):

		# Combine video with generated narration
ffmpeg -i video.mp4 -i speech_output/combined_narration.wav 
  -c:v copy -c:a aac -map 0:v:0 -map 1:a:0 
  output_with_narration.mp4

Benefits of This Approach:

Reproducible - Version control your narration script
Editable - Update text and regenerate audio instantly
Scalable - Generate hours of narration programmatically
Synchronized - Timing data for perfect video alignment
Professional - High-quality AI voices at no cost
Automated - Complete pipeline from script to final audio

This framework is production-ready for YouTube videos, tutorials, documentaries, and any content requiring programmatic narration generation.

Quality Comparison

Based on naturalness and clarity:

Coqui TTS (Neural): Excellent (near-human quality)
Kokoro TTS: Very Good (natural with good intonation)
Piper: Good (fast, clean, natural enough for most uses)
eSpeak NG: Fair (robotic but intelligible)
Festival: Fair (dated but functional)
Cloud Services (Google/Azure): Excellent (requires internet)

Troubleshooting

Poor Audio Quality

		# Use higher quality model
tts --model_name "tts_models/en/ljspeech/glow-tts" 
    --text "Better quality" 
    --out_path output.wav
 
# Increase sample rate
kokoro-tts "High quality audio" --sample-rate 44100 -o output.wav

Slow Generation

		# Use faster engine (Piper)
echo "Fast speech" | piper --model en_US-lessac-low.onnx
 
# Use smaller Coqui model
tts --model_name "tts_models/en/ljspeech/speedy-speech" 
    --text "Faster generation"

Unnatural Pronunciation

		# Use phonetic hints
from kokoro_tts import KokoroTTS
 
tts = KokoroTTS()
 
# Add pronunciation hints
text_with_hints = """
The SQL (sequel) database uses APIs (ay pee eyes) for access.
"""
 
audio = tts.synthesize(text_with_hints)

Out of Memory

		# Use CPU instead of GPU
export CUDA_VISIBLE_DEVICES=""
 
# Use smaller model
tts --model_name "tts_models/en/ljspeech/speedy-speech" 
    --text "Lower memory usage"

Resources

Conclusion

AI text-to-speech on Linux has matured into a powerful ecosystem of tools offering everything from lightweight formant synthesis to state-of-the-art neural voices. Whether you need real-time speech for accessibility, audiobook generation, or voice assistant applications, these open-source solutions provide professional results with complete privacy and offline capability.

Start with Piper for a great balance of quality and speed, explore Kokoro for natural-sounding voices, and dive into Coqui TTS for advanced features like voice cloning. The future of voice synthesis is open source and running on your Linux machine.

AI Text-to-Speech on Linux - Complete Guide

Introduction

Why Use AI Text-to-Speech on Linux?

Option 1: Kokoro TTS (Recommended)

Installation

Basic Usage

Python API

Advanced Configuration

Option 2: Piper TTS (Fast & Lightweight)

Installation

Basic Usage

Available Voice Models

Python Integration

Option 3: Coqui TTS (Advanced Neural TTS)

Installation

Basic Usage

Python API

Voice Cloning

Option 4: eSpeak NG (Lightweight & Fast)

Option 5: Festival

System-wide Integration

Creating a Global TTS Command

Clipboard to Speech

PDF to Audio

Screen Reader Integration

Notification Reader

Real-time Text-to-Speech Server

GPU Acceleration

Performance Optimization Tips

Use Cases

Audiobook Generation

Podcast Generation

Interactive Voice Assistant

Email Reader

Video Narration Pipeline (Production Framework)

Quality Comparison

Troubleshooting

Poor Audio Quality

Slow Generation

Unnatural Pronunciation

Out of Memory

Resources

Conclusion