AI Text-to-Speech on Linux - Complete Guide
Comprehensive guide to setting up AI-powered text-to-speech on Linux using Kokoro TTS, Piper, Coqui TTS, and other tools for natural voice synthesis

Introduction
Text-to-speech (TTS) technology has evolved dramatically with modern AI models, producing natural-sounding voices that rival human speech. With tools like Kokoro TTS, Piper, and Coqui TTS, you can generate high-quality speech synthesis entirely on your Linux machine. This guide covers multiple solutions from lightweight local models to advanced neural TTS systems.
Why Use AI Text-to-Speech on Linux?
- Privacy - Generate speech locally without sending text to cloud services
- Offline capability - Work without internet connection
- Cost-effective - No API fees or subscription costs
- Customization - Fine-tune voices and adjust speech parameters
- Integration - Easy integration with Linux workflows, screen readers, and applications
Option 1: Kokoro TTS (Recommended)
Kokoro TTS is a modern, high-quality text-to-speech engine with natural-sounding voices and excellent performance.
Installation
# Install dependencies
sudo pacman -S python python-pip # Arch
sudo apt install python3 python3-pip # Ubuntu/Debian
# Install required audio libraries
sudo pacman -S espeak-ng # Arch
sudo apt install espeak-ng # Ubuntu/Debian
# Install Kokoro TTS
pip install kokoro-tts
Basic Usage
# Generate speech from text
kokoro-tts "Hello, this is a test of Kokoro text-to-speech."
# Specify output file
kokoro-tts "Welcome to Linux" -o output.wav
# Use different voice
kokoro-tts "Testing different voices" --voice female
# Adjust speech rate
kokoro-tts "Faster speech" --rate 1.5
# Adjust pitch
kokoro-tts "Higher pitch" --pitch 1.2
Python API
from kokoro_tts import KokoroTTS
# Initialize TTS engine
tts = KokoroTTS()
# Generate speech
audio = tts.synthesize("Hello from Python!")
# Save to file
tts.save(audio, "output.wav")
# Use different voice
audio = tts.synthesize("Different voice test", voice="male")
# Adjust parameters
audio = tts.synthesize(
"Custom parameters",
rate=1.2,
pitch=0.9,
volume=0.8
)
Advanced Configuration
from kokoro_tts import KokoroTTS, VoiceConfig
# Create custom voice configuration
config = VoiceConfig(
voice="female",
rate=1.1,
pitch=1.0,
volume=0.9,
language="en-US"
)
tts = KokoroTTS(config=config)
# Generate with SSML support
ssml_text = """
<speak>
<prosody rate="slow">This is slow,</prosody>
<prosody rate="fast">this is fast,</prosody>
and this is normal.
</speak>
"""
audio = tts.synthesize_ssml(ssml_text)
Option 2: Piper TTS (Fast & Lightweight)
Piper is an extremely fast, local neural TTS system perfect for real-time applications.
Installation
# Download Piper binary
wget https://github.com/rhasspy/piper/releases/download/v1.2.0/piper_amd64.tar.gz
tar xzf piper_amd64.tar.gz
# Download a voice model
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json
Basic Usage
# Generate speech
echo "Hello from Piper TTS" | ./piper
--model en_US-lessac-medium.onnx
--output_file output.wav
# From text file
cat article.txt | ./piper
--model en_US-lessac-medium.onnx
--output_file article.wav
# Play directly
echo "Testing audio output" | ./piper
--model en_US-lessac-medium.onnx
--output_raw | aplay -r 22050 -f S16_LE -t raw -
Available Voice Models
# List all available voices
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/voices.json
# Popular English voices:
# - en_US-lessac-medium (high quality, natural)
# - en_US-amy-medium (clear, professional)
# - en_GB-alan-medium (British English)
# - en_US-libritts-high (very high quality, slower)
Python Integration
import subprocess
import json
def piper_tts(text, model_path="en_US-lessac-medium.onnx", output_file="output.wav"):
"""Generate speech using Piper TTS"""
process = subprocess.Popen(
['./piper', '--model', model_path, '--output_file', output_file],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
process.communicate(input=text.encode('utf-8'))
return output_file
# Usage
piper_tts("Hello from Python with Piper!", output_file="test.wav")
Option 3: Coqui TTS (Advanced Neural TTS)
Coqui TTS is a professional-grade TTS engine with state-of-the-art voice quality and cloning capabilities.
Installation
# Install Coqui TTS
pip install TTS
# Install with CUDA support (for GPU acceleration)
pip install TTS[cuda]
Basic Usage
# List available models
tts --list_models
# Generate speech with default model
tts --text "Hello from Coqui TTS" --out_path output.wav
# Use specific model
tts --model_name "tts_models/en/ljspeech/tacotron2-DDC"
--text "Testing Tacotron model"
--out_path output.wav
# Multi-speaker model
tts --model_name "tts_models/en/vctk/vits"
--text "Different speakers available"
--speaker_idx 5
--out_path output.wav
Python API
from TTS.api import TTS
# Load model
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC", progress_bar=True)
# Generate speech
tts.tts_to_file(text="Hello from Coqui TTS in Python", file_path="output.wav")
# Multi-speaker model
tts = TTS("tts_models/en/vctk/vits")
tts.tts_to_file(
text="Speaking with different voice",
speaker="p225",
file_path="output.wav"
)
# Voice cloning (if supported by model)
tts = TTS("tts_models/multilingual/multi-dataset/your_tts")
tts.tts_to_file(
text="Clone this voice",
speaker_wav="reference_voice.wav",
language="en",
file_path="cloned.wav"
)
Voice Cloning
from TTS.api import TTS
# Load a model that supports voice cloning
tts = TTS("tts_models/multilingual/multi-dataset/your_tts")
# Clone voice from reference audio
tts.tts_to_file(
text="This should sound like the reference speaker",
speaker_wav="path/to/reference/audio.wav",
language="en",
file_path="cloned_output.wav"
)
Option 4: eSpeak NG (Lightweight & Fast)
eSpeak NG is a compact, formant synthesis TTS engine - less natural but extremely fast and lightweight.
# Install
sudo pacman -S espeak-ng # Arch
sudo apt install espeak-ng # Ubuntu/Debian
# Basic usage
espeak-ng "Hello from eSpeak NG"
# Save to file
espeak-ng "Save this audio" -w output.wav
# Adjust speed (words per minute)
espeak-ng -s 150 "Faster speech"
# Adjust pitch (0-99)
espeak-ng -p 50 "Higher pitch"
# Different voice
espeak-ng -v en-us+f3 "Female voice"
# Different language
espeak-ng -v es "Hola mundo"
Option 5: Festival
Festival is a classic TTS system with multiple synthesis techniques.
# Install
sudo pacman -S festival festival-us # Arch
sudo apt install festival festvox-kallpc16k # Ubuntu/Debian
# Basic usage
echo "Hello from Festival" | festival --tts
# From file
festival --tts input.txt
# Save to file
text2wave input.txt -o output.wav
System-wide Integration
Creating a Global TTS Command
#!/bin/bash
# ~/bin/speak
if [ -z "$1" ]; then
# Read from stdin
text=$(cat)
else
text="$1"
fi
# Use Piper for high-quality speech
echo "$text" | ./piper
--model en_US-lessac-medium.onnx
--output_raw | aplay -r 22050 -f S16_LE -t raw -
# Make executable
chmod +x ~/bin/speak
# Usage
speak "Hello world"
echo "Test from stdin" | speak
cat article.txt | speak
Clipboard to Speech
#!/bin/bash
# ~/bin/speak-clipboard
# Read clipboard content
text=$(xclip -o -selection clipboard)
# Speak it
kokoro-tts "$text" --output_raw | aplay -r 22050 -f S16_LE -t raw -
Bind to a keyboard shortcut for instant clipboard reading.
PDF to Audio
#!/bin/bash
# Convert PDF to speech
if [ -z "$1" ]; then
echo "Usage: pdf2speech <pdf-file>"
exit 1
fi
# Extract text from PDF
text=$(pdftotext "$1" -)
# Generate speech
echo "$text" | piper
--model en_US-lessac-medium.onnx
--output_file "${1%.pdf}.wav"
echo "Audio saved to ${1%.pdf}.wav"
Screen Reader Integration
#!/usr/bin/env python3
# Simple screen reader for Linux
import subprocess
from kokoro_tts import KokoroTTS
import pyautogui
tts = KokoroTTS()
def read_screen_text():
"""Capture and read screen text"""
# Get text from focused window
text = subprocess.check_output(['xdotool', 'getwindowfocus', 'getwindowname'])
text = text.decode('utf-8').strip()
# Speak it
audio = tts.synthesize(text)
tts.play(audio)
# Bind to keyboard shortcut
while True:
# Wait for trigger (implement with keybinding library)
read_screen_text()
Notification Reader
#!/bin/bash
# Read notifications aloud
# Monitor notifications with dunst
dunstctl subscribe | while read -r line; do
if [[ $line == *"summary"* ]]; then
notification=$(echo "$line" | sed 's/.*summary: //')
echo "$notification" | speak
fi
done
Real-time Text-to-Speech Server
Create a local TTS server for multiple applications:
#!/usr/bin/env python3
from flask import Flask, request, send_file
from kokoro_tts import KokoroTTS
import io
app = Flask(__name__)
tts = KokoroTTS()
@app.route('/tts', methods=['POST'])
def text_to_speech():
"""
POST /tts
Body: {"text": "Hello world", "voice": "female", "rate": 1.0}
"""
data = request.json
text = data.get('text', '')
voice = data.get('voice', 'default')
rate = data.get('rate', 1.0)
# Generate audio
audio = tts.synthesize(text, voice=voice, rate=rate)
# Return as WAV file
audio_io = io.BytesIO()
tts.save(audio, audio_io)
audio_io.seek(0)
return send_file(audio_io, mimetype='audio/wav')
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
# Usage
curl -X POST http://localhost:5000/tts
-H "Content-Type: application/json"
-d '{"text": "Hello from TTS server"}'
--output speech.wav
GPU Acceleration
For faster processing with neural TTS models:
# Install PyTorch with CUDA
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Coqui TTS will automatically use GPU
tts --text "GPU accelerated speech"
--model_name "tts_models/en/ljspeech/tacotron2-DDC"
--out_path output.wav
# Force GPU usage in Python
from TTS.api import TTS
tts = TTS("tts_models/en/ljspeech/tacotron2-DDC", gpu=True)
tts.tts_to_file(text="Using GPU for synthesis", file_path="output.wav")
Performance Optimization Tips
- Choose the right engine - Piper for speed, Coqui for quality
- Use appropriate models - Smaller models for real-time, larger for quality
- Enable GPU - 5-10x speedup with neural models
- Cache common phrases - Pre-generate frequently used audio
- Use streaming - For long-form content, stream audio output
Use Cases
Audiobook Generation
#!/bin/bash
# Convert epub to audiobook
# Install calibre for conversion
sudo pacman -S calibre
# Convert epub to txt
ebook-convert book.epub book.txt
# Generate audio chapters
split -l 1000 book.txt chapter_
for chapter in chapter_*; do
cat "$chapter" | piper
--model en_US-lessac-medium.onnx
--output_file "${chapter}.wav"
done
# Combine chapters
sox chapter_*.wav audiobook.wav
Podcast Generation
#!/usr/bin/env python3
from kokoro_tts import KokoroTTS
from pydub import AudioSegment
import os
tts = KokoroTTS()
# Podcast script with multiple speakers
script = [
{"speaker": "male", "text": "Welcome to the Linux podcast!"},
{"speaker": "female", "text": "Today we're discussing text-to-speech."},
{"speaker": "male", "text": "It's amazing what AI can do now."},
]
# Generate audio for each line
segments = []
for line in script:
audio = tts.synthesize(line["text"], voice=line["speaker"])
# Add pause between speakers
silence = AudioSegment.silent(duration=500) # 500ms
segments.extend([audio, silence])
# Combine all segments
podcast = sum(segments)
podcast.export("podcast_episode.mp3", format="mp3")
Interactive Voice Assistant
#!/usr/bin/env python3
import speech_recognition as sr
from kokoro_tts import KokoroTTS
recognizer = sr.Recognizer()
tts = KokoroTTS()
def listen():
"""Listen for voice command"""
with sr.Microphone() as source:
print("Listening...")
audio = recognizer.listen(source)
try:
text = recognizer.recognize_whisper(audio)
return text
except:
return None
def respond(text):
"""Speak response"""
audio = tts.synthesize(text)
tts.play(audio)
# Main loop
while True:
command = listen()
if command:
print(f"You said: {command}")
# Process command and generate response
if "weather" in command.lower():
respond("I don't have weather data, but it's probably nice outside!")
elif "quit" in command.lower():
respond("Goodbye!")
break
else:
respond(f"You said: {command}")
Email Reader
#!/usr/bin/env python3
import imaplib
import email
from kokoro_tts import KokoroTTS
tts = KokoroTTS()
# Connect to email
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('your_email@gmail.com', 'your_password')
mail.select('inbox')
# Get unread emails
_, messages = mail.search(None, 'UNSEEN')
for msg_id in messages[0].split():
_, msg_data = mail.fetch(msg_id, '(RFC822)')
email_body = email.message_from_bytes(msg_data[0][1])
subject = email_body['subject']
sender = email_body['from']
# Read email aloud
speech_text = f"New email from {sender}. Subject: {subject}"
audio = tts.synthesize(speech_text)
tts.play(audio)
Video Narration Pipeline (Production Framework)
For video content creation, you need a robust pipeline to generate synchronized narration. Here's a complete framework using Kokoro TTS for programmatic video narration.
Project Structure:
video-project/
โโโ narration.json # Scene narration script
โโโ audio/
โ โโโ main.py # Workflow orchestrator
โ โโโ json_to_speech.py # JSON โ Audio generator
โ โโโ concatenate_audio.py # Audio concatenation
โโโ speech_output/
โโโ 01_intro.wav # Generated scene audio
โโโ 02_scene2.wav
โโโ combined_narration.wav # Final audio track
โโโ scene_timings.txt # Timing info for video sync
Step 1: Define Your Narration Script
Create narration.json with scene-based narration:
{
"scenes": [
{
"id": "intro",
"title": "Introduction",
"text": [
"Welcome to our tutorial on Linux text-to-speech.",
"In this video, we'll explore how to generate high-quality voiceovers programmatically."
]
},
{
"id": "kokoro_intro",
"title": "Kokoro TTS Overview",
"text": [
"Kokoro TTS is a modern text-to-speech engine with natural-sounding voices.",
"It's perfect for generating narration for videos, podcasts, and audiobooks."
]
},
{
"id": "tutorial",
"title": "Tutorial",
"text": [
"Let's start by installing the required dependencies.",
"You'll need Python 3.8 or higher and the Kokoro library."
]
}
]
}
Step 2: JSON to Speech Converter
Create json_to_speech.py:
#!/usr/bin/env python3
"""Generate speech files from narration JSON using Kokoro TTS."""
import json
import sys
from pathlib import Path
from kokoro import KPipeline
import soundfile as sf
def load_narration_json(json_path):
"""Load and parse narration JSON."""
with open(json_path, 'r', encoding='utf-8') as f:
data = json.load(f)
# Support different JSON structures
if isinstance(data, dict) and 'scenes' in data:
return data['scenes']
elif isinstance(data, list):
return data
else:
return [data]
def sanitize_filename(filename):
"""Remove invalid filename characters."""
invalid_chars = '<>:"/\|?*'
for char in invalid_chars:
filename = filename.replace(char, '_')
return filename.strip()
def generate_speech_files(narrations, output_dir, pipeline):
"""Generate speech files for all narrations."""
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
print(f"Processing {len(narrations)} narrations...\n")
for i, narration in enumerate(narrations):
# Get text (can be string or array of strings)
text_data = narration.get('text', '')
text_segments = [text_data] if isinstance(text_data, str) else text_data
narration_id = narration.get('id', f'narration_{i+1:03d}')
title = narration.get('title', '')
print(f"Scene: {narration_id} - {title}")
# Generate audio for each text segment
for segment_idx, text in enumerate(text_segments):
if not text.strip():
continue
# Create filename with scene ordering
scene_prefix = f"{i+1:02d}"
if len(text_segments) == 1:
filename = f"{scene_prefix}_{sanitize_filename(narration_id)}.wav"
else:
filename = f"{scene_prefix}_{sanitize_filename(narration_id)}_{segment_idx+1:02d}.wav"
output_path = output_dir / filename
print(f" Generating: {filename}")
print(f" Text: {text[:70]}{'...' if len(text) > 70 else ''}")
# Generate speech with Kokoro TTS
generator = pipeline(text, voice='af_heart') # American female voice
# Collect audio chunks
audio_chunks = []
for graphemes, phonemes, audio in generator:
if audio is not None and len(audio) > 0:
audio_chunks.append(audio)
if audio_chunks:
final_audio = audio_chunks[0]
# Save audio file (Kokoro outputs at 24kHz)
sf.write(str(output_path), final_audio, 24000)
duration = len(final_audio) / 24000
print(f" โ Saved: {filename} ({duration:.1f}s)\n")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Generate speech from narration JSON")
parser.add_argument("json_file", help="Path to narration JSON file")
parser.add_argument("--output", default="./speech_output",
help="Output directory (default: ./speech_output)")
parser.add_argument("--voice", default="af_heart",
help="Voice to use (default: af_heart)")
args = parser.parse_args()
# Load narration script
narrations = load_narration_json(args.json_file)
print(f"Found {len(narrations)} scene(s)\n")
# Initialize Kokoro TTS (American English)
print("Initializing Kokoro TTS...")
pipeline = KPipeline(lang_code='a', repo_id='hexgrad/Kokoro-82M', device='cpu')
print("โ Ready\n")
# Generate all speech files
generate_speech_files(narrations, args.output, pipeline)
print(f"\n๐ Complete! Audio files in '{args.output}'")
Step 3: Audio Concatenation Script
Create concatenate_audio.py to combine scene audio with proper timing:
#!/usr/bin/env python3
"""Concatenate scene audio files into a single synchronized track."""
import numpy as np
import soundfile as sf
from pathlib import Path
import glob
def get_audio_files_sorted(audio_dir):
"""Get all WAV files, sorted by filename."""
pattern = str(audio_dir / "*.wav")
return sorted([Path(f) for f in glob.glob(pattern)])
def concatenate_audio_files(silence_between_scenes=1.0, silence_between_segments=0.5):
"""Concatenate audio with configurable silence."""
audio_dir = Path("speech_output")
audio_files = get_audio_files_sorted(audio_dir)
if not audio_files:
print("No audio files found!")
return
all_audio = []
timings = []
current_time = 0.0
sample_rate = None
print(f"Concatenating {len(audio_files)} audio files...")
print(f"Silence between scenes: {silence_between_scenes}s")
print(f"Silence between segments: {silence_between_segments}s\n")
for i, file_path in enumerate(audio_files):
# Load audio file
data, file_sample_rate = sf.read(file_path)
duration = len(data) / file_sample_rate
if sample_rate is None:
sample_rate = file_sample_rate
print(f"{file_path.name}: {duration:.2f}s at {current_time:.2f}s")
# Add audio
all_audio.append(data)
# Record timing
timings.append({
'file': file_path.name,
'start': current_time,
'duration': duration,
'end': current_time + duration
})
current_time += duration
# Add silence (except after last file)
if i < len(audio_files) - 1:
silence_samples = int(silence_between_scenes * sample_rate)
silence = np.zeros(silence_samples)
all_audio.append(silence)
current_time += silence_between_scenes
# Combine all audio
combined_audio = np.concatenate(all_audio)
# Save combined audio
output_path = audio_dir / "combined_narration.wav"
sf.write(str(output_path), combined_audio, sample_rate)
total_duration = len(combined_audio) / sample_rate
print(f"\nโ
Saved: {output_path}")
print(f"Total duration: {total_duration:.2f}s")
# Save timing information for video synchronization
timings_path = audio_dir / "scene_timings.txt"
with open(timings_path, "w") as f:
f.write("Scene Audio Timings\n")
f.write("=" * 50 + "\n\n")
for timing in timings:
f.write(f"{timing['file']}\n")
f.write(f" Start: {timing['start']:.2f}s\n")
f.write(f" End: {timing['end']:.2f}s\n")
f.write(f" Duration: {timing['duration']:.2f}s\n\n")
f.write(f"Total Duration: {total_duration:.2f}s\n")
print(f"Timing info: {timings_path}")
return output_path, timings
if __name__ == "__main__":
import sys
silence_between_scenes = float(sys.argv[1]) if len(sys.argv) > 1 else 1.5
silence_between_segments = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5
concatenate_audio_files(silence_between_scenes, silence_between_segments)
Step 4: Workflow Orchestrator
Create main.py to automate the entire pipeline:
#!/usr/bin/env python3
"""Automated workflow for video narration generation."""
import subprocess
import sys
from pathlib import Path
import shutil
def main():
script_dir = Path(__file__).parent
# Clear old output
speech_output = script_dir / "speech_output"
if speech_output.exists():
print("๐งน Clearing old audio files...")
shutil.rmtree(speech_output)
speech_output.mkdir(parents=True, exist_ok=True)
# Step 1: Generate speech from JSON
print("๐ Step 1: Generating speech from narration.json...\n")
result = subprocess.run(
["python", "json_to_speech.py", "../narration.json"],
cwd=script_dir,
capture_output=True,
text=True
)
if result.returncode != 0:
print(f"โ Error: {result.stderr}")
sys.exit(1)
print(result.stdout)
print("โ
Speech generation complete\n")
# Step 2: Concatenate audio
print("๐ Step 2: Concatenating audio files...\n")
result = subprocess.run(
["python", "concatenate_audio.py"],
cwd=script_dir,
capture_output=True,
text=True
)
if result.returncode != 0:
print(f"โ Error: {result.stderr}")
sys.exit(1)
print(result.stdout)
print("โ
Concatenation complete\n")
print("๐ Workflow complete!")
print(f" Combined audio: speech_output/combined_narration.wav")
print(f" Timing data: speech_output/scene_timings.txt")
if __name__ == "__main__":
main()
Usage:
# Install dependencies
pip install kokoro soundfile numpy
# Run complete workflow
python audio/main.py
# Or run steps individually:
python audio/json_to_speech.py narration.json
python audio/concatenate_audio.py
Output:
speech_output/
โโโ 01_intro.wav # Scene 1 audio
โโโ 02_kokoro_intro.wav # Scene 2 audio
โโโ 03_tutorial.wav # Scene 3 audio
โโโ combined_narration.wav # Complete narration track
โโโ scene_timings.txt # Timing for video sync
scene_timings.txt (for video editors):
Scene Audio Timings
==================================================
01_intro.wav
Start: 0.00s
End: 8.50s
Duration: 8.50s
02_kokoro_intro.wav
Start: 10.00s
End: 17.25s
Duration: 7.25s
03_tutorial.wav
Start: 18.75s
End: 25.10s
Duration: 6.35s
Total Duration: 25.10s
Integration with Video Editors:
For Remotion (React-based video):
import { Audio, useCurrentFrame, useVideoConfig } from 'remotion';
export const VideoNarration: React.FC = () => {
const frame = useCurrentFrame();
const { fps } = useVideoConfig();
return (
<Audio src="/audio/combined_narration.wav" />
);
};
For Motion Canvas:
import { Audio } from '@motion-canvas/core';
export default makeScene2D(function* (view) {
const audio = new Audio('/audio/combined_narration.wav');
yield* audio.play();
yield* waitFor(25.1); // Total duration from scene_timings.txt
});
For FFmpeg (manual sync):
# Combine video with generated narration
ffmpeg -i video.mp4 -i speech_output/combined_narration.wav
-c:v copy -c:a aac -map 0:v:0 -map 1:a:0
output_with_narration.mp4
Benefits of This Approach:
- Reproducible - Version control your narration script
- Editable - Update text and regenerate audio instantly
- Scalable - Generate hours of narration programmatically
- Synchronized - Timing data for perfect video alignment
- Professional - High-quality AI voices at no cost
- Automated - Complete pipeline from script to final audio
This framework is production-ready for YouTube videos, tutorials, documentaries, and any content requiring programmatic narration generation.
Quality Comparison
Based on naturalness and clarity:
- Coqui TTS (Neural): Excellent (near-human quality)
- Kokoro TTS: Very Good (natural with good intonation)
- Piper: Good (fast, clean, natural enough for most uses)
- eSpeak NG: Fair (robotic but intelligible)
- Festival: Fair (dated but functional)
- Cloud Services (Google/Azure): Excellent (requires internet)
Troubleshooting
Poor Audio Quality
# Use higher quality model
tts --model_name "tts_models/en/ljspeech/glow-tts"
--text "Better quality"
--out_path output.wav
# Increase sample rate
kokoro-tts "High quality audio" --sample-rate 44100 -o output.wav
Slow Generation
# Use faster engine (Piper)
echo "Fast speech" | piper --model en_US-lessac-low.onnx
# Use smaller Coqui model
tts --model_name "tts_models/en/ljspeech/speedy-speech"
--text "Faster generation"
Unnatural Pronunciation
# Use phonetic hints
from kokoro_tts import KokoroTTS
tts = KokoroTTS()
# Add pronunciation hints
text_with_hints = """
The SQL (sequel) database uses APIs (ay pee eyes) for access.
"""
audio = tts.synthesize(text_with_hints)
Out of Memory
# Use CPU instead of GPU
export CUDA_VISIBLE_DEVICES=""
# Use smaller model
tts --model_name "tts_models/en/ljspeech/speedy-speech"
--text "Lower memory usage"
Resources
Conclusion
AI text-to-speech on Linux has matured into a powerful ecosystem of tools offering everything from lightweight formant synthesis to state-of-the-art neural voices. Whether you need real-time speech for accessibility, audiobook generation, or voice assistant applications, these open-source solutions provide professional results with complete privacy and offline capability.
Start with Piper for a great balance of quality and speed, explore Kokoro for natural-sounding voices, and dive into Coqui TTS for advanced features like voice cloning. The future of voice synthesis is open source and running on your Linux machine.