AI Speech-to-Text on Linux - Complete Guide
Comprehensive guide to setting up AI-powered speech-to-text on Linux using OpenAI Whisper, Vosk, and other tools for accurate audio transcription

Introduction
Speech-to-text technology has revolutionized how we interact with computers. With modern AI models like OpenAI Whisper, you can achieve near-human accuracy for transcription on your Linux machine. This guide covers multiple solutions from local AI models to cloud services, all running on Linux.
Why Use AI Speech-to-Text on Linux?
- Privacy - Process audio locally without sending to cloud services
- Offline capability - Work without internet connection
- Cost-effective - No subscription fees for local models
- Customization - Fine-tune models for specific domains
- Integration - Easy integration with Linux workflows and scripts
Option 1: OpenAI Whisper (Recommended)
Whisper is OpenAI's open-source speech recognition model with exceptional accuracy across multiple languages. It's trained on 680,000 hours of multilingual data, making it robust to accents, background noise, and technical language.
Key Features
- Multilingual Support: 99 languages with automatic language detection
- Multitask Model: Speech recognition, translation, and language identification
- High Accuracy: Reduces word error rates by 50% compared to other models
- Robust Performance: Works well with background noise and technical language
- Translation: Can translate non-English speech to English
Installation
# Install Python and pip if not already installed
sudo pacman -S python python-pip # Arch
sudo apt install python3 python3-pip # Ubuntu/Debian
# Install ffmpeg for audio processing
sudo pacman -S ffmpeg # Arch
sudo apt install ffmpeg # Ubuntu/Debian
# Install Whisper
pip install -U openai-whisper
Available Models
| Model | Parameters | English-only | Multilingual | Required VRAM | Relative Speed |
|---|---|---|---|---|---|
| tiny | 39 M | ✓ | ✓ | ~1 GB | ~32x |
| base | 74 M | ✓ | ✓ | ~1 GB | ~16x |
| small | 244 M | ✓ | ✓ | ~2 GB | ~6x |
| medium | 769 M | ✓ | ✓ | ~5 GB | ~2x |
| large | 1550 M | ✗ | ✓ | ~10 GB | 1x |
| turbo | 809 M | ✗ | ✓ | ~6 GB | ~8x |
The turbo model is an optimized version of large-v3 offering faster transcription with minimal accuracy loss.
Basic Usage
# Transcribe an audio file
whisper audio.mp3
# Specify model size (tiny, base, small, medium, large, turbo)
whisper audio.mp3 --model medium
# Output to specific format
whisper audio.mp3 --output_format txt
# Transcribe with timestamps
whisper audio.mp3 --output_format srt
# Specify language for better accuracy
whisper audio.mp3 --language English
# Translate to English
whisper audio.mp3 --task translate
Python API
import whisper
# Load model
model = whisper.load_model("turbo")
# Transcribe
result = model.transcribe("audio.mp3")
# Print result
print(result["text"])
# Get detailed segments with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")
Hybrid Whisper-Vosk Real-Time Transcription
For applications requiring both speed and accuracy, consider a hybrid approach combining Whisper and Vosk. This method uses Vosk for fast real-time transcription with Whisper running in the background to correct errors.
How It Works
- Vosk provides real-time transcription via WebSocket for immediate feedback
- Whisper processes the same audio in the background with a short delay
- Compare outputs using Levenshtein distance to detect significant differences
- Automatically correct VOSK's output when Whisper disagrees
Implementation Example
import vosk
import whisper
import asyncio
import Levenshtein
from vosk import KaldiRecognizer
class HybridTranscriber:
def __init__(self):
# Initialize Vosk for real-time
self.vosk_model = vosk.Model("vosk-model-small-en-us")
self.recognizer = KaldiRecognizer(self.vosk_model, 16000)
# Initialize Whisper for accuracy checking
self.whisper_model = whisper.load_model("base")
# Audio buffer for Whisper
self.audio_buffer = []
self.correction_delay = 2.0 # seconds
async def transcribe_with_corrections(self, audio_stream):
vosk_text = ""
whisper_corrections = []
# Start both transcription processes
vosk_task = asyncio.create_task(self._vosk_transcribe(audio_stream))
whisper_task = asyncio.create_task(self._whisper_correct(audio_stream))
# Process results
while True:
vosk_result = await vosk_task
whisper_result = await whisper_task
if vosk_result:
vosk_text += vosk_result
print(f"VOSK: {vosk_result}")
if whisper_result:
# Check if correction is needed
distance = Levenshtein.distance(vosk_text[-len(whisper_result):], whisper_result)
if distance > len(whisper_result) * 0.3: # 30% difference threshold
print(f"WHISPER CORRECTION: {vosk_result} -> {whisper_result}")
vosk_text = vosk_text[:-len(vosk_result)] + whisper_result
async def _vosk_transcribe(self, audio_stream):
# Real-time Vosk transcription
while True:
data = await audio_stream.read(4000)
if self.recognizer.AcceptWaveform(data):
result = json.loads(self.recognizer.Result())
return result["text"]
async def _whisper_correct(self, audio_stream):
# Background Whisper correction
await asyncio.sleep(self.correction_delay)
# Process accumulated audio with Whisper
result = self.whisper_model.transcribe("temp_audio.wav")
return result["text"]
This hybrid approach provides:
- Immediate feedback from Vosk (real-time)
- High accuracy corrections from Whisper (1-2 second delay)
- Visual indicators when corrections are applied
- Trust scoring to weigh model confidence
Model Comparison: Whisper vs Alternatives
Based on comprehensive benchmarks, here's how Whisper compares to other open-source transcription models:
Accuracy Comparison
| Model | Word Error Rate | Strengths | Limitations |
|---|---|---|---|
| Whisper Large | ~5-10% | State-of-the-art accuracy, multilingual, robust to noise | High resource requirements |
| Whisper Medium | ~10-15% | Good balance of accuracy/speed | Still resource-intensive |
| Whisper Small | ~15-25% | Fast, good for most applications | Lower accuracy on complex audio |
| Vosk | ~15-30% | Fast, lightweight, real-time capable | Limited language support |
| Kaldi | ~10-20% | Highly customizable, accurate | Complex setup, steep learning curve |
| Coqui STT | ~15-25% | Community-driven, multilingual | Maintenance mode, limited updates |
Setup Complexity
- Whisper: Simple pip install, works out-of-the-box
- Vosk: Easy download + pip install, minimal setup
- Kaldi: Complex installation, requires technical expertise
- Coqui STT: Moderate setup complexity
Performance Metrics
- Whisper Turbo: 8x faster than Large with minimal accuracy loss
- Vosk: Real-time performance on low-end hardware
- Faster-Whisper: Up to 4x faster than original Whisper
- Whisper.cpp: Optimized for CPU inference
Real-time Microphone Transcription
import whisper
import pyaudio
import wave
import tempfile
import os
model = whisper.load_model("base")
# Audio recording parameters
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
RECORD_SECONDS = 5
def record_audio():
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
print("Recording...")
frames = []
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
print("Finished recording.")
stream.stop_stream()
stream.close()
p.terminate()
return frames
def transcribe_audio(frames):
# Save to temporary file
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_audio:
wf = wave.open(temp_audio.name, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(pyaudio.PyAudio().get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()
# Transcribe
result = model.transcribe(temp_audio.name)
os.unlink(temp_audio.name)
return result["text"]
# Main loop
while True:
input("Press Enter to start recording...")
frames = record_audio()
text = transcribe_audio(frames)
print(f"Transcription: {text}\n")
Option 2: Faster Whisper
Faster implementation of Whisper with CTranslate2 - up to 4x faster with lower memory usage.
# Install
pip install faster-whisper
# Usage
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Option 3: Whisper.cpp
C++ implementation of Whisper for maximum performance.
# Clone and build
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make
# Download models
bash ./models/download-ggml-model.sh base
# Transcribe
./main -m models/ggml-base.bin -f audio.wav
Option 4: Vosk (Lightweight Alternative)
Vosk is a lightweight offline speech recognition toolkit, great for resource-constrained systems.
# Install
pip install vosk
# Download models from https://alphacephei.com/vosk/models
wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
from vosk import Model, KaldiRecognizer
import wave
import json
model = Model("vosk-model-small-en-us-0.15")
wf = wave.open("audio.wav", "rb")
rec = KaldiRecognizer(model, wf.getframerate())
while True:
data = wf.readframes(4000)
if len(data) == 0:
break
if rec.AcceptWaveform(data):
result = json.loads(rec.Result())
print(result["text"])
# Final result
print(json.loads(rec.FinalResult())["text"])
Option 5: Using Hugging Face Transformers
Direct access to Whisper models via Hugging Face.
pip install transformers torch
from transformers import pipeline
# Load Whisper model
transcriber = pipeline("automatic-speech-recognition",
model="openai/whisper-large-v3")
# Transcribe
result = transcriber("audio.mp3")
print(result["text"])
# With language specification
result = transcriber("audio.mp3",
generate_kwargs={"language": "english"})
System-wide Integration
Creating a Global Transcribe Command
# Create script at ~/bin/transcribe
#!/bin/bash
if [ -z "$1" ]; then
echo "Usage: transcribe <audio-file>"
exit 1
fi
whisper "$1" --model base --output_format txt --output_dir "$(dirname "$1")"
# Make executable
chmod +x ~/bin/transcribe
# Add to PATH (add to ~/.bashrc or ~/.zshrc)
export PATH="$HOME/bin:$PATH"
# Usage
transcribe recording.mp3
Dmenu/Rofi Integration for Quick Recording
#!/bin/bash
# ~/bin/voice-note
RECORDINGS_DIR="$HOME/voice-notes"
mkdir -p "$RECORDINGS_DIR"
FILENAME="$RECORDINGS_DIR/note-$(date +%Y%m%d-%H%M%S).wav"
# Record audio
arecord -f cd -d 10 "$FILENAME"
# Transcribe
whisper "$FILENAME" --model base --output_format txt
# Show notification
notify-send "Voice Note" "Transcription complete!"
Bind to a hotkey in your window manager for quick voice notes.
GPU Acceleration
For NVIDIA GPUs, install CUDA support for significant speedup:
# Install PyTorch with CUDA
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Whisper will automatically use GPU if available
whisper audio.mp3 --model large # Will use GPU
Performance Analysis
Word Error Rates (WER) by Language
Whisper Large-v3 Performance:
- English: ~5-8% WER
- Spanish: ~8-12% WER
- French: ~6-10% WER
- German: ~8-15% WER
- Japanese: ~10-20% WER
- Chinese: ~15-25% WER
Comparative Analysis:
- Whisper Large: ~95% accuracy (English), state-of-the-art multilingual
- Whisper Medium: ~90% accuracy, good balance for most applications
- Whisper Small: ~85% accuracy, fast and lightweight
- Vosk: ~80% accuracy, excellent for real-time applications
- Google Cloud Speech: ~95% accuracy (requires internet)
- Azure Speech Services: ~90-95% accuracy (cloud-based)
Performance Benchmarks
Inference Speed (relative to Large model):
- Tiny: ~32x faster (39M parameters)
- Base: ~16x faster (74M parameters)
- Small: ~6x faster (244M parameters)
- Medium: ~2x faster (769M parameters)
- Turbo: ~8x faster (809M parameters, optimized)
Memory Requirements:
- Tiny/Base: ~1GB VRAM
- Small: ~2GB VRAM
- Medium: ~5GB VRAM
- Large: ~10GB VRAM
- Turbo: ~6GB VRAM
Multilingual Performance
Whisper's multilingual capabilities stem from training on 680,000 hours of data:
- English: 65% of training data (highest accuracy)
- Multilingual: 35% of training data (17% dedicated to non-English tasks)
- Supported Languages: 99 languages total
- Translation: Direct translation from any supported language to English
Performance Optimization Tips
- Choose the right model -
baseorsmallfor most use cases,turbofor speed - Use Faster-Whisper - For production applications (up to 4x faster)
- Enable GPU - 10-20x speedup on NVIDIA GPUs with CUDA
- Batch processing - Process multiple files at once
- Use int8 quantization - With faster-whisper for lower memory usage
- Specify language - Explicit language setting improves accuracy
- Use turbo model - Optimized performance with minimal accuracy loss
Cost Analysis (Cloud Infrastructure)
For transcribing 1,000 hours of audio on GCP with A100 GPU:
| Model | Batch Size 1 | Batch Size 4 | Batch Size 16 |
|---|---|---|---|
| Tiny | $15.60 | $12.50 | $11.70 |
| Base | $23.40 | $18.80 | $17.50 |
| Small | $54.70 | $43.80 | $40.90 |
| Medium | $140.60 | $112.50 | $104.70 |
| Large | $281.30 | $225.00 | $209.40 |
| Turbo | $171.90 | $137.50 | $128.10 |
Costs based on late 2022 GCP pricing, exclude headcount and infrastructure setup costs.
Use Cases
Meeting Transcription
# Record meeting
arecord -f cd -d 3600 meeting.wav
# Transcribe with timestamps
whisper meeting.wav --output_format srt --model medium
YouTube Video Transcription
# Download audio with yt-dlp
yt-dlp -x --audio-format mp3 "VIDEO_URL"
# Transcribe
whisper "video.mp3" --model base
Podcast Processing
# Batch transcribe all podcast episodes
for file in podcasts/*.mp3; do
whisper "$file" --model small --output_format txt
done
Live Captioning
Create a simple live captioning system for accessibility.
Accuracy Comparison
Based on common benchmarks:
- Whisper Large: ~95% accuracy (English)
- Whisper Medium: ~90% accuracy
- Whisper Small: ~85% accuracy
- Vosk: ~80% accuracy
- Google Cloud: ~95% accuracy (requires internet)
Troubleshooting
Out of Memory Errors
# Use smaller model
whisper audio.mp3 --model tiny
# Or use faster-whisper with int8
Poor Quality Transcription
# Specify language
whisper audio.mp3 --language English --model medium
# Use larger model
whisper audio.mp3 --model large
Slow Performance
# Use whisper.cpp or faster-whisper
# Enable GPU acceleration
# Use smaller model
Advanced Features and Techniques
Custom Model Fine-tuning
For domain-specific applications, fine-tune Whisper on your own dataset:
# Install required packages
pip install datasets transformers accelerate
# Prepare your dataset (audio-text pairs)
# Then fine-tune
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
# Load pre-trained model
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
# Fine-tune on custom dataset
# (Implementation details depend on your specific use case)
Speaker Diarization
Combine Whisper with speaker identification for multi-speaker transcripts:
# Install pyannote.audio for speaker diarization
pip install pyannote.audio
# Usage example
from pyannote.audio import Pipeline
from pyannote.core import Segment, Annotation
# Initialize speaker diarization pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
# Process audio
diarization = pipeline("meeting.wav")
# Combine with Whisper transcription
# (Implementation would merge speaker segments with transcribed text)
Real-time Streaming
For live audio streams, implement streaming transcription:
import asyncio
import websockets
import json
from faster_whisper import WhisperModel
class StreamingTranscriber:
def __init__(self):
self.model = WhisperModel("base", device="cpu", compute_type="int8")
self.audio_buffer = []
self.buffer_duration = 30 # seconds
async def transcribe_stream(self, websocket):
async for message in websocket:
# Receive audio chunks
audio_chunk = json.loads(message)["audio"]
self.audio_buffer.append(audio_chunk)
# Process when buffer is full
if len(self.audio_buffer) >= self.buffer_duration * 16: # 16kHz sample rate
segments, _ = self.model.transcribe(
self._buffer_to_audio(),
language="en",
beam_size=5,
vad_filter=True
)
# Send transcription back
for segment in segments:
await websocket.send(json.dumps({
"text": segment.text,
"start": segment.start,
"end": segment.end
}))
self.audio_buffer = [] # Clear buffer
Integration with Linux Tools
Pipe Audio Through SoX for Preprocessing
# Normalize audio levels before transcription
sox input.wav output.wav norm
# Remove silence
sox input.wav output.wav silence 1 0.1 1% reverse silence 1 0.1 1% reverse
# Convert sample rate
sox input.wav -r 16000 output.wav
# Chain with Whisper
sox input.wav -r 16000 - | whisper - --model base
Cron Job for Automated Transcription
# Add to crontab for daily transcription of recorded files
# crontab -e
# 0 2 * * * /home/user/transcribe_daily.sh
#!/bin/bash
# transcribe_daily.sh
RECORDINGS_DIR="/home/user/recordings"
OUTPUT_DIR="/home/user/transcripts"
for file in "$RECORDINGS_DIR"/*.wav; do
if [ -f "$file" ]; then
filename=$(basename "$file" .wav)
whisper "$file" --model base --output_dir "$OUTPUT_DIR" --output_format txt
mv "$file" "${RECORDINGS_DIR}/processed/"
fi
done
Resources
- OpenAI Whisper GitHub - Official repository with detailed documentation
- Whisper Model Card - Hugging Face - Model details and performance metrics
- Faster Whisper - Optimized Whisper implementation
- Whisper.cpp - C++ implementation for maximum performance
- Vosk Models - Pre-trained Vosk models
- Hugging Face Transformers - Alternative Whisper integration
- Whisper-Vosk Hybrid Approach - Real-time correction techniques
- OpenAI Whisper Performance Analysis - Detailed performance benchmarks
- Whisper vs Open-Source Alternatives - Comprehensive model comparison
- OpenAI Whisper Glossary - Concise overview of Whisper's capabilities and architecture
- Whisper AI Benefits and Risks - Balanced analysis of Whisper's advantages and potential harms
Conclusion
AI speech-to-text on Linux has never been more accessible. With Whisper, you get state-of-the-art accuracy running completely offline on your machine. Whether you're transcribing meetings, processing podcasts, or building voice-controlled applications, these tools provide powerful capabilities with complete privacy and control.
Start with Whisper's base model for general use, and scale up to larger models or GPU acceleration as needed. The future of voice computing is open source and running on your Linux box.