nur-dev/nemo-fast — Multilingual Streaming STT
FastConformer Hybrid CTC+Transducer fine-tuned for Kazakh, Russian, Uzbek, and English. Supports real-time streaming inference via sherpa-onnx or batch inference via NeMo.
Model Description
| Property | Value |
|---|---|
| Architecture | FastConformer Hybrid CTC+Transducer |
| Framework | NVIDIA NeMo |
| Parameters | ~120M |
| Tokenizer | SentencePiece BPE, 4096 vocab |
| Sample rate | 16 kHz mono |
| Languages | kk · ru · uz · en |
| Streaming | Yes (160 ms chunks) |
WER Results
Evaluated with RNNT beam=16 + per-language KenLM 4-gram rescoring (ru α=0.4, uz α=0.7, kk/en α=0).
| Language | WER (in-domain) | WER (FLEURS) |
|---|---|---|
| English | 17.84% | 22.38% |
| Russian | 33.21% | 57.51% |
| Uzbek | 23.74% | 45.31% |
| Kazakh | 38.78% | 31.31% |
Note on Kazakh FLEURS: FLEURS WER (31.31%) is better than in-domain (38.78%) because the in-domain validation set includes conversational speech, which is harder than FLEURS read speech.
Repository Contents
fastconformer_v6.nemo # Full NeMo model (weights + tokenizer + config)
onnx/
encoder.onnx # FastConformer encoder for streaming inference
decoder_joint.onnx # Fused RNN-T decoder+joiner for streaming inference
Inference
Option A — NeMo (batch, GPU recommended)
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(
"fastconformer_v6.nemo",
map_location="cuda",
)
model.eval()
# Transcribe one or more audio files (16 kHz WAV/FLAC)
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])
For longer files, use CTC decoding (faster, slightly lower accuracy):
transcriptions = model.transcribe(["audio.wav"], decoder_type="ctc")
Option B — sherpa-onnx (streaming, CPU or GPU)
Install
pip install sherpa-onnx soundfile numpy
Download ONNX files
# Using huggingface_hub
from huggingface_hub import hf_hub_download
encoder = hf_hub_download("nur-dev/nemo-fast", "onnx/encoder.onnx")
decoder = hf_hub_download("nur-dev/nemo-fast", "onnx/decoder_joint.onnx")
You also need the tokenizer vocabulary. Extract it from the .nemo archive:
# .nemo files are zip archives
unzip -p fastconformer_v6.nemo tokenizer.model > tokenizer.model
# or extract the vocab txt
unzip -p fastconformer_v6.nemo vocab.txt > vocab.txt
Transcribe a file (non-streaming)
import sherpa_onnx
import soundfile as sf
import numpy as np
recognizer = sherpa_onnx.OfflineRecognizer.from_transducer(
encoder="onnx/encoder.onnx",
decoder="onnx/decoder_joint.onnx",
joiner="onnx/decoder_joint.onnx", # fused model: same file for both
tokens="vocab.txt",
num_threads=4,
sample_rate=16000,
feature_dim=80,
)
audio, sr = sf.read("audio.wav", dtype="float32")
assert sr == 16000, "Resample to 16 kHz first"
stream = recognizer.create_stream()
stream.accept_waveform(sr, audio)
recognizer.decode_stream(stream)
print(stream.result.text)
Real-time streaming transcription
import sherpa_onnx
import sounddevice as sd
import numpy as np
SAMPLE_RATE = 16000
CHUNK_MS = 160 # 160 ms per chunk
CHUNK_SAMPLES = int(SAMPLE_RATE * CHUNK_MS / 1000)
recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
encoder="onnx/encoder.onnx",
decoder="onnx/decoder_joint.onnx",
joiner="onnx/decoder_joint.onnx",
tokens="vocab.txt",
num_threads=4,
sample_rate=SAMPLE_RATE,
feature_dim=80,
decoding_method="modified_beam_search",
num_active_paths=4,
enable_endpoint_detection=True,
rule1_min_trailing_silence=2.4,
rule2_min_trailing_silence=1.2,
rule3_min_utterance_length=20.0,
)
stream = recognizer.create_stream()
def callback(indata, frames, time, status):
audio = indata[:, 0].astype(np.float32)
stream.accept_waveform(SAMPLE_RATE, audio)
while recognizer.is_ready(stream):
recognizer.decode_stream(stream)
text = recognizer.get_result(stream).text.strip()
if text:
print(f"\r{text}", end="", flush=True)
if recognizer.is_endpoint(stream):
print()
recognizer.reset(stream)
with sd.InputStream(samplerate=SAMPLE_RATE, channels=1,
blocksize=CHUNK_SAMPLES, callback=callback):
print("Listening — press Ctrl+C to stop")
while True:
sd.sleep(100)
Option C — WebSocket / REST server
The full server is in the audio-STT repository. Quick start:
pip install sherpa-onnx fastapi uvicorn websockets soundfile numpy
python serving/serve_streaming.py \
--encoder onnx/encoder.onnx \
--decoder onnx/decoder_joint.onnx \
--joiner onnx/decoder_joint.onnx \
--tokens vocab.txt \
--host 0.0.0.0 \
--port 8001
REST endpoint:
curl -X POST http://localhost:8001/transcribe \
-F "file=@audio.wav" | jq .
# {"text": "транскрипция аудио"}
WebSocket (streaming):
const ws = new WebSocket("ws://localhost:8001/ws/transcribe");
ws.onmessage = (e) => console.log(JSON.parse(e.data));
// Send raw 16-bit PCM at 16 kHz in 160 ms chunks
mediaRecorder.ondataavailable = (e) => ws.send(e.data);
Limitations
- Kazakh (38.78% WER): Training data is predominantly formal/read speech. Conversational Kazakh (call center, spontaneous) will have higher WER.
- Russian/Uzbek out-of-domain: FLEURS WER is significantly higher than in-domain (ru: 57.51%, uz: 45.31%), indicating sensitivity to recording conditions and speaking style.
- No language identification: The model does not auto-detect language. Accuracy on mixed-language audio is not characterized.
- 16 kHz mono only. Audio must be resampled before inference.
License
Creative Commons Attribution Non Commercial 4.0 (CC BY-NC 4.0)
This model may not be used for commercial purposes without explicit written permission.
- Downloads last month
- -