You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

nur-dev/nemo-fast — Multilingual Streaming STT

FastConformer Hybrid CTC+Transducer fine-tuned for Kazakh, Russian, Uzbek, and English. Supports real-time streaming inference via sherpa-onnx or batch inference via NeMo.

Model Description

Property	Value
Architecture	FastConformer Hybrid CTC+Transducer
Framework	NVIDIA NeMo
Parameters	~120M
Tokenizer	SentencePiece BPE, 4096 vocab
Sample rate	16 kHz mono
Languages	`kk` · `ru` · `uz` · `en`
Streaming	Yes (160 ms chunks)

WER Results

Evaluated with RNNT beam=16 + per-language KenLM 4-gram rescoring (ru α=0.4, uz α=0.7, kk/en α=0).

Language	WER (in-domain)	WER (FLEURS)
English	17.84%	22.38%
Russian	33.21%	57.51%
Uzbek	23.74%	45.31%
Kazakh	38.78%	31.31%

Note on Kazakh FLEURS: FLEURS WER (31.31%) is better than in-domain (38.78%) because the in-domain validation set includes conversational speech, which is harder than FLEURS read speech.

Repository Contents

fastconformer_v6.nemo          # Full NeMo model (weights + tokenizer + config)
onnx/
  encoder.onnx                 # FastConformer encoder for streaming inference
  decoder_joint.onnx           # Fused RNN-T decoder+joiner for streaming inference

Inference

Option A — NeMo (batch, GPU recommended)

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(
    "fastconformer_v6.nemo",
    map_location="cuda",
)
model.eval()

# Transcribe one or more audio files (16 kHz WAV/FLAC)
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])

For longer files, use CTC decoding (faster, slightly lower accuracy):

transcriptions = model.transcribe(["audio.wav"], decoder_type="ctc")

Option B — sherpa-onnx (streaming, CPU or GPU)

Install

pip install sherpa-onnx soundfile numpy

Download ONNX files

# Using huggingface_hub
from huggingface_hub import hf_hub_download

encoder  = hf_hub_download("nur-dev/nemo-fast", "onnx/encoder.onnx")
decoder  = hf_hub_download("nur-dev/nemo-fast", "onnx/decoder_joint.onnx")

You also need the tokenizer vocabulary. Extract it from the .nemo archive:

# .nemo files are zip archives
unzip -p fastconformer_v6.nemo tokenizer.model > tokenizer.model
# or extract the vocab txt
unzip -p fastconformer_v6.nemo vocab.txt > vocab.txt

Transcribe a file (non-streaming)

import sherpa_onnx
import soundfile as sf
import numpy as np

recognizer = sherpa_onnx.OfflineRecognizer.from_transducer(
    encoder="onnx/encoder.onnx",
    decoder="onnx/decoder_joint.onnx",
    joiner="onnx/decoder_joint.onnx",   # fused model: same file for both
    tokens="vocab.txt",
    num_threads=4,
    sample_rate=16000,
    feature_dim=80,
)

audio, sr = sf.read("audio.wav", dtype="float32")
assert sr == 16000, "Resample to 16 kHz first"

stream = recognizer.create_stream()
stream.accept_waveform(sr, audio)
recognizer.decode_stream(stream)
print(stream.result.text)

Real-time streaming transcription

import sherpa_onnx
import sounddevice as sd
import numpy as np

SAMPLE_RATE   = 16000
CHUNK_MS      = 160          # 160 ms per chunk
CHUNK_SAMPLES = int(SAMPLE_RATE * CHUNK_MS / 1000)

recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
    encoder="onnx/encoder.onnx",
    decoder="onnx/decoder_joint.onnx",
    joiner="onnx/decoder_joint.onnx",
    tokens="vocab.txt",
    num_threads=4,
    sample_rate=SAMPLE_RATE,
    feature_dim=80,
    decoding_method="modified_beam_search",
    num_active_paths=4,
    enable_endpoint_detection=True,
    rule1_min_trailing_silence=2.4,
    rule2_min_trailing_silence=1.2,
    rule3_min_utterance_length=20.0,
)

stream = recognizer.create_stream()

def callback(indata, frames, time, status):
    audio = indata[:, 0].astype(np.float32)
    stream.accept_waveform(SAMPLE_RATE, audio)
    while recognizer.is_ready(stream):
        recognizer.decode_stream(stream)
    text = recognizer.get_result(stream).text.strip()
    if text:
        print(f"\r{text}", end="", flush=True)
    if recognizer.is_endpoint(stream):
        print()
        recognizer.reset(stream)

with sd.InputStream(samplerate=SAMPLE_RATE, channels=1,
                    blocksize=CHUNK_SAMPLES, callback=callback):
    print("Listening — press Ctrl+C to stop")
    while True:
        sd.sleep(100)

Option C — WebSocket / REST server

The full server is in the audio-STT repository. Quick start:

pip install sherpa-onnx fastapi uvicorn websockets soundfile numpy

python serving/serve_streaming.py \
    --encoder  onnx/encoder.onnx \
    --decoder  onnx/decoder_joint.onnx \
    --joiner   onnx/decoder_joint.onnx \
    --tokens   vocab.txt \
    --host 0.0.0.0 \
    --port 8001

REST endpoint:

curl -X POST http://localhost:8001/transcribe \
     -F "file=@audio.wav" | jq .
# {"text": "транскрипция аудио"}

WebSocket (streaming):

const ws = new WebSocket("ws://localhost:8001/ws/transcribe");
ws.onmessage = (e) => console.log(JSON.parse(e.data));

// Send raw 16-bit PCM at 16 kHz in 160 ms chunks
mediaRecorder.ondataavailable = (e) => ws.send(e.data);

Limitations

Kazakh (38.78% WER): Training data is predominantly formal/read speech. Conversational Kazakh (call center, spontaneous) will have higher WER.
Russian/Uzbek out-of-domain: FLEURS WER is significantly higher than in-domain (ru: 57.51%, uz: 45.31%), indicating sensitivity to recording conditions and speaking style.
No language identification: The model does not auto-detect language. Accuracy on mixed-language audio is not characterized.
16 kHz mono only. Audio must be resampled before inference.

License

Creative Commons Attribution Non Commercial 4.0 (CC BY-NC 4.0)

This model may not be used for commercial purposes without explicit written permission.

Downloads last month: -