stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.0

Cache-aware streaming variant of NVIDIA's stt_ar_fastconformer_hybrid_large_pcd_v1.0, fine-tuned with chunked attention masks so it can run real-time on phone via sherpa-onnx OnlineRecognizer.

Streaming config: att_context_size = [70, 13]

Left context: 5.6 seconds (past audio in attention)
Right context (lookahead): 1040ms (future audio peek)
Constant 1.04s lag between audio input and corresponding output token

Performance

metric	value
Best val_wer (10h train + 5h continue)	TBD (computed at upload time)
Tashkeel-stripped WER	TBD
Latency (end-to-end on phone)	~1.04s lookahead + ~150ms pipeline
Model size	459 MB (.nemo), ~470 MB (ONNX with cache I/O)

Training data (~10h)

8.31h Quran tartil (5 reciters: abdulsamad, abdullah_basfar, abdullah_matroud, abdurrahmaan_as-sudais, alhusary) from tarteel-ai/everyayah
2.0h FLEURS Egyptian Arabic from google/fleurs ar_eg

Usage

Server (NeMo PyTorch)

import nemo.collections.asr as nemo_asr
m = nemo_asr.models.ASRModel.from_pretrained(
    'dev-ahmedhany/stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.0'
)
# Offline transcribe — all chunks at once, equivalent to streaming
hyp = m.transcribe(['audio.wav'])

Streaming inference (chunk-by-chunk)

See examples/streaming_inference.py in the NeMo cache-aware streaming notebook.

Mobile (sherpa-onnx)

final recognizer = sherpa.OnlineRecognizer(
  model: 'streaming_70_13.onnx',
);
final stream = recognizer.createStream();
// per audio chunk:
stream.acceptWaveform(samples, sampleRate);
recognizer.decode(stream);
final partialText = recognizer.getResult(stream).text;

Limitations

Trained on tartil/murattal Quran style; mujawwad style may underperform
1040ms lookahead — not suitable for sub-500ms-latency use cases
Training data was 80% Quran + 20% Egyptian Arabic; broader MSA coverage would benefit from MASC/CV-17/SADA22 data (these failed to load during this training run; could be added in a v1.1 retrain)

Training recipe

1× NVIDIA L4 GPU, ~30 min wall clock for 15 epochs total
Optimizer: AdamW, lr 5e-4 (initial), 1e-4 (continuation)
Precision: bf16-mixed
Batch size: 8
max_duration: 30s

Downloads last month: 100

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dev-ahmedhany/stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.0

Base model

nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0

Quantized

(1)

this model