stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.0

Cache-aware streaming variant of NVIDIA's stt_ar_fastconformer_hybrid_large_pcd_v1.0, fine-tuned with chunked attention masks so it can run real-time on phone via sherpa-onnx OnlineRecognizer.

Streaming config: att_context_size = [70, 13]

  • Left context: 5.6 seconds (past audio in attention)
  • Right context (lookahead): 1040ms (future audio peek)
  • Constant 1.04s lag between audio input and corresponding output token

Performance

metric value
Best val_wer (10h train + 5h continue) TBD (computed at upload time)
Tashkeel-stripped WER TBD
Latency (end-to-end on phone) ~1.04s lookahead + ~150ms pipeline
Model size 459 MB (.nemo), ~470 MB (ONNX with cache I/O)

Training data (~10h)

  • 8.31h Quran tartil (5 reciters: abdulsamad, abdullah_basfar, abdullah_matroud, abdurrahmaan_as-sudais, alhusary) from tarteel-ai/everyayah
  • 2.0h FLEURS Egyptian Arabic from google/fleurs ar_eg

Usage

Server (NeMo PyTorch)

import nemo.collections.asr as nemo_asr
m = nemo_asr.models.ASRModel.from_pretrained(
    'dev-ahmedhany/stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.0'
)
# Offline transcribe โ€” all chunks at once, equivalent to streaming
hyp = m.transcribe(['audio.wav'])

Streaming inference (chunk-by-chunk)

See examples/streaming_inference.py in the NeMo cache-aware streaming notebook.

Mobile (sherpa-onnx)

final recognizer = sherpa.OnlineRecognizer(
  model: 'streaming_70_13.onnx',
);
final stream = recognizer.createStream();
// per audio chunk:
stream.acceptWaveform(samples, sampleRate);
recognizer.decode(stream);
final partialText = recognizer.getResult(stream).text;

Limitations

  • Trained on tartil/murattal Quran style; mujawwad style may underperform
  • 1040ms lookahead โ€” not suitable for sub-500ms-latency use cases
  • Training data was 80% Quran + 20% Egyptian Arabic; broader MSA coverage would benefit from MASC/CV-17/SADA22 data (these failed to load during this training run; could be added in a v1.1 retrain)

Training recipe

  • 1ร— NVIDIA L4 GPU, ~30 min wall clock for 15 epochs total
  • Optimizer: AdamW, lr 5e-4 (initial), 1e-4 (continuation)
  • Precision: bf16-mixed
  • Batch size: 8
  • max_duration: 30s
Downloads last month
100
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for dev-ahmedhany/stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.0

Quantized
(1)
this model