Instructions to use dev-ahmedhany/stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use dev-ahmedhany/stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.0 with NeMo:
# tag did not correspond to a valid NeMo domain.
- Notebooks
- Google Colab
- Kaggle
stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.0
Cache-aware streaming variant of NVIDIA's stt_ar_fastconformer_hybrid_large_pcd_v1.0,
fine-tuned with chunked attention masks so it can run real-time on phone via
sherpa-onnx OnlineRecognizer.
Streaming config: att_context_size = [70, 13]
- Left context: 5.6 seconds (past audio in attention)
- Right context (lookahead): 1040ms (future audio peek)
- Constant 1.04s lag between audio input and corresponding output token
Performance
| metric | value |
|---|---|
| Best val_wer (10h train + 5h continue) | TBD (computed at upload time) |
| Tashkeel-stripped WER | TBD |
| Latency (end-to-end on phone) | ~1.04s lookahead + ~150ms pipeline |
| Model size | 459 MB (.nemo), ~470 MB (ONNX with cache I/O) |
Training data (~10h)
- 8.31h Quran tartil (5 reciters: abdulsamad, abdullah_basfar, abdullah_matroud,
abdurrahmaan_as-sudais, alhusary) from
tarteel-ai/everyayah - 2.0h FLEURS Egyptian Arabic from
google/fleursar_eg
Usage
Server (NeMo PyTorch)
import nemo.collections.asr as nemo_asr
m = nemo_asr.models.ASRModel.from_pretrained(
'dev-ahmedhany/stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.0'
)
# Offline transcribe โ all chunks at once, equivalent to streaming
hyp = m.transcribe(['audio.wav'])
Streaming inference (chunk-by-chunk)
See examples/streaming_inference.py in the
NeMo cache-aware streaming notebook.
Mobile (sherpa-onnx)
final recognizer = sherpa.OnlineRecognizer(
model: 'streaming_70_13.onnx',
);
final stream = recognizer.createStream();
// per audio chunk:
stream.acceptWaveform(samples, sampleRate);
recognizer.decode(stream);
final partialText = recognizer.getResult(stream).text;
Limitations
- Trained on tartil/murattal Quran style; mujawwad style may underperform
- 1040ms lookahead โ not suitable for sub-500ms-latency use cases
- Training data was 80% Quran + 20% Egyptian Arabic; broader MSA coverage would benefit from MASC/CV-17/SADA22 data (these failed to load during this training run; could be added in a v1.1 retrain)
Training recipe
- 1ร NVIDIA L4 GPU, ~30 min wall clock for 15 epochs total
- Optimizer: AdamW, lr 5e-4 (initial), 1e-4 (continuation)
- Precision: bf16-mixed
- Batch size: 8
- max_duration: 30s
- Downloads last month
- 100
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support