granite-speech-4.1-2b-nar β€” GGUF

GGUF conversion of ibm-granite/granite-speech-4.1-2b-nar for use with CrispASR.

The NAR variant replaces the autoregressive Granite decoder with a non-autoregressive one β€” the LLM runs once over the full sequence with is_causal=False everywhere instead of token-by-token sampling. The slot positions in the LLM input absorb edit votes; a per-row argmax + unique_consecutive + drop-EOS gives the final transcript. Throughput is several times higher than the autoregressive variants on the same audio.

Architecturally NAR differs from base 4.1-2b in three places:

  • Encoder self-conditioning at layer 8 β€” the layer-8 CTC softmax is fed back into the hidden stream as a 1024-dim residual. The per-frame blank probability captured here also drives the posterior-weighted pool of the BPE auxiliary head.
  • BPE auxiliary CTC head (100353-vocab) on a posterior-pooled (window=4) view of the final hidden states. Its greedy decode initialises the LLM input text.
  • 4-layer encoder hidden-state concatenation for the projector input ([layer 4, 8, 12, last] β†’ 4Γ—1024 = 4096-dim), vs. base's single layer and PLUS's 2-layer concat.

Files

File Quantisation Size Notes
granite-speech-4.1-2b-nar-f16.gguf F16 ~5.4 GB Encoder + projector in F32, LLM weights in F16 β€” full parity reference
granite-speech-4.1-2b-nar-q4_k.gguf Q4_K ~3.2 GB Recommended. LLM layers Q4_K; encoder + projector kept F32 (precision-sensitive). Bit-identical-quality to F16 on encoder + projector
granite-speech-4.1-2b-nar-q4_k-f16enc.gguf Q4_K + F16 encoder ~2.4 GB LLM Q4_K, encoder + projector F16 (norms / biases / BN stats stay F32). ~800 MB smaller than the recommended Q4_K with no measurable parity loss on this clip
granite-speech-4.1-2b-nar-q4_k-mini.gguf Q4_K (aggressive) ~1.5 GB Encoder, projector and LLM all Q4_K. Smallest / fastest to download. Cosine parity is noticeably worse on NAR than on base or PLUS mini because the 4-layer hidden-state concat (the architectural delta in NAR) quadruples the surface for Q4_K rounding error. JFK still transcribes correctly because the LLM's argmax recovers the right token; harder material is more likely to regress. Use Q4_K or Q4_K-f16enc unless disk size is the binding constraint

Cosine parity (vs PyTorch BF16 reference, JFK 11 s clip)

Stage F16 cos_min Q4_K cos_min Q4_K-f16enc cos_min Q4_K-mini cos_min
mel_spectrogram 0.999997 0.999997 0.999997 0.999997
encoder_out 0.999852 0.999852 0.999852 0.104
encoder_logits (CTC) 0.999675 0.999675 0.999675 0.864
projector_out 0.999999 0.999999 0.999999 0.965
editing_logits 0.999999 0.956 0.956 0.956
editing_logits_top1 1.000000 1.000000 1.000000 1.000000
transcribe == ref βœ… βœ… βœ… βœ…

encoder_out is the 4096-dim concatenation of layer 4, 8, 12 and the final encoder layer (the NAR architectural delta). The recommended and -f16enc files keep the encoder in F32/F16 so parity matches the F16 reference exactly. On -mini the encoder is fully Q4_K β€” rounding error compounds across the 16-layer Conformer and the 4-way concat amplifies the worst-frame divergence to cos_min β‰ˆ 0.10.

editing_logits raw cosine sits at ~0.956 on every Q4_K variant β€” that's expected LLM-quantization noise across 100 K vocab logits. The argmax (editing_logits_top1) and the resulting transcript are unchanged: all four files reproduce the reference final text exactly.

Reference transcript: "and so, my fellow americans, ask not what your country can do for you. ask what you can do for your country."

Tested with crispasr-diff granite-nle <model.gguf> <ref.gguf> samples/jfk.wav

Usage with CrispASR

NAR uses a separate runtime from the autoregressive granite variants. Today it is reachable via the crispasr-diff harness and the granite_nle library directly; a granite-4.1-nar backend in the main crispasr CLI is the next step (see TODO.md).

# Bit-exact end-to-end transcribe via the diff harness
crispasr-diff granite-nle \
  granite-speech-4.1-2b-nar-q4_k.gguf \
  /path/to/ref.gguf \
  samples/jfk.wav

The library entry point is granite_nle_transcribe(ctx, samples, n_samples) in src/granite_nle.h; it returns a malloc'd UTF-8 string with the final transcript. There are also fine-grained accessors (compute_mel, run_encoder, run_projector, run_llm_editing) for partial-pipeline use.

Supported languages: English, French, German, Spanish, Portuguese.

Architecture

Stage Description
Encoder 16-layer Macaron Conformer (1024 dim, 8 heads, 15-tap depthwise conv). Self-conditioning at layer 8 (CTC softmax β†’ 1024-dim residual back into the hidden stream). BPE auxiliary CTC head (100353-vocab) on a posterior-pooled (window=4) view of the final hidden states, weighted by 1 - blank_prob_mid from the layer-8 self-conditioning softmax. 4-layer hidden-state concatenation [layer 4, 8, 12, last] β†’ 4096-dim projector input.
Projector 2-layer simplified Q-Former with 32 attention heads. Block size 15, downsample rate 5 β†’ 3 audio tokens per 15-frame window. Projects 4096-dim concatenated encoder feature down to 2048-dim LLM input.
LLM (NAR) Granite 4.0-1B (40 layers, 2048 hidden, GQA 16/4, SwiGLU, RoPE ΞΈ=10000, ΞΌP multipliers). Every self_attn layer runs with is_causal=False. Single forward pass over [audio_embs, text_with_insertion_slots]; the slot logits decode into the final transcript via argmax + unique_consecutive + drop EOS.

Total ~2.2 B parameters. NAR throughput is several times higher than the autoregressive variants because there is no token-by-token sampling loop.

Conversion

# Convert HF safetensors β†’ GGUF F16
python models/convert-granite-nle-to-gguf.py \
  --input /path/to/granite-speech-4.1-2b-nar \
  --output granite-speech-4.1-2b-nar-f16.gguf

# Quantise F16 β†’ Q4_K (encoder + projector preserved F32, LLM Q4_K)
crispasr-quantize granite-speech-4.1-2b-nar-f16.gguf \
                  granite-speech-4.1-2b-nar-q4_k.gguf q4_k

# Q4_K with F16 encoder/projector (smaller, no measurable parity loss)
CRISPASR_GRANITE_ENC_F16=1 \
crispasr-quantize granite-speech-4.1-2b-nar-f16.gguf \
                  granite-speech-4.1-2b-nar-q4_k-f16enc.gguf q4_k

# Aggressive Q4_K everywhere (encoder + projector + LLM)
CRISPASR_GRANITE_QUANT_ALL=1 \
crispasr-quantize granite-speech-4.1-2b-nar-f16.gguf \
                  granite-speech-4.1-2b-nar-q4_k-mini.gguf q4_k

The NAR converter is a separate script (convert-granite-nle-to-gguf.py) because the GGUF arch (granite_nle), tensor naming (BPE auxiliary head, 4-layer hidden capture indices) and self-conditioning metadata all differ from the autoregressive variants. The crispasr-quantize binary recognises both granite_speech and granite_nle archs and applies identical encoder/projector skip rules to both.

Licence

Apache 2.0 β€” same as the original ibm-granite/granite-speech-4.1-2b-nar.

Downloads last month
342
GGUF
Model size
2B params
Architecture
granite_nle
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/granite-speech-4.1-2b-nar-GGUF

Quantized
(2)
this model