granite-speech-4.1-2b-plus β€” GGUF

GGUF conversion of ibm-granite/granite-speech-4.1-2b-plus for use with CrispASR.

The PLUS variant adds two capabilities over the base 4.1-2b:

  • Punctuated and capitalised transcripts by default β€” no special prompt required.
  • Speaker labels and word-level timestamps in the model's structured output (full output parsing in CrispASR is the next step; raw text works today).

Architecturally PLUS is the base 4.1-2b plus a single change: the encoder's layer-3 hidden state is concatenated with the final layer output (config: cat_hidden_layers: [3]), producing a 2048-dim projector input instead of 1024. The Q-Former cross-attention K/V projection weights are correspondingly (1024, 2048).

Files

File Quantisation Size Notes
granite-speech-4.1-2b-plus-f16.gguf F16 ~5.6 GB Encoder + projector in F32, LLM weights in F16 β€” full parity reference
granite-speech-4.1-2b-plus-q4_k.gguf Q4_K ~2.96 GB Recommended. LLM layers Q4_K; encoder + projector kept F32 (precision-sensitive). Bit-identical-quality to F16 on encoder + projector
granite-speech-4.1-2b-plus-q4_k-f16enc.gguf Q4_K + F16 encoder ~2.28 GB LLM Q4_K, encoder + projector F16 (norms / biases / BN stats stay F32). ~700 MB smaller than the recommended Q4_K with no measurable parity loss on this clip
granite-speech-4.1-2b-plus-q4_k-mini.gguf Q4_K (aggressive) ~1.66 GB Encoder, projector and LLM all Q4_K. Smallest / fastest to download. Cosine parity is noticeably worse on PLUS than on the base 4.1 mini because the layer-3 + final hidden-state concat (the architectural delta in PLUS) doubles the surface for Q4_K rounding error. JFK still transcribes correctly with light punctuation drift, but harder material is more likely to regress than on base-4.1 mini. Use Q4_K or Q4_K-f16enc unless disk size is the binding constraint

Cosine parity (vs PyTorch BF16 reference, JFK 11 s clip)

Stage F16 cos_min Q4_K cos_min Q4_K-f16enc cos_min Q4_K-mini cos_min
mel_spectrogram 0.999997 0.999997 0.999997 0.999997
encoder_out 0.999938 0.999938 0.999938 0.622
projector_out 0.999995 0.999995 0.999995 0.960

encoder_out is the 2048-dim concatenation of the layer-3 hidden state and the final encoder layer (the PLUS architectural delta). On the recommended and -f16enc files the encoder weights stay in F32/F16, so parity is essentially indistinguishable from the F16 reference. On the -mini file the encoder weights are Q4_K β€” rounding error compounds across the 16-layer Conformer and shows up amplified after the concat, which is why encoder_out cos_min drops to ~0.62 on PLUS where base-4.1 mini sits at ~0.93. End-to-end JFK transcription is still correct.

Tested with crispasr-diff granite-4.1 <model.gguf> <ref.gguf> samples/jfk.wav

Usage with CrispASR

# auto-download and transcribe
crispasr --backend granite-4.1-plus -m auto samples/audio.wav

# or with explicit path
crispasr --backend granite-4.1-plus \
  -m granite-speech-4.1-2b-plus-f16.gguf \
  samples/audio.wav

End-to-end example on the JFK 11s clip:

$ crispasr --backend granite-4.1-plus -m auto samples/jfk.wav
And so my fellow Americans, ask not what your country can do for
you, ask what you can do for your country.

(Note the punctuation + capitalisation that the base 4.1-2b only produces with an explicit --ask "transcribe with proper punctuation..." prompt.)

Architecture

Stage Description
Encoder 16-layer Macaron Conformer (1024 dim, 8 heads, 15-tap depthwise conv). Hidden state at layer 3 is captured and concatenated with the final layer output β†’ 2048-dim projector input.
Projector 2-layer BLIP-2 Q-Former. Cross-attention K/V weights are (1024, 2048) to consume the wider concatenated encoder feature. 3 learned query tokens per 15-frame window.
LLM Granite 4.0-1B (40 layers, 2048 hidden, GQA 16/4, SwiGLU, RoPE ΞΈ=10000, ΞΌP multipliers).

Total ~2.2 B parameters. The "+" capability is encoded entirely in training β€” the architectural delta from base is just the layer concatenation.

Conversion

# Convert HF safetensors β†’ GGUF F16
python models/convert-granite-speech-to-gguf.py \
  --input /path/to/granite-speech-4.1-2b-plus \
  --output granite-speech-4.1-2b-plus-f16.gguf

# Quantise F16 β†’ Q4_K (encoder + projector preserved F32, LLM Q4_K)
crispasr-quantize granite-speech-4.1-2b-plus-f16.gguf \
                  granite-speech-4.1-2b-plus-q4_k.gguf q4_k

# Q4_K with F16 encoder/projector (smaller, no measurable parity loss)
CRISPASR_GRANITE_ENC_F16=1 \
crispasr-quantize granite-speech-4.1-2b-plus-f16.gguf \
                  granite-speech-4.1-2b-plus-q4_k-f16enc.gguf q4_k

# Aggressive Q4_K everywhere (encoder + projector + LLM)
CRISPASR_GRANITE_QUANT_ALL=1 \
crispasr-quantize granite-speech-4.1-2b-plus-f16.gguf \
                  granite-speech-4.1-2b-plus-q4_k-mini.gguf q4_k

The same converter handles base / 4.1-2b / 4.1-2b-plus from a single script β€” variant detection happens via config.json keys (cat_hidden_layers, encoder_hidden_size).

Licence

Apache 2.0 β€” same as the original ibm-granite/granite-speech-4.1-2b-plus.

Downloads last month
599
GGUF
Model size
2B params
Architecture
granite_speech
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/granite-speech-4.1-2b-plus-GGUF

Quantized
(2)
this model