granite-speech-4.1-2b-plus β GGUF
GGUF conversion of ibm-granite/granite-speech-4.1-2b-plus for use with CrispASR.
The PLUS variant adds two capabilities over the base 4.1-2b:
- Punctuated and capitalised transcripts by default β no special prompt required.
- Speaker labels and word-level timestamps in the model's structured output (full output parsing in CrispASR is the next step; raw text works today).
Architecturally PLUS is the base 4.1-2b plus a single change: the
encoder's layer-3 hidden state is concatenated with the final layer
output (config: cat_hidden_layers: [3]), producing a 2048-dim
projector input instead of 1024. The Q-Former cross-attention K/V
projection weights are correspondingly (1024, 2048).
Files
| File | Quantisation | Size | Notes |
|---|---|---|---|
granite-speech-4.1-2b-plus-f16.gguf |
F16 | ~5.6 GB | Encoder + projector in F32, LLM weights in F16 β full parity reference |
granite-speech-4.1-2b-plus-q4_k.gguf |
Q4_K | ~2.96 GB | Recommended. LLM layers Q4_K; encoder + projector kept F32 (precision-sensitive). Bit-identical-quality to F16 on encoder + projector |
granite-speech-4.1-2b-plus-q4_k-f16enc.gguf |
Q4_K + F16 encoder | ~2.28 GB | LLM Q4_K, encoder + projector F16 (norms / biases / BN stats stay F32). ~700 MB smaller than the recommended Q4_K with no measurable parity loss on this clip |
granite-speech-4.1-2b-plus-q4_k-mini.gguf |
Q4_K (aggressive) | ~1.66 GB | Encoder, projector and LLM all Q4_K. Smallest / fastest to download. Cosine parity is noticeably worse on PLUS than on the base 4.1 mini because the layer-3 + final hidden-state concat (the architectural delta in PLUS) doubles the surface for Q4_K rounding error. JFK still transcribes correctly with light punctuation drift, but harder material is more likely to regress than on base-4.1 mini. Use Q4_K or Q4_K-f16enc unless disk size is the binding constraint |
Cosine parity (vs PyTorch BF16 reference, JFK 11 s clip)
| Stage | F16 cos_min | Q4_K cos_min | Q4_K-f16enc cos_min | Q4_K-mini cos_min |
|---|---|---|---|---|
| mel_spectrogram | 0.999997 | 0.999997 | 0.999997 | 0.999997 |
| encoder_out | 0.999938 | 0.999938 | 0.999938 | 0.622 |
| projector_out | 0.999995 | 0.999995 | 0.999995 | 0.960 |
encoder_out is the 2048-dim concatenation of the layer-3 hidden state
and the final encoder layer (the PLUS architectural delta). On the
recommended and -f16enc files the encoder weights stay in F32/F16, so
parity is essentially indistinguishable from the F16 reference. On the
-mini file the encoder weights are Q4_K β rounding error compounds
across the 16-layer Conformer and shows up amplified after the concat,
which is why encoder_out cos_min drops to ~0.62 on PLUS where base-4.1
mini sits at ~0.93. End-to-end JFK transcription is still correct.
Tested with crispasr-diff granite-4.1 <model.gguf> <ref.gguf> samples/jfk.wav
Usage with CrispASR
# auto-download and transcribe
crispasr --backend granite-4.1-plus -m auto samples/audio.wav
# or with explicit path
crispasr --backend granite-4.1-plus \
-m granite-speech-4.1-2b-plus-f16.gguf \
samples/audio.wav
End-to-end example on the JFK 11s clip:
$ crispasr --backend granite-4.1-plus -m auto samples/jfk.wav
And so my fellow Americans, ask not what your country can do for
you, ask what you can do for your country.
(Note the punctuation + capitalisation that the base 4.1-2b only
produces with an explicit --ask "transcribe with proper punctuation..." prompt.)
Architecture
| Stage | Description |
|---|---|
| Encoder | 16-layer Macaron Conformer (1024 dim, 8 heads, 15-tap depthwise conv). Hidden state at layer 3 is captured and concatenated with the final layer output β 2048-dim projector input. |
| Projector | 2-layer BLIP-2 Q-Former. Cross-attention K/V weights are (1024, 2048) to consume the wider concatenated encoder feature. 3 learned query tokens per 15-frame window. |
| LLM | Granite 4.0-1B (40 layers, 2048 hidden, GQA 16/4, SwiGLU, RoPE ΞΈ=10000, ΞΌP multipliers). |
Total ~2.2 B parameters. The "+" capability is encoded entirely in training β the architectural delta from base is just the layer concatenation.
Conversion
# Convert HF safetensors β GGUF F16
python models/convert-granite-speech-to-gguf.py \
--input /path/to/granite-speech-4.1-2b-plus \
--output granite-speech-4.1-2b-plus-f16.gguf
# Quantise F16 β Q4_K (encoder + projector preserved F32, LLM Q4_K)
crispasr-quantize granite-speech-4.1-2b-plus-f16.gguf \
granite-speech-4.1-2b-plus-q4_k.gguf q4_k
# Q4_K with F16 encoder/projector (smaller, no measurable parity loss)
CRISPASR_GRANITE_ENC_F16=1 \
crispasr-quantize granite-speech-4.1-2b-plus-f16.gguf \
granite-speech-4.1-2b-plus-q4_k-f16enc.gguf q4_k
# Aggressive Q4_K everywhere (encoder + projector + LLM)
CRISPASR_GRANITE_QUANT_ALL=1 \
crispasr-quantize granite-speech-4.1-2b-plus-f16.gguf \
granite-speech-4.1-2b-plus-q4_k-mini.gguf q4_k
The same converter handles base / 4.1-2b / 4.1-2b-plus from a single
script β variant detection happens via config.json keys
(cat_hidden_layers, encoder_hidden_size).
Licence
Apache 2.0 β same as the original ibm-granite/granite-speech-4.1-2b-plus.
- Downloads last month
- 599
16-bit
Model tree for cstr/granite-speech-4.1-2b-plus-GGUF
Base model
ibm-granite/granite-4.0-1b-base