gemma-4-E2B-it-assistant — GGUF — noimatrix (ik_llama.cpp only)

GGUF quantizations of google/gemma-4-E2B-it-assistant, the Multi-Token-Prediction (MTP) drafter for gemma-4-E2B-it. Intended for use as the draft model in speculative decoding paired with a quantized gemma-4-E2B-it verifier. Realised speedup depends on hardware and verifier choice — see §Performance for measured numbers and the PR's reference benchmarks.

ik_llama.cpp only — will NOT load in mainline llama.cpp

These quants use the gemma4_mtp architecture, which is currently only supported in ikawrakow/ik_llama.cpp on (or after) the feat/gemma-4-mtp branch / PR #1744. Mainline ggml-org/llama.cpp does not know about gemma4_assistant / gemma4_mtp and will refuse to load these files. Downstream tools that wrap mainline (ollama, LM Studio, jan.ai, llama-cpp-python, …) will not work either until they follow ik_llama.cpp.

Tested with ik_llama.cpp commit a703033607ed3edbeab0205d8c9ad75cc1b5759f.

Benchmarks in progress

Real measured throughput + acceptance-rate numbers for these drafters are being collected on a consumer laptop GPU (NVIDIA 4060 Laptop, 8 GiB VRAM) — full --draft-max × --draft-p-min matrix across multiple prompts, driven against llama-server with acceptance read from its per-request statistics mtp: stderr lines.

Results land in the smaller drafters' model cards first (E2B → E4B → 26B-A4B → 31B), since smaller models cycle through the bench faster. Check the E2B card first if you're shopping for performance numbers — even if you plan to deploy a different size, the relative draft-max curves and acceptance trends carry across sizes within the family.

Until then, §Performance below cites the upstream PR's reference benchmarks (data-center GPU + 31B verifier) — treat them as a ceiling, not a target.

Honest limitations of this build

Things that should be on the model card and aren't faked:

  • No imatrix calibration. PR #1744 builds the gemma4_mtp drafter graph with a hardcoded GGML_ASSERT(has_target_ctx), meaning standalone-drafter llama-imatrix runs abort in llama_decode before producing anything. Same as every other community-published Gemma 4 drafter quant today (Radamanthys11's, etc.), this build quantizes from f16 directly. At ≥4 bits this is fine; the precision benefit of imatrix-guided quantization at Q5+ is single-digit %. At Q3 and below it matters more — see the per-quant warnings in the table below.
  • No acceptance-rate validation in this build. The benchmark numbers in §"Performance" come from the upstream PR thread on a 31B verifier, not from runs against this drafter quant. Smoke-testing on the build host (CPU-only) was disabled because per-token MTP cost is many seconds on CPU, making proper benchmark-quality runs impractical mid-pipeline. Treat the numbers as expectations, not measurements of these specific files.
  • No IK-only IQK quants (IQ4_KS / IQ5_KS / IQ4_KSS). These are normally what "ik_llama.cpp build" gets you over mainline, but their precision-per-bit advantage comes from imatrix-guided scale selection — without imatrix they collapse to roughly K-quant quality at the same bit budget, so shipping them would just be misleading row-count padding. They'll come back in a sibling experiment when upstream supports standalone-drafter imatrix.
  • MXFP4 has narrower runtime support. Loadable in current ik_llama.cpp and mainline llama.cpp; older ggml-based runtimes may not support it yet. Use a K-quant if you need to load these in something older.

Pairing — required

A drafter is not a standalone language model. To use these quants you also need a base-model GGUF, with matching vocab (262144 tokens, which is the whole Gemma 4 family default). Recommended pairings:

Drafter quant (this repo) Verifier quant (suggested) Source
Q8_0 gemma-4-E2B-it-Q8_0.gguf unsloth/gemma-4-E2B-it-GGUF
Q6_K gemma-4-E2B-it-Q8_0.gguf or Q6_K unsloth/gemma-4-E2B-it-GGUF
Q5_K_M / Q5_K_S gemma-4-E2B-it-Q5_K_M.gguf or higher unsloth/gemma-4-E2B-it-GGUF
Q4_K_M / Q4_K_S gemma-4-E2B-it-Q4_K_M.gguf or higher unsloth/gemma-4-E2B-it-GGUF

Verifier: unsloth/gemma-4-E2B-it-GGUF — ~4.6B total (2.3B effective per Google's "E" naming) parameters. Heads-up on Google's naming: the "E" in E2B/E4B means effective (active inference) parameters via Per-Layer Embeddings, not total weight count — full weights still have to fit in VRAM. The "A" in 26B A4B is the same trick for the MoE variant: 3.8B active out of 25.2B total. The 31B is plain dense, no naming games.

Pairing precision: matching is generally optimal, but mismatched pairings work too with some acceptance-rate penalty. Both bf16 and f16 drafters in this repo are valid pairing targets for a bf16/f16/Q8_0 verifier — bf16 is preferred when your runtime supports it (matches the source-tensor format exactly; see the bf16 row in §Quants). Going above the verifier's precision on the drafter has no benefit.

Empirical note (single-data-point, no warranty): on a 4060 Laptop GPU paired against unsloth's Gemma-4-E4B-it-Q4_K_M.gguf verifier, the Q4_K_M drafter outperformed the Q8_0 drafter by ~13% in throughput at --draft-max 3. The smaller drafter's faster draft step appears to outweigh the acceptance-rate cost from more aggressive quantization. Contradicts the common "always pick the highest-bit drafter" heuristic. Bench your own hardware before assuming.

Preserves structured-output tokens

The drafter's vocabulary is identical to the verifier's (262144 tokens, the Gemma 4 family default). Notably, that includes Gemma 4's reserved tokens for structured output formats which the drafter speculates correctly:

Token pair Used for
<|tool_call> / <tool_call|> Tool / function calling — agent invokes a tool
<|tool_response> / <tool_response|> Tool / function calling — tool result back to model
<|channel> / <channel|> Multi-channel output (e.g. <|channel>thought for chain-of-thought reasoning vs user-facing channel)
<|"|> Structured-string delimiter

If you're running tool-calling agents, multi-step reasoning, or any structured-generation workflow on top of Gemma 4, this drafter will speculate those tokens just like any other — meaning the MTP speedup applies to the whole response, not just the natural-language parts. Most published drafter quants don't talk about this because it Just Works mechanically (vocabulary is a separate GGUF section that's never quantized), but it's worth saying out loud: pairing this drafter with a tool-calling-finetuned verifier preserves the tool-call grammar end-to-end.

Quants

78M parameter MTP head, 1536-dim backbone projection (must match the verifier's hidden_size).

12 files spanning bf16 → Q3, in approximate order of decreasing precision.

Quantization Approx. bpw Size Notes
gemma-4-E2B-it-assistant-bf16.gguf 16 165 MB Faithful to source. Gemma 4's safetensors are bfloat16 (8-bit exponent, 7-bit mantissa); this preserves them exactly. Prefer this over f16 if your runtime speaks bf16 (recent mainline llama.cpp, ik_llama.cpp, ollama).
gemma-4-E2B-it-assistant-f16.gguf 16 165 MB Conventional reference. f16 has more mantissa precision than bf16 (10 vs 7 bits) but a smaller exponent range (5 vs 8 bits), so it can over/underflow on activations bf16 handles fine. For weights converted from a bf16 source, going to f16 effectively quantizes the dynamic range to fit f16's narrower exponent — small but real loss vs bf16. Use bf16 when your runtime supports it.
gemma-4-E2B-it-assistant-Q8_0.gguf 8.5 95 MB Near-lossless quantization. Recommended pairing target — drafters' acceptance rate suffers most from quantization, so the highest-bit quant is the best choice if your verifier is also Q8_0+.
gemma-4-E2B-it-assistant-Q6_K.gguf 6.5 77 MB K-quant, very high quality. Good balance for Q6_K verifiers.
gemma-4-E2B-it-assistant-Q5_K_M.gguf 5.7 76 MB High-precision K-quant.
gemma-4-E2B-it-assistant-Q5_K_S.gguf 5.5 76 MB Smaller Q5 variant.
gemma-4-E2B-it-assistant-Q4_K_M.gguf 4.85 75 MB Community sweet-spot for verifier pairings.
gemma-4-E2B-it-assistant-Q4_K_S.gguf 4.6 75 MB Smaller Q4 K-quant.
gemma-4-E2B-it-assistant-IQ4_NL.gguf 4.5 75 MB Non-linear i-quant. Mainline-loadable (unlike the IK-only IQ4_KS). Doesn't require imatrix to be useful.
gemma-4-E2B-it-assistant-IQ4_XS.gguf 4.25 66 MB Smaller non-K i-quant. Mainline-loadable.
gemma-4-E2B-it-assistant-MXFP4.gguf 4.25 74 MB OCP microscaling 4-bit float format. Loadable in current ik_llama.cpp and mainline llama.cpp; older ggml-based runtimes may not support it yet.
gemma-4-E2B-it-assistant-Q3_K_L.gguf 3.4 74 MB Untested for drafter use; pair with caution. Without imatrix, Q3 loses more accuracy than higher quants — and drafters are particularly acceptance-sensitive (a misprediction is wasted work). Included for users who absolutely need the smallest footprint, but be aware MTP speedup could degrade or invert vs. running the verifier alone. Benchmark before deploying.

Deliberately omitted quants (and why, briefly):

  • F32 — zero-padded bf16, no information gain, double the disk.
  • Q4_0 / Q5_0 / Q4_1 / Q5_1 — legacy non-K quants. K-quants strictly dominate them at the same bit budget.
  • Q3_K_M / Q3_K_S / Q2_K — without imatrix, drafter acceptance drops sharply below Q3_K_L. Re-add when imatrix is available.
  • IQ2_* / IQ1_* — too noisy at any bit budget for drafter use, even with imatrix. Verifier rejects most drafted tokens, paired generation goes net negative vs. baseline.
  • IQ4_KS / IQ4_KSS / IQ5_KS / IQ3_KT / IQ4_KT — IK-fork-only quants whose precision advantage requires imatrix. Coming in a future imatrix-capable sibling experiment.

Usage

ik_llama.cpp's llama-server (or llama-cli for one-shot generation):

# Build / install ik_llama.cpp first; see
# https://github.com/ikawrakow/ik_llama.cpp

llama-server \
    --model gemma-4-E2B-it-Q8_0.gguf \
    --model-draft gemma-4-E2B-it-assistant-Q8_0.gguf \
    --spec-type mtp \
    --draft-max 3 \
    --draft-p-min 0.0 \
    -ngld 99 \
    --n-gpu-layers 99 \
    --ctx-size 32768 \
    -ctk q8_0 -ctv q8_0 \
    -b 1024 -ub 1024 \
    --jinja \
    --host 127.0.0.1 --port 18080

Flag reference:

Flag What it does
--spec-type mtp Enables MTP-style speculative decoding (this is the path PR #1744 plumbs).
--model-draft (-md) The drafter GGUF.
--draft-max N Maximum draft length per step. 3 is a good default; 1–4 are all reasonable; tune per workload with --spec-autotune.
--draft-p-min Minimum draft-token probability to bother drafting. 0.0 accepts all drafts; raising it shortens speculative chains.
-ngld 99 Push the drafter onto GPU layers (no-op on CPU-only hosts). The drafter is small enough to fully fit on any consumer GPU.
-ctk q8_0 / -ctv q8_0 Quantize KV cache. Reduces VRAM pressure for long contexts.
--jinja Use the model's Jinja chat template (Gemma 4's tool-call format etc.).

--spec-autotune (per the PR #1744 description) will probe several --draft-max values during inference and pick the best-fitting one for your workload — useful if you don't want to tune by hand.

Performance

Reproducing the upstream benchmark on a 31B verifier + this drafter at Q8_0 on Q8_0 (per the PR #1744 description):

Run Throughput Acceptance
Baseline (no MTP) ~21 t/s
MTP --draft-max 1 ~35 t/s ~89%
MTP --draft-max 2 ~44 t/s ~83%
MTP --draft-max 3 ~49 t/s ~74%
MTP --draft-max 4 ~49 t/s ~64%

Smaller verifiers (E2B/E4B) get less absolute t/s benefit because the verifier itself is faster, so there's less time-budget for the drafter to fill in. The percentage uplift is similar.

Compatibility notes

A few cosmetic / non-blocking quirks you may see in normal use:

  • transformers warning during conversion (only relevant if you re-convert from source rather than using these prebuilts):

    You are using a model of type `gemma4_assistant` to instantiate a
    model of type ``. This may be expected if you are loading a
    checkpoint that shares a subset of the architecture …
    

    The IK fork's convert_hf_to_gguf.py patches in gemma4_assistant arch support on the GGUF side but does not patch the Hugging Face transformers library itself. So transformers (which the converter uses to read the source safetensors) sees the unfamiliar model_type and falls back to generic loading. Generic loading reads the raw weights correctly, so the conversion still produces a valid GGUF — the warning is cosmetic.

  • Oops: tensor with strange name per_layer_* at runtime (visible if you pair against certain non-google-flavored Gemma 4 base GGUFs, e.g. unsloth's). These warnings come from the verifier loader, not the drafter — they're the verifier model's per-layer projection tensors which ik_llama.cpp's gemma4 base implementation may not fully recognize on third-party-quantized GGUFs. Inference still works but may fall back to slower code paths for those tensors. If absolute throughput seems too low vs. the PR's reference benchmarks, try a different verifier (google's own f16, bartowski's quants, or any other community source) and compare.

  • mtp_pre_proj.weight / mtp_post_proj.weight "strange name" warnings at drafter load — see PR #1744 review thread; these are the drafter's MTP projection tensors which the size- accounting iteration in src/llama.cpp doesn't special-case. Cosmetic; the MTP runtime loads them correctly via create_gemma4_mtp_tensors.

Provenance

  • Source: google/gemma-4-E2B-it-assistant, Apache 2.0 + Gemma terms of use.
  • Architecture: gemma4_mtp (the GGUF-side name for Gemma4AssistantForCausalLM).
  • Converter / runtime: ik_llama.cpp feat/gemma-4-mtp branch, i.e. PR #1744 by @SamuelOliveirads.
  • Calibration corpus for imatrix: none used in this build (see "Honest limitations" above for why).
  • Build host: a CPU-only Linux box.

Comparable existing community quants: Radamanthys11/Gemma-4-E2B-it-assistant-GGUF and the rest of @Radamanthys11's collection (the same person who wrote PR #1744). Those repos ship F16 + Q8_0 only.

This repo ships every quant variant of this drafter that made sense to produce: 12 files spanning bf16 reference down to Q3_K_L, including K-quants, non-K i-quants (IQ4_NL, IQ4_XS), and OCP MXFP4. The omitted quants (F32, legacy Q4_0/Q5_0 etc., Q2_K, IQ2_*, IQ1_*, the imatrix-dependent IQ4_KS family) are documented above the table with the reason each was left out.

License

Gemma Terms of Use, inherited from the source model. By downloading or using these quants you agree to Google's Gemma terms — same as if you'd downloaded the upstream weights directly.

Issues / questions

Open a discussion on this repo (cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix) for anything quant-specific (a particular file refusing to load, a quant variant behaving worse than expected, sizes-table corrections, etc.).

For ik_llama.cpp runtime bugs (gemma4_mtp arch issues, MTP acceptance-rate quirks, --spec-type mtp plumbing) the canonical place is the upstream PR #1744 thread or the ikawrakow/ik_llama.cpp issue tracker. For upstream weights / chat-template / tokenizer questions, file against google/gemma-4-E2B-it-assistant — but please filter quant-format problems out before going there; Google does not maintain the GGUF tooling.

Downloads last month
2,320
GGUF
Model size
78M params
Architecture
gemma4_mtp
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix

Quantized
(3)
this model

Collection including cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix