Instructions to use cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix", filename="gemma-4-E2B-it-assistant-IQ4_NL.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M # Run inference directly in the terminal: llama-cli -hf cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M # Run inference directly in the terminal: llama-cli -hf cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M
Use Docker
docker model run hf.co/cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix with Ollama:
ollama run hf.co/cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M
- Unsloth Studio new
How to use cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix to start chatting
- Docker Model Runner
How to use cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix with Docker Model Runner:
docker model run hf.co/cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M
- Lemonade
How to use cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-E2B-it-assistant-GGUF-noimatrix-Q4_K_M
List all available models
lemonade list
gemma-4-E2B-it-assistant — GGUF — noimatrix (ik_llama.cpp only)
GGUF quantizations of google/gemma-4-E2B-it-assistant,
the Multi-Token-Prediction (MTP) drafter for gemma-4-E2B-it. Intended
for use as the draft model in speculative decoding paired with a
quantized gemma-4-E2B-it verifier. Realised speedup depends on
hardware and verifier choice — see §Performance for measured
numbers and the PR's reference benchmarks.
ik_llama.cpp only — will NOT load in mainline llama.cpp
These quants use the
gemma4_mtparchitecture, which is currently only supported inikawrakow/ik_llama.cppon (or after) thefeat/gemma-4-mtpbranch / PR #1744. Mainlineggml-org/llama.cppdoes not know aboutgemma4_assistant/gemma4_mtpand will refuse to load these files. Downstream tools that wrap mainline (ollama, LM Studio, jan.ai, llama-cpp-python, …) will not work either until they follow ik_llama.cpp.Tested with ik_llama.cpp commit
a703033607ed3edbeab0205d8c9ad75cc1b5759f.
Benchmarks in progress
Real measured throughput + acceptance-rate numbers for these drafters are being collected on a consumer laptop GPU (NVIDIA 4060 Laptop, 8 GiB VRAM) — full
--draft-max×--draft-p-minmatrix across multiple prompts, driven againstllama-serverwith acceptance read from its per-requeststatistics mtp:stderr lines.Results land in the smaller drafters' model cards first (E2B → E4B → 26B-A4B → 31B), since smaller models cycle through the bench faster. Check the E2B card first if you're shopping for performance numbers — even if you plan to deploy a different size, the relative draft-max curves and acceptance trends carry across sizes within the family.
Until then, §Performance below cites the upstream PR's reference benchmarks (data-center GPU + 31B verifier) — treat them as a ceiling, not a target.
Honest limitations of this build
Things that should be on the model card and aren't faked:
- No imatrix calibration. PR #1744 builds the gemma4_mtp drafter graph with a hardcoded
GGML_ASSERT(has_target_ctx), meaning standalone-drafterllama-imatrixruns abort inllama_decodebefore producing anything. Same as every other community-published Gemma 4 drafter quant today (Radamanthys11's, etc.), this build quantizes from f16 directly. At ≥4 bits this is fine; the precision benefit of imatrix-guided quantization at Q5+ is single-digit %. At Q3 and below it matters more — see the per-quant warnings in the table below.- No acceptance-rate validation in this build. The benchmark numbers in §"Performance" come from the upstream PR thread on a 31B verifier, not from runs against this drafter quant. Smoke-testing on the build host (CPU-only) was disabled because per-token MTP cost is many seconds on CPU, making proper benchmark-quality runs impractical mid-pipeline. Treat the numbers as expectations, not measurements of these specific files.
- No IK-only IQK quants (IQ4_KS / IQ5_KS / IQ4_KSS). These are normally what "ik_llama.cpp build" gets you over mainline, but their precision-per-bit advantage comes from imatrix-guided scale selection — without imatrix they collapse to roughly K-quant quality at the same bit budget, so shipping them would just be misleading row-count padding. They'll come back in a sibling experiment when upstream supports standalone-drafter imatrix.
- MXFP4 has narrower runtime support. Loadable in current ik_llama.cpp and mainline llama.cpp; older ggml-based runtimes may not support it yet. Use a K-quant if you need to load these in something older.
Pairing — required
A drafter is not a standalone language model. To use these quants you also need a base-model GGUF, with matching vocab (262144 tokens, which is the whole Gemma 4 family default). Recommended pairings:
| Drafter quant (this repo) | Verifier quant (suggested) | Source |
|---|---|---|
| Q8_0 | gemma-4-E2B-it-Q8_0.gguf |
unsloth/gemma-4-E2B-it-GGUF |
| Q6_K | gemma-4-E2B-it-Q8_0.gguf or Q6_K |
unsloth/gemma-4-E2B-it-GGUF |
| Q5_K_M / Q5_K_S | gemma-4-E2B-it-Q5_K_M.gguf or higher |
unsloth/gemma-4-E2B-it-GGUF |
| Q4_K_M / Q4_K_S | gemma-4-E2B-it-Q4_K_M.gguf or higher |
unsloth/gemma-4-E2B-it-GGUF |
Verifier: unsloth/gemma-4-E2B-it-GGUF — ~4.6B total (2.3B effective per Google's "E" naming) parameters. Heads-up on Google's naming: the "E" in E2B/E4B means effective (active inference) parameters via Per-Layer Embeddings, not total weight count — full weights still have to fit in VRAM. The "A" in 26B A4B is the same trick for the MoE variant: 3.8B active out of 25.2B total. The 31B is plain dense, no naming games.
Pairing precision: matching is generally optimal, but mismatched
pairings work too with some acceptance-rate penalty. Both bf16
and f16 drafters in this repo are valid pairing targets for a
bf16/f16/Q8_0 verifier — bf16 is preferred when your runtime
supports it (matches the source-tensor format exactly; see the bf16
row in §Quants). Going above the verifier's precision on the
drafter has no benefit.
Empirical note (single-data-point, no warranty): on a 4060 Laptop GPU paired against unsloth's
Gemma-4-E4B-it-Q4_K_M.ggufverifier, the Q4_K_M drafter outperformed the Q8_0 drafter by ~13% in throughput at--draft-max 3. The smaller drafter's faster draft step appears to outweigh the acceptance-rate cost from more aggressive quantization. Contradicts the common "always pick the highest-bit drafter" heuristic. Bench your own hardware before assuming.
Preserves structured-output tokens
The drafter's vocabulary is identical to the verifier's (262144 tokens, the Gemma 4 family default). Notably, that includes Gemma 4's reserved tokens for structured output formats which the drafter speculates correctly:
| Token pair | Used for |
|---|---|
<|tool_call> / <tool_call|> |
Tool / function calling — agent invokes a tool |
<|tool_response> / <tool_response|> |
Tool / function calling — tool result back to model |
<|channel> / <channel|> |
Multi-channel output (e.g. <|channel>thought for chain-of-thought reasoning vs user-facing channel) |
<|"|> |
Structured-string delimiter |
If you're running tool-calling agents, multi-step reasoning, or any structured-generation workflow on top of Gemma 4, this drafter will speculate those tokens just like any other — meaning the MTP speedup applies to the whole response, not just the natural-language parts. Most published drafter quants don't talk about this because it Just Works mechanically (vocabulary is a separate GGUF section that's never quantized), but it's worth saying out loud: pairing this drafter with a tool-calling-finetuned verifier preserves the tool-call grammar end-to-end.
Quants
78M parameter MTP head, 1536-dim
backbone projection (must match the verifier's hidden_size).
12 files spanning bf16 → Q3, in approximate order of decreasing precision.
| Quantization | Approx. bpw | Size | Notes |
|---|---|---|---|
gemma-4-E2B-it-assistant-bf16.gguf |
16 | 165 MB | Faithful to source. Gemma 4's safetensors are bfloat16 (8-bit exponent, 7-bit mantissa); this preserves them exactly. Prefer this over f16 if your runtime speaks bf16 (recent mainline llama.cpp, ik_llama.cpp, ollama). |
gemma-4-E2B-it-assistant-f16.gguf |
16 | 165 MB | Conventional reference. f16 has more mantissa precision than bf16 (10 vs 7 bits) but a smaller exponent range (5 vs 8 bits), so it can over/underflow on activations bf16 handles fine. For weights converted from a bf16 source, going to f16 effectively quantizes the dynamic range to fit f16's narrower exponent — small but real loss vs bf16. Use bf16 when your runtime supports it. |
gemma-4-E2B-it-assistant-Q8_0.gguf |
8.5 | 95 MB | Near-lossless quantization. Recommended pairing target — drafters' acceptance rate suffers most from quantization, so the highest-bit quant is the best choice if your verifier is also Q8_0+. |
gemma-4-E2B-it-assistant-Q6_K.gguf |
6.5 | 77 MB | K-quant, very high quality. Good balance for Q6_K verifiers. |
gemma-4-E2B-it-assistant-Q5_K_M.gguf |
5.7 | 76 MB | High-precision K-quant. |
gemma-4-E2B-it-assistant-Q5_K_S.gguf |
5.5 | 76 MB | Smaller Q5 variant. |
gemma-4-E2B-it-assistant-Q4_K_M.gguf |
4.85 | 75 MB | Community sweet-spot for verifier pairings. |
gemma-4-E2B-it-assistant-Q4_K_S.gguf |
4.6 | 75 MB | Smaller Q4 K-quant. |
gemma-4-E2B-it-assistant-IQ4_NL.gguf |
4.5 | 75 MB | Non-linear i-quant. Mainline-loadable (unlike the IK-only IQ4_KS). Doesn't require imatrix to be useful. |
gemma-4-E2B-it-assistant-IQ4_XS.gguf |
4.25 | 66 MB | Smaller non-K i-quant. Mainline-loadable. |
gemma-4-E2B-it-assistant-MXFP4.gguf |
4.25 | 74 MB | OCP microscaling 4-bit float format. Loadable in current ik_llama.cpp and mainline llama.cpp; older ggml-based runtimes may not support it yet. |
gemma-4-E2B-it-assistant-Q3_K_L.gguf |
3.4 | 74 MB | Untested for drafter use; pair with caution. Without imatrix, Q3 loses more accuracy than higher quants — and drafters are particularly acceptance-sensitive (a misprediction is wasted work). Included for users who absolutely need the smallest footprint, but be aware MTP speedup could degrade or invert vs. running the verifier alone. Benchmark before deploying. |
Deliberately omitted quants (and why, briefly):
F32— zero-padded bf16, no information gain, double the disk.Q4_0 / Q5_0 / Q4_1 / Q5_1— legacy non-K quants. K-quants strictly dominate them at the same bit budget.Q3_K_M / Q3_K_S / Q2_K— without imatrix, drafter acceptance drops sharply below Q3_K_L. Re-add when imatrix is available.IQ2_* / IQ1_*— too noisy at any bit budget for drafter use, even with imatrix. Verifier rejects most drafted tokens, paired generation goes net negative vs. baseline.IQ4_KS / IQ4_KSS / IQ5_KS / IQ3_KT / IQ4_KT— IK-fork-only quants whose precision advantage requires imatrix. Coming in a future imatrix-capable sibling experiment.
Usage
ik_llama.cpp's llama-server (or llama-cli for one-shot
generation):
# Build / install ik_llama.cpp first; see
# https://github.com/ikawrakow/ik_llama.cpp
llama-server \
--model gemma-4-E2B-it-Q8_0.gguf \
--model-draft gemma-4-E2B-it-assistant-Q8_0.gguf \
--spec-type mtp \
--draft-max 3 \
--draft-p-min 0.0 \
-ngld 99 \
--n-gpu-layers 99 \
--ctx-size 32768 \
-ctk q8_0 -ctv q8_0 \
-b 1024 -ub 1024 \
--jinja \
--host 127.0.0.1 --port 18080
Flag reference:
| Flag | What it does |
|---|---|
--spec-type mtp |
Enables MTP-style speculative decoding (this is the path PR #1744 plumbs). |
--model-draft (-md) |
The drafter GGUF. |
--draft-max N |
Maximum draft length per step. 3 is a good default; 1–4 are all reasonable; tune per workload with --spec-autotune. |
--draft-p-min |
Minimum draft-token probability to bother drafting. 0.0 accepts all drafts; raising it shortens speculative chains. |
-ngld 99 |
Push the drafter onto GPU layers (no-op on CPU-only hosts). The drafter is small enough to fully fit on any consumer GPU. |
-ctk q8_0 / -ctv q8_0 |
Quantize KV cache. Reduces VRAM pressure for long contexts. |
--jinja |
Use the model's Jinja chat template (Gemma 4's tool-call format etc.). |
--spec-autotune (per the PR #1744 description) will probe several
--draft-max values during inference and pick the best-fitting one
for your workload — useful if you don't want to tune by hand.
Performance
Reproducing the upstream benchmark on a 31B verifier + this drafter at Q8_0 on Q8_0 (per the PR #1744 description):
| Run | Throughput | Acceptance |
|---|---|---|
| Baseline (no MTP) | ~21 t/s | — |
MTP --draft-max 1 |
~35 t/s | ~89% |
MTP --draft-max 2 |
~44 t/s | ~83% |
MTP --draft-max 3 |
~49 t/s | ~74% |
MTP --draft-max 4 |
~49 t/s | ~64% |
Smaller verifiers (E2B/E4B) get less absolute t/s benefit because the verifier itself is faster, so there's less time-budget for the drafter to fill in. The percentage uplift is similar.
Compatibility notes
A few cosmetic / non-blocking quirks you may see in normal use:
transformerswarning during conversion (only relevant if you re-convert from source rather than using these prebuilts):You are using a model of type `gemma4_assistant` to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture …The IK fork's
convert_hf_to_gguf.pypatches ingemma4_assistantarch support on the GGUF side but does not patch the Hugging Facetransformerslibrary itself. Sotransformers(which the converter uses to read the source safetensors) sees the unfamiliarmodel_typeand falls back to generic loading. Generic loading reads the raw weights correctly, so the conversion still produces a valid GGUF — the warning is cosmetic.Oops: tensor with strange name per_layer_*at runtime (visible if you pair against certain non-google-flavored Gemma 4 base GGUFs, e.g. unsloth's). These warnings come from the verifier loader, not the drafter — they're the verifier model's per-layer projection tensors which ik_llama.cpp's gemma4 base implementation may not fully recognize on third-party-quantized GGUFs. Inference still works but may fall back to slower code paths for those tensors. If absolute throughput seems too low vs. the PR's reference benchmarks, try a different verifier (google's own f16, bartowski's quants, or any other community source) and compare.mtp_pre_proj.weight/mtp_post_proj.weight"strange name" warnings at drafter load — see PR #1744 review thread; these are the drafter's MTP projection tensors which the size- accounting iteration insrc/llama.cppdoesn't special-case. Cosmetic; the MTP runtime loads them correctly viacreate_gemma4_mtp_tensors.
Provenance
- Source: google/gemma-4-E2B-it-assistant, Apache 2.0 + Gemma terms of use.
- Architecture:
gemma4_mtp(the GGUF-side name forGemma4AssistantForCausalLM). - Converter / runtime: ik_llama.cpp
feat/gemma-4-mtpbranch, i.e. PR #1744 by @SamuelOliveirads. - Calibration corpus for imatrix: none used in this build (see "Honest limitations" above for why).
- Build host: a CPU-only Linux box.
Comparable existing community quants:
Radamanthys11/Gemma-4-E2B-it-assistant-GGUF
and the rest of @Radamanthys11's
collection (the same person who wrote PR #1744). Those repos ship
F16 + Q8_0 only.
This repo ships every quant variant of this drafter that made sense to produce: 12 files spanning bf16 reference down to Q3_K_L, including K-quants, non-K i-quants (IQ4_NL, IQ4_XS), and OCP MXFP4. The omitted quants (F32, legacy Q4_0/Q5_0 etc., Q2_K, IQ2_*, IQ1_*, the imatrix-dependent IQ4_KS family) are documented above the table with the reason each was left out.
License
Gemma Terms of Use, inherited from the source model. By downloading or using these quants you agree to Google's Gemma terms — same as if you'd downloaded the upstream weights directly.
Issues / questions
Open a discussion on this repo (cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix) for anything quant-specific (a particular file refusing to load, a quant variant behaving worse than expected, sizes-table corrections, etc.).
For ik_llama.cpp runtime bugs (gemma4_mtp arch issues, MTP
acceptance-rate quirks, --spec-type mtp plumbing) the canonical
place is the upstream
PR #1744 thread
or the ikawrakow/ik_llama.cpp issue tracker.
For upstream weights / chat-template / tokenizer questions, file
against google/gemma-4-E2B-it-assistant — but please
filter quant-format problems out before going there; Google does not
maintain the GGUF tooling.
- Downloads last month
- 2,320
Model tree for cafkafk/gemma-4-E2B-it-assistant-GGUF-noimatrix
Base model
google/gemma-4-E2B-it-assistant