Instructions to use AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4", trust_remote_code=True)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4

SGLang

How to use AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4 with Docker Model Runner:
```
docker model run hf.co/AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4
```

gemma-4-31B-it-speculator.eagle3-NVFP4 (NVFP4)

NVFP4 quantization of RedHatAI/gemma-4-31B-it-speculator.eagle3 — RedHat / Neural Magic's official EAGLE-3 speculator drafter for Gemma 4 31B (dense).

What this is

Drop-in replacement for the BF16 drafter, 3× smaller (4.5 GB → ~1.5 GB) and ~1.5× faster per draft step on Blackwell with native FP4 tensor cores. Targets the same verifier model as the BF16 source.

Use it with vLLM

vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
  --tensor-parallel-size 1 \
  --speculative-config '{
    "model": "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4",
    "num_speculative_tokens": 3,
    "method": "eagle3"
  }' \
  --max-num-seqs 8 \
  --kv-cache-dtype fp8 \
  --enable-chunked-prefill \
  --enable-prefix-caching

Verifier can be any Gemma 4 31B (dense) variant — base, instruct, NVFP4, or fine-tunes (abliterated, domain-tuned, etc.). EAGLE drafters are architecture-bound, not weights-bound; output distribution is provably the verifier's.

Quantization recipe

Field	Value
Algorithm	NVIDIA ModelOpt `NVFP4_DEFAULT_CFG` (max calibration, no AWQ)
Block size	16 (NVFP4 standard)
Excluded from quantization	`lm_head`, `embed_tokens`, `d2t` (vocab map)
Calibration data	256 conversations from `HuggingFaceH4/ultrachat_200k` (train_sft)
Calibration mode	Realistic — ran target NVFP4 verifier first, captured aux hidden states at layers `eagle_aux_hidden_state_layer_ids`, fed to drafter alongside input_ids
Hardware	1× NVIDIA RTX PRO 6000 Blackwell (96 GB)
Output dtype	NVFP4 (FP4 E2M1 + per-block FP8 scales + per-tensor FP32 scales)
Modelopt version	0.43.0rc2.dev (main, with merged PRs #1264 + #1265)

Performance expectations

Acceptance lengths (relative to BF16 source — RedHat's published numbers):

Dataset	BF16 (k=5)	NVFP4 (estimate, k=5)
HumanEval	3.80	~3.40
math_reasoning	3.93	~3.50
qa	2.38	~2.20
MT-bench	2.83	~2.60
RAG	2.80	~2.60
summarization	2.20	~2.05
translation	2.68	~2.45

Roughly 8-12% acceptance loss vs BF16, more than offset by per-step speedup on Blackwell native FP4 hardware.

Spark / DGX Spark deployment

Tested on NVIDIA DGX Spark (GB10, sm 12.1, 128 GB unified memory) using the ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest image (eugr nightly with sm_120-compiled FlashInfer CUTLASS + VLLM_CUTLASS NVFP4 kernels).

Single-stream wall-clock: 2.0-2.5× speedup over no spec decode on chat workloads.

Files

model.safetensors — NVFP4 quantized drafter weights (~1.5 GB)
config.json — Eagle3 speculator config (carries verifier reference)
config.py — Custom Eagle3SpeculatorConfig class (custom_code, required for trust_remote_code=True)
tokenizer.json, tokenizer_config.json — Verifier tokenizer (Gemma 4)
hf_quant_config.json — ModelOpt NVFP4 quantization metadata
modelopt_state.pt — Full modelopt state for re-export

License

Apache 2.0 (matches base model). NVFP4 quantization is a derivative work contributed under the same terms.

Provenance

Created by quantizing RedHatAI/gemma-4-31B-it-speculator.eagle3 with NVFP4_DEFAULT_CFG calibrated against RedHatAI/gemma-4-31B-it-NVFP4 outputs. Methodology adapted from RedHat AI's published Gemma 4 NVFP4 target recipe + standard EAGLE-3 calibration practice.

☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC) _{bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4}	Ξ Ethereum (ETH) _{0x1512667F6D61454ad531d2E45C0a5d1fd82D0500}
◎ Solana (SOL) _{DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t}	ⓜ Monero (XMR) _{836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd}