Instructions to use AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4
- SGLang
How to use AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4 with Docker Model Runner:
docker model run hf.co/AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4
gemma-4-31B-it-speculator.eagle3-NVFP4 (NVFP4)
NVFP4 quantization of RedHatAI/gemma-4-31B-it-speculator.eagle3 β RedHat / Neural Magic's official EAGLE-3 speculator drafter for Gemma 4 31B (dense).
What this is
Drop-in replacement for the BF16 drafter, 3Γ smaller (4.5 GB β ~1.5 GB) and ~1.5Γ faster per draft step on Blackwell with native FP4 tensor cores. Targets the same verifier model as the BF16 source.
Use it with vLLM
vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
--tensor-parallel-size 1 \
--speculative-config '{
"model": "AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4",
"num_speculative_tokens": 3,
"method": "eagle3"
}' \
--max-num-seqs 8 \
--kv-cache-dtype fp8 \
--enable-chunked-prefill \
--enable-prefix-caching
Verifier can be any Gemma 4 31B (dense) variant β base, instruct, NVFP4, or fine-tunes (abliterated, domain-tuned, etc.). EAGLE drafters are architecture-bound, not weights-bound; output distribution is provably the verifier's.
Quantization recipe
| Field | Value |
|---|---|
| Algorithm | NVIDIA ModelOpt NVFP4_DEFAULT_CFG (max calibration, no AWQ) |
| Block size | 16 (NVFP4 standard) |
| Excluded from quantization | lm_head, embed_tokens, d2t (vocab map) |
| Calibration data | 256 conversations from HuggingFaceH4/ultrachat_200k (train_sft) |
| Calibration mode | Realistic β ran target NVFP4 verifier first, captured aux hidden states at layers eagle_aux_hidden_state_layer_ids, fed to drafter alongside input_ids |
| Hardware | 1Γ NVIDIA RTX PRO 6000 Blackwell (96 GB) |
| Output dtype | NVFP4 (FP4 E2M1 + per-block FP8 scales + per-tensor FP32 scales) |
| Modelopt version | 0.43.0rc2.dev (main, with merged PRs #1264 + #1265) |
Performance expectations
Acceptance lengths (relative to BF16 source β RedHat's published numbers):
| Dataset | BF16 (k=5) | NVFP4 (estimate, k=5) |
|---|---|---|
| HumanEval | 3.80 | ~3.40 |
| math_reasoning | 3.93 | ~3.50 |
| qa | 2.38 | ~2.20 |
| MT-bench | 2.83 | ~2.60 |
| RAG | 2.80 | ~2.60 |
| summarization | 2.20 | ~2.05 |
| translation | 2.68 | ~2.45 |
Roughly 8-12% acceptance loss vs BF16, more than offset by per-step speedup on Blackwell native FP4 hardware.
Spark / DGX Spark deployment
Tested on NVIDIA DGX Spark (GB10, sm 12.1, 128 GB unified memory) using the
ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest image (eugr nightly with
sm_120-compiled FlashInfer CUTLASS + VLLM_CUTLASS NVFP4 kernels).
Single-stream wall-clock: 2.0-2.5Γ speedup over no spec decode on chat workloads.
Files
model.safetensorsβ NVFP4 quantized drafter weights (~1.5 GB)config.jsonβ Eagle3 speculator config (carries verifier reference)config.pyβ CustomEagle3SpeculatorConfigclass (custom_code, required fortrust_remote_code=True)tokenizer.json,tokenizer_config.jsonβ Verifier tokenizer (Gemma 4)hf_quant_config.jsonβ ModelOpt NVFP4 quantization metadatamodelopt_state.ptβ Full modelopt state for re-export
License
Apache 2.0 (matches base model). NVFP4 quantization is a derivative work contributed under the same terms.
Provenance
Created by quantizing RedHatAI/gemma-4-31B-it-speculator.eagle3 with NVFP4_DEFAULT_CFG calibrated against
RedHatAI/gemma-4-31B-it-NVFP4 outputs. Methodology adapted from RedHat AI's published
Gemma 4 NVFP4 target recipe + standard EAGLE-3 calibration practice.
See also
- BF16 source: RedHatAI/gemma-4-31B-it-speculator.eagle3
- Verifier (NVFP4): RedHatAI/gemma-4-31B-it-NVFP4
- EAGLE-3 paper: arXiv:2503.01840
- Speculators library: vllm-project/speculators
β Support the work
If this release has been useful, tips are deeply appreciated β they go directly toward more compute, more models, and more open releases.
βΏ Bitcoin (BTC)![]() bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
|
Ξ Ethereum (ETH)![]() 0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
|
β Solana (SOL)![]() DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
|
β Monero (XMR)![]() 836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd
|
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
- Downloads last month
- 9,205
Model tree for AEON-7/gemma-4-31B-it-speculator.eagle3-NVFP4
Base model
RedHatAI/gemma-4-31B-it-speculator.eagle3


