Eagle3-Qwen3-8B-zh
Introduction
Eagle3-Qwen3-8B-zh is a retrained version of the open-source Qwen3-8B, designed for the Eagle-3 speculative decoding algorithm. It has acceleration capabilities for both Chinese and English texts, aimed at speeding up the inference process during the decoding phase of large language models. This model is trained on a bilingual dataset (Chinese and English), enhancing its performance for tasks involving mixed-language input.
To reduce training costs, we built a low-cost Eagle-3 training tool based on the Eagle and SpecForge frameworks, optimized for consumer-grade GPUs and All-in-one AI server production environments. The training process was carried out on an All-in-one AI server (NVIDIA RTX 4090), completing the end-to-end training in about 3 days.
Notably, the draft model, trained with approximately 135k data points, outperforms other open-source models in acceleration performance (with a slight edge in English and a significant improvement in Chinese). During testing, on a single NVIDIA RTX 4090 GPU, the model achieved up to 🚀 2.5x decoding speedup for both English and Chinese tasks on the MT-bench(-zh) dataset.
Note: The configuration files in the project follow the SGLang framework. If you intend to use other frameworks like vLLM, please modify the relevant configurations.
Training Configuration
With limited hardware resources, we built a low-cost Eagle-3 weight training tool for consumer-grade GPUs based on the open-source Eagle and SpecForge projects. This tool supports efficient Eagle-3 training for common LLM sizes on All-in-one AI servers.
Dataset: Constructed a 135K-sample training corpus from ShareGPT-68K (English) and ShareGPT-Chinese-English-90k (Chinese), where prompts retained original content while outputs were regenerated using Qwen3-8B.
Training Environment: Conducted on a 4-GPU setup (NVIDIA RTX 4090, 24GB VRAM each) with a total training duration of approximately 3 days.
Inference Launch Command
To launch the EAGLE-3 algorithm service using SGLang, here is the instruction:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--speculative-algo EAGLE3 \
--speculative-draft Zjcxy-SmartAI/Eagle3-Qwen3-8B-zh \
--speculative-num-steps 5 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--mem-fraction 0.7 \
--dtype float16 \
--port 30000 \
--cuda-graph-max-bs 8 \
--cuda-graph-bs {1,2,3,4}
To launch the original model service (for comparative experiments) using SGLang, here is the instruction:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--dtype float16 \
--port 30000 \
--mem-fraction 0.7 \
--cuda-graph-max-bs 8 \
--cuda-graph-bs {1,2,3,4}
Performance Evaluation
On a single NVIDIA RTX 4090, we systematically tested the open-source weights based on the Eagle-3 algorithm using the following datasets:
MT-bench: This dataset contains 80 high-quality English multi-turn dialogues across various domains.
MT-bench-zh: The Chinese version of the MT-bench dataset was processed using the Tongyi Qianwen service, then manually corrected.
C-Eval: We selected 2 samples from each category in the C-Eval dataset, making a total of 104 samples.
GSM8K: We randomly selected 100 samples from the GSM8K dataset.
Note: All evaluations were performed under 4-concurrency settings. The maximum generation length was configured at 512 tokens for MT-bench and 2048 tokens for C-Eval and GSM8K benchmarks.
To ensure the comprehensiveness of the results, we conducted numerous experiments with different speculative decoding hyperparameters and request sampling configurations.
Experiment 1
The configuration executes 5 iterative steps of speculative decoding per inference, maintains the top-4 tokens by probability for speculative generation, and processes 16 draft tokens in parallel for validation.
--speculative-num-steps 5 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
| MT-bench-zh | MT-bench | C-Eval | GSM8K | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Temperature | Model | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ |
| T=0 | Zjcxy-SmartAI/Eagle3-Qwen3-8B-zh | 2.16x | 3.44 | 2.27x | 3.62 | 2.25x | 3.38 | 2.81x | 4.18 |
| AngelSlim/Qwen3-8B_eagle3 | 0.79x | 1.25 | 2.18x | 3.52 | 0.83x | 1.21 | 2.27x | 3.32 | |
| Tengyunw/qwen3_8b_eagle3B | 0.72x | 1.12 | 2.23x | 3.51 | 0.77x | 1.13 | 2.67x | 3.95 | |
Experiment 2
The configuration executes 6 iterative steps of speculative decoding per inference, maintains the top-10 tokens by probability for speculative generation, and processes 32 draft tokens in parallel for validation.
--speculative-num-steps 6 \
--speculative-eagle-topk 10 \
--speculative-num-draft-tokens 32 \
| MT-bench-zh | MT-bench | C-Eval | GSM8K | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Temperature | Model | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ |
| T=0 | Zjcxy-SmartAI/Eagle3-Qwen3-8B-zh | 2.13x | 3.9 | 2.25x | 4.11 | 2.26x | 3.86 | 2.85x | 4.8 |
| AngelSlim/Qwen3-8B_eagle3 | - | - | 2.16x | 4.01 | - | - | 2.21x | 3.72 | |
| Tengyunw/qwen3_8b_eagle3B | - | - | 2.21x | 4 | - | - | 2.66x | 4.5 | |
| T=1 | Zjcxy-SmartAI/Eagle3-Qwen3-8B | 1.46x | 3.57 | 1.58x | 3.89 | 1.5x | 3.42 | 1.99x | 4.52 |
| AngelSlim/Qwen3-8B_eagle3 | - | - | 1.53x | 3.84 | - | - | 1.55x | 3.54 | |
| Tengyunw/qwen3_8b_eagle3B | - | - | 1.56x | 3.51 | - | - | 1.85x | 4.26 | |
Experiment 3
Additional testing was conducted based on the common configuration, employing the official Qwen3-8B recommended sampling parameters: Temperature=0.6, TopP=0.95, TopK=20, and MinP=0.
| MT-bench-zh | MT-bench | C-Eval | GSM8K | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Temperature | Model | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ |
| T=0.6 | Zjcxy-SmartAI/Eagle3-Qwen3-8B-zh | 1.73x | 3.82 | 1.83x | 4.06 | 1.79x | 3.69 | 2.32x | 4.73 |
| AngelSlim/Qwen3-8B_eagle3 | - | - | 1.78x | 3.97 | - | - | 1.79x | 3.67 | |
| Tengyunw/qwen3_8b_eagle3B | - | - | 1.81x | 3.97 | - | - | 2.17x | 4.45 | |
Testing results
Through comparative analysis, we have drawn the following conclusions:
- The 135k-sample-trained model performs consistently across diverse benchmarks. It maintains English proficiency while significantly advancing Chinese task performance and acceleration compared to similar-scale open-source models. On an NVIDIA RTX 4090, we observe up to 2.5× end-to-end inference acceleration for both languages.
Relevant Link
Qwen3-8B Open-source Weights: https://huggingface.co/Qwen/Qwen3-8B
Eagle Open-source Repository: https://github.com/SafeAILab/EAGLE
SpecForce Framework: https://github.com/sgl-project/SpecForge
- Downloads last month
- 29