Model Card for ealexeev/TheDrummer-Rocinante-X-12B-v1-NVFP4
This is an NVFP4 quantization of TheDrummer/Rocinante-X-12B-v1.
Quantization Details
Used https://github.com/ealexeev/llm-quantization script.
Calibration dataset size: 1024 Calibration data:
- HuggingFaceH4/ultrachat_200k
- allenai/c4_en
- mrcedric98/fiction_books_v8
These were shuffled and mixed at a ratio of 3:2:3 and a sample size of 1024.
Procedure
python ./quantize_nvfp4.py --model TheDrummer/Rocinante-X-12B-v1 --output ./TheDrummer/Rocinante-X-12B-v1 --size 512 --seed 42 --ultra_chat 3 --c4_en 2 --fiction_v8 3
I had read in VLLM docs that NVFP4 quantization needs very few samples. I ran multiple quants of 128, 256, 512, 1024, 2048 samples. This 512 version hit the sweet spot in these particular evals.
Quantization Evals
Evaluation Results
| Benchmark | Metric | Base (BF16) | NVFP4 (512S) | Delta (%) |
|---|---|---|---|---|
| ARC Challenge | acc_norm (↑) | 0.5922 | 0.5802 | -2.03% |
| HellaSwag | acc_norm (↑) | 0.8230 | 0.8122 | -1.31% |
| IFEval | prompt_strict (↑) | 0.2255 | 0.2348 | +6.56% |
| Lambada | accuracy (↑) | 0.7192 | 0.7176 | -0.22% |
| WikiText | word_ppl (↓) | 8.8535 | 9.6191 | +8.65% |
Performance Breakdown
- Logic & Reasoning: The model shows exceptional stability, retaining over 98.5% of its baseline capability in HellaSwag and 98% in ARC-Challenge.
- Instruction Following: Surprisingly, the 512S quantization run showed a 6.5% relative improvement in IFEval strict accuracy. This suggests that the FP4 quantization noise did not degrade—and may have even clarified—the model's adherence to constraints.
- Predictive Modeling: Perplexity on WikiText saw a moderate increase of 8.6%, which is a typical trade-off when moving to 4-bit precision formats.
Performance Highlights
- Reasoning Retention: The model maintains 98.9% of its ARC-Challenge performance and 98.4% of its HellaSwag performance after quantization.
- Instruction Following: The IFEval scores showed a slight improvement in the NVFP4 run (+0.93%), suggesting high stability in instruction adherence despite the reduced precision.
- Perplexity: A minor increase in WikiText perplexity (~8.5%) is observed, which is within the expected range for 4-bit quantization.
Bias, Risks, and Limitations
This is already a creative fine-tune. It was quantized with that usecase in mind. Probably not gonna pass any leet-coder challenges with this one.
How To Use
bash
vllm serve ealexeev/TheDrummer-Rocinante-X-12B-v1-NVFP4 \
--tensor-parallel-size 1 \ # 1 GPU
--gpu-memory-utilization 0.8 \ # Else it will take it all for KV
- Downloads last month
- 72
Model tree for ealexeev/TheDrummer-Rocinante-X-12B-v1-NVFP4
Base model
mistralai/Mistral-Nemo-Base-2407