Model Card for ealexeev/TheDrummer-Rocinante-X-12B-v1-NVFP4

This is an NVFP4 quantization of TheDrummer/Rocinante-X-12B-v1.

Quantization Details

Used https://github.com/ealexeev/llm-quantization script.

Calibration dataset size: 1024 Calibration data:

  • HuggingFaceH4/ultrachat_200k
  • allenai/c4_en
  • mrcedric98/fiction_books_v8

These were shuffled and mixed at a ratio of 3:2:3 and a sample size of 1024.

Procedure

python ./quantize_nvfp4.py --model TheDrummer/Rocinante-X-12B-v1 --output ./TheDrummer/Rocinante-X-12B-v1 --size 512 --seed 42 --ultra_chat 3 --c4_en 2 --fiction_v8 3

I had read in VLLM docs that NVFP4 quantization needs very few samples. I ran multiple quants of 128, 256, 512, 1024, 2048 samples. This 512 version hit the sweet spot in these particular evals.

Quantization Evals

Evaluation Results

Benchmark Metric Base (BF16) NVFP4 (512S) Delta (%)
ARC Challenge acc_norm (↑) 0.5922 0.5802 -2.03%
HellaSwag acc_norm (↑) 0.8230 0.8122 -1.31%
IFEval prompt_strict (↑) 0.2255 0.2348 +6.56%
Lambada accuracy (↑) 0.7192 0.7176 -0.22%
WikiText word_ppl (↓) 8.8535 9.6191 +8.65%

Performance Breakdown

  • Logic & Reasoning: The model shows exceptional stability, retaining over 98.5% of its baseline capability in HellaSwag and 98% in ARC-Challenge.
  • Instruction Following: Surprisingly, the 512S quantization run showed a 6.5% relative improvement in IFEval strict accuracy. This suggests that the FP4 quantization noise did not degrade—and may have even clarified—the model's adherence to constraints.
  • Predictive Modeling: Perplexity on WikiText saw a moderate increase of 8.6%, which is a typical trade-off when moving to 4-bit precision formats.

Performance Highlights

  • Reasoning Retention: The model maintains 98.9% of its ARC-Challenge performance and 98.4% of its HellaSwag performance after quantization.
  • Instruction Following: The IFEval scores showed a slight improvement in the NVFP4 run (+0.93%), suggesting high stability in instruction adherence despite the reduced precision.
  • Perplexity: A minor increase in WikiText perplexity (~8.5%) is observed, which is within the expected range for 4-bit quantization.

Bias, Risks, and Limitations

This is already a creative fine-tune. It was quantized with that usecase in mind. Probably not gonna pass any leet-coder challenges with this one.

How To Use

bash
vllm serve ealexeev/TheDrummer-Rocinante-X-12B-v1-NVFP4 \
    --tensor-parallel-size 1 \      # 1 GPU
    --gpu-memory-utilization 0.8 \  # Else it will take it all for KV
Downloads last month
72
Safetensors
Model size
7B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ealexeev/TheDrummer-Rocinante-X-12B-v1-NVFP4