Model Card for ealexeev/TheDrummer-Rocinante-X-12B-v1-NVFP4

This is an NVFP4 quantization of TheDrummer/Rocinante-X-12B-v1.

Quantization Details

Used https://github.com/ealexeev/llm-quantization script.

Calibration dataset size: 1024 Calibration data:

HuggingFaceH4/ultrachat_200k
allenai/c4_en
mrcedric98/fiction_books_v8

These were shuffled and mixed at a ratio of 3:2:3 and a sample size of 1024.

Procedure

python ./quantize_nvfp4.py --model TheDrummer/Rocinante-X-12B-v1 --output ./TheDrummer/Rocinante-X-12B-v1 --size 512 --seed 42 --ultra_chat 3 --c4_en 2 --fiction_v8 3

I had read in VLLM docs that NVFP4 quantization needs very few samples. I ran multiple quants of 128, 256, 512, 1024, 2048 samples. This 512 version hit the sweet spot in these particular evals.

Quantization Evals

Evaluation Results

Benchmark	Metric	Base (BF16)	NVFP4 (512S)	Delta (%)
ARC Challenge	acc_norm (↑)	0.5922	0.5802	-2.03%
HellaSwag	acc_norm (↑)	0.8230	0.8122	-1.31%
IFEval	prompt_strict (↑)	0.2255	0.2348	+6.56%
Lambada	accuracy (↑)	0.7192	0.7176	-0.22%
WikiText	word_ppl (↓)	8.8535	9.6191	+8.65%

Performance Breakdown

Logic & Reasoning: The model shows exceptional stability, retaining over 98.5% of its baseline capability in HellaSwag and 98% in ARC-Challenge.
Instruction Following: Surprisingly, the 512S quantization run showed a 6.5% relative improvement in IFEval strict accuracy. This suggests that the FP4 quantization noise did not degrade—and may have even clarified—the model's adherence to constraints.
Predictive Modeling: Perplexity on WikiText saw a moderate increase of 8.6%, which is a typical trade-off when moving to 4-bit precision formats.

Performance Highlights

Reasoning Retention: The model maintains 98.9% of its ARC-Challenge performance and 98.4% of its HellaSwag performance after quantization.
Instruction Following: The IFEval scores showed a slight improvement in the NVFP4 run (+0.93%), suggesting high stability in instruction adherence despite the reduced precision.
Perplexity: A minor increase in WikiText perplexity (~8.5%) is observed, which is within the expected range for 4-bit quantization.

Bias, Risks, and Limitations

This is already a creative fine-tune. It was quantized with that usecase in mind. Probably not gonna pass any leet-coder challenges with this one.

How To Use

bash
vllm serve ealexeev/TheDrummer-Rocinante-X-12B-v1-NVFP4 \
    --tensor-parallel-size 1 \      # 1 GPU
    --gpu-memory-utilization 0.8 \  # Else it will take it all for KV

Downloads last month: 72

Safetensors

Model size

7B params

Tensor type

F32

BF16

F8_E4M3

Model tree for ealexeev/TheDrummer-Rocinante-X-12B-v1-NVFP4

Base model

mistralai/Mistral-Nemo-Base-2407

Finetuned

mistralai/Mistral-Nemo-Instruct-2407

Finetuned

TheDrummer/Rocinante-X-12B-v1

Quantized

(8)

this model