Voodisss
/

Qwen3-Reranker-0.6B-GGUF-llama_cpp

Model card Files Files and versions

Voodisss commited on Mar 10

Commit

eb9ad47

·

verified ·

1 Parent(s): 8ae70ff

Update readme.md with benchmarks

Files changed (1) hide show

README.md +17 -7

README.md CHANGED Viewed

@@ -17,13 +17,23 @@ Working GGUF of [Qwen/Qwen3-Reranker-0.6B](https://huggingface.co/Qwen/Qwen3-Rer
 > **Other sizes:** [0.6B (this)](https://huggingface.co/Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp) · [4B](https://huggingface.co/Voodisss/Qwen3-Reranker-4B-GGUF-llama_cpp) · [8B](https://huggingface.co/Voodisss/Qwen3-Reranker-8B-GGUF-llama_cpp)
-## Available files
-| File | Quant | Size | Description |
-|------|-------|------|-------------|
-| `Qwen3-Reranker-0.6B-F16.gguf` | F16 | ~1.2 GB | Full precision, no quality loss |
-| `Qwen3-Reranker-0.6B-Q8_0.gguf` | Q8_0 | ~0.6 GB | 8-bit quantized, half the size |
 ## Does it work?

 > **Other sizes:** [0.6B (this)](https://huggingface.co/Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp) · [4B](https://huggingface.co/Voodisss/Qwen3-Reranker-4B-GGUF-llama_cpp) · [8B](https://huggingface.co/Voodisss/Qwen3-Reranker-8B-GGUF-llama_cpp)
+## Quantization quality comparison (Qwen3-Reranker-0.6B)
+Benchmarked on [MTEB AskUbuntuDupQuestions](https://huggingface.co/datasets/mteb/AskUbuntuDupQuestions) (361 queries) via llama-server `/v1/rerank` on RTX 3090. All quants produced from the same F16 source using `llama-quantize`.
+| Quant  | Size    | NDCG@10 | MAP@10 | MRR@10 | Δ NDCG@10 |
+| ------ | ------- | ------- | ------ | ------ | --------- |
+| F16    | 1.12 GB | 0.6688  | 0.5143 | 0.7317 | baseline  |
+| Q8_0   | 0.60 GB | 0.6677  | 0.5143 | 0.7329 | -0.2%     |
+| Q6_K   | 0.46 GB | 0.6691  | 0.5156 | 0.7353 | +0.0%     |
+| Q5_K_M | 0.41 GB | 0.6671  | 0.5138 | 0.7377 | -0.3%     |
+| Q5_0   | 0.41 GB | 0.6678  | 0.5118 | 0.7423 | -0.2%     |
+| Q4_K_M | 0.37 GB | 0.6669  | 0.5120 | 0.7345 | -0.3%     |
+| Q4_0   | 0.36 GB | 0.6556  | 0.5010 | 0.7211 | -2.0%     |
+| Q3_K_M | 0.32 GB | 0.6551  | 0.5004 | 0.7354 | -2.1%     |
+| Q2_K   | 0.28 GB | 0.4770  | 0.3104 | 0.5668 | **-28.7%**    |
+**Takeaway:** Q4_K_M (0.37 GB) is the sweet spot for 0.6B — 3x smaller than F16 with only 0.3% quality loss. Below Q4_K_M, quality starts to degrade: Q4_0 and Q3_K_M drop ~2%, and Q2_K is unusable (-28.7%). Smaller models are more sensitive to quantization than larger ones.
 ## Does it work?