Update readme.md with benchmarks
Browse files
README.md
CHANGED
|
@@ -17,13 +17,23 @@ Working GGUF of [Qwen/Qwen3-Reranker-0.6B](https://huggingface.co/Qwen/Qwen3-Rer
|
|
| 17 |
|
| 18 |
> **Other sizes:** [0.6B (this)](https://huggingface.co/Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp) · [4B](https://huggingface.co/Voodisss/Qwen3-Reranker-4B-GGUF-llama_cpp) · [8B](https://huggingface.co/Voodisss/Qwen3-Reranker-8B-GGUF-llama_cpp)
|
| 19 |
|
| 20 |
-
##
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
|
| 25 |
-
|
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
## Does it work?
|
| 29 |
|
|
|
|
| 17 |
|
| 18 |
> **Other sizes:** [0.6B (this)](https://huggingface.co/Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp) · [4B](https://huggingface.co/Voodisss/Qwen3-Reranker-4B-GGUF-llama_cpp) · [8B](https://huggingface.co/Voodisss/Qwen3-Reranker-8B-GGUF-llama_cpp)
|
| 19 |
|
| 20 |
+
## Quantization quality comparison (Qwen3-Reranker-0.6B)
|
| 21 |
+
|
| 22 |
+
Benchmarked on [MTEB AskUbuntuDupQuestions](https://huggingface.co/datasets/mteb/AskUbuntuDupQuestions) (361 queries) via llama-server `/v1/rerank` on RTX 3090. All quants produced from the same F16 source using `llama-quantize`.
|
| 23 |
+
|
| 24 |
+
| Quant | Size | NDCG@10 | MAP@10 | MRR@10 | Δ NDCG@10 |
|
| 25 |
+
| ------ | ------- | ------- | ------ | ------ | --------- |
|
| 26 |
+
| F16 | 1.12 GB | 0.6688 | 0.5143 | 0.7317 | baseline |
|
| 27 |
+
| Q8_0 | 0.60 GB | 0.6677 | 0.5143 | 0.7329 | -0.2% |
|
| 28 |
+
| Q6_K | 0.46 GB | 0.6691 | 0.5156 | 0.7353 | +0.0% |
|
| 29 |
+
| Q5_K_M | 0.41 GB | 0.6671 | 0.5138 | 0.7377 | -0.3% |
|
| 30 |
+
| Q5_0 | 0.41 GB | 0.6678 | 0.5118 | 0.7423 | -0.2% |
|
| 31 |
+
| Q4_K_M | 0.37 GB | 0.6669 | 0.5120 | 0.7345 | -0.3% |
|
| 32 |
+
| Q4_0 | 0.36 GB | 0.6556 | 0.5010 | 0.7211 | -2.0% |
|
| 33 |
+
| Q3_K_M | 0.32 GB | 0.6551 | 0.5004 | 0.7354 | -2.1% |
|
| 34 |
+
| Q2_K | 0.28 GB | 0.4770 | 0.3104 | 0.5668 | **-28.7%** |
|
| 35 |
+
|
| 36 |
+
**Takeaway:** Q4_K_M (0.37 GB) is the sweet spot for 0.6B — 3x smaller than F16 with only 0.3% quality loss. Below Q4_K_M, quality starts to degrade: Q4_0 and Q3_K_M drop ~2%, and Q2_K is unusable (-28.7%). Smaller models are more sensitive to quantization than larger ones.
|
| 37 |
|
| 38 |
## Does it work?
|
| 39 |
|