Qwen3-Reranker-0.6B โ GGUF (llama.cpp)
Working GGUF of Qwen/Qwen3-Reranker-0.6B for llama.cpp. Converted 2025-03-09 with the official convert_hf_to_gguf.py.
Other sizes: 0.6B (this) ยท 4B ยท 8B
Quantization quality comparison (Qwen3-Reranker-0.6B)
Benchmarked on MTEB AskUbuntuDupQuestions (361 queries) via llama-server /v1/rerank on RTX 3090. All quants produced from the same F16 source using llama-quantize.
| Quant | Size | NDCG@10 | MAP@10 | MRR@10 | ฮ NDCG@10 |
|---|---|---|---|---|---|
| F16 | 1.12 GB | 0.6688 | 0.5143 | 0.7317 | baseline |
| Q8_0 | 0.60 GB | 0.6677 | 0.5143 | 0.7329 | -0.2% |
| Q6_K | 0.46 GB | 0.6691 | 0.5156 | 0.7353 | +0.0% |
| Q5_K_M | 0.41 GB | 0.6671 | 0.5138 | 0.7377 | -0.3% |
| Q5_0 | 0.41 GB | 0.6678 | 0.5118 | 0.7423 | -0.2% |
| Q4_K_M | 0.37 GB | 0.6669 | 0.5120 | 0.7345 | -0.3% |
| Q4_0 | 0.36 GB | 0.6556 | 0.5010 | 0.7211 | -2.0% |
| Q3_K_M | 0.32 GB | 0.6551 | 0.5004 | 0.7354 | -2.1% |
| Q2_K | 0.28 GB | 0.4770 | 0.3104 | 0.5668 | -28.7% |
Takeaway: Q4_K_M (0.37 GB) is the sweet spot for 0.6B โ 3x smaller than F16 with only 0.3% quality loss. Below Q4_K_M, quality starts to degrade: Q4_0 and Q3_K_M drop ~2%, and Q2_K is unusable (-28.7%). Smaller models are more sensitive to quantization than larger ones.
Does it work?
Yes. Most community GGUFs of Qwen3-Reranker produce garbage scores (4.5e-23) because they're missing reranker-specific tensors. See llama.cpp #16407. This one works:
Doc 0 (relevant): relevance_score = 0.98XX
Doc 1 (irrelevant): relevance_score = 0.00XX
Quick start
llama-server -m Qwen3-Reranker-0.6B-f16.gguf --reranking --pooling rank --embedding --port 8081
curl http://localhost:8081/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"query": "employment termination notice period",
"documents": [
"The Labour Code requires 30 calendar days written notice.",
"Corporate tax rates for small enterprises."
]
}'
Use /v1/rerank, not /v1/embeddings. The embeddings endpoint returns zeros for reranker models.
What's different about this GGUF?
The official convert_hf_to_gguf.py detects Qwen3-Reranker and does things naive converters skip:
- Extracts
cls.output.weight(the yes/no classifier) fromlm_head - Sets
pooling_type = RANKmetadata - Bakes in the rerank chat template
- Sets
classifier.output_labels = ["yes", "no"]
Without these, llama-server has nothing to compute scores from.
models.ini example
[Qwen3-Reranker-0.6B-f16]
model = /path/to/Qwen3-Reranker-0.6B-f16.gguf
reranking = true
pooling = rank
embedding = true
ctx-size = 32768
For a full multi-model setup guide (embedding + reranking + chat on one server), see the llama-server Qwen3 guide.
Convert it yourself
pip install huggingface_hub gguf torch safetensors sentencepiece
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-Reranker-0.6B', local_dir='Qwen3-Reranker-0.6B-src')"
python convert_hf_to_gguf.py --outtype f16 --outfile Qwen3-Reranker-0.6B-f16.gguf Qwen3-Reranker-0.6B-src/
License
Apache 2.0 โ same as the original model.
- Downloads last month
- 209
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit