Mar 5 - 'Final' Update: iMatrix + Benchmarks + New quant algo

#13

pinned

by danielhanchen - opened Mar 5

Discussion

danielhanchen

Unsloth AI org Mar 5

•

edited Mar 5

All GGUFs now use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.
GGUFs updated with an improved quantization algorithm.
Updated with fixed chat template for improved tool-calling & coding performance!
Replaced BF16 layers with F16 for faster inference on unsupported devices.
See our new benchmarks for 122B-A10B here.
Think toggle for Qwen3.5 now in LM Studio. See our guide for instructions.
Please follow the correct instructions / settings in our guide here.

Fine-tuning and RL Qwen3.5

You can also fine-tune and perform reinforcement learning (RL) on all Qwen3.5 models with Unsloth via our free Colab notebooks.
Read our Qwen3.5 fine-tuning guide for tips, VRAM requirements, code and more here.

For Qwen3.5-35B-A3B, we primarily reduced the maximum KLD:

danielhanchen pinned discussion Mar 5

watchingyousleep

Mar 6

If I'm reading this right then Q3_K_M is on par with UD-Q3_K_XL? What does UD signify then?

Bader-CN

Mar 6

I think UD is Unsloth Dynamic
https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

Karthik-D

Mar 6

•

edited Mar 6

Hi, Thanks for the GGUFs, small observation, the file sizes are bit off in the graph

MetaphoricalCode

Mar 7

•

edited Mar 7

Why don't you name other quants? These graphs make no sense. One has to know that Noctrex is a MXFP4 quant, one wouldn't point that out without prior knowledge. With Ubergarm you can't even tell what quant that is. And same goes for every single quant except for unsloth's, these two just portray the problem the best
I'm getting cheap ad vibes with everything you do lately. "Ours is good, theirs is bad"

It's also funny how you finally point out that "Benjamin’s recent MiniMax‑M2.5 analysis shows a case how perplexity and KLD can be very misleading. Unsloth Dynamic IQ2_XXS performs better than AesSedai’s IQ3_S on real world evals (LiveCodeBench v6, MMLU Pro) despite being 11GB smaller. Yet, AesSedai’s perplexity and KLD benchmarks suggest the opposite." ( source: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks )
Yet you go on with these graphs. Can't your perfect final quants be outperformed in real life scenarios by those which you didn't even bother naming? You step on that territory deliberately, then do it right

jcartu

Mar 13

Running Q8_0 on dual RTX PRO 6000 Blackwell (96GB each) with llama.cpp b8192:

Setup:

tensor-split across 2x PRO 6000, 5090 handles embeddings separately
3 parallel slots × 131K context
--no-jinja flag needed for this model
CUDA arch 120a (not 120) required for Blackwell native

Throughput (Q8_0):

PP: ~2900 tok/s
TG: ~71 tok/s per slot (single user ~71, 3 concurrent ~210 aggregate)

KV cache: q4_0 and q8_0 both work at 131K context. q4_0 saves ~30% VRAM with minimal quality loss for most tasks.

Key gotcha: If you're on Blackwell, make sure you compile with CMAKE_CUDA_ARCHITECTURES=120a and LLAMA_CUDA_FA_ALL_QUANTS=ON. Pre-built binaries don't include SM120 yet. Also set GGML_CUDA_FA_ALL_QUANTS=1 at runtime.

Excellent quant quality at Q8_0 — practically indistinguishable from FP16 for coding/agentic tasks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment