Feb 27: GGUF Update + Tool-calling fixes + Benchmarks
Qwen3.5 is now updated with improved tool-calling & coding performance! See improvements via Claude Code, Codex.
We also benchmarked GGUFs & removed MXFP4 layers from 3 quants.
We will be re-uploading the non UD ones in a few hours with the chat template fixes if all goes well.
Analysis + MXFP4 investigation: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
coooooool!
Thanks!
If KLD and PPL is not directly corelated to real world performance than would you be posting real world performance against the quants going forward?
If KLD and PPL is not directly corelated to real world performance than would you be posting real world performance against the quants going forward?
Possibly but maybe only very little. They're too time consuming and expensive to run. Look for Benjamin Marie if you want those benchmarks: https://x.com/bnjmn_marie/status/2027043753484021810
Can I suggets a similar test for the 27B model? considering it is also partially state space, it might be interesting to understand the difference between this sparse MoE hybrid family and the dense counterparts.
Conventional wisdom is after all based on mainly "pure transformer dense models". it was well overdue with such a test now, and I think with some dedicated insight into this dense counterpart we will have a complete enough picture going forward. Besides the lack of insight into Long tail and Medium tail factual retention / expression ability and benchmark scores.
I'd really like if there were labels for the other quant lines on that chart, personally. Hard to compare, as is.
If KLD and PPL is not directly corelated to real world performance
Mildly interesting anecdotal experience, to me: I was using the UD-Q4-K-XL quant with the wrong MX4 weights. When the news came out yesterday, I downloaded half a dozen Qwen 3.5 35B quants from all the major players. After putting them through my own "real world performance" evaluation - STEM and agentic coding/tasks based on real work I do - Unsloth's Q4 still came out on top.
It's a small test, I fully admit that. Adding more evals takes a considerable amount of time since these are not multiple choice nor LLM judged but use actual code/unit test/answer regex evaluation. It's 24 evals across a decent range of topics based on real world use - so each hit or miss corresponds directly to something I'd be using it for.
Unsloth Qwen3.5-35B-A3B-UD-Q4_K_XL (MX4 issue): 22 out of 24
Ubergarm Qwen3.5-35B-A3B-Q4_0 : 21 out of 24
Bartowski: Qwen_Qwen3.5-35B-A3B-Q4_K_M 19 out of 24
temp = 0.6, top-p = 0.95, top-k = 20, min-p = 0.00, seed = 1.
Maybe with different settings or a different seed I'd have different results, but I found the results interesting so far.
Models that score the best I simply try out in actual use for awhile to see how it feels in practice.
Downloading the fixed weights now, to see how it changes.
I'd really like if there were labels for the other quant lines on that chart, personally. Hard to compare, as is.
If KLD and PPL is not directly corelated to real world performance
Mildly interesting anecdotal experience, to me: I was using the UD-Q4-K-XL quant with the wrong MX4 weights. When the news came out yesterday, I downloaded half a dozen Qwen 3.5 35B quants from all the major players. After putting them through my own "real world performance" evaluation - STEM and agentic coding/tasks based on real work I do - Unsloth's Q4 still came out on top.
It's a small test, I fully admit that, as adding more takes a considerable amount of time since these are not multiple choice nor LLM judged but use actual code/unit test/answer regex evaluation.
24 evals across a decent range of topics based on real world use - so each hit or miss corresponds directly to something I'd be using it for.Unsloth Qwen3.5-35B-A3B-UD-Q4_K_XL (MX4 issue): 22 out of 24
Ubergarm Qwen3.5-35B-A3B-Q4_0 : 21 out of 24
Bartowski: Qwen_Qwen3.5-35B-A3B-Q4_K_M 19 out of 24temp = 0.6, top-p = 0.95, top-k = 20, min-p = 0.00.
Maybe with different settings I'd have different results, but I found the results interesting so far.
Downloading the fixed weights now, to see how it changes.
We unfortunately couldn't put other labels otherwise the graph would be unreadable as there are hundreds of plots.
And very interesting, thanks for sharing our results! 0.0
You have not used Iq4_nl and removed the quant is there any particular finding on it ?
Unsloth Qwen3.5-35B-A3B-UD-Q4_K_M (fixed): 20 / 24
Failed 3 out of 4 agent coding tests.
Unsloth Qwen3.5-35B-A3B-UD-Q4_K_XL (fixed): 19 / 24
Failed 3 out of 4 agent coding tests, and an additional math question.
The results are surprising, so I tried two additional seeds - same results. 3 out of 4 failed.
The old weights passed 3 out of 4 agent tests.
Edit: to note, it's not a tool calling issue. They complete the task, the answer just doesn't fully solve the issue and unit test conditions.
For reference:
GLM-4.7-Flash-PRISM-Q4_K_M: 20 / 24
Devstral-Small-2-24B-Instruct-2512-Q4_K_M: 20 / 24
Qwen3.5-27B-UD-Q4_K_XL: 19 / 24
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: 17 / 24
Somehow, for me, both perform worse than the old weights and the _XL performs worse than the _M.
For now I'll be going with either the old pseudo-MX4 Unsloth or Ubergarm, depending on how they feel in practice.
I often have results that don't align much with what I see on others' benchmarks. E.g., I'm still looking for a 27B quant / settings that outperform the 35B-A3B model quants, and so far haven't found one. Maybe my results are a fluke, don't read too much into this. The quants may be objectively better but subjectively worse for my particular use.
But real world testing is the only reliable way I know to judge, rather than reproducing Flappy Bird, or solving random meme strawberry / car wash riddles.
Does this fix vision?
I have very good results with IQ2_M, i haven't notices any tool calling issues, i dont see this quant in list of quants anymore, any reason why it's not in the quants for download and also not in the benchmarks?
You have not used Iq4_nl and removed the quant is there any particular finding on it ?
Also your packet processing and token generation stats are very odd. Which hardware and backend is used by you?
I have Vega8 and Vulkan backend, poor gpu club, in past, testing Q8 has always better PP than Q4 and Q4 always have better TG than Q8.
You tests show that all have almost same TG and PP also PP is marginally higher at q4 than q8. Which reverse of what I tested in past.
I have very good results with IQ2_M, i haven't notices any tool calling issues, i dont see this quant in list of quants anymore, any reason why it's not in the quants for download and also not in the benchmarks?
Great to hear it's working for you. ATM it's still converting, will update y'all once it's updated
Sorry if I am missing something, but is there a comparison of speed for UD-Q4-K-XL and MXFP4_MOE? Should this update affect it?
Previous versions of MXFP4 were slower for me (AMD iGPU with unified memory, llama.cpp Vulkan on Linux). Does it make sense to test again, or it should be kind of the same?
Sorry if I am missing something, but is there a comparison of speed for UD-Q4-K-XL and MXFP4_MOE? Should this update affect it?
Previous versions of MXFP4 were slower for me (AMD iGPU with unified memory, llama.cpp Vulkan on Linux). Does it make sense to test again, or it should be kind of the same?
"It's better to use Q4_K than MXFP4 when choosing between them."
https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-some-tensors-are-very-sensitive-to-quantization
Please feel free to correct me if I am wrong here, this is purely based off some diffing I did.
I was curious what the "tool calling" fixes entailed since I couldn't seem to find any details. Posting here in case anyone else is curious.
I pulled the chat template from the old quant and new one, and the difference seems to be how they are validating tool call args.
Old:
{%- if tool_call.arguments is defined %}
{%- for args_name, args_value in tool_call.arguments|items %}
{{- '<parameter=' + args_name + '>\n' }}
{%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
{{- args_value }}
{{- '\n</parameter>\n' }}
{%- endfor %}
{%- endif %}
New:
{%- if tool_call.arguments is mapping %}
{%- for args_name in tool_call.arguments %}
{%- set args_value = tool_call.arguments[args_name] %}
{{- '<parameter=' + args_name + '>\n' }}
{%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
{{- args_value }}
{{- '\n</parameter>\n' }}
{%- endfor %}
{%- endif %}
Effectively there's an extra guard to make sure the tool call arguments are actually a valid dict. If they are then the formatted output is identical.
Hi Folks,
Just sharing some results here — impressive performance and very fast!
Running Qwen3.5-35B-A3B-UD-IQ2_XXS with llama.cpp (server-cuda) locally with claude code for complex and large HTML transformations and structured refactoring tasks.
llama.cpp settings I used:
docker run -d
--gpus all
--name llama-server
--restart unless-stopped
-p :
-v /path/to/models:/models
ghcr.io/ggml-org/llama.cpp:server-cuda
-m "/models/Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf"
-c 128000
--batch-size 2048
--host 0.0.0.0
--port
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.00
--chat-template-kwargs "{"enable_thinking": false}"
--jinja
Hardware: RTX 4090 class GPU (24GB VRAM).
Very stable so far under large-context editing workloads.
HELL YEAH!!
Thank you this was so helpful
Benchmark评测问题?
right.
New update to the 35B mainly to reduce maximum KLD! https://www.reddit.com/r/LocalLLaMA/comments/1rlkptk/final_qwen35_unsloth_gguf_update/
Somehow when ever the prompt changes a little bit [shift from roo/kilo code modes] on llama.cpp it forces the model re analyze the prompt ...[54319] slot update_slots: id 0 | task 5379 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
.............. is this specific to qwen3.5 architecture not having the support on current llama.cpp or some flags that i might be missing?