unsloth/Qwen3.5-35B-A3B-GGUF · Feb 27: GGUF Update + Tool-calling fixes + Benchmarks

Unsloth AI org 6 days ago

•

Qwen3.5 is now updated with improved tool-calling & coding performance! See improvements via Claude Code, Codex.
We also benchmarked GGUFs & removed MXFP4 layers from 3 quants.
We will be re-uploading the non UD ones in a few hours with the chat template fixes if all goes well.

Analysis + MXFP4 investigation: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

danielhanchen pinned discussion 6 days ago

LaikaFramework

6 days ago

coooooool!

danielhanchen

Unsloth AI org 6 days ago

Thanks!

engrtipusultan

6 days ago

•

edited 6 days ago

If KLD and PPL is not directly corelated to real world performance than would you be posting real world performance against the quants going forward?

danielhanchen

Unsloth AI org 6 days ago

If KLD and PPL is not directly corelated to real world performance than would you be posting real world performance against the quants going forward?

Possibly but maybe only very little. They're too time consuming and expensive to run. Look for Benjamin Marie if you want those benchmarks: https://x.com/bnjmn_marie/status/2027043753484021810

espen96

6 days ago

Can I suggets a similar test for the 27B model? considering it is also partially state space, it might be interesting to understand the difference between this sparse MoE hybrid family and the dense counterparts.
Conventional wisdom is after all based on mainly "pure transformer dense models". it was well overdue with such a test now, and I think with some dedicated insight into this dense counterpart we will have a complete enough picture going forward. Besides the lack of insight into Long tail and Medium tail factual retention / expression ability and benchmark scores.

usbphone

6 days ago

•

edited 6 days ago

I'd really like if there were labels for the other quant lines on that chart, personally. Hard to compare, as is.

If KLD and PPL is not directly corelated to real world performance

Mildly interesting anecdotal experience, to me: I was using the UD-Q4-K-XL quant with the wrong MX4 weights. When the news came out yesterday, I downloaded half a dozen Qwen 3.5 35B quants from all the major players. After putting them through my own "real world performance" evaluation - STEM and agentic coding/tasks based on real work I do - Unsloth's Q4 still came out on top.

It's a small test, I fully admit that. Adding more evals takes a considerable amount of time since these are not multiple choice nor LLM judged but use actual code/unit test/answer regex evaluation. It's 24 evals across a decent range of topics based on real world use - so each hit or miss corresponds directly to something I'd be using it for.

Unsloth Qwen3.5-35B-A3B-UD-Q4_K_XL (MX4 issue): 22 out of 24
Ubergarm Qwen3.5-35B-A3B-Q4_0 : 21 out of 24
Bartowski: Qwen_Qwen3.5-35B-A3B-Q4_K_M 19 out of 24

temp = 0.6, top-p = 0.95, top-k = 20, min-p = 0.00, seed = 1.

Maybe with different settings or a different seed I'd have different results, but I found the results interesting so far.
Models that score the best I simply try out in actual use for awhile to see how it feels in practice.

Downloading the fixed weights now, to see how it changes.

danielhanchen

Unsloth AI org 6 days ago

I'd really like if there were labels for the other quant lines on that chart, personally. Hard to compare, as is.

If KLD and PPL is not directly corelated to real world performance

Mildly interesting anecdotal experience, to me: I was using the UD-Q4-K-XL quant with the wrong MX4 weights. When the news came out yesterday, I downloaded half a dozen Qwen 3.5 35B quants from all the major players. After putting them through my own "real world performance" evaluation - STEM and agentic coding/tasks based on real work I do - Unsloth's Q4 still came out on top.

It's a small test, I fully admit that, as adding more takes a considerable amount of time since these are not multiple choice nor LLM judged but use actual code/unit test/answer regex evaluation.
24 evals across a decent range of topics based on real world use - so each hit or miss corresponds directly to something I'd be using it for.

Unsloth Qwen3.5-35B-A3B-UD-Q4_K_XL (MX4 issue): 22 out of 24
Ubergarm Qwen3.5-35B-A3B-Q4_0 : 21 out of 24
Bartowski: Qwen_Qwen3.5-35B-A3B-Q4_K_M 19 out of 24

temp = 0.6, top-p = 0.95, top-k = 20, min-p = 0.00.

Maybe with different settings I'd have different results, but I found the results interesting so far.
Downloading the fixed weights now, to see how it changes.

We unfortunately couldn't put other labels otherwise the graph would be unreadable as there are hundreds of plots.

And very interesting, thanks for sharing our results! 0.0

engrtipusultan

6 days ago

You have not used Iq4_nl and removed the quant is there any particular finding on it ?

usbphone

6 days ago

•

edited 6 days ago

Unsloth Qwen3.5-35B-A3B-UD-Q4_K_M (fixed): 20 / 24
Failed 3 out of 4 agent coding tests.

Unsloth Qwen3.5-35B-A3B-UD-Q4_K_XL (fixed): 19 / 24
Failed 3 out of 4 agent coding tests, and an additional math question.

The results are surprising, so I tried two additional seeds - same results. 3 out of 4 failed.
The old weights passed 3 out of 4 agent tests.

Edit: to note, it's not a tool calling issue. They complete the task, the answer just doesn't fully solve the issue and unit test conditions.

For reference:
GLM-4.7-Flash-PRISM-Q4_K_M: 20 / 24
Devstral-Small-2-24B-Instruct-2512-Q4_K_M: 20 / 24
Qwen3.5-27B-UD-Q4_K_XL: 19 / 24
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: 17 / 24

Somehow, for me, both perform worse than the old weights and the _XL performs worse than the _M.
For now I'll be going with either the old pseudo-MX4 Unsloth or Ubergarm, depending on how they feel in practice.

I often have results that don't align much with what I see on others' benchmarks. E.g., I'm still looking for a 27B quant / settings that outperform the 35B-A3B model quants, and so far haven't found one. Maybe my results are a fluke, don't read too much into this. The quants may be objectively better but subjectively worse for my particular use.

But real world testing is the only reliable way I know to judge, rather than reproducing Flappy Bird, or solving random meme strawberry / car wash riddles.

watchingyousleep

6 days ago

Does this fix vision?

Throghar

6 days ago

I have very good results with IQ2_M, i haven't notices any tool calling issues, i dont see this quant in list of quants anymore, any reason why it's not in the quants for download and also not in the benchmarks?

engrtipusultan

6 days ago

•

edited 6 days ago

You have not used Iq4_nl and removed the quant is there any particular finding on it ?

Also your packet processing and token generation stats are very odd. Which hardware and backend is used by you?

I have Vega8 and Vulkan backend, poor gpu club, in past, testing Q8 has always better PP than Q4 and Q4 always have better TG than Q8.

You tests show that all have almost same TG and PP also PP is marginally higher at q4 than q8. Which reverse of what I tested in past.

shimmyshimmer

Unsloth AI org 6 days ago

I have very good results with IQ2_M, i haven't notices any tool calling issues, i dont see this quant in list of quants anymore, any reason why it's not in the quants for download and also not in the benchmarks?

Great to hear it's working for you. ATM it's still converting, will update y'all once it's updated

adanos

6 days ago

Sorry if I am missing something, but is there a comparison of speed for UD-Q4-K-XL and MXFP4_MOE? Should this update affect it?

Previous versions of MXFP4 were slower for me (AMD iGPU with unified memory, llama.cpp Vulkan on Linux). Does it make sense to test again, or it should be kind of the same?

maxious

5 days ago

Sorry if I am missing something, but is there a comparison of speed for UD-Q4-K-XL and MXFP4_MOE? Should this update affect it?

Previous versions of MXFP4 were slower for me (AMD iGPU with unified memory, llama.cpp Vulkan on Linux). Does it make sense to test again, or it should be kind of the same?

"It's better to use Q4_K than MXFP4 when choosing between them."
https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-some-tensors-are-very-sensitive-to-quantization

mcfadyeni

5 days ago

Please feel free to correct me if I am wrong here, this is purely based off some diffing I did.

I was curious what the "tool calling" fixes entailed since I couldn't seem to find any details. Posting here in case anyone else is curious.

I pulled the chat template from the old quant and new one, and the difference seems to be how they are validating tool call args.

Old:

{%- if tool_call.arguments is defined %}
  {%- for args_name, args_value in tool_call.arguments|items %}
    {{- '<parameter=' + args_name + '>\n' }}
    {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
    {{- args_value }}
    {{- '\n</parameter>\n' }}
  {%- endfor %}
{%- endif %}

New:

{%- if tool_call.arguments is mapping %}
  {%- for args_name in tool_call.arguments %}
    {%- set args_value = tool_call.arguments[args_name] %}
    {{- '<parameter=' + args_name + '>\n' }}
    {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
    {{- args_value }}
    {{- '\n</parameter>\n' }}
  {%- endfor %}
{%- endif %}

Effectively there's an extra guard to make sure the tool call arguments are actually a valid dict. If they are then the formatted output is identical.

tstello

4 days ago

•

edited 4 days ago

Hi Folks,

Just sharing some results here — impressive performance and very fast!

Running Qwen3.5-35B-A3B-UD-IQ2_XXS with llama.cpp (server-cuda) locally with claude code for complex and large HTML transformations and structured refactoring tasks.

llama.cpp settings I used:

docker run -d
--gpus all
--name llama-server
--restart unless-stopped
-p :
-v /path/to/models:/models
ghcr.io/ggml-org/llama.cpp:server-cuda
-m "/models/Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf"
-c 128000
--batch-size 2048
--host 0.0.0.0
--port
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.00
--chat-template-kwargs "{"enable_thinking": false}"
--jinja

Hardware: RTX 4090 class GPU (24GB VRAM).

Very stable so far under large-context editing workloads.

vivasvan100

2 days ago

HELL YEAH!!

POISONX

1 day ago

Thank you this was so helpful

whyadnao

about 16 hours ago

Benchmark评测问题？

sekaruoa

about 11 hours ago

right.

danielhanchen

Unsloth AI org about 4 hours ago

New update to the 35B mainly to reduce maximum KLD! https://www.reddit.com/r/LocalLLaMA/comments/1rlkptk/final_qwen35_unsloth_gguf_update/

mayankiit04

about 4 hours ago

Somehow when ever the prompt changes a little bit [shift from roo/kilo code modes] on llama.cpp it forces the model re analyze the prompt ...[54319] slot update_slots: id 0 | task 5379 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
.............. is this specific to qwen3.5 architecture not having the support on current llama.cpp or some flags that i might be missing?