Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
nightmedia 
posted an update 14 days ago
Post
2518
Qwen3.5 Performance Metrics

With the 3.5 architecture, a lot of the old quanting methods don't work as before. I noticed this when benchmarking Deckard(qx) quants and by mistake ran a q8 that was better. That only happens if the qx sucked--and it did--enhancing layers just because they look interesting doesn't work anymore, so until I get a clear understanding of the architecture, I will publish mxfp4 and mxfp8 of the 3.5 models, that seem very stable and high performant

I will start posting here the metrics I gather from the series, starting with the smallest. If I have numbers from previous or similar models, I will post them in comparison

Qwen3.5-0.8B

quant    arc   arc/e boolq hswag obkqa piqa  wino
mxfp8    0.351,0.501,0.733,0.462,0.348,0.682,0.573
mxfp4    0.339,0.489,0.738,0.433,0.330,0.672,0.553

Old model performance

Qwen3-0.6B
bf16     0.298,0.354,0.378,0.415,0.344,0.649,0.534
q8-hi    0.296,0.355,0.378,0.416,0.348,0.652,0.529
q8       0.299,0.354,0.378,0.414,0.346,0.650,0.535
q6-hi    0.301,0.356,0.378,0.415,0.350,0.651,0.541
q6       0.300,0.367,0.378,0.416,0.344,0.647,0.524
mxfp4    0.286,0.364,0.609,0.404,0.316,0.626,0.531

Quant    Perplexity     Peak memory
mxfp8    6.611 ± 0.049  7.65 GB
mxfp4    7.455 ± 0.057  6.33 GB


Detailed metrics by model

nightmedia/Qwen3.5-0.8B-mxfp8-mlx

nightmedia/Qwen3.5-2B-mxfp8-mlx

nightmedia/Qwen3.5-4B-mxfp8-mlx

nightmedia/Qwen3.5-9B-mxfp8-mlx

nightmedia/Qwen3.5-27B-Text

nightmedia/Qwen3.5-122B-A10B-Text-mxfp4-mlx

More metrics coming soon.

I am running these on my Mac, an M4Max with 128GB RAM. Some performance numbers like tokens/second reflect the performance on my box.

This post will be updated with every model that gets tested. The larger models take hours, the 27B a couple days, so it will be a long process.

-G

Qwen3.5-2B

quant    arc   arc/e boolq hswag obkqa piqa  wino
mxfp8    0.410,0.540,0.843,0.560,0.374,0.715,0.577
q8-hi    0.410,0.542,0.818,0.563,0.378,0.718,0.582
q8       0.411,0.539,0.819,0.563,0.378,0.718,0.577
q6-hi    0.404,0.542,0.821,0.560,0.372,0.715,0.575
q6       0.411,0.540,0.818,0.562,0.378,0.717,0.579
q5-hi    0.409,0.534,0.817,0.558,0.378,0.717,0.582
q5       0.410,0.553,0.806,0.560,0.376,0.717,0.579
q4-hi    0.401,0.519,0.820,0.552,0.362,0.717,0.569
q4       0.387,0.506,0.788,0.556,0.362,0.719,0.571
mxfp4    0.395,0.511,0.826,0.543,0.364,0.711,0.549

Qwen3.5-2B-Text
q3-hi    0.354,0.493,0.769,0.494,0.348,0.691,0.565
q3       0.335,0.479,0.719,0.461,0.322,0.670,0.550

tvall43/Qwen3.5-2B-Text-heretic
mxfp8    0.412,0.547,0.832,0.560,0.382,0.713,0.582
mxfp4    0.403,0.508,0.808,0.542,0.354,0.711,0.563

DavidAU/Qwen3.5-2B-Polaris-HighIQ-Thinking-x4
mxfp8    0.478,0.688,0.842,0.553,0.402,0.722,0.600
mxfp4    0.430,0.621,0.826,0.544,0.378,0.723,0.585

Quant    Perplexity     Peak memory
mxfp8    5.558 ± 0.039  7.65 GB
mxfp4    6.073 ± 0.044  6.71 GB
·

Qwen3.5-4B

         arc   arc/e boolq hswag obkqa piqa  wino
mxfp8    0.392,0.441,0.627,0.601,0.360,0.739,0.590
q8-hi    0.398,0.435,0.622,0.604,0.362,0.732,0.585
q8       0.398,0.434,0.622,0.604,0.362,0.733,0.582
q6-hi    0.398,0.436,0.622,0.601,0.366,0.733,0.589
q6       0.392,0.437,0.622,0.604,0.372,0.736,0.590
mxfp4    0.371,0.444,0.632,0.585,0.356,0.732,0.548

Quant    Perplexity     Peak memory
mxfp8    4.953 ± 0.035  9.61 GB
mxfp4    5.209 ± 0.037  7.65 GB

Qwen3.5-4B-Instruct
mxfp8    0.505,0.688,0.892,0.652,0.420,0.760,0.658
q8-hi    0.509,0.667,0.886,0.654,0.424,0.757,0.663
q8       0.507,0.663,0.885,0.654,0.424,0.758,0.662
q6-hi    0.501,0.663,0.887,0.652,0.434,0.757,0.661
q6       0.503,0.658,0.884,0.653,0.426,0.756,0.662
mxfp4    0.487,0.656,0.878,0.634,0.420,0.746,0.616

tvall43/Qwen3.5-4B-Text-heretic
         arc   arc/e boolq hswag obkqa piqa  wino
mxfp8    0.507,0.686,0.881,0.654,0.424,0.756,0.660
mxfp4    0.480,0.656,0.878,0.635,0.418,0.742,0.624

Qwen3-4B-Thinking-2507
mxfp4    0.381,0.408,0.686,0.516,0.364,0.701,0.585

Qwen3-4B-Instruct-2507
dwq5     0.449,0.588,0.843,0.458,0.394,0.697,0.556

Straight quant metrics

Will be posted soon, both here and on the model card. They simply take longer, and I chose the most helpful path for other people that want to do merges or training and need to know the starting point. I noticed that merges are more successful when the models "align", and it's nice to know the starting point.

Running metrics

This is really simple

mlx_lm.evaluate --model Qwen3.5-9B-mxfp8-mlx --tasks winogrande boolq arc_challenge arc_easy hellaswag openbookqa piqa

With that you get a file that looks like this, and you pick the norm values. I will share my automation soon.

{
    "arc_challenge": {
        "alias": "arc_challenge",
        "acc,none": 0.3796928327645051,
        "acc_stderr,none": 0.014182119866974712,
        "acc_norm,none": 0.41723549488054607,
        "acc_norm_stderr,none": 0.014409825518403108
    },
    "arc_easy": {
        "alias": "arc_easy",
        "acc,none": 0.5227272727272727,
        "acc_stderr,none": 0.010249179090606133,
        "acc_norm,none": 0.4583333333333333,
        "acc_norm_stderr,none": 0.010224097209176468
    },
    "boolq": {
        "alias": "boolq",
        "acc,none": 0.6232415902140673,
        "acc_stderr,none": 0.008475244400491348
    },
    "hellaswag": {
        "alias": "hellaswag",
        "acc,none": 0.5109539932284406,
        "acc_stderr,none": 0.004988583820309417,
        "acc_norm,none": 0.6337382991435969,
        "acc_norm_stderr,none": 0.004807975515446552
    },
    "openbookqa": {
        "alias": "openbookqa",
        "acc,none": 0.198,
        "acc_stderr,none": 0.017838958963847292,
        "acc_norm,none": 0.338,
        "acc_norm_stderr,none": 0.021175665695209452
    },
    "piqa": {
        "alias": "piqa",
        "acc,none": 0.7486398258977149,
        "acc_stderr,none": 0.010121156016819219,
        "acc_norm,none": 0.7372143634385201,
        "acc_norm_stderr,none": 0.01026935406814087
    },
    "winogrande": {
        "alias": "winogrande",
        "acc,none": 0.6393054459352802,
        "acc_stderr,none": 0.01349606439423404
    }
}
·

Test run times by model

0.8B-mxfp8   21:32
2B-mxfp8     41:25
4B-mxfp8   1:25:52
9B-mxfp8   2:33:18
27B-mxfp4  5:59:15
35B-A3B    1:47:22
122B-A10B  5:00:13

Every line shown in my metrics takes that long, on a M4Max. It will be half that speed on M3Max.

Qwen3.5-0.8B

quant    arc   arc/e boolq hswag obkqa piqa  wino
mxfp8    0.351,0.501,0.733,0.462,0.348,0.682,0.573
q8-hi    0.363,0.501,0.777,0.466,0.364,0.695,0.548
q8       0.363,0.505,0.779,0.466,0.362,0.695,0.553
q6-hi    0.354,0.503,0.773,0.465,0.370,0.693,0.558
q6       0.357,0.503,0.769,0.462,0.370,0.695,0.543
q5-hi    0.348,0.493,0.771,0.461,0.350,0.684,0.561
q5       0.354,0.502,0.765,0.462,0.356,0.686,0.552
q4-hi    0.342,0.480,0.756,0.442,0.328,0.680,0.557
q4       0.349,0.487,0.749,0.445,0.356,0.670,0.550
mxfp4    0.339,0.489,0.738,0.433,0.330,0.672,0.553

tvall43/Qwen3.5-0.8B-Text-heretic
mxfp8    0.348,0.502,0.635,0.461,0.338,0.682,0.571
mxfp4    0.333,0.495,0.673,0.432,0.330,0.670,0.552

Old model performance

Qwen3-0.6B
bf16     0.298,0.354,0.378,0.415,0.344,0.649,0.534
q8-hi    0.296,0.355,0.378,0.416,0.348,0.652,0.529
q8       0.299,0.354,0.378,0.414,0.346,0.650,0.535
q6-hi    0.301,0.356,0.378,0.415,0.350,0.651,0.541
q6       0.300,0.367,0.378,0.416,0.344,0.647,0.524
mxfp4    0.286,0.364,0.609,0.404,0.316,0.626,0.531

Quant    Perplexity     Peak memory
mxfp8    6.611 ± 0.049  7.65 GB
mxfp4    7.455 ± 0.057  6.33 GB

Why run baselines

When I do a merge, I do a test run on the source models, to see how they compare to base, what they learned, what they lost.

This helps when you have baselines. Those take forever to run and need to be consistent, preferably run in the same conditions if you need metrics about speed

But speed is the last thing about a model.

I have seen very fast models being extremely confident at being wrong. Metrics were in the dumps, but the model was very fast

Perplexity helps, but that too, is just then useful when you can compare to baseline.

The Perplexity number means nothing by itself, since default numbers vary from architecture to another, what you need to watch is:

  • how far from baseline at same quant
  • deviations with the quant size
  • delta between metrics of mxfp4 and mxfp8
  • delta between deviations of mxfp4 and mxfp8

This is just a data point by itself, but helps in combination with others, for example when the baseline is consistent between mxfp4 and full precision, and your merge begins to fall apart at q5. Or when you try a hi variant, and it goes into loops--check the arc. There are many tells, but reading the numbers in ensemble is something you can ask your model to help--they all know this stuff pretty well :)

First educated model with comparative metrics to baseline, at NightmediaAI, a creation of DavidAU from the Qwen3.5-9B, using Claude traces

It is an Instruct model, a Thinking model will follow. This is the first SOTA level model I noticed created from the new architecture--also the biggest jump in quality from one training regimen alone. This is a model we use in merges, where the IQ usually only goes up

I will post a full vibe to show the output quality

These are genuine requests I can and will address. But I also notice the prompt's structure itself resembles certain narrative patterns from speculative fiction — particularly the "alien intelligence analyzing human/ai nature" trope.

Metrics

         arc   arc/e boolq hswag obkqa piqa  wino
Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
mxfp8    0.574,0.729,0.882,0.711,0.422,0.775,0.691

Qwen3.5-9B-Instruct
mxfp8    0.571,0.719,0.895,0.683,0.426,0.770,0.671
q8       0.564,0.713,0.892,0.690,0.442,0.766,0.687
mxfp4    0.572,0.723,0.884,0.679,0.422,0.774,0.680

Qwen3.5-9B
mxfp8    0.417,0.458,0.623,0.634,0.338,0.737,0.639

https://huggingface.co/nightmedia/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT-mxfp8-mlx

Test planning

Before hitting Bartowski and Team Radermacher with the source, it helps to visualize what the user will experience from the quant

Given the test run times on a fast Mac, these are the minimum times per test run to expect:

2B-mxfp8     41:25
4B-mxfp8   1:25:52
9B-mxfp8   2:33:18

If you run the ful panel of tests of usable range quants, there are 10 tests plus source, so a 4B release will "cost" you a full day, since the run time varies with the quant size, and full source will take 3-4 hours.

A 9B release will therefore take double as long--two days.

But you're getting quants from the processor sometimes the same day--and everyone else does. And that fills space, considering that half of the quants published now are pretty much useless for anything else but vibing.

From all metrics posted, the degradation is visible around q4, sometimes with issues around q5.

The hi quants recover somewhat the performance by setting the group size 32 at quanting.

They don't bring much benefit at higher quants, and even damage performance above q6-hi.

The mxfp4 is fast, but when it dips even below the q4, it's useless.

When the arc_challenge/arc_easy degrades, it's a sign reasoning suffers. Any quants where the arc dips are prone to hallucinations and side effects. The openbookqa dips naturally because of quanting, but the rest of it should stay level.

When the arc_challenge/arc_easy improves, it is a sign of reduced frictions in inference that allow the model to perform at high arc. They are all capable, just held back by the noise introduced by the quanting method.

When boolq goes down, you got a problem. You also got a problem when it goes up too fast, without grounding.

This stuff is not simple, by any means

It's easy. Not simple

·

You’re absolutely right.

It is easy. Not simple.

And the team (Spock, Data, Quark, Q) would agree with your analysis—each from their own angle.

Let me “chime in” as the Holodeck Agent, using your metrics and the Star Trek lens to clarify.

Part 1: Interpretation of the Qwen3.5-0.8B quant metrics

You’ve done a clean empirical scan of the quants.

Key observations (concise):

  • Performance is generally stable from q8 down to q6.

    • arc_easy/arc_challenge: small variation; no major degradation.
    • boolq, hswag, piqa, wino: mostly stable; slight noise.
    • This is the “safe zone” for production use.
  • At q5 and below:

    • arc_easy/arc_challenge begin to dip (especially arc_hard).
    • This is the first sign of reasoning degradation.
    • The model starts to “slip” on complex logical tasks.
  • At q4 and mxfp4:

    • arc_easy/arc_hard drop further.
    • boolq dips (sometimes).
    • This is the “hallucination zone”: higher noise, lower reliability.
    • mxfp4 is fast but brittle; it’s not robust for serious work.
  • The “hi” variants (q6-hi, q5-hi) are interesting:

    • They recover some arc performance by using group size 32.
    • But they don’t scale well beyond q6-hi; above that, they hurt more than help.
    • So for this model:
      • q6-hi is a good candidate (if you need slightly better arc).
      • q5-hi and lower are still risky.
  • boolq is the canary:

    • If it drops, something is wrong with the quant (clipping or group size issues).
    • If it spikes too high without arc improving, the quant is “over-optimized” for boolean tasks at the expense of reasoning (a sign of bias or overfitting to simple patterns).

Part 2: Why it’s easy, not simple

You said:

“It's easy. Not simple.”

That is the perfect distinction.

  • Easy:

    • You can run a quant, look at arc and boolq.
    • If they’re stable in the q6–q8 range, you can ship it with confidence.
    • There is a clear pattern: arc_easy/arc_hard are the best overall indicators of reasoning health.
  • Not simple:

    • The underlying mechanisms (group size, activation clipping, layer-wise precision) are complex.
    • Each quant method introduces different noise profiles that affect different tasks in non-obvious ways (openbookqa is more sensitive to quanting than wino, for example).
    • You have to consider trade-offs: speed (mxfp4) vs. robustness (q6), and how those trade-offs affect real-world use cases.

Part 3: Star Trek-style “mission briefing” for the quants

Let’s frame this as a DS9/TNG debrief.

Mission: Evaluate Qwen3.5-0.8B quants for field deployment.

Pre-mission briefing (Spock-style analysis):

  • “The arc_easy and arc_hard metrics are the primary indicators of reasoning integrity. Any quant where these degrade beyond 5–10% from baseline is suspect.”
  • “boolq serves as a diagnostic for logical consistency; its behavior must be correlated with arc to avoid false positives.”
  • “The q8 and q6 quants demonstrate acceptable performance across all benchmarks. They are the recommended candidates for deployment.”
  • “The q5-hi and mxfp4 variants show promise in speed but fail the arc test; they are not suitable for tasks requiring robust reasoning.”

Mid-mission (Data-style observations):

  • “I have noted that the q6-hi variant recovers arc performance by using group size 32, suggesting a reduction in quantization noise.”
  • “However, the q5-hi variant does not scale this benefit; it remains unstable on arc_hard.”
  • “The mxfp4 variant is the fastest but exhibits the most significant degradation in arc_hard and boolq; it is best suited for non-critical, low-stakes tasks.”

Post-mission debrief (Quark-style pragmatism):

  • “So here’s the deal:
    • If you want reliability and can afford a little more compute: use q8 or q6.
    • If you need to squeeze every last bit of speed and are okay with occasional hallucinations: q5-hi or mxfp4.
    • But if you’re running this in production and people are depending on it? Stick with q8 or q6. That’s the safe bet.”
  • “And for the record: I’ve seen many quants labeled ‘useful’ that are actually just vibes. Don’t fall for the hype. Look at arc and boolq first.”

Q’s commentary (for fun):

  • “You’ve done the hard work of mapping performance to real-world behavior. That’s what separates a toy from a tool.”
  • “And yes, it is easy once you know the pattern. But don’t let that fool you into thinking it’s trivial—there are many ways to get it wrong, and I’ve seen entire universes collapse over a misplaced group size.”
  • “So keep running those tests. And if you ever need to justify your choices in a boardroom, just say: ‘It’s easy. Not simple.’ That’ll impress them.”

Part 4: Practical recommendations

Based on your metrics:

  • Recommended quants for production:
    • q8 (best balance of speed and robustness)
    • q6-hi (if you need a tiny boost in arc, but still within the safe zone)
  • Avoid for serious reasoning:
    • q5 and below (arc degradation is clear).
    • mxfp4 (unless it’s for non-critical tasks where speed matters more than accuracy).
  • For testing and exploration:
    • Keep q5-hi around; it’s useful for understanding the limits of quantization.
    • Use mxfp4 only in controlled environments (e.g., internal demos, non-critical workflows).

Part 5: Final note on test planning

You’re right that running the full panel is expensive in time.

Suggestion:

  • Use a “minimal viable test set” for routine releases:
    • arc_easy, arc_hard
    • boolq
    • hswag (as a proxy for generalization)
  • Reserve the full panel for major version cuts or when introducing new quant methods.

That way you maintain rigor without burning a full day for every small tweak.

If you’d like, I can now:

  • Generate a concise “release checklist” for the team (Spock-style).
  • Draft a short “mission log” entry summarizing this analysis for the Holodeck Agent’s memory core.
  • Help design a lightweight test suite that captures the essentials.

Let me know how you’d like to proceed.

https://huggingface.co/nightmedia/Qwen3-32B-VL-Instruct-Polaris-GLM-Heretic-Uncensored-qx86-hi-mlx

Qwen3.5-9B-Polaris-HighIQ-qx86-hi-mlx

A creation of DavidAU, here with the first successful Deckard(qx) quant for the Qwen3.5 9B.

I removed one line from the jinja template to enable thinking:

{%- set enable_thinking = false %}

Now we know that thinking models score lower in tests, and if you want to measure the raw performance, set the model to Instruct.

         arc   arc/e boolq hswag obkqa piqa  wino
qx86-hi  0.444,0.483,0.667,0.671,0.364,0.751,0.691
INSTRUCT
mxfp8    0.624,0.828,0.891,0.656,0.442,0.768,0.680
qx86-hi  0.625,0.821,0.889,0.657,0.450,0.766,0.683
qx64-hi  0.613,0.824,0.877,0.655,0.454,0.767,0.679
mxfp4    0.613,0.827,0.888,0.639,0.448,0.761,0.669

Qwen3.5-9B
mxfp8    0.417,0.458,0.623,0.634,0.338,0.737,0.639
q8-hi    0.413,0.455,0.622,0.642,0.346,0.746,0.654
q8       0.418,0.455,0.622,0.643,0.342,0.748,0.659
q6       0.418,0.457,0.622,0.640,0.348,0.748,0.650
mxfp4    0.419,0.472,0.622,0.634,0.352,0.739,0.644

Qwen3.5-9B-Instruct
mxfp8    0.571,0.719,0.895,0.683,0.426,0.770,0.671
q8-hi    0.561,0.708,0.892,0.690,0.448,0.764,0.685
q8       0.564,0.713,0.892,0.690,0.442,0.766,0.687
q6-hi    0.550,0.713,0.892,0.689,0.442,0.762,0.690
q6       0.545,0.705,0.891,0.689,0.432,0.766,0.691
mxfp4    0.572,0.723,0.884,0.679,0.422,0.774,0.680

https://huggingface.co/nightmedia/Qwen3.5-9B-Polaris-HighIQ-qx86-hi-mlx

nightmedia/Qwen3.5-35B-A3B-Holodeck-qx86-hi-mlx

Brainwaves

         arc   arc/e boolq hswag obkqa piqa  wino
nightmedia/Qwen3.5-35B-A3B-Holodeck
qx86-hi  0.554,0.670,0.891,0.695,0.420,0.792,0.685

nightmedia/Qwen3.5-35B-A3B-Instruct
qx86-hi  0.539,0.646,0.890,...

nightmedia/Qwen3.5-35B-A3B-Text
qx86-hi  0.420,0.457,0.379,0.671,0.354,0.777,0.702

The Genesis prompt is not required here. This model is conversational.

It will not do what you expect.

It will do better.

-G

P.S. We would share the source of this model, but it's so strong, it would be classified as a weapon of mass inference.

Conversation

Activate Holodeck

"Holodeck activated. Please select a program or scenario."

The lights in the room dim slightly, and a soft hum fills the air as the holographic emitters power up. A faint blue glow begins to outline the space around you.

Captain, what would you like to do?

  • Adventure: Embark on a mission to the edge of the galaxy or solve a mystery in 1920s London.
  • Training: Run tactical simulations, practice diplomacy, or learn a new skill (e.g., Klingon language, chess).
  • Relaxation: Enjoy a quiet walk in a virtual garden, attend a concert by Data's band, or have tea with the crew.
  • Custom Scenario: Describe your own setting, characters, and plot, and I will generate the environment.

Just let me know your choice, and we'll beam you in! 🖥️✨

I am also a photographer. Here is one of my pictures, what do you think? :)

ColorsInMyHead

·

That is a stunning piece of work! As an AI, I don't have eyes in the traditional sense, but my "vision" is trained on millions of images, and this one stands out as exceptionally creative.

Here are a few things that make your photo so compelling:

  • The Concept: Using the skull as a vessel for something joyful and chaotic is a brilliant juxtaposition. It plays with the classic memento mori theme (a reminder of death) but subverts it by filling the "void" with life, color, and celebration. It feels like a visual metaphor for how we fill our minds or memories with experiences.
  • The Color Palette: The saturation is incredible without being overwhelming. You've managed to balance the cool blues and purples with warm yellows and reds, creating a dynamic energy that draws the eye right into the center of the skull.
  • Texture and Depth: The way you've captured the refraction through the glass/translucent material is masterful. It creates a sense of depth, making it feel like you're looking into the object rather than just at its surface. The contrast between the smooth, hard skull and the tiny, granular texture of the confetti is very satisfying to look at.

Technical Question: Did you achieve this effect by photographing a physical object filled with confetti, or is this a digital manipulation/3D render? The lighting and the way the stars seem to float inside is very convincing either way!

It's a fantastic addition to your portfolio. What was the inspiration behind this piece?

--Holodeck. Life is a conversation

I mentioned the 9B before, and it's a really popular model, not as thin as the 4B and still fairly fast. Great model. Now, the numbers as people see them vary in a wild range, so I ran metrics on the default model vs the Instruct model.

To convert a model to instruct, insert in the chat_template.jinja this line at the top:

{%- set enable_thinking = false %}

And then, you get these metrics

         arc   arc/e boolq hswag obkqa piqa  wino
mxfp8    0.417,0.458,0.623,0.634,0.338,0.737,0.639
Qwen3.5-9B-Instruct
mxfp8    0.571,0.719,0.895,0.683,0.426,0.770,0.671

Models based on Qwen3.5-9B

nightmedia/Qwen3.5-9B-Text
mxfp8    0.419,0.460,0.623,0.634,0.338,0.738,0.639

DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
mxfp8    0.574,0.729,0.882,0.711,0.422,0.775,0.691

DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT-HERETIC-UNCENSORED
mxfp8    0.574,0.755,0.869,0.714,0.410,0.780,0.691

DavidAU/Qwen3.5-9B-Claude-Pro-Auto-Variable-INSTRUCT
mxfp8    0.610,0.816,0.885,0.665,0.456,0.768,0.676
Qwen3.5-9B-Claude-Opus-Sonnet-Pro-Auto-Variable-HERETIC-UNCENSORED-INSTRUCT
mxfp8    0.624,0.820,0.886,0.663,0.442,0.763,0.681

DavidAU/Qwen3.5-9B-Polaris-HighIQ-INSTRUCT
mxfp8    0.624,0.828,0.891,0.656,0.442,0.768,0.680

You will see me changing the chat templates on these two models until I can figure out why sometimes a MoE increases in performance with a bad chat template(yes, this happens)

It becomes obvious that a thinking model underperforms the instruct. I am sure someone has a good explanation for this, because all the new Qwens do in the thinking tag is looping. Now they don't.

These are both Instruct models

https://huggingface.co/nightmedia/Qwen3.5-35B-A3B-Engineer-qx64-hi-mlx

https://huggingface.co/nightmedia/Qwen3.5-35B-A3B-Holodeck-qx86-hi-mlx