DeepSeek-V4-Flash-MLX-Q4Q8 / requantization-plan.md
Deviad's picture
Add files using upload-large-folder tool
399a4fb verified

DeepSeek-V4-Flash-MLX-Q4Q8: Build & Requantization Plan

End-to-end recipe for producing a working DeepSeek-V4-Flash-MLX-Q4Q8 bundle that vMLX 1.3.97 (and its bundled jang_tools / mlx_lm) can serve correctly. Two non-obvious bugs in the stock toolchain need fixing or routed experts produce zero/NaN logits and the model emits BOS-token loops at inference.

Problem summary (why you can't just run the upstream converter)

jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang writes the routed expert tensors to disk by direct-copying the FP4 source (int8 packed nibbles + UE8M0 scales, no biases) instead of running them through mx.quantize. The code path comment claims this is "BIT-EXACT preservation" relying on vMLX's MXFP4 mode — but vMLX's MXFP4 dispatch is broken at 4-bit (gibberish output, mentioned in build_mlx_q4q8.sh:14).

When you load the resulting bundle with vMLX you also hit two latent bugs in jang_tools.load_jangtq:

  1. _patch_quant_config_inplace corrupts a correct config. It infers per-module quantization overrides from raw safetensors keys (DSV4 source format like model.layers.N.ffn.experts.E.w1), and if those don't match the user's existing post-sanitize overrides (model.layers.N.mlp.switch_mlp.gate_proj), it overwrites the whole quantization dict with disk-keyed entries. After overwrite, mlx_lm's class_predicate (mlx_lm/utils.py:349) does an exact-key lookup against the model's post-sanitize module paths, finds nothing, and falls through to the top-level bits=8. Routed experts get wrapped as 8-bit QuantizedSwitchLinear; the on-disk 4-bit packed weights (in/8 columns) silently fail to load into the 8-bit module (in/4 columns) under strict=False. Modules retain zero-init weights → zero expert outputs → garbage logits.

  2. The MXFP4 → affine community converter (mxfp4_to_affine.py) uses the wrong affine formula. It encodes as scale = (max-min)/15, bias = min. MLX's affine formula (Metal kernel quantized.h:2387) is:

    scale = max((w_max - w_min) / 15, eps)
    side  = abs(w_min) > abs(w_max)
    scale = side ? scale : -scale
    edge  = side ? w_min : w_max
    q0    = round(edge / scale)
    scale = (q0 != 0) ? edge / q0 : scale
    bias  = (q0 != 0) ? edge      : 0
    

    i.e. MLX picks the larger-magnitude endpoint as bias and adjusts scale so that endpoint maps to an exact integer level. With the wrong formula every weight is off by ~5–10% relative; error compounds across 43 transformer layers → activations explode by layer ~20, NaN by layer ~29, sampler tied at token 0 (BOS) → BOS loops in reasoning_content.

Fix overview

# Fix Where One-time?
1 Patch load_jangtq.py so _patch_quant_config_inplace skips when the user's config already has post-sanitize (model-path) overrides. /Applications/vMLX.app/.../jang_tools/load_jangtq.py yes (per vMLX install)
2 After convert_dsv4_jangtq, re-quantize all 33,024 routed expert tensors from the FP4 source using mx.quantize (MLX's correct affine formula), and rebuild model.safetensors.index.json to include the new .biases keys. refix_routed_experts.py (this repo) every build

Patch #1 is independent of the bundle and persists across builds (until the vMLX app is reinstalled). Step #2 is part of the build pipeline and runs whenever the bundle is rebuilt.

Step-by-step recipe (from scratch)

Prereqs:

  • vMLX.app installed at /Applications/vMLX.app (1.3.97+)
  • /Volumes/Backup mounted with ≥100 GB free for the FP4 source
  • ~/.cache/huggingface/hub/Deviad with ≥200 GB free for the output bundle
  • huggingface-cli authenticated for deepseek-ai/DeepSeek-V4-Flash access

Steps (matching build_mlx_q4q8.sh subcommands):

  1. check — verify volumes, free space, bundled python, jang_tools.
  2. patch_loader — apply the _patch_quant_config_inplace fast-skip guard to load_jangtq.py. Idempotent.
  3. downloadhf download deepseek-ai/DeepSeek-V4-Flash into /Volumes/Backup/DeepSeek-V4/source (~50 GB, FP4 routed experts + FP8 attention/shared/embed/lm_head, BF16 norms/router/HC).
  4. convertjang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang produces the bundle at ~/.cache/huggingface/hub/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8. After this, routed experts are still in MXFP4-direct-copy form (uint8 E8M0 scales, no biases); attention/shared/embed/lm_head are correctly quantized via mx.quantize already.
  5. requantize — run refix_routed_experts.py. For each of the 33,024 routed expert tensors, read the FP4 weight + UE8M0 scale from the source bundle, decode through FP4_LUT[nibbles] * 2^(scale-127), re-quantize via mx.quantize(group_size=32, bits=4, mode="affine"), and replace .weight, .scales, .biases in the destination shards. Rebuild model.safetensors.index.json to include the added .biases keys. Takes ~30–35 min on M3 Ultra; peak RAM ~15 GB.
  6. finalize — copy tokenizer / encoding files from the source or a reference JANG_2L bundle.
  7. patch — apply EOS / chat-template fixes to tokenizer_config.json and generation_config.json.
  8. verify — sanity-check the bundle (file presence, EOS, shard count, encoding dir).
  9. serve — launch vmlx_engine.cli serve.

The all target runs check → patch_loader → (download if needed) → convert → requantize → finalize → patch → verify.

Verifying it works

After serve, hit the chat completion endpoint:

curl -s http://127.0.0.1:8010/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"deepseek-v4-flash-mlx-q4q8",
       "messages":[{"role":"user","content":"Write one sentence about the moon."}],
       "max_tokens":120}'

Expected: a coherent English sentence with finish_reason: stop. Bug-state: 400 BOS tokens (<|begin▁of▁sentence|> repeated) with finish_reason: length.

A second sanity test that distinguishes correctness from coherence:

curl -s http://127.0.0.1:8010/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"deepseek-v4-flash-mlx-q4q8",
       "messages":[{"role":"user","content":"What is 17+28?"}],
       "max_tokens":80}'

Should produce 45 somewhere in reasoning_content. (The _hc_pre function in dsv4/mlx_model.py has an explicit comment about this exact prompt regressing to "17 plus plus plus" if any numerical step in the mHC mechanism is even slightly off.)

Out of scope / future work

  • Repacking the bundle into TurboQuant .tq_packed/.tq_norms/.tq_bits format would unlock the fused-gate+up Metal kernel (probably 2–3× tok/s). Not required for correctness.
  • Upstream the _patch_quant_config_inplace fast-skip into jang_tools so step 2 isn't needed.
  • Upstream a --no-fp4-passthrough flag to convert_dsv4_jangtq so the routed experts go through mx.quantize directly during conversion, removing the need for the requantize step.

Files

  • build_mlx_q4q8.sh — orchestrator script. Self-contained: emits refix_routed_experts.py from an embedded heredoc each time the requantize step runs, with SRC_DIR and OUT_DIR substituted from the variables at the top of the script.
  • refix_routed_experts.py — auto-generated. Hand-edits get clobbered on the next requantize run; edit the heredoc in build_mlx_q4q8.sh instead.
  • requantization-plan.md — this file.
  • mxfp4_to_affine.pydeprecated; uses the wrong affine formula (kept as historical reference). Do NOT run it on a fresh bundle.
  • fix_crossshard_orphans.pydeprecated; was the second-pass cleanup for the wrong-formula converter. Not needed any more — refix_routed_experts.py handles cross-shard cases natively.