# DeepSeek-V4-Flash-MLX-Q4Q8: Build & Requantization Plan End-to-end recipe for producing a working `DeepSeek-V4-Flash-MLX-Q4Q8` bundle that vMLX 1.3.97 (and its bundled `jang_tools` / `mlx_lm`) can serve correctly. Two non-obvious bugs in the stock toolchain need fixing or routed experts produce zero/NaN logits and the model emits BOS-token loops at inference. ## Problem summary (why you can't just run the upstream converter) `jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang` writes the routed expert tensors to disk by **direct-copying the FP4 source** (int8 packed nibbles + UE8M0 scales, no biases) instead of running them through `mx.quantize`. The code path comment claims this is "BIT-EXACT preservation" relying on vMLX's MXFP4 mode — but vMLX's MXFP4 dispatch is broken at 4-bit (gibberish output, mentioned in `build_mlx_q4q8.sh:14`). When you load the resulting bundle with vMLX you also hit two latent bugs in `jang_tools.load_jangtq`: 1. **`_patch_quant_config_inplace` corrupts a correct config**. It infers per-module quantization overrides from raw safetensors keys (DSV4 source format like `model.layers.N.ffn.experts.E.w1`), and if those don't match the user's existing post-sanitize overrides (`model.layers.N.mlp.switch_mlp.gate_proj`), it overwrites the whole `quantization` dict with disk-keyed entries. After overwrite, `mlx_lm`'s `class_predicate` (`mlx_lm/utils.py:349`) does an exact-key lookup against the model's post-sanitize module paths, finds nothing, and falls through to the top-level `bits=8`. Routed experts get wrapped as 8-bit `QuantizedSwitchLinear`; the on-disk 4-bit packed weights (in/8 columns) silently fail to load into the 8-bit module (in/4 columns) under `strict=False`. Modules retain zero-init weights → zero expert outputs → garbage logits. 2. **The MXFP4 → affine community converter (`mxfp4_to_affine.py`) uses the wrong affine formula**. It encodes as `scale = (max-min)/15, bias = min`. MLX's affine formula (Metal kernel `quantized.h:2387`) is: ``` scale = max((w_max - w_min) / 15, eps) side = abs(w_min) > abs(w_max) scale = side ? scale : -scale edge = side ? w_min : w_max q0 = round(edge / scale) scale = (q0 != 0) ? edge / q0 : scale bias = (q0 != 0) ? edge : 0 ``` i.e. MLX picks the larger-magnitude endpoint as `bias` and adjusts `scale` so that endpoint maps to an exact integer level. With the wrong formula every weight is off by ~5–10% relative; error compounds across 43 transformer layers → activations explode by layer ~20, NaN by layer ~29, sampler tied at token 0 (BOS) → BOS loops in `reasoning_content`. ## Fix overview | # | Fix | Where | One-time? | |---|--------------------------------------------------------------------------------------|-----------------------------------------------|-----------| | 1 | Patch `load_jangtq.py` so `_patch_quant_config_inplace` skips when the user's config already has post-sanitize (model-path) overrides. | `/Applications/vMLX.app/.../jang_tools/load_jangtq.py` | yes (per vMLX install) | | 2 | After `convert_dsv4_jangtq`, re-quantize all 33,024 routed expert tensors from the FP4 source using `mx.quantize` (MLX's correct affine formula), and rebuild `model.safetensors.index.json` to include the new `.biases` keys. | `refix_routed_experts.py` (this repo) | every build | Patch #1 is independent of the bundle and persists across builds (until the vMLX app is reinstalled). Step #2 is part of the build pipeline and runs whenever the bundle is rebuilt. ## Step-by-step recipe (from scratch) Prereqs: - vMLX.app installed at `/Applications/vMLX.app` (1.3.97+) - `/Volumes/Backup` mounted with ≥100 GB free for the FP4 source - `~/.cache/huggingface/hub/Deviad` with ≥200 GB free for the output bundle - `huggingface-cli` authenticated for `deepseek-ai/DeepSeek-V4-Flash` access Steps (matching `build_mlx_q4q8.sh` subcommands): 1. **`check`** — verify volumes, free space, bundled python, jang_tools. 2. **`patch_loader`** — apply the `_patch_quant_config_inplace` fast-skip guard to `load_jangtq.py`. Idempotent. 3. **`download`** — `hf download deepseek-ai/DeepSeek-V4-Flash` into `/Volumes/Backup/DeepSeek-V4/source` (~50 GB, FP4 routed experts + FP8 attention/shared/embed/lm_head, BF16 norms/router/HC). 4. **`convert`** — `jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang` produces the bundle at `~/.cache/huggingface/hub/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8`. After this, **routed experts are still in MXFP4-direct-copy form** (uint8 E8M0 scales, no biases); attention/shared/embed/lm_head are correctly quantized via `mx.quantize` already. 5. **`requantize`** — run `refix_routed_experts.py`. For each of the 33,024 routed expert tensors, read the FP4 weight + UE8M0 scale from the source bundle, decode through `FP4_LUT[nibbles] * 2^(scale-127)`, re-quantize via `mx.quantize(group_size=32, bits=4, mode="affine")`, and replace `.weight, .scales, .biases` in the destination shards. Rebuild `model.safetensors.index.json` to include the added `.biases` keys. Takes ~30–35 min on M3 Ultra; peak RAM ~15 GB. 6. **`finalize`** — copy tokenizer / encoding files from the source or a reference JANG_2L bundle. 7. **`patch`** — apply EOS / chat-template fixes to `tokenizer_config.json` and `generation_config.json`. 8. **`verify`** — sanity-check the bundle (file presence, EOS, shard count, encoding dir). 9. **`serve`** — launch `vmlx_engine.cli serve`. The `all` target runs `check → patch_loader → (download if needed) → convert → requantize → finalize → patch → verify`. ## Verifying it works After `serve`, hit the chat completion endpoint: ``` curl -s http://127.0.0.1:8010/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"deepseek-v4-flash-mlx-q4q8", "messages":[{"role":"user","content":"Write one sentence about the moon."}], "max_tokens":120}' ``` Expected: a coherent English sentence with `finish_reason: stop`. **Bug-state**: 400 BOS tokens (`<|begin▁of▁sentence|>` repeated) with `finish_reason: length`. A second sanity test that distinguishes correctness from coherence: ``` curl -s http://127.0.0.1:8010/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"deepseek-v4-flash-mlx-q4q8", "messages":[{"role":"user","content":"What is 17+28?"}], "max_tokens":80}' ``` Should produce `45` somewhere in `reasoning_content`. (The `_hc_pre` function in `dsv4/mlx_model.py` has an explicit comment about this exact prompt regressing to `"17 plus plus plus"` if any numerical step in the mHC mechanism is even slightly off.) ## Out of scope / future work - Repacking the bundle into TurboQuant `.tq_packed/.tq_norms/.tq_bits` format would unlock the fused-gate+up Metal kernel (probably 2–3× tok/s). Not required for correctness. - Upstream the `_patch_quant_config_inplace` fast-skip into `jang_tools` so step 2 isn't needed. - Upstream a `--no-fp4-passthrough` flag to `convert_dsv4_jangtq` so the routed experts go through `mx.quantize` directly during conversion, removing the need for the requantize step. ## Files - `build_mlx_q4q8.sh` — orchestrator script. Self-contained: emits `refix_routed_experts.py` from an embedded heredoc each time the `requantize` step runs, with `SRC_DIR` and `OUT_DIR` substituted from the variables at the top of the script. - `refix_routed_experts.py` — auto-generated. Hand-edits get clobbered on the next `requantize` run; edit the heredoc in `build_mlx_q4q8.sh` instead. - `requantization-plan.md` — this file. - `mxfp4_to_affine.py` — **deprecated**; uses the wrong affine formula (kept as historical reference). Do NOT run it on a fresh bundle. - `fix_crossshard_orphans.py` — **deprecated**; was the second-pass cleanup for the wrong-formula converter. Not needed any more — `refix_routed_experts.py` handles cross-shard cases natively.