# DeepSeek-V4-Flash-MLX-Q4Q8: Build & Requantization Plan

End-to-end recipe for producing a working `DeepSeek-V4-Flash-MLX-Q4Q8`
bundle that vMLX 1.3.97 (and its bundled `jang_tools` / `mlx_lm`) can
serve correctly. Two non-obvious bugs in the stock toolchain need fixing
or routed experts produce zero/NaN logits and the model emits BOS-token
loops at inference.

## Problem summary (why you can't just run the upstream converter)

`jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang` writes
the routed expert tensors to disk by **direct-copying the FP4 source**
(int8 packed nibbles + UE8M0 scales, no biases) instead of running them
through `mx.quantize`. The code path comment claims this is "BIT-EXACT
preservation" relying on vMLX's MXFP4 mode — but vMLX's MXFP4 dispatch
is broken at 4-bit (gibberish output, mentioned in
`build_mlx_q4q8.sh:14`).

When you load the resulting bundle with vMLX you also hit two latent
bugs in `jang_tools.load_jangtq`:

1. **`_patch_quant_config_inplace` corrupts a correct config**.
   It infers per-module quantization overrides from raw safetensors keys
   (DSV4 source format like `model.layers.N.ffn.experts.E.w1`), and if
   those don't match the user's existing post-sanitize overrides
   (`model.layers.N.mlp.switch_mlp.gate_proj`), it overwrites the whole
   `quantization` dict with disk-keyed entries. After overwrite,
   `mlx_lm`'s `class_predicate` (`mlx_lm/utils.py:349`) does an exact-key
   lookup against the model's post-sanitize module paths, finds nothing,
   and falls through to the top-level `bits=8`. Routed experts get
   wrapped as 8-bit `QuantizedSwitchLinear`; the on-disk 4-bit packed
   weights (in/8 columns) silently fail to load into the 8-bit module
   (in/4 columns) under `strict=False`. Modules retain zero-init
   weights → zero expert outputs → garbage logits.

2. **The MXFP4 → affine community converter
   (`mxfp4_to_affine.py`) uses the wrong affine formula**.
   It encodes as `scale = (max-min)/15, bias = min`. MLX's affine
   formula (Metal kernel `quantized.h:2387`) is:
   ```
   scale = max((w_max - w_min) / 15, eps)
   side  = abs(w_min) > abs(w_max)
   scale = side ? scale : -scale
   edge  = side ? w_min : w_max
   q0    = round(edge / scale)
   scale = (q0 != 0) ? edge / q0 : scale
   bias  = (q0 != 0) ? edge      : 0
   ```
   i.e. MLX picks the larger-magnitude endpoint as `bias` and adjusts
   `scale` so that endpoint maps to an exact integer level. With the
   wrong formula every weight is off by ~5–10% relative; error
   compounds across 43 transformer layers → activations explode by
   layer ~20, NaN by layer ~29, sampler tied at token 0 (BOS) → BOS
   loops in `reasoning_content`.

## Fix overview

| # | Fix                                                                                  | Where                                         | One-time? |
|---|--------------------------------------------------------------------------------------|-----------------------------------------------|-----------|
| 1 | Patch `load_jangtq.py` so `_patch_quant_config_inplace` skips when the user's config already has post-sanitize (model-path) overrides. | `/Applications/vMLX.app/.../jang_tools/load_jangtq.py` | yes (per vMLX install) |
| 2 | After `convert_dsv4_jangtq`, re-quantize all 33,024 routed expert tensors from the FP4 source using `mx.quantize` (MLX's correct affine formula), and rebuild `model.safetensors.index.json` to include the new `.biases` keys. | `refix_routed_experts.py` (this repo) | every build |

Patch #1 is independent of the bundle and persists across builds (until
the vMLX app is reinstalled). Step #2 is part of the build pipeline
and runs whenever the bundle is rebuilt.

## Step-by-step recipe (from scratch)

Prereqs:
- vMLX.app installed at `/Applications/vMLX.app` (1.3.97+)
- `/Volumes/Backup` mounted with ≥100 GB free for the FP4 source
- `~/.cache/huggingface/hub/Deviad` with ≥200 GB free for the output bundle
- `huggingface-cli` authenticated for `deepseek-ai/DeepSeek-V4-Flash` access

Steps (matching `build_mlx_q4q8.sh` subcommands):

1. **`check`** — verify volumes, free space, bundled python, jang_tools.
2. **`patch_loader`** — apply the `_patch_quant_config_inplace` fast-skip
   guard to `load_jangtq.py`. Idempotent.
3. **`download`** — `hf download deepseek-ai/DeepSeek-V4-Flash` into
   `/Volumes/Backup/DeepSeek-V4/source` (~50 GB, FP4 routed experts +
   FP8 attention/shared/embed/lm_head, BF16 norms/router/HC).
4. **`convert`** — `jang_tools.dsv4.convert_dsv4_jangtq --profile 4
   --format jang` produces the bundle at
   `~/.cache/huggingface/hub/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8`. After
   this, **routed experts are still in MXFP4-direct-copy form** (uint8
   E8M0 scales, no biases); attention/shared/embed/lm_head are correctly
   quantized via `mx.quantize` already.
5. **`requantize`** — run `refix_routed_experts.py`. For each of the
   33,024 routed expert tensors, read the FP4 weight + UE8M0 scale from
   the source bundle, decode through `FP4_LUT[nibbles] * 2^(scale-127)`,
   re-quantize via `mx.quantize(group_size=32, bits=4, mode="affine")`,
   and replace `.weight, .scales, .biases` in the destination shards.
   Rebuild `model.safetensors.index.json` to include the added `.biases`
   keys. Takes ~30–35 min on M3 Ultra; peak RAM ~15 GB.
6. **`finalize`** — copy tokenizer / encoding files from the source or a
   reference JANG_2L bundle.
7. **`patch`** — apply EOS / chat-template fixes to
   `tokenizer_config.json` and `generation_config.json`.
8. **`verify`** — sanity-check the bundle (file presence, EOS, shard
   count, encoding dir).
9. **`serve`** — launch `vmlx_engine.cli serve`.

The `all` target runs `check → patch_loader → (download if needed) →
convert → requantize → finalize → patch → verify`.

## Verifying it works

After `serve`, hit the chat completion endpoint:

```
curl -s http://127.0.0.1:8010/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"deepseek-v4-flash-mlx-q4q8",
       "messages":[{"role":"user","content":"Write one sentence about the moon."}],
       "max_tokens":120}'
```

Expected: a coherent English sentence with `finish_reason: stop`.
**Bug-state**: 400 BOS tokens (`<｜begin▁of▁sentence｜>` repeated) with
`finish_reason: length`.

A second sanity test that distinguishes correctness from coherence:

```
curl -s http://127.0.0.1:8010/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"deepseek-v4-flash-mlx-q4q8",
       "messages":[{"role":"user","content":"What is 17+28?"}],
       "max_tokens":80}'
```

Should produce `45` somewhere in `reasoning_content`. (The `_hc_pre`
function in `dsv4/mlx_model.py` has an explicit comment about this exact
prompt regressing to `"17 plus plus plus"` if any numerical step in the
mHC mechanism is even slightly off.)

## Out of scope / future work

- Repacking the bundle into TurboQuant `.tq_packed/.tq_norms/.tq_bits`
  format would unlock the fused-gate+up Metal kernel (probably 2–3×
  tok/s). Not required for correctness.
- Upstream the `_patch_quant_config_inplace` fast-skip into `jang_tools`
  so step 2 isn't needed.
- Upstream a `--no-fp4-passthrough` flag to `convert_dsv4_jangtq` so the
  routed experts go through `mx.quantize` directly during conversion,
  removing the need for the requantize step.

## Files

- `build_mlx_q4q8.sh` — orchestrator script. Self-contained: emits
  `refix_routed_experts.py` from an embedded heredoc each time the
  `requantize` step runs, with `SRC_DIR` and `OUT_DIR` substituted from
  the variables at the top of the script.
- `refix_routed_experts.py` — auto-generated. Hand-edits get clobbered
  on the next `requantize` run; edit the heredoc in `build_mlx_q4q8.sh`
  instead.
- `requantization-plan.md` — this file.
- `mxfp4_to_affine.py` — **deprecated**; uses the wrong affine formula
  (kept as historical reference). Do NOT run it on a fresh bundle.
- `fix_crossshard_orphans.py` — **deprecated**; was the second-pass
  cleanup for the wrong-formula converter. Not needed any more —
  `refix_routed_experts.py` handles cross-shard cases natively.