Instructions to use Deviad/DeepSeek-V4-Flash-MLX-Q4Q8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Deviad/DeepSeek-V4-Flash-MLX-Q4Q8 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Deviad/DeepSeek-V4-Flash-MLX-Q4Q8") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use Deviad/DeepSeek-V4-Flash-MLX-Q4Q8 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "Deviad/DeepSeek-V4-Flash-MLX-Q4Q8" --prompt "Once upon a time"
DeepSeek-V4-Flash-MLX-Q4Q8: Build & Requantization Plan
End-to-end recipe for producing a working DeepSeek-V4-Flash-MLX-Q4Q8
bundle that vMLX 1.3.97 (and its bundled jang_tools / mlx_lm) can
serve correctly. Two non-obvious bugs in the stock toolchain need fixing
or routed experts produce zero/NaN logits and the model emits BOS-token
loops at inference.
Problem summary (why you can't just run the upstream converter)
jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang writes
the routed expert tensors to disk by direct-copying the FP4 source
(int8 packed nibbles + UE8M0 scales, no biases) instead of running them
through mx.quantize. The code path comment claims this is "BIT-EXACT
preservation" relying on vMLX's MXFP4 mode — but vMLX's MXFP4 dispatch
is broken at 4-bit (gibberish output, mentioned in
build_mlx_q4q8.sh:14).
When you load the resulting bundle with vMLX you also hit two latent
bugs in jang_tools.load_jangtq:
_patch_quant_config_inplacecorrupts a correct config. It infers per-module quantization overrides from raw safetensors keys (DSV4 source format likemodel.layers.N.ffn.experts.E.w1), and if those don't match the user's existing post-sanitize overrides (model.layers.N.mlp.switch_mlp.gate_proj), it overwrites the wholequantizationdict with disk-keyed entries. After overwrite,mlx_lm'sclass_predicate(mlx_lm/utils.py:349) does an exact-key lookup against the model's post-sanitize module paths, finds nothing, and falls through to the top-levelbits=8. Routed experts get wrapped as 8-bitQuantizedSwitchLinear; the on-disk 4-bit packed weights (in/8 columns) silently fail to load into the 8-bit module (in/4 columns) understrict=False. Modules retain zero-init weights → zero expert outputs → garbage logits.The MXFP4 → affine community converter (
mxfp4_to_affine.py) uses the wrong affine formula. It encodes asscale = (max-min)/15, bias = min. MLX's affine formula (Metal kernelquantized.h:2387) is:scale = max((w_max - w_min) / 15, eps) side = abs(w_min) > abs(w_max) scale = side ? scale : -scale edge = side ? w_min : w_max q0 = round(edge / scale) scale = (q0 != 0) ? edge / q0 : scale bias = (q0 != 0) ? edge : 0i.e. MLX picks the larger-magnitude endpoint as
biasand adjustsscaleso that endpoint maps to an exact integer level. With the wrong formula every weight is off by ~5–10% relative; error compounds across 43 transformer layers → activations explode by layer ~20, NaN by layer ~29, sampler tied at token 0 (BOS) → BOS loops inreasoning_content.
Fix overview
| # | Fix | Where | One-time? |
|---|---|---|---|
| 1 | Patch load_jangtq.py so _patch_quant_config_inplace skips when the user's config already has post-sanitize (model-path) overrides. |
/Applications/vMLX.app/.../jang_tools/load_jangtq.py |
yes (per vMLX install) |
| 2 | After convert_dsv4_jangtq, re-quantize all 33,024 routed expert tensors from the FP4 source using mx.quantize (MLX's correct affine formula), and rebuild model.safetensors.index.json to include the new .biases keys. |
refix_routed_experts.py (this repo) |
every build |
Patch #1 is independent of the bundle and persists across builds (until the vMLX app is reinstalled). Step #2 is part of the build pipeline and runs whenever the bundle is rebuilt.
Step-by-step recipe (from scratch)
Prereqs:
- vMLX.app installed at
/Applications/vMLX.app(1.3.97+) /Volumes/Backupmounted with ≥100 GB free for the FP4 source~/.cache/huggingface/hub/Deviadwith ≥200 GB free for the output bundlehuggingface-cliauthenticated fordeepseek-ai/DeepSeek-V4-Flashaccess
Steps (matching build_mlx_q4q8.sh subcommands):
check— verify volumes, free space, bundled python, jang_tools.patch_loader— apply the_patch_quant_config_inplacefast-skip guard toload_jangtq.py. Idempotent.download—hf download deepseek-ai/DeepSeek-V4-Flashinto/Volumes/Backup/DeepSeek-V4/source(~50 GB, FP4 routed experts + FP8 attention/shared/embed/lm_head, BF16 norms/router/HC).convert—jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jangproduces the bundle at~/.cache/huggingface/hub/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8. After this, routed experts are still in MXFP4-direct-copy form (uint8 E8M0 scales, no biases); attention/shared/embed/lm_head are correctly quantized viamx.quantizealready.requantize— runrefix_routed_experts.py. For each of the 33,024 routed expert tensors, read the FP4 weight + UE8M0 scale from the source bundle, decode throughFP4_LUT[nibbles] * 2^(scale-127), re-quantize viamx.quantize(group_size=32, bits=4, mode="affine"), and replace.weight, .scales, .biasesin the destination shards. Rebuildmodel.safetensors.index.jsonto include the added.biaseskeys. Takes ~30–35 min on M3 Ultra; peak RAM ~15 GB.finalize— copy tokenizer / encoding files from the source or a reference JANG_2L bundle.patch— apply EOS / chat-template fixes totokenizer_config.jsonandgeneration_config.json.verify— sanity-check the bundle (file presence, EOS, shard count, encoding dir).serve— launchvmlx_engine.cli serve.
The all target runs check → patch_loader → (download if needed) → convert → requantize → finalize → patch → verify.
Verifying it works
After serve, hit the chat completion endpoint:
curl -s http://127.0.0.1:8010/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"deepseek-v4-flash-mlx-q4q8",
"messages":[{"role":"user","content":"Write one sentence about the moon."}],
"max_tokens":120}'
Expected: a coherent English sentence with finish_reason: stop.
Bug-state: 400 BOS tokens (<|begin▁of▁sentence|> repeated) with
finish_reason: length.
A second sanity test that distinguishes correctness from coherence:
curl -s http://127.0.0.1:8010/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"deepseek-v4-flash-mlx-q4q8",
"messages":[{"role":"user","content":"What is 17+28?"}],
"max_tokens":80}'
Should produce 45 somewhere in reasoning_content. (The _hc_pre
function in dsv4/mlx_model.py has an explicit comment about this exact
prompt regressing to "17 plus plus plus" if any numerical step in the
mHC mechanism is even slightly off.)
Out of scope / future work
- Repacking the bundle into TurboQuant
.tq_packed/.tq_norms/.tq_bitsformat would unlock the fused-gate+up Metal kernel (probably 2–3× tok/s). Not required for correctness. - Upstream the
_patch_quant_config_inplacefast-skip intojang_toolsso step 2 isn't needed. - Upstream a
--no-fp4-passthroughflag toconvert_dsv4_jangtqso the routed experts go throughmx.quantizedirectly during conversion, removing the need for the requantize step.
Files
build_mlx_q4q8.sh— orchestrator script. Self-contained: emitsrefix_routed_experts.pyfrom an embedded heredoc each time therequantizestep runs, withSRC_DIRandOUT_DIRsubstituted from the variables at the top of the script.refix_routed_experts.py— auto-generated. Hand-edits get clobbered on the nextrequantizerun; edit the heredoc inbuild_mlx_q4q8.shinstead.requantization-plan.md— this file.mxfp4_to_affine.py— deprecated; uses the wrong affine formula (kept as historical reference). Do NOT run it on a fresh bundle.fix_crossshard_orphans.py— deprecated; was the second-pass cleanup for the wrong-formula converter. Not needed any more —refix_routed_experts.pyhandles cross-shard cases natively.