ZAYA1-8B-JANGTQ_K

Zyphra/ZAYA1-8B — 3.4 GB on disk — mixed-bit JANGTQ_K quantization that recovers ZAYA's quality at the 2-3k cumulative-token coherence ceiling where the prior JANGTQ2 tier collapsed.

Source: Zyphra/ZAYA1-8B (80 layers alternating CCA attention + top-1 MoE, 16 routed experts + MOD skip route, 8.4 B total / 760 M active, hybrid cache)
Quantization: mixed-bit MXTQ on routed experts:
- down_proj: 4-bit (output enters residual stream — most sensitive)
- gate_proj: 2-bit (gated through SwiGLU)
- up_proj: 2-bit (multiplied with gate)
- attention / embed / lm_head: 8-bit affine
- norms / router / conv_qk / biases: fp16 / fp32 passthrough
Routed-expert layout: pre-stacked along axis 0 under zaya_block.experts.switch_mlp.{gate_proj, up_proj, down_proj} per the JANGTQ-PRESTACK standard. Sidecar jangtq_runtime.safetensors (~8 KB) ships both (in=2048, bits=2) and (in=2048, bits=4) codebooks + sign-flip vector for Swift runtimes.
Bundle size: ~3.4 GB on-disk (~2.67 bits avg routed weight)
Runs on: M3 Max 32 GB+ / M4 / M5 / Mac Studio

Why mixed-bit?

ZAYA1-8B is top-1 MoE with MOD passthrough — every routed token rides ONE expert's quantization error, with no top-k averaging to smooth out the noise. At plain 2-bit (JANGTQ2) the residual stream accumulates codebook noise and collapses into short-phrase loops past ~~2-3 k cumulative output tokens (documented at `~~/osaurus-staging/docs/JANGTQ2_QUALITY_LIMITS.md`).

JANGTQ_K spends 4 bits on down_proj (the projection whose output feeds the residual stream) and keeps 2 bits on gate_proj / up_proj (gated through SwiGLU's multiplicative path, much less sensitive). Same total budget as ~2.67-bit but quality close to 4-bit on the matmul whose noise actually matters.

Loading (Python)

pip install jang-tools mlx-lm

from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("JANGQ-AI/ZAYA1-8B-JANGTQ_K")

chat = tokenizer.apply_chat_template(
    [{{"role": "user", "content": "What is 2 + 2?"}}],
    tokenize=False,
    add_generation_prompt=True,
)

load_jangtq_model auto-registers model_type=zaya via jang_tools.zaya before building the MLX skeleton.

Validated runtime contract

80 layers materialize; 40 sparse-MoE layers hydrate routed experts via TurboQuantLinear with per-projection bit widths (gate=2 / up=2 / down=4).
Capabilities: family=zaya, reasoning_parser=qwen3, tool_parser=zaya_xml, supports_thinking=True, think_in_template=False, cache_type=hybrid.
Single-prompt smoke: "2+2=4", "Paris", recursive fibonacci(n) — short, on-topic, fast.
Multi-turn smoke: 3-turn code+tests+README run → 6,177 chars cumulative, well past the 2-3 k JANGTQ2 ceiling, no loops / no repetition / no off-topic collapse.

Runtime support matrix

Surface	Status
`jang-tools` Python (`load_jangtq_model`)	✅ working — this README's load snippet
`vmlx-swift-lm` Swift	✅ working — `Libraries/MLXLLM/Models/Zaya.swift` + JANGTQ codebook dispatch

Reasoning + tools

Reasoning parser: qwen3 (extracts <think>...</think> blocks)
Tool parser: zaya_xml (Zyphra wrapper around standard XML tool calls — see Tool/Parsers/ZayaXMLToolCallParser.swift)
Cache: hybrid (CCA + standard KV; convolution state preserved per CCA layer + previous-hidden-state side-channel)

Credits

Quantization + MLX runtime: Jinho Jang (eric@jangq.ai)
Source model: Zyphra ZAYA1 team
License: Apache-2.0, inherited from upstream

Downloads last month: 158

Safetensors

Model size

0.9B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for JANGQ-AI/ZAYA1-8B-JANGTQ_K

Base model

Zyphra/ZAYA1-base

Finetuned

Zyphra/ZAYA1-reasoning-base

Finetuned

Zyphra/ZAYA1-8B

Quantized

(14)

this model