ZAYA1-8B-JANGTQ_K

Zyphra/ZAYA1-8B — 3.4 GB on diskmixed-bit JANGTQ_K quantization that recovers ZAYA's quality at the 2-3k cumulative-token coherence ceiling where the prior JANGTQ2 tier collapsed.

  • Source: Zyphra/ZAYA1-8B (80 layers alternating CCA attention + top-1 MoE, 16 routed experts + MOD skip route, 8.4 B total / 760 M active, hybrid cache)
  • Quantization: mixed-bit MXTQ on routed experts:
    • down_proj: 4-bit (output enters residual stream — most sensitive)
    • gate_proj: 2-bit (gated through SwiGLU)
    • up_proj: 2-bit (multiplied with gate)
    • attention / embed / lm_head: 8-bit affine
    • norms / router / conv_qk / biases: fp16 / fp32 passthrough
  • Routed-expert layout: pre-stacked along axis 0 under zaya_block.experts.switch_mlp.{gate_proj, up_proj, down_proj} per the JANGTQ-PRESTACK standard. Sidecar jangtq_runtime.safetensors (~8 KB) ships both (in=2048, bits=2) and (in=2048, bits=4) codebooks + sign-flip vector for Swift runtimes.
  • Bundle size: ~3.4 GB on-disk (~2.67 bits avg routed weight)
  • Runs on: M3 Max 32 GB+ / M4 / M5 / Mac Studio

Why mixed-bit?

ZAYA1-8B is top-1 MoE with MOD passthrough — every routed token rides ONE expert's quantization error, with no top-k averaging to smooth out the noise. At plain 2-bit (JANGTQ2) the residual stream accumulates codebook noise and collapses into short-phrase loops past 2-3 k cumulative output tokens (documented at `/osaurus-staging/docs/JANGTQ2_QUALITY_LIMITS.md`).

JANGTQ_K spends 4 bits on down_proj (the projection whose output feeds the residual stream) and keeps 2 bits on gate_proj / up_proj (gated through SwiGLU's multiplicative path, much less sensitive). Same total budget as ~2.67-bit but quality close to 4-bit on the matmul whose noise actually matters.

Loading (Python)

pip install jang-tools mlx-lm
from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("JANGQ-AI/ZAYA1-8B-JANGTQ_K")

chat = tokenizer.apply_chat_template(
    [{{"role": "user", "content": "What is 2 + 2?"}}],
    tokenize=False,
    add_generation_prompt=True,
)

load_jangtq_model auto-registers model_type=zaya via jang_tools.zaya before building the MLX skeleton.

Validated runtime contract

  • 80 layers materialize; 40 sparse-MoE layers hydrate routed experts via TurboQuantLinear with per-projection bit widths (gate=2 / up=2 / down=4).
  • Capabilities: family=zaya, reasoning_parser=qwen3, tool_parser=zaya_xml, supports_thinking=True, think_in_template=False, cache_type=hybrid.
  • Single-prompt smoke: "2+2=4", "Paris", recursive fibonacci(n) — short, on-topic, fast.
  • Multi-turn smoke: 3-turn code+tests+README run → 6,177 chars cumulative, well past the 2-3 k JANGTQ2 ceiling, no loops / no repetition / no off-topic collapse.

Runtime support matrix

Surface Status
jang-tools Python (load_jangtq_model) ✅ working — this README's load snippet
vmlx-swift-lm Swift ✅ working — Libraries/MLXLLM/Models/Zaya.swift + JANGTQ codebook dispatch

Reasoning + tools

  • Reasoning parser: qwen3 (extracts <think>...</think> blocks)
  • Tool parser: zaya_xml (Zyphra wrapper around standard XML tool calls — see Tool/Parsers/ZayaXMLToolCallParser.swift)
  • Cache: hybrid (CCA + standard KV; convolution state preserved per CCA layer + previous-hidden-state side-channel)

Credits

  • Quantization + MLX runtime: Jinho Jang (eric@jangq.ai)
  • Source model: Zyphra ZAYA1 team
  • License: Apache-2.0, inherited from upstream
Downloads last month
158
Safetensors
Model size
0.9B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/ZAYA1-8B-JANGTQ_K

Finetuned
Zyphra/ZAYA1-8B
Quantized
(14)
this model