ZAYA1-8B-MXFP4

Quantized Zyphra/ZAYA1-8B for Apple Silicon runtimes.


Source	Zyphra/ZAYA1-8B
License	Apache-2.0, inherited from upstream
Format	MXFP4
Modality	text
Bundle size	5.48 GiB
Tensor keys	1965
Expert layout	Pre-stacked `zaya_block.experts.switch_mlp`
Runtime status	Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (missing coherence report); published as a format/runtime bundle pending downstream ZAYA runtime validation.

Important Runtime Note

This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout.

ZAYA is not a stock mlx_lm architecture. It alternates CCA attention layers and top-1 MoE layers. Use this bundle only with a runtime that implements the ZAYA CCA state contract and the converted pre-stacked expert layout.

Runtime Pin Required

Use a vmlx-swift-lm build that includes the ZAYA Swift runtime (Libraries/MLXLLM/Models/Zaya.swift + MLXLMCommon/Cache/ZayaCCACache.swift + BatchEngine/BatchZayaCCACache.swift). The first verified pin is commit b9da180 or newer.

Architecture Summary

80 decoder layers: alternating CCA attention and top-1 MoE
Hidden size 2048, 16 query heads, 2 KV heads, head dim ?
CCA state per attention layer: standard KV plus conv_state [B,1280,2] and prev_hs [B,2048]
16 routed experts per MoE layer, top-1 routing with MOD skip route
Context length 131072, rope_theta=5000000

Quantization

4-bit affine linears + 8-bit embeddings + passthrough router/CCA state tensors.

Passthrough floor for first release prep:

conv_qk.*, temp, norms, residual scaling, router path, biases, and balancing biases are preserved as float tensors.
Embeddings and lm_head use 8-bit affine in the prepared bundles.
Text-only ZAYA1-8B has no vision_tower or LoRA tensors.
jangtq_runtime.safetensors is not applicable to MXFP4.

mxtq_bits:

null

Bundle Verification

Safetensor headers scanned.
Source tensor coverage checked.
Converted bundles checked for local_experts removal.
Converted expert tensors checked for pre-stacked switch_mlp layout.
JANGTQ sidecars checked for the Swift runtime contract.
Capabilities verified: family=zaya, supports_thinking=False, tool_parser=zaya_xml.
Runtime coherence status recorded above.

Runtime Smoke Tests

Before production use, run short deterministic prompts through the exact target runtime:

What is 2+2? Answer with only the number.
What is the capital of France? Answer with one word.
One chat-template prompt with thinking disabled.
One chat-template prompt with thinking enabled and enough output budget for the final answer.

The first public bundle release records bundle integrity and runtime contract checks. Full generation quality depends on a ZAYA-aware runtime implementation.

Korean Summary

이 번들은 Zyphra/ZAYA1-8B를 Apple Silicon MLX/JANG 런타임용으로 양자화한 모델입니다. ZAYA의 CCA attention 상태와 MoE 라우팅을 정확히 구현한 런타임에서만 사용해야 합니다.

Files

config.json carries weight_format=mxfp4, zaya_expert_layout=split_switch_mlp.
jang_config.json carries cache_subtype=zaya_cca.
Tokenizer files and chat template are preserved from the upstream source snapshot.