Instructions to use OsaurusAI/ZAYA1-8B-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/ZAYA1-8B-MXFP4 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("OsaurusAI/ZAYA1-8B-MXFP4") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use OsaurusAI/ZAYA1-8B-MXFP4 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/ZAYA1-8B-MXFP4"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "OsaurusAI/ZAYA1-8B-MXFP4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use OsaurusAI/ZAYA1-8B-MXFP4 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/ZAYA1-8B-MXFP4"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default OsaurusAI/ZAYA1-8B-MXFP4
Run Hermes
hermes
- MLX LM
How to use OsaurusAI/ZAYA1-8B-MXFP4 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "OsaurusAI/ZAYA1-8B-MXFP4"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "OsaurusAI/ZAYA1-8B-MXFP4" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OsaurusAI/ZAYA1-8B-MXFP4", "messages": [ {"role": "user", "content": "Hello"} ] }'

ZAYA1-8B-MXFP4
Quantized Zyphra/ZAYA1-8B for Apple Silicon runtimes.
| Source | Zyphra/ZAYA1-8B |
| License | Apache-2.0, inherited from upstream |
| Format | MXFP4 |
| Modality | text |
| Bundle size | 5.48 GiB |
| Tensor keys | 1965 |
| Expert layout | Pre-stacked zaya_block.experts.switch_mlp |
| Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (missing coherence report); published as a format/runtime bundle pending downstream ZAYA runtime validation. |
Important Runtime Note
This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout.
ZAYA is not a stock mlx_lm architecture. It alternates CCA attention layers and top-1 MoE layers. Use this bundle only with a runtime that implements the ZAYA CCA state contract and the converted pre-stacked expert layout.
Runtime Pin Required
Use a vmlx-swift-lm build that includes the ZAYA Swift runtime (Libraries/MLXLLM/Models/Zaya.swift + MLXLMCommon/Cache/ZayaCCACache.swift + BatchEngine/BatchZayaCCACache.swift). The first verified pin is commit b9da180 or newer.
Architecture Summary
- 80 decoder layers: alternating CCA attention and top-1 MoE
- Hidden size 2048, 16 query heads, 2 KV heads, head dim ?
- CCA state per attention layer: standard KV plus
conv_state [B,1280,2]andprev_hs [B,2048] - 16 routed experts per MoE layer, top-1 routing with MOD skip route
- Context length 131072,
rope_theta=5000000
Quantization
4-bit affine linears + 8-bit embeddings + passthrough router/CCA state tensors.
Passthrough floor for first release prep:
conv_qk.*,temp, norms, residual scaling, router path, biases, and balancing biases are preserved as float tensors.- Embeddings and
lm_headuse 8-bit affine in the prepared bundles. - Text-only ZAYA1-8B has no vision_tower or LoRA tensors.
jangtq_runtime.safetensorsis not applicable to MXFP4.
mxtq_bits:
null
Bundle Verification
- Safetensor headers scanned.
- Source tensor coverage checked.
- Converted bundles checked for
local_expertsremoval. - Converted expert tensors checked for pre-stacked
switch_mlplayout. - JANGTQ sidecars checked for the Swift runtime contract.
- Capabilities verified: family=zaya, supports_thinking=False, tool_parser=zaya_xml.
- Runtime coherence status recorded above.
Runtime Smoke Tests
Before production use, run short deterministic prompts through the exact target runtime:
What is 2+2? Answer with only the number.What is the capital of France? Answer with one word.- One chat-template prompt with thinking disabled.
- One chat-template prompt with thinking enabled and enough output budget for the final answer.
The first public bundle release records bundle integrity and runtime contract checks. Full generation quality depends on a ZAYA-aware runtime implementation.
Korean Summary
이 번들은 Zyphra/ZAYA1-8B를 Apple Silicon MLX/JANG 런타임용으로 양자화한 모델입니다. ZAYA의 CCA attention 상태와 MoE 라우팅을 정확히 구현한 런타임에서만 사용해야 합니다.
Files
config.jsoncarriesweight_format=mxfp4,zaya_expert_layout=split_switch_mlp.jang_config.jsoncarriescache_subtype=zaya_cca.- Tokenizer files and chat template are preserved from the upstream source snapshot.
- Downloads last month
- 3,739
Quantized