Qwen3.6-35B-A3B-REAM-192

REAM-merged variant of Qwen/Qwen3.6-35B-A3B. The number of routed experts in each MoE layer was reduced from 256 → 192 via REAM (Router-weighted Expert Activation Merging), Samsung SAIL Montreal's method that clusters and merges experts (rather than discarding them, like REAP). Vision encoder is preserved unchanged — full multimodal pipeline intact.

35.11B → 27.05B params (−23%) | 192 routed + 1 shared experts/layer | ~3B active per token | VL preserved | bf16

Status

Pre-evaluation release. Model loads, generates coherent text on smoke-test prompts, and the multimodal stack is structurally intact. lm-eval (MMLU + GSM8K + HumanEval) are pending. This repo will be updated after evaluation is complete.

Known issue: MTP weights missing from safetensors

text_config.mtp_num_hidden_layers: 1 is declared in config.json, but the MTP block weights themselves were not included in this push (merge_args.mtp_safe_tensors was unset during the REAM merge run, so the Phase-3-skip kept the config flag but dropped the tensors). The architecture table below has been corrected.

This will be addressed in a future update: the original (un-merged-against-body) MTP weights from base Qwen/Qwen3.6-35B-A3B will be transplanted into a follow-up safetensors shard. The MTP head's expert routing is self-contained (its own 256-expert MoE block, separate from the body's experts), so the un-merged head sitting on the merged body is mathematically valid; whether it produces useful draft tokens after the body's expert merge is an open empirical question that will be answered by doing a MTP-aware GGUF re-quant + speculative-decode benchmark.

For text/vision use of this model without speculative decoding, the missing MTP weights have no effect — AutoModelForCausalLM and AutoModelForImageTextToText ignore the MTP block entirely. The discrepancy only matters for users planning on using MTP speculative decoding.

Method

Expert reduction

REAM merges experts by:

  1. Computing per-expert REAP saliency (softmax(router_logits) × ‖expert_output‖₂) over a calibration set.
  2. Clustering low-saliency experts into groups around high-saliency centroids (group_size=32).
  3. Merging each cluster into one expert via permutation-aligned weighted average over weight + activation similarity.

Output: bf16 safetensors, structurally identical to base model except num_experts=192. No quantization applied.

Calibration

Composite mix tilted toward agentic coding, adapted from atbender's REAP recipe:

Source Samples Why
SWE-bench/SWE-smith-trajectories (tool) 1024 Agentic multi-turn with tool calls
Salesforce/xlam-function-calling-60k 1024 Single-turn function calling
theblackcat102/evol-codealpaca-v1 683 General coding instruction-following
open-r1/Mixture-of-Thoughts (code) 454 Code reasoning chains
open-r1/Mixture-of-Thoughts (math) 454 Math reasoning
open-r1/Mixture-of-Thoughts (science) 457 Science reasoning
Total 4096 sequences × 2048 tokens ~8.4M tokens

Sequences token-packed to fill 2048-token context; shuffled with seed 42.

Hyperparameters

  • merge_size: 192 (out of 256)
  • group_size: 32 (REAM cluster centroid count)
  • saliency: REAP
  • merging: logits+weights (uses both expert output activations and weight similarity for permutation alignment)
  • grouping: ream
  • gated_sim: True (softmax over router logits before clustering)
  • Seed: 42

Vision tower & MTP

  • Vision encoder: unmodified from Qwen3.6-35B-A3B base. The encoder is stripped on AutoModelForCausalLM load, so post-merge we re-attach by loading the original VLM, swapping model.language_model with the merged version, and saving the combined state.
  • MTP layer: ⚠️ config-declared but weights missing from this push (see Known Issue above). The intent was to keep the original (un-merged) MTP layer because REAM's MTP-merge code path only supports Qwen3-style ModuleList experts while Qwen3.6 stores MTP experts in packed format — keys mismatch and merge would zero out experts. However, the merge run did not pass mtp_safe_tensors, so the original weights were never copied into the output. A follow-up update will transplant them from base.

Architecture

Property Original (Qwen3.6-35B-A3B) This model
Total Parameters ~35.11B ~27.05B
Active Parameters ~3B ~3B
Routed Experts per Layer 256 192
Routed per Token 8 8
Shared Expert 1/layer 1/layer (preserved)
MoE Layers 40 40
Vision Encoder Yes Yes (unmodified)
MTP Layer Yes Config-declared; weights pending transplant from base (see Known Issue)
Precision BF16 BF16
Disk Size ~67 GB ~52 GB
Context 262K 262K

Critical: dtype must be bfloat16

Qwen3.6's GDN (Gated Delta Network) linear attention layers overflow float16 (max 65504), producing silent NaN. Always use dtype=torch.bfloat16. This applies to:

  • HF inference: dtype=torch.bfloat16
  • vLLM: --dtype bfloat16
  • llama.cpp GGUF conversion: --outtype bf16, never f16

Usage

Text-only (CausalLM)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

m = AutoModelForCausalLM.from_pretrained(
    "keithnull/Qwen3.6-35B-A3B-REAM-192",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="eager",
)
t = AutoTokenizer.from_pretrained("keithnull/Qwen3.6-35B-A3B-REAM-192", trust_remote_code=True)

messages = [{"role": "user", "content": "Solve: if 3x + 7 = 22, what is x?"}]
inputs = t.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(m.device)
outputs = m.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(t.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Vision-language

The model's architecture is Qwen3_5MoeForConditionalGeneration. Use AutoModelForImageTextToText to load the full VL wrapper:

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("keithnull/Qwen3.6-35B-A3B-REAM-192", trust_remote_code=True)
m = AutoModelForImageTextToText.from_pretrained(
    "keithnull/Qwen3.6-35B-A3B-REAM-192",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

image = Image.open("path/to/image.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this image."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(m.device)
outputs = m.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(processor.tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Prerequisites

pip install "git+https://github.com/huggingface/transformers.git@main"
# Qwen3.6's qwen3_5_moe arch landed in transformers main after the last tagged release.
pip install accelerate torchvision

For 10× faster GDN forward pass (optional):

pip install flash-linear-attention causal-conv1d einops

(ABI must match your installed torch — use pip install --no-build-isolation if the kernels fail to load at runtime.)

Hardware

  • Built on: 1× NVIDIA H100 80GB on RunPod. Phase 3a (the actual REAM merge) took 3h 22m wall-clock with bf16 weights staged on CPU (device_map='cpu') and per-layer chunks shuttled to GPU. GPU memory held at ~56 GB throughout. Calibration was 4096 sequences × 2048 tokens = 8.4M tokens.
  • Runs on: BF16 weights are ~52 GB. With KV cache for 32K context need ~60 GB VRAM minimum (H100 80GB, A100 80GB, 2× 48 GB cards with TP, or RTX Pro 6000 96GB). For consumer-tier GPUs, downstream quantization (Q4_K_M GGUF or W4A16) recommended.

Evaluation

Todo: Eval results (MMLU, GSM8K, HumanEval) will be added when run.

Acknowledgements

License inherited from the base model (Apache 2.0).

Reproducibility

Run via github.com/SamsungSAILMontreal/ream with the following local patches required for Qwen3.6 + transformers v5 compat:

  1. ream/moe_utils.py — chunk-embed in get_moe_input to avoid 192 GB GPU OOM at calibration scale; build mask via 1-batch dummy for broadcasting; .cpu() accumulated hidden states between layers.
  2. ream/qwen3_mtp.py — set cfg['mlp_only_layers'] = [] for Qwen3_5MoeTextConfig (lacks this attr; Qwen3MoeDecoderLayer reads it).
  3. merge.py — load model with attn_implementation='eager' (transformers v5's create_causal_mask returns None for SDPA/flash; REAM expects a tensor).
  4. merge.py — expose --calibration_data_size and --calibration_data_seq_len CLI flags.
  5. Calibration data must be re-tokenized with Qwen3.6's tokenizer (vocab 248,320; the shipped qwen3_seed42 files use Qwen3 vocab 151,936).
  6. MTP weights: must explicitly pass --mtp_safe_tensors /path/to/base/mtp/shards to preserve the original MTP block from base. This was missed in the original run; a follow-up will transplant the weights post-hoc rather than re-running the full 3h merge.

MTP layer was not merged by design (REAM's MTP path expects ModuleList experts; Qwen3.6 uses packed)

Downloads last month
99
Safetensors
Model size
27B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for keithnull/Qwen3.6-35B-A3B-REAM-192

Finetuned
(113)
this model
Finetunes
1 model
Quantizations
1 model

Paper for keithnull/Qwen3.6-35B-A3B-REAM-192