Qwen3.6-35B-A3B-REAM-192
REAM-merged variant of Qwen/Qwen3.6-35B-A3B. The number of routed experts in each MoE layer was reduced from 256 → 192 via REAM (Router-weighted Expert Activation Merging), Samsung SAIL Montreal's method that clusters and merges experts (rather than discarding them, like REAP). Vision encoder is preserved unchanged — full multimodal pipeline intact.
35.11B → 27.05B params (−23%) | 192 routed + 1 shared experts/layer | ~3B active per token | VL preserved | bf16
Status
Pre-evaluation release. Model loads, generates coherent text on smoke-test prompts, and the multimodal stack is structurally intact. lm-eval (MMLU + GSM8K + HumanEval) are pending. This repo will be updated after evaluation is complete.
Known issue: MTP weights missing from safetensors
text_config.mtp_num_hidden_layers: 1 is declared in config.json, but the MTP block weights themselves were not included in this push (merge_args.mtp_safe_tensors was unset during the REAM merge run, so the Phase-3-skip kept the config flag but dropped the tensors). The architecture table below has been corrected.
This will be addressed in a future update: the original (un-merged-against-body) MTP weights from base Qwen/Qwen3.6-35B-A3B will be transplanted into a follow-up safetensors shard. The MTP head's expert routing is self-contained (its own 256-expert MoE block, separate from the body's experts), so the un-merged head sitting on the merged body is mathematically valid; whether it produces useful draft tokens after the body's expert merge is an open empirical question that will be answered by doing a MTP-aware GGUF re-quant + speculative-decode benchmark.
For text/vision use of this model without speculative decoding, the missing MTP weights have no effect — AutoModelForCausalLM and AutoModelForImageTextToText ignore the MTP block entirely. The discrepancy only matters for users planning on using MTP speculative decoding.
Method
Expert reduction
REAM merges experts by:
- Computing per-expert REAP saliency (
softmax(router_logits) × ‖expert_output‖₂) over a calibration set. - Clustering low-saliency experts into groups around high-saliency centroids (group_size=32).
- Merging each cluster into one expert via permutation-aligned weighted average over weight + activation similarity.
Output: bf16 safetensors, structurally identical to base model except num_experts=192. No quantization applied.
Calibration
Composite mix tilted toward agentic coding, adapted from atbender's REAP recipe:
| Source | Samples | Why |
|---|---|---|
| SWE-bench/SWE-smith-trajectories (tool) | 1024 | Agentic multi-turn with tool calls |
| Salesforce/xlam-function-calling-60k | 1024 | Single-turn function calling |
| theblackcat102/evol-codealpaca-v1 | 683 | General coding instruction-following |
| open-r1/Mixture-of-Thoughts (code) | 454 | Code reasoning chains |
| open-r1/Mixture-of-Thoughts (math) | 454 | Math reasoning |
| open-r1/Mixture-of-Thoughts (science) | 457 | Science reasoning |
| Total | 4096 sequences × 2048 tokens | ~8.4M tokens |
Sequences token-packed to fill 2048-token context; shuffled with seed 42.
Hyperparameters
merge_size: 192 (out of 256)group_size: 32 (REAM cluster centroid count)saliency: REAPmerging:logits+weights(uses both expert output activations and weight similarity for permutation alignment)grouping:reamgated_sim: True (softmax over router logits before clustering)- Seed: 42
Vision tower & MTP
- Vision encoder: unmodified from Qwen3.6-35B-A3B base. The encoder is stripped on
AutoModelForCausalLMload, so post-merge we re-attach by loading the original VLM, swappingmodel.language_modelwith the merged version, and saving the combined state. - MTP layer: ⚠️ config-declared but weights missing from this push (see Known Issue above). The intent was to keep the original (un-merged) MTP layer because REAM's MTP-merge code path only supports Qwen3-style ModuleList experts while Qwen3.6 stores MTP experts in packed format — keys mismatch and merge would zero out experts. However, the merge run did not pass
mtp_safe_tensors, so the original weights were never copied into the output. A follow-up update will transplant them from base.
Architecture
| Property | Original (Qwen3.6-35B-A3B) | This model |
|---|---|---|
| Total Parameters | ~35.11B | ~27.05B |
| Active Parameters | ~3B | ~3B |
| Routed Experts per Layer | 256 | 192 |
| Routed per Token | 8 | 8 |
| Shared Expert | 1/layer | 1/layer (preserved) |
| MoE Layers | 40 | 40 |
| Vision Encoder | Yes | Yes (unmodified) |
| MTP Layer | Yes | Config-declared; weights pending transplant from base (see Known Issue) |
| Precision | BF16 | BF16 |
| Disk Size | ~67 GB | ~52 GB |
| Context | 262K | 262K |
Critical: dtype must be bfloat16
Qwen3.6's GDN (Gated Delta Network) linear attention layers overflow float16 (max 65504), producing silent NaN. Always use dtype=torch.bfloat16. This applies to:
- HF inference:
dtype=torch.bfloat16 - vLLM:
--dtype bfloat16 - llama.cpp GGUF conversion:
--outtype bf16, neverf16
Usage
Text-only (CausalLM)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained(
"keithnull/Qwen3.6-35B-A3B-REAM-192",
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
attn_implementation="eager",
)
t = AutoTokenizer.from_pretrained("keithnull/Qwen3.6-35B-A3B-REAM-192", trust_remote_code=True)
messages = [{"role": "user", "content": "Solve: if 3x + 7 = 22, what is x?"}]
inputs = t.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(m.device)
outputs = m.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(t.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Vision-language
The model's architecture is Qwen3_5MoeForConditionalGeneration. Use AutoModelForImageTextToText to load the full VL wrapper:
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import torch
processor = AutoProcessor.from_pretrained("keithnull/Qwen3.6-35B-A3B-REAM-192", trust_remote_code=True)
m = AutoModelForImageTextToText.from_pretrained(
"keithnull/Qwen3.6-35B-A3B-REAM-192",
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
image = Image.open("path/to/image.jpg")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image."},
],
}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(m.device)
outputs = m.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(processor.tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
Prerequisites
pip install "git+https://github.com/huggingface/transformers.git@main"
# Qwen3.6's qwen3_5_moe arch landed in transformers main after the last tagged release.
pip install accelerate torchvision
For 10× faster GDN forward pass (optional):
pip install flash-linear-attention causal-conv1d einops
(ABI must match your installed torch — use pip install --no-build-isolation if the kernels fail to load at runtime.)
Hardware
- Built on: 1× NVIDIA H100 80GB on RunPod. Phase 3a (the actual REAM merge) took 3h 22m wall-clock with bf16 weights staged on CPU (
device_map='cpu') and per-layer chunks shuttled to GPU. GPU memory held at ~56 GB throughout. Calibration was 4096 sequences × 2048 tokens = 8.4M tokens. - Runs on: BF16 weights are ~52 GB. With KV cache for 32K context need ~60 GB VRAM minimum (H100 80GB, A100 80GB, 2× 48 GB cards with TP, or RTX Pro 6000 96GB). For consumer-tier GPUs, downstream quantization (Q4_K_M GGUF or W4A16) recommended.
Evaluation
Todo: Eval results (MMLU, GSM8K, HumanEval) will be added when run.
Acknowledgements
- Base model: Qwen/Qwen3.6-35B-A3B (Alibaba Qwen team)
- REAM method: Samsung SAIL Montreal — arXiv 2604.04356
- REAP saliency formula (used by REAM for centroid selection): Cerebras Research
- Calibration recipe: adapted from atbender's REAP work (composite agentic-coding mix)
License inherited from the base model (Apache 2.0).
Reproducibility
Run via github.com/SamsungSAILMontreal/ream with the following local patches required for Qwen3.6 + transformers v5 compat:
ream/moe_utils.py— chunk-embed inget_moe_inputto avoid 192 GB GPU OOM at calibration scale; build mask via 1-batch dummy for broadcasting;.cpu()accumulated hidden states between layers.ream/qwen3_mtp.py— setcfg['mlp_only_layers'] = []forQwen3_5MoeTextConfig(lacks this attr;Qwen3MoeDecoderLayerreads it).merge.py— load model withattn_implementation='eager'(transformers v5'screate_causal_maskreturnsNonefor SDPA/flash; REAM expects a tensor).merge.py— expose--calibration_data_sizeand--calibration_data_seq_lenCLI flags.- Calibration data must be re-tokenized with Qwen3.6's tokenizer (vocab 248,320; the shipped
qwen3_seed42files use Qwen3 vocab 151,936). - MTP weights: must explicitly pass
--mtp_safe_tensors /path/to/base/mtp/shardsto preserve the original MTP block from base. This was missed in the original run; a follow-up will transplant the weights post-hoc rather than re-running the full 3h merge.
MTP layer was not merged by design (REAM's MTP path expects ModuleList experts; Qwen3.6 uses packed)
- Downloads last month
- 99