Instructions to use keithnull/Qwen3.6-35B-A3B-REAM-192 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use keithnull/Qwen3.6-35B-A3B-REAM-192 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="keithnull/Qwen3.6-35B-A3B-REAM-192")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("keithnull/Qwen3.6-35B-A3B-REAM-192")
model = AutoModelForImageTextToText.from_pretrained("keithnull/Qwen3.6-35B-A3B-REAM-192")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use keithnull/Qwen3.6-35B-A3B-REAM-192 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "keithnull/Qwen3.6-35B-A3B-REAM-192"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "keithnull/Qwen3.6-35B-A3B-REAM-192",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192

SGLang

How to use keithnull/Qwen3.6-35B-A3B-REAM-192 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "keithnull/Qwen3.6-35B-A3B-REAM-192" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "keithnull/Qwen3.6-35B-A3B-REAM-192",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "keithnull/Qwen3.6-35B-A3B-REAM-192" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "keithnull/Qwen3.6-35B-A3B-REAM-192",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use keithnull/Qwen3.6-35B-A3B-REAM-192 with Docker Model Runner:
```
docker model run hf.co/keithnull/Qwen3.6-35B-A3B-REAM-192
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Qwen3.6-35B-A3B-REAM-192

REAM-merged variant of Qwen/Qwen3.6-35B-A3B. The number of routed experts in each MoE layer was reduced from 256 → 192 via REAM (Router-weighted Expert Activation Merging), Samsung SAIL Montreal's method that clusters and merges experts (rather than discarding them, like REAP). Vision encoder is preserved unchanged — full multimodal pipeline intact.

35.11B → 27.05B params (−23%) | 192 routed + 1 shared experts/layer | ~3B active per token | VL preserved | bf16

Status

Pre-evaluation release. Model loads, generates coherent text on smoke-test prompts, and the multimodal stack is structurally intact. lm-eval (MMLU + GSM8K + HumanEval) are pending. This repo will be updated after evaluation is complete.

Known issue: MTP weights missing from safetensors

text_config.mtp_num_hidden_layers: 1 is declared in config.json, but the MTP block weights themselves were not included in this push (merge_args.mtp_safe_tensors was unset during the REAM merge run, so the Phase-3-skip kept the config flag but dropped the tensors). The architecture table below has been corrected.

This will be addressed in a future update: the original (un-merged-against-body) MTP weights from base Qwen/Qwen3.6-35B-A3B will be transplanted into a follow-up safetensors shard. The MTP head's expert routing is self-contained (its own 256-expert MoE block, separate from the body's experts), so the un-merged head sitting on the merged body is mathematically valid; whether it produces useful draft tokens after the body's expert merge is an open empirical question that will be answered by doing a MTP-aware GGUF re-quant + speculative-decode benchmark.

For text/vision use of this model without speculative decoding, the missing MTP weights have no effect — AutoModelForCausalLM and AutoModelForImageTextToText ignore the MTP block entirely. The discrepancy only matters for users planning on using MTP speculative decoding.

Method

Expert reduction

REAM merges experts by:

Computing per-expert REAP saliency (softmax(router_logits) × ‖expert_output‖₂) over a calibration set.
Clustering low-saliency experts into groups around high-saliency centroids (group_size=32).
Merging each cluster into one expert via permutation-aligned weighted average over weight + activation similarity.

Output: bf16 safetensors, structurally identical to base model except num_experts=192. No quantization applied.

Calibration

Composite mix tilted toward agentic coding, adapted from atbender's REAP recipe:

Source	Samples	Why
SWE-bench/SWE-smith-trajectories (tool)	1024	Agentic multi-turn with tool calls
Salesforce/xlam-function-calling-60k	1024	Single-turn function calling
theblackcat102/evol-codealpaca-v1	683	General coding instruction-following
open-r1/Mixture-of-Thoughts (code)	454	Code reasoning chains
open-r1/Mixture-of-Thoughts (math)	454	Math reasoning
open-r1/Mixture-of-Thoughts (science)	457	Science reasoning
Total	4096 sequences × 2048 tokens	~8.4M tokens

Sequences token-packed to fill 2048-token context; shuffled with seed 42.

Hyperparameters

merge_size: 192 (out of 256)
group_size: 32 (REAM cluster centroid count)
saliency: REAP
merging: logits+weights (uses both expert output activations and weight similarity for permutation alignment)
grouping: ream
gated_sim: True (softmax over router logits before clustering)
Seed: 42

Vision tower & MTP

Vision encoder: unmodified from Qwen3.6-35B-A3B base. The encoder is stripped on AutoModelForCausalLM load, so post-merge we re-attach by loading the original VLM, swapping model.language_model with the merged version, and saving the combined state.
MTP layer: ⚠️ config-declared but weights missing from this push (see Known Issue above). The intent was to keep the original (un-merged) MTP layer because REAM's MTP-merge code path only supports Qwen3-style ModuleList experts while Qwen3.6 stores MTP experts in packed format — keys mismatch and merge would zero out experts. However, the merge run did not pass mtp_safe_tensors, so the original weights were never copied into the output. A follow-up update will transplant them from base.

Architecture

Property	Original (Qwen3.6-35B-A3B)	This model
Total Parameters	~35.11B	~27.05B
Active Parameters	~3B	~3B
Routed Experts per Layer	256	192
Routed per Token	8	8
Shared Expert	1/layer	1/layer (preserved)
MoE Layers	40	40
Vision Encoder	Yes	Yes (unmodified)
MTP Layer	Yes	Config-declared; weights pending transplant from base (see Known Issue)
Precision	BF16	BF16
Disk Size	~67 GB	~52 GB
Context	262K	262K

Critical: dtype must be bfloat16

Qwen3.6's GDN (Gated Delta Network) linear attention layers overflow float16 (max 65504), producing silent NaN. Always use dtype=torch.bfloat16. This applies to:

HF inference: dtype=torch.bfloat16
vLLM: --dtype bfloat16
llama.cpp GGUF conversion: --outtype bf16, never f16

Usage

Text-only (CausalLM)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

m = AutoModelForCausalLM.from_pretrained(
    "keithnull/Qwen3.6-35B-A3B-REAM-192",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="eager",
)
t = AutoTokenizer.from_pretrained("keithnull/Qwen3.6-35B-A3B-REAM-192", trust_remote_code=True)

messages = [{"role": "user", "content": "Solve: if 3x + 7 = 22, what is x?"}]
inputs = t.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(m.device)
outputs = m.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(t.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Vision-language

The model's architecture is Qwen3_5MoeForConditionalGeneration. Use AutoModelForImageTextToText to load the full VL wrapper:

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("keithnull/Qwen3.6-35B-A3B-REAM-192", trust_remote_code=True)
m = AutoModelForImageTextToText.from_pretrained(
    "keithnull/Qwen3.6-35B-A3B-REAM-192",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

image = Image.open("path/to/image.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this image."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(m.device)
outputs = m.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(processor.tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Prerequisites

pip install "git+https://github.com/huggingface/transformers.git@main"
# Qwen3.6's qwen3_5_moe arch landed in transformers main after the last tagged release.
pip install accelerate torchvision

For 10× faster GDN forward pass (optional):

pip install flash-linear-attention causal-conv1d einops

(ABI must match your installed torch — use pip install --no-build-isolation if the kernels fail to load at runtime.)

Hardware

Built on: 1× NVIDIA H100 80GB on RunPod. Phase 3a (the actual REAM merge) took 3h 22m wall-clock with bf16 weights staged on CPU (device_map='cpu') and per-layer chunks shuttled to GPU. GPU memory held at ~56 GB throughout. Calibration was 4096 sequences × 2048 tokens = 8.4M tokens.
Runs on: BF16 weights are ~52 GB. With KV cache for 32K context need ~60 GB VRAM minimum (H100 80GB, A100 80GB, 2× 48 GB cards with TP, or RTX Pro 6000 96GB). For consumer-tier GPUs, downstream quantization (Q4_K_M GGUF or W4A16) recommended.

Evaluation

Todo: Eval results (MMLU, GSM8K, HumanEval) will be added when run.

Acknowledgements

Base model: Qwen/Qwen3.6-35B-A3B (Alibaba Qwen team)
REAM method: Samsung SAIL Montreal — arXiv 2604.04356
REAP saliency formula (used by REAM for centroid selection): Cerebras Research
Calibration recipe: adapted from atbender's REAP work (composite agentic-coding mix)

License inherited from the base model (Apache 2.0).

Reproducibility

Run via github.com/SamsungSAILMontreal/ream with the following local patches required for Qwen3.6 + transformers v5 compat:

ream/moe_utils.py — chunk-embed in get_moe_input to avoid 192 GB GPU OOM at calibration scale; build mask via 1-batch dummy for broadcasting; .cpu() accumulated hidden states between layers.
ream/qwen3_mtp.py — set cfg['mlp_only_layers'] = [] for Qwen3_5MoeTextConfig (lacks this attr; Qwen3MoeDecoderLayer reads it).
merge.py — load model with attn_implementation='eager' (transformers v5's create_causal_mask returns None for SDPA/flash; REAM expects a tensor).
merge.py — expose --calibration_data_size and --calibration_data_seq_len CLI flags.
Calibration data must be re-tokenized with Qwen3.6's tokenizer (vocab 248,320; the shipped qwen3_seed42 files use Qwen3 vocab 151,936).
MTP weights: must explicitly pass --mtp_safe_tensors /path/to/base/mtp/shards to preserve the original MTP block from base. This was missed in the original run; a follow-up will transplant the weights post-hoc rather than re-running the full 3h merge.