Instructions to use XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1")
model = AutoModelForImageTextToText.from_pretrained("XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1

SGLang

How to use XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 with Docker Model Runner:
```
docker model run hf.co/XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1
```

Qwopus3.6-27B-v2 GPTQ-Pro FOEM 4-bit g128 ns256 v2

⚙️ Reproducibility note for 131K vLLM serving

Known-good runtime required for the RTX 3090 131K validation below.

The RTX 3090 131K-context validation reported below depends on a known-good vLLM nightly runtime. Treat this as a serving-runtime compatibility constraint, not necessarily a model artifact limitation.

docker.io/vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8

vLLM 0.20.1rc1.dev16+g7a1eb8ac2

Newer stable or nightly vLLM builds may reject the same --max-model-len 131072 configuration because of CUDA graph memory profiling / admission changes. To reproduce the advertised RTX 3090 long-context setup, start from this runtime and the serving flags below before changing vLLM versions.

This is a GPTQ-Pro 4-bit quantization of Jackrong/Qwopus3.6-27B-v2, built to make this excellent Qwopus/Qwen3.6 model practical to run in vLLM with GPTQ-Marlin kernels and long-context inference.

The goal is simple: preserve as much of the original model's character and capability as possible while making it efficient enough for single-GPU RTX 3090-class vLLM deployments.

This is not a new fine-tune. It is a quantized derivative of the original Qwopus3.6-27B-v2 model.

Source and credits

Source model:

Jackrong/Qwopus3.6-27B-v2

Quantization methodology and reference recipe:

Thanks to Jackrong for the original Qwopus3.6 model, and to groxaxo for GPTQ-Pro and the Qwen3.6 GPTQ-Pro recipe this quantization was aligned with.

Quantization recipe

Setting	Value
Method	GPTQ-Pro / GPTQModel
Bits	`4`
Group size	`128`
Symmetric quantization	`true`
Desc act	`false`
True sequential	`true`
Calibration dataset	WikiText-2 raw train
Calibration samples	`256`
Sequence length	`2048`
MSE	`2.0`
Damp percent	`0.05`
Damp auto increment	`0.01`
FOEM alpha	`0.25`
FOEM beta	`0.2`
Batch size	`1`

Preserved modules include vision, lm_head, embeddings, and norms.

Validation showed that this artifact preserves MTP-related configuration metadata, but does not include actual mtp.* tensors in model.safetensors.index.json, so this release should be treated as non-MTP for vLLM speculative decoding.

Post-save compatibility patch:

pad_token_id=248055
tokenizer class patched to Qwen2TokenizerFast when needed for vLLM compatibility

Intended serving setup

This checkpoint is intended for text-only vLLM serving on RTX 3090-class hardware.

Recommended vLLM options:

vllm serve XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-FOEM-4bit-g128-ns256-v2 \
  --served-model-name qwopus3.6-27b-v2-gptq-pro-foem-4bit-g128-ns256-v2 \
  --language-model-only \
  --dtype float16 \
  --quantization gptq_marlin \
  --disable-custom-all-reduce \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8_e5m2 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --max-cudagraph-capture-size 32 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code

Reasoning / thinking mode

This model preserves Qwen3-style reasoning behavior. The validation workload below was run with thinking enabled.

MTP / speculative decoding status

This ns256-v2 artifact should be considered text-only and non-MTP for vLLM speculative decoding as published. config.json advertises mtp_num_hidden_layers=1, but the weight index does not contain source mtp.* tensors. Enabling vLLM MTP against this unpatched artifact produced essentially zero accepted draft tokens and poor throughput.

A separate experimental follow-up artifact restores real MTP tensors and quantizes the large MTP linears:

XReyRobert/Qwopus3.6-27B-v2-MTP-GPTQ-Pro-v1

That MTP-GPTQ artifact works and reaches good draft acceptance, but it was still slower than this non-MTP baseline on a single RTX 3090. For practical 100k-131k serving on 1x RTX 3090, this ns256-v2 non-MTP artifact remains the preferred choice.

RTX 3090 validation status

This checkpoint was validated on an RTX 3090 24GB with vLLM, max_model_len=131072, kv_cache_dtype=fp8_e5m2, prefix caching enabled, and thinking enabled.

Observed vLLM multi-turn agent workload metrics:

Metric	Observed value	Notes
Requests observed	`15`	Multi-turn agent session calls
vLLM request success count	`15/15`	No vLLM errors observed during the sample
Average prompt size	`33,172` tokens	Real multi-turn workload
Average output size	`322` tokens	Real generated responses
Average time to first token	`5.70s`	Prometheus TTFT summary
Average end-to-end request latency	`13.07s`	Includes prefill, decode, and serving overhead
Average time per output token	`0.0230s/token`	vLLM TPOT summary
Decode throughput from TPOT	about `43.5 tok/s`	Decode-only estimate
Prefix cache hit ratio	`83.2%` cumulative	vLLM prefix-cache counters
Live 60s prompt throughput	about `1,917 prompt tok/s`	Aggregate observed window
Live 60s generation throughput	about `19.1 generated tok/s`	Aggregate over full window, including prefill and idle mix
Live 60s prefix-cache hit ratio	`78.9%`	Delta over the observed window

These are practical multi-turn serving metrics, not a synthetic benchmark. They are useful for RTX 3090-class long-context serving expectations, especially multi-turn usage with prefix caching.

📊 4. Evaluation & Benchmarks

📊 Evaluation & Performance Metrics

25 May update: single-pass MMLU-Pro selected subset run for the GPTQ-Pro ns256 quantization, plus RTX 3090 vLLM serving metrics.

📚 MMLU-Pro Subset 90.00% 315 / 350 single-pass unrestricted

🧠 vs Qwopus BF16 +2.57 pp 306 / 350 reference CSV

⚖️ vs Qwen3.6 BF16 +5.14 pp 297 / 350 reference CSV

⚡ RTX 3090 vLLM 43.24 completion tok/s, request wall-time

📚 4.1 MMLU-Pro Selected Subset - 25 May single-pass run

Evaluation format: This uses the same 350-question MMLU-Pro selected subset published in the test_data directory of Jackrong/Qwopus3.6-27B-v2: 7 categories, 50 questions per category. This is not a full MMLU-Pro leaderboard run.

Protocol note: The primary score below is a single-pass local vLLM run over all 350 selected questions. Generation used temperature=1.0, top_p=0.95, no thinking_token_budget, and no explicit max_tokens. Prediction extraction was strict and deterministic: only answers matching The answer is X were counted, with no random fallback.

Model / run	Correct / Total	Accuracy	Δ vs Qwen	Δ vs Qwopus
This GPTQ-Pro ns256 artifact	315 / 350	90.00%	+5.14 pp	+2.57 pp
Qwopus3.6-27B-v2 reference CSV	306 / 350	87.43%	+2.57 pp	baseline
Qwen3.6-27B-v2 reference CSV	297 / 350	84.86%	baseline	-2.57 pp

Category	Qwen3.6-27B	Qwopus3.6-27B-v2	This GPTQ-Pro ns256
Biology	96%	96%	96%
Business	88%	94%	90%
Computer Science	82%	84%	80%
Mathematics	90%	88%	96%
Physics	76%	86%	94%
Chemistry	74%	80%	90%
Health	88%	84%	84%

Summary: On this selected 350-question MMLU-Pro evaluation set, this GPTQ-Pro ns256 artifact reached 90.00% accuracy in a single-pass unrestricted local vLLM run. This is above the published Qwopus3.6-27B-v2 reference CSV at 87.43% and Qwen3.6-27B-v2 at 84.86%. Because this is a selected subset rather than the full MMLU-Pro benchmark, treat the result as a focused regression/quality check rather than a leaderboard claim.

Scope note: single-pass unrestricted generation exposed one pathological runaway: 129,581 completion tokens, finish_reason=length, and no parsed answer. For practical serving, bounded generation remains recommended even though this unrestricted run is cleaner statistically.

⚡ 4.2 Single-pass Runtime Notes (RTX 3090 / vLLM)

Metric	Observed value
Completion tokens	925,577
Request elapsed sum	21,408.0s / 5h 56m 48s
Completion throughput	43.24 tok/s
Finish reasons	349 stop / 1 length
No parsed answer	1 / 350
Completion tokens p50 / p95 / max	1,131 / 8,989 / 129,581
Long outputs	43 >4096, 20 >8192, 7 >16384

Compatibility notes

This artifact was built and validated for text-only vLLM serving without speculative decoding. Do not enable MTP on this artifact as published; the mtp.* tensors are absent from the weight index. Vision-related modules were not validated for vision use in this release.

Limitations

Experimental quantization.
MTP/speculative decoding is not supported by this published artifact because mtp.* tensors are missing.
Quality has been checked on Jackrong's 350-question MMLU-Pro selected subset only; this is not a full MMLU-Pro evaluation or an official leaderboard submission.
The single-pass unrestricted run used no explicit max_tokens and exposed one pathological long generation; bounded output limits are recommended for practical serving.
RTX 3090 metrics above are observed workload numbers, not a controlled benchmark suite.
Long-context and tool-calling workflows were validated on the described local vLLM/Hermes setup; behavior may vary on other serving stacks, hardware, or generation settings.

References

Source model: Jackrong/Qwopus3.6-27B-v2
GPTQ-Pro tooling: groxaxo/GPTQ-Pro
Reference GPTQ-Pro recipe: groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit
MMLU-Pro benchmark repository: TIGER-AI-Lab/MMLU-Pro
MMLU-Pro HF Space / leaderboard: TIGER-Lab/MMLU-Pro

Individual project notice

This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.

🧪 Runtime note: vLLM v0.21.0 stable vs the 131K nightly setup

Validated stable-runtime fallback for RTX 3090-class 24 GB GPUs.

The advertised 131K long-context setup for this GPTQ-Pro checkpoint depends on the known-working vLLM nightly image listed above. With stable vLLM v0.21.0, CUDA graph memory is accounted for by default, so the validated context length on the same GPU class is lower.

Stable v0.21.0 serving flags

--max-model-len 105056 \

--gpu-memory-utilization 0.989 \

--kv-cache-dtype fp8_e5m2 \

--max-num-seqs 1 \

--max-num-batched-tokens 2096 \

--max-cudagraph-capture-size 4

105,056

GPU KV cache tokens

1.00x

full-context concurrency

40-42 tok/s

Codex-style decode after warmup

80%+

observed prefix-cache hit rate

If you need the full 131K context window, use the pinned nightly above. If you prefer a current stable vLLM release, start from the 105056 context configuration and validate on your exact GPU/runtime.

Downloads last month: 1,102

Safetensors

Model size

27B params

Tensor type

BF16

I32

Model tree for XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1

Base model

Jackrong/Qwopus3.6-27B-v2

Quantized

(42)

this model