Qwopus3.6-27B-v2 GPTQ-Pro ns256 v2 RTX 3090 benchmark

Qwopus3.6-27B-v2 GPTQ-Pro FOEM 4-bit g128 ns256 v2

⚙️ Reproducibility note for 131K vLLM serving

Known-good runtime required for the RTX 3090 131K validation below.
The RTX 3090 131K-context validation reported below depends on a known-good vLLM nightly runtime. Treat this as a serving-runtime compatibility constraint, not necessarily a model artifact limitation.
docker.io/vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8
vLLM 0.20.1rc1.dev16+g7a1eb8ac2
Newer stable or nightly vLLM builds may reject the same --max-model-len 131072 configuration because of CUDA graph memory profiling / admission changes. To reproduce the advertised RTX 3090 long-context setup, start from this runtime and the serving flags below before changing vLLM versions.

This is a GPTQ-Pro 4-bit quantization of Jackrong/Qwopus3.6-27B-v2, built to make this excellent Qwopus/Qwen3.6 model practical to run in vLLM with GPTQ-Marlin kernels and long-context inference.

The goal is simple: preserve as much of the original model's character and capability as possible while making it efficient enough for single-GPU RTX 3090-class vLLM deployments.

This is not a new fine-tune. It is a quantized derivative of the original Qwopus3.6-27B-v2 model.

Source and credits

Source model:

Quantization methodology and reference recipe:

Thanks to Jackrong for the original Qwopus3.6 model, and to groxaxo for GPTQ-Pro and the Qwen3.6 GPTQ-Pro recipe this quantization was aligned with.

Quantization recipe

Setting Value
Method GPTQ-Pro / GPTQModel
Bits 4
Group size 128
Symmetric quantization true
Desc act false
True sequential true
Calibration dataset WikiText-2 raw train
Calibration samples 256
Sequence length 2048
MSE 2.0
Damp percent 0.05
Damp auto increment 0.01
FOEM alpha 0.25
FOEM beta 0.2
Batch size 1

Preserved modules include vision, lm_head, embeddings, and norms.

Validation showed that this artifact preserves MTP-related configuration metadata, but does not include actual mtp.* tensors in model.safetensors.index.json, so this release should be treated as non-MTP for vLLM speculative decoding.

Post-save compatibility patch:

  • pad_token_id=248055
  • tokenizer class patched to Qwen2TokenizerFast when needed for vLLM compatibility

Intended serving setup

This checkpoint is intended for text-only vLLM serving on RTX 3090-class hardware.

Recommended vLLM options:

vllm serve XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-FOEM-4bit-g128-ns256-v2 \
  --served-model-name qwopus3.6-27b-v2-gptq-pro-foem-4bit-g128-ns256-v2 \
  --language-model-only \
  --dtype float16 \
  --quantization gptq_marlin \
  --disable-custom-all-reduce \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8_e5m2 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --max-cudagraph-capture-size 32 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code

Reasoning / thinking mode

This model preserves Qwen3-style reasoning behavior. The validation workload below was run with thinking enabled.

MTP / speculative decoding status

This ns256-v2 artifact should be considered text-only and non-MTP for vLLM speculative decoding as published. config.json advertises mtp_num_hidden_layers=1, but the weight index does not contain source mtp.* tensors. Enabling vLLM MTP against this unpatched artifact produced essentially zero accepted draft tokens and poor throughput.

A separate experimental follow-up artifact restores real MTP tensors and quantizes the large MTP linears:

XReyRobert/Qwopus3.6-27B-v2-MTP-GPTQ-Pro-v1

That MTP-GPTQ artifact works and reaches good draft acceptance, but it was still slower than this non-MTP baseline on a single RTX 3090. For practical 100k-131k serving on 1x RTX 3090, this ns256-v2 non-MTP artifact remains the preferred choice.

RTX 3090 validation status

This checkpoint was validated on an RTX 3090 24GB with vLLM, max_model_len=131072, kv_cache_dtype=fp8_e5m2, prefix caching enabled, and thinking enabled.

Observed vLLM multi-turn agent workload metrics:

Metric Observed value Notes
Requests observed 15 Multi-turn agent session calls
vLLM request success count 15/15 No vLLM errors observed during the sample
Average prompt size 33,172 tokens Real multi-turn workload
Average output size 322 tokens Real generated responses
Average time to first token 5.70s Prometheus TTFT summary
Average end-to-end request latency 13.07s Includes prefill, decode, and serving overhead
Average time per output token 0.0230s/token vLLM TPOT summary
Decode throughput from TPOT about 43.5 tok/s Decode-only estimate
Prefix cache hit ratio 83.2% cumulative vLLM prefix-cache counters
Live 60s prompt throughput about 1,917 prompt tok/s Aggregate observed window
Live 60s generation throughput about 19.1 generated tok/s Aggregate over full window, including prefill and idle mix
Live 60s prefix-cache hit ratio 78.9% Delta over the observed window

These are practical multi-turn serving metrics, not a synthetic benchmark. They are useful for RTX 3090-class long-context serving expectations, especially multi-turn usage with prefix caching.

📊 4. Evaluation & Benchmarks

📊 Evaluation & Performance Metrics

25 May update: single-pass MMLU-Pro selected subset run for the GPTQ-Pro ns256 quantization, plus RTX 3090 vLLM serving metrics.

📚 MMLU-Pro Subset 90.00% 315 / 350 single-pass unrestricted
🧠 vs Qwopus BF16 +2.57 pp 306 / 350 reference CSV
⚖️ vs Qwen3.6 BF16 +5.14 pp 297 / 350 reference CSV
⚡ RTX 3090 vLLM 43.24 completion tok/s, request wall-time
📚 4.1 MMLU-Pro Selected Subset - 25 May single-pass run
Evaluation format: This uses the same 350-question MMLU-Pro selected subset published in the test_data directory of Jackrong/Qwopus3.6-27B-v2: 7 categories, 50 questions per category. This is not a full MMLU-Pro leaderboard run.
Protocol note: The primary score below is a single-pass local vLLM run over all 350 selected questions. Generation used temperature=1.0, top_p=0.95, no thinking_token_budget, and no explicit max_tokens. Prediction extraction was strict and deterministic: only answers matching The answer is X were counted, with no random fallback.
Model / run Correct / Total Accuracy Δ vs Qwen Δ vs Qwopus
This GPTQ-Pro ns256 artifact 315 / 350 90.00% +5.14 pp +2.57 pp
Qwopus3.6-27B-v2 reference CSV 306 / 350 87.43% +2.57 pp baseline
Qwen3.6-27B-v2 reference CSV 297 / 350 84.86% baseline -2.57 pp
Category Qwen3.6-27B Qwopus3.6-27B-v2 This GPTQ-Pro ns256
Biology96%96%96%
Business88%94%90%
Computer Science82%84%80%
Mathematics90%88%96%
Physics76%86%94%
Chemistry74%80%90%
Health88%84%84%

Summary: On this selected 350-question MMLU-Pro evaluation set, this GPTQ-Pro ns256 artifact reached 90.00% accuracy in a single-pass unrestricted local vLLM run. This is above the published Qwopus3.6-27B-v2 reference CSV at 87.43% and Qwen3.6-27B-v2 at 84.86%. Because this is a selected subset rather than the full MMLU-Pro benchmark, treat the result as a focused regression/quality check rather than a leaderboard claim.

Scope note: single-pass unrestricted generation exposed one pathological runaway: 129,581 completion tokens, finish_reason=length, and no parsed answer. For practical serving, bounded generation remains recommended even though this unrestricted run is cleaner statistically.
4.2 Single-pass Runtime Notes (RTX 3090 / vLLM)
Metric Observed value
Completion tokens925,577
Request elapsed sum21,408.0s / 5h 56m 48s
Completion throughput43.24 tok/s
Finish reasons349 stop / 1 length
No parsed answer1 / 350
Completion tokens p50 / p95 / max1,131 / 8,989 / 129,581
Long outputs43 >4096, 20 >8192, 7 >16384

Compatibility notes

This artifact was built and validated for text-only vLLM serving without speculative decoding. Do not enable MTP on this artifact as published; the mtp.* tensors are absent from the weight index. Vision-related modules were not validated for vision use in this release.

Limitations

  • Experimental quantization.
  • MTP/speculative decoding is not supported by this published artifact because mtp.* tensors are missing.
  • Quality has been checked on Jackrong's 350-question MMLU-Pro selected subset only; this is not a full MMLU-Pro evaluation or an official leaderboard submission.
  • The single-pass unrestricted run used no explicit max_tokens and exposed one pathological long generation; bounded output limits are recommended for practical serving.
  • RTX 3090 metrics above are observed workload numbers, not a controlled benchmark suite.
  • Long-context and tool-calling workflows were validated on the described local vLLM/Hermes setup; behavior may vary on other serving stacks, hardware, or generation settings.

References

Individual project notice

This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.


🧪 Runtime note: vLLM v0.21.0 stable vs the 131K nightly setup

Validated stable-runtime fallback for RTX 3090-class 24 GB GPUs.
The advertised 131K long-context setup for this GPTQ-Pro checkpoint depends on the known-working vLLM nightly image listed above. With stable vLLM v0.21.0, CUDA graph memory is accounted for by default, so the validated context length on the same GPU class is lower.
Stable v0.21.0 serving flags
--max-model-len 105056 \
--gpu-memory-utilization 0.989 \
--kv-cache-dtype fp8_e5m2 \
--max-num-seqs 1 \
--max-num-batched-tokens 2096 \
--max-cudagraph-capture-size 4
105,056
GPU KV cache tokens
1.00x
full-context concurrency
40-42 tok/s
Codex-style decode after warmup
80%+
observed prefix-cache hit rate
If you need the full 131K context window, use the pinned nightly above. If you prefer a current stable vLLM release, start from the 105056 context configuration and validate on your exact GPU/runtime.
Downloads last month
1,102
Safetensors
Model size
27B params
Tensor type
BF16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1

Quantized
(42)
this model