Instructions to use XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1") model = AutoModelForImageTextToText.from_pretrained("XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1
- SGLang
How to use XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1 with Docker Model Runner:
docker model run hf.co/XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1
Qwopus3.6-27B-v2 GPTQ-Pro FOEM 4-bit g128 ns256 v2
This is a GPTQ-Pro 4-bit quantization of Jackrong/Qwopus3.6-27B-v2, built to make this excellent Qwopus/Qwen3.6 model practical to run in vLLM with GPTQ-Marlin kernels and long-context inference.
The goal is simple: preserve as much of the original model's character and capability as possible while making it efficient enough for single-GPU RTX 3090-class vLLM deployments.
This is not a new fine-tune. It is a quantized derivative of the original Qwopus3.6-27B-v2 model.
Source and credits
Source model:
Quantization methodology and reference recipe:
Thanks to Jackrong for the original Qwopus3.6 model, and to groxaxo for GPTQ-Pro and the Qwen3.6 GPTQ-Pro recipe this quantization was aligned with.
Quantization recipe
| Setting | Value |
|---|---|
| Method | GPTQ-Pro / GPTQModel |
| Bits | 4 |
| Group size | 128 |
| Symmetric quantization | true |
| Desc act | false |
| True sequential | true |
| Calibration dataset | WikiText-2 raw train |
| Calibration samples | 256 |
| Sequence length | 2048 |
| MSE | 2.0 |
| Damp percent | 0.05 |
| Damp auto increment | 0.01 |
| FOEM alpha | 0.25 |
| FOEM beta | 0.2 |
| Batch size | 1 |
Preserved modules include vision, lm_head, embeddings, and norms.
Validation showed that this artifact preserves MTP-related configuration metadata, but does not include actual mtp.* tensors in model.safetensors.index.json, so this release should be treated as non-MTP for vLLM speculative decoding.
Post-save compatibility patch:
pad_token_id=248055- tokenizer class patched to
Qwen2TokenizerFastwhen needed for vLLM compatibility
Intended serving setup
This checkpoint is intended for text-only vLLM serving on RTX 3090-class hardware.
Recommended vLLM options:
vllm serve XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-FOEM-4bit-g128-ns256-v2 \
--served-model-name qwopus3.6-27b-v2-gptq-pro-foem-4bit-g128-ns256-v2 \
--language-model-only \
--dtype float16 \
--quantization gptq_marlin \
--disable-custom-all-reduce \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--max-num-seqs 1 \
--kv-cache-dtype fp8_e5m2 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--max-cudagraph-capture-size 32 \
--gpu-memory-utilization 0.95 \
--trust-remote-code
Reasoning / thinking mode
This model preserves Qwen3-style reasoning behavior. The validation workload below was run with thinking enabled.
MTP / speculative decoding status
This ns256-v2 artifact should be considered text-only and non-MTP for vLLM speculative decoding as published. config.json advertises mtp_num_hidden_layers=1, but the weight index does not contain source mtp.* tensors. Enabling vLLM MTP against this unpatched artifact produced essentially zero accepted draft tokens and poor throughput.
A separate experimental follow-up artifact restores real MTP tensors and quantizes the large MTP linears:
XReyRobert/Qwopus3.6-27B-v2-MTP-GPTQ-Pro-v1
That MTP-GPTQ artifact works and reaches good draft acceptance, but it was still slower than this non-MTP baseline on a single RTX 3090. For practical 100k-131k serving on 1x RTX 3090, this ns256-v2 non-MTP artifact remains the preferred choice.
RTX 3090 validation status
This checkpoint was validated on an RTX 3090 24GB with vLLM, max_model_len=131072, kv_cache_dtype=fp8_e5m2, prefix caching enabled, and thinking enabled.
Observed vLLM multi-turn agent workload metrics:
| Metric | Observed value | Notes |
|---|---|---|
| Requests observed | 15 |
Multi-turn agent session calls |
| vLLM request success count | 15/15 |
No vLLM errors observed during the sample |
| Average prompt size | 33,172 tokens |
Real multi-turn workload |
| Average output size | 322 tokens |
Real generated responses |
| Average time to first token | 5.70s |
Prometheus TTFT summary |
| Average end-to-end request latency | 13.07s |
Includes prefill, decode, and serving overhead |
| Average time per output token | 0.0230s/token |
vLLM TPOT summary |
| Decode throughput from TPOT | about 43.5 tok/s |
Decode-only estimate |
| Prefix cache hit ratio | 83.2% cumulative |
vLLM prefix-cache counters |
| Live 60s prompt throughput | about 1,917 prompt tok/s |
Aggregate observed window |
| Live 60s generation throughput | about 19.1 generated tok/s |
Aggregate over full window, including prefill and idle mix |
| Live 60s prefix-cache hit ratio | 78.9% |
Delta over the observed window |
These are practical multi-turn serving metrics, not a synthetic benchmark. They are useful for RTX 3090-class long-context serving expectations, especially multi-turn usage with prefix caching.
📊 4. Evaluation & Benchmarks
Compatibility notes
This artifact was built and validated for text-only vLLM serving without speculative decoding. Do not enable MTP on this artifact as published; the mtp.* tensors are absent from the weight index. Vision-related modules were not validated for vision use in this release.
Limitations
- Experimental quantization.
- MTP/speculative decoding is not supported by this published artifact because
mtp.*tensors are missing. - Quality has been checked on Jackrong's 350-question MMLU-Pro selected subset only; this is not a full MMLU-Pro evaluation or an official leaderboard submission.
- The single-pass unrestricted run used no explicit
max_tokensand exposed one pathological long generation; bounded output limits are recommended for practical serving. - RTX 3090 metrics above are observed workload numbers, not a controlled benchmark suite.
- Long-context and tool-calling workflows were validated on the described local vLLM/Hermes setup; behavior may vary on other serving stacks, hardware, or generation settings.
References
- Source model: Jackrong/Qwopus3.6-27B-v2
- GPTQ-Pro tooling: groxaxo/GPTQ-Pro
- Reference GPTQ-Pro recipe: groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit
- MMLU-Pro benchmark repository: TIGER-AI-Lab/MMLU-Pro
- MMLU-Pro HF Space / leaderboard: TIGER-Lab/MMLU-Pro
Individual project notice
This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.
- Downloads last month
- 1,102
Model tree for XReyRobert/Qwopus3.6-27B-v2-GPTQ-Pro-v1
Base model
Jackrong/Qwopus3.6-27B-v2