Instructions to use caiovicentino1/Qwen3.5-27B-HLWQ-Q5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use caiovicentino1/Qwen3.5-27B-HLWQ-Q5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="caiovicentino1/Qwen3.5-27B-HLWQ-Q5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("caiovicentino1/Qwen3.5-27B-HLWQ-Q5") model = AutoModelForImageTextToText.from_pretrained("caiovicentino1/Qwen3.5-27B-HLWQ-Q5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use caiovicentino1/Qwen3.5-27B-HLWQ-Q5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "caiovicentino1/Qwen3.5-27B-HLWQ-Q5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/Qwen3.5-27B-HLWQ-Q5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/caiovicentino1/Qwen3.5-27B-HLWQ-Q5
- SGLang
How to use caiovicentino1/Qwen3.5-27B-HLWQ-Q5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "caiovicentino1/Qwen3.5-27B-HLWQ-Q5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/Qwen3.5-27B-HLWQ-Q5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "caiovicentino1/Qwen3.5-27B-HLWQ-Q5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/Qwen3.5-27B-HLWQ-Q5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use caiovicentino1/Qwen3.5-27B-HLWQ-Q5 with Docker Model Runner:
docker model run hf.co/caiovicentino1/Qwen3.5-27B-HLWQ-Q5
Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.
The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.
Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).
Qwen3.5-27B — HLWQ INT4
Native vLLM. Marlin kernel. Zero plugin.
HLWQ Q5 preprocessing produces better INT4 weights than direct quantization — stored in CompressedTensors format for native vLLM inference.
Quick Start — vLLM (one command)
pip install vllm
vllm serve caiovicentino1/Qwen3.5-27B-HLWQ-Q5 --language-model-only --enforce-eager
That's it. No plugin, no pip install polarquant, no custom code.
Tested results:
| GPU | tok/s |
|---|---|
| A100 80GB | 168 tok/s (9B) |
| RTX PRO 6000 96GB | 44 tok/s (9B) / 18 tok/s (27B) |
Quick Start — HuggingFace Transformers
pip install polarquant
import polarengine_vllm # auto-registers with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Qwen3.5-27B-HLWQ-Q5", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-27B-HLWQ-Q5", trust_remote_code=True)
inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Consumer GPU Compatibility
| GPU | VRAM | Works? | Expected tok/s |
|---|---|---|---|
| RTX 4090 | 24 GB | YES (tight) | ~10 |
| A100 / H100 | 80 GB | YES | ~18-50 |
| RTX PRO 6000 | 96 GB | YES | ~18 |
Why HLWQ INT4 is Better
Standard INT4 (GPTQ/AWQ) quantizes weights directly — outliers cause errors.
HLWQ adds a preprocessing step:
- Hadamard rotation — distributes weight energy uniformly (eliminates outliers)
- Lloyd-Max Q5 — MSE-optimal quantization for the resulting Gaussian distribution
- Dequant → INT4 — the cleaned weights produce better INT4 than direct quantization
| Method | PPL (lower = better) |
|---|---|
| BF16 baseline | 6.37 |
| HLWQ → INT4 | 6.56 |
| Direct INT4 | 6.68 |
Same speed as GPTQ/AWQ, better quality.
Important Flags
| Flag | Why |
|---|---|
--language-model-only |
Qwen3.5 is multimodal — this skips the vision encoder (we only quantized text) |
--enforce-eager |
Required on Blackwell GPUs (cc 12.0). Optional on A100/H100 (faster without it) |
Links
- Paper: arxiv.org/abs/2603.29078
- GitHub: github.com/caiovicentino/polarengine-vllm
- PyPI:
pip install polarquant - Base model: Qwen/Qwen3.5-27B
- Downloads last month
- 39
Model tree for caiovicentino1/Qwen3.5-27B-HLWQ-Q5
Base model
Qwen/Qwen3.5-27B