Instructions to use caiovicentino1/Qwen3.5-27B-HLWQ-Q5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use caiovicentino1/Qwen3.5-27B-HLWQ-Q5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="caiovicentino1/Qwen3.5-27B-HLWQ-Q5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("caiovicentino1/Qwen3.5-27B-HLWQ-Q5")
model = AutoModelForImageTextToText.from_pretrained("caiovicentino1/Qwen3.5-27B-HLWQ-Q5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use caiovicentino1/Qwen3.5-27B-HLWQ-Q5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "caiovicentino1/Qwen3.5-27B-HLWQ-Q5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "caiovicentino1/Qwen3.5-27B-HLWQ-Q5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/caiovicentino1/Qwen3.5-27B-HLWQ-Q5

SGLang

How to use caiovicentino1/Qwen3.5-27B-HLWQ-Q5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "caiovicentino1/Qwen3.5-27B-HLWQ-Q5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "caiovicentino1/Qwen3.5-27B-HLWQ-Q5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "caiovicentino1/Qwen3.5-27B-HLWQ-Q5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "caiovicentino1/Qwen3.5-27B-HLWQ-Q5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use caiovicentino1/Qwen3.5-27B-HLWQ-Q5 with Docker Model Runner:
```
docker model run hf.co/caiovicentino1/Qwen3.5-27B-HLWQ-Q5
```

Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

Qwen3.5-27B — HLWQ INT4

Native vLLM. Marlin kernel. Zero plugin.

HLWQ Q5 preprocessing produces better INT4 weights than direct quantization — stored in CompressedTensors format for native vLLM inference.

Quick Start — vLLM (one command)

pip install vllm
vllm serve caiovicentino1/Qwen3.5-27B-HLWQ-Q5 --language-model-only --enforce-eager

That's it. No plugin, no pip install polarquant, no custom code.

Tested results:

GPU	tok/s
A100 80GB	168 tok/s (9B)
RTX PRO 6000 96GB	44 tok/s (9B) / 18 tok/s (27B)

Quick Start — HuggingFace Transformers

pip install polarquant

import polarengine_vllm  # auto-registers with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Qwen3.5-27B-HLWQ-Q5", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-27B-HLWQ-Q5", trust_remote_code=True)

inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Consumer GPU Compatibility

GPU	VRAM	Works?	Expected tok/s
RTX 4090	24 GB	YES (tight)	~10
A100 / H100	80 GB	YES	~18-50
RTX PRO 6000	96 GB	YES	~18

Why HLWQ INT4 is Better

Standard INT4 (GPTQ/AWQ) quantizes weights directly — outliers cause errors.

HLWQ adds a preprocessing step:

Hadamard rotation — distributes weight energy uniformly (eliminates outliers)
Lloyd-Max Q5 — MSE-optimal quantization for the resulting Gaussian distribution
Dequant → INT4 — the cleaned weights produce better INT4 than direct quantization

Method	PPL (lower = better)
BF16 baseline	6.37
HLWQ → INT4	6.56
Direct INT4	6.68

Same speed as GPTQ/AWQ, better quality.

Important Flags

Flag	Why
`--language-model-only`	Qwen3.5 is multimodal — this skips the vision encoder (we only quantized text)
`--enforce-eager`	Required on Blackwell GPUs (cc 12.0). Optional on A100/H100 (faster without it)

Model tree for caiovicentino1/Qwen3.5-27B-HLWQ-Q5

Base model

Qwen/Qwen3.5-27B

Quantized

(205)

this model

Collections including caiovicentino1/Qwen3.5-27B-HLWQ-Q5

Papers for caiovicentino1/Qwen3.5-27B-HLWQ-Q5

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Paper • 2603.29078 • Published Mar 30

PolarQuant: Quantizing KV Caches with Polar Transformation

Paper • 2502.02617 • Published Feb 4, 2025 • 1

caiovicentino1
/

Qwen3.5-27B-HLWQ-Q5