How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Qwen3.6-27B NVFP4 GGUF

NVFP4 GGUF quantizations of Qwen/Qwen3.6-27B, produced for use with llama.cpp.

About LibertAI

LibertAI is a decentralized AI platform — private inference, an OpenAI-compatible API, and a chat UI, all running on community GPUs over Aleph Cloud instead of a single company's servers. No accounts required to chat, no logs sent home, and the same models you'd self-host are available behind a sovereign endpoint.

If you want to put this model (or any other) to work as an autonomous agent without running your own infrastructure, check out LiberClaw — Hermes-style agents hosted on Aleph Cloud with LibertAI inference. Free tier: 2 agents, no credit card, 5 minutes to deploy. Open source.

The FFN tensors are quantized to NVFP4 (NVIDIA's 4-bit float with E4M3 block scale), repacked from mmangkad/Qwen3.6-27B-NVFP4 (NVIDIA ModelOpt calibration). The remaining tensors (attention projections, SSM linear_attn blocks, embeddings, output) use a conventional GGUF quant — three variants are provided.

Why NVFP4? On NVIDIA Blackwell GPUs (RTX 50-series, B100/B200), llama.cpp uses native NVFP4 tensor-core MMA kernels (added in llama.cpp #22196) for the FFN matmul — the dominant compute cost during inference. On older GPUs the path falls back to dp4a/MMQ kernels, where these GGUFs run but offer no perf advantage over standard K-quants.

Files

File Size FFN Other tensors When to pick
Qwen3.6-27B-NVFP4-Q4_K_M.gguf 15 GB NVFP4 Q4_K_M Recommended. Fastest serving throughput on Blackwell + smallest VRAM footprint
Qwen3.6-27B-NVFP4-Q8_0.gguf 19 GB NVFP4 Q8_0 Higher precision attention/embeddings if you have the VRAM
Qwen3.6-27B-NVFP4-BF16.gguf 28 GB NVFP4 BF16 Max quality (preserves source precision for non-FFN tensors); slower in practice — only pick if you need bit-for-bit source fidelity
mmproj-Qwen3.6-27B-F16.gguf 889 MB F16 vision tower Required for image/video input — reusable with any Qwen3.6-27B GGUF, not NVFP4-specific

Performance

Measured on an NVIDIA RTX 5090 (32 GB, Blackwell, sm_120), llama.cpp build c84e6d6db.

Batched serving (llama-batched-bench, 512 in / 128 out per request)

NVFP4-Q4_K_M vs stock Q4_K_M on RTX 5090

NVFP4-Q4_K_M beats stock Q4_K_M on total serving throughput at every parallel batch size we tested (+9 / +0 / +8 / +2% at 1 / 4 / 8 / 16 sequences), with the largest token-generation wins at single stream (+12%) and 8 parallel sequences (+14%). It also uses less VRAM (14.7 vs 16.3 GiB), leaving more room for KV cache.

Variant comparison (same hardware)

Variant Size PP512 (tok/s) TG64 (tok/s)
NVFP4-Q4_K_M 14.72 GiB 2865 64
NVFP4-Q8_0 18.65 GiB 3346 64
NVFP4-BF16 27.19 GiB 1403 49

The Q4_K_M variant is the speed/efficiency winner. The BF16 variant is included for completeness but pays a real bandwidth cost — only pick it if you need maximum precision on the non-FFN tensors and don't care about throughput.

Usage

Text-only (CLI)

llama-cli -m Qwen3.6-27B-NVFP4-Q8_0.gguf -ngl 999 -c 8192 -p "Your prompt here"

Multimodal (server, vision + text)

llama-server \
  -m Qwen3.6-27B-NVFP4-Q8_0.gguf \
  --mmproj mmproj-Qwen3.6-27B-F16.gguf \
  -ngl 999 -c 32768 \
  --host 0.0.0.0 --port 8080

Then POST to /v1/chat/completions with image content blocks — see the llama.cpp multimodal docs.

Recommended sampler

Qwen3.6 is a thinking model. Default chat template enables <think> blocks. For non-thinking usage pass --reasoning off (in llama-cli) or set chat_template_kwargs.enable_thinking=false in the API.

About the architecture

Qwen3.6-27B is a hybrid attention + SSM dense model: every 4th layer is conventional attention; the remaining 48 of 64 layers use Mamba-style linear_attn blocks. The NVFP4 source from mmangkad keeps the SSM in_proj_* projections and standard attention projections at higher precision — only the FFN matmul (192 tensors) is NVFP4. The variants above differ only in how those non-FFN tensors are stored.

Sources & credits

  • Base model: Qwen/Qwen3.6-27B by Alibaba Qwen team — Apache 2.0
  • NVFP4 calibration source: mmangkad/Qwen3.6-27B-NVFP4 (NVIDIA ModelOpt v0.42.0)
  • mmproj source: official BF16 weights from Qwen/Qwen3.6-27B
  • Tooling: llama.cpp convert_hf_to_gguf.py and llama-quantize

License

Apache 2.0, inherited from the upstream model.

Downloads last month
6,179
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(372)
this model