Instructions to use LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF", filename="Qwen3.6-27B-NVFP4-BF16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
Use Docker
docker model run hf.co/LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
- Ollama
How to use LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF with Ollama:
ollama run hf.co/LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
- Unsloth Studio new
How to use LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF to start chatting
- Pi new
How to use LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF with Docker Model Runner:
docker model run hf.co/LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
- Lemonade
How to use LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.6-27B-NVFP4-GGUF-Q4_K_M
List all available models
lemonade list
Qwen3.6-27B NVFP4 GGUF
NVFP4 GGUF quantizations of Qwen/Qwen3.6-27B, produced for use with llama.cpp.
About LibertAI
LibertAI is a decentralized AI platform — private inference, an OpenAI-compatible API, and a chat UI, all running on community GPUs over Aleph Cloud instead of a single company's servers. No accounts required to chat, no logs sent home, and the same models you'd self-host are available behind a sovereign endpoint.
If you want to put this model (or any other) to work as an autonomous agent without running your own infrastructure, check out LiberClaw — Hermes-style agents hosted on Aleph Cloud with LibertAI inference. Free tier: 2 agents, no credit card, 5 minutes to deploy. Open source.
The FFN tensors are quantized to NVFP4 (NVIDIA's 4-bit float with E4M3 block scale), repacked from mmangkad/Qwen3.6-27B-NVFP4 (NVIDIA ModelOpt calibration). The remaining tensors (attention projections, SSM linear_attn blocks, embeddings, output) use a conventional GGUF quant — three variants are provided.
Why NVFP4? On NVIDIA Blackwell GPUs (RTX 50-series, B100/B200), llama.cpp uses native NVFP4 tensor-core MMA kernels (added in llama.cpp #22196) for the FFN matmul — the dominant compute cost during inference. On older GPUs the path falls back to
dp4a/MMQ kernels, where these GGUFs run but offer no perf advantage over standard K-quants.
Files
| File | Size | FFN | Other tensors | When to pick |
|---|---|---|---|---|
Qwen3.6-27B-NVFP4-Q4_K_M.gguf |
15 GB | NVFP4 | Q4_K_M | Recommended. Fastest serving throughput on Blackwell + smallest VRAM footprint |
Qwen3.6-27B-NVFP4-Q8_0.gguf |
19 GB | NVFP4 | Q8_0 | Higher precision attention/embeddings if you have the VRAM |
Qwen3.6-27B-NVFP4-BF16.gguf |
28 GB | NVFP4 | BF16 | Max quality (preserves source precision for non-FFN tensors); slower in practice — only pick if you need bit-for-bit source fidelity |
mmproj-Qwen3.6-27B-F16.gguf |
889 MB | — | F16 vision tower | Required for image/video input — reusable with any Qwen3.6-27B GGUF, not NVFP4-specific |
Performance
Measured on an NVIDIA RTX 5090 (32 GB, Blackwell, sm_120), llama.cpp build c84e6d6db.
Batched serving (llama-batched-bench, 512 in / 128 out per request)
NVFP4-Q4_K_M beats stock Q4_K_M on total serving throughput at every parallel batch size we tested (+9 / +0 / +8 / +2% at 1 / 4 / 8 / 16 sequences), with the largest token-generation wins at single stream (+12%) and 8 parallel sequences (+14%). It also uses less VRAM (14.7 vs 16.3 GiB), leaving more room for KV cache.
Variant comparison (same hardware)
| Variant | Size | PP512 (tok/s) | TG64 (tok/s) |
|---|---|---|---|
NVFP4-Q4_K_M |
14.72 GiB | 2865 | 64 |
NVFP4-Q8_0 |
18.65 GiB | 3346 | 64 |
NVFP4-BF16 |
27.19 GiB | 1403 | 49 |
The Q4_K_M variant is the speed/efficiency winner. The BF16 variant is included for completeness but pays a real bandwidth cost — only pick it if you need maximum precision on the non-FFN tensors and don't care about throughput.
Usage
Text-only (CLI)
llama-cli -m Qwen3.6-27B-NVFP4-Q8_0.gguf -ngl 999 -c 8192 -p "Your prompt here"
Multimodal (server, vision + text)
llama-server \
-m Qwen3.6-27B-NVFP4-Q8_0.gguf \
--mmproj mmproj-Qwen3.6-27B-F16.gguf \
-ngl 999 -c 32768 \
--host 0.0.0.0 --port 8080
Then POST to /v1/chat/completions with image content blocks — see the llama.cpp multimodal docs.
Recommended sampler
Qwen3.6 is a thinking model. Default chat template enables <think> blocks. For non-thinking usage pass --reasoning off (in llama-cli) or set chat_template_kwargs.enable_thinking=false in the API.
About the architecture
Qwen3.6-27B is a hybrid attention + SSM dense model: every 4th layer is conventional attention; the remaining 48 of 64 layers use Mamba-style linear_attn blocks. The NVFP4 source from mmangkad keeps the SSM in_proj_* projections and standard attention projections at higher precision — only the FFN matmul (192 tensors) is NVFP4. The variants above differ only in how those non-FFN tensors are stored.
Sources & credits
- Base model: Qwen/Qwen3.6-27B by Alibaba Qwen team — Apache 2.0
- NVFP4 calibration source: mmangkad/Qwen3.6-27B-NVFP4 (NVIDIA ModelOpt v0.42.0)
- mmproj source: official BF16 weights from
Qwen/Qwen3.6-27B - Tooling: llama.cpp
convert_hf_to_gguf.pyandllama-quantize
License
Apache 2.0, inherited from the upstream model.
- Downloads last month
- 6,179
4-bit
8-bit
16-bit
Model tree for LibertAIDAI/Qwen3.6-27B-NVFP4-GGUF
Base model
Qwen/Qwen3.6-27B