---
language: en
tags:
- gguf
- quantization
- llama.cpp
- qwen
license: apache-2.0
base_model: Qwen/Qwen3Guard-Gen-8B
library_name: llama.cpp
---

# Qwen3Guard-Gen-8B - GGUF Quantized Versions

This repository provides **GGUF quantized versions** of [Qwen3Guard-Gen-8B](https://huggingface.co/Qwen/Qwen3Guard-Gen-8B), converted with [llama.cpp](https://github.com/ggerganov/llama.cpp).  

The base model was first exported from Hugging Face format to GGUF (FP16) and then quantized into multiple formats. These variants offer different trade-offs between **model size, inference speed, and output quality**.

---

## 🔧 Model Details
- **Base model:** [Qwen/Qwen3Guard-Gen-8B](https://huggingface.co/Qwen/Qwen3Guard-Gen-8B)  
- **Architecture:** Qwen 3 (8B parameters)  
- **Format:** GGUF  
- **Intended use:** Guardrail / safety-aligned text generation  
- **Conversion tool:** `convert_hf_to_gguf.py` (from llama.cpp)  
- **Quantization tool:** `llama-quantize`  

---

## 📊 Quantized Versions

| Quantization | Filename | Size (MiB) | Notes |
|--------------|----------|------------|-------|
| **FP16**     | `Qwen3Guard-Gen-8B-FP16.gguf` | ~15623 | Full precision (baseline) |
| **Q2_K**     | `Qwen3Guard-Gen-8B-Q2_K.gguf` | ~3204 | Smallest, lowest accuracy |
| **Q3_K_M**   | `Qwen3Guard-Gen-8B-Q3_K_M.gguf` | ~4027 | Balanced small size |
| **Q4_0**     | `Qwen3Guard-Gen-8B-Q4_0.gguf` | ~4662 | Good balance, faster |
| **Q4_K_M**   | `Qwen3Guard-Gen-8B-Q4_K_M.gguf` | ~4909 | Standard, widely used |
| **Q5_K_M**   | `Qwen3Guard-Gen-8B-Q5_K_M.gguf` | ~5713 | Better accuracy |
| **Q6_K**     | `Qwen3Guard-Gen-8B-Q6_K.gguf` | ~6568 | High accuracy |
| **Q8_0**     | `Qwen3Guard-Gen-8B-Q8_0.gguf` | ~8505 | Near FP16 quality |

---

## 🚀 Usage

### 🖥️ llama.cpp
Download a quantized file and run:
```bash
./main -m Qwen3Guard-Gen-8B-Q4_K_M.gguf -p "Hello, Qwen!"
```

### 🐍 Python
Directly download from hub, and use with llama-cpp-python.
```python
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(
    repo_id="ShahzebKhoso/Qwen3Guard-Gen-8B-GGUF",
    filename="Qwen3Guard-Gen-8B-Q4_K_M.gguf"
)

llm = Llama(model_path=model_path)

output = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, Qwen!"}
    ],
    max_tokens=100
)

print(output["choices"][0]["message"]["content"])
```


These GGUF versions are optimized for **fast inference** with CPU/GPU runtimes like `llama.cpp`, `Ollama`, and `LM Studio`.