--- language: en tags: - gguf - quantization - llama.cpp - qwen license: apache-2.0 base_model: Qwen/Qwen3Guard-Gen-8B library_name: llama.cpp --- # Qwen3Guard-Gen-8B - GGUF Quantized Versions This repository provides **GGUF quantized versions** of [Qwen3Guard-Gen-8B](https://huggingface.co/Qwen/Qwen3Guard-Gen-8B), converted with [llama.cpp](https://github.com/ggerganov/llama.cpp). The base model was first exported from Hugging Face format to GGUF (FP16) and then quantized into multiple formats. These variants offer different trade-offs between **model size, inference speed, and output quality**. --- ## 🔧 Model Details - **Base model:** [Qwen/Qwen3Guard-Gen-8B](https://huggingface.co/Qwen/Qwen3Guard-Gen-8B) - **Architecture:** Qwen 3 (8B parameters) - **Format:** GGUF - **Intended use:** Guardrail / safety-aligned text generation - **Conversion tool:** `convert_hf_to_gguf.py` (from llama.cpp) - **Quantization tool:** `llama-quantize` --- ## 📊 Quantized Versions | Quantization | Filename | Size (MiB) | Notes | |--------------|----------|------------|-------| | **FP16** | `Qwen3Guard-Gen-8B-FP16.gguf` | ~15623 | Full precision (baseline) | | **Q2_K** | `Qwen3Guard-Gen-8B-Q2_K.gguf` | ~3204 | Smallest, lowest accuracy | | **Q3_K_M** | `Qwen3Guard-Gen-8B-Q3_K_M.gguf` | ~4027 | Balanced small size | | **Q4_0** | `Qwen3Guard-Gen-8B-Q4_0.gguf` | ~4662 | Good balance, faster | | **Q4_K_M** | `Qwen3Guard-Gen-8B-Q4_K_M.gguf` | ~4909 | Standard, widely used | | **Q5_K_M** | `Qwen3Guard-Gen-8B-Q5_K_M.gguf` | ~5713 | Better accuracy | | **Q6_K** | `Qwen3Guard-Gen-8B-Q6_K.gguf` | ~6568 | High accuracy | | **Q8_0** | `Qwen3Guard-Gen-8B-Q8_0.gguf` | ~8505 | Near FP16 quality | --- ## 🚀 Usage ### 🖥️ llama.cpp Download a quantized file and run: ```bash ./main -m Qwen3Guard-Gen-8B-Q4_K_M.gguf -p "Hello, Qwen!" ``` ### 🐍 Python Directly download from hub, and use with llama-cpp-python. ```python from huggingface_hub import hf_hub_download from llama_cpp import Llama model_path = hf_hub_download( repo_id="ShahzebKhoso/Qwen3Guard-Gen-8B-GGUF", filename="Qwen3Guard-Gen-8B-Q4_K_M.gguf" ) llm = Llama(model_path=model_path) output = llm.create_chat_completion( messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello, Qwen!"} ], max_tokens=100 ) print(output["choices"][0]["message"]["content"]) ``` These GGUF versions are optimized for **fast inference** with CPU/GPU runtimes like `llama.cpp`, `Ollama`, and `LM Studio`.