Qwen3-32B-Uncensored-Autoround-int4

High-performance 4-bit quantized version of Qwen3-32B using Auto-Round quantization (GPTQ-compatible).

Model Description

This is a 4-bit quantized version of Qwen3-32B, optimized for efficient deployment on consumer GPUs while maintaining excellent quality. The model uses Auto-Round quantization with a group size of 128, providing superior quality retention compared to standard GPTQ.

Base Model: Qwen3-32B
Quantization: 4-bit Auto-Round (GPTQ-compatible)
Group Size: 128
Model Size: ~18GB (4x reduction from FP16)
Context Length: Up to 40,960 tokens (12,288 tested)

🚀 Performance Benchmarks

Speed Performance (RTX 3090 24GB)

Test Type	Tokens Generated	Speed	Time
Short (100 tokens)	100	34.1 tok/s	2.93s
Medium (500 tokens)	500	33.7 tok/s	14.8s
Long (12,000 tokens)	12,000	30.7 tok/s	6m 30s

Average Speed: 30-34 tokens/second on single RTX 3090

Quality Metrics - Perplexity Analysis

Comprehensive testing across 16 diverse domains (3,015 tokens analyzed):

Overall Results:

Mean Perplexity: 3.28 ⭐ (Excellent)
Median Perplexity: 3.15
High Confidence Predictions: 60.9%

Domain-Specific Performance:

Domain	Perplexity	Rating
Scientific Text	2.10	Outstanding
Historical Text	2.33	Excellent
Medical Text	2.32	Excellent
Technical Text	3.07	Very Good
Literary Text	3.43	Good
News Articles	5.11	Good

Quality Rating: ⭐⭐⭐⭐⭐ (5/5) - Excellent

Comparison to Official Benchmarks

vs Official Qwen3 4-bit GPTQ:

Auto-Round shows superior perplexity retention
32B model benefits from better quantization resistance
Comparable quality to official quantized variants

vs Community Deployments:

50% faster than typical 32B deployments (15-25 tok/s)
Exceeds Qwen3-32B on Quadro GV100 (20 tok/s)
Production-ready performance

Hardware Requirements

Minimum

GPU: 24GB VRAM (RTX 3090, RTX 4090, A5000, etc.)
RAM: 32GB system RAM
Storage: 20GB

Quick Start

Using vLLM (Recommended)

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model groxaxo/Qwen3-32B-Uncensored-Autoround-int4 \
  --dtype bfloat16 \
  --max-model-len 12288 \
  --gpu-memory-utilization 0.95 \
  --port 8000

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "groxaxo/Qwen3-32B-Uncensored-Autoround-int4",
    device_map="auto",
    trust_remote_code=False
)

tokenizer = AutoTokenizer.from_pretrained("groxaxo/Qwen3-32B-Uncensored-Autoround-int4")

prompt = "Write a detailed explanation of quantum computing:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Performance Optimization Tips

For Maximum Speed

Use vLLM with --enforce-eager for consistent performance
Set --gpu-memory-utilization 0.95 to maximize throughput
Use --dtype bfloat16 for optimal speed/quality balance

For Maximum Context

Max context tested: 12,288 tokens on 24GB GPU
For 40K context: Use 2 GPUs with tensor parallelism
FP8 KV cache: Experimental, may be unstable

GPU Utilization

Achieved: 97% compute, 84% memory during inference
Memory usage: ~23.6GB on RTX 3090
Optimal batch size: 1-4 sequences

Benchmark Details

Confidence Distribution

Very High Confidence (>-0.1 log prob): 43.3%
High Confidence (-0.5 to -0.1): 17.6%
Medium Confidence (-1.5 to -0.5): 15.3%
Low Confidence (-3.0 to -1.5): 12.3%
Very Low Confidence (<-3.0): 11.5%

Statistical Analysis

Standard Deviation: 1.34
25th Percentile: 2.22
75th Percentile: 4.51
Range: 1.34 - 5.53

Use Cases

Excellent for:

✅ Scientific and technical content generation
✅ Medical and academic writing
✅ Historical and factual text
✅ Code generation and analysis
✅ Long-form content creation
✅ Multi-turn conversations

Good for:

✅ Creative writing and literature
✅ News article generation
✅ Business and legal documents
✅ Multilingual tasks (119 languages)

Limitations

Quantization may introduce minor quality degradation compared to FP16
Single GPU deployment limited to ~12K context
Requires GPU with at least 24GB VRAM
Not suitable for real-time applications requiring <50ms latency

Citation

If you use this model, please cite:

@misc{qwen3-32b-autoround-int4,
  title={Qwen3-32B-Uncensored-Autoround-int4},
  author={groxaxo},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/groxaxo/Qwen3-32B-Uncensored-Autoround-int4}}
}

License

Apache 2.0 - Same as base Qwen3 model

Acknowledgments

Base model: Qwen3-32B by Alibaba Cloud
Quantization: Auto-Round technique
Testing framework: vLLM 0.11.0

Model Card Contact

For questions or issues, please open an issue on the model repository.

Last Updated: October 2025
Quantization Method: Auto-Round (GPTQ-compatible)
Tested On: RTX 3090 24GB, vLLM 0.11.0

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

I32

BF16

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for groxaxo/Qwen3-32B-Uncensored-Autoround-int4

Base model

Qwen/Qwen3-32B

Adapter

nicoboss/Qwen3-32B-Uncensored

Quantized

(4)

this model