Qwen3-32B-Uncensored-Autoround-int4

High-performance 4-bit quantized version of Qwen3-32B using Auto-Round quantization (GPTQ-compatible).

Model Description

This is a 4-bit quantized version of Qwen3-32B, optimized for efficient deployment on consumer GPUs while maintaining excellent quality. The model uses Auto-Round quantization with a group size of 128, providing superior quality retention compared to standard GPTQ.

  • Base Model: Qwen3-32B
  • Quantization: 4-bit Auto-Round (GPTQ-compatible)
  • Group Size: 128
  • Model Size: ~18GB (4x reduction from FP16)
  • Context Length: Up to 40,960 tokens (12,288 tested)

πŸš€ Performance Benchmarks

Speed Performance (RTX 3090 24GB)

Test Type Tokens Generated Speed Time
Short (100 tokens) 100 34.1 tok/s 2.93s
Medium (500 tokens) 500 33.7 tok/s 14.8s
Long (12,000 tokens) 12,000 30.7 tok/s 6m 30s

Average Speed: 30-34 tokens/second on single RTX 3090

Quality Metrics - Perplexity Analysis

Comprehensive testing across 16 diverse domains (3,015 tokens analyzed):

Overall Results:

  • Mean Perplexity: 3.28 ⭐ (Excellent)
  • Median Perplexity: 3.15
  • High Confidence Predictions: 60.9%

Domain-Specific Performance:

Domain Perplexity Rating
Scientific Text 2.10 Outstanding
Historical Text 2.33 Excellent
Medical Text 2.32 Excellent
Technical Text 3.07 Very Good
Literary Text 3.43 Good
News Articles 5.11 Good

Quality Rating: ⭐⭐⭐⭐⭐ (5/5) - Excellent

Comparison to Official Benchmarks

vs Official Qwen3 4-bit GPTQ:

  • Auto-Round shows superior perplexity retention
  • 32B model benefits from better quantization resistance
  • Comparable quality to official quantized variants

vs Community Deployments:

  • 50% faster than typical 32B deployments (15-25 tok/s)
  • Exceeds Qwen3-32B on Quadro GV100 (20 tok/s)
  • Production-ready performance

Hardware Requirements

Minimum

  • GPU: 24GB VRAM (RTX 3090, RTX 4090, A5000, etc.)
  • RAM: 32GB system RAM
  • Storage: 20GB

Recommended

  • GPU: RTX 3090/4090 or better
  • RAM: 64GB system RAM
  • Fast SSD storage

Quick Start

Using vLLM (Recommended)

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model groxaxo/Qwen3-32B-Uncensored-Autoround-int4 \
  --dtype bfloat16 \
  --max-model-len 12288 \
  --gpu-memory-utilization 0.95 \
  --port 8000

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "groxaxo/Qwen3-32B-Uncensored-Autoround-int4",
    device_map="auto",
    trust_remote_code=False
)

tokenizer = AutoTokenizer.from_pretrained("groxaxo/Qwen3-32B-Uncensored-Autoround-int4")

prompt = "Write a detailed explanation of quantum computing:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Performance Optimization Tips

For Maximum Speed

  • Use vLLM with --enforce-eager for consistent performance
  • Set --gpu-memory-utilization 0.95 to maximize throughput
  • Use --dtype bfloat16 for optimal speed/quality balance

For Maximum Context

  • Max context tested: 12,288 tokens on 24GB GPU
  • For 40K context: Use 2 GPUs with tensor parallelism
  • FP8 KV cache: Experimental, may be unstable

GPU Utilization

  • Achieved: 97% compute, 84% memory during inference
  • Memory usage: ~23.6GB on RTX 3090
  • Optimal batch size: 1-4 sequences

Benchmark Details

Confidence Distribution

  • Very High Confidence (>-0.1 log prob): 43.3%
  • High Confidence (-0.5 to -0.1): 17.6%
  • Medium Confidence (-1.5 to -0.5): 15.3%
  • Low Confidence (-3.0 to -1.5): 12.3%
  • Very Low Confidence (<-3.0): 11.5%

Statistical Analysis

  • Standard Deviation: 1.34
  • 25th Percentile: 2.22
  • 75th Percentile: 4.51
  • Range: 1.34 - 5.53

Use Cases

Excellent for:

  • βœ… Scientific and technical content generation
  • βœ… Medical and academic writing
  • βœ… Historical and factual text
  • βœ… Code generation and analysis
  • βœ… Long-form content creation
  • βœ… Multi-turn conversations

Good for:

  • βœ… Creative writing and literature
  • βœ… News article generation
  • βœ… Business and legal documents
  • βœ… Multilingual tasks (119 languages)

Limitations

  • Quantization may introduce minor quality degradation compared to FP16
  • Single GPU deployment limited to ~12K context
  • Requires GPU with at least 24GB VRAM
  • Not suitable for real-time applications requiring <50ms latency

Citation

If you use this model, please cite:

@misc{qwen3-32b-autoround-int4,
  title={Qwen3-32B-Uncensored-Autoround-int4},
  author={groxaxo},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/groxaxo/Qwen3-32B-Uncensored-Autoround-int4}}
}

License

Apache 2.0 - Same as base Qwen3 model

Acknowledgments

  • Base model: Qwen3-32B by Alibaba Cloud
  • Quantization: Auto-Round technique
  • Testing framework: vLLM 0.11.0

Model Card Contact

For questions or issues, please open an issue on the model repository.


Last Updated: October 2025
Quantization Method: Auto-Round (GPTQ-compatible)
Tested On: RTX 3090 24GB, vLLM 0.11.0

Downloads last month
-
Safetensors
Model size
2B params
Tensor type
I32
Β·
BF16
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for groxaxo/Qwen3-32B-Uncensored-Autoround-int4

Base model

Qwen/Qwen3-32B
Quantized
(4)
this model