| | --- |
| | base_model: |
| | - Qwen/Qwen3.5-397B-A17B |
| | tags: |
| | - qwen |
| | - fp8 |
| | - vllm |
| | - compressed-tensors |
| | name: RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic |
| | --- |
| | |
| | # FP8 Quantized Qwen3.5-397B-A17B |
| |
|
| | This is a preliminary version (and subject to change) of FP8 quantized [Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B) model. |
| | The model has both weights and activations quantized to FP8 format with [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor). |
| |
|
| | It is compatible and tested against vllm main. Deploy it with: `vllm serve RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic`. |
| |
|
| | # Preliminary Evaluations |
| |
|
| | 1) GSM8k via vLLM's `tests/evals/gsm8k/gsm8k_eval.py` shows almost no degradation of accuracy: |
| |
|
| | | | Qwen/Qwen3.5-397B-A17B | RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic<br> (this model) | |
| | | -------- | :--------------------: | :------------------------------------: | |
| | | Accuracy | 89.5 | 89.4 | |
| | | Recovery | \- | 99.9% | |
| |
|
| | 2) Under greedy sampling, the model generates almost identical text to the unquantized baseline. `Qwen/Qwen3.5-397B-A17B` is left, `RedHatAI/Qwen3.5-397B-A17B-FP8-Dynamic` is right: |
| |
|
| |
|
| |  |
| |
|
| |
|
| |
|
| | **Note**: More rigorous evaluations are currently in progress and will be available soon. |