RedHatAI
/

Qwen3.5-397B-A17B-FP8-dynamic

compressed-tensors

Model card Files Files and versions

Qwen3.5-397B-A17B-FP8-dynamic / README.md

dsikka's picture

Update README.md

3f56f53 verified 3 days ago

|

history blame contribute delete

1.44 kB

	---
	base_model:
	- Qwen/Qwen3.5-397B-A17B
	tags:
	- qwen
	- fp8
	- vllm
	- compressed-tensors
	name: RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic
	---

	# FP8 Quantized Qwen3.5-397B-A17B

	This is a preliminary version (and subject to change) of FP8 quantized [Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B) model.
	The model has both weights and activations quantized to FP8 format with [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor).

	It is compatible and tested against vllm main. Deploy it with: `vllm serve RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic`.

	# Preliminary Evaluations

	1) GSM8k via vLLM's `tests/evals/gsm8k/gsm8k_eval.py` shows almost no degradation of accuracy:

	\| \| Qwen/Qwen3.5-397B-A17B \| RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic<br> (this model) \|
	\| -------- \| :--------------------: \| :------------------------------------: \|
	\| Accuracy \| 89.5 \| 89.4 \|
	\| Recovery \| \- \| 99.9% \|

	2) Under greedy sampling, the model generates almost identical text to the unquantized baseline. `Qwen/Qwen3.5-397B-A17B` is left, `RedHatAI/Qwen3.5-397B-A17B-FP8-Dynamic` is right:


	![image](https://cdn-uploads.huggingface.co/production/uploads/628e0ce4e53bbd334577fcb0/3RwIhv9s9LGJdEbG2FFDv.png)



	Note: More rigorous evaluations are currently in progress and will be available soon.