Anyone else getting only exclamation marks?
First of all, thank you for your work.
I'm trying to test the quant with the lastest vllm nightly and manually updated transformers.
Seems like I'm getting the same error as in this quant: https://huggingface.co/Sehyo/Qwen3.5-397B-A17B-NVFP4/discussions/1
Forcing the dtype to bfloat16 broke a few other things.
Could there be something wrong with the quant?
You mean, you're only producing "!!!" ?
Have you tried rm -rf ~/.cache/flashinfer and/or reboot your machine?
Any minimum examples?
I'm experiencing the same issue as @Halbin when trying to test this quantized model with the latest vllm nightly version and manually updated transformers. Below are my reproduction steps:
Environment
- vllm version: vllm/vllm-openai:nightly (version 0.16.0rc2.dev496+g4a9c07a0a)
- transformers: git+https://github.com/huggingface/transformers.git@f2ba019
- Model: Qwen3.5-397B-A17B-AWQ (AWQ quantized)
- Hardware: 4 GPUs (A100 80G)
Steps to Reproduce
- Enter the container:
docker run -it --privileged --runtime nvidia --gpus '"device=0,1,2,3"' -v /dev/shm:/dev/shm -e NCCL_DEBUG=INFO -p 18001:8000 --entrypoint=/bin/bash vllm/vllm-openai:nightly
- Modify vllm environment:
rm -rf ~/.cache/flashinfer
pip install -U "transformers @ git+https://github.com/huggingface/transformers.git@f2ba019"
# Fix a simple bug in modeling_rope_utils.py line 651
TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py"
NEW_LINE=' ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}'
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"
- Start the service:
vllm serve /root/.cache/huggingface/Qwen3.5-397B-A17B-AWQ \
--served-model-name QuantTrio/Qwen3.5-397B-A17B-AWQ \
--tensor-parallel-size 4 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
It got only exclamation marks inside the when I send the requests.
got you
Same result here, A40 in -tp 8
I briefly looked at the docker image, it is installed with cuda 12.9 which probably would not work and could cause illegal memory access.
It's best at the moment to install cuda 12.8 or cuda 13.0
Then, force reinstall vllm and transformers manually (use pip or uv), it won't take too much effort though. Have a try.
I still need a day to download this repo and vllm image, which is kinda slow (network issue).
With python3.12 and cuda 12.8, this series had been tested using 3080/3090/4090/H200/A6000pro
They shouldn't output "exclamation marks"
My setup is more or less the same as JoeyHwongs. Same vllm version, 4 x H100, host uses cuda 12.9
FROM vllm/vllm-openai:nightly-4a9c07a0a2b8308a045476b48be29e37c349
RUN apt-get update && apt-get install -y --no-install-recommends git && \
rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip
RUN git clone --depth 1 https://github.com/huggingface/transformers.git /tmp/transformers && \
pip install --no-cache-dir --editable /tmp/transformers
# Shouldn't be required in the docker image
RUN rm -rf ~/.cache/flashinfer
# This part might no longer be necessary as of the most recent state of transformers#main
RUN TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && \
echo "patching $TF_FILE" && \
NEW_LINE=' ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"
My setup is more or less the same as JoeyHwongs. Same base vllm image. 4 x H100. Host uses cuda 12.9
FROM vllm/vllm-openai:nightly-4a9c07a0a2b8308a045476b48be29e37c349 RUN apt-get update && apt-get install -y --no-install-recommends git && \ rm -rf /var/lib/apt/lists/* RUN pip install --upgrade pip RUN git clone --depth 1 https://github.com/huggingface/transformers.git /tmp/transformers && \ pip install --no-cache-dir --editable /tmp/transformers # Shouldn't be required in the docker image RUN rm -rf ~/.cache/flashinfer # This part might no longer be necessary as of the most recent state of transformers#main RUN TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && \ echo "patching $TF_FILE" && \ NEW_LINE=' ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \ perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"
Yes, please refrain from using cuda 12.9 on this model, for now.
I've tested with python 3.12 and CUDA 12.8, on 8xA40 in a fresh docker container, still nothing but !!!!!!!!!!!!!!!!!
{
"id": "chatcmpl-a6501c68a3dbe578",
"object": "chat.completion",
"created": 1772207990,
"model": "QuantTrio/Qwen3.5-397B-A17B-AWQ",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
},
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 17,
"total_tokens": 529,
"completion_tokens": 512,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}
I've tested with python 3.12 and CUDA 12.8, on 8xA40 in a fresh docker container, still nothing but !!!!!!!!!!!!!!!!!
{ "id": "chatcmpl-a6501c68a3dbe578", "object": "chat.completion", "created": 1772207990, "model": "QuantTrio/Qwen3.5-397B-A17B-AWQ", "choices": [ { "index": 0, "message": { "role": "assistant", "content": null, "refusal": null, "annotations": null, "audio": null, "function_call": null, "tool_calls": [], "reasoning": "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" }, "logprobs": null, "finish_reason": "length", "stop_reason": null, "token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 17, "total_tokens": 529, "completion_tokens": 512, "prompt_tokens_details": null }, "prompt_logprobs": null, "prompt_token_ids": null, "kv_transfer_params": null }
which container / system / graphics cards were you using?
what are you guys' nvidia driver versions?
Might be on to something with the driver, mine's a bit old. Can't update today to test though.
driver: 570.195.03
os: NixOS
container: nvidia/cuda:12.8.1-devel-ubuntu24.04
gpus: A40 x 8
Okay, I think I may have found the culprit. We need to change float16 in config.json into bfloat16
The new config.json has been updated accordingly, please have a try.
it works!
amazing
Can confirm everything works now with this custom image:
FROM vllm/vllm-openai:nightly
RUN apt-get update && apt-get install -y --no-install-recommends git && \
rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip
RUN git clone --depth 1 https://github.com/huggingface/transformers.git /tmp/transformers && \
pip install --no-cache-dir --editable /tmp/transformers
And --disable-custom-all-reduce added to vllm startup.
MTP (num_speculative_tokens: 1) works as well, well done!