Anyone else getting only exclamation marks?

by Halbin - opened 5 days ago

•

First of all, thank you for your work.

I'm trying to test the quant with the lastest vllm nightly and manually updated transformers.
Seems like I'm getting the same error as in this quant: https://huggingface.co/Sehyo/Qwen3.5-397B-A17B-NVFP4/discussions/1
Forcing the dtype to bfloat16 broke a few other things.

Could there be something wrong with the quant?

tclf90

QuantTrio org 5 days ago

You mean, you're only producing "!!!" ?
Have you tried rm -rf ~/.cache/flashinfer and/or reboot your machine?
Any minimum examples?

JoeyHwong

5 days ago

•

edited 5 days ago

I'm experiencing the same issue as @Halbin when trying to test this quantized model with the latest vllm nightly version and manually updated transformers. Below are my reproduction steps:

Environment

vllm version: vllm/vllm-openai:nightly (version 0.16.0rc2.dev496+g4a9c07a0a)
transformers: git+https://github.com/huggingface/transformers.git@f2ba019
Model: Qwen3.5-397B-A17B-AWQ (AWQ quantized)
Hardware: 4 GPUs (A100 80G)

Steps to Reproduce

Enter the container:

docker run -it --privileged --runtime nvidia --gpus '"device=0,1,2,3"' -v /dev/shm:/dev/shm -e NCCL_DEBUG=INFO -p 18001:8000 --entrypoint=/bin/bash vllm/vllm-openai:nightly

Modify vllm environment:

rm -rf ~/.cache/flashinfer
pip install -U "transformers @ git+https://github.com/huggingface/transformers.git@f2ba019"

# Fix a simple bug in modeling_rope_utils.py line 651
TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py"
NEW_LINE='            ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}'
perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"

Start the service:

vllm serve /root/.cache/huggingface/Qwen3.5-397B-A17B-AWQ \
  --served-model-name QuantTrio/Qwen3.5-397B-A17B-AWQ \
  --tensor-parallel-size 4 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

It got only exclamation marks inside the when I send the requests.

tclf90

QuantTrio org 5 days ago

got you

dt9k

5 days ago

Same result here, A40 in -tp 8

tclf90

QuantTrio org 4 days ago

•

edited 4 days ago

I briefly looked at the docker image, it is installed with cuda 12.9 which probably would not work and could cause illegal memory access.

It's best at the moment to install cuda 12.8 or cuda 13.0

Then, force reinstall vllm and transformers manually (use pip or uv), it won't take too much effort though. Have a try.
I still need a day to download this repo and vllm image, which is kinda slow (network issue).

With python3.12 and cuda 12.8, this series had been tested using 3080/3090/4090/H200/A6000pro

They shouldn't output "exclamation marks"

Halbin

4 days ago

•

edited 4 days ago

My setup is more or less the same as JoeyHwongs. Same vllm version, 4 x H100, host uses cuda 12.9

FROM vllm/vllm-openai:nightly-4a9c07a0a2b8308a045476b48be29e37c349

RUN apt-get update && apt-get install -y --no-install-recommends git && \
  rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip
RUN git clone --depth 1 https://github.com/huggingface/transformers.git /tmp/transformers && \
  pip install --no-cache-dir --editable /tmp/transformers

# Shouldn't be required in the docker image
RUN rm -rf ~/.cache/flashinfer
# This part might no longer be necessary as of the most recent state of transformers#main
RUN TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && \
  echo "patching $TF_FILE" && \
  NEW_LINE='            ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
  perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"

tclf90

QuantTrio org 4 days ago

My setup is more or less the same as JoeyHwongs. Same base vllm image. 4 x H100. Host uses cuda 12.9

FROM vllm/vllm-openai:nightly-4a9c07a0a2b8308a045476b48be29e37c349

RUN apt-get update && apt-get install -y --no-install-recommends git && \
  rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip
RUN git clone --depth 1 https://github.com/huggingface/transformers.git /tmp/transformers && \
  pip install --no-cache-dir --editable /tmp/transformers

# Shouldn't be required in the docker image
RUN rm -rf ~/.cache/flashinfer
# This part might no longer be necessary as of the most recent state of transformers#main
RUN TF_FILE="$(python -m pip show transformers | awk -F': ' '/^Location:/{print $2}')/transformers/modeling_rope_utils.py" && \
  echo "patching $TF_FILE" && \
  NEW_LINE='            ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation) | {"partial_rotary_factor"}' \
  perl -i.bak -pe 'if ($. == 651) { $_ = $ENV{NEW_LINE} . "\n" }' "$TF_FILE"

Yes, please refrain from using cuda 12.9 on this model, for now.

dt9k

4 days ago

I've tested with python 3.12 and CUDA 12.8, on 8xA40 in a fresh docker container, still nothing but !!!!!!!!!!!!!!!!!

  {                                                                                                                                                                                                                            
      "id": "chatcmpl-a6501c68a3dbe578",                                                                                                                                                                                       
      "object": "chat.completion",                                                                                                                                                                                             
      "created": 1772207990,                                                                                                                                                                                                   
      "model": "QuantTrio/Qwen3.5-397B-A17B-AWQ",                                                                                                                                                                              
      "choices": [                                                                                                                                                                                                             
          {
              "index": 0,                                                                                                                                                                                                      
              "message": {                                                                                                                                                                                                     
                  "role": "assistant",                                                                                                                                                                                         
                  "content": null,                                                                                                                                                                                             
                  "refusal": null,                                                                                                                                                                                             
                  "annotations": null,                                                                                                                                                                                         
                  "audio": null,                                                                                                                                                                                               
                  "function_call": null,                                                                                                                                                                                       
                  "tool_calls": [],                                                                                                                                                                                            
                  "reasoning": "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"                                                                                                                           
              },                                                                                                                                                                                                               
              "logprobs": null,                                                                                                                                                                                                
              "finish_reason": "length",                                                                                                                                                                                       
              "stop_reason": null,                                                                                                                                                                                             
              "token_ids": null                                                                                                                                                                                                
          }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
          "prompt_tokens": 17,
          "total_tokens": 529,
          "completion_tokens": 512,
          "prompt_tokens_details": null
      },
      "prompt_logprobs": null,
      "prompt_token_ids": null,
      "kv_transfer_params": null
  }

tclf90

QuantTrio org 4 days ago

I've tested with python 3.12 and CUDA 12.8, on 8xA40 in a fresh docker container, still nothing but !!!!!!!!!!!!!!!!!

  {                                                                                                                                                                                                                            
      "id": "chatcmpl-a6501c68a3dbe578",                                                                                                                                                                                       
      "object": "chat.completion",                                                                                                                                                                                             
      "created": 1772207990,                                                                                                                                                                                                   
      "model": "QuantTrio/Qwen3.5-397B-A17B-AWQ",                                                                                                                                                                              
      "choices": [                                                                                                                                                                                                             
          {
              "index": 0,                                                                                                                                                                                                      
              "message": {                                                                                                                                                                                                     
                  "role": "assistant",                                                                                                                                                                                         
                  "content": null,                                                                                                                                                                                             
                  "refusal": null,                                                                                                                                                                                             
                  "annotations": null,                                                                                                                                                                                         
                  "audio": null,                                                                                                                                                                                               
                  "function_call": null,                                                                                                                                                                                       
                  "tool_calls": [],                                                                                                                                                                                            
                  "reasoning": "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"                                                                                                                           
              },                                                                                                                                                                                                               
              "logprobs": null,                                                                                                                                                                                                
              "finish_reason": "length",                                                                                                                                                                                       
              "stop_reason": null,                                                                                                                                                                                             
              "token_ids": null                                                                                                                                                                                                
          }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
          "prompt_tokens": 17,
          "total_tokens": 529,
          "completion_tokens": 512,
          "prompt_tokens_details": null
      },
      "prompt_logprobs": null,
      "prompt_token_ids": null,
      "kv_transfer_params": null
  }

which container / system / graphics cards were you using?

tclf90

QuantTrio org 4 days ago

what are you guys' nvidia driver versions?

dt9k

4 days ago

Might be on to something with the driver, mine's a bit old. Can't update today to test though.

driver: 570.195.03
os: NixOS
container: nvidia/cuda:12.8.1-devel-ubuntu24.04
gpus: A40 x 8

tclf90

QuantTrio org 4 days ago

Okay, I think I may have found the culprit. We need to change float16 in config.json into bfloat16

The new config.json has been updated accordingly, please have a try.

dt9k

3 days ago

it works!

tclf90

QuantTrio org 3 days ago

amazing

Halbin

3 days ago

•

edited 3 days ago

Can confirm everything works now with this custom image:

FROM vllm/vllm-openai:nightly

RUN apt-get update && apt-get install -y --no-install-recommends git && \
  rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip
RUN git clone --depth 1 https://github.com/huggingface/transformers.git /tmp/transformers && \
  pip install --no-cache-dir --editable /tmp/transformers

And --disable-custom-all-reduce added to vllm startup.

MTP (num_speculative_tokens: 1) works as well, well done!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment