Successfully Running Qwen3-Next-80B-A3B-Instruct-AWQ-4bit on 3x RTX 3090s

#9
by 8055izham - opened

Just to share, I managed to get cpatonn's AWQ model working on 3x RTX 3090s (72GB VRAM total) with usable performance (~66 tokens/sec). Here's how:

Hardware / OS

  • Ubuntu 24 04 LTS
  • 3x NVIDIA RTX 3090 (or similar 24GB cards)
  • Docker with NVIDIA Container Toolkit installed
  1. Install NVIDIA Container Toolkit:
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
    sudo systemctl restart docker

  2. Launch the Docker container:
    docker run --rm -it --gpus all --ipc=host
    -e TRANSFORMERS_OFFLINE=1
    -e HF_HOME=/root/.cache/huggingface
    -v $HOME/.cache/huggingface:/root/.cache/huggingface
    -p 8000:8000
    public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0eecb3166365a29db117c2aff6ca441b484b514d
    bash

  3. Inside the container, serve the model:
    vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
    --pipeline-parallel-size 3
    --tensor-parallel-size 1
    --dtype float16
    --max-model-len 8192
    --gpu-memory-utilization 0.90
    --no-enable-chunked-prefill
    --trust-remote-code
    --port 8000

Performance Results
VRAM usage: ~23GB per GPU
Prompt processing: 400-500 tokens/sec
Generation: 60-70 tokens/sec for sustained sequences

Hope this helps others with similar hardware setups. Cheers..

Thanks. So I take it that we still need to use the nightlys and that the v.0.11 release is still broken for this quant?

Could I ask where that docker image is from?

EDIT: Never mind, seems to be working now with the official docker image too. Ta!

THANKS A LOT MY BRO, 3x 3090 is a bit of a pain sometimes aye. so glad we have this model now up and running, how do you find it against the 30b thikning 2507 model? pd: have you maxed out context yet? imma try this right now and post updates.

Bloody hell!! running perfectly on 3x 3090 at 160k context, speeds between 65tk/s to 30tk/s , my script:
GNU nano 6.2 vllm_qwen3_80b_starter.sh
#!/bin/bash

vLLM Qwen3 Model Starter Script

This script starts the vLLM server with the Qwen3 model

echo "Starting vLLM server with Qwen3 model..."

Activate the vllm conda environment (as that's where the running instance is)

source /home/op/miniconda3/etc/profile.d/conda.sh
conda activate vllm

Check if conda activate was successful

if [ $? -ne 0 ]; then
echo "Error: Failed to activate conda environment 'vllm'"
exit 1
fi

echo "Conda environment 'vllm' activated successfully"

Start the vLLM server with the Qwen3 model

echo "Starting vLLM server with the following parameters:"
echo " Model: cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit"
echo " Pipeline parallel size: 3"
echo " Tensor parallel size: 1"
echo " Dtype: float16"
echo " KV cache dtype: fp16"
echo " GPU memory utilization: 0.92"
echo " Max num seqs: 1"
echo " Max model length: 160000"
echo " Port: 8030"
echo ""

vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
--pipeline-parallel-size 3
--tensor-parallel-size 1
--dtype float16
--kv-cache-dtype auto
--gpu-memory-utilization 0.92
--max-num-seqs 1
--max-model-len 160000
--trust-remote-code
--port 8030

echo "vLLM server stopped"

Glad it worked out for you, groxaxo. I didn't try the 'thinking' model, instruct model is performing well for my usage (I'm in the healthcare). The max context length I needed for my usage is 64k. Didn't see the need to push further but your 160k is amazing.. I'll keep that in mind if the need arise. Cheers..

.
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] WorkerProc failed to start.
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] Traceback (most recent call last):
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 715, in worker_main
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] worker = WorkerProc(*args, **kwargs)
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 555, in init
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.worker.load_model()
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in load_model
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.model = model_loader.load_model(
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] model = initialize_model(
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 55, in initialize_model
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1218, in init
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.set_moe_parameters()
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1158, in set_moe_parameters
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] raise RuntimeError("No Qwen3Next layer found in the model.layers.")
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] RuntimeError: No Qwen3Next layer found in the model.layers.
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] WorkerProc failed to start.
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] Traceback (most recent call last):
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 715, in worker_main
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] worker = WorkerProc(*args, **kwargs)
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 555, in init
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.worker.load_model()
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in load_model
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.model = model_loader.load_model(
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] model = initialize_model(
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 55, in initialize_model
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1218, in init
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.set_moe_parameters()
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1158, in set_moe_parameters
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] raise RuntimeError("No Qwen3Next layer found in the model.layers.")
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] RuntimeError: No Qwen3Next layer found in the model.layers.

cyankiwi org

Hi @marutichintan ,
Thank you for using my model.
I assume you run this model with pipeline parallelism? Could you install the latest vllm commit from source, which should solve this problem?
Thanks,
Ton.

Will it work on 2x3090 with 30-60k context?

Sign up or log in to comment