Successfully Running Qwen3-Next-80B-A3B-Instruct-AWQ-4bit on 3x RTX 3090s

by 8055izham - opened Oct 5, 2025

Oct 5, 2025

Just to share, I managed to get cpatonn's AWQ model working on 3x RTX 3090s (72GB VRAM total) with usable performance (~66 tokens/sec). Here's how:

Hardware / OS

Ubuntu 24 04 LTS
3x NVIDIA RTX 3090 (or similar 24GB cards)
Docker with NVIDIA Container Toolkit installed

Install NVIDIA Container Toolkit:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Launch the Docker container:
docker run --rm -it --gpus all --ipc=host
-e TRANSFORMERS_OFFLINE=1
-e HF_HOME=/root/.cache/huggingface
-v $HOME/.cache/huggingface:/root/.cache/huggingface
-p 8000:8000
public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0eecb3166365a29db117c2aff6ca441b484b514d
bash
Inside the container, serve the model:
vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
--pipeline-parallel-size 3
--tensor-parallel-size 1
--dtype float16
--max-model-len 8192
--gpu-memory-utilization 0.90
--no-enable-chunked-prefill
--trust-remote-code
--port 8000

Performance Results
VRAM usage: ~23GB per GPU
Prompt processing: 400-500 tokens/sec
Generation: 60-70 tokens/sec for sustained sequences

Hope this helps others with similar hardware setups. Cheers..

swedish

Oct 6, 2025

•

edited Oct 6, 2025

Thanks. So I take it that we still need to use the nightlys and that the v.0.11 release is still broken for this quant?

Could I ask where that docker image is from?

EDIT: Never mind, seems to be working now with the official docker image too. Ta!

groxaxo

Oct 21, 2025

•

edited Oct 21, 2025

THANKS A LOT MY BRO, 3x 3090 is a bit of a pain sometimes aye. so glad we have this model now up and running, how do you find it against the 30b thikning 2507 model? pd: have you maxed out context yet? imma try this right now and post updates.

groxaxo

Oct 22, 2025

•

edited Oct 22, 2025

Bloody hell!! running perfectly on 3x 3090 at 160k context, speeds between 65tk/s to 30tk/s , my script:
GNU nano 6.2 vllm_qwen3_80b_starter.sh
#!/bin/bash

vLLM Qwen3 Model Starter Script

This script starts the vLLM server with the Qwen3 model

echo "Starting vLLM server with Qwen3 model..."

Activate the vllm conda environment (as that's where the running instance is)

source /home/op/miniconda3/etc/profile.d/conda.sh
conda activate vllm

Check if conda activate was successful

if [ $? -ne 0 ]; then
echo "Error: Failed to activate conda environment 'vllm'"
exit 1
fi

echo "Conda environment 'vllm' activated successfully"

Start the vLLM server with the Qwen3 model

echo "Starting vLLM server with the following parameters:"
echo " Model: cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit"
echo " Pipeline parallel size: 3"
echo " Tensor parallel size: 1"
echo " Dtype: float16"
echo " KV cache dtype: fp16"
echo " GPU memory utilization: 0.92"
echo " Max num seqs: 1"
echo " Max model length: 160000"
echo " Port: 8030"
echo ""

vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
--pipeline-parallel-size 3
--tensor-parallel-size 1
--dtype float16
--kv-cache-dtype auto
--gpu-memory-utilization 0.92
--max-num-seqs 1
--max-model-len 160000
--trust-remote-code
--port 8030

echo "vLLM server stopped"

8055izham

Oct 23, 2025

Glad it worked out for you, groxaxo. I didn't try the 'thinking' model, instruct model is performing well for my usage (I'm in the healthcare). The max context length I needed for my usage is 64k. Didn't see the need to push further but your 160k is amazing.. I'll keep that in mind if the need arise. Cheers..

marutichintan

Nov 24, 2025

.
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] WorkerProc failed to start.
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] Traceback (most recent call last):
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 715, in worker_main
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] worker = WorkerProc(*args, **kwargs)
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 555, in init
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.worker.load_model()
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in load_model
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.model = model_loader.load_model(
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] model = initialize_model(
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 55, in initialize_model
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1218, in init
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.set_moe_parameters()
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1158, in set_moe_parameters
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] raise RuntimeError("No Qwen3Next layer found in the model.layers.")
(Worker_PP1 pid=11878) ERROR 11-24 16:18:50 [multiproc_executor.py:743] RuntimeError: No Qwen3Next layer found in the model.layers.
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] WorkerProc failed to start.
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] Traceback (most recent call last):
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 715, in worker_main
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] worker = WorkerProc(*args, **kwargs)
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 555, in init
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.worker.load_model()
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in load_model
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.model = model_loader.load_model(
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] model = initialize_model(
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 55, in initialize_model
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1218, in init
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] self.set_moe_parameters()
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] File "/home/doubleslashai/ai/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1158, in set_moe_parameters
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] raise RuntimeError("No Qwen3Next layer found in the model.layers.")
(Worker_PP2 pid=11879) ERROR 11-24 16:18:50 [multiproc_executor.py:743] RuntimeError: No Qwen3Next layer found in the model.layers.

cpatonn

cyankiwi org Nov 24, 2025

Hi @marutichintan ,
Thank you for using my model.
I assume you run this model with pipeline parallelism? Could you install the latest vllm commit from source, which should solve this problem?
Thanks,
Ton.

Cucumis

Nov 25, 2025

Will it work on 2x3090 with 30-60k context?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment