Running on 4 GPUs with TP=4
#11
by
nephepritou
- opened
Maybe it's just me, but I didn't knew it will run with -tp 4 and used tp 2 with acceptable performance. But -tp 4 will not crash with -enable-expert-parallel.
So, you can run it on 4 * RTX 3090 at 100+tps with following command:
python
-m vllm.entrypoints.openai.api_server \
--model ./cpatonn/GLM-4.5-Air-AWQ-4bit \
--served-model-name "glm-air-4.5" \
--dtype float16 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--max-model-len 131072 \
--gpu-memory-utilization 0.93 \
--max-num-seqs 2 \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--reasoning-parser glm45 \