Running on 4 GPUs with TP=4

#11

by nephepritou - opened 7 days ago

7 days ago

Maybe it's just me, but I didn't knew it will run with -tp 4 and used tp 2 with acceptable performance. But -tp 4 will not crash with -enable-expert-parallel.

So, you can run it on 4 * RTX 3090 at 100+tps with following command:

python
      -m vllm.entrypoints.openai.api_server \
      --model ./cpatonn/GLM-4.5-Air-AWQ-4bit \
      --served-model-name "glm-air-4.5" \
      --dtype float16 \
      --tensor-parallel-size 4 \
      --enable-expert-parallel \
      --max-model-len 131072 \
      --gpu-memory-utilization 0.93 \
      --max-num-seqs 2 \
      --enable-auto-tool-choice \
      --tool-call-parser glm45 \
      --reasoning-parser glm45 \

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment