feihu.hf
commited on
Commit
·
1c8249c
1
Parent(s):
d199b9e
update README
Browse files
README.md
CHANGED
|
@@ -201,10 +201,15 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
|
|
| 201 |
|
| 202 |
### How to Enable 1M Token Context
|
| 203 |
|
|
|
|
|
|
|
|
|
|
| 204 |
#### Step 1: Update Configuration File
|
| 205 |
|
| 206 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
| 207 |
|
|
|
|
|
|
|
| 208 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
| 209 |
|
| 210 |
#### Option 1: Using vLLM
|
|
@@ -212,7 +217,9 @@ After updating the config, proceed with either **vLLM** or **SGLang** for servin
|
|
| 212 |
To run Qwen with 1M context support:
|
| 213 |
|
| 214 |
```bash
|
| 215 |
-
|
|
|
|
|
|
|
| 216 |
```
|
| 217 |
|
| 218 |
Then launch the server with Dual Chunk Flash Attention enabled:
|
|
@@ -225,7 +232,8 @@ vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
|
|
| 225 |
--enable-chunked-prefill \
|
| 226 |
--max-num-batched-tokens 131072 \
|
| 227 |
--enforce-eager \
|
| 228 |
-
--max-num-seqs 1
|
|
|
|
| 229 |
```
|
| 230 |
|
| 231 |
##### Key Parameters
|
|
@@ -238,28 +246,14 @@ vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
|
|
| 238 |
| `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
|
| 239 |
| `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
|
| 240 |
| `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
|
| 241 |
-
|
| 242 |
-
##### Troubleshooting:
|
| 243 |
-
|
| 244 |
-
1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
|
| 245 |
-
|
| 246 |
-
The VRAM reserved for the KV cache is insufficient. Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
|
| 247 |
-
|
| 248 |
-
2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
|
| 249 |
-
|
| 250 |
-
The VRAM reserved for activation weights is insufficient. You can try setting ``gpu_memory_utilization`` to 0.85 or lower, but be aware that this might reduce the VRAM available for the KV cache.
|
| 251 |
-
|
| 252 |
-
3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
|
| 253 |
-
|
| 254 |
-
The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len``.
|
| 255 |
-
|
| 256 |
|
| 257 |
#### Option 2: Using SGLang
|
| 258 |
|
| 259 |
First, clone and install the specialized branch:
|
| 260 |
|
| 261 |
```bash
|
| 262 |
-
git clone
|
| 263 |
cd sglang
|
| 264 |
pip install -e "python[all]"
|
| 265 |
```
|
|
@@ -282,10 +276,26 @@ python3 -m sglang.launch_server \
|
|
| 282 |
|---------|--------|
|
| 283 |
| `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
|
| 284 |
| `--context-length 1010000` | Defines max input length |
|
| 285 |
-
| `--mem-frac 0.75` |
|
| 286 |
| `--tp 8` | Tensor parallelism size (matches model sharding) |
|
| 287 |
| `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
|
| 288 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 289 |
#### Long-Context Performance
|
| 290 |
|
| 291 |
We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.
|
|
|
|
| 201 |
|
| 202 |
### How to Enable 1M Token Context
|
| 203 |
|
| 204 |
+
> [!NOTE]
|
| 205 |
+
> To effectively process a 1 million token context, users will require approximately **1000 GB** of total GPU memory. This accounts for model weights, KV-cache storage, and peak activation memory demands.
|
| 206 |
+
|
| 207 |
#### Step 1: Update Configuration File
|
| 208 |
|
| 209 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
| 210 |
|
| 211 |
+
#### Step 2: Start Model Server
|
| 212 |
+
|
| 213 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
| 214 |
|
| 215 |
#### Option 1: Using vLLM
|
|
|
|
| 217 |
To run Qwen with 1M context support:
|
| 218 |
|
| 219 |
```bash
|
| 220 |
+
git clone https://github.com/vllm-project/vllm.git
|
| 221 |
+
cd vllm
|
| 222 |
+
pip install -e .
|
| 223 |
```
|
| 224 |
|
| 225 |
Then launch the server with Dual Chunk Flash Attention enabled:
|
|
|
|
| 232 |
--enable-chunked-prefill \
|
| 233 |
--max-num-batched-tokens 131072 \
|
| 234 |
--enforce-eager \
|
| 235 |
+
--max-num-seqs 1 \
|
| 236 |
+
--gpu-memory-utilization 0.85
|
| 237 |
```
|
| 238 |
|
| 239 |
##### Key Parameters
|
|
|
|
| 246 |
| `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory |
|
| 247 |
| `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) |
|
| 248 |
| `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage |
|
| 249 |
+
| `--gpu-memory-utilization 0.85` | Set the fraction of GPU memory to be used for the model executor |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 250 |
|
| 251 |
#### Option 2: Using SGLang
|
| 252 |
|
| 253 |
First, clone and install the specialized branch:
|
| 254 |
|
| 255 |
```bash
|
| 256 |
+
git clone https://github.com/sgl-project/sglang.git
|
| 257 |
cd sglang
|
| 258 |
pip install -e "python[all]"
|
| 259 |
```
|
|
|
|
| 276 |
|---------|--------|
|
| 277 |
| `--attention-backend dual_chunk_flash_attn` | Activates Dual Chunk Flash Attention |
|
| 278 |
| `--context-length 1010000` | Defines max input length |
|
| 279 |
+
| `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. |
|
| 280 |
| `--tp 8` | Tensor parallelism size (matches model sharding) |
|
| 281 |
| `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM |
|
| 282 |
|
| 283 |
+
#### Troubleshooting:
|
| 284 |
+
|
| 285 |
+
1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache."
|
| 286 |
+
|
| 287 |
+
The VRAM reserved for the KV cache is insufficient.
|
| 288 |
+
- vLLM: Consider reducing the ``max_model_len`` or increasing the ``tensor_parallel_size``. Alternatively, you can reduce ``max_num_batched_tokens``, although this may significantly slow down inference.
|
| 289 |
+
- SGLang: Consider reducing the ``context-length`` or increasing the ``tp``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference.
|
| 290 |
+
|
| 291 |
+
2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory."
|
| 292 |
+
|
| 293 |
+
The VRAM reserved for activation weights is insufficient. You can try lowering ``gpu_memory_utilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache.
|
| 294 |
+
|
| 295 |
+
3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager."
|
| 296 |
+
|
| 297 |
+
The input is too lengthy. Consider using a shorter sequence or increasing the ``max_model_len`` or ``context-length``.
|
| 298 |
+
|
| 299 |
#### Long-Context Performance
|
| 300 |
|
| 301 |
We test the model on an 1M version of the [RULER](https://arxiv.org/abs/2404.06654) benchmark.
|