Transformers documentation

Continuous batching

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.3.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Continuous batching

Continuous batching maximizes GPU utilization. It increases throughput and reduces latency by using dynamic scheduling to rearrange the batch at each step. The system removes completed requests and adds new requests immediately to prevent GPU idling. Chunked prefill prevents expensive prefill work from stalling the batch while still allowing new requests still join.

Continuous batching works with transformers serve, a server for deploying local models, and generate_batch().

generate_batch

The generate_batch() method works with all autoregressive text models. It accepts a list of tokenized inputs and a GenerationConfig to configure generation settings.

import datasets
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507",
    attn_implementation="sdpa_paged",
    device_map="cuda",
    dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507", padding_side="left")

dataset = datasets.load_dataset("openai/gsm8k", "socratic", split="test")
dataset = dataset.select(range(args.samples))
tokenized_datasets = dataset.map(lambda x: tokenizer(x["question"]), batched=True)
simple_batch_inputs = [item["input_ids"] for item in tokenized_datasets]

generation_config = GenerationConfig(
    max_new_tokens=32,
    use_cuda_graph=False,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
    do_sample=False,
    max_batch_tokens=512,
)

batch_outputs = model.generate_batch(
    inputs=simple_batch_inputs,
    generation_config=generation_config,
)

for request_id, output in batch_outputs.items():
    generated_text = tokenizer.decode(output.generated_tokens, skip_special_tokens=True)
    print(f"Request {request_id} output: {generated_text}")

ContinuousBatchingManager

The ContinuousBatchingManager orchestrates the background thread by pulling requests from the queue and filling the GPU to capacity. Every iteration checks for finished requests and schedules new ones to join the batch. Use this manager to customize request scheduling.

Call init_continuous_batching() to initialize the manager with a GenerationConfig and start() the background thread.

from transformers.generation.continuous_batching import RequestStatus

manager = model.init_continuous_batching(generation_config=generation_config)
manager.start()

Use add_request() to asynchronously submit individual requests. Provide a specific request id or the manager wgenerates one automatically.

for i, input_ids in enumerate(simple_batch_inputs):
    request_id = manager.add_request(input_ids=input_ids, request_id=f"request_{i}")

Retrieve all results as they arrive with get_result().

for request_id, request in manager.get_result():
    generated_text = tokenizer.decode(request.generated_tokens, skip_special_tokens=True)
    print(f"Request {request_id} output: {generated_text}")

Use the request_id of a specific request to get its results. This is a blocking operation that waits until the result is ready.

result = manager.get_result(request_id="request_5")

Stream partial results for a specific request with request_id_iter().

manager.add_request(
    input_ids=input_ids,
    request_id="streaming_request",
    stream=True,
)
for chunk in manager.request_id_iter(request_id="streaming_request"):
    generated_text = tokenizer.decode(chunk.generated_tokens, skip_special_tokens=True)
    print(generated_text)
    # FIXME: stop iteration in `request_id_iter` when finished instead of doing it externally
    if chunk.status == RequestStatus.FINISHED:
        break

Call stop() to terminate the manager.

manager.stop()

PagedAttention

PagedAttention breaks large key-value caches into smaller, non-contiguous fixed-size pages to avoid GPU memory fragmentation and support variable-length requests. Transformers automatically enables PagedAttention when using continuous batching.

You could explicitly enable PagedAttention when instantiating a model rather than waiting for generate_batch() to dynamically enable it.

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507",
    attn_implementation="paged|flash_attention_2",
    device_map="cuda",
    torch_dtype=torch.bfloat16
)

Sliding window attention

Sliding window attention limits the backward context of a token to save compute. Generation cost stays proportional to window size. This reduces compute per step and simplifies continuous batching.

Transformers models like Mistral and Gemma 2 natively support sliding window attention. Manually enable it in the model config if the architecture supports it. This helps with fine-tuning or running custom experiments.

from transformers import AutoConfig

config = AutoConfig.from_pretrained("google/gemma-2-2b")
config.sliding_window = 4096

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b",
    config=config,
    attn_implementation="paged|flash_attention_2",
    device_map="cuda",
    dtype=torch.bfloat16,
)

Usage remains the same with generate_batch().

How it works

The ContinuousMixin class serves as the main interface for continuous batching through generate_batch(). This method internally creates a ContinuousBatchingManager.

ContinuousBatchingManager manages requests by creating a background thread for the generation loop and adding requests to the queue. The manager is thread-safe, allowing asynchronous request additions while the model generates.

The Scheduler selects requests for processing at each step based on the token budget. FIFOScheduler is the default scheduler. It prioritizes decoding requests over prefilling requests and assigns them to specific memory blocks. PrefillFirstScheduler prioritizes prefill requests instead.

ContinuousBatchingManager runs the model forward pass for the scheduled requests. It then collects and returns the results.

ContinuousBatchingConfig is the object used to store all parameters needed for continuous batching, that are not already in GenerationConfig

Continuous batching config

class transformers.ContinuousBatchingConfig

< >

( block_size: int = 256 num_blocks: int | None = None max_batch_tokens: int | None = None max_memory_percent: float = 0.8 max_blocks_per_request: int | None = 0 allow_block_sharing: bool = True use_async_batching: bool | None = None use_cuda_graph: bool | None = None q_padding_interval_size: int = 0 kv_padding_interval_size: int = 0 max_cached_graphs: int = 0 scheduler: str = 'fifo' max_queue_size: int = 0 )

Parameters

  • block_size (int, optional, defaults to 256) — Size of each KV cache block in tokens.
  • num_blocks (int, optional) — Number of blocks in the KV cache. Auto-inferred from GPU memory when None.
  • max_batch_tokens (int, optional) — Maximum number of tokens in a batch. Auto-inferred from GPU memory when None.
  • max_memory_percent (float, optional, defaults to 0.8) — Maximum percentage of free GPU memory (after the model is loaded) to use for the KV cache.
  • max_blocks_per_request (int, optional, defaults to 0) — Maximum blocks per request, used in the flash_attn_with_kvcache fast decode path to dimension the block table. Setting this to 0 disables the fast decode path.
  • allow_block_sharing (bool, optional, defaults to True) — Whether to allow block sharing for prefix caching. Block sharing can only be allowed, never forced, as some models do not support it. Disable if you have few short prompts but long generation lengths.
  • use_async_batching (bool, optional) — Whether to enable async double-buffering, which removes CPU overhead from the continuous batching loop at the cost of doubled VRAM usage. Auto-detected when None.
  • use_cuda_graph (bool, optional) — Whether to enable CUDA graphs. Auto-inferred when None.
  • q_padding_interval_size (int, optional, defaults to 0) — Query padding granularity in tokens for CUDA graphs. Uses a preset from continuous_api.py when set to 0.
  • kv_padding_interval_size (int, optional, defaults to 0) — KV padding granularity in tokens for CUDA graphs. Uses a preset from continuous_api.py when set to 0.
  • max_cached_graphs (int, optional, defaults to 0) — Maximum number of cached CUDA graphs. Uses a preset from continuous_api.py when set to 0.
  • scheduler (str, optional, defaults to "fifo") — Scheduler type to use.
  • max_queue_size (int, optional, defaults to 0) — Maximum request queue size for serving. 0 means unlimited.

Class that holds arguments relative to continuous batching, when using continuous batching through the generate_batch method or the continuous_batching_context_manager context manager.

__call__

( *args **kwargs )

Call self as a function.

Resources

The Continuous batching blog post explains KV caching, chunked prefill, and ragged batching with dynamic scheduling in more detail.

Update on GitHub