vllm.v1.attention.backends.utils
AttentionMetadataBuilder
¶
Source code in vllm/v1/attention/backends/utils.py
__init__
abstractmethod
¶
__init__(
kv_cache_spec: AttentionSpec,
vllm_config: VllmConfig,
device: device,
)
build
abstractmethod
¶
build(
common_prefix_len: int,
common_attn_metadata: CommonAttentionMetadata,
fast_build: bool = False,
) -> M
Central method that builds attention metadata. Some builders (MLA) require reorder_batch to be called prior to build.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
common_prefix_len
|
int
|
The length of the common prefix of the batch. |
required |
common_attn_metadata
|
CommonAttentionMetadata
|
The common attention metadata. |
required |
fast_build
|
bool
|
The meta-data will prioritize speed of building over then speed at execution. Can be used for spec-decode where the result of a build call may only be used for few layers/iters. |
False
|
Source code in vllm/v1/attention/backends/utils.py
build_for_cudagraph_capture
¶
build_for_cudagraph_capture(
common_attn_metadata: CommonAttentionMetadata,
) -> M
Build attention metadata for CUDA graph capture. Uses build by default. Subclasses that override this method should call self.build or super().build_for_cudagraph_capture.
Source code in vllm/v1/attention/backends/utils.py
can_run_in_cudagraph
¶
can_run_in_cudagraph(
common_attn_metadata: CommonAttentionMetadata,
) -> bool
Can this batch (with given metadata) use CUDA Graphs for attention.
reorder_batch
¶
reorder_batch(
input_batch: InputBatch,
scheduler_output: SchedulerOutput,
) -> bool
This method can reorder the batch if desired by the backend. :return: Has the batch been reordered (default False).
Source code in vllm/v1/attention/backends/utils.py
CommonAttentionMetadata
dataclass
¶
Per-batch attention metadata, shared across layers and backends. AttentionMetadataBuilder instances use it to construct per-layer metadata.
For many of the tensors we keep both GPU and CPU versions.
Source code in vllm/v1/attention/backends/utils.py
num_computed_tokens_cpu
instance-attribute
¶
num_computed_tokens_cpu: Tensor
(batch_size,), the number of computed tokens for each request
query_start_loc_cpu
instance-attribute
¶
query_start_loc_cpu: Tensor
(batch_size + 1,), the start location of each request in query Tensor
seq_lens_cpu
instance-attribute
¶
seq_lens_cpu: Tensor
(batch_size,), the length of each request including both computed tokens and newly scheduled tokens
PerLayerParameters
dataclass
¶
Currently, FlashInfer backend only support models in which all layers share the same values for the following hyperparameters.
Source code in vllm/v1/attention/backends/utils.py
get_kv_cache_layout
cached
¶
Source code in vllm/v1/attention/backends/utils.py
get_per_layer_parameters
¶
get_per_layer_parameters(
vllm_config: VllmConfig, cls_: type[AttentionImpl]
) -> dict[str, PerLayerParameters]
Scan all attention layers and determine some hyperparameters
to use during plan
.
Source code in vllm/v1/attention/backends/utils.py
infer_global_hyperparameters
¶
infer_global_hyperparameters(
per_layer_params: dict[str, PerLayerParameters],
) -> PerLayerParameters
Currently, FlashInfer backend only support models in which all layers share
the same values for the following hyperparameters:
- window_left
- logits_soft_cap
- sm_scale
So this function asserts that all layers share the same values for these hyperparameters and returns the global values.
Source code in vllm/v1/attention/backends/utils.py
make_local_attention_virtual_batches
¶
make_local_attention_virtual_batches(
attn_chunk_size: int,
query_start_loc_np: ndarray,
seq_lens_np: ndarray,
block_table: Tensor,
block_size: int = 0,
) -> tuple[ndarray, ndarray, ndarray, Tensor]
Source code in vllm/v1/attention/backends/utils.py
273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 |
|
reorder_batch_to_split_decodes_and_prefills
¶
reorder_batch_to_split_decodes_and_prefills(
input_batch: InputBatch,
scheduler_output: SchedulerOutput,
decode_threshold: int = 1,
) -> bool
Reorders the batch to split into prefill and decode requests; places all requests with <= decode_threshold tokens at the front of the batch.
Returns:
Type | Description |
---|---|
bool
|
True if the batch was modified, False otherwise. |
Source code in vllm/v1/attention/backends/utils.py
split_decodes_and_prefills
¶
split_decodes_and_prefills(
common_attn_metadata: CommonAttentionMetadata,
decode_threshold: int = 1,
) -> tuple[int, int, int, int]
Assuming a reordered batch, finds the boundary between prefill and decode requests.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
common_attn_metadata
|
CommonAttentionMetadata
|
CommonAttentionMetadata object containing the batch metadata. |
required |
decode_threshold
|
int
|
The maximum query length to be considered a decode. |
1
|
Returns:
Name | Type | Description |
---|---|---|
num_decodes |
int
|
The number of decode requests. |
num_prefills |
int
|
The number of prefill requests. |
num_decode_tokens |
int
|
The number of tokens in the decode requests. |
num_prefill_tokens |
int
|
The number of tokens in the prefill requests. |