LLM serving is bottlenecked not by compute — but by memory. Before vLLM, 60–80% of GPU memory was wasted. Understanding why requires understanding the KV cache.
1.7 GB
KV cache per request (LLaMA-13B)
60–80%
Memory wasted by prior systems
<4%
Waste with PagedAttention
24×
Max throughput gain vs HuggingFace
Click a request to see its KV cache memory footprint · Dark = wasted space
What is the KV Cache?
During autoregressive generation, the attention mechanism must attend over all previously generated tokens. To avoid recomputing key (K) and value (V) vectors at every step, they are cached — this is the KV cache. For a large model like LLaMA-13B, a single request can consume up to 1.7 GB of KV cache. At batch sizes of 32+, this exhausts GPU memory entirely.
Why Cache at All?
Without caching, each new token requires recomputing K and V for every previous token — O(n²) work. With KV cache, each step is O(n) — a massive practical speedup. The cache is non-negotiable for serving at scale.
The Size Problem
KV cache size scales with: model size (hidden dim × layers), sequence length, and batch size. For LLaMA-65B at 2048 tokens, a single request requires ~3.2 GB. A batch of 10 requests needs 32 GB — the full A100 memory.
The Waste Problem
Prior systems allocated memory for the maximum possible sequence length upfront. If a request generates 200 tokens instead of 2048, 90% of allocated KV cache is unused. This reservation waste is why GPU utilisation was so poor.
The OS Analogy — Borrowing from Decades of Systems Research
The insight behind PagedAttention was not new to computer science — it came from operating systems. Virtual memory and paging solved the exact same problem for RAM in the 1960s. Click each concept to see the parallel.
Click a row to explore the OS ↔ LLM serving analogy
Virtual Memory ↔ KV Cache
In an OS, every process gets a virtual address space — a contiguous view of memory that is actually backed by scattered physical pages. The process doesn’t know or care about the physical layout. PagedAttention does the same for KV cache: each request sees a logical sequence of KV blocks, but the physical GPU memory backing them can be anywhere on the device.
Why This Matters
Virtual memory let OS researchers pack many more processes into RAM by eliminating external fragmentation and enabling demand paging. PagedAttention does the same for GPU memory — eliminating the need to reserve contiguous space and enabling dynamic allocation of KV cache as tokens are generated.
What Didn’t Transfer
OS swapping to disk is feasible because CPUs are slow relative to disk latency. GPU inference cannot swap to CPU RAM without destroying latency. vLLM addresses this with preemption — if memory runs out, lower-priority requests are paused and their KV cache is either swapped or recomputed.
PagedAttention modifies the attention computation to work over non-contiguous blocks of KV cache. The key insight: attention doesn’t require contiguous memory — it only requires that the right K and V vectors are gathered at attention time.
Step 1: Prefill Phase
During prefill, the entire prompt is processed in one forward pass. All prompt token K and V vectors are computed and stored into KV cache blocks. Blocks are allocated on demand — only as many blocks as needed for the prompt length are allocated. For a 512-token prompt with block size 16, this allocates 32 blocks immediately.
PagedAttention block structure:
block_size = 16 tokens (configurable, power of 2)
num_blocks = ceil(seq_len / block_size)
Physical block layout:
block[i].keys = float16[block_size, num_heads, head_dim]
block[i].values = float16[block_size, num_heads, head_dim]
Attention gather (decode step):
for each logical block b in block_table[request_id]:
physical_block = block_table[request_id][b]
K_slice = gpu_mem[physical_block].keys
V_slice = gpu_mem[physical_block].values
attn_score += softmax(Q · K_slice.T) · V_slice
Prior LLM serving systems wasted memory in three distinct ways. PagedAttention addresses each one. Click each waste type to see the before/after.
Internal Fragmentation
Traditional systems allocate KV cache in large contiguous chunks — one per request. When a request ends mid-chunk, the remaining space is lost. With block-based allocation, blocks are exactly block_size tokens. Only the last block of a sequence has any waste, and that waste is at most (block_size - 1) token slots. For block_size=16, maximum waste per request is 15 token slots — far less than wasting hundreds of tokens in a pre-allocated buffer.
Each request maintains a block table — a mapping from logical block indices (0, 1, 2, …) to physical block indices in GPU memory. Click "Add Token" to simulate generation and watch the block table grow.
Green = used token slot · Grey = free slot in last block · Blue = newly allocated block
How the Block Table Works
Each entry in the block table is a (logical_block_idx → physical_block_idx) pair. When generating a new token, the system checks the last logical block. If it has free slots, the K and V vectors are written there. If the block is full, a new physical block is allocated from the free block pool, and a new entry is appended to the block table. Physical blocks can be anywhere in GPU memory — they don’t need to be adjacent.
Free Block Pool
The memory manager maintains a global pool of free physical blocks. When a request needs a new block, one is taken from the pool. When a request completes, all its blocks are returned to the pool — immediately available for new requests. No fragmented dead zones, no reserved-but-unused space.
Reference Counting
Physical blocks have reference counts. A block shared between multiple requests (e.g., via copy-on-write for beam search) has ref_count > 1. The block is only freed when all references are released. This enables safe memory sharing without copying.
When multiple requests share the same prompt prefix (e.g., beam search or parallel sampling), their KV cache blocks can be physically shared — no copying. A fork only happens when a request writes a diverging token.
Gold = shared blocks (ref_count > 1) · Green = diverged private blocks · Click "Step" to animate
Parallel Sampling
When sampling N outputs from the same prompt (e.g., generating 5 alternative responses), all N sequences start from identical KV cache. With copy-on-write, all N block tables point to the same physical blocks for the prompt. Only when each sequence generates its first diverging token is a private copy of that block created. Memory savings: up to 55% reduction. Throughput gain: up to 2.2× improvement over systems that copy KV cache upfront.
Static batching holds the GPU idle until every request in a batch finishes. Continuous batching inserts new requests mid-flight — the moment a slot frees up. Combined with PagedAttention, this is what makes vLLM’s throughput gains possible.
Click "Run" to animate · Blue = active token generation · Dark = GPU idle · Gold = waiting queue
Static Batching (Old)
Collect N requests. Run all N through prefill. Generate tokens for all N until every single one is done. Only then start the next batch. Short requests wait for long ones — GPU throughput suffers. Common in early HuggingFace serving and FasterTransformer.
Continuous Batching (vLLM)
Each iteration, the scheduler checks for completed requests and immediately queues new ones. The GPU never idles waiting for slow requests. First coined by Orca (2022) — vLLM combines it with PagedAttention for memory-efficient continuous batching at scale.
What is an "iteration" in vLLM’s scheduler? ▼
Each scheduler iteration processes one decode step across all active requests — generating one token per request. After each iteration, the scheduler: (1) collects newly finished requests, (2) frees their blocks back to the pool, (3) checks the waiting queue, (4) if free blocks are available, promotes waiting requests to running. This happens hundreds of times per second.
What happens when memory runs out mid-batch? ▼
vLLM uses preemption. If the scheduler cannot allocate blocks for the next step, it can: (a) swap a running request’s KV cache to CPU RAM and resume it later, or (b) recompute — drop the KV cache and re-run prefill when GPU memory is freed. Recompute wastes compute but avoids the PCIe bandwidth bottleneck of swap. The scheduler prefers swap for short sequences and recompute for long ones where swap cost dominates.
A single long prompt monopolises the GPU during prefill — all decode steps for other requests are frozen. Chunked prefill slices the prompt into fixed-size chunks and interleaves them with ongoing decode steps, dramatically cutting P99 TTFT.
~5 s
TTFT stall for a 4096-token prompt (unchunked, 70B)
512
Default chunk size (tokens per scheduler step)
6×
P99 TTFT reduction vs unchunked prefill
Timeline view — each row is a concurrent request; orange = prefill, green = decode, grey = stalled
The Stall Problem
A 4096-token prefill takes ~5s on an A100 for a 70B model. During this time, all other requests in the batch are frozen. P99 TTFT spikes badly under heavy load with diverse prompt lengths.
Chunked Solution
Chunked prefill splits the 4096-token prompt into 8 chunks of 512 tokens. Each chunk occupies one scheduler step alongside decode batches. The GPU serves all requests fairly — no monopolisation.
Memory Benefit
Chunked prefill also lowers peak memory pressure: partial KV cache written incrementally. Larger batch sizes become possible alongside long-context requests. Enabled by default in vLLM v0.3+.
vLLM was benchmarked on A10G and A100 GPUs against HuggingFace Transformers, Text Generation Inference (TGI), Orca, and FasterTransformer. Results from SOSP 2023.
vLLM vs HuggingFace Transformers
On LLaMA-7B (A10G GPU), vLLM achieves up to 24× higher throughput than HuggingFace Transformers. On LLaMA-13B, the gains are 15×. The gap is largest at high request rates — HuggingFace stalls on memory allocation while vLLM continues serving with its dynamic block allocator. Even at low request rates (where memory isn’t the bottleneck), vLLM still shows 4–8× advantage from continuous batching alone.
Production Impact
LMSYS Chatbot Arena uses vLLM in production. Compared to their previous HuggingFace backend, they observed up to 30× throughput improvement, enabling them to handle 30K–60K daily requests with 50% fewer GPUs.
Beam Search Gains
For beam search (width 4), copy-on-write reduces memory usage by up to 55% and delivers 2.2× throughput improvement vs systems that copy KV cache per beam. The longer the shared prefix, the larger the gain.
Memory Utilisation
Prior systems waste 60–80% of KV cache memory to fragmentation and over-reservation. PagedAttention achieves less than 4% waste — near the theoretical minimum given fixed block sizes. This directly translates to larger effective batch sizes.
TTFT, TPOT, and Throughput — The Three Metrics That Matter
LLM serving has three distinct performance dimensions that are often in tension. Optimising for one can hurt another. Operators must understand all three to tune vLLM correctly for their workload.
TTFT (Time To First Token)
Latency from request submission to first generated token. Dominated by prefill cost. Critical for interactive chat — users perceive it as "responsiveness". Chunked prefill and smaller batch sizes reduce TTFT.
TPOT (Time Per Output Token)
Latency per token during the decode phase (after the first). Dominated by memory bandwidth — reading KV cache blocks from HBM. PagedAttention and FlashAttention both reduce TPOT. Target: <50ms for real-time streaming.
Throughput (req/s or tok/s)
Total system capacity: requests per second or output tokens per second. Higher batch sizes improve throughput but increase TTFT and TPOT. Continuous batching in vLLM maximises throughput without a fixed batching window.
vLLM has a clean separation between the scheduling/memory layer and the compute layer. Click a component to explore its role.
Click a component to explore its responsibilities
LLM Engine
The LLM Engine is the central orchestrator. It receives incoming requests from the API server, passes them to the Scheduler for batching decisions, dispatches batches to Workers via Ray (for multi-GPU), and returns completed outputs to the caller. The Engine maintains the global view of all active, waiting, and preempted requests.
Scheduler
Decides each iteration: which waiting requests to admit, which running requests to preempt, and which preempted requests to resume. Implements the continuous batching policy and coordinates with the Block Manager on memory availability.
Block Manager
Owns the global free-block pool. Allocates and frees physical KV cache blocks. Maintains block tables for each request. Handles copy-on-write when forking shared blocks for beam search or parallel sampling.
Worker (GPU)
Executes forward passes on the model. Contains the PagedAttention CUDA kernel — a custom attention implementation that gathers K/V from non-contiguous physical blocks using the block table. One worker per GPU; coordinated via Ray for tensor parallelism.
Prefix Caching — Never Recompute the Same Prompt Twice
Many API deployments use a long system prompt for every request. Without prefix caching, each request recomputes and stores that prompt independently. With prefix caching, the KV cache for any shared prefix is computed once and reused — even across separate API calls.
Gold = system prompt tokens (recomputed or shared) · Blue = user query tokens
Without Prefix Caching
Every incoming request triggers a full prefill of the system prompt + user query. For a 2000-token system prompt with 32 concurrent requests, that is 32 separate prefill computations of the same 2000 tokens — 64,000 wasted token computations per batch. The GPU spends most of its time recomputing identical KV vectors.
How It Works
vLLM hashes the token IDs of each block. On a new request, if a block hash matches an existing cached block, that physical block is reused (ref_count++). The prefill only processes the new (uncached) suffix of the prompt.
Cache Hit Rate
For chatbot deployments with a fixed system prompt, cache hit rate approaches 100% for the prompt portion. Only the user-specific query portion is computed fresh. For RAG systems, frequently retrieved document chunks also get cached.
Eviction Policy
Prefix cache uses LRU eviction. When memory pressure rises, least-recently-used cached blocks are freed first. Blocks referenced by active requests are never evicted. The cache is purely opportunistic — correctness is never compromised.
Speculative Decoding — Generating Multiple Tokens Per Step
Standard autoregressive decoding generates one token per forward pass. Speculative decoding uses a small draft model to propose several tokens at once — the large model verifies all of them in a single pass. Click "Step" to animate the accept/reject process.
Green = accepted tokens · Red = rejected (large model resamples) · Click Step to advance
Why It Works
LLM inference is memory-bandwidth-bound, not compute-bound. The GPU can verify K tokens in parallel almost as fast as it can generate 1. So if the draft model is accurate, you get K tokens in ~1 forward pass latency. Typical speedup: 2–3× for latency-sensitive workloads where the draft model acceptance rate is high.
When It Helps (and When It Doesn't)
Works best: low temperature (near-deterministic) tasks like code generation, summarisation, translation. Works poorly: high temperature creative tasks where the large model disagrees with the draft model frequently. vLLM supports both ngram-based and model-based drafting.
Does speculative decoding change output quality? ▼
No — speculative decoding is mathematically equivalent to standard sampling from the large model. The rejection sampling scheme ensures the final token distribution is identical to what the large model would have produced directly. Accepted tokens are exactly what the large model would have chosen; rejected tokens are resampled from a corrected distribution.
What draft strategies does vLLM support? ▼
vLLM supports: (1) Model-based drafting — a small model (e.g. 68M params) generates proposals; (2) ngram drafting — prompt ngrams that match the generation so far are proposed as continuations (zero extra memory, works well for long-form repetitive text); (3) Medusa — multiple decoding heads on the same model propose tokens in parallel without a separate draft model.
FlashAttention vs PagedAttention — Two Solutions, Two Different Problems
These two are constantly confused. They solve different problems, work at different levels, and are not alternatives — vLLM uses both simultaneously. Click each dimension to compare.
What Problem Does Each Solve?
FlashAttention solves the IO-efficiency problem: the standard attention computation materialises the full N×N attention matrix in GPU HBM (slow global memory), which is bandwidth-bound. FlashAttention tiles the computation to stay in SRAM (fast on-chip memory), avoiding HBM round-trips. PagedAttention solves the memory management problem: KV cache for concurrent requests is scattered and wasteful. It has nothing to do with how the attention math is computed — only where the KV tensors live.
FlashAttention — IO Efficiency
Fuses the QK^T matmul, softmax, and V matmul into one CUDA kernel using tiling. Reduces HBM memory reads/writes from O(N²) to O(N). Does not change the attention output — mathematically identical. Speeds up training and inference by 2–4× purely through better hardware utilisation.
PagedAttention — Memory Management
Changes how KV cache blocks are allocated and located in GPU memory. Does not change the attention math. Allows non-contiguous physical storage of KV cache. Enables sharing (copy-on-write), dynamic allocation, and near-zero waste. Complementary to FlashAttention — vLLM uses a PagedAttention kernel built on top of FlashAttention principles.
How much GPU memory does the KV cache actually consume? Adjust the sliders to see live calculations for any model, dtype, context length, and batch size — with and without PagedAttention.
Purple = usable KV cache · Red = wasted (prior art) · Green = model weights
How the Calculation Works
KV cache per request = 2 (K+V) × num_layers × num_heads × head_dim × seq_len × dtype_bytes. The "2×" is for both key and value tensors. For LLaMA-7B (32 layers, 32 heads, 128 head_dim, fp16): 2 × 32 × 32 × 128 × seq_len × 2 bytes = 524,288 × seq_len bytes. At 2048 tokens: ~1.07 GB per request.
Tensor Parallelism — Sharding the Model Across GPUs
A single A100 (80 GB) cannot hold a 70B parameter model (140 GB in fp16). Tensor parallelism splits the model’s weight matrices column-wise and row-wise across multiple GPUs. vLLM uses Megatron-LM style tensor parallelism with an all-reduce after each layer.
140 GB
LLaMA-70B in fp16 (single-GPU impossible on A100)
8×
GPUs needed to serve LLaMA-70B (tp=8)
~linear
Throughput scaling up to tp=4 on NVLink systems
Each GPU holds a partition of Q/K/V and FFN weight matrices — all-reduce synchronises activations after each layer
Column Parallel (QKV Proj)
Q, K, V projection matrices are split column-wise across GPUs. Each GPU computes its subset of attention heads independently. No communication needed during the forward pass for attention computation itself.
Row Parallel (O Proj + FFN)
Output projection and FFN down-projection are split row-wise. Each GPU computes a partial sum; an all-reduce across GPUs combines them. On NVLink (A100/H100), all-reduce at tp=8 takes ~100μs per layer.
Pipeline Parallelism
For very large models, pipeline parallelism (pp) splits transformer layers across GPUs. Layers 0-39 on GPU 0-3, layers 40-79 on GPU 4-7. Combined with tensor parallelism: tp=4, pp=2 = 8 GPUs for a 70B model.
Disaggregated Prefill/Decode — Separating Two Different Problems
Prefill (processing the prompt) is compute-bound and parallelisable. Decode (generating tokens one-by-one) is memory-bandwidth-bound. Mixing them on the same GPU means neither is optimal. Disaggregated serving runs them on separate GPU pools, with KV cache transferred over NVLink/RDMA.
Compute-bound
Prefill — GPU utilisation ~95% on long prompts
Memory-bound
Decode — bottleneck is HBM bandwidth, not FLOPS
2-3×
Throughput improvement with disaggregation (DistServe)
Coupled System Problem
When prefill and decode share GPUs, long prefills cause "prefill-decode interference." Decode batches are paused while the GPU does prefill. Even with continuous batching, the two workloads compete for the same memory bandwidth and compute units.
Disaggregated Design
Prefill instances: fewer, larger GPUs optimised for compute throughput. Decode instances: many smaller GPUs optimised for memory bandwidth. After prefill, KV cache is transferred (via NVLink, RDMA, or PCIe) to the decode instance.
When to Use
Disaggregation shines at scale with heterogeneous GPU fleets or when SLAs require strict TTFT/TPOT separation. At small scale, coupled serving with chunked prefill is simpler. vLLM supports PD disaggregation via its disaggregated serving API (v0.6+).
Fine-tuning creates thousands of specialised model variants. Serving each as a separate GPU instance is prohibitively expensive. LoRA adapters are tiny (MBs) compared to the base model (GBs). vLLM’s multi-LoRA support serves dozens of LoRA adapters simultaneously on one base model using the Punica CUDA kernel for batched LoRA computation.
~10 MB
Typical LoRA adapter size (rank 16, LLaMA-7B)
13 GB
LLaMA-7B base model (fp16) — shared by all adapters
64+
Concurrent LoRA adapters supported per vLLM instance
LoRA Mathematics
LoRA decomposes weight updates as W’ = W + ΔW = W + BA, where B ∈ ℝ^{d×r} and A ∈ ℝ^{r×k} with rank r ≪ d. For LLaMA-7B with r=16, each attention projection adds only 0.5M parameters vs 16M for the full matrix — a 32× reduction.
Punica CUDA Kernel
Naively, each request with a different adapter requires a separate GPU kernel launch. Punica’s BGMV (Batched Gather Matrix-Vector) kernel processes a batch of requests with different LoRA adapters in a single kernel call, achieving near-native throughput.
Operational Model
Adapters are loaded into GPU memory on first use (LRU eviction when memory is tight). The base model stays resident. Routing is per-request: each API call can specify a different lora_id. No separate servers, no duplicated base model weights.
Follow a single request through every stage of vLLM processing. Click each stage to see what happens internally — which component is responsible and what data structures are touched.
Click a stage to explore · Arrows show data flow · Purple = active stage
Stage 1: API Ingestion
The client sends an HTTP POST to /v1/completions or /v1/chat/completions. The FastAPI server parses the JSON body, validates parameters (max_tokens, temperature, stop sequences), and creates an internal Request object with a unique request_id. The raw prompt text is passed to the tokenizer.
Multiple LLM serving frameworks have emerged, each with different trade-offs. Understanding the ecosystem helps operators choose the right tool for their deployment context.
vLLM
Best-in-class throughput via PagedAttention + continuous batching. OpenAI-compatible API. Broad model support (HuggingFace hub). Active open-source community. Ideal for research and production at scale. First choice for most teams.
TGI (Text Generation Inference)
HuggingFace’s production server. Strong integration with HF Hub. Supports tensor parallelism and flash attention. Slightly less throughput than vLLM in benchmarks, but excellent ops tooling (Prometheus, Docker, Kubernetes). Good for HF-native stacks.
TensorRT-LLM
NVIDIA’s inference engine. Best raw throughput on NVIDIA hardware (H100). Complex to set up and maintain — requires model compilation per target GPU. Best choice when absolute max throughput on specific NVIDIA hardware is the only goal.
Ollama
Developer-first local inference. One-line install, runs on CPU+GPU. Not designed for high-throughput multi-user serving — single request at a time. Excellent for local prototyping and personal use. Not a production serving framework.
DeepSpeed-MII
Microsoft’s serving layer on top of DeepSpeed. Strong at very large models (175B+) with ZeRO-Inference. Less community momentum recently as vLLM has caught up on multi-GPU support. Niche use: extremely large models on Azure/AML infrastructure.
Decision Guide
Local dev → Ollama. Research/production → vLLM. HuggingFace-native stack → TGI. Max NVIDIA performance, ops team available → TensorRT-LLM. 175B+ on Azure → DeepSpeed-MII. vLLM is the correct default for 90% of use cases.
vLLM is now one of the most widely deployed LLM serving frameworks. It supports OpenAI-compatible APIs, tensor parallelism, quantisation, and a growing ecosystem of integrations.
OpenAI-Compatible API
vLLM exposes a drop-in OpenAI-compatible REST API. Any application that calls the OpenAI completions or chat completions endpoint can switch to a self-hosted vLLM instance with zero code changes. This has driven rapid adoption for cost-sensitive deployments.
Tensor Parallelism
For models too large for a single GPU (e.g., LLaMA-70B requires 140+ GB fp16), vLLM shards the model across multiple GPUs using tensor parallelism via Megatron-LM’s approach. The PagedAttention kernel and block tables work correctly across shards — each GPU holds a portion of the KV cache blocks.
Quantisation Support
AWQ, GPTQ, SqueezeLLM — vLLM supports quantised models out of the box. 4-bit quantisation reduces model weights by 4× and KV cache by ~2×, enabling larger batch sizes on the same hardware.
Prefix Caching
v0.2+ adds automatic prefix caching: if multiple requests share a common system prompt, the prompt’s KV cache blocks are computed once and reused across all subsequent requests — even across separate API calls.
Speculative Decoding
vLLM implements speculative decoding — a small draft model proposes multiple tokens; the large model verifies in parallel. Reduces end-to-end latency for latency-sensitive applications without changing output distribution.
How does vLLM handle multi-modal models? ▼
Later vLLM versions added support for multi-modal inputs (images, video). Image features are embedded into tokens and processed through the same PagedAttention mechanism. The block manager allocates KV cache for vision tokens alongside text tokens transparently.
What is the overhead of PagedAttention vs standard attention? ▼
The custom CUDA kernel adds a small gather overhead compared to FlashAttention on contiguous memory. In practice this is 2–5% slower per attention operation. But because PagedAttention enables much larger batch sizes (more requests in parallel), the total throughput is far higher — the per-request cost is amortised across more concurrent work.