Prefill is the first phase of every LLM inference call. Before the model generates a single token of output, it must process your entire input prompt in one shot.
Think of it as the model "reading and understanding" everything you said before it starts "writing" a response.
You type: "Explain how a CPU works to a 5-year-old"
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β PREFILL PHASE β
β β
β Input: all your tokens processed IN PARALLEL β
β Output: KV cache + first generated token β
β Duration: one-time, milliseconds β
β Bottleneck: GPU compute (FLOPS) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β DECODE PHASE β
β β
β Input: one token at a time (autoregressive) β
β Output: next token, then next, then next... β
β Duration: ongoing, seconds β
β Bottleneck: memory bandwidth (GB/s) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Let's trace a real prompt through prefill. We'll use:
Prompt: "Explain how a CPU works to a 5-year-old"
The tokenizer (e.g., Llama's SentencePiece / tiktoken) breaks your text into tokens. Tokens are NOT always full words:
Raw text: "Explain how a CPU works to a 5-year-old" Tokenized: ββββββββββ¬βββββββββββ¬ββββββββββββββββββββββββββββββββββ β Index β Token ID β Token Text β ββββββββββΌβββββββββββΌββββββββββββββββββββββββββββββββββ€ β 0 β 849 β "Explain" β β 1 β 1268 β " how" β β 2 β 264 β " a" β β 3 β 18622 β " CPU" β β 4 β 4375 β " works" β β 5 β 311 β " to" β β 6 β 264 β " a" β β 7 β 220 β " 5" β β 8 β - β "-" β β 9 β 3236 β "year" β β 10 β - β "-old" β ββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββββββββ Total: 11 tokens
Each token ID is looked up in an embedding table to get a high-dimensional vector:
Token "Explain" (ID 849) β [0.023, -0.156, 0.891, 0.042, ..., -0.334]
β°ββββββββββββ 8192 dimensions βββββββββββββ―
(for Llama-3-70B; 4096 for smaller models)
Token " how" (ID 1268) β [0.178, 0.045, -0.223, 0.567, ..., 0.112]
Token " a" (ID 264) β [-0.089, 0.334, 0.012, -0.445, ..., 0.267]
... and so on for all 11 tokens
This creates an embedding matrix of shape [11, 8192] — 11 tokens, each with 8192 dimensions.
This embedding matrix now passes through every single transformer layer in the model. For Llama-3-70B, that's 80 layers. At each layer:
Layer 1 of 80: ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β SELF-ATTENTION β β β β β β β β For each of the 11 tokens, compute: β β β β β β β β Q (Query) = token_embedding Γ W_Q β β β β K (Key) = token_embedding Γ W_K β STORED in KV β β β β V (Value) = token_embedding Γ W_V β STORED in KV β β β β β β β β Then: Attention = softmax(Q Γ K^T / βd) Γ V β β β β β β β β Every token attends to every other token (11 Γ 11) β β β β That's 121 attention scores computed simultaneously β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β β βΌ β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β FEED-FORWARD NETWORK (MLP) β β β β β β β β Each token goes through: β β β β hidden = token Γ W_gate (upproject) β β β β hidden = SiLU(hidden) Γ (token Γ W_up) β β β β output = hidden Γ W_down (downproject) β β β β β β β β For 70B: 8192 β 28672 β 8192 dimensions β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β Output: Updated embeddings [11, 8192] + KV cache for layer 1 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ... repeat for layers 2, 3, 4, ... all the way to layer 80
Here's what the 11×11 attention matrix looks like for our prompt at one attention head:
Explain how a CPU works to a 5 - year -old Explain [ 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ] how [ 0.35 0.65 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ] a [ 0.15 0.25 0.60 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ] CPU [ 0.30 0.10 0.05 0.55 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ] works [ 0.10 0.15 0.05 0.40 0.30 0.0 0.0 0.0 0.0 0.0 0.0 ] to [ 0.20 0.05 0.05 0.10 0.15 0.45 0.0 0.0 0.0 0.0 0.0 ] a [ 0.05 0.05 0.10 0.05 0.05 0.30 0.40 0.0 0.0 0.0 0.0 ] 5 [ 0.10 0.05 0.05 0.05 0.05 0.10 0.10 0.50 0.0 0.0 0.0 ] - [ 0.05 0.03 0.02 0.03 0.02 0.05 0.05 0.35 0.40 0.0 0.0 ] year [ 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.15 0.10 0.40 0.0 ] -old [ 0.10 0.05 0.05 0.05 0.05 0.10 0.05 0.15 0.05 0.15 0.20] β Each row = how much each token "attends to" every previous token. β CAUSAL attention β tokens only see tokens before them (upper triangle = 0). β "CPU" attends strongly to "Explain" (0.30) β it knows this is about explaining CPUs. β "-old" attends to "5" (0.15) and "year" (0.15) β understands "5-year-old" as one concept.
The critical point: During prefill, this ENTIRE 11×11 matrix is computed in ONE shot. All 121 attention scores calculated simultaneously using GPU parallelism. This is why prefill is compute-heavy but fast.
After all 80 layers, the final hidden state of the LAST token goes through a linear layer to produce logits over the entire vocabulary (~128,000 tokens for Llama-3):
Final hidden state of "-old" β Linear projection β Logits [128,256] Top predictions: "A" β probability 0.12 "Imagine" β probability 0.09 "Think" β probability 0.08 "So" β probability 0.07 "OK" β probability 0.06 ... Sampled (temperature=0.7): "Imagine" β This is the FIRST generated token (TTFT)
Simultaneously, all Key and Value vectors from every layer are stored as the KV cache. This allows decode to skip reprocessing the entire prompt.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β π½οΈ THE LLM RESTAURANT β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β CUSTOMER (User) walks in and places an order: β β "I'd like a detailed Italian pasta dish, gluten-free, β β with a side salad, no onions, for two people" β β β β This order = YOUR PROMPT (many specific requirements) β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β STEP 1: ORDER TICKET (Tokenization) β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β The waiter breaks the order into individual items: β β β β β β β β [Italian] [pasta] [gluten-free] [side salad] β β β β [no onions] [for two] β β β β β β β β = 6 "tokens" on the ticket β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β STEP 2: KITCHEN READS FULL ORDER AT ONCE (Self-Attention) β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β The head chef reads the ENTIRE ticket simultaneously. β β β β They understand RELATIONSHIPS between items: β β β β β β β β "Italian" + "pasta" β ah, we're making Italian pasta β β β β "gluten-free" modifies "pasta" β use rice noodles β β β β "side salad" + "no onions" β salad without onions β β β β "for two" β double everything β β β β β β β β This cross-referencing of every item with every other β β β β item = the ATTENTION MATRIX. β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β STEP 3: PREP STATIONS SET UP (KV Cache Generation) β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β Based on understanding the full order, the kitchen sets β β β β up PREP STATIONS β pre-measured ingredients, sauces β β β β heated, pans selected: β β β β β β β β Station 1: Rice noodles (measured for 2) β β β β Station 2: Italian sauce (no onion variant) β β β β Station 3: Salad greens (no onions, for 2) β β β β Station 4: Garnishes ready β β β β β β β β These prep stations = KV CACHE β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β STEP 4: FIRST DISH ELEMENT PRODUCED (First Token) β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β With everything prepped, the kitchen produces the first β β β β component: the boiling water hits the noodles. β β β β β β β β Time from order to first action = TTFT β β β β (Time To First Token) β β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β NOW: The kitchen moves to DECODE phase β producing each β β dish component one at a time, using the prep stations (KV cache) β β instead of re-reading the order ticket every time. β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Restaurant | LLM Prefill |
|---|---|
| Customer's order | Your prompt (input text) |
| Breaking order into items | Tokenization |
| Chef reads full order at once | Self-attention (all tokens in parallel) |
| "gluten-free" modifies "pasta" | Attention scores between related tokens |
| Setting up prep stations | Generating the KV cache |
| Pre-measured ingredients | Stored Key/Value vectors per token per layer |
| Time from order to first action | TTFT (Time To First Token) |
| Many chefs working simultaneously | GPU parallel processing (CUDA cores) |
| Bigger order = more prep time | Longer prompts = longer prefill (O(N²)) |
def prefill(prompt: str, model: TransformerModel) -> Tuple[Token, KVCache]:
"""
Process the entire prompt and return the first generated token + KV cache.
This runs ONCE per request.
"""
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# STEP 1: Tokenize
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
token_ids = tokenizer.encode(prompt)
# "Explain how a CPU works to a 5-year-old"
# β [849, 1268, 264, 18622, 4375, 311, 264, 220, -, 3236, -]
seq_len = len(token_ids) # 11
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# STEP 2: Embed
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
hidden_states = embedding_table[token_ids]
# shape: [11, 8192] (seq_len Γ hidden_dim)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# STEP 3: Process through all transformer layers
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
kv_cache = {}
for layer_idx in range(80): # 80 layers for Llama-3-70B
# --- Self-Attention ---
Q = hidden_states @ W_Q[layer_idx] # [11, 8192] Γ [8192, 8192] β [11, 8192]
K = hidden_states @ W_K[layer_idx] # [11, 8192] Γ [8192, 1024] β [11, 1024]
V = hidden_states @ W_V[layer_idx] # [11, 8192] Γ [8192, 1024] β [11, 1024]
# β
STORE K and V β this IS the KV cache
kv_cache[layer_idx] = (K, V)
attention_scores = (Q @ K.transpose()) / sqrt(head_dim) # [11, 11]
attention_scores = apply_causal_mask(attention_scores)
attention_weights = softmax(attention_scores) # [11, 11]
attention_output = attention_weights @ V # [11, 8192]
# --- Feed-Forward Network ---
gate = SiLU(attention_output @ W_gate[layer_idx]) # [11, 28672]
up = attention_output @ W_up[layer_idx] # [11, 28672]
ffn_output = (gate * up) @ W_down[layer_idx] # [11, 8192]
hidden_states = layer_norm(hidden_states + ffn_output)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# STEP 4: Predict first token
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
last_hidden = hidden_states[-1] # [8192]
logits = last_hidden @ W_vocab # [128256]
first_token = sample(logits, temperature=0.7) # β "Imagine"
return first_token, kv_cache
hidden_states is always [11, 8192]. One big matrix multiplication.kv_cache[layer_idx] = (K, V). 80 layers × 2 matrices × 11 tokens.KV Cache for "Explain how a CPU works to a 5-year-old" (11 tokens): Layer 0: K: [11, 8, 128] V: [11, 8, 128] Layer 1: K: [11, 8, 128] V: [11, 8, 128] ... (80 layers total) ... Layer 79: K: [11, 8, 128] V: [11, 8, 128] Total: 80 Γ 2 Γ 11 Γ 8 Γ 128 = 22,528,000 float16 values = ~43 MB
For a 4096-token prompt:
80 Γ 2 Γ 4096 Γ 8 Γ 128 Γ 2 bytes (FP16) = ~1.34 GB per request 50 concurrent users = 67 GB = almost an entire H100's memory
Q Γ K^T β shape: [N, N] For N = 11: 11 Γ 11 = 121 multiply-adds β trivial For N = 512: 512 Γ 512 = 262,144 multiply-adds β easy For N = 4096: 4096 Γ 4096 = 16,777,216 multiply-adds β heavy For N = 32768: 32K Γ 32K = 1,073,741,824 multiply-adds β MASSIVE Arithmetic Intensity = FLOPs / Bytes Loaded Prefill: ~NΒ² FLOPs / ~N bytes = ~N (HIGH β compute-bound) Decode: ~N FLOPs / ~N bytes = ~1 (LOW β memory-bound)
GPU Utilization During Prefill: CUDA Cores: βββββββββββββββββββββββββββββββ ~85% utilized (doing matmuls) Memory Bus: ββββββββββββββββββββββββββββββ ~35% utilized (not the bottleneck) This is GOOD β the GPU is doing what it's designed for.
| Metric | Value |
|---|---|
| Parameters | 70 billion |
| Layers | 80 |
| Hidden dimension | 8,192 |
| Attention heads (Q) | 64 |
| KV heads (GQA) | 8 |
| Head dimension | 128 |
| Vocabulary size | 128,256 |
| Model weights (FP16) | ~140 GB (needs 2× H100s min) |
| Prompt Length | Attention FLOPs | Prefill Time (1× H100) | KV Cache Size |
|---|---|---|---|
| 128 tokens | ~0.5 TFLOP | ~8 ms | 42 MB |
| 512 tokens | ~8 TFLOP | ~25 ms | 167 MB |
| 2,048 tokens | ~128 TFLOP | ~85 ms | 670 MB |
| 4,096 tokens | ~512 TFLOP | ~180 ms | 1.34 GB |
| 8,192 tokens | ~2048 TFLOP | ~450 ms | 2.68 GB |
| 32,768 tokens | ~32768 TFLOP | ~3500 ms | 10.7 GB |
Where the time goes (4096-token prefill, Llama-3-70B): Self-Attention (QKV + attention + output): ~55% of time Feed-Forward Network (gate + up + down): ~40% of time Layer norms, residuals, embedding: ~5% of time Total: ~180ms on 1Γ H100, ~45ms with 4-way tensor parallelism
ββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β PREFILL β DECODE β ββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ€ β Processes ALL tokens at once β Generates ONE token per step β β Attention: NΓN matrix (121 ops) β Attention: 1ΓN vector (11 ops) β β Bottleneck: COMPUTE β Bottleneck: MEMORY BANDWIDTH β β GPU cores: ~85% utilized β GPU cores: ~12% utilized β β Memory BW: ~35% utilized β Memory BW: ~88% utilized β β Runs ONCE per request β Runs N times (once per token) β β Duration: milliseconds β Duration: seconds β β Output: KV cache + 1st token β Output: all remaining tokens β β Metric: TTFT β Metric: ITL, throughput β β Wants: More FLOPS, higher TP β Wants: More bandwidth, batching β β Scaling: O(NΒ²) with prompt len β Scaling: O(N) per step β ββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββ
BEFORE (Aggregated β same GPU does both):
GPU 0: ββ PREFILL βββββ DECODE ββββββ PREFILL βββββ DECODE ββββ
compute-heavy BW-heavy blocks decode wastes compute
AFTER (Disaggregated β dedicated GPUs):
Prefill GPU: ββ REQ1 ββ REQ2 ββ REQ3 ββ REQ4 ββ (always computing)
Decode GPU: ββ REQ1 ββββββββββββββββββββββββ (batches users)
KV Transfer: Prefill GPU ββNIXLβββ Decode GPU (~1ms NVLink)
| Metric | Aggregated | Disaggregated | Improvement |
|---|---|---|---|
| Prefill GPU compute | ~60% | ~90% | 1.5× |
| Decode GPU bandwidth | ~40% | ~85% | 2.1× |
| Requests per GPU | baseline | ~2× | 2× |
| TTFT consistency | variable | consistent | much better |
Decode is the second phase of every LLM inference call. After prefill has processed your entire prompt and built the KV cache, decode takes over and generates the response one token at a time.
Prefill already happened: Input: "Explain how a CPU works to a 5-year-old" Output: KV cache (11 tokens cached) + first token "Imagine" Now DECODE begins: Step 1: "Imagine" β model predicts β "a" Step 2: "a" β model predicts β "tiny" Step 3: "tiny" β model predicts β "factory" Step 4: "factory" β model predicts β "inside" Step 5: "inside" β model predicts β "your" Step 6: "your" β model predicts β "computer" Step 7: "computer" β model predicts β "." βββββββββββββββββββββββββββββββββββββββββββββββββββ β DECODE PHASE β β Input: ONE new token per step β β Reads: entire KV cache + all model weights β β Output: ONE next token per step β β Bottleneck: memory bandwidth (GB/s) β βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β DECODE STEP 1 β β β β Input token: "Imagine" (just this ONE token) β β KV Cache: 11 entries from prefill β β [Explain] [how] [a] [CPU] [works] [to] [a] [5] [-] [year] [-old]β β β β What the GPU does at each of 80 layers: β β a. Compute Q, K, V for "Imagine" β β b. β READ all 11 cached K vectors from GPU memory β β c. β READ all 11 cached V vectors from GPU memory β β d. Attention = softmax(Q_new Γ [K_cached; K_new]^T) Γ V_all β β Shape: [1] Γ [12]^T β [1, 12] attention scores β β e. Append K_new and V_new to cache (11 β 12 entries) β β β β Output: "a" | KV Cache now: 12 entries | Time: ~40ms β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β DECODE STEP 2 β β Input: "a" | Cache: 12 entries | READ 12 K + 12 V from memory β β Attention: [1, 13] scores | Output: "tiny" | Cache: 13 entries β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Step 3: "tiny" | Reads: 13 entries | β "factory" | Cache: 14 Step 4: "factory" | Reads: 14 entries | β "inside" | Cache: 15 Step 5: "inside" | Reads: 15 entries | β "your" | Cache: 16 Step 6: "your" | Reads: 16 entries | β "computer" | Cache: 17 Step 7: "computer" | Reads: 17 entries | β "." | Cache: 18
Decode Step 4 β generating "inside" after "Imagine a tiny factory"
The single query token "factory" attends to all 14 cached tokens:
Explain how a CPU works to a 5 - year -old Imagine a tiny
factory [ 0.02 0.03 0.01 0.15 0.08 0.03 0.01 0.02 0.01 0.01 0.01 0.18 0.04 0.40 ]
β β β
"CPU" gets "Imagine" gets "tiny" gets
high attention high attention highest
14 multiply-adds. Trivial compute.
But reading 14 K + 14 V vectors across 80 layers from GPU memory = SLOW.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β π½οΈ THE LLM RESTAURANT β DECODE PHASE β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Prep stations (KV cache) ready from prefill: β β Station 1: Rice noodles Station 2: Italian sauce β β Station 3: Salad greens Station 4: Garnishes β β β β Action 1: Boil the noodles β β Chef walks to ALL 4 stations to check context β decides action β β Time: mostly WALKING between stations, not actual cooking β β β β Action 2: Start the sauce β β Chef walks to ALL 5 stations (4 + 1 new from action 1) β β Even MORE walking now β β β β Action 3: Prepare the salad β β Chef walks to ALL 6 stations... and so on β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β KEY INSIGHT: β β β β Chef's HANDS barely doing work (low compute). β β β β Chef's LEGS exhausted from walking (high memory BW). β β β β The actual cooking is simple. CHECKING every station β β β β before each action is what takes all the time. β β β β This is decode: tiny compute, massive memory reads. β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β WHAT MAKES IT WORSE: β β If a NEW customer (prefill) arrives mid-cooking, the chef β β STOPS, reads new order, sets up new stations, then resumes. β β First customer's food gets cold = "token stuttering". β β β β SOLUTION: One chef ONLY preps (prefill GPU). β β Another ONLY cooks (decode GPU). Stations passed via NIXL. β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Restaurant (Decode) | LLM Decode |
|---|---|
| One cooking action at a time | One token generated per step |
| Checking every station before each action | Reading entire KV cache from GPU memory |
| More stations = more walking | Longer sequences = more data reads |
| Chef's hands barely working | GPU compute ~12% utilized |
| Chef's legs exhausted | GPU memory bandwidth ~88% utilized |
| New customer interrupting cooking | Prefill interrupting decode |
| Separate prep chef and cooking chef | Disaggregated serving |
def decode(first_token, kv_cache, model, max_tokens=200):
"""Generate tokens one at a time using the KV cache from prefill."""
generated = []
current = first_token # "Imagine"
for step in range(max_tokens):
hidden = embedding_table[current] # [1, 8192] β ONE token
for layer_idx in range(80):
Q_new = hidden @ W_Q[layer_idx] # [1, 8192]
K_new = hidden @ W_K[layer_idx] # [1, 1024]
V_new = hidden @ W_V[layer_idx] # [1, 1024]
# β
THE EXPENSIVE PART: read cached K, V from HBM
K_cached = kv_cache[layer_idx].K # [seq_len, 1024] β BOTTLENECK
V_cached = kv_cache[layer_idx].V # [seq_len, 1024] β BOTTLENECK
K_all = concat(K_cached, K_new) # [seq_len+1, 1024]
V_all = concat(V_cached, V_new)
# Attention is a VECTOR, not a matrix
scores = Q_new @ K_all.T / sqrt(128) # [1, seq_len+1]
weights = softmax(scores)
attn_out = weights @ V_all # [1, 8192]
kv_cache[layer_idx].K = K_all # cache grows by 1
kv_cache[layer_idx].V = V_all
# Feed-Forward (same ops, but for 1 token)
gate = SiLU(attn_out @ W_gate[layer_idx]) # [1, 28672]
up = attn_out @ W_up[layer_idx]
ffn = (gate * up) @ W_down[layer_idx] # [1, 8192]
hidden = layer_norm(hidden + ffn)
logits = hidden @ W_vocab # [1, 128256]
next_token = sample(logits, temperature=0.7)
generated.append(next_token)
if next_token == EOS: break
current = next_token
return generated
hidden is always [1, 8192][1, seq_len+1], not [seq_len, seq_len]Token generated β Cache entries β Cache size (70B, FP16) β % of H100 80GB ββββββββββββββββββΌββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ Prefill done β 11 β 3.6 MB β 0.005% After 50 tokens β 61 β 20 MB β 0.025% After 200 tokens β 211 β 69 MB β 0.086% After 4K tokens β 4,107 β 1.34 GB β 1.68% After 32K tokens β 32,779 β 10.72 GB β 13.4% After 128K tokensβ 128,011 β 41.9 GB β 52.3% β uh oh
Concurrent users: Users β KV per user β Total KV β Remaining for model βββββββΌββββββββββββββΌββββββββββββΌββββββββββββββββββββ 1 β 1.34 GB β 1.34 GB β 78.66 GB β 10 β 1.34 GB β 13.4 GB β 66.6 GB β 30 β 1.34 GB β 40.2 GB β 39.8 GB β οΈ tight 50 β 1.34 GB β 67.0 GB β 13.0 GB β OOM
KVBM Memory Hierarchy (Dynamo's solution): ββββββββββββββββββββββββββββββββββββββββ β G1: GPU HBM (~80 GB, ~3.35 TB/s) β β Hot cache (active decode) ββββββββββββββββββββββββββββββββββββββββ€ β G2: CPU RAM (~2 TB, ~200 GB/s) β β Warm cache (recently used) ββββββββββββββββββββββββββββββββββββββββ€ β G3: Local SSD (~8 TB, ~12 GB/s) β β Cold cache (idle sessions) ββββββββββββββββββββββββββββββββββββββββ€ β G4: Remote (unlimited, variable) β β Archive (persistent) ββββββββββββββββββββββββββββββββββββββββ Result: 10Γ more concurrent users than GPU memory alone allows
Data READ per token (Llama-3-70B): Model weights (per layer): ~1.63 GB Γ 80 layers = ~130 GB KV cache (at 4K tokens): ~16 MB Γ 80 layers = ~1.3 GB TOTAL PER TOKEN: ~131 GB Compute per token: ~54 billion FLOPs (54 GFLOP) Arithmetic Intensity = 54 GFLOP / 131 GB = 0.41 FLOPs/byte H100 balance point = 990 TFLOPS / 3.35 TB/s = 295 FLOPs/byte Decode is 720Γ below the balance point. GPU compute cores are ~99.86% IDLE during decode.
GPU Utilization During Decode (1 user): CUDA Cores: ββββββββββββββββββββββββββββββ ~3% (almost idle!) Memory Bus: βββββββββββββββββββββββββββββββ ~88% (reading as fast as possible) 87% of time = reading model weights from memory. 3% of time = actual computation.
Key insight: Model weights are the SAME for every user. Read once, apply to ALL users' tokens simultaneously. 1 user: 130 GB weights β 1 token β 0.41 FLOPs/byte (wasteful) 8 users: 130 GB weights β 8 tokens β 3.3 FLOPs/byte (better) 32 users: 130 GB weights β 32 tokensβ 13.1 FLOPs/byte (good) 128 users:130 GB weights β 128 tokensβ 52 FLOPs/byte (great)
But each user has their own KV cache: Users β Weight reads β KV cache reads β Total β Intensity βββββββΌβββββββββββββββΌβββββββββββββββββΌββββββββββββΌββββββββββ 1 β 130 GB β 1.3 GB β 131 GB β 0.41 8 β 130 GB β 10.4 GB β 140 GB β 3.1 32 β 130 GB β 41.6 GB β 172 GB β 10.1 64 β 130 GB β 83.2 GB β 213 GB β 16.2 Sweet spot: 32-64 users per GPU (balancing compute vs memory)
| Scenario | Data read/step | Time/token | Tokens/sec |
|---|---|---|---|
| 1 user, 4K context | ~131 GB | ~40 ms | ~25 |
| 8 users, 4K context | ~140 GB | ~42 ms | ~190 |
| 32 users, 4K context | ~172 GB | ~51 ms | ~627 |
Where the time goes (1 user, 4K context): Weight loading from HBM: ~87% (QKV: 22%, FFN: 55%, output: 10%) KV cache reading from HBM: ~9% Actual computation: ~3% Other: ~1% 87% of decode = reading model weights. The GPU is a memory-reading machine.
WITHOUT disaggregation: User A (decode): tok tok tok ββββBLOCKEDββββ tok tok tok User B (prefill): ββββPREFILLββββ User A sees: tokens... FREEZE 200ms... tokens (stuttering) WITH disaggregation (Dynamo): Prefill GPU: ββββ B ββββ ββββ C ββββ ββββ D ββββ Decode GPU: tok tok tok tok tok tok tok tok tok User A sees: smooth, consistent flow. No interruptions.
WITHOUT KVBM: GPU 80GB: [Model 35GB] [KV User1 15GB] [KV User2 15GB] [KV U3 15GB] β OOM User 4 β REJECTED WITH KVBM: GPU 80GB: [Model 35GB] [KV Active 30GB] CPU 512GB: [KV warm users 200GB] SSD 4TB: [KV cold users 2TB] User 4 β KV fetched from SSD in ~10ms
10K tokens Γ 40ms = 400 seconds (6.7 minutes) Dynamo solutions: - Speculative decoding: predict multiple tokens (2-4Γ speedup) - Request migration: move to less-loaded GPUs - KV offloading: spill growing cache via KVBM
ββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β PREFILL β DECODE β ββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ€ β Processes ALL tokens at once β Generates ONE token per step β β Runs ONCE per request β Runs HUNDREDS of times β β Attention: NΓN matrix β Attention: 1ΓN vector β β Reads weights once β Reads weights EVERY step β β GPU compute: ~85% β GPU compute: ~3-12% β β GPU memory BW: ~35% β GPU memory BW: ~88% β β Bottleneck: FLOPS β Bottleneck: MEMORY BANDWIDTH β β KV cache: WRITTEN β KV cache: READ + APPENDED β β Duration: ms β Duration: seconds β β Metric: TTFT β Metric: ITL, throughput β β Restaurant: read order + prep β Restaurant: cook one step, β β stations at once β check all stations each time β β Sprinter: explosive burst β Marathon: steady endurance β ββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββ
The fundamental insight: PREFILL needs: ββββββββββββββββ COMPUTE DECODE needs: ββββββββββββββββ BANDWIDTH Same GPU β both compromised. Neither at peak. Separate GPUs β both at peak. Maximum efficiency. PREFILL GPU: ββββββββββββββββ COMPUTE (100% focused) DECODE GPU: ββββββββββββββββ BANDWIDTH (100% focused) KV TRANSFER: ββNIXLβββ (~1ms via NVLink) This is NVIDIA Dynamo's core design principle.