Prefill
Decode

Prefill in LLM Inference

How does an LLM process your prompt before generating a single word? This document simulates the prefill phase step by step — with real tokens, real math, and a restaurant analogy that makes it click.
Contents
  1. What is Prefill?
  2. Real-World Token Simulation
  3. The Restaurant Analogy
  4. Step-by-Step Technical Walkthrough
  5. KV Cache: What Gets Stored
  6. Why Prefill is Compute-Bound
  7. Real Numbers: Llama-3-70B
  8. Prefill vs Decode Side-by-Side

1. What is Prefill?

Prefill is the first phase of every LLM inference call. Before the model generates a single token of output, it must process your entire input prompt in one shot.

Think of it as the model "reading and understanding" everything you said before it starts "writing" a response.

You type: "Explain how a CPU works to a 5-year-old"

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                   PREFILL PHASE                  β”‚
  β”‚                                                  β”‚
  β”‚  Input: all your tokens processed IN PARALLEL    β”‚
  β”‚  Output: KV cache + first generated token        β”‚
  β”‚  Duration: one-time, milliseconds                β”‚
  β”‚  Bottleneck: GPU compute (FLOPS)                 β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                   DECODE PHASE                   β”‚
  β”‚                                                  β”‚
  β”‚  Input: one token at a time (autoregressive)     β”‚
  β”‚  Output: next token, then next, then next...     β”‚
  β”‚  Duration: ongoing, seconds                      β”‚
  β”‚  Bottleneck: memory bandwidth (GB/s)             β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. Real-World Token Simulation

Let's trace a real prompt through prefill. We'll use:

Prompt: "Explain how a CPU works to a 5-year-old"

Step 1: Tokenization

The tokenizer (e.g., Llama's SentencePiece / tiktoken) breaks your text into tokens. Tokens are NOT always full words:

Raw text:  "Explain how a CPU works to a 5-year-old"

Tokenized:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Index  β”‚ Token ID β”‚ Token Text                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   0    β”‚  849     β”‚ "Explain"                       β”‚
β”‚   1    β”‚  1268    β”‚ " how"                          β”‚
β”‚   2    β”‚  264     β”‚ " a"                            β”‚
β”‚   3    β”‚  18622   β”‚ " CPU"                          β”‚
β”‚   4    β”‚  4375    β”‚ " works"                        β”‚
β”‚   5    β”‚  311     β”‚ " to"                           β”‚
β”‚   6    β”‚  264     β”‚ " a"                            β”‚
β”‚   7    β”‚  220     β”‚ " 5"                            β”‚
β”‚   8    β”‚  -       β”‚ "-"                             β”‚
β”‚   9    β”‚  3236    β”‚ "year"                          β”‚
β”‚  10    β”‚  -       β”‚ "-old"                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Total: 11 tokens
Note: Spaces are usually attached to the next word as " how" not "how". Subwords like "year" and "-old" get their own tokens. The exact split depends on the tokenizer vocabulary.

Step 2: Embedding

Each token ID is looked up in an embedding table to get a high-dimensional vector:

Token "Explain" (ID 849)  β†’  [0.023, -0.156, 0.891, 0.042, ..., -0.334]
                               ╰──────────── 8192 dimensions ────────────╯
                               (for Llama-3-70B; 4096 for smaller models)

Token " how"    (ID 1268) β†’  [0.178, 0.045, -0.223, 0.567, ..., 0.112]
Token " a"      (ID 264)  β†’  [-0.089, 0.334, 0.012, -0.445, ..., 0.267]
  ... and so on for all 11 tokens

This creates an embedding matrix of shape [11, 8192] — 11 tokens, each with 8192 dimensions.

Step 3: Through Every Transformer Layer (The Big Part)

This embedding matrix now passes through every single transformer layer in the model. For Llama-3-70B, that's 80 layers. At each layer:

Layer 1 of 80:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              SELF-ATTENTION                              β”‚    β”‚
β”‚  β”‚                                                          β”‚    β”‚
β”‚  β”‚  For each of the 11 tokens, compute:                     β”‚    β”‚
β”‚  β”‚                                                          β”‚    β”‚
β”‚  β”‚    Q (Query)  = token_embedding Γ— W_Q                    β”‚    β”‚
β”‚  β”‚    K (Key)    = token_embedding Γ— W_K    ← STORED in KV  β”‚    β”‚
β”‚  β”‚    V (Value)  = token_embedding Γ— W_V    ← STORED in KV  β”‚    β”‚
β”‚  β”‚                                                          β”‚    β”‚
β”‚  β”‚  Then: Attention = softmax(Q Γ— K^T / √d) Γ— V            β”‚    β”‚
β”‚  β”‚                                                          β”‚    β”‚
β”‚  β”‚  Every token attends to every other token (11 Γ— 11)      β”‚    β”‚
β”‚  β”‚  That's 121 attention scores computed simultaneously      β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                           β”‚                                      β”‚
β”‚                           β–Ό                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              FEED-FORWARD NETWORK (MLP)                  β”‚    β”‚
β”‚  β”‚                                                          β”‚    β”‚
β”‚  β”‚  Each token goes through:                                β”‚    β”‚
β”‚  β”‚    hidden = token Γ— W_gate  (upproject)                  β”‚    β”‚
β”‚  β”‚    hidden = SiLU(hidden) Γ— (token Γ— W_up)                β”‚    β”‚
β”‚  β”‚    output = hidden Γ— W_down (downproject)                β”‚    β”‚
β”‚  β”‚                                                          β”‚    β”‚
β”‚  β”‚  For 70B: 8192 β†’ 28672 β†’ 8192 dimensions                β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                  β”‚
β”‚  Output: Updated embeddings [11, 8192] + KV cache for layer 1   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

... repeat for layers 2, 3, 4, ... all the way to layer 80

Step 4: The Attention Matrix (Visualized)

Here's what the 11×11 attention matrix looks like for our prompt at one attention head:

                Explain  how    a    CPU  works   to    a     5     -   year  -old
  Explain      [  1.0   0.0   0.0   0.0   0.0   0.0  0.0   0.0  0.0  0.0   0.0 ]
  how          [ 0.35   0.65  0.0   0.0   0.0   0.0  0.0   0.0  0.0  0.0   0.0 ]
  a            [ 0.15   0.25  0.60  0.0   0.0   0.0  0.0   0.0  0.0  0.0   0.0 ]
  CPU          [ 0.30   0.10  0.05  0.55  0.0   0.0  0.0   0.0  0.0  0.0   0.0 ]
  works        [ 0.10   0.15  0.05  0.40  0.30  0.0  0.0   0.0  0.0  0.0   0.0 ]
  to           [ 0.20   0.05  0.05  0.10  0.15  0.45 0.0   0.0  0.0  0.0   0.0 ]
  a            [ 0.05   0.05  0.10  0.05  0.05  0.30 0.40  0.0  0.0  0.0   0.0 ]
  5            [ 0.10   0.05  0.05  0.05  0.05  0.10 0.10  0.50 0.0  0.0   0.0 ]
  -            [ 0.05   0.03  0.02  0.03  0.02  0.05 0.05  0.35 0.40 0.0   0.0 ]
  year         [ 0.05   0.05  0.05  0.05  0.05  0.05 0.05  0.15 0.10 0.40  0.0 ]
  -old         [ 0.10   0.05  0.05  0.05  0.05  0.10 0.05  0.15 0.05 0.15  0.20]

  ↑ Each row = how much each token "attends to" every previous token.
  ↑ CAUSAL attention β€” tokens only see tokens before them (upper triangle = 0).
  ↑ "CPU" attends strongly to "Explain" (0.30) β€” it knows this is about explaining CPUs.
  ↑ "-old" attends to "5" (0.15) and "year" (0.15) β€” understands "5-year-old" as one concept.

The critical point: During prefill, this ENTIRE 11×11 matrix is computed in ONE shot. All 121 attention scores calculated simultaneously using GPU parallelism. This is why prefill is compute-heavy but fast.

Step 5: Output — First Token + KV Cache

After all 80 layers, the final hidden state of the LAST token goes through a linear layer to produce logits over the entire vocabulary (~128,000 tokens for Llama-3):

Final hidden state of "-old" β†’ Linear projection β†’ Logits [128,256]

Top predictions:
  "A"        β†’ probability 0.12
  "Imagine"  β†’ probability 0.09
  "Think"    β†’ probability 0.08
  "So"       β†’ probability 0.07
  "OK"       β†’ probability 0.06
  ...

Sampled (temperature=0.7): "Imagine"   ← This is the FIRST generated token (TTFT)

Simultaneously, all Key and Value vectors from every layer are stored as the KV cache. This allows decode to skip reprocessing the entire prompt.


3. The Restaurant Analogy (Prefill Edition)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     🍽️  THE LLM RESTAURANT                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  CUSTOMER (User) walks in and places an order:                      β”‚
β”‚  "I'd like a detailed Italian pasta dish, gluten-free,              β”‚
β”‚   with a side salad, no onions, for two people"                     β”‚
β”‚                                                                     β”‚
β”‚  This order = YOUR PROMPT (many specific requirements)              β”‚
β”‚                                                                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  STEP 1: ORDER TICKET (Tokenization)                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚ The waiter breaks the order into individual items:        β”‚      β”‚
β”‚  β”‚                                                           β”‚      β”‚
β”‚  β”‚  [Italian] [pasta] [gluten-free] [side salad]            β”‚      β”‚
β”‚  β”‚  [no onions] [for two]                                   β”‚      β”‚
β”‚  β”‚                                                           β”‚      β”‚
β”‚  β”‚ = 6 "tokens" on the ticket                               β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                                                     β”‚
β”‚  STEP 2: KITCHEN READS FULL ORDER AT ONCE (Self-Attention)          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚ The head chef reads the ENTIRE ticket simultaneously.     β”‚      β”‚
β”‚  β”‚ They understand RELATIONSHIPS between items:              β”‚      β”‚
β”‚  β”‚                                                           β”‚      β”‚
β”‚  β”‚  "Italian" + "pasta" β†’ ah, we're making Italian pasta     β”‚      β”‚
β”‚  β”‚  "gluten-free" modifies "pasta" β†’ use rice noodles        β”‚      β”‚
β”‚  β”‚  "side salad" + "no onions" β†’ salad without onions        β”‚      β”‚
β”‚  β”‚  "for two" β†’ double everything                            β”‚      β”‚
β”‚  β”‚                                                           β”‚      β”‚
β”‚  β”‚ This cross-referencing of every item with every other     β”‚      β”‚
β”‚  β”‚ item = the ATTENTION MATRIX.                              β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                                                     β”‚
β”‚  STEP 3: PREP STATIONS SET UP (KV Cache Generation)                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚ Based on understanding the full order, the kitchen sets   β”‚      β”‚
β”‚  β”‚ up PREP STATIONS β€” pre-measured ingredients, sauces       β”‚      β”‚
β”‚  β”‚ heated, pans selected:                                    β”‚      β”‚
β”‚  β”‚                                                           β”‚      β”‚
β”‚  β”‚  Station 1: Rice noodles (measured for 2)                 β”‚      β”‚
β”‚  β”‚  Station 2: Italian sauce (no onion variant)              β”‚      β”‚
β”‚  β”‚  Station 3: Salad greens (no onions, for 2)              β”‚      β”‚
β”‚  β”‚  Station 4: Garnishes ready                               β”‚      β”‚
β”‚  β”‚                                                           β”‚      β”‚
β”‚  β”‚ These prep stations = KV CACHE                            β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                                                     β”‚
β”‚  STEP 4: FIRST DISH ELEMENT PRODUCED (First Token)                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚ With everything prepped, the kitchen produces the first   β”‚      β”‚
β”‚  β”‚ component: the boiling water hits the noodles.            β”‚      β”‚
β”‚  β”‚                                                           β”‚      β”‚
β”‚  β”‚ Time from order to first action = TTFT                    β”‚      β”‚
β”‚  β”‚ (Time To First Token)                                     β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                                                     β”‚
β”‚  NOW: The kitchen moves to DECODE phase β€” producing each            β”‚
β”‚  dish component one at a time, using the prep stations (KV cache)   β”‚
β”‚  instead of re-reading the order ticket every time.                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
RestaurantLLM Prefill
Customer's orderYour prompt (input text)
Breaking order into itemsTokenization
Chef reads full order at onceSelf-attention (all tokens in parallel)
"gluten-free" modifies "pasta"Attention scores between related tokens
Setting up prep stationsGenerating the KV cache
Pre-measured ingredientsStored Key/Value vectors per token per layer
Time from order to first actionTTFT (Time To First Token)
Many chefs working simultaneouslyGPU parallel processing (CUDA cores)
Bigger order = more prep timeLonger prompts = longer prefill (O(N²))

4. Step-by-Step Technical Walkthrough

def prefill(prompt: str, model: TransformerModel) -> Tuple[Token, KVCache]:
    """
    Process the entire prompt and return the first generated token + KV cache.
    This runs ONCE per request.
    """

    # ═══════════════════════════════════════════════════════
    # STEP 1: Tokenize
    # ═══════════════════════════════════════════════════════
    token_ids = tokenizer.encode(prompt)
    # "Explain how a CPU works to a 5-year-old"
    # β†’ [849, 1268, 264, 18622, 4375, 311, 264, 220, -, 3236, -]
    seq_len = len(token_ids)  # 11

    # ═══════════════════════════════════════════════════════
    # STEP 2: Embed
    # ═══════════════════════════════════════════════════════
    hidden_states = embedding_table[token_ids]
    # shape: [11, 8192]  (seq_len Γ— hidden_dim)

    # ═══════════════════════════════════════════════════════
    # STEP 3: Process through all transformer layers
    # ═══════════════════════════════════════════════════════
    kv_cache = {}

    for layer_idx in range(80):  # 80 layers for Llama-3-70B

        # --- Self-Attention ---
        Q = hidden_states @ W_Q[layer_idx]  # [11, 8192] Γ— [8192, 8192] β†’ [11, 8192]
        K = hidden_states @ W_K[layer_idx]  # [11, 8192] Γ— [8192, 1024] β†’ [11, 1024]
        V = hidden_states @ W_V[layer_idx]  # [11, 8192] Γ— [8192, 1024] β†’ [11, 1024]

        # β˜… STORE K and V β€” this IS the KV cache
        kv_cache[layer_idx] = (K, V)

        attention_scores = (Q @ K.transpose()) / sqrt(head_dim)  # [11, 11]
        attention_scores = apply_causal_mask(attention_scores)
        attention_weights = softmax(attention_scores)             # [11, 11]
        attention_output = attention_weights @ V                  # [11, 8192]

        # --- Feed-Forward Network ---
        gate = SiLU(attention_output @ W_gate[layer_idx])  # [11, 28672]
        up   = attention_output @ W_up[layer_idx]           # [11, 28672]
        ffn_output = (gate * up) @ W_down[layer_idx]       # [11, 8192]

        hidden_states = layer_norm(hidden_states + ffn_output)

    # ═══════════════════════════════════════════════════════
    # STEP 4: Predict first token
    # ═══════════════════════════════════════════════════════
    last_hidden = hidden_states[-1]                    # [8192]
    logits = last_hidden @ W_vocab                     # [128256]
    first_token = sample(logits, temperature=0.7)      # β†’ "Imagine"

    return first_token, kv_cache

Key observations

  1. All 11 tokens processed at oncehidden_states is always [11, 8192]. One big matrix multiplication.
  2. KV cache = K and V saved per layerkv_cache[layer_idx] = (K, V). 80 layers × 2 matrices × 11 tokens.
  3. Attention matrix is N×N — 11×11 = 121 scores. For 4096 tokens: 16.7 million scores. This is why prefill is compute-bound.
  4. Only the LAST token predicts the next token — the other 10 outputs are discarded (their K,V stay in cache).

5. KV Cache: What Gets Stored

KV Cache for "Explain how a CPU works to a 5-year-old" (11 tokens):

Layer 0:  K: [11, 8, 128]  V: [11, 8, 128]
Layer 1:  K: [11, 8, 128]  V: [11, 8, 128]
... (80 layers total) ...
Layer 79: K: [11, 8, 128]  V: [11, 8, 128]

Total: 80 Γ— 2 Γ— 11 Γ— 8 Γ— 128 = 22,528,000 float16 values = ~43 MB

For a 4096-token prompt:

80 Γ— 2 Γ— 4096 Γ— 8 Γ— 128 Γ— 2 bytes (FP16) = ~1.34 GB per request
50 concurrent users = 67 GB = almost an entire H100's memory

6. Why Prefill is Compute-Bound

Q Γ— K^T  β†’  shape: [N, N]

For N = 11:     11 Γ— 11   = 121 multiply-adds           β†’ trivial
For N = 512:    512 Γ— 512 = 262,144 multiply-adds       β†’ easy
For N = 4096:   4096 Γ— 4096 = 16,777,216 multiply-adds  β†’ heavy
For N = 32768:  32K Γ— 32K = 1,073,741,824 multiply-adds β†’ MASSIVE

Arithmetic Intensity = FLOPs / Bytes Loaded
Prefill:  ~NΒ² FLOPs / ~N bytes  =  ~N  (HIGH β€” compute-bound)
Decode:   ~N FLOPs  / ~N bytes  =  ~1  (LOW β€” memory-bound)
GPU Utilization During Prefill:

CUDA Cores: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘  ~85% utilized (doing matmuls)
Memory Bus:  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~35% utilized (not the bottleneck)

This is GOOD β€” the GPU is doing what it's designed for.

7. Real Numbers: Llama-3-70B

MetricValue
Parameters70 billion
Layers80
Hidden dimension8,192
Attention heads (Q)64
KV heads (GQA)8
Head dimension128
Vocabulary size128,256
Model weights (FP16)~140 GB (needs 2× H100s min)
Prompt LengthAttention FLOPsPrefill Time (1× H100)KV Cache Size
128 tokens~0.5 TFLOP~8 ms42 MB
512 tokens~8 TFLOP~25 ms167 MB
2,048 tokens~128 TFLOP~85 ms670 MB
4,096 tokens~512 TFLOP~180 ms1.34 GB
8,192 tokens~2048 TFLOP~450 ms2.68 GB
32,768 tokens~32768 TFLOP~3500 ms10.7 GB
Where the time goes (4096-token prefill, Llama-3-70B):

Self-Attention (QKV + attention + output):   ~55% of time
Feed-Forward Network (gate + up + down):     ~40% of time
Layer norms, residuals, embedding:            ~5% of time

Total: ~180ms on 1Γ— H100, ~45ms with 4-way tensor parallelism

8. Prefill vs Decode Side-by-Side

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           PREFILL                β”‚           DECODE                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Processes ALL tokens at once     β”‚ Generates ONE token per step     β”‚
β”‚ Attention: NΓ—N matrix (121 ops)  β”‚ Attention: 1Γ—N vector (11 ops)   β”‚
β”‚ Bottleneck: COMPUTE              β”‚ Bottleneck: MEMORY BANDWIDTH    β”‚
β”‚ GPU cores: ~85% utilized         β”‚ GPU cores: ~12% utilized         β”‚
β”‚ Memory BW: ~35% utilized         β”‚ Memory BW: ~88% utilized         β”‚
β”‚ Runs ONCE per request            β”‚ Runs N times (once per token)    β”‚
β”‚ Duration: milliseconds           β”‚ Duration: seconds                β”‚
β”‚ Output: KV cache + 1st token     β”‚ Output: all remaining tokens     β”‚
β”‚ Metric: TTFT                     β”‚ Metric: ITL, throughput          β”‚
β”‚ Wants: More FLOPS, higher TP     β”‚ Wants: More bandwidth, batching  β”‚
β”‚ Scaling: O(NΒ²) with prompt len   β”‚ Scaling: O(N) per step           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
BEFORE (Aggregated β€” same GPU does both):

  GPU 0: β–ˆβ–ˆ PREFILL β–ˆβ–ˆβ–‘β–‘β–‘ DECODE β–‘β–‘β–‘β–‘β–ˆβ–ˆ PREFILL β–ˆβ–ˆβ–‘β–‘β–‘ DECODE β–‘β–‘β–‘β–‘
         compute-heavy  BW-heavy      blocks decode   wastes compute

AFTER (Disaggregated β€” dedicated GPUs):

  Prefill GPU:  β–ˆβ–ˆ REQ1 β–ˆβ–ˆ REQ2 β–ˆβ–ˆ REQ3 β–ˆβ–ˆ REQ4 β–ˆβ–ˆ  (always computing)
  Decode GPU:   β–‘β–‘ REQ1 β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘    (batches users)
  KV Transfer:  Prefill GPU ──NIXL──→ Decode GPU (~1ms NVLink)
MetricAggregatedDisaggregatedImprovement
Prefill GPU compute~60%~90%1.5×
Decode GPU bandwidth~40%~85%2.1×
Requests per GPUbaseline~2×
TTFT consistencyvariableconsistentmuch better

Sources

  1. Attention Is All You Need — Vaswani et al., 2017 → arxiv.org/abs/1706.03762
  2. PagedAttention / vLLM — Kwon et al., 2023 → arxiv.org/abs/2309.06180
  3. Splitwise — Patel et al., 2024 → arxiv.org/abs/2311.18677
  4. DistServe — Zhong et al., 2024 → arxiv.org/abs/2401.09670
  5. NVIDIA Dynamogithub.com/ai-dynamo/dynamo
  6. FlashAttention — Dao et al., 2022 → arxiv.org/abs/2205.14135
  7. Llama 3 Model Card — Meta, 2024 → github.com/meta-llama/llama3

Decode in LLM Inference

After prefill processes your prompt, decode takes over and generates the response — one token at a time. This document simulates the decode phase step by step, with real tokens, real math, and a restaurant analogy that makes it click.
Contents
  1. What is Decode?
  2. Real-World Token Simulation
  3. The Restaurant Analogy (Decode Edition)
  4. Step-by-Step Technical Walkthrough
  5. The KV Cache Growth Problem
  6. Why Decode is Memory-Bandwidth-Bound
  7. Batching: How Decode Becomes Efficient
  8. Real Numbers: Llama-3-70B
  9. Decode Failure Modes
  10. Prefill vs Decode — The Full Picture

1. What is Decode?

Decode is the second phase of every LLM inference call. After prefill has processed your entire prompt and built the KV cache, decode takes over and generates the response one token at a time.

Prefill already happened:
  Input: "Explain how a CPU works to a 5-year-old"
  Output: KV cache (11 tokens cached) + first token "Imagine"

Now DECODE begins:

  Step 1:  "Imagine"  β†’ model predicts β†’ "a"
  Step 2:  "a"        β†’ model predicts β†’ "tiny"
  Step 3:  "tiny"     β†’ model predicts β†’ "factory"
  Step 4:  "factory"  β†’ model predicts β†’ "inside"
  Step 5:  "inside"   β†’ model predicts β†’ "your"
  Step 6:  "your"     β†’ model predicts β†’ "computer"
  Step 7:  "computer" β†’ model predicts β†’ "."

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                  DECODE PHASE                    β”‚
  β”‚  Input: ONE new token per step                   β”‚
  β”‚  Reads: entire KV cache + all model weights      β”‚
  β”‚  Output: ONE next token per step                 β”‚
  β”‚  Bottleneck: memory bandwidth (GB/s)             β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. Real-World Token Simulation

Decode Step 1: Generating "a"

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DECODE STEP 1                                                       β”‚
β”‚                                                                     β”‚
β”‚ Input token: "Imagine" (just this ONE token)                        β”‚
β”‚ KV Cache: 11 entries from prefill                                   β”‚
β”‚   [Explain] [how] [a] [CPU] [works] [to] [a] [5] [-] [year] [-old]β”‚
β”‚                                                                     β”‚
β”‚ What the GPU does at each of 80 layers:                             β”‚
β”‚   a. Compute Q, K, V for "Imagine"                                  β”‚
β”‚   b. β˜… READ all 11 cached K vectors from GPU memory                β”‚
β”‚   c. β˜… READ all 11 cached V vectors from GPU memory                β”‚
β”‚   d. Attention = softmax(Q_new Γ— [K_cached; K_new]^T) Γ— V_all     β”‚
β”‚      Shape: [1] Γ— [12]^T β†’ [1, 12] attention scores                β”‚
β”‚   e. Append K_new and V_new to cache (11 β†’ 12 entries)             β”‚
β”‚                                                                     β”‚
β”‚ Output: "a"    |    KV Cache now: 12 entries    |    Time: ~40ms    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Decode Step 2: Generating "tiny"

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DECODE STEP 2                                                       β”‚
β”‚ Input: "a"  |  Cache: 12 entries  |  READ 12 K + 12 V from memory  β”‚
β”‚ Attention: [1, 13] scores  |  Output: "tiny"  |  Cache: 13 entries β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Steps 3–7: The pattern

Step 3:  "tiny"     | Reads: 13 entries | β†’ "factory"  | Cache: 14
Step 4:  "factory"  | Reads: 14 entries | β†’ "inside"   | Cache: 15
Step 5:  "inside"   | Reads: 15 entries | β†’ "your"     | Cache: 16
Step 6:  "your"     | Reads: 16 entries | β†’ "computer" | Cache: 17
Step 7:  "computer" | Reads: 17 entries | β†’ "."        | Cache: 18

The attention vector (not matrix!)

Decode Step 4 β€” generating "inside" after "Imagine a tiny factory"

The single query token "factory" attends to all 14 cached tokens:

            Explain  how    a    CPU  works   to    a     5     -   year  -old  Imagine  a    tiny
  factory  [ 0.02   0.03  0.01  0.15  0.08  0.03  0.01  0.02  0.01  0.01  0.01  0.18   0.04  0.40 ]
                                  ↑                                                ↑            ↑
                              "CPU" gets                                    "Imagine" gets   "tiny" gets
                              high attention                                high attention   highest

  14 multiply-adds. Trivial compute.
  But reading 14 K + 14 V vectors across 80 layers from GPU memory = SLOW.

3. The Restaurant Analogy (Decode Edition)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 🍽️  THE LLM RESTAURANT β€” DECODE PHASE               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  Prep stations (KV cache) ready from prefill:                       β”‚
β”‚    Station 1: Rice noodles    Station 2: Italian sauce              β”‚
β”‚    Station 3: Salad greens    Station 4: Garnishes                  β”‚
β”‚                                                                     β”‚
β”‚  Action 1: Boil the noodles                                         β”‚
β”‚    Chef walks to ALL 4 stations to check context β†’ decides action   β”‚
β”‚    Time: mostly WALKING between stations, not actual cooking        β”‚
β”‚                                                                     β”‚
β”‚  Action 2: Start the sauce                                          β”‚
β”‚    Chef walks to ALL 5 stations (4 + 1 new from action 1)          β”‚
β”‚    Even MORE walking now                                            β”‚
β”‚                                                                     β”‚
β”‚  Action 3: Prepare the salad                                        β”‚
β”‚    Chef walks to ALL 6 stations... and so on                        β”‚
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚ KEY INSIGHT:                                               β”‚     β”‚
β”‚  β”‚ Chef's HANDS barely doing work (low compute).              β”‚     β”‚
β”‚  β”‚ Chef's LEGS exhausted from walking (high memory BW).       β”‚     β”‚
β”‚  β”‚ The actual cooking is simple. CHECKING every station       β”‚     β”‚
β”‚  β”‚ before each action is what takes all the time.             β”‚     β”‚
β”‚  β”‚ This is decode: tiny compute, massive memory reads.        β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                                                                     β”‚
β”‚  WHAT MAKES IT WORSE:                                               β”‚
β”‚  If a NEW customer (prefill) arrives mid-cooking, the chef          β”‚
β”‚  STOPS, reads new order, sets up new stations, then resumes.        β”‚
β”‚  First customer's food gets cold = "token stuttering".              β”‚
β”‚                                                                     β”‚
β”‚  SOLUTION: One chef ONLY preps (prefill GPU).                       β”‚
β”‚  Another ONLY cooks (decode GPU). Stations passed via NIXL.         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Restaurant (Decode)LLM Decode
One cooking action at a timeOne token generated per step
Checking every station before each actionReading entire KV cache from GPU memory
More stations = more walkingLonger sequences = more data reads
Chef's hands barely workingGPU compute ~12% utilized
Chef's legs exhaustedGPU memory bandwidth ~88% utilized
New customer interrupting cookingPrefill interrupting decode
Separate prep chef and cooking chefDisaggregated serving

4. Step-by-Step Technical Walkthrough

def decode(first_token, kv_cache, model, max_tokens=200):
    """Generate tokens one at a time using the KV cache from prefill."""
    generated = []
    current = first_token  # "Imagine"

    for step in range(max_tokens):
        hidden = embedding_table[current]        # [1, 8192] β€” ONE token

        for layer_idx in range(80):
            Q_new = hidden @ W_Q[layer_idx]      # [1, 8192]
            K_new = hidden @ W_K[layer_idx]      # [1, 1024]
            V_new = hidden @ W_V[layer_idx]      # [1, 1024]

            # β˜… THE EXPENSIVE PART: read cached K, V from HBM
            K_cached = kv_cache[layer_idx].K     # [seq_len, 1024] ← BOTTLENECK
            V_cached = kv_cache[layer_idx].V     # [seq_len, 1024] ← BOTTLENECK

            K_all = concat(K_cached, K_new)      # [seq_len+1, 1024]
            V_all = concat(V_cached, V_new)

            # Attention is a VECTOR, not a matrix
            scores = Q_new @ K_all.T / sqrt(128) # [1, seq_len+1]
            weights = softmax(scores)
            attn_out = weights @ V_all           # [1, 8192]

            kv_cache[layer_idx].K = K_all        # cache grows by 1
            kv_cache[layer_idx].V = V_all

            # Feed-Forward (same ops, but for 1 token)
            gate = SiLU(attn_out @ W_gate[layer_idx])  # [1, 28672]
            up   = attn_out @ W_up[layer_idx]
            ffn  = (gate * up) @ W_down[layer_idx]     # [1, 8192]
            hidden = layer_norm(hidden + ffn)

        logits = hidden @ W_vocab                # [1, 128256]
        next_token = sample(logits, temperature=0.7)
        generated.append(next_token)
        if next_token == EOS: break
        current = next_token

    return generated

Key observations

  1. Only 1 token goes through the modelhidden is always [1, 8192]
  2. KV cache READ every step — 80 times per step (once per layer). This is the bottleneck.
  3. Attention is a vector dot product[1, seq_len+1], not [seq_len, seq_len]
  4. Cache grows by 1 each step — after 200 steps: 211 entries
  5. Full model weights read EVERY step — ~140 GB read per token

5. The KV Cache Growth Problem

Token generated  β”‚ Cache entries β”‚ Cache size (70B, FP16) β”‚ % of H100 80GB
─────────────────┼───────────────┼────────────────────────┼────────────────
Prefill done     β”‚      11       β”‚     3.6 MB             β”‚    0.005%
After 50 tokens  β”‚      61       β”‚    20 MB               β”‚    0.025%
After 200 tokens β”‚     211       β”‚    69 MB               β”‚    0.086%
After 4K tokens  β”‚   4,107       β”‚  1.34 GB               β”‚    1.68%
After 32K tokens β”‚  32,779       β”‚ 10.72 GB               β”‚   13.4%
After 128K tokensβ”‚ 128,011       β”‚ 41.9 GB                β”‚   52.3%  ← uh oh
Concurrent users:

Users β”‚ KV per user β”‚ Total KV  β”‚ Remaining for model
──────┼─────────────┼───────────┼────────────────────
  1   β”‚  1.34 GB    β”‚  1.34 GB  β”‚ 78.66 GB  βœ“
  10  β”‚  1.34 GB    β”‚ 13.4 GB   β”‚ 66.6 GB   βœ“
  30  β”‚  1.34 GB    β”‚ 40.2 GB   β”‚ 39.8 GB   ⚠️ tight
  50  β”‚  1.34 GB    β”‚ 67.0 GB   β”‚ 13.0 GB   βœ— OOM
KVBM Memory Hierarchy (Dynamo's solution):

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ G1: GPU HBM  (~80 GB, ~3.35 TB/s)   β”‚ ← Hot cache (active decode)
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ G2: CPU RAM  (~2 TB, ~200 GB/s)     β”‚ ← Warm cache (recently used)
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ G3: Local SSD (~8 TB, ~12 GB/s)     β”‚ ← Cold cache (idle sessions)
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ G4: Remote   (unlimited, variable)   β”‚ ← Archive (persistent)
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Result: 10Γ— more concurrent users than GPU memory alone allows

6. Why Decode is Memory-Bandwidth-Bound

Data READ per token (Llama-3-70B):

  Model weights (per layer): ~1.63 GB Γ— 80 layers = ~130 GB
  KV cache (at 4K tokens):   ~16 MB Γ— 80 layers  = ~1.3 GB
  TOTAL PER TOKEN: ~131 GB

Compute per token:
  ~54 billion FLOPs (54 GFLOP)

Arithmetic Intensity = 54 GFLOP / 131 GB = 0.41 FLOPs/byte
H100 balance point  = 990 TFLOPS / 3.35 TB/s = 295 FLOPs/byte

Decode is 720Γ— below the balance point.
GPU compute cores are ~99.86% IDLE during decode.
GPU Utilization During Decode (1 user):

CUDA Cores: β–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ~3%  (almost idle!)
Memory Bus:  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘  ~88% (reading as fast as possible)

87% of time = reading model weights from memory.
3% of time = actual computation.

7. Batching: How Decode Becomes Efficient

Key insight: Model weights are the SAME for every user.
Read once, apply to ALL users' tokens simultaneously.

1 user:   130 GB weights β†’ 1 token  β†’ 0.41 FLOPs/byte  (wasteful)
8 users:  130 GB weights β†’ 8 tokens β†’ 3.3 FLOPs/byte   (better)
32 users: 130 GB weights → 32 tokens→ 13.1 FLOPs/byte  (good)
128 users:130 GB weights → 128 tokens→ 52 FLOPs/byte   (great)
But each user has their own KV cache:

Users β”‚ Weight reads β”‚ KV cache reads β”‚ Total     β”‚ Intensity
──────┼──────────────┼────────────────┼───────────┼──────────
  1   β”‚   130 GB     β”‚    1.3 GB      β”‚  131 GB   β”‚  0.41
  8   β”‚   130 GB     β”‚   10.4 GB      β”‚  140 GB   β”‚  3.1
  32  β”‚   130 GB     β”‚   41.6 GB      β”‚  172 GB   β”‚  10.1
  64  β”‚   130 GB     β”‚   83.2 GB      β”‚  213 GB   β”‚  16.2

Sweet spot: 32-64 users per GPU (balancing compute vs memory)

8. Real Numbers: Llama-3-70B

ScenarioData read/stepTime/tokenTokens/sec
1 user, 4K context~131 GB~40 ms~25
8 users, 4K context~140 GB~42 ms~190
32 users, 4K context~172 GB~51 ms~627
Where the time goes (1 user, 4K context):

Weight loading from HBM:    ~87%   (QKV: 22%, FFN: 55%, output: 10%)
KV cache reading from HBM:   ~9%
Actual computation:           ~3%
Other:                        ~1%

87% of decode = reading model weights. The GPU is a memory-reading machine.

9. Decode Failure Modes

Failure Mode 1: Prefill Interference

WITHOUT disaggregation:
  User A (decode): tok tok tok β–ˆβ–ˆβ–ˆβ–ˆBLOCKEDβ–ˆβ–ˆβ–ˆβ–ˆ tok tok tok
  User B (prefill):             β–ˆβ–ˆβ–ˆβ–ˆPREFILLβ–ˆβ–ˆβ–ˆβ–ˆ
  User A sees: tokens... FREEZE 200ms... tokens (stuttering)

WITH disaggregation (Dynamo):
  Prefill GPU: β–ˆβ–ˆβ–ˆβ–ˆ B β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆ C β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆ D β–ˆβ–ˆβ–ˆβ–ˆ
  Decode GPU:  tok tok tok tok tok tok tok tok tok
  User A sees: smooth, consistent flow. No interruptions.

Failure Mode 2: KV Cache OOM

WITHOUT KVBM:
  GPU 80GB: [Model 35GB] [KV User1 15GB] [KV User2 15GB] [KV U3 15GB] ← OOM
  User 4 β†’ REJECTED

WITH KVBM:
  GPU 80GB: [Model 35GB] [KV Active 30GB]
  CPU 512GB: [KV warm users 200GB]
  SSD 4TB: [KV cold users 2TB]
  User 4 β†’ KV fetched from SSD in ~10ms

Failure Mode 3: Long outputs

10K tokens Γ— 40ms = 400 seconds (6.7 minutes)

Dynamo solutions:
  - Speculative decoding: predict multiple tokens (2-4Γ— speedup)
  - Request migration: move to less-loaded GPUs
  - KV offloading: spill growing cache via KVBM

10. Prefill vs Decode — The Full Picture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           PREFILL                β”‚           DECODE                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Processes ALL tokens at once     β”‚ Generates ONE token per step     β”‚
β”‚ Runs ONCE per request            β”‚ Runs HUNDREDS of times           β”‚
β”‚ Attention: NΓ—N matrix            β”‚ Attention: 1Γ—N vector            β”‚
β”‚ Reads weights once               β”‚ Reads weights EVERY step         β”‚
β”‚ GPU compute: ~85%                β”‚ GPU compute: ~3-12%              β”‚
β”‚ GPU memory BW: ~35%              β”‚ GPU memory BW: ~88%              β”‚
β”‚ Bottleneck: FLOPS                β”‚ Bottleneck: MEMORY BANDWIDTH    β”‚
β”‚ KV cache: WRITTEN                β”‚ KV cache: READ + APPENDED       β”‚
β”‚ Duration: ms                     β”‚ Duration: seconds                β”‚
β”‚ Metric: TTFT                     β”‚ Metric: ITL, throughput          β”‚
β”‚ Restaurant: read order + prep    β”‚ Restaurant: cook one step,       β”‚
β”‚   stations at once               β”‚   check all stations each time   β”‚
β”‚ Sprinter: explosive burst        β”‚ Marathon: steady endurance       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
The fundamental insight:

  PREFILL needs: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ COMPUTE
  DECODE needs:  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ BANDWIDTH

  Same GPU β†’ both compromised. Neither at peak.
  Separate GPUs β†’ both at peak. Maximum efficiency.

  PREFILL GPU: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ COMPUTE (100% focused)
  DECODE GPU:  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ BANDWIDTH (100% focused)
  KV TRANSFER: ──NIXL──→ (~1ms via NVLink)

  This is NVIDIA Dynamo's core design principle.

Sources

  1. Attention Is All You Need — Vaswani et al., 2017 → arxiv.org/abs/1706.03762
  2. PagedAttention / vLLM — Kwon et al., 2023 → arxiv.org/abs/2309.06180
  3. Splitwise — Patel et al., 2024 → arxiv.org/abs/2311.18677
  4. DistServe — Zhong et al., 2024 → arxiv.org/abs/2401.09670
  5. Sarathi — Agrawal et al., 2023 → arxiv.org/abs/2308.16369
  6. NVIDIA Dynamogithub.com/ai-dynamo/dynamo
  7. FlashAttention — Dao et al., 2022 → arxiv.org/abs/2205.14135
  8. Llama 3 — Meta, 2024 → github.com/meta-llama/llama3
  9. H100 Datasheet — NVIDIA → nvidia.com/h100
  10. Mooncake — Qin et al., 2024 → arxiv.org/abs/2407.00079