Prefill in LLM Inference

How does an LLM process your prompt before generating a single word? This document simulates the prefill phase step by step — with real tokens, real math, and a restaurant analogy that makes it click.

Contents

What is Prefill?
Real-World Token Simulation
The Restaurant Analogy
Step-by-Step Technical Walkthrough
KV Cache: What Gets Stored
Why Prefill is Compute-Bound
Real Numbers: Llama-3-70B
Prefill vs Decode Side-by-Side

1. What is Prefill?

Prefill is the first phase of every LLM inference call. Before the model generates a single token of output, it must process your entire input prompt in one shot.

Think of it as the model "reading and understanding" everything you said before it starts "writing" a response.

You type: "Explain how a CPU works to a 5-year-old"

  ┌─────────────────────────────────────────────────┐
  │                   PREFILL PHASE                  │
  │                                                  │
  │  Input: all your tokens processed IN PARALLEL    │
  │  Output: KV cache + first generated token        │
  │  Duration: one-time, milliseconds                │
  │  Bottleneck: GPU compute (FLOPS)                 │
  └─────────────────────────────────────────────────┘
                         │
                         ▼
  ┌─────────────────────────────────────────────────┐
  │                   DECODE PHASE                   │
  │                                                  │
  │  Input: one token at a time (autoregressive)     │
  │  Output: next token, then next, then next...     │
  │  Duration: ongoing, seconds                      │
  │  Bottleneck: memory bandwidth (GB/s)             │
  └─────────────────────────────────────────────────┘

2. Real-World Token Simulation

Let's trace a real prompt through prefill. We'll use:

Prompt: "Explain how a CPU works to a 5-year-old"

Step 1: Tokenization

The tokenizer (e.g., Llama's SentencePiece / tiktoken) breaks your text into tokens. Tokens are NOT always full words:

Raw text:  "Explain how a CPU works to a 5-year-old"

Tokenized:
┌────────┬──────────┬─────────────────────────────────┐
│ Index  │ Token ID │ Token Text                      │
├────────┼──────────┼─────────────────────────────────┤
│   0    │  849     │ "Explain"                       │
│   1    │  1268    │ " how"                          │
│   2    │  264     │ " a"                            │
│   3    │  18622   │ " CPU"                          │
│   4    │  4375    │ " works"                        │
│   5    │  311     │ " to"                           │
│   6    │  264     │ " a"                            │
│   7    │  220     │ " 5"                            │
│   8    │  -       │ "-"                             │
│   9    │  3236    │ "year"                          │
│  10    │  -       │ "-old"                          │
└────────┴──────────┴─────────────────────────────────┘

Total: 11 tokens

Note: Spaces are usually attached to the next word as " how" not "how". Subwords like "year" and "-old" get their own tokens. The exact split depends on the tokenizer vocabulary.

Step 2: Embedding

Each token ID is looked up in an embedding table to get a high-dimensional vector:

Token "Explain" (ID 849)  →  [0.023, -0.156, 0.891, 0.042, ..., -0.334]
                               ╰──────────── 8192 dimensions ────────────╯
                               (for Llama-3-70B; 4096 for smaller models)

Token " how"    (ID 1268) →  [0.178, 0.045, -0.223, 0.567, ..., 0.112]
Token " a"      (ID 264)  →  [-0.089, 0.334, 0.012, -0.445, ..., 0.267]
  ... and so on for all 11 tokens

This creates an embedding matrix of shape [11, 8192] — 11 tokens, each with 8192 dimensions.

Step 3: Through Every Transformer Layer (The Big Part)

This embedding matrix now passes through every single transformer layer in the model. For Llama-3-70B, that's 80 layers. At each layer:

Layer 1 of 80:
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              SELF-ATTENTION                              │    │
│  │                                                          │    │
│  │  For each of the 11 tokens, compute:                     │    │
│  │                                                          │    │
│  │    Q (Query)  = token_embedding × W_Q                    │    │
│  │    K (Key)    = token_embedding × W_K    ← STORED in KV  │    │
│  │    V (Value)  = token_embedding × W_V    ← STORED in KV  │    │
│  │                                                          │    │
│  │  Then: Attention = softmax(Q × K^T / √d) × V            │    │
│  │                                                          │    │
│  │  Every token attends to every other token (11 × 11)      │    │
│  │  That's 121 attention scores computed simultaneously      │    │
│  └─────────────────────────────────────────────────────────┘    │
│                           │                                      │
│                           ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              FEED-FORWARD NETWORK (MLP)                  │    │
│  │                                                          │    │
│  │  Each token goes through:                                │    │
│  │    hidden = token × W_gate  (upproject)                  │    │
│  │    hidden = SiLU(hidden) × (token × W_up)                │    │
│  │    output = hidden × W_down (downproject)                │    │
│  │                                                          │    │
│  │  For 70B: 8192 → 28672 → 8192 dimensions                │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│  Output: Updated embeddings [11, 8192] + KV cache for layer 1   │
└──────────────────────────────────────────────────────────────────┘

... repeat for layers 2, 3, 4, ... all the way to layer 80

Step 4: The Attention Matrix (Visualized)

Here's what the 11×11 attention matrix looks like for our prompt at one attention head:

                Explain  how    a    CPU  works   to    a     5     -   year  -old
  Explain      [  1.0   0.0   0.0   0.0   0.0   0.0  0.0   0.0  0.0  0.0   0.0 ]
  how          [ 0.35   0.65  0.0   0.0   0.0   0.0  0.0   0.0  0.0  0.0   0.0 ]
  a            [ 0.15   0.25  0.60  0.0   0.0   0.0  0.0   0.0  0.0  0.0   0.0 ]
  CPU          [ 0.30   0.10  0.05  0.55  0.0   0.0  0.0   0.0  0.0  0.0   0.0 ]
  works        [ 0.10   0.15  0.05  0.40  0.30  0.0  0.0   0.0  0.0  0.0   0.0 ]
  to           [ 0.20   0.05  0.05  0.10  0.15  0.45 0.0   0.0  0.0  0.0   0.0 ]
  a            [ 0.05   0.05  0.10  0.05  0.05  0.30 0.40  0.0  0.0  0.0   0.0 ]
  5            [ 0.10   0.05  0.05  0.05  0.05  0.10 0.10  0.50 0.0  0.0   0.0 ]
  -            [ 0.05   0.03  0.02  0.03  0.02  0.05 0.05  0.35 0.40 0.0   0.0 ]
  year         [ 0.05   0.05  0.05  0.05  0.05  0.05 0.05  0.15 0.10 0.40  0.0 ]
  -old         [ 0.10   0.05  0.05  0.05  0.05  0.10 0.05  0.15 0.05 0.15  0.20]

  ↑ Each row = how much each token "attends to" every previous token.
  ↑ CAUSAL attention — tokens only see tokens before them (upper triangle = 0).
  ↑ "CPU" attends strongly to "Explain" (0.30) — it knows this is about explaining CPUs.
  ↑ "-old" attends to "5" (0.15) and "year" (0.15) — understands "5-year-old" as one concept.

The critical point: During prefill, this ENTIRE 11×11 matrix is computed in ONE shot. All 121 attention scores calculated simultaneously using GPU parallelism. This is why prefill is compute-heavy but fast.

Step 5: Output — First Token + KV Cache

After all 80 layers, the final hidden state of the LAST token goes through a linear layer to produce logits over the entire vocabulary (~128,000 tokens for Llama-3):

Final hidden state of "-old" → Linear projection → Logits [128,256]

Top predictions:
  "A"        → probability 0.12
  "Imagine"  → probability 0.09
  "Think"    → probability 0.08
  "So"       → probability 0.07
  "OK"       → probability 0.06
  ...

Sampled (temperature=0.7): "Imagine"   ← This is the FIRST generated token (TTFT)

Simultaneously, all Key and Value vectors from every layer are stored as the KV cache. This allows decode to skip reprocessing the entire prompt.

3. The Restaurant Analogy (Prefill Edition)

┌─────────────────────────────────────────────────────────────────────┐
│                     🍽️  THE LLM RESTAURANT                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  CUSTOMER (User) walks in and places an order:                      │
│  "I'd like a detailed Italian pasta dish, gluten-free,              │
│   with a side salad, no onions, for two people"                     │
│                                                                     │
│  This order = YOUR PROMPT (many specific requirements)              │
│                                                                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  STEP 1: ORDER TICKET (Tokenization)                                │
│  ┌───────────────────────────────────────────────────────────┐      │
│  │ The waiter breaks the order into individual items:        │      │
│  │                                                           │      │
│  │  [Italian] [pasta] [gluten-free] [side salad]            │      │
│  │  [no onions] [for two]                                   │      │
│  │                                                           │      │
│  │ = 6 "tokens" on the ticket                               │      │
│  └───────────────────────────────────────────────────────────┘      │
│                                                                     │
│  STEP 2: KITCHEN READS FULL ORDER AT ONCE (Self-Attention)          │
│  ┌───────────────────────────────────────────────────────────┐      │
│  │ The head chef reads the ENTIRE ticket simultaneously.     │      │
│  │ They understand RELATIONSHIPS between items:              │      │
│  │                                                           │      │
│  │  "Italian" + "pasta" → ah, we're making Italian pasta     │      │
│  │  "gluten-free" modifies "pasta" → use rice noodles        │      │
│  │  "side salad" + "no onions" → salad without onions        │      │
│  │  "for two" → double everything                            │      │
│  │                                                           │      │
│  │ This cross-referencing of every item with every other     │      │
│  │ item = the ATTENTION MATRIX.                              │      │
│  └───────────────────────────────────────────────────────────┘      │
│                                                                     │
│  STEP 3: PREP STATIONS SET UP (KV Cache Generation)                 │
│  ┌───────────────────────────────────────────────────────────┐      │
│  │ Based on understanding the full order, the kitchen sets   │      │
│  │ up PREP STATIONS — pre-measured ingredients, sauces       │      │
│  │ heated, pans selected:                                    │      │
│  │                                                           │      │
│  │  Station 1: Rice noodles (measured for 2)                 │      │
│  │  Station 2: Italian sauce (no onion variant)              │      │
│  │  Station 3: Salad greens (no onions, for 2)              │      │
│  │  Station 4: Garnishes ready                               │      │
│  │                                                           │      │
│  │ These prep stations = KV CACHE                            │      │
│  └───────────────────────────────────────────────────────────┘      │
│                                                                     │
│  STEP 4: FIRST DISH ELEMENT PRODUCED (First Token)                  │
│  ┌───────────────────────────────────────────────────────────┐      │
│  │ With everything prepped, the kitchen produces the first   │      │
│  │ component: the boiling water hits the noodles.            │      │
│  │                                                           │      │
│  │ Time from order to first action = TTFT                    │      │
│  │ (Time To First Token)                                     │      │
│  └───────────────────────────────────────────────────────────┘      │
│                                                                     │
│  NOW: The kitchen moves to DECODE phase — producing each            │
│  dish component one at a time, using the prep stations (KV cache)   │
│  instead of re-reading the order ticket every time.                 │
└─────────────────────────────────────────────────────────────────────┘

Restaurant	LLM Prefill
Customer's order	Your prompt (input text)
Breaking order into items	Tokenization
Chef reads full order at once	Self-attention (all tokens in parallel)
"gluten-free" modifies "pasta"	Attention scores between related tokens
Setting up prep stations	Generating the KV cache
Pre-measured ingredients	Stored Key/Value vectors per token per layer
Time from order to first action	TTFT (Time To First Token)
Many chefs working simultaneously	GPU parallel processing (CUDA cores)
Bigger order = more prep time	Longer prompts = longer prefill (O(N²))

4. Step-by-Step Technical Walkthrough

def prefill(prompt: str, model: TransformerModel) -> Tuple[Token, KVCache]:
    """
    Process the entire prompt and return the first generated token + KV cache.
    This runs ONCE per request.
    """

    # ═══════════════════════════════════════════════════════
    # STEP 1: Tokenize
    # ═══════════════════════════════════════════════════════
    token_ids = tokenizer.encode(prompt)
    # "Explain how a CPU works to a 5-year-old"
    # → [849, 1268, 264, 18622, 4375, 311, 264, 220, -, 3236, -]
    seq_len = len(token_ids)  # 11

    # ═══════════════════════════════════════════════════════
    # STEP 2: Embed
    # ═══════════════════════════════════════════════════════
    hidden_states = embedding_table[token_ids]
    # shape: [11, 8192]  (seq_len × hidden_dim)

    # ═══════════════════════════════════════════════════════
    # STEP 3: Process through all transformer layers
    # ═══════════════════════════════════════════════════════
    kv_cache = {}

    for layer_idx in range(80):  # 80 layers for Llama-3-70B

        # --- Self-Attention ---
        Q = hidden_states @ W_Q[layer_idx]  # [11, 8192] × [8192, 8192] → [11, 8192]
        K = hidden_states @ W_K[layer_idx]  # [11, 8192] × [8192, 1024] → [11, 1024]
        V = hidden_states @ W_V[layer_idx]  # [11, 8192] × [8192, 1024] → [11, 1024]

        # ★ STORE K and V — this IS the KV cache
        kv_cache[layer_idx] = (K, V)

        attention_scores = (Q @ K.transpose()) / sqrt(head_dim)  # [11, 11]
        attention_scores = apply_causal_mask(attention_scores)
        attention_weights = softmax(attention_scores)             # [11, 11]
        attention_output = attention_weights @ V                  # [11, 8192]

        # --- Feed-Forward Network ---
        gate = SiLU(attention_output @ W_gate[layer_idx])  # [11, 28672]
        up   = attention_output @ W_up[layer_idx]           # [11, 28672]
        ffn_output = (gate * up) @ W_down[layer_idx]       # [11, 8192]

        hidden_states = layer_norm(hidden_states + ffn_output)

    # ═══════════════════════════════════════════════════════
    # STEP 4: Predict first token
    # ═══════════════════════════════════════════════════════
    last_hidden = hidden_states[-1]                    # [8192]
    logits = last_hidden @ W_vocab                     # [128256]
    first_token = sample(logits, temperature=0.7)      # → "Imagine"

    return first_token, kv_cache

Key observations

All 11 tokens processed at once — hidden_states is always [11, 8192]. One big matrix multiplication.
KV cache = K and V saved per layer — kv_cache[layer_idx] = (K, V). 80 layers × 2 matrices × 11 tokens.
Attention matrix is N×N — 11×11 = 121 scores. For 4096 tokens: 16.7 million scores. This is why prefill is compute-bound.
Only the LAST token predicts the next token — the other 10 outputs are discarded (their K,V stay in cache).

5. KV Cache: What Gets Stored

KV Cache for "Explain how a CPU works to a 5-year-old" (11 tokens):

Layer 0:  K: [11, 8, 128]  V: [11, 8, 128]
Layer 1:  K: [11, 8, 128]  V: [11, 8, 128]
... (80 layers total) ...
Layer 79: K: [11, 8, 128]  V: [11, 8, 128]

Total: 80 × 2 × 11 × 8 × 128 = 22,528,000 float16 values = ~43 MB

For a 4096-token prompt:

80 × 2 × 4096 × 8 × 128 × 2 bytes (FP16) = ~1.34 GB per request
50 concurrent users = 67 GB = almost an entire H100's memory

6. Why Prefill is Compute-Bound

Q × K^T  →  shape: [N, N]

For N = 11:     11 × 11   = 121 multiply-adds           → trivial
For N = 512:    512 × 512 = 262,144 multiply-adds       → easy
For N = 4096:   4096 × 4096 = 16,777,216 multiply-adds  → heavy
For N = 32768:  32K × 32K = 1,073,741,824 multiply-adds → MASSIVE

Arithmetic Intensity = FLOPs / Bytes Loaded
Prefill:  ~N² FLOPs / ~N bytes  =  ~N  (HIGH — compute-bound)
Decode:   ~N FLOPs  / ~N bytes  =  ~1  (LOW — memory-bound)

GPU Utilization During Prefill:

CUDA Cores: ████████████████████████████░░░  ~85% utilized (doing matmuls)
Memory Bus:  ████████████░░░░░░░░░░░░░░░░░░  ~35% utilized (not the bottleneck)

This is GOOD — the GPU is doing what it's designed for.

7. Real Numbers: Llama-3-70B

Metric	Value
Parameters	70 billion
Layers	80
Hidden dimension	8,192
Attention heads (Q)	64
KV heads (GQA)	8
Head dimension	128
Vocabulary size	128,256
Model weights (FP16)	~140 GB (needs 2× H100s min)

Prompt Length	Attention FLOPs	Prefill Time (1× H100)	KV Cache Size
128 tokens	~0.5 TFLOP	~8 ms	42 MB
512 tokens	~8 TFLOP	~25 ms	167 MB
2,048 tokens	~128 TFLOP	~85 ms	670 MB
4,096 tokens	~512 TFLOP	~180 ms	1.34 GB
8,192 tokens	~2048 TFLOP	~450 ms	2.68 GB
32,768 tokens	~32768 TFLOP	~3500 ms	10.7 GB

Where the time goes (4096-token prefill, Llama-3-70B):

Self-Attention (QKV + attention + output):   ~55% of time
Feed-Forward Network (gate + up + down):     ~40% of time
Layer norms, residuals, embedding:            ~5% of time

Total: ~180ms on 1× H100, ~45ms with 4-way tensor parallelism

8. Prefill vs Decode Side-by-Side

┌──────────────────────────────────┬──────────────────────────────────┐
│           PREFILL                │           DECODE                 │
├──────────────────────────────────┼──────────────────────────────────┤
│ Processes ALL tokens at once     │ Generates ONE token per step     │
│ Attention: N×N matrix (121 ops)  │ Attention: 1×N vector (11 ops)   │
│ Bottleneck: COMPUTE              │ Bottleneck: MEMORY BANDWIDTH    │
│ GPU cores: ~85% utilized         │ GPU cores: ~12% utilized         │
│ Memory BW: ~35% utilized         │ Memory BW: ~88% utilized         │
│ Runs ONCE per request            │ Runs N times (once per token)    │
│ Duration: milliseconds           │ Duration: seconds                │
│ Output: KV cache + 1st token     │ Output: all remaining tokens     │
│ Metric: TTFT                     │ Metric: ITL, throughput          │
│ Wants: More FLOPS, higher TP     │ Wants: More bandwidth, batching  │
│ Scaling: O(N²) with prompt len   │ Scaling: O(N) per step           │
└──────────────────────────────────┴──────────────────────────────────┘

BEFORE (Aggregated — same GPU does both):

  GPU 0: ██ PREFILL ██░░░ DECODE ░░░░██ PREFILL ██░░░ DECODE ░░░░
         compute-heavy  BW-heavy      blocks decode   wastes compute

AFTER (Disaggregated — dedicated GPUs):

  Prefill GPU:  ██ REQ1 ██ REQ2 ██ REQ3 ██ REQ4 ██  (always computing)
  Decode GPU:   ░░ REQ1 ░░░░░░░░░░░░░░░░░░░░░░░░    (batches users)
  KV Transfer:  Prefill GPU ──NIXL──→ Decode GPU (~1ms NVLink)

Metric	Aggregated	Disaggregated	Improvement
Prefill GPU compute	~60%	~90%	1.5×
Decode GPU bandwidth	~40%	~85%	2.1×
Requests per GPU	baseline	~2×	2×
TTFT consistency	variable	consistent	much better

Sources

Attention Is All You Need — Vaswani et al., 2017 → arxiv.org/abs/1706.03762
PagedAttention / vLLM — Kwon et al., 2023 → arxiv.org/abs/2309.06180
Splitwise — Patel et al., 2024 → arxiv.org/abs/2311.18677
DistServe — Zhong et al., 2024 → arxiv.org/abs/2401.09670
NVIDIA Dynamo → github.com/ai-dynamo/dynamo
FlashAttention — Dao et al., 2022 → arxiv.org/abs/2205.14135
Llama 3 Model Card — Meta, 2024 → github.com/meta-llama/llama3

Decode in LLM Inference

After prefill processes your prompt, decode takes over and generates the response — one token at a time. This document simulates the decode phase step by step, with real tokens, real math, and a restaurant analogy that makes it click.

Contents

What is Decode?
Real-World Token Simulation
The Restaurant Analogy (Decode Edition)
Step-by-Step Technical Walkthrough
The KV Cache Growth Problem
Why Decode is Memory-Bandwidth-Bound
Batching: How Decode Becomes Efficient
Real Numbers: Llama-3-70B
Decode Failure Modes
Prefill vs Decode — The Full Picture

1. What is Decode?

Decode is the second phase of every LLM inference call. After prefill has processed your entire prompt and built the KV cache, decode takes over and generates the response one token at a time.

Prefill already happened:
  Input: "Explain how a CPU works to a 5-year-old"
  Output: KV cache (11 tokens cached) + first token "Imagine"

Now DECODE begins:

  Step 1:  "Imagine"  → model predicts → "a"
  Step 2:  "a"        → model predicts → "tiny"
  Step 3:  "tiny"     → model predicts → "factory"
  Step 4:  "factory"  → model predicts → "inside"
  Step 5:  "inside"   → model predicts → "your"
  Step 6:  "your"     → model predicts → "computer"
  Step 7:  "computer" → model predicts → "."

  ┌─────────────────────────────────────────────────┐
  │                  DECODE PHASE                    │
  │  Input: ONE new token per step                   │
  │  Reads: entire KV cache + all model weights      │
  │  Output: ONE next token per step                 │
  │  Bottleneck: memory bandwidth (GB/s)             │
  └─────────────────────────────────────────────────┘

2. Real-World Token Simulation

Decode Step 1: Generating "a"

┌─────────────────────────────────────────────────────────────────────┐
│ DECODE STEP 1                                                       │
│                                                                     │
│ Input token: "Imagine" (just this ONE token)                        │
│ KV Cache: 11 entries from prefill                                   │
│   [Explain] [how] [a] [CPU] [works] [to] [a] [5] [-] [year] [-old]│
│                                                                     │
│ What the GPU does at each of 80 layers:                             │
│   a. Compute Q, K, V for "Imagine"                                  │
│   b. ★ READ all 11 cached K vectors from GPU memory                │
│   c. ★ READ all 11 cached V vectors from GPU memory                │
│   d. Attention = softmax(Q_new × [K_cached; K_new]^T) × V_all     │
│      Shape: [1] × [12]^T → [1, 12] attention scores                │
│   e. Append K_new and V_new to cache (11 → 12 entries)             │
│                                                                     │
│ Output: "a"    |    KV Cache now: 12 entries    |    Time: ~40ms    │
└─────────────────────────────────────────────────────────────────────┘

Decode Step 2: Generating "tiny"

┌─────────────────────────────────────────────────────────────────────┐
│ DECODE STEP 2                                                       │
│ Input: "a"  |  Cache: 12 entries  |  READ 12 K + 12 V from memory  │
│ Attention: [1, 13] scores  |  Output: "tiny"  |  Cache: 13 entries │
└─────────────────────────────────────────────────────────────────────┘

Steps 3–7: The pattern

Step 3:  "tiny"     | Reads: 13 entries | → "factory"  | Cache: 14
Step 4:  "factory"  | Reads: 14 entries | → "inside"   | Cache: 15
Step 5:  "inside"   | Reads: 15 entries | → "your"     | Cache: 16
Step 6:  "your"     | Reads: 16 entries | → "computer" | Cache: 17
Step 7:  "computer" | Reads: 17 entries | → "."        | Cache: 18

The attention vector (not matrix!)

Decode Step 4 — generating "inside" after "Imagine a tiny factory"

The single query token "factory" attends to all 14 cached tokens:

            Explain  how    a    CPU  works   to    a     5     -   year  -old  Imagine  a    tiny
  factory  [ 0.02   0.03  0.01  0.15  0.08  0.03  0.01  0.02  0.01  0.01  0.01  0.18   0.04  0.40 ]
                                  ↑                                                ↑            ↑
                              "CPU" gets                                    "Imagine" gets   "tiny" gets
                              high attention                                high attention   highest

  14 multiply-adds. Trivial compute.
  But reading 14 K + 14 V vectors across 80 layers from GPU memory = SLOW.

3. The Restaurant Analogy (Decode Edition)

┌─────────────────────────────────────────────────────────────────────┐
│                 🍽️  THE LLM RESTAURANT — DECODE PHASE               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Prep stations (KV cache) ready from prefill:                       │
│    Station 1: Rice noodles    Station 2: Italian sauce              │
│    Station 3: Salad greens    Station 4: Garnishes                  │
│                                                                     │
│  Action 1: Boil the noodles                                         │
│    Chef walks to ALL 4 stations to check context → decides action   │
│    Time: mostly WALKING between stations, not actual cooking        │
│                                                                     │
│  Action 2: Start the sauce                                          │
│    Chef walks to ALL 5 stations (4 + 1 new from action 1)          │
│    Even MORE walking now                                            │
│                                                                     │
│  Action 3: Prepare the salad                                        │
│    Chef walks to ALL 6 stations... and so on                        │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐     │
│  │ KEY INSIGHT:                                               │     │
│  │ Chef's HANDS barely doing work (low compute).              │     │
│  │ Chef's LEGS exhausted from walking (high memory BW).       │     │
│  │ The actual cooking is simple. CHECKING every station       │     │
│  │ before each action is what takes all the time.             │     │
│  │ This is decode: tiny compute, massive memory reads.        │     │
│  └────────────────────────────────────────────────────────────┘     │
│                                                                     │
│  WHAT MAKES IT WORSE:                                               │
│  If a NEW customer (prefill) arrives mid-cooking, the chef          │
│  STOPS, reads new order, sets up new stations, then resumes.        │
│  First customer's food gets cold = "token stuttering".              │
│                                                                     │
│  SOLUTION: One chef ONLY preps (prefill GPU).                       │
│  Another ONLY cooks (decode GPU). Stations passed via NIXL.         │
└─────────────────────────────────────────────────────────────────────┘

Restaurant (Decode)	LLM Decode
One cooking action at a time	One token generated per step
Checking every station before each action	Reading entire KV cache from GPU memory
More stations = more walking	Longer sequences = more data reads
Chef's hands barely working	GPU compute ~12% utilized
Chef's legs exhausted	GPU memory bandwidth ~88% utilized
New customer interrupting cooking	Prefill interrupting decode
Separate prep chef and cooking chef	Disaggregated serving

4. Step-by-Step Technical Walkthrough

def decode(first_token, kv_cache, model, max_tokens=200):
    """Generate tokens one at a time using the KV cache from prefill."""
    generated = []
    current = first_token  # "Imagine"

    for step in range(max_tokens):
        hidden = embedding_table[current]        # [1, 8192] — ONE token

        for layer_idx in range(80):
            Q_new = hidden @ W_Q[layer_idx]      # [1, 8192]
            K_new = hidden @ W_K[layer_idx]      # [1, 1024]
            V_new = hidden @ W_V[layer_idx]      # [1, 1024]

            # ★ THE EXPENSIVE PART: read cached K, V from HBM
            K_cached = kv_cache[layer_idx].K     # [seq_len, 1024] ← BOTTLENECK
            V_cached = kv_cache[layer_idx].V     # [seq_len, 1024] ← BOTTLENECK

            K_all = concat(K_cached, K_new)      # [seq_len+1, 1024]
            V_all = concat(V_cached, V_new)

            # Attention is a VECTOR, not a matrix
            scores = Q_new @ K_all.T / sqrt(128) # [1, seq_len+1]
            weights = softmax(scores)
            attn_out = weights @ V_all           # [1, 8192]

            kv_cache[layer_idx].K = K_all        # cache grows by 1
            kv_cache[layer_idx].V = V_all

            # Feed-Forward (same ops, but for 1 token)
            gate = SiLU(attn_out @ W_gate[layer_idx])  # [1, 28672]
            up   = attn_out @ W_up[layer_idx]
            ffn  = (gate * up) @ W_down[layer_idx]     # [1, 8192]
            hidden = layer_norm(hidden + ffn)

        logits = hidden @ W_vocab                # [1, 128256]
        next_token = sample(logits, temperature=0.7)
        generated.append(next_token)
        if next_token == EOS: break
        current = next_token

    return generated

Key observations

Only 1 token goes through the model — hidden is always [1, 8192]
KV cache READ every step — 80 times per step (once per layer). This is the bottleneck.
Attention is a vector dot product — [1, seq_len+1], not [seq_len, seq_len]
Cache grows by 1 each step — after 200 steps: 211 entries
Full model weights read EVERY step — ~140 GB read per token

5. The KV Cache Growth Problem

Token generated  │ Cache entries │ Cache size (70B, FP16) │ % of H100 80GB
─────────────────┼───────────────┼────────────────────────┼────────────────
Prefill done     │      11       │     3.6 MB             │    0.005%
After 50 tokens  │      61       │    20 MB               │    0.025%
After 200 tokens │     211       │    69 MB               │    0.086%
After 4K tokens  │   4,107       │  1.34 GB               │    1.68%
After 32K tokens │  32,779       │ 10.72 GB               │   13.4%
After 128K tokens│ 128,011       │ 41.9 GB                │   52.3%  ← uh oh

Concurrent users:

Users │ KV per user │ Total KV  │ Remaining for model
──────┼─────────────┼───────────┼────────────────────
  1   │  1.34 GB    │  1.34 GB  │ 78.66 GB  ✓
  10  │  1.34 GB    │ 13.4 GB   │ 66.6 GB   ✓
  30  │  1.34 GB    │ 40.2 GB   │ 39.8 GB   ⚠️ tight
  50  │  1.34 GB    │ 67.0 GB   │ 13.0 GB   ✗ OOM

KVBM Memory Hierarchy (Dynamo's solution):

  ┌──────────────────────────────────────┐
  │ G1: GPU HBM  (~80 GB, ~3.35 TB/s)   │ ← Hot cache (active decode)
  ├──────────────────────────────────────┤
  │ G2: CPU RAM  (~2 TB, ~200 GB/s)     │ ← Warm cache (recently used)
  ├──────────────────────────────────────┤
  │ G3: Local SSD (~8 TB, ~12 GB/s)     │ ← Cold cache (idle sessions)
  ├──────────────────────────────────────┤
  │ G4: Remote   (unlimited, variable)   │ ← Archive (persistent)
  └──────────────────────────────────────┘

  Result: 10× more concurrent users than GPU memory alone allows

6. Why Decode is Memory-Bandwidth-Bound

Data READ per token (Llama-3-70B):

  Model weights (per layer): ~1.63 GB × 80 layers = ~130 GB
  KV cache (at 4K tokens):   ~16 MB × 80 layers  = ~1.3 GB
  TOTAL PER TOKEN: ~131 GB

Compute per token:
  ~54 billion FLOPs (54 GFLOP)

Arithmetic Intensity = 54 GFLOP / 131 GB = 0.41 FLOPs/byte
H100 balance point  = 990 TFLOPS / 3.35 TB/s = 295 FLOPs/byte

Decode is 720× below the balance point.
GPU compute cores are ~99.86% IDLE during decode.

GPU Utilization During Decode (1 user):

CUDA Cores: █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~3%  (almost idle!)
Memory Bus:  ████████████████████████████░░░  ~88% (reading as fast as possible)

87% of time = reading model weights from memory.
3% of time = actual computation.

7. Batching: How Decode Becomes Efficient

Key insight: Model weights are the SAME for every user.
Read once, apply to ALL users' tokens simultaneously.

1 user:   130 GB weights → 1 token  → 0.41 FLOPs/byte  (wasteful)
8 users:  130 GB weights → 8 tokens → 3.3 FLOPs/byte   (better)
32 users: 130 GB weights → 32 tokens→ 13.1 FLOPs/byte  (good)
128 users:130 GB weights → 128 tokens→ 52 FLOPs/byte   (great)

But each user has their own KV cache:

Users │ Weight reads │ KV cache reads │ Total     │ Intensity
──────┼──────────────┼────────────────┼───────────┼──────────
  1   │   130 GB     │    1.3 GB      │  131 GB   │  0.41
  8   │   130 GB     │   10.4 GB      │  140 GB   │  3.1
  32  │   130 GB     │   41.6 GB      │  172 GB   │  10.1
  64  │   130 GB     │   83.2 GB      │  213 GB   │  16.2

Sweet spot: 32-64 users per GPU (balancing compute vs memory)

8. Real Numbers: Llama-3-70B

Scenario	Data read/step	Time/token	Tokens/sec
1 user, 4K context	~131 GB	~40 ms	~25
8 users, 4K context	~140 GB	~42 ms	~190
32 users, 4K context	~172 GB	~51 ms	~627

Where the time goes (1 user, 4K context):

Weight loading from HBM:    ~87%   (QKV: 22%, FFN: 55%, output: 10%)
KV cache reading from HBM:   ~9%
Actual computation:           ~3%
Other:                        ~1%

87% of decode = reading model weights. The GPU is a memory-reading machine.

9. Decode Failure Modes

Failure Mode 1: Prefill Interference

WITHOUT disaggregation:
  User A (decode): tok tok tok ████BLOCKED████ tok tok tok
  User B (prefill):             ████PREFILL████
  User A sees: tokens... FREEZE 200ms... tokens (stuttering)

WITH disaggregation (Dynamo):
  Prefill GPU: ████ B ████ ████ C ████ ████ D ████
  Decode GPU:  tok tok tok tok tok tok tok tok tok
  User A sees: smooth, consistent flow. No interruptions.

Failure Mode 2: KV Cache OOM

WITHOUT KVBM:
  GPU 80GB: [Model 35GB] [KV User1 15GB] [KV User2 15GB] [KV U3 15GB] ← OOM
  User 4 → REJECTED

WITH KVBM:
  GPU 80GB: [Model 35GB] [KV Active 30GB]
  CPU 512GB: [KV warm users 200GB]
  SSD 4TB: [KV cold users 2TB]
  User 4 → KV fetched from SSD in ~10ms

Failure Mode 3: Long outputs

10K tokens × 40ms = 400 seconds (6.7 minutes)

Dynamo solutions:
  - Speculative decoding: predict multiple tokens (2-4× speedup)
  - Request migration: move to less-loaded GPUs
  - KV offloading: spill growing cache via KVBM

10. Prefill vs Decode — The Full Picture

┌──────────────────────────────────┬──────────────────────────────────┐
│           PREFILL                │           DECODE                 │
├──────────────────────────────────┼──────────────────────────────────┤
│ Processes ALL tokens at once     │ Generates ONE token per step     │
│ Runs ONCE per request            │ Runs HUNDREDS of times           │
│ Attention: N×N matrix            │ Attention: 1×N vector            │
│ Reads weights once               │ Reads weights EVERY step         │
│ GPU compute: ~85%                │ GPU compute: ~3-12%              │
│ GPU memory BW: ~35%              │ GPU memory BW: ~88%              │
│ Bottleneck: FLOPS                │ Bottleneck: MEMORY BANDWIDTH    │
│ KV cache: WRITTEN                │ KV cache: READ + APPENDED       │
│ Duration: ms                     │ Duration: seconds                │
│ Metric: TTFT                     │ Metric: ITL, throughput          │
│ Restaurant: read order + prep    │ Restaurant: cook one step,       │
│   stations at once               │   check all stations each time   │
│ Sprinter: explosive burst        │ Marathon: steady endurance       │
└──────────────────────────────────┴──────────────────────────────────┘

The fundamental insight:

  PREFILL needs: ████████████████ COMPUTE
  DECODE needs:  ████████████████ BANDWIDTH

  Same GPU → both compromised. Neither at peak.
  Separate GPUs → both at peak. Maximum efficiency.

  PREFILL GPU: ████████████████ COMPUTE (100% focused)
  DECODE GPU:  ████████████████ BANDWIDTH (100% focused)
  KV TRANSFER: ──NIXL──→ (~1ms via NVLink)

  This is NVIDIA Dynamo's core design principle.

Sources

Attention Is All You Need — Vaswani et al., 2017 → arxiv.org/abs/1706.03762
PagedAttention / vLLM — Kwon et al., 2023 → arxiv.org/abs/2309.06180
Splitwise — Patel et al., 2024 → arxiv.org/abs/2311.18677
DistServe — Zhong et al., 2024 → arxiv.org/abs/2401.09670
Sarathi — Agrawal et al., 2023 → arxiv.org/abs/2308.16369
NVIDIA Dynamo → github.com/ai-dynamo/dynamo
FlashAttention — Dao et al., 2022 → arxiv.org/abs/2205.14135
Llama 3 — Meta, 2024 → github.com/meta-llama/llama3
H100 Datasheet — NVIDIA → nvidia.com/h100
Mooncake — Qin et al., 2024 → arxiv.org/abs/2407.00079