Prefill & Decode (2)

1. Prefill Phase (Input Analysis & Parallel Processing)

  • Processing Method: It processes the user’s lengthy prompts or documents all at once using Parallel Computing.
  • Bottleneck (Compute-bound): Since it needs to process a massive amount of data simultaneously, computational power is the most critical factor. This phase generates the KV (Key-Value) cache which is used in the next step.
  • Requirements: Because it requires large-scale parallel computation, massive throughput and large-capacity memory are essential.
  • Hardware Characteristics: The image describes this as a “One-Hit & Big Size” style, explaining that a GPU-based HBM (High Bandwidth Memory) architecture is highly suitable for handling such large datasets.

2. Decode Phase (Sequential Token Generation)

  • Processing Method: Using the KV cache generated during the Prefill phase, this is a Sequential Computing process that generates the response tokens one by one.
  • Bottleneck (Memory-bound): The computation itself is light, but the system must constantly access the memory (KV cache) to fetch and generate the next word. Therefore, memory access speed (bandwidth) becomes the limiting factor.
  • Requirements: Because it needs to provide small and immediate responses to the user, ultra-low latency and deterministic execution speed are crucial.
  • Hardware Characteristics: Described as a “Small-Hit & Fast Size” style, an LPU (Language Processing Unit)-based SRAM architecture is highly advantageous to minimize latency.

๐Ÿ’กSummary

  1. Prefill is a compute-bound phase that processes user input in parallel all at once to create a KV cache, making GPU and HBM architectures highly suitable.
  2. Decode is a memory-bound phase that sequentially generates words one by one by referencing the KV cache, where LPU and SRAM architectures are advantageous for achieving ultra-low latency.
  3. Ultimately, an LLM operates by grasping the context through large-scale computation (Prefill) and then generating responses in real-time through fast memory access (Decode).

#LLM #Prefill #Decode #GPU_HBM #LPU_SRAM #AIArchitecture #ParallelComputing #UltraLowLatencyAI

With Gemini

Parallelism (2) – Pipeline, Tensor

Parallelism (2) – Pipeline vs Tensor Parallelism

This image compares two parallel processing techniques: Pipeline Parallelism and Tensor Parallelism.

Pipeline Parallelism

Core Concept:

  • Sequential work is divided into multiple stages
  • Each GPU is responsible for a specific task (a โ†’ b โ†’ c)

Characteristics:

  • Axis: Depth-wise – splits by layers
  • Pattern: Pipeline/conveyor belt with micro-batches
  • Communication: Only at stage boundaries
  • Cost: Bubbles (idle time), requires pipeline tuning

How it works: Data flows sequentially like waves, with each GPU processing its assigned stage before passing to the next GPU.


Tensor Parallelism

Core Concept:

  • Matrix pool is prepared and split in advance
  • All GPUs simultaneously process different parts of the same data

Characteristics:

  • Axis: Width-wise – splits inside layers
  • Pattern: Width-wise sharding – splits matrix/attention across GPUs
  • Communication: Occurs at every Transformer layer (forward/backward)
  • Cost: High communication overhead, requires strong NVLink/NVSwitch

How it works: Large matrices are divided into chunks, with each GPU processing simultaneously while continuously communicating via NVLink/NVSwitch.


Key Differences

AspectPipelineTensor
Split MethodLayer-wise (vertical)Within-layer (horizontal)
GPU RoleDifferent tasksParts of same task
CommunicationLow (stage boundaries)High (every layer)
Hardware NeedsStandardHigh-speed interconnect required

Summary

Pipeline Parallelism splits models vertically by layers with sequential processing and low communication cost, while Tensor Parallelism splits horizontally within layers for parallel processing but requires high-speed interconnects. These two techniques are often combined in training large-scale AI models to maximize efficiency.

#ParallelComputing #DistributedTraining #DeepLearning #GPUOptimization #MachineLearning #ModelParallelism #AIInfrastructure #NeuralNetworks #ScalableAI #HPC

With Claude