Parallelism (2) – Pipeline vs Tensor Parallelism

This image compares two parallel processing techniques: Pipeline Parallelism and Tensor Parallelism.

Pipeline Parallelism

Core Concept:

Sequential work is divided into multiple stages
Each GPU is responsible for a specific task (a → b → c)

Characteristics:

Axis: Depth-wise – splits by layers
Pattern: Pipeline/conveyor belt with micro-batches
Communication: Only at stage boundaries
Cost: Bubbles (idle time), requires pipeline tuning

How it works: Data flows sequentially like waves, with each GPU processing its assigned stage before passing to the next GPU.

Tensor Parallelism

Core Concept:

Matrix pool is prepared and split in advance
All GPUs simultaneously process different parts of the same data

Characteristics:

Axis: Width-wise – splits inside layers
Pattern: Width-wise sharding – splits matrix/attention across GPUs
Communication: Occurs at every Transformer layer (forward/backward)
Cost: High communication overhead, requires strong NVLink/NVSwitch

How it works: Large matrices are divided into chunks, with each GPU processing simultaneously while continuously communicating via NVLink/NVSwitch.

Key Differences

Aspect	Pipeline	Tensor
Split Method	Layer-wise (vertical)	Within-layer (horizontal)
GPU Role	Different tasks	Parts of same task
Communication	Low (stage boundaries)	High (every layer)
Hardware Needs	Standard	High-speed interconnect required

Summary

Pipeline Parallelism splits models vertically by layers with sequential processing and low communication cost, while Tensor Parallelism splits horizontally within layers for parallel processing but requires high-speed interconnects. These two techniques are often combined in training large-scale AI models to maximize efficiency.

#ParallelComputing #DistributedTraining #DeepLearning #GPUOptimization #MachineLearning #ModelParallelism #AIInfrastructure #NeuralNetworks #ScalableAI #HPC

With Claude

Tag: DistributedTraining

Parallelism (2) – Pipeline, Tensor