Parallelism (1) – Data , Expert

Parallelism Comparison: Data Parallelism vs Expert Parallelism

This image compares two major parallelization strategies used for training large language models (LLMs).

Left: Data Parallelism

Structure:

  • Data is divided into multiple batches from the database
  • Same complete model is replicated on each GPU
  • Each GPU independently processes different data batches
  • Results are aggregated to generate final output

Characteristics:

  • Scaling axis: Number of batches/samples
  • Pattern: Full model copy on each GPU, dense training
  • Communication: Gradient All-Reduce synchronization once per step
  • Advantages: Simple and intuitive implementation
  • Disadvantages: Model size must fit in single GPU memory

Right: Expert Parallelism

Structure:

  • Data is divided by layers
  • Tokens are distributed to appropriate experts through All-to-All network and router
  • Different expert models (A, B, C) are placed on each GPU
  • Parallel processing at block/thread level in GPU pool

Characteristics:

  • Scaling axis: Number of experts
  • Pattern: Sparse structure – only few experts activated per token
  • Goal: Maintain large capacity while limiting FLOPs per token
  • Communication: All-to-All token routing
  • Advantages: Can scale model capacity significantly (MoE – Mixture of Experts architecture)
  • Disadvantages: High communication overhead and complex load balancing

Key Differences

AspectData ParallelismExpert Parallelism
Model DivisionFull model replicationModel divided into experts
Data DivisionBatch-wiseLayer/token-wise
Communication PatternGradient All-ReduceToken All-to-All
ScalabilityProportional to data sizeProportional to expert count
EfficiencyDense computationSparse computation (conditional activation)

These two approaches are often used together in practice, enabling ultra-large-scale model training through hybrid parallelization strategies.


Summary

Data Parallelism replicates the entire model across GPUs and divides the training data, synchronizing gradients after each step – simple but memory-limited. Expert Parallelism divides the model into specialized experts and routes tokens dynamically, enabling massive scale through sparse activation. Modern systems combine both strategies to train trillion-parameter models efficiently.

#MachineLearning #DeepLearning #LLM #Parallelism #DistributedTraining #DataParallelism #ExpertParallelism #MixtureOfExperts #MoE #GPU #ModelTraining #AIInfrastructure #ScalableAI #NeuralNetworks #HPC

Mixture-of-Experts (MoE) DeepSeek-v3

Image Interpretation: DeepSeek-v3 Mixture-of-Experts (MoE)

This image outlines the key technologies and performance efficiency of the DeepSeek-v3 model, which utilizes the Mixture-of-Experts (MoE) architecture. It is divided into the architecture diagram/cost table on the left and four key technical features on the right.

1. DeepSeekMoE Architecture (Left Diagram)

The diagram illustrates how the model processes data:

  • Separation of Experts: Unlike traditional MoEs, it distinguishes between Shared Experts (Green) and Routed Experts (Blue).
    • Shared Experts: Always active to handle common knowledge.
    • Routed Experts: Selectively activated by the Router to handle specific, specialized features.
  • Workflow: When an input (ut) arrives, the Router selects the top-$K$ experts (Top-Kr). The system processes the input through both shared and selected routed experts in parallel and combines the results.

2. Four Key Technical Features (Right Panel)

This section explains how DeepSeek-v3 overcomes the limitations of existing MoE models:

  • Load Balancing without Auxiliary Loss:
    • Problem: Standard MoEs often use “auxiliary loss” to balance expert usage, which can degrade performance.
    • Solution: It uses learnable bias terms in the router to ensure balance. This bias only affects “dispatching” (where data goes) and not the actual “weights” (calculation values), preserving model quality.
  • Shared Expert Design:
    • Concept: Keeping one or a few experts always active for general tasks allows the routed experts to focus purely on complex, specialized tasks.
    • Benefit: Reduces redundancy and improves the capacity utilization of experts.
  • Hardware-Aware Dual-Pipe Parallelism:
    • Efficiency: It fully overlaps All-to-All communication with computation, minimizing idle time.
    • Optimization: “Node-local expert routing” is used to minimize slow data transfers between different nodes.
  • FP8 Mixed-Precision Training:
    • Speed & Cost: Utilizes the tensor cores of modern GPUs (Hopper/Blackwell) for full FP8 (8-bit floating point) training. This drastically lowers both training and inference costs.

3. Cost Efficiency Comparison (Table 2)

The comparison highlights the massive efficiency gain over dense models:

  • DeepSeek-V3 MoE (671B parameters): Despite having the largest parameter count, its training cost is extremely low at 250 GFLOPS/Token.
  • LLaMa-405B Dense (405B parameters): Although smaller in size, it requires ~10x higher cost (2448 GFLOPS/Token) compared to DeepSeek-v3.
  • Conclusion: DeepSeek-v3 achieves “high performance at low cost” by massively scaling the model size (671B) while keeping the actual computation equivalent to a much smaller model.

Summary

  1. Hybrid Structure: DeepSeek-v3 separates “Shared Experts” for general knowledge and “Routed Experts” for specialized tasks to maximize efficiency.
  2. Optimized Training: It achieves high speed and balance using “Load Balancing without Auxiliary Loss” and “FP8 Mixed-Precision Training.”
  3. Extreme Efficiency: Despite a massive 671B parameter size, it offers roughly 10x lower training costs per token compared to similar dense models (like LLaMa-405B).

#DeepSeek #AI #MachineLearning #MoE #MixtureOfExperts #LLM #DeepLearning #TechTrends #ArtificialIntelligence #ModelArchitecture

With Gemini

MoE & More

MoE & More – Architecture Interpretation

This diagram illustrates an advanced Mixture of Experts (MoE) model architecture.

Core Structure

1. Two Types of Experts

  • Shared Expert (Generalist)
    • Handles common knowledge: basic language structure, context understanding, general common sense
    • Applied universally to all tokens
  • Routed Expert (Specialist)
    • Handles specialized knowledge: coding, math, translation, etc.
    • Router selects the K most suitable experts for each token

2. Router (Gateway) Role

For each token, determines “Who’s best for handling this word?” by:

  • Selecting K experts out of N available specialists
  • Using Top-K selection mechanism

Key Optimization Techniques

Select Top-K ๐ŸŽฏ

  • Chooses K most suitable routed experts
  • Distributes work evenly and occasionally tries new experts

Stabilize โš–๏ธ

  • Prevents work from piling up on specific experts
  • Sets capacity limits and adds slight randomness

2-Stage Decouple ๐Ÿ”

  • Creates a shortlist of candidate experts
  • Separately checks “Are they available now?” + “Are they good at this?”
  • Calculates and mixes the two criteria separately before final decision
  • Validates availability and skill before selection

Systems โšก

  • Positions experts close together (reduces network delay)
  • Groups tokens for batch processing
  • Improves communication efficiency

Adaptive & Safety Loop ๐Ÿ”„

  • Adjusts K value in real-time (uses more/fewer experts as needed)
  • Redirects to backup path if experts are busy
  • Continuously monitors load, overflow, and performance
  • Auto-adjusts when issues arise

Purpose

This system enhances both efficiency and performance through:

  • Optimized expert placement
  • Accelerated batch processing
  • Real-time monitoring with immediate problem response

Summary

MoE & More combines generalist experts (common knowledge) with specialist experts (domain-specific skills), using an intelligent router to dynamically select the best K experts for each token. Advanced techniques like 2-stage decoupling, stabilization, and adaptive safety loops ensure optimal load balancing, prevent bottlenecks, and enable real-time adjustments for maximum efficiency. The result is a faster, more efficient, and more reliable AI system that scales intelligently.

#MixtureOfExperts #MoE #AIArchitecture #MachineLearning #DeepLearning #LLM #NeuralNetworks #AIOptimization #ScalableAI #RouterMechanism #ExpertSystems #AIEfficiency #LoadBalancing #AdaptiveAI #MLOps

With Claude