MPFT: Multi-Plane Fat-Tree for Massive Scale and Cost Efficiency

Posted on 2025-12-232025-12-22 by lechuck park

MPFT: Multi-Plane Fat-Tree for Massive Scale and Cost Efficiency

1. Architecture Overview (Blue Section)

The core innovation of MPFT lies in parallelizing network traffic across multiple independent “planes” to maximize bandwidth and minimize hardware overhead.

Multi-Plane Architecture: The network is split into 4 independent planes (channels).
Multiple Physical Ports per NIC: Each Network Interface Card (NIC) is equipped with multiple ports—one for each plane.
QP Parallel Utilization (Packet Striping): A single Queue Pair (QP) can utilize all available ports simultaneously. This allows for striped traffic, where data is spread across all paths at once.
Out-of-Order Placement: Because packets travel via different planes, they may arrive in a different order than they were sent. Therefore, the NIC must natively support out-of-order processing to reassemble the data correctly.

2. Performance & Cost Results (Purple Section)

The table compares MPFT against standard topologies like FT2/FT3 (Fat-Tree), SF (Slim Fly), and DF (Dragonfly).

Metric	MPFT	FT3	Dragonfly (DF)
Endpoints	16,384	65,536	261,632
Switches	768	5,120	16,352
Total Cost	$72M	$491M	$1,522M
Cost per Endpoint	$4.39k	$7.5k	$5.8k

Scalability: MPFT supports 16,384 endpoints, which is significantly higher than a standard 2-tier Fat-Tree (FT2).
Resource Efficiency: It achieves high scalability while using far fewer switches (768) and links compared to the 3-tier Fat-Tree (FT3).
Economic Advantage: At $4.39k per endpoint, it is one of the most cost-efficient models for large-scale data centers, especially when compared to the $7.5k cost of FT3.

Summary

MPFT is presented as a “sweet spot” solution for AI/HPC clusters. It provides the high-speed performance of complex 3-tier networks but keeps the cost and hardware complexity closer to simpler 2-tier systems by using multi-port NICs and traffic striping.

#NetworkArchitecture #DataCenter #HighPerformanceComputing #GPU #AITraining #MultiPlaneFatTree #MPFT #NetworkingTech #ClusterComputing #CloudInfrastructure

Large Scale Network Driven Design(2) Multi-Plane Fat-Tree & Low Latency

Posted on 2025-12-172025-12-16 by lechuck park

Comprehensive Interpretation: Large Scale Network Driven Design

This document outlines the technical blueprint for “Network Co-Design,” explaining how network architecture must evolve to support massive AI workloads (specifically LLMs) by balancing Bandwidth and Latency.

Here is the breakdown from an architect’s perspective:

1. Structural Efficiency: MPFT & MRFT (The “Green” Section)

This section answers the question: “How do we efficiently cluster thousands of GPUs?”

Massive Scalability: It proposes a Multi-Plane Fat-Tree (MPFT) structure using 400G InfiniBand (IB) switches, theoretically capable of scaling up to 16,384 GPUs. This mirrors the scale of NVIDIA SuperPods.
Multi-Rail Architecture (MRFT): MRFT utilizes two distinct network planes. Think of this as adding a second level to a highway to double the lanes. It achieves higher bandwidth efficiency compared to traditional 2-tier Fat-Tree designs.
Software Optimization (NCCL): The hardware (MRFT) is fully utilized by NCCL (NVIDIA Collective Communications Library). NCCL acts as the traffic controller, ensuring the expanded physical bandwidth is saturated efficiently.
Latency Reduction (Packet Striping): The QP & Priority section highlights a critical mechanism where a single Queue Pair (QP) stripes packets across multiple ports simultaneously. This parallel transmission significantly reduces latency.
Current Bottlenecks: The design acknowledges limitations, such as InfiniBand’s lack of native support for out-of-order packet placement and the time overhead incurred during inter-plane communication (requires internal forwarding).

2. The Core of Performance: Low Latency & MoE (The “Blue” Section)

This section answers the question: “Why is low latency (and InfiniBand) non-negotiable?”

Sensitivity of MoE Models: Modern Mixture of Experts (MoE) models rely heavily on “Expert Parallelism,” which triggers frequent All-to-all communication. If the network lags even by hundreds of microseconds, the entire system performance degrades fataly.
RoCE vs. InfiniBand: The document draws a clear line. While RoCE (RDMA over Converged Ethernet) is cost-effective, InfiniBand (IB) is the superior choice for low-latency environments essential for AI training/inference.
Surprising Latency Metrics: It highlights a specific scenario: for small data transfers (e.g., 64 Bytes), InfiniBand can be faster than intra-node NVLink (specifically during Cross Leaf communication), proving its dominance in minimizing end-to-end latency.

Summary

Scalable Architecture: The Multi-Plane (MPFT) and Multi-Rail (MRFT) Fat-Tree designs, optimized with NCCL, maximize bandwidth efficiency to support massive clusters of up to 16k GPUs.
Latency Criticality: Modern AI workloads like Mixture of Experts (MoE) are hypersensitive to delay, making InfiniBand the preferred choice over RoCE due to its superior handling of All-to-all communication.
Co-Design Strategy: Achieving peak AI performance requires a “Network Co-Design” approach where high-speed hardware (400G IB) and software protocols (Packet Striping) are tightly integrated to minimize end-to-end latency.

#AINetworking #DataCenterArchitecture #InfiniBand #NCCL #LowLatency #HPC #GPUScaling #MoE #NetworkDesign #AIInfrastructure #DeepseekV3

WIth Gemini

Large Scale Network Driven Design ( Deepseek V3)

Posted on 2025-10-312025-10-31 by lechuck park

Deepseek v3 Large-Scale Network Architecture Analysis

This image explains the Multi-Plane Fat-Tree network structure of Deepseek v3.

Core Architecture

1. 8-Plane Architecture

Consists of eight independent network channels (highways)
Maximizes network bandwidth and distributes traffic for enhanced scalability

2. Fat-Tree Topology

Two-layer switch structure:
- Leaf SW (Leaf Switches): Directly connected to GPUs
- Spine SW (Spine Switches): Interconnect leaf switches
Enables high-speed communication among all nodes (GPUs) while minimizing switch contention

3. GPU/IB NIC Pair

Each GPU is paired with a dedicated Network Interface Card (NIC)
Each pair is exclusively assigned to one of the eight planes to initiate communication

Communication Methods

NVLink

Ultra-high-speed connection between GPUs within the same node
Fast data transfer path used for intra-node communication

Cross-plane Traffic

Occurs when communication happens between different planes
Requires intra-node forwarding through another NIC, PCIe, or NVLink
Primary factor that increases latency

Network Optimization Process

The workflow below minimizes latency and prevents network congestion:

Workload Analysis
All to All (analyzing all-to-all communication patterns)
Plane & Layer Set (plane and layer assignment)
Profiling (Hot-path opt K) (hot-path optimization)
Static Routing (Hybrid) (hybrid static routing approach)

Goal: Low latency & no jamming

Scalability

This design is a scale-out network for large-scale distributed training supporting 16,384+ GPUs. Each plane operates independently to maximize overall system throughput.

3-Line Summary

Deepseek v3 uses an 8-plane fat-tree network architecture that connects 16,384+ GPUs through independent communication channels, minimizing contention and maximizing bandwidth. The two-layer switch topology (Spine and Leaf) combined with dedicated GPU-NIC pairs enables efficient traffic distribution across planes. Cross-plane traffic management and hot-path optimization ensure low-latency, high-throughput communication for large-scale AI training.

#DeepseekV3 #FatTreeNetwork #MultiPlane #NetworkArchitecture #ScaleOut #DistributedTraining #AIInfrastructure #GPUCluster #HighPerformanceComputing #NVLink #DataCenterNetworking #LargeScaleAI

With Claude

Insights into DeepSeek-V3

Posted on 2025-10-10 by lechuck park

This image presents an insights overview of DeepSeek-V3, highlighting its key technical innovations and architectural features.

Core Technical Components

1. MLA (Multi-Head Latent Attention)

Focuses on memory efficiency
Processes attention mechanisms through latent representations to reduce memory footprint

2. MoE (Mixture-of-Experts)

Enables cost-effective scaling
Activates only relevant experts for each input, reducing computational overhead while maintaining performance

3. FP8 Mixed-Precision Training

Achieves efficient computation
Combines FP8 and FP32 precision levels strategically

4. MTP (Multi-Token Prediction)

Enables faster autoregressive inference
Predicts multiple tokens simultaneously (“look ahead two or three letters instead of one at a time”)

5. Multi-Plane Network Topology

Provides scalable, efficient cluster networking
Acts like a multi-lane highway to prevent bottlenecks

Right Panel Technical Details

KV Cache Compression (latent space)

Handles long contexts with low memory and fast decoding

Aux-loss-free Load Balancing + Expert Parallel (All-to-All)

Reduces FLOPs/costs while maintaining training/inference performance

Weights/Matmul in FP8 + FP32 Accumulation

Computes in lightweight units but sums precisely for critical totals (lower memory, bandwidth, compute, stable accuracy)

Predict Multiple Tokens at Once During Training

Delivers higher speed and accuracy boosts in benchmarks

2-tier Fat-Tree × Multiple Planes (separated per RDMA-NIC pair)

Provides inter-plane congestion isolation, resilience, and reduced cost/latency

Summary

DeepSeek-V3 represents a comprehensive optimization of large language models through innovations in attention mechanisms, expert routing, mixed-precision training, multi-token prediction, and network architecture. These techniques collectively address the three critical bottlenecks: memory, computation, and communication. The result is a highly efficient model capable of scaling to massive sizes while maintaining cost-effectiveness and performance.

#DeepSeekV3 #LLM #MixtureOfExperts #EfficientAI #ModelOptimization #MultiTokenPrediction #FP8Training #LatentAttention #ScalableAI #AIInfrastructure

With Claude