Interconnection Driven Design (Deepseek v3)

Interconnection Driven Design

This image outlines a technical approach to solving bottlenecks in High-Performance Computing (HPC) and AI/LLM infrastructure. It is categorized into three main rows, each progressing from a Problem to a Solution, and finally to a hardware-level Final Optimization.

1. Convergence of Scale-Up and Scale-Out

Focuses on resolving inefficiencies between server communication and GPU computation.

  • Problem (IB Communication): The speed of inter-server connections (e.g., InfiniBand) creates a bottleneck for total system performance.
  • Inefficiency (Streaming Multiprocessor): The GPU’s core computational units (SMs) waste resources handling network overhead instead of focusing on actual calculations.
  • Solution (SM Offload): Communication tasks are delegated (offloaded) to dedicated coprocessors, allowing SMs to focus exclusively on computation.
  • Final Optimization (Unified Network Adapter): Physically integrating intra-node and inter-node communication into a single Network Interface Card (NIC) to minimize data movement paths.

2. Bandwidth Contention & Latency

Addresses the limitations of data bandwidth and processing delays.

  • Problem (KV Cache): Reusable token data for LLM inference frequently travels between the CPU and GPU, consuming significant bandwidth.
  • Bottleneck (PCIe): The primary interconnect has limited bandwidth, leading to contention and performance degradation during traffic spikes.
  • Solution (Traffic Class – TC): A prioritization mechanism (QoS) ensures urgent, latency-sensitive traffic is processed before less critical data.
  • Final Optimization (I/O Die Chiplet Integration): Integrating network I/O directly alongside the GPU die bypasses PCIe entirely, eliminating contention and drastically reducing latency.

3. Node-Limited Routing

Optimizes data routing strategies for distributed neural networks.

  • Key Tech (NVLink): A high-speed, intra-node GPU interconnect strategically used to maximize local data transfer.
  • Context (Experts): Neural network modules (MoE – Mixture of Experts) are distributed across various nodes, requiring activation for specific tokens.
  • Solution/Strategy (Minimize IB Cost): Reducing overhead by restricting slow inter-node usage (InfiniBand) to a single hop while distributing data internally via fast NVLink.
  • Final Optimization (Node-Limited): Algorithms restrict the selection of “Experts” (modules) to a limited node group, reducing inter-node traffic and guaranteeing communication efficiency.

Summary

  1. Integration: The design overcomes system bottlenecks by physically unifying network adapters and integrating I/O dies directly with GPUs to bypass slow connections like PCIe.
  2. Offloading & Prioritization: It improves efficiency by offloading network tasks from GPU cores (SMs) and prioritizing urgent traffic (Traffic Class) to reduce latency.
  3. Routing Optimization: It utilizes “Node-Limited” routing strategies to maximize high-speed local connections (NVLink) and minimize slower inter-server communication in distributed AI models.

#InterconnectionDrivenDesign #AIInfrastructure #GPUOptimization #HPC #ChipletIntegration #NVLink #LatencyReduction #LLMHardware #infiniband

With Gemini

Multi-Plane Network Topology ( deepseek v3)

Multi-Plane Network Topology for Scalable AI Clusters

Core Architecture (Left – Green Sections)

Topology Structure

  • Adopts 2-Tier Fat-Tree (FT2) architecture for reduced latency and cost efficiency compared to 3-Tier
  • Achieves massive scale connections at much lower cost than 3-tier architectures

Multi-Plane Design

  • 8-Plane Architecture: Each node contains 8 GPUs and 8 IB NICs
  • 1:1 Mapping: Dedicates specific GPU-NIC pairs to separate planes

NIC Specifications

  • Hardware: 400G InfiniBand (ConnectX-7)
  • Resilience: Multi-port connectivity ensures robustness against single-port failures

Maximum Scalability

  • Theoretically supports up to 16,384 GPUs within the 2-tier structure

Advantages (Center – Purple Sections)

Cost Efficiency: Connects massive scale at much lower cost compared to 3-tier architectures

Ultra-Low Latency: Fewer network hops ensure rapid data transfer, ideal for latency-sensitive AI models like MoE

Traffic Isolation: Independent communication lanes (planes) prevent congestion or faults in one lane from affecting others

Proven Performance: Validated in large-scale tests with 2048 GPUs, delivering stable and high-speed communication

Challenges (Right – Orange Sections)

Packet Ordering Issues: Current hardware (ConnectX-7) has limitations in handling out-of-order data packets

Cross-Plane Delays: Moving data between different network planes requires extra internal forwarding, causing higher latency during AI inference

Smarter Routing Needed: Standard traffic methods (ECMP) are inefficient for AI; requires Adaptive Routing that intelligently selects the best path based on network traffic

Hardware Integration: Future hardware should build network components directly into main chips to remove bottlenecks and speed up communication


Summary

This document presents a multi-plane network topology using 2-tier Fat-Tree architecture that scales AI clusters up to 16,384 GPUs cost-effectively with ultra-low latency. The 8-plane design with 1:1 GPU-NIC mapping provides traffic isolation and resilience, though challenges remain in packet ordering and cross-plane communication. Future improvements require smarter routing algorithms and deeper hardware-network integration to optimize AI workload performance.

#AIInfrastructure #DataCenterNetworking #HPC #InfiniBand #GPUCluster #NetworkTopology #FatTree #ScalableComputing #MLOps #AIHardware #DistributedComputing #CloudInfrastructure #NetworkArchitecture #DeepLearning #AIatScale

AI Data Center: Critical Bottlenecks and Technological Solutions


AI Data Center: Critical Bottlenecks and Technological Solutions

This chart analyzes the major challenges facing modern AI Data Centers across six key domains. It outlines the [Domain] → [Bottleneck/Problem] → [Solution] flow, indicating the severity of each bottleneck with a score out of 100.

1. Generative AI

  • Bottleneck (45/100): Redundant Computation
    • Inefficiencies occur when calculating massive parameters for large models.
  • Solutions:
    • MoE (Mixture of Experts): Uses only relevant sub-models (experts) for specific tasks to reduce computation.
    • Quantization (FP16 → INT8/FP4): Reduces data precision to speed up processing and save memory.

2. OS for AI Works

  • Bottleneck (55/100): Low MFU (Model Flops Utilization)
    • Issues with resource fragmentation and idle time result in underutilization of hardware.
  • Solutions:
    • Dynamic Checkpointing: Efficiently saves model states during training.
    • AI-Native Scheduler: Optimizes task distribution based on network topology.

3. Computing / AI Engine (Most Critical)

  • Bottleneck (85/100): Memory Wall
    • Marked as the most severe bottleneck, where memory bandwidth cannot keep up with the speed of logic processors.
  • Solutions:
    • HBM3e/HBM4: Next-generation High Bandwidth Memory.
    • PIM (Processing In Memory): Performs calculations directly within memory to reduce data movement.

4. Network

  • Bottleneck (75/100): Communication Overhead
    • Latency issues arise during synchronization between multiple GPUs.
  • Solutions:
    • UEC-based RDMA: Ultra Ethernet Consortium standards for faster direct memory access.
    • CPO / LPO: Advanced optics (Co-Packaged/Linear Drive) to improve data transmission efficiency.

5. Power

  • Bottleneck (65/100): Density Cap
    • Physical limits on how much power can be supplied per server rack.
  • Solutions:
    • 400V HVDC: High Voltage Direct Current for efficient power delivery.
    • BESS Peak Shaving: Using Battery Energy Storage Systems to manage peak power loads.

6. Cooling

  • Bottleneck (70/100): Thermal Throttling Limit
    • Performance drops (throttling) caused by excessive heat in high-density racks.
  • Solutions:
    • DTC Liquid Cooling: Direct-to-Chip liquid cooling technologies.
    • CDU: Coolant Distribution Units for effective heat management.

Summary

  1. The “Memory Wall” (85/100) is identified as the most critical bottleneck in AI Data Centers, meaning memory bandwidth is the primary constraint on performance.
  2. To overcome these limits, the industry is adopting advanced hardware like HBM and Liquid Cooling, alongside software optimizations like MoE and Quantization.
  3. Scaling AI infrastructure requires a holistic approach that addresses computing, networking, power efficiency, and thermal management simultaneously.

#AIDataCenter #ArtificialIntelligence #MemoryWall #HBM #LiquidCooling #GenerativeAI #TechTrends #AIInfrastructure #Semiconductor #CloudComputing

With Gemini

AI Operation : All Connected

AI Operation: All Connected – Image Analysis

This diagram explains the operational paradigm shift in AI Data Centers (AI DC).

Top Section: New Challenges

AI DC Characteristics:

  • Paradigm shift: Fundamental change in operations for the AI era
  • High Cost: Massive investment required for GPUs, infrastructure, etc.
  • High Risk: Greater impact during outages and increased complexity

Five Core Components of AI DC (left→right):

  1. Software: AI models, application development
  2. Computing: GPUs, servers, and computational resources
  3. Network: Data transmission and communication infrastructure
  4. Power: High-density power supply and management (highlighted in orange)
  5. Cooling: Heat management and cooling systems

→ These five elements are interconnected through the “All Connected Metric”

Bottom Section: Integrated Operations Solution

Core Concept:

📦 Tightly Fused Rubik’s Cube

  • The five core components (Software, Computing, Network, Power, Cooling) are intricately intertwined like a Rubik’s cube
  • Changes or issues in one element affect all other elements due to tight coupling

🎯 All Connected Data-Driven Operations

  • Data-driven integrated operations: Collecting and analyzing data from all connected elements
  • “For AI, With AI”: Operating the data center itself using AI technology for AI workloads

Continuous Stability & Optimization

  • Ensuring continuous stability
  • Real-time monitoring and optimization

Key Message

AI data centers have five core components—Software, Computing, Network, Power, and Cooling—that are tightly fused together. To effectively manage this complex system, a data-centric approach that integrates and analyzes data from all components is essential, enabling continuous stability and optimization.


Summary

AI data centers are characterized by tightly coupled components (software, computing, network, power, cooling) that create high complexity, cost, and risk. This interconnected system requires data-driven operations that leverage AI to monitor and optimize all elements simultaneously. The goal is achieving continuous stability and optimization through integrated, real-time management of all connected metrics.

#AIDataCenter #DataDrivenOps #AIInfrastructure #DataCenterOptimization #TightlyFused #AIOperations #HybridInfrastructure #IntelligentOps #AIforAI #DataCenterManagement #MLOps #AIOps #PowerManagement #CoolingOptimization #NetworkInfrastructure

Large Scale Network Driven Design ( Deepseek V3)

Deepseek v3 Large-Scale Network Architecture Analysis

This image explains the Multi-Plane Fat-Tree network structure of Deepseek v3.

Core Architecture

1. 8-Plane Architecture

  • Consists of eight independent network channels (highways)
  • Maximizes network bandwidth and distributes traffic for enhanced scalability

2. Fat-Tree Topology

  • Two-layer switch structure:
    • Leaf SW (Leaf Switches): Directly connected to GPUs
    • Spine SW (Spine Switches): Interconnect leaf switches
  • Enables high-speed communication among all nodes (GPUs) while minimizing switch contention

3. GPU/IB NIC Pair

  • Each GPU is paired with a dedicated Network Interface Card (NIC)
  • Each pair is exclusively assigned to one of the eight planes to initiate communication

Communication Methods

NVLink

  • Ultra-high-speed connection between GPUs within the same node
  • Fast data transfer path used for intra-node communication

Cross-plane Traffic

  • Occurs when communication happens between different planes
  • Requires intra-node forwarding through another NIC, PCIe, or NVLink
  • Primary factor that increases latency

Network Optimization Process

The workflow below minimizes latency and prevents network congestion:

  1. Workload Analysis
  2. All to All (analyzing all-to-all communication patterns)
  3. Plane & Layer Set (plane and layer assignment)
  4. Profiling (Hot-path opt K) (hot-path optimization)
  5. Static Routing (Hybrid) (hybrid static routing approach)

Goal: Low latency & no jamming

Scalability

This design is a scale-out network for large-scale distributed training supporting 16,384+ GPUs. Each plane operates independently to maximize overall system throughput.


3-Line Summary

Deepseek v3 uses an 8-plane fat-tree network architecture that connects 16,384+ GPUs through independent communication channels, minimizing contention and maximizing bandwidth. The two-layer switch topology (Spine and Leaf) combined with dedicated GPU-NIC pairs enables efficient traffic distribution across planes. Cross-plane traffic management and hot-path optimization ensure low-latency, high-throughput communication for large-scale AI training.

#DeepseekV3 #FatTreeNetwork #MultiPlane #NetworkArchitecture #ScaleOut #DistributedTraining #AIInfrastructure #GPUCluster #HighPerformanceComputing #NVLink #DataCenterNetworking #LargeScaleAI

With Claude

Data Center Shift with AI

Data Center Shift with AI

This diagram illustrates how data centers are transforming as they enter the AI era.

📅 Timeline of Technological Evolution

The top section shows major technology revolutions and their timelines:

  • Internet ’95 (Internet era)
  • Mobile ’07 (Mobile era)
  • Cloud ’10 (Cloud era)
  • Blockchain
  • AI(LLM) ’22 (Large Language Model-based AI era)

🏢 Traditional Data Center Components

Conventional data centers consisted of the following core components:

  • Software
  • Server
  • Network
  • Power
  • Cooling

These were designed as relatively independent layers.

🚀 New Requirements in the AI Era

With the introduction of AI (especially LLMs), data centers require specialized infrastructure:

  1. LLM Model – Operating large language models
  2. GPU – High-performance graphics processing units (essential for AI computations)
  3. High B/W – High-bandwidth networks (for processing large volumes of data)
  4. SMR/HVDC – Switched-Mode Rectifier/High-Voltage Direct Current power systems
  5. Liquid/CDU – Liquid cooling/Cooling Distribution Units (for cooling high-heat GPUs)

🔗 Key Characteristic of AI Data Centers: Integrated Design

The circular connection in the center of the diagram represents the most critical feature of AI data centers:

Tight Interdependency between SW/Computing/Network ↔ Power/Cooling

Unlike traditional data centers, in AI data centers:

  • GPU-based computing consumes enormous power and generates significant heat
  • High B/W networks consume additional power during massive data transfers between GPUs
  • Power systems (SMR/HVDC) must stably supply high power density
  • Liquid cooling (Liquid/CDU) must handle high-density GPU heat in real-time

These elements must be closely integrated in design, and optimizing just one element cannot guarantee overall system performance.

💡 Key Message

AI workloads require moving beyond the traditional layer-by-layer independent design approach of conventional data centers, demanding that computing-network-power-cooling be designed as one integrated system. This demonstrates that a holistic approach is essential when building AI data centers.


📝 Summary

AI data centers fundamentally differ from traditional data centers through the tight integration of computing, networking, power, and cooling systems. GPU-based AI workloads create unprecedented power density and heat generation, requiring liquid cooling and HVDC power systems. Success in AI infrastructure demands holistic design where all components are co-optimized rather than independently engineered.

#AIDataCenter #DataCenterEvolution #GPUInfrastructure #LiquidCooling #AIComputing #LLM #DataCenterDesign #HighPerformanceComputing #AIInfrastructure #HVDC #HolisticDesign #CloudComputing #DataCenterCooling #AIWorkloads #FutureOfDataCenters

With Claude

network issue in a GPU workload

This diagram illustrates network bottleneck issues in large-scale AI/ML systems.

Key Components:

Left side:

  • Big Data and AI Model/Workload connected to the system via network

Center:

  • Large-scale GPU cluster (multiple GPUs arranged in a grid pattern)
  • Each GPU is interconnected for distributed processing

Right side:

  • Power supply and cooling systems

Core Problem:

The network interface specifications shown at the bottom reveal bandwidth mismatches:

  • inter GPU NVLink: 600GB/s
  • inter Server Infiniband: 400Gbps
  • CPU/RAM/DISK PCIe/NVLink: (relatively lower bandwidth)

“One Issue” – System-wide Propagation:

A network bottleneck or failure at a specific point (marked with red circle) “spreads throughout the entire system” as indicated by the yellow arrows.

This diagram warns that in large-scale AI training, a single network bottleneck can have catastrophic effects on overall system performance. It visualizes how bandwidth imbalances at various levels – GPU-to-GPU communication, server-to-server communication, and storage access – can compromise the efficiency of the entire system. The cascading effect demonstrates how network issues can quickly propagate and impact the performance of distributed AI workloads across the infrastructure.

with Claude