This design is a scale-out network for large-scale distributed training supporting 16,384+ GPUs. Each plane operates independently to maximize overall system throughput.
3-Line Summary
Deepseek v3 uses an 8-plane fat-tree network architecture that connects 16,384+ GPUs through independent communication channels, minimizing contention and maximizing bandwidth. The two-layer switch topology (Spine and Leaf) combined with dedicated GPU-NIC pairs enables efficient traffic distribution across planes. Cross-plane traffic management and hot-path optimization ensure low-latency, high-throughput communication for large-scale AI training.
These three elements must be organically combined – this is the core message of the diagram.
Summary
LLM optimization requires integrating traditional deterministic SW/HW optimization with new paradigms: probabilistic/statistical approaches that mirror human language understanding and learning, plus hardware architectures designed for massive parallel processing. This represents a fundamental shift from conventional optimization, where human-centric probabilistic thinking and large-scale parallelism are not optional but essential dimensions.
Sam Altman: “The cost of AI will converge to the cost of energy. The abundance of AI will be limited by the abundance of energy”
Power infrastructure (transmission lines, transformers) takes years to build
Data centers projected to consume 7.5% of US electricity by 2030
6. Cooling
Advanced technologies like liquid cooling required. Infrastructure upgrades take 1+ year
“Who is the first wall?”
Critical Bottlenecks by Timeline:
Current (2025): Memory bandwidth + Data quality
Short-to-Mid term: Power infrastructure (5-10 years to build)
Long-term: Physical limit of the speed of light
Summary
The “first wall” in AI scaling is not a single barrier but a multi-layered constraint system that emerges sequentially over time. Today’s immediate challenges are memory bandwidth and data quality, followed by power infrastructure limitations in the mid-term, and ultimately the fundamental physical constraint of the speed of light. As Sam Altman emphasized, AI’s future abundance will be fundamentally limited by energy abundance, with all bottlenecks interconnected through the computing→heat→cooling→power chain.
This diagram illustrates how data centers are transforming as they enter the AI era.
📅 Timeline of Technological Evolution
The top section shows major technology revolutions and their timelines:
Internet ’95 (Internet era)
Mobile ’07 (Mobile era)
Cloud ’10 (Cloud era)
Blockchain
AI(LLM) ’22 (Large Language Model-based AI era)
🏢 Traditional Data Center Components
Conventional data centers consisted of the following core components:
Software
Server
Network
Power
Cooling
These were designed as relatively independent layers.
🚀 New Requirements in the AI Era
With the introduction of AI (especially LLMs), data centers require specialized infrastructure:
LLM Model – Operating large language models
GPU – High-performance graphics processing units (essential for AI computations)
High B/W – High-bandwidth networks (for processing large volumes of data)
SMR/HVDC – Switched-Mode Rectifier/High-Voltage Direct Current power systems
Liquid/CDU – Liquid cooling/Cooling Distribution Units (for cooling high-heat GPUs)
🔗 Key Characteristic of AI Data Centers: Integrated Design
The circular connection in the center of the diagram represents the most critical feature of AI data centers:
Tight Interdependency between SW/Computing/Network ↔ Power/Cooling
Unlike traditional data centers, in AI data centers:
GPU-based computing consumes enormous power and generates significant heat
High B/W networks consume additional power during massive data transfers between GPUs
Power systems (SMR/HVDC) must stably supply high power density
Liquid cooling (Liquid/CDU) must handle high-density GPU heat in real-time
These elements must be closely integrated in design, and optimizing just one element cannot guarantee overall system performance.
💡 Key Message
AI workloads require moving beyond the traditional layer-by-layer independent design approach of conventional data centers, demanding that computing-network-power-cooling be designed as one integrated system. This demonstrates that a holistic approach is essential when building AI data centers.
📝 Summary
AI data centers fundamentally differ from traditional data centers through the tight integration of computing, networking, power, and cooling systems. GPU-based AI workloads create unprecedented power density and heat generation, requiring liquid cooling and HVDC power systems. Success in AI infrastructure demands holistic design where all components are co-optimized rather than independently engineered.
PUE Improvement: Power Usage Effectiveness (overall power efficiency metric)
Key Message
This diagram emphasizes that for successful AI implementation:
Technical Foundation: Both Data/Chips (Computing) and Power/Cooling (Infrastructure) are necessary
Tight Integration: These two axes are not separate but must be firmly connected like a chain and optimized simultaneously
Implementation Technologies: Specific advanced technologies for stability and optimization in each domain must provide support
The central link particularly visualizes the interdependent relationship where “increasing computing power requires strengthening energy and cooling in tandem, and computing performance cannot be realized without infrastructure support.”
Summary
AI systems require two inseparable pillars: Computing (Data/Chips) and Infrastructure (Power/Cooling), which must be tightly integrated and optimized together like links in a chain. Each pillar is supported by advanced technologies spanning from AI model optimization (FlashAttention, Quantization) to next-gen hardware (GB200, TPU) and sustainable infrastructure (SMR, Liquid Cooling, AI-driven optimization). The key insight is that scaling AI performance demands simultaneous advancement across all layers—more computing power is meaningless without proportional energy supply and cooling capacity.
AI Data Center Cooling System Architecture Analysis
This diagram illustrates the evolution of data center cooling systems designed for high-heat AI workloads.
Traditional Cooling System (Top Section)
Three-Stage Cooling Process:
Cooling Tower – Uses ambient air to cool water
Chiller – Further refrigerates the cooled water
CRAH (Computer Room Air Handler) – Distributes cold air to the server room
Free Cooling option is shown, which reduces chiller operation by leveraging low outside temperatures for energy savings.
New Approach for AI DC: Liquid Cooling System (Bottom Section)
To address extreme heat generation from high-density AI chips, a CDU (Coolant Distribution Unit) based liquid cooling system has been introduced.
Key Components:
① Coolant Circulation and Distribution
Direct coolant circulation system to servers
② Heat Exchanges (Two Methods)
Direct-to-Chip (D2C) Liquid Cooling: Cold plate with manifold distribution system directly contacting chips
Rear-Door Heat Exchanger (RDHx): Heat exchanger mounted on rack rear door (immersion cooling)
③ Pumping and Flow Control
Pumps and flow control for coolant circulation
④ Filtration and Coolant Quality Management
Maintains coolant quality and removes contaminants
⑤ Monitoring and Control
Real-time monitoring and cooling performance control
Critical Differences
Traditional Method: Air cooling → Indirect, suitable for low-density workloads
AI DC Method: Liquid cooling → Direct, high-efficiency, capable of handling high TDP (Thermal Design Power) of AI chips
Liquid has approximately 25x better heat transfer efficiency than air, making it effective for cooling AI accelerators (GPUs, TPUs) that generate hundreds of watts to kilowatt-level heat.
Summary:
Traditional data centers use air-based cooling (Cooling Tower → Chiller → CRAH), suitable for standard workloads.
AI data centers require liquid cooling with CDU systems due to extreme heat from high-density AI chips.
Liquid cooling offers direct-to-chip heat removal with 25x better thermal efficiency than air, supporting kW-level heat dissipation.