This diagram illustrates how data centers are transforming as they enter the AI era.
π Timeline of Technological Evolution
The top section shows major technology revolutions and their timelines:
Internet ’95 (Internet era)
Mobile ’07 (Mobile era)
Cloud ’10 (Cloud era)
Blockchain
AI(LLM) ’22 (Large Language Model-based AI era)
π’ Traditional Data Center Components
Conventional data centers consisted of the following core components:
Software
Server
Network
Power
Cooling
These were designed as relatively independent layers.
π New Requirements in the AI Era
With the introduction of AI (especially LLMs), data centers require specialized infrastructure:
LLM Model – Operating large language models
GPU – High-performance graphics processing units (essential for AI computations)
High B/W – High-bandwidth networks (for processing large volumes of data)
SMR/HVDC – Switched-Mode Rectifier/High-Voltage Direct Current power systems
Liquid/CDU – Liquid cooling/Cooling Distribution Units (for cooling high-heat GPUs)
π Key Characteristic of AI Data Centers: Integrated Design
The circular connection in the center of the diagram represents the most critical feature of AI data centers:
Tight Interdependency between SW/Computing/Network β Power/Cooling
Unlike traditional data centers, in AI data centers:
GPU-based computing consumes enormous power and generates significant heat
High B/W networks consume additional power during massive data transfers between GPUs
Power systems (SMR/HVDC) must stably supply high power density
Liquid cooling (Liquid/CDU) must handle high-density GPU heat in real-time
These elements must be closely integrated in design, and optimizing just one element cannot guarantee overall system performance.
π‘ Key Message
AI workloads require moving beyond the traditional layer-by-layer independent design approach of conventional data centers, demanding that computing-network-power-cooling be designed as one integrated system. This demonstrates that a holistic approach is essential when building AI data centers.
π Summary
AI data centers fundamentally differ from traditional data centers through the tight integration of computing, networking, power, and cooling systems. GPU-based AI workloads create unprecedented power density and heat generation, requiring liquid cooling and HVDC power systems. Success in AI infrastructure demands holistic design where all components are co-optimized rather than independently engineered.
AI Workload Cooling Systems: Bidirectional Physical-Software Optimization
This image summarizes four cutting-edge research studies demonstrating the bidirectional optimization relationship between AI LLMs and cooling systems. It proves that physical cooling infrastructure and software workloads are deeply interconnected.
π Core Concept of Bidirectional Optimization
Direction 1: Physical Cooling β AI Performance Impact
Cooling methods directly affect LLM/VLM throughput and stability
Direction 2: AI Software β Cooling Control
LLMs themselves act as intelligent controllers for cooling systems
π Research Analysis
1. Physical Cooling Impact on AI Performance (2025 arXiv)
[Cooling HW β AI SW Performance]
Experiment: Liquid vs Air cooling comparison on H100 nodes
Physical Differences:
GPU Temperature: Liquid 41-50Β°C vs Air 54-72Β°C (up to 22Β°C difference)
GPU Power Consumption: 148-173W reduction
Node Power: ~1kW savings
Software Performance Impact:
Throughput: 54 vs 46 TFLOPs/GPU (+17% improvement)
Sustained and predictable performance through reduced throttling
Adaptive cooling strategies based on workload characteristics
3. Virtuous Cycle Effect
Better cooling β AI performance improvement β smarter cooling control
β Energy savings β more AI jobs β advanced cooling optimization
β Sustainable large-scale AI infrastructure
π― Practical Implications
These studies demonstrate:
Cooling is no longer passive infrastructure: It’s an active determinant of AI performance
AI optimizes its own environment: Meta-level self-optimizing systems
Hardware-software co-design is essential: Isolated optimization is suboptimal
Simultaneous achievement of sustainability and performance: Synergy, not trade-off
π Summary
These four studies establish that next-generation AI data centers must evolve into integrated ecosystems where physical cooling and software workloads interact in real-time to self-optimize. The bidirectional relationshipβwhere better cooling enables superior AI performance, and AI algorithms intelligently control cooling systemsβcreates a virtuous cycle that simultaneously achieves enhanced performance, energy efficiency, and sustainable scalability for large-scale AI infrastructure.
This image shows a CDU (Coolant Distribution Unit) Metrics & Control System diagram illustrating the overall structure. The system can be organized as follows:
System Structure
Upper Section: CDU Structure
First Loop: CPU with Coolant Distribution Unit
Second Main Loop: Row Manifold and Rack Manifold configuration
Process Chill Water Supply/Return: Process chilled water circulation system
Lower Section: Data Collection & Control Devices
Control Devices:
Pump (Pump RPM, Rate of max speed)
Valve (Valve Open %)
Sensor Configuration:
Temperature & Pressure Sensors on manifolds
Supply System:
Rack Water Supply/Return
Main Control Methods
1. Fixed Pressure Control (Fixed Pressure Drop)
Primary Method: Maintaining fixed pressure drop between rack supply-return
Primary Method: Maintaining constant approach temperature
Alternatives: Fixed open, fixed secondary supply temperature control
Summary
This CDU system provides precise cooling control for data centers through dual management of pressure and temperature. The system integrates sensor feedback from manifolds with pump and valve control to maintain optimal cooling conditions across server racks.
The provided visual summarizes the key performance metrics of the CDU (Cooling Distribution Unit) that adheres to the OCP (Open Compute Project) ‘Project Deschutes’ specification. This CDU is designed for high-performance computing environments, particularly for massive-scale liquid cooling of AI/ML workloads.
Key Performance Indicators
System Availability: The primary target for system availability is 99.999%. This represents an extremely high level of reliability, with less than 5 minutes and 15 seconds of downtime per year.
Thermal Load Capacity: The CDU is designed to handle a thermal load of up to 2,000 kW, which is among the highest thermal capacities in the industry.
Power Usage: The CDU itself consumes 74 kW of power.
IT Flow Rate: It supplies coolant to the servers at a rate of 500 GPM (approximately 1,900 LPM).
Operating Pressure: The overall system operating pressure is within a range of 0-130 psig (approximately 0-900 kPa).
IT Differential Pressure: The pressure difference required on the server side is 80-90 psi (approximately 550-620 kPa).
Approach Temperature: The approach temperature, a key indicator of heat exchange efficiency, is targeted at β€3βC. A lower value is better, as it signifies more efficient heat removal.
Why Cooling is Crucial for GPU Performance
Cooling has a direct and significant impact on GPU performance and stability. Because GPUs are highly sensitive to heat, if they are not maintained within an optimal temperature range, they will automatically reduce their performance through a process called thermal throttling to prevent damage.
The ‘Project Deschutes’ CDU is engineered to prevent this by handling a massive thermal load of 2,000 kW with a powerful 500 GPM flow rate and a low approach temperature of β€3βC. This robust cooling capability ensures that GPUs can operate at their maximum potential without being limited by heat, which is essential for maximizing performance in demanding AI workloads.
This illustration visualizes the evolution of data centers, contrasting the traditionally separated components with the modern AI data center where software, compute, network, and crucially, power and cooling systems are ‘tightly fused’ together. It emphasizes how power and advanced cooling are organically intertwined with GPU and memory, directly impacting AI performance and highlighting their inseparable role in meeting the demands of high-performance AI. This tight integration symbolizes a pivotal shift for the modern AI era.
This image demonstrates the critical impact of cooling stability on both LLM performance and energy efficiency in GPU servers through benchmark results.
Cascading Effects of Unstable Cooling
Problems with Unstable Air Cooling:
GPU Temperature: 54-72Β°C (high and unstable)
Thermal throttling occurs – where GPUs automatically reduce clock speeds to prevent overheating, leading to significant performance degradation
Result: Double penalty of reduced performance + increased power consumption
Energy Efficiency Impact:
Power Consumption: 8.16kW (high)
Performance: 46 TFLOPS (degraded)
Energy Efficiency: 5.6 TFLOPS/kW (poor performance-to-power ratio)
Benefits of Stable Liquid Cooling
Temperature Stability Achievement:
GPU Temperature: 41-50Β°C (low and stable)
No thermal throttling β sustained optimal performance
Energy Efficiency Improvement:
Power Consumption: 6.99kW (14% reduction)
Performance: 54 TFLOPS (17% improvement)
Energy Efficiency: 7.7 TFLOPS/kW (38% improvement)
Core Mechanisms: How Cooling Affects Energy Efficiency
Power Efficiency Optimization: Eliminates inefficient power consumption caused by overheating
Performance Consistency: Unstable cooling can cause GPUs to use 50% of power budget while delivering only 25% performance
Advanced cooling systems can achieve energy savings ranging from 17% to 23% compared to traditional methods. This benchmark paradoxically shows that proper cooling investment dramatically improves overall energy efficiency.
Final Summary
Unstable cooling triggers thermal throttling that simultaneously degrades LLM performance while increasing power consumption, creating a dual efficiency loss.Stable liquid cooling achieves 17% performance gains and 14% power savings simultaneously, improving energy efficiency by 38%.In AI infrastructure, adequate cooling investment is essential for optimizing both performance and energy efficiency.
This infographic compares the evolution from servers to data centers, showing the progression of IT infrastructure complexity and operational requirements.
Left – Server
Shows individual hardware components: CPU, motherboard, power supply, cooling fans
Labeled “No Human Operation,” indicating basic automated functionality
Center – Modular DC
Represented by red cubes showing modular architecture
Emphasizes “More Bigger” scale and “modular” design
Represents an intermediate stage between single servers and full data centers
Right – Data Center
Displays multiple server racks and various infrastructure components (networking, power, cooling systems)
Marked as “Human & System Operation,” suggesting more complex management requirements
Additional Perspective on Automation Evolution:
While the image shows data centers requiring human intervention, the actual industry trend points toward increasing automation:
Advanced Automation: Large-scale data centers increasingly use AI-driven management systems, automated cooling controls, and predictive maintenance to minimize human intervention.
Lights-Out Operations Goal: Hyperscale data centers from companies like Google, Amazon, and Microsoft ultimately aim for complete automated operations with minimal human presence.
Paradoxical Development: As scale increases, complexity initially requires more human involvement, but advanced automation eventually enables a return toward unmanned operations.
Summary: This diagram illustrates the current transition from simple automated servers to complex data centers requiring human oversight, but the ultimate industry goal is achieving fully automated “lights-out” data center operations. The evolution shows increasing complexity followed by sophisticated automation that eventually reduces the need for human intervention.