Cooling for AI (heavy heater)

Posted on 2025-10-172025-10-16 by lechuck park

AI Data Center Cooling System Architecture Analysis

This diagram illustrates the evolution of data center cooling systems designed for high-heat AI workloads.

Traditional Cooling System (Top Section)

Three-Stage Cooling Process:

Cooling Tower – Uses ambient air to cool water
Chiller – Further refrigerates the cooled water
CRAH (Computer Room Air Handler) – Distributes cold air to the server room

Free Cooling option is shown, which reduces chiller operation by leveraging low outside temperatures for energy savings.

New Approach for AI DC: Liquid Cooling System (Bottom Section)

To address extreme heat generation from high-density AI chips, a CDU (Coolant Distribution Unit) based liquid cooling system has been introduced.

Key Components:

① Coolant Circulation and Distribution

Direct coolant circulation system to servers

② Heat Exchanges (Two Methods)

Direct-to-Chip (D2C) Liquid Cooling: Cold plate with manifold distribution system directly contacting chips
Rear-Door Heat Exchanger (RDHx): Heat exchanger mounted on rack rear door (immersion cooling)

③ Pumping and Flow Control

Pumps and flow control for coolant circulation

④ Filtration and Coolant Quality Management

Maintains coolant quality and removes contaminants

⑤ Monitoring and Control

Real-time monitoring and cooling performance control

Critical Differences

Traditional Method: Air cooling → Indirect, suitable for low-density workloads

AI DC Method: Liquid cooling → Direct, high-efficiency, capable of handling high TDP (Thermal Design Power) of AI chips

Liquid has approximately 25x better heat transfer efficiency than air, making it effective for cooling AI accelerators (GPUs, TPUs) that generate hundreds of watts to kilowatt-level heat.

Summary:

Traditional data centers use air-based cooling (Cooling Tower → Chiller → CRAH), suitable for standard workloads.
AI data centers require liquid cooling with CDU systems due to extreme heat from high-density AI chips.
Liquid cooling offers direct-to-chip heat removal with 25x better thermal efficiency than air, supporting kW-level heat dissipation.

#AIDataCenter #LiquidCooling #DataCenterInfrastructure #CDU #ThermalManagement #DirectToChip #AIInfrastructure #GreenDataCenter #HeatDissipation #HyperscaleComputing #AIWorkload #DataCenterCooling #ImmersionCooling #EnergyEfficiency #NextGenDataCenter

With Claude

Power for AI

Posted on 2025-10-132025-10-13 by lechuck park

AI Data Center Power Infrastructure: 3 Key Transformations

Traditional Data Center Power Structure (Baseline)

Power Grid → Transformer → UPS → Server (220V AC)

Single power grid connection
Standard UPS backup (10-15 minutes)
AC power distribution
200-300W per server

3 Critical Changes for AI Data Centers

🔴 1. More Power (Massive Power Supply)

Key Changes:

Diversified power sources:
- SMR (Small Modular Reactor) – Stable baseload power
- Renewable energy integration
- Natural gas turbines
- Long-term backup generators + large fuel tanks

Why: AI chips (GPU/TPU) consume kW to tens of kW per server

Traditional server: 200-300W
AI server: 5-10 kW (25-50x increase)
Total data center power demand: Hundreds of MW scale

🔴 2. Stable Power (Power Quality & Conditioning)

Key Changes:

800V HVDC system – High-voltage DC transmission
ESS (Energy Storage System) – Large-scale battery storage
Peak Shaving – Peak load control and leveling
UPS + Battery/Flywheel – Instantaneous outage protection
Power conditioning equipment – Voltage/frequency stabilization

Why: AI workload characteristics

Instantaneous power surges (during inference/training startup)
High power density (30-100 kW per rack)
Power fluctuation sensitivity – Training interruption = days of work lost
24/7 uptime requirements

🔴 3. Server Power (High-Efficiency Direct DC Delivery)

Key Changes:

Direct-to-Chip DC power delivery
Rack-level battery systems (Lithium/Supercapacitor)
High-density power distribution

Why: Maximize efficiency

Eliminate AC→DC conversion losses (5-15% efficiency gain)
Direct chip-level power supply – Minimize conversion stages
Ultra-high rack density support (100+ kW/rack)
Even minor voltage fluctuations are critical – Chip-level stabilization needed

Key Differences Summary

Category	Traditional DC	AI Data Center
Power Scale	Few MW	Hundreds of MW
Rack Density	5-10 kW/rack	30-100+ kW/rack
Power Method	AC-centric	HVDC + Direct DC
Backup Power	UPS (10-15 min)	Multi-tier (Generator+ESS+UPS)
Power Stability	Standard	Extremely high reliability
Energy Sources	Single grid	Multiple sources (Nuclear+Renewable)

Summary

✅ AI data centers require 25-50x more power per server, demanding massive power infrastructure with diversified sources including SMRs and renewables

✅ Extreme workload stability needs drive multi-tier backup systems (ESS+UPS+Generator) and advanced power conditioning with 800V HVDC

✅ Direct-to-chip DC power delivery eliminates conversion losses, achieving 5-15% efficiency gains critical for 100+ kW/rack densities

#AIDataCenter #DataCenterPower #HVDC #DirectDC #EnergyStorageSystem #PeakShaving #SMR #PowerInfrastructure #HighDensityComputing #GPUPower #DataCenterDesign #EnergyEfficiency #UPS #BackupPower #AIInfrastructure #HyperscaleDataCenter #PowerConditioning #DCPower #GreenDataCenter #FutureOfComputing

With Claude

From Stability to Turbulence: Why Smart Operations Matter Most

Posted on 2025-09-272025-09-27 by lechuck park

History always alternates between periods of stability and turbulence. In turbulent times, management and operations become critical, since small decisions can determine survival. This shift mirrors the move from static, stability-focused maintenance to agile, data-driven, and adaptive operations.

#PhilosophyShift #DataDriven #AdaptiveOps #AIDataCenter #ResilientManagement #StabilityToAgility

“Tightly Fused” in AI DC

Posted on 2025-09-242025-09-23 by lechuck park

This diagram illustrates a “Tightly Fused” AI datacenter architecture showing the interdependencies between system components and their failure points.

System Components

LLM SW: Large Language Model Software
GPU Server: Computing infrastructure with cooling fans
Power: Electrical power supply system
Cooling: Thermal management system

Critical Issues

1. Power Constraints

Lack of power leads to power-limited throttling in GPU servers
Results in decreased TFLOPS/kW (computational efficiency per watt)

2. Cooling Limitations

Insufficient cooling causes thermal throttling
Increases risk of device errors and failures

3. Cost Escalation

Already high baseline costs
System bottlenecks drive costs even higher

Core Principle

The bottom equation demonstrates the fundamental relationship: Computing (→ Heat) = Power = Cooling

This shows that computational workload generates heat, requiring equivalent power supply and cooling capacity to maintain optimal performance.

Summary

This diagram highlights how AI datacenters require perfect balance between computing, power, and cooling systems – any bottleneck in one area cascades into performance degradation and cost increases across the entire infrastructure.

#AIDatacenter #MLInfrastructure #GPUComputing #DataCenterDesign #AIInfrastructure #ThermalManagement #PowerEfficiency #ScalableAI #HPC #CloudInfrastructure #AIHardware #SystemArchitecture

With Claude