Cooling Changes

Posted on 2026-03-302026-03-30 by lechuck park

The provided image illustrates the evolution of data center cooling methods and the corresponding increase in risk—specifically, the drastic reduction of available thermal buffer space—categorized into three stages.

Here is a breakdown of each cooling method shown:

1. Air Cooling

Method: The most traditional approach, providing room-level cooling with uncontained airflow.
Characteristics: The physical space of the server room acts as a sponge for heat. Because of this, there is an ample “Thermal Buffer” utilizing the floor space. If the cooling system fails, it takes some time for temperatures to reach critical levels.

2. Hot/Cold Aisle Containment

Method: Physically separates the cold intake air from the hot exhaust air to prevent them from mixing.
Characteristics: Focuses on Airflow Optimization. It significantly improves cooling efficiency by directing and controlling the airflow within enclosed spaces.

3. Direct Liquid Cooling (DLC)

Method: A high-density, chip-level cooling approach that brings liquid coolant directly to the primary heat-generating components (like CPUs or GPUs).
Characteristics: While cooling efficiency is maximized, there is Zero Thermal Buffer. There is absolutely no thermal margin provided by surrounding air or room volume.

💡 Core Implication (The Red Warning Box)

The ultimate takeaway of this slide is highlighted in the bottom right corner.

In a DLC environment, a loss of cooling triggers thermal runaway within 30 seconds. This speed fundamentally exceeds human response limits. It is no longer feasible for a facility manager to hear an alarm, diagnose the issue, and manually intervene before catastrophic failure occurs in modern, high-density servers.

Summary

Evolution of Efficiency: Data center cooling is shifting from broad, room-level air cooling to highly efficient, chip-level Direct Liquid Cooling (DLC).
Loss of Thermal Buffer: This transition completely eliminates the physical thermal margin, meaning there is zero room for error if the cooling system fails.
Automation is Mandatory: Because DLC cooling loss causes thermal runaway in under 30 seconds—faster than humans can react—AI-driven, automated operational agents are now essential to protect infrastructure.

#DataCenter #DataCenterCooling #DirectLiquidCooling #ThermalRunaway #AIOps #InfrastructureManagement

With Gemini

LLM goes with Computing-Power-Cooling

Posted on 2025-11-122025-11-11 by lechuck park

LLM’s Computing-Power-Cooling Relationship

This diagram illustrates the technical architecture and potential issues that can occur when operating LLMs (Large Language Models).

Normal Operation (Top Left)

Computing Requires – LLM workload is delivered to the processor
Power Requires – Power supplied via DVFS (Dynamic Voltage and Frequency Scaling)
Heat Generated – Heat is produced during computing processes
Cooling Requires – Temperature management through proper cooling systems

Problem Scenarios

Power Issue (Top Right)

Symptom: Insufficient power (kW & Quality)
Results:
- Computing performance degradation
- Power throttling or errors
- LLM workload errors

Cooling Issue (Bottom Right)

Symptom: Insufficient cooling (Temperature & Density)
Results:
- Abnormal heat generation
- Thermal throttling or errors
- Computing performance degradation
- LLM workload errors

Key Message

For stable LLM operations, the three elements of Computing-Power-Cooling must be balanced. If any one element is insufficient, it leads to system-wide performance degradation or errors. This emphasizes that AI infrastructure design must consider not only computing power but also adequate power supply and cooling systems together.

Summary

LLM operation requires a critical balance between computing, power supply, and cooling infrastructure.
Insufficient power causes power throttling, while inadequate cooling leads to thermal throttling, both resulting in workload errors.
Successful AI infrastructure design must holistically address all three components rather than focusing solely on computational capacity.

#LLM #AIInfrastructure #DataCenter #ThermalManagement #PowerManagement #AIOperations #MachineLearning #HPC #DataCenterCooling #AIHardware #ComputeOptimization #MLOps #TechInfrastructure #AIatScale #GreenAI

WIth Claude

Cooling for AI (heavy heater)

Posted on 2025-10-172025-10-16 by lechuck park

AI Data Center Cooling System Architecture Analysis

This diagram illustrates the evolution of data center cooling systems designed for high-heat AI workloads.

Traditional Cooling System (Top Section)

Three-Stage Cooling Process:

Cooling Tower – Uses ambient air to cool water
Chiller – Further refrigerates the cooled water
CRAH (Computer Room Air Handler) – Distributes cold air to the server room

Free Cooling option is shown, which reduces chiller operation by leveraging low outside temperatures for energy savings.

New Approach for AI DC: Liquid Cooling System (Bottom Section)

To address extreme heat generation from high-density AI chips, a CDU (Coolant Distribution Unit) based liquid cooling system has been introduced.

Key Components:

① Coolant Circulation and Distribution

Direct coolant circulation system to servers

② Heat Exchanges (Two Methods)

Direct-to-Chip (D2C) Liquid Cooling: Cold plate with manifold distribution system directly contacting chips
Rear-Door Heat Exchanger (RDHx): Heat exchanger mounted on rack rear door (immersion cooling)

③ Pumping and Flow Control

Pumps and flow control for coolant circulation

④ Filtration and Coolant Quality Management

Maintains coolant quality and removes contaminants

⑤ Monitoring and Control

Real-time monitoring and cooling performance control

Critical Differences

Traditional Method: Air cooling → Indirect, suitable for low-density workloads

AI DC Method: Liquid cooling → Direct, high-efficiency, capable of handling high TDP (Thermal Design Power) of AI chips

Liquid has approximately 25x better heat transfer efficiency than air, making it effective for cooling AI accelerators (GPUs, TPUs) that generate hundreds of watts to kilowatt-level heat.

Summary:

Traditional data centers use air-based cooling (Cooling Tower → Chiller → CRAH), suitable for standard workloads.
AI data centers require liquid cooling with CDU systems due to extreme heat from high-density AI chips.
Liquid cooling offers direct-to-chip heat removal with 25x better thermal efficiency than air, supporting kW-level heat dissipation.

#AIDataCenter #LiquidCooling #DataCenterInfrastructure #CDU #ThermalManagement #DirectToChip #AIInfrastructure #GreenDataCenter #HeatDissipation #HyperscaleComputing #AIWorkload #DataCenterCooling #ImmersionCooling #EnergyEfficiency #NextGenDataCenter

With Claude

DC Cooling (R)

Posted on 2025-10-15 by lechuck park

Data Center Cooling System Core Structure

This diagram illustrates an integrated data center cooling system centered on chilled water/cooling water circulation and heat exchange.

Core Cooling Circulation Structure

Primary Loop: Cooling Water Loop

Cooling Tower → Cooling Water → Chiller → (Heat Exchange) → Cooling Tower

Cooling Tower: Dissipates heat from cooling water to atmosphere using outdoor air
Pump/Header: Controls cooling water pressure and flow rate through circulation pipes
Heat Exchange in Chiller: Cooling water exchanges heat with refrigerant to cool the refrigerant

Secondary Loop: Chilled Water Loop

Chiller → Chilled Water → CRAH → (Heat Exchange) → Chiller

Chiller: Generates chilled water (7-12°C) through compressor and refrigerant cycle
Pump/Header: Circulates chilled water to CRAH units and returns it back
Heat Exchange in CRAH: Chilled water exchanges heat with air to cool the air

Tertiary Loop: Cooling Air Loop

CRAH → Cooling Air → Servers → Hot Air → CRAH

CRAH (Computer Room Air Handler): Generates cooling air through water-to-air heat exchanger
FAN: Forces circulation of cooling air throughout server room
Heat Absorption: Air absorbs server heat and returns to CRAH

Heat Exchange Critical Points

Heat Exchange #1: Inside Chiller

Cooling Water ↔ Refrigerant: Transfers refrigerant heat to cooling water in condenser
Refrigerant ↔ Chilled Water: Absorbs heat from chilled water to refrigerant in evaporator

Heat Exchange #2: CRAH

Chilled Water ↔ Air: Transfers air heat to chilled water in water-to-air heat exchanger
Chilled water temperature rises → Returns to chiller

Heat Exchange #3: Server Room

Hot Air ↔ Servers: Air absorbs heat from servers
Temperature-increased air → Returns to CRAH

Energy Efficiency: Free Cooling

Low-Temperature Outdoor Air → Air-to-Water Heat Exchanger → Chilled Water Cooling → Reduced Chiller Load

Condition: When outdoor temperature is sufficiently low
Effect: Reduces chiller operation and compressor power consumption (up to 30-50%)
Method: Utilizes natural cooling through cooling tower or dedicated heat exchanger

Cooling System Control Elements

Cooling Basic Operations Components:

Cool Down: Controls water/air temperature reduction
Water Circulation: Adjusts flow rate through pump speed/pressure control
Heat Exchanges: Optimizes heat exchanger efficiency
Plumbing: Manages circulation paths and pressure loss

Heat Flow Summary

Server Heat → Air → CRAH (Heat Exchange) → Chilled Water → Chiller (Heat Exchange) → 
Cooling Water → Cooling Tower → Atmospheric Discharge

Summary

This system efficiently removes server heat to the outdoor atmosphere through three cascading circulation loops (air → chilled water → cooling water) and three strategic heat exchange points (CRAH, Chiller, Cooling Tower). Free cooling optimization reduces energy consumption by up to 50% when outdoor conditions permit. The integrated pump/header network ensures precise flow control across all loops for maximum cooling efficiency.

#DataCenterCooling #ChilledWater #CRAH #FreeCooling #HeatExchange #CoolingTower #ThermalManagement #DataCenterInfrastructure #EnergyEfficiency #HVACSystem #CoolingLoop #WaterCirculation #ServerCooling #DataCenterDesign #GreenDataCenter

With Claude