Cooling with AI works

AI Workload Cooling Systems: Bidirectional Physical-Software Optimization

This image summarizes four cutting-edge research studies demonstrating the bidirectional optimization relationship between AI LLMs and cooling systems. It proves that physical cooling infrastructure and software workloads are deeply interconnected.

πŸ”„ Core Concept of Bidirectional Optimization

Direction 1: Physical Cooling β†’ AI Performance Impact

  • Cooling methods directly affect LLM/VLM throughput and stability

Direction 2: AI Software β†’ Cooling Control

  • LLMs themselves act as intelligent controllers for cooling systems

πŸ“Š Research Analysis

1. Physical Cooling Impact on AI Performance (2025 arXiv)

[Cooling HW β†’ AI SW Performance]

  • Experiment: Liquid vs Air cooling comparison on H100 nodes
  • Physical Differences:
    • GPU Temperature: Liquid 41-50Β°C vs Air 54-72Β°C (up to 22Β°C difference)
    • GPU Power Consumption: 148-173W reduction
    • Node Power: ~1kW savings
  • Software Performance Impact:
    • Throughput: 54 vs 46 TFLOPs/GPU (+17% improvement)
    • Sustained and predictable performance through reduced throttling
    • Improved performance/watt (perf/W) ratio

β†’ Physical cooling improvements directly enhance AI workload real-time processing capabilities

2. AI Controls Cooling Systems (2025 arXiv)

[AI SW β†’ Cooling HW Control]

  • Method: Offline Reinforcement Learning (RL) for automated data center cooling control
  • Results: 14-21% cooling energy reduction in 2000-hour real deployment
  • Bidirectional Effects:
    • AI algorithms optimally control physical cooling equipment (CRAC, pumps, etc.)
    • Saved energy β†’ enables more LLM job execution
    • Secured more power headroom for AI computation expansion

β†’ AI software intelligently controls physical cooling to improve overall system efficiency

3. LLM as Cooling Controller (2025 OpenReview)

[AI SW ↔ Cooling HW Interaction]

  • Innovative Approach: Using LLMs as interpretable controllers for liquid cooling systems
  • Simulation Results:
    • Temperature Stability: +10-18% improvement vs RL
    • Energy Efficiency: +12-14% improvement
  • Bidirectional Interaction Significance:
    • LLMs interpret real-time physical sensor data (temperature, flow rate, etc.)
    • Multi-objective trade-off optimization between cooling requirements and energy saving
    • Interpretability: LLM decision-making process is human-understandable
    • Result: Reduced throttling/interruptions β†’ improved AI workload stability

β†’ Complete closed-loop where AI controls physical systems, and results feedback to AI performance

4. Physical Cooling Innovation Enables AI Training (E-Energy’25 PolyU)

[Cooling HW β†’ AI SW Training Stability]

  • Method: Immersion cooling applied to LLM training
  • Physical Benefits:
    • Dramatically reduced fan/CRAC overhead
    • Lower PUE (Power Usage Effectiveness) achieved
    • Uniform and stable heat removal
  • Impact on AI Training:
    • Enables stable long-duration training (eliminates thermal spikes)
    • Quantitative power-delay trade-off optimization per workload
    • Continuous training environment without interruptions

β†’ Advanced physical cooling technology secures feasibility of large-scale LLM training

πŸ” Physical-Software Interdependency Map

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Physical Cooling Systems                    β”‚
β”‚    (Liquid cooling, Immersion, CRAC, Heat exchangers)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               ↓                        ↑
        Temp↓ Power↓ Stability↑    AI-based Control
               ↓                   RL/LLM Controllers
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              AI Workloads (LLM/VLM)                      β”‚
β”‚    Performance↑ Throughput↑ Throttling↓ Training Stability↑│
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’‘ Key Insights: Bidirectional Optimization Synergy

1. Bottom-Up Influence (Physical β†’ Software)

  • Better cooling β†’ maintains higher clock speeds/throughput
  • Temperature stability β†’ predictable performance, no training interruptions
  • Power efficiency β†’ enables simultaneous operation of more GPUs

2. Top-Down Influence (Software β†’ Physical)

  • AI algorithms provide real-time optimal control of cooling equipment
  • LLM’s interpretable decision-making ensures operational transparency
  • Adaptive cooling strategies based on workload characteristics

3. Virtuous Cycle Effect

Better cooling β†’ AI performance improvement β†’ smarter cooling control
β†’ Energy savings β†’ more AI jobs β†’ advanced cooling optimization
β†’ Sustainable large-scale AI infrastructure

🎯 Practical Implications

These studies demonstrate:

  1. Cooling is no longer passive infrastructure: It’s an active determinant of AI performance
  2. AI optimizes its own environment: Meta-level self-optimizing systems
  3. Hardware-software co-design is essential: Isolated optimization is suboptimal
  4. Simultaneous achievement of sustainability and performance: Synergy, not trade-off

πŸ“ Summary

These four studies establish that next-generation AI data centers must evolve into integrated ecosystems where physical cooling and software workloads interact in real-time to self-optimize. The bidirectional relationshipβ€”where better cooling enables superior AI performance, and AI algorithms intelligently control cooling systemsβ€”creates a virtuous cycle that simultaneously achieves enhanced performance, energy efficiency, and sustainable scalability for large-scale AI infrastructure.

#EnergyEfficiency#GreenAI#SustainableAI#DataCenterOptimization#ReinforcementLearning#AIControl#SmartCooling

With Claude

New Era of Digitals

This image presents a diagram titled “New Era of Digitals” that illustrates the evolution of computing paradigms.

Overall Structure:

The diagram shows a progression from left to right, transitioning from being “limited by Humans” to achieving “Everything by Digitals.”

Key Stages:

  1. Human Desire: The process begins with humans’ fundamental need to “wanna know it clearly,” representing our desire for understanding and knowledge.
  2. Rule-Based Era (1000s):
    • Deterministic approach
    • Using Logics and Rules
    • Automation with Specific Rules
    • Record with a human recognizable format
  3. Data-Driven Era:
    • Probabilistic approach (Not 100% But OK)
    • Massive Computing (Energy Resource)
    • Neural network-like structures represented by interconnected nodes

Core Message:

The diagram illustrates how computing has evolved from early systems that relied on human-defined explicit rules and logic to modern data-driven, probabilistic approaches. This represents the shift toward AI and machine learning, where we achieve “Not 100% But OK” results through massive computational resources rather than perfect deterministic rules.

The transition shows how we’ve moved from systems that required everything to be “human recognizable” to systems that can process and understand patterns beyond direct human comprehension, marking the current digital revolution where algorithms and data-driven approaches can handle complexity that exceeds traditional rule-based systems.

With Claude

‘tightly fused’

This illustration visualizes the evolution of data centers, contrasting the traditionally separated components with the modern AI data center where software, compute, network, and crucially, power and cooling systems are ‘tightly fused’ together. It emphasizes how power and advanced cooling are organically intertwined with GPU and memory, directly impacting AI performance and highlighting their inseparable role in meeting the demands of high-performance AI. This tight integration symbolizes a pivotal shift for the modern AI era.

LLM Efficiency with a Cooling

This image demonstrates the critical impact of cooling stability on both LLM performance and energy efficiency in GPU servers through benchmark results.

Cascading Effects of Unstable Cooling

Problems with Unstable Air Cooling:

  • GPU Temperature: 54-72Β°C (high and unstable)
  • Thermal throttling occurs – where GPUs automatically reduce clock speeds to prevent overheating, leading to significant performance degradation
  • Result: Double penalty of reduced performance + increased power consumption

Energy Efficiency Impact:

  • Power Consumption: 8.16kW (high)
  • Performance: 46 TFLOPS (degraded)
  • Energy Efficiency: 5.6 TFLOPS/kW (poor performance-to-power ratio)

Benefits of Stable Liquid Cooling

Temperature Stability Achievement:

  • GPU Temperature: 41-50Β°C (low and stable)
  • No thermal throttling β†’ sustained optimal performance

Energy Efficiency Improvement:

  • Power Consumption: 6.99kW (14% reduction)
  • Performance: 54 TFLOPS (17% improvement)
  • Energy Efficiency: 7.7 TFLOPS/kW (38% improvement)

Core Mechanisms: How Cooling Affects Energy Efficiency

  1. Thermal Throttling Prevention: Stable cooling allows GPUs to maintain peak performance continuously
  2. Power Efficiency Optimization: Eliminates inefficient power consumption caused by overheating
  3. Performance Consistency: Unstable cooling can cause GPUs to use 50% of power budget while delivering only 25% performance

Advanced cooling systems can achieve energy savings ranging from 17% to 23% compared to traditional methods. This benchmark paradoxically shows that proper cooling investment dramatically improves overall energy efficiency.

Final Summary

Unstable cooling triggers thermal throttling that simultaneously degrades LLM performance while increasing power consumption, creating a dual efficiency loss. Stable liquid cooling achieves 17% performance gains and 14% power savings simultaneously, improving energy efficiency by 38%. In AI infrastructure, adequate cooling investment is essential for optimizing both performance and energy efficiency.

With Claude

HOPE OF THE NEXT

Hope to jump

This image visualizes humanity’s endless desire for ‘difference’ as the creative force behind ‘newness.’ The organic human brain fuses with the logical AI circuitry, and from their core, a burst of light emerges. This light symbolizes not just the expansion of knowledge, but the very moment of creation, transforming into unknown worlds and novel concepts.

Data Center Operantions

Data center operations are shifting from experience-driven practices toward data-driven and AI-optimized systems.
However, a fundamental challenge persists: the lack of digital credibility.

  • Insufficient data quality: Incomplete monitoring data and unreliable hardware reduce trust.
  • Limited digital expertise of integrators: Many providers focus on traditional design/operations, lacking strong datafication and automation capabilities.
  • Absence of verification frameworks: No standardized process to validate or certify collected data and analytical outputs.

These gaps are amplified by the growing scale and complexity of data centers and the expansion of GPU adoption, making them urgent issues that must be addressed for the next phase of digital operations.