New Risk @ AI DC

Overview: New Risks at AI Data Centers

The image outlines the infrastructure challenges faced by modern AI Data Centers (AI DC), specifically focusing on the high demands placed on hardware like GPUs. It divides these challenges into two primary categories: Power Risk and Cooling Risk.

The central graphic illustrates that the core AI processing units (Brains/GPUs) are entirely dependent on these two foundational elements.


⚡ Power Risk

This section highlights issues related to power supply and infrastructure (such as Power Diversification, ESS, and 800V HVDC).

  • Power Supply Shortage (GPU Power Throttling): When the facility cannot provide enough power, GPUs slow down to compensate.
    • Impacts: Delays in AI workloads, financial losses due to lost data checkpoints, and the collapse of synchronization across the entire computing cluster.
  • Rapid Power Fluctuations: Sudden spikes or drops in the power supply.
    • Impacts: Voltage sag, electrical resonance in external grids, and reduced lifespan or physical damage to backup power systems like generators and UPS (Uninterruptible Power Supplies).
  • Power Quality Degradation: When the provided electricity is “noisy” or unstable.
    • Impacts: Malfunctions in protective electrical relays, overheating of server Power Supply Units (PSUs), and unexplained network communication errors.

❄️ Cooling Risk

This section focuses on the challenges of managing the massive heat generated by AI workloads, specifically looking at Liquid Cooling and changes in Cooling Distribution Unit (CDU) environments.

  • Cooling Supply Shortage (GPU Thermal Throttling): When the cooling system cannot remove heat fast enough, GPUs slow down to prevent melting.
    • Impacts: Delays in AI workloads, reduced lifespan and increased defects in GPUs, and long-term damage to surrounding server equipment.
  • Leakage Occurrence: Physical leaks in the liquid cooling system.
    • Impacts: Immediate equipment burnout (short circuits), risk of electrical arc flashes and fires, and cascading system shutdowns due to a loss of pressure in the cooling loop.
  • Cooling Water Quality Deterioration: When the liquid used for cooling becomes contaminated or degrades.
    • Impacts: Formation of localized “hot-spots” where cooling fails, a sharp decline in overall cooling efficiency, and mechanical wear and tear on the CDU pumps.

📝 Summary

  1. AI Data Centers face critical new infrastructure risks divided into two main categories: supplying massive amounts of power and managing extreme heat.
  2. Power-related risks (shortages, fluctuations, and poor quality) lead to severe workload delays, cluster synchronization failures, and damage to backup generators.
  3. Cooling-related risks (insufficient cooling, leaks, and poor water quality) cause thermal throttling, severe hardware damage, and potentially catastrophic fires.

#AIDataCenter #DataCenterInfrastructure #GPUPower #LiquidCooling #DataCenterRisk #ThermalThrottling #TechInfrastructure

With Gemini

Data Center Changes

The Evolution of Data Centers

This infographic, titled “Data Center Changes,” visually explains how data center requirements are skyrocketing due to the shift from traditional computing to AI-driven workloads.

The chart compares three stages of data centers across two main metrics: Rack Density (how much power a single server rack consumes, shown on the vertical axis) and the overall Total Power Capacity (represented by the size and labels of the circles).

  • Traditional DC (Data Center): In the past, data centers ran at a very low rack density of around 2kW. The total power capacity required for a facility was relatively small, at around 10 MW.
  • Cloud-native DC: As cloud computing took over, the demands increased. Rack densities jumped to about 10kW, and the overall facility size grew to require around 100 MW of power.
  • AI DC: This is where we see a massive leap. Driven by heavy GPU workloads, AI data centers push rack densities beyond 100kW+. The scale of these facilities is enormous, demanding up to 1GW of power. The red starburst shape also highlights a new challenge: “Ultra-high Volatility,” meaning the power draw isn’t stable; it spikes violently depending on what the AI is processing.

The Three Core Challenges (Bottom Panels)

The bottom three panels summarize the key takeaways of transitioning to AI Data Centers:

  1. Scale (Massive Investment): Building a 1GW “Campus-scale” AI data center requires astronomical capital expenditure (CAPEX). To put this into perspective, the chart notes that just 10MW costs roughly 200 billion KRW (South Korean Won). Scaling that to 1GW is a colossal financial undertaking.
  2. Density (The Need for Liquid Cooling): Power density per rack is jumping from 2kW to 100kW—a 50x increase. Traditional air-conditioning cannot cool servers running this hot, meaning the industry must transition to advanced liquid cooling technologies.
  3. Volatility (Unpredictable Demands): Unlike traditional servers that run at a steady hum, AI GPU workloads change in real-time. A sudden surge in computing tasks instantly spikes both the electricity needed to run the GPUs and the cooling power needed to keep them from melting.

Summary

  • Data centers are undergoing a massive transformation from Traditional (10MW) and Cloud (100MW) models to gigantic AI Data Centers requiring up to 1 Gigawatt (1GW) of power.
  • Because AI servers use powerful GPUs, power density per rack is increasing 50-fold (up to 100kW+), forcing a shift from traditional air cooling to advanced liquid cooling.
  • This AI infrastructure requires staggering financial investments (CAPEX) and must be designed to handle extreme, real-time volatility in both power and cooling demands.

#DataCenter #AIDataCenter #LiquidCooling #GPU #CloudComputing #TechTrends #TechInfrastructure #CAPEX

With Gemini

Air Cooling For 30kw/Rack

Why Air Cooling Fails at 30kW+

  • Noise & Vibration: Achieving 6,000 CMH airflow generates 90-100dB noise and vibrations that damage hardware.
  • Space Loss: Massive cooling fans displace GPUs/CPUs, drastically reducing compute density.
  • Power Waste: Fan power consumption grows cubically (V^3), causing a significant spike in PUE (Power Usage Effectiveness).

Conclusion: At 30kW/Rack, air cooling hits a physical and economic “wall”. Transitioning to Liquid Cooling is mandatory for next-generation AI Data Centers.


#AIDataCenter #LiquidCooling #ThermalManagement #30kWRack #DataCenterEfficiency #PUE #HighDensityComputing #GPUCooling

Legacy vs AI DC

Legacy DC vs. AI Factory

1. Legacy Data Center

  • Static Load: The flat line on the graph indicates that power and compute demands are stable, continuous, and highly predictable.
  • Air Cooling: Traditional fan-based air cooling systems are sufficient to manage the heat generated by standard, lower-density server racks.
  • Minutes Level Work: System responses, resource provisioning, and facility adjustments generally occur on a scale of minutes.
  • IT & OT Silo Ops: Information Technology (servers, networking) and Operational Technology (power, cooling facilities) are managed independently in isolated silos, with no real-time data exchange.

2. AI Factory (DC)

  • Dynamic/High-Density: The volatile, jagged graph illustrates how AI workloads create extreme, rapid power spikes and demand highly dense computing resources.
  • Liquid Cooling: The immense heat output from high-performance AI chips necessitates advanced liquid cooling solutions (represented by the water drop and circulation arrows) to maintain thermal efficiency.
  • Seconds Level Works: The physical infrastructure must be highly agile, detecting and responding to sudden dynamic workload changes and thermal shifts within seconds.
  • Workload Aware: The facility dynamically adapts its cooling and power based on real-time AI computing needs. Establishing this requires robust “IT/OT Data Convergence” and the utilization of “High-Fidelity Data” as key components of a broader “Digitalization” strategy.

Summary

  1. Legacy data centers are designed for predictable, static loads using traditional air cooling, with IT and facility operations (OT) isolated from one another.
  2. AI Factories must handle highly volatile, high-density workloads, making liquid cooling and instantaneous, seconds-level infrastructure responses mandatory.
  3. Transitioning to a true “Workload Aware” facility requires a strong “Digitalization” strategy centered around “IT/OT Data Convergence” and “High-Fidelity Data.”

#AIFactory #DataCenter #LiquidCooling #WorkloadAware #ITOTConvergence #HighFidelityData #Digitalization #AIInfrastructure

With Gemini

AI DC : CAPEX to OPEX (2) inside


AI DC: The Chain Reaction from CAPEX to OPEX Risk

The provided image logically illustrates the sequential mechanism of how the massive initial capital expenditure (CAPEX) of an AI Data Center (AI DC) translates into complex operational risks and increased operating expenses (OPEX).

1. HUGE CAPEX (Massive Initial Investment)

  • Context: Building an AI data center requires enormous capital expenditure (CAPEX) due to high-cost GPU servers, high-density racks, and specialized networking infrastructure.
  • Flow: However, the challenge does not end with high initial costs. Driven by the following three factors, this massive infrastructure investment inevitably cascades into severe operational risks.

2. LLM WORKLOAD (The Root Cause)

  • Characteristics: Unlike traditional IT workloads, AI (especially LLM) workloads are highly volatile and unpredictable.
  • Key Factors: * The continuous, heavy load of Training (steady 24/7) mixed with the bursty, erratic nature of Inference.
    • Demand-driven spikes and low predictability, which lead to poor scheduling determinism and system-wide rhythm disruption.

3. POWER SPIKES (Electrical Infrastructure Stress)

  • Characteristics: The extreme volatility of LLM workloads causes sudden, extreme fluctuations in server power consumption.
  • Key Factors:
    • Rapid power transients (ΔP) and high ramp rates (dP/dt) create sudden power spikes and idle drops.
    • These fluctuations cause significant grid stress, accelerate the aging of power distribution equipment (UPS/PDU stress & derating), degrade overall system reliability, and create major capacity planning uncertainty.

4. COOLING STRESS (Thermal System Stress)

  • Characteristics: Sudden surges in power consumption immediately translate into rapid temperature increases (Thermal transients, ΔT).
  • Key Factors:
    • Cooling lag / control latency: There is an inevitable delay between the sudden heat generation and the cooling system’s physical response.
    • Physical limits: Traditional air cooling hits its limits, forcing transitions to Liquid cooling (DLC/CDU) or Immersion cooling. Failure to manage this latency increases the risk of thermal runaway, triggers system throttling (performance degradation), and negatively impacts SLAs/SLOs.

5. OPEX RISK (The Final Operational Consequence)

  • Context: The combination of unpredictable LLM workloads, power infrastructure stress, and cooling system limitations culminates in severe OPEX Risk.
  • Conclusion: Ultimately, this chain reaction exponentially increases daily operational costs and uncertainties—ranging from accelerated equipment replacement costs and higher power bills (due to degraded PUE) to massive expenses related to frequent incident responses and infrastructure instability.

Summary:

The slide delivers a powerful message: While the physical construction of an AI data center is highly expensive (CAPEX), the true danger lies in the unique volatility of AI workloads. This volatility triggers extreme power (ΔP) and thermal (ΔT) spikes. If these physical transients are not strictly managed, the operational costs and risks (OPEX) will spiral completely out of control.

#AIDataCenter #AIDC #CAPEX #OPEX #LLMWorkload #PowerSpikes #CoolingStress #LiquidCooling #ThermalManagement #DataCenterInfrastructure #GPUInfrastructure #OPEXRisk

With Gemini

AI-Driven Proactive Cooling Architecture

The provided image illustrates an AI-Driven Proactive Cooling Architecture, detailing a sophisticated pipeline that transforms operational data into precise thermal management.


1. The Proactive Data Hierarchy

The architecture categorizes data sources along a spectrum, moving from “More Proactive” (predicting future heat) to “Reactive” (measuring existing heat).

  • LLM Job Schedule (Most Proactive): This layer looks at the job queue, node thermal headroom, and resource availability. It allows the system to prepare for heat before the first calculation even begins.
  • LLM Workload: Monitors real-time GPU utilization (%) and token throughput to understand the intensity of the current processing task.
  • GPU / HBM: Captures direct hardware telemetry, including GPU power draw (Watts) and High Bandwidth Memory (HBM) temperatures.
  • Server Internal Temperature: Measures the junction temperature, fan/pump speeds, and the $\Delta T$ (temperature difference) between server inlet and outlet.
  • Floor & Rack Temperature (Reactive): The traditional monitoring layer that identifies hot spots and rack density (kW) once heat has already entered the environment.

2. The Analysis and Response Loop

The bottom section of the diagram shows how this multi-layered data is converted into action:

  • Gathering Data: Telemetry from all five layers is aggregated into a central repository.
  • Analysis with ML: A Machine Learning engine processes this data to predict thermal trends. It doesn’t just look at where the temperature is now, but where it will be in the next few minutes based on the workload.
  • Cooling Response: The ML insights trigger physical adjustments in the cooling infrastructure, specifically controlling the $\Delta T$ (Supply/Return) and Flow Rate (LPM – Liters Per Minute) of the coolant.

3. Technical Significance

By shifting the control logic “left” (toward the LLM Job Schedule), data centers can eliminate the thermal lag inherent in traditional systems. This is particularly critical for AI infrastructure, where GPU power consumption can spike almost instantaneously, often faster than traditional mechanical cooling systems can ramp up.


Summary

  1. This architecture shifts cooling from a reactive sensor-based model to a proactive workload-aware model using AI/ML.
  2. It integrates data across the entire stack, from high-level LLM job queues down to chip-level GPU power draw and rack temperatures.
  3. The ML engine predicts thermal demand to dynamically adjust coolant flow rates and supply temperatures, significantly improving energy efficiency and hardware longevity.

#AICooling #DataCenterInfrastructure #ProactiveCooling #GPUManagement #LiquidCooling #LLMOps #ThermalManagement #EnergyEfficiency #SmartDC

With Gemini

AI Cost


Strategic Analysis of the AI Cost Chart

1. Hardware (IT Assets): “The Investment Core”

  • Icon: A chip embedded in a complex network web.
  • Key Message: The absolute dominant force, consuming ~70% of the total budget.
  • Details:
    • Compute (The Lead): Features GPU clusters (H100/B200, NVL72). These are not just servers; they represent “High Value Density.”
    • Network (The Hidden Lead): No longer just cabling. The cost of Interconnects (InfiniBand/RoCEv2) and Optics (800G/1.6T) has surged to 15~20%, acting as the critical nervous system of the cluster.

2. Power (Energy): “The Capacity War”

  • Icon: An electric grid secured by a heavy lock (representing capacity security).
  • Key Message: A “Ratio Illusion.” While the percentage (~20%) seems stable due to the skyrocketing hardware costs, the absolute electricity bill has exploded.
  • Details:
    • Load Characteristic: The IT Load (Chip power) dwarfs the cooling load.
    • Strategy: The battle is not just about Efficiency (PUE), but about Availability (Grid Capacity) and Tariff Negotiation.

3. Facility & Cooling: “The Insurance Policy”

  • Icon: A vault holding gold bars (Asset Protection).
  • Key Message: Accounting for ~10% of CapEx, this is not an area for cost-cutting, but for “Premium Insurance.”
  • Details:
    • Paradigm Shift: The facility exists to protect the multi-million dollar “Silicon Assets.”
    • Technology: Zero-Failure is the goal. High-density technologies like DLC (Direct Liquid Cooling) and Immersion Cooling are mandatory to prevent thermal throttling.

4. Fault Cost (Operational Efficiency): “The Invisible Loss”

  • Icon: A broken pipe leaking coins (burning money).
  • Key Message: A “Hidden Cost” that determines the actual success or failure of the business.
  • Details:
    • Metric: The core KPI is MFU (Model Flop Utilization).
    • Impact: Any bottleneck (network stall, storage wait) results in “Stranded Capacity.” If utilization drops to 50%, you are effectively engaging in a “Silent Burn” of 50% of your massive CapEx investment.

💡 Architect’s Note

This chart perfectly illustrates “Why we need an AI DC Operating System.”

“Pillars 1, 2, and 3 (Hardware, Power, Facility) represent the massive capital burned during CONSTRUCTION.

Pillar 4 (Fault Cost) is the battleground for OPERATION.”

Your Operating System is the solution designed to plug the leak in Pillar 4, ensuring that the astronomical investments in Pillars 1, 2, and 3 translate into actual computational value.


Summary

The AI Data Center is a “High-Value Density Asset” where Hardware dominates CapEx (~70%), Power dominates OpEx dynamics, and Facility acts as Insurance. However, the Operational System (OS) is the critical differentiator that prevents Fault Cost—the silent killer of ROI—by maximizing MFU.

#AIDataCenter #AIInfrastructure #GPUUnitEconomics #MFU #FaultCost #DataCenterOS #LiquidCooling #CapExStrategy #TechArchitecture