Co-Work

This image, titled “Co-Work,” illustrates a strategic framework for Event-Centric AIOps. It demonstrates how raw telemetry from physical infrastructure is transformed into structured, actionable intelligence for an AI Agent, fundamentally driven by human expertise.

1. Data Generation and Extraction

  • Device to Metric: Physical infrastructure (Device) generates raw operational data.
  • The Role of Configurations: This data is extracted into quantitative Metric (Number) formats. This extraction is guided by Configurations & Topology, which represents the structural configurations and network topology. This ensures the system understands the physical and logical layout of the devices.

2. Contextualization

  • Metric to Context: Raw numerical data lacks operational meaning on its own. It is transformed into readable Context (text), effectively converting raw telemetry into event logs suitable for LLM-based analysis.
  • The Role of System: This conversion is executed by the System, which acts as the Data Processing Operating System. It defines the rules and logic for how raw numbers are processed, correlated, and translated into meaningful operational states.

3. AI Agent Integration

  • Context to AI Agent: The structured, contextualized text is delivered to the AI Agent for analysis, root cause identification, or predictive tasks.
  • The Role of Manual: The AI Agent’s understanding is heavily enriched by the Manual, which encompasses text-based operating manuals, standard operating procedures (SOPs), and historical troubleshooting data. This provides the AI with established guidelines for how to interpret and react to specific scenarios.

4. The Foundation: Human Intent

The green foundational layer, Human Intent, is the most critical aspect of this architecture. Configurations, System, and Manual are the three core elements and systems that are actively built and managed by humans. They dictate the rules, structural layout, and historical knowledge that guide the AI. This ensures that the AI Agent does not operate in a vacuum, but rather functions safely and effectively within the strict boundaries of human operational intent.

Summary

The “Co-Work” architecture visualizes a collaborative AIOps framework where raw device metrics are systematically transformed into contextualized text. By leveraging three key human-managed components—Configurations (topology), Systems (data processing), and Manuals (historical/procedural text)—the architecture bridges the gap between physical hardware and AI. It ensures the AI Agent receives highly structured, context-rich event data to perform accurate and reliable infrastructure management.

#AIOps #EventCentricAIOps #AIDataCenter #HumanInTheLoop #Telemetry #LLM #ITOperations

Sensors for AI DC Rack

Architecture Walkthrough: High-Density AI Rack Monitoring Topology

This diagram illustrates a comprehensive monitoring framework tailored for next-generation, high-density AI Data Centers. As rack power densities scale upward of 40kW to over 100kW, the integration of high-density power delivery and advanced liquid cooling demands a unified telemetry layer. The architecture symmetrically bifurcates these critical operations into two primary domains: Power Distribution & Electrical Infrastructure (left, in yellow) and Liquid Cooling & Thermal Management (right, in blue).

1. Power Infrastructure Telemetry (Left Domain)

  • Busbar (Top Left): Focuses on tracking surface temperatures at copper/aluminum busway joints using contact or non-contact infrared (IR) sensors. This mitigates the risk of thermal runaway caused by mechanical loosening or joint degradation.
  • Tap-off Box (Middle Left): Monitors the critical junction where power is tapped from the main busway to individual racks. Telemetry captures internal ambient temperatures and circuit breaker contact wear to prevent nuisance tripping under heavy GPU loads.
  • Rack PDU (Bottom Left): Delivers granular power quality (PQ) analytics. Beyond basic billing metrics, it utilizes high-speed sampling to capture transient events—such as voltage sags, swells, and total harmonic distortion (THD)—triggered by sudden LLM training state transitions.

2. Liquid Cooling & Thermal Management (Right Domain)

  • Cold Aisle / Rear (Top Right): Provides 3D micro-climate profiling of the rack enclosure. Using sensor grids (top, middle, bottom), it tracks cold air intake and maps exhaust air behavior to instantaneously flag localized hot spots or individual server fan failures.
  • QD (Quick Disconnect) Valve (Middle Right): Positions high-sensitivity leak detection ropes or optical fluid sensors directly at the fluid mating interfaces of individual GPU server blades. This safeguards expensive IT assets against coolant escape.
  • Manifold / CDU (Bottom Right): Serves as the central hydronic balancing hub. By cross-referencing volumetric flow rate (LPM), differential pressure (Delta P), and differential temperature ($\Delta T$) across supply and return lines, the system continuously calculates the exact real-time heat rejection load in kW.

Executive Summary: The Imperative of High-Fidelity Infrastructure Telemetry

In a modern AI Data Center, the sheer density of accelerated computing clusters renders traditional, coarse facility monitoring completely obsolete. To ensure maximum uptime and operational efficiency, telemetry must undergo a paradigm shift governed by two critical vectors:

1. High Precision & High Resolution

Because GPU workloads scale from idle to maximum power in microseconds, sensors must feature ultra-high sampling rates (millisecond-level resolution for electrical transients) and high precision (milli-degree sensitivity for liquid thermal loops). Coarse, averaged data masks dangerous micro-spikes that degrade hardware components over time. High-resolution telemetry is the baseline requirement for capturing the true, unvarnished physical state of the infrastructure.

2. From Phenomena to Precursors (Omens)

Traditional data center monitoring is reactive—it alerts operators to a phenomenon (e.g., “Rack temperature has exceeded $85^\circ\text{C}$”), which usually means the failure has already occurred.

Conversely, high-fidelity, continuous data allows an AIOps engine to identify precursors or omens—the microscopic anomalies that precede a disaster. For instance:

  • A fractional, steady rise in busbar temperature relative to a static workload implies micro-vibration joint loosening (Thermal Degradation Precursor).
  • A subtle drift in the dielectric constant near a fluid coupling signals a microscopic weep before it transforms into a catastrophic spray (Leak Precursor).
  • A minor, localized spike in differential pressure (Delta P) combined with a micro-drop in flow rate alerts the system to initial strainer clogging before fluid starvation throttles the GPUs.

By capturing these subtle “signs” rather than waiting for the “symptom,” data centers can transition from reactive firefighting to fully automated, self-healing predictive maintenance.

#AIDataCenter #LiquidCooling #DirectToChip #AIOps #InfrastructureTelemetry #HighDensityComputing #PredictiveMaintenance #DataCenterArchitecture #TechnicalVisualization #SmartInfrastructure

With Gemini

Sag & Swell

The image provides a clear, side-by-side comparison of two major power quality issues: Voltage Sag (or Dip) and Voltage Swell. It looks like a great summary graphic prepared for your tech blog at eeumee.net, particularly because it sharply highlights how these electrical phenomena specifically impact AI Data Centers (AI DC).

1. Voltage Sag / Dip

  • Definition: A sudden, momentary decrease in voltage.
  • System Impact: It causes immediate service and system disruption. If the voltage drops too low, servers can suddenly power off or reboot.
  • AI DC Relevance: Noted as “Very high on AI DC.” The risk and frequency are elevated in AI environments.
  • Root Cause: This is primarily driven by sudden load or workload changes. When thousands of GPUs simultaneously spin up for intensive AI training or inference tasks, they draw massive amounts of current in an instant, causing the voltage to dip.

2. Voltage Swell

  • Definition: A sudden, momentary increase in voltage.
  • System Impact: Unlike a sag, a swell might not cause an immediate outage, but it forces overvoltage through the components, leading to equipment stress and degradation.
  • AI DC Relevance: It carries a significant cumulative impact. The hardware damage builds up over time, eventually leading to premature component failure.
  • Root Cause: Typically triggered by power system or control abnormalities, or when a massive electrical load is suddenly dropped from the grid.

💡 Core Insight

This slide captures why power dynamics in AI Data Centers are vastly different from traditional IT environments. The extreme, dynamic power fluctuations inherent to AI workloads make rigorous power quality monitoring (via DCIM) and the implementation of highly responsive, advanced power architectures—such as Battery Energy Storage Systems (BESS)—absolutely critical to maintaining uptime and protecting expensive hardware.

#AIDataCenter #PowerQuality #VoltageSag #VoltageSwell #DataCenterInfrastructure #TechBlog #GPUWorkloads #ServerCooling

With Gemini

AI Data Center Operation Platform Layer

The provided image illustrates the architecture of an AI DataCenter Operation Platform, mapping it out in five distinct stages from the physical foundation layer up to the top-tier artificial intelligence application layer.

The upward-pointing arrows depict the flow of raw data collected from the infrastructure, demonstrating the system’s upward evolution and how the data is ultimately utilized intelligently by AI.

Here is the breakdown of the core roles and components of each layer:

  • Layer 1: Facility & Physical Edge
    • Role: The foundational layer responsible for collecting data and controlling the physical infrastructure equipment of the data center, such as power and cooling systems.
    • Key Elements: High-Frequency Data Sampling, Precision Time Synchronization (Precision NTP/PTP), Standard Interfaces, and Zero-Latency Control & Redundancy. This layer focuses on extracting data and issuing control commands to hardware with extreme speed and accuracy.
  • Layer 2: Network Fabric
    • Role: The neural network of the data center. It reliably and rapidly transmits the massive amounts of collected data to the upper platforms without bottlenecks.
    • Key Elements: Non-blocking Leaf-Spine Architecture, Ultra-High-Speed Telemetry, and Integrated Security & NMS (Network Management System) Monitoring. These elements work together to efficiently handle large-scale traffic.
  • Layer 3: Control & Management (Integrated Control)
    • Role: The layer that integrates and normalizes heterogeneous data streaming in from various facilities and solutions to execute practical operations and management.
    • Key Elements: Operational Solution Convergence, Heterogeneous Data Normalization, Traffic-based Anomaly Detection, and Monitoring-Based Commissioning (MBCx). It acts as a critical gateway to identify infrastructure issues early and improve overall operational efficiency.
  • Layer 4: Analysis Platform
    • Role: The stage where refined data is stored, analyzed, and visualized, allowing administrators to intuitively grasp the system’s status at a glance.
    • Key Elements: Utilizes a High-Performance Time-Series Database (TSDB) to record state changes over time and provides Customized Views/Dashboards for tailored monitoring.
  • Layer 5: Intelligent Expansion
    • Role: The ultimate destination of this platform. It is the highest layer where AI autonomously operates and optimizes the data center, leveraging the well-organized data provided by the lower layers.
    • Key Elements: Generative AI Agent (LLM+RAG), Digital Twin technology, ML-based Automated Power/Cooling Control, and Intelligent Report Generation.

This blueprint clearly demonstrates the overall solution architecture: precisely collecting and transmitting raw data from hardware facilities (Layers 1-2), standardizing, storing, and analyzing that data (Layers 3-4), and ultimately achieving advanced, autonomous operations through intelligent, automatic control of power and cooling systems via a Generative AI Agent (Layer 5).


#AIDataCenter #AIOps #DataCenterManagement #GenerativeAI #DigitalTwin #NetworkFabric #ITInfrastructure #SmartDataCenter #MachineLearning #TechArchitecture

With Gemini

Prerequisites for ML


Architecture Overview: Prerequisites for ML

1. Data Sources: Convergence of IT and OT (Top Layer)

The diagram outlines four core domains essential for machine learning-based control in an AI data center. The top layer illustrates the necessary integration of IT components (AI workloads and GPUs) and Operational Technology (Power/ESS and Cooling systems). It emphasizes that the first prerequisite for an AI data center agent is to aggregate status data from these historically siloed equipment groups into a unified pipeline.

2. Collection Phase: Ultra-High-Speed Telemetry

The subsequent layer focuses on data collection. Because power spikes unique to AI workloads occur in milliseconds, the architecture demands High-Frequency Data Sampling and a Low-Latency Network. Furthermore, Precision Time Synchronization is highlighted as a critical requirement; the timestamps of a sudden GPU load spike must perfectly align with temperature changes in the cooling system for the ML model to establish accurate causal relationships.

3. Processing Phase: Heterogeneous Data Processing

As incoming data points utilize varying communication protocols and polling intervals, the third layer addresses data refinement. It employs a Unified Standard Protocol to convert heterogeneous data, along with Normalization & Ontology mapping so the ML model can comprehend the physical relationships between IT servers and facility cooling units. Additionally, a Message Broker for Spikes Data is included as a buffer to prevent system bottlenecks or data loss during the massive influx of telemetry that occurs at the onset of large-scale distributed training.

4. Execution Phase: High-Performance Control Computing

Following data processing, the execution layer is designed to take direct action on the facility infrastructure. This phase requires Zero-Latency Facility Control computing power to enable immediate physical responses. To meet the zero-downtime demands of data center operations, this layer incorporates a comprehensive SW/HW Redundancy Architecture to guarantee absolute High Availability (HA).

5. Ultimate Goal: Securing Real-Time, High-Fidelity Data

The foundational layers culminate in the ultimate goal shown at the bottom: Securing Real-Time, High-Fidelity Data. This emphasizes that predictive control algorithms cannot function effectively with noisy or delayed inputs. A robust data infrastructure is the definitive prerequisite for enabling proactive pre-cooling and ESS optimization.


📝 Summary

  1. A successful ML-driven data center operation requires a robust, high-speed data foundation prior to deploying predictive algorithms.
  2. Bridging the gap between IT (GPUs) and OT (Power/Cooling) through synchronized, high-frequency telemetry forms the core of this architecture.
  3. Securing real-time, high-fidelity data enables the crucial transition from delayed reactive responses to proactive predictive cooling and energy optimization.

#AIDataCenter #MachineLearning #ITOTConvergence #DataPipeline #PredictiveControl #Telemetry

New Risk @ AI DC

Overview: New Risks at AI Data Centers

The image outlines the infrastructure challenges faced by modern AI Data Centers (AI DC), specifically focusing on the high demands placed on hardware like GPUs. It divides these challenges into two primary categories: Power Risk and Cooling Risk.

The central graphic illustrates that the core AI processing units (Brains/GPUs) are entirely dependent on these two foundational elements.


⚡ Power Risk

This section highlights issues related to power supply and infrastructure (such as Power Diversification, ESS, and 800V HVDC).

  • Power Supply Shortage (GPU Power Throttling): When the facility cannot provide enough power, GPUs slow down to compensate.
    • Impacts: Delays in AI workloads, financial losses due to lost data checkpoints, and the collapse of synchronization across the entire computing cluster.
  • Rapid Power Fluctuations: Sudden spikes or drops in the power supply.
    • Impacts: Voltage sag, electrical resonance in external grids, and reduced lifespan or physical damage to backup power systems like generators and UPS (Uninterruptible Power Supplies).
  • Power Quality Degradation: When the provided electricity is “noisy” or unstable.
    • Impacts: Malfunctions in protective electrical relays, overheating of server Power Supply Units (PSUs), and unexplained network communication errors.

❄️ Cooling Risk

This section focuses on the challenges of managing the massive heat generated by AI workloads, specifically looking at Liquid Cooling and changes in Cooling Distribution Unit (CDU) environments.

  • Cooling Supply Shortage (GPU Thermal Throttling): When the cooling system cannot remove heat fast enough, GPUs slow down to prevent melting.
    • Impacts: Delays in AI workloads, reduced lifespan and increased defects in GPUs, and long-term damage to surrounding server equipment.
  • Leakage Occurrence: Physical leaks in the liquid cooling system.
    • Impacts: Immediate equipment burnout (short circuits), risk of electrical arc flashes and fires, and cascading system shutdowns due to a loss of pressure in the cooling loop.
  • Cooling Water Quality Deterioration: When the liquid used for cooling becomes contaminated or degrades.
    • Impacts: Formation of localized “hot-spots” where cooling fails, a sharp decline in overall cooling efficiency, and mechanical wear and tear on the CDU pumps.

📝 Summary

  1. AI Data Centers face critical new infrastructure risks divided into two main categories: supplying massive amounts of power and managing extreme heat.
  2. Power-related risks (shortages, fluctuations, and poor quality) lead to severe workload delays, cluster synchronization failures, and damage to backup generators.
  3. Cooling-related risks (insufficient cooling, leaks, and poor water quality) cause thermal throttling, severe hardware damage, and potentially catastrophic fires.

#AIDataCenter #DataCenterInfrastructure #GPUPower #LiquidCooling #DataCenterRisk #ThermalThrottling #TechInfrastructure

With Gemini

Data Center Changes

The Evolution of Data Centers

This infographic, titled “Data Center Changes,” visually explains how data center requirements are skyrocketing due to the shift from traditional computing to AI-driven workloads.

The chart compares three stages of data centers across two main metrics: Rack Density (how much power a single server rack consumes, shown on the vertical axis) and the overall Total Power Capacity (represented by the size and labels of the circles).

  • Traditional DC (Data Center): In the past, data centers ran at a very low rack density of around 2kW. The total power capacity required for a facility was relatively small, at around 10 MW.
  • Cloud-native DC: As cloud computing took over, the demands increased. Rack densities jumped to about 10kW, and the overall facility size grew to require around 100 MW of power.
  • AI DC: This is where we see a massive leap. Driven by heavy GPU workloads, AI data centers push rack densities beyond 100kW+. The scale of these facilities is enormous, demanding up to 1GW of power. The red starburst shape also highlights a new challenge: “Ultra-high Volatility,” meaning the power draw isn’t stable; it spikes violently depending on what the AI is processing.

The Three Core Challenges (Bottom Panels)

The bottom three panels summarize the key takeaways of transitioning to AI Data Centers:

  1. Scale (Massive Investment): Building a 1GW “Campus-scale” AI data center requires astronomical capital expenditure (CAPEX). To put this into perspective, the chart notes that just 10MW costs roughly 200 billion KRW (South Korean Won). Scaling that to 1GW is a colossal financial undertaking.
  2. Density (The Need for Liquid Cooling): Power density per rack is jumping from 2kW to 100kW—a 50x increase. Traditional air-conditioning cannot cool servers running this hot, meaning the industry must transition to advanced liquid cooling technologies.
  3. Volatility (Unpredictable Demands): Unlike traditional servers that run at a steady hum, AI GPU workloads change in real-time. A sudden surge in computing tasks instantly spikes both the electricity needed to run the GPUs and the cooling power needed to keep them from melting.

Summary

  • Data centers are undergoing a massive transformation from Traditional (10MW) and Cloud (100MW) models to gigantic AI Data Centers requiring up to 1 Gigawatt (1GW) of power.
  • Because AI servers use powerful GPUs, power density per rack is increasing 50-fold (up to 100kW+), forcing a shift from traditional air cooling to advanced liquid cooling.
  • This AI infrastructure requires staggering financial investments (CAPEX) and must be designed to handle extreme, real-time volatility in both power and cooling demands.

#DataCenter #AIDataCenter #LiquidCooling #GPU #CloudComputing #TechTrends #TechInfrastructure #CAPEX

With Gemini