Cooling Works & Metrics

Data Center Cooling System Overview

Cooling System Operation Flow

  1. Cooling Tower: Produces cooling water by releasing heat to the outside environment. This stage involves dissipating heat into the atmosphere.
  2. Chiller: Absorbs heat from the cooling water to produce chilled water. The condenser plays a crucial role in this process.
  3. Air Handling Unit: Uses chilled water to cool air, creating cooling air for the server room.
  4. Server Room: The cooled air is ultimately supplied to the server room to remove heat from IT equipment.

Key Control and Conversion Equipment

  • Pump: Regulates the pressure and speed of cooling and chilled water to maintain appropriate flow rates throughout the system.
  • Header: Handles the distribution and collection of cooling and chilled water, ensuring uniform distribution across the system.
  • Heat Exchanger/Condenser: Performs heat exchange processes at various stages, with the condenser playing a particularly important role in the chiller.
  • Fan: Circulates cooling air to the server room.

Core Measurement Metrics

  • Temperature: Monitors the temperature of cooling water, chilled water, and air at each stage to evaluate system efficiency.
  • Water Flow Rate: Measures the amount of cooling and chilled water circulating in the system to ensure adequate cooling capacity.
  • Supply/Return Temperature Differential: Measures the temperature difference before and after passing through each component to assess heat exchange efficiency.
  • Power Usage: Monitors the power consumption of pumps, chillers, fans, and other equipment to manage energy efficiency.

These metrics are monitored in detail by pump and condenser to optimize the overall performance of the cooling system and improve energy efficiency.

With Claude

AI in the data center

AI in the Data Center

This diagram titled “AI in the Data Center” illustrates two key transformational elements that occur when AI technology is integrated into data centers:

1. Computing Infrastructure Changes

  • AI workloads powered by GPUs become central to operations
  • Transition from traditional server infrastructure to GPU-centric computing architecture
  • Fundamental changes in data center hardware configuration and network connectivity

2. Management Infrastructure Changes

  • Increased requirements for power (“More Power!!”) and cooling (“More Cooling!!”) to support GPU infrastructure
  • Implementation of data-driven management systems utilizing AI technology
  • AI-based analytics and management for maintaining stability and improving efficiency

These two changes are interconnected, visually demonstrating how AI technology not only revolutionizes the computing capabilities of data centers but also necessitates innovation in management approaches to effectively operate these advanced systems.

with Claude

Operation with system

Key Analysis of Operation Cost Diagram

This diagram illustrates the cost structure of system implementation and operation, highlighting the following key concepts:

  1. High Initial Deployment Cost: At the beginning of a system’s lifecycle, deployment costs are substantial. This represents a one-time investment but requires significant capital.
  2. Perpetual Nature of Operation Costs: Operation costs continue indefinitely as long as the system exists, making them a permanent expense factor.
  3. Components of Operation Cost: Operation costs consist of several key elements:
    • Energy Cost
    • Labor Cost
    • Disability Cost
    • Additional miscellaneous costs (+@)
  4. Role of Automation Systems: As shown on the right side of the diagram, implementing automation systems can significantly reduce operation costs over time.
  5. Timing of Automation Investment: While automation systems also require initial investment during the early phases, they deliver long-term operation cost reduction benefits, ultimately improving the overall cost structure.

This diagram effectively visualizes the relationship between initial costs and long-term operational expenses, as well as the cost optimization strategy through automation.

With Claude

Server Room Cooling Metrics

This dashboard is designed to monitor the comprehensive performance of server room cooling systems by displaying temperature changes alongside server power consumption data, while also tracking water flow rate (Water LPM) and fan speed. The main utilities and applications of this approach include:

  1. Integrated Data Visualization:
    • Enables simultaneous monitoring of temperature, power consumption, and cooling system parameters (flow rate, fan speed) in a single dashboard, facilitating the identification of correlations between systems.
    • Allows operators to immediately observe how increases in power consumption lead to temperature rises and the subsequent response of cooling systems.
  2. Benefits of Heat Map Implementation:
    • Represents data from multiple temperature sensors categorized as MAX/MIN/AVG with color differentiation, providing intuitive understanding of spatial temperature distribution.
    • Creates clear visual contrast between yellow (HOTZONE) and blue (COOLZONE) areas, making temperature gradients easily recognizable.
    • Enables quick identification of temperature anomalies for early detection of potential issues.
  3. Cooling Efficiency Monitoring:
    • Facilitates analysis of the relationship between Water LPM (water flow rate) and temperature changes to evaluate cooling water usage efficiency.
    • Allows assessment of air circulation system effectiveness by examining correlations between fan speed and COOLZONE/HOTZONE temperature changes.
    • Enables real-time monitoring of heat exchange efficiency through the difference between RETURN TEMP and SUPPLY TEMP.
  4. Event Detection and Analysis:
    • Features an “EVENT(Big Change?)” indicator that helps quickly identify significant changes or anomalies.
    • Displays data from the past 30 minutes in 5-minute intervals, enabling analysis of short-term trends and patterns.
  5. Operational Decision Support:
    • Provides immediate feedback on the effects of cooling system adjustments (changes in flow rate or fan speed) on temperature, enabling optimization of operational parameters.
    • Helps evaluate the response capability of cooling systems during increased server loads, supporting capacity planning.
    • Offers necessary data to balance energy efficiency with server stability.

This dashboard goes beyond a simple monitoring tool to serve as a comprehensive decision support system for optimizing thermal management in server rooms, improving energy efficiency, and ensuring equipment stability. The heat map visualization approach, in particular, makes complex temperature data intuitively interpretable, allowing operators to quickly assess situations and respond appropriately.

With Claude

Cooling(CRAH) Inside

This image shows a diagram of the cooling system structure inside a CRAH (Computer Room Air Handler).

  1. Cooling Process Flow:
  • COLD WATER enters the system
  • Flow is controlled through an OPEN valve (%)
  • Water flows at a specified Flux rate (LPM)
  • Passes through a heat exchanger (coil)
  1. Air Circulation:
  • Return Hot Air from servers enters the system
  • Air is cooled through the heat exchanger
  • Air is circulated by fans (FAN SPEED in RPM)
  • Air volume is controlled by a Damper (Open)
  • Cooled air is supplied to the servers
  1. Key Control Elements:
  • Valve opening percentage (%)
  • Fan speed (RPM)
  • Damper position (Open)

This system illustrates the basic operating principles of a cooling system used in data centers or server rooms to effectively control server heat generation. The main purpose is to maintain appropriate temperatures by continuously removing heat (Load/Heat) generated by the servers.

The diagram efficiently shows the complete cycle from cold water intake to the cooling of hot server air and its recirculation, demonstrating how CRAH systems maintain optimal operating temperatures in data center environments.

With Claude

Power Control

Power Control system diagram

  1. Power Source (Left Side)
  • High Power characteristics:
    • Very Dangerous
    • Very Difficult to Control
    • High Cost to Control
  1. Central Control/Distribution System (Center)
  • Distributor: Shares/distributes power
  • Transformer: Steps down power
  • Circuit Breaker: Stops power
  • UPS (Uninterruptible Power Supply): Saves power
  • Power Control (multi-step)
  1. Final Distribution (Right Side)
  • Low Power characteristics:
    • Power for computing
    • Complex Control Required
    • Reduced dangers

The diagram shows the complete process of how high-power electricity is safely and efficiently controlled and converted into low-power suitable for computing systems. The power flow is illustrated through a “Delivery” phase, passing through various protective and control devices before being distributed to multiple servers or computing equipment.

The system emphasizes safety and control through multiple stages:

  • Initial high-power input is marked as dangerous and difficult to control
  • Multiple control mechanisms (transformer, circuit breaker, UPS) manage the power
  • The distributor splits the controlled power to multiple endpoints
  • Final output is appropriate for computing equipment

This setup ensures safe and reliable power distribution while reducing the risks associated with high-power electrical systems.

With Claude

Server Room Flow

With Claude
Comprehensive Analysis of Server Room HVAC System Configuration and Operation

  1. Physical Configuration
  • Multiple cooling units arranged in CRAC (Computer Room Air Conditioning) Zone
  • Three-tier structure: Cool Zone, Server Zone, Hot Zone
  • Upper and lower distribution structure for air circulation
  1. Temperature Monitoring System
  • Supply Temperature (S. Temp): Cooling unit output temperature
  • Cooling Zone Temperature (C. Temp): Pre-server intake temperature
  • Hot Zone Temperature (H. Temp): Server exhaust temperature
  • Return Temperature (R. Temp): CRAC intake temperature
  1. Efficiency Management Indicators
  • AVG. Imbalance monitoring for each section
  • CPU load and power consumption correlation analysis
  • CPU efficiency and heat generation relationship tracking
  1. Analysis Points
  • Delta T analysis between sections
  • Temperature variation patterns by time/season
  • Power efficiency and cooling efficiency correlation
  • System stability prediction indicators
  1. Operational Goals
  • Operating cost optimization
  • Provide stable server operating environment
  • Energy-efficient cooling system operation
  • Proactive problem detection and response