Next AI

Posted on 2025-11-222025-11-21 by lechuck park

This illustration contrasts an old approach of endlessly adding more GPU servers, burning money for little gain, with a new era where AI-driven optimization of software, network, cooling and power delivers smarter GPUs and a much better ROI.

CDU Metrics & Control

Posted on 2025-09-252025-09-24 by lechuck park

This image shows a CDU (Coolant Distribution Unit) Metrics & Control System diagram illustrating the overall structure. The system can be organized as follows:

System Structure

Upper Section: CDU Structure

First Loop: CPU with Coolant Distribution Unit
Second Main Loop: Row Manifold and Rack Manifold configuration
Process Chill Water Supply/Return: Process chilled water circulation system

Lower Section: Data Collection & Control Devices

Control Devices:
- Pump (Pump RPM, Rate of max speed)
- Valve (Valve Open %)
Sensor Configuration:
- Temperature & Pressure Sensors on manifolds
Supply System:
- Rack Water Supply/Return

Main Control Methods

1. Fixed Pressure Control (Fixed Pressure Drop)

Primary Method: Maintaining fixed pressure drop between rack supply-return
Alternatives: Fixed flow rate, fixed supply temperature, fixed return temperature, fixed speed control

2. Approach Temperature Control

Primary Method: Maintaining constant approach temperature
Alternatives: Fixed open, fixed secondary supply temperature control

Summary

This CDU system provides precise cooling control for data centers through dual management of pressure and temperature. The system integrates sensor feedback from manifolds with pump and valve control to maintain optimal cooling conditions across server racks.

#CDU #CoolantDistribution #DataCenterCooling #TemperatureControl #PressureControl #ThermalManagement

with Claude

Multi-DCs Operation with a LLM(3)

Posted on 2025-09-12 by lechuck park

This diagram presents the 3 Core Expansion Strategies for Event Message-based LLM Data Center Operations System.

System Architecture Overview

Basic Structure:

Collects event messages from various event protocols (Log, Syslog, Trap, etc.)
3-stage processing pipeline: Collector → Integrator → Analyst
Final stage performs intelligent analysis using LLM and AI

3 Core Expansion Strategies

1️⃣ Data Expansion (Data Add On)

Integration of additional data sources beyond Event Messages:

Metrics: Performance indicators and metric data
Manuals: Operational manuals and documentation
Configures: System settings and configuration information
Maintenance: Maintenance history and procedural data

2️⃣ System Extension

Infrastructure scalability and flexibility enhancement:

Scale Up/Out: Vertical/horizontal scaling for increased processing capacity
To Cloud: Cloud environment expansion and hybrid operations

3️⃣ LLM Model Enhancement (More Better Model)

Evolution toward DC Operations Specialized LLM:

Prompt Up: Data center operations-specialized prompt engineering
Nice & Self LLM Model: In-house development of DC operations specialized LLM model construction and tuning

Strategic Significance

These 3 expansion strategies present a roadmap for evolving from a simple event log analysis system to an Intelligent Autonomous Operations Data Center. Particularly, through the development of in-house DC operations specialized LLM, the goal is to build an AI system that achieves domain expert-level capabilities specifically tailored for data center operations, rather than relying on generic AI tools.

With Claude

Temperate Prediction in DC (II) – The start and The Target

Posted on 2025-07-29 by lechuck park

This image illustrates the purpose and outcomes of temperature prediction approaches in data centers, showing how each method serves different operational needs.

Purpose and Results Framework

CFD Approach – Validation and Design Purpose

Input:

Setup Data: Physical infrastructure definitions (100% RULES-based)
Pre-defined spatial, material, and boundary conditions

Process: Physics-based simulation through computational fluid dynamics

Results:

What-if (One Case) Simulation: Theoretical scenario testing
Checking a Limitation: Validates whether proposed configurations are “OK or not”
Used for design validation and capacity planning

ML Approach – Operational Monitoring Purpose

Input:

Relation (Extended) Data: Real-time operational data starting from workload metrics
Continuous data streams: Power, CPU, Temperature, LPM/RPM

Process: Data-driven pattern learning and prediction

Results:

Operating Data: Real-time operational insights
Anomaly Detection: Identifies unusual patterns or potential issues
Used for real-time monitoring and predictive maintenance

Key Distinction in Purpose

CFD: “Can we do this?” – Validates design feasibility and limits before implementation

Answers hypothetical scenarios
Provides go/no-go decisions for infrastructure changes
Design-time tool

ML: “What’s happening now?” – Monitors current operations and predicts immediate future

Provides real-time operational intelligence
Enables proactive issue detection
Runtime operational tool

The diagram shows these are complementary approaches: CFD for design validation and ML for operational excellence, each serving distinct phases of data center lifecycle management.

With Claude

DC Changes

Posted on 2025-07-02 by lechuck park

This image shows a diagram that matches 3 Environmental Changes in data centers with 3 Operational Response Changes.

Environmental Changes → Operational Response Changes

1. Hyper Scale

Environmental Change: Large-scale/Complexity

Systems becoming bigger and more complex
Increased management complexity

→ Operational Response: DevOps + Big Data/AI Prediction

Development-Operations integration through DevOps
Intelligent operations through big data analytics and AI prediction

2. New DC (New Data Center)

Environmental Change: New/Edge and various types of data centers

Proliferation of new edge data centers
Distributed infrastructure environment

→ Operational Response: Integrated Operations

Multi-center integrated management
Standardized operational processes
Role-based operational framework

3. AI DC (AI Data Center)

Environmental Change: GPU Large-scale Computing/Massive Power Requirements

GPU-intensive high-performance computing
Enormous power consumption

→ Operational Response: Digital Twin – Real-time Data View

Digital replication of actual configurations
High-quality data-based monitoring
Real-time predictive analytics including temperature prediction

This diagram systematically demonstrates that as data center environments undergo physical changes, operational approaches must also become more intelligent and integrated in response.

with Claude

800V HVDC

Posted on 2025-06-272025-06-27 by lechuck park

AI Data Center: Server-Side Power Management Transition from AC to DC

Traditional AC Server Power Management (Upper Section)

AC Power Distribution Chain

6.6kV to 380V AC: Primary voltage step-down transformation
UPS (Outage Fast Recovery): Backup power for short-term outages
Distribution Cutoff, Regulation: Power distribution control and voltage regulation
AC to DC for Server: Final AC-DC conversion at server level
Output: AC 380V (KW level)

New DC Server Power Management Technology (Lower Section)

DC Power Distribution Chain

AC to DC Conv 800V HVDC: Direct high-voltage DC conversion
ESS (Energy Storage System): Integrated energy storage solution
Digital Control: Advanced digital power management
DC to DC Down for Server: DC-DC step-down conversion for servers
Output: HVDC 800V (MW level)

Key Technology Advantages of DC Transition

Power Quality Enhancement

PF Up, Harmonics Dn: Improved power factor and reduced harmonic distortion

Advanced Backup Capability

Long time Backup Peak Shaving: Extended backup duration with intelligent peak load management

Operational Efficiency

Lower Loss, High Density, Easy Control: Reduced conversion losses, compact footprint, simplified control architecture

Scalable Power Delivery

High Power Usage Available: Enhanced power capacity to meet AI server demands

Server-Side Power Management Transformation

This diagram illustrates the technological shift in server-side power management from traditional AC distribution (KW-level) to advanced DC distribution (MW-level), specifically designed to address the high-power requirements and efficiency demands of AI data centers. The DC approach eliminates multiple AC-DC conversion stages, resulting in improved efficiency and better power management capabilities.

With Claude

Server Room Workload

Posted on 2025-06-242025-06-23 by lechuck park

This diagram illustrates a server room thermal management system workflow.

System Architecture

Server Internal Components:

AI Workload, GPU Workload, and Power Workload are connected to the CPU, generating heat

Temperature Monitoring Points:

Supply Temp: Cold air supplied from the cooling system
CoolZone Temp: Temperature in the cooling zone
Inlet Temp: Server inlet temperature
Outlet Temp: Server outlet temperature
Hot Zone Temp: Temperature in the heat exhaust zone
Return Temp : Hot air return to the cooling system

Cooling System:

The Cooling Workload on the left manages overall cooling
Closed-loop cooling system that circulates back via Return Temp

Temperature Delta Monitoring

The bottom flowchart shows how each workload affects temperature changes (ΔT):

Delta temperature sensors (Δ1, Δ2, Δ3) measure temperature differences across each section
This data enables analysis of each workload’s thermal impact and optimization of cooling efficiency

This system appears to be a data center thermal management solution designed to effectively handle high heat loads from AI and GPU-intensive workloads. The comprehensive temperature monitoring allows for precise control and optimization of the cooling infrastructure based on real-time workload demands.

With Claude