Lechuck Park

Temperate Prediction in DC (II) – The start and The Target

Posted on 2025-07-29 by lechuck park

This image illustrates the purpose and outcomes of temperature prediction approaches in data centers, showing how each method serves different operational needs.

Purpose and Results Framework

CFD Approach – Validation and Design Purpose

Input:

Setup Data: Physical infrastructure definitions (100% RULES-based)
Pre-defined spatial, material, and boundary conditions

Process: Physics-based simulation through computational fluid dynamics

Results:

What-if (One Case) Simulation: Theoretical scenario testing
Checking a Limitation: Validates whether proposed configurations are “OK or not”
Used for design validation and capacity planning

ML Approach – Operational Monitoring Purpose

Input:

Relation (Extended) Data: Real-time operational data starting from workload metrics
Continuous data streams: Power, CPU, Temperature, LPM/RPM

Process: Data-driven pattern learning and prediction

Results:

Operating Data: Real-time operational insights
Anomaly Detection: Identifies unusual patterns or potential issues
Used for real-time monitoring and predictive maintenance

Key Distinction in Purpose

CFD: “Can we do this?” – Validates design feasibility and limits before implementation

Answers hypothetical scenarios
Provides go/no-go decisions for infrastructure changes
Design-time tool

ML: “What’s happening now?” – Monitors current operations and predicts immediate future

Provides real-time operational intelligence
Enables proactive issue detection
Runtime operational tool

The diagram shows these are complementary approaches: CFD for design validation and ML for operational excellence, each serving distinct phases of data center lifecycle management.

With Claude

Temperate Prediction in DC

Posted on 2025-07-28 by lechuck park

Overall Structure

Top: CFD (Computational Fluid Dynamics) based approach Bottom: ML (Machine Learning) based approach

CFD Approach (Top)

Basic Setup:
- Spatial Definition & Material Properties: Physical space definition of the data center and material characteristics (servers, walls, air, etc.)
- Boundary Conditions: Setting boundary conditions (inlet/outlet temperatures, airflow rates, heat sources, etc.)
Processing:
- Configuration + Physical Rules: Application of physical laws (heat transfer equations, fluid dynamics equations, etc.)
- Heat Flow: Heat flow calculations based on defined conditions
Output: Heat + Air Flow Simulation (physics-based heat and airflow simulation)

ML Approach (Bottom)

Data Collection:
- Real-time monitoring through Metrics/Data Sensing
- Operational data: Power (Kw), CPU (%), Workload, etc.
- Actual temperature measurements through Temperature Sensing
Processing: Pattern learning through Machine Learning algorithms
Output: Heat (with Location) Prediction (location-specific heat prediction)

Key Differences

CFD Method: Theoretical calculation through physical laws using physical space definitions, material properties, and boundary conditions as inputs ML Method: Data-driven approach that learns from actual operational data and sensor information for prediction

The key distinction is that CFD performs simulation from predefined physical conditions, while ML learns from actual operational data collected during runtime to make predictions.

With Claude

Simple – Super REPEAT

Posted on 2025-07-27 by lechuck park

Computing is shifting from complex logic to massive parallel processing of simple matrix operations, especially in AI. As computation becomes faster, memory—its speed, structure, and reliability—becomes the new bottleneck and the most critical resource.

Summer Vacation

Posted on 2025-07-26 by lechuck park

AI Workload

Posted on 2025-07-252025-07-25 by lechuck park

This image visualizes the three major AI workload types and their characteristics in a comprehensive graph.

Graph Structure Analysis

Visualization Framework:

Y-axis: AI workload intensity (requests per hour, FLOPS, CPU/GPU utilization, etc.)
X-axis: Time progression
Stacked Area Chart: Shows the proportion and changes of three workload types within the total AI system load

Three AI Workload Characteristics

1. Learning – Blue Area

Properties: Steady, Controllable, Planning

Located at the bottom with a stable, wide area
Represents model training processes with predictable and plannable resource usage
Maintains consistent load over extended periods

2. Reasoning – Yellow Area

Properties: Fluctuating, Unpredictable, Optimizing!!!

Middle layer showing dramatic fluctuations
Involves complex decision-making and logical reasoning processes
Most unpredictable workload requiring critical optimization
Load varies significantly based on external environmental changes

3. Inference – Green Area

Properties: On-device Side, Low Latency

Top layer with irregular patterns
Executes on edge devices or user terminals
Service workload requiring real-time responses
Low latency is the core requirement

Key Implications

Differentiated Resource Management Strategies Required:

Learning: Stable long-term planning and infrastructure investment
Reasoning: Dynamic scaling and optimization technology focus
Inference: Edge optimization and response time improvement

This graph provides crucial insights demonstrating that customized resource allocation strategies considering the unique characteristics of each workload type are essential for effective AI system operations.

This visualization emphasizes that AI workloads are not monolithic but consist of distinct components with varying demands, requiring sophisticated resource management approaches to handle their collective and individual requirements effectively.

With Claude

Components for AI Work

Posted on 2025-07-23 by lechuck park

This diagram visualizes the core concept that all components must be organically connected and work together to successfully operate AI workloads.

Importance of Organic Interconnections

Continuity of Data Flow

The data pipeline from Big Data → AI Model → AI Workload must operate seamlessly
Bottlenecks at any stage directly impact overall system performance

Cooperative Computing Resource Operations

GPU/CPU computational power must be balanced with HBM memory bandwidth
SSD I/O performance must harmonize with memory-processor data transfer speeds
Performance degradation in one component limits the efficiency of the entire system

Integrated Software Control Management

Load balancing, integration, and synchronization coordinate optimal hardware resource utilization
Real-time optimization of workload distribution and resource allocation

Infrastructure-based Stability Assurance

Stable power supply ensures continuous operation of all computing resources
Cooling systems prevent performance degradation through thermal management of high-performance hardware
Facility control maintains consistency of the overall operating environment

Key Insight

In AI systems, the weakest link determines overall performance. For example, no matter how powerful the GPU, if memory bandwidth is insufficient or cooling is inadequate, the entire system cannot achieve its full potential. Therefore, balanced design and integrated management of all components is crucial for AI workload success.

The diagram emphasizes that AI infrastructure is not just about having powerful individual components, but about creating a holistically optimized ecosystem where every element supports and enhances the others.

With Claude