Temperate Prediction in DC (II) – The start and The Target

This image illustrates the purpose and outcomes of temperature prediction approaches in data centers, showing how each method serves different operational needs.

Purpose and Results Framework

CFD Approach – Validation and Design Purpose

Input:

  • Setup Data: Physical infrastructure definitions (100% RULES-based)
  • Pre-defined spatial, material, and boundary conditions

Process: Physics-based simulation through computational fluid dynamics

Results:

  • What-if (One Case) Simulation: Theoretical scenario testing
  • Checking a Limitation: Validates whether proposed configurations are “OK or not”
  • Used for design validation and capacity planning

ML Approach – Operational Monitoring Purpose

Input:

  • Relation (Extended) Data: Real-time operational data starting from workload metrics
  • Continuous data streams: Power, CPU, Temperature, LPM/RPM

Process: Data-driven pattern learning and prediction

Results:

  • Operating Data: Real-time operational insights
  • Anomaly Detection: Identifies unusual patterns or potential issues
  • Used for real-time monitoring and predictive maintenance

Key Distinction in Purpose

CFD: “Can we do this?” – Validates design feasibility and limits before implementation

  • Answers hypothetical scenarios
  • Provides go/no-go decisions for infrastructure changes
  • Design-time tool

ML: “What’s happening now?” – Monitors current operations and predicts immediate future

  • Provides real-time operational intelligence
  • Enables proactive issue detection
  • Runtime operational tool

The diagram shows these are complementary approaches: CFD for design validation and ML for operational excellence, each serving distinct phases of data center lifecycle management.

With Claude

Temperate Prediction in DC

Overall Structure

Top: CFD (Computational Fluid Dynamics) based approach Bottom: ML (Machine Learning) based approach

CFD Approach (Top)

  • Basic Setup:
    • Spatial Definition & Material Properties: Physical space definition of the data center and material characteristics (servers, walls, air, etc.)
    • Boundary Conditions: Setting boundary conditions (inlet/outlet temperatures, airflow rates, heat sources, etc.)
  • Processing:
    • Configuration + Physical Rules: Application of physical laws (heat transfer equations, fluid dynamics equations, etc.)
    • Heat Flow: Heat flow calculations based on defined conditions
  • Output: Heat + Air Flow Simulation (physics-based heat and airflow simulation)

ML Approach (Bottom)

  • Data Collection:
    • Real-time monitoring through Metrics/Data Sensing
    • Operational data: Power (Kw), CPU (%), Workload, etc.
    • Actual temperature measurements through Temperature Sensing
  • Processing: Pattern learning through Machine Learning algorithms
  • Output: Heat (with Location) Prediction (location-specific heat prediction)

Key Differences

CFD Method: Theoretical calculation through physical laws using physical space definitions, material properties, and boundary conditions as inputs ML Method: Data-driven approach that learns from actual operational data and sensor information for prediction

The key distinction is that CFD performs simulation from predefined physical conditions, while ML learns from actual operational data collected during runtime to make predictions.

With Claude

AI Workload

This image visualizes the three major AI workload types and their characteristics in a comprehensive graph.

Graph Structure Analysis

Visualization Framework:

  • Y-axis: AI workload intensity (requests per hour, FLOPS, CPU/GPU utilization, etc.)
  • X-axis: Time progression
  • Stacked Area Chart: Shows the proportion and changes of three workload types within the total AI system load

Three AI Workload Characteristics

1. Learning – Blue Area

Properties: Steady, Controllable, Planning

  • Located at the bottom with a stable, wide area
  • Represents model training processes with predictable and plannable resource usage
  • Maintains consistent load over extended periods

2. Reasoning – Yellow Area

Properties: Fluctuating, Unpredictable, Optimizing!!!

  • Middle layer showing dramatic fluctuations
  • Involves complex decision-making and logical reasoning processes
  • Most unpredictable workload requiring critical optimization
  • Load varies significantly based on external environmental changes

3. Inference – Green Area

Properties: On-device Side, Low Latency

  • Top layer with irregular patterns
  • Executes on edge devices or user terminals
  • Service workload requiring real-time responses
  • Low latency is the core requirement

Key Implications

Differentiated Resource Management Strategies Required:

  • Learning: Stable long-term planning and infrastructure investment
  • Reasoning: Dynamic scaling and optimization technology focus
  • Inference: Edge optimization and response time improvement

This graph provides crucial insights demonstrating that customized resource allocation strategies considering the unique characteristics of each workload type are essential for effective AI system operations.

This visualization emphasizes that AI workloads are not monolithic but consist of distinct components with varying demands, requiring sophisticated resource management approaches to handle their collective and individual requirements effectively.

With Claude

AI Platform eating all

This diagram illustrates the fundamental paradigm shift in service development across three platform evolution stages.

Platform Evolution:

  1. Cloud Platform
    • Server-Client separation with cloud infrastructure development
    • Developers directly build servers and databases to provide services
  2. SDK Platform
    • Client-side evolution based on specific OS/SDK ecosystems (iOS, Android, Windows)
    • Each platform provides development environments and tools
    • This stage generated “Vast and numerous internet services” – an explosive growth of diverse internet services
  3. AI Platform – “Eating ALL”
    • Fundamental paradigm shift: Instead of developers building individual services, the AI platform itself generates and provides services
    • “All Services by AI”: AI directly provides the diverse services that developers previously created
    • Multimodal capabilities: AI can understand and process all human senses and communication methods (language, vision, audio), enabling all functionalities through natural language conversation without specialized apps or services

Key Transformation:

  • Traditional: Developer → Platform → Service Development → User
  • AI Era: User → AI Platform → Instant Service Generation/Provision

This represents not just tool evolution, but a fundamental reorganization of the service ecosystem where countless specialized services converge into one unified AI platform due to AI’s universal cognitive abilities. The AI platform becomes a total service provider, essentially “eating” all existing service categories.

With Claude

Components for AI Work

This diagram visualizes the core concept that all components must be organically connected and work together to successfully operate AI workloads.

Importance of Organic Interconnections

Continuity of Data Flow

  • The data pipeline from Big Data → AI Model → AI Workload must operate seamlessly
  • Bottlenecks at any stage directly impact overall system performance

Cooperative Computing Resource Operations

  • GPU/CPU computational power must be balanced with HBM memory bandwidth
  • SSD I/O performance must harmonize with memory-processor data transfer speeds
  • Performance degradation in one component limits the efficiency of the entire system

Integrated Software Control Management

  • Load balancing, integration, and synchronization coordinate optimal hardware resource utilization
  • Real-time optimization of workload distribution and resource allocation

Infrastructure-based Stability Assurance

  • Stable power supply ensures continuous operation of all computing resources
  • Cooling systems prevent performance degradation through thermal management of high-performance hardware
  • Facility control maintains consistency of the overall operating environment

Key Insight

In AI systems, the weakest link determines overall performance. For example, no matter how powerful the GPU, if memory bandwidth is insufficient or cooling is inadequate, the entire system cannot achieve its full potential. Therefore, balanced design and integrated management of all components is crucial for AI workload success.

The diagram emphasizes that AI infrastructure is not just about having powerful individual components, but about creating a holistically optimized ecosystem where every element supports and enhances the others.

With Claude