New Era of Digitals

This image presents a diagram titled “New Era of Digitals” that illustrates the evolution of computing paradigms.

Overall Structure:

The diagram shows a progression from left to right, transitioning from being “limited by Humans” to achieving “Everything by Digitals.”

Key Stages:

  1. Human Desire: The process begins with humans’ fundamental need to “wanna know it clearly,” representing our desire for understanding and knowledge.
  2. Rule-Based Era (1000s):
    • Deterministic approach
    • Using Logics and Rules
    • Automation with Specific Rules
    • Record with a human recognizable format
  3. Data-Driven Era:
    • Probabilistic approach (Not 100% But OK)
    • Massive Computing (Energy Resource)
    • Neural network-like structures represented by interconnected nodes

Core Message:

The diagram illustrates how computing has evolved from early systems that relied on human-defined explicit rules and logic to modern data-driven, probabilistic approaches. This represents the shift toward AI and machine learning, where we achieve “Not 100% But OK” results through massive computational resources rather than perfect deterministic rules.

The transition shows how we’ve moved from systems that required everything to be “human recognizable” to systems that can process and understand patterns beyond direct human comprehension, marking the current digital revolution where algorithms and data-driven approaches can handle complexity that exceeds traditional rule-based systems.

With Claude

CDU ( OCP Project Deschutes ) Numbers

OCP CDU (Deschutes) Standard Overview

The provided visual summarizes the key performance metrics of the CDU (Cooling Distribution Unit) that adheres to the OCP (Open Compute Project) ‘Project Deschutes’ specification. This CDU is designed for high-performance computing environments, particularly for massive-scale liquid cooling of AI/ML workloads.


Key Performance Indicators

  • System Availability: The primary target for system availability is 99.999%. This represents an extremely high level of reliability, with less than 5 minutes and 15 seconds of downtime per year.
  • Thermal Load Capacity: The CDU is designed to handle a thermal load of up to 2,000 kW, which is among the highest thermal capacities in the industry.
  • Power Usage: The CDU itself consumes 74 kW of power.
  • IT Flow Rate: It supplies coolant to the servers at a rate of 500 GPM (approximately 1,900 LPM).
  • Operating Pressure: The overall system operating pressure is within a range of 0-130 psig (approximately 0-900 kPa).
  • IT Differential Pressure: The pressure difference required on the server side is 80-90 psi (approximately 550-620 kPa).
  • Approach Temperature: The approach temperature, a key indicator of heat exchange efficiency, is targeted at ≤3∘C. A lower value is better, as it signifies more efficient heat removal.

Why Cooling is Crucial for GPU Performance

Cooling has a direct and significant impact on GPU performance and stability. Because GPUs are highly sensitive to heat, if they are not maintained within an optimal temperature range, they will automatically reduce their performance through a process called thermal throttling to prevent damage.

The ‘Project Deschutes’ CDU is engineered to prevent this by handling a massive thermal load of 2,000 kW with a powerful 500 GPM flow rate and a low approach temperature of ≤3∘C. This robust cooling capability ensures that GPUs can operate at their maximum potential without being limited by heat, which is essential for maximizing performance in demanding AI workloads.

with Gemini

‘tightly fused’

This illustration visualizes the evolution of data centers, contrasting the traditionally separated components with the modern AI data center where software, compute, network, and crucially, power and cooling systems are ‘tightly fused’ together. It emphasizes how power and advanced cooling are organically intertwined with GPU and memory, directly impacting AI performance and highlighting their inseparable role in meeting the demands of high-performance AI. This tight integration symbolizes a pivotal shift for the modern AI era.

nVidia DCGM(Data Center GPU Manager) Metrics

nVidia DCGM for GPU Stabilization and Optimization

Purpose and Overview

DCGM (Data Center GPU Manager) metrics provide comprehensive real-time monitoring for GPU cluster stability and performance optimization in data center environments. The system enables proactive issue detection and prevention through systematic metric categorization across utility states, performance profiling, and system identification. This integrated approach ensures uninterrupted high-performance operations while extending hardware lifespan and optimizing operational costs.

GPU Stabilization Through Metric Monitoring

Thermal Stability Management

  • GPU Temperature monitoring prevents overheating
  • Clock Throttle Reasons identifies performance degradation causes
  • Automatic workload redistribution when temperature thresholds are reached

Power Management Optimization

  • Power Usage and Total Energy Consumption tracking
  • Priority-based job scheduling when power limits are approached
  • Energy efficiency-based resource allocation

Memory Integrity Assurance

  • ECC Error Count monitoring for early hardware fault detection
  • Frame Buffer Memory utilization tracking prevents OOM scenarios

Clock Throttling-Based Optimization

The Clock Throttle Reasons bitmask provides real-time detection of GPU performance limitations. Normal operation (0x00000000) maintains peak performance, while power limiting (0x00000001) triggers workload distribution to alternate GPUs. Thermal limiting (0x00000002) activates enhanced cooling and temporarily suspends heat-generating tasks. Complex limitation scenarios prompt emergency workload migration and hardware diagnostics to maintain system stability.

Integrated Optimization Strategy

Predictive Management

  • Metric trend analysis for proactive issue prediction
  • Workload pattern learning for optimal resource pre-allocation

Dynamic Scaling

  • SM/DRAM Active Cycles Ratio enables real-time load balancing
  • PCIe/NVLink Throughput optimization for network efficiency

Fault Prevention

  • Rising ECC Error Count triggers GPU isolation and replacement scheduling
  • Driver Version and Process Name tracking resolves compatibility issues

With Claude

Multi-DCs Operation with a LLM (4)

LLM-Based Multi-Datacenter Operation System

System Architecture

3-Stage Processing Pipeline: Collector → Integrator → Analyst

  • Event collection from various protocols
  • Data normalization through local integrators
  • Intelligent analysis via LLM/AI analyzers
  • RAG data expansion through bottom Data Add-On modules

Core Functions

1. Time-Based Event Aggregation Analysis

  • 60-second intervals (adjustable) for event bundling
  • Comprehensive situational analysis instead of individual alarms
  • LLM queries with predefined prompts

Effectiveness:

  • ✅ Resolves alarm fatigue and enables correlation analysis
  • ✅ Improves operational efficiency through periodic comprehensive reports
  • ⚠️ Potential delay in immediate response to critical issues ( -> Using a legacy/local monitoring system )

2. RAG-Based Data Enhancement

  • Extension data: Metrics, manuals, configurations, maintenance records
  • Reuse of past analysis results as learning data
  • Improved accuracy through domain-specific knowledge accumulation

Effectiveness:

  • ✅ Continuous improvement of analysis quality and increased automation
  • ✅ Systematization of operational knowledge and organizational capability enhancement

Innovative Value

  • Paradigm Shift: Reactive → Predictive/Contextual analysis
  • Operational Burden Reduction: Transform massive alarms into meaningful insights
  • Self-Evolution: Continuous learning system through RAG framework

Executive Summary: This system overcomes the limitations of traditional individual alarm approaches and represents an innovative solution that intelligentizes datacenter operations through time-based event aggregation and LLM analysis. As a self-evolving monitoring system that continuously learns and develops through RAG-based data enhancement, it is expected to dramatically improve operational efficiency and analysis accuracy.

With Claude

Switching of the power

This diagram illustrates two main power switching methods used in electrical systems: ATS (Automatic Transfer Switch) and STS (Static Transfer Switch).

System Configuration

  • Power Sources: Utility grid and Generator
  • Protection: UPS systems
  • Load: Server infrastructure

ATS (Automatic Transfer Switch)

Location: Switchgear Area (Power Distribution Board)

Characteristics:

  • Mechanism: Mechanical breakers/contacts
  • Transfer Time: Several seconds (including generator start-up)
  • Advantages: Relatively simple, lower cost
  • Application: Standard power transfer systems

STS (Static Transfer Switch)

Location: Panelboard Area (Distribution Panel)

Characteristics:

  • Mechanism: Semiconductor devices (SCR, IGBT)
  • Transfer Time: A few milliseconds (near seamless)
  • Advantages: Ensures high-quality power supply
  • Disadvantages: Expensive

Key Differences

  1. Transfer Speed: STS is significantly faster (milliseconds vs seconds)
  2. Technology: ATS uses mechanical switching, STS uses electronic switching
  3. Cost: ATS is more economical
  4. Power Quality: STS provides more stable power delivery
  5. Complexity: STS requires more sophisticated semiconductor control

Applications

  • ATS: Suitable for applications that can tolerate brief power interruptions
  • STS: Critical for sensitive equipment like servers, data centers, and medical facilities requiring uninterrupted power

Summary: This diagram shows a redundant power system where ATS provides cost-effective backup power switching while STS offers near-instantaneous transfer for critical loads. Both systems work together with UPS backup to ensure continuous power supply to servers and sensitive equipment.

With Claude