Data in AI DC

This image illustrates a data monitoring system for an AI data center server room. Titled “Data in AI DC Server Room,” it depicts the relationships between key elements being monitored in the data center.

The system consists of four main components, each with detailed metrics:

  1. GPU Workload – Right center
    • Computing Load: GPU utilization rate (%) and type of computational tasks (training vs. inference)
    • Power Consumption: Real-time power consumption of each GPU (W) – Example: NVIDIA H100 GPU consumes up to 700W
    • Workload Pattern: Periodicity of workload (peak/off-peak times) and predictability
    • Memory Usage: GPU memory usage patterns (e.g., HBM3 memory bandwidth usage)
  2. Power Infrastructure – Left
    • Power Usage: Real-time power output and efficiency of UPS, PDU, and transformers
    • Power Quality: Voltage, frequency stability, and power loss rate
    • Power Capacity: Types and proportions of supplied energy, ensuring sufficient power availability for current workload operations
  3. Cooling System – Right
    • Cooling Device Status: Air-cooling fan speed (RPM), liquid cooling pump flow rate (LPM), and coolant temperature (°C)
    • Environmental Conditions: Data center internal temperature, humidity, air pressure, and hot/cold zone temperatures – critical for server operations
    • Cooling Efficiency: Power Usage Effectiveness (PUE) and proportion of power consumed by the cooling system
  4. Server/Rack – Top center
    • Rack Power Density: Power consumption per rack (kW) – Example: GPU server racks range from 30 to 120 kW
    • Temperature Profile: Temperature (°C) of GPUs, CPUs, memory modules, and heat distribution
    • Server Status: Operational state of servers (active/standby) and workload distribution status

The workflow sequence indicated at the bottom of the diagram represents:

  1. ① GPU WORK: Initial execution of AI workloads – GPU computational tasks begin, generating system load
  2. ② with POWER USE: Increased power supply for GPU operations – Power demand increases with GPU workload, and power infrastructure responds accordingly
  3. ③ COOLING WORK: Cooling processes activated in response to heat generation
    • Sensing: Temperature sensors detect server and rack thermal conditions, monitoring hot/cold zone temperature differentials
    • Analysis: Analysis of collected temperature data, determining cooling requirements
    • Action: Adjustment of cooling equipment (fan speed, coolant flow rate, etc. automatically regulated)
  4. ④ SERVER OK: Maintenance of normal server operation through proper power supply and cooling – Temperature and power remain stable, allowing GPU workloads to continue running under optimal conditions

The arrows indicate data flow and interrelationships between systems, showing connections from power infrastructure to servers and from cooling systems to servers. This integrated system enables efficient and stable data center operation by detecting increased power demand and heat generation from GPU workloads, and adjusting cooling systems in real-time accordingly.

With Claude

DGX Inside

NVIDIA DGX is a specialized server system optimized for GPU-centric high-performance computing. This diagram illustrates the internal architecture of DGX, which maintains a server-like structure but is specifically designed for massive parallel processing.

The core of the DGX system consists of multiple high-performance GPUs interconnected not through conventional PCIe, but via NVIDIA’s proprietary NVLink and NVSwitch technologies. This configuration dramatically increases GPU-to-GPU communication bandwidth, maximizing parallel processing efficiency.

Key features:

  • Integration of multiple CPUs and eight GPUs through high-performance interconnects
  • Mesh network configuration between all GPUs via NVSwitch, minimizing bottlenecks
  • Hierarchical memory architecture combining High Bandwidth Memory (HBM) and DRAM
  • NVMe SSDs for high-speed storage
  • High-efficiency cooling system supporting dense computing environments
  • InfiniBand networking for high-speed connections between multiple DGX systems

This configuration is optimized for workloads requiring parallel processing such as deep learning, AI model training, and large-scale data analysis, enabling much more efficient GPU utilization compared to conventional servers.

With Claude

NVLink, Infiniband

This diagram compares two GPU networking technologies: NVLink and InfiniBand, both essential for parallel computing expansion.

On the left side, the “NVLink” section shows multiple GPUs connected vertically through purple interconnect bars. This represents the “Scale UP” approach, where GPUs are vertically scaled within a single system for tight integration.

On the right side, the “InfiniBand” section demonstrates how multiple server nodes connect through an InfiniBand network. This illustrates the “Scale Out” approach, where computing power expands horizontally across multiple independent systems.

Both technologies share the common goal of expanding parallel processing capabilities, but they do so in different architectural approaches. NVLink focuses on high-speed, direct connections between GPUs in a single system, while InfiniBand specializes in networking across multiple systems to support distributed computing environments.

The optimization of these expansion configurations is crucial for maximizing performance in high-performance computing, AI training, and other compute-intensive applications. System architects must carefully consider workload characteristics, data movement patterns, and scaling requirements when choosing between these technologies or determining how to best implement them together in hybrid configurations.

With Claude

Connected in AI DC

This diagram titled “Data is Connected in AI DC” illustrates the relationships starting from workload scheduling in an AI data center.

Key aspects of the diagram:

  1. The entire system’s interconnected relationships begin with workload scheduling.
  2. The diagram divides the process into two major phases:
    • Deterministic phase: Primarily concerned with power requirements that operate in a predictable, planned manner.
    • Statistical phase: Focused on cooling requirements, where predictions vary based on external environmental conditions.
  3. The “Prophet Commander” at the workload scheduling stage can predict/direct future requirements, allowing the system to prepare power (1.1 Power Ready!!) and cooling (1.2 Cooling Ready!!) in advance.
  4. Process flow:
    • Job allocation from workload scheduling to GPU cluster
    • GPUs request and receive power
    • Temperature rises due to operations
    • Cooling system detects temperature and activates cooling

This diagram illustrates the interconnected workflow in AI data centers, beginning with workload scheduling that enables predictive resource management. The process flows from deterministic power requirements to statistical cooling needs, with the “Prophet Commander” enabling proactive preparation of power and cooling resources. This integrated approach demonstrates how workload prediction can drive efficient resource allocation throughout the entire AI data center ecosystem.

With Claude

Analytical vs Empirical

Analytical vs Empirical Approaches

Analytical Approach

  1. Theory Driven: Based on mathematical theories and logical reasoning
  2. Programmable with Design: Implemented through explicit rules and algorithms
  3. Sequential by CPU: Tasks are processed one at a time in sequence
  4. Precise & Explainable: Results are accurate and decision-making processes are transparent

Empirical Approach

  1. Data Driven: Based on real data and observations
  2. Deep Learning with Learn: Neural networks automatically learn from data
  3. Parallel by GPU: Multiple tasks are processed simultaneously for improved efficiency
  4. Approximate & Unexplainable: Results are approximations and internal workings are difficult to explain

Summary

This diagram illustrates the key differences between traditional programming methods and modern machine learning approaches. The analytical approach follows clearly defined rules designed by humans and can precisely explain results, while the empirical approach learns patterns from data and improves efficiency through parallel processing but leaves decision-making processes as a black box.

with claude

GPU vs NPU on Deep learning

This diagram illustrates the differences between GPU and NPU from a deep learning perspective:

GPU (Graphic Process Unit):

  • Originally developed for 3D game rendering
  • In deep learning, it’s utilized for parallel processing of vast amounts of data through complex calculations during the training process
  • Characterized by “More Computing = Bigger Memory = More Power,” requiring high computing power
  • Processes big data and vectorizes information using the “Everything to Vector” approach
  • Stores learning results in Vector Databases for future use

NPU (Neuron Process Unit):

  • Retrieves information from already trained Vector DBs or foundation models to generate answers to questions
  • This process is called “Inference”
  • While the training phase processes all data in parallel, the inference phase only searches/infers content related to specific questions to formulate answers
  • Performs parallel processing similar to how neurons function

In conclusion, GPUs are responsible for processing enormous amounts of data and storing learning results in vector form, while NPUs specialize in the inference process of generating actual answers to questions based on this stored information. This relationship can be summarized as “training creates and stores vast amounts of data, while inference utilizes this at the point of need.”

With Claude

AI in the data center

AI in the Data Center

This diagram titled “AI in the Data Center” illustrates two key transformational elements that occur when AI technology is integrated into data centers:

1. Computing Infrastructure Changes

  • AI workloads powered by GPUs become central to operations
  • Transition from traditional server infrastructure to GPU-centric computing architecture
  • Fundamental changes in data center hardware configuration and network connectivity

2. Management Infrastructure Changes

  • Increased requirements for power (“More Power!!”) and cooling (“More Cooling!!”) to support GPU infrastructure
  • Implementation of data-driven management systems utilizing AI technology
  • AI-based analytics and management for maintaining stability and improving efficiency

These two changes are interconnected, visually demonstrating how AI technology not only revolutionizes the computing capabilities of data centers but also necessitates innovation in management approaches to effectively operate these advanced systems.

with Claude