LLM goes with Computing-Power-Cooling

LLM’s Computing-Power-Cooling Relationship

This diagram illustrates the technical architecture and potential issues that can occur when operating LLMs (Large Language Models).

Normal Operation (Top Left)

  1. Computing Requires – LLM workload is delivered to the processor
  2. Power Requires – Power supplied via DVFS (Dynamic Voltage and Frequency Scaling)
  3. Heat Generated – Heat is produced during computing processes
  4. Cooling Requires – Temperature management through proper cooling systems

Problem Scenarios

Power Issue (Top Right)

  • Symptom: Insufficient power (kW & Quality)
  • Results:
    • Computing performance degradation
    • Power throttling or errors
    • LLM workload errors

Cooling Issue (Bottom Right)

  • Symptom: Insufficient cooling (Temperature & Density)
  • Results:
    • Abnormal heat generation
    • Thermal throttling or errors
    • Computing performance degradation
    • LLM workload errors

Key Message

For stable LLM operations, the three elements of Computing-Power-Cooling must be balanced. If any one element is insufficient, it leads to system-wide performance degradation or errors. This emphasizes that AI infrastructure design must consider not only computing power but also adequate power supply and cooling systems together.


Summary

  • LLM operation requires a critical balance between computing, power supply, and cooling infrastructure.
  • Insufficient power causes power throttling, while inadequate cooling leads to thermal throttling, both resulting in workload errors.
  • Successful AI infrastructure design must holistically address all three components rather than focusing solely on computational capacity.

#LLM #AIInfrastructure #DataCenter #ThermalManagement #PowerManagement #AIOperations #MachineLearning #HPC #DataCenterCooling #AIHardware #ComputeOptimization #MLOps #TechInfrastructure #AIatScale #GreenAI

WIth Claude

Basic LLM Workflow

Basic LLM Workflow Interpretation

This diagram illustrates how data flows through various hardware components during the inference process of a Large Language Model (LLM).

Step-by-Step Breakdown

① Initialization Phase (Warm weights)

  • Model weights are loaded from SSD → DRAM → HBM (High Bandwidth Memory)
  • Weights are distributed and shared across multiple GPUs

② Input Processing (CPU tokenizes/batches)

  • CPU tokenizes input text and processes batches
  • Data is transferred through DRAM buffer to GPU

③ GPU Inference Execution

  • GPU performs Attention and FFN (Feed-Forward Network) computations from HBM
  • KV cache (Key-Value cache) is stored in HBM
  • If HBM is tight, KV cache can be offloaded to DRAM or SSD

④ Distributed Communication (NvLink/Infiniband)

  • Intra-node: High-speed communication between GPUs via NvLink (with NVSwitch if available)
  • Inter-node: Parallel communication through InfiniBand or NCCL

⑤ Post-processing (CPU decoding/post)

  • CPU decodes generated tokens and performs post-processing
  • Logs and caches are saved to SSD

Key Characteristics

This architecture leverages a memory hierarchy to efficiently execute large-scale models:

  • SSD: Long-term storage (slowest, largest capacity)
  • DRAM: Intermediate buffer
  • HBM: GPU-dedicated high-speed memory (fastest, limited capacity)

When model size exceeds GPU memory, strategies include distributing across multiple GPUs or offloading data to higher-level memory tiers.


Summary

This diagram shows how LLMs process data through a memory hierarchy (SSD→DRAM→HBM) across CPU and GPU components. The workflow involves loading model weights, tokenizing inputs on CPU, running inference on GPU with HBM, and using distributed communication (NvLink/InfiniBand) for multi-GPU setups. Memory management strategies like KV cache offloading enable efficient execution of large models that exceed single GPU capacity.

#LLM #DeepLearning #GPUComputing #MachineLearning #AIInfrastructure #NeuralNetworks #DistributedComputing #HPC #ModelOptimization #AIArchitecture #NvLink #Transformer #MLOps #AIEngineering #ComputerArchitecture

With Claude

High Cost & High Risk with AI

This image illustrates the high cost and high risk of AI/LLM (Large Language Model) training.

Key Analysis

Left: AI/LLM Growth Path

  • Evolution from Internet → Mobile & Cloud → AI/LLM (Transformer)
  • Each stage shows increasing fluctuations in the graph
  • Emphasizes “High Cost, High Risk” message

Center: Real Problem Visualization

The red graph shows dramatic performance spikes that occurred during actual training processes.

Top Right: Silent Data Corruption (SDC) Issues

Silent data corruption from hardware failures:

  • Power drops, thermal stress → hardware faults
  • Silent errors → training divergence
  • 6 SDC failures in a 54-day pretraining run

Bottom Right: Reliability Issues in Large-Scale ML Clusters (Meta Case)

Real failure cases:

  • 8-GPU job: average 47.7 days
  • 1024-GPU job: MTTF (Mean Time To Failure) 7.9 hours
  • 16,384-GPU job: failure in approximately 1.8 hours

Summary

  1. As GPU scale increases, failure probability rises exponentially, making large-scale AI training extremely costly and technically risky.
  2. Hardware-induced silent data corruption causes training divergence, with 6 failures recorded in just 54 days of pretraining.
  3. Meta’s experience shows massive GPU clusters can fail in under 2 hours, highlighting infrastructure reliability as a critical challenge.

#AITraining #LLM #MachineLearning #DataCorruption #GPUCluster #MLOps #AIInfrastructure #HardwareReliability #TransformerModels #HighPerformanceComputing #AIRisk #MLEngineering #DeepLearning

AI approach

Legacy – The Era of Scale-Up

Traditional AI approach showing its limitations:

  • Simple Data: Starting with basic data
  • Simple Data & Logic: Combining data with logic
  • Better Data & Logic: Improving data and logic
  • Complex Data & Logic: Advancing to complex data and logic
  • Near The Limitation: Eventually hitting a fundamental ceiling

This approach gradually increases complexity, but no matter how much it improves, it inevitably runs into fundamental scalability limitations.

AI Works – The Era of Scale-Out

Modern AI transcending the limitations of the legacy approach through a new paradigm:

  • The left side shows the limitations of the old approach
  • The lightbulb icon in the middle represents a paradigm shift (Breakthrough)
  • The large purple box on the right demonstrates a completely different approach:
    • Massive parallel processing of countless “01/10” units (neural network neurons)
    • Horizontal scaling (Scale-Out) instead of sequential complexity increase
    • Fundamentally overcoming the legacy limitations

Key Message

No matter how much you improve the legacy approach, there’s a ceiling. AI breaks through that ceiling with a completely different architecture.


Summary

  • Legacy AI hits fundamental limits by sequentially increasing complexity (Scale-Up)
  • Modern AI uses massive parallel processing architecture to transcend these limitations (Scale-Out)
  • This represents a paradigm shift from incremental improvement to architectural revolution

#AI #MachineLearning #DeepLearning #NeuralNetworks #ScaleOut #Parallelization #AIRevolution #Paradigmshift #LegacyVsModern #AIArchitecture #TechEvolution #ArtificialIntelligence #ScalableAI #DistributedComputing #AIBreakthrough

Multi-Head Latent Attention – Changes

Multi-Head Latent Attention (MLA) Interpretation

This image is a technical diagram explaining the structure of Multi-Head Latent Attention (MLA).

🎯 Core Concept

MLA is a mechanism that improves the memory efficiency of traditional Multi-Head Attention.

Traditional Approach (Before) vs MLA

Traditional Approach:

  • Stores K, V vectors of all past tokens
  • Memory usage increases linearly with sequence length

MLA:

  • Summarizes past information with a fixed-size Latent vector (c^KV)
  • Maintains constant memory usage regardless of sequence length

📊 Architecture Explanation

1. Input Processing

  • Starts from Input Hidden State (h_t)

2. Latent Vector Generation

  • Latent c_t^Q: For Query of current token (compressed representation)
  • Latent c_t^KV: For Key-Value (cached and reused)

3. Query, Key, Value Generation

  • Query (q): Generated from current token (h_t)
  • Key-Value: Generated from Latent c_t^KV
    • Creates Compressed (C) and Recent (R) versions from c_t^KV
    • Concatenates both for use

4. Multi-Head Attention Execution

  • Performs attention computation with generated Q, K, V
  • Uses BF16 (Mixed Precision)

✅ Key Advantages

  1. Memory Efficiency: Compresses past information into fixed-size vectors
  2. Faster Inference: Reuses cached Latent vectors
  3. Information Preservation: Maintains performance by combining compressed and recent information
  4. Mixed Precision Support: Utilizes FP8, FP32, BF16

🔑 Key Differences

  • v_t^R from Latent c_t^KV is not used (purple box on the right side of diagram)
  • Value of current token is directly generated from h_t
  • This enables efficient combination of compressed past information and current information

This architecture is an innovative approach to solve the KV cache memory problem during LLM inference.


Summary

MLA replaces the linearly growing KV cache with fixed-size latent vectors, dramatically reducing memory consumption during inference. It combines compressed past information with current token data through an efficient attention mechanism. This innovation enables faster and more memory-efficient LLM inference while maintaining model performance.

#MultiHeadLatentAttention #MLA #TransformerOptimization #LLMInference #KVCache #MemoryEfficiency #AttentionMechanism #DeepLearning #NeuralNetworks #AIArchitecture #ModelCompression #EfficientAI #MachineLearning #NLP #LargeLanguageModels

With Claude

Data Center Shift with AI

Data Center Shift with AI

This diagram illustrates how data centers are transforming as they enter the AI era.

📅 Timeline of Technological Evolution

The top section shows major technology revolutions and their timelines:

  • Internet ’95 (Internet era)
  • Mobile ’07 (Mobile era)
  • Cloud ’10 (Cloud era)
  • Blockchain
  • AI(LLM) ’22 (Large Language Model-based AI era)

🏢 Traditional Data Center Components

Conventional data centers consisted of the following core components:

  • Software
  • Server
  • Network
  • Power
  • Cooling

These were designed as relatively independent layers.

🚀 New Requirements in the AI Era

With the introduction of AI (especially LLMs), data centers require specialized infrastructure:

  1. LLM Model – Operating large language models
  2. GPU – High-performance graphics processing units (essential for AI computations)
  3. High B/W – High-bandwidth networks (for processing large volumes of data)
  4. SMR/HVDC – Switched-Mode Rectifier/High-Voltage Direct Current power systems
  5. Liquid/CDU – Liquid cooling/Cooling Distribution Units (for cooling high-heat GPUs)

🔗 Key Characteristic of AI Data Centers: Integrated Design

The circular connection in the center of the diagram represents the most critical feature of AI data centers:

Tight Interdependency between SW/Computing/Network ↔ Power/Cooling

Unlike traditional data centers, in AI data centers:

  • GPU-based computing consumes enormous power and generates significant heat
  • High B/W networks consume additional power during massive data transfers between GPUs
  • Power systems (SMR/HVDC) must stably supply high power density
  • Liquid cooling (Liquid/CDU) must handle high-density GPU heat in real-time

These elements must be closely integrated in design, and optimizing just one element cannot guarantee overall system performance.

💡 Key Message

AI workloads require moving beyond the traditional layer-by-layer independent design approach of conventional data centers, demanding that computing-network-power-cooling be designed as one integrated system. This demonstrates that a holistic approach is essential when building AI data centers.


📝 Summary

AI data centers fundamentally differ from traditional data centers through the tight integration of computing, networking, power, and cooling systems. GPU-based AI workloads create unprecedented power density and heat generation, requiring liquid cooling and HVDC power systems. Success in AI infrastructure demands holistic design where all components are co-optimized rather than independently engineered.

#AIDataCenter #DataCenterEvolution #GPUInfrastructure #LiquidCooling #AIComputing #LLM #DataCenterDesign #HighPerformanceComputing #AIInfrastructure #HVDC #HolisticDesign #CloudComputing #DataCenterCooling #AIWorkloads #FutureOfDataCenters

With Claude

MoE & More

MoE & More – Architecture Interpretation

This diagram illustrates an advanced Mixture of Experts (MoE) model architecture.

Core Structure

1. Two Types of Experts

  • Shared Expert (Generalist)
    • Handles common knowledge: basic language structure, context understanding, general common sense
    • Applied universally to all tokens
  • Routed Expert (Specialist)
    • Handles specialized knowledge: coding, math, translation, etc.
    • Router selects the K most suitable experts for each token

2. Router (Gateway) Role

For each token, determines “Who’s best for handling this word?” by:

  • Selecting K experts out of N available specialists
  • Using Top-K selection mechanism

Key Optimization Techniques

Select Top-K 🎯

  • Chooses K most suitable routed experts
  • Distributes work evenly and occasionally tries new experts

Stabilize ⚖️

  • Prevents work from piling up on specific experts
  • Sets capacity limits and adds slight randomness

2-Stage Decouple 🔍

  • Creates a shortlist of candidate experts
  • Separately checks “Are they available now?” + “Are they good at this?”
  • Calculates and mixes the two criteria separately before final decision
  • Validates availability and skill before selection

Systems

  • Positions experts close together (reduces network delay)
  • Groups tokens for batch processing
  • Improves communication efficiency

Adaptive & Safety Loop 🔄

  • Adjusts K value in real-time (uses more/fewer experts as needed)
  • Redirects to backup path if experts are busy
  • Continuously monitors load, overflow, and performance
  • Auto-adjusts when issues arise

Purpose

This system enhances both efficiency and performance through:

  • Optimized expert placement
  • Accelerated batch processing
  • Real-time monitoring with immediate problem response

Summary

MoE & More combines generalist experts (common knowledge) with specialist experts (domain-specific skills), using an intelligent router to dynamically select the best K experts for each token. Advanced techniques like 2-stage decoupling, stabilization, and adaptive safety loops ensure optimal load balancing, prevent bottlenecks, and enable real-time adjustments for maximum efficiency. The result is a faster, more efficient, and more reliable AI system that scales intelligently.

#MixtureOfExperts #MoE #AIArchitecture #MachineLearning #DeepLearning #LLM #NeuralNetworks #AIOptimization #ScalableAI #RouterMechanism #ExpertSystems #AIEfficiency #LoadBalancing #AdaptiveAI #MLOps

With Claude