High Cost & High Risk with AI

Posted on 2025-11-042025-11-02 by lechuck park

This image illustrates the high cost and high risk of AI/LLM (Large Language Model) training.

Key Analysis

Left: AI/LLM Growth Path

Evolution from Internet → Mobile & Cloud → AI/LLM (Transformer)
Each stage shows increasing fluctuations in the graph
Emphasizes “High Cost, High Risk” message

Center: Real Problem Visualization

The red graph shows dramatic performance spikes that occurred during actual training processes.

Top Right: Silent Data Corruption (SDC) Issues

Silent data corruption from hardware failures:

Power drops, thermal stress → hardware faults
Silent errors → training divergence
6 SDC failures in a 54-day pretraining run

Bottom Right: Reliability Issues in Large-Scale ML Clusters (Meta Case)

Real failure cases:

8-GPU job: average 47.7 days
1024-GPU job: MTTF (Mean Time To Failure) 7.9 hours
16,384-GPU job: failure in approximately 1.8 hours

Summary

As GPU scale increases, failure probability rises exponentially, making large-scale AI training extremely costly and technically risky.
Hardware-induced silent data corruption causes training divergence, with 6 failures recorded in just 54 days of pretraining.
Meta’s experience shows massive GPU clusters can fail in under 2 hours, highlighting infrastructure reliability as a critical challenge.

#AITraining #LLM #MachineLearning #DataCorruption #GPUCluster #MLOps #AIInfrastructure #HardwareReliability #TransformerModels #HighPerformanceComputing #AIRisk #MLEngineering #DeepLearning

Large Scale Network Driven Design ( Deepseek V3)

Posted on 2025-10-312025-10-31 by lechuck park

Deepseek v3 Large-Scale Network Architecture Analysis

This image explains the Multi-Plane Fat-Tree network structure of Deepseek v3.

Core Architecture

1. 8-Plane Architecture

Consists of eight independent network channels (highways)
Maximizes network bandwidth and distributes traffic for enhanced scalability

2. Fat-Tree Topology

Two-layer switch structure:
- Leaf SW (Leaf Switches): Directly connected to GPUs
- Spine SW (Spine Switches): Interconnect leaf switches
Enables high-speed communication among all nodes (GPUs) while minimizing switch contention

3. GPU/IB NIC Pair

Each GPU is paired with a dedicated Network Interface Card (NIC)
Each pair is exclusively assigned to one of the eight planes to initiate communication

Communication Methods

NVLink

Ultra-high-speed connection between GPUs within the same node
Fast data transfer path used for intra-node communication

Cross-plane Traffic

Occurs when communication happens between different planes
Requires intra-node forwarding through another NIC, PCIe, or NVLink
Primary factor that increases latency

Network Optimization Process

The workflow below minimizes latency and prevents network congestion:

Workload Analysis
All to All (analyzing all-to-all communication patterns)
Plane & Layer Set (plane and layer assignment)
Profiling (Hot-path opt K) (hot-path optimization)
Static Routing (Hybrid) (hybrid static routing approach)

Goal: Low latency & no jamming

Scalability

This design is a scale-out network for large-scale distributed training supporting 16,384+ GPUs. Each plane operates independently to maximize overall system throughput.

3-Line Summary

Deepseek v3 uses an 8-plane fat-tree network architecture that connects 16,384+ GPUs through independent communication channels, minimizing contention and maximizing bandwidth. The two-layer switch topology (Spine and Leaf) combined with dedicated GPU-NIC pairs enables efficient traffic distribution across planes. Cross-plane traffic management and hot-path optimization ensure low-latency, high-throughput communication for large-scale AI training.

#DeepseekV3 #FatTreeNetwork #MultiPlane #NetworkArchitecture #ScaleOut #DistributedTraining #AIInfrastructure #GPUCluster #HighPerformanceComputing #NVLink #DataCenterNetworking #LargeScaleAI

With Claude

Optimize LLM

Posted on 2025-10-29 by lechuck park

LLM Optimization: Integration of Traditional Methods and New Paradigms

Core Message

LLM (Transformer) optimization requires more than just traditional optimization methodologies – new perspectives must be added.

1. Traditional Optimization Methodology (Left Side)

SW (Software) Optimization

Data Optimization
- Structure: Data structure design
- Copy: Data movement optimization
Logics Optimization
- Algorithm: Efficient algorithm selection
- Profiling: Performance analysis and bottleneck identification

Characteristics: Deterministic, logical approach

HW (Hardware) Optimization

Functions & Speed (B/W): Function and speed/bandwidth optimization
Fit For HW: Optimization for existing hardware
New HW implementation: New hardware design and implementation

Characteristics: Physical performance improvement focus

2. New Perspectives Required for LLM (Right Side)

SW Aspect: Human-Centric Probabilistic Approach

Human Language View / Human’s View
- Human language understanding methods
- Human thinking perspective
Human Learning
- Mimicking human learning processes

Key Point: Statistical and Probabilistic Methodology

Different from traditional deterministic optimization
Language patterns, probability distributions, and context understanding are crucial

HW Aspect: Massive Parallel Processing

Massive Simple Parallel
- Parallel processing of large-scale simple computations
- Hardware architecture capable of parallel processing (GPU/TPU) is essential

Key Point: Efficient parallel processing of large-scale matrix operations

3. Integrated Perspective

LLM Optimization = Traditional Optimization + New Paradigm

Domain	Traditional Method	LLM Additional Elements
SW	Algorithm, data structure optimization	+ Probabilistic/statistical approach (human language/learning perspective)
HW	Function/speed optimization	+ Massive parallel processing architecture

Conclusion

For effective LLM optimization:

Traditional optimization techniques (data, algorithms, hardware) as foundation
Probabilistic approach reflecting human language and learning methods
Hardware perspective supporting massive parallel processing

These three elements must be organically combined – this is the core message of the diagram.

Summary

LLM optimization requires integrating traditional deterministic SW/HW optimization with new paradigms: probabilistic/statistical approaches that mirror human language understanding and learning, plus hardware architectures designed for massive parallel processing. This represents a fundamental shift from conventional optimization, where human-centric probabilistic thinking and large-scale parallelism are not optional but essential dimensions.

#LLMOptimization #TransformerArchitecture #MachineLearningOptimization #ParallelProcessing #ProbabilisticAI #HumanLanguageView #GPUComputing #DeepLearningHardware #StatisticalML #AIInfrastructure #ModelOptimization #ScalableAI #NeuralNetworkOptimization #AIPerformance #ComputationalEfficiency

Who is the first wall?

Posted on 2025-10-232025-10-22 by lechuck park

AI Scaling: The 6 Major Bottlenecks (2025)

1. Data

High-quality text data expected to be depleted by 2026
Solutions: Synthetic data (fraud detection in finance, medical data), Few-shot learning

2. LLM S/W (Algorithms)

Ilya Sutskever: “The era of simple scaling is over. Now it’s about scaling the right things”
Innovation directions: Test-time compute scaling (OpenAI o1), Mixture-of-Experts architecture, Hybrid AI

3. Computing → Heat

GPT-3 training required 1,024 A100 GPUs for several months
By 2030, largest training runs projected at 2-45GW scale
GPU cluster heat generation makes cooling a critical challenge

4. Memory & Network ⚠️ Current Critical Bottleneck

Memory

LLMs grow 410x/2yr, computing power 750x/2yr vs DRAM bandwidth only 2x/2yr
HBM3E completely sold out for 2024-2025. AI memory market projected to grow at 27.5% CAGR

Network

Speed of light limitation causes tens to hundreds of ms latency over distance. Critical for real-time applications (autonomous vehicles, AR)
Large-scale GPU clusters require 800Gbps+, microsecond-level ultra-low latency

5. Power 💡 Long-term Core Constraint

Sam Altman: “The cost of AI will converge to the cost of energy. The abundance of AI will be limited by the abundance of energy”
Power infrastructure (transmission lines, transformers) takes years to build
Data centers projected to consume 7.5% of US electricity by 2030

6. Cooling

Advanced technologies like liquid cooling required. Infrastructure upgrades take 1+ year

“Who is the first wall?”

Critical Bottlenecks by Timeline:

Current (2025): Memory bandwidth + Data quality
Short-to-Mid term: Power infrastructure (5-10 years to build)
Long-term: Physical limit of the speed of light

Summary

The “first wall” in AI scaling is not a single barrier but a multi-layered constraint system that emerges sequentially over time. Today’s immediate challenges are memory bandwidth and data quality, followed by power infrastructure limitations in the mid-term, and ultimately the fundamental physical constraint of the speed of light. As Sam Altman emphasized, AI’s future abundance will be fundamentally limited by energy abundance, with all bottlenecks interconnected through the computing→heat→cooling→power chain.

#AIScaling #AIBottleneck #MemoryBandwidth #HBM #DataCenterPower #AIInfrastructure #SpeedOfLight #SyntheticData #EnergyConstraint #AIFuture #ComputingLimits #GPUCluster #TestTimeCompute #MixtureOfExperts #SamAltman #AIResearch #MachineLearning #DeepLearning #AIHardware #TechInfrastructure

With Claude

Data Center Shift with AI

Posted on 2025-10-22 by lechuck park

Data Center Shift with AI

This diagram illustrates how data centers are transforming as they enter the AI era.

📅 Timeline of Technological Evolution

The top section shows major technology revolutions and their timelines:

Internet ’95 (Internet era)
Mobile ’07 (Mobile era)
Cloud ’10 (Cloud era)
Blockchain
AI(LLM) ’22 (Large Language Model-based AI era)

🏢 Traditional Data Center Components

Conventional data centers consisted of the following core components:

Software
Server
Network
Power
Cooling

These were designed as relatively independent layers.

🚀 New Requirements in the AI Era

With the introduction of AI (especially LLMs), data centers require specialized infrastructure:

LLM Model – Operating large language models
GPU – High-performance graphics processing units (essential for AI computations)
High B/W – High-bandwidth networks (for processing large volumes of data)
SMR/HVDC – Switched-Mode Rectifier/High-Voltage Direct Current power systems
Liquid/CDU – Liquid cooling/Cooling Distribution Units (for cooling high-heat GPUs)

🔗 Key Characteristic of AI Data Centers: Integrated Design

The circular connection in the center of the diagram represents the most critical feature of AI data centers:

Tight Interdependency between SW/Computing/Network ↔ Power/Cooling

Unlike traditional data centers, in AI data centers:

GPU-based computing consumes enormous power and generates significant heat
High B/W networks consume additional power during massive data transfers between GPUs
Power systems (SMR/HVDC) must stably supply high power density
Liquid cooling (Liquid/CDU) must handle high-density GPU heat in real-time

These elements must be closely integrated in design, and optimizing just one element cannot guarantee overall system performance.

💡 Key Message

AI workloads require moving beyond the traditional layer-by-layer independent design approach of conventional data centers, demanding that computing-network-power-cooling be designed as one integrated system. This demonstrates that a holistic approach is essential when building AI data centers.

📝 Summary

AI data centers fundamentally differ from traditional data centers through the tight integration of computing, networking, power, and cooling systems. GPU-based AI workloads create unprecedented power density and heat generation, requiring liquid cooling and HVDC power systems. Success in AI infrastructure demands holistic design where all components are co-optimized rather than independently engineered.

#AIDataCenter #DataCenterEvolution #GPUInfrastructure #LiquidCooling #AIComputing #LLM #DataCenterDesign #HighPerformanceComputing #AIInfrastructure #HVDC #HolisticDesign #CloudComputing #DataCenterCooling #AIWorkloads #FutureOfDataCenters

With Claude

New For AI

Posted on 2025-10-202025-10-19 by lechuck park

Analysis of “New For AI” Diagram

This image, titled “New For AI,” systematically organizes the essential components required for building AI systems.

Structure Overview

Top Section: Fundamental Technical Requirements for AI (Two Pillars)

Left Domain – Computing Axis (Turquoise)

Massive Data
- Processing vast amounts of data that form the foundation for AI training and operations
Immense Computing
- Powerful computational capacity to process data and run AI models

Right Domain – Infrastructure Axis (Light Blue)

3. Enormous Energy
Large-scale power supply to drive AI computing

High-Density Cooling
- Effective heat removal from high-performance computing operations

Central Link 🔗

Meaning of the Chain Link Icon:

For AI to achieve its performance, Computing (Data/Chips) and Infrastructure (Power/Cooling) don’t simply exist in parallel
They must be tightly integrated and optimized to work together
Symbolizes the interdependent relationship where strengthening only one side cannot unlock the full system’s potential

Bottom Section: Implementation Technologies (Stability & Optimization)

Learning & Inference/Reasoning (Learning and Inference Optimization)

Technologies to enhance AI model performance and efficiency:

Evals/Golden Set: Model evaluation and benchmarking
Safety Guardrails, RLHF-DPO: Safety assurance and human feedback-based learning
FlashAttention: Memory-efficient attention mechanism
Quant(INT8/FP8): Computational optimization through model quantization
Speculative/MTP Decoding: Inference speed enhancement techniques

Massive Parallel Computing (Large-Scale Parallel Computing)

Hardware and network technologies enabling massive computation:

GB200/GB300 NVL72: NVIDIA’s latest GPU systems
HBM: High Bandwidth Memory
InfiniBand, NVlink: Ultra-high-speed interconnect technologies
AI factory: AI-dedicated data centers
TPU, MI3xx, NPU, DPU: Various AI-specialized chips
PIM, CxL, UvLink: Memory-compute integration and next-gen interfaces
Silicon Photonics, UEC: Optical communication technologies

More Energy, Energy Efficiency (Energy Supply and Efficiency)

Technologies for stable and efficient power supply:

Smart Grid: Intelligent power grid
SMR: Small Modular Reactor (stable large-scale power source)
Renewable Energy: Renewable energy integration
ESS: Energy Storage System (power stabilization)
800V HVDC: High-voltage direct current transmission (loss minimization)
Direct DC Supply: Direct DC supply (eliminating conversion losses)
Power Forecasting: AI-based power demand prediction and optimization

High Heat Exchange & PUE (Heat Exchange and Power Efficiency)

Securing cooling system efficiency and stability:

Liquid Cooling: Liquid cooling (higher efficiency than air cooling)
CDU: Coolant Distribution Unit
D2C: Direct-to-Chip cooling
Immersing: Immersion cooling (complete liquid immersion)
100% Free Cooling: Utilizing external air (energy saving)
AI-Driven Cooling Optimization: AI-based cooling optimization
PUE Improvement: Power Usage Effectiveness (overall power efficiency metric)

Key Message

This diagram emphasizes that for successful AI implementation:

Technical Foundation: Both Data/Chips (Computing) and Power/Cooling (Infrastructure) are necessary
Tight Integration: These two axes are not separate but must be firmly connected like a chain and optimized simultaneously
Implementation Technologies: Specific advanced technologies for stability and optimization in each domain must provide support

The central link particularly visualizes the interdependent relationship where “increasing computing power requires strengthening energy and cooling in tandem, and computing performance cannot be realized without infrastructure support.”

Summary

AI systems require two inseparable pillars: Computing (Data/Chips) and Infrastructure (Power/Cooling), which must be tightly integrated and optimized together like links in a chain. Each pillar is supported by advanced technologies spanning from AI model optimization (FlashAttention, Quantization) to next-gen hardware (GB200, TPU) and sustainable infrastructure (SMR, Liquid Cooling, AI-driven optimization). The key insight is that scaling AI performance demands simultaneous advancement across all layers—more computing power is meaningless without proportional energy supply and cooling capacity.

#AI #AIInfrastructure #AIComputing #DataCenter #AIChips #EnergyEfficiency #LiquidCooling #MachineLearning #AIOptimization #HighPerformanceComputing #HPC #GPUComputing #AIFactory #GreenAI #SustainableAI #AIHardware #DeepLearning #AIEnergy #DataCenterCooling #AITechnology #FutureOfAI #AIStack #MLOps #AIScale #ComputeInfrastructure

With Claude

Cooling for AI (heavy heater)

Posted on 2025-10-172025-10-16 by lechuck park

AI Data Center Cooling System Architecture Analysis

This diagram illustrates the evolution of data center cooling systems designed for high-heat AI workloads.

Traditional Cooling System (Top Section)

Three-Stage Cooling Process:

Cooling Tower – Uses ambient air to cool water
Chiller – Further refrigerates the cooled water
CRAH (Computer Room Air Handler) – Distributes cold air to the server room

Free Cooling option is shown, which reduces chiller operation by leveraging low outside temperatures for energy savings.

New Approach for AI DC: Liquid Cooling System (Bottom Section)

To address extreme heat generation from high-density AI chips, a CDU (Coolant Distribution Unit) based liquid cooling system has been introduced.

Key Components:

① Coolant Circulation and Distribution

Direct coolant circulation system to servers

② Heat Exchanges (Two Methods)

Direct-to-Chip (D2C) Liquid Cooling: Cold plate with manifold distribution system directly contacting chips
Rear-Door Heat Exchanger (RDHx): Heat exchanger mounted on rack rear door (immersion cooling)

③ Pumping and Flow Control

Pumps and flow control for coolant circulation

④ Filtration and Coolant Quality Management

Maintains coolant quality and removes contaminants

⑤ Monitoring and Control

Real-time monitoring and cooling performance control

Critical Differences

Traditional Method: Air cooling → Indirect, suitable for low-density workloads

AI DC Method: Liquid cooling → Direct, high-efficiency, capable of handling high TDP (Thermal Design Power) of AI chips

Liquid has approximately 25x better heat transfer efficiency than air, making it effective for cooling AI accelerators (GPUs, TPUs) that generate hundreds of watts to kilowatt-level heat.

Summary:

Traditional data centers use air-based cooling (Cooling Tower → Chiller → CRAH), suitable for standard workloads.
AI data centers require liquid cooling with CDU systems due to extreme heat from high-density AI chips.
Liquid cooling offers direct-to-chip heat removal with 25x better thermal efficiency than air, supporting kW-level heat dissipation.

#AIDataCenter #LiquidCooling #DataCenterInfrastructure #CDU #ThermalManagement #DirectToChip #AIInfrastructure #GreenDataCenter #HeatDissipation #HyperscaleComputing #AIWorkload #DataCenterCooling #ImmersionCooling #EnergyEfficiency #NextGenDataCenter

With Claude