This infographic illustrates how AI’s exponential growth triggers a cascading exponential expansion across all interconnected domains.
Core Concept: Exponential Chain Reaction
Top Process Chain: AI’s exponential growth creates proportionally exponential demands at each stage:
AI (LLM) ≈ Data ≈ Computing ≈ Power ≈ Cooling
The “≈” symbol indicates that each element grows exponentially in proportion to the others. When AI doubles, the required data, computing, power, and cooling all scale proportionally.
Evidence of Exponential Growth Across Domains
1. AI Networking & Global Data Generation (Top Left)
Exponential increase beginning in the 2010s
Vertical surge post-2020
2. Data Center Electricity Demand (Center Left)
Sharp increase projected between 2026-2030
Orange (AI workloads) overwhelms blue (traditional workloads)
AI is the primary driver of total power demand growth
3. Power Production Capacity (Center Right)
2005-2030 trends across various energy sources
Power generation must scale alongside AI demand
4. AI Computing Usage (Right)
Most dramatic exponential growth
Modern AI era begins in 2012
Doubling every 6 months (extremely rapid exponential growth)
Over 300,000x increase since 2012
Three exponential growth phases shown (1e+0, 1e+2, 1e+4, 1e+6)
Key Message
This infographic demonstrates that AI development is not an isolated phenomenon but triggers exponential evolution across the entire ecosystem:
As AI models advance → Data requirements grow exponentially
As data increases → Computing power needs scale exponentially
As computing expands → Power consumption rises exponentially
As power consumption grows → Cooling systems must expand exponentially
All elements are tightly interconnected, creating a ‘cascading exponential effect’ where exponential growth in one domain simultaneously triggers exponential development and demand across all other domains.
This image explains the Multi-Head Latent Attention (MLA) compression technique from two perspectives.
Core Concepts
Left Panel: Matrix Perspective of Compression
Multiple attention heads (represented as cross-shaped matrices) are consolidated into a single compressed matrix
Multiple independent matrices are transformed into one compressed representation containing features
The original can be reconstructed from this compressed representation
Only minor loss occurs while achieving dramatic N-to-1 compression
Right Panel: Vector (Directional) Perspective of Compression
Vectors extending in various directions from a central point
Each vector represents the directionality and features of different attention heads
Similar vectors are compressed while preserving directional information (vector features)
Original information can be recovered through vector features even after compression
Key Mechanism
Compression → Recovery Process:
Multiple heads are compressed into latent features
During storage, only the compressed representation is maintained, drastically reducing storage space
When needed, original head information can be recovered using stored features (vectors)
Loss is minimal while memory efficiency is maximized
Main Advantages (Bottom Boxes)
MLA Compression: Efficient compression of multi-head attention
Keep features(vector): Preserves vector features for reconstruction
Minor loss: Maintains performance with negligible information loss
Memory Efficiency: Dramatically reduces storage space
For K-V Cache: Optimizes Key-Value cache memory
Practical Significance
This technique transforms N attention heads into 1 compressed representation in large language models, dramatically reducing storage space while enabling recovery through feature vectors when needed – a lossy compression method. It significantly reduces the memory burden of K-V cache, maximizing inference efficiency.
Sends SIGKILL to selected victim when below watermarks or cgroup memory.max is hit
Victim selection based on badness/oom_score_adj
Configurable via /proc/<pid>/oom_score_adj and vm.panic_on_oom
Summary
When an app requests memory, Linux first reserves virtual address space (overcommit), then allocates physical memory on first use. If physical memory runs low, the system tries to reclaim pages from caches and swap, but when all else fails, the OOM Killer terminates processes based on their oom_score to free up memory and keep the system running.
AI Workload Cooling Systems: Bidirectional Physical-Software Optimization
This image summarizes four cutting-edge research studies demonstrating the bidirectional optimization relationship between AI LLMs and cooling systems. It proves that physical cooling infrastructure and software workloads are deeply interconnected.
🔄 Core Concept of Bidirectional Optimization
Direction 1: Physical Cooling → AI Performance Impact
Cooling methods directly affect LLM/VLM throughput and stability
Direction 2: AI Software → Cooling Control
LLMs themselves act as intelligent controllers for cooling systems
📊 Research Analysis
1. Physical Cooling Impact on AI Performance (2025 arXiv)
[Cooling HW → AI SW Performance]
Experiment: Liquid vs Air cooling comparison on H100 nodes
Physical Differences:
GPU Temperature: Liquid 41-50°C vs Air 54-72°C (up to 22°C difference)
GPU Power Consumption: 148-173W reduction
Node Power: ~1kW savings
Software Performance Impact:
Throughput: 54 vs 46 TFLOPs/GPU (+17% improvement)
Sustained and predictable performance through reduced throttling
Adaptive cooling strategies based on workload characteristics
3. Virtuous Cycle Effect
Better cooling → AI performance improvement → smarter cooling control
→ Energy savings → more AI jobs → advanced cooling optimization
→ Sustainable large-scale AI infrastructure
🎯 Practical Implications
These studies demonstrate:
Cooling is no longer passive infrastructure: It’s an active determinant of AI performance
AI optimizes its own environment: Meta-level self-optimizing systems
Hardware-software co-design is essential: Isolated optimization is suboptimal
Simultaneous achievement of sustainability and performance: Synergy, not trade-off
📝 Summary
These four studies establish that next-generation AI data centers must evolve into integrated ecosystems where physical cooling and software workloads interact in real-time to self-optimize. The bidirectional relationship—where better cooling enables superior AI performance, and AI algorithms intelligently control cooling systems—creates a virtuous cycle that simultaneously achieves enhanced performance, energy efficiency, and sustainable scalability for large-scale AI infrastructure.