Reduces FLOPs/costs while maintaining training/inference performance
Weights/Matmul in FP8 + FP32 Accumulation
Computes in lightweight units but sums precisely for critical totals (lower memory, bandwidth, compute, stable accuracy)
Predict Multiple Tokens at Once During Training
Delivers higher speed and accuracy boosts in benchmarks
2-tier Fat-Tree Γ Multiple Planes (separated per RDMA-NIC pair)
Provides inter-plane congestion isolation, resilience, and reduced cost/latency
Summary
DeepSeek-V3 represents a comprehensive optimization of large language models through innovations in attention mechanisms, expert routing, mixed-precision training, multi-token prediction, and network architecture. These techniques collectively address the three critical bottlenecks: memory, computation, and communication. The result is a highly efficient model capable of scaling to massive sizes while maintaining cost-effectiveness and performance.
AI Workload Cooling Systems: Bidirectional Physical-Software Optimization
This image summarizes four cutting-edge research studies demonstrating the bidirectional optimization relationship between AI LLMs and cooling systems. It proves that physical cooling infrastructure and software workloads are deeply interconnected.
π Core Concept of Bidirectional Optimization
Direction 1: Physical Cooling β AI Performance Impact
Cooling methods directly affect LLM/VLM throughput and stability
Direction 2: AI Software β Cooling Control
LLMs themselves act as intelligent controllers for cooling systems
π Research Analysis
1. Physical Cooling Impact on AI Performance (2025 arXiv)
[Cooling HW β AI SW Performance]
Experiment: Liquid vs Air cooling comparison on H100 nodes
Physical Differences:
GPU Temperature: Liquid 41-50Β°C vs Air 54-72Β°C (up to 22Β°C difference)
GPU Power Consumption: 148-173W reduction
Node Power: ~1kW savings
Software Performance Impact:
Throughput: 54 vs 46 TFLOPs/GPU (+17% improvement)
Sustained and predictable performance through reduced throttling
Adaptive cooling strategies based on workload characteristics
3. Virtuous Cycle Effect
Better cooling β AI performance improvement β smarter cooling control
β Energy savings β more AI jobs β advanced cooling optimization
β Sustainable large-scale AI infrastructure
π― Practical Implications
These studies demonstrate:
Cooling is no longer passive infrastructure: It’s an active determinant of AI performance
AI optimizes its own environment: Meta-level self-optimizing systems
Hardware-software co-design is essential: Isolated optimization is suboptimal
Simultaneous achievement of sustainability and performance: Synergy, not trade-off
π Summary
These four studies establish that next-generation AI data centers must evolve into integrated ecosystems where physical cooling and software workloads interact in real-time to self-optimize. The bidirectional relationshipβwhere better cooling enables superior AI performance, and AI algorithms intelligently control cooling systemsβcreates a virtuous cycle that simultaneously achieves enhanced performance, energy efficiency, and sustainable scalability for large-scale AI infrastructure.
This illustration visualizes the evolution of data centers, contrasting the traditionally separated components with the modern AI data center where software, compute, network, and crucially, power and cooling systems are ‘tightly fused’ together. It emphasizes how power and advanced cooling are organically intertwined with GPU and memory, directly impacting AI performance and highlighting their inseparable role in meeting the demands of high-performance AI. This tight integration symbolizes a pivotal shift for the modern AI era.
Operational Burden Reduction: Transform massive alarms into meaningful insights
Self-Evolution: Continuous learning system through RAG framework
Executive Summary: This system overcomes the limitations of traditional individual alarm approaches and represents an innovative solution that intelligentizes datacenter operations through time-based event aggregation and LLM analysis. As a self-evolving monitoring system that continuously learns and develops through RAG-based data enhancement, it is expected to dramatically improve operational efficiency and analysis accuracy.
This image demonstrates the critical impact of cooling stability on both LLM performance and energy efficiency in GPU servers through benchmark results.
Cascading Effects of Unstable Cooling
Problems with Unstable Air Cooling:
GPU Temperature: 54-72Β°C (high and unstable)
Thermal throttling occurs – where GPUs automatically reduce clock speeds to prevent overheating, leading to significant performance degradation
Result: Double penalty of reduced performance + increased power consumption
Energy Efficiency Impact:
Power Consumption: 8.16kW (high)
Performance: 46 TFLOPS (degraded)
Energy Efficiency: 5.6 TFLOPS/kW (poor performance-to-power ratio)
Benefits of Stable Liquid Cooling
Temperature Stability Achievement:
GPU Temperature: 41-50Β°C (low and stable)
No thermal throttling β sustained optimal performance
Energy Efficiency Improvement:
Power Consumption: 6.99kW (14% reduction)
Performance: 54 TFLOPS (17% improvement)
Energy Efficiency: 7.7 TFLOPS/kW (38% improvement)
Core Mechanisms: How Cooling Affects Energy Efficiency
Power Efficiency Optimization: Eliminates inefficient power consumption caused by overheating
Performance Consistency: Unstable cooling can cause GPUs to use 50% of power budget while delivering only 25% performance
Advanced cooling systems can achieve energy savings ranging from 17% to 23% compared to traditional methods. This benchmark paradoxically shows that proper cooling investment dramatically improves overall energy efficiency.
Final Summary
Unstable cooling triggers thermal throttling that simultaneously degrades LLM performance while increasing power consumption, creating a dual efficiency loss.Stable liquid cooling achieves 17% performance gains and 14% power savings simultaneously, improving energy efficiency by 38%.In AI infrastructure, adequate cooling investment is essential for optimizing both performance and energy efficiency.
This diagram presents a unified framework consisting of three core structures, their interconnected relationships, and complementary utilization as the foundation for LLM advancement.
Three Core Structures
1. Corpus Structure
Token-based raw linguistic data
Provides statistical language patterns and usage frequency information
Final stage performs intelligent analysis using LLM and AI
3 Core Expansion Strategies
1οΈβ£ Data Expansion (Data Add On)
Integration of additional data sources beyond Event Messages:
Metrics: Performance indicators and metric data
Manuals: Operational manuals and documentation
Configures: System settings and configuration information
Maintenance: Maintenance history and procedural data
2οΈβ£ System Extension
Infrastructure scalability and flexibility enhancement:
Scale Up/Out: Vertical/horizontal scaling for increased processing capacity
To Cloud: Cloud environment expansion and hybrid operations
3οΈβ£ LLM Model Enhancement (More Better Model)
Evolution toward DC Operations Specialized LLM:
Prompt Up: Data center operations-specialized prompt engineering
Nice & Self LLM Model: In-house development of DC operations specialized LLM model construction and tuning
Strategic Significance
These 3 expansion strategies present a roadmap for evolving from a simple event log analysis system to an Intelligent Autonomous Operations Data Center. Particularly, through the development of in-house DC operations specialized LLM, the goal is to build an AI system that achieves domain expert-level capabilities specifically tailored for data center operations, rather than relying on generic AI tools.