아프지마 Don’t get sick

Posted on 2025-12-13 by lechuck park

Interconnection Driven Design (Deepseek v3)

Posted on 2025-12-122025-12-11 by lechuck park

Interconnection Driven Design

This image outlines a technical approach to solving bottlenecks in High-Performance Computing (HPC) and AI/LLM infrastructure. It is categorized into three main rows, each progressing from a Problem to a Solution, and finally to a hardware-level Final Optimization.

1. Convergence of Scale-Up and Scale-Out

Focuses on resolving inefficiencies between server communication and GPU computation.

Problem (IB Communication): The speed of inter-server connections (e.g., InfiniBand) creates a bottleneck for total system performance.
Inefficiency (Streaming Multiprocessor): The GPU’s core computational units (SMs) waste resources handling network overhead instead of focusing on actual calculations.
Solution (SM Offload): Communication tasks are delegated (offloaded) to dedicated coprocessors, allowing SMs to focus exclusively on computation.
Final Optimization (Unified Network Adapter): Physically integrating intra-node and inter-node communication into a single Network Interface Card (NIC) to minimize data movement paths.

2. Bandwidth Contention & Latency

Addresses the limitations of data bandwidth and processing delays.

Problem (KV Cache): Reusable token data for LLM inference frequently travels between the CPU and GPU, consuming significant bandwidth.
Bottleneck (PCIe): The primary interconnect has limited bandwidth, leading to contention and performance degradation during traffic spikes.
Solution (Traffic Class – TC): A prioritization mechanism (QoS) ensures urgent, latency-sensitive traffic is processed before less critical data.
Final Optimization (I/O Die Chiplet Integration): Integrating network I/O directly alongside the GPU die bypasses PCIe entirely, eliminating contention and drastically reducing latency.

3. Node-Limited Routing

Optimizes data routing strategies for distributed neural networks.

Key Tech (NVLink): A high-speed, intra-node GPU interconnect strategically used to maximize local data transfer.
Context (Experts): Neural network modules (MoE – Mixture of Experts) are distributed across various nodes, requiring activation for specific tokens.
Solution/Strategy (Minimize IB Cost): Reducing overhead by restricting slow inter-node usage (InfiniBand) to a single hop while distributing data internally via fast NVLink.
Final Optimization (Node-Limited): Algorithms restrict the selection of “Experts” (modules) to a limited node group, reducing inter-node traffic and guaranteeing communication efficiency.

Summary

Integration: The design overcomes system bottlenecks by physically unifying network adapters and integrating I/O dies directly with GPUs to bypass slow connections like PCIe.
Offloading & Prioritization: It improves efficiency by offloading network tasks from GPU cores (SMs) and prioritizing urgent traffic (Traffic Class) to reduce latency.
Routing Optimization: It utilizes “Node-Limited” routing strategies to maximize high-speed local connections (NVLink) and minimize slower inter-server communication in distributed AI models.

#InterconnectionDrivenDesign #AIInfrastructure #GPUOptimization #HPC #ChipletIntegration #NVLink #LatencyReduction #LLMHardware #infiniband

With Gemini

vLLM Features

Posted on 2025-12-112025-12-10 by lechuck park

vLLM Features & Architecture Breakdown

This chart outlines the key components of vLLM (Virtual Large Language Model), a library designed to optimize the inference speed and memory efficiency of Large Language Models (LLMs).

1. Core Algorithm

PagedAttention
- Concept: Applies the operating system’s (OS) virtual memory paging mechanism to the attention mechanism.
- Benefit: It resolves memory fragmentation and enables the storage of the KV (Key-Value) cache in non-contiguous memory spaces, significantly reducing memory waste.

2. Data Unit

Block (Page)
- Concept: The minimum KV cache unit with a fixed token size (e.g., 16 tokens).
- Benefit: Increases management efficiency via fixed-size allocation and minimizes wasted space (internal fragmentation) within slots.
Block Table
- Concept: A mapping table that connects Logical Blocks to Physical Blocks.
- Benefit: Allows non-contiguous physical memory to be processed as if it were a continuous context.

3. Operation

Pre-allocation (Profiling)
- Concept: Reserves the maximum required VRAM at startup by running a dummy simulation.
- Benefit: Eliminates the overhead of runtime memory allocation/deallocation and prevents Out Of Memory (OOM) errors at the source.

4. Memory Handling

Swapping
- Concept: Offloads data to CPU RAM when GPU memory becomes full.
- Benefit: Handles traffic bursts without server downtime and preserves the context of suspended (waiting) requests.
Recomputation
- Concept: Recalculates data instead of swapping it when recalculation is more cost-effective.
- Benefit: Optimizes performance for short prompts or in environments with slow interconnects (e.g., PCIe limits).

5. Scheduling

Continuous Batching
- Concept: Iteration-level scheduling that fills idle slots immediately without waiting for other requests to finish.
- Benefit: Eliminates GPU idle time and maximizes overall throughput.

Summary

vLLM adapts OS memory management techniques (like Paging and Swapping) to optimize LLM serving, solving critical memory fragmentation issues.
Key technologies like PagedAttention and Continuous Batching minimize memory waste and eliminate GPU idle time to maximize throughput.
This architecture ensures high performance and stability by preventing memory crashes (OOM) and efficiently handling traffic bursts.

#vLLM #LLMInference #PagedAttention #AIArchitecture #GPUOptimization #MachineLearning #SystemDesign #AIInfrastructure

With Gemini

Time Constant(Delay of the sensor)

Posted on 2025-12-102025-12-07 by lechuck park

Image Interpretation: System Problems Due to Sensor Delay

This diagram explains system performance issues caused by the Time Constant (delay) of temperature sensors.

Top Section: Two Workload Scenarios

LLM Workload (AI Tasks)

Runs at 100% workload
Almost no delay (No Delay almost)
Result: Performance Down and Workload Cost waste

GPU Workload

Operating at 80°C
Thermal Throttling occurs
Transport Delay exists
Performance degradation starts at 60°C → Step down

Bottom Section: Core of the Sensor Delay Problem

Timeline:

Sensor UP start (Temperature Sensor activation)
- Big Delay due to Time Constant
TC63 (After 10-20 seconds)
- Sensor detects 63% temperature rise
- Actual temperature is already higher
After 30-40 seconds
- Sensor detects 86% rise
- Temperature Divergence, Late Cooling problem occurs

Key Issues

Due to the sensor’s Time Constant delay:

Takes too long to detect actual temperature rise
Cooling system activates too late
GPU already overheated, causing thermal throttling
Results in workload cost waste and performance degradation

Summary

Sensor delays create a critical gap between actual temperature and detected temperature, causing cooling systems to react too late. This results in GPU thermal throttling, performance degradation, and wasted computational resources. Real-time monitoring with fast-response sensors is essential for optimal system performance.

#ThermalManagement #SensorDelay #TimeConstant #GPUThrottling #DataCenter #PerformanceOptimization #CoolingSystem #AIWorkload #SystemMonitoring #HardwareEngineering #ThermalThrottling #LatencyChallenges #ComputeEfficiency #ITInfrastructure #TemperatureSensing

With Claude

2 Key Points For Digitalizations

Posted on 2025-12-092025-12-07 by lechuck park

2 Key Points For Digitalizations

This diagram illustrates two essential elements for successful digital transformation.

1️⃣ Data Quality

“High Precision & High Resolution”

The left section shows the data collection and quality management phase:

Facility/Device: Physical infrastructure including servers, networks, power systems, and cooling equipment
Data Generator: Generates data from various sources
3T Process:
- Performance: Data collection and measurement
- Transform: Data processing and standardization
- Transfer: Data movement and delivery

The key is to secure high-quality data with high precision and resolution.

2️⃣ Fast & Accurate Data Correlation

“Rapid Data Correlation Analysis with AI”

The right section represents the data utilization phase:

Data Storing: Systematic storage in various types of databases
Monitoring: Real-time system surveillance and alerts
Analysis: In-depth data analysis and insight extraction

The ultimate goal is to quickly and accurately identify correlations between data using AI.

Core Message

The keys to successful digitalization are:

Input Stage: Accurate and detailed data collection
Output Stage: Fast and precise AI-based analysis

True digital transformation becomes possible when these two elements work in harmony.

Summary

✅ Successful digitalization requires two pillars: high-quality data input (high precision & resolution) and intelligent output (AI-driven analysis).

✅ The process flows from facility infrastructure through data generation, the 3T transformation (Performance-Transform-Transfer), to storage, monitoring, and analysis.

✅ When quality data collection meets fast AI correlation analysis, organizations achieve meaningful digital transformation and actionable insights.

#DigitalTransformation #DataQuality #AIAnalysis #DataCorrelation #HighPrecisionData #BigData #DataDriven #Industry40 #SmartFactory #DataInfrastructure #DigitalStrategy #AIInsights #DataManagement #TechInnovation #EnterpriseIT

With Claude

GPU Throttling

Posted on 2025-12-082025-12-07 by lechuck park

GPU Throttling Architecture Analysis

This diagram illustrates the GPU’s power and thermal management system.

Key Components

1. Two Throttling Triggers

Power Throttling: Throttling triggered by power limits
Thermal Throttling: Throttling triggered by temperature limits

2. Different Control Approaches

Power Limit (Budget) Controller: Slow, Linear Step Down
Thermal Safety Controller: Fast, Hard Step Down
- This aggressive response is necessary because overheating can cause immediate hardware damage

3. Priority Gate

Receives signals from both controllers and determines which limitation to apply.

4. PMU/SMU/DVFS Controller

The Common Control Unit that manages:

PMU: Power Management Unit
SMU: System Management Unit
DVFS: Dynamic Voltage and Frequency Scaling

5. Actual Adjustment Mechanisms

Clock Domain Controller: Reduces GPU Frequency
Voltage Regulator: Reduces GPU Voltage

6. Final Result

Lower Power/Temp (Throttled): Reduced power consumption and temperature in throttled state

Core Principle

When the GPU reaches power budget or temperature limits, it automatically reduces performance to protect the system. By lowering both frequency and voltage simultaneously, it effectively reduces power consumption (P ∝ V²f).

Summary

GPU throttling uses two controllers—power (slow, linear) and thermal (fast, aggressive)—that feed into a shared PMU/SMU/DVFS system to dynamically reduce clock frequency and voltage. Thermal throttling responds more aggressively than power throttling because overheating poses immediate hardware damage risks. The end result is lower power consumption and temperature, sacrificing performance to maintain system safety and longevity.

#GPUThrottling #ThermalManagement #PowerManagement #DVFS #GPUArchitecture #HardwareOptimization #ThermalSafety #PerformanceVsPower #ComputerHardware #GPUDesign #SystemManagement #ClockSpeed #VoltageRegulation #TechExplained #HardwareEngineering

With Claude

Now, IT Begins

Posted on 2025-12-07 by lechuck park