With

Posted on 2025-11-012025-11-01 by lechuck park

Large Scale Network Driven Design ( Deepseek V3)

Posted on 2025-10-312025-10-31 by lechuck park

Deepseek v3 Large-Scale Network Architecture Analysis

This image explains the Multi-Plane Fat-Tree network structure of Deepseek v3.

Core Architecture

1. 8-Plane Architecture

Consists of eight independent network channels (highways)
Maximizes network bandwidth and distributes traffic for enhanced scalability

2. Fat-Tree Topology

Two-layer switch structure:
- Leaf SW (Leaf Switches): Directly connected to GPUs
- Spine SW (Spine Switches): Interconnect leaf switches
Enables high-speed communication among all nodes (GPUs) while minimizing switch contention

3. GPU/IB NIC Pair

Each GPU is paired with a dedicated Network Interface Card (NIC)
Each pair is exclusively assigned to one of the eight planes to initiate communication

Communication Methods

NVLink

Ultra-high-speed connection between GPUs within the same node
Fast data transfer path used for intra-node communication

Cross-plane Traffic

Occurs when communication happens between different planes
Requires intra-node forwarding through another NIC, PCIe, or NVLink
Primary factor that increases latency

Network Optimization Process

The workflow below minimizes latency and prevents network congestion:

Workload Analysis
All to All (analyzing all-to-all communication patterns)
Plane & Layer Set (plane and layer assignment)
Profiling (Hot-path opt K) (hot-path optimization)
Static Routing (Hybrid) (hybrid static routing approach)

Goal: Low latency & no jamming

Scalability

This design is a scale-out network for large-scale distributed training supporting 16,384+ GPUs. Each plane operates independently to maximize overall system throughput.

3-Line Summary

Deepseek v3 uses an 8-plane fat-tree network architecture that connects 16,384+ GPUs through independent communication channels, minimizing contention and maximizing bandwidth. The two-layer switch topology (Spine and Leaf) combined with dedicated GPU-NIC pairs enables efficient traffic distribution across planes. Cross-plane traffic management and hot-path optimization ensure low-latency, high-throughput communication for large-scale AI training.

#DeepseekV3 #FatTreeNetwork #MultiPlane #NetworkArchitecture #ScaleOut #DistributedTraining #AIInfrastructure #GPUCluster #HighPerformanceComputing #NVLink #DataCenterNetworking #LargeScaleAI

With Claude

AI approach

Posted on 2025-10-30 by lechuck park

Legacy – The Era of Scale-Up

Traditional AI approach showing its limitations:

Simple Data: Starting with basic data
Simple Data & Logic: Combining data with logic
Better Data & Logic: Improving data and logic
Complex Data & Logic: Advancing to complex data and logic
Near The Limitation: Eventually hitting a fundamental ceiling

This approach gradually increases complexity, but no matter how much it improves, it inevitably runs into fundamental scalability limitations.

AI Works – The Era of Scale-Out

Modern AI transcending the limitations of the legacy approach through a new paradigm:

The left side shows the limitations of the old approach
The lightbulb icon in the middle represents a paradigm shift (Breakthrough)
The large purple box on the right demonstrates a completely different approach:
- Massive parallel processing of countless “01/10” units (neural network neurons)
- Horizontal scaling (Scale-Out) instead of sequential complexity increase
- Fundamentally overcoming the legacy limitations

Key Message

No matter how much you improve the legacy approach, there’s a ceiling. AI breaks through that ceiling with a completely different architecture.

Summary

Legacy AI hits fundamental limits by sequentially increasing complexity (Scale-Up)
Modern AI uses massive parallel processing architecture to transcend these limitations (Scale-Out)
This represents a paradigm shift from incremental improvement to architectural revolution

#AI #MachineLearning #DeepLearning #NeuralNetworks #ScaleOut #Parallelization #AIRevolution #Paradigmshift #LegacyVsModern #AIArchitecture #TechEvolution #ArtificialIntelligence #ScalableAI #DistributedComputing #AIBreakthrough

Optimize LLM

Posted on 2025-10-29 by lechuck park

LLM Optimization: Integration of Traditional Methods and New Paradigms

Core Message

LLM (Transformer) optimization requires more than just traditional optimization methodologies – new perspectives must be added.

1. Traditional Optimization Methodology (Left Side)

SW (Software) Optimization

Data Optimization
- Structure: Data structure design
- Copy: Data movement optimization
Logics Optimization
- Algorithm: Efficient algorithm selection
- Profiling: Performance analysis and bottleneck identification

Characteristics: Deterministic, logical approach

HW (Hardware) Optimization

Functions & Speed (B/W): Function and speed/bandwidth optimization
Fit For HW: Optimization for existing hardware
New HW implementation: New hardware design and implementation

Characteristics: Physical performance improvement focus

2. New Perspectives Required for LLM (Right Side)

SW Aspect: Human-Centric Probabilistic Approach

Human Language View / Human’s View
- Human language understanding methods
- Human thinking perspective
Human Learning
- Mimicking human learning processes

Key Point: Statistical and Probabilistic Methodology

Different from traditional deterministic optimization
Language patterns, probability distributions, and context understanding are crucial

HW Aspect: Massive Parallel Processing

Massive Simple Parallel
- Parallel processing of large-scale simple computations
- Hardware architecture capable of parallel processing (GPU/TPU) is essential

Key Point: Efficient parallel processing of large-scale matrix operations

3. Integrated Perspective

LLM Optimization = Traditional Optimization + New Paradigm

Domain	Traditional Method	LLM Additional Elements
SW	Algorithm, data structure optimization	+ Probabilistic/statistical approach (human language/learning perspective)
HW	Function/speed optimization	+ Massive parallel processing architecture

Conclusion

For effective LLM optimization:

Traditional optimization techniques (data, algorithms, hardware) as foundation
Probabilistic approach reflecting human language and learning methods
Hardware perspective supporting massive parallel processing

These three elements must be organically combined – this is the core message of the diagram.

Summary

LLM optimization requires integrating traditional deterministic SW/HW optimization with new paradigms: probabilistic/statistical approaches that mirror human language understanding and learning, plus hardware architectures designed for massive parallel processing. This represents a fundamental shift from conventional optimization, where human-centric probabilistic thinking and large-scale parallelism are not optional but essential dimensions.

#LLMOptimization #TransformerArchitecture #MachineLearningOptimization #ParallelProcessing #ProbabilisticAI #HumanLanguageView #GPUComputing #DeepLearningHardware #StatisticalML #AIInfrastructure #ModelOptimization #ScalableAI #NeuralNetworkOptimization #AIPerformance #ComputationalEfficiency

From Tokenization to Output

Posted on 2025-10-28 by lechuck park

From Tokenization to Output: Understanding NLP and Transformer Models

This image illustrates the complete process from tokenization to output in Natural Language Processing (NLP) and transformer models.

Top Section: Traditional Information Retrieval Process (Green Boxes)

Distinction (Difference) – Clear Boundary
- Cutting word pieces, attaching number tags, creating manageable units, generating receipt slips
Classification (Similarity)
- Placing in the same neighborhood, gathering similar meanings, classifying by topic on bookshelves, organizing by close proximity
Indexing
- Remembering position, assigning bookshelf numbers, creating a table of contents, organizing context
Retrieval (Fetching)
- Asking a question, searching the table of contents, retrieving content, finding necessary information
Processing → Result
- Analyzing information, synthesizing content, writing a report, generating the final answer

Bottom Section: Actual Transformer Model Implementation (Purple Boxes)

Tokenization
- String splitting, subword units, ID conversion, vocabulary mapping
Embedding Feature
- High-dimensional vector conversion, embedding matrix, semantic distance, placement in vector space
Positional Encoding + Context Building
- Positional information encoding, sine/cosine functions, context matrix, preserving sequence order
Attention Mechanism
- Query-Key-Value, attention scores, softmax weights, selective information extraction
Feed Forward + Output
- Non-linear transformation, 2-layer neural network, softmax probability distribution, next token prediction

Key Concept

This diagram maps traditional information retrieval concepts to modern transformer architecture implementations. It visualizes how abstract concepts in the top row are realized through concrete technical implementations in the bottom row, providing an educational resource for understanding how models like GPT and BERT work internally at each stage.

Summary

This diagram explains the end-to-end pipeline of transformer models by mapping traditional information retrieval concepts (distinction, classification, indexing, retrieval, processing) to technical implementations (tokenization, embedding, positional encoding, attention mechanism, feed-forward output). The top row shows abstract conceptual stages while the bottom row reveals the actual neural network components used in models like GPT and BERT. It serves as an educational bridge between high-level understanding and low-level technical architecture.

#NLP #TransformerModels #DeepLearning #Tokenization #AttentionMechanism #MachineLearning #AI #NeuralNetworks #GPT #BERT #PositionalEncoding #Embedding #InformationRetrieval #ArtificialIntelligence #DataScience

With Claude

Multi-Head Latent Attention – Changes

Posted on 2025-10-272025-10-26 by lechuck park

Multi-Head Latent Attention (MLA) Interpretation

This image is a technical diagram explaining the structure of Multi-Head Latent Attention (MLA).

🎯 Core Concept

MLA is a mechanism that improves the memory efficiency of traditional Multi-Head Attention.

Traditional Approach (Before) vs MLA

Traditional Approach:

Stores K, V vectors of all past tokens
Memory usage increases linearly with sequence length

MLA:

Summarizes past information with a fixed-size Latent vector (c^KV)
Maintains constant memory usage regardless of sequence length

📊 Architecture Explanation

1. Input Processing

Starts from Input Hidden State (h_t)

2. Latent Vector Generation

Latent c_t^Q: For Query of current token (compressed representation)
Latent c_t^KV: For Key-Value (cached and reused)

3. Query, Key, Value Generation

Query (q): Generated from current token (h_t)
Key-Value: Generated from Latent c_t^KV
- Creates Compressed (C) and Recent (R) versions from c_t^KV
- Concatenates both for use

4. Multi-Head Attention Execution

Performs attention computation with generated Q, K, V
Uses BF16 (Mixed Precision)

✅ Key Advantages

Memory Efficiency: Compresses past information into fixed-size vectors
Faster Inference: Reuses cached Latent vectors
Information Preservation: Maintains performance by combining compressed and recent information
Mixed Precision Support: Utilizes FP8, FP32, BF16

🔑 Key Differences

v_t^R from Latent c_t^KV is not used (purple box on the right side of diagram)
Value of current token is directly generated from h_t
This enables efficient combination of compressed past information and current information

This architecture is an innovative approach to solve the KV cache memory problem during LLM inference.

Summary

MLA replaces the linearly growing KV cache with fixed-size latent vectors, dramatically reducing memory consumption during inference. It combines compressed past information with current token data through an efficient attention mechanism. This innovation enables faster and more memory-efficient LLM inference while maintaining model performance.

#MultiHeadLatentAttention #MLA #TransformerOptimization #LLMInference #KVCache #MemoryEfficiency #AttentionMechanism #DeepLearning #NeuralNetworks #AIArchitecture #ModelCompression #EfficientAI #MachineLearning #NLP #LargeLanguageModels

With Claude