MPFT: Multi-Plane Fat-Tree for Massive Scale and Cost Efficiency

Posted on 2025-12-232025-12-22 by lechuck park

MPFT: Multi-Plane Fat-Tree for Massive Scale and Cost Efficiency

1. Architecture Overview (Blue Section)

The core innovation of MPFT lies in parallelizing network traffic across multiple independent “planes” to maximize bandwidth and minimize hardware overhead.

Multi-Plane Architecture: The network is split into 4 independent planes (channels).
Multiple Physical Ports per NIC: Each Network Interface Card (NIC) is equipped with multiple ports—one for each plane.
QP Parallel Utilization (Packet Striping): A single Queue Pair (QP) can utilize all available ports simultaneously. This allows for striped traffic, where data is spread across all paths at once.
Out-of-Order Placement: Because packets travel via different planes, they may arrive in a different order than they were sent. Therefore, the NIC must natively support out-of-order processing to reassemble the data correctly.

2. Performance & Cost Results (Purple Section)

The table compares MPFT against standard topologies like FT2/FT3 (Fat-Tree), SF (Slim Fly), and DF (Dragonfly).

Metric	MPFT	FT3	Dragonfly (DF)
Endpoints	16,384	65,536	261,632
Switches	768	5,120	16,352
Total Cost	$72M	$491M	$1,522M
Cost per Endpoint	$4.39k	$7.5k	$5.8k

Scalability: MPFT supports 16,384 endpoints, which is significantly higher than a standard 2-tier Fat-Tree (FT2).
Resource Efficiency: It achieves high scalability while using far fewer switches (768) and links compared to the 3-tier Fat-Tree (FT3).
Economic Advantage: At $4.39k per endpoint, it is one of the most cost-efficient models for large-scale data centers, especially when compared to the $7.5k cost of FT3.

Summary

MPFT is presented as a “sweet spot” solution for AI/HPC clusters. It provides the high-speed performance of complex 3-tier networks but keeps the cost and hardware complexity closer to simpler 2-tier systems by using multi-port NICs and traffic striping.

#NetworkArchitecture #DataCenter #HighPerformanceComputing #GPU #AITraining #MultiPlaneFatTree #MPFT #NetworkingTech #ClusterComputing #CloudInfrastructure

Parallelism (1) – Data , Expert

Posted on 2025-11-262025-11-25 by lechuck park

Parallelism Comparison: Data Parallelism vs Expert Parallelism

This image compares two major parallelization strategies used for training large language models (LLMs).

Left: Data Parallelism

Structure:

Data is divided into multiple batches from the database
Same complete model is replicated on each GPU
Each GPU independently processes different data batches
Results are aggregated to generate final output

Characteristics:

Scaling axis: Number of batches/samples
Pattern: Full model copy on each GPU, dense training
Communication: Gradient All-Reduce synchronization once per step
Advantages: Simple and intuitive implementation
Disadvantages: Model size must fit in single GPU memory

Right: Expert Parallelism

Structure:

Data is divided by layers
Tokens are distributed to appropriate experts through All-to-All network and router
Different expert models (A, B, C) are placed on each GPU
Parallel processing at block/thread level in GPU pool

Characteristics:

Scaling axis: Number of experts
Pattern: Sparse structure – only few experts activated per token
Goal: Maintain large capacity while limiting FLOPs per token
Communication: All-to-All token routing
Advantages: Can scale model capacity significantly (MoE – Mixture of Experts architecture)
Disadvantages: High communication overhead and complex load balancing

Key Differences

Aspect	Data Parallelism	Expert Parallelism
Model Division	Full model replication	Model divided into experts
Data Division	Batch-wise	Layer/token-wise
Communication Pattern	Gradient All-Reduce	Token All-to-All
Scalability	Proportional to data size	Proportional to expert count
Efficiency	Dense computation	Sparse computation (conditional activation)

These two approaches are often used together in practice, enabling ultra-large-scale model training through hybrid parallelization strategies.

Summary

Data Parallelism replicates the entire model across GPUs and divides the training data, synchronizing gradients after each step – simple but memory-limited. Expert Parallelism divides the model into specialized experts and routes tokens dynamically, enabling massive scale through sparse activation. Modern systems combine both strategies to train trillion-parameter models efficiently.

#MachineLearning #DeepLearning #LLM #Parallelism #DistributedTraining #DataParallelism #ExpertParallelism #MixtureOfExperts #MoE #GPU #ModelTraining #AIInfrastructure #ScalableAI #NeuralNetworks #HPC

LLM goes with Computing-Power-Cooling

Posted on 2025-11-122025-11-11 by lechuck park

LLM’s Computing-Power-Cooling Relationship

This diagram illustrates the technical architecture and potential issues that can occur when operating LLMs (Large Language Models).

Normal Operation (Top Left)

Computing Requires – LLM workload is delivered to the processor
Power Requires – Power supplied via DVFS (Dynamic Voltage and Frequency Scaling)
Heat Generated – Heat is produced during computing processes
Cooling Requires – Temperature management through proper cooling systems

Problem Scenarios

Power Issue (Top Right)

Symptom: Insufficient power (kW & Quality)
Results:
- Computing performance degradation
- Power throttling or errors
- LLM workload errors

Cooling Issue (Bottom Right)

Symptom: Insufficient cooling (Temperature & Density)
Results:
- Abnormal heat generation
- Thermal throttling or errors
- Computing performance degradation
- LLM workload errors

Key Message

For stable LLM operations, the three elements of Computing-Power-Cooling must be balanced. If any one element is insufficient, it leads to system-wide performance degradation or errors. This emphasizes that AI infrastructure design must consider not only computing power but also adequate power supply and cooling systems together.

Summary

LLM operation requires a critical balance between computing, power supply, and cooling infrastructure.
Insufficient power causes power throttling, while inadequate cooling leads to thermal throttling, both resulting in workload errors.
Successful AI infrastructure design must holistically address all three components rather than focusing solely on computational capacity.

#LLM #AIInfrastructure #DataCenter #ThermalManagement #PowerManagement #AIOperations #MachineLearning #HPC #DataCenterCooling #AIHardware #ComputeOptimization #MLOps #TechInfrastructure #AIatScale #GreenAI

WIth Claude

The Perfect Paradox

Posted on 2025-11-10 by lechuck park

The Perfect Paradox – Analysis

This diagram illustrates “The Perfect Paradox”, explaining the relationship between effort and results. Here are the key concepts:

Graph Analysis

Axes:

X-axis: Effort
Y-axis: Result

Pattern:

Initially, results increase proportionally with effort
After the Inflection Point (green circle), dramatically increased effort yields minimal or even diminishing returns
“Perfect” exists in an unreachable zone

Core Message

“Good Enough (Satisfying)”

Located near the inflection point
Represents the optimal effort-to-result ratio

The Central Paradox:

“Before ‘perfect’ lies ‘infinite’.”

This means achieving perfection requires infinite effort.

AI Connection

The bottom arrow shows the evolution of approaches:

Rule-based Approach → Data-Driven Approach

Key Insight:

“While data-driven AI is now far beyond ‘good enough’, it remains imperfect.”

This suggests that modern AI achieves high performance, but pursuing practical utility is more rational than chasing perfection.

Summary

The Perfect Paradox shows that after a certain inflection point, exponentially more effort produces minimal improvement, making “perfect” practically unreachable. The optimal strategy is achieving “good enough” – the sweet spot where effort and results are balanced. Modern data-driven AI has surpassed “good enough” but remains imperfect, demonstrating that practical excellence trumps impossible perfection.

#PerfectParadox #DiminishingReturns #GoodEnough #EffortVsResults #PracticalExcellence #AILimitations #DataDrivenAI #InflectionPoint #OptimizationStrategy #PerfectionismVsPragmatism #ProductivityInsights #SmartEffort #AIPhilosophy #EfficiencyMatters #RealisticGoals

From RNN to Transformer

Posted on 2025-08-202025-08-19 by lechuck park

Visual Analysis: RNN vs Transformer

Visual Structure Comparison

RNN (Top): Sequential Chain

Linear flow: Circular nodes connected left-to-right
Hidden states: Each node processes sequentially
Attention weights: Numbers (2,5,11,4,2) show token importance
Bottleneck: Must process one token at a time

Transformer (Bottom): Parallel Grid

Matrix layout: 5×5 grid of interconnected nodes
Self-attention: All tokens connect to all others simultaneously
Multi-head: 5 parallel attention heads working together
Position encoding: Separate blue boxes handle sequence order

Key Visual Insights

Processing Pattern

RNN: Linear chain → Sequential dependency
Transformer: Interconnected grid → Parallel freedom

Information Flow

RNN: Single path with accumulating states
Transformer: Multiple simultaneous pathways

Attention Mechanism

RNN: Weights applied to existing sequence
Transformer: Direct connections between all elements

Design Effectiveness

The diagram succeeds by using:

Contrasting layouts to show architectural differences
Color coding to highlight attention mechanisms
Clear labels (“Sequential” vs “Parallel Processing”)
Visual metaphors that make complex concepts intuitive

The grid vs chain visualization immediately conveys why Transformers enable faster, more scalable processing than RNNs.

Summary

This diagram effectively illustrates the fundamental shift from sequential to parallel processing in neural architecture. The visual contrast between RNN’s linear chain and Transformer’s interconnected grid clearly demonstrates why Transformers revolutionized AI by enabling massive parallelization and better long-range dependencies.

With Claude

Parallel Processing

Posted on 2025-08-062025-08-05 by lechuck park

Parallel Processing System Analysis

System Architecture

1. Input Stage – Independent Processing

Multiple tasks are simultaneously input into the system in parallel
Each task can be processed independently of others

2. Central Processing Network

Blue Nodes (Modification Work)

Processing units that perform actual data modifications or computations
Handle parallel incoming tasks simultaneously

Yellow Nodes (Propagation Work)

Responsible for propagating changes to other nodes
Handle system-wide state synchronization

3. Synchronization Stage

Objective: “Work & Wait To Make Same State”
Wait until all nodes reach identical state
Essential process for ensuring data consistency

Performance Characteristics

Advantage: Massive Parallel

Increased throughput through large-scale parallel processing
Reduced overall processing time by executing multiple tasks simultaneously

Disadvantage: Massive Wait Cost

Wait time overhead for synchronization
Entire system must wait for the slowest node
Performance degradation due to synchronization overhead

Key Trade-off

Parallel processing systems must balance performance enhancement with data consistency:

More parallelism = Higher performance, but more complex synchronization
Strong consistency guarantee = Longer wait times, but stable data state

This concept is directly related to the CAP Theorem (Consistency, Availability, Partition tolerance), which is a fundamental consideration in distributed system design.

With Claude

“Vectors” than definitions.

Posted on 2025-08-04 by lechuck park

This image visualizes the core philosophy that “In the AI era, vector-based thinking is needed rather than simplified definitions.”

Paradigm Shift in the Upper Flow:

Definitions: Traditional linear and fixed textual definitions
Vector: Transformation into multidimensional and flexible vector space
Context: Structure where clustering and contextual relationships emerge through vectorization

Modern Approach in the Lower Flow:

Big Data: Complex and diverse forms of data
Machine Learning: Processing through pattern recognition and learning
Classification: Sophisticated vector-based classification
Clustered: Clustering based on semantic similarity
Labeling: Dynamic labeling considering context

Core Insight: In the AI era, we must move beyond simplistic definitional thinking like “an apple is a red fruit” and understand an apple as a multidimensional vector encompassing color, taste, texture, nutritional content, cultural meaning, and more. This vector-based thinking enables richer contextual understanding and flexible reasoning, allowing us to solve complex real-world problems more effectively.

Beyond simple classification or definition, this presents a new cognitive paradigm that emphasizes relationships and context. The image advocates for a fundamental shift from rigid categorical thinking to a nuanced, multidimensional understanding that better reflects how modern AI systems process and interpret information.

With Claude