Multi-Head Latent Attention – Latent KV-Cache (DeepSeek v3)

Posted on 2025-11-182025-11-17 by lechuck park

Multi-Head Latent Attention – Latent KV-Cache Interpretation

This image explains the Multi-Head Latent Attention (MLA) mechanism and Latent KV-Cache technique for efficient inference in transformer models.

Core Concepts

1. Latent and Residual Split

Q, K, V are decomposed into two components:

Latent (C): Compressed representation shared across heads (q^c, k^c, v^c)
Residual (R): Contains detailed information of individual tokens (q^R, k^R)

2. KV Cache Compression

Instead of traditional approach, stores only in compressed form:

k^R (Latent Key): Stores only Latent Space features
Achieves significant reduction in KV cache size compared to GQA models

3. Operation Flow

Generate Latent c_t^Q from Input Hidden h_t (using FP8)
Create q_{t,i}^C, q_{t,i}^R through Latent
k^R and v^c are concatenated and fed to Multi-Head Attention
Caching during inference: Only k^R and compressed Value stored (shown with checkered icon)
Apply RoPE (Rotary Position Embedding) for position information

4. FP8/FP32 Mixed Precision

FP8: Applied to most matrix multiplications (increases computational efficiency)
FP32: Applied to critical operations like RoPE (maintains numerical stability)

Key Advantages

Memory Efficiency: Caches only compressed representations instead of full K, V
Computational Efficiency: Fast inference using FP8
Long Sequence Processing: Enables understanding of long contexts through relative position information

Residual & RoPE Explanation

Residual: The difference between predicted and actual values (“difference between expected and measured values”)
RoPE: A technique that rotates Q and K vectors based on position, allowing attention scores to be calculated using only relative distances

Summary

This technique represents a cutting-edge optimization for LLM inference that dramatically reduces memory footprint by storing only compressed latent representations in the KV cache while maintaining model quality. The combination of latent-residual decomposition and mixed precision (FP8/FP32) enables both faster computation and longer context handling. RoPE further enhances the model’s ability to understand relative positions in extended sequences.

#MultiHeadAttention #LatentAttention #KVCache #TransformerOptimization #LLMInference #ModelCompression #MixedPrecision #FP8 #RoPE #EfficientAI #DeepLearning #AttentionMechanism #ModelAcceleration #AIOptimization #NeuralNetworks

With Cluade

FP8 Mixed-Precision Training

Posted on 2025-11-132025-11-13 by lechuck park

FP8 Mixed-Precision Training Interpretation

This image is a technical diagram showing FP8 (8-bit Floating Point) Mixed-Precision Training methodology.

Three Main Architectures

1. Mixture of Experts (MoE)

Input: Starts with BF16 precision
Calc (1): Router output & input hidden states → BF16
Calc (2): Expert FFN (Feed-Forward Network) → FP8 computation
Calc (3): Accumulation → FP32
Transmit (Dispatch): Token dispatch (All-to-All) → FP8
Transmit (Combine): Combine expert outputs → BF16
Output: BF16

2. Multi-head Latent Attention

Input: BF16
Calc (1): Input hidden states → BF16
Calc (2): Projection/Query/Key/Value → FP8
Calc (3): Key/Value compression → BF16
Stabilization: RMSNorm → FP32
Output: Output hidden states → BF16

3. Multi-Token Prediction

Input: BF16
Calc (1): Embedding layer output → BF16
Calc (2): Transformer block → FP8
Calc (3): RMSNorm → FP32
Calc (4): Linear projection → BF16
Output: Output hidden states → BF16

Precision Strategy (Bottom Boxes)

🟦 BF16 (Default)

Works for most tasks
Balanced speed/stability

🟪 BF8 (Fastest)

For large compute/data movement
Very energy-efficient

🟣 BF32 (Safest/Most Precise)

For accuracy-critical or sensitive math operations

Summary

FP8 mixed-precision training strategically uses different numerical precisions across model operations: FP8 for compute-intensive operations (FFN, attention, transformers) to maximize speed and efficiency, FP32 for sensitive operations like accumulation and normalization to maintain numerical stability, and BF16 for input/output and communication to balance performance. This approach enables faster training with lower energy consumption while preserving model accuracy, making it ideal for training large-scale AI models efficiently.

#FP8Training #MixedPrecision #AIOptimization #DeepLearning #ModelEfficiency #NeuralNetworks #ComputeOptimization #MLPerformance #TransformerTraining #EfficientAI #LowPrecisionTraining #AIInfrastructure #MachineLearning #GPUOptimization #ModelTraining

With Claude

Multi-Token Prediction (MTP) – Increasing Inference Speed

Posted on 2025-11-05 by lechuck park

This image explains the Multi-Token Prediction (MTP) architecture that improves inference speed.

Overall Structure

Left: Main Model

Starts with an Embedding Layer that converts input tokens to vectors
Deep neural network composed of L Transformer Blocks
RMSNorm stabilizes the range of Transformer input/output values
Finally, the Output Head (BF16 precision) calculates the probability distribution for the next token

Right: MTP Module 1 (Speculative Decoding Module) + More MTP Modules

Maximizes efficiency by reusing the Main Model’s outputs
Two RMSNorms normalize the intermediate outputs from the Main Model
Performs lightweight operations using a single Transformer Block with FP8 Mixed Precision
Generates specialized vectors for future token prediction through Linear Projection and concatenation
Produces candidate tokens with BF16 precision

Key Features

Two-stage processing: The Main Model accurately predicts the next token, while the MTP Module generates additional candidate tokens in advance
Efficiency:
- Shares the Embedding Layer with the Main Model to avoid recalculation
- Reduces computational load with FP8 Mixed Precision
- Uses only a single Transformer Block
Stability: RMSNorm ensures stable processing of outputs that haven’t passed through the Main Model’s deep layers

Summary

MTP architecture accelerates inference by using a lightweight module alongside the main model to speculatively generate multiple future tokens in parallel. It achieves efficiency through shared embeddings, mixed precision operations, and a single transformer block while maintaining stability through normalization layers. This approach significantly reduces latency in large language model generation.

#MultiTokenPrediction #MTP #SpeculativeDecoding #LLM #TransformerOptimization #InferenceAcceleration #MixedPrecision #AIEfficiency #NeuralNetworks #DeepLearning

With Claude