Multi-Token Prediction (MTP) – Increasing Inference Speed

Posted on 2025-11-05 by lechuck park

This image explains the Multi-Token Prediction (MTP) architecture that improves inference speed.

Overall Structure

Left: Main Model

Starts with an Embedding Layer that converts input tokens to vectors
Deep neural network composed of L Transformer Blocks
RMSNorm stabilizes the range of Transformer input/output values
Finally, the Output Head (BF16 precision) calculates the probability distribution for the next token

Right: MTP Module 1 (Speculative Decoding Module) + More MTP Modules

Maximizes efficiency by reusing the Main Model’s outputs
Two RMSNorms normalize the intermediate outputs from the Main Model
Performs lightweight operations using a single Transformer Block with FP8 Mixed Precision
Generates specialized vectors for future token prediction through Linear Projection and concatenation
Produces candidate tokens with BF16 precision

Key Features

Two-stage processing: The Main Model accurately predicts the next token, while the MTP Module generates additional candidate tokens in advance
Efficiency:
- Shares the Embedding Layer with the Main Model to avoid recalculation
- Reduces computational load with FP8 Mixed Precision
- Uses only a single Transformer Block
Stability: RMSNorm ensures stable processing of outputs that haven’t passed through the Main Model’s deep layers

Summary

MTP architecture accelerates inference by using a lightweight module alongside the main model to speculatively generate multiple future tokens in parallel. It achieves efficiency through shared embeddings, mixed precision operations, and a single transformer block while maintaining stability through normalization layers. This approach significantly reduces latency in large language model generation.

#MultiTokenPrediction #MTP #SpeculativeDecoding #LLM #TransformerOptimization #InferenceAcceleration #MixedPrecision #AIEfficiency #NeuralNetworks #DeepLearning

With Claude

Multi-Head Latent Attention – Changes

Posted on 2025-10-272025-10-26 by lechuck park

Multi-Head Latent Attention (MLA) Interpretation

This image is a technical diagram explaining the structure of Multi-Head Latent Attention (MLA).

🎯 Core Concept

MLA is a mechanism that improves the memory efficiency of traditional Multi-Head Attention.

Traditional Approach (Before) vs MLA

Traditional Approach:

Stores K, V vectors of all past tokens
Memory usage increases linearly with sequence length

MLA:

Summarizes past information with a fixed-size Latent vector (c^KV)
Maintains constant memory usage regardless of sequence length

📊 Architecture Explanation

1. Input Processing

Starts from Input Hidden State (h_t)

2. Latent Vector Generation

Latent c_t^Q: For Query of current token (compressed representation)
Latent c_t^KV: For Key-Value (cached and reused)

3. Query, Key, Value Generation

Query (q): Generated from current token (h_t)
Key-Value: Generated from Latent c_t^KV
- Creates Compressed (C) and Recent (R) versions from c_t^KV
- Concatenates both for use

4. Multi-Head Attention Execution

Performs attention computation with generated Q, K, V
Uses BF16 (Mixed Precision)

✅ Key Advantages

Memory Efficiency: Compresses past information into fixed-size vectors
Faster Inference: Reuses cached Latent vectors
Information Preservation: Maintains performance by combining compressed and recent information
Mixed Precision Support: Utilizes FP8, FP32, BF16

🔑 Key Differences

v_t^R from Latent c_t^KV is not used (purple box on the right side of diagram)
Value of current token is directly generated from h_t
This enables efficient combination of compressed past information and current information

This architecture is an innovative approach to solve the KV cache memory problem during LLM inference.

Summary

MLA replaces the linearly growing KV cache with fixed-size latent vectors, dramatically reducing memory consumption during inference. It combines compressed past information with current token data through an efficient attention mechanism. This innovation enables faster and more memory-efficient LLM inference while maintaining model performance.

#MultiHeadLatentAttention #MLA #TransformerOptimization #LLMInference #KVCache #MemoryEfficiency #AttentionMechanism #DeepLearning #NeuralNetworks #AIArchitecture #ModelCompression #EfficientAI #MachineLearning #NLP #LargeLanguageModels

With Claude

Multi-Head Latent Attention (MLA) Compression

Posted on 2025-10-032025-10-02 by lechuck park

Multi-Head Latent Attention (MLA) Compression Interpretation

This image explains the Multi-Head Latent Attention (MLA) compression technique from two perspectives.

Core Concepts

Left Panel: Matrix Perspective of Compression

Multiple attention heads (represented as cross-shaped matrices) are consolidated into a single compressed matrix
Multiple independent matrices are transformed into one compressed representation containing features
The original can be reconstructed from this compressed representation
Only minor loss occurs while achieving dramatic N-to-1 compression

Right Panel: Vector (Directional) Perspective of Compression

Vectors extending in various directions from a central point
Each vector represents the directionality and features of different attention heads
Similar vectors are compressed while preserving directional information (vector features)
Original information can be recovered through vector features even after compression

Key Mechanism

Compression → Recovery Process:

Multiple heads are compressed into latent features
During storage, only the compressed representation is maintained, drastically reducing storage space
When needed, original head information can be recovered using stored features (vectors)
Loss is minimal while memory efficiency is maximized

Main Advantages (Bottom Boxes)

MLA Compression: Efficient compression of multi-head attention
Keep features(vector): Preserves vector features for reconstruction
Minor loss: Maintains performance with negligible information loss
Memory Efficiency: Dramatically reduces storage space
For K-V Cache: Optimizes Key-Value cache memory

Practical Significance

This technique transforms N attention heads into 1 compressed representation in large language models, dramatically reducing storage space while enabling recovery through feature vectors when needed – a lossy compression method. It significantly reduces the memory burden of K-V cache, maximizing inference efficiency.

#MLACompression #MultiHeadAttention #LLMEfficiency #MemoryEfficiency #KVCache #TransformerOptimization #DeepLearning #AIResearch #ModelCompression

With Claude