MultiTokenPrediction – Lechuck Park

This image explains the Multi-Token Prediction (MTP) architecture that improves inference speed.

Overall Structure

Left: Main Model

Starts with an Embedding Layer that converts input tokens to vectors
Deep neural network composed of L Transformer Blocks
RMSNorm stabilizes the range of Transformer input/output values
Finally, the Output Head (BF16 precision) calculates the probability distribution for the next token

Right: MTP Module 1 (Speculative Decoding Module) + More MTP Modules

Maximizes efficiency by reusing the Main Model’s outputs
Two RMSNorms normalize the intermediate outputs from the Main Model
Performs lightweight operations using a single Transformer Block with FP8 Mixed Precision
Generates specialized vectors for future token prediction through Linear Projection and concatenation
Produces candidate tokens with BF16 precision

Key Features

Two-stage processing: The Main Model accurately predicts the next token, while the MTP Module generates additional candidate tokens in advance
Efficiency:
- Shares the Embedding Layer with the Main Model to avoid recalculation
- Reduces computational load with FP8 Mixed Precision
- Uses only a single Transformer Block
Stability: RMSNorm ensures stable processing of outputs that haven’t passed through the Main Model’s deep layers

Summary

MTP architecture accelerates inference by using a lightweight module alongside the main model to speculatively generate multiple future tokens in parallel. It achieves efficiency through shared embeddings, mixed precision operations, and a single transformer block while maintaining stability through normalization layers. This approach significantly reduces latency in large language model generation.

#MultiTokenPrediction #MTP #SpeculativeDecoding #LLM #TransformerOptimization #InferenceAcceleration #MixedPrecision #AIEfficiency #NeuralNetworks #DeepLearning

With Claude

Tag: MultiTokenPrediction

Multi-Token Prediction (MTP) – Increasing Inference Speed

Overall Structure

Key Features

Summary