Multi-Token Prediction (MTP) – Increasing Inference Speed

This image explains the Multi-Token Prediction (MTP) architecture that improves inference speed.

Overall Structure

Left: Main Model

  • Starts with an Embedding Layer that converts input tokens to vectors
  • Deep neural network composed of L Transformer Blocks
  • RMSNorm stabilizes the range of Transformer input/output values
  • Finally, the Output Head (BF16 precision) calculates the probability distribution for the next token

Right: MTP Module 1 (Speculative Decoding Module) + More MTP Modules

  • Maximizes efficiency by reusing the Main Model’s outputs
  • Two RMSNorms normalize the intermediate outputs from the Main Model
  • Performs lightweight operations using a single Transformer Block with FP8 Mixed Precision
  • Generates specialized vectors for future token prediction through Linear Projection and concatenation
  • Produces candidate tokens with BF16 precision

Key Features

  1. Two-stage processing: The Main Model accurately predicts the next token, while the MTP Module generates additional candidate tokens in advance
  2. Efficiency:
    • Shares the Embedding Layer with the Main Model to avoid recalculation
    • Reduces computational load with FP8 Mixed Precision
    • Uses only a single Transformer Block
  3. Stability: RMSNorm ensures stable processing of outputs that haven’t passed through the Main Model’s deep layers

Summary

MTP architecture accelerates inference by using a lightweight module alongside the main model to speculatively generate multiple future tokens in parallel. It achieves efficiency through shared embeddings, mixed precision operations, and a single transformer block while maintaining stability through normalization layers. This approach significantly reduces latency in large language model generation.

#MultiTokenPrediction #MTP #SpeculativeDecoding #LLM #TransformerOptimization #InferenceAcceleration #MixedPrecision #AIEfficiency #NeuralNetworks #DeepLearning

With Claude