3 Computing in AI

AI Computing Architecture

3 Processing Types

1. Sequential Processing

  • Hardware: General CPU (Intel/ARM)
  • Function: Control flow, I/O, scheduling, Data preparation
  • Workload Share: Training 5%, Inference 5%

2. Parallel Stream Processing

  • Hardware: CUDA core (Stream process)
  • Function: FP32/FP16 Vector/Scalar, memory management
  • Workload Share: Training 10%, Inference 30%

3. Matrix Processing

  • Hardware: Tensor core (Matrix core)
  • Function: Mixed-precision (FP8/FP16) MMA, Sparse matrix operations
  • Workload Share: Training 85%+, Inference 65%+

Key Insight

The majority of AI workloads are concentrated in matrix processing because matrix multiplication is the core operation in deep learning. Tensor cores are the key component for AI performance improvement.

With Claude

CUDA

From Claude with some prompting
This image illustrates the architecture of CUDA (Compute Unified Device Architecture), a parallel computing platform and programming model developed by NVIDIA. The main components are as follows:

  1. OS, CPU, USER: Concentric circles on the left represent the operating system, CPU, and user level.
  2. CUDA Framework: The large central box, which includes API control.
  3. SIMT (Single Instruction Multi Thread): Indicated by the arrow connecting the CPU to the CUDA cores.
  4. CUDA Cores: Parallel processing units represented by many small squares.
  5. Warp: Described as “a group of CUDA core”, representing a group of CUDA cores.
  6. Standardized High-Bandwidth (HW Arch): Represents standardized high-bandwidth at the hardware architecture level.
  7. Dynamic Allocation (SW Arch): Indicates dynamic allocation at the software architecture level.
  8. Video Memory Block: Located in the upper right corner.
  9. API Control: Included within the CUDA Framework.

This diagram shows the allocation of tasks from the CPU to CUDA cores, the parallel processing capability of CUDA cores, and overall system control through memory and API. CUDA is used to accelerate complex computations by leveraging the powerful parallel processing capabilities of GPUs.

The diagram effectively simplifies the CUDA architecture to provide an overview of its key components and their relationships, suitable for educational purposes or high-level explanations.