ML System Engineering

Posted on 2026-01-132026-01-12 by lechuck park

This image illustrates the core pillars of ML System Engineering, outlining the journey from raw data to a responsible, deployed model.

Data Engineering: Data Quality & Skew Prevention
- Focuses on building robust pipelines to ensure high-quality data. It aims to prevent “training-serving skew,” where the model performs well during training but fails in real-world production due to data inconsistencies.
Model Optimization: Accuracy vs. Efficiency
- Involves balancing competing metrics such as model size, memory usage, latency, and accuracy. The goal is to optimize models to meet specific hardware constraints without sacrificing predictive performance.
Training Infrastructure: Distributed Training & Convergence
- Highlights the technical backbone required to scale AI. It focuses on the seamless integration of hardware, data, and algorithms through distributed systems to ensure models converge efficiently and quickly.
Deployment & Operations: MLOps & Edge-to-Cloud
- Covers the lifecycle of a model in production. MLOps ensures continuous adaptation and monitoring across various environments, from massive Cloud infrastructures to resource-constrained TinyML (edge) devices.
Ethics & Governance: Fairness & Accountability
- Treats non-functional requirements like fairness, privacy, and transparency as core engineering priorities. It includes “fairness audits” to ensure the AI operates responsibly and remains accountable to its users.

Summary

ML System Engineering bridges the gap between theoretical research and real-world production by focusing on data integrity and hardware-aware model optimization.
It utilizes MLOps and distributed infrastructure to ensure scalable, continuous deployment across diverse environments, from the Cloud to the Edge.
The framework establishes Ethics and Governance as fundamental engineering requirements to ensure AI systems are fair, transparent, and accountable.

#MLSystemEngineering #MLOps #ModelOptimization #DataEngineering #DistributedTraining #TinyML #ResponsibleAI #EdgeComputing #AIGovernance

With Gemini

Basic LLM Workflow

Posted on 2025-11-112025-11-10 by lechuck park

Basic LLM Workflow Interpretation

This diagram illustrates how data flows through various hardware components during the inference process of a Large Language Model (LLM).

Step-by-Step Breakdown

① Initialization Phase (Warm weights)

Model weights are loaded from SSD → DRAM → HBM (High Bandwidth Memory)
Weights are distributed and shared across multiple GPUs

② Input Processing (CPU tokenizes/batches)

CPU tokenizes input text and processes batches
Data is transferred through DRAM buffer to GPU

③ GPU Inference Execution

GPU performs Attention and FFN (Feed-Forward Network) computations from HBM
KV cache (Key-Value cache) is stored in HBM
If HBM is tight, KV cache can be offloaded to DRAM or SSD

④ Distributed Communication (NvLink/Infiniband)

Intra-node: High-speed communication between GPUs via NvLink (with NVSwitch if available)
Inter-node: Parallel communication through InfiniBand or NCCL

⑤ Post-processing (CPU decoding/post)

CPU decodes generated tokens and performs post-processing
Logs and caches are saved to SSD

Key Characteristics

This architecture leverages a memory hierarchy to efficiently execute large-scale models:

SSD: Long-term storage (slowest, largest capacity)
DRAM: Intermediate buffer
HBM: GPU-dedicated high-speed memory (fastest, limited capacity)

When model size exceeds GPU memory, strategies include distributing across multiple GPUs or offloading data to higher-level memory tiers.

Summary

This diagram shows how LLMs process data through a memory hierarchy (SSD→DRAM→HBM) across CPU and GPU components. The workflow involves loading model weights, tokenizing inputs on CPU, running inference on GPU with HBM, and using distributed communication (NvLink/InfiniBand) for multi-GPU setups. Memory management strategies like KV cache offloading enable efficient execution of large models that exceed single GPU capacity.

#LLM #DeepLearning #GPUComputing #MachineLearning #AIInfrastructure #NeuralNetworks #DistributedComputing #HPC #ModelOptimization #AIArchitecture #NvLink #Transformer #MLOps #AIEngineering #ComputerArchitecture

With Claude

Optimize LLM

Posted on 2025-10-29 by lechuck park

LLM Optimization: Integration of Traditional Methods and New Paradigms

Core Message

LLM (Transformer) optimization requires more than just traditional optimization methodologies – new perspectives must be added.

1. Traditional Optimization Methodology (Left Side)

SW (Software) Optimization

Data Optimization
- Structure: Data structure design
- Copy: Data movement optimization
Logics Optimization
- Algorithm: Efficient algorithm selection
- Profiling: Performance analysis and bottleneck identification

Characteristics: Deterministic, logical approach

HW (Hardware) Optimization

Functions & Speed (B/W): Function and speed/bandwidth optimization
Fit For HW: Optimization for existing hardware
New HW implementation: New hardware design and implementation

Characteristics: Physical performance improvement focus

2. New Perspectives Required for LLM (Right Side)

SW Aspect: Human-Centric Probabilistic Approach

Human Language View / Human’s View
- Human language understanding methods
- Human thinking perspective
Human Learning
- Mimicking human learning processes

Key Point: Statistical and Probabilistic Methodology

Different from traditional deterministic optimization
Language patterns, probability distributions, and context understanding are crucial

HW Aspect: Massive Parallel Processing

Massive Simple Parallel
- Parallel processing of large-scale simple computations
- Hardware architecture capable of parallel processing (GPU/TPU) is essential

Key Point: Efficient parallel processing of large-scale matrix operations

3. Integrated Perspective

LLM Optimization = Traditional Optimization + New Paradigm

Domain	Traditional Method	LLM Additional Elements
SW	Algorithm, data structure optimization	+ Probabilistic/statistical approach (human language/learning perspective)
HW	Function/speed optimization	+ Massive parallel processing architecture

Conclusion

For effective LLM optimization:

Traditional optimization techniques (data, algorithms, hardware) as foundation
Probabilistic approach reflecting human language and learning methods
Hardware perspective supporting massive parallel processing

These three elements must be organically combined – this is the core message of the diagram.

Summary

LLM optimization requires integrating traditional deterministic SW/HW optimization with new paradigms: probabilistic/statistical approaches that mirror human language understanding and learning, plus hardware architectures designed for massive parallel processing. This represents a fundamental shift from conventional optimization, where human-centric probabilistic thinking and large-scale parallelism are not optional but essential dimensions.

#LLMOptimization #TransformerArchitecture #MachineLearningOptimization #ParallelProcessing #ProbabilisticAI #HumanLanguageView #GPUComputing #DeepLearningHardware #StatisticalML #AIInfrastructure #ModelOptimization #ScalableAI #NeuralNetworkOptimization #AIPerformance #ComputationalEfficiency