Linux kernel for GPU Workload

Posted on 2025-11-142025-11-13 by lechuck park

Linux Kernel GPU Workload Support Features

Goal: Maximize Memory Efficiency & Data Transfer

The core objective is to treat GPUs as a top-tier component like CPUs, reducing memory bottlenecks for large-scale AI workloads.

Key Features

1. Full CXL (Compute Express Link) Support

Standard interface for high-speed connections between CPUs, accelerators (GPU, FPGA), and memory expansion devices
Enables high-speed data transfer

2. Enhanced HMM (Heterogeneous Memory Management)

Heterogeneous memory management capabilities
Allows device drivers to map system memory pages to GPU page tables
Enables seamless GPU memory access

3. Enhanced P2P DMA & GPUDirect Support

Enables direct data exchange between GPUs
Direct communication with NVMe storage and network cards (GPUDirect RDMA)
Operates without CPU intervention for improved performance

4. DRM Scheduler & GPU Driver Improvements

Enhanced Direct Rendering Manager scheduling functionality
Active integration of latest drivers from major vendors: AMD (AMDGPU), Intel (i915/Xe), Intel Gaudi/Ponte Vecchio
NVIDIA still uses proprietary drivers

5. Advanced Async I/O via io_uring

Efficient I/O request exchange with kernel through Ring Buffer mechanism
Optimized asynchronous I/O performance

Summary

The Linux kernel now enables GPUs to independently access memory (CXL, HMM), storage, and network resources (P2P DMA, GPUDirect) without CPU involvement. Enhanced drivers from AMD, Intel, and improved schedulers optimize GPU workload management. These features collectively eliminate CPU bottlenecks, making the kernel highly efficient for large-scale AI and HPC workloads.

#LinuxKernel #GPU #AI #HPC #CXL #HMM #GPUDirect #P2PDMA #AMDGPU #IntelGPU #MachineLearning #HighPerformanceComputing #DRM #io_uring #HeterogeneousComputing #DataCenter #CloudComputing

With Claude

OOM (Out-of-Memory) Works

Posted on 2025-10-022025-10-02 by lechuck park

OOM (Out-of-Memory) Mechanism Explained

This diagram illustrates how the Linux OOM (Out-of-Memory) Killer operates when the system runs out of memory.

Main Process Flow (Left Side)

Request
- An application requests memory from the system
VM Commit (Reserve)
- The system reserves virtual memory
- Overcommit policy allows reservation beyond physical capacity
First Use (HW mapping) → Page Fault
- Hardware mapping occurs when memory is actually accessed
- Triggers a page fault for physical allocation
Reclaim/Compaction
- System attempts to free memory through cache, SLAB, writeback, and compaction
- Can be throttled via cgroup memory.high settings
Swap (if enabled)
- Uses swap space if available and enabled
OOM Killer
- As a last resort, terminates processes to free memory

Detailed Decision Points (Center & Right Columns)

Memory Request

App asks for memory
Controlled via brk/sbrk, mmap/munmap, mremap, and prlimit(RLIMIT_AS)

Virtual Address Allocation

Overcommit policy allows reservation beyond physical limits
Uses mmap (e.g., MAP_PRIVATE) with madvise(MADV_WILLNEED) hints

Physical Memory Allocation

Checks if zone watermarks are OK
If yes, maps a physical page; if no, attempts reclamation
Optional: mlock/munlock, mprotect, mincore

Any Other Free Memory Space?

Attempts to free memory via cache/SLAB/writeback/compaction
May throttle on cgroup memory.high
Hints: madvise(MADV_DONTNEED)

Swap Space?

Checks if swap space is available to offload anonymous pages
System: swapon/swapoff; App: mlock* (to avoid swap)

OOM Killer

Sends SIGKILL to selected victim when below watermarks or cgroup memory.max is hit
Victim selection based on badness/oom_score_adj
Configurable via /proc/<pid>/oom_score_adj and vm.panic_on_oom

Summary

When an app requests memory, Linux first reserves virtual address space (overcommit), then allocates physical memory on first use. If physical memory runs low, the system tries to reclaim pages from caches and swap, but when all else fails, the OOM Killer terminates processes based on their oom_score to free up memory and keep the system running.

#Linux #OOM #MemoryManagement #KernelPanic #SystemAdministration #DevOps #OperatingSystem #Performance #MemoryOptimization #LinuxKernel

With Claude

OOM Killer

Posted on 2025-04-01 by lechuck park

OOM (Out-of-Memory) Killer

This diagram explains the Linux OOM Killer mechanism:

Memory Request Process:
- A process requests memory allocation from the operating system.
- It receives a handler for the allocated memory.
Memory Management System:
- The operating system manages virtual memory.
- Virtual memory utilizes physical memory and disk swap space.
- Linux allows memory overcommitment.
OOM Killer Operation:
- When physical memory becomes scarce, the OOM Killer is initiated.
- The OOM Killer selects and terminates “less important” processes based on factors such as memory usage and process priority.
- This mechanism maintains the overall stability of the system.

Linux OOM Killer is a mechanism that automatically activates when physical memory becomes scarce. It maintains system stability by selecting and terminating less important processes based on memory usage and priority.

With Claude

Control Flow Enforcement Tech.

Posted on 2025-03-052025-03-04 by lechuck park

This image is an illustrative diagram of Control Flow Enforcement Technology (CET). CET is a hardware-based security feature, primarily supported by Intel CPUs.

The diagram shows the two main mechanisms of CET:

Shadow Stack:

Stores the return address on a separate, secure stack to prevent an attacker from changing it.
When a function is called, the hardware writes the return address to the shadow stack.
When the function returns, the address on the stack is compared to the address on the shadow stack, and an exception is thrown if they don’t match.

Indirect Branch Tracking:

Restricts indirect jumps or calls via function pointers, etc. to prevent jumps to arbitrary code.
Hardware enforces that only code that starts with an End of Branch (ENDBR) instruction can be executed.

At the bottom of the diagram is a visual representation of the process of calling a function and exiting the function with the ENDBR instruction. This shows the process of logging (storing) the return address when the function is called and comparing it to the stored address when the function exits.

With Claude

IO_uring

Posted on 2025-02-24 by lechuck park

This image explains IO_uring, an asynchronous I/O framework for Linux. Let me break down its key components and features:

IO_uring Main Use Cases:

High-Performance Databases
High-Speed Network Applications
File Processing Systems

Core Components:

Submission Queue (SQ): Where user applications submit requests like “read this file” or “send this network packet”
Completion Queue (CQ): Where the kernel places the results after finishing a task
Shared Memory: A shared region between user space and kernel space

Key Features:

Low Latency without copying
High Throughput
Efficient Communication with the Kernel

How it Works:

Operates as an asynchronous I/O framework
User space communicates with kernel space through submission and completion queues
Uses shared memory to minimize data copying
Provides a modern interface for asynchronous I/O operations

The diagram shows the flow between user space and kernel space, with shared memory acting as an intermediary. This design allows for efficient I/O handling, particularly beneficial for applications requiring high performance and low latency.

The framework represents a significant improvement in Linux I/O handling, providing a more efficient way to handle I/O operations compared to traditional methods. It’s particularly valuable for applications that need to handle multiple I/O operations simultaneously while maintaining high performance.

With Claude

Uretprobe

Posted on 2025-02-18 by lechuck park

Here’s a summary of Uretprobe, a Linux kernel tracing/debugging tool:

Overview:

Uretprobe is a user-space return probe tool designed to monitor function returns in user space
It can track the execution flow from function start to end/return points

Key Features:

Ability to intervene at the return point of user-space functions
Intercepts the stack address just before function returns and enables post-processing
Supports debugging and performance analysis capabilities
Can trace specific function return values for dynamic analysis and performance monitoring

Advantages:

Provides more precise analysis compared to uprobes
Can be integrated with eBPF/BCC for high-performance profiling

The main benefit of Uretprobe lies in its ability to intercept user-space operations and perform additional code analysis, enabling deeper insights into program behavior and performance characteristics.

Similar tracing/debugging mechanisms include:

Kprobes (Kernel Probes)
Kretprobes (Kernel Return Probes)
DTrace
SystemTap
Ftrace
Perf
LTTng (Linux Trace Toolkit Next Generation)
BPF (Berkeley Packet Filter) based tools
Dynamic Probes (DynProbes)
USDT (User Statically-Defined Tracing)

These tools form part of the Linux observability and performance analysis ecosystem, each offering unique capabilities for system and application monitoring.

Page(Memory) Replacement with AI

Posted on 2025-02-07 by lechuck park

With Claude
This image illustrates a Page (Memory) Replacement system using AI. Let me break down the key components:

Top Structure:

Paging (Legacy & current): Basic paging system structure
Logical Memory: Organized in 4KB units, maximum 64-bit sizing (2^64 Bytes)
Physical Memory: Limited to the actual installed memory size

Memory Allocation:

Shows Alloc() and Dealloc() functions
When no more allocation is possible, there’s a question about deallocation strategy:
- FIFO (First In First Out): Deallocate the oldest allocated memory first
- LRU (Least Recently Used): Deallocate the oldest used memory first

AI-based Page Replacement Process:

Data Collection: Gathers information about page access frequency, time intervals, and memory usage patterns
Feature Extraction: Analyzes page access time, access frequency, process ID, memory region, etc.
Model Training: Aims to predict the likelihood of specific pages being accessed in the future
Page Replacement Decision: Pages with a low likelihood of future access are prioritized for swapping
Real-Time Application & Evaluation: Applies the model in real-time to perform page replacement and evaluate system performance

This system integrates traditional page replacement algorithms with AI technology to achieve more efficient memory management. The use of AI helps in making more intelligent decisions about which pages to keep in memory and which to swap out, based on learned patterns and predictions.