Skip to main content

Command Palette

Search for a command to run...

ARCHITECTURE OF A PRODUCTION-GRADE, 5.15 RCU-PROTECTED KERNEL REGISTRY

Updated
7 min read
ARCHITECTURE OF A PRODUCTION-GRADE, 5.15 RCU-PROTECTED KERNEL REGISTRY

This Program manages object lifecycles across distinct parallel execution layers. It bridges lockless reader threads with a serialized writer thread and an asynchronous memory reclamation subsystem.

                             WRITER THREAD (Process Context)
                                           │
                                    kobjx_alloc() / kobjx_delete()
                                           │
                                           ▼
                 ┌───────────────────┐       [Symmetric Unlink via hash_del_rcu]
                 │ Spinlock (_lock)  │ ───►  Removes tracking nodes from global structures
                 └───────────────────┘       while preserving reader pointer trails.
                                           │
                                           ▼
                   kobjx_put(node) ────► Decrements refcnt. If 0, invokes call_rcu().
                                           │
                                           └─────────────────────────┐
                                                                     ▼
                                                       ┌───────────────────────────┐
                                                       │   RCU Grace Period Pipe   │
                                                       └───────────────────────────┘
                                                                     │
                                                       [Waits for all pre-existing readers]
                                                                     │
                                                                     ▼
                                                       ┌───────────────────────────┐
                                                       │    rcu_free_callback()    │
                                                       └───────────────────────────┘
                                                                     │ (Asynchronous SLAB return)
                                                                     ▼
                                                       ┌───────────────────────────┐
                                                       │ kmem_cache_free(cache, k) │
                                                       └───────────────────────────┘
                                                                     ▲
                                                                     │ [Direct Cache Free bypassed]
 ────────────────────────────────────────────────────────────────────┼────────────────────────────────────
                                                                     │
     READER THREADS (Lockless)                                       │       SHRINKER SUBSYSTEM (Memory Pressure)
                                                                     │
     rcu_read_lock();                                                │       scan_objects()
     kobjx_lookup()                                                  │             │
           │                                                         │             ▼
   [Hash Traversal]                                                  │       Acquires spinlock, isolates 
           │                                                         │       expired nodes via list_del_init.
           ▼                                                         │             │
 refcount_inc_not_zero()                                             │             ▼
           │                                                         │       kobjx_put(node)
 [Increments reader protection token]                                │             │
           │                                                         │             ▼
     rcu_read_unlock();                                              └─────── Passes lifecycle ownership 
                                                                              safely back to the RCU pipeline.

1. CORE ARCHITECTURAL PARADIGM


In multi-threaded and distributed kernel development, creating a data registry that provides lockless read performance, asynchronous reclamation, and memory pressure resilience is a significant engineering challenge.

The architectural stability of this Program relies on pairing Read-Copy Update (RCU) primitives with Atomic Reference Counting. RCU ensures that readers can traverse data structures without acquiring locks, while reference counting guarantees that memory remains valid even after an object is unlinked from global tracking systems.

By separating the structural visibility of an object from its physical footprint in memory, the engine completely eliminates execution serialization on read-heavy paths.


2. DEEP TECHNICAL SUB-COMPONENTS


A. Lockless Reader Protection Token (Preventing Use-After-Free)

In a naive implementation, looking up an item via RCU and returning its raw memory address creates an immediate Use-After-Free (UAF) condition. The split second after rcu_read_unlock() executes, a parallel CPU can invoke kobjx_delete(), pass the grace period, and immediately return that memory block back to the allocator cache pool.

This Program eliminates the Use-After-Free (UAF) vulnerability by checking and incrementing the reference counter while still inside the RCU read-side critical section using atomic primitives:

====================================================================================================
[UNDER THE HOOD: RCU LOOKUP STEP-BY-STEP TRAVERSAL]

CPU 0 (Reader)                      CPU 1 (Writer)
--------------                      --------------
rcu_read_lock();
Traverses global_table...
Matches Node ID 1.
refcount_inc_not_zero(&k->refcnt)   
   └─► Success! (Refcount 1->2)     kobjx_delete(Node 1);
rcu_read_unlock();                  Acquires spin_lock(&lock);
Returns Safe Node 1 Pointer.        list_del_init(&Node 1);
                                    hash_del_rcu(&Node 1);
                                    spin_unlock(&lock);
                                    kobjx_put(Node 1); (Refcount 2-1, stays alive!)
  Code Implementation Architecture:
  
    static struct kobjx *kobjx_lookup(int id)
    {
        struct kobjx *k;

        rcu_read_lock();
        hash_for_each_possible_rcu(global_table, k, hnode, id) {
            if (k->id == id) {
                /* * CRITICAL STEP: Pin the item before dropping RCU read lock.
                 * If refcount is already 0, it means it's unlinked and dead; 
                 * refcount_inc_not_zero will fail and prevent UAF.
                 */
                if (!refcount_inc_not_zero(&k->refcnt))
                    k = NULL; 
                
                rcu_read_unlock();
                return k;
            }
        }
        rcu_read_unlock();
        return NULL;
    }

B. Memory-Safe Shrinker Reclaim Loop


Shrinker subsystems are invoked directly by the kernel's virtual memory management engine during low memory conditions (via mm/vmscan.c). A common architectural mistake is immediately freeing objects inside the shrinker scan loop using kmem_cache_free() while holding a lock. If a reader thread on another CPU is currently traversing that exact object, the kernel will immediately crash or oops.

This Program decouples structural unlinking from memory destruction. The object is removed from tracking structures under a spinlock, but its memory lifecycle is deferred to the RCU callback:

    static unsigned long scan_objects(struct shrinker *s, struct shrink_control *sc)
    {
        struct kobjx *k, *tmp;
        unsigned long freed = 0;
        unsigned long age = msecs_to_jiffies(MAX_AGE_MS);

        spin_lock(&lock);
        list_for_each_entry_safe(k, tmp, &global_list, list) {
            if (time_before(jiffies, k->created_jiffies + age))
                continue;

            /* Structural isolation under spinlock protection */
            list_del_init(&k->list);
            hash_del_rcu(&k->hnode);
            atomic_long_dec(&active);

            /* Drop lock before handing off resource ownership to preserve latency */
            spin_unlock(&lock);
            
            /*
             * CRITICAL RECLAIM FIX: Hand ownership back to the atomic tracking system.
             * Memory is only returned to SLAB when all concurrent readers drop out.
             */
            kobjx_put(k); 
            freed++;

            if (freed >= sc->nr_to_scan)
                return freed;

            spin_lock(&lock);
        }
        spin_unlock(&lock);
        return freed;
    }

C. The Unified Reference Management Pipeline


To guarantee structural changes never block, all object teardown pipelines merge cleanly into a unified atomic decrement routine. When the reference count reaches 0, the node is pushed asynchronously to the RCU grace period processing engine.

    static void kobjx_put(struct kobjx *k)
    {
        /* Atomically decrement; execute branch ONLY when transitioning from 1 -> 0 */
        if (refcount_dec_and_test(&k->refcnt)) {
            call_rcu(&k->rcu, rcu_free_callback);
        }
    }

    static void rcu_free_callback(struct rcu_head *r)
    {
        struct kobjx *k = container_of(r, struct kobjx, rcu);
        
        /* Physical object memory reclamation occurs safely out of band */
        kmem_cache_free(cache, k);
        this_cpu_inc(cpu_stats.free);
    }

3. CORE PROBLEMS SOLVED

Deadlocks on Critical Paths: Traditional lookup tables protect state data with heavy read/write locks (rwlock_t). If a thread holding a write lock is interrupted by an interrupt handler or softIRQ that attempts a read lookup, the system deadlocks instantly. RCU readers do not block, completely preventing this category of recursive deadlocking.

The "Cache Thrashing" Problem: In high-concurrency multi-core systems, traditional spinlocks or read/write locks continuously modify the memory state of the lock variable itself. This causes constant cache-line invalidations (Inter-Processor Interrupts / Cache Coherency traffic) across cores. RCU readers execute entirely without atomic write instructions, keeping CPU cache lines clean, local, and highly optimized.

Kernel Out-Of-Memory (OOM) Vulnerabilities: Registries that dynamically cache allocations can quickly consume all system memory under heavy traffic. By integrating the shrinker structure directly into the registry's backend, the infrastructure automatically sheds older, unused allocations before the system triggers an OOM event, matching node reclamation to the host's actual memory availability.

4. ADVANTAGES OF THIS ARCHITECTURE

Deterministic O(1) Lookup Latency: Lockless traversal means read latency depends solely on hash bucket density, remaining completely unaffected by how many concurrent threads are querying or modifying the table at the same time.

Zero-Impact Asynchronous Teardown: The memory-reclamation loop does not block active readers. Writers can detach nodes instantly, while the physical teardown occurs entirely in the background via hardware softIRQs inside the RCU engine (rcu_core).

Hardware Cache Isolation via SLAB Constructors: By utilizing a dedicated memory cache pool (kmem_cache_create), the registry allocates nodes close to each other in memory. The constructor function (ctor) pre-initializes stable elements like the embedded list_head, reducing cache miss penalties when reusing memory objects.

5. REAL-WORLD APPLICATIONS

Networking Routing Tables & Connection Tracking (nf_conntrack): Firewall systems look up millions of active network packets per second against internal state machines. They rely on RCU hash tables to route packets without lock contention, using a background shrinker to evict expired connection states when the networking buffers fill up.

VFS Directory Entry Cache (dcache): Resolving file system strings (like /usr/bin/local) requires rapid traversals across parent/child directories. The Virtual File System uses an RCU-protected hash table (dentry) to look up paths across thousands of parallel system processes without serializing operations behind a global disk-lock lock.

Container Tracking & IPC Registry Systems: Container orchestration engines constantly provision and destroy control groups (cgroups) and IPC identifiers. Kernel-level registry systems use this exact architecture to safely track and clean up isolated resources when namespaces are torn down.

Source Code

The complete source code is available on GitHub:

Production-Grade RCU-Protected Kernel Registry