Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Memory Architecture

Purpose

This spec defines the memory architecture for Cobre: data ownership categories, per-rank memory budget derived from approved specifications, NUMA-aware allocation principles, and memory growth analysis across SDDP iterations. This spec aggregates memory-relevant requirements from Solver Workspaces, Shared Memory Aggregation, and Communication Patterns into a unified memory perspective.

1. Data Ownership Model

1.1 Ownership Categories

All runtime data falls into one of four ownership categories, each with distinct allocation and access patterns:

CategoryOwnerAccess During TrainingAllocation StrategyExamples
Shared read-onlyNode (via SharedWindow) or rankRead by all threads, never modified after initSharedWindow leader allocation or per-rank replicationOpening tree, input case data, PAR parameters, spectral factors
Thread-local mutableRayon threadExclusive read/write by owning threadFirst-touch on owning thread’s NUMA nodeSolver workspace, solution buffers, basis cache, cut accumulation buffer
Rank-local growingMPI rankRead by all threads, written at stage boundariesPre-allocated with growth capacityCut pool (grows each iteration)
TemporaryRayon threadAllocated/freed within a single solvePre-allocated in workspace, reusedRHS patch buffer, scratch arrays

Single-process mode note: In single-process mode (used by cobre-python and cobre-mcp), the “Shared read-only” category uses regular per-process heap allocation instead of SharedWindow<T>. MPI windows are not available without MPI initialization. The ownership semantics are otherwise identical – data is still read-only during training, allocated once at initialization, and shared across all Rayon threads within the process. See Hybrid Parallelism §1.0a for single-process mode details.

Per-rank memory architecture — shared read-only (System, PAR, opening tree), thread-local mutable (solver workspaces), rank-local growing (cut pool), temporary (scratch buffers)

1.2 Concurrency Model

Cobre uses Rayon for intra-rank threading (see Hybrid Parallelism §5). The data sharing model follows Rust’s ownership and borrowing rules, enforced at compile time via Send and Sync trait bounds:

Rust/Rayon PatternData CategoryImplication
&T where T: SyncShared read-onlyAll threads see the same reference; no synchronization needed for reads
Indexed by rayon::current_thread_index()Thread-local mutableEach thread accesses its own workspace by thread index
Written between par_iter calls (single-threaded)Rank-local growingWritten between parallel regions (single-threaded merge), read during parallel regions

No Arc, RwLock, or Mutex is needed — Rust’s ownership model, Rayon’s implicit join synchronization, and thread-index-indexed arrays provide the necessary semantics. The compiler enforces that shared data is Sync and that mutable data is exclusively owned.

2. Per-Rank Memory Budget

2.1 Derivable Components

The following memory estimates are derived from production-scale dimensions in approved specs. Reference configuration: 160 hydros, 130 thermals, 60 stages, 192 forward passes, 10 openings, 15K cut capacity, 48 threads per rank, 4 ranks per node.

Under Strategy 2+3 (StageLpCache), the memory model is a two-tier structure: thread-local solver workspaces plus a shared StageLpCache via SharedRegion<T>.

Thread-local per rank (48 threads, NUMA-local, HiGHS):

ComponentSizeDerivationSource
Solver workspaces (×48)~1,737 MB48 threads × ~36 MB per workspace (solver instance + buffers + per-stage basis cache)Solver Workspaces §1.2
Forward pass state~18 MB192 trajectories × ~9 KB state vector (1,120 doubles + metadata) × ~1 stage bufferedTraining Loop §5.1
MPI communication buffers~10 MBSend/receive buffers for MPI_Allgatherv (cuts: ~3.2 MB, states: ~1.72 MB per stage)Communication Patterns §2
Thread-local per rank~1.8 GB

SharedRegion (1 copy per node, NUMA-interleaved):

ComponentSizeDerivationSource
StageLpCache~22,300 MB60 stages × ~378 MB per stage (structural CSC + 15K cut slots in CSC)Solver Abstraction §11.4
Cut pool metadata~12 MB60 stages × 15K cuts × (intercept + activity + metadata) ≈ ~200 bytes/cutCut Management Impl §1.3
Opening tree~0.8 MB10 openings × 60 stages × 160 entities × 8 bytesScenario Generation §2.3
Input case data~20 MBSystem entities, PAR parameters, correlation factors, block/exchange factorsInternal Structures
SharedRegion total~22.3 GB

Node total (4 ranks):

ComponentSize
Thread-local (4 ranks × 1.8 GB)~7.1 GB
SharedRegion (1 copy)~22.3 GB
Node total (HiGHS)~27.7 GB
Node total (CLP)~26.0 GB

CLP workspaces are ~21 MB/thread (vs ~36 MB for HiGHS) due to 1-byte basis status codes, reducing thread-local memory by ~720 MB per rank.

2.2 SharedRegion Savings

Under Strategy 2+3, the dominant memory structure — the StageLpCache (~22.3 GB) — is shared across all ranks on the same node via SharedRegion<T> (see Shared Memory Aggregation §1). This is the primary memory optimization:

DataPer-Rank (replicated)SharedRegion (1 copy/node)Savings (4 ranks/node)
StageLpCache~22,300 MB~22,300 MB~66,900 MB
Cut pool metadata~12 MB~12 MB~36 MB
Opening tree~0.8 MB~0.8 MB~2.4 MB
Input case data~20 MB~20 MB~60 MB
Total shareable~22,333 MB~22,333 MB~67,000 MB

Without SharedRegion, each rank would replicate the entire StageLpCache (~22.3 GB × 4 ranks = ~89.2 GB), making the node total ~96 GB — still within the 384 GB node capacity but significantly more than the ~27.7 GB with sharing.

Single-process mode note: The SharedRegion savings table above applies only to multi-rank MPI deployments. In single-process mode (used by cobre-python and cobre-mcp), all data resides in a single process. The per-process memory footprint includes one copy of the StageLpCache (~22.3 GB) plus thread-local workspaces. At production scale with 48 threads, expect approximately 24 GB total memory usage.

2.3 Memory Growth

The StageLpCache is the dominant memory structure. It is physically pre-allocated at initialization (15K cut slots × 60 stages), so no dynamic allocation occurs during training. As cuts accumulate, StageLpCache CSC slots are populated with coefficient data — this is logical growth within the pre-allocated structure.

IterationActive Cuts (approx)StageLpCache UtilizationNotes
1192~1.3% cuts (one per forward pass)
50~5,000~33%Before cut selection starts pruning
100~10,000~67%Cut selection bounds active count
200~15,000~100%At capacity; dominated cuts deactivated

The StageLpCache pre-allocates all 15K cut slots at initialization (per Solver Abstraction §11.4). Physical memory is committed up front via the SharedRegion allocation. The stage transition time improves at lower utilization (fewer cut rows → smaller CSC → faster passModel).

2.4 Scaling with Problem Size

Memory scales primarily with the number of hydro plants (state dimension), the number of stages, and thread count:

DimensionEffect on Memory
Hydro count (×2)State dimension doubles → StageLpCache nnz per cut doubles → StageLpCache approximately doubles.
Stage count (×2)Per-stage basis cache doubles. StageLpCache doubles (one per stage). SharedRegion doubles.
Forward passes (×2)Forward state buffers double. StageLpCache growth rate doubles (more cuts per iteration).
Threads (×2)Thread-local workspaces double (linear). SharedRegion unchanged (shared, not per-thread).
Openings (×2)Opening tree doubles. Modest impact (~0.8 MB → ~1.6 MB at production scale).

3. NUMA-Aware Allocation

3.1 Principles

Decision DEC-011 (active): One MPI rank per NUMA domain is the recommended deployment model; confines each Rayon thread pool to a single NUMA domain.

Modern HPC nodes have multiple NUMA domains. Memory access latency varies significantly between local and remote NUMA domains (typical: 1.5-3× slower for remote access). Cobre follows three NUMA principles:

Principle 1 — Thread-owns-workspace: Each solver workspace is allocated by the thread that will use it, ensuring first-touch allocation on the thread’s local NUMA node. The workspace is never accessed by other threads.

Principle 2 — One rank per NUMA domain: The recommended deployment is one MPI rank per NUMA domain (see Hybrid Parallelism §4.4 and SLURM Deployment). This ensures that all threads within a rank share the same NUMA domain, making shared read-only data (case data, opening tree) local to all threads.

Principle 3 — First-touch initialization: Large arrays (solution buffers, basis cache) are initialized by the owning thread within a Rayon parallel scope, not by the main thread. This ensures the OS places memory pages on the NUMA node where they will be accessed.

3.2 NUMA Initialization Sequence

The initialization of per-thread resources follows the sequence documented in Solver Workspaces §1.3:

  1. Main thread determines NUMA topology via cobre_comm::slurm helpers (see Hybrid Parallelism §1.2)
  2. Rayon parallel scope is entered (via rayon::scope or par_iter)
  3. Each thread creates its solver instance (first-touch allocates solver internals on local NUMA)
  4. Each thread initializes its per-stage basis cache, solution buffers, and scratch arrays
  5. Parallel scope completes (implicit join)
  6. Main thread verifies all workspaces are initialized

3.3 Cache Line Alignment

All per-thread data structures must be padded to cache line boundaries (64 bytes) to prevent false sharing between adjacent threads’ workspaces. This applies to:

  • Solver workspace array entries (indexed by thread ID)
  • Cut accumulation buffers (see Synchronization §3.1)
  • Per-thread solve statistics counters

3.4 NUMA Topology (AMD EPYC 9R14)

The production reference hardware (AMD EPYC 9R14, 192 cores) has the following NUMA characteristics:

PropertyValue
NUMA domains (NPS4)4
CCDs per NUMA domain3
Physical cores per domain24 (48 hardware threads with SMT)
L3 cache per CCD32 MB (NOT unified within NUMA — 3 separate 32 MB caches per domain, 96 MB total per NUMA)
Local bandwidth~60–80 GB/s sustained per NUMA domain (3 DDR5-4800 channels)
Cross-NUMA bandwidth~30–40 GB/s per direction (Infinity Fabric)

3.5 NUMA Latency Reference

Representative NUMA latency characteristics:

Access PatternTypical LatencyImpact on LP Solve
Local NUMA~80 nsBaseline
Adjacent NUMA~120 ns (1.5×)Moderate slowdown
Remote NUMA (cross-socket)~200 ns (2.5×)Significant slowdown

These values are representative of AMD EPYC and Intel Xeon platforms. Actual latencies vary by hardware. The key insight is that remote NUMA access can slow LP solves by 2-3× when solver working data (LU factorization, pricing vectors) is allocated on the wrong NUMA node — motivating Principle 1 (thread-owns-workspace).

3.6 NUMA-Interleaved Allocation for SharedRegion

Decision DEC-010 (active): NUMA-interleaved allocation (mbind(MPOL_INTERLEAVE)) for the SharedRegion holding the StageLpCache; distributes pages round-robin across all NUMA domains.

The StageLpCache (~22.3 GB) is shared across all ranks on the same node via SharedRegion<T>. If allocated entirely on the leader rank’s NUMA domain, all other ranks incur cross-NUMA latency for every passModel read — creating a memory controller bottleneck.

Solution: Allocate the SharedRegion with NUMA-interleaved page placement (mbind(MPOL_INTERLEAVE) or numactl --interleave=all). This distributes memory pages round-robin across all 4 NUMA domains.

ApproachEffective BandwidthStage Transition TimeNode Memory
Leader-only placement~15–20 GB/s~18.9–25 ms~27.7 GB
NUMA-interleaved (adopted)~44 GB/s~8.6 ms~27.7 GB
Per-rank replication~60–80 GB/s~5.4 ms~92.0 GB

Recommendation: NUMA-interleaved SharedRegion is the default. It provides ~2.5× speedup over leader-only placement at no additional memory cost. Per-rank replication provides a further ~1.6× speedup but triples memory usage; this is a profile-guided optimization for memory-rich nodes.

L3 cache impact: The passModel operation reads ~378 MB per stage — far exceeding the 32 MB L3 per CCD (or even the 96 MB total per NUMA domain). This is a DRAM-bandwidth-bound streaming read, not a cache-friendly working-set operation. In contrast, the LP solve working set (~15 MB for solver internals + LU factorization) fits within a single CCD’s 32 MB L3. Strategy 2+3 improves cache behavior: the per-thread footprint (~36 MB) doesn’t pollute L3/TLB, whereas the per-thread CSR assembly approach required ~375 MB per thread for on-the-fly LP construction.

4. Hot-Path Allocation Avoidance

4.1 Requirement

No heap allocation (malloc/new/Vec::push beyond capacity) is permitted during the SDDP hot path — the forward pass LP solves, backward pass LP solves, and cut accumulation. All buffers are pre-allocated at initialization and reused.

4.2 Pre-Allocated Components

ComponentPre-allocated AtReused DuringReference
StageLpCache (CSC)InitializationRead every stage transition via passModelSolver Abstraction §11.4
Solver instanceInitializationEvery LP solve (all iterations)Solver Workspaces §1.2
Primal/dual buffersInitializationSolution extraction after each solveSolver Workspaces §1.2
RHS patch bufferInitializationScenario patching before each solveSolver Workspaces §1.2
Cut pool metadata slotsInitializationCut metadata insertion via bitmapSolver Abstraction §5
Cut accumulation buffersInitializationPer-thread cut collection each stageSynchronization §3.1
MPI send/recv buffersInitializationMPI_Allgatherv each stageCommunication Patterns §2

4.3 Allocation Monitoring

In debug/test builds, an allocation tracker can detect unexpected hot-path allocations by hooking the global allocator and flagging any allocation between the start and end of an LP solve. This is a development tool, not a production feature.

Cross-References