Memory Architecture
Purpose
This spec defines the memory architecture for Cobre: data ownership categories, per-rank memory budget derived from approved specifications, NUMA-aware allocation principles, and memory growth analysis across SDDP iterations. This spec aggregates memory-relevant requirements from Solver Workspaces, Shared Memory Aggregation, and Communication Patterns into a unified memory perspective.
1. Data Ownership Model
1.1 Ownership Categories
All runtime data falls into one of four ownership categories, each with distinct allocation and access patterns:
| Category | Owner | Access During Training | Allocation Strategy | Examples |
|---|---|---|---|---|
| Shared read-only | Node (via SharedWindow) or rank | Read by all threads, never modified after init | SharedWindow leader allocation or per-rank replication | Opening tree, input case data, PAR parameters, spectral factors |
| Thread-local mutable | Rayon thread | Exclusive read/write by owning thread | First-touch on owning thread’s NUMA node | Solver workspace, solution buffers, basis cache, cut accumulation buffer |
| Rank-local growing | MPI rank | Read by all threads, written at stage boundaries | Pre-allocated with growth capacity | Cut pool (grows each iteration) |
| Temporary | Rayon thread | Allocated/freed within a single solve | Pre-allocated in workspace, reused | RHS patch buffer, scratch arrays |
Single-process mode note: In single-process mode (used by cobre-python and cobre-mcp), the “Shared read-only” category uses regular per-process heap allocation instead of SharedWindow<T>. MPI windows are not available without MPI initialization. The ownership semantics are otherwise identical – data is still read-only during training, allocated once at initialization, and shared across all Rayon threads within the process. See Hybrid Parallelism §1.0a for single-process mode details.
1.2 Concurrency Model
Cobre uses Rayon for intra-rank threading (see Hybrid Parallelism §5). The data sharing model follows Rust’s ownership and borrowing rules, enforced at compile time via Send and Sync trait bounds:
| Rust/Rayon Pattern | Data Category | Implication |
|---|---|---|
&T where T: Sync | Shared read-only | All threads see the same reference; no synchronization needed for reads |
Indexed by rayon::current_thread_index() | Thread-local mutable | Each thread accesses its own workspace by thread index |
Written between par_iter calls (single-threaded) | Rank-local growing | Written between parallel regions (single-threaded merge), read during parallel regions |
No Arc, RwLock, or Mutex is needed — Rust’s ownership model, Rayon’s implicit join synchronization, and thread-index-indexed arrays provide the necessary semantics. The compiler enforces that shared data is Sync and that mutable data is exclusively owned.
2. Per-Rank Memory Budget
2.1 Derivable Components
The following memory estimates are derived from production-scale dimensions in approved specs. Reference configuration: 160 hydros, 130 thermals, 60 stages, 192 forward passes, 10 openings, 15K cut capacity, 48 threads per rank, 4 ranks per node.
Under Strategy 2+3 (StageLpCache), the memory model is a two-tier structure: thread-local solver workspaces plus a shared StageLpCache via SharedRegion<T>.
Thread-local per rank (48 threads, NUMA-local, HiGHS):
| Component | Size | Derivation | Source |
|---|---|---|---|
| Solver workspaces (×48) | ~1,737 MB | 48 threads × ~36 MB per workspace (solver instance + buffers + per-stage basis cache) | Solver Workspaces §1.2 |
| Forward pass state | ~18 MB | 192 trajectories × ~9 KB state vector (1,120 doubles + metadata) × ~1 stage buffered | Training Loop §5.1 |
| MPI communication buffers | ~10 MB | Send/receive buffers for MPI_Allgatherv (cuts: ~3.2 MB, states: ~1.72 MB per stage) | Communication Patterns §2 |
| Thread-local per rank | ~1.8 GB |
SharedRegion (1 copy per node, NUMA-interleaved):
| Component | Size | Derivation | Source |
|---|---|---|---|
| StageLpCache | ~22,300 MB | 60 stages × ~378 MB per stage (structural CSC + 15K cut slots in CSC) | Solver Abstraction §11.4 |
| Cut pool metadata | ~12 MB | 60 stages × 15K cuts × (intercept + activity + metadata) ≈ ~200 bytes/cut | Cut Management Impl §1.3 |
| Opening tree | ~0.8 MB | 10 openings × 60 stages × 160 entities × 8 bytes | Scenario Generation §2.3 |
| Input case data | ~20 MB | System entities, PAR parameters, correlation factors, block/exchange factors | Internal Structures |
| SharedRegion total | ~22.3 GB |
Node total (4 ranks):
| Component | Size |
|---|---|
| Thread-local (4 ranks × 1.8 GB) | ~7.1 GB |
| SharedRegion (1 copy) | ~22.3 GB |
| Node total (HiGHS) | ~27.7 GB |
| Node total (CLP) | ~26.0 GB |
CLP workspaces are ~21 MB/thread (vs ~36 MB for HiGHS) due to 1-byte basis status codes, reducing thread-local memory by ~720 MB per rank.
2.2 SharedRegion Savings
Under Strategy 2+3, the dominant memory structure — the StageLpCache (~22.3 GB) — is shared across all ranks on the same node via SharedRegion<T> (see Shared Memory Aggregation §1). This is the primary memory optimization:
| Data | Per-Rank (replicated) | SharedRegion (1 copy/node) | Savings (4 ranks/node) |
|---|---|---|---|
| StageLpCache | ~22,300 MB | ~22,300 MB | ~66,900 MB |
| Cut pool metadata | ~12 MB | ~12 MB | ~36 MB |
| Opening tree | ~0.8 MB | ~0.8 MB | ~2.4 MB |
| Input case data | ~20 MB | ~20 MB | ~60 MB |
| Total shareable | ~22,333 MB | ~22,333 MB | ~67,000 MB |
Without SharedRegion, each rank would replicate the entire StageLpCache (~22.3 GB × 4 ranks = ~89.2 GB), making the node total ~96 GB — still within the 384 GB node capacity but significantly more than the ~27.7 GB with sharing.
Single-process mode note: The SharedRegion savings table above applies only to multi-rank MPI deployments. In single-process mode (used by cobre-python and cobre-mcp), all data resides in a single process. The per-process memory footprint includes one copy of the StageLpCache (~22.3 GB) plus thread-local workspaces. At production scale with 48 threads, expect approximately 24 GB total memory usage.
2.3 Memory Growth
The StageLpCache is the dominant memory structure. It is physically pre-allocated at initialization (15K cut slots × 60 stages), so no dynamic allocation occurs during training. As cuts accumulate, StageLpCache CSC slots are populated with coefficient data — this is logical growth within the pre-allocated structure.
| Iteration | Active Cuts (approx) | StageLpCache Utilization | Notes |
|---|---|---|---|
| 1 | 192 | ~1.3% | cuts (one per forward pass) |
| 50 | ~5,000 | ~33% | Before cut selection starts pruning |
| 100 | ~10,000 | ~67% | Cut selection bounds active count |
| 200 | ~15,000 | ~100% | At capacity; dominated cuts deactivated |
The StageLpCache pre-allocates all 15K cut slots at initialization (per Solver Abstraction §11.4). Physical memory is committed up front via the SharedRegion allocation. The stage transition time improves at lower utilization (fewer cut rows → smaller CSC → faster passModel).
2.4 Scaling with Problem Size
Memory scales primarily with the number of hydro plants (state dimension), the number of stages, and thread count:
| Dimension | Effect on Memory |
|---|---|
| Hydro count (×2) | State dimension doubles → StageLpCache nnz per cut doubles → StageLpCache approximately doubles. |
| Stage count (×2) | Per-stage basis cache doubles. StageLpCache doubles (one per stage). SharedRegion doubles. |
| Forward passes (×2) | Forward state buffers double. StageLpCache growth rate doubles (more cuts per iteration). |
| Threads (×2) | Thread-local workspaces double (linear). SharedRegion unchanged (shared, not per-thread). |
| Openings (×2) | Opening tree doubles. Modest impact (~0.8 MB → ~1.6 MB at production scale). |
3. NUMA-Aware Allocation
3.1 Principles
Decision DEC-011 (active): One MPI rank per NUMA domain is the recommended deployment model; confines each Rayon thread pool to a single NUMA domain.
Modern HPC nodes have multiple NUMA domains. Memory access latency varies significantly between local and remote NUMA domains (typical: 1.5-3× slower for remote access). Cobre follows three NUMA principles:
Principle 1 — Thread-owns-workspace: Each solver workspace is allocated by the thread that will use it, ensuring first-touch allocation on the thread’s local NUMA node. The workspace is never accessed by other threads.
Principle 2 — One rank per NUMA domain: The recommended deployment is one MPI rank per NUMA domain (see Hybrid Parallelism §4.4 and SLURM Deployment). This ensures that all threads within a rank share the same NUMA domain, making shared read-only data (case data, opening tree) local to all threads.
Principle 3 — First-touch initialization: Large arrays (solution buffers, basis cache) are initialized by the owning thread within a Rayon parallel scope, not by the main thread. This ensures the OS places memory pages on the NUMA node where they will be accessed.
3.2 NUMA Initialization Sequence
The initialization of per-thread resources follows the sequence documented in Solver Workspaces §1.3:
- Main thread determines NUMA topology via
cobre_comm::slurmhelpers (see Hybrid Parallelism §1.2) - Rayon parallel scope is entered (via
rayon::scopeorpar_iter) - Each thread creates its solver instance (first-touch allocates solver internals on local NUMA)
- Each thread initializes its per-stage basis cache, solution buffers, and scratch arrays
- Parallel scope completes (implicit join)
- Main thread verifies all workspaces are initialized
3.3 Cache Line Alignment
All per-thread data structures must be padded to cache line boundaries (64 bytes) to prevent false sharing between adjacent threads’ workspaces. This applies to:
- Solver workspace array entries (indexed by thread ID)
- Cut accumulation buffers (see Synchronization §3.1)
- Per-thread solve statistics counters
3.4 NUMA Topology (AMD EPYC 9R14)
The production reference hardware (AMD EPYC 9R14, 192 cores) has the following NUMA characteristics:
| Property | Value |
|---|---|
| NUMA domains (NPS4) | 4 |
| CCDs per NUMA domain | 3 |
| Physical cores per domain | 24 (48 hardware threads with SMT) |
| L3 cache per CCD | 32 MB (NOT unified within NUMA — 3 separate 32 MB caches per domain, 96 MB total per NUMA) |
| Local bandwidth | ~60–80 GB/s sustained per NUMA domain (3 DDR5-4800 channels) |
| Cross-NUMA bandwidth | ~30–40 GB/s per direction (Infinity Fabric) |
3.5 NUMA Latency Reference
Representative NUMA latency characteristics:
| Access Pattern | Typical Latency | Impact on LP Solve |
|---|---|---|
| Local NUMA | ~80 ns | Baseline |
| Adjacent NUMA | ~120 ns (1.5×) | Moderate slowdown |
| Remote NUMA (cross-socket) | ~200 ns (2.5×) | Significant slowdown |
These values are representative of AMD EPYC and Intel Xeon platforms. Actual latencies vary by hardware. The key insight is that remote NUMA access can slow LP solves by 2-3× when solver working data (LU factorization, pricing vectors) is allocated on the wrong NUMA node — motivating Principle 1 (thread-owns-workspace).
3.6 NUMA-Interleaved Allocation for SharedRegion
Decision DEC-010 (active): NUMA-interleaved allocation (
mbind(MPOL_INTERLEAVE)) for the SharedRegion holding the StageLpCache; distributes pages round-robin across all NUMA domains.
The StageLpCache (~22.3 GB) is shared across all ranks on the same node via SharedRegion<T>. If allocated entirely on the leader rank’s NUMA domain, all other ranks incur cross-NUMA latency for every passModel read — creating a memory controller bottleneck.
Solution: Allocate the SharedRegion with NUMA-interleaved page placement (mbind(MPOL_INTERLEAVE) or numactl --interleave=all). This distributes memory pages round-robin across all 4 NUMA domains.
| Approach | Effective Bandwidth | Stage Transition Time | Node Memory |
|---|---|---|---|
| Leader-only placement | ~15–20 GB/s | ~18.9–25 ms | ~27.7 GB |
| NUMA-interleaved (adopted) | ~44 GB/s | ~8.6 ms | ~27.7 GB |
| Per-rank replication | ~60–80 GB/s | ~5.4 ms | ~92.0 GB |
Recommendation: NUMA-interleaved SharedRegion is the default. It provides ~2.5× speedup over leader-only placement at no additional memory cost. Per-rank replication provides a further ~1.6× speedup but triples memory usage; this is a profile-guided optimization for memory-rich nodes.
L3 cache impact: The passModel operation reads ~378 MB per stage — far exceeding the 32 MB L3 per CCD (or even the 96 MB total per NUMA domain). This is a DRAM-bandwidth-bound streaming read, not a cache-friendly working-set operation. In contrast, the LP solve working set (~15 MB for solver internals + LU factorization) fits within a single CCD’s 32 MB L3. Strategy 2+3 improves cache behavior: the per-thread footprint (~36 MB) doesn’t pollute L3/TLB, whereas the per-thread CSR assembly approach required ~375 MB per thread for on-the-fly LP construction.
4. Hot-Path Allocation Avoidance
4.1 Requirement
No heap allocation (malloc/new/Vec::push beyond capacity) is permitted during the SDDP hot path — the forward pass LP solves, backward pass LP solves, and cut accumulation. All buffers are pre-allocated at initialization and reused.
4.2 Pre-Allocated Components
| Component | Pre-allocated At | Reused During | Reference |
|---|---|---|---|
| StageLpCache (CSC) | Initialization | Read every stage transition via passModel | Solver Abstraction §11.4 |
| Solver instance | Initialization | Every LP solve (all iterations) | Solver Workspaces §1.2 |
| Primal/dual buffers | Initialization | Solution extraction after each solve | Solver Workspaces §1.2 |
| RHS patch buffer | Initialization | Scenario patching before each solve | Solver Workspaces §1.2 |
| Cut pool metadata slots | Initialization | Cut metadata insertion via bitmap | Solver Abstraction §5 |
| Cut accumulation buffers | Initialization | Per-thread cut collection each stage | Synchronization §3.1 |
| MPI send/recv buffers | Initialization | MPI_Allgatherv each stage | Communication Patterns §2 |
4.3 Allocation Monitoring
In debug/test builds, an allocation tracker can detect unexpected hot-path allocations by hooking the global allocator and flagging any allocation between the start and end of an LP solve. This is a development tool, not a production feature.
Cross-References
- Solver Workspaces §1 — Detailed workspace contents, sizing, NUMA-aware initialization, basis cache
- Solver Abstraction §5 — Cut pool preallocation, slot assignment, capacity management
- Solver Abstraction §11.4 — StageLpCache design, sizing, ownership, update and read contracts
- Shared Memory Aggregation §1 — SharedRegion allocation, StageLpCache as primary candidate, NUMA-interleaved placement
- Communication Patterns §2 — Data payload sizes for MPI buffers
- Communication Patterns §5 — SharedWindow capabilities and shared data candidates
- Synchronization §3 — Cut accumulation buffers, cache line alignment
- Hybrid Parallelism §1.2 — ferrompi capabilities, SLURM/NUMA detection
- Hybrid Parallelism §4.4 — NUMA binding policy
- Scenario Generation §2.3 — Opening tree size and memory layout
- Cut Management Impl §4.2 — Per-cut wire size and cut pool sizing
- Training Loop §5.1 — State vector dimensions
- Internal Structures — In-memory data model for system entities