Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Shared Memory and Aggregation

Purpose

This spec defines the intra-node shared memory usage patterns, reproducibility guarantees, and performance monitoring for Cobre. It extends Communication Patterns §5 with detailed SharedRegion<T> allocation and access patterns, consolidates reproducibility requirements from across the approved specs, and defines performance monitoring interpretation. For the collective operations and their semantics, see Communication Patterns and Synchronization.

1. Intra-Node Shared Memory

1.1 SharedRegion<T> Usage Model

The SharedMemoryProvider trait (Communicator Trait SS4) allows ranks on the same physical node to share memory regions via SharedRegion<T> (Communicator Trait SS4.2) handles. Cobre uses a leader allocation pattern: one rank per node (the leader, determined by is_leader() from the SharedMemoryProvider trait, which returns true for rank 0 within the intra-node communicator from split_local() (Communicator Trait SS4.1)) allocates the shared region; other ranks on the same node access it via the region handle.

RoleAllocationRead AccessWrite Access
LeaderCalls create_shared_region(count) – allocates full regionregion.as_slice() (zero-copy)region.as_mut_slice(), then region.fence()
FollowerReceives handle to leader’s region (size 0 local allocation)region.as_slice() (zero-copy)Not permitted (leader writes on behalf)

The SharedRegion<T> type provides RAII semantics — Drop automatically frees the underlying shared memory resource (MPI window for ferrompi, OS shared segment for shm, Vec<T> for HeapFallback). See Communicator Trait SS4.2.

1.2 Shared Data: StageLpCache (Primary)

The StageLpCache (Solver Abstraction SS11.4) is the primary candidate for SharedRegion<T> — it is the dominant memory structure at ~22.3 GB and provides the largest sharing benefit:

PropertyValue
Size~22.3 GB (60 stages × ~378 MB per stage at 15K cut capacity)
Access patternRead-only during forward/backward passes; updated between iterations by leader rank
Sharing benefitEliminates ~67 GB replication (4 ranks × 22.3 GB → 1 copy)
Write phaseBetween iterations: leader rank writes new cuts and deactivation bounds, region.fence() + barrier. Cut selection computation is distributed across all ranks (Cut Selection Strategy Trait SS2.2a), but SharedRegion writes remain leader-only — the single-writer contract is preserved.
Memory layoutPer-stage CSC arrays (structural template + cut slots) — sequential access during passModel

NUMA-interleaved allocation is mandatory for StageLpCache. At >5 GB, leader-only placement creates a memory controller bottleneck. mbind(MPOL_INTERLEAVE) distributes pages round-robin across all NUMA domains, providing ~44 GB/s effective read bandwidth (vs ~15–20 GB/s with leader-only). See Memory Architecture §3.6.

Allocation protocol:

  1. Create intra-node communicator via comm.split_local() (Communicator Trait SS4.1)
  2. Leader calls create_shared_region::<u8>(total_bytes) with NUMA-interleaved allocation hint; followers allocate size 0
  3. Leader assembles initial StageLpCache from StageTemplate CSC + empty cut slots for each stage
  4. region.fence() ensures initial data is visible to all ranks
  5. During training passes: all ranks read StageLpCache[t] via region.as_slice() (zero-copy)
  6. Between iterations: leader writes new cut data and deactivation bounds, then region.fence() + barrier

Decision DEC-016 (active): Cut selection uses deferred parallel execution — stages distributed across ranks and threads, with DeactivationSet allgatherv and leader-only SharedRegion write.

Update path (between iterations, leader-only writes):

  • Write new cut coefficients into pre-allocated CSC slots (~300 MB/iteration at ~5 ms) — every iteration
  • Apply deactivations from the parallel cut selection phase: set deactivated cut row bounds to (non-binding, no structural change) — selection iterations only. The deactivation decisions are computed in parallel by all ranks (Cut Management Implementation SS7.1a), but the SharedRegion writes are performed exclusively by the leader rank (Cut Management Implementation SS7.1b)
  • region.fence() + barrier after all writes
  • Read path: passModel(StageLpCache[t]) = sequential bulk read at ~44 GB/s → ~8.6 ms per stage transition

1.3 Shared Data: Cut Pool Metadata

Cut pool metadata (intercepts, activity bitmap, iteration/forward_pass indices, binding counts) is included in the SharedRegion alongside StageLpCache. Total: ~12 MB across 60 stages at 15K capacity — negligible relative to StageLpCache.

1.4 Shared Data: Opening Tree (Secondary)

The opening tree (fixed noise vectors for the backward pass — see Scenario Generation §2.3) is a secondary candidate for SharedRegion<T>:

PropertyValue
Size bytes (~0.8 MB at production scale)
Access patternRead-only during training (generated once before first iteration)
Sharing benefitAvoid replicating ~0.8 MB per rank on same node (4 ranks = ~2.4 MB saved)
Write phaseLeader generates during initialization, region.fence() ensures visibility
Memory layoutOpening-major ordering for backward pass locality (see scenario-generation.md §2.3)

Decision DEC-017 (active): Communication-free parallel noise generation – every rank and thread independently derives identical noise via deterministic SipHash-1-3 seed derivation, eliminating MPI broadcast or gather for scenario noise.

Generation protocol:

  1. Create intra-node communicator via comm.split_local() (Communicator Trait SS4.1)
  2. Leader calls create_shared_region::<f64>(count) (Communicator Trait SS4.1) with total opening tree size; followers allocate size 0
  3. Distribute generation work across ranks within the node (each rank generates its assigned portion via contiguous block assignment)
  4. Each rank writes its portion directly to the shared region at the correct offset
  5. region.fence() ensures all writes are visible to all ranks on the node
  6. All ranks read any opening via region.as_slice() (zero-copy)

1.5 Shared Data: Input Case Data (Secondary)

Static input data (hydro parameters, thermal parameters, system topology) loaded during initialization is read-only during training. Sharing via SharedRegion<T> avoids per-rank replication:

DataApproximate SizeShareable?Notes
System entity data~10 MBYesRead-only after initialization
PAR model parameters~5 MBYesPreprocessed contiguous arrays
Correlation factors~2 MBYesSpectral factors, one per profile
Block/exchange factors~1 MBYesLoaded once, read-only

Whether these smaller data structures justify SharedRegion<T> overhead (allocation, fence, pointer indirection) depends on the number of ranks per node. With 4+ ranks per node, the savings compound, but the absolute benefit (~60 MB) is negligible relative to StageLpCache (~67 GB saved).

2. Intra-Node Cut Aggregation

2.1 Baseline: Flat allgatherv

The approved architecture uses flat allgatherv for cut distribution: every rank sends its cuts and every rank receives all cuts. There is no hierarchical aggregation, no root-based reduction, and no point-to-point messaging. See Synchronization §1.1 and Communication Patterns §1.2.

2.2 Optimization: Two-Level Aggregation

For deployments with many ranks per node (8+), a two-level aggregation can reduce inter-node allgatherv volume by first aggregating cuts within a node:

  1. Level 1 (Intra-node): Ranks on the same node contribute cuts to shared memory via SharedRegion<T> (via the SharedMemoryProvider trait). The node leader collects and deduplicates if applicable.
  2. Level 2 (Inter-node): Node leaders participate in allgatherv across nodes, exchanging the aggregated per-node cut sets.
  3. Distribute within node: Node leaders write received remote cuts to shared memory; region.fence() ensures visibility.

This reduces the allgatherv participant count from total ranks to node leaders, reducing collective latency for large rank counts. However, it adds intra-node synchronization overhead.

Design decision: Two-level aggregation is an optimization for large-scale deployments (64+ ranks). For the initial implementation with moderate rank counts (4-16 ranks), flat allgatherv is sufficient. Profiling should guide whether to enable this optimization. See Communication Patterns §3.2 for bandwidth analysis showing communication fraction < 2% even on Ethernet.

3. Reproducibility Guarantees

3.1 Requirement

Given the same inputs and random seed, Cobre must produce bit-for-bit identical results regardless of:

Must Be Independent OfMechanism
Number of MPI ranksDeterministic seeding, contiguous block distribution, deterministic cut slots
Number of Rayon threads per rankThread-local accumulation, Rayon join, single-threaded merge
Execution timing/orderingIdentity-based seeding, deterministic allgatherv rank ordering

3.2 Component-Level Reproducibility

Each component of the SDDP pipeline has specific reproducibility mechanisms:

ComponentMechanismReference
Scenario noiseDeterministic seed from (base_seed, iteration, scenario, stage)Scenario Generation §2.2
Opening treeDeterministic seed from (base_seed, opening_index, stage)Scenario Generation §2.3
Forward pass assignmentContiguous block distribution by rank indexWork Distribution §3.1
Backward pass assignmentContiguous block distribution by rank indexWork Distribution §2.1
Cut slot positionsDeterministic from (iteration, forward_pass_index)Cut Management Implementation §4.3
Cut synchronizationallgatherv receives in rank order (0, 1, …, R-1)Communication Patterns §6.1
Cut selectionDeterministic algorithm on identical data (all ranks compute independently)Cut Management Implementation §6
LP constraint orderCuts added by slot index (iteration order)Solver Abstraction §5

3.3 Floating-Point Considerations

Reductions: allreduce with ReduceOp::Sum may produce different results depending on the reduction tree shape (non-associativity of floating-point addition). For convergence statistics this is acceptable — the upper bound is a statistical estimate. See Communication Patterns §6.2.

Intra-rank reductions: Thread-local accumulation followed by single-threaded merge (see Synchronization §3) produces deterministic results because the merge order is fixed (thread 0, thread 1, …, thread ). Rayon’s par_iter with collect() or indexed par_chunks() preserves deterministic aggregation order, avoiding the non-determinism that would arise from unordered parallel reductions.

Cut coefficient aggregation: For single-cut formulation, per-opening results are aggregated within each thread (sequential — deterministic), then merged across threads (fixed order — deterministic), then exchanged via allgatherv (rank order — deterministic). The final averaging is performed locally by each rank on the full set of results (identical data — deterministic). No floating-point non-determinism is introduced at any step.

3.4 Verification

Reproducibility can be verified by running the same case with different rank/thread configurations and comparing:

QuantityComparisonExpected Result
Lower bound traceBit-for-bit across configsIdentical
Upper bound meanWithin FP tolerance~1e-12 relative error
Final cut poolBit-for-bit across configsIdentical coefficients
Policy outputBit-for-bit across configsIdentical

Upper bound mean may have small FP differences due to allreduce tree shape variation, but the lower bound and cut pool (which determine the policy) must be exact.

4. Performance Monitoring

4.1 Approved Timing Outputs

Cobre writes per-iteration and per-rank timing data to Parquet files during training. The schemas are defined in Output Schemas §6.2-§6.3:

Per-iteration (training/timing/iterations.parquet): forward solve, backward solve, cut computation, cut selection, MPI communication, I/O, overhead.

Per-rank (training/timing/mpi_ranks.parquet): forward time, backward time, communication time, idle time, LP solve count, scenarios processed.

4.2 Diagnostic Interpretation

SymptomMetric to CheckLikely CauseMitigation
Idle time >> 0 on some ranksidle_time_ms variance across ranksLoad imbalanceCheck LP iteration count variance; consider NUMA placement
Communication time growing across iterationscommunication_time_ms trendCut count growth increases allgatherv payloadEnable cut selection to bound active cut count
Forward time >> backward timeforward_time_ms vs backward_time_msMany forward passes, few trial points per rankExpected for high ratio
Backward time growing across iterationsbackward_solve_ms trendMore cuts in FCF → larger LPs → longer solvesEnable cut selection; check for dominated cuts
Communication fraction > 10%communication_time_ms / totalInsufficient compute per rankReduce rank count (fewer, larger ranks); check network bandwidth

4.3 Load Balance Assessment

The primary load balance indicator is the ratio of maximum to minimum forward_time_ms (or backward_time_ms) across ranks within an iteration. With static contiguous block distribution and homogeneous LP structure, this ratio should be close to 1.0:

Imbalance RatioAssessmentAction
< 1.1ExcellentNone needed
1.1 - 1.2AcceptableMonitor — may improve with NUMA tuning
1.2 - 1.5NotableCheck NUMA placement, check LP iteration counts
> 1.5SignificantInvestigate — likely NUMA or hardware issue

Cross-References