Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Backend Testing and Conformance

Purpose

This spec defines the conformance test suite, interchangeability tests, and performance regression tests for the four Communicator backend implementations (local, ferrompi, TCP, shm). The conformance suite verifies that each backend satisfies the method contracts from Communicator Trait §2 and the SharedMemoryProvider lifecycle from Communicator Trait §4. The interchangeability tests verify that swapping backends produces equivalent SDDP results – the central claim of the backend abstraction architecture. The performance regression tests ensure that the trait abstraction layer introduces no measurable overhead relative to direct API calls.

1. Conformance Test Suite

Tests are parameterized by backend and rank count. Each test specifies its name, applicable backends, rank configurations, input scenario, and expected observable behavior (postcondition assertion).

Test naming convention: test_{backend}_{method}_{scenario} where {backend} is one of local, ferrompi, tcp, shm, and {method} is the Communicator or SharedMemoryProvider method under test.

Rank count configurations: Tests are run with 1, 2, and 4 ranks unless otherwise noted. The local backend always runs with size=1. The ferrompi, TCP, and shm backends run with size=2 and size=4. Size=1 tests on multi-rank backends verify degenerate-case correctness.

1.1 allgatherv Conformance

allgatherv is the most performance-critical method (Communicator Trait §2.1). The conformance tests verify rank-ordered receive, data integrity, and correct handling of heterogeneous send sizes and boundary cases.

Test NameRanksInput ScenarioExpected Observable Behavior
test_{backend}_allgatherv_heterogeneous2, 4Rank 0 sends 3 elements [1.0, 2.0, 3.0]; rank 1 sends 5 elements [4.0, 5.0, 6.0, 7.0, 8.0]; remaining ranks send rank-specific element counts. counts and displs computed accordingly.recv buffer on all ranks contains rank 0’s data at displs[0], rank 1’s data at displs[1], …, rank ’s data at displs[R-1]. All ranks have bit-for-bit identical recv contents.
test_{backend}_allgatherv_identity_size11Single rank sends [10.0, 20.0, 30.0]; counts=[3], displs=[0].recv[0..3] == send[0..3]. Identity copy semantics per Local Backend §2.1.
test_{backend}_allgatherv_empty_send2, 4One rank sends 0 elements (counts[r]=0 for that rank); other ranks send non-empty data.recv buffer is populated correctly for all ranks. The zero-count rank’s region in recv is untouched. All ranks have identical recv contents.
test_{backend}_allgatherv_single_element2, 4Each rank sends exactly 1 element (its rank index as f64). counts is all 1s.recv contains [0.0, 1.0, ..., (R-1) as f64] on all ranks.
test_{backend}_allgatherv_large_payload2Each rank sends 100,000 f64 elements initialized to rank as f64.recv contains correct data for both ranks. Validates that shared buffer sizing and framing handle large payloads.

1.2 allreduce Conformance

allreduce aggregates convergence statistics (Communicator Trait §2.2). The conformance tests verify both ReduceOp::Sum and ReduceOp::Min correctness and identity semantics for size=1.

Test NameRanksInput ScenarioExpected Observable Behavior
test_{backend}_allreduce_sum2, 4Each rank sends [1.0, 2.0, 3.0, 4.0] (the SDDP convergence statistics shape: 4 x f64).recv == [R * 1.0, R * 2.0, R * 3.0, R * 4.0] on all ranks, where is the number of ranks.
test_{backend}_allreduce_min2, 4Rank sends [r as f64, (R - r) as f64].recv == [0.0, 1.0] on all ranks (element-wise minimum).
test_{backend}_allreduce_identity_size11Single rank sends [42.0, 99.0] with ReduceOp::Sum.recv == [42.0, 99.0]. Reduction of a single operand is the identity.
test_{backend}_allreduce_single_element2, 4Each rank sends [rank as f64] (single-element buffer) with ReduceOp::Sum.recv == [sum(0..R) as f64] on all ranks.
test_{backend}_allreduce_max2, 4Rank sends [r as f64, (R - 1 - r) as f64]. ReduceOp::Max.recv == [(R-1) as f64, (R-1) as f64] on all ranks.

1.3 broadcast Conformance

broadcast distributes initialization data (Communicator Trait §2.3). The conformance tests verify data integrity with different root ranks.

Test NameRanksInput ScenarioExpected Observable Behavior
test_{backend}_broadcast_root02, 4Root=0. Rank 0’s buf contains [3.14, 2.72, 1.41]. Other ranks’ buf is zero-initialized.After broadcast, all ranks have buf == [3.14, 2.72, 1.41].
test_{backend}_broadcast_root_last2, 4Root=. Root rank’s buf contains [100.0, 200.0]. Other ranks’ buf is zero-initialized.After broadcast, all ranks have buf == [100.0, 200.0].
test_{backend}_broadcast_data_integrity4Root=0. buf contains 10,000 f64 elements with a known pattern (buf[i] = i as f64).All ranks have identical buf contents after broadcast. Element-by-element comparison confirms no corruption.

1.4 barrier Conformance

barrier provides global synchronization (Communicator Trait §2.4). The conformance test verifies the all-ranks-must-enter semantics.

Test NameRanksInput ScenarioExpected Observable Behavior
test_{backend}_barrier_write_before_read_after2, 4Each rank writes its rank index to a shared location (shared memory buffer or file), then calls barrier(), then reads all other ranks’ locations.After barrier returns, all ranks can observe writes from all other ranks. No rank reads a value that was not yet written. This is the canonical write-before-barrier, read-after-barrier verification pattern.
test_{backend}_barrier_repeated2, 4Ranks execute three consecutive barriers.All three barriers complete without deadlock. Verifies correct barrier reset between consecutive calls (generation counter correctness per Shared Memory Backend §1.4).

1.5 rank and size Conformance

rank() and size() are infallible accessors (Communicator Trait §2.5). The conformance tests verify the consistency properties across all ranks.

Test NameRanksInput ScenarioExpected Observable Behavior
test_{backend}_rank_in_range1, 2, 4Each rank calls rank() and size().rank() returns a value in 0..size() on every rank.
test_{backend}_size_consistent2, 4All ranks call size() and broadcast their result.All ranks report the same size() value.
test_{backend}_rank_unique2, 4All ranks call rank() and gather results via allgatherv.The collected rank values form a permutation of 0..size() with no duplicates.

1.6 Compound Conformance: Collective Sequencing

This test verifies that consecutive collective operations do not interfere with each other – a critical property for the SDDP training loop which interleaves allgatherv, allreduce, and barrier calls across iterations.

Test NameRanksInput ScenarioExpected Observable Behavior
test_{backend}_collective_sequence2, 4Execute in order: allgatherv (each rank sends rank-specific data), allreduce with ReduceOp::Sum, barrier, second allgatherv (each rank sends different data).Each operation returns correct results independently. The second allgatherv does not contain stale data from the first. Verifies collective sequencing (shm: collective_seq counter per Shared Memory Backend §3.5).

1.7 SharedMemoryProvider Lifecycle Tests

These tests verify the SharedMemoryProvider trait methods and the SharedRegion<T> lifecycle defined in Communicator Trait §4. All four backends implement SharedMemoryProvider (ferrompi and shm with true shared memory; local and TCP with HeapFallback per Communicator Trait §4.4).

Test NameRanksInput ScenarioExpected Observable Behavior
test_{backend}_shared_region_lifecycle2, 4Leader calls create_shared_region::<f64>(100), writes [0.0, 1.0, ..., 99.0] via as_mut_slice(), calls fence(). Follower calls fence(), then reads via as_slice().Follower reads exactly [0.0, 1.0, ..., 99.0]. Verifies the allocation -> population -> fence -> read lifecycle from Communicator Trait §4.2.
test_{backend}_shared_region_is_leader2, 4All ranks call is_leader() and gather results.For ferrompi and shm: exactly one rank (local rank 0) returns true; all others return false. For local and TCP (HeapFallback): all ranks return true per Communicator Trait §4.4.
test_{backend}_split_local_rank_size2, 4All ranks call split_local() and query rank() and size() on the returned communicator.For ferrompi and shm: the returned communicator has size() equal to the number of co-located ranks and rank() in 0..size(). For local and TCP: the returned communicator has size() == 1 and rank() == 0 (per Communicator Trait §4.4).
test_{backend}_shared_region_drop2Create a SharedRegion<f64>, populate it, read it, then drop it.No resource leaks. For ferrompi: MPI_Win_free called on drop. For shm: shm_unlink + munmap called on drop. For HeapFallback: Vec<T> dropped. Verified by running under a resource leak detector or by checking that a subsequent region with the same name can be created.

1.8 Error-Case Tests (by CommError Variant)

These tests verify that each backend correctly reports errors via the CommError variants defined in Communicator Trait §1.4 and Communicator Trait §4.6. Error-case behavior varies significantly by backend.

1.8.1 CommError::InvalidBufferSize

Test NameRanksInput ScenarioExpected Observable Behavior
test_{backend}_allreduce_buffer_mismatch2send.len() = 4, recv.len() = 3 (mismatch violating Communicator Trait §2.2 precondition).Returns Err(CommError::InvalidBufferSize { operation: "allreduce", expected: 4, actual: 3 }) on all backends (local, ferrompi, tcp, shm). All backends validate buffer size preconditions (see §1.9.1).
test_{backend}_allgatherv_recv_too_small2recv.len() is smaller than displs[R-1] + counts[R-1] (receive buffer too small per Communicator Trait §2.1 precondition).Returns Err(CommError::InvalidBufferSize { ... }) on all backends that validate buffer sizes (local, ferrompi, tcp, shm).

1.8.2 CommError::InvalidRoot

Test NameRanksInput ScenarioExpected Observable Behavior
test_{backend}_broadcast_invalid_root2broadcast(buf, root=5) where size() == 2.Returns Err(CommError::InvalidRoot { root: 5, size: 2 }) on all backends (local, ferrompi, tcp, shm). All backends validate root values (see §1.9.1).

1.8.3 CommError::CollectiveFailed

Test NameRanksInput ScenarioExpected Observable Behavior
test_tcp_barrier_timeout2One rank calls barrier(); the other rank does not (simulating a crash). COBRE_TCP_TIMEOUT_SECS=2.The calling rank receives Err(CommError::CollectiveFailed { operation: "barrier", mpi_error_code: 0, message: ... }) within approximately 2 seconds. The mpi_error_code is 0 per TCP Backend §5.3.
test_shm_barrier_timeout2One rank calls barrier(); the other rank never reaches the barrier (process exits). The shm backend’s barrier timeout is set to 2 seconds.The waiting rank receives Err(CommError::CollectiveFailed { operation: "barrier", ... }) within approximately 2 seconds. Verifies that the generation counter mechanism (Shared Memory Backend §1.4) does not block indefinitely on rank failure.

1.8.4 CommError::InvalidCommunicator

Test NameRanksInput ScenarioExpected Observable Behavior
test_ferrompi_finalized_communicator2Attempt to call allgatherv after MPI_Finalize has been called (communicator is in an invalid state).Returns Err(CommError::InvalidCommunicator). This error is MPI-specific per Communicator Trait §1.4; the local, TCP, and shm backends do not produce this variant.

1.8.5 CommError::AllocationFailed

Test NameRanksInput ScenarioExpected Observable Behavior
test_{backend}_shared_region_excessive_size1Call create_shared_region::<f64>(usize::MAX / 2) (request exceeding available memory).Returns Err(CommError::AllocationFailed { requested_bytes: ..., message: ... }) for true shared memory backends (ferrompi, shm). HeapFallback backends (local, TCP) follow Rust’s standard allocation failure semantics (abort on OOM by default), per Communicator Trait §4.6.

1.9 Backend-Specific Conformance Notes

1.9.1 Local Backend

The local backend validates preconditions and returns errors on violations: it returns Err(CommError::InvalidBufferSize) when buffer sizes do not match the operation’s requirements, and Err(CommError::InvalidRoot) when a root value other than 0 is passed. It cannot produce CommError::CollectiveFailed (no MPI calls) or CommError::InvalidCommunicator (no communicator state to invalidate).

Testing implication: Error-case tests (§1.8) for InvalidBufferSize and InvalidRoot apply to the local backend and must assert Err(...). Only CollectiveFailed and InvalidCommunicator error tests are not applicable to the local backend.

1.9.2 Ferrompi Backend

Precondition checks (buffer sizes, root validity) are delegated to the ferrompi layer per Ferrompi Backend §1.2. The FerrompiBackend does not duplicate these checks. Conformance tests for error cases must verify that the ferrompi-generated MPI error codes are correctly mapped to CommError variants via the error conversion function in Ferrompi Backend §5.2.

Testing implication: Ferrompi error-case tests depend on the MPI implementation’s error reporting behavior, which may vary across MPI vendors (OpenMPI, MPICH, Intel MPI). Tests should assert on the CommError variant (e.g., InvalidBufferSize, InvalidRoot) rather than on the specific mpi_error_code value.

1.9.3 TCP Backend

For CommError::CollectiveFailed, the mpi_error_code field is always 0 per TCP Backend §5.3, because there is no MPI error code to report. The message field contains the specific TCP error description.

Testing implication: TCP error-case tests must assert mpi_error_code == 0 for all CollectiveFailed errors. Timeout-based failure detection (§1.8.3) is controlled via the COBRE_TCP_TIMEOUT_SECS environment variable.

1.9.4 Shared Memory Backend

The shm backend uses a two-barrier protocol for allreduce (Shared Memory Backend §3.2): all ranks write to their per-rank slots, execute a first barrier, then rank 0 performs the reduction and writes to the output slot, and a second barrier ensures the result is visible to all ranks.

Testing implication: The conformance test test_shm_allreduce_sum (§1.2) implicitly validates the two-barrier protocol. An additional structural test should verify that rank 0 is the rank performing the reduction (not any other rank), by instrumenting or observing that non-zero ranks do not write to the output slot between the two barriers. This is validated indirectly: if a non-zero rank performed the reduction, the result would be incorrect for non-commutative edge cases or the timing guarantee would be violated.

2. Interchangeability Tests

These tests verify that swapping communication backends produces equivalent SDDP results. The training loop is generic over C: Communicator (Communicator Trait §3), and any conformant backend must produce the same policy.

2.1 Reference Test Case

The interchangeability tests use a small SDDP problem defined by the following properties:

PropertyValueRationale
Stages2–3Minimum for meaningful cut generation (backward pass requires at least 2 stages)
Hydro plants2Minimum for reservoir coupling and non-trivial state space
Forward passes per iteration10Small enough for fast execution; large enough for meaningful upper bound statistics
Iterations5Sufficient to generate a non-trivial cut pool; small enough for sub-5-second execution
Openings per stage3–5Small scenario tree for fast backward pass
Execution time (single core)< 5 secondsHard constraint: interchangeability tests run in CI without HPC resources

The test case is defined by properties, not by a specific dataset. Any dataset satisfying the above properties is acceptable, provided it is deterministic (fixed random seed) and produces a non-trivial policy (at least one cut generated per stage).

2.2 Backend Comparison Matrix

The reference test case is executed on each backend configuration and the results are compared pairwise.

ConfigurationBackendSizeDescription
ref_locallocal1Reference result. Single-process execution with identity collectives.
ref_shmshm2Two-process execution on the same node via POSIX shared memory.
ref_tcptcp2Two-process execution via TCP coordinator pattern.
ref_ferrompiferrompi2Two-process execution via MPI collectives (CI environments with MPI only).

2.3 Comparison Criteria

QuantityComparison MethodExpected ResultRationale
Lower bound traceBit-for-bit identicalAll multi-rank backends match each other and ref_localLower bound is computed from the first-stage LP objective, which is deterministic given identical cut pools. The rank-ordered allgatherv ensures identical cut pool construction across backends. Per Shared Memory Aggregation §3.4.
Final cut poolBit-for-bit identicalAll multi-rank backends match each other and ref_localCut coefficients are computed from LP dual multipliers (deterministic) and aggregated via rank-ordered allgatherv (deterministic). Cut slot assignment is deterministic from (iteration, forward_pass_index) per Shared Memory Aggregation §3.2.
Policy outputBit-for-bit identicalAll multi-rank backends match each other and ref_localThe policy is fully determined by the cut pool. Identical cut pools produce identical policies.
Upper bound meanWithin tolerance (~1e-12 relative)Multi-rank backends match each other; may differ from ref_local by floating-point toleranceallreduce with ReduceOp::Sum may produce results that vary with the reduction tree shape (Communicator Trait §2.2). TCP and shm use fixed rank-order sequential reduction (deterministic), so they match each other exactly. Ferrompi uses MPI’s implementation-defined reduction tree, which may diverge from TCP/shm by ~1e-12 relative error. The local backend (size=1) has no reduction tree, so its upper bound may differ from multi-rank results.

See Shared Memory Aggregation §3.4 for the full reproducibility verification criteria.

2.4 Python Multi-Process Interchangeability

This test verifies that the Python multi-process execution path produces results equivalent to single-process execution. Unlike the Rust-level tests in §2.2–§2.3, this test exercises the full Python worker lifecycle (Python Bindings SS2.1a), including process spawning, backend initialization, result collection, and WorkerInfo metadata.

Test procedure:

  1. Execute result_1 = cobre.train(case, num_workers=1) (single-process, local backend).
  2. Execute result_n = cobre.train(case, num_workers=N, backend="shm") where (multi-process, shm backend).
  3. Compare results using the criteria from §2.3:
    • Lower bound trace: bit-for-bit identical.
    • Final cut pool: bit-for-bit identical.
    • Policy output: bit-for-bit identical.
    • Upper bound mean: within ~1e-12 relative tolerance.

WorkerInfo metadata assertions:

  1. Verify that result_n.workers returns a list of exactly WorkerInfo instances.
  2. Verify that the WorkerInfo.rank values form a permutation of 0..N (all distinct).
  3. Verify that all WorkerInfo.backend values are "shm" (matching the requested backend).
  4. Verify that result_1.workers returns None (single-process mode does not produce worker metadata, per Python Bindings SS2.7).

Error handling assertion:

  1. If a worker process fails during test execution, verify that cobre.WorkerError is raised with the rank property identifying the failed worker (per Python Bindings SS6.1a). This is tested by intentionally causing a worker to fail (e.g., invalid case data passed to one rank only) and verifying the exception structure.

3. Performance Regression Tests

These tests ensure that the Communicator trait abstraction layer introduces no measurable overhead and that backend performance remains within expected bounds across releases.

3.1 Ferrompi Overhead Baseline

The primary performance claim of the backend abstraction is zero-cost abstraction via monomorphization (Backend Selection §1.4): when the training loop is generic over C: Communicator and the binary is built with a single backend feature, the compiler resolves C to the concrete type at compile time, producing identical assembly to direct API calls.

Test methodology:

  1. Direct ferrompi calls. Execute the SDDP training loop calling ferrompi API functions directly (no Communicator trait indirection). Record wall-clock time per iteration at worst-case scale ( ranks, stages — hypothetical maximum; DEC-009 baseline is , forward passes, ~587 MB/iteration per Communication Patterns §3.1).
  2. Trait wrapper calls. Execute the same training loop using train::<FerrompiBackend>(comm, ...) with the Communicator trait. Record wall-clock time per iteration at the same scale.
  3. Compare. The trait wrapper must introduce < 1% overhead relative to direct calls.

Rationale: Monomorphization eliminates the trait abstraction at compile time (Ferrompi Backend §4.1). Each trait method call compiles to a direct call to the corresponding ferrompi function. The match on ReduceOp in allreduce is eliminated when the variant is known at the call site (constant propagation). The expected overhead is zero; the 1% threshold accounts for measurement noise.

Note: This test is only applicable to the ferrompi backend. The local, TCP, and shm backends do not have a “direct call” baseline – the Communicator trait is their only API. Their performance tests are integration-level (§3.2, §3.3).

3.2 Per-Operation Measurement

Per-operation timing measurements capture the communication cost of each collective operation in isolation, enabling regression detection at the operation level rather than the iteration level.

OperationPayloadRank CountMeasurementRegression Threshold
allgatherv (trial points)~206 MB total (, ~12.9 MB per rank)16Median wall-clock time over 10 repetitions< 5% regression from baseline
allgatherv (cuts per stage)~3.2 MB total (, ~200 KB per rank)16Median wall-clock time over 10 repetitions< 5% regression from baseline
allreduce (convergence stats)32 bytes (4 x f64)16Median wall-clock time over 100 repetitions< 5% regression from baseline
barrier0 bytes16Median wall-clock time over 100 repetitions< 10% regression from baseline
broadcast (initialization)~10 KB16Median wall-clock time over 10 repetitions< 10% regression from baseline

Notes:

  • allgatherv is the dominant communication cost (~587 MB/iteration), so its regression threshold is tighter (5% vs 10%).
  • allreduce is latency-sensitive and on the convergence checking critical path.
  • barrier and broadcast are off the hot path (checkpoint and initialization only).
  • Measurements are per-backend; the local backend is excluded because its operations are identity copies or no-ops.

3.3 Python Multi-Process Smoke Test

This test verifies that multi-process execution from Python achieves a meaningful speedup over single-process execution, confirming that the inter-process communication overhead does not negate the parallelism benefit.

Test procedure:

  1. Execute t1 = time(cobre.train(case, num_workers=1)) – single-process wall-clock time.
  2. Execute t2 = time(cobre.train(case, num_workers=2, backend="shm")) – two-process wall-clock time.
  3. Assert t2 < 0.70 * t1 – two workers complete in less than 70% of single-worker wall-clock time.

Rationale: With 2 workers, perfect scaling would achieve 50% of single-process time. The 70% threshold allows for:

  • Process spawn overhead (Python multiprocessing.Process with "spawn" start method).
  • Backend initialization overhead (shm segment creation and mapping).
  • Communication overhead (shm collectives, barriers).
  • Load imbalance between workers.

This is a smoke test, not a microbenchmark. It catches gross regressions (e.g., communication deadlocks, accidental serialization) rather than measuring fine-grained performance. The test case from §2.1 (< 5 seconds single-core) is used to keep CI execution time reasonable.

4. Determinism Verification

This section verifies a stronger property than interchangeability (§2): that the same backend produces bit-for-bit identical results across different rank counts, thread counts, and repeated executions. The invariant is defined in Shared Memory Aggregation §3.1.

4.1 Reproducibility Invariant

Given the same inputs and random seed, Cobre produces bit-for-bit identical results regardless of:

  • (a) the number of ranks,
  • (b) the number of OpenMP threads per rank,
  • (c) execution timing and OS scheduling.

Bit-for-bit identical quantities (all backends):

QuantityDeterminism GuaranteeRationale
Lower bound traceBit-for-bit identicalComputed from the first-stage LP objective, which is deterministic given identical cut pools. Cut pools are synchronized via rank-ordered allgatherv (deterministic on all backends). Per Shared Memory Aggregation §3.2.
Final cut pool coefficientsBit-for-bit identicalCut coefficients are LP dual multipliers (deterministic), aggregated via rank-ordered allgatherv (deterministic). Cut slot assignment is deterministic from (iteration, forward_pass_index). Per Shared Memory Aggregation §3.2.
Policy output (FlatBuffers file)Bit-for-bit identicalThe policy is fully determined by the cut pool. Identical cut pools produce identical policies. Serialization is deterministic.
Convergence termination iterationBit-for-bit identicalThe stopping rule evaluates the lower bound (bit-for-bit identical) against deterministic thresholds. All backends terminate at the same iteration number.

Exempt quantities:

QuantityDeterminism GuaranteeRationale
Upper bound mean/varianceBackend-dependent (see §4.2)Upper bound is computed via allreduce with ReduceOp::Sum. For local, TCP, and shm backends, the reduction order is fixed (sequential rank 0..R-1), producing identical results. For ferrompi, the MPI reduction tree shape is implementation-defined, so the floating-point result may differ across rank counts or MPI libraries. Per Communicator Trait §2.2.
Wall-clock timing metricsNon-deterministicTiming depends on OS scheduling, system load, and hardware state. Not comparable across runs.
Per-rank timing breakdownsNon-deterministicPer-rank timing varies with OS scheduling, NUMA placement, and inter-process contention. Not comparable across runs.

4.2 Per-Backend Determinism Properties

The following table classifies each collective operation’s determinism properties per backend. An operation is deterministic if it produces bit-for-bit identical results given identical inputs, regardless of timing or scheduling.

Backendallgathervallreduce Sumallreduce Minbroadcastbarrier
localIdentity copy (deterministic)Identity copy (deterministic)Identity copy (deterministic)No-op (deterministic)No-op (deterministic)
tcpRank-ordered assembly by coordinator (deterministic)Fixed sequential reduction rank 0..R-1 (deterministic)Min is comparison-based (deterministic)DeterministicDeterministic
shmRank-ordered writes to shared buffer (deterministic)Rank 0 reduces sequentially rank 0..R-1 (deterministic)Min is comparison-based (deterministic)DeterministicDeterministic
ferrompiRank-ordered per MPI spec (deterministic)Implementation-defined tree shape (non-deterministic across MPI libraries and rank counts)Min is comparison-based (deterministic)DeterministicDeterministic

All four backends produce deterministic allgatherv results – data is assembled in rank order (rank 0, rank 1, …, rank R-1), ensuring identical cut pools and trial point sets on all ranks. This is the foundation of the reproducibility invariant.

For allreduce with ReduceOp::Sum, the local, TCP, and shm backends are deterministic because their reduction order is fixed:

The ferrompi backend delegates to MPI_Allreduce, whose reduction tree shape is implementation-defined (Communicator Trait §2.2). Different MPI libraries (OpenMPI, MPICH, Intel MPI) and different rank counts may produce different floating-point summation results due to the non-associativity of IEEE 754 addition. This affects only the upper bound statistics – all other SDDP quantities are determined by allgatherv (deterministic) and LP solves (deterministic given identical inputs).

For allreduce with ReduceOp::Min, all backends are deterministic because minimum is a comparison-based operation with no floating-point arithmetic ambiguity. Given identical inputs, min(a, b) = min(b, a) regardless of evaluation order.

4.3 Determinism Test Matrix

The determinism tests use the same reference test case defined in §2.1 (2-3 stages, 2 hydro plants, 10 forward passes, 5 iterations, 3-5 openings per stage, < 5 seconds single-core). All tests use the same fixed random seed.

Test IDBackendRanksThreads/RankReferencePurpose
D1local11Baseline: single-rank, single-thread reference result
D2local14D1Thread independence: same result with different thread count
D3shm21D1Multi-rank: 2 ranks match single-rank result
D4shm41D1Rank scaling: 4 ranks match single-rank result
D5shm24D1Combined: multi-rank + multi-thread
D6tcp21D1TCP backend: 2 ranks match single-rank result
D7tcp41D1TCP rank scaling: 4 ranks match single-rank result
D8ferrompi21D1MPI backend: 2 ranks match single-rank on exact quantities
D9ferrompi42D1MPI combined: 4 ranks, 2 threads
D10ferrompi21 (run twice)D10-selfMPI reproducibility: same config run twice, bit-for-bit match

Comparison criteria per quantity:

QuantityD2-D7 (local/shm/tcp vs D1)D8-D9 (ferrompi vs D1)D10 (ferrompi run A vs run B)
Lower bound traceBit-for-bit identicalBit-for-bit identicalBit-for-bit identical
Final cut pool coefficientsBit-for-bit identicalBit-for-bit identicalBit-for-bit identical
Policy output (FlatBuffers file)Bit-for-bit identicalBit-for-bit identicalBit-for-bit identical
Convergence termination iterationBit-for-bit identicalBit-for-bit identicalBit-for-bit identical
Upper bound meanBit-for-bit identicalUnconstrainedBit-for-bit identical

Notes on comparison modes:

  • D2-D5 (local/shm vs D1): All quantities including upper bound are bit-for-bit identical. Neither the local backend’s identity copy nor the shm backend’s sequential reduction introduces floating-point non-determinism.

  • D6-D7 (tcp vs D1): All quantities including upper bound are bit-for-bit identical. The TCP coordinator reduces sequentially in rank order, producing the same floating-point results as the shm backend.

  • D8-D9 (ferrompi vs D1): Lower bound, cut pool, policy, and convergence iteration are bit-for-bit identical (dependent only on allgatherv and LP solves). Upper bound mean is unconstrained: MPI_Allreduce uses an implementation-defined reduction tree, so the ferrompi upper bound may be computed in a different floating-point summation order than D1.

  • D10 (ferrompi run-twice): Both runs must be bit-for-bit identical on all quantities including upper bound. The MPI reduction tree is deterministic for a fixed (library, rank count) configuration; the non-determinism in §4.2 applies only when comparing across different rank counts or MPI libraries.

4.4 Python Multi-Process Determinism Tests

These tests verify that the Python multi-process execution path preserves determinism and correctly handles error conditions. Unlike §4.3 which tests the Rust-level backends directly, these tests exercise the full Python worker lifecycle from Python Bindings SS2.1a.

4.4.1 Fork Rejection

Test procedure:

  1. Set multiprocessing.set_start_method("fork") in the test process.
  2. Call cobre.train(case, num_workers=2).

Expected result: Raises cobre.CobreError with kind="IncompatibleSettings" and a message explaining that the "fork" start method is prohibited.

Reference: Python Bindings SS2.1a step 1: “If it is already set to \"fork\", the library raises cobre.CobreError with kind=\"IncompatibleSettings\".”

Rationale: The "fork" start method is unsafe for multi-threaded processes (OpenMP threads, solver threads) and can cause deadlocks or corrupted state. Cobre requires "spawn" for multi-process execution.

4.4.2 Worker Error Propagation

Test variant A – Worker raises CobreError:

  1. Call cobre.train(case_with_invalid_data, num_workers=2, backend="shm") where case_with_invalid_data is constructed to trigger a CobreError in the worker during training (e.g., an LP infeasibility that surfaces after the first iteration).
  2. The parent process raises cobre.WorkerError.
  3. Assert: WorkerError.inner is an instance of cobre.CobreError (or a subclass) containing the original error from the failed worker.
  4. Assert: WorkerError.rank identifies the failed worker (0 or 1).
  5. Assert: All remaining workers are terminated (no zombie processes).

Test variant B – Worker crashes at OS level:

  1. Call cobre.train(case, num_workers=2, backend="shm") with a test hook that causes one worker to crash at the OS level (e.g., by sending SIGKILL to the worker process after it starts).
  2. The parent process raises cobre.WorkerError.
  3. Assert: WorkerError.inner is None (the worker crashed without raising a Python-level exception).
  4. Assert: WorkerError.exit_code is signal-based (negative on Unix, reflecting the signal number).
  5. Assert: All remaining workers are terminated.

Reference: Python Bindings SS6.1a: WorkerError with rank, exit_code, and inner properties.

4.4.3 Run-Twice Reproducibility

Test procedure:

  1. Execute result_a = cobre.train(case, config_overrides={"random_seed": 42}, num_workers=2, backend="shm").
  2. Execute result_b = cobre.train(case, config_overrides={"random_seed": 42}, num_workers=2, backend="shm").
  3. Compare lower bound traces: bit-for-bit identical.
  4. Compare final cut pool coefficients: bit-for-bit identical.
  5. Compare policy output: bit-for-bit identical.
  6. Compare convergence termination iteration: identical.

Expected result: All exact quantities are bit-for-bit identical across the two runs.

Rationale: This test verifies that the Python multiprocessing.Process spawn path, Communicator creation, worker coordination, and result collection introduce no non-determinism. The "spawn" start method creates fresh Python interpreters, and any state leakage between runs would manifest as differing results.

4.4.4 Progress Event Multiplexing

Test procedure:

  1. Define a progress_callback that appends each ProgressEvent to a list.
  2. Execute cobre.train(case, num_workers=2, backend="shm", progress_callback=progress_callback).
  3. Let N be the number of training iterations (convergence termination iteration from the result).

Assertions:

  • Both worker_id=0 and worker_id=1 appear in the collected events.
  • For each worker, events with that worker_id are in iteration order: if event has iteration=i and event has iteration=j with , then appears before in the list.
  • Total events count equals 2 * N (one event per iteration per worker).

Reference: Python Bindings SS2.9: ProgressEvent.worker_id disambiguates the source; “events from the same worker are in iteration order.”

4.4.5 TCP Auto-Detection

Test procedure:

  1. Set the environment variables COBRE_TCP_COORDINATOR=127.0.0.1 and COBRE_TCP_PORT=<port> where <port> is a free ephemeral port (see TCP Backend §8.1 for the split-variable convention).
  2. Execute result = cobre.train(case, num_workers=2, backend="auto").
  3. Inspect result.workers metadata.

Assertions:

  • result.workers is a list of 2 WorkerInfo instances.
  • WorkerInfo.backend == "tcp" for both workers (the auto-detection logic selected TCP because COBRE_TCP_COORDINATOR was set).

Reference: Python Bindings SS7.5: “When backend=\"auto\" and num_workers > 1: selects \"tcp\" if the COBRE_TCP_COORDINATOR environment variable is set.”

4.5 Scope of the Reproducibility Invariant

The invariant is defined in Shared Memory Aggregation §3.1 (requirement statement), with mechanisms in §3.2 (deterministic seeding, rank-ordered allgatherv, deterministic cut slots), floating-point considerations in §3.3, and quantity-by-quantity verification criteria in §3.4.

The test matrix in §4.3 and the Python multi-process tests in §4.4 provide the concrete configurations that exercise this invariant across all four backends and both execution modes (Rust-level and Python-level). No modifications to Shared Memory Aggregation are required – the invariant definition is backend-agnostic by design.

Cross-References