Checkpointing
Purpose
This spec defines how Cobre persists training state for fault tolerance (checkpointing) and supports resuming training or warm-starting from a previously trained policy. For the serialization format and policy directory structure, see Binary Formats §3. For output generation (simulation results, timing data), see Output Infrastructure and Output Schemas.
1. Checkpoint Strategy
1.1 Goals
| Goal | Requirement |
|---|---|
| Fault tolerance | Resume training after SLURM preemption, wall-time limit, or node failure |
| Checkpoint overhead | < 5% of iteration time (rank 0 writes while other ranks wait at barrier) |
| Warm-start | Start new training from a previously trained policy’s cuts |
| Reproducibility | Resume from checkpoint must produce bit-for-bit identical results to uninterrupted run |
1.2 Checkpoint Triggers
| Trigger | Condition |
|---|---|
| Periodic | Every iterations (configurable via checkpoint_interval; disabled by default — must be explicitly enabled) |
| Signal | SIGTERM/SIGINT sets shutdown flag; checkpoint written from last completed iteration’s state |
| Convergence | Final checkpoint on training completion |
Signal handling follows the protocol in CLI and Lifecycle §7: the handler sets a global flag, checkpoints the last fully completed iteration (not the in-progress one), and exits. The training loop checks the flag at iteration boundaries.
2. Checkpoint Contents
2.1 What Must Be Serialized
| Component | Serialization | Why Required |
|---|---|---|
| Cut pool (all stages) | FlatBuffers StageCuts per stage | Primary policy data — the trained cost-to-go approximation |
| Cut activity/slot state | is_active flags and slot indices per cut | LP row structure must be reconstructed identically |
| Solver basis (per stage) | FlatBuffers StageBasis per stage | Exact warm-start — avoids full re-solve on resume |
| Iteration counter | PolicyMetadata.completed_iterations | Resume continues from correct iteration |
| RNG state | PolicyMetadata.rng_state (full state vector) | Scenario reproducibility — next iteration generates same noise |
| Convergence history | Lower/upper bound traces | Convergence monitoring continues with correct history |
| Config hash | PolicyMetadata.config_hash | Detect config changes between runs |
| System hash | PolicyMetadata.system_hash | Detect input data changes between runs |
For the complete FlatBuffers schema (StageCuts, StageBasis, PolicyMetadata), see Binary Formats §3.1. For the reproducibility requirements on checkpoint/resume, see Binary Formats §4.1.
2.2 What Is NOT Serialized
| Component | Reason Not Serialized |
|---|---|
| Opening tree | Deterministically regenerable from rng_seed + opening indices (see Scenario Generation §2.3) |
| Solver workspaces | Thread-local, rebuilt at initialization (see Solver Workspaces §1.3) |
| MPI communication buffers | Allocated at initialization, not training state |
| Forward pass state | Ephemeral — consumed within each iteration |
2.3 Checkpoint Write Protocol
- Training loop completes an iteration and checks checkpoint triggers (§1.2)
- All ranks synchronize at an MPI barrier
- Rank 0 writes the checkpoint to the policy directory (FlatBuffers format per Binary Formats §3.2)
- Rank 0 updates the
latestsymlink to point to the new checkpoint - Rank 0 removes old checkpoints beyond the retention limit (default: keep last 3)
- All ranks synchronize at an MPI barrier
- Training continues
Only rank 0 performs I/O. All other ranks wait at the barrier. At production scale, the checkpoint write takes a few seconds (cut pool is pre-allocated and contiguous — see Binary Formats §3.3 for memory layout).
2.4 Checkpoint Sizing
Checkpoint size is dominated by the cut pool. At production scale (per Binary Formats §4.3):
| Component | Size at capacity |
|---|---|
| Cut pool (all stages) | 60 stages × up to 15K cuts × ~17 KB per cut — up to ~14.3 GB at maximum capacity (cut coefficients absorbed into StageLpCache SharedRegion at ~22.3 GB node-wide) |
| Solver basis | 60 stages × ~87 KB per basis ≈ ~5 MB |
| Metadata + history | < 1 MB |
Early iterations produce much smaller checkpoints (only populated slots are serialized). The cut pool pre-allocates slots but only populated ones are written.
3. Execution Modes
3.1 Mode Definitions
Three execution modes determine how training initializes (per Binary Formats §4.2):
| Mode | Cut Loading | RNG State | Cut Pool Capacity | Result Guarantee |
|---|---|---|---|---|
fresh | None | From config seed | max_iter × fwd_passes | Deterministic from seed |
warm_start | All cuts from policy | Fresh from config seed | loaded + new training | Different from original |
resume | All cuts + exact state | Restored from checkpoint | Same as checkpoint | Bit-for-bit identical |
3.2 Resume Protocol
On resume, the following state must be restored exactly:
- Cut pool: All cuts (active and inactive) with their original slot indices — LP row structure must match
- RNG state: Full state vector, not just seed — ensures next iteration generates identical scenarios
- Convergence history: Lower/upper bound traces — convergence monitoring continues correctly
- Iteration counter: Resume from
completed_iterations + 1 - Solver basis: Per-stage basis vectors — exact warm-start avoids different pivot sequences
After restoration, the resumed run must produce bit-for-bit identical results to an uninterrupted run. See Binary Formats §4.1 and Shared Memory Aggregation §3 for the reproducibility guarantee.
3.3 Warm-Start Protocol
Warm-start loads cuts from a previous policy but starts training with fresh RNG state:
- Load cuts from policy directory into the cut pool (these become the “warm-start” cuts)
- Initialize RNG from config seed (not restored — new scenario sequence)
- Allocate additional cut pool capacity for new training cuts
- Begin training from iteration 0 with pre-populated cost-to-go approximation
Warm-start produces different results from the original training because the scenario sequence differs.
3.4 Compatibility Validation
Before loading cuts (resume or warm-start), the system validates that the policy is compatible with the current input data:
| Validation Check | What Is Compared | Failure Mode |
|---|---|---|
| State dimension | Policy state_dimension vs. computed from input data | Hard error |
| Stage count | Policy num_stages vs. input stage count | Hard error |
| Config hash (resume) | Policy config_hash vs. current config hash | Hard error |
| System hash (resume) | Policy system_hash vs. current input hash | Hard error |
Note: Comprehensive policy compatibility validation (block modes, hydro counts, AR orders, cascade topology, penalty configuration) is deferred to Deferred Features §C.9. The checks above are the minimum required for initial implementation.
4. Signal Handling Integration
4.1 SLURM Preemption
SLURM sends SIGTERM when a job approaches its wall-time limit. The graceful shutdown protocol (defined in CLI and Lifecycle §7) ensures checkpoint integrity:
| Step | Action |
|---|---|
| 1 | Signal handler sets global shutdown flag |
| 2 | Training loop detects flag at next iteration boundary |
| 3 | Checkpoint written from last fully completed iteration’s policy state |
| 4 | Training manifest updated with status: interrupted |
| 5 | Process exits with code 0 (clean shutdown) |
The checkpoint is written from the last completed iteration, not the in-progress one. This avoids serializing partially-updated state (e.g., cuts from an incomplete backward pass).
4.2 Resume After Preemption
The next SLURM job invocation detects the checkpoint via the latest symlink and resumes:
- Load checkpoint (§3.2 resume protocol)
- Regenerate opening tree from persisted
rng_seed(deterministic — see Scenario Generation §2.3) - Rebuild solver workspaces with first-touch NUMA allocation (see Solver Workspaces §1.3)
- Restore solver basis per stage for warm-start
- Continue training from
completed_iterations + 1
Cross-References
- Binary Formats §3 — FlatBuffers schema (StageCuts, StageBasis, PolicyMetadata), policy directory structure, encoding guidelines
- Binary Formats §4 — Cut pool persistence: checkpoint reproducibility, execution modes, cut pool sizing
- Output Infrastructure §1.2 — Training manifest with status values (completed, interrupted)
- Output Schemas §6.2-§6.3 — Timing output schemas (iterations.parquet, mpi_ranks.parquet)
- CLI and Lifecycle §5 — Execution phases and exit codes
- CLI and Lifecycle §7 — Signal handling and graceful shutdown protocol
- Convergence Monitoring §1 — Convergence criteria, bound computation, history that must be checkpointed
- Scenario Generation §2.3 — Opening tree is deterministically regenerable from seed (not checkpointed)
- Solver Workspaces §1.3 — NUMA-aware workspace initialization on resume
- Training Loop §3 — Iteration structure, checkpoint integration points
- Shared Memory Aggregation §3 — Reproducibility guarantees (bit-for-bit identical results)
- Memory Architecture §4 — Pre-allocated components that are rebuilt (not checkpointed)
- SLURM Deployment — Job scripts with checkpoint/resume configuration
- Deferred Features §C.9 — Comprehensive policy compatibility validation (deferred)