Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

CLI and Lifecycle

Purpose

This spec defines the Cobre program entrypoint, command-line interface, exit codes, execution phase lifecycle, conditional execution modes, configuration resolution hierarchy, and job scheduler integration. It covers everything from process invocation through phase orchestration to shutdown.

1. Design Philosophy

Cobre adopts a single-entrypoint design optimized for HPC batch execution. The program is always invoked via MPI launchers (mpiexec, mpirun, or SLURM’s srun) and all runtime behavior is controlled through configuration files rather than command-line arguments.

Rationale:

  • HPC job scripts benefit from stable command-line interfaces
  • Configuration files provide auditability and reproducibility
  • Complex nested options are better expressed in JSON than CLI flags
  • Reduces parsing complexity in the hot initialization path

1.1 Agent Composability Principle

Cobre serves two audiences with complementary interaction models:

  • HPC batch execution – The primary mode. MPI-launched, config-driven, human-readable output. Optimized for production-scale runs on cluster schedulers.
  • Agent-composable interfaces – Secondary modes (MCP server, Python bindings, TUI) that expose the same solver through programmatic APIs. These operate in single-process mode without MPI, producing structured output with stable schemas.

The agent composability principle states that every Cobre operation must be usable by a programmatic agent – an AI coding assistant, a CI/CD pipeline, or a Python orchestration script – without requiring human interpretation of output. This is achieved through structured JSON output (--output-format json), progress streaming (--output-format json-lines), and library-mode execution (no MPI, no signal handlers, no scheduler detection). See Design Principles SS6 for the full agent-readability design rules and Structured Output for the JSON schema definitions.

2. Invocation Pattern

# Standard invocation
mpiexec -n 8 cobre /path/to/case_directory

# SLURM batch execution
srun cobre /path/to/case_directory

# Validation-only mode
mpiexec -n 1 cobre /path/to/case_directory --validate-only

2.1 Subcommand Invocation Patterns

Cobre supports subcommand-style invocations. The available subcommands are:

cobre init [OPTIONS] [DIRECTORY]           # Scaffold a new case directory
cobre run <CASE_DIR> [OPTIONS]             # Load, train, simulate, and write results
cobre validate <CASE_DIR>                  # Validate a case directory
cobre report <RESULTS_DIR>                 # Query results from a completed run
cobre summary <OUTPUT_DIR>                 # Display the post-run summary
cobre schema export [--output-dir <DIR>]   # Export JSON Schema files for all input types
cobre version                              # Print version and build information

For distributed execution, the run subcommand can be launched under MPI:

mpiexec -np 4 cobre run /path/to/case_directory

MPI requirements by subcommand:

SubcommandMPI RequiredRationale
initNoTemplate scaffolding; no computation
runYes (for distributed execution) or No (single-process)Training/simulation can run with or without MPI
validateNoValidation is rank-0-only; single-process is sufficient
reportNoReads output files; no computation
summaryNoReads output files; no computation
schemaNoSchema export; no computation
versionNoInformation only

3. Command-Line Interface

ArgumentRequiredDescription
CASE_DIRYesPath to case directory containing config.json
--validate-onlyNoRun Startup and Validation phases only, then exit (see SS5.3)
--versionNoPrint version and exit
--helpNoPrint usage and exit

3.1 Global CLI Flags

The following flag applies to all subcommands:

FlagValuesDefaultDescription
--color <WHEN>auto, always, neverautoControl ANSI color output on stderr. always forces color on (useful under mpiexec which pipes stderr through a non-TTY). Also honoured via the COBRE_COLOR env var.

Color resolution order (highest to lowest priority):

  1. --color <WHEN> CLI flag
  2. COBRE_COLOR environment variable (always or never; invalid values ignored)
  3. FORCE_COLOR=1 environment variable (forces color on; see https://force-color.org)
  4. Console auto-detection (whether stderr is a TTY)

3.2 Subcommand Arguments

Each subcommand accepts specific positional and keyword arguments:

SubcommandPositionalAdditional FlagsDescription
init[DIRECTORY]--template <NAME>, --list, --forceScaffold a new case directory
runCASE_DIR--output <DIR>, --threads <N>, --quietExecute training and/or simulation
validateCASE_DIR(none)Validate input files only
reportRESULTS_DIR(none)Query output data as JSON
summaryOUTPUT_DIR(none)Display post-run summary
schema(subcommands)export [--output-dir <DIR>]Manage JSON Schema files
version(none)(none)Print version information

Design Decision: All execution options (skip training, skip simulation, warm-start mode, etc.) are specified in config.json, not via CLI flags. This ensures:

  1. Job scripts remain stable across configuration changes
  2. Configuration is self-documenting and version-controlled
  3. No ambiguity between CLI and config file settings

4. Exit Codes

CodeCategoryCause
0SuccessThe command completed without errors
1ValidationCase directory failed the validation pipeline – schema errors, cross-reference errors, semantic constraint violations, or policy compatibility mismatches
2I/OFile not found, permission denied, disk full, or write failure during loading or output
3SolverLP infeasible subproblem or numerical solver failure during training or simulation
4InternalCommunication failure, unexpected channel closure, or other software/environment problem

Codes 1–2 indicate user-correctable input problems; codes 3–4 indicate case/environment problems. Error messages are printed to stderr with error: prefix and hint lines.

5. Execution Phases Overview

5.1 Phase Diagram

flowchart TB
    S["Startup<br/><i>MPI init, CLI parse · all ranks</i>"]
    V["Validation<br/><i>load, schema, refs · rank 0</i>"]
    I["Initialization<br/><i>broadcast, alloc, solver · all ranks</i>"]
    G["Scenario Gen<br/><i>PAR fit, opening tree · all ranks</i>"]
    M{"run mode?"}
    T["Training"]
    LP["Load policy"]
    SIM["Simulation"]
    F(["Finalize"])

    S --> V --> I --> G --> M
    M -->|Full Run / Training Only| T
    M -->|Simulation Only| LP
    M -->|Validation Only| F
    T -->|simulation.enabled| SIM
    T -->|training only| F
    LP --> SIM
    SIM --> F

5.2 Phase Responsibilities

PhaseMPI RanksKey Operations
StartupAllMPI init, scheduler detection, CLI parsing
ValidationRank 0 onlyLoad files, schema validation, cross-references
InitializationAllBroadcast, memory allocation, solver setup
Scenario GenAll (parallel)PAR fitting, noise sampling, correlation
TrainingAll (parallel)SDDP iterations
SimulationAll (parallel)Policy evaluation
FinalizeAllOutput writing, cleanup

5.2a Phase-Training Loop Alignment

This subsection documents the correspondence between CLI execution phases (SS5.2) and the spec sections that define each phase’s operations. The purpose is to ensure that every phase boundary assumed by the Training Loop is explicitly sequenced in the lifecycle, and that implementers can trace each phase to its authoritative specification.

PhaseOperationAuthoritative SpecOrdering Constraint
StartupMPI backend initialization (create_communicator)Hybrid Parallelism SS6, Step 1Must be the first operation; precedes all file I/O and thread creation
StartupTopology detection (rank, size, intra-node split)Hybrid Parallelism SS6, Steps 2–3Requires MPI backend initialized
StartupScheduler detection (SLURM, PBS, local)CLI and Lifecycle SS6.3Reads environment variables; no MPI dependency
StartupCLI argument parsing and subcommand routingCLI and Lifecycle SS3Determines execution mode before Validation
ValidationRank-0 file loading and validation (load_case)Input Loading Pipeline SS8.1Rank 0 only; produces the System struct
Initializationpostcard broadcast of System to worker ranksInput Loading Pipeline SS6Requires System from Validation; all ranks receive identical validated data
InitializationOpenMP configuration and NUMA allocation policyHybrid Parallelism SS6, Steps 4–6Must precede workspace allocation (first-touch policy)
InitializationSolver workspace allocation (thread-local, NUMA-aware)Solver Workspaces SS1.3Each thread creates its own workspace on its NUMA node
InitializationStage LP template constructionSolver Abstraction SS11.1Built from resolved System; shared read-only across threads
InitializationParallel policy loading (warm-start only)Input Loading Pipeline SS7All ranks load in parallel after System broadcast
InitializationFPHA hyperplane fitting (computed source only)Input Loading Pipeline SS8Requires geometry and topology from System
Scenario GenPAR model preprocessingScenario Generation SS1Transforms raw PAR parameters into contiguous cache-friendly layout
Scenario GenOpening tree generation (fixed before training)Scenario Generation SS2.3Generated once; remains fixed throughout training
Scenario GenSpectral decomposition of correlation matricesScenario Generation SS2.1Pre-decomposed during preprocessing; used at runtime
TrainingSDDP iteration loop (forward/backward/convergence)Training Loop SS2.1Requires all preceding phases complete
SimulationPolicy evaluation on large scenario setsSimulation Architecture SS1Requires trained FCF from Training (or loaded policy from Initialization)
FinalizeOutput writing (Parquet, policy FlatBuffers, manifest)Output Infrastructure SS1Rank 0 writes manifest; all ranks may write partitioned output
FinalizeMPI finalize and process exitHybrid Parallelism SS6Must be the last MPI operation

Key invariants enforced by phase ordering:

  1. MPI-first: MPI initialization is the first operation in Startup, before any file I/O or thread creation. This is required by the MPI standard when using MPI_THREAD_MULTIPLE (Hybrid Parallelism SS6, Step 1).
  2. Rank-0 validation before broadcast: The load_case function (Input Loading Pipeline SS8.1) executes on rank 0 only during the Validation phase. The resulting System struct is broadcast via postcard during Initialization. This ensures all ranks receive identical, validated data.
  3. Workspaces before training: Solver workspace allocation (Solver Workspaces SS1.3) and stage template construction (Solver Abstraction SS11.1) complete during Initialization, before the Training phase begins. The Training Loop assumes these are ready at entry.
  4. Scenarios before training: PAR preprocessing and opening tree generation complete during Scenario Gen. The backward pass requires the fixed opening tree (Scenario Generation SS2.3) from iteration 1 onward.

5.3 Conditional Execution

The execution flow supports several modes. Which phases execute depends on the mode:

PhaseFull RunTraining OnlySimulation OnlyValidation Only
StartupYesYesYesYes
ValidationYesYesYesYes
InitializationYesYesYes
Scenario GenYesYesYes
TrainingYesYes
SimulationYesYes
FinalizeYesYesYes

Mode selection:

  • Full Run — Default. Both training and simulation execute sequentially.
  • Training Only — Produces a policy (cuts) without evaluating it. Useful for convergence analysis or when simulation will be run separately.
  • Simulation Only — Evaluates an existing policy. Requires a policy/ directory from a prior training run. Scenario generation still executes because simulation forward passes need scenario realizations.
  • Validation Only — Validates all input files and configuration, then exits immediately after the Validation phase. No memory allocation, no solver setup, no outputs. Triggered by --validate-only on the command line (overrides config settings) or by disabling both training.enabled and simulation.enabled in config.json.

Training Only and Simulation Only are controlled by the training.enabled and simulation.enabled fields in config.json. See Configuration Reference.

5.4 Subcommand Phase Mapping

Each subcommand participates in a subset of the execution phases:

SubcommandStartupValidationInitializationScenario GenTrainingSimulationFinalize
init
runYesYesYesYesYesYesYes
validateYes*Yes
report
summary
schema
version

* Startup for validate skips MPI initialization (single-process mode). The init, report, summary, schema, and version subcommands have no lifecycle phases – they perform their operation and exit immediately.

Library mode (used by cobre-mcp and cobre-python): When invoked as a library rather than via the CLI binary, the execution lifecycle skips MPI initialization, scheduler detection, and signal handler installation. The library caller provides the case path directly and receives structured results as Rust types. See Hybrid Parallelism §1 for single-process mode initialization and Python Bindings for the Python API surface.

6. Configuration Resolution

There are two distinct categories of runtime settings, and they follow different resolution rules:

6.1 Resource Allocations (read-only from environment)

Resource allocations are determined by the MPI launcher and job scheduler. The program reads them from the environment and must not override them. These values are never sourced from config.json or compiled defaults.

ParameterSourceDescription
MPI rank countMPI launcher (mpiexec -n, srun)Number of processes
CPUs per taskScheduler (SLURM_CPUS_PER_TASK) or OMP_NUM_THREADSThreads per rank
Memory per nodeScheduler (SLURM_MEM_PER_NODE)Memory budget for pool sizing
Job IDScheduler (SLURM_JOB_ID)Recorded in output metadata

Rationale: Allowing config.json to override resource allocations would create dangerous mismatches — e.g., the program spawning 8 threads on a node where SLURM allocated 2 CPUs, causing oversubscription. Resource allocations are a contract between the job scheduler and the process; the program observes them, it does not negotiate.

If OMP_NUM_THREADS is not set and no scheduler is detected, the program defaults to 1 thread per rank.

6.2 Algorithm Parameters (config hierarchy)

Algorithm parameters (tolerances, buffer sizes, stopping rules, etc.) are resolved in priority order:

  1. config.json — Explicit user configuration. See Configuration Reference
  2. Compiled defaults — Internal constants for any parameter not specified in config.json

The resolved configuration is recorded in the training metadata file for reproducibility (see Output Infrastructure SS2).

6.3 Scheduler Detection

Cobre detects the job scheduler environment at startup to read resource allocations.

Supported schedulers:

SchedulerDetectionInitial Support
SLURMSLURM_JOB_ID environment varYes
PBS/TorquePBS_JOBID environment varFuture
LSFLSB_JOBID environment varFuture
LocalNo scheduler env vars detectedYes (fallback)

If no scheduler is detected, the program falls back to local defaults: 1 thread per rank, no memory budget constraint.

Scope note: SLURM is the primary target scheduler. PBS and LSF support is planned for future releases and listed here for completeness. The detection mechanism is the same (environment variable probing), so adding new schedulers is straightforward.

7. Signal Handling and Graceful Shutdown

Cobre installs signal handlers to support graceful shutdown during long-running training and simulation phases.

SignalBehavior
SIGTERMGraceful shutdown: set shutdown flag, checkpoint last completed iteration, exit
SIGINTSame as SIGTERM — checkpoint last completed iteration, exit with code 130
SIGKILLImmediate termination (cannot be caught). Recovery via crash protocol on next startup

Graceful shutdown protocol:

  1. The signal handler sets a global shutdown flag.
  2. The program does not wait for the current iteration to finish — iterations at production scale can take minutes, and SIGTERM is expected to result in a fast exit.
  3. A checkpoint is written from the last fully completed iteration’s policy state. This state is always consistent and ready to serialize.
  4. All MPI ranks coordinate shutdown via a barrier before finalization.
  5. The training manifest is updated with status: "partial" and the last completed iteration number.

This ensures that a SIGTERM from SLURM (e.g., approaching wall-time limit) results in a prompt shutdown without corrupting policy or output files. The next invocation can detect the partial state via the manifest and resume from the checkpoint. See Output Infrastructure SS1.2 for manifest status values.

8. Structured Output Protocol

The structured output protocol defines how Cobre CLI responses are formatted for programmatic consumption. This section provides an overview and cross-references; the complete protocol specification is in the Structured Output spec.

8.1 Output Format Negotiation

The --output-format global flag (SS3.1) selects between three presentation modes:

ModeFlag ValueTransportUse Case
Humanhuman (default)Text to stdoutInteractive terminal, HPC batch log files
JSONjsonSingle JSON document to stdoutProgrammatic result consumption, CI/CD pipelines
JSON-linesjson-linesNewline-delimited JSON to stdoutReal-time progress monitoring by agents and TUI

The output format affects only presentation. It does not change computation, output files on disk, or exit codes.

8.2 Response Envelope

All JSON responses use a common envelope schema. See Structured Output SS2 for the complete JSON Schema definition:

{
  "$schema": "urn:cobre:response:v1",
  "command": "<subcommand>",
  "success": true,
  "exit_code": 0,
  "cobre_version": "2.0.0",
  "errors": [],
  "warnings": [],
  "data": { ... },
  "summary": { ... }
}

8.3 JSON-Lines Streaming

For long-running operations (run), the JSON-lines format emits per-iteration progress events matching the fields defined in Convergence Monitoring SS2.4. The streaming protocol uses four envelope types: started, progress, terminated, and result. See Structured Output SS3 for the complete streaming protocol and Convergence Monitoring SS4.1 for the JSON-lines schema.

Cross-References