CLI and Lifecycle

Purpose

This spec defines the Cobre program entrypoint, command-line interface, exit codes, execution phase lifecycle, conditional execution modes, configuration resolution hierarchy, and job scheduler integration. It covers everything from process invocation through phase orchestration to shutdown.

1. Design Philosophy

Cobre adopts a single-entrypoint design optimized for HPC batch execution. The program is always invoked via MPI launchers (mpiexec, mpirun, or SLURM’s srun) and all runtime behavior is controlled through configuration files rather than command-line arguments.

Rationale:

HPC job scripts benefit from stable command-line interfaces
Configuration files provide auditability and reproducibility
Complex nested options are better expressed in JSON than CLI flags
Reduces parsing complexity in the hot initialization path

1.1 Agent Composability Principle

Cobre serves two audiences with complementary interaction models:

HPC batch execution – The primary mode. MPI-launched, config-driven, human-readable output. Optimized for production-scale runs on cluster schedulers.
Agent-composable interfaces – Secondary modes (MCP server, Python bindings, TUI) that expose the same solver through programmatic APIs. These operate in single-process mode without MPI, producing structured output with stable schemas.

The agent composability principle states that every Cobre operation must be usable by a programmatic agent – an AI coding assistant, a CI/CD pipeline, or a Python orchestration script – without requiring human interpretation of output. This is achieved through structured JSON output (--output-format json), progress streaming (--output-format json-lines), and library-mode execution (no MPI, no signal handlers, no scheduler detection). See Design Principles SS6 for the full agent-readability design rules and Structured Output for the JSON schema definitions.

2. Invocation Pattern

# Standard invocation
mpiexec -n 8 cobre /path/to/case_directory

# SLURM batch execution
srun cobre /path/to/case_directory

# Validation-only mode
mpiexec -n 1 cobre /path/to/case_directory --validate-only

2.1 Subcommand Invocation Patterns

Cobre supports subcommand-style invocations. The available subcommands are:

cobre init [OPTIONS] [DIRECTORY]           # Scaffold a new case directory
cobre run <CASE_DIR> [OPTIONS]             # Load, train, simulate, and write results
cobre validate <CASE_DIR>                  # Validate a case directory
cobre report <RESULTS_DIR>                 # Query results from a completed run
cobre summary <OUTPUT_DIR>                 # Display the post-run summary
cobre schema export [--output-dir <DIR>]   # Export JSON Schema files for all input types
cobre version                              # Print version and build information

For distributed execution, the run subcommand can be launched under MPI:

mpiexec -np 4 cobre run /path/to/case_directory

MPI requirements by subcommand:

Subcommand	MPI Required	Rationale
`init`	No	Template scaffolding; no computation
`run`	Yes (for distributed execution) or No (single-process)	Training/simulation can run with or without MPI
`validate`	No	Validation is rank-0-only; single-process is sufficient
`report`	No	Reads output files; no computation
`summary`	No	Reads output files; no computation
`schema`	No	Schema export; no computation
`version`	No	Information only

3. Command-Line Interface

Argument	Required	Description
`CASE_DIR`	Yes	Path to case directory containing `config.json`
`--validate-only`	No	Run Startup and Validation phases only, then exit (see SS5.3)
`--version`	No	Print version and exit
`--help`	No	Print usage and exit

3.1 Global CLI Flags

The following flag applies to all subcommands:

Flag	Values	Default	Description
`--color <WHEN>`	`auto`, `always`, `never`	`auto`	Control ANSI color output on stderr. `always` forces color on (useful under `mpiexec` which pipes stderr through a non-TTY). Also honoured via the `COBRE_COLOR` env var.

Color resolution order (highest to lowest priority):

--color <WHEN> CLI flag
COBRE_COLOR environment variable (always or never; invalid values ignored)
FORCE_COLOR=1 environment variable (forces color on; see https://force-color.org)
Console auto-detection (whether stderr is a TTY)

3.2 Subcommand Arguments

Each subcommand accepts specific positional and keyword arguments:

Subcommand	Positional	Additional Flags	Description
`init`	`[DIRECTORY]`	`--template <NAME>`, `--list`, `--force`	Scaffold a new case directory
`run`	`CASE_DIR`	`--output <DIR>`, `--threads <N>`, `--quiet`	Execute training and/or simulation
`validate`	`CASE_DIR`	(none)	Validate input files only
`report`	`RESULTS_DIR`	(none)	Query output data as JSON
`summary`	`OUTPUT_DIR`	(none)	Display post-run summary
`schema`	(subcommands)	`export [--output-dir <DIR>]`	Manage JSON Schema files
`version`	(none)	(none)	Print version information

Design Decision: All execution options (skip training, skip simulation, warm-start mode, etc.) are specified in config.json, not via CLI flags. This ensures:

Job scripts remain stable across configuration changes
Configuration is self-documenting and version-controlled
No ambiguity between CLI and config file settings

4. Exit Codes

Code	Category	Cause
`0`	Success	The command completed without errors
`1`	Validation	Case directory failed the validation pipeline – schema errors, cross-reference errors, semantic constraint violations, or policy compatibility mismatches
`2`	I/O	File not found, permission denied, disk full, or write failure during loading or output
`3`	Solver	LP infeasible subproblem or numerical solver failure during training or simulation
`4`	Internal	Communication failure, unexpected channel closure, or other software/environment problem

Codes 1–2 indicate user-correctable input problems; codes 3–4 indicate case/environment problems. Error messages are printed to stderr with error: prefix and hint lines.

5. Execution Phases Overview

5.1 Phase Diagram

flowchart TB
    S["Startup<br/><i>MPI init, CLI parse · all ranks</i>"]
    V["Validation<br/><i>load, schema, refs · rank 0</i>"]
    I["Initialization<br/><i>broadcast, alloc, solver · all ranks</i>"]
    G["Scenario Gen<br/><i>PAR fit, opening tree · all ranks</i>"]
    M{"run mode?"}
    T["Training"]
    LP["Load policy"]
    SIM["Simulation"]
    F(["Finalize"])

    S --> V --> I --> G --> M
    M -->|Full Run / Training Only| T
    M -->|Simulation Only| LP
    M -->|Validation Only| F
    T -->|simulation.enabled| SIM
    T -->|training only| F
    LP --> SIM
    SIM --> F

5.2 Phase Responsibilities

Phase	MPI Ranks	Key Operations
Startup	All	MPI init, scheduler detection, CLI parsing
Validation	Rank 0 only	Load files, schema validation, cross-references
Initialization	All	Broadcast, memory allocation, solver setup
Scenario Gen	All (parallel)	PAR fitting, noise sampling, correlation
Training	All (parallel)	SDDP iterations
Simulation	All (parallel)	Policy evaluation
Finalize	All	Output writing, cleanup

5.2a Phase-Training Loop Alignment

This subsection documents the correspondence between CLI execution phases (SS5.2) and the spec sections that define each phase’s operations. The purpose is to ensure that every phase boundary assumed by the Training Loop is explicitly sequenced in the lifecycle, and that implementers can trace each phase to its authoritative specification.

Phase	Operation	Authoritative Spec	Ordering Constraint
Startup	MPI backend initialization (`create_communicator`)	Hybrid Parallelism SS6, Step 1	Must be the first operation; precedes all file I/O and thread creation
Startup	Topology detection (rank, size, intra-node split)	Hybrid Parallelism SS6, Steps 2–3	Requires MPI backend initialized
Startup	Scheduler detection (SLURM, PBS, local)	CLI and Lifecycle SS6.3	Reads environment variables; no MPI dependency
Startup	CLI argument parsing and subcommand routing	CLI and Lifecycle SS3	Determines execution mode before Validation
Validation	Rank-0 file loading and validation (`load_case`)	Input Loading Pipeline SS8.1	Rank 0 only; produces the `System` struct
Initialization	postcard broadcast of `System` to worker ranks	Input Loading Pipeline SS6	Requires `System` from Validation; all ranks receive identical validated data
Initialization	OpenMP configuration and NUMA allocation policy	Hybrid Parallelism SS6, Steps 4–6	Must precede workspace allocation (first-touch policy)
Initialization	Solver workspace allocation (thread-local, NUMA-aware)	Solver Workspaces SS1.3	Each thread creates its own workspace on its NUMA node
Initialization	Stage LP template construction	Solver Abstraction SS11.1	Built from resolved `System`; shared read-only across threads
Initialization	Parallel policy loading (warm-start only)	Input Loading Pipeline SS7	All ranks load in parallel after `System` broadcast
Initialization	FPHA hyperplane fitting (computed source only)	Input Loading Pipeline SS8	Requires geometry and topology from `System`
Scenario Gen	PAR model preprocessing	Scenario Generation SS1	Transforms raw PAR parameters into contiguous cache-friendly layout
Scenario Gen	Opening tree generation (fixed before training)	Scenario Generation SS2.3	Generated once; remains fixed throughout training
Scenario Gen	Spectral decomposition of correlation matrices	Scenario Generation SS2.1	Pre-decomposed during preprocessing; used at runtime
Training	SDDP iteration loop (forward/backward/convergence)	Training Loop SS2.1	Requires all preceding phases complete
Simulation	Policy evaluation on large scenario sets	Simulation Architecture SS1	Requires trained FCF from Training (or loaded policy from Initialization)
Finalize	Output writing (Parquet, policy FlatBuffers, manifest)	Output Infrastructure SS1	Rank 0 writes manifest; all ranks may write partitioned output
Finalize	MPI finalize and process exit	Hybrid Parallelism SS6	Must be the last MPI operation

Key invariants enforced by phase ordering:

MPI-first: MPI initialization is the first operation in Startup, before any file I/O or thread creation. This is required by the MPI standard when using MPI_THREAD_MULTIPLE (Hybrid Parallelism SS6, Step 1).
Rank-0 validation before broadcast: The load_case function (Input Loading Pipeline SS8.1) executes on rank 0 only during the Validation phase. The resulting System struct is broadcast via postcard during Initialization. This ensures all ranks receive identical, validated data.
Workspaces before training: Solver workspace allocation (Solver Workspaces SS1.3) and stage template construction (Solver Abstraction SS11.1) complete during Initialization, before the Training phase begins. The Training Loop assumes these are ready at entry.
Scenarios before training: PAR preprocessing and opening tree generation complete during Scenario Gen. The backward pass requires the fixed opening tree (Scenario Generation SS2.3) from iteration 1 onward.

5.3 Conditional Execution

The execution flow supports several modes. Which phases execute depends on the mode:

Phase	Full Run	Training Only	Simulation Only	Validation Only
Startup	Yes	Yes	Yes	Yes
Validation	Yes	Yes	Yes	Yes
Initialization	Yes	Yes	Yes	—
Scenario Gen	Yes	Yes	Yes	—
Training	Yes	Yes	—	—
Simulation	Yes	—	Yes	—
Finalize	Yes	Yes	Yes	—

Mode selection:

Full Run — Default. Both training and simulation execute sequentially.
Training Only — Produces a policy (cuts) without evaluating it. Useful for convergence analysis or when simulation will be run separately.
Simulation Only — Evaluates an existing policy. Requires a policy/ directory from a prior training run. Scenario generation still executes because simulation forward passes need scenario realizations.
Validation Only — Validates all input files and configuration, then exits immediately after the Validation phase. No memory allocation, no solver setup, no outputs. Triggered by --validate-only on the command line (overrides config settings) or by disabling both training.enabled and simulation.enabled in config.json.

Training Only and Simulation Only are controlled by the training.enabled and simulation.enabled fields in config.json. See Configuration Reference.

5.4 Subcommand Phase Mapping

Each subcommand participates in a subset of the execution phases:

Subcommand	Startup	Validation	Initialization	Scenario Gen	Training	Simulation	Finalize
`init`	–	–	–	–	–	–	–
`run`	Yes	Yes	Yes	Yes	Yes	Yes	Yes
`validate`	Yes*	Yes	–	–	–	–	–
`report`	–	–	–	–	–	–	–
`summary`	–	–	–	–	–	–	–
`schema`	–	–	–	–	–	–	–
`version`	–	–	–	–	–	–	–

* Startup for validate skips MPI initialization (single-process mode). The init, report, summary, schema, and version subcommands have no lifecycle phases – they perform their operation and exit immediately.

Library mode (used by cobre-mcp and cobre-python): When invoked as a library rather than via the CLI binary, the execution lifecycle skips MPI initialization, scheduler detection, and signal handler installation. The library caller provides the case path directly and receives structured results as Rust types. See Hybrid Parallelism §1 for single-process mode initialization and Python Bindings for the Python API surface.

6. Configuration Resolution

There are two distinct categories of runtime settings, and they follow different resolution rules:

6.1 Resource Allocations (read-only from environment)

Resource allocations are determined by the MPI launcher and job scheduler. The program reads them from the environment and must not override them. These values are never sourced from config.json or compiled defaults.

Parameter	Source	Description
MPI rank count	MPI launcher (`mpiexec -n`, `srun`)	Number of processes
CPUs per task	Scheduler (`SLURM_CPUS_PER_TASK`) or `OMP_NUM_THREADS`	Threads per rank
Memory per node	Scheduler (`SLURM_MEM_PER_NODE`)	Memory budget for pool sizing
Job ID	Scheduler (`SLURM_JOB_ID`)	Recorded in output metadata

Rationale: Allowing config.json to override resource allocations would create dangerous mismatches — e.g., the program spawning 8 threads on a node where SLURM allocated 2 CPUs, causing oversubscription. Resource allocations are a contract between the job scheduler and the process; the program observes them, it does not negotiate.

If OMP_NUM_THREADS is not set and no scheduler is detected, the program defaults to 1 thread per rank.

6.2 Algorithm Parameters (config hierarchy)

Algorithm parameters (tolerances, buffer sizes, stopping rules, etc.) are resolved in priority order:

config.json — Explicit user configuration. See Configuration Reference
Compiled defaults — Internal constants for any parameter not specified in config.json

The resolved configuration is recorded in the training metadata file for reproducibility (see Output Infrastructure SS2).

6.3 Scheduler Detection

Cobre detects the job scheduler environment at startup to read resource allocations.

Supported schedulers:

Scheduler	Detection	Initial Support
SLURM	`SLURM_JOB_ID` environment var	Yes
PBS/Torque	`PBS_JOBID` environment var	Future
LSF	`LSB_JOBID` environment var	Future
Local	No scheduler env vars detected	Yes (fallback)

If no scheduler is detected, the program falls back to local defaults: 1 thread per rank, no memory budget constraint.

Scope note: SLURM is the primary target scheduler. PBS and LSF support is planned for future releases and listed here for completeness. The detection mechanism is the same (environment variable probing), so adding new schedulers is straightforward.

7. Signal Handling and Graceful Shutdown

Cobre installs signal handlers to support graceful shutdown during long-running training and simulation phases.

Signal	Behavior
`SIGTERM`	Graceful shutdown: set shutdown flag, checkpoint last completed iteration, exit
`SIGINT`	Same as `SIGTERM` — checkpoint last completed iteration, exit with code 130
`SIGKILL`	Immediate termination (cannot be caught). Recovery via crash protocol on next startup

Graceful shutdown protocol:

The signal handler sets a global shutdown flag.
The program does not wait for the current iteration to finish — iterations at production scale can take minutes, and SIGTERM is expected to result in a fast exit.
A checkpoint is written from the last fully completed iteration’s policy state. This state is always consistent and ready to serialize.
All MPI ranks coordinate shutdown via a barrier before finalization.
The training manifest is updated with status: "partial" and the last completed iteration number.

This ensures that a SIGTERM from SLURM (e.g., approaching wall-time limit) results in a prompt shutdown without corrupting policy or output files. The next invocation can detect the partial state via the manifest and resume from the checkpoint. See Output Infrastructure SS1.2 for manifest status values.

8. Structured Output Protocol

The structured output protocol defines how Cobre CLI responses are formatted for programmatic consumption. This section provides an overview and cross-references; the complete protocol specification is in the Structured Output spec.

8.1 Output Format Negotiation

The --output-format global flag (SS3.1) selects between three presentation modes:

Mode	Flag Value	Transport	Use Case
Human	`human` (default)	Text to stdout	Interactive terminal, HPC batch log files
JSON	`json`	Single JSON document to stdout	Programmatic result consumption, CI/CD pipelines
JSON-lines	`json-lines`	Newline-delimited JSON to stdout	Real-time progress monitoring by agents and TUI

The output format affects only presentation. It does not change computation, output files on disk, or exit codes.

8.2 Response Envelope

All JSON responses use a common envelope schema. See Structured Output SS2 for the complete JSON Schema definition:

{
  "$schema": "urn:cobre:response:v1",
  "command": "<subcommand>",
  "success": true,
  "exit_code": 0,
  "cobre_version": "2.0.0",
  "errors": [],
  "warnings": [],
  "data": { ... },
  "summary": { ... }
}

8.3 JSON-Lines Streaming

For long-running operations (run), the JSON-lines format emits per-iteration progress events matching the fields defined in Convergence Monitoring SS2.4. The streaming protocol uses four envelope types: started, progress, terminated, and result. See Structured Output SS3 for the complete streaming protocol and Convergence Monitoring SS4.1 for the JSON-lines schema.

Cross-References

Configuration Reference — Complete config.json schema and parameter documentation
Input Loading Pipeline — How input files are loaded after CLI parsing and config resolution
Validation Architecture — Multi-layer validation executed during the Validation phase
Design Principles — Format selection and declaration order invariance governing input processing
Production Scale Reference — Typical phase durations and resource requirements at production scale
Output Infrastructure — Manifest files, metadata, crash recovery protocol
SLURM Deployment — Job scripts and multi-node deployment patterns
Structured Output — Full JSON schema definitions for CLI response envelope, error schema, and JSON-lines streaming protocol
MCP Server — MCP tool, resource, and prompt definitions for agent interaction
Python Bindings — PyO3 API surface, zero-copy data paths, GIL management
Terminal UI — TUI event consumption, convergence plot, interactive features

Keyboard shortcuts

Cobre Methodology Reference