TCP Backend
Status: Not Implemented. This specification describes planned behavior for a future backend. No implementation exists in the current codebase.
Purpose
The TCP backend enables multi-process SDDP execution using standard TCP sockets (std::net), eliminating the dependency on an MPI runtime. It uses a coordinator-based star topology where rank 0 runs a TCP server and ranks 1 through connect as clients. All collective operations flow through the coordinator, which is simpler than a peer-to-peer mesh and sufficient for SDDP’s collective-only communication pattern (Communication Patterns §1.2). This backend targets two deployment scenarios: container-friendly environments where MPI installation is impractical, and Python multi-process orchestration where the GIL is incompatible with MPI launchers (Python Bindings §7). The backend is gated behind the tcp Cargo feature flag as specified in Backend Registration and Selection §1.2.
1. Architecture
1.1 Coordinator Pattern
The TCP backend implements a star topology with a single coordinator (rank 0) and worker clients. The coordinator manages all collective operations by receiving data from workers, performing the required aggregation or synchronization, and distributing results back to all participants.
+-------------------+
| Coordinator |
| (rank 0) |
| |
| TCP listener |
| port: P |
+--------+----------+
|
+--------------+--------------+
| | |
+----+----+ +----+----+ +----+----+
| Worker | | Worker | | Worker |
| rank 1 | | rank 2 | | rank R-1|
| | | | | |
| TCP | | TCP | | TCP |
| client | | client | | client |
+---------+ +---------+ +---------+
Design rationale: SDDP uses only collective operations, no point-to-point messaging (Communication Patterns §1.2). A coordinator-based star topology maps naturally to collectives: every operation passes through a single hub that enforces ordering, performs reductions, and detects failures. A peer-to-peer mesh would add connection complexity without benefit.
1.2 Struct Definition
The TcpBackend struct holds the persistent TCP connections and rank metadata established during initialization. The coordinator holds accepted connections (one per worker), while each worker holds a single connection to the coordinator.
#![allow(unused)]
fn main() {
/// TCP-based communication backend for MPI-free multi-process SDDP.
///
/// Uses a coordinator pattern where rank 0 runs a TCP server and all
/// other ranks connect as clients. All collective operations are
/// mediated by the coordinator. Connections are persistent for the
/// lifetime of the training run (no per-operation connect/disconnect).
#[cfg(feature = "tcp")]
pub struct TcpBackend {
/// This process's rank index (0..size).
rank: usize,
/// Total number of ranks in the communicator.
size: usize,
/// Timeout for collective operations and connection attempts.
timeout: Duration,
/// Connection state, determined by whether this rank is the coordinator.
role: TcpRole,
}
/// The role of this rank in the TCP communicator.
#[cfg(feature = "tcp")]
enum TcpRole {
/// Rank 0: holds accepted connections from all R-1 workers,
/// indexed by worker rank (1..size).
Coordinator {
listener: TcpListener,
workers: Vec<TcpStream>, // workers[i] = connection to rank i+1
},
/// Ranks 1..R-1: holds a single connection to the coordinator.
Worker {
conn: TcpStream,
},
}
}
1.3 Communicator Trait Implementation
The TcpBackend implements Communicator by routing each collective operation through the coordinator using the protocols defined in SS3. The coordinator participates as both a server (mediating the collective) and a rank (contributing its own data).
#![allow(unused)]
fn main() {
#[cfg(feature = "tcp")]
impl Communicator for TcpBackend {
fn allgatherv<T: CommData>(
&self,
send: &[T],
recv: &mut [T],
counts: &[usize],
displs: &[usize],
) -> Result<(), CommError> {
// Protocol: SS3.1
// ...
}
fn allreduce<T: CommData>(
&self,
send: &[T],
recv: &mut [T],
op: ReduceOp,
) -> Result<(), CommError> {
// Protocol: SS3.2
// ...
}
fn broadcast<T: CommData>(
&self,
buf: &mut [T],
root: usize,
) -> Result<(), CommError> {
// Protocol: SS3.3
// ...
}
fn barrier(&self) -> Result<(), CommError> {
// Protocol: SS3.4
// ...
}
fn rank(&self) -> usize {
self.rank
}
fn size(&self) -> usize {
self.size
}
}
}
Error mapping: All TCP I/O errors are mapped to CommError variants from Communicator Trait §1.4. No new variants are introduced. See SS5 for the complete mapping.
2. Connection Establishment
2.1 Startup Sequence
Connection establishment occurs once during the Startup phase (CLI and Lifecycle §5), before any collective operation. The sequence differs for the coordinator and workers:
Coordinator (rank 0):
- Bind a
TcpListeneron0.0.0.0:COBRE_TCP_PORT(default: 29500). - Accept exactly connections, one from each worker rank.
- For each accepted connection, execute the handshake protocol (SS2.2) to verify the worker’s identity.
- Store the authenticated connections in rank order (
workers[0]= rank 1,workers[1]= rank 2, …). - The coordinator is ready for collective operations once all workers have completed the handshake.
Worker (rank , where ):
- Read
COBRE_TCP_COORDINATOR(host) andCOBRE_TCP_PORT(port) from the environment. - Connect to the coordinator at
COBRE_TCP_COORDINATOR:COBRE_TCP_PORTwith a configurable timeout (SS8). - Execute the handshake protocol (SS2.2) to register with the coordinator.
- The worker is ready for collective operations once the handshake acknowledgment is received.
2.2 Handshake Protocol
After a TCP connection is established, the worker and coordinator execute a handshake to verify rank identity and communicator size consistency.
Step 1 – Worker sends handshake:
The worker sends a Handshake message (SS4 framing) containing:
| Field | Type | Size | Description |
|---|---|---|---|
rank | u32 (BE) | 4 bytes | The worker’s rank index (COBRE_TCP_RANK) |
size | u32 (BE) | 4 bytes | The expected communicator size (COBRE_TCP_SIZE) |
Step 2 – Coordinator verifies:
The coordinator validates the received handshake:
rankmust be in the range1..size(rank 0 is the coordinator itself).rankmust not have been registered by a previous connection (no duplicate ranks).sizemust match the coordinator’s ownCOBRE_TCP_SIZEvalue.
If validation fails, the coordinator sends an error response and closes the connection.
Step 3 – Coordinator sends acknowledgment:
On successful validation, the coordinator sends an Ack message (SS4 framing) containing:
| Field | Type | Size | Description |
|---|---|---|---|
size | u32 (BE) | 4 bytes | The confirmed communicator size (echoed back) |
The worker verifies that the echoed size matches its own COBRE_TCP_SIZE. If the values disagree, the worker reports an initialization error and terminates.
2.3 Persistent Connections
All TCP connections established during the handshake phase remain open for the entire lifetime of the training run. There is no per-operation connect/disconnect cycle. Persistent connections avoid:
- TCP handshake latency (~1 ms per connection) on every collective operation.
- Port exhaustion from rapid connect/disconnect cycles during the ~100-200 iterations of SDDP training.
- TIME_WAIT socket state accumulation.
Connections are closed during the shutdown phase, after all collective operations have completed.
2.4 TCP Socket Configuration
All persistent connections are configured for low-latency communication:
| Socket Option | Value | Rationale |
|---|---|---|
TCP_NODELAY | true | Disable Nagle’s algorithm to avoid buffering small messages |
SO_KEEPALIVE | true | Detect broken connections from rank crashes |
| Read timeout | self.timeout | Configurable timeout for detecting unresponsive ranks (SS5) |
| Write timeout | self.timeout | Configurable timeout for detecting blocked writes (SS5) |
3. Collective Protocols
Each collective operation defined in Communicator Trait §1.1 is implemented as a coordinator-mediated message exchange. The coordinator handles both its own data (as rank 0) and the relay of worker data. All messages use the framing format defined in SS4.
3.1 allgatherv
Gathers variable-length data from all ranks and distributes the assembled result to all ranks, with data ordered by rank index. This is the most performance-critical operation, called twice per iteration (Communication Patterns §1.1).
Message exchange sequence:
-
Workers send data to coordinator. Each worker rank () sends an
AllgathervSendmessage containing its local data buffer (send[0..counts[r]]). Messages arrive asynchronously – workers do not wait for each other. -
Coordinator collects all contributions. The coordinator:
- Places its own data (rank 0’s
sendbuffer) into the result buffer at offsetdispls[0]. - Receives each worker’s
AllgathervSendmessage and places the received data at offsetdispls[r]in the result buffer, where is the sending worker’s rank. - Blocks until all worker contributions have been received (implicit barrier).
- Places its own data (rank 0’s
-
Coordinator sends assembled result to all. Once all contributions are collected, the coordinator sends an
AllgathervRecvmessage to each worker containing the complete assembled buffer (recv[0..sum(counts)]). -
Workers receive the assembled result. Each worker receives the
AllgathervRecvmessage and copies the contents into itsrecvbuffer.
Determinism guarantee: The coordinator assembles received data into the result buffer in rank order (rank 0 at displs[0], rank 1 at displs[1], …, rank at displs[R-1]), regardless of the order in which worker messages arrive on the network. This preserves the rank-ordered receive invariant from Communication Patterns §6.1 and ensures that all ranks construct identical cut pools and trial point sets after synchronization.
Data movement analysis: Per allgatherv call with total payload bytes across ranks:
| Direction | Data Volume | Description |
|---|---|---|
| Workers to coord | (rank 0’s data) | Each worker sends its contribution |
| Coord to workers | Coordinator sends the full assembled buffer | |
| Total on wire | Dominated by the coordinator’s fan-out to all |
3.2 allreduce
Reduces data element-wise across all ranks using the specified operation, with the result available on all ranks. The payload is minimal (32 bytes for convergence statistics), so coordinator overhead is negligible.
Message exchange sequence:
-
Workers send data to coordinator. Each worker rank sends an
AllreduceSendmessage containing its localsendbuffer and theReduceOptag (encoded in the operation tag byte, SS4). -
Coordinator reduces element-wise. The coordinator:
- Initializes the accumulator from its own
sendbuffer (rank 0’s contribution). - Receives each worker’s
AllreduceSendmessage and applies the reduction operation element-wise:ReduceOp::Sum:acc[i] += recv[i]for all .ReduceOp::Min:acc[i] = min(acc[i], recv[i])for all .ReduceOp::Max:acc[i] = max(acc[i], recv[i])for all .
- The reduction order is deterministic: rank 0 first, then ranks in ascending order.
- Initializes the accumulator from its own
-
Coordinator sends result to all. The coordinator sends an
AllreduceRecvmessage to each worker containing the reduced result buffer. -
Workers receive the result. Each worker copies the received result into its
recvbuffer.
Floating-point note: Because the coordinator reduces in a fixed sequential order (rank 0, 1, …, ), the floating-point result is deterministic for a given rank count and data. This is stricter than the MPI specification, which allows implementation-defined reduction tree shapes (Communicator Trait §2.2). The TCP backend’s sequential reduction eliminates one source of floating-point non-determinism compared to the ferrompi backend.
3.3 broadcast
Distributes data from a designated root rank to all other ranks. Used only during initialization for configuration and case data – not on the per-iteration hot path.
Message exchange sequence:
Case 1: root = 0 (coordinator is root).
- The coordinator sends a
Broadcastmessage to each worker containing thebufcontents. - Each worker receives the
Broadcastmessage and copies the contents intobuf.
Case 2: root 0 (a worker is root).
- The root worker sends a
Broadcastmessage to the coordinator containing itsbufcontents. - The coordinator receives the message and copies the contents into its own
buf. - The coordinator forwards the
Broadcastmessage to all other workers (ranks root, 0). - Each non-root worker receives the
Broadcastmessage and copies the contents intobuf.
3.4 barrier
Blocks until all ranks have entered the barrier. Used only for checkpoint synchronization – not on the per-iteration hot path.
Message exchange sequence:
-
Workers send ready signal. Each worker sends a
BarrierReadymessage (zero-byte payload) to the coordinator. -
Coordinator waits for all. The coordinator:
- Considers itself (rank 0) as implicitly ready.
- Blocks until all
BarrierReadymessages have been received.
-
Coordinator releases all. Once all ranks are accounted for, the coordinator sends a
BarrierGomessage (zero-byte payload) to each worker. -
Workers unblock. Each worker receives the
BarrierGomessage and returns from thebarrier()call.
4. Message Framing
4.1 Wire Format
All messages exchanged between the coordinator and workers use a length-prefixed binary format over TCP byte streams. The frame structure is:
+------+-----+---------+
| LEN | TAG | PAYLOAD |
+------+-----+---------+
4 B 1 B LEN - 1 B
| Field | Type | Size | Description |
|---|---|---|---|
LEN | u32 (BE) | 4 bytes | Length of TAG + PAYLOAD in bytes (big-endian unsigned 32-bit) |
TAG | u8 | 1 byte | Operation type identifier |
PAYLOAD | raw bytes | LEN - 1 bytes | Operation-specific data |
The LEN field encodes the number of bytes following the length header (i.e., TAG + PAYLOAD), not the total frame size. The maximum payload size is bytes ( 4 GB), which is far larger than any SDDP collective payload (~587 MB total across all operations per iteration).
4.2 Operation Tags
| Tag Value | Name | Direction | Payload Contents |
|---|---|---|---|
0x01 | AllgathervSend | Worker to coord | Raw bytes of the worker’s send buffer |
0x02 | AllgathervRecv | Coord to worker | Raw bytes of the assembled recv buffer |
0x03 | AllreduceSend | Worker to coord | 1-byte ReduceOp tag + raw bytes of send buffer |
0x04 | AllreduceRecv | Coord to worker | Raw bytes of the reduced recv buffer |
0x05 | Broadcast | Bidirectional | Raw bytes of buf |
0x06 | BarrierReady | Worker to coord | Empty (zero bytes) |
0x07 | BarrierGo | Coord to worker | Empty (zero bytes) |
0x08 | Handshake | Worker to coord | rank (u32 BE) + size (u32 BE) |
0x09 | Ack | Coord to worker | size (u32 BE) |
0x0A | Shutdown | Coord to worker | Empty (zero bytes) |
4.3 Byte Order and Payload Encoding
All multi-byte integers in the framing layer (LEN, rank, size in handshake) use big-endian byte order for unambiguous cross-platform interpretation.
Payload data (the CommData buffers transmitted during collectives) is transmitted as raw bytes in the sender’s native byte order. Because SDDP deployment targets use homogeneous architectures (all ranks on the same platform), byte order conversion for payload data is not required. The CommData trait bound (Copy + Send + Sync + 'static) from Communicator Trait §1.2 ensures that all payload types have a well-defined memory layout suitable for bitwise transmission.
4.4 Reading and Writing Frames
Frame I/O uses Rust’s std::io::Read and std::io::Write traits on TcpStream, with read_exact and write_all to handle partial reads/writes:
#![allow(unused)]
fn main() {
/// Read a complete framed message, returning the operation tag and payload.
fn read_frame(stream: &mut TcpStream) -> Result<(u8, Vec<u8>), CommError> {
let mut len_buf = [0u8; 4];
stream.read_exact(&mut len_buf)?;
let body_len = u32::from_be_bytes(len_buf) as usize;
let mut body = vec![0u8; body_len];
stream.read_exact(&mut body)?;
let tag = body[0];
let payload = body[1..].to_vec();
Ok((tag, payload))
}
/// Write a complete framed message to the stream.
fn write_frame(
stream: &mut TcpStream,
tag: u8,
payload: &[u8],
) -> Result<(), CommError> {
let body_len = (1 + payload.len()) as u32;
stream.write_all(&body_len.to_be_bytes())?;
stream.write_all(&[tag])?;
stream.write_all(payload)?;
stream.flush()?;
Ok(())
}
}
5. Failure Handling
5.1 Timeout-Based Detection
All TCP read and write operations are subject to a configurable timeout. When a read or write does not complete within the timeout period, the operation fails with CommError::CollectiveFailed. The timeout covers the following failure modes:
| Failure Mode | Detection Mechanism | Timeout Applies To |
|---|---|---|
| Rank process crash | TCP connection closed (read returns 0 bytes) or timeout | All collective operations |
| Rank process hang | Read timeout expires without receiving expected data | All collective operations |
| Network partition | Read/write timeout expires | All collective operations |
| Coordinator crash | Workers detect closed connection or timeout | All collective operations |
Default timeout: 60 seconds. This is configurable via the COBRE_TCP_TIMEOUT_SECS environment variable (SS8). The default is chosen to accommodate:
- Slow LP solves at production scale (~25 ms per solve, ~192 solves per stage, ~60 stages = ~288 seconds of computation per iteration) where ranks may take significantly different amounts of time to reach the next collective.
- Network transients that may delay message delivery by seconds.
The 60-second default exceeds the expected per-stage wall-clock time (~5 seconds at production scale) by a factor of 12, providing a generous margin while still detecting genuine failures within a reasonable timeframe.
5.2 Graceful Shutdown
When the training loop completes (convergence or iteration limit), the coordinator initiates a graceful shutdown:
- The coordinator sends a
Shutdownmessage (tag0x0A, zero-byte payload) to each worker. - Each worker receives the
Shutdownmessage and closes its connection. - The coordinator closes all worker connections and the listener socket.
If a worker terminates before receiving the Shutdown message (e.g., due to an error in the training loop), the coordinator detects the closed connection during the next collective operation and reports the failure.
5.3 Error Mapping
TCP backend failures are mapped to existing CommError variants from Communicator Trait §1.4. No new variants are introduced.
| TCP Failure | CommError Variant |
|---|---|
std::io::ErrorKind::TimedOut | CommError::CollectiveFailed { operation, mpi_error_code: 0, message } |
std::io::ErrorKind::ConnectionReset | CommError::CollectiveFailed { operation, mpi_error_code: 0, message } |
std::io::ErrorKind::UnexpectedEof (rank crash) | CommError::CollectiveFailed { operation, mpi_error_code: 0, message } |
std::io::ErrorKind::ConnectionRefused | CommError::CollectiveFailed { operation, mpi_error_code: 0, message } |
| Unexpected operation tag | CommError::CollectiveFailed { operation, mpi_error_code: 0, message } |
| Length mismatch (frame size vs. expected) | CommError::InvalidBufferSize { operation, expected, actual } |
| Handshake failure (wrong rank/size) | Reported as BackendError::InitializationFailed (initialization-time error) |
The mpi_error_code field is set to 0 for all TCP errors because there is no MPI error code to report. The message field contains the specific TCP error description for diagnostic logging.
6. Performance Analysis
6.1 Communication Volume
The per-iteration communication volume at worst-case scale ( ranks, stages — hypothetical maximum per DEC-009; production baseline is , forward passes, ) is approximately 587 MB (Communication Patterns §3.1):
| Operation | Per-Iteration Volume | Frequency |
|---|---|---|
Trial point allgatherv | ~206 MB | Once per iteration |
Cut allgatherv | ~381 MB | Once per stage (x 119) |
Convergence allreduce | 32 B | Once per iteration |
| Total | ~587 MB |
6.2 Coordinator Throughput Analysis
In the star topology, the coordinator is the bottleneck: it must receive data from all workers and then send the assembled result to all workers. The total data through the coordinator per iteration is:
Allgatherv (cuts + trial points):
where is the average per-rank contribution and is the assembled result size.
For trial point allgatherv ( MB, ):
- Receive: MB
- Send: MB
- Total coordinator I/O: MB
For cut allgatherv (119 stages at worst-case ; MB per stage):
- Per stage: receive MB + send MB = 51 MB
- All stages: MB
- Total coordinator I/O: MB
Combined coordinator I/O per iteration: MB.
6.3 Wire-Speed Analysis
On a 100 Gbps Ethernet link (12.5 GB/s per direction, full-duplex):
| Metric | Value |
|---|---|
| Coordinator I/O per iteration | ~9,352 MB |
| Wire-speed transfer time | s |
| Protocol overhead factor (est.) | ~50% |
| Estimated wall-clock transfer time | ~112 ms |
| Typical iteration wall-clock time | ~5 s |
| Communication overhead fraction | ~2.2% |
The 50% protocol overhead factor accounts for TCP/IP headers (~40 bytes per packet on a ~64 KB segment), kernel-to-user space copies, system call overhead, and Nagle/delayed-ACK interaction (mitigated by TCP_NODELAY). This factor is consistent with the estimate in Communication Patterns §3.2.
Note on full-duplex advantage: The coordinator’s receive phase (workers-to-coordinator) and send phase (coordinator-to-workers) can overlap at the TCP stack level; the serialized analysis above is conservative.
6.4 Comparison with MPI
| Metric | Ferrompi (InfiniBand HDR) | TCP (100 Gbps Ethernet) | Ratio |
|---|---|---|---|
| Link bandwidth | 25 GB/s | 12.5 GB/s | 2x |
| Protocol overhead | ~50% | ~50% | comparable |
| Topology | Tree/ring (MPI-optimized) | Star (coordinator) | MPI advantage |
| Est. overhead per iteration | ~46 ms | ~112 ms | ~2.4x |
| Fraction of 5s iteration | ~0.9% | ~2.2% | both viable |
The TCP backend is approximately 2-3x slower than MPI for pure communication, primarily due to the star topology requiring the coordinator to relay all data and the lower bandwidth of Ethernet versus InfiniBand. However, at < 200 ms/iteration overhead, which is < 5% of a typical 5-second iteration, the TCP backend is viable for production-scale SDDP. The LP solver dominates iteration time, and the communication overhead is well within acceptable bounds.
7. SharedMemoryProvider
The TCP backend implements SharedMemoryProvider using the HeapFallback strategy defined in Communicator Trait §4.4. TCP-connected processes run in separate address spaces (potentially on different physical nodes), so true intra-node shared memory is not available. Each rank allocates its own Vec<T> for shared memory regions.
7.1 Implementation
The TCP backend reuses the HeapRegion<T> type defined by the local backend, as both backends share the same HeapFallback semantics:
#![allow(unused)]
fn main() {
#[cfg(feature = "tcp")]
impl SharedMemoryProvider for TcpBackend {
type Region<T: CommData> = HeapRegion<T>;
fn create_shared_region<T: CommData>(
&self,
count: usize,
) -> Result<Self::Region<T>, CommError> {
Ok(HeapRegion {
data: vec![T::default(); count],
})
}
fn split_local(&self) -> Result<Box<dyn Communicator>, CommError> {
// Each TCP rank is its own "node" for shared memory purposes.
Ok(Box::new(LocalBackend))
}
fn is_leader(&self) -> bool {
// Always true: HeapFallback requires every rank to populate
// its own copy (see [Communicator Trait §4.4]).
true
}
}
}
The HeapFallback behavior summary for the TCP backend matches the canonical table in Communicator Trait §4.4:
| Method | HeapFallback Behavior |
|---|---|
create_shared_region | Allocates Vec<T> with count elements (per-process copy) |
is_leader | Always returns true (every rank is its own leader) |
split_local | Returns a single-rank communicator (rank 0 of size 1) |
as_slice | Returns &self.data[..] (local heap memory) |
as_mut_slice | Returns &mut self.data[..] (local heap memory) |
fence | No-op (returns Ok(())) |
Drop | Drops inner Vec<T> |
Memory footprint: With HeapFallback, every rank replicates shared data. For ranks with ~20.8 MB of shareable data (opening tree + case data), the total footprint is ~83.2 MB versus ~20.8 MB with true shared memory – a 4x overhead. This is acceptable for the TCP backend’s target deployment scenarios (containers, Python orchestration), where memory is less constrained than on shared-node HPC clusters. See Communicator Trait §4.4 for the full comparison.
8. Configuration
8.1 Environment Variables
The TCP backend is configured via environment variables following the COBRE_ prefix convention from Backend Registration and Selection §3.2.
| Variable | Required | Default | Description |
|---|---|---|---|
COBRE_TCP_COORDINATOR | Yes | (none) | Hostname or IP address of the coordinator (rank 0). Workers connect to this address during initialization. |
COBRE_TCP_PORT | No | 29500 | TCP port on which the coordinator listens. All ranks must use the same port value. |
COBRE_TCP_RANK | Yes | (none) | Rank index of this process (0..COBRE_TCP_SIZE). Rank 0 is the coordinator; all others are workers. |
COBRE_TCP_SIZE | Yes | (none) | Total number of ranks in the communicator. Must be consistent across all processes. |
COBRE_TCP_TIMEOUT_SECS | No | 60 | Timeout in seconds for collective operations and connection attempts. See SS5.1 for failure detection. |
Coordinator address resolution: The COBRE_TCP_COORDINATOR value is resolved via standard DNS or IP address parsing. For container deployments, this is typically the coordinator container’s hostname or service name (e.g., cobre-rank0, 10.0.0.1). The coordinator process itself does not read COBRE_TCP_COORDINATOR – it binds to 0.0.0.0:COBRE_TCP_PORT.
Rank assignment: Ranks are assigned externally by the orchestrator (Python script, container scheduler, or shell script). Each process must be started with a unique COBRE_TCP_RANK in 0..COBRE_TCP_SIZE. The coordinator must have COBRE_TCP_RANK=0.
8.2 Example Invocation
# Start coordinator (rank 0) on host cobre-coord
COBRE_COMM_BACKEND=tcp \
COBRE_TCP_RANK=0 \
COBRE_TCP_SIZE=4 \
COBRE_TCP_PORT=29500 \
cobre run /path/to/case &
# Start workers on other hosts
COBRE_COMM_BACKEND=tcp \
COBRE_TCP_COORDINATOR=cobre-coord \
COBRE_TCP_PORT=29500 \
COBRE_TCP_RANK=1 \
COBRE_TCP_SIZE=4 \
cobre run /path/to/case &
COBRE_COMM_BACKEND=tcp \
COBRE_TCP_COORDINATOR=cobre-coord \
COBRE_TCP_PORT=29500 \
COBRE_TCP_RANK=2 \
COBRE_TCP_SIZE=4 \
cobre run /path/to/case &
COBRE_COMM_BACKEND=tcp \
COBRE_TCP_COORDINATOR=cobre-coord \
COBRE_TCP_PORT=29500 \
COBRE_TCP_RANK=3 \
COBRE_TCP_SIZE=4 \
cobre run /path/to/case &
8.3 Feature Gating
The TCP backend is gated behind the tcp Cargo feature flag:
[features]
default = []
tcp = [] # No external dependencies -- uses std::net only
[dependencies]
# No tcp-specific dependencies
When tcp is not enabled, the TcpBackend type does not exist. The create_communicator() factory function (Backend Registration and Selection §4) does not include a TCP branch, and COBRE_COMM_BACKEND=tcp produces an error listing available backends.
When tcp is enabled, the TCP backend is available for selection via the factory function or COBRE_COMM_BACKEND=tcp. In the auto-detection priority chain (Backend Registration and Selection §2.2), the TCP backend is selected when COBRE_TCP_COORDINATOR is set and the mpi backend was not selected first.
8.4 Build Profile Integration
The TCP backend is included in the following build profiles from Backend Registration and Selection §1.3:
| Build Profile | Includes tcp? | Rationale |
|---|---|---|
| CLI / HPC | No | MPI is the production communication layer for cluster deployment |
| Python wheel | Yes | TCP enables multi-process Python orchestration without MPI |
| Test / CI | No | Only local backend; no external dependencies |
| Development | Yes | All backends compiled for testing |
Cross-References
- Communicator Trait §1 –
Communicatortrait definition,CommData,ReduceOp,CommErrortype definitions implemented by this backend - Communicator Trait §2 – Method contracts (preconditions, postconditions, determinism guarantees) that this backend preserves via the coordinator-mediated protocols in SS3
- Communicator Trait §3 – Generic parameterization pattern (
train<C: Communicator>) enabling compile-time monomorphization of this backend - Communicator Trait §4 –
SharedMemoryProvidertrait andHeapFallbacksemantics (§4.4) used by this backend for shared memory regions - Communicator Trait §4.4 –
HeapFallbackcanonical behavior table: per-processVec<T>,is_leaderalwaystrue,fenceis no-op - Communication Patterns §1.1 – Three collective operations per iteration (allgatherv x2, allreduce x1), plus initialization-only broadcast and barrier
- Communication Patterns §1.2 – No point-to-point messaging in SDDP; collectives only – the design constraint that makes the coordinator pattern sufficient
- Communication Patterns §2 – Data payloads for trial points (§2.1), cuts (§2.2), and convergence statistics (§2.3) transmitted through the TCP backend
- Communication Patterns §3 – Communication volume analysis (~587 MB/iteration) and bandwidth requirements on 100 Gbps Ethernet
- Communication Patterns §6.1 – Deterministic rank-ordered receive invariant preserved by the coordinator’s rank-ordered assembly in SS3.1
- Communication Patterns §6.2 – Floating-point reduction tolerance; the TCP backend’s sequential reduction is stricter than MPI’s implementation-defined tree
- Backend Registration and Selection §1.2 – Feature flag matrix;
tcpfeature gates this backend with no external dependencies - Backend Registration and Selection §2.2 – Auto-detection priority chain; TCP is selected when
COBRE_TCP_COORDINATORis set and MPI was not detected - Backend Registration and Selection §3.1 – Per-backend environment variable table for the TCP backend
- Backend Registration and Selection §4 – Factory pattern returning
TcpBackendin theCommBackendenum for multi-feature builds - Backend Registration and Selection §5.1 – Programmatic backend selection via
TcpConfigfor library-mode callers - Solver Abstraction §10 – Compile-time selection pattern via generic parameters and Cargo feature flags; the architectural precedent for this backend’s feature-gated design
- Python Bindings §7 – TCP backend as the enabler for distributed Python SDDP execution without MPI
- Hybrid Parallelism §1.0a – Single-process mode; the TCP backend provides a multi-process alternative for environments where MPI is unavailable