Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Local Backend

Purpose

The local backend is the single-process communication backend for Cobre. It provides a Communicator and SharedMemoryProvider implementation where all collective operations are identity operations (copy input to output) or no-ops (do nothing), formalizing the single-process mode previously described as prose exceptions in Hybrid Parallelism §1.0a. The local backend is always available – it requires no feature flag, no external dependencies, and no runtime configuration. It is the default fallback when no other backend is configured, as specified in the priority chain of Backend Registration and Selection §2.2. In single-feature builds with no communication features enabled, all collectives compile to zero instructions after inlining (Backend Registration and Selection §1.4).

1. Struct and Trait Implementation

1.1 Struct Definition

The LocalBackend struct is a zero-sized type (ZST). It holds no state because there is exactly one rank, no MPI communicator handles, no intra-node communicator, and no connection state. The ZST property means that LocalBackend occupies zero bytes at runtime and has no construction cost.

#![allow(unused)]
fn main() {
/// Single-process communication backend with identity collective semantics.
///
/// Zero-sized type with no runtime state. All collective operations are
/// identity copies or no-ops, compiling to zero instructions after inlining
/// in single-feature builds (see §1.2).
pub struct LocalBackend;
}

1.2 Communicator Trait Implementation

The impl Communicator for LocalBackend provides trivial implementations for all six trait methods. Each method satisfies the contracts defined in Communicator Trait §2 for the degenerate case of a single rank.

#![allow(unused)]
fn main() {
impl Communicator for LocalBackend {
    fn allgatherv<T: CommData>(
        &self,
        send: &[T],
        recv: &mut [T],
        counts: &[usize],
        displs: &[usize],
    ) -> Result<(), CommError> {
        // Validate: single rank requires exactly one count/displ entry.
        if counts.len() != 1 { return Err(CommError::InvalidBufferSize { .. }); }
        if displs.len() != 1 { return Err(CommError::InvalidBufferSize { .. }); }
        if send.len() != counts[0] { return Err(CommError::InvalidBufferSize { .. }); }
        if recv.len() < displs[0] + counts[0] { return Err(CommError::InvalidBufferSize { .. }); }
        // Identity copy: with one rank, displs=[0] and counts=[send.len()].
        recv[displs[0]..displs[0] + counts[0]].copy_from_slice(send);
        Ok(())
    }

    fn allreduce<T: CommData>(
        &self,
        send: &[T],
        recv: &mut [T],
        _op: ReduceOp,
    ) -> Result<(), CommError> {
        // Validate: buffer lengths must match and be non-empty.
        if send.len() != recv.len() { return Err(CommError::InvalidBufferSize { .. }); }
        if send.is_empty() { return Err(CommError::InvalidBufferSize { .. }); }
        // Identity copy: Sum(x) = Min(x) = Max(x) = x for a single operand.
        recv.copy_from_slice(send);
        Ok(())
    }

    fn broadcast<T: CommData>(
        &self,
        _buf: &mut [T],
        root: usize,
    ) -> Result<(), CommError> {
        // Validate: only root 0 is valid for a single-rank communicator.
        if root >= 1 { return Err(CommError::InvalidRoot { root, size: 1 }); }
        // No-op: the single rank is both sender and receiver.
        Ok(())
    }

    fn barrier(&self) -> Result<(), CommError> {
        // No-op: nothing to synchronize.
        Ok(())
    }

    fn rank(&self) -> usize {
        0
    }

    fn size(&self) -> usize {
        1
    }
}
}

Precondition validation: Unlike the ferrompi backend, the local backend validates buffer size preconditions and root rank arguments directly, returning appropriate errors:

  • allgatherv returns CommError::InvalidBufferSize if counts.len() != 1, displs.len() != 1, send.len() != counts[0], or recv.len() < displs[0] + counts[0].
  • allreduce returns CommError::InvalidBufferSize if send.len() != recv.len() or send.is_empty().
  • broadcast returns CommError::InvalidRoot if root >= 1 (the only valid root for a single-rank communicator is 0).
  • barrier returns Ok(()) unconditionally (no-op).

The local backend cannot produce CommError::CollectiveFailed (no MPI calls) or CommError::InvalidCommunicator (no communicator state to invalidate).

Inlining and codegen: Because LocalBackend is a ZST with trivial method bodies, the compiler inlines all trait methods at call sites when the concrete type is known. In a single-feature build (no mpi, tcp, or shm features), the generic parameter C: Communicator resolves to LocalBackend, and:

  • allgatherv compiles to a single memcpy (or equivalent loop for non-Copy codegen).
  • allreduce compiles to a single memcpy.
  • broadcast compiles to zero instructions.
  • barrier compiles to zero instructions.
  • rank compiles to the constant 0.
  • size compiles to the constant 1.

This achieves the zero-cost abstraction guarantee stated in Backend Registration and Selection §1.4.

2. Identity Semantics

The local backend implements each Communicator method as either an identity operation (input copied to output) or a no-op (no action taken). The distinction is important: identity operations perform a memory copy that must not be elided, while no-ops can be entirely eliminated by the compiler.

2.1 Behavior Comparison Table

MethodMulti-Rank BehaviorLocal Backend BehaviorClassification
allgathervGathers variable-length data from all ranks, ordered by rank indexCopies send to recv[displs[0]..displs[0]+counts[0]]Identity copy
allreduceElement-wise reduction (Sum, Min, Max) across all ranksCopies send to recv (reduction of one value = identity)Identity copy
broadcastSends data from root rank to all other ranksNo-op (data is already in the buffer on the only rank)No-op
barrierBlocks until all ranks have entered the barrierNo-op (single rank, nothing to wait for)No-op
rank()Returns the calling rank’s index in 0..size()Returns 0Constant
size()Returns the total number of ranks in the communicatorReturns 1Constant

2.2 Postcondition Verification

Each method’s postconditions from Communicator Trait §2 are satisfied by the local backend:

PostconditionMethodLocal Backend Satisfaction
Rank-ordered receiveallgathervRank 0’s data at displs[0] – the only rank contributes, and its data appears at position 0
Identical across ranksallgathervTrivially satisfied (only one rank)
Implicit barrierallgathervTrivially satisfied (only one rank)
Element-wise reductionallreduceop(x) = x for any single operand – the identity copy is the correct reduction
Identical across ranksallreduceTrivially satisfied (only one rank)
Data from rootbroadcastRank 0 is both root and sole receiver; data is already in place
Identical across ranksbroadcastTrivially satisfied (only one rank)
Global synchronization (all ranks enter first)barrierTrivially satisfied (only one rank)
rank() in 0..size()rank0 is in 0..1
Constant after initializationrank/sizeZST with hardcoded values – always constant

3. Shared Memory Fallback

The local backend implements SharedMemoryProvider using the HeapFallback strategy defined in Communicator Trait §4.4. Shared memory regions are regular heap-allocated Vec<T> instances. The semantics are fully specified in Communicator Trait §4.4; this section documents the local backend’s concrete realization of that specification.

3.1 SharedMemoryProvider Implementation

#![allow(unused)]
fn main() {
impl SharedMemoryProvider for LocalBackend {
    type Region<T: CommData> = HeapRegion<T>;

    fn create_shared_region<T: CommData>(
        &self,
        count: usize,
    ) -> Result<Self::Region<T>, CommError> {
        Ok(HeapRegion {
            data: vec![T::default(); count],
        })
    }

    fn split_local(&self) -> Result<Box<dyn LocalCommunicator>, CommError> {
        // A single process is its own node; the intra-node communicator
        // is identical to the world communicator.
        Ok(Box::new(LocalBackend))
    }

    fn is_leader(&self) -> bool {
        // Always true: the single rank is its own leader.
        true
    }
}
}

3.2 HeapRegion

The HeapRegion<T> type wraps a Vec<T> and implements SharedRegion<T> with trivial semantics:

#![allow(unused)]
fn main() {
/// Shared memory region backed by a heap-allocated `Vec<T>`.
///
/// Used by backends without true intra-node shared memory (local, tcp).
/// Lifecycle phases from [Communicator Trait §4.2](./communicator-trait.md)
/// degenerate to simple `Vec` operations.
pub struct HeapRegion<T: CommData> {
    data: Vec<T>,
}

impl<T: CommData> SharedRegion<T> for HeapRegion<T> {
    fn as_slice(&self) -> &[T] {
        &self.data
    }

    fn as_mut_slice(&mut self) -> &mut [T] {
        &mut self.data
    }

    fn fence(&self) -> Result<(), CommError> {
        // No-op: all access is within a single process.
        Ok(())
    }
}
}

3.3 HeapFallback Behavior Summary

The local backend’s HeapFallback realization maps to the canonical behavior table in Communicator Trait §4.4:

MethodHeapFallback Behavior (from §4.4)Local Backend Realization
create_shared_regionAllocates Vec<T> with count elements (per-process copy)HeapRegion { data: vec![T::default(); count] }
is_leaderAlways returns true (every rank is its own leader)Returns true (single rank is the sole leader)
split_localReturns a single-rank communicator (rank 0 of size 1)Returns Box::new(LocalBackend) (rank 0, size 1)
as_sliceReturns &self.vec[..] (local heap memory)Returns &self.data[..]
as_mut_sliceReturns &mut self.vec[..] (local heap memory)Returns &mut self.data[..]
fenceNo-op (returns Ok(()))Returns Ok(()) (no remote ranks to synchronize)
DropDrops inner Vec<T>HeapRegion drops inner data: Vec<T> via standard Drop

Memory footprint: The HeapFallback replicates data per-process. With a single process, there is no replication overhead – the memory footprint equals the data size. The memory savings from true shared memory backends are irrelevant when there is only one process.

4. Use Cases

The local backend is used whenever Cobre operates without inter-process communication. It is the communication backend for all non-MPI execution modes:

4.1 Python Bindings

Python bindings (Python Bindings §1.2) operate in single-process mode because the GIL is incompatible with MPI launchers (Hybrid Parallelism §1.0a). The local backend is constructed directly by the binding layer before releasing the GIL and entering the Rust training function:

#![allow(unused)]
fn main() {
let comm = LocalBackend;
let result = train(&comm, &config)?;
}

4.2 MCP Server

The MCP server (MCP Server §1.1) is a long-lived single-process server incompatible with MPI launcher lifecycle management. It constructs the local backend once at server startup and reuses it for all training invocations. See Backend Registration and Selection §5.4.

4.3 TUI (Terminal User Interface)

The TUI operates as an interactive single-process mode where the user monitors training progress in real time. The local backend provides the communication layer without requiring MPI infrastructure on the user’s workstation.

4.4 Testing and CI

When no communication features are enabled (the Test / CI build profile from Backend Registration and Selection §1.3), the local backend is the only available backend. It provides deterministic, dependency-free execution for unit tests, integration tests, and CI pipelines. The create_communicator() factory (Backend Registration and Selection §4.1) returns LocalBackend directly, with full monomorphization and zero dispatch overhead.

4.5 Always-Available Guarantee

The local backend requires no Cargo feature flag, no external libraries, no MPI runtime, no TCP coordinator, and no shared memory segments. It is unconditionally compiled into every Cobre binary, as specified in the feature flag matrix (Backend Registration and Selection §1.2). This makes it the guaranteed fallback at the bottom of the auto-detection priority chain (Backend Registration and Selection §2.2).

No-feature build: In a build with no communication features (cargo test, cargo build with no --features), the local backend is the only communication backend. The factory function returns LocalBackend directly:

#![allow(unused)]
fn main() {
#[cfg(not(any(feature = "mpi", feature = "tcp", feature = "shm")))]
pub fn create_communicator() -> Result<LocalBackend, BackendError> {
    Ok(LocalBackend)
}
}

No runtime configuration: The local backend requires no environment variables (Backend Registration and Selection §3.1). No COBRE_ prefixed variables, no launcher-injected variables, no configuration of any kind.

5. Determinism

5.1 Communication Determinism

With a single rank, there is no inter-process communication and therefore no source of communication non-determinism. The reproducibility guarantees from Shared Memory Aggregation §3.1 are trivially satisfied:

Reproducibility RequirementMulti-Rank MechanismLocal Backend
Independent of number of MPI ranksDeterministic seeding, contiguous block distribution, deterministic cut slotsN/A – always 1 rank
Independent of number of OpenMP threadsThread-local accumulation, fixed merge orderSame mechanism – OpenMP parallelism remains fully active
Independent of execution timing/orderingIdentity-based seeding, deterministic MPI_Allgatherv rank orderingNo communication timing; local operations are sequentially ordered

5.2 Floating-Point Determinism

The floating-point non-determinism described in Communicator Trait §2.2 (reduction tree shape varies with rank count and MPI implementation) does not apply to the local backend. With a single rank:

  • allreduce with ReduceOp::Sum performs an identity copy – no floating-point arithmetic, no reduction tree.
  • The upper bound statistics are computed from the single rank’s local trajectories using the thread-local accumulation pattern (Shared Memory Aggregation §3.3), which produces deterministic results regardless of thread count.

Guarantee: Given the same inputs and random seed, the local backend produces bit-for-bit identical results regardless of the number of OpenMP threads, matching the determinism invariant from Shared Memory Aggregation §3.1.

Cross-References