Skip to content

Upper Bound Evaluation

This spec defines the upper bound evaluation mechanism in Cobre via inner approximation (SIDP): the vertex-based value function approximation, Lipschitz interpolation, the linearized upper bound LP, gap computation, and configuration. It complements the outer approximation (cuts) described in SDDP Algorithm by providing a convergence certificate through deterministic upper bounds.

For notation conventions (index sets, parameters, decision variables, dual variables), see Notation Conventions.

Standard SDDP provides only a lower bound (outer approximation) through cuts. Convergence verification requires an upper bound (inner approximation). This is especially important for:

  1. Risk-averse problems: CVaR objectives cannot be reliably estimated via Monte Carlo simulation of the policy
  2. Convergence certificates: Gap =zˉz= \bar{z} - \underline{z} provides a true optimality measure
  3. Conservative policies: Inner approximation gives “at most YY” guarantees

The inner approximation Vˉt(x)\bar{V}_t(x) is constructed from vertices (visited state-value pairs):

Vt={(x(1),vˉ(1)),(x(2),vˉ(2)),,(x(n),vˉ(n))}\mathcal{V}_t = \{(x^{(1)}, \bar{v}^{(1)}), (x^{(2)}, \bar{v}^{(2)}), \ldots, (x^{(n)}, \bar{v}^{(n)})\}

where each vertex stores:

  • x(i)x^{(i)}: State vector visited during forward passes
  • vˉ(i)\bar{v}^{(i)}: Upper bound on expected cost-to-go from that state (computed recursively)

For a new state xx not in Vt\mathcal{V}_t, the upper bound is computed via Lipschitz interpolation:

Vˉt(x)=min(x(i),vˉ(i))Vt{vˉ(i)+Ltxx(i)1}\bar{V}_t(x) = \min_{(x^{(i)}, \bar{v}^{(i)}) \in \mathcal{V}_t} \left\{ \bar{v}^{(i)} + L_t \cdot \|x - x^{(i)}\|_1 \right\}

where LtL_t is the Lipschitz constant for stage tt.

Interpretation: The upper bound at xx is the minimum over all vertices of “vertex value plus distance penalty.” This forms a concave piecewise-linear function — the inner (concave) counterpart to the outer (convex) cut approximation.

The Lipschitz constant bounds the maximum rate of change of the value function with respect to the state. For SDDP with penalty-based feasibility (relatively complete recourse):

Backward accumulation:

LT=cmaxpenaltyL_T = c_{max}^{penalty} Lt=dtt+1Lt+1+cmaxpenalty,tL_t = d_{t \to t+1} \cdot L_{t+1} + c_{max}^{penalty,t}

where:

  • cmaxpenalty,tc_{max}^{penalty,t} is the maximum penalty coefficient at stage tt (e.g., deficit penalty in $/MWh)
  • dtt+1d_{t \to t+1} is the discount factor for transition tt+1t \to t+1 (see Discount Rate)

Example: With deficit penalty 10001000 $/MWh over 5 stages, no discounting:

Stage ttLipschitz LtL_t
51,000
42,000
33,000
24,000
15,000

During the upper bound evaluation pass (a backward pass variant):

At terminal stage TT:

vˉ(i)=EωT[cT(x(i),ωT)](expected immediate cost only)\bar{v}^{(i)} = \mathbb{E}_{\omega_T}\left[c_T(x^{(i)}, \omega_T)\right] \quad \text{(expected immediate cost only)}

At stage t<Tt < T:

For each vertex (x(i),)Vt(x^{(i)}, \cdot) \in \mathcal{V}_t:

  1. For each scenario ωt\omega_t, solve the stage subproblem with incoming state x(i)x^{(i)} and realization ωt\omega_t
  2. Obtain the optimal next-stage state xt+1(ωt)x_{t+1}^*(\omega_t)
  3. Evaluate the inner approximation at the next stage: θˉ(ωt)=Vˉt+1(xt+1(ωt))\bar{\theta}(\omega_t) = \bar{V}_{t+1}(x_{t+1}^*(\omega_t))
  4. Set vertex value as the expected discounted cost-to-go:
vˉ(i)=Eωt[ct(x(i),ωt)+dtt+1θˉ(ωt)]\bar{v}^{(i)} = \mathbb{E}_{\omega_t}\left[c_t(x^{(i)}, \omega_t) + d_{t \to t+1} \cdot \bar{\theta}(\omega_t)\right]

For policy evaluation with inner approximation, the stage LP replaces the outer approximation (cut constraints on θ\theta) with inner approximation (vertex constraints on θˉ\bar{\theta}).

Standard LP (outer approximation — lower bound):

min  ctxt+dtt+1θ\min \; c_t^\top x_t + d_{t \to t+1} \cdot \theta s.t. θαk+πkxtk (cuts)\text{s.t. } \theta \geq \alpha_k + \pi_k^\top x_t \quad \forall k \text{ (cuts)}

Inner approximation LP (upper bound):

min  ctxt+dtt+1θˉ\min \; c_t^\top x_t + d_{t \to t+1} \cdot \bar{\theta} s.t. θˉvˉ(i)+Ltjxt,jxj(i)iVt (vertices)\text{s.t. } \bar{\theta} \leq \bar{v}^{(i)} + L_t \sum_j |x_{t,j} - x_j^{(i)}| \quad \forall i \in \mathcal{V}_t \text{ (vertices)}

The absolute value xjxj(i)|x_j - x_j^{(i)}| in the vertex constraints is linearized using standard splitting:

xjxj(i)=uj(i)++uj(i)|x_j - x_j^{(i)}| = u_j^{(i)+} + u_j^{(i)-} xjxj(i)=uj(i)+uj(i)x_j - x_j^{(i)} = u_j^{(i)+} - u_j^{(i)-} uj(i)+,uj(i)0u_j^{(i)+}, u_j^{(i)-} \geq 0

Additional variables (per vertex ii, per state component jj):

VariableDomainDescription
uj(i)+u_j^{(i)+}0\geq 0Positive deviation from vertex ii in dimension jj
uj(i)u_j^{(i)-}0\geq 0Negative deviation from vertex ii in dimension jj
θˉ\bar{\theta}freeUpper bound on future cost

Constraints (for each vertex iVti \in \mathcal{V}_t):

θˉvˉ(i)+Ltj(uj(i)++uj(i))\bar{\theta} \leq \bar{v}^{(i)} + L_t \sum_j (u_j^{(i)+} + u_j^{(i)-}) xjxj(i)=uj(i)+uj(i)jx_j - x_j^{(i)} = u_j^{(i)+} - u_j^{(i)-} \quad \forall j

At each iteration kk where the upper bound is evaluated:

Lower bound (from cuts at stage 1):

zk=c1(x^1)+d12V2(x^1)\underline{z}^k = c_1(\hat{x}_1) + d_{1 \to 2} \cdot \underline{V}_2(\hat{x}_1)

Upper bound (from vertices at stage 1):

zˉk=c1(x^1)+d12Vˉ2(x^1)\bar{z}^k = c_1(\hat{x}_1) + d_{1 \to 2} \cdot \bar{V}_2(\hat{x}_1)

Relative gap:

gapk=zˉkzkmax(1,zˉk)×100%\text{gap}^k = \frac{\bar{z}^k - \underline{z}^k}{\max(1, |\bar{z}^k|)} \times 100\%

Convergence: As kk \to \infty, gapk0\text{gap}^k \to 0 for convex problems with finitely many scenarios.

For stopping rules that use the gap, see Stopping Rules.

AspectImpact
Vertices per stageTypically O(iterations×forward_passes)\mathcal{O}(\text{iterations} \times \text{forward\_passes})
LP size increase2×nstate×nvertices2 \times n_{state} \times n_{vertices} additional variables
Evaluation frequencyTrade-off between gap accuracy and runtime
MemoryVertices stored separately from cuts

Recommendation: Enable upper bound evaluation every 5-10 iterations after an initial burn-in period (10+ iterations) for convergence monitoring without excessive overhead.

Cobre can also estimate an upper bound on expected total cost by running the trained policy on scenarios drawn independently of the training tree. This procedure produces a statistical upper-bound estimate; for the deterministic upper bound applicable to risk-averse policies, see sections 2–9 of this chapter.

The core methodological guarantee is that the noise used for the simulation forward pass is drawn independently of the noise used during training. Training forward passes sample from the opening tree (see Scenario Generation) to build the scenario tree that drives cut generation; any cost computed by re-running the policy on those same training scenarios would produce a biased estimator — the cuts were shaped to be tight at those states. The out-of-sample simulation avoids this by drawing a fresh set of NN independent noise realizations from the same opening tree via a separate seed. Because cuts have no dependence on these independent draws, the resulting cost sample is an unbiased estimator of the true expected cost under the current policy.

The simulation executes a complete forward pass for each of the NN independent replications, recording the total discounted cost CiC_i for scenario ii:

Ci=t=1Td1tct(i)C_i = \sum_{t=1}^{T} d_{1 \to t} \cdot c_t^{(i)}

where ct(i)c_t^{(i)} is the immediate cost at stage tt of replication ii, and d1td_{1 \to t} is the cumulative discount factor from stage 1 to stage tt (see Discount Rate).

The sample mean is the Monte Carlo estimator of expected total cost:

Cˉ=1Ni=1NCi\bar{C} = \frac{1}{N} \sum_{i=1}^{N} C_i

This estimator is unbiased under independent draws: E[Cˉ]=E[C]\mathbb{E}[\bar{C}] = \mathbb{E}[C]. The sample standard deviation is:

σC=1N1i=1N(CiCˉ)2\sigma_C = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (C_i - \bar{C})^2}

In addition to mean and standard deviation, the simulation computes the Conditional Value-at-Risk at a specified confidence level α\alpha from the sorted scenario costs. For the CVaR estimator and its interpretation under risk-averse policies, see Risk Measures.

Under the normal approximation, the 95% confidence interval for E[C]\mathbb{E}[C] has half-width:

Δ95=1.96σCN\Delta_{95} = 1.96 \cdot \frac{\sigma_C}{\sqrt{N}}

The approximation is reliable when N20N \geq 20 (central-limit-theorem regime). The reported interval is [CˉΔ95,  Cˉ+Δ95][\bar{C} - \Delta_{95},\; \bar{C} + \Delta_{95}].

Trade-off: every doubling of NN narrows the confidence interval by a factor of 2\sqrt{2}, but costs proportionally more LP solves — N×TN \times T stage subproblems per simulation check. The practical baseline of N=100N = 100 replications gives a half-width of roughly σC/10\sigma_C / 10, which is sufficient to distinguish a converged policy from one still improving.

The sole knob governing the out-of-sample procedure is the number of replications NN (the replications parameter of the simulation stopping rule). It controls the statistical resolution of the estimator and the compute cost of each stopping check simultaneously.

NNConfidence interval half-widthLP solves per check (T=120T = 120)
20σC/4.5\approx \sigma_C / 4.52,400
100σC/10\approx \sigma_C / 1012,000
500σC/22\approx \sigma_C / 2260,000

The baseline N=100N = 100 is practical for production runs: at the reference scale of T=120T = 120 stages and 16 MPI ranks, the simulation check costs approximately 1.7 seconds of wall-clock time per trigger. Raising NN narrows the interval linearly in 1/N1/\sqrt{N} at a proportional cost in simulation time.

10.5 Interaction with the Simulation-Based Stopping Rule

Section titled “10.5 Interaction with the Simulation-Based Stopping Rule”

The out-of-sample simulation described here is the per-iteration measurement on which the simulation-based stopping rule operates. Each time the stopping rule fires (every period iterations, when the outer-approximation bound is stable), one simulation check is executed, producing Cˉ\bar{C}, σC\sigma_C, and Δ95\Delta_{95} for the current iteration. The stopping rule then compares consecutive simulation estimates to test whether the policy has stabilized across iterations. The stopping criterion and the comparison metric are fully specified in Stopping Rules section 5; this chapter owns only the per-iteration estimator (mean, standard deviation, confidence interval) and delegates the convergence decision to that chapter.

For cyclic policy graphs (see Horizon Modes), the inner approximation operates on the same seasonal cut-pool structure: vertices are organized by season τ\tau, not by absolute stage ID. The Lipschitz constant must account for the cumulative discount around the cycle, which bounds the geometric series of future contributions.

The convergence guarantee still holds: with dcycle<1d_{cycle} < 1, both the outer (cut) and inner (vertex) approximations converge to the true value function at the fixed point.

Costa, B.F.P., & Leclere, V. (2023). “Duality of upper bounds in stochastic dynamic programming.” Optimization Online. https://optimization-online.org/?p=23738

Philpott, A.B., de Matos, V.L., & Finardi, E.C. (2013). “On solving multistage stochastic programs with coherent risk measures.” Operations Research, 61(4), 957-970. https://doi.org/10.1287/opre.2013.1175

  • SDDP Algorithm — Core algorithm providing the outer approximation (lower bound) that this spec complements
  • Notation Conventions — Standard symbols for state variables, value functions, and cost-to-go
  • Discount Rate — Discount factor dd used in vertex value computation (section 5) and Lipschitz accumulation (section 4)
  • Horizon Modes — Cyclic policy graphs and the season-indexed pool structure that the inner approximation mirrors
  • Cut Management — Outer approximation cuts that provide the lower bound counterpart
  • Stopping Rules — Convergence criteria that use the gap between inner and outer approximations; simulation-based stopping rule that consumes the per-iteration out-of-sample estimator (section 10.5)
  • Risk Measures — CVaR objectives where deterministic upper bounds are essential; CVaR estimator validity under risk-averse policies (section 10.1)
  • Scenario Generation — Opening-tree definition from which out-of-sample replications draw independent noise realizations (section 10.1)