Rollout scorecard

Shadow runs prove behavioural equivalence; benchmarks prove performance parity. The rollout scorecard is the single object that combines the two into a structured RolloutVerdict — Go, NoGo, or Inconclusive — and emits a self-contained evidence bundle operators can attach to a release decision.

File: crates/ftui-harness/src/rollout_scorecard.rs.

Configuration

crates/ftui-harness/src/rollout_scorecard.rs


pub struct RolloutScorecardConfig {
    /// Minimum number of shadow-run scenarios required. Default: 1.
    pub min_shadow_scenarios: usize,
    /// Minimum frame match ratio across all shadow runs (0.0..=1.0). Default: 1.0.
    pub min_match_ratio: f64,
    /// Whether a passing benchmark gate is required for Go. Default: false.
    pub require_benchmark_pass: bool,
}
 
impl RolloutScorecardConfig {
    pub fn min_shadow_scenarios(self, n: usize) -> Self;
    pub fn min_match_ratio(self, ratio: f64) -> Self;     // clamped to 0..=1
    pub fn require_benchmark_pass(self, required: bool) -> Self;
}

Defaults are conservative: one shadow run and 100 % frame match. For a production rollout, require several scenarios:


let cfg = RolloutScorecardConfig::default()
    .min_shadow_scenarios(5)
    .min_match_ratio(1.0)
    .require_benchmark_pass(true);

Verdict

crates/ftui-harness/src/rollout_scorecard.rs


pub enum RolloutVerdict {
    Go,            // All evidence meets thresholds.
    NoGo,          // Determinism or performance regression detected.
    Inconclusive,  // Not enough evidence to decide.
}

Go requires all of:

shadow_results.len() >= min_shadow_scenarios
aggregate_match_ratio() >= min_match_ratio
No ShadowVerdict::Diverged present.
If require_benchmark_pass, a GateResult must be attached and gate.passed().

Anything short of that returns Inconclusive (missing evidence) or NoGo (evidence says “don’t”).

API

crates/ftui-harness/src/rollout_scorecard.rs


pub struct RolloutScorecard { /* ... */ }
 
impl RolloutScorecard {
    pub fn new(config: RolloutScorecardConfig) -> Self;
    pub fn add_shadow_result(&mut self, result: ShadowRunResult);
    pub fn set_benchmark_gate(&mut self, result: GateResult);
 
    pub fn shadow_scenario_count(&self) -> usize;
    pub fn shadow_match_count(&self)    -> usize;
    pub fn aggregate_match_ratio(&self) -> f64;
 
    pub fn evaluate(&self) -> RolloutVerdict;
    pub fn summary(&self)  -> RolloutSummary;
}

Evidence bundle

RolloutEvidenceBundle is the release artefact — JSON that combines the scorecard verdict with runtime-observed telemetry and lane metadata:

crates/ftui-harness/src/rollout_scorecard.rs


pub struct RolloutEvidenceBundle {
    pub scorecard: RolloutSummary,
    pub queue_telemetry: Option<QueueTelemetry>,
    pub requested_lane: String,
    pub resolved_lane: String,
    pub rollout_policy: String,
}
 
impl RolloutEvidenceBundle {
    pub fn to_json(&self) -> String;
}

A shortened example of what to_json() produces:

rollout_evidence.json


{
  "schema_version": "1.0.0",
  "scorecard": {
    "verdict": "GO",
    "shadow_scenarios": 5,
    "shadow_matches": 5,
    "aggregate_match_ratio": 1.0,
    "total_frames_compared": 4800,
    "benchmark_passed": "pass",
    "config": {
      "min_shadow_scenarios": 5,
      "min_match_ratio": 1.0,
      "benchmark_required": true
    }
  },
  "queue_telemetry": {
    "enqueued": 1234,
    "processed": 1234,
    "dropped": 0,
    "high_water": 12,
    "in_flight": 0
  },
  "runtime": {
    "requested_lane": "structured",
    "resolved_lane": "structured",
    "rollout_policy": "shadow"
  }
}

Worked example

tests/rollout_drill.rs


use ftui_harness::{
    rollout_scorecard::{RolloutScorecard, RolloutScorecardConfig, RolloutVerdict},
    shadow_run::{ShadowRun, ShadowRunConfig},
};
 
#[test]
fn structured_is_go_for_counter_scenarios() {
    let scenarios = [
        ("increment",     42),
        ("reset_on_zero", 43),
        ("burst_ticks",   44),
        ("resize",        45),
        ("quit_path",     46),
    ];
 
    let mut scorecard = RolloutScorecard::new(
        RolloutScorecardConfig::default()
            .min_shadow_scenarios(5)
            .min_match_ratio(1.0),
    );
 
    for (name, seed) in scenarios {
        let cfg = ShadowRunConfig::new("rollout/counter", name, seed)
            .viewport(40, 10);
        let result = ShadowRun::compare(cfg, || CounterModel::new(), |s| {
            s.init();
            for _ in 0..30 { s.tick(); s.capture_frame(); }
        });
        scorecard.add_shadow_result(result);
    }
 
    assert_eq!(scorecard.evaluate(), RolloutVerdict::Go);
 
    let summary = scorecard.summary();
    std::fs::write("rollout_summary.json", summary.to_json()).unwrap();
}

Attach a benchmark gate (see the benchmark_gate module) to require performance parity:


scorecard.set_benchmark_gate(gate_result);

Reading a verdict in CI


# Fail CI unless the scorecard said Go:
jq -r '.scorecard.verdict' rollout_summary.json | grep -qx 'GO'
 
# Fleet dashboard: count lanes by resolved lane
jq -r '.runtime.resolved_lane' **/rollout_evidence.json | sort | uniq -c

Pitfalls

min_match_ratio < 1.0 lets divergence slip through. The ratio helps for long-running scenarios with known-benign noise (e.g. external timestamps baked into the UI), but it is not a substitute for fixing the source. Prefer a deterministic harness first; lower the ratio only after you know why frames diverge.

Scorecard “Inconclusive” is not “Go”. CI gates must distinguish Go from Inconclusive. The default config accepts a single scenario, which is rarely enough to declare parity — bump min_shadow_scenarios for release builds.

Cross-references

Shadow run Runtime lanes Effect queue Testing — shadow run Evidence sink

How this piece fits in runtime.

Runtime overview