Skip to Content
OperationsSLO schema

SLO schema (slo.yaml)

slo.yaml at the repo root is FrankenTUI’s machine-readable service level objective definition. Every budget the kernel honours — frame render p99, layout compute p99, Bayesian posterior update latency, heap RSS — is named and bounded here. CI validates the schema on every push and runs a deterministic replay that exercises safe-mode when breaches are injected.

Source: /slo.yaml.

Why a single SLO file

The benchmark gate (benchmark gate) provides the enforcement mechanism. The SLO file provides the intention — what the kernel promises its users. Keeping the two aligned in one place means:

  • Budgets are reviewable in a single PR.
  • A CI breach maps one-to-one to a documented promise.
  • The runtime’s degradation cascade can key off the same metric names that appear in tests.

Top-level layout

# Global thresholds regression_threshold: 0.10 noise_tolerance: 0.05 safe_mode_breach_count: 3 safe_mode_error_rate: 0.10 metrics: render_frame_p99_us: metric_type: latency max_value: 4000.0 max_ratio: 1.25 safe_mode_trigger: true # …many more metrics…

Global thresholds

FieldMeaning
regression_thresholdFractional overage above baseline that counts as a regression (default 10%).
noise_toleranceMeasurement variance absorbed before alerting (default 5%).
safe_mode_breach_countHow many safe_mode_trigger: true metrics must breach before the runtime enters safe mode.
safe_mode_error_rateError-rate metric above which safe mode engages independently of latency.

Per-metric fields

FieldTypeMeaning
metric_typelatency | memory | error_rateUnit family. latency is microseconds; memory is bytes or count; error_rate is a fraction 0–1.
max_valuef64Absolute ceiling. Exceeding it is a breach.
max_ratiof64Max ratio vs baseline. Exceeding it is a breach. Optional.
safe_mode_triggerboolWhen true, breaching this metric counts toward safe_mode_breach_count.

Breach semantics: a metric breaches if value > max_value or value / baseline > max_ratio.

Metric categories

The schema groups metrics into two planes.

Data plane — frame rendering pipeline

Budgets on the hot path: every render has to meet these to hit the 60 Hz target.

render_frame_p50_us: { max_value: 500.0, max_ratio: 1.15 } render_frame_p95_us: { max_value: 2000.0, max_ratio: 1.20 } render_frame_p99_us: { max_value: 4000.0, max_ratio: 1.25, safe_mode_trigger: true } render_frame_p999_us: { max_value: 8000.0, max_ratio: 1.50, safe_mode_trigger: true } layout_compute_p50_us: { max_value: 200.0 } layout_compute_p95_us: { max_value: 800.0 } layout_compute_p99_us: { max_value: 1500.0 } diff_strategy_p50_us: { max_value: 100.0 } diff_strategy_p95_us: { max_value: 500.0 } diff_strategy_p99_us: { max_value: 1000.0 } ansi_present_p50_us: { max_value: 150.0 } ansi_present_p95_us: { max_value: 600.0 } ansi_present_p99_us: { max_value: 1200.0 }

Memory and error budgets on the same plane:

heap_rss_bytes: { max_value: 104857600.0, max_ratio: 1.50, safe_mode_trigger: true } allocations_per_frame: { max_value: 500.0, max_ratio: 1.30 } false_positive_strategy_switch_rate: { max_value: 0.05, safe_mode_trigger: true } malformed_ansi_rate: { max_value: 0.01 }

Decision plane — intelligence layer

Budgets on the statistical kernels behind the runtime’s adaptive behaviour.

posterior_update_p99_us: { max_value: 500.0, safe_mode_trigger: true } voi_computation_p99_us: { max_value: 400.0 } conformal_predict_p95_us: { max_value: 100.0 } eprocess_update_p50_us: { max_value: 10.0 } bocpd_update_p50_us: { max_value: 25.0 } cascade_decision_p99_us: { max_value: 100.0 }

The decision plane’s budgets are deliberately tight — a sluggish posterior-update hurts every diff decision that follows. See intelligence overview for what these kernels actually do.

BreachResult

When the runtime evaluates a metric against the SLO, it produces a BreachResult:

pub struct BreachResult { pub metric_name: String, pub metric_type: MetricType, // Latency | Memory | ErrorRate pub value: f64, pub max_value: f64, pub max_ratio: Option<f64>, pub baseline: Option<f64>, pub breached: bool, pub safe_mode_trigger: bool, pub reason: BreachReason, // OverMaxValue | OverMaxRatio | None }

Breach results are emitted as events on the evidence sink and counted toward safe_mode_breach_count. Reaching that count flips the runtime into safe mode — see frame budget for the degradation cascade that kicks in.

CI validation

CI runs two gates against this file on every push:

  1. Schema validation. Every metric declares a known metric_type, has numeric max_value, and safe_mode_trigger is a bool. Unknown keys fail.
  2. Deterministic safe-mode replay. A fixture injects breaches on the safe_mode_trigger: true metrics and asserts the runtime transitions into safe mode. If the cascade doesn’t fire, CI fails.

Relationship to the benchmark gate

  • SLO (slo.yaml) is the promise — the outermost ceiling.
  • Benchmark gate (tests/baseline.json) is the enforcement — the per-benchmark measured baseline with tolerance.

The gate’s budgets should always be ≤ the SLO’s max_value. If a benchmark’s baseline creeps up past an SLO ceiling, either the SLO must widen (deliberate promise change) or the gate has to fail.

See benchmark gate for the mechanics and telemetry events for the metric names in their canonical runtime form.

Adding a new metric

Pick a name consistent with existing conventions

Latency metrics end in _p{50,95,99,999}_us. Memory metrics are either _bytes or _per_frame. Error rates are _rate.

Decide whether it should trigger safe mode

A metric should set safe_mode_trigger: true only if breaching it means the kernel is genuinely unsafe for interactive use. A slow posterior update is annoying; a 4 ms p99 frame render is user-visible every single frame.

Add the metric and budget

metrics: my_new_metric_p99_us: metric_type: latency max_value: 750.0 max_ratio: 1.30 safe_mode_trigger: false

Wire a benchmark

Add a criterion benchmark that emits the same name and ensure the tests/baseline.json percentile budget stays within the SLO ceiling.

Run the gates

./scripts/perf_regression_gate.sh --check-only ./scripts/bench_budget.sh --check-only

Confirm both pass at the new budget.

Pitfalls

Don’t raise an SLO to hide a regression. The SLO is a promise. Document a relaxation in the PR description and in the commit history; reviewers should push back on silent widening.

safe_mode_trigger cascades. Flipping a metric to true without understanding the degradation cascade may cause the runtime to enter safe mode more eagerly than intended. Test with the deterministic safe-mode replay before landing.

Percentile choice is load-bearing. If the SLO promises p99 and the benchmark gate measures p95, the two are unrelated. Keep the percentile consistent across SLO, gate, and telemetry.