Replay triage

A failed doctor_frankentui run leaves behind a full JSONL ledger plus per-case captures. For humans that is too much. The replay triage script walks the ledger, extracts the top-N signals — structured failure pointers with expected/actual evidence — and emits either a compact summary or a JSON report you can paste into a PR or an issue.

Source: scripts/doctor_frankentui_replay_triage.py.

Running it


python3 ./scripts/doctor_frankentui_replay_triage.py \
    --run-root /tmp/doctor_frankentui/e2e/failure_20260423T000000Z \
    --max-signals 8

Full flags:

Flag	Default	Purpose
`--run-root PATH`	required	Path to a `doctor_frankentui` run root containing `meta/`.
`--output-json PATH`	empty	Write the structured report to `PATH` instead of stdout.
`--max-signals N`	`5`	Cap the number of signals in the compact output.
`--max-timeline N`	`40`	Cap the number of timeline entries.

The signal model

Every signal is one Signal dataclass instance:


@dataclass
class Signal:
    severity: int           # 0 = info, 1 = warn, 2 = error
    message: str            # human-readable summary
    event_index: int | None # line index into meta/events.jsonl
    event_type: str | None  # e.g. "frame_captured", "diff_decision"
    case_id: str | None     # failing case (from meta/case_results.json)
    step_id: str | None     # failing step (from meta/step_results.tsv)
    pointers: list[str]     # JSON pointers to evidence
    expected: dict[str, Any]
    actual: dict[str, Any]

Severity levels

0 (info) — contextual observation. Not a failure by itself but narrows the search.
1 (warn) — non-fatal divergence (e.g. a degradation_event followed by a clean recovery).
2 (error) — a hard failure. At least one signal at this level is present whenever the underlying run failed.

Pointer syntax

Pointers are absolute relative to the run root and use a simple path#anchor syntax:

Form	Meaning
`meta/events.jsonl#L<N>`	Line `N` in `events.jsonl`.
`cases/<case_id>/<file>`	File under the case directory.
`meta/<file>#/<path>`	JSON Pointer into a structured file.

What signals get extracted

The script scans meta/events.jsonl and classifies events in priority order.

Frame checksum mismatches


{
  "event_type": "frame_checksum_mismatch",
  "frame_idx": 42,
  "expected": "0xdeadbeef00000000",
  "actual":   "0xcafef00d00000000"
}

Severity 2. The first occurrence is usually the smoking gun for a render regression.

Contract violations

Events with event_type in {"contract_violation", "invariant_break", "validator_failure"}. Severity 2. The expected/actual fields contain the specific invariant that broke.

Degradation cascade

Events with event_type == "degradation_event" whose to_tier indicates a downgrade (Full → SimpleBorders / NoColors / TextOnly — see frame budget). Severity 1.

Guard breaches

Events with event_type == "conformal_frame_guard" whose verdict is "breach". Severity 1 (repeated breaches cluster into a single signal with a count).

Resize storms

Events with event_type == "resize_decision" whose coalesce outcome indicates unhandled chaos. Severity 1.

Case-level failure

The script reads meta/case_results.json; every case with status: "failed" becomes a severity-2 signal, annotated with the case’s artifact_dir and the events_pointer into the JSONL.

Output shapes

Compact (default)


Replay Triage: doctor_failure_seed0
Run root: /tmp/doctor_frankentui/e2e/failure_20260423T000000Z

Signals (top 8)
---------------
[ERROR] frame checksum diverged at index 42
        event_type=frame_captured  case=case_003  step=resize_storm
        expected={"checksum":"0xdeadbeef00000000"}
        actual  ={"checksum":"0xcafef00d00000000"}
        pointers:
          meta/events.jsonl#L4812
          cases/case_003/frame_42.txt

[WARN]  degradation downgrade Full -> NoColors (budget p99 breach)
        event_type=degradation_event  case=case_003  step=resize_storm
        expected={"tier":"Full"}
        actual  ={"tier":"NoColors","reason":"conformal_frame_guard_breach"}
        pointers:
          meta/events.jsonl#L4740

…

Timeline (last 40 events)
-------------------------
…

Structured (`--output-json`)

The JSON report mirrors the compact output but is trivially machine-readable. It is the same shape used for meta/replay_triage_report.json in a failure-path run — see artifacts.


python3 ./scripts/doctor_frankentui_replay_triage.py \
    --run-root /tmp/doctor_frankentui/e2e/failure_20260423T000000Z \
    --output-json /tmp/triage.json \
    --max-signals 8
jq '.signals[0]' /tmp/triage.json

Typical triage loop

Run the failure flow


./scripts/doctor_frankentui_failure_e2e.sh

Artifacts land under /tmp/doctor_frankentui/e2e/failure_<ts>/.

Run triage


python3 ./scripts/doctor_frankentui_replay_triage.py \
    --run-root /tmp/doctor_frankentui/e2e/failure_<ts> \
    --max-signals 8

Follow the pointers

For each severity-2 signal, open its first pointers[] entry. That’s almost always the event where reality diverged from expectation.


# JSON pointer form: meta/events.jsonl#L4812
sed -n '4805,4820p' meta/events.jsonl | jq .

Grep the broader context

Cross-reference with the grep recipes in evidence grep patterns — the event right before the failure is usually the cause.

Reproduce with a shadow-run

Once you have a hypothesis, reproduce it as a shadow-run scenario: two lanes (before / after your fix), same seed, same events, frame checksums must match. If they match, you have a regression test.

Pitfalls

Triage is not a fix. Signals point at where reality diverged from expectation — that’s the symptom, not always the root cause. The root cause is often several events earlier; scan the timeline backwards from the first severity-2 signal.

Missing events.jsonl. If the child process crashed before emitting any event, triage has nothing to say. In that case start from logs/<step>.stderr.log and meta/command_manifest.txt — see artifacts troubleshooting.

--max-signals caps the compact output, not the JSON report. The JSON report always contains every signal the script found. Use it for archival; use the compact form for human scanning.

Overview Artifacts contract Determinism soak Evidence grep patterns Frame budget Shadow-run Evidence sink