Replay triage
A failed doctor_frankentui run leaves behind a full JSONL ledger
plus per-case captures. For humans that is too much. The replay triage
script walks the ledger, extracts the top-N signals — structured
failure pointers with expected/actual evidence — and emits either a
compact summary or a JSON report you can paste into a PR or an issue.
Source: scripts/doctor_frankentui_replay_triage.py.
Running it
python3 ./scripts/doctor_frankentui_replay_triage.py \
--run-root /tmp/doctor_frankentui/e2e/failure_20260423T000000Z \
--max-signals 8Full flags:
| Flag | Default | Purpose |
|---|---|---|
--run-root PATH | required | Path to a doctor_frankentui run root containing meta/. |
--output-json PATH | empty | Write the structured report to PATH instead of stdout. |
--max-signals N | 5 | Cap the number of signals in the compact output. |
--max-timeline N | 40 | Cap the number of timeline entries. |
The signal model
Every signal is one Signal dataclass instance:
@dataclass
class Signal:
severity: int # 0 = info, 1 = warn, 2 = error
message: str # human-readable summary
event_index: int | None # line index into meta/events.jsonl
event_type: str | None # e.g. "frame_captured", "diff_decision"
case_id: str | None # failing case (from meta/case_results.json)
step_id: str | None # failing step (from meta/step_results.tsv)
pointers: list[str] # JSON pointers to evidence
expected: dict[str, Any]
actual: dict[str, Any]Severity levels
0(info) — contextual observation. Not a failure by itself but narrows the search.1(warn) — non-fatal divergence (e.g. adegradation_eventfollowed by a clean recovery).2(error) — a hard failure. At least one signal at this level is present whenever the underlying run failed.
Pointer syntax
Pointers are absolute relative to the run root and use a simple
path#anchor syntax:
| Form | Meaning |
|---|---|
meta/events.jsonl#L<N> | Line N in events.jsonl. |
cases/<case_id>/<file> | File under the case directory. |
meta/<file>#/<path> | JSON Pointer into a structured file. |
What signals get extracted
The script scans meta/events.jsonl and classifies events in
priority order.
Frame checksum mismatches
{
"event_type": "frame_checksum_mismatch",
"frame_idx": 42,
"expected": "0xdeadbeef00000000",
"actual": "0xcafef00d00000000"
}Severity 2. The first occurrence is usually the smoking gun for a
render regression.
Contract violations
Events with event_type in
{"contract_violation", "invariant_break", "validator_failure"}.
Severity 2. The expected/actual fields contain the specific
invariant that broke.
Degradation cascade
Events with event_type == "degradation_event" whose to_tier
indicates a downgrade (Full → SimpleBorders / NoColors / TextOnly —
see frame budget). Severity 1.
Guard breaches
Events with event_type == "conformal_frame_guard" whose verdict is
"breach". Severity 1 (repeated breaches cluster into a single
signal with a count).
Resize storms
Events with event_type == "resize_decision" whose coalesce
outcome indicates unhandled chaos. Severity 1.
Case-level failure
The script reads meta/case_results.json; every case with
status: "failed" becomes a severity-2 signal, annotated with the
case’s artifact_dir and the events_pointer into the JSONL.
Output shapes
Compact (default)
Replay Triage: doctor_failure_seed0
Run root: /tmp/doctor_frankentui/e2e/failure_20260423T000000Z
Signals (top 8)
---------------
[ERROR] frame checksum diverged at index 42
event_type=frame_captured case=case_003 step=resize_storm
expected={"checksum":"0xdeadbeef00000000"}
actual ={"checksum":"0xcafef00d00000000"}
pointers:
meta/events.jsonl#L4812
cases/case_003/frame_42.txt
[WARN] degradation downgrade Full -> NoColors (budget p99 breach)
event_type=degradation_event case=case_003 step=resize_storm
expected={"tier":"Full"}
actual ={"tier":"NoColors","reason":"conformal_frame_guard_breach"}
pointers:
meta/events.jsonl#L4740
…
Timeline (last 40 events)
-------------------------
…Structured (--output-json)
The JSON report mirrors the compact output but is trivially
machine-readable. It is the same shape used for
meta/replay_triage_report.json in a failure-path run — see
artifacts.
python3 ./scripts/doctor_frankentui_replay_triage.py \
--run-root /tmp/doctor_frankentui/e2e/failure_20260423T000000Z \
--output-json /tmp/triage.json \
--max-signals 8
jq '.signals[0]' /tmp/triage.jsonTypical triage loop
Run the failure flow
./scripts/doctor_frankentui_failure_e2e.shArtifacts land under /tmp/doctor_frankentui/e2e/failure_<ts>/.
Run triage
python3 ./scripts/doctor_frankentui_replay_triage.py \
--run-root /tmp/doctor_frankentui/e2e/failure_<ts> \
--max-signals 8Follow the pointers
For each severity-2 signal, open its first pointers[] entry.
That’s almost always the event where reality diverged from expectation.
# JSON pointer form: meta/events.jsonl#L4812
sed -n '4805,4820p' meta/events.jsonl | jq .Grep the broader context
Cross-reference with the grep recipes in evidence grep patterns — the event right before the failure is usually the cause.
Reproduce with a shadow-run
Once you have a hypothesis, reproduce it as a shadow-run scenario: two lanes (before / after your fix), same seed, same events, frame checksums must match. If they match, you have a regression test.
Pitfalls
Triage is not a fix. Signals point at where reality diverged
from expectation — that’s the symptom, not always the root cause.
The root cause is often several events earlier; scan the timeline
backwards from the first severity-2 signal.
Missing events.jsonl. If the child process crashed before
emitting any event, triage has nothing to say. In that case start
from logs/<step>.stderr.log and meta/command_manifest.txt — see
artifacts troubleshooting.
--max-signals caps the compact output, not the JSON report.
The JSON report always contains every signal the script found. Use
it for archival; use the compact form for human scanning.