Conformal Rank Confidence
What goes wrong with a naive approach
The command-palette ledger produces log-posterior scores for every candidate. Two problems survive:
- Ties and near-ties. Two candidates with log-posterior differing by swap on the next keystroke. The top-k list shimmers.
- No notion of confidence. The palette can’t tell “these top three are clearly a group” from “there are seven candidates all essentially tied”. A good UI behaves differently in the two cases.
You cannot solve this with a deterministic comparator alone — you need a calibrated p-value for “is rank really distinct from rank ?” Gap-based conformal rank confidence gives it to you in one pass.
Mental model
After scoring, look at the gaps between consecutive scores in the sorted list:
A big gap means the upper candidate is separated from the lower — rank is “confident”. A tiny gap means they’re essentially interchangeable.
Calibrate this intuition: compute a p-value from the gap relative to an empirical scale. If the p-value is below a threshold, the rank is confident and the displayed order is stable; otherwise the renderer uses a deterministic tie-break (e.g., alphabetical) to freeze the order.
Rank confidence is conformal prediction applied to the gap rather than the score. It doesn’t need a parametric model of what “close” means — the calibration set tells you what close looks like for this palette on this corpus.
The math
Gap-based p-value
Collect gap history over the last keystrokes. For gap at rank :
(an empirical CDF — the fraction of historical gaps that were no bigger than the current one). Small means the current gap is unusually small — low confidence, tie-break.
Equivalent one-liner for the current ledger:
where is a conformal quantile of past gaps. Either form works; the repository uses the direct quantile-based variant for speed.
Deterministic tie-break
When (default ), apply tie-break:
The ID ordering is fixed, so the displayed top-k is stable across keystrokes that don’t change the score structure.
Stability of top-k
Because tie-break is deterministic on IDs, flicker only happens when (a) a new candidate enters the top-k with a confidently better score, or (b) a previously confident rank loses confidence. Both are user-visible events that belong in the UI; shimmering micro-scores no longer cause motion.
Worked example — typing g
Scores (log-posterior) for candidates matching g:
git.status +2.0
git.pull +1.99
git.push +1.98
goto.definition +0.50Gaps: 0.01, 0.01, 1.48.
Assume historical gap quantile at 0.2: .
p(git.status vs git.pull) = min(1, 0.01/0.10) = 0.10 ≤ τ → tie-break
p(git.pull vs git.push) = min(1, 0.01/0.10) = 0.10 ≤ τ → tie-break
p(git.push vs goto.def) = min(1, 1.48/0.10) = 1.00 > τ → confidentThe three git.* entries lock their order by ID (alphabetical:
pull, push, status), while goto.definition stays distinctly
fourth. No keystroke-level flicker between the top three.
Rust interface
use ftui_widgets::command_palette::rank_confidence::{RankConfidence, RankConfig};
let mut rc = RankConfidence::new(RankConfig {
tau: 0.2,
history_window: 256,
});
rc.observe(&gaps_for_this_query); // update history
let stable = rc.stable_order(&sorted_scores); // Vec<CandidateId>stable_order returns the candidates re-sorted with tie-break
applied. Tie zones are coalesced into groups that are internally
ordered by ID; confident ranks keep their score-based positions.
How to debug
Tie-break outcomes travel alongside match-evidence lines — see
command-palette-ledger
for the base schema. An extra field records the confidence:
{"schema":"match-evidence","id":"git.pull","rank":2,
"log_posterior":1.99,"p_gap_above":0.10,"tied_with":"git.status"}FTUI_EVIDENCE_SINK=/tmp/ftui.jsonl cargo run -p ftui-demo-showcase
# Find candidates that routinely land in tie zones:
jq -c 'select(.schema=="match-evidence" and (.p_gap_above // 1) <= 0.2)
| .id' /tmp/ftui.jsonl | sort | uniq -c | sort -rnPitfalls
Tie-break must be stable across sessions. Using hash order or insertion order means the “stable” display changes between runs. Use the canonical command ID (string), not a memory address.
Threshold trades flicker for responsiveness. Large (say 0.5) locks many confident ranks into tie-break, suppressing legitimate reorderings. Keep unless you see evidence-sink data showing flicker under that setting.
Cross-references
- Command-palette ledger — the score source whose gaps this layer analyses.
- Vanilla conformal — the quantile machinery reused here.
/widgets/command-palette— the consumer widget’s API.