Skip to Content

Conformal Rank Confidence

What goes wrong with a naive approach

The command-palette ledger produces log-posterior scores for every candidate. Two problems survive:

  1. Ties and near-ties. Two candidates with log-posterior differing by 10610^{-6} swap on the next keystroke. The top-k list shimmers.
  2. No notion of confidence. The palette can’t tell “these top three are clearly a group” from “there are seven candidates all essentially tied”. A good UI behaves differently in the two cases.

You cannot solve this with a deterministic comparator alone — you need a calibrated p-value for “is rank ii really distinct from rank i+1i+1?” Gap-based conformal rank confidence gives it to you in one pass.

Mental model

After scoring, look at the gaps between consecutive scores in the sorted list:

gi=scoreiscorei+1g_i = \text{score}_i - \text{score}_{i+1}

A big gap means the upper candidate is separated from the lower — rank ii is “confident”. A tiny gap means they’re essentially interchangeable.

Calibrate this intuition: compute a p-value from the gap relative to an empirical scale. If the p-value is below a threshold, the rank is confident and the displayed order is stable; otherwise the renderer uses a deterministic tie-break (e.g., alphabetical) to freeze the order.

Rank confidence is conformal prediction applied to the gap rather than the score. It doesn’t need a parametric model of what “close” means — the calibration set tells you what close looks like for this palette on this corpus.

The math

Gap-based p-value

Collect gap history over the last NN keystrokes. For gap gig_i at rank ii:

pi=min ⁣(1,  #{j:gjgi}N)p_i = \min\!\left(1,\; \frac{\#\{j : g_j \le g_i\}}{N}\right)

(an empirical CDF — the fraction of historical gaps that were no bigger than the current one). Small pip_i means the current gap is unusually small — low confidence, tie-break.

Equivalent one-liner for the current ledger:

pi=min(1,gi/gscale)p_i = \min(1, g_i / g_{\text{scale}})

where gscaleg_{\text{scale}} is a conformal quantile of past gaps. Either form works; the repository uses the direct quantile-based variant for speed.

Deterministic tie-break

When pi>τp_i > \tau (default τ=0.2\tau = 0.2), apply tie-break:

break(a,b)=(ida<idb)\text{break}(a, b) = (\text{id}_a < \text{id}_b)

The ID ordering is fixed, so the displayed top-k is stable across keystrokes that don’t change the score structure.

Stability of top-k

Because tie-break is deterministic on IDs, flicker only happens when (a) a new candidate enters the top-k with a confidently better score, or (b) a previously confident rank loses confidence. Both are user-visible events that belong in the UI; shimmering micro-scores no longer cause motion.

Worked example — typing g

Scores (log-posterior) for candidates matching g:

git.status +2.0 git.pull +1.99 git.push +1.98 goto.definition +0.50

Gaps: 0.01, 0.01, 1.48.

Assume historical gap quantile at 0.2: gscale=0.10g_{\text{scale}} = 0.10.

p(git.status vs git.pull) = min(1, 0.01/0.10) = 0.10 ≤ τ → tie-break p(git.pull vs git.push) = min(1, 0.01/0.10) = 0.10 ≤ τ → tie-break p(git.push vs goto.def) = min(1, 1.48/0.10) = 1.00 > τ → confident

The three git.* entries lock their order by ID (alphabetical: pull, push, status), while goto.definition stays distinctly fourth. No keystroke-level flicker between the top three.

Rust interface

crates/ftui-widgets/src/command_palette/rank_confidence.rs
use ftui_widgets::command_palette::rank_confidence::{RankConfidence, RankConfig}; let mut rc = RankConfidence::new(RankConfig { tau: 0.2, history_window: 256, }); rc.observe(&gaps_for_this_query); // update history let stable = rc.stable_order(&sorted_scores); // Vec<CandidateId>

stable_order returns the candidates re-sorted with tie-break applied. Tie zones are coalesced into groups that are internally ordered by ID; confident ranks keep their score-based positions.

How to debug

Tie-break outcomes travel alongside match-evidence lines — see command-palette-ledger for the base schema. An extra field records the confidence:

{"schema":"match-evidence","id":"git.pull","rank":2, "log_posterior":1.99,"p_gap_above":0.10,"tied_with":"git.status"}
FTUI_EVIDENCE_SINK=/tmp/ftui.jsonl cargo run -p ftui-demo-showcase # Find candidates that routinely land in tie zones: jq -c 'select(.schema=="match-evidence" and (.p_gap_above // 1) <= 0.2) | .id' /tmp/ftui.jsonl | sort | uniq -c | sort -rn

Pitfalls

Tie-break must be stable across sessions. Using hash order or insertion order means the “stable” display changes between runs. Use the canonical command ID (string), not a memory address.

Threshold τ\tau trades flicker for responsiveness. Large τ\tau (say 0.5) locks many confident ranks into tie-break, suppressing legitimate reorderings. Keep τ0.2\tau \le 0.2 unless you see evidence-sink data showing flicker under that setting.

Cross-references

Where next