Conformal Rank Confidence

What goes wrong with a naive approach

The command-palette ledger produces log-posterior scores for every candidate. Two problems survive:

Ties and near-ties. Two candidates with log-posterior differing by $10^{-6}$ swap on the next keystroke. The top-k list shimmers.
No notion of confidence. The palette can’t tell “these top three are clearly a group” from “there are seven candidates all essentially tied”. A good UI behaves differently in the two cases.

You cannot solve this with a deterministic comparator alone — you need a calibrated p-value for “is rank $i$ really distinct from rank $i+1$ ?” Gap-based conformal rank confidence gives it to you in one pass.

Mental model

After scoring, look at the gaps between consecutive scores in the sorted list:

g_i = \text{score}_i - \text{score}_{i+1}

A big gap means the upper candidate is separated from the lower — rank $i$ is “confident”. A tiny gap means they’re essentially interchangeable.

Calibrate this intuition: compute a p-value from the gap relative to an empirical scale. If the p-value is below a threshold, the rank is confident and the displayed order is stable; otherwise the renderer uses a deterministic tie-break (e.g., alphabetical) to freeze the order.

Rank confidence is conformal prediction applied to the gap rather than the score. It doesn’t need a parametric model of what “close” means — the calibration set tells you what close looks like for this palette on this corpus.

The math

Gap-based p-value

Collect gap history over the last $N$ keystrokes. For gap $g_i$ at rank $i$ :

p_i = \min\!\left(1,\; \frac{\#\{j : g_j \le g_i\}}{N}\right)

(an empirical CDF — the fraction of historical gaps that were no bigger than the current one). Small $p_i$ means the current gap is unusually small — low confidence, tie-break.

Equivalent one-liner for the current ledger:

p_i = \min(1, g_i / g_{\text{scale}})

where $g_{\text{scale}}$ is a conformal quantile of past gaps. Either form works; the repository uses the direct quantile-based variant for speed.

Deterministic tie-break

When $p_i > \tau$ (default $\tau = 0.2$ ), apply tie-break:

\text{break}(a, b) = (\text{id}_a < \text{id}_b)

The ID ordering is fixed, so the displayed top-k is stable across keystrokes that don’t change the score structure.

Stability of top-k

Because tie-break is deterministic on IDs, flicker only happens when (a) a new candidate enters the top-k with a confidently better score, or (b) a previously confident rank loses confidence. Both are user-visible events that belong in the UI; shimmering micro-scores no longer cause motion.

Worked example — typing `g`

Scores (log-posterior) for candidates matching g:


git.status       +2.0
git.pull         +1.99
git.push         +1.98
goto.definition  +0.50

Gaps: 0.01, 0.01, 1.48.

Assume historical gap quantile at 0.2: $g_{\text{scale}} = 0.10$ .


p(git.status vs git.pull)  = min(1, 0.01/0.10) = 0.10   ≤ τ   → tie-break
p(git.pull  vs git.push)   = min(1, 0.01/0.10) = 0.10   ≤ τ   → tie-break
p(git.push  vs goto.def)   = min(1, 1.48/0.10) = 1.00   > τ   → confident

The three git.* entries lock their order by ID (alphabetical: pull, push, status), while goto.definition stays distinctly fourth. No keystroke-level flicker between the top three.

Rust interface

crates/ftui-widgets/src/command_palette/rank_confidence.rs


use ftui_widgets::command_palette::rank_confidence::{RankConfidence, RankConfig};
 
let mut rc = RankConfidence::new(RankConfig {
    tau: 0.2,
    history_window: 256,
});
 
rc.observe(&gaps_for_this_query);          // update history
let stable = rc.stable_order(&sorted_scores); // Vec<CandidateId>

stable_order returns the candidates re-sorted with tie-break applied. Tie zones are coalesced into groups that are internally ordered by ID; confident ranks keep their score-based positions.

How to debug

Tie-break outcomes travel alongside match-evidence lines — see command-palette-ledger for the base schema. An extra field records the confidence:


{"schema":"match-evidence","id":"git.pull","rank":2,
 "log_posterior":1.99,"p_gap_above":0.10,"tied_with":"git.status"}


FTUI_EVIDENCE_SINK=/tmp/ftui.jsonl cargo run -p ftui-demo-showcase
 
# Find candidates that routinely land in tie zones:
jq -c 'select(.schema=="match-evidence" and (.p_gap_above // 1) <= 0.2)
       | .id' /tmp/ftui.jsonl | sort | uniq -c | sort -rn

Pitfalls

Tie-break must be stable across sessions. Using hash order or insertion order means the “stable” display changes between runs. Use the canonical command ID (string), not a memory address.

Threshold $\tau$ trades flicker for responsiveness. Large $\tau$ (say 0.5) locks many confident ranks into tie-break, suppressing legitimate reorderings. Keep $\tau \le 0.2$ unless you see evidence-sink data showing flicker under that setting.

Cross-references

Command-palette ledger — the score source whose gaps this layer analyses.
Vanilla conformal — the quantile machinery reused here.
/widgets/command-palette — the consumer widget’s API.

Where next

How this piece fits in intelligence.

Intelligence overview

The quantile machinery reused here.

Vanilla conformal