Graphemes and Width Cache
Two caches sit on the text hot path: a grapheme pool that interns complex clusters into compact 32-bit IDs, and a width cache that maps graphemes to terminal cell widths. Together they turn “what does this piece of text occupy on screen?” from an expensive Unicode-table walk into an O(1) lookup.
Why graphemes are not characters
A grapheme is a user-perceived character. The family emoji
👨👩👧👦 is one grapheme composed of seven Unicode scalar values
joined by zero-width joiners. A hand-wave 🤚🏻 is one grapheme of two
scalars. Typing \u{65}\u{301} (e + combining acute) is one grapheme
of two scalars.
Cells on a terminal grid are grapheme-wide. Every piece of text math in FrankenTUI — cursor motion, line wrapping, hit testing, diff computation — operates on graphemes.
GraphemeId — a 32-bit reference
Complex graphemes don’t fit in the 4-byte CellContent inline storage,
so they’re interned in a GraphemePool and cells store a
GraphemeId(u32) instead of the string bytes.
pub struct GraphemeId(u32);
impl GraphemeId {
pub const MAX_SLOT: u32 = 0xFFFF; // 65,535 (16 bits)
pub const MAX_GENERATION: u16 = 2_047; // (11 bits)
pub const MAX_WIDTH: u8 = 15; // ( 4 bits)
pub const fn new(slot: u32, generation: u16, width: u8) -> Self;
pub const fn slot(self) -> usize;
pub const fn generation(self) -> u16;
pub const fn width(self) -> usize;
}The bit layout
31 30 29 28 27 26 25 24 ... 16 15 14 13 12 11 10 9 ... 0
┌──────────┬────────────────────┬──────────────────────────┐
│ width │ generation │ slot │
│ 4 bits │ 11 bits │ 16 bits │
└──────────┴────────────────────┴──────────────────────────┘
↑ ↑ ↑
width 0-15 2048 reuse versions 64K unique slots- Slot — index into the pool’s
Vec<Option<GraphemeSlot>>. 64K slots is enough for any single frame’s unique graphemes. - Generation — bumps when a slot is reused. Detects stale IDs (ABA) when a grapheme is released and a new one takes its slot.
- Width — cached display width (0–15). Avoids a pool lookup when all you want is how many cells the glyph takes.
Bit 31 of the enclosing CellContent is the type discriminator between
“direct char” (bit 31 = 0) and “GraphemeId” (bit 31 = 1).
The grapheme pool — reference-counted
pub struct GraphemePool { /* slots, generations, lookup, free-list */ }
impl GraphemePool {
pub fn intern(&mut self, text: &str, width: u8) -> GraphemeId;
pub fn get(&self, id: GraphemeId) -> Option<&str>;
pub fn retain(&mut self, id: GraphemeId);
pub fn release(&mut self, id: GraphemeId);
pub fn refcount(&self, id: GraphemeId) -> u32;
}intern— dedupes via a lookup map. The same👨👩👧👦interned twice returns the same ID with an incremented refcount.retain/release— when a cell is copied into another cell (buffer diff, scroll, partial redraw),retain. When a cell is overwritten,release. When refcount hits zero, the slot goes back on the free list and its generation is bumped.getvalidates generation before returning; a stale ID returnsNone.
The width cache
pub struct WidthCache { /* LruCache<u64, usize>, hits, misses */ }
impl WidthCache {
pub fn new(capacity: usize) -> Self;
pub fn with_default_capacity() -> Self; // 4096 entries
pub fn get_or_compute(&mut self, text: &str) -> usize;
pub fn contains(&self, text: &str) -> bool;
pub fn get(&mut self, text: &str) -> Option<usize>;
pub fn clear(&mut self);
pub fn reset_stats(&mut self);
pub fn stats(&self) -> CacheStats;
}
pub struct CacheStats {
pub hits: u64,
pub misses: u64,
pub size: usize,
pub capacity: usize,
}
impl CacheStats {
pub fn hit_rate(&self) -> f64;
}Keys are FxHash-hashed grapheme strings; values are the computed
display width. The cache is per-thread (there’s a thread_local!
accessor via cached_width) so it needs no locking.
The ~90 % hit-rate claim
A typical editing or reading surface shows the same graphemes over
and over — ASCII letters, a handful of symbols, a few emojis. The
measured hit rate in an editor-style workload stays above 90 %, often
above 98 %. You can confirm by reading cache.stats().hit_rate() in a
debug overlay.
ASCII fast-path
ASCII graphemes have width 1 (or 0 for controls). The cache short- circuits them without a hash lookup at all:
is ASCII? ─────yes────▶ return 1
│
no
▼
FxHash lookup ─ hit? ──yes──▶ return cached width
│
no
▼
compute + insertThe result: the hot path for the overwhelming majority of content is a branch and a return.
VS16 and emoji width
One environmental quirk: U+FE0F (Variation Selector-16) tells the
renderer “treat the preceding codepoint as emoji, not text.” Whether
that changes the width depends on the terminal. The width cache reads
the policy once at startup via OnceLock:
// Read at startup; subsequent env changes are ignored for determinism.
FTUI_EMOJI_VS16_WIDTH=2 // force width 2 for VS16 sequences
FTUI_EMOJI_VS16_WIDTH=1 // force width 1
FTUI_EMOJI_VS16_WIDTH=auto // default — let unicode-width decideThe OnceLock design is deliberate: changing the policy mid-process
would invalidate cached widths and produce layout drift.
Worked example — measure a styled line
use ftui_text::width_cache::WidthCache;
let mut cache = WidthCache::with_default_capacity();
let parts = ["Status: ", "OK ", "✅"];
let total: usize = parts.iter().map(|p| cache.get_or_compute(p)).sum();
// "Status: " = 8, "OK " = 3, "✅" = 2 -> 13
let stats = cache.stats();
println!("hit rate = {:.2}", stats.hit_rate());Repeated calls for the same graphemes hit the cache.
Pitfalls
Width is not char count. "a".len() == 1, "é".chars().count()
can be 1 or 2 (NFC vs NFD), and "✅".chars().count() == 1 but the
cell width is 2. Use the width cache, not .len().
GraphemeId generation wraps. 2047 reuses of the same slot and the
generation counter wraps to zero. This is fine because retain/
release discipline keeps live IDs in sync with the slot’s current
generation, but don’t store GraphemeIds across long periods
without the pool — they can alias after enough churn.
Don’t clear the width cache to force an update. The cache values are stable functions of the input. If you see wrong widths, fix the input (normalize first, or use the correct VS16 policy), don’t evict.