Graphemes and Width Cache

Two caches sit on the text hot path: a grapheme pool that interns complex clusters into compact 32-bit IDs, and a width cache that maps graphemes to terminal cell widths. Together they turn “what does this piece of text occupy on screen?” from an expensive Unicode-table walk into an O(1) lookup.

Why graphemes are not characters

A grapheme is a user-perceived character. The family emoji 👨‍👩‍👧‍👦 is one grapheme composed of seven Unicode scalar values joined by zero-width joiners. A hand-wave 🤚🏻 is one grapheme of two scalars. Typing \u{65}\u{301} (e + combining acute) is one grapheme of two scalars.

Cells on a terminal grid are grapheme-wide. Every piece of text math in FrankenTUI — cursor motion, line wrapping, hit testing, diff computation — operates on graphemes.

`GraphemeId` — a 32-bit reference

Complex graphemes don’t fit in the 4-byte CellContent inline storage, so they’re interned in a GraphemePool and cells store a GraphemeId(u32) instead of the string bytes.


pub struct GraphemeId(u32);
 
impl GraphemeId {
    pub const MAX_SLOT: u32     = 0xFFFF;    // 65,535  (16 bits)
    pub const MAX_GENERATION: u16 = 2_047;   //          (11 bits)
    pub const MAX_WIDTH: u8     = 15;        //          ( 4 bits)
 
    pub const fn new(slot: u32, generation: u16, width: u8) -> Self;
    pub const fn slot(self)       -> usize;
    pub const fn generation(self) -> u16;
    pub const fn width(self)      -> usize;
}

The bit layout


    31 30 29 28 27 26 25 24 ... 16 15 14 13 12 11 10  9 ...  0
   ┌──────────┬────────────────────┬──────────────────────────┐
   │  width   │     generation     │           slot           │
   │  4 bits  │      11 bits       │          16 bits         │
   └──────────┴────────────────────┴──────────────────────────┘
        ↑                ↑                        ↑
        width 0-15       2048 reuse versions      64K unique slots

Slot — index into the pool’s Vec<Option<GraphemeSlot>>. 64K slots is enough for any single frame’s unique graphemes.
Generation — bumps when a slot is reused. Detects stale IDs (ABA) when a grapheme is released and a new one takes its slot.
Width — cached display width (0–15). Avoids a pool lookup when all you want is how many cells the glyph takes.

Bit 31 of the enclosing CellContent is the type discriminator between “direct char” (bit 31 = 0) and “GraphemeId” (bit 31 = 1).

The grapheme pool — reference-counted


pub struct GraphemePool { /* slots, generations, lookup, free-list */ }
 
impl GraphemePool {
    pub fn intern(&mut self, text: &str, width: u8) -> GraphemeId;
    pub fn get(&self, id: GraphemeId) -> Option<&str>;
    pub fn retain(&mut self, id: GraphemeId);
    pub fn release(&mut self, id: GraphemeId);
    pub fn refcount(&self, id: GraphemeId) -> u32;
}

intern — dedupes via a lookup map. The same 👨‍👩‍👧‍👦 interned twice returns the same ID with an incremented refcount.
retain / release — when a cell is copied into another cell (buffer diff, scroll, partial redraw), retain. When a cell is overwritten, release. When refcount hits zero, the slot goes back on the free list and its generation is bumped.
get validates generation before returning; a stale ID returns None.

The width cache


pub struct WidthCache { /* LruCache<u64, usize>, hits, misses */ }
 
impl WidthCache {
    pub fn new(capacity: usize) -> Self;
    pub fn with_default_capacity() -> Self;          // 4096 entries
 
    pub fn get_or_compute(&mut self, text: &str) -> usize;
    pub fn contains(&self, text: &str) -> bool;
    pub fn get(&mut self, text: &str) -> Option<usize>;
 
    pub fn clear(&mut self);
    pub fn reset_stats(&mut self);
    pub fn stats(&self) -> CacheStats;
}
 
pub struct CacheStats {
    pub hits:   u64,
    pub misses: u64,
    pub size:   usize,
    pub capacity: usize,
}
 
impl CacheStats {
    pub fn hit_rate(&self) -> f64;
}

Keys are FxHash-hashed grapheme strings; values are the computed display width. The cache is per-thread (there’s a thread_local! accessor via cached_width) so it needs no locking.

The ~90 % hit-rate claim

A typical editing or reading surface shows the same graphemes over and over — ASCII letters, a handful of symbols, a few emojis. The measured hit rate in an editor-style workload stays above 90 %, often above 98 %. You can confirm by reading cache.stats().hit_rate() in a debug overlay.

ASCII fast-path

ASCII graphemes have width 1 (or 0 for controls). The cache short- circuits them without a hash lookup at all:


   is ASCII? ─────yes────▶ return 1
       │
       no
       ▼
   FxHash lookup ─ hit? ──yes──▶ return cached width
       │
       no
       ▼
   compute + insert

The result: the hot path for the overwhelming majority of content is a branch and a return.

VS16 and emoji width

One environmental quirk: U+FE0F (Variation Selector-16) tells the renderer “treat the preceding codepoint as emoji, not text.” Whether that changes the width depends on the terminal. The width cache reads the policy once at startup via OnceLock:


// Read at startup; subsequent env changes are ignored for determinism.
FTUI_EMOJI_VS16_WIDTH=2    // force width 2 for VS16 sequences
FTUI_EMOJI_VS16_WIDTH=1    // force width 1
FTUI_EMOJI_VS16_WIDTH=auto // default — let unicode-width decide

The OnceLock design is deliberate: changing the policy mid-process would invalidate cached widths and produce layout drift.

Worked example — measure a styled line

measure.rs


use ftui_text::width_cache::WidthCache;
 
let mut cache = WidthCache::with_default_capacity();
 
let parts = ["Status: ", "OK ", "✅"];
let total: usize = parts.iter().map(|p| cache.get_or_compute(p)).sum();
// "Status: " = 8, "OK " = 3, "✅" = 2  -> 13
 
let stats = cache.stats();
println!("hit rate = {:.2}", stats.hit_rate());

Repeated calls for the same graphemes hit the cache.

Pitfalls

Width is not char count. "a".len() == 1, "é".chars().count() can be 1 or 2 (NFC vs NFD), and "✅".chars().count() == 1 but the cell width is 2. Use the width cache, not .len().

GraphemeId generation wraps. 2047 reuses of the same slot and the generation counter wraps to zero. This is fine because retain/ release discipline keeps live IDs in sync with the slot’s current generation, but don’t store GraphemeIds across long periods without the pool — they can alias after enough churn.

Don’t clear the width cache to force an update. The cache values are stable functions of the input. If you see wrong widths, fix the input (normalize first, or use the correct VS16 policy), don’t evict.

Where to go next

Grapheme counts and iteration on top of stored text.

Rope

The other cache on the text hot path.

Shaping

The styled units whose cell_length consults this cache.

Spans and segments

Accessibility consumers of grapheme data.

A11y — BiDi & RTL support

How this piece fits in text.

Text overview