Unicode Normalization

Unicode is generous with ways to encode “the same” string. é can be one codepoint (U+00E9, precomposed) or two (U+0065 e + U+0301 COMBINING ACUTE ACCENT, decomposed). Both render identically. Neither compares equal to the other under ==. Normalization is the step that picks a canonical form and turns this problem into a non-problem.

ftui_text::normalization wraps the unicode-normalization crate with a small, opinionated surface.

The four forms


pub enum NormForm {
    Nfc,        // Canonical Composition   (default for storage)
    Nfd,        // Canonical Decomposition (default for processing)
    Nfkc,       // Compatibility Composition
    Nfkd,       // Compatibility Decomposition
}

Form	Semantics	Typical use
NFC	Compose precomposed glyphs where possible	storage, display
NFD	Decompose into base + combining marks	text processing
NFKC	NFC + compatibility substitutions (e.g. ﬁ → fi)	search, indexing
NFKD	NFD + compatibility substitutions	case-folding pipelines

Visualized


  Input:        café  (with either precomposed é or e + ◌́)

  NFC:          c a f é           ← single-codepoint é
                     (U+00E9)

  NFD:          c a f e ◌́         ← base + combining mark
                     (U+0065 U+0301)

  NFKC/NFKD:    identical in this case; differences show up on ﬁ,
                ﬀ, ①, etc.

The API


pub fn normalize(s: &str, form: NormForm) -> String;
pub fn is_normalized(s: &str, form: NormForm) -> bool;
pub fn normalize_for_search(s: &str) -> String;       // NFKC + case-fold
pub fn eq_normalized(a: &str, b: &str, form: NormForm) -> bool;
 
pub fn nfc_iter(s: &str)  -> impl Iterator<Item = char> + '_;
pub fn nfd_iter(s: &str)  -> impl Iterator<Item = char> + '_;
pub fn nfkc_iter(s: &str) -> impl Iterator<Item = char> + '_;
pub fn nfkd_iter(s: &str) -> impl Iterator<Item = char> + '_;

normalize — allocate once and return the normalized string.
is_normalized — O(n) check without allocation; cheap fast path.
eq_normalized — normalize both sides then compare; the correct way to ask “are these the same user-perceived string?”.
normalize_for_search — NFKC + case-folding. Use this for search-index keys and fuzzy-search inputs.
*_iter — streaming normalization when you don’t want the String allocation.

When to normalize where

Storage → NFC

Store text in NFC when it lands in a rope, a persistent database, or a file. It’s the most compact and the most compatible with OS APIs.

Processing → NFD

When you need to inspect combining marks, strip accents, or analyze scripts, decompose to NFD first. Then the base character and each mark are separate codepoints you can examine independently.

Search → NFKC + case fold (via `normalize_for_search`)

Search indices should not distinguish café from cafe\u{0301} nor ﬁle from file. Canonical equivalence is too strict; compatibility equivalence plus case folding is the intuitive “same word” relation.

Shaping → NFC first

The shaping layer works best with composed forms because OpenType font tables are usually keyed on precomposed glyphs. Normalize to NFC before building TextRuns.

Worked example

search_index.rs


use ftui_text::normalization::{normalize_for_search, eq_normalized, NormForm};
 
// Building a search index
let docs = ["cafe", "café", "ﬁle", "file"];
let index: std::collections::HashMap<String, usize> = docs
    .iter()
    .enumerate()
    .map(|(i, d)| (normalize_for_search(d), i))
    .collect();
 
// User searches for "cafe"
let query = normalize_for_search("cafe");
assert!(index.contains_key(&query));
 
// Strict canonical equality ignores the e/é split
assert!(eq_normalized("café", "cafe\u{0301}", NormForm::Nfc));

Fast-path: skip normalization when already normal

fast_path.rs


use ftui_text::normalization::{is_normalized, normalize, NormForm};
 
fn store(s: &str) -> String {
    if is_normalized(s, NormForm::Nfc) {
        s.to_owned()                       // cheap
    } else {
        normalize(s, NormForm::Nfc)        // rebuild
    }
}

For mostly-ASCII content, is_normalized is extremely fast and lets you avoid the allocation in the common case.

Pitfalls

Normalization is not case folding. normalize("Café", NFC) is still "Café". If you want case-insensitive equality, use normalize_for_search or explicitly fold case afterward.

NFKC loses information. ① becomes 1, ﬁ becomes fi. That’s great for search, terrible for display or storage. Don’t NFKC your rope.

Normalization changes length. After NFC, the char count, byte count, and grapheme count can all shift. If you cache indices into the string, recompute them after normalizing.

Where to go next

The storage layer — store NFC here.

Rope

Why NFC before shaping is the right default.

Shaping

Grapheme counts after normalization.

Grapheme / width

How normalization interacts with screen reader output.

A11y — BiDi & RTL support

How this piece fits in text.

Text overview