Unicode Normalization
Unicode is generous with ways to encode “the same” string. é can be
one codepoint (U+00E9, precomposed) or two (U+0065 e + U+0301 COMBINING
ACUTE ACCENT, decomposed). Both render identically. Neither compares
equal to the other under ==. Normalization is the step that picks a
canonical form and turns this problem into a non-problem.
ftui_text::normalization wraps the unicode-normalization crate with
a small, opinionated surface.
The four forms
pub enum NormForm {
Nfc, // Canonical Composition (default for storage)
Nfd, // Canonical Decomposition (default for processing)
Nfkc, // Compatibility Composition
Nfkd, // Compatibility Decomposition
}| Form | Semantics | Typical use |
|---|---|---|
| NFC | Compose precomposed glyphs where possible | storage, display |
| NFD | Decompose into base + combining marks | text processing |
| NFKC | NFC + compatibility substitutions (e.g. fi → fi) | search, indexing |
| NFKD | NFD + compatibility substitutions | case-folding pipelines |
Visualized
Input: café (with either precomposed é or e + ◌́)
NFC: c a f é ← single-codepoint é
(U+00E9)
NFD: c a f e ◌́ ← base + combining mark
(U+0065 U+0301)
NFKC/NFKD: identical in this case; differences show up on fi,
ff, ①, etc.The API
pub fn normalize(s: &str, form: NormForm) -> String;
pub fn is_normalized(s: &str, form: NormForm) -> bool;
pub fn normalize_for_search(s: &str) -> String; // NFKC + case-fold
pub fn eq_normalized(a: &str, b: &str, form: NormForm) -> bool;
pub fn nfc_iter(s: &str) -> impl Iterator<Item = char> + '_;
pub fn nfd_iter(s: &str) -> impl Iterator<Item = char> + '_;
pub fn nfkc_iter(s: &str) -> impl Iterator<Item = char> + '_;
pub fn nfkd_iter(s: &str) -> impl Iterator<Item = char> + '_;normalize— allocate once and return the normalized string.is_normalized— O(n) check without allocation; cheap fast path.eq_normalized— normalize both sides then compare; the correct way to ask “are these the same user-perceived string?”.normalize_for_search— NFKC + case-folding. Use this for search-index keys and fuzzy-search inputs.*_iter— streaming normalization when you don’t want theStringallocation.
When to normalize where
Storage → NFC
Store text in NFC when it lands in a rope, a persistent database, or a file. It’s the most compact and the most compatible with OS APIs.
Processing → NFD
When you need to inspect combining marks, strip accents, or analyze scripts, decompose to NFD first. Then the base character and each mark are separate codepoints you can examine independently.
Search → NFKC + case fold (via normalize_for_search)
Search indices should not distinguish café from cafe\u{0301} nor
file from file. Canonical equivalence is too strict; compatibility
equivalence plus case folding is the intuitive “same word” relation.
Shaping → NFC first
The shaping layer works best with composed forms because OpenType font
tables are usually keyed on precomposed glyphs. Normalize to NFC
before building TextRuns.
Worked example
use ftui_text::normalization::{normalize_for_search, eq_normalized, NormForm};
// Building a search index
let docs = ["cafe", "café", "file", "file"];
let index: std::collections::HashMap<String, usize> = docs
.iter()
.enumerate()
.map(|(i, d)| (normalize_for_search(d), i))
.collect();
// User searches for "cafe"
let query = normalize_for_search("cafe");
assert!(index.contains_key(&query));
// Strict canonical equality ignores the e/é split
assert!(eq_normalized("café", "cafe\u{0301}", NormForm::Nfc));Fast-path: skip normalization when already normal
use ftui_text::normalization::{is_normalized, normalize, NormForm};
fn store(s: &str) -> String {
if is_normalized(s, NormForm::Nfc) {
s.to_owned() // cheap
} else {
normalize(s, NormForm::Nfc) // rebuild
}
}For mostly-ASCII content, is_normalized is extremely fast and lets
you avoid the allocation in the common case.
Pitfalls
Normalization is not case folding. normalize("Café", NFC) is
still "Café". If you want case-insensitive equality, use
normalize_for_search or explicitly fold case afterward.
NFKC loses information. ① becomes 1, fi becomes fi. That’s
great for search, terrible for display or storage. Don’t NFKC your
rope.
Normalization changes length. After NFC, the char count, byte count, and grapheme count can all shift. If you cache indices into the string, recompute them after normalizing.