11. String Advanced Transformation¶
Overview¶
Normalization Forms¶
In Unicode text processing, "NFC", "NFD", "NFKC", and "NFKD" are four normalization forms defined by the Unicode Standard to ensure that text is stored in a consistent and comparable way — even if it visually looks the same but is encoded differently.
Unicode Normalization Forms
| Form | Description |
|---|---|
| NFC | Canonical Composition – combines characters into composed forms. |
| NFD | Canonical Decomposition – breaks characters into base + diacritics. |
| NFKC | Compatibility Composition – like NFC but also replaces compatibility characters. |
| NFKD | Compatibility Decomposition – like NFD but also replaces compatibility characters. |
🧪 Example: Using All Forms
import unicodedata
text = "TÚI THÚ KHỦNG LONG BẠO CHÚA MỀM"
print("NFC:", unicodedata.normalize("NFC", text))
print("NFD:", unicodedata.normalize("NFD", text))
print("NFKC:", unicodedata.normalize("NFKC", text))
print("NFKD:", unicodedata.normalize("NFKD", text))
🧼 When to Use Each
- NFC: Best for display and storage (e.g., filenames, UI).
- NFD: Useful for accent stripping or character analysis.
- NFKC/NFKD: Ideal for search, comparison, or compatibility (e.g., turning “①” into “1”).
🧩 1. NFD (Normalization Form D — Decomposed)¶
- D stands for Decomposition.
- It splits characters into their simplest combining forms.
Example:
- The single character “é” (U+00E9) is decomposed into “e” (U+0065) + “◌́” (U+0301).
- Used when you want to analyze or compare base characters and diacritics separately.
🧩 2. NFC (Normalization Form C — Composed)¶
- C stands for Composition.
- It"s the canonical composition form — meaning it tries to combine decomposed characters into one precomposed form when possible.
- Essentially, NFC is the “standard” normalized form most commonly used for storage and display.
Example:
- “e” + combining acute accent becomes a single character “é”.
🧩 3. NFKD (Normalization Form KD — Compatibility Decomposition)¶
- K stands for Compatibility.
- Like NFD, but it also applies compatibility mappings, meaning that it may change the way certain symbols are represented for easier comparison.
- It decomposes characters and converts “visually similar but semantically different” characters into a common form.
Example:
- NFKD is used when the appearance doesn"t matter, only the semantic value.
🧩 4. NFKC (Normalization Form KC — Compatibility Composition)¶
- Like NFKD, but after compatibility decomposition, it re-composes characters when possible.
- Useful when you want to normalize for search, comparison, or user input, not preserving exact visual form.
Example:
🔍 Summary Table¶
| Form | Meaning | Type | Example Input | Normalized Output |
|---|---|---|---|---|
| NFD | Canonical Decomposition | Decomposed | é | e + ́ |
| NFC | Canonical Composition | Composed | e + ́ | é |
| NFKD | Compatibility Decomposition | Decomposed (simplified form) | ① | 1 |
| NFKC | Compatibility Composition | Composed (simplified form) | ① | 1 |
💡 In Python¶
You can use the unicodedata module:
import unicodedata
text = "é"
print(unicodedata.normalize("NFD", text)) # e + ́
print(unicodedata.normalize("NFC", text)) # é
print(unicodedata.normalize("NFKD", "①")) # 1
print(unicodedata.normalize("NFKC", "①")) # 1
Excellent — those are great examples, because they"re “typographic” (curly) punctuation characters, not the plain ASCII ones. They show clearly how Unicode normalization (especially NFKC/NFKD) handles compatibility characters.
Let"s analyze each one:
✦ Input characters¶
| Character | Description | Unicode code point | ASCII equivalent |
|---|---|---|---|
" | Right single quotation mark | U+2019 | ' |
– | En dash | U+2013 | - |
“ | Left double quotation mark | U+201C | " |
” | Right double quotation mark | U+201D | " |
🧩 Normalization behavior¶
Let"s see what happens under each normalization form:
| Character | NFD | NFC | NFKD | NFKC | Comment |
|---|---|---|---|---|---|
" | " | " | ' | ' | NFD/NFC keep typographic quote; NFKD/NFKC simplify to ASCII ' |
– | – | – | - | - | Compatibility decomposition turns it into plain hyphen |
“ | “ | “ | " | " | Compatibility decomposition converts to ASCII double quote |
” | ” | ” | " | " | Same — converted to plain double quote |
🧠 Explanation¶
-
NFD/NFC are canonical normalizations — they only affect characters with accents or composed forms, not punctuation. → So your curly quotes and dashes remain unchanged.
-
NFKD/NFKC are compatibility normalizations — they also simplify typographic variants, superscripts, circled numbers, etc. → So typographic punctuation becomes the plain ASCII version.
💻 Python Example¶
import unicodedata
text = "" – “ ”"
for ch in text.split():
print(f"Original: {ch!r} {ord(ch):04X}")
for form in ["NFD", "NFC", "NFKD", "NFKC"]:
normalized = unicodedata.normalize(form, ch)
print(f" {form}: {normalized!r} {ord(normalized):04X}")
print()
Output (simplified):
Original: " 2019
NFD: " 2019
NFC: " 2019
NFKD: ' 0027
NFKC: ' 0027
Original: – 2013
NFD: – 2013
NFC: – 2013
NFKD: - 002D
NFKC: - 002D
Original: “ 201C
NFD: “ 201C
NFC: “ 201C
NFKD: " 0022
NFKC: " 0022
...
✅ Summary¶
| Form | Behavior on typographic punctuation |
|---|---|
| NFD/NFC | Keep curly quotes/dashes as they are (no change) |
| NFKD/NFKC | Convert to plain ASCII ', ", - — more suitable for search, comparison, or plain-text processing |
Excellent observation 👍 — and yes, what you"re seeing is expected but a bit subtle! Let"s unpack why your result in Python shows:
and did not convert the curly quote (", U+2019) into the ASCII ' (U+0027).
🧠 The short answer¶
👉 Because the Unicode Standard does not define a compatibility mapping for " (RIGHT SINGLE QUOTATION MARK, U+2019). So normalization (even NFKC/NFKD) leaves it unchanged.
🔍 The deeper explanation¶
Unicode normalization forms (NFC, NFD, NFKC, NFKD) only use mappings defined in the Unicode Character Database (UCD), specifically:
- Canonical mappings → for NFC/NFD
- Compatibility mappings → for NFKC/NFKD
However, not every “visually similar” character is considered a compatibility equivalent. In particular, punctuation marks like ", ", “, ”, and – are typographic symbols, not compatibility variants of ASCII ', ", or -.
That means:
There is no official compatibility decomposition for them — so normalization doesn"t change them.
🧩 Examples that do change under NFKC/NFKD¶
These do have compatibility mappings:
| Character | Unicode | Normalized NFKC | Notes |
|---|---|---|---|
① | U+2460 | 1 | circled digit one |
Å | U+212B | Å | Angstrom sign → Latin A with ring |
㎏ | U+338F | kg | compatibility ligature |
Ⅳ | U+2163 | IV | Roman numeral four |
…but punctuation marks like ", –, “, ” stay the same.
🧰 What to do if you want ASCII conversion¶
If your goal is to normalize text to plain ASCII, you need extra logic (beyond Unicode normalization). For example, you can use:
Option 1: unicodedata.normalize + manual mapping¶
import unicodedata
text = "“Hello – it"s fine.”"
normalized = unicodedata.normalize("NFKC", text)
# Then manually replace typographic punctuation
ascii_text = (
normalized.replace("“", '"')
.replace("”", '"')
.replace(""", "'")
.replace(""", "'")
.replace("–", "-")
)
print(ascii_text)
# "Hello - it's fine."
Option 2: Use a library like ftfy or unidecode¶
or
✅ Summary
| Form | Changes '"–“”? | Why |
|---|---|---|
| NFC / NFD | ❌ No | They"re not decomposable |
| NFKC / NFKD | ❌ No | No compatibility mapping in Unicode |
ftfy / manual map | ✅ Yes | Explicit conversion to ASCII equivalents |
Reference¶
- Python - Data Model
https://www.geeksforgeeks.org/namedtuple-in-python/?ref=lbp
Understanding ASCII, Unicode, UTF-8, and Text Handling¶
1. ASCII, Unicode, and UTF-8 — Overview¶
ASCII¶
- Definition: American Standard Code for Information Interchange
- Range: 0–127 (7-bit)
- Purpose: Represents English letters, digits, and basic symbols
- Limitations: Cannot represent accented characters (e.g.,
á,ê) or characters from other languages.
Unicode¶
- Definition: A universal character set designed to cover all characters from all writing systems
- Range: Over 1.1 million code points (0–0x10FFFF)
- Encoding forms: Can be stored as UTF-8, UTF-16, UTF-32
- Purpose: Allows text from multiple languages to be stored and processed consistently.
UTF-8¶
- Definition: A variable-length encoding of Unicode characters
- Characteristics:
- ASCII characters (0–127) use 1 byte
-
Other characters (e.g., Vietnamese
á) use 2–4 bytes -
Advantages:
- Backward compatible with ASCII
- Efficient storage for mixed ASCII and non-ASCII text
- Widely used on the web and in modern applications.
Key differences:
| Feature | ASCII | Unicode | UTF-8 |
|---|---|---|---|
| Characters | 128 | 1,112,064+ | Encoding scheme |
| Language support | English only | All languages | All languages |
| Storage | 1 byte | Depends on encoding (UTF-8,16,32) | 1–4 bytes per character |
2. Handling Non-ASCII Text (e.g., Vietnamese)¶
Vietnamese contains accented characters like á, à, ạ, â, etc. Handling them correctly requires:
a) Ensure proper encoding/decoding¶
- Always use UTF-8 for reading/writing files:
# Writing JSON with Vietnamese text
import json
data = {"unit": "Cái"}
with open("a.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
- Do not rely on system default encodings (like cp1252 on Windows), which may fail.
b) Use Unicode normalization¶
- Unicode allows characters to be represented in multiple ways:
- Precomposed:
á→ single code pointU+00E1 -
Decomposed:
a+ combining acute accentU+0301 -
Normalize using Python"s
unicodedata:
import unicodedata
s = "a\u0301I" # decomposed
print(unicodedata.normalize("NFC", s)) # precomposed: 'áI'
Normalization forms:
| Form | Effect |
|---|---|
| NFC | Compose to single code point where possible |
| NFD | Decompose into base + combining marks |
| NFKC/NFKD | Compatibility normalization (e.g., superscripts, ligatures) |
c) Fix mojibake / double-encoding¶
- Sometimes UTF-8 bytes are misinterpreted as Latin-1 or Windows-1252
- Recover with:
3. General Rules to Detect Encoding Issues¶
- Check file metadata / BOM: UTF-8 files may start with
EF BB BF - Look for replacement characters:
�indicates a decoding error - Check for mojibake patterns:
CáIinstead ofCáI→ UTF-8 mis-decoded as Latin-1-
C\u00e1I→ escaped Unicode sequence -
Python inspection:
import chardet
raw_bytes = open("a.json", "rb").read()
print(chardet.detect(raw_bytes)) # returns likely encoding
4. Examples: Methods and Differences¶
a) unicode.normalize¶
s1 = "a\u0301I" # a + combining accent
print(s1) # prints: áI
print(unicodedata.normalize("NFC", s1)) # prints: áI
Effect: fixes different representations of the same character.
b) ensure_ascii=False in JSON¶
data = {"unit": "Cái"}
json_str = json.dumps(data, ensure_ascii=True)
print(json_str) # {"unit": "C\u00e1i"}
json_str2 = json.dumps(data, ensure_ascii=False)
print(json_str2) # {"unit": "Cái"}
Effect:
ensure_ascii=True→ escapes non-ASCII as\uXXXXensure_ascii=False→ writes characters directly in UTF-8
c) Fixing mojibake¶
Effect: recovers text that was double-encoded.
Summary¶
- ASCII → only English, 7-bit
- Unicode → universal code points for all languages
- UTF-8 → variable-length encoding, ASCII-compatible, stores all Unicode
- Handling Vietnamese:
- Always use UTF-8
- Normalize combining marks (
unicodedata.normalize) -
Fix mojibake if needed (
latin-1 → utf-8) -
Detection rules:
- Look for replacement characters or unexpected escapes
-
Use libraries like
chardet -
Practical examples:
unicode.normalize→ standardize accentsensure_ascii=False→ human-readable UTF-8 JSONlatin-1 → utf-8→ recover corrupted text
# unicodedata.normalize("NFC", "đơn bán hàng") == "đơn bán hàng"
# unicodedata.normalize("NFC", "đơn bán hàng") in ("đơn bán hàng",)
# "đơn bán hàng" in unicodedata.normalize("NFC", "đơn bán hàng")