Skip to content

11. String Advanced Transformation

Overview

Normalization Forms

In Unicode text processing, "NFC", "NFD", "NFKC", and "NFKD" are four normalization forms defined by the Unicode Standard to ensure that text is stored in a consistent and comparable way — even if it visually looks the same but is encoded differently.

Unicode Normalization Forms

Form Description
NFC Canonical Composition – combines characters into composed forms.
NFD Canonical Decomposition – breaks characters into base + diacritics.
NFKC Compatibility Composition – like NFC but also replaces compatibility characters.
NFKD Compatibility Decomposition – like NFD but also replaces compatibility characters.

🧪 Example: Using All Forms

import unicodedata

text = "TÚI THÚ KHỦNG LONG BẠO CHÚA MỀM"

print("NFC:", unicodedata.normalize("NFC", text))
print("NFD:", unicodedata.normalize("NFD", text))
print("NFKC:", unicodedata.normalize("NFKC", text))
print("NFKD:", unicodedata.normalize("NFKD", text))

🧼 When to Use Each

  • NFC: Best for display and storage (e.g., filenames, UI).
  • NFD: Useful for accent stripping or character analysis.
  • NFKC/NFKD: Ideal for search, comparison, or compatibility (e.g., turning “①” into “1”).

🧩 1. NFD (Normalization Form D — Decomposed)

  • D stands for Decomposition.
  • It splits characters into their simplest combining forms.

Example:

é  e + ´
  • The single character “é” (U+00E9) is decomposed into “e” (U+0065) + “◌́” (U+0301).
  • Used when you want to analyze or compare base characters and diacritics separately.

🧩 2. NFC (Normalization Form C — Composed)

  • C stands for Composition.
  • It"s the canonical composition form — meaning it tries to combine decomposed characters into one precomposed form when possible.
  • Essentially, NFC is the “standard” normalized form most commonly used for storage and display.

Example:

e + ´  é
  • “e” + combining acute accent becomes a single character “é”.

🧩 3. NFKD (Normalization Form KD — Compatibility Decomposition)

  • K stands for Compatibility.
  • Like NFD, but it also applies compatibility mappings, meaning that it may change the way certain symbols are represented for easier comparison.
  • It decomposes characters and converts “visually similar but semantically different” characters into a common form.

Example:

  1
   (A + ring)
  • NFKD is used when the appearance doesn"t matter, only the semantic value.

🧩 4. NFKC (Normalization Form KC — Compatibility Composition)

  • Like NFKD, but after compatibility decomposition, it re-composes characters when possible.
  • Useful when you want to normalize for search, comparison, or user input, not preserving exact visual form.

Example:

  1
  IV

🔍 Summary Table

Form Meaning Type Example Input Normalized Output
NFD Canonical Decomposition Decomposed é e + ́
NFC Canonical Composition Composed e + ́ é
NFKD Compatibility Decomposition Decomposed (simplified form) 1
NFKC Compatibility Composition Composed (simplified form) 1

💡 In Python

You can use the unicodedata module:

import unicodedata

text = "é"
print(unicodedata.normalize("NFD", text))   # e + ́
print(unicodedata.normalize("NFC", text))   # é
print(unicodedata.normalize("NFKD", "①"))  # 1
print(unicodedata.normalize("NFKC", "①"))  # 1

Excellent — those are great examples, because they"re “typographic” (curly) punctuation characters, not the plain ASCII ones. They show clearly how Unicode normalization (especially NFKC/NFKD) handles compatibility characters.

Let"s analyze each one:

✦ Input characters

Character Description Unicode code point ASCII equivalent
" Right single quotation mark U+2019 '
En dash U+2013 -
Left double quotation mark U+201C "
Right double quotation mark U+201D "

🧩 Normalization behavior

Let"s see what happens under each normalization form:

Character NFD NFC NFKD NFKC Comment
" " " ' ' NFD/NFC keep typographic quote; NFKD/NFKC simplify to ASCII '
- - Compatibility decomposition turns it into plain hyphen
" " Compatibility decomposition converts to ASCII double quote
" " Same — converted to plain double quote

🧠 Explanation

  • NFD/NFC are canonical normalizations — they only affect characters with accents or composed forms, not punctuation. → So your curly quotes and dashes remain unchanged.

  • NFKD/NFKC are compatibility normalizations — they also simplify typographic variants, superscripts, circled numbers, etc. → So typographic punctuation becomes the plain ASCII version.

💻 Python Example

import unicodedata

text = ""   "

for ch in text.split():
    print(f"Original: {ch!r} {ord(ch):04X}")
    for form in ["NFD", "NFC", "NFKD", "NFKC"]:
        normalized = unicodedata.normalize(form, ch)
        print(f"  {form}: {normalized!r} {ord(normalized):04X}")
    print()

Output (simplified):

Original: " 2019
  NFD: " 2019
  NFC: " 2019
  NFKD: ' 0027
  NFKC: ' 0027

Original: – 2013
  NFD: – 2013
  NFC: – 2013
  NFKD: - 002D
  NFKC: - 002D

Original: “ 201C
  NFD: “ 201C
  NFC: “ 201C
  NFKD: " 0022
  NFKC: " 0022
...

✅ Summary

Form Behavior on typographic punctuation
NFD/NFC Keep curly quotes/dashes as they are (no change)
NFKD/NFKC Convert to plain ASCII ', ", - — more suitable for search, comparison, or plain-text processing

Excellent observation 👍 — and yes, what you"re seeing is expected but a bit subtle! Let"s unpack why your result in Python shows:

>>> import unicodedata
>>> unicodedata.normalize("NFKC", """)
'"'

and did not convert the curly quote (", U+2019) into the ASCII ' (U+0027).

🧠 The short answer

👉 Because the Unicode Standard does not define a compatibility mapping for " (RIGHT SINGLE QUOTATION MARK, U+2019). So normalization (even NFKC/NFKD) leaves it unchanged.

🔍 The deeper explanation

Unicode normalization forms (NFC, NFD, NFKC, NFKD) only use mappings defined in the Unicode Character Database (UCD), specifically:

  • Canonical mappings → for NFC/NFD
  • Compatibility mappings → for NFKC/NFKD

However, not every “visually similar” character is considered a compatibility equivalent. In particular, punctuation marks like ", ", , , and are typographic symbols, not compatibility variants of ASCII ', ", or -.

That means:

There is no official compatibility decomposition for them — so normalization doesn"t change them.

🧩 Examples that do change under NFKC/NFKD

These do have compatibility mappings:

Character Unicode Normalized NFKC Notes
U+2460 1 circled digit one
U+212B Å Angstrom sign → Latin A with ring
U+338F kg compatibility ligature
U+2163 IV Roman numeral four

…but punctuation marks like ", , , stay the same.

🧰 What to do if you want ASCII conversion

If your goal is to normalize text to plain ASCII, you need extra logic (beyond Unicode normalization). For example, you can use:

Option 1: unicodedata.normalize + manual mapping

import unicodedata

text = "“Hello – it"s fine."
normalized = unicodedata.normalize("NFKC", text)
# Then manually replace typographic punctuation
ascii_text = (
    normalized.replace("“", '"')
              .replace("”", '"')
              .replace(""", "'")
              .replace(""", "'")
              .replace("–", "-")
)
print(ascii_text)
# "Hello - it's fine."

Option 2: Use a library like ftfy or unidecode

import ftfy
print(ftfy.fix_text("“Hello – it"s fine."))
# "Hello - it's fine."

or

from unidecode import unidecode
print(unidecode("“Hello – it"s fine."))
# "Hello - it's fine."

Summary

Form Changes '"–“”? Why
NFC / NFD ❌ No They"re not decomposable
NFKC / NFKD ❌ No No compatibility mapping in Unicode
ftfy / manual map ✅ Yes Explicit conversion to ASCII equivalents

Reference

https://www.geeksforgeeks.org/namedtuple-in-python/?ref=lbp

Understanding ASCII, Unicode, UTF-8, and Text Handling

1. ASCII, Unicode, and UTF-8 — Overview

ASCII

  • Definition: American Standard Code for Information Interchange
  • Range: 0–127 (7-bit)
  • Purpose: Represents English letters, digits, and basic symbols
  • Limitations: Cannot represent accented characters (e.g., á, ê) or characters from other languages.

Unicode

  • Definition: A universal character set designed to cover all characters from all writing systems
  • Range: Over 1.1 million code points (0–0x10FFFF)
  • Encoding forms: Can be stored as UTF-8, UTF-16, UTF-32
  • Purpose: Allows text from multiple languages to be stored and processed consistently.

UTF-8

  • Definition: A variable-length encoding of Unicode characters
  • Characteristics:
  • ASCII characters (0–127) use 1 byte
  • Other characters (e.g., Vietnamese á) use 2–4 bytes

  • Advantages:

  • Backward compatible with ASCII
  • Efficient storage for mixed ASCII and non-ASCII text
  • Widely used on the web and in modern applications.

Key differences:

Feature ASCII Unicode UTF-8
Characters 128 1,112,064+ Encoding scheme
Language support English only All languages All languages
Storage 1 byte Depends on encoding (UTF-8,16,32) 1–4 bytes per character

2. Handling Non-ASCII Text (e.g., Vietnamese)

Vietnamese contains accented characters like á, à, , â, etc. Handling them correctly requires:

a) Ensure proper encoding/decoding

  • Always use UTF-8 for reading/writing files:
# Writing JSON with Vietnamese text
import json

data = {"unit": "Cái"}
with open("a.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)
  • Do not rely on system default encodings (like cp1252 on Windows), which may fail.

b) Use Unicode normalization

  • Unicode allows characters to be represented in multiple ways:
  • Precomposed: á → single code point U+00E1
  • Decomposed: a + combining acute accent U+0301

  • Normalize using Python"s unicodedata:

import unicodedata

s = "a\u0301I"  # decomposed
print(unicodedata.normalize("NFC", s))  # precomposed: 'áI'

Normalization forms:

Form Effect
NFC Compose to single code point where possible
NFD Decompose into base + combining marks
NFKC/NFKD Compatibility normalization (e.g., superscripts, ligatures)

c) Fix mojibake / double-encoding

  • Sometimes UTF-8 bytes are misinterpreted as Latin-1 or Windows-1252
  • Recover with:
def fix_mojibake(s: str) -> str:
    try:
        return s.encode("latin-1").decode("utf-8")
    except:
        return s

3. General Rules to Detect Encoding Issues

  1. Check file metadata / BOM: UTF-8 files may start with EF BB BF
  2. Look for replacement characters: indicates a decoding error
  3. Check for mojibake patterns:
  4. CáI instead of CáI → UTF-8 mis-decoded as Latin-1
  5. C\u00e1I → escaped Unicode sequence

  6. Python inspection:

import chardet
raw_bytes = open("a.json", "rb").read()
print(chardet.detect(raw_bytes))  # returns likely encoding

4. Examples: Methods and Differences

a) unicode.normalize

s1 = "a\u0301I"  # a + combining accent
print(s1)  # prints: áI
print(unicodedata.normalize("NFC", s1))  # prints: áI

Effect: fixes different representations of the same character.

b) ensure_ascii=False in JSON

data = {"unit": "Cái"}
json_str = json.dumps(data, ensure_ascii=True)
print(json_str)  # {"unit": "C\u00e1i"}

json_str2 = json.dumps(data, ensure_ascii=False)
print(json_str2)  # {"unit": "Cái"}

Effect:

  • ensure_ascii=True → escapes non-ASCII as \uXXXX
  • ensure_ascii=False → writes characters directly in UTF-8

c) Fixing mojibake

s = "CáI"
fixed = s.encode("latin-1").decode("utf-8")
print(fixed)  # Cái

Effect: recovers text that was double-encoded.

Summary

  1. ASCII → only English, 7-bit
  2. Unicode → universal code points for all languages
  3. UTF-8 → variable-length encoding, ASCII-compatible, stores all Unicode
  4. Handling Vietnamese:
  5. Always use UTF-8
  6. Normalize combining marks (unicodedata.normalize)
  7. Fix mojibake if needed (latin-1 → utf-8)

  8. Detection rules:

  9. Look for replacement characters or unexpected escapes
  10. Use libraries like chardet

  11. Practical examples:

  12. unicode.normalize → standardize accents
  13. ensure_ascii=False → human-readable UTF-8 JSON
  14. latin-1 → utf-8 → recover corrupted text

# unicodedata.normalize("NFC", "đơn bán hàng") == "đơn bán hàng"
# unicodedata.normalize("NFC", "đơn bán hàng") in ("đơn bán hàng",)
# "đơn bán hàng" in unicodedata.normalize("NFC", "đơn bán hàng")