11. String Advanced Transformation¶

Overview¶

Normalization Forms¶

In Unicode text processing, "NFC", "NFD", "NFKC", and "NFKD" are four normalization forms defined by the Unicode Standard to ensure that text is stored in a consistent and comparable way — even if it visually looks the same but is encoded differently.

Unicode Normalization Forms

Form	Description
NFC	Canonical Composition – combines characters into composed forms.
NFD	Canonical Decomposition – breaks characters into base + diacritics.
NFKC	Compatibility Composition – like NFC but also replaces compatibility characters.
NFKD	Compatibility Decomposition – like NFD but also replaces compatibility characters.

🧪 Example: Using All Forms

import unicodedata

text = "TÚI THÚ KHỦNG LONG BẠO CHÚA MỀM"

print("NFC:", unicodedata.normalize("NFC", text))
print("NFD:", unicodedata.normalize("NFD", text))
print("NFKC:", unicodedata.normalize("NFKC", text))
print("NFKD:", unicodedata.normalize("NFKD", text))

🧼 When to Use Each

NFC: Best for display and storage (e.g., filenames, UI).
NFD: Useful for accent stripping or character analysis.
NFKC/NFKD: Ideal for search, comparison, or compatibility (e.g., turning “①” into “1”).

🧩 1. NFD (Normalization Form D — Decomposed)¶

D stands for Decomposition.
It splits characters into their simplest combining forms.

Example:

é → e + ´

The single character “é” (U+00E9) is decomposed into “e” (U+0065) + “◌́” (U+0301).
Used when you want to analyze or compare base characters and diacritics separately.

🧩 2. NFC (Normalization Form C — Composed)¶

C stands for Composition.
It"s the canonical composition form — meaning it tries to combine decomposed characters into one precomposed form when possible.
Essentially, NFC is the “standard” normalized form most commonly used for storage and display.

Example:

e + ´ → é

“e” + combining acute accent becomes a single character “é”.

🧩 3. NFKD (Normalization Form KD — Compatibility Decomposition)¶

K stands for Compatibility.
Like NFD, but it also applies compatibility mappings, meaning that it may change the way certain symbols are represented for easier comparison.
It decomposes characters and converts “visually similar but semantically different” characters into a common form.

Example:

① → 1
Å → Å (A + ring)

NFKD is used when the appearance doesn"t matter, only the semantic value.

🧩 4. NFKC (Normalization Form KC — Compatibility Composition)¶

Like NFKD, but after compatibility decomposition, it re-composes characters when possible.
Useful when you want to normalize for search, comparison, or user input, not preserving exact visual form.

Example:

① → 1
Ⅳ → IV

🔍 Summary Table¶

Form	Meaning	Type	Example Input	Normalized Output
NFD	Canonical Decomposition	Decomposed	`é`	`e + ́`
NFC	Canonical Composition	Composed	`e + ́`	`é`
NFKD	Compatibility Decomposition	Decomposed (simplified form)	`①`	`1`
NFKC	Compatibility Composition	Composed (simplified form)	`①`	`1`

💡 In Python¶

You can use the unicodedata module:

import unicodedata

text = "é"
print(unicodedata.normalize("NFD", text))   # e + ́
print(unicodedata.normalize("NFC", text))   # é
print(unicodedata.normalize("NFKD", "①"))  # 1
print(unicodedata.normalize("NFKC", "①"))  # 1

Excellent — those are great examples, because they"re “typographic” (curly) punctuation characters, not the plain ASCII ones. They show clearly how Unicode normalization (especially NFKC/NFKD) handles compatibility characters.

Let"s analyze each one:

✦ Input characters¶

Character	Description	Unicode code point	ASCII equivalent
`"`	Right single quotation mark	U+2019	`'`
`–`	En dash	U+2013	`-`
`“`	Left double quotation mark	U+201C	`"`
`”`	Right double quotation mark	U+201D	`"`

🧩 Normalization behavior¶

Let"s see what happens under each normalization form:

Character	NFD	NFC	NFKD	NFKC	Comment
`"`	`"`	`"`	`'`	`'`	NFD/NFC keep typographic quote; NFKD/NFKC simplify to ASCII `'`
`–`	`–`	`–`	`-`	`-`	Compatibility decomposition turns it into plain hyphen
`“`	`“`	`“`	`"`	`"`	Compatibility decomposition converts to ASCII double quote
`”`	`”`	`”`	`"`	`"`	Same — converted to plain double quote

🧠 Explanation¶

NFD/NFC are canonical normalizations — they only affect characters with accents or composed forms, not punctuation. → So your curly quotes and dashes remain unchanged.
NFKD/NFKC are compatibility normalizations — they also simplify typographic variants, superscripts, circled numbers, etc. → So typographic punctuation becomes the plain ASCII version.

💻 Python Example¶

import unicodedata

text = "" – “ ”"

for ch in text.split():
    print(f"Original: {ch!r} {ord(ch):04X}")
    for form in ["NFD", "NFC", "NFKD", "NFKC"]:
        normalized = unicodedata.normalize(form, ch)
        print(f"  {form}: {normalized!r} {ord(normalized):04X}")
    print()

Output (simplified):

Original: " 2019
  NFD: " 2019
  NFC: " 2019
  NFKD: ' 0027
  NFKC: ' 0027

Original: – 2013
  NFD: – 2013
  NFC: – 2013
  NFKD: - 002D
  NFKC: - 002D

Original: “ 201C
  NFD: “ 201C
  NFC: “ 201C
  NFKD: " 0022
  NFKC: " 0022
...

✅ Summary¶

Form	Behavior on typographic punctuation
NFD/NFC	Keep curly quotes/dashes as they are (no change)
NFKD/NFKC	Convert to plain ASCII `'`, `"`, `-` — more suitable for search, comparison, or plain-text processing

Excellent observation 👍 — and yes, what you"re seeing is expected but a bit subtle! Let"s unpack why your result in Python shows:

>>> import unicodedata
>>> unicodedata.normalize("NFKC", """)
'"'

and did not convert the curly quote (", U+2019) into the ASCII ' (U+0027).

🧠 The short answer¶

👉 Because the Unicode Standard does not define a compatibility mapping for " (RIGHT SINGLE QUOTATION MARK, U+2019). So normalization (even NFKC/NFKD) leaves it unchanged.

🔍 The deeper explanation¶

Unicode normalization forms (NFC, NFD, NFKC, NFKD) only use mappings defined in the Unicode Character Database (UCD), specifically:

Canonical mappings → for NFC/NFD
Compatibility mappings → for NFKC/NFKD

However, not every “visually similar” character is considered a compatibility equivalent. In particular, punctuation marks like ", ", “, ”, and – are typographic symbols, not compatibility variants of ASCII ', ", or -.

That means:

There is no official compatibility decomposition for them — so normalization doesn"t change them.

🧩 Examples that do change under NFKC/NFKD¶

These do have compatibility mappings:

Character	Unicode	Normalized NFKC	Notes
`①`	U+2460	`1`	circled digit one
`Å`	U+212B	`Å`	Angstrom sign → Latin A with ring
`㎏`	U+338F	`kg`	compatibility ligature
`Ⅳ`	U+2163	`IV`	Roman numeral four

…but punctuation marks like ", –, “, ” stay the same.

🧰 What to do if you want ASCII conversion¶

If your goal is to normalize text to plain ASCII, you need extra logic (beyond Unicode normalization). For example, you can use:

Option 1: `unicodedata.normalize` + manual mapping¶

import unicodedata

text = "“Hello – it"s fine.”"
normalized = unicodedata.normalize("NFKC", text)
# Then manually replace typographic punctuation
ascii_text = (
    normalized.replace("“", '"')
              .replace("”", '"')
              .replace(""", "'")
              .replace(""", "'")
              .replace("–", "-")
)
print(ascii_text)
# "Hello - it's fine."

Option 2: Use a library like `ftfy` or `unidecode`¶

import ftfy
print(ftfy.fix_text("“Hello – it"s fine.”"))
# "Hello - it's fine."

or

from unidecode import unidecode
print(unidecode("“Hello – it"s fine.”"))
# "Hello - it's fine."

✅ Summary

Form	Changes `'"–“”`?	Why
NFC / NFD	❌ No	They"re not decomposable
NFKC / NFKD	❌ No	No compatibility mapping in Unicode
`ftfy` / manual map	✅ Yes	Explicit conversion to ASCII equivalents

Reference¶

Python - Data Model

https://www.geeksforgeeks.org/namedtuple-in-python/?ref=lbp

Understanding ASCII, Unicode, UTF-8, and Text Handling¶

1. ASCII, Unicode, and UTF-8 — Overview¶

ASCII¶

Definition: American Standard Code for Information Interchange
Range: 0–127 (7-bit)
Purpose: Represents English letters, digits, and basic symbols
Limitations: Cannot represent accented characters (e.g., á, ê) or characters from other languages.

Unicode¶

Definition: A universal character set designed to cover all characters from all writing systems
Range: Over 1.1 million code points (0–0x10FFFF)
Encoding forms: Can be stored as UTF-8, UTF-16, UTF-32
Purpose: Allows text from multiple languages to be stored and processed consistently.

UTF-8¶

Definition: A variable-length encoding of Unicode characters
Characteristics:
ASCII characters (0–127) use 1 byte
Other characters (e.g., Vietnamese á) use 2–4 bytes
Advantages:
Backward compatible with ASCII
Efficient storage for mixed ASCII and non-ASCII text
Widely used on the web and in modern applications.

Key differences:

Feature	ASCII	Unicode	UTF-8
Characters	128	1,112,064+	Encoding scheme
Language support	English only	All languages	All languages
Storage	1 byte	Depends on encoding (UTF-8,16,32)	1–4 bytes per character

2. Handling Non-ASCII Text (e.g., Vietnamese)¶

Vietnamese contains accented characters like á, à, ạ, â, etc. Handling them correctly requires:

a) Ensure proper encoding/decoding¶

Always use UTF-8 for reading/writing files:

# Writing JSON with Vietnamese text
import json

data = {"unit": "Cái"}
with open("a.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

Do not rely on system default encodings (like cp1252 on Windows), which may fail.

b) Use Unicode normalization¶

Unicode allows characters to be represented in multiple ways:
Precomposed: á → single code point U+00E1
Decomposed: a + combining acute accent U+0301
Normalize using Python"s unicodedata:

import unicodedata

s = "a\u0301I"  # decomposed
print(unicodedata.normalize("NFC", s))  # precomposed: 'áI'

Normalization forms:

Form	Effect
NFC	Compose to single code point where possible
NFD	Decompose into base + combining marks
NFKC/NFKD	Compatibility normalization (e.g., superscripts, ligatures)

c) Fix mojibake / double-encoding¶

Sometimes UTF-8 bytes are misinterpreted as Latin-1 or Windows-1252
Recover with:

def fix_mojibake(s: str) -> str:
    try:
        return s.encode("latin-1").decode("utf-8")
    except:
        return s

3. General Rules to Detect Encoding Issues¶

Check file metadata / BOM: UTF-8 files may start with EF BB BF
Look for replacement characters: � indicates a decoding error
Check for mojibake patterns:
CÃ¡I instead of CáI → UTF-8 mis-decoded as Latin-1
C\u00e1I → escaped Unicode sequence
Python inspection:

import chardet
raw_bytes = open("a.json", "rb").read()
print(chardet.detect(raw_bytes))  # returns likely encoding

4. Examples: Methods and Differences¶

a) `unicode.normalize`¶

s1 = "a\u0301I"  # a + combining accent
print(s1)  # prints: áI
print(unicodedata.normalize("NFC", s1))  # prints: áI

Effect: fixes different representations of the same character.

b) `ensure_ascii=False` in JSON¶

data = {"unit": "Cái"}
json_str = json.dumps(data, ensure_ascii=True)
print(json_str)  # {"unit": "C\u00e1i"}

json_str2 = json.dumps(data, ensure_ascii=False)
print(json_str2)  # {"unit": "Cái"}

Effect:

ensure_ascii=True → escapes non-ASCII as \uXXXX
ensure_ascii=False → writes characters directly in UTF-8

c) Fixing mojibake¶

s = "CÃ¡I"
fixed = s.encode("latin-1").decode("utf-8")
print(fixed)  # Cái

Effect: recovers text that was double-encoded.

Summary¶

ASCII → only English, 7-bit
Unicode → universal code points for all languages
UTF-8 → variable-length encoding, ASCII-compatible, stores all Unicode
Handling Vietnamese:
Always use UTF-8
Normalize combining marks (unicodedata.normalize)
Fix mojibake if needed (latin-1 → utf-8)
Detection rules:
Look for replacement characters or unexpected escapes
Use libraries like chardet
Practical examples:
unicode.normalize → standardize accents
ensure_ascii=False → human-readable UTF-8 JSON
latin-1 → utf-8 → recover corrupted text

# unicodedata.normalize("NFC", "đơn bán hàng") == "đơn bán hàng"
# unicodedata.normalize("NFC", "đơn bán hàng") in ("đơn bán hàng",)
# "đơn bán hàng" in unicodedata.normalize("NFC", "đơn bán hàng")

11. String Advanced Transformation¶

Overview¶

Normalization Forms¶

🧩 1. NFD (Normalization Form D — Decomposed)¶

🧩 2. NFC (Normalization Form C — Composed)¶

🧩 3. NFKD (Normalization Form KD — Compatibility Decomposition)¶

🧩 4. NFKC (Normalization Form KC — Compatibility Composition)¶

🔍 Summary Table¶

💡 In Python¶

✦ Input characters¶

🧩 Normalization behavior¶

🧠 Explanation¶

💻 Python Example¶

✅ Summary¶

🧠 The short answer¶

🔍 The deeper explanation¶

🧩 Examples that do change under NFKC/NFKD¶

🧰 What to do if you want ASCII conversion¶

Option 1: unicodedata.normalize + manual mapping¶

Option 2: Use a library like ftfy or unidecode¶

Reference¶

Understanding ASCII, Unicode, UTF-8, and Text Handling¶

1. ASCII, Unicode, and UTF-8 — Overview¶

ASCII¶

Unicode¶

UTF-8¶

2. Handling Non-ASCII Text (e.g., Vietnamese)¶

a) Ensure proper encoding/decoding¶

b) Use Unicode normalization¶

c) Fix mojibake / double-encoding¶

3. General Rules to Detect Encoding Issues¶

4. Examples: Methods and Differences¶

a) unicode.normalize¶

b) ensure_ascii=False in JSON¶

c) Fixing mojibake¶

Summary¶

Option 1: `unicodedata.normalize` + manual mapping¶

Option 2: Use a library like `ftfy` or `unidecode`¶

a) `unicode.normalize`¶

b) `ensure_ascii=False` in JSON¶