Skip to content

Text - Text - Text

Overview

There are a lot of data that exists in text types. For example:

a) Text in legal documents

b) Text in analysis investment brochure

c) Text in transaction billings

d) Text in scanned trading registed documents

And so on, ...

Its in differents kind of storage but its has a lot of information that we can parse it into data and give us a various general inforamtion. And if we can put into a system, it like a charm.

Transformation

Transform types of sources

In the examples, you can see text that appear in various, from dirty spaces to very useful way.

E.g:

PDF types to Text

PNG types to Text

Online Newspaper types is something go to text

Extract information of data

  1. Text to number
Example Target Information
This is increased 40 percentage revenue 40% Positive, for revenue
There are has 3 types of flowers 3 Category, Number of class

So its has a pattern for this

import re

patterns = re.match("\d+|0123456789|\d+\,\d+")

The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$ The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s or use it inline:

/(?s)^((?!hede).)*$/ (where the /.../ are the regex delimiters, i.e., not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/ Explanation A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐

S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│ └──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘

index 0 1 2 3 4 5 6 7 where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).). Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).

Share Improve this answer

https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word

bubble_sort.py
import string
from string import Template
import datetime

# `string` library has 4 main
# a) Built-in variables
# b) Custom String Format
# c) Template

# A. BUILT-IN VARIABLE

# What is ASCII:
# Shortcut of American Standard Code for Information Interchange
# Its is a character encoding standard for electronic communication.
# ASCII codes represent text in computers, telecommunications equipment, and other devices
# Read more: [ASCII](https://en.wikipedia.org/wiki/ASCII)

# Built in variables with self-explain name
# Seperated into 3 groups: letters, digits, punctation and whitespace
# Special case contain 4 group is `printable`.
# This support to reduce memory to remember all of this together
# and very helpful in text analysis.

# Group 1:
string.ascii_letters
string.ascii_lowercase
string.ascii_uppercase

# Group 2:
string.digits
string.hexdigits
string.octdigits

# Group 3:
string.punctuation

# Group 4:
string.whitespace

# Contain 4 groups:
string.printable

# B. CUSTOM FORMAT

# Type 1: Index based with exists index or not (upper 3.1+)
# Normal Case
'{0}, {1}, {2}'.format('a', 'b', 'c')
# Index Position
'{2}, {1}, {0}'.format('a', 'b', 'c')
# Auto index without using index
'{}, {}, {}'.format('a', 'b', 'c')
# Unpacking using *
'{0}, {1}, {2}'.format(*'abc')
# Repeat
'{0}, {1}, {0}'.format('F', 'S')

# Type 2: Naming arguments
# Normal case
'Coordinates: {lat}, {lon}'.format(lat = '24.7N', lon='-12.4E')
# Unpack dict using **
coord = {'lat': '24.7N', 'lon':'-12.4E'}
'Coordinates: {lat}, {lon}'.format(**coord)

# Standard Format Specifier:
# format_spec     ::=  [[fill]align][sign][#][0][width][grouping_option][.precision][type]
# fill            ::=  <any character>
# align           ::=  "<" | ">" | "=" | "^"
# sign            ::=  "+" | "-" | " "
# width           ::=  digit+
# grouping_option ::=  "_" | ","
# precision       ::=  digit+
# type            ::=  "b" | "c" | "d" | "e" | "E" | "f" | "F" | "g" | "G" | "n" | "o" | "s" | "x" | "

# Fill and Align
# Case 1: Fill * and align with > (Left)
# E.g: '*****************************************************Tunnels'
'{:*>60}'.format('Tunnels')

# Case 2: Fill ~ and align with ^ (Middle)
# E.g: '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Python Pathway~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
'{:~^80}'.format('Python Pathway')

# Case 3:
# E.g: Number with positive and negative number
'{:+f}; {:+f}'.format(4.6, -12.14)
'{: f}; {: f}'.format(3.14, -3.14)
'{:-f}; {:-f}'.format(5.94, -9.14)

# Case 4:
# E.g: Format Number in different alias
'int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}'.format(93)

# Case 5: With 0x, 0o, or 0b as prefix
# E.g: Such as hex and oct type
'int: {0:d};  hex: {0:#x};  oct: {0:#o};  bin: {0:#b}'.format(47)

# Case 6: Using the comma as a thousands separator
# E.g: 1234 into 1,234
'{:,}'.format(1234)
'{:_}'.format(123456)

# Case 7: Percentage with number of precisions
'{:.3%}'.format(0.05821)

# C. TEMPLATE

# Template strings support $-based substitutions, using the following rules:
# ====
# $$ is an escape; it is replaced with a single $.
# $identifier names a substitution placeholder matching a mapping key of "identifier". By default, "identifier" is restricted to any case-insensitive ASCII alphanumeric string (including underscores) that starts with an underscore or ASCII letter. The first non-identifier character after the $ character terminates this placeholder specification.
# ${identifier} is equivalent to $identifier. It is required when valid identifier characters follow the placeholder but are not part of the placeholder, such as "${noun}ification".
# Any other appearance of $ in the string will result in a ValueError being raised.

# Basic concept
# 2 steps:
# a) Define template through Template
# b) Binding argument with `substitute`

# Template
s = Template("$user has been reviewed by $reviewer at $time")

# Binding
s.substitute(user="Pja", reviewer="Sungri", time=datetime.datetime.now())

# KeyError when:
# Mising $time
s.substitute(user="Pja", reviewer="Sungri")

# Not err when using safe_subtitule
# Its replace missing arguments by itself
s.safe_substitute(user="Pja", reviewer="Sungri")

# D. HELPFUL FUNCTIONS

# In my opinions, it not help much.
# But i like this idea, using multiple resources like split, capitalize then join of `str` library
string.capwords("Capitalize Word by seperator", sep=" ")
  1. Text to Date

  2. Text Padding

Regex

Google Syntax for re2 https://github.com/google/re2/wiki/Syntax

This is implemented on the BigQuery regex syntax

Libraries

Reference

https://github.com/google/re2/wiki/Syntax

Encoding Detection

1. General Idea of Character Encoding Detection

When you have raw bytes (like from a file or network), you need to know the encoding to convert them into readable text. The general steps are:

  1. Check for BOM (Byte Order Mark): Some encodings (UTF-8, UTF-16, UTF-32) may include a BOM at the beginning. If present, it directly indicates the encoding.

  2. Byte pattern analysis:

  3. Certain byte sequences are valid in one encoding but invalid in another.
  4. For example, UTF-8 has strict rules: continuation bytes must be in 0x80–0xBF.

  5. Statistical analysis:

  6. Libraries like chardet analyze frequency distributions of bytes in the text.
  7. They compare patterns against profiles of known encodings (like Windows-1252, ISO-8859-1, Shift-JIS, etc.).

  8. Confidence scoring:

  9. Each possible encoding gets a confidence score.
  10. The detector selects the most likely encoding.

  11. Fallbacks:

  12. If no encoding is certain, default to UTF-8 or a common locale encoding.

2. How chardet Works

chardet is a Python port of Mozilla"s Universal Charset Detector. Its process can be summarized as:

  1. Input raw bytes.
  2. Run multiple "probers":
  3. Each prober is a small detector for a specific charset family (UTF-8, ISO-8859-*, Shift-JIS, etc.).
  4. Each prober analyzes byte sequences and calculates a confidence score.

  5. Select the encoding with highest confidence.

  6. Return the result as a dictionary, e.g.:
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

Key Features:

  • Heuristic-based: Not perfect, but works well on reasonably long text.
  • Multi-byte aware: Can detect multi-byte encodings like UTF-8, Shift-JIS, Big5.
  • Language hints: Can give a hint of language when charset alone is ambiguous.

3. Mozilla"s Universal Charset Detector

  • The original UCD (Universal Charset Detector) was developed by Mozilla for Firefox.
  • Goals: Detect the encoding of arbitrary web content reliably.
  • chardet in Python is a port of this project.
  • Key ideas from Mozilla:
  • Multiple charset probers work in parallel.
  • Use byte sequence validation and statistical analysis.
  • Return a confidence score for each candidate.
  • Designed to handle HTML pages, emails, and other text.

So essentially:

Mozilla UCD → Heuristics + Byte Stats → Detect Charset
chardet → Python port of UCD → same heuristics and stats

4. Mermaid Diagram for the Process

flowchart TD
    A[Start: Raw Bytes] --> B[Check BOM]
    B -->|BOM Found| C[Return Encoding from BOM]
    B -->|No BOM| D[Run Charset Probers]
    D --> E[UTF-8 Prober]
    D --> F[Single-byte Probers (Latin1, Windows-1252...)]
    D --> G[Multi-byte Probers (Shift-JIS, Big5, etc.)]
    E --> H[Compute Confidence Score]
    F --> H
    G --> H
    H --> I[Compare Scores]
    I --> J[Select Most Likely Encoding]
    J --> K[Return Encoding and Confidence]

5. How This Relates to Real-World Usage

  • In Python:
import chardet

raw_bytes = b'\xc3\xa1'  # á in UTF-8
result = chardet.detect(raw_bytes)
print(result)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
  • When it aligns with Mozilla"s approach:
  • Both use multiple probers and statistics.
  • Both return a confidence score.
  • Both rely on heuristics for ambiguous cases.

  • Limitations:

  • Short text may give wrong results.
  • Some encodings (like ISO-8859-1 vs Windows-1252) are very similar → low confidence.
  • Rare or mixed encodings may confuse the detector.
flowchart TD
    A[Start: Raw Bytes Input] --> B[Check BOM (Byte Order Mark)]
    B -->|BOM Found| C[Determine Encoding from BOM]
    B -->|No BOM| D[Run Charset Probers / Heuristics]

    D --> E[UTF-8 Prober: Validate Byte Sequences]
    D --> F[Single-byte Probers: Latin1, Windows-1252, ISO-8859-*]
    D --> G[Multi-byte Probers: Shift-JIS, Big5, EUC-JP, etc.]

    E --> H[Compute UTF-8 Confidence Score]
    F --> I[Compute Single-byte Confidence Scores]
    G --> J[Compute Multi-byte Confidence Scores]

    H --> K[Compare Confidence Scores]
    I --> K
    J --> K

    K --> L[Select Most Likely Encoding]
    L --> M[Return Encoding + Confidence Score]

    C --> M
    M --> N[Decode Bytes to Text using Detected Encoding]

    %% Optional notes
    classDef note fill:#f9f,stroke:#333,stroke-width:1px,color:#000;
    class N note;

Explanation of the Diagram

  1. Start: Receive raw bytes. Could be from a file, web page, or network.
  2. Check BOM:
  3. If BOM exists → directly know encoding (UTF-8, UTF-16LE/BE, UTF-32).
  4. If no BOM → need heuristics.

  5. Charset probers:

  6. UTF-8 prober: checks validity of multi-byte UTF-8 sequences.
  7. Single-byte probers: detect encodings like Latin1, Windows-1252.
  8. Multi-byte probers: for languages like Japanese or Chinese.

  9. Compute confidence scores: Each prober returns a score indicating likelihood.

  10. Compare scores: Choose encoding with the highest confidence.
  11. Return encoding + confidence: Can be used to decode bytes reliably.
  12. Decode: Final step to get readable text.
Encoding Family Examples / Notes Type
UTF-8 Standard Unicode encoding, variable-length Multi-byte
UTF-16 UTF-16LE, UTF-16BE Multi-byte
UTF-32 UTF-32LE, UTF-32BE Multi-byte
ISO-8859-* ISO-8859-1 (Latin1), ISO-8859-2, ISO-8859-5… Single-byte
Windows Code Pages Windows-1250, Windows-1251, Windows-1252… Single-byte
Asian Encodings Shift-JIS, EUC-JP, EUC-KR, GB2312, Big5 Multi-byte
KOI8-R / KOI8-U Russian / Ukrainian encodings Single-byte
MacRoman / MacCyrillic Classic Mac OS encodings Single-byte
ASCII 7-bit basic characters Single-byte

Note: chardet can detect many encodings, but accuracy improves with longer text samples. Short strings or mixed encodings may give low-confidence results.

2️⃣ Table: Methods / Workflow in chardet

Step / Method Description Purpose / Notes
Input Bytes Raw bytes from file/network Starting point
BOM Check Look for Byte Order Mark (UTF-8, UTF-16, UTF-32) Quick detection if present
Charset Probers Multiple detectors for different families Detects likely encoding
UTF-8 Prober Validates multi-byte sequences Checks UTF-8 correctness
Single-byte Probers Tests Latin1, Windows-125x, ISO-8859-x encodings Finds likely single-byte encodings
Multi-byte Probers Tests Shift-JIS, Big5, EUC, etc. For CJK languages (Japanese, Chinese, Korean)
Statistical Analysis Frequency analysis of bytes / character patterns Compare to known profiles
Confidence Scoring Each prober assigns a confidence value Helps rank possible encodings
Best Match Selection Choose the encoding with highest confidence Determines result
Return Result Dictionary: encoding, confidence, language hint Ready for decoding bytes
Fallback Handling Optional: fallback to UTF-8 or locale encoding if uncertain Ensures text is still usable