Text - Text - Text¶
Overview¶
There are a lot of data that exists in text types. For example:
a) Text in legal documents
b) Text in analysis investment brochure
c) Text in transaction billings
d) Text in scanned trading registed documents
And so on, ...
Its in differents kind of storage but its has a lot of information that we can parse it into data and give us a various general inforamtion. And if we can put into a system, it like a charm.
Transformation¶
Transform types of sources¶
In the examples, you can see text that appear in various, from dirty spaces to very useful way.
E.g:
PDF types to Text
PNG types to Text
Online Newspaper types is something go to text
Extract information of data¶
- Text to number
| Example | Target | Information |
|---|---|---|
| This is increased 40 percentage revenue | 40% | Positive, for revenue |
| There are has 3 types of flowers | 3 | Category, Number of class |
So its has a pattern for this
The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:
^((?!hede).)*$ The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.
And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):
/^((?!hede).)*$/s or use it inline:
/(?s)^((?!hede).)*$/ (where the /.../ are the regex delimiters, i.e., not part of the pattern)
If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:
/^((?!hede)[\s\S])*$/ Explanation A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":
┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│ └──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘
index 0 1 2 3 4 5 6 7 where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.
So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).). Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)$
As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).
Share Improve this answer
https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word
import string
from string import Template
import datetime
# `string` library has 4 main
# a) Built-in variables
# b) Custom String Format
# c) Template
# A. BUILT-IN VARIABLE
# What is ASCII:
# Shortcut of American Standard Code for Information Interchange
# Its is a character encoding standard for electronic communication.
# ASCII codes represent text in computers, telecommunications equipment, and other devices
# Read more: [ASCII](https://en.wikipedia.org/wiki/ASCII)
# Built in variables with self-explain name
# Seperated into 3 groups: letters, digits, punctation and whitespace
# Special case contain 4 group is `printable`.
# This support to reduce memory to remember all of this together
# and very helpful in text analysis.
# Group 1:
string.ascii_letters
string.ascii_lowercase
string.ascii_uppercase
# Group 2:
string.digits
string.hexdigits
string.octdigits
# Group 3:
string.punctuation
# Group 4:
string.whitespace
# Contain 4 groups:
string.printable
# B. CUSTOM FORMAT
# Type 1: Index based with exists index or not (upper 3.1+)
# Normal Case
'{0}, {1}, {2}'.format('a', 'b', 'c')
# Index Position
'{2}, {1}, {0}'.format('a', 'b', 'c')
# Auto index without using index
'{}, {}, {}'.format('a', 'b', 'c')
# Unpacking using *
'{0}, {1}, {2}'.format(*'abc')
# Repeat
'{0}, {1}, {0}'.format('F', 'S')
# Type 2: Naming arguments
# Normal case
'Coordinates: {lat}, {lon}'.format(lat = '24.7N', lon='-12.4E')
# Unpack dict using **
coord = {'lat': '24.7N', 'lon':'-12.4E'}
'Coordinates: {lat}, {lon}'.format(**coord)
# Standard Format Specifier:
# format_spec ::= [[fill]align][sign][#][0][width][grouping_option][.precision][type]
# fill ::= <any character>
# align ::= "<" | ">" | "=" | "^"
# sign ::= "+" | "-" | " "
# width ::= digit+
# grouping_option ::= "_" | ","
# precision ::= digit+
# type ::= "b" | "c" | "d" | "e" | "E" | "f" | "F" | "g" | "G" | "n" | "o" | "s" | "x" | "
# Fill and Align
# Case 1: Fill * and align with > (Left)
# E.g: '*****************************************************Tunnels'
'{:*>60}'.format('Tunnels')
# Case 2: Fill ~ and align with ^ (Middle)
# E.g: '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Python Pathway~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
'{:~^80}'.format('Python Pathway')
# Case 3:
# E.g: Number with positive and negative number
'{:+f}; {:+f}'.format(4.6, -12.14)
'{: f}; {: f}'.format(3.14, -3.14)
'{:-f}; {:-f}'.format(5.94, -9.14)
# Case 4:
# E.g: Format Number in different alias
'int: {0:d}; hex: {0:x}; oct: {0:o}; bin: {0:b}'.format(93)
# Case 5: With 0x, 0o, or 0b as prefix
# E.g: Such as hex and oct type
'int: {0:d}; hex: {0:#x}; oct: {0:#o}; bin: {0:#b}'.format(47)
# Case 6: Using the comma as a thousands separator
# E.g: 1234 into 1,234
'{:,}'.format(1234)
'{:_}'.format(123456)
# Case 7: Percentage with number of precisions
'{:.3%}'.format(0.05821)
# C. TEMPLATE
# Template strings support $-based substitutions, using the following rules:
# ====
# $$ is an escape; it is replaced with a single $.
# $identifier names a substitution placeholder matching a mapping key of "identifier". By default, "identifier" is restricted to any case-insensitive ASCII alphanumeric string (including underscores) that starts with an underscore or ASCII letter. The first non-identifier character after the $ character terminates this placeholder specification.
# ${identifier} is equivalent to $identifier. It is required when valid identifier characters follow the placeholder but are not part of the placeholder, such as "${noun}ification".
# Any other appearance of $ in the string will result in a ValueError being raised.
# Basic concept
# 2 steps:
# a) Define template through Template
# b) Binding argument with `substitute`
# Template
s = Template("$user has been reviewed by $reviewer at $time")
# Binding
s.substitute(user="Pja", reviewer="Sungri", time=datetime.datetime.now())
# KeyError when:
# Mising $time
s.substitute(user="Pja", reviewer="Sungri")
# Not err when using safe_subtitule
# Its replace missing arguments by itself
s.safe_substitute(user="Pja", reviewer="Sungri")
# D. HELPFUL FUNCTIONS
# In my opinions, it not help much.
# But i like this idea, using multiple resources like split, capitalize then join of `str` library
string.capwords("Capitalize Word by seperator", sep=" ")
-
Text to Date
-
Text Padding
Regex
Google Syntax for re2 https://github.com/google/re2/wiki/Syntax
This is implemented on the BigQuery regex syntax
Libraries¶
Reference¶
https://github.com/google/re2/wiki/Syntax
Encoding Detection¶
1. General Idea of Character Encoding Detection¶
When you have raw bytes (like from a file or network), you need to know the encoding to convert them into readable text. The general steps are:
-
Check for BOM (Byte Order Mark): Some encodings (UTF-8, UTF-16, UTF-32) may include a BOM at the beginning. If present, it directly indicates the encoding.
-
Byte pattern analysis:
- Certain byte sequences are valid in one encoding but invalid in another.
-
For example, UTF-8 has strict rules: continuation bytes must be in
0x80–0xBF. -
Statistical analysis:
- Libraries like
chardetanalyze frequency distributions of bytes in the text. -
They compare patterns against profiles of known encodings (like Windows-1252, ISO-8859-1, Shift-JIS, etc.).
-
Confidence scoring:
- Each possible encoding gets a confidence score.
-
The detector selects the most likely encoding.
-
Fallbacks:
- If no encoding is certain, default to UTF-8 or a common locale encoding.
2. How chardet Works¶
chardet is a Python port of Mozilla"s Universal Charset Detector. Its process can be summarized as:
- Input raw bytes.
- Run multiple "probers":
- Each prober is a small detector for a specific charset family (UTF-8, ISO-8859-*, Shift-JIS, etc.).
-
Each prober analyzes byte sequences and calculates a confidence score.
-
Select the encoding with highest confidence.
- Return the result as a dictionary, e.g.:
Key Features:¶
- Heuristic-based: Not perfect, but works well on reasonably long text.
- Multi-byte aware: Can detect multi-byte encodings like UTF-8, Shift-JIS, Big5.
- Language hints: Can give a hint of language when charset alone is ambiguous.
3. Mozilla"s Universal Charset Detector¶
- The original UCD (Universal Charset Detector) was developed by Mozilla for Firefox.
- Goals: Detect the encoding of arbitrary web content reliably.
chardetin Python is a port of this project.- Key ideas from Mozilla:
- Multiple charset probers work in parallel.
- Use byte sequence validation and statistical analysis.
- Return a confidence score for each candidate.
- Designed to handle HTML pages, emails, and other text.
So essentially:
Mozilla UCD → Heuristics + Byte Stats → Detect Charset
chardet → Python port of UCD → same heuristics and stats
4. Mermaid Diagram for the Process¶
flowchart TD
A[Start: Raw Bytes] --> B[Check BOM]
B -->|BOM Found| C[Return Encoding from BOM]
B -->|No BOM| D[Run Charset Probers]
D --> E[UTF-8 Prober]
D --> F[Single-byte Probers (Latin1, Windows-1252...)]
D --> G[Multi-byte Probers (Shift-JIS, Big5, etc.)]
E --> H[Compute Confidence Score]
F --> H
G --> H
H --> I[Compare Scores]
I --> J[Select Most Likely Encoding]
J --> K[Return Encoding and Confidence] 5. How This Relates to Real-World Usage¶
- In Python:
import chardet
raw_bytes = b'\xc3\xa1' # á in UTF-8
result = chardet.detect(raw_bytes)
print(result)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
- When it aligns with Mozilla"s approach:
- Both use multiple probers and statistics.
- Both return a confidence score.
-
Both rely on heuristics for ambiguous cases.
-
Limitations:
- Short text may give wrong results.
- Some encodings (like ISO-8859-1 vs Windows-1252) are very similar → low confidence.
- Rare or mixed encodings may confuse the detector.
flowchart TD
A[Start: Raw Bytes Input] --> B[Check BOM (Byte Order Mark)]
B -->|BOM Found| C[Determine Encoding from BOM]
B -->|No BOM| D[Run Charset Probers / Heuristics]
D --> E[UTF-8 Prober: Validate Byte Sequences]
D --> F[Single-byte Probers: Latin1, Windows-1252, ISO-8859-*]
D --> G[Multi-byte Probers: Shift-JIS, Big5, EUC-JP, etc.]
E --> H[Compute UTF-8 Confidence Score]
F --> I[Compute Single-byte Confidence Scores]
G --> J[Compute Multi-byte Confidence Scores]
H --> K[Compare Confidence Scores]
I --> K
J --> K
K --> L[Select Most Likely Encoding]
L --> M[Return Encoding + Confidence Score]
C --> M
M --> N[Decode Bytes to Text using Detected Encoding]
%% Optional notes
classDef note fill:#f9f,stroke:#333,stroke-width:1px,color:#000;
class N note; Explanation of the Diagram¶
- Start: Receive raw bytes. Could be from a file, web page, or network.
- Check BOM:
- If BOM exists → directly know encoding (UTF-8, UTF-16LE/BE, UTF-32).
-
If no BOM → need heuristics.
-
Charset probers:
- UTF-8 prober: checks validity of multi-byte UTF-8 sequences.
- Single-byte probers: detect encodings like Latin1, Windows-1252.
-
Multi-byte probers: for languages like Japanese or Chinese.
-
Compute confidence scores: Each prober returns a score indicating likelihood.
- Compare scores: Choose encoding with the highest confidence.
- Return encoding + confidence: Can be used to decode bytes reliably.
- Decode: Final step to get readable text.
| Encoding Family | Examples / Notes | Type |
|---|---|---|
| UTF-8 | Standard Unicode encoding, variable-length | Multi-byte |
| UTF-16 | UTF-16LE, UTF-16BE | Multi-byte |
| UTF-32 | UTF-32LE, UTF-32BE | Multi-byte |
| ISO-8859-* | ISO-8859-1 (Latin1), ISO-8859-2, ISO-8859-5… | Single-byte |
| Windows Code Pages | Windows-1250, Windows-1251, Windows-1252… | Single-byte |
| Asian Encodings | Shift-JIS, EUC-JP, EUC-KR, GB2312, Big5 | Multi-byte |
| KOI8-R / KOI8-U | Russian / Ukrainian encodings | Single-byte |
| MacRoman / MacCyrillic | Classic Mac OS encodings | Single-byte |
| ASCII | 7-bit basic characters | Single-byte |
Note:
chardetcan detect many encodings, but accuracy improves with longer text samples. Short strings or mixed encodings may give low-confidence results.
2️⃣ Table: Methods / Workflow in chardet¶
| Step / Method | Description | Purpose / Notes |
|---|---|---|
| Input Bytes | Raw bytes from file/network | Starting point |
| BOM Check | Look for Byte Order Mark (UTF-8, UTF-16, UTF-32) | Quick detection if present |
| Charset Probers | Multiple detectors for different families | Detects likely encoding |
| UTF-8 Prober | Validates multi-byte sequences | Checks UTF-8 correctness |
| Single-byte Probers | Tests Latin1, Windows-125x, ISO-8859-x encodings | Finds likely single-byte encodings |
| Multi-byte Probers | Tests Shift-JIS, Big5, EUC, etc. | For CJK languages (Japanese, Chinese, Korean) |
| Statistical Analysis | Frequency analysis of bytes / character patterns | Compare to known profiles |
| Confidence Scoring | Each prober assigns a confidence value | Helps rank possible encodings |
| Best Match Selection | Choose the encoding with highest confidence | Determines result |
| Return Result | Dictionary: encoding, confidence, language hint | Ready for decoding bytes |
| Fallback Handling | Optional: fallback to UTF-8 or locale encoding if uncertain | Ensures text is still usable |