Skip to content

Text - Text - Text

Overview

There are a lot of data that exists in text types. For example:

a) Text in legal documents

b) Text in analysis investment brochure

c) Text in transaction billings

d) Text in scanned trading registed documents

And so on, ...

Its in differents kind of storage but its has a lot of information that we can parse it into data and give us a various general inforamtion. And if we can put into a system, it like a charm.

Transformation

Transform types of sources

In the examples, you can see text that appear in various, from dirty spaces to very useful way.

E.g:

PDF types to Text

PNG types to Text

Online Newspaper types is something go to text

Extract information of data

  1. Text to number
Example Target Information
This is increased 40 percentage revenue 40% Positive, for revenue
There are has 3 types of flowers 3 Category, Number of class

So its has a pattern for this

import re

patterns = re.match("\d+|0123456789|\d+\,\d+")

The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$ The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s or use it inline:

/(?s)^((?!hede).)*$/ (where the /.../ are the regex delimiters, i.e., not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/ Explanation A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐

S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│ └──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘

index 0 1 2 3 4 5 6 7 where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).). Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).

Share Improve this answer

https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word

bubble_sort.py
import string
from string import Template
import datetime

# `string` library has 4 main
# a) Built-in variables
# b) Custom String Format
# c) Template

# A. BUILT-IN VARIABLE

# What is ASCII:
# Shortcut of American Standard Code for Information Interchange
# Its is a character encoding standard for electronic communication.
# ASCII codes represent text in computers, telecommunications equipment, and other devices
# Read more: [ASCII](https://en.wikipedia.org/wiki/ASCII)

# Built in variables with self-explain name
# Seperated into 3 groups: letters, digits, punctation and whitespace
# Special case contain 4 group is `printable`.
# This support to reduce memory to remember all of this together
# and very helpful in text analysis.

# Group 1:
string.ascii_letters
string.ascii_lowercase
string.ascii_uppercase

# Group 2:
string.digits
string.hexdigits
string.octdigits

# Group 3:
string.punctuation

# Group 4:
string.whitespace

# Contain 4 groups:
string.printable

# B. CUSTOM FORMAT

# Type 1: Index based with exists index or not (upper 3.1+)
# Normal Case
'{0}, {1}, {2}'.format('a', 'b', 'c')
# Index Position
'{2}, {1}, {0}'.format('a', 'b', 'c')
# Auto index without using index
'{}, {}, {}'.format('a', 'b', 'c')
# Unpacking using *
'{0}, {1}, {2}'.format(*'abc')
# Repeat
'{0}, {1}, {0}'.format('F', 'S')

# Type 2: Naming arguments
# Normal case
'Coordinates: {lat}, {lon}'.format(lat = '24.7N', lon='-12.4E')
# Unpack dict using **
coord = {'lat': '24.7N', 'lon':'-12.4E'}
'Coordinates: {lat}, {lon}'.format(**coord)

# Standard Format Specifier:
# format_spec     ::=  [[fill]align][sign][#][0][width][grouping_option][.precision][type]
# fill            ::=  <any character>
# align           ::=  "<" | ">" | "=" | "^"
# sign            ::=  "+" | "-" | " "
# width           ::=  digit+
# grouping_option ::=  "_" | ","
# precision       ::=  digit+
# type            ::=  "b" | "c" | "d" | "e" | "E" | "f" | "F" | "g" | "G" | "n" | "o" | "s" | "x" | "

# Fill and Align
# Case 1: Fill * and align with > (Left)
# E.g: '*****************************************************Tunnels'
'{:*>60}'.format('Tunnels')

# Case 2: Fill ~ and align with ^ (Middle)
# E.g: '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Python Pathway~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
'{:~^80}'.format('Python Pathway')

# Case 3:
# E.g: Number with positive and negative number
'{:+f}; {:+f}'.format(4.6, -12.14)
'{: f}; {: f}'.format(3.14, -3.14)
'{:-f}; {:-f}'.format(5.94, -9.14)

# Case 4:
# E.g: Format Number in different alias
'int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}'.format(93)

# Case 5: With 0x, 0o, or 0b as prefix
# E.g: Such as hex and oct type
'int: {0:d};  hex: {0:#x};  oct: {0:#o};  bin: {0:#b}'.format(47)

# Case 6: Using the comma as a thousands separator
# E.g: 1234 into 1,234
'{:,}'.format(1234)
'{:_}'.format(123456)

# Case 7: Percentage with number of precisions
'{:.3%}'.format(0.05821)

# C. TEMPLATE

# Template strings support $-based substitutions, using the following rules:
# ====
# $$ is an escape; it is replaced with a single $.
# $identifier names a substitution placeholder matching a mapping key of "identifier". By default, "identifier" is restricted to any case-insensitive ASCII alphanumeric string (including underscores) that starts with an underscore or ASCII letter. The first non-identifier character after the $ character terminates this placeholder specification.
# ${identifier} is equivalent to $identifier. It is required when valid identifier characters follow the placeholder but are not part of the placeholder, such as "${noun}ification".
# Any other appearance of $ in the string will result in a ValueError being raised.

# Basic concept
# 2 steps:
# a) Define template through Template
# b) Binding argument with `substitute`

# Template
s = Template("$user has been reviewed by $reviewer at $time")

# Binding
s.substitute(user="Pja", reviewer="Sungri", time=datetime.datetime.now())

# KeyError when:
# Mising $time
s.substitute(user="Pja", reviewer="Sungri")

# Not err when using safe_subtitule
# Its replace missing arguments by itself
s.safe_substitute(user="Pja", reviewer="Sungri")

# D. HELPFUL FUNCTIONS

# In my opinions, it not help much.
# But i like this idea, using multiple resources like split, capitalize then join of `str` library
string.capwords("Capitalize Word by seperator", sep=" ")
  1. Text to Date

  2. Text Padding

Regex

Google Syntax for re2 https://github.com/google/re2/wiki/Syntax

This is implemented on the BigQuery regex syntax

Libraries

Reference

https://github.com/google/re2/wiki/Syntax