Tokenizer¶

Overview¶

Tokenizer is a process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. The goal of tokenization is to convert input text into a format that is easier to analyze and process.

Example¶

Tokenization is a process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens.

The following is an example of tokenization:

Input	Tokens
Hello my name is John.	["Hello", "my", "name", "is", "John", "."]

(+) Explain 2 methods BPE with Word

Method 1: BPE¶

BPE stands for Byte Pair Encoding. It is a word segmentation algorithm that is used to split words into subwords. BPE is a statistical method that find a set of subwords that can be used to represent the input text.

The process of BPE can be described as follows:

Initialize a dictionary of subwords. The dictionary contains the subwords that have been seen so far.
Take a word from the input text and look it up in the dictionary. If the word is not found in the dictionary, the algorithm will try to split the word into two subwords.
If the word is found in the dictionary, the algorithm will try to split the word into two subwords.
If the word is split into two subwords, the algorithm will try to split the subwords into two subwords.
Repeat the process until all the words in the input text have been split into subwords.

The following is an example of BPE:

Input	Subwords
cat	[cat]
cats	[cat, s]
running	[run, ning]

Method 2: WordPiece¶

WordPiece is a subwording algorithm that is used to split words into subwords. WordPiece is similar to BPE, but it uses a different approach to split words into subwords.

The process of WordPiece can be described as follows:

Initialize a dictionary of subwords. The dictionary contains the subwords that have been seen so far.
Take a word from the input text and look it up in the dictionary. If the word is not found in the dictionary, the algorithm will try to split the word into two subwords.
If the word is found in the dictionary, the algorithm will try to split the word into two subwords.
If the word is split into two subwords, the algorithm will try to split the subwords into two subwords.
Repeat the process until all the words in the input text have been split into subwords.

The following is an example of WordPiece:

Input	Subwords
cat	[cat]
cats	[cat, s]
running	[run, ning]

Comparison¶

The following is a comparison of BPE and WordPiece:

Feature	BPE	WordPiece
Complex	3	5
Example	[cat, s]	[run, ning]
Use-cases	NMT, BERT	BERT

(+) Tokenizer for vietnamese

Table of tokenizer at the current

Package Name	Language	Last Updated
coccoc-tokenizer	C++
RDRSegmenter	Java
RDRPOSTagger	Java
VnCoreNLP	Java
vlp-tok	Scala
ETNLP	Python
VietnameseTextNormalizer	Python
nnvlp	Python
jPTDP	Python
vi_spacy	Python
underthesea	Python
vnlp	Java
pyvi	Python
JVnTextPro	Java
DongDu	C++
VLSP Toolkit
vTools
JNSP	Java

¶

https://vlsp.hpda.vn/demo/?page=resources

https://github.com/vndee/awsome-vietnamese-nlp

Using: https://arxiv.org/html/2411.12240v2