Tokenizer¶
Overview¶
Tokenizer is a process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. The goal of tokenization is to convert input text into a format that is easier to analyze and process.
Example¶
Tokenization is a process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens.
The following is an example of tokenization:
Input | Tokens |
---|---|
Hello my name is John. | ["Hello", "my", "name", "is", "John", "."] |
(+) Explain 2 methods BPE with Word
Method 1: BPE¶
BPE stands for Byte Pair Encoding. It is a word segmentation algorithm that is used to split words into subwords. BPE is a statistical method that find a set of subwords that can be used to represent the input text.
The process of BPE can be described as follows:
- Initialize a dictionary of subwords. The dictionary contains the subwords that have been seen so far.
- Take a word from the input text and look it up in the dictionary. If the word is not found in the dictionary, the algorithm will try to split the word into two subwords.
- If the word is found in the dictionary, the algorithm will try to split the word into two subwords.
- If the word is split into two subwords, the algorithm will try to split the subwords into two subwords.
- Repeat the process until all the words in the input text have been split into subwords.
The following is an example of BPE:
Input | Subwords |
---|---|
cat | [cat] |
cats | [cat, s] |
running | [run, ning] |
Method 2: WordPiece¶
WordPiece is a subwording algorithm that is used to split words into subwords. WordPiece is similar to BPE, but it uses a different approach to split words into subwords.
The process of WordPiece can be described as follows:
- Initialize a dictionary of subwords. The dictionary contains the subwords that have been seen so far.
- Take a word from the input text and look it up in the dictionary. If the word is not found in the dictionary, the algorithm will try to split the word into two subwords.
- If the word is found in the dictionary, the algorithm will try to split the word into two subwords.
- If the word is split into two subwords, the algorithm will try to split the subwords into two subwords.
- Repeat the process until all the words in the input text have been split into subwords.
The following is an example of WordPiece:
Input | Subwords |
---|---|
cat | [cat] |
cats | [cat, s] |
running | [run, ning] |
Comparison¶
The following is a comparison of BPE and WordPiece:
Feature | BPE | WordPiece |
---|---|---|
Complex | 3 | 5 |
Example | [cat, s] | [run, ning] |
Use-cases | NMT, BERT | BERT |
(+) Tokenizer for vietnamese
Table of tokenizer at the current
Package Name | Language | Last Updated |
---|---|---|
coccoc-tokenizer | C++ | |
RDRSegmenter | Java | |
RDRPOSTagger | Java | |
VnCoreNLP | Java | |
vlp-tok | Scala | |
ETNLP | Python | |
VietnameseTextNormalizer | Python | |
nnvlp | Python | |
jPTDP | Python | |
vi_spacy | Python | |
underthesea | Python | |
vnlp | Java | |
pyvi | Python | |
JVnTextPro | Java | |
DongDu | C++ | |
VLSP Toolkit | ||
vTools | ||
JNSP | Java |
¶
https://vlsp.hpda.vn/demo/?page=resources
https://github.com/vndee/awsome-vietnamese-nlp