Skip to content

Tokenizer

Overview

Tokenizer is a process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. The goal of tokenization is to convert input text into a format that is easier to analyze and process.

Example

Tokenization is a process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens.

The following is an example of tokenization:

Input Tokens
Hello my name is John. ["Hello", "my", "name", "is", "John", "."]

(+) Explain 2 methods BPE with Word

Method 1: BPE

BPE stands for Byte Pair Encoding. It is a word segmentation algorithm that is used to split words into subwords. BPE is a statistical method that find a set of subwords that can be used to represent the input text.

The process of BPE can be described as follows:

  1. Initialize a dictionary of subwords. The dictionary contains the subwords that have been seen so far.
  2. Take a word from the input text and look it up in the dictionary. If the word is not found in the dictionary, the algorithm will try to split the word into two subwords.
  3. If the word is found in the dictionary, the algorithm will try to split the word into two subwords.
  4. If the word is split into two subwords, the algorithm will try to split the subwords into two subwords.
  5. Repeat the process until all the words in the input text have been split into subwords.

The following is an example of BPE:

Input Subwords
cat [cat]
cats [cat, s]
running [run, ning]

Method 2: WordPiece

WordPiece is a subwording algorithm that is used to split words into subwords. WordPiece is similar to BPE, but it uses a different approach to split words into subwords.

The process of WordPiece can be described as follows:

  1. Initialize a dictionary of subwords. The dictionary contains the subwords that have been seen so far.
  2. Take a word from the input text and look it up in the dictionary. If the word is not found in the dictionary, the algorithm will try to split the word into two subwords.
  3. If the word is found in the dictionary, the algorithm will try to split the word into two subwords.
  4. If the word is split into two subwords, the algorithm will try to split the subwords into two subwords.
  5. Repeat the process until all the words in the input text have been split into subwords.

The following is an example of WordPiece:

Input Subwords
cat [cat]
cats [cat, s]
running [run, ning]

Comparison

The following is a comparison of BPE and WordPiece:

Feature BPE WordPiece
Complex 3 5
Example [cat, s] [run, ning]
Use-cases NMT, BERT BERT

(+) Tokenizer for vietnamese

Table of tokenizer at the current

Package Name Language Last Updated
coccoc-tokenizer C++
RDRSegmenter Java
RDRPOSTagger Java
VnCoreNLP Java
vlp-tok Scala
ETNLP Python
VietnameseTextNormalizer Python
nnvlp Python
jPTDP Python
vi_spacy Python
underthesea Python
vnlp Java
pyvi Python
JVnTextPro Java
DongDu C++
VLSP Toolkit
vTools
JNSP Java

https://vlsp.hpda.vn/demo/?page=resources

https://github.com/vndee/awsome-vietnamese-nlp

Using: https://arxiv.org/html/2411.12240v2