Finance Entity¶

Overview¶

Design a model for finance entity extract using spacy.

Process¶

Concept:

Named Entity Recognition (NER) is a subtask of natural language processing that focuses on identifying and classifying named entities within the text. Named entities refer to specific types of commodities such as person names, organization names, locations, dates, numerical values, and more. NER is vital in various applications, including information extraction, question answering, chatbots, sentiment analysis, and recommendation systems.

Follow:

https://blog.futuresmart.ai/building-a-custom-ner-model-with-spacy-a-step-by-step-guide

Concept: https://ongxuanhong.wordpress.com/2015/06/10/machine-learning-la-gi/

Spacy Training: https://spacy.io/usage/training

Sample dataset for NER: https://www.kaggle.com/datasets/finalepoch/medical-ner/data

Traing and using pyvi: https://github.com/trungtv/vi_spacy

See more models: https://gitlab.com/trungtv/vi_spacy

Build ConLLU datasets: https://pypi.org/project/conllu/

Using: https://github.com/vncorenlp/VnCoreNLP

Example: https://github.com/vndee/sentivi/tree/master

Training:

Prepairing data for trainning strategy: [https://ner.pythonhumanities.com/03_02_train_spacy_ner_model.html].

Using: https://github.com/undertheseanlp/ner

Design process for train: https://github.com/undertheseanlp/ner/blob/master/data_conversion2.py

Sample dataset for traing: https://github.com/ds4v/absa-vlsp-2018/blob/main/datasets/vlsp2018_hotel/1-VLSP2018-SA-Hotel-train.txt

Metrics:

Calculated metrics: https://keras.io/api/metrics/

Model Evaluation: https://mlflow.org/docs/latest/model-evaluation/index.html

Term:

Data Catalog
baseline >> Model baseline
Commerical baseline

TODO¶

Build datacard for models
Write created training datasets.
Learning KerasNLP: https://keras.io/examples/generative/gpt2_text_generation_with_kerasnlp/
Build transformer from scratch: https://keras.io/guides/keras_nlp/transformer_pretraining/
Documentation: https://huybik.github.io/Word-Tokenizer-Benchmark/

And there we have our desired metrics :
PyVi
Tagging time: 18.30862819995673 Accuracy: Word segmentation 0.5787813685605887 Pos tag 0.6820176441011108 Entity recognition 0.0
Underthesea
Tagging time: 38.3007470000739 Accuracy: Word segmentation 0.8004068199478904 Pos tag 0.6021163779311606 Entity recognition 0.0
VnCoreNLP
Tagging time: 67.42396240014205 Accuracy: Word segmentation 0.7836769209672259 Pos tag 0.6328564245554692 Entity recognition 0.0
We have tagging time along with accuracy for word segmentaion, pos tag and entity recognition (though entity missing from dataset so it always 0).
Pyvi is the fastest of the lot, 2 times faster than the second fastest. This is the result of optimized SpaCy library. However, the trained model of PyVi lose out on segmentation accuracy, only managed 57.8%. Underthesea achived highest accuracy on segmentation at 80%, but lose out on pos tagging to both PyVi and VnCoreNLP. The java tool VnCoreNLP is the slowest of the lot due to it’s java wrapper. I conclude that the best way to extract most correct token is mix match of using Underthesea for word segmentation, and PyVi for Pos tagging.

Tokenizer benchmark: https://huybik.github.io/Word-Tokenizer-Benchmark-followup/

Follow¶

For thinking process

https://github.com/ds4v/absa-vlsp-2018

https://nlpprogress.com/vietnamese/vietnamese.html

https://github.com/vndee/awsome-vietnamese-nlp

FAQ:

How to store models?
How to implement with mlflow
Do we buy or create sample https://prodi.gy/buy | 390 PER | 490 ORG (1715 renew 12m)
Tools for annotations:

Text annotation is a crucial task in Natural Language Processing (NLP) that involves labeling various aspects of text data for tasks like training machine learning models, creating training datasets, or extracting information. There are several tools available for text annotation, each with its own strengths and weaknesses. Here are some popular tools for text annotation as of my last knowledge update in September 2021:
Label Studio: Label Studio is a versatile open-source tool that supports text annotation along with other data types. It provides a user-friendly interface for creating labeled datasets for NLP tasks.
Prodigy: Prodigy is a paid annotation tool developed by the creators of spaCy. It offers a streamlined interface for annotating text data and supports active learning to improve annotation efficiency.
Brat: Brat (Brat Rapid Annotation Tool) is an open-source web-based tool specifically designed for text annotation tasks. It allows users to annotate text for various NLP tasks like named entity recognition, relation extraction, and more.
Doccano: Doccano is an open-source text annotation tool that supports various annotation types, including named entity recognition, text classification, and sequence labeling. It offers an intuitive interface for annotating text data.
Tagtog: Tagtog is a web-based text annotation tool that supports collaborative annotation workflows. It provides features for annotating text for tasks like named entity recognition, text classification, and more.
Prodi.gy: Prodi.gy by Explosion AI is a versatile annotation tool that supports text annotation for various NLP tasks. It offers a range of annotation interfaces and supports custom annotation workflows.
Amazon SageMaker Ground Truth: Amazon SageMaker Ground Truth is a managed data labeling service that supports text annotation along with other data types. It provides a scalable platform for creating high-quality labeled datasets for machine learning.

https://www.quora.com/NLP-what-are-the-best-tools-for-text-annotation

Label with: https://doccano.github.io/doccano/developer_guide/

Then converted to spacy format: https://stackoverflow.com/questions/77248199/how-to-convert-doccano-exported-jsonl-format-to-spacy-format