Finance Entity¶
Overview¶
Design a model for finance entity extract using spacy.
Process¶
Concept:
Named Entity Recognition (NER) is a subtask of natural language processing that focuses on identifying and classifying named entities within the text. Named entities refer to specific types of commodities such as person names, organization names, locations, dates, numerical values, and more. NER is vital in various applications, including information extraction, question answering, chatbots, sentiment analysis, and recommendation systems.
Follow:
https://blog.futuresmart.ai/building-a-custom-ner-model-with-spacy-a-step-by-step-guide
Concept: https://ongxuanhong.wordpress.com/2015/06/10/machine-learning-la-gi/
Spacy Training: https://spacy.io/usage/training
Sample dataset for NER: https://www.kaggle.com/datasets/finalepoch/medical-ner/data
Traing and using pyvi: https://github.com/trungtv/vi_spacy
See more models: https://gitlab.com/trungtv/vi_spacy
Build ConLLU datasets: https://pypi.org/project/conllu/
Using: https://github.com/vncorenlp/VnCoreNLP
Example: https://github.com/vndee/sentivi/tree/master
Training:
Prepairing data for trainning strategy: [https://ner.pythonhumanities.com/03_02_train_spacy_ner_model.html].
Using: https://github.com/undertheseanlp/ner
Design process for train: https://github.com/undertheseanlp/ner/blob/master/data_conversion2.py
Sample dataset for traing: https://github.com/ds4v/absa-vlsp-2018/blob/main/datasets/vlsp2018_hotel/1-VLSP2018-SA-Hotel-train.txt
Metrics:
Calculated metrics: https://keras.io/api/metrics/
Model Evaluation: https://mlflow.org/docs/latest/model-evaluation/index.html
Term:
-
Data Catalog
-
baseline >> Model baseline
-
Commerical baseline
TODO¶
-
Build datacard for models
-
Write created training datasets.
-
Learning KerasNLP: https://keras.io/examples/generative/gpt2_text_generation_with_kerasnlp/
-
Build transformer from scratch: https://keras.io/guides/keras_nlp/transformer_pretraining/
-
Documentation: https://huybik.github.io/Word-Tokenizer-Benchmark/
And there we have our desired metrics :
PyVi
Tagging time: 18.30862819995673 Accuracy: Word segmentation 0.5787813685605887 Pos tag 0.6820176441011108 Entity recognition 0.0
Underthesea
Tagging time: 38.3007470000739 Accuracy: Word segmentation 0.8004068199478904 Pos tag 0.6021163779311606 Entity recognition 0.0
VnCoreNLP
Tagging time: 67.42396240014205 Accuracy: Word segmentation 0.7836769209672259 Pos tag 0.6328564245554692 Entity recognition 0.0
We have tagging time along with accuracy for word segmentaion, pos tag and entity recognition (though entity missing from dataset so it always 0).
Pyvi is the fastest of the lot, 2 times faster than the second fastest. This is the result of optimized SpaCy library. However, the trained model of PyVi lose out on segmentation accuracy, only managed 57.8%. Underthesea achived highest accuracy on segmentation at 80%, but lose out on pos tagging to both PyVi and VnCoreNLP. The java tool VnCoreNLP is the slowest of the lot due to it’s java wrapper. I conclude that the best way to extract most correct token is mix match of using Underthesea for word segmentation, and PyVi for Pos tagging.
Tokenizer benchmark: https://huybik.github.io/Word-Tokenizer-Benchmark-followup/
Follow¶
For thinking process
https://github.com/ds4v/absa-vlsp-2018
https://nlpprogress.com/vietnamese/vietnamese.html
https://github.com/vndee/awsome-vietnamese-nlp
FAQ:
-
How to store models?
-
How to implement with mlflow
-
Do we buy or create sample https://prodi.gy/buy | 390 PER | 490 ORG (1715 renew 12m)
-
Tools for annotations:
Text annotation is a crucial task in Natural Language Processing (NLP) that involves labeling various aspects of text data for tasks like training machine learning models, creating training datasets, or extracting information. There are several tools available for text annotation, each with its own strengths and weaknesses. Here are some popular tools for text annotation as of my last knowledge update in September 2021:
Label Studio: Label Studio is a versatile open-source tool that supports text annotation along with other data types. It provides a user-friendly interface for creating labeled datasets for NLP tasks.
Prodigy: Prodigy is a paid annotation tool developed by the creators of spaCy. It offers a streamlined interface for annotating text data and supports active learning to improve annotation efficiency.
Brat: Brat (Brat Rapid Annotation Tool) is an open-source web-based tool specifically designed for text annotation tasks. It allows users to annotate text for various NLP tasks like named entity recognition, relation extraction, and more.
Doccano: Doccano is an open-source text annotation tool that supports various annotation types, including named entity recognition, text classification, and sequence labeling. It offers an intuitive interface for annotating text data.
Tagtog: Tagtog is a web-based text annotation tool that supports collaborative annotation workflows. It provides features for annotating text for tasks like named entity recognition, text classification, and more.
Prodi.gy: Prodi.gy by Explosion AI is a versatile annotation tool that supports text annotation for various NLP tasks. It offers a range of annotation interfaces and supports custom annotation workflows.
Amazon SageMaker Ground Truth: Amazon SageMaker Ground Truth is a managed data labeling service that supports text annotation along with other data types. It provides a scalable platform for creating high-quality labeled datasets for machine learning.
https://www.quora.com/NLP-what-are-the-best-tools-for-text-annotation
Label with: https://doccano.github.io/doccano/developer_guide/
Then converted to spacy format: https://stackoverflow.com/questions/77248199/how-to-convert-doccano-exported-jsonl-format-to-spacy-format