Skip to content

Hackathon - Document Search Solution

Overview

This repo folks source code from the backend component that we implemented in our Document Resources Management System Mobile App in the One Mount Hackathon contest has been organized at Ha Noi, 2021. Our solution matches with feasibility to implement and be a workable prototype.

Our application contains a search engine to search all legal documents in our company, and solved problems that we faced in our workplace:

  • Its supports full-text search, instead of searching from metadata only (title, abstracts, short description,...), then reduced time to deep-dive into the body of document when the users see matches keyword.

  • Its centralization knowledge, follow the single source of truth principle and is served through friendly UI in mobile.

  • It contains interactive modules that connect with user data, such as search behaviors, related documents to build a persona box with useful information.

Logic Implemented

By researching a lot of paper walkthroughs, we adopt the following components that we try to deploy to our systems.

Hackathon One Mount System Logic Design

Architecture

By leveraging our cloud infrastructure, we used a lot of products offered by Google to deliver our backend components:

  • Schedule, Functions: Cloud Schedule, Cloud Functions, Data Proc

  • Text Extract, Transformation: Translation API, Vision API, Natural Language API

  • Storage: BigQuery, Cloud Storage

Then, we use low-code Outsystems to build Front End component that will pack our information and served it through the mobile application. We put Front End/Back End in different virtual machine instances (for securities purposes)

Hackathon One Mount SAD

Based on that, behind the scenes, we have designed a data pipeline to extract information and information from our document.

Data Pipeline

There is a consistent process that the Back End component will cover for the document journeys. The diagram below will show that:

Hackathon One Mount Interactive Step

When document owners or operators publish a document into the system, the Backend then triggers the following four steps:

Step 1: Metadata, Document Read.

This step pulls metadata of the document based on the fill-in of the user through operation UI with the below sample:

metadata:
  entity: Company Sample A
  language: VIE
  number: "No. 120"
  issued_date: 2021-11-04
  effective_date: 2022-01-01
  source: Business Development
  subject: Strategic Opportunities Template
  title: Strategic Opportunities Template
  type: Internal Regulation
  is_internal: true

The number of annotating metadata from a document will be increased time-to-time because it can be input by the document owner or operators, some can be extracted automatically based on text mapping then still need to be verified later.

Then, we read the document and convert it into a text file *.txt and translate English/Vietnamese depending on the language of the document.

The endpoint of this step has three files: metadata in .yml, two text files in desired languages

Step 2: At our core, we designed a Document class that has some methods listed below:

Method of Document class table

Method Feature Description Is Public
_read_document Process document Process to parse document information from files, supported file type are text *.txt, word (*.doc, *.docx), and with Vision API, we can support PDF files (*.pdf) and images (*.png) False
_extract_document Extract Content Based on the processed text, we parse the document into one dictionary paragraphs, sentences, words for both English and Vietnamese, return a dictionary of position and types relative False
_parse_dictionary Separated Content Components Separated dictionary into single dictionary with following synopsis {'position': number, 'components': string} False
_get_unique_words Tokens create Created set of words using in document False
_content_to_df Document tabular form Parse content of document into data frame False
check_valid_ext Util - Validation extension of file Check for the extension of the file with target extension True
_check_file Util - Validation file exists If the file exists, return True, else False, instead of error False
get_id Get the ID of Document Hashed generate based on the file name. Example: Db9c73ec7aee3cbc103a29d07938b5c39 True
get_document Get document Document text that read from source True
get_content Get contents Dictionary of all components of the document, including paragraphs, sentences, words True
get_paragraphs Get paragraphs Dictionary of paragraphs only. Example: {'position': 1, 'paragraph': "This is first paragraph, included multiple rows"} True
get_sentences Get sentences Dictionary of sentences only. Example: {'position': 1, 'sentence': "This is first sentence"} True
get_words Get words Dictionary of words only. Example: {'position': 1, 'word': "This"} True
get_tokens Get lists Array that contains set of words True
get_metadata Get Metadata Dictionary of metadata that get from document owner True
to_dataframe Get Dataframe Dataframe of document True
statistics Document basic statistics Dictionary of basic statistics, count number of contents (number of paragraphs, number of sentences, number of words) True
write_excel Write to Excel Write data frame into excel file True
write_parquet Write to Parquet Write dataframe into parquet file True
update_metadata Update Metadata When there are updated metadata information, then we trigger information related to document True
search Simple search backbone by thefuzz library Search keyword in a document then returned related components with limit and threshold True

Step 3: Storage Output

We then store the output we gather from steps 1 and 2 into a database:

  1. Raw files [Uploaded file, Process Text, Translated Text, Metadata] will go to Cloud Storage.

  2. The content of the document in the data frame will go to BigQuery to transform around more steps to become useful information.

Step 4: Data Model

Based on the data from step 3, we can generate various useful targeted outputs that support use-case in the UI functional. We then, combined from both user search processors, the related of the document to create new data models and backed by using dbt (data build tool) in BigQuery environment.

Let example:

  • Synonym Keyword: all suggest related words based on keywords, separated by |.
keyword:
  string: Ha Noi
  related: Ho Chi Minh|Hai Phong|Ninh Binh|Vinh Phuc
  • Most Search: at least top-N search in a period.
most_search:
  1: Salary
  2: License
  3: Tax

We then can serve it through API for Front End can get from that with updated data, can be in real-time.

Quickstart

So at this repo, we mirror steps 1 and 2 which cover the baking journey from sources and focus on the usage of Document class. Below will represent the source code folders and files hierarchy.

.
data/ # Contain dataset
src/ # Contain `src` component
|____ document/
|____ file/
|____ util/
__init__.py
.gitignore
config.py # Configuration file
extract.py # Extract Document Information
metadata.py # Parse metadata
README.md # Project indtroduction
Makefile # Automation with target
requirements.txt # Dependencies

[1] Installment:

a) Python at version >= 3.9 [Minor at 3.9.1]

b) Install dependencies

requirements.txt

[2] Configuration:

To intergrate with the pipeline, you can modify global component on the config file.

config.py

Remember this, this can be largely based on the file sizes, so remember to put the file into .gitignore to not affect the codebase.

Based on the number of handle read the document from file type, we can extend or reduced the glob file extension we can support. SUPPORTED_FILE_TYPE currently support [".doc", ".docx", ".doc", ".txt"].

[3] Using CLI to interactive with module:

Added metadata for the source folder, if not exists related *.yml file.

python document/metadata.py --source data/source/

Extract information and send it to the destination folder.

python document/extract.py --source data/source/ --destination temp/document

or, with Makefile, we will have 2 targets that automatically process above. You can change the config variables declared with the document_* prefix.

Makefile

Then, we can use it by below command:

make document-metadata
make document-extract

After all, you can see all things in the destination folder. Example:

Hackathon Sample Output Folder

with:

F_DOCUMENT_METADATA.xlsx will likely like this

Hackathon Sample Output Metadata

F_DOCUMENT_METADATA.xlsx will likely like this

Hackathon Sample Output Attribute

Note

  1. Replica Source Code: This is just fragmented code, represented around 50% proposition compared to what we implemented in the contest. Its has enhancement with code style and package it into a module. For the impelement of Document class, reference to Source Code Part

  2. We welcome idea contributions and comments/feedback for our solution.

Reference

  1. Text Analysis Pipelines [Towards Ad-hoc Large-Scale Text Mining] of Henning Wachsmuth

Source Code

Source code design for document class

Document Class Implemented