Hackathon - Document Search Solution¶

Overview¶

This repo folks source code from the backend component that we implemented in our Document Resources Management System Mobile App in the One Mount Hackathon contest has been organized at Ha Noi, 2021. Our solution matches with feasibility to implement and be a workable prototype.

Our application contains a search engine to search all legal documents in our company, and solved problems that we faced in our workplace:

Its supports full-text search, instead of searching from metadata only (title, abstracts, short description,...), then reduced time to deep-dive into the body of document when the users see matches keyword.
Its centralization knowledge, follow the single source of truth principle and is served through friendly UI in mobile.
It contains interactive modules that connect with user data, such as search behaviors, related documents to build a persona box with useful information.

Logic Implemented¶

By researching a lot of paper walkthroughs, we adopt the following components that we try to deploy to our systems.

Architecture¶

By leveraging our cloud infrastructure, we used a lot of products offered by Google to deliver our backend components:

Schedule, Functions: Cloud Schedule, Cloud Functions, Data Proc
Text Extract, Transformation: Translation API, Vision API, Natural Language API
Storage: BigQuery, Cloud Storage

Then, we use low-code Outsystems to build Front End component that will pack our information and served it through the mobile application. We put Front End/Back End in different virtual machine instances (for securities purposes)

Based on that, behind the scenes, we have designed a data pipeline to extract information and information from our document.

Data Pipeline¶

There is a consistent process that the Back End component will cover for the document journeys. The diagram below will show that:

When document owners or operators publish a document into the system, the Backend then triggers the following four steps:

Step 1: Metadata, Document Read.

This step pulls metadata of the document based on the fill-in of the user through operation UI with the below sample:

metadata:
  entity: Company Sample A
  language: VIE
  number: "No. 120"
  issued_date: 2021-11-04
  effective_date: 2022-01-01
  source: Business Development
  subject: Strategic Opportunities Template
  title: Strategic Opportunities Template
  type: Internal Regulation
  is_internal: true

The number of annotating metadata from a document will be increased time-to-time because it can be input by the document owner or operators, some can be extracted automatically based on text mapping then still need to be verified later.

Then, we read the document and convert it into a text file *.txt and translate English/Vietnamese depending on the language of the document.

The endpoint of this step has three files: metadata in .yml, two text files in desired languages

Step 2: At our core, we designed a Document class that has some methods listed below:

Method of Document class table

Method	Feature	Description	Is Public
_read_document	Process document	Process to parse document information from files, supported file type are text `.txt`, word (`.doc`, `.docx`), and with Vision API, we can support PDF files (`.pdf`) and images (`*.png`)	False
_extract_document	Extract Content	Based on the processed text, we parse the document into one dictionary paragraphs, sentences, words for both English and Vietnamese, return a dictionary of position and types relative	False
_parse_dictionary	Separated Content Components	Separated dictionary into single dictionary with following synopsis `{'position': number, 'components': string}`	False
_get_unique_words	Tokens create	Created set of words using in document	False
_content_to_df	Document tabular form	Parse content of document into data frame	False
check_valid_ext	Util - Validation extension of file	Check for the extension of the file with target extension	True
_check_file	Util - Validation file exists	If the file exists, return True, else False, instead of error	False
get_id	Get the ID of Document	Hashed generate based on the file name. Example: Db9c73ec7aee3cbc103a29d07938b5c39	True
get_document	Get document	Document text that read from source	True
get_content	Get contents	Dictionary of all components of the document, including paragraphs, sentences, words	True
get_paragraphs	Get paragraphs	Dictionary of paragraphs only. Example: `{'position': 1, 'paragraph': "This is first paragraph, included multiple rows"}`	True
get_sentences	Get sentences	Dictionary of sentences only. Example: `{'position': 1, 'sentence': "This is first sentence"}`	True
get_words	Get words	Dictionary of words only. Example: `{'position': 1, 'word': "This"}`	True
get_tokens	Get lists	Array that contains set of words	True
get_metadata	Get Metadata	Dictionary of metadata that get from document owner	True
to_dataframe	Get Dataframe	Dataframe of document	True
statistics	Document basic statistics	Dictionary of basic statistics, count number of contents (number of paragraphs, number of sentences, number of words)	True
write_excel	Write to Excel	Write data frame into excel file	True
write_parquet	Write to Parquet	Write dataframe into parquet file	True
update_metadata	Update Metadata	When there are updated metadata information, then we trigger information related to document	True
search	Simple search backbone by `thefuzz` library	Search keyword in a document then returned related components with limit and threshold	True

Step 3: Storage Output

We then store the output we gather from steps 1 and 2 into a database:

Raw files [Uploaded file, Process Text, Translated Text, Metadata] will go to Cloud Storage.
The content of the document in the data frame will go to BigQuery to transform around more steps to become useful information.

Step 4: Data Model

Based on the data from step 3, we can generate various useful targeted outputs that support use-case in the UI functional. We then, combined from both user search processors, the related of the document to create new data models and backed by using dbt (data build tool) in BigQuery environment.

Let example:

Synonym Keyword: all suggest related words based on keywords, separated by |.

keyword:
  string: Ha Noi
  related: Ho Chi Minh|Hai Phong|Ninh Binh|Vinh Phuc

Most Search: at least top-N search in a period.

most_search:
  1: Salary
  2: License
  3: Tax

We then can serve it through API for Front End can get from that with updated data, can be in real-time.

Quickstart¶

So at this repo, we mirror steps 1 and 2 which cover the baking journey from sources and focus on the usage of Document class. Below will represent the source code folders and files hierarchy.

.
data/ # Contain dataset
src/ # Contain `src` component
|____ document/
|____ file/
|____ util/
__init__.py
.gitignore
config.py # Configuration file
extract.py # Extract Document Information
metadata.py # Parse metadata
README.md # Project indtroduction
Makefile # Automation with target
requirements.txt # Dependencies

[1] Installment:

a) Python at version >= 3.9 [Minor at 3.9.1]

b) Install dependencies

requirements.txt

[2] Configuration:

To intergrate with the pipeline, you can modify global component on the config file.

config.py

Remember this, this can be largely based on the file sizes, so remember to put the file into .gitignore to not affect the codebase.

Based on the number of handle read the document from file type, we can extend or reduced the glob file extension we can support. SUPPORTED_FILE_TYPE currently support [".doc", ".docx", ".doc", ".txt"].

[3] Using CLI to interactive with module:

Added metadata for the source folder, if not exists related *.yml file.

python document/metadata.py --source data/source/

Extract information and send it to the destination folder.

python document/extract.py --source data/source/ --destination temp/document

or, with Makefile, we will have 2 targets that automatically process above. You can change the config variables declared with the document_* prefix.

Makefile

Then, we can use it by below command:

make document-metadata
make document-extract

After all, you can see all things in the destination folder. Example:

with:

F_DOCUMENT_METADATA.xlsx will likely like this

Note¶

Replica Source Code: This is just fragmented code, represented around 50% proposition compared to what we implemented in the contest. Its has enhancement with code style and package it into a module. For the impelement of Document class, reference to Source Code Part
We welcome idea contributions and comments/feedback for our solution.

Reference¶

Text Analysis Pipelines [Towards Ad-hoc Large-Scale Text Mining] of Henning Wachsmuth

Source Code¶

Source code design for document class

Document Class Implemented