Tax Notices, Entity Extraction and Document Classification Using Document AI

Tax notices, entity extraction and document classification using document AI


Our team started with analysing the different notice types and what relevant information could be extracted from different notices.

About UKG

Ultimate Kronos Group (UKG) is an American multinational technology company with dual headquarters in Lowell, Massachusetts, and Weston, Florida. It provides workforce management and human resource management services. As a leading global provider of HCM, payroll, HR service delivery, and workforce management solutions, UKG’s award-winning Pro, Dimensions, and Ready solutions help tens of thousands of organisations across geographies and in every industry drive better business outcomes, improve HR effectiveness, streamline the payroll process, and help make work a better and more connected experience for everyone.

The challenge

The most important challenges faced were with respect to the data provided. The Tax Notices data was provided in the form of PDF files with issues like rotated pages, pages with bad scan and/or incorrect orientation, and blank pages randomly on any page of the PDF document.

Another challenge was about the types of Tax Notices and their differentiation. Since the notice types were large in number ranging from 200-400 different types, classifying those documents was a challenge since the text/context in the documents was mostly similar, and finding the terms that exactly differentiate the documents from each other were difficult. It required us to implement different approaches like Jaccard Similarity, Naive Bayes, etc.

For information or entity extraction from documents, Google Cloud Platform services and Document AI were used. The Document AI Forms Parser was used to process and convert unstructured data into a structured format. Although the required data was extracted with high accuracy, the parser extracted a few important entities with low confidence. While saving this extracted data, garbage value was extracted with them as well. We needed to analyse it carefully and avoid the usage of garbage values.

“Technology is a great equaliser that enables our clients to compete with the largest banks in the world. One of the significant technology advantages that Knoldus  expertise solution provides is the ability to share across our product portfolio. The significant events that occur throughout an end user’s financial journey, from opening an account to initiating a home or small business loan to saving for college or retirement,” said Vice President, hosting architecture.

The solution

Our team started by analysing the different notice types and what relevant information could be extracted from different notices. The first task was to leverage the Document AI service, which extracted the data from documents using Form Parser. The Doc AI form parser parses the data in the form of key-value pairs. This extracted data from all the documents were stored in the BigQuery table for use at later stages. Here is a sample of how the Document AI Form Parser extracts data from documents.


But using this service/processor our team was able to extract data that only had key-value pairs. Some data like dates, company name, identification number, etc. did not have any key associated with it while few entities were present in the paragraph content which was necessary to be extracted. Google Cloud Data Loss Prevention (DLP) proved a very good solution. Data Loss Prevention (DLP) uses built-in infotype detectors to extract information from documents. It has nearly 150 different info types. Infotypes are a type of sensitive PII data such as email address, identification number, credit card number DOB, etc. Our team used the built-in infotype and also created a custom-infotype for extracting entities. With both the services, i.e. Google Document AI and Data Loss Prevention(DLP) our team was able to extract all the data that was needed. The data and/or entity extraction accuracy was 85%-95%.

In the next step, we used this extracted data to manually map it with the expected data fields that were needed from each notice type. This was done manually to analyse the associated labels representing a particular entity in the notice document. Consider an example for expected field “NAME” there can be different labels in different documents like [Name, Tax-Payer Name, etc.] This manually mapped data was used to create an automated pipeline that maps the exact entity’s label and its value to the expected field.

After this the Document Classification of Tax Notices was implemented, in which we firstly used Jaccard Similarity MinHash Approach for classification. MinHash is an LSH family for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union. MinHash applies a random hash function to each element in the set and takes the minimum of all hashed values. But we observed that the accuracy obtained was quite low from this approach, and the results were not satisfying.

The next approach used for classification was implementing Multi-Class Classification using Naive Bayes Classifier and Hashing Trick to handle OOV terms. The usage of a Hashing Trick helps address the problem of memory consumption of a large vocabulary, and it also mitigates the problem of filter circumvention. With this approach, the accuracy increased to 75-85%, and the model could more accurately classify the tax notices.

The outcome 

  • Automating the extracting of expected entities from tax notices.
  • Automating the Classification of Tax Notices.
  • Reduced manual jobs and the number of errors occurring manually.
  • Can add new notice types efficiently and store its extracted entity results.

Read more case studies

The art of human-AI collaboration: A case study in model improvement

For over 16 years, NashTech has been a trusted partner, providing data management solutions that have fuelled the exponential growth of our client’s online shopping platform. The approach has...

From rising above adversity to riding the wave of digital transformation in the education sector

Explore how NashTech help Trinity College London ride the wave of digital transformation in the education sector

Migrating and modernising the virtual learning environment to AWS for an enhanced experience

The migrated and modernised Moodle infrastructure means that The Open University can now take advantage of cloud benefits.


Our partnerships

Scroll to Top
sample short
sample heading lorem isump
Unlock the power of knowledge with our new whitepaper
“Elevating User Experience for Product Owners”
Unlock the power of knowledge with our new whitepaper
“Elevating User Experience for Product Owners”