The focus of this blog is to highlight design considerations for any Intelligent Document Processing. However, keep in mind that design and development of any AI/ML driven system is iterative in nature, which is very different from a traditional software development project. That is certainly true in the case of IDP system development as well. For a high-level overview and conceptual understanding about IDP, please refer to “Intelligent Document Processing – An Overview”.
Broadly, the key aspects of an IDP system can be summarized as listed below:
Administration - Access Control & Permissions
Data Acquisition and Management
Annotation Lifecycle Management
Machine Learning Model Training, Validation and Testing
Model Pipeline & Management
User Experience - Inference, Review and Correction
Integration points
Hardware / GPU Management
All of the above aspects need to come together in harmony for the system to perform as expected. Let’s take a brief look into each of them one by one.
Administration module can be further sub-divided into 1) onboarding, 2) grouping, and 3) permissions. This requires a comprehensive role and permission management system to define specific groups (like administrators, annotators, business users, etc.) and associate specific permissions to controls what actions they can perform - can upload document, can annotate, can create new document categories, etc.
Depending upon design of the system you may want to restrict activities - like document upload for OCR and model training – that may be expensive and cost money or take excessive computing resources if everyone is allowed to perform those actions.
The input to any IDP system will inherently be in the form of documents, emails, images, html, and other similar data sources. This will require a flexible design for adoption of new data sources, without causing any major disruption to existing functionality or overall design and architecture of the system, that may arise in future.
Post data acquisition, as the input goes through various stages of OCR, NLP processing and predictions, series of intermediate data files and outputs will be generated. A well thought out mechanisms need to be designed upfront to manage the data for easy and quick retrieval of information with optimal performance.
This is one of the most laborious, time consuming and error prone step in any machine learning project. Data Scientists will appreciate that model performance is directly correlated to the quality of training data. Quality can be defined in terms of “consistency” and “correctness”.
Consistency is measure of that same type of text is being label as intended across all documents in the dataset. This is a tricky problem if more than one annotator is involved. Depending on the level of domain understanding, the interpretation of text and the entities may vary significantly from one annotator to another.
Correctness, which is self-explanatory, is measure of that what is being annotated is indeed correct.
Both, consistency and correctness are important. Machine learning models a tolerant to some noise in the training data, but it the annotation are consistently wrong then it will lead to systemic problem resulting in models learning the wrong information. Systemic annotation issues can have material impact on the project, both in terms of time and money, and should be avoided as early in the lifecycle as possible.
It is extremely important to have a well thought out annotation lifecycle with review and approval controls to ensure consistency and correctness in the training dataset.
This is the core step of any machine learning project and is in itself a very vast topic, which is outside the scope of this blog. All I would like to say is that system should be flexible enough to support multiple iterations of training, validation and testing to get to the acceptable model performance. Depending upon the model metrics, the systems should aid data scientists to decide if more data is required or not, help analyze predictions and evaluate model capability to be able to generalize on unseen data.
Depending upon the type of document – Structured, Semi-structure, or unstructured – you may have different set of models that may have to be stitched together to extract information. This brings in the concept of model pipeline that need to be trained in conjunction with each other and then orchestrated in the same sequence at runtime. This is easier said than done as the performance of models later in the pipeline is heavily dependent on the output (or predictions) from earlier models in the pipeline.
Another import consideration in an NLP project is the ability to annotate, review and make corrections for the entire document. Please note, this is in the context of the entire “document” and not specific sentence or paragraph. There are many tools available in the market that facilitate annotation of sentences (or small pieces of text that can be supplied via csv files) for part-of-speech tagging, entity labelling or sentiment labeling. However, when annotating documents, context is very important and a particular entity may or may not be an appropriate label depending upon the context in which it appears. Hence, it is important that the entire document is available to annotator at the time of labeling to facilitate appropriate interpretation and correct annotation of entities. For example, if JPMC is acting as administrative agent as well as a lender in a credit agreement then when picking JPMC as administrative agent, all other instance of JPMC where it is appearing as lender in the document are not valid and should not be marked as administrative agent.
Same is true when validating predictions as they, too, need to be validated based on the context in the document.
IDP system by itself does not add any values until it is integrated with a business process, or data is collected in a repository for downstream analysis. Depending on the use case this integration step can be accomplished via an API endpoint or just downloading the data in a flat file for downstream processing & analysis.
The total number of models in an IDP system will steadily increase as more document categories are onboarding on the platform. This renders static serving solutions, where number of models are known upfront, useless. During runtime, any number of these models may be active, which may max out the available GPU capacity. Scaling out horizontally to add new GPUs at runtime may not be a very optimal solution as it can take several mins for new instance to be provisioned. It might be ok for training jobs, but may not be acceptable at the time of inference when new documents are being submitted for processing. Some dedicated compute capacity is required with some intelligent GPU management to avoid scenarios where GPUs are maxed out.
There you have it – key design considerations for building an IDP system. Other essential ingredients required - data science, machine learning and software engineering skills combined with grit, creativity and lot of patience!