One of the popular question that keep coming up in my conversations with colleagues, clients, and acquaintances is - What are different types of documents, and how to determine it? In this blog, I will attempt to answer this question and delve briefly into aspects of information extraction for individual document types.
At a high level there are three types of documents - Structured, Semi-Structured and Unstructured.
Structured documents are the simplest of the three and one that we all too often encounter whenever we are filing taxes, applying for a driver license, filling up a medical form, responding to a survey, etc. We all would have come across such a document in our life at some point or the other. Many such forms have been replaced with digital forms for data collection, but believe it or not, these forms are still very common in a Business-2-business (B2B) scenarios. A popular example is a set of Acord forms that are very common in the insurance industry.
Structured documents are characterized by the fact that there is a fixed layout, and information is always captured at the same spot on the page. If you have coordinates of sections where information is being captured on the form, you can build OCR based systems to extract information from these documents.
However, the complication comes in when there is variation in layout either in terms of the size of page, format variation or drift in the structure of document when printed on the paper. OCR based systems will not be able to process these variations if the variation surpasses the thresholds that the system was designed to handle. What you need here is smart system that can handle such variations.
Semi-structured documents are type of documents where text is spatially laid on the page. Some of the common examples of semi-structured documents are - Invoices, Receipts, Passports, etc. A category of document (like invoice) will have similar type of information, but the layout may vary from one document to another, especially when it is coming from different sources - like an invoice from vendor A vs invoice from vendor B.
In a semi-structured document, pieces of text drive meaning by relative positioning with other text fragments in the vicinity. For example, a six-digit number in an invoice drive meaning from the label saying "inv#" (or something to that effect) that is typically placed either left of the number or above it. For this reason, semi-structured documents are also called Visually Rich Documents (VRD) for visual relative representation of information on the page.
Extracting information from semi-structured documents (or VRDs) is much more complex and require specialized machine learning natural language processing (NLP) algorithms or algorithms inspired from computer vision algorithms to learn general pattern of text representation and then isolate pieces that are of interest.
Unstructured documents are by far the most complex documents. These documents are typically free form text like letters, agreements, legal documents, research papers, etc. As the name suggests, unstructured documents have no fixed structure to how the information is captured or presented. The document may comprise several paragraphs in combination of tables, pictures and other artifacts to capture and communicate for human consumption.
One of the unique characteristic of unstructured documents is that the semantics and the meaning of words changes from one domain to another, and even within a domain, from one function to another. The context of the subject, or the document, has a lot more importance than in the case of structured and semi-structured document. This variability across unstructured documents makes the information extraction task an extremely complex challenge.
However, information extraction from unstructured documents is now becoming possible with recent breakthroughs in Natural Language Processing technology. It is possible to train general purpose language models on vast corpus of specific language, and then fine-tuning it for specific tasks leveraging transfer learning techniques. But, even with these breakthroughs, extracting information from unstructured documents is a fairly complex endeavor and requires huge investment in terms of specialized skillsets (like data science, engineering), and computing resources.