Skilja Blog

Document Understanding Primer

von | Feb 27, 2012 | Erkennung, Extraktion, Grundlagen, Klassifikation

For a long time document understanding has been a research topic in computer sciences. We have seen conferences discussing concepts and approaches to use computers and machine learning for understanding documents. Quite often this topic appears also in proceedings on text analytics or more recently document analysis.

In recent times also practical applications have become available that provide basic functionality in understanding documents. Typically these applications are used by enterprises to manage large amounts of incoming documents (especially paper) and to offer some kind of automatic recognition and distribution of documents. As these early and simple solutions have proven successful we will soon see a wealth of new concepts that will allow providing much larger benefits to companies and end users making use of this. Therefore let’s take a look what this actually is.

The first goal of document understanding is to identify the function and the meaning of a document and its parts. Typically a document is written for a specific purpose which defines its function. An invoice is designed and created to notify a buyer on the goods bought and how much money needs to paid for them along with some other information for accounting and tax purposes. All content of the invoice follows this function. Or an application form in a bank is used to collect all information that is needed to open an account. This document is normally very structured. On the other side an e-Mail (which is also a document) conveys information, opens a discussion and calls for action in a very unstructured way.

So the first step in document understanding is to identify the function and separate the documents to be processed accordingly. Typically this step is called “classification”. However this is only the primary classification as more categorization according to a taxonomy can occur which do not have the purpose to define the function. It is therefore very important to distinguish these two types of classification as a lot of misunderstanding results from confusion between these. The function of a document determines the possible content and the information entities that can be found on it.

The second step of understanding is to identify all information entities or predicates of a document related to the function. Typically only a few entities are needed and required (e.g. the amounts on an invoice) but we would prefer to only talk about “understanding” when all entities have been identified. To identify an entity (like a tax amount) it needs to be detected and a meaning must be attached. A number without a meaning is not a predicate. Only if all entities can be labeled with a meaning the computer system really understands the meaning of the document.

In a third step real understanding can take place if the entities are brought into the context of the document function, the purpose of the communication and the other entities. Typically a (business) document triggers an action. Context and correlation between the discovered predicates needs to be analyzed to determine which action. An e-Mail may contain a request to send some information back (which would be an entity “request for information”). But only in context with the rest of the e-Mail, the e-Mail thread or attachments it becomes clear which actions to perform. I will discuss this in more depth in upcoming posts.

As you have seen it is important to know the function of a document or any text to be able to understand its content. And with the knowledge of function (=purpose) and some content we can take action and already have some kind of document understanding. This is what current solutions provide.