When you have been involved in plans or projects for automated document processing you have for sure been exposed to the distinction between structured and unstructured information. And you might have gathered an understanding what this means. But what does it really mean?
Starting from the end, from the desired result let’s look at the type of information that can be processed by a computer. Typically this is information stored in databases, in records and in fields with labels. An algorithm that is part of each information processing software product requires input to provide desired output. In the end the input must be something that engineers call name-value pairs. Each value that is used for the task must be accompanied by a name that identifies the meaning of the value. Without the meaning the value cannot be interpreted by the computer. As an example the sequence of numbers 01.04.12 can have a lot of meanings:
- A date: first of April if you live in Europe
- A date: fourth of January if you live in US
- A telephone number (or a part of a telephone number)
- An account number etc.
Only by assigning a label to this number a software program can use it to make a meaningful decision (e.g. using it as a meeting date to pop up a window and remind you).
So in general labeled or tagged data is the type of information we call structured. This information does not have to be numeric as above nor does it have to reside in database. There are numerous possibilities to represent the meaning of information. Even a free text document like this might be stored together with XML tags that can be used to identify the entities. The words “April” and “January” in above list could be tagged to be months. And hence be interpreted and used by an algorithm.
Now here comes the problem. Typically most of the information we receive and most of the communication we lead are not tagged and hence not structured. The human cognitive capabilities allow us to understand it anyway. A computer program cannot say if “April” is a month or a name of a girl. We humans can as we understand the context of this word. And therefore it is easy to understand it is a month. And since the number above would also be in a context we would also understand the meaning of it. This is something computer programs just start to accomplish and this is called content analysis or fact extraction. And it is so much more difficult than you would expect because the human brain is so good at this task.
The majority of the information in the world is unstructured. Most of it is still on analog media like paper. But even if it is digitized and available – as we say – electronically, machines do not automatically understand it. Processing it is only possible because humans can automatically assign a context and a meaning to each element and therefore understand it. And after understanding we can draw conclusions, make decisions, and synthesize new information. To make it usable it needs to be transformed to structured information with recognition systems or simply by data typists.
Structured information is information a computer can process. Unstructured is what we see in the world. Recognition and Document Understanding bridge the gap between these sides to allow automated document processing and analysis.