If you have followed my presentations in the past you know that document classification closely corresponds to concept creation of the human mind. The concepts represent classes of real life objects. We are able to recognize concepts and group objects and group them into classes to which we assign meaning and terms. Already when we are very young we are able to do that with ease. See related article on the baby computer on this.
Obviously this is a basic ability of the cognitive system. This can be very nicely seen in a collection of book covers that was presented in the recent edition of German weekly journal “Die Zeit”. They have asked the question why book covers tend to display the same theme (=concept) over and over again. Particularly striking is the example they found for book covers that show an empty chair.
More examples of concepts used in book cover can be found here.
For us in the document understanding arena this is a very interesting collection. Because it is ideally suited to describe how to build a classifier:
- Define a category of similar objects (documents of a certain type) you want to classify. In this case this is the category “chair”
- Identify all the features that you have in your images. For the book covers these can be colors, shapes, edges but also text (in the titles) and logos
- When all possible features have been defined and found the important step is to understand which features are relevant and which are irrelevant. This is done by the classifier in the learn-by-example step. The classifier will look at all book covers (represented as features) and remove all the features that are not relevant for our category. For example the classifier will find that color, the title and the logo are not important while certain aspects of the shape are important.
- Not all samples must contain all features but each need to have a certain number of relevant features. While the classification technique is relatively simple and well understood the art is the selection and correct discrimination of the features. This makes a good classifier.
You see what tremendous cognitive power we as humans have as for us there is no difficulty at all to correctly classify all these chairs. But we have the advantage, that we have a stored concept of a chair (world knowledge) which makes the task easier. The same holds true for documents. While statistical analysis needs to start from scratch each time, humans have their ontology (world knowledge) available. In the future also document classification will be greatly improved by adding context and knowledge.