Taxonomy and Hierarchies

Classification can be defined as modeling real objects via a simplified mathematical representation consisting of a set of characteristic features. The goal of such a description is to collect objects, which are quite similar into one group.

A big benefit of the similarity concept is that these groups can be used in an ordered way to organize and access a large number of objects, like the books in a library. From medical diagnostics to international news classification schemes, in almost every profession there are specific bodies regulating the hierarchical set of classes, their content and relationship, and their usage. These sets are called Taxonomies. Taxonomies can either be predefined by domain specific agreements or can be individually built for a concrete organization and classification task.

Often category systems are represented by a mathematical graph, called a tree, that arranges the classes or groups of similar objects according to their hierarchical dependence. The phylogenetic tree of life that is known from Biology is a good example of such a tree.

Tree of life as an example of a taxonomy (from Wikipedia Commons)

The tree shows classes with similar objects in the same branches with common parents. Meaning that in this example animals, fungi and plants are more similar than animals and bacteria. The main classes are depicted in colors (blue, red and brown) and the sub hierarchies appear as twigs in the tree.

A key to success of any classification system, scientific as above or domain specific for a document processing system is the way the taxonomy is created. The typical approach is creating a taxonomy and building a tree of categories related to existing business processes in the company. The classes are added manually or imported from an existing taxonomy in a database, a mail server or the file system. The imported structure is then used for building the classification tree and, if machine learning is used, the documents are used as training samples.

But it is very important to choose the classes and there position in the tree wisely to avoid conflicts and contradictions within the tree. This requires significant domain knowledge and a deep understanding of the classification technology used to build it correctly. Normally it is not possible to just map existing manual processes into a software system. An expert needs to analyze the situation and build the structure according the restrictions implied. But if properly used hierarchies can significantly improve classification quality and help to avoid overlap and ambiguities between classes.