Classification tries to mimic human understanding. Several methods have been developed in the past to achieve what we as humans can do almost effortless. These methods can be divided into two groups.
Rule based classification
Rule based systems are carefully crafted by system designers who are familiar with the subject matter and are able to code their knowledge into rules that rebuild this knowledge in a software application. Rule based systems have been very popular since a long time. In fact there is even a special programming language, called LISP, which has been designed to allow programmers to create rules based artificial intelligence algorithms. The advantage of rule based systems is that they are totally controllable; the disadvantage is that their complexity rises faster than the complexity of the task.
Most rules-based systems today use a combination of Boolean (and, or, not) operators and dictionaries to find either positive or negative evidence of a match to a category. For example, rules for content about financial businesses care specifically that the keyword “bank” is matched but NOT in proximity to the words “river” or “to sit on”.
The success of rules-based systems is dependent on the complexity of the document content and the size of the taxonomy as well as on a well coded collection of synonyms and keywords associated to each category. The challenge is in creating and tweaking the rules for each category and keyword combination — which can be a lot of work for large taxonomies. Therefore it can generally be said that the smaller the number of classes and the more complex the document content is, the better rule based systems work.
We also see hybrid systems in the market today where the rules are manually defined and the dictionary is created automatically by training the system. To be effective this still needs a lot of supervision by an experienced coder to avoid falsely learned keywords so that it seems almost easier to define the keywords manually upfront.
Statistical classification
The second group of systems is statistical and uses machine learning to identify a set of features in the text that are characteristic for a document category. Statistical classification models are mostly trained by supervised learning. During this process a set of representative samples is presented to the system. These samples are analyzed and relevant features are extracted. Relevant means relevant for the type of object (in the case of text documents these are words or fragments of words) and relevant for a specific category – features that are present in one category but missing or less frequent in all other categories. A typical text page has about 250 words, of which maybe 20-30 have real significance and are not just grammatical filling words. To find the right and good features a significant amount of different characteristic text samples must be available and prepared. Normally these are provided in the project setup as a starter and then enhanced and changed using online feedback results from operation during production.
Features in the most simple case are just the words of the documents (this is called the “bag of words” approach). More sophisticated systems make use of stemming and lemmatization to normalize the words, so that for example “buying” and “bought” are recognized as the same feature related to a financial transaction. Others use trigrams or n-grams which reduce the words to a series of characters that are similar. Finding general but also effective internal representations of natural language documents is the most important part of the process and decisive for the quality of the results.
Once the features are known a classification algorithm is applied to assign a weight to each feature. The weight determines the overall importance of a feature and the relevance for a specific category. For example grammatical stop words like “and”, “the” or “to” are so frequent in each document that they have zero overall importance and are eliminated completely. More meaningful words like “credit”, “mortgage” or “deed” might be present in the samples of several classes but only significant for one. The classifier therefore assigns weights to each of the features to mark the ones that are really important. It should be clear from this description that a significant number of typical samples are required in the learn set of each class to achieve a sufficient result.
Classifiers that are used normally fall under one of the following families:
- Nearest Neighbor
- Support Vector Machines
- Bayesian Networks
- Maximum Entropy
The model that is created by the classifier is applied to unknown documents for classification. Now the features of the unknown document are extracted and compared to the stored weighted features. The result of the classification is a list of confidences giving the correlation between the document and each class. The confidence is defined between 0 (no similarity) and 1 (very similar to class). The confidence should not be normalized to the sum of all confidences but should give a true measurement of the correlation between the document and a class.
Often the training is iterative and the model is retrained with additional samples either offline or in production to get improved results.