Many automatic classification systems out there today use a pure bag of words approach for finding relevant features that determine the meaning of a document. Few are using correlation and collocation – to account for the fact that words have a different meaning based on their context. None of them is using full semantic analysis of the meaning of words. But this is very much needed to be able to accurately classify a document.
The main reason is that (especially English) language is so ambiguous. English nouns have on average 5-8 close synonyms. There are words – example “strike” – that have more than 30 common meanings (strike a baseball, strike price buying stock, going on strike as an employee etc.). Now if you use a simple bag of words as features the software will never be able to make a clear distinction between an important fact (strike = work stoppage) and irrelevant information (baseball). Hence the classification result is also ambiguous and not very precise.
This can be solved by a full semantic analysis. A good definition of semantic is given by Wikipedia: “Linguistic semantics is the study of meaning that is used for understanding human expression through language”. Sounds exactly like what we want to achieve in document classification and understanding. Linguistic semantics is actually able to resolve the ambiguity of expressions and assign a unique meaning to the words. This is achieved by analyzing the relations the word has within a text.
The word “plant” is a good example. If used in the meaning of a factory then this is what you can do: You can enter the plant, you can build a plant or close it. But you cannot pick it or eat it. This is the other plant – the “living organism lacking the power of locomotion”. Thus by analyzing these relations a semantic analyzer can exactly determine the meaning of words – in the same way a human brain is doing it – rule based. Therefore we can make sure that we can distinguish the meaning of plants. Not to speak about Apple. – Just looking for Apple in a text without semantics you will find also all the rotten apples, apple pies and of course Big Apple in your results. Only semantics can identify Apple as a company through its relations in a text.
Using semantic analysis as a feature generator for classification greatly improves the precision of classification algorithm and at the same time allows distinguishing between important and irrelevant features in a text. In the same way you are doing through reading. It is obvious that future intelligent algorithms will need to use this technology.
In an upcoming post we will explain how semantic analysis can be used to generalize concepts to find topics which are another important aspect for good classification.