Document understanding is about understanding text and determining its meaning and entities therein. For unstructured text the meaning of entities or even simple words very often is not clearly devisable and can be very ambiguous. Take a simple number in a text – this can be anything, from an identification number to a date. A lot of systems try statistical approaches these days but in the end only a complete linguistic analysis of the text allows to really understand it and derive structured information and actionable events.
In linguistic analysis several methods are used step by step. Each method builds on top of the previous. Only very sophisticated systems today have completely implemented all steps the combination of which we call semantic technologies:
- Lexicon: Identifies ultimate units of text
- Morphology: Identifies grammatical properties (gender, case, voice, etc.)
- Syntax: Identifies the structure of the sentence, i.e. how words are grammatically related
- Semantics: Interprets the meaning of syntactic relations and grammatical properties
- Pragmatics: Uses extra-linguistic data (i.e. knowledge about the world) to interpret semantic relations
Syntax (from Ancient Greek σύνταξις “arrangement”) is the study of the principles and processes by which sentences are constructed in particular languages. It refers directly to the rules and principles that govern the sentence structure of any individual language. Syntax attempts to describe languages in terms of such rules.
Semantics is the study of meaning. It focuses on the relation between signifiers, such as words, phrases, signs and symbols, and what they stand for. For document understanding it is the study of meaning that is used to understand human expression through language
Pragmatics studies the ways in which context contributes to meaning. The transmission of meaning depends not only on the linguistic knowledge of the speaker and listener, but also on the context, knowledge about the status of those involved, the inferred intent of the speaker, and so on. (All definitions cited from Wikipedia).
One of the problems of human language is that words are ambiguous. In fact English words have 6-8 close synonyms. As an example the English word strike has more than 30 common meanings -strike a ball, strike of the workers, strike a match, strike the hour… You see that I am already using additional nouns as identifiers to exemplify the meanings.
Let me give you some examples how semantic technologies allow you to understand the meaning of ambiguous text. All of the steps in the semantic layered model can help you to resolve ambiguities:
Syntax “mean” – depending on the syntactic role the word has different meaning
- He doesn’t mean it. mean as a verb
- He is really mean. mean as an adjective
- This is the monthly mean. mean as a noun
In each of the examples the word mean has a very different meaning which can be (partially) resolved using syntax analysis. There is no way that any statistical classifier ever would be able to distinguish these.
Semantic: “bank” – always a noun but depending on the relations in the text has different meanings:
- We were sitting on the bank of the Colorado bank as a border of the river
- I opened an account at the bank of Colorado bank as financial institution
- I entered the bank of Colorado through the door bank as a building
The semantic disambiguation is only possible because there are defined relations between words within their specific meanings. You simply cannot sit on a financial institution and you cannot enter the border of a river. These relations do not exist and therefore the true meaning can be determined in each of these cases.
Pragmatic: “fire!” – as an exclamation depends on the situation.
- Calling Fire! in a building means everybody should leave as quick as possible
- Calling Fire! as the commander of a squadron will trigger some rifles to be shot
In document understanding the context of course is derived from the document type and the rest of the text.
Disambiguation is only a small part of what can be achieved with semantic analysis. Understanding text comprises also document classification, entity extraction, relation analysis, extraction of structured data and events and sentiment analysis. All this comes into reach now and will be topic of some follow up articles where we will show applications of these new technologies to practical problems.