Erkennung | Skilja

Das Ende der OCR: Wie Worterkennung alles verändert

Alexander — Thu, 30 Oct 2025 16:26:57 +0000

Seit mehr als einem halben Jahrhundert steht OCR – Optical Character Recognition (optische Zeichenerkennung) – für eines: Maschinen, die einzelne Zeichen entschlüsseln. Von den frühen Schablonen in den 1960er Jahren bis hin zu den statistischen Engines der 2000er Jahre hat OCR das Lesen immer als mechanischen Prozess betrachtet. Es segmentierte Text in zeichenförmige Fragmente und versuchte, deren Bedeutung Buchstabe für Buchstabe zu analysieren. Aber diese Ära neigt sich nun dem Ende zu. Moderne Systeme lesen Zeichen überhaupt nicht mehr – zumindest nicht einzeln. Sie lesen Wörter, Phrasen und sogar Bedeutungen. OCR hat sich zu etwas grundlegend anderem entwickelt, und dieser Wandel sorgt für ein neues Maß an Genauigkeit, Geschmeidigkeit und Natürlichkeit, das frühere Generationen nicht erreichen konnten.

Das hat Skilja mit Lesa erschaffen, unserem Deep-Learning-System auf Transformer-Basis, das so konzipiert ist, dass es wie Menschen liest: ganzheitlich, kontextbezogen und intelligent.

Ein kurzer Rückblick: Von Zeichen zum Kontext

Die traditionelle OCR begann als analoger Prozess (daher das O = Optisch) und durchlief mehrere Phasen:

1950s–1980s:Starre Schablonen – nur bei makellosen maschinengeschriebenen Seiten effektiv.
1990s:Merkmalsextraktion und frühes maschinelles Lernen – besser, aber immer noch anfällig.
2000s–2010s:Statistische Modellierung und verbesserte Analyse-Workflows – gut genug für Bücher, gedruckte Formulare und eingeschränkte Handschrift, aber immer an Zeichen gebunden.

Selbst im besten Fall blieb die klassische OCR ein Ratespiel. Sie verwechselte die 1 mit dem Buchstaben l, verwandelte Flecken in Glyphen und hatte Schwierigkeiten mit allem, was außerhalb ihrer engen Erwartungen lag. Vor allem mit Handschrift.

Das Problem waren nicht bessere Algorithmen, sondern Zeichen – aber Menschen lesen Wörter und keine Zeichen, wie jeder bestätigen kann, der schon einmal einem Kind beim Lesenlernen zugesehen hat.

Der Wandel durch Deep Learning: Von der Entschlüsselung von Formen zum Verständnis von Sprache

Transformer haben alles verändert. Anstatt Zeichen zu interpretieren, interpretieren transformerbasierte Modelle Sequenzen, Kontext und sprachliche Wahrscheinlichkeit. Sie betrachten Text nicht als isolierte Formen, sondern als Teile von Sätzen, Absätzen und Konzepten.

Dadurch kann Lesa:

ganze Wörter erkennen, nicht nur Buchstaben,
den umgebenden Text als Kontext nutzen,
die Kohärenz über ganze Seiten hinweg gewährleisten,
und sich an verschiedene visuelle Stile anpassen.

Lesen wird zu einer Aufgabe des Sprachverständnisses und nicht zu einer Aufgabe der Pixel-Entschlüsselung.

Lesa: Entwickelt für Intelligenz auf Wortebene

Lesa wurde von Grund auf dafür entwickelt, ein Dokument als sprachliches Gebilde zu behandeln. Mithilfe einer Transformer-Architektur, die mit vielfältigen Texten und realen Bildern trainiert wurde, fragt Lesa nicht: „Was ist das für ein Zeichen?“, sondern: „Was bedeutet dieses Wort – und wie passt es in den Satz?“ Das ist wichtig, da wir nun:

Deutlich weniger Fehler haben: Keine Fehlerkaskaden mehr durch einzelne Zeichen.
Natürliche Ausgabe erreichen: Saubere Abstände, korrekte Zeichensetzung, zusammenhängender Text.
Robustheit von Schriftart und Layout gewährleisten: Funktioniert bei formatiertem Text, unordentlichen Belegen, Tabellen und mehrspaltigen Seiten.

Eines der erstaunlichsten Ergebnisse des Verständnisses auf Wortebene ist, dass gebundene Handschrift – wie sie in Formularen, Notizen, Krankenakten, Lieferscheinen und Unternehmensunterlagen zu finden sind – nun viel besser funktionieren. Das bedeutet, dass unordentliche Großbuchstaben, mit Kästchen gefüllte Handschriften und halb gedruckte Formulare plötzlich gut lesbar sind. Die Handschrifterkennung ist plötzlich zum Greifen nah und wird routinemäßig angewendet.

Dies als „das Ende der OCR“ zu bezeichnen, ist keine Übertreibung. Es ist eine technische Anerkennung dessen, was sich verändert hat.

Traditionelle OCR = Zeichenerkennung, Moderne OCR = Sprachverständnis

Lesa gehört zu einer neuen Generation von Systemen, die Dokumente so lesen, wie Menschen es tun – indem sie Wörter interpretieren und nicht Symbole entschlüsseln.

OCR, wie wir es kannten, ist vorbei. Etwas Besseres hat es ersetzt. Lesa steht für diese neue Ära.

Auto Classification and Bias

skiljaadmin — Tue, 22 Jul 2025 12:01:16 +0000

Personal bias and individual opinions are a big issue in standardized business processing if they happen to influence the outcome of a process and the decisions made. Nobody wants to be subject to random changes in the outcome of a personal request – and yet it happens. Because humans have a bias in how they see facts, based on their education, cultural background and even the mood they happen to be in at a certain time in the week. So in addition to different persons making different decisions you can even expect that the same person makes different decisions during the week. You just look differently at a task on Monday morning than on Friday evening. The reason is so-called priming which happens to all of us day-by-day through our experience, knowledge, physical condition, context and a lot of other small factors.

In a recent article on the meaning of words we have shown how the sound of words influences our perception. There are a lot more linguistic associations that influence the way we think and behave, which introduce bias. If for example I tell you that I am driving north across a hilly terrain, would you expect the trip to be mosty uphill or downhill? In fact most people associate movement to north with uphill and to south with downhill. An interesting study by psychologists Leif D. Nelson from UC San Diego and Joseph Simmons from Yale shows that these associations can actually be measured and produce some strange biases: People think it will take longer to travel north than south, that it will cost more to ship to a northern than to a southern location, and that a moving company will charge more for northward movement than for southward movement. A similar study concluded, that people assume that property is more valuable when it sits in the northern part of town. Of course these opinions stem from the decision by the old Greeks to plot the map of the world with northern parts above the south. But also it shows us clearly how much we are biased by our language and of course north/south is only one of many linguistic associations we are exposed to.

Ancient mapmakers introduced north and south unwittingly, but lawyers do have an intention when they describe car accidents. While the defense might call a car accident “contact”, the plaintiff might say one car “smashed” the other. Elizabeth Loftus and John Palmer showed in a classic experiment that these labels really matter. They had a group of students watch the same series of traffic accidents. Then they were asked to estimate the speed of the cars when the accident occurred. When the scene was described in a way that the cars “contacted” one another, the average speed estimation by the students was thirty-two miles an hour, whereas the estimate was forty miles an hour when they said that the cars “smashed” one another. In a another experiment, 14% of participants incorrectly remembered seeing shattered glass when told that the cars “hit” one another, whereas 32% of participants made the same error when told the cars “smashed” into one another. This shows that even a single word can change how people remember an event they witnessed only minutes earlier – making it very clear how priming can bias our decisions.

This brings us back to auto-classification. A classifier like the Skilja Content Classifier is trained with representative samples that are collected by a group of people. If applied to a bunch of documents it will then make the same decision over and over again. It represents – through machine learning – the average opinion about the content of a document and will repeat it without tiring. Same in Monday morning and Friday evening – with a speed of several 100.000 pages per hour. It will make errors – based on statistics – but not more than a human. And the errors are reproducible and can be corrected if necessary. If you have ever thought about compliance – this is a good example. Because compliance does not say that you cannot make errors. It says that you need a reproducible, documented procedure how you store and treat your documents. Auto-classification can help to achieve this goal. It is a great tool for boosting productivity. But it is even more helpful to avoid bias, irreproducible results and non-compliance.

How Meanings of Words Change

skiljaadmin — Fri, 02 Dec 2022 11:57:32 +0000

We all know that our language is fluid and words can change their meaning over time. Words get extinct and new words are created but more often existing words are adapted to new circumstances. It is interesting to see how this happens in the course of years but sometimes words change their meaning overnight.

In a study on “Statistically Significant Detection of Linguistic Change”, published last year (available online from arxiv.org), the researchers Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, Steven Skiena have used data mining to find out how the way we use words is revealing the linguistic earthquakes that constantly change our language. There findings are very interesting for anybody who works professionally with text analytics as it reveals a lot about how semantics in our language work. Kulkarni et al. have tracked these linguistic changes by mining the corpus of words stored in databases such as Google Books, movie reviews from Amazon and of course Twitter.

In the pre-internet times the usage and meaning of words changed relatively slowly. This is can be seen in the metamorphoses of the word “gay” from its social meaning in the fifties to the purely sexual-orientation meaning in our time. This is nicely displayed in the word cloud view below:

A faster change occurred in the 1970s to the word “mouse”, when it gained the new meaning of “computer input device” and later, the word “windows” was used internationally as the name of the Microsoft operating system within a few years.

Today the meaning of a word can change almost instantly. Before October 2012, the word “sandy” was an adjective meaning “covered in or consisting mostly of sand”. Then Hurricane “Sandy” approached. Almost overnight, this word gained an additional meaning as a proper noun for one of the costliest storms in US history.

Now this might not sound like a big deal for us – but just imagine the insurance industry and the thousands of e-Mails they suddenly receive referring to damages by Sandy! If they are using a static or rule based classification system, they might easily miss the point on these.

So this is a big challenge for automatic classification systems that have been trained in a machine learning algorithm on a specific set of documents containing words in a specific meaning. When these meanings suddenly change or the usage of words is widened suddenly the classifier will make wrong decisions. The only way to cope with this problem in a living and productive system is continuous learning. The system must learn from user corrections – supervised learning – but also from good classification that contains some new aspects and features. This is called unsupervised learning and is so important as the number of documents that can be used for training the system is much higher than the manual correction. By unsupervised learning – or better enhancement, which also includes forgetting btw – the classification system will be able to cope with changed meanings over the time. Abrupt changes as mentioned above will lead to a drop in classification rate from which the classifier will recover within a few days – like humans who will also need a short time to adapt.

On the Benefits of Page Classification

skiljaadmin — Thu, 29 Sep 2022 14:37:47 +0000

Classification deals with the categorization of objects. In our process automation and digitization world, we often think of the objects as complete documents that need to be classified. Of course, it is important to understand what the type of a document is and automatic classification can determine exactly this. But documents in a business context normally are complex and not homogenous. As a person when you get a multipage document you typically will browse through it to see what is in it to understand what it is about. A document in an envelope or a manila folder that you receive on your desk may consist of an opening letter, some notes, then the real important document, like for example the court order, and maybe attached some standard forms. To understand which process to initiate and what to do with the document you will therefore look at the pages and decide how you can determine from their content what this is all about. Maybe even two or more processes originate from different pages within one document where you might need to answer a request from one page and execute a payment from another page.

Laera Classifier – Page Classification for Claims Processing

This is exactly what page classification in document understanding is able to provide automatically. Instead of looking at the document as a whole the algorithm will classify page by page and derive decisions from the results. This is much more granular than taking only the complete document. And it is different from automatic document separation which is physically splitting the document. Of course, separation is another option based on page results, but it is error prone and risky as the document is ripped apart maybe incorrectly. Often this is not at all necessary but it is sufficient to structure and digitize the document page-wise to achieve the process goals intended.

Page classification requires a solid infrastructure and understanding of physical documents. We provide this with the Laera Classification Framework that inherently understands structured documents. Going even further would be paragraph and sentence classification but this will be a topic for another article. In Laera you can simply define a page classification scheme alongside the document classification. And you can even use the page classification results to determine the document type (e.g. by majority rule or by priority rule).

An example of a real-life project that is in production since more than a year is shown above.

In this case the customer receives thousands of car insurance claims per day. These are 10 to 50 page documents that contain all different kind of pages, as examples:

Covering letter or e-mail (“Anschreiben”)
Attorney’s letter
Expertise (“Gutachten”)
Calculation of repair (“Kalkulation”)
Declaration of Assignment (“Abtretungserklärung”)
Photos

Laera Classifier is able to automatically determine all of these types with a rate in the high 90%. Photo detection tags all photos and hides them for the following recognition steps as they are unnecessarily blocking OCR and extraction steps otherwise. The page classification results allow to structure and reorder the document in an optimal way for subsequent extraction of data from the different page types. Being able to define specific extraction for each page type leads to a significant increase in extraction quality and speed. It also greatly eases the task for the clerks in the subsequent process steps as they already receive a structured document (in this case a PDF that is assembled) with tags and always in the same order.

In such way page classification plays an important role in streamlining the process getting a bit closer to the way how a person would look at the document and work from it.

Document Separation Revisited

Alexander — Thu, 08 Sep 2022 09:50:54 +0000

One of the frequently overlooked and really difficult problems in document automation, which is also really annoying in daily processing, is the automatic separation of a stack of documents into single meaningful documents and assignment to a document class. In traditional scanning processes this is often achieved by manual preparation of the paper and sticking a barcode as a document separator on each first page. But this is labor intensive and error prone. In addition as we are going more and more digital, even with paper based processes, normally the processing facility does not have access to the paper any more. So the goal would be to simply scan the whole stack and have it separated by an intelligent algorithm.

Fortunately this is readily available today for example from the Skilja technology stack as a built in feature into the Laera classifier. This does not say it is easy. It requires quite some experience and infrastructure to manage several interdependent steps of classification and separation in a stable and reliable way. This is what Laera provides out of the box.

How does document structuring work in principle? Well, in exactly the same way (our credo!) as a human would do it. Go through the stack page by page, determine what page type it is, if it is related to the previous page or if a new topic/form starts. Then check page numbers for security if they are present. If in doubt, go back one or a few pages to check back and then make your decision to separate.

Laera Document Separation

In AI classification, what Laera is, this is built into a sequence of algorithms. The system is trained on a sample that is already correctly separated. Laera does learn for each page if it is a first, a middle, and end or a single page. The user does not have to specify this explicitly as the Laera AI finds that out automatically from the samples and hides this complexity from the users. The training interface just requires you to drop the single documents into the training set. It is not required to have an exact number of pages (range) for each document type. Laera automatically takes into account that these can vary for each document type. However if you know you can also restrict allowed pages for example for single page forms that are always single page.

Laera will then learn the structure and apply it to the whole document stack of unseparated single pages during runtime. Each page is analyzed. A second classifier (we could call it a “meta-classifier”) will then take these results and find the most probable separation based on the trained model. So even if a first page has not been identified as a first page there is a chance that the meta-classifier still will see it as more probable to be a first page and correctly separate. A third classifier will then determine the document type for the separated documents. As usual in Laera all this is very fast and a separation of a stack of 200 pages with 150 document types takes less than 30 seconds in total.

The example below shows the results from separation of a mortgage application stack with 153 pages and classification in 244 document types. The horizontal lines indicated the found separators and the “New page” column shows the new numbering of pages in the separated documents.

Laera Mortgage Separation Result (click on image to see full screen)

The detail view of the separation result for one page nicely shows how the separation algorithm came to a decision for the first page of a URLA supported in addition by the detected page count on the page (“Page 1 of 9).

Laera Mortgage Separation Details

Training of this model takes about 10 minutes so it is easy to frequently test and refine it. All this can be done by the end user and does not need an AI engineer.

Quality is very important and Laera makes sure to bias towards precision to no errors are made and allow the workflow to show unconfident separations to a user for decision. In a project that was done 18 months ago for a large Swiss insurance company Laera achieved an automation rate of 87% with an error rate (false positive) or 0.14%. Still of course each separation result needs to be checked and the correction results will be used by Laera online learning to improve the model.

But overall the reduction of work in separation and the increase in quality is very measurable and yields huge benefits. All this is available either on premise or as cloud service to be used through RPA or RESTful API in any backend. Let us know if you are interested and we can show you a demo. Also a setup with your own documents is easily achievable with little effort. Contact is info (at) skilja.com.

Confusion Matrix

skiljaadmin — Wed, 10 Aug 2022 11:01:43 +0000

Understanding the quality of an automatic classification system is crucial for its acceptance and any attempt to improve it over time. Quality means that we need to look at errors and at the recognition rate. In classification terms these values are called precision and recall. Precision gives the percentage of documents that have been classified correctly with respect to all documents assigned by the classifier (a/a+b), recall is the number of documents classified into a class with respect to the total number of documents that should be in this class (a/a+c). In a previous post (Measuring Classification Quality) we have already discussed these and how important they are. It is easy to depict them in a graphical visualization:

While these values might appear a little abstract their advantage is that they are independent of the size of the set. But it might be more intuitive to talk about the actual number of documents that are imported into a class from other classes (set b) or exportedand lost from the class (set a). Because it becomes obvious that recall and precision are related and have the same value if no threshold is applied – as every document that is imported to a class must have been lost in another class. Also it makes it easy to look at particular problem classes with a lot of imports (attractors) or exports (donors).

For a classification system these values can be depicted in a so called confusion matrix (also known as a contingency table or an error matrix) showing all relations between classes in one glance.

Our classification designer in the Skilja Content Classification system has a built in visualization that lets you easily see the migration of documents into other classes. As an example we have used our popular Reuters news wire test set and arranged the classes in 7 hierarchical groups. If you run a 90:10 split benchmark on all 5917 documents (which fortunately only takes a few seconds because the SCC is so incredibly fast) the confusion matrix obtained for the 51 classes looks as follows:

Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. Of course the user interface allows you to zoom in to look at the details.

The correctly classified documents are summed up on the diagonal, the exports are on the right upper side and the imports on the left lower side. In our case you see quite some exports from the class “acq”, which is news on acquisitions to “earn”, which is earning. But this is to be expected as these classes are close by topic and often a report on acquisition talks about the same topics (shares, revenue, board) as for earnings. The user can now use this display to click on the box of the 57 exported documents, open them in a list and review them to improve classification if desired. Such it becomes easy to drill down into the results and see exactly what can be improved. You will never achieve 100% precision but remember that also manual human classification only achieves 95% on average as proven in experiments.

When the classes are organized in a hierarchy, the confusion matrix by Skilja also allows you to collapse the nodes and look at upper levels only. In this case the values of the hidden subclasses are summed up and shown for the parent class.

The diagonal has two values now. For example 4.320 of the finance documents have been correctly classified but 175 have been exported/imported within the finance category. Often you are only interested in the migration between the main parent classes, while errors under one parent are less problematic.

Typically an organisation can assign a cost with each export and import. The cost can be different for each pair of classes where this happens. Migrations within a set of subclasses are often not very expensive if they relate for example to documents that anyway are processed in a department. On the other side an import into a class that leads to an automatic payment can be very expensive. This can be mitigated by assigning different thresholds to such classes, which SCC allows. The confusion matrix allows you to find out where these need to be applied. But the matrix can also be exported and you can apply your own cost matrix to the results to determine, which improvement make sense. We are currently working with a real client to create a case study that shows these numbers in a real world example at an insurance company. When available, this study will be published here. Stay tuned!

Process as a Service

skiljaadmin — Wed, 05 May 2021 13:57:54 +0000

Imagine that you have created a powerful process for superb document automation using all kind of advanced recognition, image processing and AI technologies available. With these technologies it is possible to automate almost any document driven process that involves repetitive cognitive tasks like classification, indexing and decision making today. VINNA by Skilja is a powerful platform that enables and orchestrates the LAERA components by Skilja to perform all these miracles. In addition VINNA plugs in a lot of other powerful tools from other technology companies like barcode recognition, office format conversion, e-Mail normalization, PDF-A generation etc. etc.

Now as this process is built and works well, the question arises how to integrate the new capabilities in your line-of-business applications and existing processes. File import and export is insecure and outdated. Full integration requires too much effort from IT that might not be available.

Fortunately there is a solution 🙂 : By plugging in an Event Driven Activity (EDA) you can enable ANY process to be accessible through a standard web service protocol. Simply by adding the EDA to an existing process you make it available to a RESTful service call. The EDA can be the only start point of a process but it can also be added in addition to existing starters like file importers, message queues or IMAP collectors that pump documents into the same process.

Typically you add one EDA starter and one or several EDA Reporter or Listeners. These are accessed through a simple web protocol by the EDA consumer. The Consumer can either be a web page (as in the example below) or a Windows application like an RPA client that automates the cognitive task scheduling. Both are provided as sample source codes with your installation. At any stage of the process the Consumer is optionally updated via event on the progress of the work item in the process. When processing is finished the results are retrieved either through an event or from a queue that is queried via REST.

VINNA Process as a Service schematically

The process can be deployed anywhere – there is zero setup. Simply use the URL of the service and of course the credentials for the secure and encrypted communication with the authentication service. The process can sit anywhere in the cloud and be used by any client world wide that has access –

Through EDA VINNA is used completely in slave mode and the consumer is shielded from any complexity of the process.

The process can contain any number of steps and routes, including manual correction, approvals or even sending tasks to the crowd. The Consumer will see none of this, it will simply get notified on the results on a standard API irrespective of the process. So when the process is changed and enhanced, the Consumer can stay as it is. This is the true power of process as a service. „Technology under the hood“- Especially when combined with online learning that will continuously improve the result.

Vinna Process as a Service for RPA with Classification Example

A nice example is shown in the graphic above. The Consumer (in this case any RPA client) choses to use a classification service by selecting the process (through EDA) and the classification project that should be used. In this case both classification activity and classification Web designer access the same classification model in the database. Therefore an admin user can even modify the classification model and the taxonomy or create a new one. Because the process itself and the call to use it stay unchanged. This process also contains a manual correction step (Batch Review) that is conditionally used in case of uncertain results. The system can even send a link for a correction task as a web page so the user can make any decisions or corrections of submitted tasks herself without having to install anything. The corrected results in turn will then be aggregated and used for online learning. If several hundred users are using this process it will quickly optimize the results automatically without any further effort. And at all time the actually processing can happen anywhere in the world on any cloud server.

This is the power of process as a service.

Reading Medical Reports

skiljaadmin — Fri, 27 Dec 2019 14:08:21 +0000

Medical Reports are complex documents that are written by doctors who use their specific language and style to express not only facts but also hypotheses and suggestions. They are intended to be read by other doctors or experts who have a deep knowledge of the subject at hand and can make judgements based on what they learn. And in the end the information contained therein is of vital importance for many decisions that are to be taken – medication, change of life style, possible surgery but also cost of insurance.

So reading medical reports using Artificial Intelligence and machine learning is a challenging task. The use case described here utilizes Laera Information Extraction to assist experts at a life insurance to calculate the health risk for persons. In the end the decision needs to be taken by doctors but the system can greatly assist them to sift through the amount of text provided. Because medical reports attached to a life insurance application can easily exceed 100dreds of pages.

Laera Information Extraction uses advanced AI methods to assess the risks contained in these reports and points them out to the experts. In a first step the diagnoses are extracted base on the common ICD-10 code. But a simple word search is not enough because each diagnosis needs to be put into its context. Here the option of Laera to assign multiple roles to an entity becomes very useful. Once a critical diagnosis is found Laera makes an assessment based on several categories:

Is the polarity negative or positive? In most cases symptoms are excluded in the reports and therefore the diagnosis is negative. These are of no interest (of course the patient is happy about that). Only the positively confirmed ones are relevant for the risk assessment.
Is the diagnosis for the present or the past? Many reports contain a lot of history of what happened in the past. While the history might be interesting, the main focus is on the current situation.
Is the diagnosis for the person herself or maybe for the family? Family history (Father had heart attack) might be important but needs to be assessed differently.

Finding Diagnoses, polarity and roles

Laera intelligent extraction performs all these tasks and analyzes all pages in milliseconds using semantic and structural methods:

Find all diagnose and symptoms
Find the polarity (is the diagnosis excluded or asserted
Determine the context (role e.g. self or family)
Classify the paragraph by relevance
Present and auto-summarize results in ICD10 terms
Highlight relevant areas in the document for quick visual confirmation

Of course, just to make sure this does not get lost, the roles and assignments are not defined by rules but trained using machine learning from a few hundred labeled examples.

Summary of symptoms with polarity in ICD-10 tree

The customer using this system could reduce the time spent to assess an application by more than 50% as the experts get a prepared data set that allows them to quickly jump tp the relevant sections and make the decision.

It is also important to note that this is not a hard coded special solution, but an example of the application of the Laera Information Extraction product. Any other industry can use this approach to solve their specific requirements. Example range from contract management to court documents, but of course also standard default extraction tasks can be easily solved with Laera Information Extraction.

If you are interested to learn more about this use case or the application of AI extraction in your specific domain please let us know via e-Mail at info(at)skilja.com and we will be happy to provide more information.

The Magic of Online-Learning

skiljaweb3 — Sun, 03 Feb 2019 08:26:58 +0000

Wouldn’t it be nice if your AI enabled document processing system would continuously take the input from user interactions and use this information to improve the quality of recognition over time? And nobody would have to take care of this – even in the case of hundreds of document classes with dozens of index fields each. In the best case the system would be easy to set up, run completely unattended in background and work like a charm.

This is what Skilja with its Laera Classification and Extraction software suites provides. We have completely implemented this new paradigm which is available either as SDKs or as integrated modules to our Vinna Document Processing Platform. But of course what looks easy for the user requires significant infrastructure and automated checks and balances to make this a reliable and stable part of your processing tasks.

Machine Online-Learning of document classification and recognition uses supervised and unsupervised continuous training of incoming data streams. Supervised learning will take the corrections the users have made, analyze them and apply them as new patterns as appropriate. Unsupervised learning will use the results of successful and correct classification and extraction to generate additional knowledge (expanding the space) and statistics of usage of existing knowledge. Both combined are then used to continuously improve the system. The infrastructure is set up quickly and consists of services that do the work in the background: collect statistics, collect samples, analyze the validity of the new data and publish them to the production runtime system if the AI has determined them to be valid additions.

As we know that system administrators might be vary of having their setup changed automatically (at least until they have seen it really works) there is several intermediate levels of AI automation that they can chose. The most important are:

Have all changes and each new document manually reviewed, benchmarked and checked before explicitly publishing it. This is the box on the left
Have automatically created improvements be reviewed and explicitly published
View any conflict and resolve them manually (or at least check them)
Restrict the users that can contribute to the training to a certain group. Only corrections from this group will be taken into account while the input from less experienced users will be discarded.

But in the end learning can run completely unattended. As in school (think exams) we need to check the validity of the new knowledge before we apply it. Therefore Laera algorithms will always analyze for conflicts that are created and try to resolve them. Im addition each new revision of the training pattern will fully automatically be quality checked in background and only be accepted if the recognition results of the new model exceed the existing one. This is an assurance for the production system: Changes in quality will always only go into one direction – better!

Again, this is not a black box but Laera provides precise insight of what is happening and lets you influence or even revert the suggested improvements at any stage. Laera Monitor is the tool for this, a web application that shows the continuously measured quality numbers of your system.

The example shown here shows a typical curve for the F1 score (average quality measurement). Starting with a setup of a few hundred trained documents the quality quickly deteriorates as new and unknown samples arrive in production. Especially when the real volumes start to be processed. It is interesting to see that the precision stays high close to 95% which is very satisfying, but recall (recognition rate) goes down as the system simply does not “know” the new documents. But then online learning kicks in and uses the new samples and corrections made to quickly improve the quality to 95% after a few thousand new training documents have been processed.

Online Learning will make classification and extraction much easier in the future. After an initial setup AI will simply learn in background what needs to be known to arrive at he best possible automation rate within a few weeks. This makes a whole new area of processes (for example with smaller document volumes) available and will greatly improve quality for existing automation processes.

Please let us know if you have additional questions or need more insight or have a direct interest. Contact us under info (at) skilja.com.

A New Approach to OCR Quality

skiljaadmin — Fri, 08 Apr 2016 10:33:16 +0000

The approach to improve OCR on a given document is very similar to human capabilities of adapting their cognitive capabilities to a specific sample. Just imagine that you see a document with very difficult handwriting. In the begining you will be able to distinguish some of the more distinct characters which in turn allow you to conclude the meaning of other characters as you can derive them from the characteristics of the writer. The same is done in unsupevised machine learning in the OCR Accuracy extension. We use object detection and classification to create clusters of all possible characters on a specific page and then use well recognizable characters to automatically label these clusters with their meaning (e.g. these are all capital “E”). From these a prototype can be derived and then applied in a second round to all the unknown characters. This helps the system to identify even deteriorated or distorted samples confidently thus boosting OCR quality.

An example for a very old book is shown below displayed in our test and benchmark tool, the Accuracy Extension (AE) Studio. The green blocks highlight characters that have been corrected by the AE. In the tooltip you can see that in the first line in the word “solten” the character, which is an “o” in GT (ground truth), has been incorrectly recognized as “e” but has been corrected back to an “o” by AE. The text line shows the difference between the first normal OCR pass and the corrected result for the full line.

Document was provided by courtesy of University of Innsbruck

Another example is a typical old typewriter document (actually a telegram). All the faint characters have been well corrected. The text line at the bottom shows the comparison between original OCR and the magically corrected result.

A more complex example is the recognition of a complete old Newspaper page as shown below. This page contains 16.837 characters. This is an advantage because a high number of available character is beneficial for the automatic creation of good protoypes which in turn can be used to improve the quality.

In this case the first pass of OCR with FineReader 11 (from ABBYY) yields a decent quality of 78,4% correct characters when compared to a manually corrected ground truth file. If the AE OCR booster is initialized by unsupervised learning (default init) the recognition rate goes up by 6% to 84.5%. If in addition we use the ground truth of another page of the same newspaper for the learning step the system achieves an improvement of more than 10% to 88,7%.

These are promising first results for improving OCR on difficult documents using unsupervised machine learning techniques. The project is ongoing and will for sure yield even better results in the coming year, allowing researches to access cultural heritage easier and faster.

This project is funded by the Federal Ministry of Education and Research (BMBF) of Germany and the European Union under the project OptO-Heritage and grant number 01QE140.