Alexander | Skilja

Das Ende der OCR: Wie Worterkennung alles verändert

Alexander — Thu, 30 Oct 2025 16:26:57 +0000

Seit mehr als einem halben Jahrhundert steht OCR – Optical Character Recognition (optische Zeichenerkennung) – für eines: Maschinen, die einzelne Zeichen entschlüsseln. Von den frühen Schablonen in den 1960er Jahren bis hin zu den statistischen Engines der 2000er Jahre hat OCR das Lesen immer als mechanischen Prozess betrachtet. Es segmentierte Text in zeichenförmige Fragmente und versuchte, deren Bedeutung Buchstabe für Buchstabe zu analysieren. Aber diese Ära neigt sich nun dem Ende zu. Moderne Systeme lesen Zeichen überhaupt nicht mehr – zumindest nicht einzeln. Sie lesen Wörter, Phrasen und sogar Bedeutungen. OCR hat sich zu etwas grundlegend anderem entwickelt, und dieser Wandel sorgt für ein neues Maß an Genauigkeit, Geschmeidigkeit und Natürlichkeit, das frühere Generationen nicht erreichen konnten.

Das hat Skilja mit Lesa erschaffen, unserem Deep-Learning-System auf Transformer-Basis, das so konzipiert ist, dass es wie Menschen liest: ganzheitlich, kontextbezogen und intelligent.

Ein kurzer Rückblick: Von Zeichen zum Kontext

Die traditionelle OCR begann als analoger Prozess (daher das O = Optisch) und durchlief mehrere Phasen:

1950s–1980s:Starre Schablonen – nur bei makellosen maschinengeschriebenen Seiten effektiv.
1990s:Merkmalsextraktion und frühes maschinelles Lernen – besser, aber immer noch anfällig.
2000s–2010s:Statistische Modellierung und verbesserte Analyse-Workflows – gut genug für Bücher, gedruckte Formulare und eingeschränkte Handschrift, aber immer an Zeichen gebunden.

Selbst im besten Fall blieb die klassische OCR ein Ratespiel. Sie verwechselte die 1 mit dem Buchstaben l, verwandelte Flecken in Glyphen und hatte Schwierigkeiten mit allem, was außerhalb ihrer engen Erwartungen lag. Vor allem mit Handschrift.

Das Problem waren nicht bessere Algorithmen, sondern Zeichen – aber Menschen lesen Wörter und keine Zeichen, wie jeder bestätigen kann, der schon einmal einem Kind beim Lesenlernen zugesehen hat.

Der Wandel durch Deep Learning: Von der Entschlüsselung von Formen zum Verständnis von Sprache

Transformer haben alles verändert. Anstatt Zeichen zu interpretieren, interpretieren transformerbasierte Modelle Sequenzen, Kontext und sprachliche Wahrscheinlichkeit. Sie betrachten Text nicht als isolierte Formen, sondern als Teile von Sätzen, Absätzen und Konzepten.

Dadurch kann Lesa:

ganze Wörter erkennen, nicht nur Buchstaben,
den umgebenden Text als Kontext nutzen,
die Kohärenz über ganze Seiten hinweg gewährleisten,
und sich an verschiedene visuelle Stile anpassen.

Lesen wird zu einer Aufgabe des Sprachverständnisses und nicht zu einer Aufgabe der Pixel-Entschlüsselung.

Lesa: Entwickelt für Intelligenz auf Wortebene

Lesa wurde von Grund auf dafür entwickelt, ein Dokument als sprachliches Gebilde zu behandeln. Mithilfe einer Transformer-Architektur, die mit vielfältigen Texten und realen Bildern trainiert wurde, fragt Lesa nicht: „Was ist das für ein Zeichen?“, sondern: „Was bedeutet dieses Wort – und wie passt es in den Satz?“ Das ist wichtig, da wir nun:

Deutlich weniger Fehler haben: Keine Fehlerkaskaden mehr durch einzelne Zeichen.
Natürliche Ausgabe erreichen: Saubere Abstände, korrekte Zeichensetzung, zusammenhängender Text.
Robustheit von Schriftart und Layout gewährleisten: Funktioniert bei formatiertem Text, unordentlichen Belegen, Tabellen und mehrspaltigen Seiten.

Eines der erstaunlichsten Ergebnisse des Verständnisses auf Wortebene ist, dass gebundene Handschrift – wie sie in Formularen, Notizen, Krankenakten, Lieferscheinen und Unternehmensunterlagen zu finden sind – nun viel besser funktionieren. Das bedeutet, dass unordentliche Großbuchstaben, mit Kästchen gefüllte Handschriften und halb gedruckte Formulare plötzlich gut lesbar sind. Die Handschrifterkennung ist plötzlich zum Greifen nah und wird routinemäßig angewendet.

Dies als „das Ende der OCR“ zu bezeichnen, ist keine Übertreibung. Es ist eine technische Anerkennung dessen, was sich verändert hat.

Traditionelle OCR = Zeichenerkennung, Moderne OCR = Sprachverständnis

Lesa gehört zu einer neuen Generation von Systemen, die Dokumente so lesen, wie Menschen es tun – indem sie Wörter interpretieren und nicht Symbole entschlüsseln.

OCR, wie wir es kannten, ist vorbei. Etwas Besseres hat es ersetzt. Lesa steht für diese neue Ära.

Intelligent Document Processing and Enterprise Security

Alexander — Sun, 22 Jun 2025 11:15:06 +0000

For those of us who have historically worked in the area of Intelligent Document Processing (IDP), or Capture as it was simply called before, it is a very pleasant observation that IDP, that has been around for a long time, creates more and more interest in the general CEO discussions and is now seen as an integral part of process optimization.

This is on the one hand due to the rise of AI technologies and the subsequent understanding what can be achieved with algorithms that mimic human understanding. AI having arrived in the mainstream (and even dominating the mainstream discussions) is now generally understood as being capable to perform cognitive tasks that humans perform. We have known and preached this for a long time but it is a good development that our former niche becomes standard.

On the other hand IDP gets more and more integrated into the main business processes. In the past Capture used to be almost always departmental. Capture was only allowed in a (badly lit) corner of an Enterprise mainly because no Capture system was able to fully integrate into Enterprise IT and most importantly comply with all security rules established for enterprises.

This has changed to the better in the past few years. At Skilja we have invested a lot of effort to strictly follow all security requirements so Vinna (our enterprise platform) can become a part of enterprise IT. This means on the one side that during development we are running frequent security screenings and penetrations tests to ensure utmost security for the software. For example Vinna and Skilja Software is Veracode Verified since many years. On the other hand we independently make sure to follow all defined industry standards very closely.

The most important aspect to be allowed to run in an enterprise is authentication and authorization. Typically an enterprise will not (or only grudgingly) allow an application to store user names or passwords outside of their internal identity management software. A platform should not have its own user management. This is a no-go for many customers. Therefore Vinna from the beginning always used roles that then are mapped to users in the enterprise user directory. In Vinna 3.0 still the password needed to be entered by the user and was sent (encrypted) to the authentication backend.

Since version 3.1 our Vinna platform uses the OAuth2 protocol for the authorization of users. OAuth2 itself does not directly deal with the authentication of users and clients. Instead the authentication backend is required to grant authorization and thus access. OAuth2 is supported by a lot of backends, namely Microsoft Azure AD and Keycloak. All communication with the backend (Resource Server) is bundled in the Skilja Authorization Server that is used by all Vinna Platform services, Clients and Activities.

Vinna provides authentication via different methods:

Authorization Code Flow with PKCE via a web browser
User name and password authentication via the password grant flow
Client authentication via the client credentials grant flow

Authorization Code Flow with PKCE (proof key for code exchange) is the current best practice when logging into any client application because it avoids entrusting a client application with user credentials.

A user that wants to log into a web site is redirected to the log in page of the Skilja Authorization Server. If the user has not yet authenticated itself to the Skilja Authorization Server, they enter a user name or password into the login field. After the credentials have been verified, the user is redirected back to the web site where they started, along with an authorization code. This authorization code is used by the website to exchange it with the Authorization Server for an access token. The authorization code part prevents user credentials to ever be entered into a potentially non-trusted client application. The PKCE part in this flow prevents the access tokens to be passed around in redirect URLs shown in the browsers URL tab, thereby preventing accidental token leakage by copy/pasting the url.

The client credentials grant flow type of authorization as shown above allows to register client credentials with the authorization service, along with claims which each clientId may receive. This type of authorization is intended for machine-to-machine communication and not for typical user interactions. In this flow, a client Id and matching client secret is sent to the authorization service that issues an access token. Once the access token expires, client Id and client secret can be used again to obtain a new access token. Also, in some cases a refresh token is made available that can also be used to acquire a new access token

Overall this new architecture of authorization and authentication allows an enterprise to integrated Vinna and all IDP activities that run within Vinna into their environments, available to all users opening up much more options to use IDP in their processes than if they were separated in their own network.

Happy 12th birthday Skilja

Alexander — Thu, 25 Jan 2024 17:54:00 +0000

Unbelievable – but 24.01.2024 was the twelfth birthday of Skilja. Meaning that next year Skilja will become a teenager.

12 years ago we founded Skilja in the middle of the German winter with the goal to create the best possible Document Understanding or IDP (as others call it) solution. And now we look back and we are proud that we have achieved the goal – at least in our view. A lot has happened on the way. We have created:

A leading classification technology with statistical, semantic and LLM classifiers with more than 1.000 customers world wide
An extraction framework that allows a customer to do anything from forms, invoices to totally freeform contracts in the same environment with no technology disruption from one designer
Our own recognition engine – AI based – as we were not happy with existing engines
A powerful process platform – used by > 250 enterprises world wide on premise and in the cloud with some of them processing 500.000 documents per day.

And of course we look back to a big number of great projects and successes in the market and a lot of happy customers. This we achieved not alone but only with the help of our partners that one by one joined our partner network in the last 12 years. Thanks a lot to them for their commitment, their dedication, their encouragement.

Sometimes we also ask ourselves – how did we do that? In the end we are just a medium sized company with (now) 24 developers and engineers. The answer – I believe – is FOCUS. We were always very clear what we wanted to achieve and focused our energy on these ideas and did not allow ourselves to be distracted. This is the key of our success together with our absolut customer centric approach (which might seem in conflict with focus, but in reality is not if it is handled and communicated correctly). And of course we have a tremendous team and pool of talent which I am very grateful for. It also seems that we made some correct technological decisions on our way which pay of now – focus on services, web interfaces, machine learning approaches, ease of use, high quality of software development with most advanced tools…

Skilja was created as an independent technology provider with a strong partner network and our plan is to stay such in the years to come. Independence is no value in itself but is the core of our ability to innovate. We do not need to look at quarterly results but can define long term development goals, decide on them and follow them to the end. This makes us strong and successful and will continue to do so.

Thank all of you to make that possible.

Alexander Goerke, CEO

Document Separation Revisited

Alexander — Thu, 08 Sep 2022 09:50:54 +0000

One of the frequently overlooked and really difficult problems in document automation, which is also really annoying in daily processing, is the automatic separation of a stack of documents into single meaningful documents and assignment to a document class. In traditional scanning processes this is often achieved by manual preparation of the paper and sticking a barcode as a document separator on each first page. But this is labor intensive and error prone. In addition as we are going more and more digital, even with paper based processes, normally the processing facility does not have access to the paper any more. So the goal would be to simply scan the whole stack and have it separated by an intelligent algorithm.

Fortunately this is readily available today for example from the Skilja technology stack as a built in feature into the Laera classifier. This does not say it is easy. It requires quite some experience and infrastructure to manage several interdependent steps of classification and separation in a stable and reliable way. This is what Laera provides out of the box.

How does document structuring work in principle? Well, in exactly the same way (our credo!) as a human would do it. Go through the stack page by page, determine what page type it is, if it is related to the previous page or if a new topic/form starts. Then check page numbers for security if they are present. If in doubt, go back one or a few pages to check back and then make your decision to separate.

Laera Document Separation

In AI classification, what Laera is, this is built into a sequence of algorithms. The system is trained on a sample that is already correctly separated. Laera does learn for each page if it is a first, a middle, and end or a single page. The user does not have to specify this explicitly as the Laera AI finds that out automatically from the samples and hides this complexity from the users. The training interface just requires you to drop the single documents into the training set. It is not required to have an exact number of pages (range) for each document type. Laera automatically takes into account that these can vary for each document type. However if you know you can also restrict allowed pages for example for single page forms that are always single page.

Laera will then learn the structure and apply it to the whole document stack of unseparated single pages during runtime. Each page is analyzed. A second classifier (we could call it a “meta-classifier”) will then take these results and find the most probable separation based on the trained model. So even if a first page has not been identified as a first page there is a chance that the meta-classifier still will see it as more probable to be a first page and correctly separate. A third classifier will then determine the document type for the separated documents. As usual in Laera all this is very fast and a separation of a stack of 200 pages with 150 document types takes less than 30 seconds in total.

The example below shows the results from separation of a mortgage application stack with 153 pages and classification in 244 document types. The horizontal lines indicated the found separators and the “New page” column shows the new numbering of pages in the separated documents.

Laera Mortgage Separation Result (click on image to see full screen)

The detail view of the separation result for one page nicely shows how the separation algorithm came to a decision for the first page of a URLA supported in addition by the detected page count on the page (“Page 1 of 9).

Laera Mortgage Separation Details

Training of this model takes about 10 minutes so it is easy to frequently test and refine it. All this can be done by the end user and does not need an AI engineer.

Quality is very important and Laera makes sure to bias towards precision to no errors are made and allow the workflow to show unconfident separations to a user for decision. In a project that was done 18 months ago for a large Swiss insurance company Laera achieved an automation rate of 87% with an error rate (false positive) or 0.14%. Still of course each separation result needs to be checked and the correction results will be used by Laera online learning to improve the model.

But overall the reduction of work in separation and the increase in quality is very measurable and yields huge benefits. All this is available either on premise or as cloud service to be used through RPA or RESTful API in any backend. Let us know if you are interested and we can show you a demo. Also a setup with your own documents is easily achievable with little effort. Contact is info (at) skilja.com.

Vinna 3.0 Released

Alexander — Tue, 24 Aug 2021 10:17:26 +0000

We are proud to announce the release of Vinna 3.0, our open 4th generation Document Processing Platform. After 18 months of concentrated and intensive development time we are very happy that we now can provide even more value to our customers. We invested a lot to take our customers feedback back to our engineers and create a totally new and modern UI – with an improved backend to support enterprise performance, scalability and security requirements. Process Editor and Process Monitor are completely redesigned and both are now available in English, as well as in German language. To avoid any pain for our many existing customers, special effort has been spent on compatibility with Vinna 2.4 so all projects can be smoothly upgraded. You can either manage a 2.4. runtime from 3.0 design time to achieve a step-by-step upgrade without disrupting production, but also the transfer of old process versions into 3.0 has been very thoroughly tested.

New Process Editor UI

The new design makes creating processes with no coding – no scripting – no configuration file editing as easy as it should be. Vinna 3.0 comes with the new BPMN process editor, with improved speed and usability. Now in Angular 10, all functions are componentized and can be integrated separately.

Plenty of new features improve the design and runtime management of processes. Cooperate better with your team members, as you write comments directly to activity instances. Work together designing the process. All activities can be configured through the UI – either through standard dialog or individual extended dialogs of activities, which can even bring up their own web UI. When a process is locked, you can now immediately see by whom. Besides many graphical changes, e.g. in the view of processes, document types and variables, it is now also possible to switch all views to lists and search in all trees and lists. You can see all environments where a process(-version) has been published to and we allow deletion only when no published version exists.

Vinna 3.0 Process Designer

Process Version Management

If you ever were in charge to manage a production system you know how important staging and versioning is. “Never touch a running system” is common but in the end leads to legacy problems as nothing can be updated any more. Key to any enterprise-critical production system is version management that allows full control over what is changed – of course with thorough testing in staging steps. Therefore, version management and staging is a central part of Vinna architecture from the start and has been further improved in version 3.0. You can now create major and minor versions (1.0, 1.1, 2.0, …) of processes. The latest version you edit is always marked as a draft version – you can’t break anything! Deploying a process happens for a certain selected version. So it is easy to work on major changes of a process and already test it but at the same time create hot fixes (patches) for existing production processes if necessary. And you can even change variables in each of your runtime environments separately for each version.

Vinna 3.0 Process Version Management

Environment Management

Environment ist the runtime system where a process is published to. Many environments – on premise, private cloud, public cloud – can be managed from the same Designer. The environment is executing the process by hosting and running the activities as necessary in as many Activity Servers as needed. In Vinna 3.0 we now have “transient” Activity Servers that auto-start with a VM or in a Docker, do their work and shut down again when not needed. Together with the separation of Activity Server configuration from the instance, you can easily assign arbitrary resources to a project to scale up dynamically in peak hours, or reduce hardware cost by using just as many servers as you need. An overview over all assigned activities and activity servers across an environment fulfills a long-requested requirement.

Process Monitor

The 3.0 runtime backend is fully compatible and introduces a lot of invisible changes related to scaling, performance and security. Process Monitor is the GUI to monitor the runtime and also has been completely redesigned. It comes with a lot of improvements and usability enhancements. Many visual usability enhancements were made with the new controls like grouping, filtering and customization of the UI for business operators. There is now a new tab for directly previewing documents in a work item with all their data. Licenses can now be reviewed and managed either in runtime or in design time with a common license view including status and report for click rates.

Vinna 3.0 Process Monitor with grouped work item list.

Vinna is an open and process-oriented platform, that allows users to define a process in exactly the way as it is optimally operated in a company. The design allows full flexibility in the data model with a hierarchical document model supporting batches, folders, documents and pages. The documents are processed as work-items in the flow and passed through activities. The activities are either standard tasks like OCR or Classification, or custom tasks as integrations into the platform. Any number of activities can be defined in the process as micro services, including arbitrary routing decisions based on intermediate results. The architecture of Vinna is service oriented (SOA) and the runtime is easily deployed either in the cloud (Microsoft Azure, AWS or private cloud), on premise or in mixed environments where the data storage is kept in house and processing happens outside.

All communication between services and databases is transaction based, securely encrypted and uses standard REST protocols over HTTP and HTTPS. Three powerful HTML based graphical user interfaces are provided for defining, managing and monitoring processes. Vinna is available for small projects but also incorporates all enterprise features need for large production systems. The biggest Vinna customer now processes 100M documents p.a. in one system, which is 400.000 documents per day.

Whitepaper and data sheets are available if you are interested in further details, please contact us through info(at)skilja.com to obtain your copy.

Classification and Context

Alexander — Wed, 15 Apr 2015 16:00:00 +0000

This is a situation that probably sounds familiar to you: You meet a person and you are sure that you know him/her well and that you have already seen him or her many times – but you cannot remember who it is. More precisely – you cannot put the person into a context. This does not happen with persons that are very close to you, which you can identify (or categorize) immediately without effort. This happens with people you are acquainted with but you do not know them by heart. When this happens you start to search your memories for this person. You do this by looking for a context that you can apply to this face, body and appearance. Because the problem of categorization you are faced with is an out-of-context problem. It is so much easier to recognize a pattern or identify an object if a context is applied.

This happened to me last fall when I met the owner of a mountain restaurant from the Black Forest at the baggage pick at the airport of Rhodes. As he greeted me friendly and asked how I was, he was visually present in my memories but I was searching in vain for the context which couldn’t have been more remote. Being on vacation, after a nice flight to a foreign country close to the sea and looking forward to my first ouzo simply blocked out all of my familiar reference system which would have given me the answer instantly. Fortunately I was able to talk around my ignorance until he gave me enough additional information that it slowly dawned upon me who he was.

Context is a very important concept for classification. Not only is the ease with which objects are categorized depending on it, also the result of classification is related to context. Cognitive psychologists have analyzed the effects of context and have come up with some striking results. As you might remember classification and object recognition have a lot in common.

One early research that illustrates this is Labov’s study of the semantic boundaries of the concepts described by the words cup and bowl. (‘The Boundaries of Words and Their Meanings’, William Labov). He showed his subjects the line drawings of containers that varied in width and height.

When they were asked to assign one of the categories (words) to the drawings he observed a gradual shift from cup to bowl depending on the diameter of the container. This is as expected as the diameter for sure is one of the features that is taken as relevant for the category. So this is a simple illustration how a classifier works and how the weight of a feature influences the result. As we know this is very simplified as normally thousands of features are taken into account for classification even of cups and bowls: transparency, material, color, handle yes/no etc.

The interesting part of the experiment came when he put the situation into a context of either drink or food. In the context of food the category shifted significantly. Now more of the vessels were seen as bowls as before.

You see how context is a very important concept in classification and should not be ignored. Psychologists have even found out that you can influence the outcome of an answer by putting the subject into a context before. This effect is called priming. For example they have asked people to think about something they are ashamed of. Thus primed they were shown the ambiguous fragments W _ _ H and S _ _ P.

It was shown that those who are ashamed were more likely to complete these fragments as WASH and SOAP instead of WISH and SOUP. The opposite was true for people primed with eating memories. Amazing, isn’t it?

What is the relevance for document understanding? These experiments show that human classification that we want to emulate with document understanding is more complex than thought and by far not objective. This means also for document sorting the context is very relevant. Humans are much superior to machines in this task as they have knowledge of the context. A simple phrase like “My address was changed” is treated by a statistical classifier all the same. But if you have a context you know if it is a postal address, an e-mail address or an IP address. You know if the address was changed by the person itself, by accident or it was forced upon him. A lot of context relevant for the subsequent business process. Astonishingly enough there is no classification system in the market that makes use of the context to prime the classifier. Although it would be very useful and could improve the results. There are some systems like Kofax KTM or Paradatec AIDA which use hierarchical classification, so they have a context in the sub classification, which is good. The next big step for human like categorization would be to allow adding context explicitly and taking it into account for classification. At least this is what is suggested by the cognition experiments shown above.

Auto-Classification Technologies and RFID Smart Docs

Alexander — Sat, 10 Jan 2015 13:16:43 +0000

Editor’s note: This is a guest post from Cláudio Chaves from TCG Brazil

Recent advances in the auto-classification technologies – as described in this blog – have provided a substantial manual labor reduction for several companies related to physical preparation, classification and separation of documents into its operations. Although these advances have achieved tangible results in optimizing document centric workflows, there is still a gap in the aspect of classifying and tracking paper documents. This is especially important in some countries where physical documents are subject to different retention policies based on legal requirements. A certain amount of documents must be retained physically for a varying number of years based on the document type determined by classification.

Some capture applications are able to identify document types using barcode at scan time and using auto-classification or barcode content, apply different rules to separate and classify the document images. In the digital world everything is straightforward and works pretty well, but if you need to track and trace the same documents physically until the final archiving step is completed, it always becomes a challenge, especially into a large scale operation with tons of documents.

RFID is an acronym for Radio Frequency Identification, and it is also considered a generic term denoting the ability to identify an object remotely. It means that the information is transmitted via radio waves and does not require line-of-sight or contact between the reader and the tags. RFID technology provides great benefits through the combined use of a barcode, a microchip and an antenna, encapsulated into a tag, also called smart label. The radio waves are sent from a reader and then picked up by a tag that signals back its unique number called EPC (Electronic Product Code). The presence of a tagged folder/document is seen at a reader’s specific location, and this information can be reported to the tracking software that updates a records management database, an ECM repository or even a document capture platform.

Due a high reading speed and the capacity to identify an item even without visual access to the document, RFID technology makes it possible to quickly read a stack of tagged folders and documents even when they are stored inside a card box. In this way, it is possible to perform an automatic check-in of a ton of documents without any human intervention. Additionally it is also possible to inspect document containers like card box and folders at the receiving and delivering points, checking if all the required document classes are really there.

Just like barcodes, RFID technology allows to store a free encoding data schema into the EPC memory, on the other hand, it is always suggested to use a standard, to avoid a proprietary encoding. There are now a few international standards available for different types of objects, such as fixed assets, returnable assets, trade items, documents, etc. These standards have been developed by GS1, an international non-profit association, aiming efficiency improvement, higher items visibility and interoperability between the whole chain.

The EPC data schema GDTI (Global Document Type Identifier) specified by GS1, was developed to identify documents, including the class or type of each document. GDTI can be encoded in a 1D/2D barcode, stored into an EPC memory or printed directly on the document. Companies can use the GDTI as a method for identification and registration of documents and related events. They can also use the GDTI for information retrieval, document tracking, electronic archiving or even to prevent fraud and document falsification.

All these standards were specified based on the EPCGlobal framework, which describes the relation between different RFID components such as hardware, software and data interfaces. Based on the context of this article, we are referring about passive RFID. This technology does not use batteries and works with UHF frequency. Based on this specification, the objects can be identified not only in a near field, but also in a far field area, achieving up to 10 meters far, depending on the type of the object, the tag, antenna and the reader.

The combination of auto classification technologies and RFID tagged documents makes it possible to match the classification results, physically and logically.Given the physical classified document class, it is possible to define and choose the most appropriate document container (e.g. card box, folder, etc.) and pass the parent document class to the image/content classification engine to perform a deeper classification.

At the end of the process, we can match the results and track the both versions (image and paper) during the entire flow. This is can be achieved without physical contact with the paper document. Imagine that you get a box of paper from a remote location for archiving. If the documents have been classified and RFID encoded then within a second you can check the completeness of the physical archive and stow them away. If all of this happens with your capture process automation system you have a tight combination of auto-classification with physical sorting of paper – solving this last obstacle to full automation.

There are still several interesting RFID use cases for documents, such as automatic check in/out, hunting, inventory, exits, etc. which will become more and more popular very soon with the decreasing costs of the technology and the advances of new concepts like IoT (Internet of Things).

Links:
RFID: http://en.wikipedia.org/wiki/Radio-frequency_identification
GS1: http://www.gs1.org/about/overview
EPCGlobal Framework: http://www.gs1.org/gsmp/kc/epcglobal
GDTI: http://www.gs1.org/barcodes/technical/idkeys/gdti

####
Cláudio Chaves is Managing Director at TCG Brasil in Santana de Parnaíba, São Paulo, Brazil. He has many years of rich experience in the document processing applications especially in the South America market.

What is a good classifier? (2/4)

Alexander — Tue, 16 Dec 2014 13:18:00 +0000

In our small series about classification quality we have used the precision-recall graph to show the difference between a very good and a so-so classifier in a recent post that you can find here. This graphical representation is very common and easy to understand. Apart from the absolute numbers for the recall (e.g. 85% correctly classified documents) it is also important to understand how classification quality can be influenced by a threshold applied to the classification result. This can already be seen in the precision-recall graph if you know how it should look like. But it becomes much more obvious if the errors are displayed as a function of the recall. We call this diagram the inverted-precision graph which is described in this part 2 of our little series.

2. The Inverted-Precision Graph

The graph can be easily created by the same type of benchmark test that is used for the precision-recall graph. Either by measuring the classification quality against a “golden” test set or by simply using the train-test split method where a certain percentage of the training set is used for testing (e.g. 10%) while the remaining 90% are used for training. Of course this is repeated iteratively (in this case 10-times) until at each document has been classified once.

The first curve in the inverted-precision graph plots the error rate as a function of the read rate (recall). Apparently the higher the recall, the higher the error rate that you need to accept. The error rate is shown on the left vertical axis. Let me show you how the graph also allows to exactly determine the achievable recall and required threshold for a desired error rate. With the right y-axis the threshold is plotted as function of recall. By connecting the lines it is easy to see where we need to put the threshold to achieve a predefined error rate. The animated graphic below shows this step by step:

Inverted-Precision Graph and Relation between Error and Threshold

The inverted-precision graph is especially suited to uncover the weaknesses of classifiers related to thresholding.

In a real life example we have again used the well known Reuters-21578 Apte test set. This set has been assembled many years ago (available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html). It includes 12,902 documents for 90 classes, with a fixed splitting between test and training data (3,299 vs. 9,603). The image shows the graph for a very good and linear classifier.

The second image shows the graph for a standard classifier. This is the same data as in the first post of the series but if you compare you see that the differences between a good and a mediocre classifier become much more obvious in this representation.

The discrepancy has mainly to do with normalization of results. Even if you accept that the absolute recall is low for the weak classifier at least the results should be normalized in the way that an error rate below 5% can somehow be achieved. This is obviously not the case. The inverted-precision graph is a good way to uncover this fact which might be due either to a weak classifier or to an incomplete training set. Therefore a good classification toolkit should always provide the means to create and visualize the results also in this way.

There are good technical reasons in the algorithms to explain the differences above but this should not be the topic of this blog. More important for users is to understand that there are significant differences and that they become visible in the graphical evaluation. In an upcoming article we will drill even deeper and show the effect of classifier quality on the separation of selected pairs of classes. Stay tuned!

OCR on Historical Documents

Alexander — Wed, 03 Dec 2014 13:20:00 +0000

Skilja is proud to announce that we have received a grant from the European Union supporting a research and development project to improve OCR on historical documents. The grant is provided through the Eurostars program of the European Union. This program supports research-performing small and medium enterprises, which develop innovative products, processes and services, to gain competitive advantage. It is a transnational program, where projects have partners from two or more Eurostars countries. Thanks to this international collaboration, SMEs can more easily gain access to new markets. Please see here for more details on the program.
Skilja has won this grant together with our partner company Lumex in Norway (www.lumex.no). It will support our own investment in research activities and will run for three years. The Eurostars evaluation process selected Lumex’ and Skilja’s proposal as a top 5% technology and business model winner amongst hundreds of European SME’s representing all industries and sciences.

The goal of the three year project is to improve the recognition of difficult historical documents. An example of the documents is given below. This is a typewritten document from a correspondence archive. The technology we develop also extends to standard and gothic (fraktur) fonts. Main target is the digitization of old archives and newspapers.

Example for a difficult historic typewriter document

This improved conversion will allow researchers to access cultural heritage better and to preserve historical content for the future digital world. The project will build upon a current version of an accuracy extension for existing OCR that has been created by Lumex and Skilja. It will use advanced image processing and classification technologies to further improve the results.

This project is funded by the Federal Ministry of Education and Research (BMBF) of German and the European Union under the project OptO-Heritage and grant number 01QE140.

Visual Classifiers From Random Images

Alexander — Mon, 03 Nov 2014 13:21:47 +0000

Now this is an interesting experiment that leads us very close to the touch point between machine classification and human imaginations.

As described in previous posts, auto-classification algorithms are using features that are extracted from the objects to be classified (images or text) and are represented in a feature space. Classification can be described as finding the correct separation plane (in many dimensions) between the features for different objects. A group of researchers from MIT (Carl Vondrick, Hamed Pirsiavash, Aude Oliva, Antonio Torralba) has now used an interesting approach to get a glimpse of how this feature space might actually exist and look like in our minds. They have generated random white noise in the feature space and inverted this noise to actual images. These images then have been presented to humans and they have been asked if they resembled certain well known objects. The results are quite fascinating:

All image patches on the left are just noise. Many thousands of them were shown to online workers and they were asked them to find ones that look like cars. See full scientific paper here.

Most of the time these random images will appear to people as random. But every now and then somebody will feel that an image does remind them of a car. So we set this image aside. And repeat. After assessing 100,000 images in this way, we end up with a set of essentially random pictures that remind people of cars. We then take the average of these and find something interesting. The resulting image does indeed look like a blurry car, not a specific kind of car but a very general template of one.

Mathematically speaking this noise-driven method estimates the decision boundary that the human visual system uses for recognition.

My favorite example of the ones that have been tested is the fire hydrant that emerges from random white noise images.

Now as the random noise was actually generated in the feature space and not in the images themselves the researchers can deduct which features actually lead to the recognition of objects. And hence get an understanding on how human recognition of objects acually works. Because humans have some remarkable capabilities to recognize objects that they have never seen, touched or smelled before. This understanding of the actual feature selection in human minds will help us in the future to derive new classfiers that more closely resemble the way we all do object recognition and includes the human bias. Because this is one of the most interesting results of the study: The object that emerges depends on the cultural background of the persons selecting the random images. For example when online workers from India were asked to find a sports ball then a red circular object emerges. Because the most popular sport in India is cricket which is played with a red ball. Ask the same question to US workers then a orange ball appears – think of football or basketball.

The same question of human bias we find in our daily work in classification of documents in big enterprises. Every person will classify a document set a little different from their co-workers – thus leading to a lot of inconsistencies. Which can be overcome with auto-classification, which will always make predictable and consistent decisions.

The research described here provides a very interesting insight into the nature of human mind. It will also allow us to work on refined methods of classification that more closely resemble the way humans make decision.