Klassifikation | Skilja

Auto Classification and Bias

skiljaadmin — Tue, 22 Jul 2025 12:01:16 +0000

Personal bias and individual opinions are a big issue in standardized business processing if they happen to influence the outcome of a process and the decisions made. Nobody wants to be subject to random changes in the outcome of a personal request – and yet it happens. Because humans have a bias in how they see facts, based on their education, cultural background and even the mood they happen to be in at a certain time in the week. So in addition to different persons making different decisions you can even expect that the same person makes different decisions during the week. You just look differently at a task on Monday morning than on Friday evening. The reason is so-called priming which happens to all of us day-by-day through our experience, knowledge, physical condition, context and a lot of other small factors.

In a recent article on the meaning of words we have shown how the sound of words influences our perception. There are a lot more linguistic associations that influence the way we think and behave, which introduce bias. If for example I tell you that I am driving north across a hilly terrain, would you expect the trip to be mosty uphill or downhill? In fact most people associate movement to north with uphill and to south with downhill. An interesting study by psychologists Leif D. Nelson from UC San Diego and Joseph Simmons from Yale shows that these associations can actually be measured and produce some strange biases: People think it will take longer to travel north than south, that it will cost more to ship to a northern than to a southern location, and that a moving company will charge more for northward movement than for southward movement. A similar study concluded, that people assume that property is more valuable when it sits in the northern part of town. Of course these opinions stem from the decision by the old Greeks to plot the map of the world with northern parts above the south. But also it shows us clearly how much we are biased by our language and of course north/south is only one of many linguistic associations we are exposed to.

Ancient mapmakers introduced north and south unwittingly, but lawyers do have an intention when they describe car accidents. While the defense might call a car accident “contact”, the plaintiff might say one car “smashed” the other. Elizabeth Loftus and John Palmer showed in a classic experiment that these labels really matter. They had a group of students watch the same series of traffic accidents. Then they were asked to estimate the speed of the cars when the accident occurred. When the scene was described in a way that the cars “contacted” one another, the average speed estimation by the students was thirty-two miles an hour, whereas the estimate was forty miles an hour when they said that the cars “smashed” one another. In a another experiment, 14% of participants incorrectly remembered seeing shattered glass when told that the cars “hit” one another, whereas 32% of participants made the same error when told the cars “smashed” into one another. This shows that even a single word can change how people remember an event they witnessed only minutes earlier – making it very clear how priming can bias our decisions.

This brings us back to auto-classification. A classifier like the Skilja Content Classifier is trained with representative samples that are collected by a group of people. If applied to a bunch of documents it will then make the same decision over and over again. It represents – through machine learning – the average opinion about the content of a document and will repeat it without tiring. Same in Monday morning and Friday evening – with a speed of several 100.000 pages per hour. It will make errors – based on statistics – but not more than a human. And the errors are reproducible and can be corrected if necessary. If you have ever thought about compliance – this is a good example. Because compliance does not say that you cannot make errors. It says that you need a reproducible, documented procedure how you store and treat your documents. Auto-classification can help to achieve this goal. It is a great tool for boosting productivity. But it is even more helpful to avoid bias, irreproducible results and non-compliance.

On the Benefits of Page Classification

skiljaadmin — Thu, 29 Sep 2022 14:37:47 +0000

Classification deals with the categorization of objects. In our process automation and digitization world, we often think of the objects as complete documents that need to be classified. Of course, it is important to understand what the type of a document is and automatic classification can determine exactly this. But documents in a business context normally are complex and not homogenous. As a person when you get a multipage document you typically will browse through it to see what is in it to understand what it is about. A document in an envelope or a manila folder that you receive on your desk may consist of an opening letter, some notes, then the real important document, like for example the court order, and maybe attached some standard forms. To understand which process to initiate and what to do with the document you will therefore look at the pages and decide how you can determine from their content what this is all about. Maybe even two or more processes originate from different pages within one document where you might need to answer a request from one page and execute a payment from another page.

Laera Classifier – Page Classification for Claims Processing

This is exactly what page classification in document understanding is able to provide automatically. Instead of looking at the document as a whole the algorithm will classify page by page and derive decisions from the results. This is much more granular than taking only the complete document. And it is different from automatic document separation which is physically splitting the document. Of course, separation is another option based on page results, but it is error prone and risky as the document is ripped apart maybe incorrectly. Often this is not at all necessary but it is sufficient to structure and digitize the document page-wise to achieve the process goals intended.

Page classification requires a solid infrastructure and understanding of physical documents. We provide this with the Laera Classification Framework that inherently understands structured documents. Going even further would be paragraph and sentence classification but this will be a topic for another article. In Laera you can simply define a page classification scheme alongside the document classification. And you can even use the page classification results to determine the document type (e.g. by majority rule or by priority rule).

An example of a real-life project that is in production since more than a year is shown above.

In this case the customer receives thousands of car insurance claims per day. These are 10 to 50 page documents that contain all different kind of pages, as examples:

Covering letter or e-mail (“Anschreiben”)
Attorney’s letter
Expertise (“Gutachten”)
Calculation of repair (“Kalkulation”)
Declaration of Assignment (“Abtretungserklärung”)
Photos

Laera Classifier is able to automatically determine all of these types with a rate in the high 90%. Photo detection tags all photos and hides them for the following recognition steps as they are unnecessarily blocking OCR and extraction steps otherwise. The page classification results allow to structure and reorder the document in an optimal way for subsequent extraction of data from the different page types. Being able to define specific extraction for each page type leads to a significant increase in extraction quality and speed. It also greatly eases the task for the clerks in the subsequent process steps as they already receive a structured document (in this case a PDF that is assembled) with tags and always in the same order.

In such way page classification plays an important role in streamlining the process getting a bit closer to the way how a person would look at the document and work from it.

Document Separation Revisited

Alexander — Thu, 08 Sep 2022 09:50:54 +0000

One of the frequently overlooked and really difficult problems in document automation, which is also really annoying in daily processing, is the automatic separation of a stack of documents into single meaningful documents and assignment to a document class. In traditional scanning processes this is often achieved by manual preparation of the paper and sticking a barcode as a document separator on each first page. But this is labor intensive and error prone. In addition as we are going more and more digital, even with paper based processes, normally the processing facility does not have access to the paper any more. So the goal would be to simply scan the whole stack and have it separated by an intelligent algorithm.

Fortunately this is readily available today for example from the Skilja technology stack as a built in feature into the Laera classifier. This does not say it is easy. It requires quite some experience and infrastructure to manage several interdependent steps of classification and separation in a stable and reliable way. This is what Laera provides out of the box.

How does document structuring work in principle? Well, in exactly the same way (our credo!) as a human would do it. Go through the stack page by page, determine what page type it is, if it is related to the previous page or if a new topic/form starts. Then check page numbers for security if they are present. If in doubt, go back one or a few pages to check back and then make your decision to separate.

Laera Document Separation

In AI classification, what Laera is, this is built into a sequence of algorithms. The system is trained on a sample that is already correctly separated. Laera does learn for each page if it is a first, a middle, and end or a single page. The user does not have to specify this explicitly as the Laera AI finds that out automatically from the samples and hides this complexity from the users. The training interface just requires you to drop the single documents into the training set. It is not required to have an exact number of pages (range) for each document type. Laera automatically takes into account that these can vary for each document type. However if you know you can also restrict allowed pages for example for single page forms that are always single page.

Laera will then learn the structure and apply it to the whole document stack of unseparated single pages during runtime. Each page is analyzed. A second classifier (we could call it a “meta-classifier”) will then take these results and find the most probable separation based on the trained model. So even if a first page has not been identified as a first page there is a chance that the meta-classifier still will see it as more probable to be a first page and correctly separate. A third classifier will then determine the document type for the separated documents. As usual in Laera all this is very fast and a separation of a stack of 200 pages with 150 document types takes less than 30 seconds in total.

The example below shows the results from separation of a mortgage application stack with 153 pages and classification in 244 document types. The horizontal lines indicated the found separators and the “New page” column shows the new numbering of pages in the separated documents.

Laera Mortgage Separation Result (click on image to see full screen)

The detail view of the separation result for one page nicely shows how the separation algorithm came to a decision for the first page of a URLA supported in addition by the detected page count on the page (“Page 1 of 9).

Laera Mortgage Separation Details

Training of this model takes about 10 minutes so it is easy to frequently test and refine it. All this can be done by the end user and does not need an AI engineer.

Quality is very important and Laera makes sure to bias towards precision to no errors are made and allow the workflow to show unconfident separations to a user for decision. In a project that was done 18 months ago for a large Swiss insurance company Laera achieved an automation rate of 87% with an error rate (false positive) or 0.14%. Still of course each separation result needs to be checked and the correction results will be used by Laera online learning to improve the model.

But overall the reduction of work in separation and the increase in quality is very measurable and yields huge benefits. All this is available either on premise or as cloud service to be used through RPA or RESTful API in any backend. Let us know if you are interested and we can show you a demo. Also a setup with your own documents is easily achievable with little effort. Contact is info (at) skilja.com.

Confusion Matrix

skiljaadmin — Wed, 10 Aug 2022 11:01:43 +0000

Understanding the quality of an automatic classification system is crucial for its acceptance and any attempt to improve it over time. Quality means that we need to look at errors and at the recognition rate. In classification terms these values are called precision and recall. Precision gives the percentage of documents that have been classified correctly with respect to all documents assigned by the classifier (a/a+b), recall is the number of documents classified into a class with respect to the total number of documents that should be in this class (a/a+c). In a previous post (Measuring Classification Quality) we have already discussed these and how important they are. It is easy to depict them in a graphical visualization:

While these values might appear a little abstract their advantage is that they are independent of the size of the set. But it might be more intuitive to talk about the actual number of documents that are imported into a class from other classes (set b) or exportedand lost from the class (set a). Because it becomes obvious that recall and precision are related and have the same value if no threshold is applied – as every document that is imported to a class must have been lost in another class. Also it makes it easy to look at particular problem classes with a lot of imports (attractors) or exports (donors).

For a classification system these values can be depicted in a so called confusion matrix (also known as a contingency table or an error matrix) showing all relations between classes in one glance.

Our classification designer in the Skilja Content Classification system has a built in visualization that lets you easily see the migration of documents into other classes. As an example we have used our popular Reuters news wire test set and arranged the classes in 7 hierarchical groups. If you run a 90:10 split benchmark on all 5917 documents (which fortunately only takes a few seconds because the SCC is so incredibly fast) the confusion matrix obtained for the 51 classes looks as follows:

Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. Of course the user interface allows you to zoom in to look at the details.

The correctly classified documents are summed up on the diagonal, the exports are on the right upper side and the imports on the left lower side. In our case you see quite some exports from the class “acq”, which is news on acquisitions to “earn”, which is earning. But this is to be expected as these classes are close by topic and often a report on acquisition talks about the same topics (shares, revenue, board) as for earnings. The user can now use this display to click on the box of the 57 exported documents, open them in a list and review them to improve classification if desired. Such it becomes easy to drill down into the results and see exactly what can be improved. You will never achieve 100% precision but remember that also manual human classification only achieves 95% on average as proven in experiments.

When the classes are organized in a hierarchy, the confusion matrix by Skilja also allows you to collapse the nodes and look at upper levels only. In this case the values of the hidden subclasses are summed up and shown for the parent class.

The diagonal has two values now. For example 4.320 of the finance documents have been correctly classified but 175 have been exported/imported within the finance category. Often you are only interested in the migration between the main parent classes, while errors under one parent are less problematic.

Typically an organisation can assign a cost with each export and import. The cost can be different for each pair of classes where this happens. Migrations within a set of subclasses are often not very expensive if they relate for example to documents that anyway are processed in a department. On the other side an import into a class that leads to an automatic payment can be very expensive. This can be mitigated by assigning different thresholds to such classes, which SCC allows. The confusion matrix allows you to find out where these need to be applied. But the matrix can also be exported and you can apply your own cost matrix to the results to determine, which improvement make sense. We are currently working with a real client to create a case study that shows these numbers in a real world example at an insurance company. When available, this study will be published here. Stay tuned!

The Magic of Online-Learning

skiljaweb3 — Sun, 03 Feb 2019 08:26:58 +0000

Wouldn’t it be nice if your AI enabled document processing system would continuously take the input from user interactions and use this information to improve the quality of recognition over time? And nobody would have to take care of this – even in the case of hundreds of document classes with dozens of index fields each. In the best case the system would be easy to set up, run completely unattended in background and work like a charm.

This is what Skilja with its Laera Classification and Extraction software suites provides. We have completely implemented this new paradigm which is available either as SDKs or as integrated modules to our Vinna Document Processing Platform. But of course what looks easy for the user requires significant infrastructure and automated checks and balances to make this a reliable and stable part of your processing tasks.

Machine Online-Learning of document classification and recognition uses supervised and unsupervised continuous training of incoming data streams. Supervised learning will take the corrections the users have made, analyze them and apply them as new patterns as appropriate. Unsupervised learning will use the results of successful and correct classification and extraction to generate additional knowledge (expanding the space) and statistics of usage of existing knowledge. Both combined are then used to continuously improve the system. The infrastructure is set up quickly and consists of services that do the work in the background: collect statistics, collect samples, analyze the validity of the new data and publish them to the production runtime system if the AI has determined them to be valid additions.

As we know that system administrators might be vary of having their setup changed automatically (at least until they have seen it really works) there is several intermediate levels of AI automation that they can chose. The most important are:

Have all changes and each new document manually reviewed, benchmarked and checked before explicitly publishing it. This is the box on the left
Have automatically created improvements be reviewed and explicitly published
View any conflict and resolve them manually (or at least check them)
Restrict the users that can contribute to the training to a certain group. Only corrections from this group will be taken into account while the input from less experienced users will be discarded.

But in the end learning can run completely unattended. As in school (think exams) we need to check the validity of the new knowledge before we apply it. Therefore Laera algorithms will always analyze for conflicts that are created and try to resolve them. Im addition each new revision of the training pattern will fully automatically be quality checked in background and only be accepted if the recognition results of the new model exceed the existing one. This is an assurance for the production system: Changes in quality will always only go into one direction – better!

Again, this is not a black box but Laera provides precise insight of what is happening and lets you influence or even revert the suggested improvements at any stage. Laera Monitor is the tool for this, a web application that shows the continuously measured quality numbers of your system.

The example shown here shows a typical curve for the F1 score (average quality measurement). Starting with a setup of a few hundred trained documents the quality quickly deteriorates as new and unknown samples arrive in production. Especially when the real volumes start to be processed. It is interesting to see that the precision stays high close to 95% which is very satisfying, but recall (recognition rate) goes down as the system simply does not “know” the new documents. But then online learning kicks in and uses the new samples and corrections made to quickly improve the quality to 95% after a few thousand new training documents have been processed.

Online Learning will make classification and extraction much easier in the future. After an initial setup AI will simply learn in background what needs to be known to arrive at he best possible automation rate within a few weeks. This makes a whole new area of processes (for example with smaller document volumes) available and will greatly improve quality for existing automation processes.

Please let us know if you have additional questions or need more insight or have a direct interest. Contact us under info (at) skilja.com.

What is a good classifier? (4/4)

skiljaadmin — Thu, 23 Apr 2015 13:06:23 +0000

This is the fourth and final post on the characteristics of a good content based classifier. In previous posts we have focused on presentation of statistical results and comparison of the Skilja Classifier to a plain vanilla naïve Bayes classifier in Recall-Precision Graphs or Overlap-Separation Graphs. The first article very clearly showed the significant differences of classification results for a complete test set of documents. The second article above focused on the graphical representation for two selected classes, drilling more into details of the results and revealing the strengths and weaknessed of the classifiers.

Both representations are provided with the goal of making the system of auto-classification transparent and understandable to the user – on different levels of detail. Because users feel uneasy if they are presented with a black box and do not understand the decisions of an auto-classifier. Therefore explaining and displaying a result is of high importance in each classifier implementation to achieve user acceptance. This is why Skilja has put such a high emphasis in the visual representation of the result.

The last level of detail, after having looked at global results and the comparison on class level, is the view on single documents. As you have learned from previous posts (e.g. Classification Methods) the classifier determines so-called features that are used to establish the similarity of any given document with a trained class. While these features are complex mathematical structures in a highly dimensional feature space – they can be made visible for a single document. Skilja Content Classifier can display the features by highlighting the words and groups of words in the document that are used to calculate the features. Let me add a word of caution: There are a lot of classifiers out there that simply use these words (bag of word approach) as features. This is not a very good choice as quality will be inferior. Skilja Content Classifier uses other features (like morphemes, correlations and compounds), but for display we show the words that best represent the features chosen as this can be understood best. So do not be misled by the display.

As an example you can see below the classification features used for classifying a document from the Reuters test set on the topic of acquisitions.

The right side shows the result with a high confidence for the class “acq” representing acquisitions. The features are highlighted in turquoise in in the image. The darker the color, the more relevant the feature. Not surprisingly (and allowing the user to agree with the result) the main features used are “investment firm”, “SEC”, “stock”, “stake” etc. – all of which you would associate with acquisitions yourself. Remember that the features have been automatically generated by machine learning of a few hundred examples. But it helps enormously to generate trust from the user that the system works well. In addition we allow to manually deselect features, but do not recommend to do so, as usually the results deteriorate after manual intervention.

The next series of images shows the example of the same document (on coffee) with the features for the classes “coffee”, “trade” and “crude” which are the top three confidences in the result.

Top hit for coffee document with expected features like „Columbia“, „coffee“ and „export“ which especially in the correlation and context in which they are used lead to the „coffee“ class being selected as best result.

Second best hit of coffee document as „crude (oil)“. Crude also comes from Columbia but crude features like „energy“ cannot beat the coffee result

Third best hit of coffee document from „trade“ class – result coming from „total exports“, „business“ etc. but already far away from the actual coffee topic

By now it should become very obvious how important it is to actually see what the classification algorithm decides and why it decides. Although the intricate details of the math involved are still well hidden it helps a lot in customer projects in the acceptance of machine learning techniques. Together with the overall statistical performance analysis a transparent visualization of the inner workings is what makes a good classifier.

Classification and Context

Alexander — Wed, 15 Apr 2015 16:00:00 +0000

This is a situation that probably sounds familiar to you: You meet a person and you are sure that you know him/her well and that you have already seen him or her many times – but you cannot remember who it is. More precisely – you cannot put the person into a context. This does not happen with persons that are very close to you, which you can identify (or categorize) immediately without effort. This happens with people you are acquainted with but you do not know them by heart. When this happens you start to search your memories for this person. You do this by looking for a context that you can apply to this face, body and appearance. Because the problem of categorization you are faced with is an out-of-context problem. It is so much easier to recognize a pattern or identify an object if a context is applied.

This happened to me last fall when I met the owner of a mountain restaurant from the Black Forest at the baggage pick at the airport of Rhodes. As he greeted me friendly and asked how I was, he was visually present in my memories but I was searching in vain for the context which couldn’t have been more remote. Being on vacation, after a nice flight to a foreign country close to the sea and looking forward to my first ouzo simply blocked out all of my familiar reference system which would have given me the answer instantly. Fortunately I was able to talk around my ignorance until he gave me enough additional information that it slowly dawned upon me who he was.

Context is a very important concept for classification. Not only is the ease with which objects are categorized depending on it, also the result of classification is related to context. Cognitive psychologists have analyzed the effects of context and have come up with some striking results. As you might remember classification and object recognition have a lot in common.

One early research that illustrates this is Labov’s study of the semantic boundaries of the concepts described by the words cup and bowl. (‘The Boundaries of Words and Their Meanings’, William Labov). He showed his subjects the line drawings of containers that varied in width and height.

When they were asked to assign one of the categories (words) to the drawings he observed a gradual shift from cup to bowl depending on the diameter of the container. This is as expected as the diameter for sure is one of the features that is taken as relevant for the category. So this is a simple illustration how a classifier works and how the weight of a feature influences the result. As we know this is very simplified as normally thousands of features are taken into account for classification even of cups and bowls: transparency, material, color, handle yes/no etc.

The interesting part of the experiment came when he put the situation into a context of either drink or food. In the context of food the category shifted significantly. Now more of the vessels were seen as bowls as before.

You see how context is a very important concept in classification and should not be ignored. Psychologists have even found out that you can influence the outcome of an answer by putting the subject into a context before. This effect is called priming. For example they have asked people to think about something they are ashamed of. Thus primed they were shown the ambiguous fragments W _ _ H and S _ _ P.

It was shown that those who are ashamed were more likely to complete these fragments as WASH and SOAP instead of WISH and SOUP. The opposite was true for people primed with eating memories. Amazing, isn’t it?

What is the relevance for document understanding? These experiments show that human classification that we want to emulate with document understanding is more complex than thought and by far not objective. This means also for document sorting the context is very relevant. Humans are much superior to machines in this task as they have knowledge of the context. A simple phrase like “My address was changed” is treated by a statistical classifier all the same. But if you have a context you know if it is a postal address, an e-mail address or an IP address. You know if the address was changed by the person itself, by accident or it was forced upon him. A lot of context relevant for the subsequent business process. Astonishingly enough there is no classification system in the market that makes use of the context to prime the classifier. Although it would be very useful and could improve the results. There are some systems like Kofax KTM or Paradatec AIDA which use hierarchical classification, so they have a context in the sub classification, which is good. The next big step for human like categorization would be to allow adding context explicitly and taking it into account for classification. At least this is what is suggested by the cognition experiments shown above.

What is a good classifier? (3/4)

skiljaadmin — Thu, 29 Jan 2015 13:09:55 +0000

In recent articles about classifier quality we have focused on the overall statistical results. For this we have used either the precision-recall graph or the inverted precision graph. While these are very good tools to predict the overall quality of a classification scheme and hence the gain in productivity to be expected – they do not reveal where and why errors occur and how they can be improved.

Digging deeper into the results a more detailed analysis of potential conflicts and overlaps can be made. This method also shows the capability of a classifier to resolve these and gives us another measure for the quality of the classification engine used. To this purpose we analyze pairs of classes and the classificaton confidence for these. The graphical representation of this analysis is the Overlap & Separation Graph which will clearly show if there is any significant interference and what is the conflict level.

3. The Overlap & Separation Graph

Instead of looking at the overall results we look at pairs of classes. For each of the pairs we take the documents from the learn set of the two respective classes and classify them. For the Reuters test set we have for example the classes “acq” (acquisitions) and “earn” (earnings reports). The result gives us the classification confidence for a document from acq to be in acq and another confidence to be in earn. Of course we would expect that most of the documents in the learn set from acq are highly confident in acq – which is the case. But a few also have significant confident results from earn. If the confidence from earn for a acq document is higher than for acq then we have an overlap. This can of course happen because a lot of documents about acquisition of a company will also contain a significant vocabulary about results of this company. A news document might very well cover both aspects when a company is about to be acquired.

This analysis of pairs can be represented in a diagram that shows the cumulative percentage of documents from acq as a function of the difference of confidences between the two classes. In the same diagram we can depict the cumulative percentage of the earndocuments:

As you can see there is an overlap – which is not surprising, given what we said before. Please note that the percentage of documents is represented on a logarithmic scale. We therefore talk about a very low number of overlapping documents. In the left image the two lines cross at about 2% of documents – in the right image the overlap is significantly larger at 10%.

So you can also clearly see what is a good classifier which can separate even difficult cases (left graph) and what is a so-so classifier (right graph). We have used the same comparison of classifiers and the same dataset used in previous posts of this series. It is immediately obvious that the Skilja classifier is performing much better in the separation than a standard algorithm.

This becomes even more obvious if you compare coffee and gold. You would think that they can be well separated. Yes they can – by Skilja Content Classifier on the left side while the other classifier shows an unexpected overlap error of about 3% of the documents.

This analysis can be performed for any pair of document classes and the system can even point out the classes with most conflicts and overlaps. This is a built in function into the Classification Designer and should be a must for any tool in the market. Because now you can actually select the documents below the intersection in your learn set and take a closer look at them. And maybe determine that they are not correct or decide that you need to add more training samples to this class to make it perform better. Because the system can only perform what you teach it. Our analysis is able to directly guide you to this conflict pairs and help to quickly teach the system and correct its errors to achieve the best possible classification.

Of course actually using this analysis and graphs is much easier than digging into the details of the calculation. So don’t be afraid if this sounds complicated. Interactive usage is easy and intuitive. In the next and last part of our sequel we will dig even deeper and see what we can see from the analysis of a single document. Stay tuned.

Auto-Classification Technologies and RFID Smart Docs

Alexander — Sat, 10 Jan 2015 13:16:43 +0000

Editor’s note: This is a guest post from Cláudio Chaves from TCG Brazil

Recent advances in the auto-classification technologies – as described in this blog – have provided a substantial manual labor reduction for several companies related to physical preparation, classification and separation of documents into its operations. Although these advances have achieved tangible results in optimizing document centric workflows, there is still a gap in the aspect of classifying and tracking paper documents. This is especially important in some countries where physical documents are subject to different retention policies based on legal requirements. A certain amount of documents must be retained physically for a varying number of years based on the document type determined by classification.

Some capture applications are able to identify document types using barcode at scan time and using auto-classification or barcode content, apply different rules to separate and classify the document images. In the digital world everything is straightforward and works pretty well, but if you need to track and trace the same documents physically until the final archiving step is completed, it always becomes a challenge, especially into a large scale operation with tons of documents.

RFID is an acronym for Radio Frequency Identification, and it is also considered a generic term denoting the ability to identify an object remotely. It means that the information is transmitted via radio waves and does not require line-of-sight or contact between the reader and the tags. RFID technology provides great benefits through the combined use of a barcode, a microchip and an antenna, encapsulated into a tag, also called smart label. The radio waves are sent from a reader and then picked up by a tag that signals back its unique number called EPC (Electronic Product Code). The presence of a tagged folder/document is seen at a reader’s specific location, and this information can be reported to the tracking software that updates a records management database, an ECM repository or even a document capture platform.

Due a high reading speed and the capacity to identify an item even without visual access to the document, RFID technology makes it possible to quickly read a stack of tagged folders and documents even when they are stored inside a card box. In this way, it is possible to perform an automatic check-in of a ton of documents without any human intervention. Additionally it is also possible to inspect document containers like card box and folders at the receiving and delivering points, checking if all the required document classes are really there.

Just like barcodes, RFID technology allows to store a free encoding data schema into the EPC memory, on the other hand, it is always suggested to use a standard, to avoid a proprietary encoding. There are now a few international standards available for different types of objects, such as fixed assets, returnable assets, trade items, documents, etc. These standards have been developed by GS1, an international non-profit association, aiming efficiency improvement, higher items visibility and interoperability between the whole chain.

The EPC data schema GDTI (Global Document Type Identifier) specified by GS1, was developed to identify documents, including the class or type of each document. GDTI can be encoded in a 1D/2D barcode, stored into an EPC memory or printed directly on the document. Companies can use the GDTI as a method for identification and registration of documents and related events. They can also use the GDTI for information retrieval, document tracking, electronic archiving or even to prevent fraud and document falsification.

All these standards were specified based on the EPCGlobal framework, which describes the relation between different RFID components such as hardware, software and data interfaces. Based on the context of this article, we are referring about passive RFID. This technology does not use batteries and works with UHF frequency. Based on this specification, the objects can be identified not only in a near field, but also in a far field area, achieving up to 10 meters far, depending on the type of the object, the tag, antenna and the reader.

The combination of auto classification technologies and RFID tagged documents makes it possible to match the classification results, physically and logically.Given the physical classified document class, it is possible to define and choose the most appropriate document container (e.g. card box, folder, etc.) and pass the parent document class to the image/content classification engine to perform a deeper classification.

At the end of the process, we can match the results and track the both versions (image and paper) during the entire flow. This is can be achieved without physical contact with the paper document. Imagine that you get a box of paper from a remote location for archiving. If the documents have been classified and RFID encoded then within a second you can check the completeness of the physical archive and stow them away. If all of this happens with your capture process automation system you have a tight combination of auto-classification with physical sorting of paper – solving this last obstacle to full automation.

There are still several interesting RFID use cases for documents, such as automatic check in/out, hunting, inventory, exits, etc. which will become more and more popular very soon with the decreasing costs of the technology and the advances of new concepts like IoT (Internet of Things).

Links:
RFID: http://en.wikipedia.org/wiki/Radio-frequency_identification
GS1: http://www.gs1.org/about/overview
EPCGlobal Framework: http://www.gs1.org/gsmp/kc/epcglobal
GDTI: http://www.gs1.org/barcodes/technical/idkeys/gdti

####
Cláudio Chaves is Managing Director at TCG Brasil in Santana de Parnaíba, São Paulo, Brazil. He has many years of rich experience in the document processing applications especially in the South America market.

What is a good classifier? (2/4)

Alexander — Tue, 16 Dec 2014 13:18:00 +0000

In our small series about classification quality we have used the precision-recall graph to show the difference between a very good and a so-so classifier in a recent post that you can find here. This graphical representation is very common and easy to understand. Apart from the absolute numbers for the recall (e.g. 85% correctly classified documents) it is also important to understand how classification quality can be influenced by a threshold applied to the classification result. This can already be seen in the precision-recall graph if you know how it should look like. But it becomes much more obvious if the errors are displayed as a function of the recall. We call this diagram the inverted-precision graph which is described in this part 2 of our little series.

2. The Inverted-Precision Graph

The graph can be easily created by the same type of benchmark test that is used for the precision-recall graph. Either by measuring the classification quality against a “golden” test set or by simply using the train-test split method where a certain percentage of the training set is used for testing (e.g. 10%) while the remaining 90% are used for training. Of course this is repeated iteratively (in this case 10-times) until at each document has been classified once.

The first curve in the inverted-precision graph plots the error rate as a function of the read rate (recall). Apparently the higher the recall, the higher the error rate that you need to accept. The error rate is shown on the left vertical axis. Let me show you how the graph also allows to exactly determine the achievable recall and required threshold for a desired error rate. With the right y-axis the threshold is plotted as function of recall. By connecting the lines it is easy to see where we need to put the threshold to achieve a predefined error rate. The animated graphic below shows this step by step:

Inverted-Precision Graph and Relation between Error and Threshold

The inverted-precision graph is especially suited to uncover the weaknesses of classifiers related to thresholding.

In a real life example we have again used the well known Reuters-21578 Apte test set. This set has been assembled many years ago (available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html). It includes 12,902 documents for 90 classes, with a fixed splitting between test and training data (3,299 vs. 9,603). The image shows the graph for a very good and linear classifier.

The second image shows the graph for a standard classifier. This is the same data as in the first post of the series but if you compare you see that the differences between a good and a mediocre classifier become much more obvious in this representation.

The discrepancy has mainly to do with normalization of results. Even if you accept that the absolute recall is low for the weak classifier at least the results should be normalized in the way that an error rate below 5% can somehow be achieved. This is obviously not the case. The inverted-precision graph is a good way to uncover this fact which might be due either to a weak classifier or to an incomplete training set. Therefore a good classification toolkit should always provide the means to create and visualize the results also in this way.

There are good technical reasons in the algorithms to explain the differences above but this should not be the topic of this blog. More important for users is to understand that there are significant differences and that they become visible in the graphical evaluation. In an upcoming article we will drill even deeper and show the effect of classifier quality on the separation of selected pairs of classes. Stay tuned!