Market | Skilja

IDP: Solving bot illiteracy in the digital workforce – Part 2

Guest — Wed, 21 Oct 2020 10:12:00 +0000

Editor’s note: This is a guest post from Jupp Stöpetie

In this post we examine the role of Intelligent Document Processing (IDP) relative to Robotic Process Automation (RPA) and how these technologies drive Digital Transformation when combined. Part one was looking at what is driving and enabling Digital Transformation. This part two is dedicated to RPA and why IDP is essential.

Robotic Process Automation

Since maybe 3 or 4 years RPA has become the tool of choice for automating repetitive tasks for many companies. RPA vendors claim that their systems are easy to set up and maintain without the need of coding. Vendors propagate that RPA systems can be operated by business managers with some light-weight training. Basically every manager can create bots that replace human workers. This has led to a wide spread of RPA systems in businesses. There is no need anymore for complicated, lengthy and costly automation projects managed by IT departments. RPA puts automation and the use of AI in the hands of business managers. Today the workforce in companies is increasingly a combination of humans and robots. But of course things are more complicated when you look a bit deeper than the RPA marketing collaterals. What is for example when we need to retrieve information from a document? That is no problem for human workers. But bots have no humanlike reading skills. They only can “read” data from structured data like files and databases. That is because it is easy to instruct bots with the exact location where to find the data.

Why bots can’t read documents

A document can be seen as a container for content. The content is made up of static data and explicit and hidden information describing the relationships between the data which gives meaning to the document. A document also has meta-data being the properties of the container itself. Content is represented in documents in a way so that humans can process the content. Note that this has an important implication. Most documents were never designed to be read by bots. When data must be extracted it doesn’t matter for human workers if a document has a fixed structure like in a form, a semi-structure like in an invoice or no structure like in a contract. With some proper instructions human workers will be able to find the data they are looking for. Depending on how much structure there is, processing time may vary significantly of course. Bots however have no cognitive reading skills. And adding OCR and data capture technology to an RPA solution is often not enough to make bots really skilled at processing documents.

Photo by Arlington Research on Unsplash

Why OCR and Data Capture often do not offer the right reading skills for bots

The short answer: these technologies fall short because they were not developed for RPA. OCR was not designed to understand content. OCR is a technology for converting pixels into characters. Most OCR packages can also convert document images (scans) into text files while recreating the original layout. Data Capture systems use OCR technology and also many other AI technologies. Data capture systems were designed for extracting data from large volumes of documents with the highest possible accuracy. Neither OCR nor data capture were designed for RPA users to teach their bots how to read stuff.

The users who usually set up data capture installations are engineers who know exactly how to use all the levers and parameters of these systems. They often will create scripts or even add self coded additions in order to achieve the highest possible accuracy. The initial set up is high but that makes a lot of sense. The ROI of data capture installations doesn’t have to be fast. Almost always these installations are set up to run for many years.

Batch oriented data capture systems, although very powerful when setup correctly, logically are not the first choice of RPA users when they need to add document processing capabilities to their bots. These users are looking for simple, easy, fast and flexible functionality. The volumes they need to process are small. They also have higher needs for these systems to learn while doing because not the whole spectrum of variability in the documents that need to be processed will be available at the start of the project. And often they also need a higher level of intelligence because they want to automate tasks that were formerly performed by humans. And when they design new processes RPA users want to create intelligent bots that behave just like humans. But what RPA users cannot handle and what would break the RPA paradigm is if adding reading skills to bots comes at the expense of requiring a lot of investment and special technical skills like solution design, coding, production testing etc.

Note that in cases where RPA systems are used for processing large volumes of documents it makes sense to use data capture systems to extract data from these documents. Because in these cases efficiency aka accuracy most likely will play a significant role. Processing large volumes of documents is data capture’s sweet spot.

What RPA users really need when they have to process documents is something that may look a bit like OCR and data capture but is much smarter than that because it is operating a much broader set of AI technologies and at the same time it also should be easier to use.

IDP: a new product category for a new market

The massive pervasion of RPA installations in companies that happened during the past 3 to 4 years has led to an increasingly high demand for these easy, simple, flexible but yet powerful intelligent data capture solutions. This fast growing demand has spawned a new generation of companies who has gone down a different path using different technologies than what the incumbents in the data capture market have been doing for more than 20 years. These new systems are all based on the idea that Deep Neural Networks and other forms of Machine Learning are better and much easier ways to fulfill the needs of RPA users. All you need is a lot of samples, train your neural networks and off you go. What however is unclear at this stage is if ML can actually deliver the accuracy that is needed when bots should have humanlike reading skills executing mission critical tasks. When deep neural networks make mistakes you cannot go in and correct for these mistakes. These systems are black boxes. It is noteworthy that all incumbents are updating their existing offerings with adding ML technologies. Especially with the goal to become better at processing unstructured documents. And it seems not such a ridiculous assumption that these companies who have many years of experience developing document processing systems have an advantage over the competition that is fully ML focused. It looks quite plausible that incumbents are better set up to marry old AI and new AI technologies based on their solid understanding of how to build robust document processing systems.

All these developments of old and new companies to develop document processing skills for RPA bots have led to a new category of products that cater to the digital transformation market and not so much to the traditional capture market. The emergence of these new products was the reason for the Everest Group, a leading management consulting and research firm to come up with a new product category: Intelligent Document Processing or IDP.

Everest Group defines IDP as any software product or solution that captures data from documents (e.g., email, text, pdf, and scanned documents), categorizes, and extracts relevant data for further processing using AI technologies such as computer vision, OCR, Natural Language Processing (NLP), and machine/deep learning. These solutions are typically non-invasive and can be integrated with internal applications, systems, and other automation platforms.

Photo by Markus Spiske on Unsplash

About OCR and Online-learning

In many blogs about IDP authors create a contradiction between OCR being the old fashioned unintelligent way of data extraction and modern AI based extraction methods that are state of the art and intelligent. First of all as I pointed out OCR is not data extraction. OCR is one of many AI technologies that are used in data capture systems. And there are different ways we can build intelligence into systems that will help to find and interpret data in documents. Machine learning may perform better on unstructured documents. When dealing with forms and semi-structured documents systems that use templates, classifiers and other AI technologies including machine learning will almost always outperform systems that are solely based on machine learning. Note that these comprehensive systems operating ML in a smart way will learn from automated feedback through users correcting mistakes and from the results of successful and correct classification and extraction to generate additional knowledge (expanding the space) and statistics of usage of existing knowledge. See The Magic of Online-Learning.

What is the ideal IDP solution?

The challenge for intelligent document processing is that there seems to be no one ideal approach. Depending on the type of documents, the volumes that have to be processed and the importance of accuracy ML or more traditional AI approaches or a blend thereof will be the best choice. For example when document volumes are big accuracy is important. When a shared service centre for example is processing 50 million documents a year improving accuracy from say 95% to 96% is significant. We just reduced the number of documents that have to be corrected by 500k. Another case where accuracy is critical is in straight-through processing.

It seems that the best option customers have is to adopt an IDP platform that enables them to operate different solutions or combinations of such solutions while shielding users from the complexity of these solutions. Vinna is such a platform that even allows to add crowdsourcing (Human In The Loop Verification) without users ever being aware of the complexity of what goes on under the hood.

———————————————————————————————

Jupp Stöpetie as CEO of ABBYY Europe established ABBYY’s presence in the Western European markets, growing the brand and market presence to a leadership status for +25 years. His experience includes founding and growing companies and managing all levels of business operations, sales, and marketing. Jupp left ABBYY in spring 2020 and now works as an independent consultant based in Munich, Germany.

IDP: Solving bot illiteracy in the digital workforce – Part 1

Guest — Wed, 15 Jul 2020 10:11:47 +0000

Editor’s note: This is a guest post from Jupp Stöpetie

In this post we examine the role of Intelligent Document Processing (IDP) relative to Robotic Process Automation (RPA) and how these technologies drive Digital Transformation when combined. In part one we look at what is driving and enabling Digital Transformation. Part two then is dedicated to RPA and why IDP is essential.

Digital Transformation

Companies all over the world are redesigning and digitizing their businesses at an ever faster pace and they have many reasons for doing so:

They want to serve their customers faster and better.
They understand that automation and using AI technologies improves innovation, agility, scalability and cost-efficiency
They appreciate that in contrast to human labor-intensive processes digital processes
- take less time to design
- need less capital investment
- are faster to deploy
- and (much) less costly to run.

The above is commonly referred to as Digital Transformation. Citing Salesforce’s definition: “Digital Transformation (DX) is the process of using digital technologies to create new — or modify existing — business processes, culture, and customer experiences to meet changing business and market requirements.“

The main drivers of Digital transformation are the ongoing globalisation and the Fourth Industrial Revolution which has led to dramatically increased levels of competition. DX greatly improves the agility of businesses so they can adapt much faster than ever before to changes in the market. Changing or even completely redesigning digital processes is a lot easier and comes at much lower cost which results in significantly increasing a business’ competitiveness. Note that even terminating digitised processes comes at a much lower cost than when a lot of capital investment and labor was involved. Basically businesses have no choice. They must become digital. And those who are slow to change find themselves in an increasingly disadvantageous position. New businesses nowadays will always start with a digital concept in mind and by doing so will avoid manual processes if that is possible and makes sense.

Photo by Mike Kononov on Unsplash

Market Data on Digital Transformation

What actually makes things very different from say ten years ago is the enormous progress. Worldwide spending on technologies and services that enable digital transformation (DX) by business practices, products, and organizations is forecast to reach $2.3 trillion in 2023, according to a new update to the International Data Corporation (IDC) Worldwide Semiannual Digital Transformation Spending Guide. DX spending is expected to steadily expand throughout the 2019-2023 forecast period, achieving a five-year compound annual growth rate of 17.1%. “We are approaching an important milestone in DX investment with our forecast showing the DX share of total worldwide technology investment hitting 53% in 2023,” said Craig Simpson, research manager with IDC’s Customer Insights and Analysis Group.

In their July report Grand View Research reported that the global Robotic Process Automation market size which is a segment of the overall DX market was valued at USD 1.40 billion in 2019 and is projected to exhibit a compound annual growth rate (CAGR) of 40.6% from 2020 to 2027.

Why is Digital Transformation taking place now

In the last decade we have seen a tsunami of digital transformation projects driven by an accelerating desire of companies to increase competitiveness and innovation. But of course one could argue that businesses always had that desire. The question one could ask is: Why is all that happening now? Haven’t companies not been automating for decades already? Yes, but not at the current pace.

Note: at this stage it is unclear how the Covid-19 pandemic will influence market dynamics. It seems however unlikely that the need for companies to digitize their businesses will slow down. On the contrary, it is much more likely that the opposite will happen.

What has made things totally different from say ten years ago and has really enabled Digital Transformation to take place is the enormous progress in both computer science and in the computer industry. Compared with say 10 years ago we see:

an enormous increase in computational power
the rise of super powerful algorithms, algorithms that need incredible amounts of computational power which is now available
vastly improved connectivity both in speed and access points and it is improving at an accelerating pace (5G)
smartphones: in 2009 170 million smartphones were sold. In 2020 1.5 billion units are estimated to be sold. That makes for an estimated 3.5 billion people having a smartphone in 2020
huge and disproportionate – compared to the rest of the economy – amounts of money, that have been invested in tech companies. Successful tech companies have rewarded their investors with multiples that dwarfs any other industry.
and last but not least an insistently growing appetite of consumers and businesses for more, better and faster service anywhere at any time.

Some examples of how robotic process automation and document processing drive Digital Transformation in companies

A large international pharmaceutical company wanted to capture all details from their purchase orders and perform lookups and validation using their ERP. Tasks were automated using a RPA system. An intelligent document processing system was needed to read the data from the POs.

A large financial services company wanted to use RPA to automate their KYC process which involves capturing, verifying driver licences, passports etc. Early in the process they found that they also needed an intelligent document processing system.

A global logistics company wanted to automate their invoice processing (millions of documents). Their RPA system was found to be up to the challenge. But the company initially backed off because of the complexity of extracting data from millions of invoices (semi-structured) with a wide range of varying lay-outs. Only after a proof of concept clarified that there was an intelligent document processing system on the market that was up to the challenge the company proceeded with the project.

In part 2 of this post we will discuss how the combination of RPA with IDP works and what are the challenges that need to be considered.

———————————————————————————————

How Digitization Will Change the Insurance Industry

skiljaadmin — Tue, 08 Mar 2016 11:52:10 +0000

Insurance companies always have been at the forefront of automating processes. Because they are at the source of a significant amount of traffic and correspondence with their customers that leads to repeatable processes: Claims management and Policy creation. In a first wave starting end of the 90′s a lot of insurances have successfully managed to move their paper based processes to automated workflow using advanced classification and recognition technologies. I remember well all the big projects that we have been doing and still do in this area. This already significantly reduces the time to process a single case and the manpower needed to do so. And insurances face a significant cost problem and increasing competition not only on their terms but also on their level of service.

With process automation and digitization, using modern Classification technology, address and table extraction and semantic understanding the content of the documents becomes digitally available. In the current phase insurances are using this data and modern learning analytical methods to actually automate not only the process but also the decision making. If all information is available a KI system (like Skilja’s learning classifiers) can automatically decide if a claim is standard and can be paid right away or if it needs expert inspection. An application for a new contract can be clarified and checked mostly by semantic and analytical algorithms, that link it to the policy of the insurance – leaving the underwriter the task to finally check the suggestion based on the conclusion made by the machine.

In two recent studies from McKinsey, one by Sylvain Johansson and Ulrike Vogelgesang (Automating the insurance industry), the other by Michael Chui, James Manyika, and Mehdi Miremadi (Four fundamentals of workplace automation) the authors show in a very impressive way how this will affect the work place in the industry:

As the automation of physical and knowledge work advances, many jobs will be redefined rather than eliminated—at least in the short term. The bottom line is that 45 percent of work activities could be automated using already demonstrated technology. If the technologies that process and “understand” natural language were to reach the median level of human performance, an additional 13 percent of work activities in the US economy could be automated. The magnitude of automation potential reflects the speed with which advances in artificial intelligence and its variants, such as machine learning, are challenging our assumptions about what is automatable. It’s no longer the case that only routine, codifiable activities are candidates for automation and that activities requiring “tacit” knowledge or experience that is difficult to translate into task specifications are immune to automation.

Being McKinsey they look specifically at the workforce, the FTEs (full time equivalents) needed for a task and the required skills in the future. Using benchmarking across insurances McKinsey finds very interestingly that most time is spent in policy creation, policy servicing and claims management.

Of course claims management is minor in Life but significant in P&C. The graph is interesting because Skilja is right now involved in several projects in Life where we exactly aim at reducing the time needed for underwriting a policy. And the reduction we see is significant using modern KI, deep learning and other algorithms. New technology and agile development methodologies allow swift process automation at limited cost.

The effect in insurance workforce is dramatic as shown in the first graph above. Up to 25% of the FTE in these repeatable tasks will be consolidated. Meaning that the insurance can lower the cost, increase the number of customers served and reduce the time to process a claim or application.

For more detailed information we recommend to read the excellent original article “Insurance on the threshold of digitization: Implications for the Life and P&C workforce” available for download here.

If you are interested how this can be achieved from the technological side and what the current state of the art of classification, document understanding and automatic decision making is, please look here www.skilja.com or let us know info(at)skilja.com.

Visiting Docville – October 2014

Alexander — Fri, 24 Oct 2014 13:23:30 +0000

Now already a tradition we just had the fifth meeting of Docville in Brussels this week. Docville is a networking & exchange initiative for executives from the international Information Management ecosystem (Capture, ECM, BPM, BI and BPO), organized and facilitated by Michael Ziegler. Docville now has more than 1100 members on LinkedIn. This group connects regularly to exchange their experiences on advances in technology and the trends in the market. This year Skilja for the first time was one of the sponsors of the event.

As usual the discussions, held at roundtables, were divided between market, marketing aspects and technical topics. For me – as a techie – the most relevant discussions were held around:

SMART PROCESS APPS (SPA) AND INTELLIGENT BPMS (IBPMS)– ARE THEY REALLY THE SAVIOR FOR THE FUTURE?
NEXT GENERATION DOCUMENT SERVICE BPOS- FROM BACK-FILE CONVERSION TO LUCRATIVE DOCUMENT PROCESS AUTOMATION
AP AUTOMATION IN THE CLOUD –NOW THAT THIS IS A REALITY, WHAT ARE THE CHALLENGES?
MOVING FROM ON-PREMISE SOFTWARE TO CLOUD SERVICES: THE IMPACT ON THE IM-SOFTWARE VENDORS’ BUSINESS
MOBILE APPS, DEVICES AND CONTENT IN A MOBILE WORK ENVIRONMENT
THE CHANGING FACE OF DOCUMENT CAPTURE AND THE VALUE PROPOSITION
AUTO-CLASSIFICATION DEMYSTIFIED OR THE END OF MANUAL INDEXING

As you can easily see the hot topics are (surprise!) mobile/cloud and document automation. The latter you might remember was also reflected in my recent post “Technology Everywhere” that featured a corresponding Gartner study.

Interestingly enough nobody was really interested in AP automation any more. In fact some of the participants actually didn’t even want to discuss it or use it as an example. Not because it is not big business opportunity – but because it is now self evident. It is clear that it can be done, how it is done and who is doing it. What a development over the last five years! Of course now everybody is looking for a similar opportunity to package a solution. The hot candidate is HR.

Especially the BPOs are looking for new opportunities to create new offerings for their customer and in general to convert themselves from scan services to real business process outsourcers. Everybody wants to go up the value chain. And with the maturing software this is possible – remains to convince the customers.

Auto-Classification and automation was discussed widely with a general agreement that we are entering main stream with this technology. After the pioneering years (starting 15 years ago) it is now generally accepted that human cognitive tasks can and should be automated. And not only in mailroom automation but in a variety of other areas where our software can take over the categorization and decision making. And it was agreed that this is best provided on a component level by specialized plugins from technology providers, so the vendors can focus on the actual solutions for their customers. Very similar to the way OCR has evolved just 10 years earlier.

Another interesting aspect was the discussion about cloud enablement. The acceptance of cloud services in Europe seems to be close to zero. Whether this is due to security concerns or fears of data loss nobody knows for sure. But you need to be cloud enabled as a vendor (checkbox feature) – even if nobody then buys it. The Nordic countries seem to be more willing to use the new offering. Also customers obviously still want to buy the software (CAPEX) instead of renting it (OPEX). Also surprising but came out as the result of a quick survey we did. This shows nicely how different the view of analysts can be from the actual reality. Although we all expect that the big move to the cloud can happen every day….

You can probably remember more than one event where the most valuable (and pleasant!) moments occurred during conversations with other attendees during short coffee and lunch breaks, or post-schedule drinks. In these interactions sometimes the most valuable information and opinions are gained and exchanged. This is especially true for Docville where these conversations continued over good food and drinks until late in the night in the bar in Brussels.

Intelligence Everywhere: Top 10 Technology Trends 2015

Alexander — Tue, 14 Oct 2014 13:24:47 +0000

Gartner has just published their outlook on the strategic technical developments that will be important for enterprises in the next three years.

Well – it seems that we at Skilja are at the right place with what we are doing. We have known this for a long time and our success shows it, but it is a nice confirmation from an outside view that we are not only in the middle of innovation but actually leading it.

If you look at their second row of technologies – it is all about intelligence and analytics. This is all where Skilja is with classification, text analytics and automatic decision making:

Top 10 Tech Trends 2015 (graphics by Gartner)

“Advanced, Pervasive and Invisible Analytics: Analytics will take center stage as the volume of data generated by embedded systems increases and vast pools of structured and unstructured data inside and outside the enterprise are analyzed. Big data remains an important enabler for this trend but the focus needs to shift to thinking about big questions and big answers first and big data second — the value is in the answers, not the data.”

“Context-Rich Systems: By understanding the context of a user request, applications can not only adjust their security response but also adjust how information is delivered to the user, greatly simplifying an increasingly complex computing world.”

“Smart Machines: Deep analytics applied to an understanding of context provide the preconditions for a world of smart machines. This foundation combines with advanced algorithms that allow systems to understand their environment, learn for themselves,and act autonomously. The smart machine era will be the most disruptive in the history of IT.”

All this will be in the cloud and available for mobile – which is a matter of course for all our activities and developments that are built on ditributed cloud enabled systems.

To add the last grain of salt to this great support from Gartner to our strategy they finish with “Agile programming of everything from applications to basic infrastructure is essential to enable organizations to deliver the flexibility required to make the digital business work.”

Does this sound like what we have been evangelizing and practising for many years? Yes it does!

For full Gartner article see here.

Visiting AIIM 2013

Alexander — Thu, 28 Mar 2013 14:02:25 +0000

AIIM is the community that provides education, research, and best practices on information management and collaboration. This year the AIIM community met in New Orleans, the city of music and French Creole architecture. New Orleans is also called the “Big Easy” – well big was certainly one of the topics that were discussed related to data, but everybody agreed that this would not be easy.

The new format of AIIM as a conference with keynotes and tracks, introduced in 2012, was continued and it again was very successful as the conference was sold out with more than 600 participants well in advance. So hurry up if you want to be there next year when AIIM 2014 will take place in Orlando, FL. The presentations, mainly done by end users and matter experts, were grouped in two tracks:

Engagement: How can I interact with my customers?
Governance: How can I manage the wealth and abundance of data?

A third format was called Interaction. On several round tables experts of the industry discussed specific topics with their audience. Tables often were crowded as interest was really high. Maybe this format should be changed to a podium for the next conference.

From all the presentations that I visited let me highlight two. First we heard a very good, well founded overview by Richard Medina from Doculabs about deleting information. Sounds boring? Not at all!

Doing this correctly is not only saving a lot of money, but often is also a legal requirement. Richard made a strong case for automatic classification: “Manual classification is not an option”. Admitting that there is not the perfect solution out there yet. Yet he was talking about statistical classification only while we today know that semantic analysis is the technology that will do this job much better. “The technologies are immature and varied, but you can be successful by matching the techniques to the kind of files you want to target”

Richard gave an impressive example where the implementation of an automated classification and retention system at big organization saved 2.5 million $ per year by correct document disposition.

Second Denise Bedford, Goodyear Professor of Knowledge Management at Kent State University ventured to talk about semantic technology in front of this application oriented audience and made some really good points. As you all know I am a big proponent of using the semantic richness of our language for the understanding of documents and I could see Denise giving us some really fine arguments for doing so. Her presentation was called: “Smart and Semantic – Stop asking people to manage information and start teaching machines to do it”.

Sounds familar, doesn’t it? In Denise’s opinion: “Semantic analysis technologies have come of age. Semantic applications can now leverage embedded human knowledge to help us manage our business information in smart and strategic ways. [..] Semantic technologies require strong knowledge bases to work well, but they can help solve missing metadata problems.”

This was an ideal preparation for my own round table that I hosted on Friday on the subject of “How semantic technologies enhance document processing and document management“. Interest was huge in this round table and after a brief introduction we discussed all topics and question that you also see discussed on my posts. Many agreed that this is what they are looking for to help solve their requirements, especially around semantic indexing. We will continue to explain and discuss this new approach in upcoming posts.

Visiting Text Analytics Summit

Alexander — Thu, 14 Jun 2012 15:00:00 +0000

Text Analytics is a part of the discipline that we call Document Understanding on this site. Originally focused on text mining, text analytics moves more and more to real time analysis of documents to create actionable insight. This year’s conference in Boston was the 8th of its kind and saw a significant shift of activity towards analyzing social media. This is a big trend in the U.S. as it is seen as crucial for understanding customer sentiment, maximizing social media productivity and optimizing market research. I have my doubts if the results gained from social media analytics are really as valuable and reliable as presented in the conference. Just think who is writing public comments (not restricted to friends and circles) in the web and if this is the group that a company should take as a guideline for their strategy. Nevertheless right now it is big hype maybe because it represents the only way for big enterprises to tackle social media at all.

But social media was not the only field that was presented. Airline security was shown by NASA in one example as a very valid field of text analytics, but also improving customer service by analyzing reports and of course e-discovery for litigation support.

As an opening presentation Seth Grimes, President Alta Plana made some good remarks on the state of the industry in his talk about “Text and Beyond”. In his view a mega trend is real time operation on Big Data (described by the three V’s: volume, velocity, variety). The goal must not be to reduce information but to do more with more (more actions with more data). The problem is not information overload but filter failure – which I find is a refreshing new look at the situation.

An important part of text analytics will be knowledge enrichment: Semantics enables a join across types and sources and structures using meaningful identifiers to create and ensemble that is greater than the sum of the part. Semantics interrelates information to represent knowledge. And text analytics generates semantics (= meaning) to bridge search, BI and applications enabling next generation information systems.

Kurt Williams from Mindshare Technologies gave a good insight in “Next Generation Customer Surveys with Text Analytics” using IBM tools. He uses text analytics on open-ended comments from customers to explain the “why behind the what” and fill-in the gaps in survey design. Text analytics is a disruptive technology for customer feedback programs. As an example he mentioned restaurant feedback that is solicited when you exit the restaurant. It is a proven fact that solicited feedback yields much higher rate of positive comments than when you simply wait for it. He made a good distinction between monitoring and discovery both of which can be achieved with text analytics:

Monitor the known unknowns = TQM for text
Discover the unknown unknown = reveals things that are hidden
Monitoring uses rules and discovery uses correlations

A customer’s view on survey analytics was given by David Williams, Manager of Marketing Analytics at Walt Disney. They analyze customer feedback and forum questions to find regularities and new trends. As an example Walt Disney analyzes comments in trip advisor (e.g. 1071 reviews for one hotel) correlated to rating (stars) to find issues and trends in their hotels. Walt Disney also has forums like the “Disney Mom’s panel” which is used as a source for information but also as a marketing instrument doing real time analysis. By detecting patterns in questions and relating them to the subsequent rate of booking it is possible to identify individuals who require a promotion to convert or who need further marketing. Something you need to keep in mind in the future when talking to big organizations!

The best and most unusual application of text analytics was presented by Ashok Srivastava, Principal Scientist Data Mining at NASA. They discover precursors to aviation incidents from data like flight reports, radar data, weather information etc. There are 6 million flights per year in the U.S. each of which has associated free text reports. An impressive visualization of the sheer number of daily flights can be seen on “A Day in the Life of Air Traffic Over the United States”. The goal of mining is to reduce the accident rate by identifying and responding to precursor events before the accident occurs. Using analytics they can detect when the system moves from a safe state to a compromised state and to an anomalous state. Over 100k reports that can be used to answer WHY something happened. Text mining actually discovered airports and areas on airports that need further investigation. As an example DFW (Dallas Fort Worth) has some problems on runways because it is very complex leading to problems in certain areas that could be detected using text analytics. And this cannot be done manually because these are many thousands of reports. A full discussion of this fascinating application to our all benefit can be found on YouTube.

An application of text analytics on a totally different field but also dealing with risk was presented by Mattias Tyrberg, CEO and Founder, Saplo from Sweden. They are using predictive text analytics for assessment of credit scores of companies using publicly available reports and news articles. The system has been trained with thousands of articles from the past 10 years for some 200 companies and was able to predict credit ratings for these. They are not using phrases but a refined model to represent text and to learn statistically and train with past data.

It is interesting in how many field text analytics and hence document understanding can be applied. We will see much more of these in the near future as now the unstructured data becomes available as never before and the technologies evolve very rapidly.

Feel free to comment and share your opinions using the comment function or to send me an e-mail to further discuss. I am sure there is a lot more that can be said and done. Please share this article with friends and colleagues if you find it worthwhile reading.

Visiting Enterprise Search Summit

Alexander — Thu, 17 May 2012 15:05:00 +0000

Enterprise Search Summit (ESS2012) is a conference of business professionals in the field of professional search applications that takes place in May in New York.

It was obvious from this year’s agenda that search technology need to go far beyond simply indexing content to provide the services business users are looking for. Indexing is a problem that is basically solved and highly performing solutions are available either commercially or from open source. But indexing alone does not allow you to find the content you are looking for. Simple keywords will not sufficiently describe the content. Therefore lot of presentations focused on document understanding as a means to find relevant content. Interestingly enough this is now the area where document capture, business process automation and search suddenly form a big overlap: The need to understand content. Various methods were presented how this goal can be achieved but the consensus was that in the end only a semantic analysis of the content will give you the precision and recall you are looking for. Therefore we will see a wide adoption of semantic text analytics methods in search in the near future.

And as the market leader in enterprise search, which is Google, is introducing the first semantic elements in their search right now, this will pave the way for a wide adoption in the business world also. Google calls it “Introducing the Knowledge Graph: things, not strings” which is nicely put and they add their vision: “We’ve always believed that the perfect search engine should understand exactly what you mean and give you back exactly what you want.”

On ESS 2012 in New York we saw some remarkable presentations:

Sue Feldman, VP Search and Discovery Technologies IDC on “The present & future state of search” made clear that search has become pervasive because we cannot find our stuff. There is a significant legal risk of missing information or using wrong information and a need for access to data as well as content. From this she concludes the requirement for more semantic analysis – entities, relationships, location, concepts – with reasoning, inferencing, predicting, and analysis across sources as well as data normalization and relationship mining. In Sue’s opinion we will see a convergence with business intelligence and machine learning will make the difference.

Seth Grimes from Alta Plana gave a good overview on the state of Semantics starting with a quote from Edward Feigenbaum “Reading text in general is a hard problem because it involves a lot of common sense knowledge. But reading from text in structured domains I don’t think is as hard. It is a critical problem that needs to be solved”

In semantic search he sees identity, history and context instead of only keywords. It returns not only hit lists but categories (facets), clusters and answers. The relevance is determined not by page rank but by intent. Which leads Seth to the definition that “Semantic search finds and produces the information that supports the searcher’s immediate goal, across appropriate sources”.

A very good example of what you can do with semantic search outside the traditional enterprise needs was given by Jordi Torras the CEO of inbenta in his presentation on Semantic Search for Web traffic and conversion rate. He showed how you can increase the findability of your site and hence the web traffic significantly using semantic methods and simply listening to the user. Main point is: Forget keywords – users don’t think in keywords. Users know that normally what you type is not what you get. Semantic analysis helps to understand the intent of the users and to direct them to the correct site.

Visiting Docville May 2012

Alexander — Fri, 11 May 2012 15:26:00 +0000

I’m just back from the international Docville meeting in Brussels. Docville is a community of professionals in the ECM and capture industry, organized and facilitated by Michael Ziegler which already has more than 800 members on LinkedIn. We connect regularly to exchange opinions, information and share trends. Or as his perfect motto expresses: Travel Once-Meet Many

This spring meeting in Brussels was again very valuable and provided many insights into technology and markets in this rapidly changing industry. As Michael puts it: Sharing ideas and learning valuable lessons from our peers in an environment of trust, respect and aligned interest is not only fun, it is mission-critical in today’s ultra-connected and rapidly changing world.

A lot of acquisition activity has been observed recently in this market: Lexmark buys Brainware, Readsoft buys foxray, Kofax buys Singularity and so on. Therefore it was a real highlight that the Docville community was able to receive a keynote on the topic presented by Christoph Löslein from Board Advisors AG. Christoph was cofounder of Dicom (which did a lot of acquisitions of which the largest and most successful was Kofax) and has specialized on consulting companies in this area since he left Dicom in 2004.

He presented reasons why integrations fail and I saw a lot of grins and nods in the audience as almost anybody has been through this once or many times when he showed the high level reasons why acquisitions fail:

A vision but no plan of integration: Wishful thinking.
An integration plan, but no common vision and tangible synergies: an 8‐lane highway leading to nowhere.
Differences in culture no reconciliation of cultural differences
Lack of communication internally

And the surprising number is that actually only 20% of all acquisitions turn out to be successful.

After this introduction we discussed in the now well-known and valued round table format topics of interest of the ECM market. At the round table there is one expert of the field who presents his insights and then leads an open discussion of 45 minutes to get the views of all participants and answers questions. There are technical, organizational and market round tables.

Some of the topics presented:

From AP Invoice Processing to an Integrated P2P Approach- What are the Challenges and Opportunities?
Changing Software Licensing & Delivery Models – Impact on Vendors and Customers
From Scanning to Electronic Document Capture & Processing – Changing Platforms, Challenges & Solution Providers
Document Service BPOs – what are the Global Trends that promise profitable growth?
Is the Digital Mailroom having a Renaissance?
SharePoint ECM: Leveraging Microsoft and Microsoft SharePoint Partners; Complementary or Competitive?
What can Semantic technologies accomplish for Document Understanding and Document Management?

You can probably remember more than one event where the most valuable (and pleasant!) moments occurred during brief conversations with other attendees during short coffee and lunch breaks, or post-schedule drinks. These interactions might have led to subsequent collaboration and better business for you going forward if only you had more time to develop the conversation. This is especially true for Docville which continued until late in the night (or early morning) in the bars of surrounding quarters in Brussels.

Visiting Social Media Analytics Summit

Alexander — Thu, 19 Apr 2012 15:57:00 +0000

I am just back from an interesting conference in San Francisco on social media analytics. It is a rather small conference; however the market it deals with is growing rapidly. You might ask what social media analytics has to do with document understanding and to tell you the truth: a lot. Every post and every tweet and every blog entry is a document. It might be a short document (140 characters maximum in Twitter) but it is a document created by a user with the intent to express something.

Companies are very interested to understand what users – their customers – are saying. Today’s analytics tools try to measure quantitative data like number of followers for a brand, number of re-tweets and how often a product name is mentioned. Keyword search allows filtering out relevant contributions and even assigning a sentiment to posts. But the general agreement at the conference was that this provides only very limited insight. The real benefit is obtained of you also understand WHY a user is saying something about you. This can only be achieved with analyzing the unstructured text and hence document understanding.

In this area understanding goes under the name of text analytics. The goal is to integrate natural language processing (NLP) that provides a much deeper analysis of the social media content. NLP capabilities include syntactic and semantic parsing, named-entity recognition, relationship analysis and entity extraction. These methods are superior to traditional keyword and keyword list search in both matching relevant social media posts as well as in the accuracy for understanding the information in those posts.

We have seen some interesting presentations and discussions on these topics in the two days here:

Ian Hersey from Attensity gave a good overview on field. We are talking Big Data with 300m tweets per day, 250b emails per day, 126m blogs, 800m Facebook users etc. If you have something to predict the predictive power is quite impressive and you can expect a 90% success rates for events (e.g. “American Idol”) if data volumes are sufficient. But successful business uses involve not just prediction but engagement like product feedback, direct customer service, marketing campaign effectiveness, and political outreach/mobilization. Equally or more important are the “whys” behind the predictions.

Ian also talked about the limitations of NLP known today like irony, sarcasm, “slanguage”, hidden agendas, cross-/multi-language issues.

Dana Jacob (former Yahoo) gave a nice insight in problems with spam and false brand engagement in social media data including the now famous tweet “Dear Yahoo, I have never heard anyone say, “I don’t know, let’s Yahoo it…” just saying…sincerely Google.” She made clear that 100% accuracy is not achievable in social media analysis – balance is important between research rigor and accuracy vs. limitation of human/machine analysis.

An interesting and concise case study was shown by Keith Paul from EMC who carries the nice title of a “Chief Listener”. EMC seems to be very advanced for a non-consumer product company with a corporate team of 11 persons to handle social media tightly integrating communities with 250k users on social sites. He could show some measurable successes in a business where sales cycles normally are 18 months instead of a few minutes in an online shop.

In a panel discussion on the second day the biggest arguments why real Document Understanding is necessary in Social Media Analytics were given to us:

Social media is freeform – you do not know which topics or terms pop up. Social media can be described by “VVV” = volume, velocity, variability.
Sentiment: Document level only is useless as not the complete document has a sentiment. You need to cut down the data but for this you need to understand the relevance of each part. You can’t be accurate at but at least be consistent with sentiment.
Accuracy: Often contributions are based on sarcasm, wording etc. but the most important question is the why. Sentiment alone is not useful at all.
Influence: It is difficult to understand what influence in social media is but it is certainly not the number of followers. Influence has a tremendous amount of context going with it.
Reach: Number of tweets on a topic or articles on a topic is not actually relevant. Actual context is important to see that people are actually discussing the topic.
Engagement: Means nothing in the way it is used today by analytics. Spending time on a page or in forums is only valuable if you can measure that it leads to a measurable result (purchase) or meaningful contribution like a comment.