I am just back from an interesting conference in San Francisco on social media analytics. It is a rather small conference; however the market it deals with is growing rapidly. You might ask what social media analytics has to do with document understanding and to tell you the truth: a lot. Every post and every tweet and every blog entry is a document. It might be a short document (140 characters maximum in Twitter) but it is a document created by a user with the intent to express something.
Companies are very interested to understand what users – their customers – are saying. Today’s analytics tools try to measure quantitative data like number of followers for a brand, number of re-tweets and how often a product name is mentioned. Keyword search allows filtering out relevant contributions and even assigning a sentiment to posts. But the general agreement at the conference was that this provides only very limited insight. The real benefit is obtained of you also understand WHY a user is saying something about you. This can only be achieved with analyzing the unstructured text and hence document understanding.
In this area understanding goes under the name of text analytics. The goal is to integrate natural language processing (NLP) that provides a much deeper analysis of the social media content. NLP capabilities include syntactic and semantic parsing, named-entity recognition, relationship analysis and entity extraction. These methods are superior to traditional keyword and keyword list search in both matching relevant social media posts as well as in the accuracy for understanding the information in those posts.
We have seen some interesting presentations and discussions on these topics in the two days here:
Ian Hersey from Attensity gave a good overview on field. We are talking Big Data with 300m tweets per day, 250b emails per day, 126m blogs, 800m Facebook users etc. If you have something to predict the predictive power is quite impressive and you can expect a 90% success rates for events (e.g. “American Idol”) if data volumes are sufficient. But successful business uses involve not just prediction but engagement like product feedback, direct customer service, marketing campaign effectiveness, and political outreach/mobilization. Equally or more important are the “whys” behind the predictions.
Ian also talked about the limitations of NLP known today like irony, sarcasm, “slanguage”, hidden agendas, cross-/multi-language issues.
Dana Jacob (former Yahoo) gave a nice insight in problems with spam and false brand engagement in social media data including the now famous tweet “Dear Yahoo, I have never heard anyone say, “I don’t know, let’s Yahoo it…” just saying…sincerely Google.” She made clear that 100% accuracy is not achievable in social media analysis – balance is important between research rigor and accuracy vs. limitation of human/machine analysis.
An interesting and concise case study was shown by Keith Paul from EMC who carries the nice title of a “Chief Listener”. EMC seems to be very advanced for a non-consumer product company with a corporate team of 11 persons to handle social media tightly integrating communities with 250k users on social sites. He could show some measurable successes in a business where sales cycles normally are 18 months instead of a few minutes in an online shop.
In a panel discussion on the second day the biggest arguments why real Document Understanding is necessary in Social Media Analytics were given to us:
- Social media is freeform – you do not know which topics or terms pop up. Social media can be described by “VVV” = volume, velocity, variability.
- Sentiment: Document level only is useless as not the complete document has a sentiment. You need to cut down the data but for this you need to understand the relevance of each part. You can’t be accurate at but at least be consistent with sentiment.
- Accuracy: Often contributions are based on sarcasm, wording etc. but the most important question is the why. Sentiment alone is not useful at all.
- Influence: It is difficult to understand what influence in social media is but it is certainly not the number of followers. Influence has a tremendous amount of context going with it.
- Reach: Number of tweets on a topic or articles on a topic is not actually relevant. Actual context is important to see that people are actually discussing the topic.
- Engagement: Means nothing in the way it is used today by analytics. Spending time on a page or in forums is only valuable if you can measure that it leads to a measurable result (purchase) or meaningful contribution like a comment.