Text Analytics Forum is part of the KMWorld conference. It was held on November 6-7 at the JW Marriott in D.C.. Attendees went to the large KMWorld keynotes in the morning and had two parallel text analytics tracks for the remainder of the day. There was a technical track and an applications track. Most of the slides are available here. My photos, including photos of some slides that caught my attention or were not available on the website, are available here. Since most slides are available online, I have only a few brief highlights below.
Automatic summarization comes in two forms: extracted and generative. Generative summarization doesn’t work very well, and some products are dropping the feature. Enron emails containing lies tend to be shorter. When a customer threatens to cancel a service, the language they use may indicate they are really looking to bargain. Deep learning works well with data, but not with concepts. For good results, make use of all document structure (titles, boldface, etc.) — search engines often ignore such details. Keywords assigned to a document by a human are often unreliable or inconsistent. Having the document’s author write a summary may be more useful. Rules work better when there is little content (machine learning prefers more content). Knowledge graphs, which were a major topic at the conference, are better for discovery than for search.
DBpedia provides structured data from wikipedia for knowledge graphs. SPARQL is a standardized language for graph databases similar to SQL for relational databases. When using knowledge graphs, the more connections away the answer is, the more like it is to be wrong. Knowledge graphs should always start with a good taxonomy or ontology.
Social media text (e.g., tweets) contains a lot of noise. Some software handles both social media and normal text, but some only really works with one or the other. Sentiment analysis can be tripped when only looking at keywords. For example, consider “product worked terribly” to “I’m terribly happy with the product.” Humans are only 60-80% accurate at sentiment analysis.