Nine Steps to Extract Insight from Unstructured Data

By Salil Godika On 27 Nov 2014

This blog is published by

The increasing digitization of information, coupled with multitude of multi- channel transactions has resulted in a data deluge. The ever- increasing pace of digital information has led the world’s aggregate data to double in even shorter intervals than ever before. According to Gartner, about 80% of data held by an organization is unstructured data, comprised of information from customer calls, emails and social media feeds. This is in addition to the voluminous diagnostic information logged by embedded and user devices. While it would be daunting to even make a proper analysis from organized data, it is all the more difficult to make sense of unstructured data.

As a result, organizations have to study structured, semi- structured and unstructured data sets to arrive at meaningful business decisions, including determining customer sentiment, cooperating with e- discovery requirements and personalizing the offerings for their customers.

While sifting through vast amounts of information can look like a lot of work, there are rewards. By reading large, disparate sets of unstructured data, one can identify connections from unrelated data sources and find patterns. What makes this method of analysis extremely effective is that it enables the discovery of trends.

There are nine steps to analyze unstructured data so that one can see more than what meets the eye:

1. Make sense of disparate data sources

Before one can begin, one needs to know what sources of data are important for the analysis. Unstructured data sources may range from web logs to voice files to emails to chat transcripts to streaming videos. If the information being analyzed is only tangentially related to the topic at hand, it should be set aside. Instead, only use information sources that are absolutely relevant.

2. Sign off on the method of analytics and find a clear way to present the results

If the end requirement is not clear, the analysis may be useless. It is important to understand what sort of answer is needed – is it a quantity, a trend, cause & effect or something else? In addition, one must provide a roadmap for what to do with the results so that they can be used in a predictive analytics engine before undergoing segmentation and integration into the business’s information store.

3. Decide the technology stack for data ingestion and storage

Even though the raw data can come from a wide variety of sources, the results of the analysis must be placed in a technology stack or cloud- connected information store so that the results can be easily utilized. Factors that are important for choosing the data storage and data retrieval depends often on the scalability, volume, variety and velocity requirements. A potential technology stack should be well evaluated against the final requirements, after which the information architecture of the project is set.

A few examples of key business requirements and the respective mapping of the technology stack are:

Real- time:It has become crucial for e- commerce companies to provide real- time quotes. This requires tracking real- time activities, and providing offerings based on the results of a predictive analytic engine. Technologies that can provide this include Storm, Flume and Lambda.

High availability: This is crucial for ingesting information from social media. The technology platform used must ensure that no loss of data occurs in a real- time stream. It is a good idea to use a messaging queue to hold incoming information as part of a data redundancy plan, such as Apache Kafka.

Multi- tenancy: Another critical dimension is the ability to isolate data and resources from different groups of users. Effective Big Data solutions should be able to natively support multi- tenancy situations. Given the sensitivities around customer data and feedback coupled with the criticality of insights, isolation is extremely important as it is often needed in order to meet today’s confidentiality requirements.

Unstructured web logs or security logs: These require flexible schema’s to hold the data. HBase /Cassandra with flexible column families could be explored.

4. Keep information in a data lake until it has to be stored in a data warehouse.

Traditionally, an organization obtained or generated information, sanitized it and stored it away. For example, if the information source was an HTML file, the text might be stripped and the rest discarded, such that information was lost during storage in a data warehouse. Anything useful that was discarded in the initial data load was lost as a result, and the only thing one could do with the data was what is possible after extraneous information was stripped away. The appeal of this prior strategy was that the data was in a pristine, mutable format that could be used whenever. However, with the advent of Big Data, it has come into common practice to do the opposite; with a data lake, information is stored in its native format until it is actually deemed useful and needed for a specific purpose, preserving metadata or anything else that might assist in the analysis.

5. Prepare the data for storage

While keeping the original file, if one needs to make use of the data, it is best to clean up a copy. In a text file, there can be a lot of noise or shorthand that can obscure valuable information. It is good practice to cleanse noise like whitespaces and symbols, while converting informal text in strings to formal language. If it is possible to detect the spoken language, it should be categorized as such. Duplicate results should be removed, the dataset treated for missing values, and off-topic information expunged from the dataset.

6. Retrieve useful information

Through the use of natural language processing and semantic analysis, one can make use of Parts- of- Speech tagging to extract common named entities, such as “person”, “organization”, “location” and their relationships. From this, one can create a term frequency matrix to understand the word pattern and flow in the text.

7. Ontology evaluation

Through analysis, one can then create the relationships among the sources and the extracted entities so that a structured database can be designed to specifications. This can take time, but the insights provided can be worth it for an organization.

8. Statistical modeling, Data & Text Mining

Once the database has been created, the data must be classified and segmented. It can save time to make use of supervised and unsupervised machine learning, such as the K- means, Logistic Regression, Naïve Bayes, and Support Vector Machine algorithms. These tools can be used to find similarities in customer behavior, targeting for a campaign and overall document classification. The disposition of customers can be determined with sentiment analysis of reviews and feedback, which helps to understand future product recommendations, overall trends and guide introductions of new products and services.

The most relevant topics discussed by customers can be analyzed through temporal modeling techniques, which can extract the topics or events that customers are sharing via social media, feedback forms or any other platform.

9. Visualize, Implement and Impact measurement

From all the above steps, it all comes down to the end result, whatever it might be. It is crucial that the answers to the analysis are provided in a tabular and graphical format, providing actionable insights for the end- user of the resultant information. To ensure that the information can be used and accessed by the intended parties, it should be rendered in a way that it can be reviewed through a handheld device or web based tool, so that the recipient can make the recommended actions on a real- time or near real- time basis. Scientific implementation methods like design of experiment (test & control), baselining and process / continuous improvement framework holds the key to success. The final step would be to measure the impact (RoI) – both hard (dollars) and soft (process efficiency & effectiveness, productivity improvements etc)

Conclusion

The real value lies in combining structured, semi- structured and un- structured data analysis for a 360 degree view. A case in point: while structured data analysis can predict customer behavior, unstructured data analysis can unravel reasons for such behavior. New information forms such as social media and machine logs have made themselves crucial to organizations for their ability to provide unique content and diagnostic intelligence once they are properly analyzed. Traditional or conventional data scientists will have to acquire new skills sets to analyze unstructured data. While enterprises develop content intelligence capabilities, the real power lies in fusing different data formats and overlaying structured data with semi and unstructured data sources for insights into the mind of a user or the life of a device.