Friday, September 30, 2016

Analytics Divide and Big Data

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" ( ).

In a research report on analytics by MIT and IBM (Kiron, et al. 2011), three progressive levels of analytical sophistication for organizations are defined as aspirational, experienced, and transformed (See Table 1). In transformed organizations, analytics is used at all levels, day to day, strategically. It is considered an integral part of everything, including the culture. These organizations have high proficiency in both data management and analytics in terms of usage, skills, and tools. Both data management and analytics are enterprise-driven and are ingrained in the enterprise culture. Transformed organizations have robust data and analytics foundations and management competencies, making it possible to capture, combine, and analyze information from disparate sources, and to disseminate it across the organization so that individuals at all levels can consume it. In these organizations, one finds that processes, practices, and behaviors are aligned with the fundamental belief that business decisions at all levels should be based on data analysis.
On the other hand, aspirational organizations do not use any sophisticated analytics beyond spreadsheets, and do not have an integrated view of their enterprise data. They lack proficiencies on both the data management and analytics fronts. Their culture relies more on decision-making based on guts and intuition rather than data analysis.
The experienced organizations are somewhere in between, with initiatives that move them closer to transformed organizations. This study shows that experienced and transformed organizations continue to expand their analytics and information management capabilities to add more business value and differentiate themselves, while aspirational organizations keep falling behind. This growing gap or divide has major implications for businesses. For the following discussion, my focus is only on experienced and transformed organizations, since these strongly believe in the value of analytics and either practice it in full or have a goal to get there. I consider them standard organizations in the sense that, in today’s competitive world, they are more of the norm than exception.[1] These organizations have also the culture, the appetite, and the desire to deal with their big data challenge, if there ever is one.

Organization Category
Information Management Proficiency
Analytics Proficiency
Data Culture
Line of business driven
Moving toward enterprise driven
Enterprise driven

Table 1: Three progressive levels of analytical sophistication in enterprises (Kiron, et al. 2011).

One main differentiator between analytics in the traditional sense and big data analytics is that in the latter, the collected big data may or may not be useful for the specific business purpose intended. From the perspective of analysis, this falls into the category of you don’t know what you don’t know. However, if any insights are extracted, they could be enormously invaluable. Due to the maturity of traditional data management and analysis technologies, data that is stored in these environments is already known to be of high value. This data has been prepared to answer known business questions. The high value justifies their storage and management in enterprise data warehouses or data marts. With new big data, there are plenty of opportunities to ask new business questions never asked before, and the economic situation is favorable when investigating these questions.

Table 2 enumerates a few possible scenarios in today’s standard analytics environments (experienced and transformed organizations) when they are faced with big data. These environments already excel in dealing with traditional and proven analytics methods and technologies where storage, management, and analysis of the data follow standard processes and practices. Scenario 1 depicts the status quo in these environments—where, in the absence of any big data, it is business as usual.

Data Scenario
Big Data?
Business Value
Somewhat known
Not possible
Not known
Not possible
Not known

Table 2: Big data scenarios in standard analytics environments.

However, in terms of their existing capabilities, they face different scenarios to deal with their big data challenges. The reader should keep in mind that the size of big data has to be interpreted in the context of time and place of each enterprise, given its sector and its place on the analytics evolution curve. In Scenario 2, the enterprise is capable of storing its big data, and can also analyze it using existing nonstandard[2] big data analytics techniques. As a result, the enterprise has some understanding of the hidden value in its big data, and can decide how much of it needs to be stored and for how long. In Scenario 3, the organization can cope with storing its big data, but does not yet have the capability to analyze it in any efficient way for assessing its value. The reason for this could be technological, methodological, skill set related, or budgetary. In Scenario 4, the enterprise at its current state is not capable of storing the big data (hence not able to analyze it either) for similar reasons to Scenario 3. Today, Scenarios 3 and 4[3] are still dominant for classic enterprises. Those operating under Scenario 2 are a small minority but are ahead of the curve compared to their peers. The curiosity of finding the potential value in big data is why big data has become a part of these organizations’ overall data strategies. Going forward, any enterprise data strategy that ignores big data should be considered incomplete.

Kiron, David, Rebecca Shockley, Nina Kruschwitz, Glenn Finch, and Micheal Haydock. 2011. Analytics: The Widening Divide. MIT Sloan Management Review; IBM Institute for Business Value.

[1] More than a decade ago, one could say that the reverse phenomenon was true, meaning that aspirational organizations were more of the norm.

[2] Big data analysis techniques are still in their infancy, and I consider them nonstandard in comparison with traditional data analytics tools and techniques (including data warehousing, BI, and data mining) that have matured, especially in the last two decades.

[3] “Without big data, you are blind and deaf in the middle of a freeway.”—Geoffrey Moore.

I discuss these topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" ( ).

Tuesday, September 6, 2016

What Constitutes a Big Data Scenario (Part 2): Human vs. Machine Generated Data

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" ( ).

It is projected that growth in global data generated per year is 40%, versus a 5% growth in global IT spending (McKinsey Global Institute 2011). From a cost perspective, this divergence clearly shows why big data technology solutions must be cost-effective today in order to stay within existing IT  budgets and forecasts. However, the real value of any technology is ultimately defined by its business ROI and not its cost. This is particularly true for classic organizations that are exploring these new big data technologies. From an economic and business standpoint, the potential value
creation from big data opportunities has already been proven by numerous web companies. In many cases this value creation has been disruptive.

More than 80% of data generated is reported to be unstructured data, which includes:

• Semistructured data: weblogs, machine logs, etc.
• Unstructured text data: blogs, e-mail, comments, etc.
• Binary data: photos, images, audio, video, etc.

It has been reported that the data collected by the US Library of Congress by April 2011 has been around 235 terabytes. Fifteen out of seventeen sectors in United States have more data stored per company than the US Library of Congress (McKinsey Global Institute 2011).

Structured data is stored and resides in predetermined fixed fields. Unstructured data cannot be stored in fixed fields. Freeform text (books, e-mails, articles, blogs, etc.) and untagged video, image,
and audio are examples of unstructured data. Semistructured data also does not conform to fixed fields, but includes tags that identify its data elements. XML, JSON, and HTML-tagged text are examples of semistructured data. Multistructured data refers to a combination of all these data varieties.

What constitutes human or machine generated data is loosely defined. I like to differentiate between the two data using the following definition (See Table below). As humans interact with each other and with other organizations, or organizations with each other, a massive amount of structured data is generated in the form of transactions such as call records, payment transactions, sale orders, etc. These data are collectively generated through business processes and had been captured and analyzed long before the Internet became mainstream. Conventional big data technologies were originally developed to handle such data. Human interactions also create semistructured data, such as weblog data, that are newer; and their detail processing often requires newer big data technologies to be cost-effective. Human-generated data can also be directly created by humans as “digital content,” which could be either unstructured or binary. In a nutshell, human-generated data can be defined as the digitization of human interactions.

On the other end, I define machine-generated data as data capture machine-to-machine (Internet of Things) interactions. Machine-generated data may be the result of observing human behavior
instead of capturing their choices. This data could also be in structured, semistructured, or binary form. Data from RFID tags, computer logs, network logs, security cameras, etc., are typical examples.

Data Generation Origin
Information Management Proficiency

Data representing the digitization of human interactions
Business process data e.g., payment transactions, sales order, call record, ERP, CRM
Content such as Web pages, E-mail, Blog, Wiki, Review, Comment
Content such as Video, Audio, Photo

Data representing machine-to-machine interactions, or simply not human-generated (Internet of Things)
Some devices
Computer logs, Device logs, Network logs,
Sensor/Meter logs
Video, Audio, Photo

I discuss these topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" ( ).