Tuesday, September 6, 2016

What Constitutes a Big Data Scenario (Part 2): Human vs. Machine Generated Data

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

It is projected that growth in global data generated per year is 40%, versus a 5% growth in global IT spending (McKinsey Global Institute 2011). From a cost perspective, this divergence clearly shows why big data technology solutions must be cost-effective today in order to stay within existing IT  budgets and forecasts. However, the real value of any technology is ultimately defined by its business ROI and not its cost. This is particularly true for classic organizations that are exploring these new big data technologies. From an economic and business standpoint, the potential value
creation from big data opportunities has already been proven by numerous web companies. In many cases this value creation has been disruptive.

More than 80% of data generated is reported to be unstructured data, which includes:

• Semistructured data: weblogs, machine logs, etc.
• Unstructured text data: blogs, e-mail, comments, etc.
• Binary data: photos, images, audio, video, etc.

It has been reported that the data collected by the US Library of Congress by April 2011 has been around 235 terabytes. Fifteen out of seventeen sectors in United States have more data stored per company than the US Library of Congress (McKinsey Global Institute 2011).

Structured data is stored and resides in predetermined fixed fields. Unstructured data cannot be stored in fixed fields. Freeform text (books, e-mails, articles, blogs, etc.) and untagged video, image,
and audio are examples of unstructured data. Semistructured data also does not conform to fixed fields, but includes tags that identify its data elements. XML, JSON, and HTML-tagged text are examples of semistructured data. Multistructured data refers to a combination of all these data varieties.

What constitutes human or machine generated data is loosely defined. I like to differentiate between the two data using the following definition (See Table below). As humans interact with each other and with other organizations, or organizations with each other, a massive amount of structured data is generated in the form of transactions such as call records, payment transactions, sale orders, etc. These data are collectively generated through business processes and had been captured and analyzed long before the Internet became mainstream. Conventional big data technologies were originally developed to handle such data. Human interactions also create semistructured data, such as weblog data, that are newer; and their detail processing often requires newer big data technologies to be cost-effective. Human-generated data can also be directly created by humans as “digital content,” which could be either unstructured or binary. In a nutshell, human-generated data can be defined as the digitization of human interactions.

On the other end, I define machine-generated data as data capture machine-to-machine (Internet of Things) interactions. Machine-generated data may be the result of observing human behavior
instead of capturing their choices. This data could also be in structured, semistructured, or binary form. Data from RFID tags, computer logs, network logs, security cameras, etc., are typical examples.

Data Generation Origin
Information Management Proficiency

Data representing the digitization of human interactions
Business process data e.g., payment transactions, sales order, call record, ERP, CRM
Content such as Web pages, E-mail, Blog, Wiki, Review, Comment
Content such as Video, Audio, Photo

Data representing machine-to-machine interactions, or simply not human-generated (Internet of Things)
Some devices
Computer logs, Device logs, Network logs,
Sensor/Meter logs
Video, Audio, Photo

I discuss these topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).