High Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data: What Constitutes a Big Data Scenario (Part 2): Human vs. Machine Generated Data

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

It is projected that growth in global data generated per year is 40%, versus a 5% growth in global IT spending (McKinsey Global Institute 2011). From a cost perspective, this divergence clearly shows why big data technology solutions must be cost-effective today in order to stay within existing IT budgets and forecasts. However, the real value of any technology is ultimately defined by its business ROI and not its cost. This is particularly true for classic organizations that are exploring these new big data technologies. From an economic and business standpoint, the potential value
creation from big data opportunities has already been proven by numerous web companies. In many cases this value creation has been disruptive.

More than 80% of data generated is reported to be unstructured data, which includes:

• Semistructured data: weblogs, machine logs, etc.
• Unstructured text data: blogs, e-mail, comments, etc.
• Binary data: photos, images, audio, video, etc.

It has been reported that the data collected by the US Library of Congress by April 2011 has been around 235 terabytes. Fifteen out of seventeen sectors in United States have more data stored per company than the US Library of Congress (McKinsey Global Institute 2011).

Structured data is stored and resides in predetermined fixed fields. Unstructured data cannot be stored in fixed fields. Freeform text (books, e-mails, articles, blogs, etc.) and untagged video, image,
and audio are examples of unstructured data. Semistructured data also does not conform to fixed fields, but includes tags that identify its data elements. XML, JSON, and HTML-tagged text are examples of semistructured data. Multistructured data refers to a combination of all these data varieties.

What constitutes human or machine generated data is loosely defined. I like to differentiate between the two data using the following definition (See Table below). As humans interact with each other and with other organizations, or organizations with each other, a massive amount of structured data is generated in the form of transactions such as call records, payment transactions, sale orders, etc. These data are collectively generated through business processes and had been captured and analyzed long before the Internet became mainstream. Conventional big data technologies were originally developed to handle such data. Human interactions also create semistructured data, such as weblog data, that are newer; and their detail processing often requires newer big data technologies to be cost-effective. Human-generated data can also be directly created by humans as “digital content,” which could be either unstructured or binary. In a nutshell, human-generated data can be defined as the digitization of human interactions.

On the other end, I define machine-generated data as data capture machine-to-machine (Internet of Things) interactions. Machine-generated data may be the result of observing human behavior
instead of capturing their choices. This data could also be in structured, semistructured, or binary form. Data from RFID tags, computer logs, network logs, security cameras, etc., are typical examples.

Data Generation Origin	Definition	Information Management Proficiency	Examples
Humans	Data representing the digitization of human interactions	Structured	Business process data e.g., payment transactions, sales order, call record, ERP, CRM
		Semistructured	Weblogs
		Unstructured	Content such as Web pages, E-mail, Blog, Wiki, Review, Comment
		Binary	Content such as Video, Audio, Photo
Machines	Data representing machine-to-machine interactions, or simply not human-generated (Internet of Things)	Structured	Some devices
		Semistructured	Computer logs, Device logs, Network logs, Sensor/Meter logs
		Binary	Video, Audio, Photo

I discuss these topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

4 comments:

utkarshFebruary 2, 2022 at 10:15 PM
The Back-end developer jobs jobs are always in demand. But, the developers need to seal the right jobs amongst the market's back-end developer jobs. As it functions at crucial aspects of development, these can be merged with full stack developer roles.
anjani02October 17, 2025 at 5:35 AM
Pega Developer Course
Become a certified Pega developer by mastering case management, workflows, and data integration. Gain real-world project experience and prepare for CSA/CSSA certifications.
anjani02October 17, 2025 at 5:37 AM
Online Business Analyst Course
Gain expertise in business process modeling, requirement analysis, and stakeholder communication. Learn industry-standard tools and methodologies to become a skilled BA professional.
WhiteScholarsNovember 19, 2025 at 11:15 PM
Great article. very valuable information
Data Analytics in Hyderabad is the most sought after job these days. Whitescholars is the best data analytics institute in hyderabad with placement. with indepth curriculum and structured training we provide hands on project on real time case studies
https://whitescholars.com/data-analytics-course-certification-training-institute-hyderabad/
https://whitescholars.com/advanced-data-science-certification-course-in-hyderabad/
https://whitescholars.com/advanced-ai-digital-marketing-course-training-institute-hyderabad/

High Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data

Tuesday, September 6, 2016

What Constitutes a Big Data Scenario (Part 2): Human vs. Machine Generated Data

4 comments:

About Me

Blog Archive