Saturday, August 13, 2016

Big Data Mining and Big Data Analytics


In three previous posts, I made an attempt to describe Data Mining, Data Science, and Machine Learning:
  • Data Mining, Data Science, and Machine Learning (1), (2), and (3).
As explained in these posts, "Data Science" is nothing but a new term. The use of analytics, data mining, and machine learning has been the cornerstone of many applications for more than two and a half decades, e.g., in finance, Telecom, and retail. However, the popularity of web products from the likes of Google, Linked-in, Amazon, and Facebook has helped analytics become a household name, and its rise under a new term (Data Science). 

Since Data Mining has been around for a long time, people may wonder how it relates to "big data Analytics"? 

Big data is a new marketing term that highlights the ever-increasing and exponential growth of data in every aspect of our lives. The term big data originated from within the open-source community, where there was an effort to develop analytics processes that were faster and more scalable than traditional data warehousing, and could extract value from the vast amounts of unstructured and semi-structured data produced daily by web users. Consequently, big data origins are tied to web data, though today big data is used in a larger context.

What is considered big data today may be the new normal in the next several years. Also, big data may also change with the place—i.e., industry sector (vertical), organization, or enterprise—depending on where they are on their analytics evolution curve. Big data is always a moving target. Big data analytics is merely a new term meant to focus our attention on the analysis of big data and address the technological and business challenges associated with it. 

The Figure below shows how I compare high-performance data mining (HPDM for short) and big data analytics (BDA for short). Any type of analysis, whether it is basic or advanced is covered under BDA. While using traditional tools and techniques, performing even basic computations on big data such as finding averages, counts, and sums in a reasonable amount of time is a challenge. As one expects, performing advanced analytics such as iterative graph algorithms
(e.g., PageRank), gradient descent (e.g., logistic regression), and expectation maximization (e.g., K-Means) introduces even bigger challenges.

The main focus of data mining, however, is advanced analytics. From its inception, data mining had to cope with the challenges of optimizing learning algorithms—which are typically iterative in nature and must converge—on very large data. These tasks are both I/O and compute intensive. In the context of big data, HPDM faces the same challenges that BDA faces in implementing advanced analytics algorithms.

Figure below is a conceptual graph comparing the operational ranges of BDA and HPDM across the two dimensions of “analytics sophistication” and “Data Volume.” BDA operates across the whole analytics spectrum, from basic to advanced analytics, but on the big data side of data volume spectrum. HPDM operates across the whole data volume spectrum, with its main focus on advanced analytics. The transition of HPDM into big data is disruptive, meaning that it requires a fresh look at the existing technology infrastructure, skills, and processes. The transition of BDA to advanced analytics can also be considered somewhat disruptive in the sense that it requires new techniques and approaches to implement these often iterative algorithms. I refer to the intersection of the two as big data mining, which emphasizes both the “big data” requirement and “data mining” (or advanced analytics) nature of it. Both are subsets of what is referred to as high performance analytics (HPA).


Comparison of big data analytics to high-performance data mining

I discuss these topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).