Monday, August 22, 2016

What Constitutes a Big Data Scenario?

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

Big data is a new marketing term that highlights the everincreasing and exponential growth of data in every aspect of our lives.  The term big data originated from within the open-source community, where there was an effort to develop analytics processes that were faster and more scalable than traditional data warehousing, and could extract value from the vast amounts of unstructured and semistructured data produced daily by web users. Consequently, big data origins are tied to web data, though today big data is used in a larger context.



The most common definition of big data (attributed to Gartner) is datasets that grow so large that they become awkward to work with using available data management and analysis tools. Difficulties include capture, storage, search, sharing, analytics, and visualization of such data. The definition is broad and can be extended in two ways. First, one expects that what is considered big data today will vary over time, as the technology will advance. What is considered big data today may be the new normal in the next several years. Second, it indirectly implies that what is big data may also change with the place—i.e., industry sector (vertical), organization, or enterprise—depending on where they are on their analytics evolution curve.



Those who are proficient in the world of analytics understand that the use of analytics in some verticals could be much more advanced compared to the others. For example, the banking industry has been an early adapter of business intelligence and predictive analytics. When it comes to systematic use of these technologies in application areas such as risk, fraud, and marketing/sales, this sector has been way ahead compared to other verticals. Large-scale retail and telecommunications have followed suit, while verticals such as health care, health insurance, and the public sector have been laggards. Hence, the interpretation of what is considered big data can differ across these verticals.

Forrester Research defines big data as the frontier of a firm’s ability to store, process, and access all of the data it needs to operate, make decisions, reduce risks, and serve customers. The main phrases here are “the frontier” and “all the data.” In essence, this definition is similar to the first definition though expressed differently. The trend of working with ever larger and larger datasets continues because, first and foremost, for some time web companies of the scale of Google, Facebook, Amazon, LinkedIn, and Yahoo! have had to capture and manage all the data on their sites (and elsewhere) at the most granular level to provide the services they offer. They had to think out of the box and innovate to cope with these challenges. Second, for more traditional companies in sectors such as financial, telecommunication, retail, etc., there are new potential business benefits in analyzing big data that will allow them to unravel new business trends, detect novel customer behavior, create new products and services, identify more complex fraud patterns, and assess more complicated risk events with better accuracy.


Though big data is always a moving target, current limits are on the order of terabytes and petabytes of data. These are the sizes of a single dataset or combinations of datasets that have to be analyzed for a specific analysis purpose at a specific time. Scientists have regularly encountered this problem in data mining, meteorology, genomics, complex physics simulations, biological/environmental research, Internet searching, and finance. Datasets also grow in size because they are being gathered by machines such as information-sensing mobile devices, aerial sensory technologies, computer software (logs), cameras, microphones, radio-frequency identification (RFID) readers, near-field communication (NFC), and wireless sensor networks.





I discuss these topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

Saturday, August 13, 2016

Big Data Mining and Big Data Analytics


In three previous posts, I made an attempt to describe Data Mining, Data Science, and Machine Learning:
  • Data Mining, Data Science, and Machine Learning (1), (2), and (3).
As explained in these posts, "Data Science" is nothing but a new term. The use of analytics, data mining, and machine learning has been the cornerstone of many applications for more than two and a half decades, e.g., in finance, Telecom, and retail. However, the popularity of web products from the likes of Google, Linked-in, Amazon, and Facebook has helped analytics become a household name, and its rise under a new term (Data Science). 

Since Data Mining has been around for a long time, people may wonder how it relates to "big data Analytics"? 

Big data is a new marketing term that highlights the ever-increasing and exponential growth of data in every aspect of our lives. The term big data originated from within the open-source community, where there was an effort to develop analytics processes that were faster and more scalable than traditional data warehousing, and could extract value from the vast amounts of unstructured and semi-structured data produced daily by web users. Consequently, big data origins are tied to web data, though today big data is used in a larger context.

What is considered big data today may be the new normal in the next several years. Also, big data may also change with the place—i.e., industry sector (vertical), organization, or enterprise—depending on where they are on their analytics evolution curve. Big data is always a moving target. Big data analytics is merely a new term meant to focus our attention on the analysis of big data and address the technological and business challenges associated with it. 

The Figure below shows how I compare high-performance data mining (HPDM for short) and big data analytics (BDA for short). Any type of analysis, whether it is basic or advanced is covered under BDA. While using traditional tools and techniques, performing even basic computations on big data such as finding averages, counts, and sums in a reasonable amount of time is a challenge. As one expects, performing advanced analytics such as iterative graph algorithms
(e.g., PageRank), gradient descent (e.g., logistic regression), and expectation maximization (e.g., K-Means) introduces even bigger challenges.

The main focus of data mining, however, is advanced analytics. From its inception, data mining had to cope with the challenges of optimizing learning algorithms—which are typically iterative in nature and must converge—on very large data. These tasks are both I/O and compute intensive. In the context of big data, HPDM faces the same challenges that BDA faces in implementing advanced analytics algorithms.

Figure below is a conceptual graph comparing the operational ranges of BDA and HPDM across the two dimensions of “analytics sophistication” and “Data Volume.” BDA operates across the whole analytics spectrum, from basic to advanced analytics, but on the big data side of data volume spectrum. HPDM operates across the whole data volume spectrum, with its main focus on advanced analytics. The transition of HPDM into big data is disruptive, meaning that it requires a fresh look at the existing technology infrastructure, skills, and processes. The transition of BDA to advanced analytics can also be considered somewhat disruptive in the sense that it requires new techniques and approaches to implement these often iterative algorithms. I refer to the intersection of the two as big data mining, which emphasizes both the “big data” requirement and “data mining” (or advanced analytics) nature of it. Both are subsets of what is referred to as high performance analytics (HPA).


Comparison of big data analytics to high-performance data mining

I discuss these topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).