Monday, August 22, 2016

What Constitutes a Big Data Scenario?

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

Big data is a new marketing term that highlights the everincreasing and exponential growth of data in every aspect of our lives.  The term big data originated from within the open-source community, where there was an effort to develop analytics processes that were faster and more scalable than traditional data warehousing, and could extract value from the vast amounts of unstructured and semistructured data produced daily by web users. Consequently, big data origins are tied to web data, though today big data is used in a larger context.



The most common definition of big data (attributed to Gartner) is datasets that grow so large that they become awkward to work with using available data management and analysis tools. Difficulties include capture, storage, search, sharing, analytics, and visualization of such data. The definition is broad and can be extended in two ways. First, one expects that what is considered big data today will vary over time, as the technology will advance. What is considered big data today may be the new normal in the next several years. Second, it indirectly implies that what is big data may also change with the place—i.e., industry sector (vertical), organization, or enterprise—depending on where they are on their analytics evolution curve.



Those who are proficient in the world of analytics understand that the use of analytics in some verticals could be much more advanced compared to the others. For example, the banking industry has been an early adapter of business intelligence and predictive analytics. When it comes to systematic use of these technologies in application areas such as risk, fraud, and marketing/sales, this sector has been way ahead compared to other verticals. Large-scale retail and telecommunications have followed suit, while verticals such as health care, health insurance, and the public sector have been laggards. Hence, the interpretation of what is considered big data can differ across these verticals.

Forrester Research defines big data as the frontier of a firm’s ability to store, process, and access all of the data it needs to operate, make decisions, reduce risks, and serve customers. The main phrases here are “the frontier” and “all the data.” In essence, this definition is similar to the first definition though expressed differently. The trend of working with ever larger and larger datasets continues because, first and foremost, for some time web companies of the scale of Google, Facebook, Amazon, LinkedIn, and Yahoo! have had to capture and manage all the data on their sites (and elsewhere) at the most granular level to provide the services they offer. They had to think out of the box and innovate to cope with these challenges. Second, for more traditional companies in sectors such as financial, telecommunication, retail, etc., there are new potential business benefits in analyzing big data that will allow them to unravel new business trends, detect novel customer behavior, create new products and services, identify more complex fraud patterns, and assess more complicated risk events with better accuracy.


Though big data is always a moving target, current limits are on the order of terabytes and petabytes of data. These are the sizes of a single dataset or combinations of datasets that have to be analyzed for a specific analysis purpose at a specific time. Scientists have regularly encountered this problem in data mining, meteorology, genomics, complex physics simulations, biological/environmental research, Internet searching, and finance. Datasets also grow in size because they are being gathered by machines such as information-sensing mobile devices, aerial sensory technologies, computer software (logs), cameras, microphones, radio-frequency identification (RFID) readers, near-field communication (NFC), and wireless sensor networks.





I discuss these topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).