Tuesday, July 22, 2014

Big Data Confusion

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

I was away for FIFA World Cup for a couple of weeks and am back now. The first thing I noticed last week was that my Twitter account had been hacked.  But now all is sorted out and this blog will be tweeted.

In the process of writing my book, I did some research on the definition of "big data" and "big data technologies." It suffices to say that I found a lot of conflicting definitions, and as a practitioner in the world of analytics, I thought how others who do not have the deep experience in analytics may feel about all this! At the end, I found the following definition of big data the most appropriate.

The most common definition of “big data” is datasets that grow so large that they become awkward to capture, store, search, share, analyze, and visualize them using available data management and analysis tools. First, what is considered big data today may be the new normal in the next several years. Second, it indirectly implies that what is big data may also change with the place—i.e., industry sector (vertical), organization, or enterprise—depending on where they are on their analytics evolution curve.

In the last few years, the term “big data” has been tied closely to “Hadoop” by the media. This is unfortunate. The reader understands that an organization may choose, or may have already chosen, to solve its big data problems using big technologies other than Hadoop. At the same time, one can also use Hadoop to solve a more conventional data problem, and not necessarily a big data problem. Hadoop is only a piece in the whole big data puzzle, but it is an influential one. This 2013 survey (Gartner Survey: Big Data Adoption in 2013 Shows Substance Behind the Hype) like a few other surveys conducted by other research organizations reflects what organizations think of big data.

In my upcoming book, I have tried to address fact vs. fiction (hype) about big data analytics. Big data analytics can be approached from many different angles as it relates to business use cases, analytic processes (methodologies and algorithms), platform architectures, etc. As far as platforms are concerned, when we talk about big data, we are talking about one of these options:

- Hadoop, MapReduce, and YARN,
- Massively parallel databases (Relational and NoSQL),
- Real-time event stream processing,
- In-memory distributed analytics,
- Big data analytics appliance.

Again, the Gartner Survey illustrates that aside from the hype, what organizations consider big data technologies (cloud is at the top), what they use them for (customer experience at the top), and what type of data they use them for (transactions at the top). I have not seen a single unified understanding of big data in these surveys and there are many of them.  That shows how lose the definitions and understandings about big data are.