Wednesday, October 5, 2016

The Age of Data Innocence is Over

When I try to explain Data Science and Analytics to business people or those interested in these fields, I use the following example to describe the four pillars of: Data, Platform/Tools, Algorithms, and know-how.

To me, "data" is like a collection of bones (say of an animal) scattered around, some clean and some hidden in dirt.  If these pieces are collected and put together correctly, it will mimic or resemble the real skeleton of that animal.  Some of the bone pieces  are perfect. Some are broken and have to be glued together, and some will be missing (hopefully not a lot). But if these bone pieces are organized together with care, it would give us a good view of the animal they once represented.

Then the infrastructure (data platforms and tools) are like the muscles that go around these bones, after they have been lined up correctly. The muscles make the skeleton move around and do interesting things. They can make the boring data come to life and express itself in many interesting ways.

The algorithms are then the brain. A very small mass of the whole thing through which, the muscles are controlled to make the skeleton body do more interesting things and in more novel ways. At the end, we have the lifeless and somewhat boring bone pieces moving around in harmony, doing interesting things.  That is what Data Science and Analytics try to do.

Data used to be static and often boring if just collected and not used, but innocent. It used to be lying around in many places and sometimes at massive scale. It was up to the art and science of the data scientists and engineers to come to life.  If they did their job right, it could come to life in many useful ways. It could do no harm of its own. It could not lie or cheat on its own.

The recent Volkswagen and Wells Fargo scandals have been a turning point, in which it signaled the end of data innocence. In the past, lies could be made by using data in a biased way and selectively. The data itself was innocent.  The scandal shows that the data can be easily manipulated at the origin, right were the data is created.

The bone pieces in my example above could now be practically fake and made look real. In reality though, they collectively portray a skeleton of fiction or imagination, no matter how great and noteworthy are the platforms, tools, and people (know-how) who assemble and use it.

The good news is that for many applications to be useful, there is no incentive for those involved to fake the data. However one can envision many example applications in which there is an incentive to manipulate the data at the origin and fool everybody down the chain.  That brings us back to the question of ethics and integrity in Data Science and Analytics and adds yet another important step to the long list of Data Validity (one of the 6Vs of Data discussed in my book) .

-----------------------------------------------------------------------------------------------------------------------
I discuss some related topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).