High Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data

A personal blog currently dedicated to the promotion of the book titled "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data." The success of new big data technologies in large web companies has created a rush toward understanding the impact of these technologies in classic analytics environments that already employ a multitude of legacy analytics technologies. The blogs provide concepts and excerpts from the book to study this impact.

Wednesday, October 26, 2016

Machine Learning vs. Traditional Statistics: Different philosophies, Different Approaches

"Machine Learning (ML)" and "Traditional Statistics(TS)" have different philosophies in their approaches. With "Data Science" in the forefront getting lots of attention and interest, I like to dedicate this blog to discuss the differentiation between the two. I often see discussions and arguments between statisticians and data miners/machine learning practitioners on the definition of "data science" and its coverage and the required skill sets. All is needed, is just paying attention to the evolution of these fields.

There is no doubt that when we talk about "Analytics," both data mining/machine learning and traditional statisticians have been a player. However, there is a significant difference in approach, applications, and philosophies of the two camps that is often overlooked.

What is ML?

ML is a branch of Artificial Intelligence (AI). AI focuses on understanding intelligence and how to replicate it in machines (systems or agents). ML aims at automatic discovery of regularities in data through the use of computer algorithms and generalizing those into new but similar data. Its main focus is the study and design of systems that can “learn from data” and its focus is inductive learning (learning by examples). ML is not the same as “data mining” or “predictive analytics” that are practices but a core part of both.

ML Roots started in 1950’s and many startups formed in late 80’s, early 90’s with applications such as real-time fraud detection, character recognition, and recommendation systems to be commercially successful (first generation ML systems). ML is also closely related to “Pattern Recognition (PR). While ML grew out of computer science, Pattern Recognition has engineering roots. The two however are facets of the same field where focus in both is learning from data. Today ML resurgence is the driver of the next big wave of innovation.

ML Application Variety

Data mining and predictive analytics

Fraud detection, ad placement, credit scoring, recommenders, drug design, stock trading, customer relationship & experience, …

Text processing & analysis

Web search, spam filtering, sentiment analysis, …

Graph mining

Other:

Speech recognition, human genome, bioinformatics, optical character recognition (OCR), face recognition, self-driving cars, scene analysis, …

ML Community/Practitioners

• Typically computer science and/or engineering background
•      More programming savvy
•      Not confined with a single tool
•      Open-source friendly
•      Rapid prototyping of the ideas/solutions desired

ML vs. Traditional Statistics

Historically, ML techniques and approach heavily relies on computing power. On the other hand, TS techniques were mostly developed where computing power was not an option. As a result, TS heavily relies on small samples and heavy assumptions about data and its distributions,

ML in general tends to make less pre-assumptions about the problem and is liberal in its approaches and techniques to find a solution, many times using heuristics. The preferred learning method in machine learning and data mining is inductive learning. At its extreme, in inductive learning the data is plentiful or abundant, and often not much prior knowledge exists or is needed about the problem and data distributions for learning to succeed. The other side of the learning spectrum is called analytical learning, (deductive learning), where data is often scarce or it is preferred (or customary) to work with small samples of it. There is also good prior knowledge about the problem and data. In real world, one often operates between these two extremes. On the other hand, traditional statistics is conservative in its approaches and techniques and often makes tight assumptions about the problem, especially data distributions.

The following table shows some of the differences in approach and philosophy between the two fields:

Machine Learning (ML)	Traditional statistics (TS)
Goal: “learning” from data of all sorts	Goal: Analyzing and summarizing data
No rigid pre-assumptions about the problem and data distributions in general	Tight assumptions about the problem and data distributions
More liberal in the techniques and approaches	Conservative in techniques and approaches
Generalization is pursued empirically through training, validation and test datasets	Generalization is pursued using statistical tests on the training dataset
Not shy of using heuristics in approaches in search of a “good solution”	Using tight initial assumptions about data and the problem, typically in search of an optimal solution under those assumptions
Redundancy in features (variables) is okay, and often helpful. Preferable to use algorithms designed to handle large number of features	Often requires independent features. Preferable to use less number of input features
Does not promote data reduction prior to learning. Promotes a culture of abundance: “the more data, the better”	Promotes data reduction as much as possible before modeling (sampling, less inputs, …)
Has faced with solving more complex problems in learning, reasoning, perception, knowledge presentation, …	Mainly focused on traditional data analysis

Learning can be achieved by manually writing a program covering all possible data patterns. This is exhaustive work and is generally impossible to accomplish for real-world problems. In addition, this program will never be as good or as thorough as a learning algorithm. Learning algorithms learn by examples (like humans do) automatically, and they generalize based on what they learn (inductive learning). Generalization is a key aspect of evaluating the performance of a learner. At the highest level, the most popular learning algorithms can be categorized into supervised and unsupervised types and each into high-level useful categories (also called data mining functions):

Supervised learning includes:

· Classification: Predicting to which discrete class an entity belongs (binary classification is used the most)—e.g., whether a customer will be high-risk.

· Regression: Predicting continuous values of an entity’s characteristic—e.g., how much an individual will spend next month on his or her credit card, given all other available information.

· Forecasting: Estimation of macro (aggregated) variables such as total monthly sales of a particular product.

· Attribute Importance: Identifying the variables (attributes) that are the most important in predicting different classification or regression outcomes.

Unsupervised learning includes:

· Clustering: Finding natural groupings in the data.

· Association models: Analyzing “market baskets” (e.g., novel combinations of the products that are often bought together in shopping carts).

Statistical Learning Theory

Historically, statisticians have been skeptics of machine learning and resistant to accepting it. This has been because of the liberal approach of ML and less emphasize on theoretical proofs. The good news is that "Statistical Learning Theory" has bridged the gap and has provided an umbrella theory where both sides can collaborate and operate. Basic statistical concepts is a cornerstone of many engineering and science fields, very much like math is. But sticking to traditional statistics thinking and practices would have prevented progress. These are two different things and ML has proved that in practice. For those interested to understand a bit about Statistical Learning Theory and its relation to ML, see the following lecture by Yaser S. Abu-Mostafa at Cal Tech.

---------------------------------------------------------------------------

I discuss these topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

10 Required Non-technical Skills for a Data Scientist

"Data Science(DS)" is nothing new but the term itself and the recent level of interest in it. As a practice it has commercially (not academically) existed for more than 25 years, mainly under "Data Mining (DM)" and "predictive analytics(PA)," since early 1990's. DM and PA got a lot of traction originally in financial, Telco, and retail industries that had a lot of granular historical data. Like anything that gets sudden attention and interest, DS has been misused and abused in a variety of ways. Given the fast surge in market demand in the last several years, many claim to be or want to be data scientists. True data scientists and DS managers who had to deal with screening DS resumes, can testify to the level of present noise (false positives) in that application process.

"Data Science" tries to be an umbrella field that covers more of what data mining and predictive analytics practices have covered. That is justified since with the growth of data of all kinds in recent decade and what is expected in the coming years, we need a lot more of the people with relevant DS skill sets. The challenge however has been the definition of that "skillset." What makes a good data scientist?

In my previous post "What is BuDAI?," I explained that a successful DS project requires the involvement of the data science team through the whole cycle. The core part of a data science project deliverable is the insight and decision coming out of analytics. The analytics could be trivial (generally aggregated view of data and only looking at a handful of variables together) where in that case there would be no need for DS. That would be in the realm of a data or business analyst. DS comes into picture usually where:

More sophisticated analytics approaches are required,
More complex transformations are required to prepare the data,
Granular or atomic analysis of entities of interests is required,
Analytics could be straightforward but big data is involved requiring attention to optimization of analytics,
...

Within BuDAI process, the S team has to interact with business, data engineers, data architects, project managers, and product managers to name a few. Aside from some relevant technical skills/knowledge[1] in math, stats, machine learning, programming, databases, and systems (the breadth and depth will depend on the level of seniority of the Data Scientist), through the years I have found the following ten traits to be as important as technical skills for junior hires and absolutely essential for senior data scientists.

Problem solving ability
Business acumen
Ability to question the work of self and others,
Passion for data (the more data, the better)
Attention to details and ability to validate own work in multiple ways
Statistical thinking (a thinker who knows when to reason deterministically and when not)
Passion for exploration and discovery (quick learner from fails)
Ability to devise optimal ways to experiment new or creativity (finding novel useful insight is cumbersome. One can never find a sure way to find it)
Presentation ability (written and oral)
Ability to simplify complex concepts for explaining to others.

----------------------
[1] This is the subject of another blog and given the today's coverage of data science, the required technical abilities vary greatly.

---------------------------------------------------------------------------
I discuss these topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

Wednesday, October 19, 2016

What is DS-BuDAI?

Data science[1] (covering data mining and related practices) is a multidisciplinary field that requires knowledge of a number of different skills, practices, and technologies, including but not limited to machine learning, pattern recognition, mathematics, programming, algorithms, statistics, and databases. In the context of big data, more skills and knowledge is required, such as knowledge of distributed computing techniques/algorithms and architectures. By nature, data science is a creative process that is a combination of both science, engineering, and art. Hence its success has been more dependent on the quality and the experience of the team that has been carrying it out. Thus in the past, for some time, data mining projects were not repeatable with the same level of success across different enterprises. However, with the maturity of the practice, that has changed.

Since the late 1990s there have been a variety of efforts to create standard methodologies and process models for data mining, such as CRISP-DM (Wirth and Hipp 2000). In this methodology there is an important focus on business, data, and deployment aspects, as well as the modeling, which used to be the main focus. Today, data science practices are more mature and well tested. Even though different methodologies may use different names for each step of the process, in general, I can logically divide any data science exercise into four phases (See Figure below):

Business Problem Understanding/Use,

Data Understanding/Use and Preparation,

Analytics and Assessment,

Implementation (Deployment and Monitoring).

In the context of big data, these logical phases stay the same; however, some low-level details of data preparation, analysis, and implementation may be impacted.

We all love acronyms and I have been using DS-BuDAI to refer to this process to communicate with business sponsors and users. The lowercase 'u' represents "Understanding/Use" to overemphasize their importance during Business and Data focused phases. It bridges the two. Analytics and Implementation are simply realizations of the data science deliverable.

The "Understanding" part needs no explanation specially in the context of business problem and data that are specifically going to be addressed and leveraged in the effort. "Use" however needs a bit of explanation given some recent experiences.

A DS project must start with a full understanding of the business challenge and how it could be solved leveraging data sources available or to be obtained. However, there could be cases that after everything is done and the value proven, the business users are not still willing to use the new insights for actions. This lack of responsiveness has a lot to do with the culture of the organization, how decisions have been historically made in the past, and the marginal improvement the new actions will bring. These however could be overcome with education and training and full support of senior management for change.

In some cases though, actionable insights are perceived by business users as "this is what we already knew" and "it is good that the data analysis confirms that." Basically saying that there is no novel new findings but a confirmation of what is known. There is truth to this perception sometimes but at times it is simply resisting change or accepting changes in practice.

In the context of data, "Use" also is essential. Collection, storage, preparation, and management of big data is still expensive no matter how much the storage costs have dropped in recent years with advent of open source systems and price drops in storage/processing systems. Data could easily be abused or misused. Sometimes too much data is used, and sometimes data is not used at the right level of details or aggregation.

The lowercase "u" in DS-BuDAI is to overemphasize understanding and use during business and data focused phases.

Originally published on10/19/16, 11:47 AM Pacific Standard Time

[1] "Data Science" is nothing new except the term itself and the level of recent interest for it.

Wednesday, October 5, 2016

The Age of Data Innocence is Over

When I try to explain Data Science and Analytics to business people or those interested in these fields, I use the following example to describe the four pillars of: Data, Platform/Tools, Algorithms, and know-how.

To me, "data" is like a collection of bones (say of an animal) scattered around, some clean and some hidden in dirt. If these pieces are collected and put together correctly, it will mimic or resemble the real skeleton of that animal. Some of the bone pieces are perfect. Some are broken and have to be glued together, and some will be missing (hopefully not a lot). But if these bone pieces are organized together with care, it would give us a good view of the animal they once represented.

Then the infrastructure (data platforms and tools) are like the muscles that go around these bones, after they have been lined up correctly. The muscles make the skeleton move around and do interesting things. They can make the boring data come to life and express itself in many interesting ways.

The algorithms are then the brain. A very small mass of the whole thing through which, the muscles are controlled to make the skeleton body do more interesting things and in more novel ways. At the end, we have the lifeless and somewhat boring bone pieces moving around in harmony, doing interesting things. That is what Data Science and Analytics try to do.

Data used to be static and often boring if just collected and not used, but innocent. It used to be lying around in many places and sometimes at massive scale. It was up to the art and science of the data scientists and engineers to come to life. If they did their job right, it could come to life in many useful ways. It could do no harm of its own. It could not lie or cheat on its own.

The recent Volkswagen and Wells Fargo scandals have been a turning point, in which it signaled the end of data innocence. In the past, lies could be made by using data in a biased way and selectively. The data itself was innocent. The scandal shows that the data can be easily manipulated at the origin, right were the data is created.

The bone pieces in my example above could now be practically fake and made look real. In reality though, they collectively portray a skeleton of fiction or imagination, no matter how great and noteworthy are the platforms, tools, and people (know-how) who assemble and use it.

The good news is that for many applications to be useful, there is no incentive for those involved to fake the data. However one can envision many example applications in which there is an incentive to manipulate the data at the origin and fool everybody down the chain. That brings us back to the question of ethics and integrity in Data Science and Analytics and adds yet another important step to the long list of Data Validity (one of the 6Vs of Data discussed in my book) .

-----------------------------------------------------------------------------------------------------------------------
I discuss some related topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

Friday, September 30, 2016

Analytics Divide and Big Data

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

In a research report on analytics by MIT and IBM (Kiron, et al. 2011), three progressive levels of analytical sophistication for organizations are defined as aspirational, experienced, and transformed (See Table 1). In transformed organizations, analytics is used at all levels, day to day, strategically. It is considered an integral part of everything, including the culture. These organizations have high proficiency in both data management and analytics in terms of usage, skills, and tools. Both data management and analytics are enterprise-driven and are ingrained in the enterprise culture. Transformed organizations have robust data and analytics foundations and management competencies, making it possible to capture, combine, and analyze information from disparate sources, and to disseminate it across the organization so that individuals at all levels can consume it. In these organizations, one finds that processes, practices, and behaviors are aligned with the fundamental belief that business decisions at all levels should be based on data analysis.

On the other hand, aspirational organizations do not use any sophisticated analytics beyond spreadsheets, and do not have an integrated view of their enterprise data. They lack proficiencies on both the data management and analytics fronts. Their culture relies more on decision-making based on guts and intuition rather than data analysis.

The experienced organizations are somewhere in between, with initiatives that move them closer to transformed organizations. This study shows that experienced and transformed organizations continue to expand their analytics and information management capabilities to add more business value and differentiate themselves, while aspirational organizations keep falling behind. This growing gap or divide has major implications for businesses. For the following discussion, my focus is only on experienced and transformed organizations, since these strongly believe in the value of analytics and either practice it in full or have a goal to get there. I consider them standard organizations in the sense that, in today’s competitive world, they are more of the norm than exception.[1] These organizations have also the culture, the appetite, and the desire to deal with their big data challenge, if there ever is one.

Organization Category	Information Management Proficiency	Analytics Proficiency	Data Culture
Aspirational	Low	Low	Line of business driven
Experienced	Medium	Medium	Moving toward enterprise driven
Transformed	High	High	Enterprise driven

Table 1: Three progressive levels of analytical sophistication in enterprises (Kiron, et al. 2011).

One main differentiator between analytics in the traditional sense and big data analytics is that in the latter, the collected big data may or may not be useful for the specific business purpose intended. From the perspective of analysis, this falls into the category of you don’t know what you don’t know. However, if any insights are extracted, they could be enormously invaluable. Due to the maturity of traditional data management and analysis technologies, data that is stored in these environments is already known to be of high value. This data has been prepared to answer known business questions. The high value justifies their storage and management in enterprise data warehouses or data marts. With new big data, there are plenty of opportunities to ask new business questions never asked before, and the economic situation is favorable when investigating these questions.

Table 2 enumerates a few possible scenarios in today’s standard analytics environments (experienced and transformed organizations) when they are faced with big data. These environments already excel in dealing with traditional and proven analytics methods and technologies where storage, management, and analysis of the data follow standard processes and practices. Scenario 1 depicts the status quo in these environments—where, in the absence of any big data, it is business as usual.

Data Scenario	Big Data?	Storage	Analysis	Business Value
1	No	Standard	Standard	Known
2	Yes	Possible	Nonstandard	Somewhat known
3	Yes	Possible	Not possible	Not known
4	Yes	Not possible	—	Not known

Table 2: Big data scenarios in standard analytics environments.

However, in terms of their existing capabilities, they face different scenarios to deal with their big data challenges. The reader should keep in mind that the size of big data has to be interpreted in the context of time and place of each enterprise, given its sector and its place on the analytics evolution curve. In Scenario 2, the enterprise is capable of storing its big data, and can also analyze it using existing nonstandard[2] big data analytics techniques. As a result, the enterprise has some understanding of the hidden value in its big data, and can decide how much of it needs to be stored and for how long. In Scenario 3, the organization can cope with storing its big data, but does not yet have the capability to analyze it in any efficient way for assessing its value. The reason for this could be technological, methodological, skill set related, or budgetary. In Scenario 4, the enterprise at its current state is not capable of storing the big data (hence not able to analyze it either) for similar reasons to Scenario 3. Today, Scenarios 3 and 4[3] are still dominant for classic enterprises. Those operating under Scenario 2 are a small minority but are ahead of the curve compared to their peers. The curiosity of finding the potential value in big data is why big data has become a part of these organizations’ overall data strategies. Going forward, any enterprise data strategy that ignores big data should be considered incomplete.

----------------------------------------------------------------------------

Kiron, David, Rebecca Shockley, Nina Kruschwitz, Glenn Finch, and Micheal Haydock. 2011. Analytics: The Widening Divide. MIT Sloan Management Review; IBM Institute for Business Value.

[1] More than a decade ago, one could say that the reverse phenomenon was true, meaning that aspirational organizations were more of the norm.

[2] Big data analysis techniques are still in their infancy, and I consider them nonstandard in comparison with traditional data analytics tools and techniques (including data warehousing, BI, and data mining) that have matured, especially in the last two decades.

[3] “Without big data, you are blind and deaf in the middle of a freeway.”—Geoffrey Moore.
---------------------------------------------------------------------------

I discuss these topics in detail in my book. Visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

Tuesday, September 6, 2016

What Constitutes a Big Data Scenario (Part 2): Human vs. Machine Generated Data

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

It is projected that growth in global data generated per year is 40%, versus a 5% growth in global IT spending (McKinsey Global Institute 2011). From a cost perspective, this divergence clearly shows why big data technology solutions must be cost-effective today in order to stay within existing IT budgets and forecasts. However, the real value of any technology is ultimately defined by its business ROI and not its cost. This is particularly true for classic organizations that are exploring these new big data technologies. From an economic and business standpoint, the potential value
creation from big data opportunities has already been proven by numerous web companies. In many cases this value creation has been disruptive.

More than 80% of data generated is reported to be unstructured data, which includes:

• Semistructured data: weblogs, machine logs, etc.
• Unstructured text data: blogs, e-mail, comments, etc.
• Binary data: photos, images, audio, video, etc.

It has been reported that the data collected by the US Library of Congress by April 2011 has been around 235 terabytes. Fifteen out of seventeen sectors in United States have more data stored per company than the US Library of Congress (McKinsey Global Institute 2011).

Structured data is stored and resides in predetermined fixed fields. Unstructured data cannot be stored in fixed fields. Freeform text (books, e-mails, articles, blogs, etc.) and untagged video, image,
and audio are examples of unstructured data. Semistructured data also does not conform to fixed fields, but includes tags that identify its data elements. XML, JSON, and HTML-tagged text are examples of semistructured data. Multistructured data refers to a combination of all these data varieties.

What constitutes human or machine generated data is loosely defined. I like to differentiate between the two data using the following definition (See Table below). As humans interact with each other and with other organizations, or organizations with each other, a massive amount of structured data is generated in the form of transactions such as call records, payment transactions, sale orders, etc. These data are collectively generated through business processes and had been captured and analyzed long before the Internet became mainstream. Conventional big data technologies were originally developed to handle such data. Human interactions also create semistructured data, such as weblog data, that are newer; and their detail processing often requires newer big data technologies to be cost-effective. Human-generated data can also be directly created by humans as “digital content,” which could be either unstructured or binary. In a nutshell, human-generated data can be defined as the digitization of human interactions.

On the other end, I define machine-generated data as data capture machine-to-machine (Internet of Things) interactions. Machine-generated data may be the result of observing human behavior
instead of capturing their choices. This data could also be in structured, semistructured, or binary form. Data from RFID tags, computer logs, network logs, security cameras, etc., are typical examples.

Data Generation Origin	Definition	Information Management Proficiency	Examples
Humans	Data representing the digitization of human interactions	Structured	Business process data e.g., payment transactions, sales order, call record, ERP, CRM
		Semistructured	Weblogs
		Unstructured	Content such as Web pages, E-mail, Blog, Wiki, Review, Comment
		Binary	Content such as Video, Audio, Photo
Machines	Data representing machine-to-machine interactions, or simply not human-generated (Internet of Things)	Structured	Some devices
		Semistructured	Computer logs, Device logs, Network logs, Sensor/Meter logs
		Binary	Video, Audio, Photo