Thursday, June 12, 2014

Data Mining, Data Science, and Machine Learning (2)

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" ( ).

A question that often comes along is: "What is the difference between Machine Learning and Data Mining Science (now Data Science)?" Newcomers often confuse these, not mentioning the businesses who try to use these terms to make their products more appealing to their customers and investors. I make an attempt to describe the difference between the two here. For more detail, you can read my new book.

Data Mining Science (the practice is referred to as Data Mining) covers all aspects of creating value out of data from capture, storage, quality checks, analysis, deployment, to infra-structure. It requires some knowledge of a variety of disciplines including mathematics, machine learning, statistics, programming, and databases. On the other hand, machine learning (ML) is a set of techniques and algorithms that can be applied to the data (any type of data from numeric, text, binary, ...) when placed into the right form; meaning a set of exemplars (or patterns or observations or records) each with a set of features that captures the problem of interest. ML (very close to pattern recognition) is a branch of artificial intelligence in computer science that in general focuses on the general problem of "learning from data" or "learning by examples."

Machine learning and pattern recognition are very close in principles and established fields taught in top universities for a long time, Both try to address the problem of learning with the difference that ML has its roots in computer science while pattern recognition has its roots in engineering. Statistical Learning Theory is a sub-field of ML that focuses on the formalization of the problem of learning, basically focusing more on the theoretical aspects. Statistical learning theory explains why machine learning techniques in general work, something the practitioners had experienced through empirical results. Traditional statisticians have always been skeptics regarding machine learning applications even though nowadays these techniques are a part of many statistical toolsets.

The fundamental assumption in machine learning that distinguishes it from traditional statistics is the principle that it makes little or no pre-assumptions about the problem including data distributions. Traditional statistics's roots go back to when there were no computers around, and analyzing any data centered around data reduction and simplification (sampling, variable reduction, linearity assumptions, ...). These often required making restrictive assumption about the problem and the data distributions.

Historically, ML techniques had to deal with solving much more challenging problems and as such, they do not make pre-assumptions about the problem and instead, uses the power of the computer to search and optimize for "a good" solution, often using heuristics. Machine learning is not necessarily in search of the best optimized solution, but "a good solution" and many times it uses heuristics in its approaches. In other words, traditional statistics forces severe assumptions on data to get its best solution, but the solution is only best assuming the correctness of those assumptions. However, dealing with many real-world machine and human-generated data, those pre-assumptions are rarely correct. Hence the resulting solutions can be mediocre at best. ML generally does not seek or claim to find the best possible solution. Often given the complexity of the problems, the best solution may not be achievable or worth the time and resources to find it even if it exists.

Another fundamental pillar of machine learning is the concept of "generalization" and the trade-off between accuracy and robustness. ML solves this using empirical approach of using training/validation/test approach while traditional statistic uses statistical tests on the training data(coupled with tight initial assumptions) to address this.

As a practice, data mining (and data science) has four phases of equal importance:

(1) Business problem understanding
(2) Data understanding and preparation
(3) Model development and assessment
(4) Deployment and monitoring.

Business Understanding: Data mining starts from the full understanding of the business problem where business domain knowledge and data mining knowledge both have to be leveraged. The final result of this process is to set an ROI expectation and a formulation of the business problem into a data mining problem.

Data Understanding and Preparation: Understanding the data does not need any explanation.  By definition, it requires proper storage, low-level quality checks, and access to "all relevant data" for the desired business problem. The data preparation covers all aspects of data manipulations that will be required for "model development."  Sometimes the data preparation is minimal but very often it requires sophisticated and innovative ways of converting "raw relevant data" to so-called an "analytic data set" (or ADS) that well represents the business problem of interest and can be fed to into the algorithms of interest. An ADS is a set of exemplars each with a set of features that capture the problem of interest.

Model Development and Assessment: The carefully designed and processed ADS is the input to this stage. This stage is where machine learning is applied to process structured, unstructured, or multi-structured data. The results of modeling are assessed against the business goals, and a good solution is selected.

Model Deployment and Monitoring: Building the most sophisticated models are only useful when those models can be operationalized and continuously monitored for performance degradation.

In earlier days of data mining, many projects failed because they were solely focused on the modeling and learning algorithm cuteness (academic orientation) rather than focusing on the business value. Many others failed in practice because of lack of attention to data storage, quality, and preparation for modeling. Of those that passed these hurdles, many failed because they could not be deployed in operations.

Today, many of these early failures have been addressed and good practitioners evaluate the whole life cycle at the first phase before attempting for a solution. The tools used in the process can change and often are from a variety.

With the explosion of data and the popularity of analytics and ML in general, all players in the market are using the term "Data Science" and "Data Scientist." The data platform vendors (MPP, NoSQL, Hadoop vendors) use the terms to emphasize the database/data store, and basic analytics aspects.  Outside the context of big data, I do not consider basic analytics even close to what a data scientist has to do. Startups and big web companies may emphasize more on the programming requirements aspects of a data scientist.

In my opinion, Data Science is mainly a sexier and more appropriate name for "Data Mining Science." "Mining" does not portray the right image because it is generally associated with dangerous and hard manual labor. Also Data Science as a practice is more focused on creating new products (data products) and tends to be much higher-level in the organization's leadership.

In the third piece of this topic, I try to enumerate what skills are required for a data scientist, what skills one needs in a data science team, and what needs to be taught in a data science curriculum.

For more detail and in the context of big data impact on data mining, you can read my new book.

In future blogs, I discuss "my ten commandments" for data mining, i.e., the general principles to be aware.