High Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data

A personal blog currently dedicated to the promotion of the book titled "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data." The success of new big data technologies in large web companies has created a rush toward understanding the impact of these technologies in classic analytics environments that already employ a multitude of legacy analytics technologies. The blogs provide concepts and excerpts from the book to study this impact.

Thursday, June 12, 2014

Data Mining, Data Science, and Machine Learning (2)

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

A question that often comes along is: "What is the difference between Machine Learning and Data Mining Science (now Data Science)?" Newcomers often confuse these, not mentioning the businesses who try to use these terms to make their products more appealing to their customers and investors. I make an attempt to describe the difference between the two here. For more detail, you can read my new book.

Data Mining Science (the practice is referred to as Data Mining) covers all aspects of creating value out of data from capture, storage, quality checks, analysis, deployment, to infra-structure. It requires some knowledge of a variety of disciplines including mathematics, machine learning, statistics, programming, and databases. On the other hand, machine learning (ML) is a set of techniques and algorithms that can be applied to the data (any type of data from numeric, text, binary, ...) when placed into the right form; meaning a set of exemplars (or patterns or observations or records) each with a set of features that captures the problem of interest. ML (very close to pattern recognition) is a branch of artificial intelligence in computer science that in general focuses on the general problem of "learning from data" or "learning by examples."

Machine learning and pattern recognition are very close in principles and established fields taught in top universities for a long time, Both try to address the problem of learning with the difference that ML has its roots in computer science while pattern recognition has its roots in engineering. Statistical Learning Theory is a sub-field of ML that focuses on the formalization of the problem of learning, basically focusing more on the theoretical aspects. Statistical learning theory explains why machine learning techniques in general work, something the practitioners had experienced through empirical results. Traditional statisticians have always been skeptics regarding machine learning applications even though nowadays these techniques are a part of many statistical toolsets.

The fundamental assumption in machine learning that distinguishes it from traditional statistics is the principle that it makes little or no pre-assumptions about the problem including data distributions. Traditional statistics's roots go back to when there were no computers around, and analyzing any data centered around data reduction and simplification (sampling, variable reduction, linearity assumptions, ...). These often required making restrictive assumption about the problem and the data distributions.

Historically, ML techniques had to deal with solving much more challenging problems and as such, they do not make pre-assumptions about the problem and instead, uses the power of the computer to search and optimize for "a good" solution, often using heuristics. Machine learning is not necessarily in search of the best optimized solution, but "a good solution" and many times it uses heuristics in its approaches. In other words, traditional statistics forces severe assumptions on data to get its best solution, but the solution is only best assuming the correctness of those assumptions. However, dealing with many real-world machine and human-generated data, those pre-assumptions are rarely correct. Hence the resulting solutions can be mediocre at best. ML generally does not seek or claim to find the best possible solution. Often given the complexity of the problems, the best solution may not be achievable or worth the time and resources to find it even if it exists.

Another fundamental pillar of machine learning is the concept of "generalization" and the trade-off between accuracy and robustness. ML solves this using empirical approach of using training/validation/test approach while traditional statistic uses statistical tests on the training data(coupled with tight initial assumptions) to address this.

As a practice, data mining (and data science) has four phases of equal importance:

(1) Business problem understanding

(2) Data understanding and preparation

(3) Model development and assessment

(4) Deployment and monitoring.

Business Understanding: Data mining starts from the full understanding of the business problem where business domain knowledge and data mining knowledge both have to be leveraged. The final result of this process is to set an ROI expectation and a formulation of the business problem into a data mining problem.

Data Understanding and Preparation: Understanding the data does not need any explanation. By definition, it requires proper storage, low-level quality checks, and access to "all relevant data" for the desired business problem. The data preparation covers all aspects of data manipulations that will be required for "model development." Sometimes the data preparation is minimal but very often it requires sophisticated and innovative ways of converting "raw relevant data" to so-called an "analytic data set" (or ADS) that well represents the business problem of interest and can be fed to into the algorithms of interest. An ADS is a set of exemplars each with a set of features that capture the problem of interest.

Model Development and Assessment: The carefully designed and processed ADS is the input to this stage. This stage is where machine learning is applied to process structured, unstructured, or multi-structured data. The results of modeling are assessed against the business goals, and a good solution is selected.

Model Deployment and Monitoring: Building the most sophisticated models are only useful when those models can be operationalized and continuously monitored for performance degradation.

In earlier days of data mining, many projects failed because they were solely focused on the modeling and learning algorithm cuteness (academic orientation) rather than focusing on the business value. Many others failed in practice because of lack of attention to data storage, quality, and preparation for modeling. Of those that passed these hurdles, many failed because they could not be deployed in operations.

Today, many of these early failures have been addressed and good practitioners evaluate the whole life cycle at the first phase before attempting for a solution. The tools used in the process can change and often are from a variety.

With the explosion of data and the popularity of analytics and ML in general, all players in the market are using the term "Data Science" and "Data Scientist." The data platform vendors (MPP, NoSQL, Hadoop vendors) use the terms to emphasize the database/data store, and basic analytics aspects. Outside the context of big data, I do not consider basic analytics even close to what a data scientist has to do. Startups and big web companies may emphasize more on the programming requirements aspects of a data scientist.

In my opinion, Data Science is mainly a sexier and more appropriate name for "Data Mining Science." "Mining" does not portray the right image because it is generally associated with dangerous and hard manual labor. Also Data Science as a practice is more focused on creating new products (data products) and tends to be much higher-level in the organization's leadership.

In the third piece of this topic, I try to enumerate what skills are required for a data scientist, what skills one needs in a data science team, and what needs to be taught in a data science curriculum.

For more detail and in the context of big data impact on data mining, you can read my new book.

In future blogs, I discuss "my ten commandments" for data mining, i.e., the general principles to be aware.

Monday, June 9, 2014

Analytics Maturity

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

Like many new technology terms in the last two decades, "analytics" has been used and abused in different contexts. Depending on the background, data professionals have a different view of what analytics is. If one comes from IT, database, reporting, or business analysis backgrounds, he/she will consider any manipulation of data that generates reports, aggregated results, or data slicing and dicing as analytics. If one comes from the data mining, machine learning, or statistical modeling experiences, he/she will only consider analytics as where sophisticated algorithms are applied to the data.

The good news is that this battle has already been fought and settled. Today, there is a clear distinction between these two interpretations of Analytics. The former is called basic analytics (low end or spreadsheet-kind of analysis) often referred to as looking at the behind mirror. The latter is referred to as advanced analytics (high end) where the goal is to use sophisticated techniques on the past data to make predictions in the future. With big data, even basic analytics like simple aggregations and tabulation of the data for reporting becomes a challenging task if response time is at all of concern. Those who focus on basic analytics tasks are like journalists while those who focus on advanced analytics may be called innovators.Both types of analytics are essential to the well- being of a business.

The fundamental principle is that an organization cannot transition into advanced analytics era if they have not already mastered the basic analytics applications. In other words, basic analytics is the requirement before entering into advanced analytics, and both are dependent on solid data management infrastructures. This so-called analytics maturity is discussed at length in my book in different contexts (including big data context) and assessing it is necessary prior to any effort to augment a firm's analytics capabilities.

Friday, June 6, 2014

Data Mining, Data Science, and Machine Learning (1)

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

People often ask me about data science and its relationship with data mining science (often just referred to as data mining) and machine learning. In my book, I provide a viewpoint on this topic. For those like me who have spent their entire career in developing and promoting data mining, machine learning, and data analytics, “data science” is nothing but a new term.

The use of machine learning and data mining to create value from corporate or public data is nothing new. It is not the first time that these technologies are in the spotlight. Many remember the late ‘80s and the early ‘90s when machine learning techniques—in particular neural networks—had become very popular. Data mining was at a rise. There were talks everywhere about advanced analysis of data for decision making. Even the popular android character in “Star Trek: The Next Generation” had been named appropriately as “Data.”

Data mining science has been the cornerstone of many applications for more than two decades, e.g., in finance and retail. However, the popularity of web products from the likes of Google, Linked-in, Amazon, and Facebook has helped analytics become a household name. While a decade ago, the masses did not know how their detailed data were being used by corporations for decision making, today they are fully aware of that fact. Many people, especially the millennial generation, voluntarily provide detailed information about themselves. Today people know that any mouse click they generate, any comment they write, any transaction they perform, and any location they go to, may be captured and analyzed for some business purpose.

All these have contributed to finally bring analytics to the forefront of many conversations even among regular people. A decade ago, we could not comfortably tell a customer how we anonymously analyze their detail transactions in real-time to protect them even from fraud (See Chapter 9 of this book).

Big Data Analytics Confusing Landscape

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

Any new technology brings its own new buzzwords and lots of promises. It becomes real confusing for outsiders to assess its relevance and to separate noise from reality. In my book, I provide an objective and holistic view of big data technologies and the impact of new big data technologies in classic organizations. Leveraging my 20 years experience in advanced analytics and machine learning, I tried to cover many topics that are relevant to anybody who is interested to get into this field.

Who is my audience for my new book?

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data."

I wrote this book for a variety of audiences (See here for details). Most importantly, there are many people in the technology, science, and business disciplines that are curious to learn about big data analytics in a broad sense, combined with some historical perspective. They may intend to enter the big data market and play a role. For this group, the book provides an overview of many relevant topics on the subject.

Thursday, June 5, 2014

Hadoop Summit 2014 Take Aways

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data."

Notable Trends from Hadoop Summit 2014 #HadoopSummit

- FYI: Hortonworks and Yahoo! are hosts of this event. This event has been held in San Jose Convention Center for the last four years (I think) and attracted 3200 people this year.

- As a sign of maturity of Hadoop, one can see the list of Diamond and Platinum sponsors that include many heavyweights such as AT&T, Microsoft, SAP, Teradata, Cisco, IBM, Informatica, Oracle, SAS, and VMware.

- With YARN finally released at the end of 2013, there was a high expectation to hear about Hadoop maturity stories/experiences in classic environments. There were stories and experiments that were shared showing that Hadoop with YARN is moving along its maturity curve, and that is good news for the classic organizations.

- Still lots of tools in Hadoop ecosystem are low level but the trend continues by different vendors to make the tools for integration and use more efficient and higher level. For classic organizations, this will be the way to get Hadoop use more widespread beyond storage and ETL.

- With Yahoo! being a host, there were lots of presentations from them regarding various aspects of Hadoop 2.0 usage at Yahoo!

- One thing I have found is that there is no "single big data technology" that solves most of the business problems out there especially for classic organizations that have lots of investment in the legacy big data analytics technologies. For now, for most business problems, more than one big data technology will be required.(Good to read my book on the subject when it is out).Many participants from classic organizations in banking, telco, retail, and others emphasized this as their experience playing with Hadoop. Remember the Hadoop newcomers used to call it the "holy grail", but those days are over. What we will see is more of convergence and integration rather than pure replacement.

- Queries using MR are slow. There are now many solutions on Hadoop to address "low latency" queries for big data such as Impala (Cloudera), Drill (MAP R), Shark (Spark), Presto (Facebook), HAWQ (Pivotal), and Apache Tez. All these are making things more confusing, and there would be no answer on what is the right way for sometimes to come. There were a few presentations focusing on Tez and how it improves performance for Pig and Hive.

- Spark-on-YARN enables Spark application run on Hadoop with no need to create a new cluster. There were some early results shared by Yahoo!. Also, there is SparkR from UC Berkeley that promises interactive analysis of large data in parallel from the R shell.

- There were new use cases from travel industry, electronic consumer peripherals, to selling cars showing Hadoop traction in many more industries.

- Hadoop security is coming along and still more work needs to be done. Check out Apache Sentry, Apache Knox, and Project Rhino.

- About cost of ownership of Hadoop cluster, what I have found is that one should always start with a public cloud implementation for discovery and exploration, and even at earlier stages of operations. There would be a point where depending on the application-specific factors, an on-premise implementation would be more cost effective and will also provide better ROI. For those applications that data security is of high importance, on-premise from the start will be the way to go for now.