Saturday, July 30, 2016

Ten General Principles in Data Mining/Science (Focus on Model Development)

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" ( ).

Through years, working with different clients and applications, I have found a set of data mining general principles that also hold through in the context of big data.  These are all listed in my book. Here I enumerate them using the terminology I have used in the book:

1.         Use of “all the data” is not equivalent to building the deepest Analytics Dataset (ADS) in terms of the number of rows.
2.         Given a choice between more of the data or a fancier algorithm, choosing “more data” at the top of the data preparation funnel is always preferred.
3.         For most problems, when using the same learning algorithm, smart variable creation on a proper sample of data (a sampled ADS) outperforms the use of the deepest ADS, with primary and/or rudimentary (not well-thought) attributes.
4.         For most problems, smart variable creation combined with a simple algorithm outperforms a mediocre variable creation exercise combined with the fanciest algorithms; whatever the ADS row size.
5.         For some classes of algorithms, a smart ADS combined with a smart presentation of its variables to the learning algorithm on sampled data outperforms a smart ADS with the largest number of rows without proper presentation of variables.

6. In many problems, one has to deal with transactional data requiring creation of time-based  variables that provide the learner with a short and long term memory of the past behavior of the entities to model. Depending on the problem, such variables need to be computed and updated from the transaction history of each entity and for every transaction in realtime or at specific time intervals.

7. For a fixed model complexity, as the number of rows (observations) in ADS increases, the training and test errors converge.

8. For some problems, all population must be represented in an ADS (e.g., social net analysis, long tail problems, high cardinality recommenders, search). For all other problems, sampling
continues to be valid. For a subset of these problems, sampling is mandatory, e.g., highly unbalanced datasets, segmented modeling, micro-modeling, and campaign groups. For the remainder, it is optional but not a limiting factor anymore. Historically for these problems, sampling had to be done to speed up the processing or to reduce the storage cost.

9. For big data, it is desired to use the same platform and interface for data understanding, preparation, and model development with minimal data movement and least iterations through the data to get to the result.

10. In the transition from model development to deployment, automatic code generation for computation of variables and models is of high importance to ensure quality control. Automatic code generation is mandatory in applications that require a large number of models.
The applications have been very diverse:
- Character recognition (hand-print English or cursive print languages) involving state-of-the-art image processing and text/character segmentation capabilities and use of neural networks with multiple layers, hundreds of nodes, and tens of thousands of weights (As an example see Machine-printed Arabic OCR), 
- Real-time payment card fraud detection (See here),
- Segmentation of buyers and sellers at a large web auction site using both supervised and unsupervised techniques,
-  Building models on-the-fly based on each SKU (millions of them) based on store characteristics for one the largest retailers,
- And tens of other applications in customer experience, risk, marketing, and fraud. Whatever machine learning or pattern recognition technique was used, the above always held true.

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" ( ).

Saturday, July 16, 2016

End of the Free Lunch for Analytics and Data Mining Software

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" ( )

There has been an interesting phenomenon that is known to many as “What Andy giveth, Bill taketh away.” It means that every time Andy Grove (CEO of Intel) brought a new chip to market, Bill Gates
would soak up the new chip’s power to upgrade his software. For thirty years, Moore’s Law has continually enabled software performance to improve, often incrementally and, once a while, drastically. Many types of applications have enjoyed regular and free performance gains during
this time, without the need for rewriting old code or doing anything special. Many other applications have been created or have evolved by providing new powerful features made possible only because of faster hardware. By the mid-2000s, this so-called “free lunch for software” came to an end.

During this long period, software engineers simply did not need to account for parallelism and concurrency in their programs or algorithms for speedup. They just relied on the 50% expected yearly speedups of the processors that was made possible by Moor’s Law. However, by the early 2000s, a new reality started to emerge—one that created a crisis. In the new software world, not all software will enjoy faster year-over-year performance gains, as has been the case in the past. Only concurrent and parallel software greatly benefits from hardware advancements going forward, while single-threaded software will stay behind. As mentioned earlier, writing parallel and concurrent software
requires the use of different programming models such as multithreading and distributed computing, which are more complex.

This is not the first time that software has faced a big challenge. The first software challenge happened in the ‘60s and ‘70s when developers were writing assembly language programs. As the computers grew in power, memory capacity, and variety, there was a need for portability and some level of abstraction from the hardware. This was solved through high-level procedural languages such as C and FORTRAN, which were very efficient.

In the ‘80s and ‘90s, another challenge emerged: composing and maintaining large software. These were projects that involved tens or hundreds of programmers and where millions of lines of code had to be managed. Commercial and government entities such as Microsoft— while developing MS Word—and the US Department of Defense (DoD) faced such challenges early on. Software performance was not the main issue, since Moore’s Law had proven that it could deal with confidently. Object-oriented programming using C++, and then Java and C#, made large software development easier. Hundreds if not thousands of people could work on a single application simultaneously.

The third challenge is what we face now. The end of scaling of singlethreaded performance has created a major shift to symmetric parallelism. Until recently, programmers have been used to not being concerned about hardware. They believed that, as usual, Moore’s Law would continue to take care of them. Achieving this requires software developers to write programs differently and more
intelligently to exploit parallelism, distributed computing, distributed data, and modern hardware architectures.

Though the end of free lunch for software applies to all software, it is more profound in the world of data mining, machine learning, and analytics. See the book "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" ( ) for more information.