Saturday, July 30, 2016

Ten General Principles in Data Mining/Science (Focus on Model Development)

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" ( ).

Through years, working with different clients and applications, I have found a set of data mining general principles that also hold through in the context of big data.  These are all listed in my book. Here I enumerate them using the terminology I have used in the book:

1.         Use of “all the data” is not equivalent to building the deepest Analytics Dataset (ADS) in terms of the number of rows.
2.         Given a choice between more of the data or a fancier algorithm, choosing “more data” at the top of the data preparation funnel is always preferred.
3.         For most problems, when using the same learning algorithm, smart variable creation on a proper sample of data (a sampled ADS) outperforms the use of the deepest ADS, with primary and/or rudimentary (not well-thought) attributes.
4.         For most problems, smart variable creation combined with a simple algorithm outperforms a mediocre variable creation exercise combined with the fanciest algorithms; whatever the ADS row size.
5.         For some classes of algorithms, a smart ADS combined with a smart presentation of its variables to the learning algorithm on sampled data outperforms a smart ADS with the largest number of rows without proper presentation of variables.

6. In many problems, one has to deal with transactional data requiring creation of time-based  variables that provide the learner with a short and long term memory of the past behavior of the entities to model. Depending on the problem, such variables need to be computed and updated from the transaction history of each entity and for every transaction in realtime or at specific time intervals.

7. For a fixed model complexity, as the number of rows (observations) in ADS increases, the training and test errors converge.

8. For some problems, all population must be represented in an ADS (e.g., social net analysis, long tail problems, high cardinality recommenders, search). For all other problems, sampling
continues to be valid. For a subset of these problems, sampling is mandatory, e.g., highly unbalanced datasets, segmented modeling, micro-modeling, and campaign groups. For the remainder, it is optional but not a limiting factor anymore. Historically for these problems, sampling had to be done to speed up the processing or to reduce the storage cost.

9. For big data, it is desired to use the same platform and interface for data understanding, preparation, and model development with minimal data movement and least iterations through the data to get to the result.

10. In the transition from model development to deployment, automatic code generation for computation of variables and models is of high importance to ensure quality control. Automatic code generation is mandatory in applications that require a large number of models.
The applications have been very diverse:
- Character recognition (hand-print English or cursive print languages) involving state-of-the-art image processing and text/character segmentation capabilities and use of neural networks with multiple layers, hundreds of nodes, and tens of thousands of weights (As an example see Machine-printed Arabic OCR), 
- Real-time payment card fraud detection (See here),
- Segmentation of buyers and sellers at a large web auction site using both supervised and unsupervised techniques,
-  Building models on-the-fly based on each SKU (millions of them) based on store characteristics for one the largest retailers,
- And tens of other applications in customer experience, risk, marketing, and fraud. Whatever machine learning or pattern recognition technique was used, the above always held true.

For more information and orders, visit the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" ( ).