Friday, September 19, 2014

Data Mining, Data Science, and Machine Learning (3)

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info )

Experiences on building Data Science Teams

By education, I was trained in Intelligent Systems focusing on Machine Learning and Pattern Recognition. Consequently, for some of us who have spent our entire career in developing and promoting data mining science, machine learning/pattern recognition, and analytics, "data science" is nothing but a new term. As mentioned, in essence it is the same as data mining science, with a few twists. There is no doubt that the popularity of the term has been directly related to the attention big data analytics has received in recent years, especially due to the success of big web companies. Popularity of these companies has finally brought the importance of analytics to the attention of masses.

Prior to this recent wave, these technologies were already an integral part of many things in our lives, without us noticing it.  However, they were never publicized to the scale they are today. The most notable commercial applications of machine learning have been around since late 80's and early 90's; real-time payment card fraud detection, hand-print recognition (US mail, checks, ...), and recommender systems. A startup I worked for called (HNC or HNC Software, acquired by FICO) was the pioneer in the first two. The core of HNC's business from early 90's was to get data from some corporations on their customers and develop intelligent solutions leveraging machine learning and advanced analytics on that data. That was the core value the company would provide and the reason for its valuation. Due to the big data craze and the recent publicity analytics has finally received, today many companies try to do that on any data they can find: public or private. And with the explosion of human and machine-generated data, there is a lot more that can be done.

Data science is "the practice of extracting insights from data of any size or variety, using a multitude of disciplines and technologies for the purpose of creating new data products and services or improving the existing ones."

As a part of four startups, I have spent my whole career in developing and promoting novel applications of machine learning and pattern recognition for real-world problems such as hand-print recognition, control of non-linear systems, real-time behavioral fraud detection, and many more. At some of these companies, I had to hire and build data science teams with the right skill sets. I also worked with a couple of universities locally in the late 90's when there were not huge demands for these skill sets and not much of publicity. I am currently advising a university of establishing an undergraduate Data Science program based on the experience of all these years working in the field.

Here are the skill sets that make a perfect data scientist.  Keep in mind, that not many people can be found to possess these all, given the multidisciplinary nature involved.  That is why it is an important managerial task to hire a data science team that collectively addresses the immediate needs of the business:

(1) Passion, love, and patience for data (often imperfect data) including all it takes (if necessary) for identification of all sources of data, collection, and validation for the prototype system,
(2) Deep knowledge of machine learning (or pattern recognition), and statistical modeling. These will provide solid ground for quantitative analysis. Real-world experience using these is sometimes a must.
(3) Good computational skills and knowledge of main programming principles - programming experience in at least a couple of languages (one third and one fourth-generation language),
(4) Solid foundation in mathematics including linear algebra, numerical analysis, and probability theory (Bayes),
(5) Business acumen - Focus on data and analytics applications that provide high impact to the business (creating data products and services),
(6) Inquisitive (ability to ask questions, challenging assumptions, validating thoughts/ideas, ...) and pragmatic (there are no perfect solutions. Good is often best given the time and resources),
(7) Some working knowledge of databases or newer data stores. A solid background on fundamentals of computing is a plus, including high-level architectures.
(8) Ability to communicate findings to business, peers, and customers (not for everybody),
(9) Ability to formulate a business problem into a data mining problem (four phases discussed earlier) and execute.
(10) Problem solver + statistical thinker.

It is important to know that tools, packages, and platforms change and as long as the person has the core traits and skills, it should be possible to adapt to new tools and platforms if necessary. Though this is not true for all.

Here are some real-world observations. Obviously I could never hire anybody with these traits directly from universities. But the successful hires from the universities were those with computer science background and focus on machine learning or some similar fields.  They were the easiest to train and develop. Applied statisticians (or applied physicists or other analytics disciplines) needed to be developed in two areas: one was in the programming/computation dimension and the other on the acceptance of machine learning techniques and approaches (sometimes at odds with what they have learned traditionally in their fields).  Usually statisticians had a narrow set of skills in computation and preferred to use a single tool like SAS or the like. Software engineers (with scientific programming interest) were not a good choice for a data science position since there was a lot more that had to be done to develop and train them.

At the end of the day, our data science team was a combination of the following:
(1) A sub-group of people with more of the skill sets above combined,
(2) A sub-group with statistics focus and bias (but exposure to machine learning techniques which are now somewhat a part of the statistical tools sets),
(3) A sub-team with data collection/manipulation/validation skills that could grow to a data scientist,
(4) Software engineers focused on implementation especially productionalizing the analytics process discovered. They would also help with the prototyping aspects if it required integration of tools and systems.
(5) Project managers to coordinate efforts.

The business acumen and customer facing skill sets take training and development. Some people just preferred the technology and the back office. These were in their nature.

Almost guaranteed, the data aspects of the data scientist job (collection, validation, manipulation, ...) are always learned and practiced in a commercial setting.  They are never taught in any university program and every time it surprises people who come from academia.

No comments:

Post a Comment