A personal blog currently dedicated to the promotion of the book titled "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data." The success of new big data technologies in large web companies has created a rush toward understanding the impact of these technologies in classic analytics environments that already employ a multitude of legacy analytics technologies. The blogs provide concepts and excerpts from the book to study this impact.
Wednesday, October 19, 2016
What is DS-BuDAI?
Data science (covering data mining and related practices) is a multidisciplinary field that requires knowledge of a number of different
skills, practices, and technologies, including but not limited to machine learning,
pattern recognition, mathematics, programming, algorithms, statistics, and
databases.In the context of big data, more skills and knowledge is required, such as knowledge
of distributed computing techniques/algorithms and architectures. By nature, data science is a creative process that is a combination of both science, engineering, and art. Hence its success has been more dependent on the quality
and the experience of the team that has been carrying it out. Thus in the past, for some time,
data mining projects were not repeatable with the same level of success across different
enterprises. However, with the maturity of the practice, that has changed.
Since the late 1990s there have been a variety
of efforts to create standard methodologies and process models for data mining,
such as CRISP-DM(Wirth and Hipp 2000). In this methodology
there is an important focus on business, data, and deployment aspects, as well as
the modeling, which used to be the main focus. Today, data science practices are more mature
and well tested. Even though different methodologies may use different
names for each step of the process, in general, I can logically divide any data science exercise into four phases (See Figure below):
Business Problem Understanding/Use,
Data Understanding/Use and Preparation,
Analytics and Assessment,
Implementation (Deployment and Monitoring).
In the context of big data, these logical phases stay the same; however, some low-level details of data preparation, analysis, and implementation may be impacted.
We all love acronyms and I have been using DS-BuDAI to refer to this process to communicate with business sponsors and users. The lowercase 'u' represents "Understanding/Use" to overemphasize their importance during Business and Data focused phases. It bridges the two. Analytics and Implementation are simply realizations of the data science deliverable.
The "Understanding" part needs no explanation specially in the context of business problem and data that are specifically going to be addressed and leveraged in the effort. "Use" however needs a bit of explanation given some recent experiences.
A DS project must start with a full understanding of the business challenge and how it could be solved leveraging data sources available or to be obtained. However, there could be cases that after everything is done and the value proven, the business users are not still willing to use the new insights for actions. This lack of responsiveness has a lot to do with the culture of the organization, how decisions have been historically made in the past, and the marginal improvement the new actions will bring. These however could be overcome with education and training and full support of senior management for change.
In some cases though, actionable insights are perceived by business users as "this is what we already knew" and "it is good that the data analysis confirms that." Basically saying that there is no novel new findings but a confirmation of what is known. There is truth to this perception sometimes but at times it is simply resisting change or accepting changes in practice.
In the context of data, "Use" also is essential. Collection, storage, preparation, and management of big data is still expensive no matter how much the storage costs have dropped in recent years with advent of open source systems and price drops in storage/processing systems. Data could easily be abused or misused. Sometimes too much data is used, and sometimes data is not used at the right level of details or aggregation.
The lowercase "u" in DS-BuDAI is to overemphasize understanding and use during business and data focused phases. Originally published on10/19/16, 11:47 AM Pacific Standard Time
 "Data Science" is nothing new except the term itself and the level of recent interest for it.