Saturday, November 1, 2014

Somewhere, Something Incredible is Waiting to be Known [1]

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info )

It is projected that growth in global data generated per year is 40%, versus a 5% growth in global IT spending (McKinsey Global Institute 2011). From a cost perspective, this divergence clearly shows why big data technology solutions must be cost-effective today in order to stay within existing IT budgets and forecasts.

However, the real value of any technology is ultimately defined by its business ROI and not its cost. This is particularly true for classic organizations that are exploring new big data technologies. From an economic and business standpoint, the potential value creation from big data opportunities has already been proven by numerous web companies. In many cases this value creation has been disruptive.

Borrowing from Carl Sagan quote above, the curiosity to find something incredible that in this case is also of potentially high monetary value, is the driving force behind the acceptance of big data analytics in big classic organizations. As the cost of these technologies reduces and they become easier to employ, the required investment bet to get the potential ROI will be justified more easily. These organizations are not quite there yet but surely they are on the path to get there.

------------------------------------
[1] Quoted from Carl Sagan.

Friday, October 24, 2014

"The Pale Blue Dot" Effect and Big Data


Many of you may have heard of the “Pale Blue Dot” which is a photograph of planet Earth taken in 1990 by the Voyager One spacecraft when it was leaving the solar system.[1] The picture was taken from a distance of about 3.7 billion miles from the earth. In the photograph, the earth with all its magnificence (i.e., life) only appears as a fraction of a pixel against the vastness of space, hence the name “Pale Blue Dot.” 

This “Pale Blue Dot” effect may well represent insights that can potentially be extracted for some big data explorations. In such contexts, the insight itself may seem very small given the vastness of data collected, processed, and analyzed. However its value could be unimaginable when discovered. For example, some answers to potential cures for diseases may be hidden in DNA sequencing data, but it is extremely difficult and expensive to analyze this data and correlate it with known diseases, given the vastness of the data. However, if an insight is found and is leveraged for a cure, it will have huge value for society as a whole.


See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info )


[1] Subsequently, the title of the photograph was used by Sagan as the main title of his 1994 book, Pale Blue Dot (Sagan, 1994).

Friday, October 10, 2014

My book titled "High-performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" is published


My book titled "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" is published.

Order at CreateSpace

Order at Amazon

Here is the Book Site.

Description:
The use of machine learning and data mining to create value from corporate or public data is nothing new. It is not the first time that these technologies are in the spotlight. Many remember the late '80s and the early '90s when machine learning techniques-in particular neural networks-had become very popular. Data mining was at a rise. There were talks everywhere about advanced analysis of data for decision making. Even the popular android character in "Star Trek: The Next Generation" had been named appropriately as "Data." Data mining science has been the cornerstone of many data products and applications for more than two decades, e.g., in finance and retail. Credit scores have been in use for decades to assess credit worthiness of people when applying for credit or loan. Sophisticated real-time fraud scores based on individual's transaction spending patterns have been used since early '90s to protect credit card holders from a variety of fraud schemes. However, the popularity of web products from the likes of Google, Linked-in, Amazon, and Facebook has helped analytics become a household name. While a decade ago, the masses did not know how their detailed data were being used by corporations for decision making, today they are fully aware of that fact. Many people, especially the millennial generation, voluntarily provide detailed information about themselves. Today people know that any mouse click they generate, any comment they write, any transaction they perform, and any location they go to, may be captured and analyzed for some business purpose. 

Every new technology comes with lots of hype and many new buzzwords. Often, fact and fiction get mixed-up making it impossible for outsiders to assess the technology's true relevance. I wrote this book to provide an objective view of analytics trends today. I have written it in complete independence, and solely as a personal passion. As a result, the views expressed in this book are those of the author and do not necessarily represent the views of, and should not be attributed to, any vendor or employer.

Due to the exponential growth of data, today there is an ever increasing need to process and analyze big data. High-performance computing architectures have been devised to address the need for handling big data, not only from a transaction processing standpoint but also from a tactical and strategic analytics viewpoint. The success of big data analytics in large web companies has created a rush toward understanding the impact of new big data technologies in classic analytics environments that already employ a multitude of legacy analytics technologies. There is a wide variety of readings about big data, high-performance computing for analytics, massively parallel processing (MPP) databases, Hadoop and its ecosystem, algorithms for big data, in-memory databases, implementation of machine learning algorithms for big data platforms, and big data analytics. However, none of these readings provides an overview of these topics in a single document. The objective of this book is to provide a historical and comprehensive view of the recent trend toward high-performance computing technologies, especially as it relates to big data analytics and high-performance data mining. The book also emphasizes the impact of big data on requiring a rethinking of every aspect of the analytics life cycle, from data management, to data mining and analysis, to deployment.

As a result of interactions with different stakeholders in classic organizations, I realized there was a need for a more holistic view of big data analytics' impact across classic organizations, and also the impact of high-performance computing techniques on legacy data mining. Whether you are an executive, manager, data scientist, analyst, sales or IT staff, the holistic and broad overview provided in the book will help in grasping the important topics in big data analytics and its potential impact in your organizations.

Friday, September 19, 2014

Data Mining, Data Science, and Machine Learning (3)

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info )

Experiences on building Data Science Teams

By education, I was trained in Intelligent Systems focusing on Machine Learning and Pattern Recognition. Consequently, for some of us who have spent our entire career in developing and promoting data mining science, machine learning/pattern recognition, and analytics, "data science" is nothing but a new term. As mentioned, in essence it is the same as data mining science, with a few twists. There is no doubt that the popularity of the term has been directly related to the attention big data analytics has received in recent years, especially due to the success of big web companies. Popularity of these companies has finally brought the importance of analytics to the attention of masses.

Prior to this recent wave, these technologies were already an integral part of many things in our lives, without us noticing it.  However, they were never publicized to the scale they are today. The most notable commercial applications of machine learning have been around since late 80's and early 90's; real-time payment card fraud detection, hand-print recognition (US mail, checks, ...), and recommender systems. A startup I worked for called (HNC or HNC Software, acquired by FICO) was the pioneer in the first two. The core of HNC's business from early 90's was to get data from some corporations on their customers and develop intelligent solutions leveraging machine learning and advanced analytics on that data. That was the core value the company would provide and the reason for its valuation. Due to the big data craze and the recent publicity analytics has finally received, today many companies try to do that on any data they can find: public or private. And with the explosion of human and machine-generated data, there is a lot more that can be done.

Data science is "the practice of extracting insights from data of any size or variety, using a multitude of disciplines and technologies for the purpose of creating new data products and services or improving the existing ones."

As a part of four startups, I have spent my whole career in developing and promoting novel applications of machine learning and pattern recognition for real-world problems such as hand-print recognition, control of non-linear systems, real-time behavioral fraud detection, and many more. At some of these companies, I had to hire and build data science teams with the right skill sets. I also worked with a couple of universities locally in the late 90's when there were not huge demands for these skill sets and not much of publicity. I am currently advising a university of establishing an undergraduate Data Science program based on the experience of all these years working in the field.

Here are the skill sets that make a perfect data scientist.  Keep in mind, that not many people can be found to possess these all, given the multidisciplinary nature involved.  That is why it is an important managerial task to hire a data science team that collectively addresses the immediate needs of the business:

(1) Passion, love, and patience for data (often imperfect data) including all it takes (if necessary) for identification of all sources of data, collection, and validation for the prototype system,
(2) Deep knowledge of machine learning (or pattern recognition), and statistical modeling. These will provide solid ground for quantitative analysis. Real-world experience using these is sometimes a must.
(3) Good computational skills and knowledge of main programming principles - programming experience in at least a couple of languages (one third and one fourth-generation language),
(4) Solid foundation in mathematics including linear algebra, numerical analysis, and probability theory (Bayes),
(5) Business acumen - Focus on data and analytics applications that provide high impact to the business (creating data products and services),
(6) Inquisitive (ability to ask questions, challenging assumptions, validating thoughts/ideas, ...) and pragmatic (there are no perfect solutions. Good is often best given the time and resources),
(7) Some working knowledge of databases or newer data stores. A solid background on fundamentals of computing is a plus, including high-level architectures.
(8) Ability to communicate findings to business, peers, and customers (not for everybody),
(9) Ability to formulate a business problem into a data mining problem (four phases discussed earlier) and execute.
(10) Problem solver + statistical thinker.

It is important to know that tools, packages, and platforms change and as long as the person has the core traits and skills, it should be possible to adapt to new tools and platforms if necessary. Though this is not true for all.

Here are some real-world observations. Obviously I could never hire anybody with these traits directly from universities. But the successful hires from the universities were those with computer science background and focus on machine learning or some similar fields.  They were the easiest to train and develop. Applied statisticians (or applied physicists or other analytics disciplines) needed to be developed in two areas: one was in the programming/computation dimension and the other on the acceptance of machine learning techniques and approaches (sometimes at odds with what they have learned traditionally in their fields).  Usually statisticians had a narrow set of skills in computation and preferred to use a single tool like SAS or the like. Software engineers (with scientific programming interest) were not a good choice for a data science position since there was a lot more that had to be done to develop and train them.

At the end of the day, our data science team was a combination of the following:
(1) A sub-group of people with more of the skill sets above combined,
(2) A sub-group with statistics focus and bias (but exposure to machine learning techniques which are now somewhat a part of the statistical tools sets),
(3) A sub-team with data collection/manipulation/validation skills that could grow to a data scientist,
(4) Software engineers focused on implementation especially productionalizing the analytics process discovered. They would also help with the prototyping aspects if it required integration of tools and systems.
(5) Project managers to coordinate efforts.

The business acumen and customer facing skill sets take training and development. Some people just preferred the technology and the back office. These were in their nature.

Almost guaranteed, the data aspects of the data scientist job (collection, validation, manipulation, ...) are always learned and practiced in a commercial setting.  They are never taught in any university program and every time it surprises people who come from academia.

Tuesday, July 22, 2014

Big Data Confusion

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

I was away for FIFA World Cup for a couple of weeks and am back now. The first thing I noticed last week was that my Twitter account had been hacked.  But now all is sorted out and this blog will be tweeted.

In the process of writing my book, I did some research on the definition of "big data" and "big data technologies." It suffices to say that I found a lot of conflicting definitions, and as a practitioner in the world of analytics, I thought how others who do not have the deep experience in analytics may feel about all this! At the end, I found the following definition of big data the most appropriate.

The most common definition of “big data” is datasets that grow so large that they become awkward to capture, store, search, share, analyze, and visualize them using available data management and analysis tools. First, what is considered big data today may be the new normal in the next several years. Second, it indirectly implies that what is big data may also change with the place—i.e., industry sector (vertical), organization, or enterprise—depending on where they are on their analytics evolution curve.

In the last few years, the term “big data” has been tied closely to “Hadoop” by the media. This is unfortunate. The reader understands that an organization may choose, or may have already chosen, to solve its big data problems using big technologies other than Hadoop. At the same time, one can also use Hadoop to solve a more conventional data problem, and not necessarily a big data problem. Hadoop is only a piece in the whole big data puzzle, but it is an influential one. This 2013 survey (Gartner Survey: Big Data Adoption in 2013 Shows Substance Behind the Hype) like a few other surveys conducted by other research organizations reflects what organizations think of big data.

In my upcoming book, I have tried to address fact vs. fiction (hype) about big data analytics. Big data analytics can be approached from many different angles as it relates to business use cases, analytic processes (methodologies and algorithms), platform architectures, etc. As far as platforms are concerned, when we talk about big data, we are talking about one of these options:

- Hadoop, MapReduce, and YARN,
- Massively parallel databases (Relational and NoSQL),
- Real-time event stream processing,
- In-memory distributed analytics,
- Big data analytics appliance.

Again, the Gartner Survey illustrates that aside from the hype, what organizations consider big data technologies (cloud is at the top), what they use them for (customer experience at the top), and what type of data they use them for (transactions at the top). I have not seen a single unified understanding of big data in these surveys and there are many of them.  That shows how lose the definitions and understandings about big data are.



Thursday, June 12, 2014

Data Mining, Data Science, and Machine Learning (2)

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

A question that often comes along is: "What is the difference between Machine Learning and Data Mining Science (now Data Science)?" Newcomers often confuse these, not mentioning the businesses who try to use these terms to make their products more appealing to their customers and investors. I make an attempt to describe the difference between the two here. For more detail, you can read my new book.


Data Mining Science (the practice is referred to as Data Mining) covers all aspects of creating value out of data from capture, storage, quality checks, analysis, deployment, to infra-structure. It requires some knowledge of a variety of disciplines including mathematics, machine learning, statistics, programming, and databases. On the other hand, machine learning (ML) is a set of techniques and algorithms that can be applied to the data (any type of data from numeric, text, binary, ...) when placed into the right form; meaning a set of exemplars (or patterns or observations or records) each with a set of features that captures the problem of interest. ML (very close to pattern recognition) is a branch of artificial intelligence in computer science that in general focuses on the general problem of "learning from data" or "learning by examples."


Machine learning and pattern recognition are very close in principles and established fields taught in top universities for a long time, Both try to address the problem of learning with the difference that ML has its roots in computer science while pattern recognition has its roots in engineering. Statistical Learning Theory is a sub-field of ML that focuses on the formalization of the problem of learning, basically focusing more on the theoretical aspects. Statistical learning theory explains why machine learning techniques in general work, something the practitioners had experienced through empirical results. Traditional statisticians have always been skeptics regarding machine learning applications even though nowadays these techniques are a part of many statistical toolsets.

The fundamental assumption in machine learning that distinguishes it from traditional statistics is the principle that it makes little or no pre-assumptions about the problem including data distributions. Traditional statistics's roots go back to when there were no computers around, and analyzing any data centered around data reduction and simplification (sampling, variable reduction, linearity assumptions, ...). These often required making restrictive assumption about the problem and the data distributions.


Historically, ML techniques had to deal with solving much more challenging problems and as such, they do not make pre-assumptions about the problem and instead, uses the power of the computer to search and optimize for "a good" solution, often using heuristics. Machine learning is not necessarily in search of the best optimized solution, but "a good solution" and many times it uses heuristics in its approaches. In other words, traditional statistics forces severe assumptions on data to get its best solution, but the solution is only best assuming the correctness of those assumptions. However, dealing with many real-world machine and human-generated data, those pre-assumptions are rarely correct. Hence the resulting solutions can be mediocre at best. ML generally does not seek or claim to find the best possible solution. Often given the complexity of the problems, the best solution may not be achievable or worth the time and resources to find it even if it exists.


Another fundamental pillar of machine learning is the concept of "generalization" and the trade-off between accuracy and robustness. ML solves this using empirical approach of using training/validation/test approach while traditional statistic uses statistical tests on the training data(coupled with tight initial assumptions) to address this.



As a practice, data mining (and data science) has four phases of equal importance:

(1) Business problem understanding
(2) Data understanding and preparation
(3) Model development and assessment
(4) Deployment and monitoring.



Business Understanding: Data mining starts from the full understanding of the business problem where business domain knowledge and data mining knowledge both have to be leveraged. The final result of this process is to set an ROI expectation and a formulation of the business problem into a data mining problem.


Data Understanding and Preparation: Understanding the data does not need any explanation.  By definition, it requires proper storage, low-level quality checks, and access to "all relevant data" for the desired business problem. The data preparation covers all aspects of data manipulations that will be required for "model development."  Sometimes the data preparation is minimal but very often it requires sophisticated and innovative ways of converting "raw relevant data" to so-called an "analytic data set" (or ADS) that well represents the business problem of interest and can be fed to into the algorithms of interest. An ADS is a set of exemplars each with a set of features that capture the problem of interest.


Model Development and Assessment: The carefully designed and processed ADS is the input to this stage. This stage is where machine learning is applied to process structured, unstructured, or multi-structured data. The results of modeling are assessed against the business goals, and a good solution is selected.


Model Deployment and Monitoring: Building the most sophisticated models are only useful when those models can be operationalized and continuously monitored for performance degradation.

In earlier days of data mining, many projects failed because they were solely focused on the modeling and learning algorithm cuteness (academic orientation) rather than focusing on the business value. Many others failed in practice because of lack of attention to data storage, quality, and preparation for modeling. Of those that passed these hurdles, many failed because they could not be deployed in operations.


Today, many of these early failures have been addressed and good practitioners evaluate the whole life cycle at the first phase before attempting for a solution. The tools used in the process can change and often are from a variety.


With the explosion of data and the popularity of analytics and ML in general, all players in the market are using the term "Data Science" and "Data Scientist." The data platform vendors (MPP, NoSQL, Hadoop vendors) use the terms to emphasize the database/data store, and basic analytics aspects.  Outside the context of big data, I do not consider basic analytics even close to what a data scientist has to do. Startups and big web companies may emphasize more on the programming requirements aspects of a data scientist.


In my opinion, Data Science is mainly a sexier and more appropriate name for "Data Mining Science." "Mining" does not portray the right image because it is generally associated with dangerous and hard manual labor. Also Data Science as a practice is more focused on creating new products (data products) and tends to be much higher-level in the organization's leadership.


In the third piece of this topic, I try to enumerate what skills are required for a data scientist, what skills one needs in a data science team, and what needs to be taught in a data science curriculum.




For more detail and in the context of big data impact on data mining, you can read my new book.

In future blogs, I discuss "my ten commandments" for data mining, i.e., the general principles to be aware.


Monday, June 9, 2014

Analytics Maturity

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

Like many new technology terms in the last two decades, "analytics" has been used and abused in different contexts. Depending on the background, data professionals have a different view of what analytics is. If one comes from IT, database, reporting, or business analysis backgrounds, he/she will consider any manipulation of data that generates reports, aggregated results, or data slicing and dicing as analytics. If one comes from the data mining, machine learning, or statistical modeling experiences, he/she will only consider analytics as where sophisticated algorithms are applied to the data.

The good news is that this battle has already been fought and settled. Today, there is a clear distinction between these two interpretations of Analytics. The former is called basic analytics (low end or spreadsheet-kind of analysis) often referred to as looking at the behind mirror. The latter is referred to as advanced analytics (high end) where the goal is to use sophisticated techniques on the past data to make predictions in the future. With big data, even basic analytics like simple aggregations and tabulation of the data for reporting becomes a challenging task if response time is at all of concern. Those who focus on basic analytics tasks are like journalists while those who focus on advanced analytics may be called innovators.Both types of analytics are essential to the well- being of a business.

The fundamental principle is that an organization cannot transition into advanced analytics era if they have not already mastered the basic analytics applications. In other words, basic analytics is the requirement before entering into advanced analytics, and both are dependent on solid data management infrastructures. This so-called analytics maturity is discussed at length in my book in different contexts (including big data context) and assessing it is necessary prior to any effort to augment a firm's analytics capabilities.

Friday, June 6, 2014

Data Mining, Data Science, and Machine Learning (1)

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

People often ask me about data science and its relationship with data mining science (often just referred to as data mining) and machine learning.  In my book, I provide a viewpoint on this topic. For those like me who have spent their entire career in developing and promoting data mining, machine learning, and data analytics, “data science” is nothing but a new term

The use of machine learning and data mining to create value from corporate or public data is nothing new. It is not the first time that these technologies are in the spotlight. Many remember the late ‘80s and the early ‘90s when machine learning techniques—in particular neural networks—had become very popular. Data mining was at a rise. There were talks everywhere about advanced analysis of data for decision making. Even the popular android character in “Star Trek: The Next Generation” had been named appropriately as “Data.” 

Data mining science has been the cornerstone of many applications for more than two decades, e.g., in finance and retail. However, the popularity of web products from the likes of Google, Linked-in, Amazon, and Facebook has helped analytics become a household name. While a decade ago, the masses did not know how their detailed data were being used by corporations for decision making, today they are fully aware of that fact. Many people, especially the millennial generation, voluntarily provide detailed information about themselves. Today people know that any mouse click they generate, any comment they write, any transaction they perform, and any location they go to, may be captured and analyzed for some business purpose. 

All these have contributed to finally bring analytics to the forefront of many conversations even among regular people. A decade ago, we could not comfortably tell a customer how we anonymously analyze their detail transactions in real-time to protect them even from fraud (See Chapter 9 of this book). 

Big Data Analytics Confusing Landscape

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ).

Any new technology brings its own new buzzwords and lots of promises. It becomes real confusing for outsiders to assess its relevance and to separate noise from reality.  In my book, I provide an objective and holistic view of big data technologies and the impact of new big data technologies in classic organizations. Leveraging my 20 years experience in advanced analytics and machine learning, I tried to cover many topics that are relevant to anybody who is interested to get into this field.

Who is my audience for my new book?

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data."

I wrote this book for a variety of audiences (See here for details).  Most importantly, there are many people in the technology, science, and business disciplines that are curious to learn about big data analytics in a broad sense, combined with some historical perspective. They may intend to enter the big data market and play a role. For this group, the book provides an overview of many relevant topics on the subject.

Thursday, June 5, 2014

Hadoop Summit 2014 Take Aways

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data."

Notable Trends from Hadoop Summit 2014  #HadoopSummit

- FYI: Hortonworks and Yahoo! are hosts of this event. This event has been held in San Jose Convention Center for the last four years (I think) and attracted 3200 people this year.

- As a sign of maturity of Hadoop, one can see the list of Diamond and Platinum sponsors that include many heavyweights such as AT&T, Microsoft, SAP, Teradata, Cisco, IBM, Informatica, Oracle, SAS, and VMware.

- With YARN finally released at the end of 2013, there was a high expectation to hear about Hadoop maturity stories/experiences in classic environments. There were stories and experiments that were shared showing that Hadoop with YARN is moving along its maturity curve, and that is good news for the classic organizations.

- Still lots of tools in Hadoop ecosystem are low level but the trend continues by different vendors to make the tools for integration and use more efficient and higher level. For classic organizations, this will be the way to get Hadoop use more widespread beyond storage and ETL.

- With Yahoo! being a host, there were lots of presentations from them regarding various aspects of Hadoop 2.0 usage at Yahoo!

- One thing I have found is that there is no "single big data technology" that solves most of the business problems out there especially for classic organizations that have lots of investment in the legacy big data analytics technologies. For now, for most business problems, more than one big data technology will be required.(Good to read my book on the subject when it is out).Many participants from classic organizations in banking, telco, retail, and others emphasized this as their experience playing with Hadoop. Remember the Hadoop newcomers used to call it the "holy grail", but those days are over. What we will see is more of convergence and integration rather than pure replacement.

- Queries using MR are slow.  There are now many solutions on Hadoop to address "low latency" queries for big data such as Impala (Cloudera), Drill (MAP R), Shark (Spark), Presto (Facebook), HAWQ (Pivotal), and Apache Tez. All these are making things more confusing, and there would be no answer on what is the right way for sometimes to come. There were a few presentations focusing on Tez and how it improves performance for Pig and Hive.

- Spark-on-YARN enables Spark application run on Hadoop with no need to create a new cluster. There were some early results shared by Yahoo!. Also, there is SparkR from UC Berkeley that promises interactive analysis of large data in parallel from the R shell.

- There were new use cases from travel industry, electronic consumer peripherals, to selling cars showing Hadoop traction in many more industries.

- Hadoop security is coming along and still more work needs to be done. Check out Apache Sentry, Apache Knox, and Project Rhino.

- About cost of ownership of Hadoop cluster, what I have found is that one should always start with a public cloud implementation for discovery and exploration, and even at earlier stages of operations. There would be a point where depending on the application-specific factors, an on-premise implementation would be more cost effective and will also provide better ROI. For those applications that data security is of high importance, on-premise from the start will be the way to go for now.

Tuesday, May 27, 2014