Saturday, July 16, 2016

End of the Free Lunch for Analytics and Data Mining Software

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info )

There has been an interesting phenomenon that is known to many as “What Andy giveth, Bill taketh away.” It means that every time Andy Grove (CEO of Intel) brought a new chip to market, Bill Gates
would soak up the new chip’s power to upgrade his software. For thirty years, Moore’s Law has continually enabled software performance to improve, often incrementally and, once a while, drastically. Many types of applications have enjoyed regular and free performance gains during
this time, without the need for rewriting old code or doing anything special. Many other applications have been created or have evolved by providing new powerful features made possible only because of faster hardware. By the mid-2000s, this so-called “free lunch for software” came to an end.

During this long period, software engineers simply did not need to account for parallelism and concurrency in their programs or algorithms for speedup. They just relied on the 50% expected yearly speedups of the processors that was made possible by Moor’s Law. However, by the early 2000s, a new reality started to emerge—one that created a crisis. In the new software world, not all software will enjoy faster year-over-year performance gains, as has been the case in the past. Only concurrent and parallel software greatly benefits from hardware advancements going forward, while single-threaded software will stay behind. As mentioned earlier, writing parallel and concurrent software
requires the use of different programming models such as multithreading and distributed computing, which are more complex.

This is not the first time that software has faced a big challenge. The first software challenge happened in the ‘60s and ‘70s when developers were writing assembly language programs. As the computers grew in power, memory capacity, and variety, there was a need for portability and some level of abstraction from the hardware. This was solved through high-level procedural languages such as C and FORTRAN, which were very efficient.

In the ‘80s and ‘90s, another challenge emerged: composing and maintaining large software. These were projects that involved tens or hundreds of programmers and where millions of lines of code had to be managed. Commercial and government entities such as Microsoft— while developing MS Word—and the US Department of Defense (DoD) faced such challenges early on. Software performance was not the main issue, since Moore’s Law had proven that it could deal with confidently. Object-oriented programming using C++, and then Java and C#, made large software development easier. Hundreds if not thousands of people could work on a single application simultaneously.

The third challenge is what we face now. The end of scaling of singlethreaded performance has created a major shift to symmetric parallelism. Until recently, programmers have been used to not being concerned about hardware. They believed that, as usual, Moore’s Law would continue to take care of them. Achieving this requires software developers to write programs differently and more
intelligently to exploit parallelism, distributed computing, distributed data, and modern hardware architectures.

Though the end of free lunch for software applies to all software, it is more profound in the world of data mining, machine learning, and analytics. See the book "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data" (http://bigdataminingbook.info ) for more information.