Thursday, June 5, 2014

Hadoop Summit 2014 Take Aways

See the book site for "High-Performance Data Mining and Big Data Analytics: The Story of Insight from Big Data."

Notable Trends from Hadoop Summit 2014  #HadoopSummit

- FYI: Hortonworks and Yahoo! are hosts of this event. This event has been held in San Jose Convention Center for the last four years (I think) and attracted 3200 people this year.

- As a sign of maturity of Hadoop, one can see the list of Diamond and Platinum sponsors that include many heavyweights such as AT&T, Microsoft, SAP, Teradata, Cisco, IBM, Informatica, Oracle, SAS, and VMware.

- With YARN finally released at the end of 2013, there was a high expectation to hear about Hadoop maturity stories/experiences in classic environments. There were stories and experiments that were shared showing that Hadoop with YARN is moving along its maturity curve, and that is good news for the classic organizations.

- Still lots of tools in Hadoop ecosystem are low level but the trend continues by different vendors to make the tools for integration and use more efficient and higher level. For classic organizations, this will be the way to get Hadoop use more widespread beyond storage and ETL.

- With Yahoo! being a host, there were lots of presentations from them regarding various aspects of Hadoop 2.0 usage at Yahoo!

- One thing I have found is that there is no "single big data technology" that solves most of the business problems out there especially for classic organizations that have lots of investment in the legacy big data analytics technologies. For now, for most business problems, more than one big data technology will be required.(Good to read my book on the subject when it is out).Many participants from classic organizations in banking, telco, retail, and others emphasized this as their experience playing with Hadoop. Remember the Hadoop newcomers used to call it the "holy grail", but those days are over. What we will see is more of convergence and integration rather than pure replacement.

- Queries using MR are slow.  There are now many solutions on Hadoop to address "low latency" queries for big data such as Impala (Cloudera), Drill (MAP R), Shark (Spark), Presto (Facebook), HAWQ (Pivotal), and Apache Tez. All these are making things more confusing, and there would be no answer on what is the right way for sometimes to come. There were a few presentations focusing on Tez and how it improves performance for Pig and Hive.

- Spark-on-YARN enables Spark application run on Hadoop with no need to create a new cluster. There were some early results shared by Yahoo!. Also, there is SparkR from UC Berkeley that promises interactive analysis of large data in parallel from the R shell.

- There were new use cases from travel industry, electronic consumer peripherals, to selling cars showing Hadoop traction in many more industries.

- Hadoop security is coming along and still more work needs to be done. Check out Apache Sentry, Apache Knox, and Project Rhino.

- About cost of ownership of Hadoop cluster, what I have found is that one should always start with a public cloud implementation for discovery and exploration, and even at earlier stages of operations. There would be a point where depending on the application-specific factors, an on-premise implementation would be more cost effective and will also provide better ROI. For those applications that data security is of high importance, on-premise from the start will be the way to go for now.