apache_hawq pivotal_hd pov spring_xd

5 Key Highlights: A Field Report from #HadoopSummit

Pivotal at Hadoop SummitHadoop Summit 2013 is underway. Last night, we got a chance to catch up with SK Krishnamurthy and Scott Kahler after the conference finished for the day and heard some of their highlights. SK leads product management for Pivotal HD and HAWQ, and Scott is one of our Field Engineers focused on customer solutions.

Of course, given the sheer volume of innovation happening around Hadoop, it is hard to cover all the amazing things that are happening. After sifting through a lot of data ourselves, we believe these five key highlights are worth passing along.

1. The Hadoop Community is Growing Up
After attending the Hadoop Summit for several years, SK really felt the sessions have evolved. In the past, most of the sessions were more technical deep dives—infrastructure, low-level programming, bits, and bytes. This year is different. There are higher level sessions—discussions on use cases, optimizing operations, and applying Hadoop to the enterprise. Similarly, Scott joined the talk by Gartner’s Merv Adrian. Merv pointed out the state of the industry—many fragmented parts. Also, he charted the growth of corporate big data plans where 27% of companies invested in 2012 and 30% would invest in 2013. This little elephant is maturing quite quickly.

2. Real-time Data Takes Center Stage
A buzz grew around the details of Storm, Spark, and YARN—but the point is not about any particular technology. It is that real-time, streaming, fast data is important and valuable in many industries, and it can also share infrastructure with big data. If you aren’t familiar, Storm does for real time processing what Hadoop did for batch processing. Spark is about in-memory cluster computing versus disk-based Hadoop jobs, and YARN is a new version of Hadoop’s MapReduce, allowing for Hadoop nodes to also be used interchangeably for real-time queries—a much more volatile usage profile than traditional Hadoop batches. For Pivotal, this provided extremely deep validation for the Pivotal Data Fabric—where the real-time, in-memory Pivotal GemFire data platform is used alongside Pivotal Greenplum and Pivotal HD, our distribution of Apache Hadoop.

3. SQL on Hadoop—The Elephant on the Table
SK attended the SQL on Hadoop panel with Gavin Sherry, the Pivotal Data Fabric Chief Strategist. Surprisingly, 4 out of the 5 panelists strongly agreed that SQL is becoming an important element of Hadoop and big, fast data. While Hadoop and HDFS originally set out to run batch jobs using Java in massively parallel compute environments, the majority of the world’s data analysts, scientists, and statisticians know SQL, not Java. When this group of people can run SQL queries on an HDFS file system or real-time data grid, a whole new level of compute power will be available to them.

4. What the heck is HAWQ?
Since we made a big announcement about Pivotal HD and HAWQ not too long ago, many people came by the booth wanting to know what HAWQ is, having heard it was SQL on Hadoop. One of the best ways to explain is this by looking at the history of Greenplum. With HAWQ, we’ve taken a massively parallel data warehouse and database engine and put it on top of HDFS. With HAWQ, companies can use their existing SQL skills, expertise, and SQL tools to interact with data on the Hadoop File System. In the HAWQ architecture, there are no duplicate processing clusters—HAWQ is installed and runs on Hadoop nodes.

Note, we have a post coming up soon on how HAWQ runs 100x faster than other SQL-on-Hadoop solutions, and, by the way, the data warehousing and business intelligence author Ralph Kimball stopped by the booth!

5. Demonstrating HAWQ’s SQL on Hadoop with Spring XD
Of course, once people understand HAWQ in concept, they want to see how it can be used. If you haven’t seen our demo, it is quite popular and explains a lot. Here is what is happening in the demo:

  • Spring XD is being used to capture the hashtags inside Twitter posts with the hashtag #HadoopSummit and load the data into HDFS.
  • HAWQ is mounting the HDFS data as an external table.
  • Tableau is being used to as a SQL client on top of HAWQ to create tag-cloud style graphs of the hashtags

If you are at Hadoop Summit, please stop by to say hello and talk shop!