Today was Hadoop World 2010 in NYC. Put together by the good folk at Cloudera, this conference was a long day focusing on big data, analytics and the Hadoop ecosystem.

All in all, this was a good conference. Hadoop and I were decent friends before today, with me being pretty good on how it could be used, where to use it and (generally) how to implement it. I’m now on a much stronger footing on all items, not to mention learning about some of the new tools. Well, new to me, at least.

Recap after the jump.


Mike Olsen (CEO of Cloudera) opened the conference. His talk was relatively average for a keynote, with a few revelatory points.

  • Of the attendees (which one might assume to be representative of the more highly motivated members of the Hadoop movement), over half of the Hadoop clusters were under 10TB in size. Hadoop gets the press for handling petabytes of data (Jonathon Gray from Facebook observed they add 10TB a day), but the technology is very usable and adroit at problem solving at the more humane levels of data.
  • The largest area of data growth does not come from humans interacting with machines; rather, it’s from machines interacting with each other. Mike’s comments focus more on grepping through log files than anything else, but it’s a point of view I had not considered before this conference.
  • “Open Source means you no longer load the gun and then hand it to Oracle.” For those reading who have ever opened the Oracle licensing bill, you will completely understand that statement.

Tim O’Reilly followed Mike with his observations on big data. (And, by the way, the Strata conference looks killer. I might have to go to that one, too…) Tim’s comments were very interesting and thought provoking, but it would be hard to replicate/report on them and do them justice. Suffice it to say that it you get a chance to hear him speak, it’s worth the time.

He did reference an IBM commercial with a great tagline: “Would you be willing to cross the street — blindfolded — on data that was five minutes old? Five hours? Five days?” What a great argument for real time analytics; may just have to steal that one…

Search Analytics With Flume And HBase

While this talk was an very interesting discussion on moving data around and getting it where it needs to go quickly, it was a bit over my head as far as technical details go (I haven’t worked with Flume and only dabbled in HBase). I can see the value in Flume for moving around log files efficiently, though. (Slide deck, Video).

RDBMS And Hadoop: A Powerful Coexistence

This would have been an interesting talk — and one right up my alley — if only it hadn’t been an extended sales pitch for the Greenplum RDBMS. (Slide Deck, Video).

The Vendors

Hadoop World had about a dozen vendors. I’m glad to see that the Hadoop ecosystem is spending some serious time trying to integrate MapReduce and SQL. Many of the companies who would benefit from Hadoop have spent millions of dollars investing in a SQL-oriented BI environment; they are not just going to walk away from that investment for the latest technological fad.

Analyst Tools And Applications For Hadoop

One of the guys from Cloudera went over the Hadoop ecosystem focusing on which packages are most applicable for analysts and technical people who may not have a strong Java background. Good talk. (Slide Deck, Video)

Exchanging Data With The Elephant

Quest (the makers of TOAD) have come out with a version of the product that speaks Hadoop. Interestingly enough, it can also combine data from an Oracle instance with the results of a MapReduce run. Using Sqoop, data can be moved from Hadoop into an RDBMS (over JDBC) and vice versa.

I hadn’t really heard about Sqoop before this conference, as I have been focusing more on learning the in’s- and-out’s of MapReduce itself. This Sqoop thing is very interesting, though; definitely going to merit a much closer look. (Slide Deck, Video)

Sqoop is a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to connect to a database. It examines each table’s schema and automatically generates the necessary classes to import data into the Hadoop Distributed File System (HDFS). Sqoop then creates and launches a MapReduce job to read tables from the database in parallel.

Sqoop can also import tables into Hive, for further relational processing, as well as export tabular data from HDFS back to databases.

For certain databases, such as MySQL, Sqoop provides further performance enhancements by using database-specific tools to facilitate imports and exports.

Multi-Channel Behavioral Analytics

A thoroughly entertaining talk from Stefan Groschupf of Datameer. It also happened to be educational, imparting a good bit of detail on how to combine the many datastreams into a holistic view of a customer. He did do a bit of selling from the stage, but it was minor and easily forgivable.

A good quote from the talk: “MapReduce is like assembler; Pig/Hive is like C.” (Slide Deck, Video)

Sentiment Analysis

A quick talk from GE about their use of Hadoop to monitor sentiment in media, social networks and writing. Not much detail on how they are doing it, only that they are and “it’s really neat.” (Slide Deck, Video)

MapReduce And Parallel Database Systems

If I only went to this talk, my entire trip to Hadoop World was worth it. Daniel Abadi discussed the operational characteristics of massively parallel relational databases as compared to Hadoop MapReduce, focusing on where Hadoop wins and loses against MPP RDBMS. Then, he threw out the idea of HadoopDB, which uses single node SQL databases (Postgres, mostly) in place of the filesystem usually provided by HDFS. I’m still working on getting my head around this idea, but it’s very exciting. I see lots of potential there. (Slide Deck)

Overall impression

I give Cloudera high marks for this conference. They are clearly trying to position themselves as Hadoop’s version of Red Hat (taking Open Source software and bundling it together into a certified platform complete with support and a single throat to choke), and conferences like this go a long ways towards cementing that position. Cloudera videotaped every talk and will be posting both the video and the slide decks online in the very near future. The only real suggestion I would have for Cloudera would be to put more power outlets around; towards the end of the day, everyone was huddling around the plugs, waiting their turn.

Slides/videos for all talks (not just the ones I attended) can be found at Cloudera’s site.