September HUG Recap

Thanks to the around 200 developers who came to Yahoo! recently for our monthly Hadoop User Group meeting. The energy in the packed room was phenomenal, and conversations continued long after the formal sessions.

The event started with Chris Riccomini talking about Pig at LinkedIn. It was great to get a firsthand view of how industry is leveraging the power of Pig to solve their data processing problems. Chris covered how Pig is an integral part of data analytics at LinkedIn. He showed how Pig is used to design, develop, and deliver data products at LinkedIn. He explored a successful example of Pig deployment at LinkedIn, pain points, and integration with Azkaban, Voldemort, Hadoop, and the rest of LinkedIn’s ecosystem. Chris also covered the most frequent gottcha's and learnings, and then concluded with some of his thoughts on the evolution of Pig. The talk generated many interesting questions.

Next we had Dhruba Borthakur and Dmytro Molkov from Facebook talking about HighAvailability of the Hadoop NameNode. Although not many current users face name-node failures, it is certainly the next big hurdle to deliver high availability. Dhruba talked about the specifics Facebook has innovated to make the name node highly available and discussed the design and the advatanges of the proposed design. He described in detail a hot-standby solution called the AvatarNode, then talked about the capabilities of the SecondaryNameNode and the BackupNode. Very insightful presentation.

Finally, we had Ahad Rana from Opencrawl talking about building a scalable Web crawler with Hadoop. Ahad talked about his experience in building an open and accessible Web-scale crawl. He discussed the Hadoop data-processing pipeline, including PageRank implementation. He also described techniques to optimize Hadoop and the design of their URL Metadata service. Finally he concluded with details on how users can leverage the crawl (using Hadoop) today. The discussion generated very detailed questions and folks remained way past the event deadline to understand the internals of search index and how they can leverage Opencrawl to solve their business needs.

We at Yahoo! embrace Hadoop, and are looking for exciting technologies and experiences you want to share. Please contact me via the Hadoop Bay Area User Group Meetup page.