On Feb. 12, I flew through a window of clear weather to join 20 other Yahoos and 40 students at the renowned University of Illinois Urbana-Champaign (UIUC) for a one day training on Hadoop. The first half of the event was split into two consecutive classes. One was a high-level overview of Hadoop taught by Milind Bhandarkar, accessible even to complete newbies, such as me. The other was a thorough introduction to the Pig scripting language by Viraj Bhat. The second half of the event consisted of a dive into practical Pig programming. Attendees were split into two groups, students and Yahoos. Students worked on the UIUC cluster while Yahoos worked on a Yahoo! cluster. Below, I briefly touch on some of the topics from the lecture that stood out to me. This is not an authoritative, technical deep-dive into Hadoop, but rather a presentation of a few topics with links to more information. Please refer to the Hadoop site, the Hadoop mailing list, and the Hadoop blog on the YDN for official documentation, educational resources, and community support.
Hadoop is composed of two main pieces: a distributed file system, HDFS, and a framework to run code over a large set of data. It is currently used by a variety of companies from Amazon to Facebook, Joost to the New York Times, to efficiently run arbitrary processes on huge sets of data.
One of the problems Hadoop addresses is the bandwidth required to move data. For example, 100TB of data requires roughly 165 minutes to scan across a LAN versus 30 minutes to scan from local storage. Instead of pulling data to Hadoop, Hadoop processes the data in place. Put another way, the user imports a data set into a Hadoop cluster where it is split and stored across the cluster's nodes using HDFS. Processes are then run on the data by Hadoop running on each node. For more information on this idea, see ?Moving Computation is Cheaper than Moving Data?.
Another problem Hadoop addresses is fault tolerance. In a cluster composed of thousands of nodes, the mean time between hardware failures is less than a day. Hadoop tracks and coordinates jobs using a central controller called the JobTracker. Should a task fail, the JobTracker will schedule the task to run again. Additionally, all data is stored in triplicate as a precaution against hardware failures.
Hadoop's native API is written in Java. There are also wrapper libraries for the Java API written in C and C++ for those more comfortable with these languages. The API is composed of three major pieces: JobClient, MapTask, and ReduceTask. For more information on the API, see the Hadoop docs.
The Hadoop architecture is composed of a NameNode and several DataNodes. NameNodes do not contain data, but instead map file names to locations, coordinate processes, and replicate corrupted files. DataNodes, as the name suggests, perform calculations on locally-stored portions of the data set.
The Hadoop file system is horizontally scalable, meaning the user can add more processing capacity by adding more machines. Nodes running Hadoop are available for both storage and processing. A typical cluster node using commodity hardware would have a terabyte of storage and a couple cores. HDFS is managed using Hadoop's FileSystem class and individual files are accessed using URIs. Several file systems are supported natively: HDFS, Amazon's s3, Kosmos, FTP, HTTP, and HTTPS.
A typical, introductory use case of Hadoop might follow the following steps:
- the user loads a set of data to the distributed file system (DFS).
- the user submits a job to the Hadoop client
- the client splits the job into several tasks that can be performed in parallel
- the client uploads each task to the DFS and registers the task with the JobTracker
- each task sends a "heartbeat" to the JobTracker when complete
- this cycle repeats until all tasks are complete
Hadoop is designed for one type of operation called MapReduce, which is most efficient when the input and output are composed of key-value pairs. Some common use cases are log processing, search indexing, generation of "you might also try" search suggestions, and machine learning. In general, Hadoop jobs are performed in advance and the results stored for later reference, although there is a Hadoop On Demand (HOD) component of the project.
The oft-used example from the Hadoop documentation is that of counting words in a large amount of text. First, the text is broken up into chunks. Then, each chunk is associated with a "mapping" operation, e.g., "the" goes in the "the" list, "so" goes in the "so" list, and so on. This list is then quick-sorted. When all mapping operations are complete, each list is "reduced" by a set of operations that count all the items in each list. For example, if there are ten occurrences of "the" in the text, the reduce operation associates "the" with 10, quick-sort the list of sums, e.g., "the"->10, "so"->3, "this"->7, and merge-sort the data into a master list. In the end, Hadoop outputs one list of key-value pairs, etc., which represent the quantity of each word in the text.
There are three ways to use Hadoop: define tasks in Java, C, or C++, compile them, and load them in Hadoop; define tasks in an interpreted language and "stream" output from Hadoop through a script written in an arbitrary language, e.g., sed, python, perl, etc.; use the Pig interpreted language. The UIUC training did not dive into programming with the Hadoop API as it was an introductory course, but we did cover streaming and Pig in depth, as these provide the easiest way to get started.
When streaming, Hadoop passes the input of each task as text into a script, to perform an arbitrary operation on the text, and output the result to stdout as text in the format "key t val n", which is then read by Hadoop. Streaming can be thought of as a wrapper around MapReduce.
Pig can be described as a procedural language, similar in a functional sense to SQL, that is processed by the Pig interpreter into a set of calls to the Hadoop API. To be semantically correct, the actual language is called "Pig Latin" and Pig is the interpreter, that is, Pig processes Pig Latin. In its most basic state, Pig Latin is limited to the pre-defined functionality of the API. However, the user is free to define and compile custom operations in Java, specify a path to the .jar files in the Pig Latin script, and then call the functions within Pig Latin. Viraj went so far as to state that user-defined functions (UDFs) are the most important feature of Pig. For Eclipse users, there is an eclipse plug-in called Pig Pen. Pig Latin can also be combined with streaming to perform "Pig streaming". "PiggyBank" is the name of Pig's UDF repository.
Hadoop trainings occur with some regularity. For anyone interested in attending a training, please refer to the Hadoop site and mailing list for the most up-to-date information and schedules. There are also Hadoop user groups that meet in several locations around the world. At the last one in Santa Clara, CA, Matei Zaharia spoke about the Fair Share scheduler developed in large part by Facebook. Another talk walked through a method for running Hadoop with SQL using JDBC.
Please post all questions about Hadoop directly to the Hadoop mailing list. For questions about the UIUC training or the this post, please leave a comment below.