Scaling Hadoop to 4000 nodes at Yahoo!

We recently ran Hadoop on what we believe is the single largest Hadoop installation, ever:

• 4000 nodes
• 2 quad core Xeons @ 2.5ghz per node
• 4x1TB SATA disks per node
• 8G RAM per node
• 1 gigabit ethernet on each node
• 40 nodes per rack
• 4 gigabit ethernet uplinks from each rack to the core (unfortunately a misconfiguration, we usually do 8 uplinks)
• Red Hat Enterprise Linux AS release 4 (Nahant Update 5)
• Sun Java JDK 1.6.0_05-b13
• So that's well over 30,000 cores with nearly 16PB of raw disk!

The exercise was primarily an effort to see how Hadoop works at this scale and gauge areas for improvements as we continue to push the envelope. We ran Hadoop trunk (post Hadoop 0.18.0) for these experiments.

Scaling has been a constant theme for Hadoop: we, at Yahoo!, ran a modestly sized Hadoop cluster of 20 nodes in early 2006; currently Yahoo! has several clusters around the 2000 node mark.

HDFS

The scaling issues have always been the main focus in designing any HDFS feature. Despite these efforts, attempts to scale the cluster up in the past sometimes resulted in some unpredictable effects. One of the most memorable examples was the cascading crash described in HADOOP-572, when failure of just a handful of data-nodes made the whole cluster completely dysfunctional in a matter of minutes.

This time the testing went smoothly and we observed quite decent file system performance. We did not see any startup problems; the name-node did not drown in self-serving heartbeats and block reports. Note, that heartbeat and block intervals were configured with the default values of 3 seconds and 1 hour respectively.

We ran a series of standard DFSIO benchmarks on the experimental cluster. The main purpose of this was to test how HDFS handles load of 14,000 clients performing writes or reads simultaneously.

HDFS Cluster Statistics

Capacity : 14.25 PB
DFS Remaining : 10.61 PB
DFS Used : 233.44 TB
DFS Used% : 1.6 %
Live Nodes : 4049
Dead Nodes : 226

Map-Reduce Cluster Statistics

Nodes: 3561
Map Slots: 4 slots per node
Reduce Slots: 4 slots per node

DFSIO benchmark is a map-reduce job where each map task opens a file and writes to or reads from it, closes it, and measures the i/o time. There is only one reduce task, which aggregates and averages individual times and sizes. The result is the average throughput of a single i/o that is how many bytes per second was written or read on average by a single client.

In the test performed each of the 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job.

The table below compares the 4,000-node cluster performance with one of our 500-node clusters.

Table 1. Throughput
  500-node cluster 4000-node cluster
  write read write read
number of files 990 990 14,000 14,000
file size (MB) 320 320 360 360
total MB processes 316,800 316,800 5,040,000 5,040,000
tasks per node 2 2 4 4
avg. throughput (MB/s) 5.8 18 40 66

The 4000-node cluster throughput was 7 times better than 500’s for writes and 3.6 times better for reads even though the bigger cluster carried more (4 v/s 2 tasks) per node load than the smaller one.

Map-Reduce

The primary area of concern was the JobTracker and how it would react to this scale (we had never subjected the JobTracker to heartbeats flowing in from 4000 tasktrackers since it isn't a very common use-case when we use HoD). We were also concerned about the JobTracker's memory usage as it serviced thousands of user-jobs.

The initial results were slightly worrisome - GridMix, the standard benchmark, took nearly 2 hours to complete and we lost a fairly large number of tasktrackers since the JobTracker couldn't handle them. For good measure, we couldn't run a 6TB sort either; we kept losing tasktrackers. (We routinely run sort benchmarks which sort 1TB, 5TB and 9TB of data.)

Of course, brand-new hardware didn't help since we kept losing disks, neither did the fact that we had a misconfigured network which let us use only 4 out of the 8 uplinks available from each rack to the backbone (effectively cutting the available bandwidth in half). On the bright side memory usage didn't seem to be a problem and the JobTracker stood up to thousands of user-jobs without problems.

We then went in armed with the YourKit(TM) profiler - we needed to peek into the JobTracker's guts while it was faltering. This basically meant going through the CPU/Memory/Monitors profiles of the JobTracker with a fine-toothed comb. To cut a long story short, here are some of the curative actions we took based those observations:
HADOOP-3863 - Fixed a bug which caused extreme contention for a single, global lock during serialization of Java strings for Hadoop RPCs.
HADOOP-3848 - Cut down wasteful RPCs during task-initialization.
HADOOP-3864 - Fixed locks in the JobTracker during job-initialization to prevent starvation of tasktrackers' heartbeats which caused the huge number of 'lost tasktrackers'.
HADOOP-3875 - Fixed tasktrackers to gracefully scale heartbeat intervals when the JobTracker is under duress.
HADOOP-3136 - Assign multiple tasks to the tasktrackers during each heartbeat, this significantly cuts down the number of heartbeats in the system.

The result of these improvements (sans HADOOP-3136 which wasn't ready in time):
1. GridMix came through slightly under an hour - a significant improvement from where we started.
2. The sort of 6TB of data completed in 37 minutes.
3. We had the cluster run more than 5000 Map-Reduce jobs in a window of around 6 hours and the JobTracker came through without any issues.

Overall, the results are very reassuring with respect to the ability of Hadoop to scale out. Of course we have only scratched the surface and have miles to go!

Konstantin V Shvachko
Arun C Murthy
Yahoo!