Hadoop at Yahoo! Sets New Gray Sort Record – The Yellow Elephant is Getting Faster

hadoop-elephantWe are proud to announce we used Apache Hadoop to set a new Gray sort record for the Jim Gray's Sort benchmark. We nearly doubled the rate of the previous Gray sort entry by sorting at a rate of 1.42 Terabytes per minute. The previous record was 0.725 Terabytes per minute.

Jim Gray's sort benchmark consists of a set of many related benchmarks, each with their own rules. All of the sort benchmarks measure the time to sort different numbers of 100 byte records. The first 10 bytes of each record is the key and the rest is the value. The Gray sort is to measure the sort rate achieved while sorting at least 100 terabytes of data. The Minute sort is the amount of data that can be sorted in less than a minute. There are two different benchmark categories. The Daytona category requires the sort code to be general purpose sort. The Indy category needs to only sort 100-byte records with 10-byte keys. We used Hadoop Terasort with slightly different configurations in both categories.

There were some new rules this year. The biggest rule change is that your sort is required to sort both skewed and non-skewed data for the Daytona benchmark category. The skewed data has to be sorted in no more than twice the elapsed time of the non-skewed data. Other changes include: the input and output data must persist in the case of a single node failure and none of the data can be compressed (input, intermediate, or output)

Results

Our only official entry was to Gray sort, but we also unofficially (we didn’t get the submission in by the deadline - learned that lesson!) broke the previous Minute Sort record. The full report can be found on the sort benchmark page under the Gray sort results.

Benchmark Data Type Amount of Data Sorted Time Rate Duplicate Keys
Daytona
GraySort
non-skewed 102.5 TB 72 minutes
8.053
seconds
1.42 TB
per
minute
0
Daytona
GraySort
skewed 102.5 TB 117 minutes
48.261
seconds
0.87 TB
per
minute
31867643140
Indy
MinuteSort
non-skewed 1612.22 GB 58.027
seconds
0
Daytona
MinuteSort
non-skewed 1497.86 GB 59.223
seconds
0
Daytona
MinuteSort
skewed 1497.86 GB 1 minute
27.242
seconds
0

Software, Hardware, and Operating System

The version of Hadoop used was Hadoop 0.23.7. Hadoop 0.23.7 is an early branch of the Hadoop 2.X line that Yahoo! has used to stabilize YARN. It is available for download at hadoop.apache.org.

The hardware and operating system details are:

  • Approximately 2100 nodes for GraySort and 2200 nodes for MinuteSort
  • System: Dell R720xd, 2 x Xeon E5-2630 2.30GHz, 62.3GB / 64GB 1333MHz DDR3, 12 x 3TB SATA
  • Processors: 2 x Xeon E5-2630 2.30GHz, 7.2GT QPI (HT enabled, 12 cores, 24 threads) - Sandy Bridge-EP C2, 64-bit, 6-core, 32nm, L3: 15MB
  • OS: RHEL Server 6.3, Linux 2.6.32-279.19.1.el6.YAHOO.20130104.x86_64 x86_64, 64-bit
  • Network: eth0 (bnx2x): 10Gb/s <full-duplex>
  • 40 nodes/rack 160Gbps rack to spine. 2.5:1 subscription.
  • Oracle JDK 1.7 (u17) - 64 bit

Biography

Thomas Graves is a software developer at Yahoo! and a Hadoop PMC member at the Apache Software Foundation.

Nathan Roberts is a Hadoop architect at Yahoo!.

Balaji Narayanan and Rajiv Chittajallu are on the Grid Operations team at Yahoo!.