We are pleased to present a new result on computing specific bits of π, the mathematical constant. The specific bits represented in hexadecimal are
0E6C1294 AED40403 F56D2D76 4026265B CA98511D
0FCFFAA1 0F4D28B1 BB5392B8.
These 256 bits end at the 2,000,000,000,000,252nd bit position, which doubles the previous known record. The position of the first bit is 1,999,999,999,999,997 and the value of the two quadrillionth bit is 0. To the best of our knowledge, the result obtained by us, the Yahoo! Cloud Computing Team, is a new world record as this article being written.
How did we get the result?
As you may know, Apache Hadoop is a very important part of our data infrastructure here at Yahoo!, and we tinker with it daily. With Hadoop, we can connect thousands of servers to process and analyze data in parallel at supercomputing speed. While the cluster I used for this project has 1000 machines, Our largest Hadoop cluster has 4,000 machines, and we’re continuing to scale the software.
Thanks to the around 175 developers who came to Yahoo! recently for our monthly Hadoop User Group meeting. The energy in the packed room was phenomenal, and conversations continued long after the formal sessions.
Hundreds of Hadoop Fans Flock to Yahoo! for the Hadoop User Group
The event started with Arun Murthy from Yahoo! describing the best practices for developing MapReduce applications. Arun introduced the concept of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. Finally, Arun talked about the anti-patterns of applications running on the Apache Hadoop clusters.
Next, Stefan Groschupf, the co-founder and CTO of Datameer, discussed the challenges in social media analytics and how to overcome these using big data analytics built on Hadoop in his “Social Media: What’s Really the Buzz?” talk. The demo was very helpful in visualizing the true thought leads and influencers in social mediaRead More »from August HUG Recap
Apache Hadoop is a software framework to build large-scale, shared storage and computing infrastructures. Hadoop clusters are used for a variety of research and development projects, and for a growing number of production processes at Yahoo!, EBay, Facebook, LinkedIn, Twitter, and other companies in the industry. It is a key component in several business critical endeavors representing a very significant investment and technology component. Thus, appropriate usage of the clusters and Hadoop is critical in ensuring that we reap the best possible return on this investment.
This blog post represents compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of aGrid Pattern which, similar to a Design Pattern, represents a general reusable solution for applications running on the Grid.
This blog post enumerates characteristics of well behaved applications and provides guidance on appropriate uses of various features and capabilities of theRead More »from Apache Hadoop: Best Practices and Anti-Patterns
Yahoo! has begun evaluating Hive for use as part of its Hadoop stack. Since, in many peoples' minds, Hive and Pig are roughly equivalent and Pig Latin is very close to SQL, this has led to some confusion. Why are we interested in using both technologies?
As we have looked at our workloads and analyzed our use cases, we have come to the conclusion that the different use cases require different tools. In this post, I will walk through our thinking on why both of these tools belong in our toolkit, and when each is appropriate.
Data preparation and presentation
Let me begin with a little background on processing and using large data. Data processing often splits into three separate tasks: data collection, data preparation, and data presentation. I will not discuss the data collection phase, because I want to focus on Pig and Hive, neither of which play a role in that phase.
The data preparation phase is often known as ETL (Extract Transform Load) or the data factory. "Factory" is a goodRead More »from Pig and Hive at Yahoo!
I'm excited to invite you to the next Hadoop Bay Area User Group, August 18th, 6 p.m., at the Yahoo! Sunnyvale campus.
The Hadoop community is growing — we had more than 200 attendees at the last meetup.
We invite you to attend whether you are an active submitter, developing Hadoop-based applications, or completely new to the Apache Hadoop world. In addition to interesting presentations, you will enjoy food, beer, and great networking.
The August 18 event comprises two sessions:
- Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of
Thanks to the over 200 developers who came to Yahoo! recently for our monthly Hadoop User Group meeting. The energy in the packed room was phenomenal, and conversations continued long after the formal sessions.
Hundreds of Hadoop Fans Flock to Yahoo! for the Hadoop User Group
The event started with Nitin Motgi from Yahoo! describing the challenge of content optimization at scale and how Yahoo! is leveraging the power of Hadoop Stack to conquer this challenge. Hadoop, along with Hive and HBase, is the technology that enables mass content personalization for Yahoo! users. Nitin talked about the high-level architecture and modeling challenges. He concluded with a summary on the learnings so far.
Anil Madan from eBay discussed the adoption of the Hadoop Stack at eBay and their future adoption plans. He also touched upon how eBay is leveraging the Hadoop clusters to enhance search relevance and extend catalog coverage for eBay. He described in detail eBay's data-sourcingRead More »from July HUG Recap
The Problem of Many Small Files
The Hadoop Distributed File System (HDFS) is designed to store and process large (terabytes) data sets. At Yahoo!, for example, a large production cluster may have 14 PB disk spaces and store 60 millions of files.
However, storing a large number of small files in HDFS is inefficient. We call a file small when its size is substantially less than the HDFS block size, which is 128 MB by default. Files and blocks are name objects in HDFS and they occupy namespace. The namespace capacity of the system is naturally limited by the physical memory in the NameNode.
When there are many small files stored in the system, these small files occupy a large portion of the namespace. As a consequence, the disk space is underutilized because of the namespace limitation. In one of our production clusters, there are 57 millions files of sizes less than 128 MB, which means that these files contain only oneRead More »from Hadoop Archive: File Compaction for HDFS
At Yahoo!, we recently implemented a stronger notion of security for the Hadoop platform, based on Kerberos as underlying authentication system. We also successfully enabled this feature within Yahoo! on our internal data processing clusters. I am sure many Hadoop developers and enterprise users are looking forward to get hands-on experience with this enterprise-class Hadoop Security feature.
In the past, we've aided developers and users get started with Hadoop by hosting a comprehensive Hadoop tutorial on YDN, along with a pre-configured single node Hadoop (0.18.0) Virtual Machine appliance.
This time, we decided to upgrade this Hadoop VM with a pre-configured single node Hadoop 0.20.S cluster, along with required Kerberos system components. We have also included Pig (version 0.7.0), a high level SQL-like data processing language used at Yahoo!.
This blog post describes how to get started with the Hadoop 20.S VM appliance. The basic information about downloading, setting up VMRead More »from Hadoop 0.20.S Virtual Machine Appliance