Yahoo! at Hadoop World in New York

As the world's largest user and contributor of Hadoop, Yahoo is
excited to be sponsoring and presenting at the upcoming Hadoop World
in New York City on Friday October 2, 2009. Yahoo has been using
Hadoop since the beginning of 2006 and have built up our Hadoop
clusters from 20 machines up to a current total of more than 24,000

Eric Baldeschwieler, the VP of Hadoop Development, will present how
we've grown Hadoop into Yahoo's primary batch data analysis
platform. Hadoop at Yahoo supports complex data analysis and mining,
display advertising, our content platforms, personalization, filtering
email spam, and continuing research in improving our products. Not
only has Hadoop reduced development time across a wide range of data
analysis projects, it has increased access to data by removing data
silos, and enabled projects that would have been previously impossible.

Owen O'Malley, an
architect on the Yahoo Hadoop team and the Apache VP of Hadoop, will
present our upcoming efforts two emerging areas of Hadoop development:
Security and backwards compatibility. As a central part of Yahoo's data analysis platform,
confidential data will be stored on the Hadoop clusters. The current
"friendly" security in Hadoop prevents accidents, but
doesn't slow down anyone trying to work around it. Toward this
end, we are integrating Hadoop with Kerberos and using strong
authentication to ensure that all users of Hadoop are who they claim
to be and their access is limited appropriately. On the other front,
Hadoop, as with all quickly growing projects, has had incompatible API
and protocol changes in each major version. This requires application
writers to change and recompile their applications and update the
client versions whenever a new version of Hadoop is deployed to the
cluster. Starting in the upcoming release of Hadoop 0.21, we're
annotating the APIs with the intended audience of the interface
(public, limited, private) and the stability of the interface (stable,
evolving, unstable) and guaranteeing that users of the public stable
interfaces will run without a recompilation on new versions of
Hadoop. Yahoo also started the Hadoop Avro project that will let us
accommodate different versions of clients connecting to the same
server. All of these threads are leading Hadoop toward a 1.0 release.

Viraj Bhat, an engineer on Yahoo's Hadoop Solutions team, will
present his work on Vaidya, which is a contrib project in Hadoop
MapReduce. MapReduce hides many details of parallelization,
fault-tolerance, data distribution and load balancing to simplify
application development. However, tuning the performance of
individual jobs with different data processing and resource
utilization characteristics is a significant challenge, even for
seasoned parallel programmers. Hadoop Vaidya is an extensible rule
based performance diagnostic tool for MapReduce jobs. It performs a
post execution analysis of map/reduce jobs by parsing and collecting
their execution statistics through job history and job configuration
files. It runs these inputs against a set of predefined tests/rules to
diagnose various performance problems and provides a targeted advice
to the users through XML reports. At Yahoo, we use Vaidya to analyze
thousands of MapReduce jobs running daily on our clusters to detect
potential performance improvements.

Jake Hofman, a research scientist at Yahoo, will present his
work on developing a large-scale network analysis package. Over the
last several years there has been a rapid increase in the number,
variety, and size of readily available (social) network data. As
such, there is a growing demand for software solutions that enable one
to extract relevant information from these data, often
leveraging tools from network analysis. For sufficiently large
networks (with, e.g., tens or hundreds of millions of nodes)
distributed solutions are often necessary, as the storage and memory
constraints of single machines are prohibitive. The new package
enables such calculations on standard Hadoop clusters. A high-level
overview of the package will be provided, followed by a discussion of
algorithms for calculating node-level features in the map/reduce
framework. He will demonstrate the package on several real-world
networks and discuss use of the calculated network features for
predictive modeling tasks.

For those of you who haven't registered yet, here is a href="">discount
code. Also come an join us in a more informal session of
lightening talks the previous night. I hope to see you all there!

-- Owen O'Malley