A summary of Shelton Shugar’s talk: “Accelerating Innovation with Cloud Computing”

Shelton Shugar delivered an excellent keynote address “Accelerating Innovation with Cloud Computing” to open the 4th Cloud Conference and Expo in Santa Clara yesterday.

Shugar, Yahoo!'s SVP of Cloud Computing, started off by clarifying that Yahoo! isn't selling anything at the Expo; the company is not into consulting or selling software. At Yahoo!, cloud computing is not about saving money. Yahoo!’s motivation is rooted in the fact that cloud computing drives innovation. Yahoo! has hundreds of products and platforms all over the world. Many of these products were the result of acquisitions, and so they came with their own infrastructure, down to the metal. Cloud computing at Yahoo! is about streamlining the services these products and platforms require. Yahoo! stores hundreds of petabytes of data all over the world, and handles petabytes of internet traffic daily. Scale is of utmost importance.

Yahoo!’s Cloud Strategy

Yahoo! is building a private cloud, deployed in data centers world-wide. Yahoo!’s cloud strategy focuses on two areas: data processing and serving. Data processing refers to data mining and analysis. Serving refers to application environments for Yahoo!’s products, edge capabilities for fast delivery, and a channel for data to flow into storage. This is a multi-year effort. Open source projects play a “central role” in this strategy; Yahoo! consumes and produces them.

Looking Inside the Yahoo! Cloud

The Yahoo! cloud has five components:

  1. Edge services
  2. Cloud serving where we host applications
  3. Online storage for serving content to consumers
  4. A batch-processing data warehouse
  5. Data collection services to filter and de-duplicate incoming data, and block abusive requests

Edge serving is based on the Yahoo! Traffic Server. Over half of all Yahoo! traffic flows through YTS.

The application serving layer is based on a tiered architecture. Applications can be cloned. Traffic can be split natively, which allows for bucket testing. Developers are freed from having to worry about versions of web serving software, locations of machines, etc. Capacity can be adjusted easily.

Online storage uses RESTful APIs. It’s deployed worldwide. Global replication is supported natively. Multiple consistency models are provided. MObStor (mass object store) is used to store large objects (1MB-2GB) such as images and video. Objects are immutable. Structured content is provided via a product called Sherpa, a key-value store. Sherpa is intended to support enough of the capabilities developers currently depend on relational databases for.

Batch processing is oriented around Hadoop. Hadoop has been running for a few years, and now operates on tens of thousands of machines. Yahoo!’s Hadoop grid stores over 80PB of data. Yahoo! uses it to optimize advertising, process weblogs, etc. Thousands of Yahoos are trained to run jobs on this grid. The Hadoop File System (HDFS) allows thousands of computers to be treated as a single machine. Pig is a higher-level procedural language that generates MapReduce code. It’s almost as efficient a well-written MapReduce code, though the Pig team jokes that most people don’t write well-written MapReduce code. Yahoo! is building columnar storage.

Examples

The Yahoo! homepage
When a user visits the Yahoo.com homepage, he/she is interacting with Yahoo! cloud services. The popular stories to display are selected using a feedback loop involving several cloud components. Hadoop is used to optimize ad-matching and build the search index. Edge services are used to cache and load-balance the page content and normalize the news feeds.

Yahoo! Mail

Yahoo! Mail uses Hadoop to identify and filter spam. Before Hadoop, mail engineers had to spend lots of time maintaining storage and machines to process an enormous amount of data. Hadoop abstracts scale, handles failures, and manages multiple users. This allows the scientists and engineers to focus on their jobs. Yahoo! Mail uses cloud storage’s replication services to help detect abuse.

Yahoo! Sports

People want to find game scores as fast as possible. Edge services provide a proxy service to route requests for dynamic content, allowing Yahoo! Sports to provide the most up-to-date content.

Yahoo! Finance

Yahoo! has the most popular finance page on the Web. Yahoo! Finance uses Hadoop to speed advertising by optimizing resource utilization.

Yahoo! Query Language (YQL)

YQL is an SQL-like language. It allows developers to query, filter, and join data from across the Web. YQL uses Sherpa instead of managing its own database.

Open Source at Yahoo!

Yahoo! contributes the code it produces for Hadoop back to the Apache open source community. External developers benefit and contribute back, which in turn benefits Yahoo!. Pig is open source. Zookeeper, a utility Yahoo! uses to coordinate multiple systems, is also open source. Yahoo! is a member of Open Cirrus, a consortium designed to facilitate research in cloud computing. The consortium is composed of nine member companies. Yahoo!’s contribution is m45, a thousand-core cluster. Yahoo! works with some of the leading universities in the world, including University of California Berkeley and Carnegie Mellon University. Yahoo! has built an enormous community around Hadoop. As a result of it’s investment in open source, Yahoo! can now hire people directly out of university to work in several areas of cloud computing. Open source attracts the best and the brightest.

Shugar’s announcement of Yahoo!’s action to open-source its Traffic Server, now an Apache Incubator project, was a highlight of his keynote. Trafic Server can process up to 35,000 trasnsactions per second on commodity hardware. It’s modular and forms the basis for Yahoo!’s caching, proxying, load balancing, routing, etc. Yahoo! pushes 400TB through it daily. Yahoo! hopes to support a vibrant community around use of the Traffic Server like it did with Hadoop. A recent GigaOm post on OStatic gives more information about the project.

Back in June, Yahoo! announced the Yahoo! distribution of Hadoop. Yahoo! selects only the components it needs and tests them well. It's a solid collection of code that’s been proven to work. Yahoo! will be releasing an update shortly.

Change

At the close, Shugar reminds the audience that Yahoo! is fully committed to cloud computing, but “moving to the cloud requires change.” If your organization is like Yahoo!, with lots of legacy systems, you’ll need to make a large organizational commitment, more like a marriage than a transaction. It takes time and investment to create cloud services and migrate to them. Yahoo! migration is a multi-year effort, but cloud computing is worth it. Developers are able to deploy so much faster than before. It’s changing the company culture.

---

Erik Eldridge

Engineer/Evangelist

Yahoo! Developer Network