Open Source Bridge 2009

General Notes

The Open Source Bridge conference was held at the Oregon Convention Center in Portland from June 17 to June 19. Sessions covered a range of topics: from building and growing open source businesses to yoga and meditation ... but the focus was decidedly technical, with some great sessions on a number of different OSS projects at varying levels of detail. The first two days featured talks of interest to the Open Source community, while the last day was an unconference. After the day was done, the Yahoo! Developer Network crew hosted a 24x7 hacker lounge with WiFi, to alleviate midnight hacking withdrawal symptoms.

I had the opportunity to attend an eclectic mix of sessions, and a few common threads emerged. As a Product Manager in the cloud computing group at Yahoo!, I'll focus on subjects that relate to the cloud, although there was no shortage of interesting discussion on a wide range of other subjects too.

The Cloud:

People in the Open Source Community are interested in cloud computing, but are also quite skeptical of it. Many of them see the cloud as a trick by hosting providers to make more money or to create buzz around existing technologies. (Quotable quote: "Cloud Computing right now is a 1000 mph bullet train of hype.") Others are excited by its potential, while still skeptical of its existing benefits. A third group, in the minority, is very excited about cloud technology.

Two of the sessions I attended, "Virtualize vs. Containerize: Fight!" and "Bridging the Developer and the Data Center" had the most interesting discussions around the cloud, although the subject came up in other sessions too.

Caching as the key to scalability

Whether applications are built in the cloud, in a private datacenter, or running off a box under your desk, the best way to make an application scale better is to find ways to limit the amount of work needed to serve a request. As dynamic, customized applications become the norm, caching becomes more important - and harder to do.

Caching can happen at multiple tiers:
*Caching proxies can be used to minimize the number of requests to hit an application server that is generating dynamic pages.
*Portions of a page may be static while others are dynamic. Caching page fragments can yield significant benefits.
*Opcode caching optimizes the performance of requests when server-side scripts are used by caching the actual compiled code that is executed to serve the request. (For example, JSP does this by default by compiling pages into classes and then using JIT compilation to optimize and cache opcodes.)
*Data and queries can be cached to minimize the number of requests, but this needs to be done very carefully.

Most modern web development frameworks provide built-in opcode caching. Couple this with intelligent application design, and you can make large portions of your website cacheable within the application server, significantly reducing the resources necessary to serve a request. Once you've optimized your application itself, you can add a caching proxy that reduces the number of requests that hit the application stack. This can significantly improve a web application's ability to scale. Being able to cache large portions of your web application, even for a few seconds, can significantly boost performance and scale because it's very rare that significant portions of all but the most dynamic sites change multiple times a second.

Once you've hit that barrier, it's time to look into more advanced technology to scale.

A new approach to data on the Web

Some of the most interesting sessions revolved around data and the web, and how the web changes assumptions about what data stores should and should not do.

Rethinking Web Databases

Brian Aker led an excellent session on Drizzle, a reworked microkernel database derived from MySQL and designed specifically for scale and concurrency - in other words, for the cloud. The code is derived from MySQL, but it's designed to have components that can be extended to suit specific needs. The core is optimized to provide only what is viewed as fundamental database operations, leaving everything else to be built as a plugin. This has meant killing some sacred cows, like triggers, stored procedures / prepared statements - even the query cache.

This approach works because the focus of Drizzle is using new technology - multiple cores and massively distributed, scaled systems, 64-bit computing - to address the problems of tomorrow. They avoid having to play catch-up to the big database engines, solving problems that have been solved before (e.g., ANSI compliance). The focus of Drizzle is the web - a group of customers that is still under-served by today's databases - so the feature set focuses on scalability and performance. The architecture provides for extensibility so that anyone who needs more heavy-weight features can add them via plugins. This approach make sense because many developers want to customize different features to suit them - e.g., many people have written their own replication mechanisms. Drizzle aims to support this kind of customization.

Brian said that Drizzle isn't quite ready for prime time yet, although some people are using it anyway.

It's great to see the open source community challenging the assumptions that have defined databases in the past and leapfrogging their proprietary competitors in serving the needs of the web. Building applications at web scale forces us to re-think traditional means of approaching problems. Frequently, one size does not fit all. Building a robust platform that can be extended as needed to suit specific needs makes a lot of sense.

Do you really need ACID?

While Drizzle started with an ACID-compliant storage engine (InnoDB) as the default, a thought-provoking session led by Bob Ippolito of Mochi Media asked attendees to think about how dropping ACID requirements could yield better scalability and availability - a tradeoff that many of the big Internet companies have cottoned on to already. This is another tenet of the cloud - services like our own Sherpa (our structured key-value store) and MObStor (our unstructured storage cloud) have embraced it, as have Amazon's S3 and SDB services among others. When availability, scalability, and performance are paramount, traditional application development models need to be re-examined.

The answer lies in dropping strong consistency in favor of eventual consistency, where the system eventually converges on a consistent state, but may or may not be consistent at any given time. Bob went on a quick tour of the various ways people have attacked this problem - distributed key-value stores (like Amazon's Dynamo), column-oriented databases (like HBase), memcached, document databases (CouchDB). He described how they've chosen to handle the issue of consistency: ignoring conflicts, resolving conflicts internally (hopefully, yielding the expected answer), and allowing the client application to decide how to resolve conflicts by returning all possible values.

This is an interesting area - there isn't a single one-size-fits-all solution, and there are few (if any) rock-solid open solutions to this problem ... yet. After looking at several open source initiatives in this area, Mochi Media decided to use a proprietary solution.

We've found that on the web, availability is paramount and in many cases strict consistency is significantly less important (unless you're building an online banking system!). Even though it would be wonderful to have a system that's simultaneously highly available, globally replicated and consistent everywhere (oh, and cheap too!), the laws of physics make that impossible. However, through careful design, it's often possible to build web applications that have limited (if any) need for strict consistency. Being willing to make this tradeoff opens up a host of means to achieve higher availability, and when these tools are available in the cloud, it enables developers to build better applications quicker.

Approaches to Computing in the Cloud

One of the new terms I learned at Open Source Bridge was "Containerization" - basically a form of lightweight virtualization, more commonly known as Operating System-level Virtualization. Before Open Source Bridge, I've always thought of containers as living higher up in the stack (e.g., application containers).

Traditional virtualization techniques (e.g., Xen, VMWare, etc.) virtualize the machine itself, splitting the host machine into several virtual machines that can then run guest operating systems which run developer code. This makes it possible to run different Operating Systems on the host and guests, and to have them all behave as though they have a complete machine at their disposal. On the other hand, Containerized systems (e.g., OpenVZ, Solaris Containers) provide Operating System level virtualization. Guest operating systems share the same kernel, but still present a virtualized view to the applications they run.

Both traditional virtualization and containerization offer similar benefits in terms of protection and consolidation. The fundamental difference is that Virtualization enforces stronger isolation between instances and provides more flexibility. In contrast, containerization ekes out higher performance from the hardware (since all code executes natively, there's a lower resource tax on the host). Think of virtualization as kernel-space, and containerization as user-space.

However, as hardware manufacturers add support for virtualization in the hardware, the performance gap is narrowing considerably. Similarly, while containerization doesn't enjoy as much enterprise and tools support, that's slowly changing too. As these two technologies evolve, perhaps they will converge on a system that shares the benefits of both approaches.

So... virtualize or containerize? As always, the answer is, "it depends ...!" Containers only work if you can (and are willing to) share the kernel between them, and they come with less enterprise/tools support and features. Virtualization has much more enterprise support, allows greater isolation and OS flexibility - but exacts a greater performance tax.

All the cloud providers today use virtualization and sell a fixed amount of virtual resources per virtual machine. However, in general, people should not assume that the default virtual machine image is optimized for their application. If they're using a cloud service, they should make sure they tune their system image for their application, or they may be left with a system that doesn't perform nearly as well as it could.

With hardware manufacturers adding support for virtualization, I expect the performance penalty for traditional virtualization to reduce significantly. At the same time, I expect that most application developers would prefer as much isolation from other tenants of virtualized hardware as possible - after all, it's hard enough trying to diagnose application crashes without adding the possibility of a kernel panic or OS crash triggered by an application you aren't even aware of. Given this, and the fact that a single kernel version does not suit everyone's needs (some developers care more about stability and will want an older kernel, while others may care a little less about stability and want enhancements in a newer version), I expect that in the days to come, the cloud will continue to be comprised of traditionally virtualized instances.

When all's said and done ...

The Open Source Bridge conference catered to developers and users of open source technology alike. The Open Source community is a hotbed of innovation, and several important new technologies are brewing there now, which will have broad-ranging impact on how software is written in the years to come. While there was a fair share of skepticism about the cloud in its current state, the Open Source community is already tackling some of the problems that are the hallmarks of the massively distributed systems that will form the cloud.

For other attendees' notes and insights, I recommend the #osbridge tag on Twitter.

Navneet Joneja
Sr Product Mgr, MObStor