One of the reasons that I like working at Yahoo!, and the primary reason I've been here six years, is the scale at which we operate. It's not so much the number of users or pages, but the hamsters running behind the scenes to power those pages.
It's also how we manage to always stay on even when something goes wrong. We've had several outages in the past. There was once a squirrel that took out an entire data center in a suicide mision, but we managed to keep the site up.
For this we have to thank Service Engineering and some pretty smart Business Continuity Planning (BCP) policies.
Last Thursday, Michael Christian, Chris Westin, and I spoke about high availability at Yahoo! — slides available here (infrastructure resiliency), here (MySQL business continuity planning), and here (application resiliency).
Laptop problems notwithstanding, we had quite a bit of fun speaking and minglng with attendees afterwards. However, Yahoo!'s real MySQL High Availability guru, Jay Janssen, wasn't around.
To compensate, he has blogged about how we handle MySQL Master High Availability. Here's what he said:
We are assuming out of the gate that we have to have a presence in at least 2 datacenters. This concept is something we call BCP or Business Continuity Planning: we want our business to continue regardless of some event that would completely destroy a datacenter. So, we require that any and all components within Yahoo are run redundantly from at least 2 datacenters that are not in the same geographical region. That way if an earthquake, hurricane, or some other natural (or otherwise) disaster affects one datacenter, the other one should be far enough away to not have been affected....
Go read Jay's post and tell us what you think.