Journaling is a common technique to implement durability while keeping performance high with servers that use magnetic disks. Databases (e.g., Postgres and HBase WAL) and file systems (e.g., Linux ext3, HDFS Namenode, ZooKeeper) have used this technique to guarantee that changes are durable and enable recovery when machines crash. An additional example, quite important for our properties but not traditionally considered an example for journaling, is messaging systems. With messaging systems, we have clients that publish messages and clients that consume published messages. For such systems, it is quite often desirable to guarantee that messages are delivered despite crashes, slow processes, or slow network.
One key property of journaling is that it induces sequential writes. All modifications are appended to the journal and there are no modifications to random parts of the journal. This property enables writing efficiently to magnetic disks, avoiding disk seeks that could degrade overall performance. Since the journal is read only upon recovery, we expect the workload to be dominated by writes. Writing to disk efficiently is not the only concern, though. It is also important to decouple the journal from the server writing to the log, otherwise a permanent failure of the server becomes an unrecoverable failure. We consequently write to an external storage system. At Yahoo!, we have traditionally used NFS filers to store such a journal.
A couple of important issues we have observed with filers are:
- Single point of failure: Even in the case of devices that use multiple disks and implement RAID, there are still some failure modes that can cause the device to become unavailable. Consequently, a more flexible scheme to replicate the journal is desirable.
- Elasticity: Expanding the storage capacity for journaling online is not simple with such devices, and for systems with a high demand, such as the messaging systems referred to above, this feature is quite critical.
We have designed BookKeeper to be a shared storage for journals and to overcome these issues. Also important to us was to use commodity server hardware that we find in our data centers instead of relying upon special devices.
BookKeeper stores ledgers, and each ledger corresponds to a sequence of entries. The semantics of an entry is specific to each application and we simply store and serve a byte array for each entry. Entries are identifiable by a sequence number, which enables random reads and range scans to ledgers.
Different from classic journaling, reading from a ledger does not necessarily imply reading a long suffix of records. Some applications of BookKeeper often read short sequences of a ledger.
To avoid a single point of failure in BookKeeper, we replicate entries across storage servers called Bookies. Bookies have both a journal device and a ledger device. The journal of a bookie is just like a regular journal: it persists changes to ledgers in the given bookie and we force it to disk before acknowledging the request to append an entry. The ledger device, ideally in a separate disk, serves read requests. Adding and removing a bookie is as simple as registering and unregistering the bookie through ZooKeeper.
When we originally designed BookKeeper, we targeted a single process writing a journal. The use case that seeded this project was the HDFS Namenode. At the time, the Namenode was a single point of failure, which is a critical problem for production systems. The initial design of BookKeeper consequently focused on enabling Namenode recovery through a shared storage that was not filer-based, but that still replicate the journal and provide high performance. We have modified the Namenode to accept pluggable journal implementations and have implemented one for BookKeeper. The BookKeeper Journal Manager (BKJM) has been available for use since version 2.0.0-alpha of Hadoop.
Given the importance of messaging systems for our properties, we also developed a topic-based pub-sub system called Hedwig that uses BookKeeeper to persist messages and guarantee delivery. With Hedwig, we can possibly have tens to hundreds of millions of concurrent ledgers. For such a volume of traffic, we would like to have any given bookie serving a large number of ledgers.
When we started experimenting with Hedwig, we noticed that our original design that used two files per ledger on the ledger device degraded substantially performance under hundreds of ledgers, and we needed to be able to serve tens of thousands without performance degradation. Our solution was to use a mix of logging and index caching to avoid the performance degradation. The current design of BookKeeper implements this solution and one application currently using it through Hedwig in production is our push notifications platform.
Our push notifications platform serves, for example, users of mobile devices that sign up to receive push notifications through Yahoo! mobile applications. BookKeeper and Hedwig are used to persist and serve notifications respectively in this platform. For each user device, we create a Hedwig topic. Having a topic per user device enables, for example, personalization of the notifications delivered. The figure below shows the high-level architecture of our notifications platform and how it incorporates BookKeeper and Hedwig.
The push notifications platform is an example of a single application that requires multiple concurrent ledgers, but the support for concurrent ledgers could also be used across applications by having BookKeeper available as a hosted service.
BookKeeper is currently an Apache project and receives contributions from multiple organizations.
Acknowledgements: Ivan Kelly, Sijie Guo, and James Hsu have contributed to the content of this post.