By Francis Liu and Sumeet Singh
In 2009-2010, Yahoo! saw an unprecedented growth in the number of users coming onboard to its Apache Hadoop platform for their data processing and analytics needs. We attribute a majority of that success and increase in user base to the introduction of multi-tenancy, security, and partitioned namespaces in Hadoop.
Screen Shot 2013-06-07 at 1.23.42 PM
With Hadoop and its ecosystem components like Apache Pig and Apache Oozie getting popular at Yahoo!, we needed a solution to store mutable data and support random access to the stored data to complement the Apache Hadoop platform. Yahoo! had been using Apache HBase in isolated instances, most notably for the CORE personalization platform and for the web crawl cache at the time. However, the use of Apache HBase was limited to large projects that had the resources to operate dedicated HBase clusters.
In 2012, Yahoo! developed multi-tenancy in Apache HBase to cater to a growing number of use cases where HBase was an excellent fit as part of its grid technology stack. The introduction of multi-tenancy in Apache HBase has lowered the barrier for all Hadoop users to now use HBase, and we are experiencing a whole new range of use cases for HBase at Yahoo!. Is the success of Hadoop’s multi-tenancy repeating itself again? Time will tell, but the number of use cases for Apache HBase has certainly exploded since we introduced multi-tenancy in HBase in late 2012.
So, why is Apache HBase attractive to Yahoo!? The reasons are straightforward, perhaps in this order:
- Supports random access that we needed
- Accessible multi-tenant platform
- Native to the Hadoop ecosystem. We have a large Hadoop developer base, and HBase offers integration with Hadoop components in use at Yahoo! such as MapReduce, Pig, Oozie, HCatalog/Hive
- Vibrant/ active open source community of developers
- Attractive throughputs, in particular, the write throughput
- Acceptable latencies and scan performance
- Supports Yahoo!’s scale
- Support for bulk uploads
- Easy application development (lowers app time to market)
- Support for dynamically adding columns, TTL, versioning, and timestamps
We have accomplished multi-tenancy in Apache HBase via Security, Isolated Deployments, Region Server Group (HBASE-6721), and Namespace (HBASE-8015). The features and patches are in the process of getting contributed back to the Apache HBase open source community.
Francis is a Principal Software Engineer at Yahoo! working mainly on Apache HBase. He is also an Apache Hive contributor and a Podling Project Management Committee (PPMC) member of the Apache HCatalog project. Prior to this, he was involved in the development of a workflow management and incremental processing platform built on top of Hadoop.