Last month the HCatalog project (formerly known as Howl) was accepted into the Apache Incubator. We have already branched for a 0.1 release, which we hope to push in the next few weeks. Given all this activity, I thought it would be a good time to write a post on the motivation behind HCatalog, what features it will provide, and who is working on it.
Why Did We Create HCatalog?
Out of the box Hadoop provides the HDFS file system for users to store their data. File systems are nice because they provide a simple interface. Users can easily copy data into the file system and run jobs against that data. However, for more complex data processing tasks, the file system abstraction is not rich enough. It forces users to know where data is located, what format it is stored in, how it is compressed, and what its schema is. Consider, for example, a Pig Latin script used to do ETL on raw web logs:
A = load '/data/raw/ds=20110225/region=us/property=news' using PigStorage()
as (user:chararray, url:chararray, timestamp:long);
If the grid administrator needs to move this data to a new location, or compact multiple files into an archive (such as har), or the data producer changes the schema or starts writing it compressed, this Pig Latin script will need to be changed. Pig and MapReduce job specifications are tightly coupled with the data storage. This inhibits data producers' ability to improve their processes. They and their users are forced to go through a painful transition process to take advantage of any improvements. This lack of data storage independence is exacerbated by the fact that in most large Hadoop installations there will be users using different tools to access their data. So a change in storage format will ripple through multiple groups and require coordination across disparate data processing tools.
The fact that in many Hadoop installations different users will be using different tools brings up another issue. These tools do not share a common notion of schemas or data types. Pig and Hive have different, though similar, data models. MapReduce leaves data types to its users. This can make sharing a data set across users with different tools difficult and error prone.
When data consumers are building new applications or want to query new data sets, finding the data in a file system is difficult. Generally, a user has to contact the group that produces the data and ask them where they write their data, what format it is in, and what its schema is. Some organizations use wiki pages or similar mechanisms to record this information. But such pages inevitably get stale as changes to the data storage are made.
Even once users know where data is, file systems do not help them know when it is available. In a complex data processing environment there will be many users waiting for the availability of a given data set. This is particularly true of the foundational data sets that form the basis for much of an organization's data processing. With a file system, the only way to know if data is available is to do an 'ls'. Hundreds of users banging away with 'ls' commands every few seconds does not help HDFS' performance.
Finally, a file system tends to become a dumping ground with little knowledge of how the data there should be managed. How long should data be kept? Are there legal reasons to store it on tape before deleting it? If so, how do you know what has been archived and what has not? With a file system all of these issues tend to be addressed by conventions. "Data in this directory is cleaned up after 30 days." "After data is archived a '.archive' file is placed in its directory." Often each data creator has to answer these questions, and they tend to answer them differently. Thus an organization ends up with not one data management policy, but twenty, thirty, or a hundred policies.
How HCatalog Addresses These Issues
Hadoop needs a better abstraction for data storage, and it needs a metadata service. HCatalog addresses both of these issues. It presents users with a table abstraction. This frees them from knowing where or how their data is stored. It allows data producers to change how they write data while still supporting existing data in the old format so that data consumers do not have to change their processes. It provides a shared schema and data model for Pig, Hive, and MapReduce. It will enable notifications of data availability. And it will provide a place to store state information about the data so that data cleaning and archiving tools can know which data sets are eligible for their services.
HCatalog takes Hive's metastore, and wraps additional layers around it to provide these services. It comes with HCatInputFormat and HCatOutputFormat for MapReduce users, and HCatLoader and HCatStorer for Pig users. Taking the Pig script example above, using HCatalog it looks like:
A = load 'raw' using HCatLoader();
B = filter A by ds='20110225' and region='us' and property='news';
Notice that Pig no longer knows or cares about the file location or storage format. It is just loading a table named 'raw'. If an administrator decides to relocate the files that store raw, or switch to a better storage format, this Pig Latin script does not change at all. Also notice that the user no longer has to declare the schema. HCatLoader communicates that automatically to Pig. HCatalog uses Hive's data model. When loading data into Pig, the types are translated to Pig types. For example, a Hive struct becomes a Pig tuple. When used by MapReduce, the value that HCatInputFormat returns is an HCatRecord, which is an ordered list of typed data. At the same time, this does not force Pig or MapReduce to scan the whole table. Pig knows how to push the filter statement into HCatLoader so that only the appropriate files are read.
HCatalog includes Hive's command line interface so that administrator can create and drop tables, specify table parameters, etc. It will also allow users to explore what tables are available and what their schema is.
HCatalog also provides an API for storage format developers to tell HCatalog how to read and write data stored in different formats. Currently HCatalog knows how to read and write RCFiles. But if data is stored in a different format, a user can implement an HCatInputStorageDriver and HCatOutputStorageDriver to tell HCatalog how to translate between your data storage and record format HCatalog uses. Which StorageDriver to use is stored at the partition level. So if you need to change how your table is stored when you already have a year's data in it, there is no need to reprocess the data. New data can be written in the new format while the old data stays in the old format. HCatalog handles using the correct StorageDriver to read each partition, allowing users to read across partitions, never knowing they are in different formats.
HCatalog Is Brought To You By...
HCatalog is a collaborative effort between members from the Apache Pig, Hive, and Hadoop projects, plus new contributors. Most of the new code has been written by Yahoos, while the Hive team has been very helpful in providing design feedback and pushing HCatalog's changes into the Hive metastore.