Last month the HCatalog project (formerly known as Howl) was accepted into the Apache Incubator. We have already branched for a 0.1 release, which we hope to push in the next few weeks. Given all this activity, I thought it would be a good time to write a post on the motivation behind HCatalog, what features it will provide, and who is working on it.
Why Did We Create HCatalog?
Out of the box Hadoop provides the HDFS file system for users to store their data. File systems are nice because they provide a simple interface. Users can easily copy data into the file system and run jobs against that data. However, for more complex data processing tasks, the file system abstraction is not rich enough. It forces users to know where data is located, what format it is stored in, how it is compressed, and what its schema is. Consider, for example, a Pig Latin script used to do ETL on raw web logs:
Read More »from HCatalog, tables and metadata for Hadoop
A = load '/data/raw/ds=20110225/region=us/property=news' using PigStorage()