Yahoo! has begun evaluating Hive for use as part of its Hadoop stack. Since, in many peoples' minds, Hive and Pig are roughly equivalent and Pig Latin is very close to SQL, this has led to some confusion. Why are we interested in using both technologies?
As we have looked at our workloads and analyzed our use cases, we have come to the conclusion that the different use cases require different tools. In this post, I will walk through our thinking on why both of these tools belong in our toolkit, and when each is appropriate.
Data preparation and presentation
Let me begin with a little background on processing and using large data. Data processing often splits into three separate tasks: data collection, data preparation, and data presentation. I will not discuss the data collection phase, because I want to focus on Pig and Hive, neither of which play a role in that phase.
The data preparation phase is often known as ETL (Extract Transform Load) or the data factory. "Factory" is a good analogy because it captures the essence of what is being done in this stage: Just as a physical factory brings in raw materials and outputs products ready for consumers, so a data factory brings in raw data and produces data sets ready for data users to consume. Raw data is loaded in, cleaned up, conformed to the selected data model, joined with other data sources, and so on. Users in this phase are generally engineers, data specialists, or researchers.
The data presentation phase is usually referred to as the data warehouse. A warehouse stores products ready for consumers; they need only come and select the proper products off of the shelves. In this phase, users may be engineers using the data for their systems, analysts, or decisionmakers.
Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.
Data factory use cases
In our data factory we have observed three distinct workloads: pipelines, iterative processing, and research.
The classical data pipelines bring in a data feed, and clean and transform it. A common example of such a feed is logs from Yahoo!'s web servers. These logs undergo a cleaning step where bots, company internal views, and clicks are removed. We also do transformations such as, for each click,
finding the page view that preceded that click.
See my previous post that discusses why we have found Pig to be the best tool for implementing these pipelines. We use it together with our workflow tool, Oozie, to construct pipelines, some of which run tens of thousands of Pig jobs a day.
A closely related use case is iterative processing. In this case, there is usually one very large data set that is maintained. Typical processing on
that data set involves bringing in small new pieces of data that will change the state of the large data set.
For example, consider a data set that contained all the news stories currently known to Yahoo! News. You can envision this as a huge graph, where each story is a node. In this graph, there are links between stories that reference the same events. Every few minutes a new set of stories comes in, and the tools need to add these to the graph, find related stories and create links, and remove old stories that these new stories supersede.
What sets this off from the standard pipeline case is the constant inflow of small changes. These require the use of an incremental processing model to process this data in a reasonable amount of time.
For example, if the process has already done a join against the graph of all news stories, and a small set of new stories arrives, re-running the join across the whole set will not be desirable. It will take hours or days.
Instead, joining against the new incremental data and using the results together with the results from the previous full join is the correct approach. This will take only a few minutes. Standard database operations can be implemented in this incremental way in Pig Latin, making Pig a good tool for this use case.
A third use case is research. Yahoo! has many scientists who use our grid tools to comb through the petabytes of data we have. Many of these researchers want to quickly write a script to test a theory or gain deeper insight.
But, in the data factory, data may not be in a nice, standardized state yet. This makes Pig a good fit for this use case as well, since it supports data with partial or unknown schemas, and semi-structured or unstructured data.
Pig integration with streaming also makes it easy for researchers to take a Perl or Python script they have already debugged on a small data set and run it against a huge data set.
Data warehouse use cases
In the data warehouse phase of processing, we see two dominant use cases: business-intelligence analysis and ad-hoc queries.
In the first case, users connect the data to business intelligence (BI) tools — such as MicroStrategy — to generate reports or do further analysis.
In the second case, users run ad-hoc queries issued by data analysts or decisionmakers.
In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field.
Pig + Hive = Hadoop toolkit
The widespread use of Pig at Yahoo! has enabled the migration of our data factory processing to Hadoop. With the adoption of Hive, we will be able to move much of our data warehousing to Hadoop as well.
Having the data factory and the data warehouse on the same system will lower data-loading time into the warehouse — as soon as the factory is finished, it is available in the warehouse.
It will also enable us to share — across both the factory and the warehouse — metadata, monitoring, and management tools; support and operations teams; and hardware.
So we are excited to add Hive to our toolkit, and look forward to using both these tools together as we lean on Hadoop to do more and more of our data processing.