Pig – The Road to an Efficient High-level language for Hadoop

Pig started as a research project within Yahoo! in the summer of 2006. The original prototype quickly became very popular with users. It was clear that a higher level language than raw map-reduce was needed to quickly rollout prototypes as well as to build production quality applications. Early adopters within Yahoo! have reported substantial increases in productivity when they migrated from raw map-reduce to Pig.

In the summer of 2007 a team was put together to make the project into a product. Working within an open source community was perceived as one of the important early goals of the project. Pig has been part of the open source community for over a year, joining Apache Incubator in September of 2007. During this time Pig has developed a community of users and developers, and added two new committers. It also gained wide popularity within Yahoo! with 30% of all Hadoop jobs using Pig - which amounts to thousands per day!

A lot of great technical work went into the project which helped with the adoption and popularity of the system. The early work included the addition of streaming operator, parameter substitution, error handling, and some performance improvements like using binary comparators and combiner.

More recently the entire system, from the parser down, has been rebuilt making the code much cleaner, extensible, and efficient. A types system was also added further improving performance and allowing for early error detection. This work is still in progress but the early performance numbers are quite impressive - we are seeing from 40% to 10x speedups between the old and the new code.

Our technical improvements and the growth of the Pig community allowed us to graduate from the Incubator and to join Hadoop as a sub-project. The entire Pig community is excited about reaching this important milestone and the opportunities that being part of the Hadoop family provides! Long live the Pig! :)

Olga Natkovich