Towards Enterprise-Class Compatibility for Apache Hadoop

At Yahoo!, the first users of Apache Hadoop were researchers developing new algorithms or manually shifting through huge data sets. These users threw away most of their code after a few weeks or months, and the little code they carried forward was not subject to rigorous quality procedures. Thus, these early users cared more about new features and scalability improvements in Hadoop than they did about backward compatibility.

This early focus on bigger-and-better helped Hadoop become the powerful platform it is today. However, over the years, both inside and outside of Yahoo!, Hadoop is increasingly being used to run large, long-lived, enterprise-class applications. Porting these applications to non-compatible upgrades of Hadoop is an arduous, expensive task that distracts teams from finding new and better ways of using Hadoop to bring value to their companies. Today, Hadoop users are demanding backwards compatibility and interface stability; these features are necessary for the next growth phase of Hadoop, as it gains wider enterprise adoption.

Interface Classification

Over the last year, as part of our plan to provide stronger backward compatibility, we have tagged interfaces in Hadoop to denote their compatibility contract for future releases. An interface can be a Java API, a configuration variable, the parameters or output of a command, metrics variables, and so on. Java APIs are tagged using Java Annotations; other types of interfaces, such as configuration options and output formats, are tagged using informal documentation conventions. The upcoming release 0.21 of Hadoop will be the first to expose this classification.

The classification system is derived from OpenSolaris and from our own internal system at Yahoo. The system distinguishes two important aspects of an interface from the perspective of backward compatibility, the audience of the interface, and the stability of the interface:

  • The audience (or scope or visibility) of an interface denotes the potential customers of the interface. In addition to the more obvious public and private designations, the audience taxonomy also includes a limited-private category for hooks exposed to peer frameworks or systems.
  • The stability of an interface denotes when changes can be made to the interface that break compatibility. Again, a binary choice between stable (guaranteed not to change incompatibly) and unstable (guaranteed to change) is not sufficient. Interfaces may also be marked as evolving, intended for use by early adopters validating their suitability. An evolving interface is marked as public only after it’s been used internally in the system and is close to being stable. These dimensions allow early adopters to gauge the risk in relying on a new interface.

The following chart should help you understand what this tagging system means to you:

What this tagging system means to you

  • If you are an application developer, stick to public-stable interfaces. If you are early adopter, you may use a public-evolving interface, but be aware that the interface may change slightly in the future, forcing a change to your application.
  • If you are a framework developer on Hadoop, you can of course safely use any of the public interfaces, but can also use limited-private interfaces targeted to your framework. For example the Hadoop RPC layer provides limited-private interfaces for HDFS and MapReduce.
  • Stay away from the private interfaces unless you are an implementer of the actual sub-component. Pay attention to any private interfaces that are stable. While the vast majority of private interfaces are marked as unstable, a few are marked as stable to warn that breaking their compatibility should be done after only serious consideration (and community discussion). For example the internal HDFS and MapReduce protocols are marked stable to support rolling-upgrades (a feature we would like to add to Hadoop in the near future).

More details are provided in the Apache Hadoop API classification document.

The Larger Plan for Compatibility

This classification system for interfaces is part of a larger, multi-step plan for backwards compatibility, which has been in work for the last two years (HADOOP-5071).

The first step was to clean up the package structure of the Hadoop source base (HADOOP-2884). While the high level structure was fine, finer-grained packages better reflect the abstractions of the underlying system architecture. For example, the HDFS package structure did not distinguish the internal abstraction of block storage and namespace, and there were several cases of layer violations within and across packages. In cleaning-up the package structure, we not only provided a better foundation for future work, but also started the process of identifying and separating public interfaces from the private and limited-private ones. We also changed the documentation to reflect this separation.

At the same time, key Hadoop interfaces were redesigned to allow them to evolve compatibly without limiting innovation in the framework. These key interfaces, which also happened to be where Hadoop has faced its greatest compatibility issues, were the APIs to the HDFS file system (HADOOP-4952) ) and the MapReduce framework (HADOOP-1230).We will discuss the redesign of these interfaces in a future blog.

Finally, we have also started to address the wire compatibility problem, i.e. the compatibility of Hadoop components across RPC boundaries and versions. Hadoop, surprisingly, does not offer wire compatibility. To address this limitation, Yahoo started the Avro project (led by Doug Cutting), which was open-sourced in Apache last year and will be incorporated into Hadoop’s RPC system over time..


The source re-factoring was done by Sanjay Radia and Raghu Angadi. The interface classification project was initiated by Sanjay Radia and derived mostly from OpenSolaris’ classification scheme. Jakob Homan designed the use of annotations to tag the interfaces. Suresh Srinivas, Tom White, Arun Murthy, Owen O'Malley, and Sanjay Radia defined the annotations for the various parts of the Hadoop system. Tom White helped drive the labeling effort to its conclusion in 0.21, particularly its integration with Hadoop’s javadoc. Doug Cutting was the lead for the Avro project.

Sanjay Radia
Sanjay Radia

Hadoop Team, Yahoo!