In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.
The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.
The current implementation of the Hadoop MapReduce framework is showing it’s age.
Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance. Over the last 5 years, we’ve done spot fixes, however lately these have come at an ever-growing cost as evinced by the increasing difficulty of making changes to the framework. The architectural deficiencies, and corrective measures, are both old and well understood - even as far back as late 2007, when we documented the proposed fix on MapReduce’s jira: https://issues.apache.org/jira/browse/MAPREDUCE-278.
From an operational perspective, the current Hadoop MapReduce framework forces a system-wide upgrade for any minor or major changes such as bug fixes, performance improvements and features. Worse, it forces every single customer of the cluster to upgrade at the same time, regardless of his or her interests; this wastes expensive cycles of customers as they validate the new version of the Hadoop for their applications.
As we consider ways to improve the Hadoop MapReduce framework it is important to keep in mind the high-level requirements. The most pressing requirements for the next generation of the MapReduce framework are:
* Scalability - Clusters of 10,000 machines and 200,000 cores, and beyond.
* Backward (and Forward) Compatibility - Ensure customers’ MapReduce applications run unchanged in the next version of the framework.
* Evolution – Ability for customers to control upgrades to the Hadoop software stack.
* Predictable Latency – A major customer concern.
* Cluster utilization
The second tier of requirements is:
* Support for alternate programming paradigms to MapReduce
* Support for short-lived services
Given the above requirements, it is clear that we need a major re-think of the infrastructure used for data processing on the Hadoop. In fact, there seems to be loose consensus in the Hadoop community around the fact that the current architecture of the MapReduce framework is incapable of meeting our states goals and that a re-factor is required; see our proposal we made on jira in January, 2008: https://issues.apache.org/jira/browse/MAPREDUCE-279.
## The Next Generation of MapReduce
The fundamental idea of the re-architecture is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application’s scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
The ResourceManager supports hierarchical application queues and those queues can be guaranteed a percentage of the cluster resources. It is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.
The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.
The NodeManager is the per-machine framework agent who is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, and handling task-failures.
## Improvements vis-à-vis current implementation of Hadoop MapReduce
The separation of management of resources in the cluster from management of the life cycle of applications and their component tasks results in an architecture, which scales out much better and more gracefully. The Hadoop MapReduce JobTracker spends a very significant portion of time and effort managing the life cycle of applications and that is the major cause of software mishaps – moving that to an application-specific entity is a significant win.
Scalability is particularly important with current hardware trends – currently Hadoop MapReduce has been deployed on clusters of up to 4,000 machines. However, 4,000 commodity machines of 2009 vintage (i.e. 8 cores, 16G RAM, 4TB disk) are only half as capable of 4,000 machines of 2011 vintage (16 cores, 48G RAM, 24TB disk. Also, operational costs favor consolidation and compel us to run ever-larger clusters of 6,000 machines and beyond.
* ResourceManager - The ResourceManager uses Apache ZooKeeper for fail-over. When the ResourceManager fails, a secondary can quickly recover via cluster state saved in ZooKeeper. The ResourceManager, on a fail-over, restarts all of the queued and running applications.
* ApplicationMaster - MapReduce NextGen supports application specific checkpoint capabilities for the ApplicationMaster. The MapReduce ApplicationMaster can recover from failures by restoring itself from state saved in HDFS.
MapReduce NextGen uses wire-compatible protocols to allow different versions of servers and clients to communicate with each other. In future releases, this will enable rolling upgrades to the clusters – a major operability win.
### Innovation & Agility
A major plus of the proposed architecture is the fact that MapReduce effectively becomes a user-land library. The computation framework (ResourceManager and NodeManager) is completely generic and is free of MapReduce specificities.
This enables end-customers to use different versions of MapReduce concurrently on the same cluster. This is trivial to support since multiple versions of MapReduce ApplicationMaster and runtime can be used for different applications. This provides significant agility for applications for bug fixes, enhancements and new features since the entire cluster does not have to be upgraded. It also allows end-customers to upgrade their applications to MapReduce versions on their own schedule and significantly enhances operability of the cluster.
The ability to run user-defined version of the Map-Reduce fosters innovation without affecting stability of the software. It will be trivial to incorporate features such as the Hadoop Online Prototype into the user’s version of MapReduce without affecting other users.
### Cluster Utilization
The MapReduce NextGen ResourceManager uses a general concept of a resource for scheduling and allocating to individual applications.
Every machine in the cluster is conceptually comprised of resources such as memory, CPU, I/O bandwidth, etc. Each machine is fungible and will be allocated to applications as containers based on application-defined resource request types. A container is a set of processes that are logically isolated from other containers on the same machine providing strong multi-tenancy support.
Thus it removes the current notion of fixed typed map and reduce slots in Hadoop MapReduce. The fixed typed slots have a significant negative impact on cluster utilization since, at different times in the cluster, either map or reduce slots are scarce.
### Support for programming paradigms other than MapReduce
MapReduce NextGen provides a completely generic computation framework to support MapReduce and other paradigms.
The architecture allows end-users to implement any application-specific framework by implementing a custom ApplicationMaster, which can request resources from the ResourceManager and utilize them as they see fit under familiar notions of isolation, capacity guarantees etc.
Thus, it supports multiple programming paradigms, such as MapReduce, MPI, Master-Worker, and iterative models, on the same Hadoop cluster and allows use of the appropriate framework for each application. This is particularly important for applications (e.g. K-Means, Page-Rank) where custom frameworks out-perform MapReduce by an order of magnitude.
Apache Hadoop, and in particular Hadoop MapReduce, is a very successful open-source project for processing large data sets. Our proposed re-factoring of Hadoop MapReduce addresses the architecture’s current issues by providing high-availability, enhancing cluster utilization and providing support for programming paradigms; and enables rapid future evolution.
We felt that none of the existing options such as Torque, Condor, Mesos etc. were designed to solve for MapReduce clusters at scale. Some of the options were new and immature, and others were missing key features such as ability to do fine-grained scheduling for hundreds of thousands of tasks, performance at scale, security, multi-tenancy etc.
We will work with the Apache Hadoop community to achieve this to elevate Apache Hadoop to the next level in the big data space.