This is the beginning of an ongoing series of blog posts on “Managing Big Data”. This series will focus on techniques that Yahoo uses to process large volumes of data, ranging from initial collection of data to the end usage of that data.
Over the last several years there are two important trends that require additional thought when putting together an architecture for a hosted service. At Yahoo!, the ability to analyze and process enormous amounts of data is increasingly important. It’s a foundational layer for improving our consumer experiences and for sharing audience insights with advertisers.
From a technology perspective, the two trends I'd like to focus on are:
1. Batch processing -- the increasing awareness of batch processing and the recent uptick in use of the map/reduce paradigm for that purpose.
2. NoSQL stores – The rise of so called "NoSQL" stores and their use to serve up data to online users (typically inside of the user's request/response cycle).
Both of these trends represent significant advances in the way that hosted systems are developed. But in order to derive the most value for an entire system, developers must think about how these two areas will work together in some holistic manner.
Let's look at a specific scenario to make this more concrete:
Making batch data available to the online system
Let's assume for the moment that you're building a new e-commerce site. And let's assume that one of the significant features of this e-commerce site is to provide user recommendations for which items a user may be interested in purchasing. This overall feature will decompose into batch component (to determine the recommendations to give to a user) and an online component (that will present the recommendations to end user).
For the batch system we may choose to use a map/reduce framework like Hadoop . The batch recommendation component will rely on various types of raw event data. Some examples include: the products the end user has viewed, the products they have purchased and types of searches the user has performed, etc. We will create a model using this data; perhaps based on a simple behavior model (e.g. if the user looks at a significant amount of sports equipment than recommend products in the sports equipment category) or a collaborative filtering model (e.g. if other users purchased the same products as this end user, recommend other products that they purchased).
Once we have decided on the data inputs and the model for making recommendations, we'll like produce the output of the batch processing as a set of recommendations for each user on a Hadoop cluster.
There are two approaches we can then make for making the batch data available online in a NoSQL store.
In this approach as part of the batch processing, we will recreate the set of recommendations for all users. It also may be possible to create a native version of the online store format in the batch system. This is generally only possible if there is no scenario where the data (user recommendations for products in this case) is updated in the online store. There must also exist a library to write the into the online store's file format (not all NoSQL stores have such as library).
This approach is attractive for a couple of reasons:
- There should not be any consistency issues between your offline and online representation of the data as you are always creating an entirely new copy of the online data at some interval (for instance once a day).
- This approach should have the least impact of your online store performance if you can perform a "swap" with new data set. How this would work is that the newly created set of recommendations are pushed in the native NoSQL store format to the online stores physical boxes. Once there each of the NoSQL stores are "upgraded" to start using the new copy of the recommendations and stop using the old version.
2. Incremental/Delta updates
Taking an incremental update approach between your offline and online store entails creating a new set incremental data changes in your batch system. For instance, creating a new set of user recommendations for newly joined users or users that have had some recent activity (which would change the recommendations). This incremental data must then be pushed to the online store.
This approach is attractive for a couple of reasons:
- Latency. By processing just the incremental updates of your recommendations is possible to update the online stores on a frequent basis. For instance, it may be possible to run a batch job on Hadoop every 30 minutes that produces a new set of product recommendations. These product recommendations can then be pushed to the online store at a 30 minute interval. The full update approach to moving the offline data online will more likely have an update frequency of several hours or even once a day (due to the large amount of data that needs to be processed and transferred to the online stores).
- Size of updates. Depending on the size of the data, it's also possible that incremental updates maybe the only viable solution. For instance if the number of users we would like to create recommendations for is in the 10s of millions and the number of recommendations for each user is large, the data set may be too large to recalculate and push the entire set.
One weakness of the incremental update approach is that it can have a performance impact on your online store. Therefore, when you apply updates to the online store you may need to consider some form of throttling of the updates as they are applied to the online store.
- Consider the type of data you have and whether pushing full updates or delta updates is more appropriate for your type of data as this fundamentally affects the architecture for making your batch data available online. At Yahoo we use both approaches depending on the specific scenario.
- Throttling your updates to your online store is important consideration to maintain your online stores latency and availability.
In a future post I'll also address taking data that's available in an online NoSQL store and making that data available in batch system like Hadoop.