At the F8 developer conference, Yahoo! unveiled a new feature called Yahoo! News Activity. The feature, on Yahoo! News, represents the first rollout of our global social strategy to build on our leadership of deeply personalized content by adding friends to the mix, guiding you to more interesting content as your new social editors. The worlds most interesting content and editorial is on Yahoo!, achieved through the right mix of editorial insight and personalized algorithms. The deep integration with Facebook offers an immersive social experience on Yahoo! News so people can discover and connect around the news and information they are enjoying on Yahoo! seamlessly through updates on Facebook.
But what we think is equally exciting is the under-the-hood technology that powers this experience, and makes the deep tech integration with Facebook possible. Our team of leading developers have built Mixer, a complex, and finely-tuned caching scheme that sits on personalization algorithms, to automatically sort and refine the large data sets that flow across the Yahoo! and Facebook platforms.
Heres more on how it works.
Yahoo! News Activity Backend Architecture
1. Assess: When a user visits a Yahoo! News page, the News Frontend (FE) will query Mixer for the users friends latest read activity in the News For You section of the article page. These are very expensive queries to evaluate, as a user typically has a 100 or more friends (some have 1000s) each with 100s of read articles, and the News FE expects a response within 250 milliseconds. This is no easy task, especially at a scale of the Yahoo! News page with 80 million unique users reading and sharing content across the network per month.
2. Engage: When a user reads an article on the Yahoo! News site, the News FE will pass a read event to Mixer for storage in Sherpa, Yahoo!s cloud-based NoSql data store. In addition, materialized views in Memcache will be updated to reflect the users latest read activity and that he/she has read the specific article.
3. Refresh: To evaluate the queries from the News FE, Mixer employs an aggressive dynamic caching scheme which balances response time performance with data freshness. Most queries can be served directly from materialized views in the cache. These views are either refreshed at update time or in the background at query time when stale. Mixer monitors background refresh tasks and will include the refresh results in the response if theyre available within the time specified by the client. Mixer can gracefully adjust the refresh rates to match the incoming request load. When the load is high, Mixer will decrease the refresh rate and serve more stale results. When the load is low, Mixer will increase the refresh rate and serve fresher results.
What makes this process different?
The Yahoo! Mixer is a fully asynchronous service that requires significantly fewer threads than thread-per-connection or thread-per-request architectures like Standard Apache Tomcat applications which dont scale well with large number of connections/requests. Because of this, the Yahoo! Mixer can efficiently handle a large number of outstanding requests to slower services like Facebook with just a small number of threads tuned to match the number of cores on the server.
Building follows applications (where users can follow events generated by other users) is inherently difficult because of the high fan-out of events from producers to consumers which multiplicatively increases the load on the data store. Yahoo! Mixer uses an aggressive dynamic caching scheme to reduce the load on the data store and balances response time performance with data freshness. This technique minimizes system cost both on workloads with a high query rate and those with high event rate, and provides the foundation for a general platform that can be used to build scalable follows applications.
We hope you enjoy using our new Yahoo! News Activity feature.
Mixer is an HTTP service that runs as a single Java process with multiple threads to take full advantage of multi-core CPUs. It is built on several Java asynchronous NIO platforms: Jetty for inbound requests, Netty for outbound HTTP requests, and SpyMemcacheClient for outbound Memcache requests. In the future, Jetty and SpyMemcacheClient will be replaced with native Netty implementations. At the core of Mixer is the Task Execution Engine. The Task Execution Engine maintains the primary execution thread pool and assigns tasks for execution based on priority. All incoming HTTP requests are dispatched by Jetty to the Task Execution Engine for execution by a specific task instance. Each Mixer API has a specific Java class implementation that is derived from the task base class. The base class encapsulates event dispatching and synchronization which provides a simplified concurrent programming model for developers.
Tasks can invoke other tasks (i.e. subtasks) and subtasks can continue to run in the background after the original requesting task has completed. Tasks are event-driven and completely asynchronous. A task will give up its execution thread whenever its waiting for responses from subtasks, and Memcache, Sherpa and Facebook service calls. A Timer service is used to signal a task when a subtask does not complete within the allotted time.
Memcache maintains two types of materialized views: 1) Consumer-pivoted, and 2) Producer-pivoted. Consumer-pivoted views (e.g. users friends latest read activity) are refreshed at query time by refresh tasks. Producer-pivoted views (e.g. users latest read activity) are refreshed at update time (i.e. when read event is posted). And producer-pivoted views are used to refresh consumer-pivoted views.
Sherpa is Yahoo!s cloud-based NoSql data store that provides low-latency reads and writes of key-value records and short range scans. Efficient range scans are particular important for the Mixer use cases. The read event is stored in the Updates table. The Updates table is a Sherpa Distributed Ordered Table that is ordered by user,timestamp desc. This provides efficient scans through a users latest read activity. A reference to the read record is stored in the UpdatesIndex table to support efficient point lookups. UpdatesIndex is a Sherpa Distributed Hash Table
Authors: M Sordo, A Linares, Sidharta S, J Thind