At a recent Hadoop User Group meeting, I made a presentation on how we leverage hadoop for spam mitigation in Yahoo! Mail. A number of people followed up requesting additional details of our architecture and engineering strategy.
In this post, I am going to try and capture our antispam engineering story, how it came to be hadoop centric and how well the new architecture has worked. I will also highlight the results we have been able to achieve. Finally, I will provide an update on when we will be releasing these updates to wide production.
At the Hadoop User group presentation, I had delved into the details of two interesting antispam algorithms. The first was "frequent itemset mining", the second was what we called the "connected components" algorithm. Both these algorithms are implemented as part of our tools portfolio. They are used by engineers, product managers and operations analysts to get a compact summary of the major trends in spam. Both these tools were implemented as part of a new engineering strategy we put in place second quarter of last year.
Our new strategy called for pointed improvements in the ability of systems to digest massive amounts of data. The first portion of the strategy, implemented by the end of 2009, targeted our reporting systems, tool chains and existing abuse reputation algorithms. The proposal was to increase the granularity of the data being handled, increase the response time to detect an attack and do to do more early detection of spam attacks. In our analysis, we quickly realized that even the small changes we were proposing to our reporting systems, tools and algorithms required us to scale our existing systems well beyond the limits that they were meant to scale.
Also, we found that engineering, product and customer advocacy teams were all hungry for data and it would be great to support additional requirements around ad hoc joins across data streams and support a general "slice" and "dice" approach to data engineering. Our first revisions to the sender and content classifiers also made it abundantly clear that we needed massive storage and massive compute.
We took the simple approach of putting ALL the data that we would possibly want to query or develop algorithms on, on a hadoop grid and let the grid scale to the storage and compute requirements. To give you some idea of the scale involved here, let me provide some ball park metrics.
By the end of this quarter, we will be loading close to 4TB of antispam data on our hadoop grids every day and we will be querying several days of data for report generation and running automated classifiers and algorithms at a frequency of a few minutes. We have not run into any scalability problems so far. In general we have found that with proper data organization, hadoop is able to scale linearly with data and compute requirements.
I will complete this section by saying that this strategy has had a huge impact on spam complaints. See for you self; I am enclosing the graph of our spam complaints from last year. The big dip corresponds to when we shifted our reports, algorithms and filters to the hadoop grid. Need I say more?
While making changes to how an existing system works is interesting and clearly the first step, the second and more interesting step is the development of brand new, distributed reputation algorithms using hadoop. Once again, our new engineering strategy called for the rewiring of all algorithms to run in parallel and increase the level of feature engineering across heuristic, statistical and machine learning systems. We needed to do this across reputation algorithms for IPs, domains, from addresses(senders), receivers(users) and content. Once again, we realized that much of the complexity was in massive data engineering. We needed to ensure that we used every bit of data that would help us make a spam/notspam decision. We also had to choose an appropriate model that could interpret this vast amount of data without getting overwhelmed.
The term "massive feature engineering" should be familiar to people in the area of machine learning. In more common engineering terms, we needed to associate several pieces of meta data to every entity we needed to classify and we needed to choose algorithms that would parallelize well. We have been hard at work the last 5 months doing this new "hadoop engineering". By the end of this month, we hope to release our first hadoop based, massively feature engineered, distributed sender classification algorithm. Code named zeroB, our initial tests make this a very compelling replacement for our current sender management system. It is 25% more accurate while being faster and cheaper to run and maintain than the current version.
Indeed I have now come to believe that Hadoop has tremendous applicability to the abuse and security domains as a whole. Both these domains have the proverbial problem of finding the needle in the hay stack and hadoop is well equipped for this task. With the amount of spam that large mail systems like yahoo see, it is truly important to employ powerful frameworks like Hadoop to ensure the problem remains tractable.
Yahoo mail was recently voted by the renowned fraunhaufer institute as the best free mail service for spam management. This recognition clearly demonstrates that our new hadoop based strategy is working and working very well. This is just the tip of the success though. In the next 3 months, we are rolling out many of these new systems to our wide install base in the United States and I am eagerly waiting to see the effect this has on spam. It has been fun and exciting building these systems on top of hadoop but it has been even more exciting to see us winning the war on spam.