M45 Enables Web-Scale Information Extraction Research

About us
We are PhD students at Carnegie Mellon in the Machine Learning Department and the Language Technologies Institute, and our thesis work is part of the Read the Web project, which is led by Professor Tom Mitchell. The goal of our project is to build a system that can start from a limited amount of knowledge and gradually learn to understand language by utilizing millions of web pages. The project serves as a great case study for developing new Machine Learning and Natural Language Processing techniques. We use the Yahoo! M45 Cluster with Hadoop extensively in our work, and this post gives an example of how we use M45 and why it is essential to our research project.

Fact extraction using statistics on patterns and instances
Our algorithms learn to extract facts from the web by using statistics from a large collection of web pages, and start from initial input that consists of categories (e.g., Company, City, Sports Team) and relations (e.g., CompanyHeadquarteredInCity(Company, City), SportsTeamHomeCity(SportsTeam, City)) of interest, and 15 "seed" instances of each category and relation (e.g., Yahoo! and IBM are instances of the Company category, and the pair (White Sox, Chicago) is an instance of the SportsTeamHomeCity relation). Our system learns new facts by discovering contextual patterns like "treaty with X" (which is good evidence that something is a country) and "X corporate headquarters in Y". The system learns patterns like these by observing that, for example, a seed instance like Yahoo! and Sunnyvale fill the X and Y slots in the pattern. It also uses learned patterns to discover new pairs of noun phrases that might satisfy the same relation. Then it ranks candidate patterns and instances using statistics about how often they co-occur with each other, because patterns are rarely perfect extractors of facts. For example, "treaty with Lincoln" occurs 5 times on the web, and "treaty with Japan" occurs about 65,000 times, so mistakes happen, but mistakes have smaller counts for useful patterns. (However, "treaty with aliens" occurs 168,000 times... Maybe this wasn't an ideal example... :) )

Before M45: Hit counts
At a small scale, it is possible to run such algorithms using search APIs like Yahoo!'s BOSS service. However, to get accurate statistics of how many times each candidate pattern occurs with each candidate pair of noun phrases, a program would need to get the hit count of the string that results from plugging the noun phrases into the pattern. If we implemented this using individual hit count queries, for 10,000 patterns and 10,000 noun phrase pairs, 10,000^3 hit count queries would be necessary to find the number of times each pair of noun phrases occurs with each pattern. At one query per second, this would take over 30,000 years. In our work, this ruled out using search APIs to gather the necessary statistics from the web.

Enter M45, stage West
This is where the M45 computing cluster and Hadoop have been invaluable for our group's research. Our first solution worked like this: we processed a large web corpus (crawled by Jamie Callan and his group here at CMU) to extract all of the sentences in the web pages. This yielded a corpus of approximately 500 million unique sentences. We then wrote subroutines which submitted jobs to M45 to answer three types of queries: a) Given a list of patterns, what noun phrases fill in the blanks of those patterns? b) Given a list of noun phrases, what patterns do those noun phrases occur with? And, c) Given a list of patterns and noun phrases, how many times does each pattern co-occur with each noun phrase (or pair of noun phrases)? After some effort towards implementing these subroutines efficiently (e.g., using Trie data structures to efficiently decide if a given sentence contained any strings of interest), our routines were able to scan through our entire corpus in about 10 minutes using only 30 nodes on M45.

Extracting everything once
This first M45-based solution let us perform web-scale research that was simply not possible for us before. However, our system still needed to call the M45-based subroutines dozens of times due to its iterative nature. Thus, we decided to use M45 to generate a data set of statistics that could be copied to our local workstations and obviate the need to access M45 at run-time. We generated a data set that specified every noun phrase and every contextual pattern in the corpus, and the number of times each co-occurred with the other. This yielded a data set of about 50GB of uncompressed text, which we can scan through in 10-20 minutes on a single local machine.

To give an example of what the processed data looks like, here are the patterns that "Hadoop" occurs with the most times in the data (with counts):

instance of _   5
top of _        4
_ is a framework        4
users of _      3
_ is an open source implementation      3
clusters running _      3
versions of _   2
version of _    2
tutorial for _  2
system using _  2
system like _   2
processing system using _       2
platform called _       2
look at _       2
_ has been tested on    2
_ has been demonstrated on      2

Having this local data set lets us run multiple experiments at the same time on different machines, frees us up from having to wait for M45 when it's crowded, and frees up M45 for other users.
We're also able to explore new algorithms that can learn to cluster similar noun phrases, try to understand the meaning of individual patterns, etc., without having to use M45 all the time.
In addition, we are sharing this data with other students here at Carnegie Mellon by offering a course this semester where the sole purpose is for the students to do something cool with the data as a course project. We're excited to see the results!

The future
We recently parsed our entire set of 500 million sentences using M45. This is a scale seldom attained by academic researchers. Even though we used an off-the-shelf dependency parser that was reasonably fast (for a syntactic parser), the task still required the equivalent of 4,000 M45 node-days. It required breaking up the task into many small pieces and writing scripts to submit jobs based on the number of free nodes on M45. This let us avoid hogging the whole cluster. Whereas our extraction patterns have previously been based only on part-of-speech tags (e.g., noun, verb), now we can use syntactic parse tree fragments, enabling more flexibility and generality in matching.

We also now have an order of magnitude more data to play with, again thanks to Jamie Callan. We estimate that the English portion of the ClueWeb09 data set will yield 5 billion sentences once we process it.

Thank you, Yahoo!

Without the generous help of Yahoo!, we (and many other academic researchers at the schools fortunate to have access to M45) would be largely left on the sidelines, unable to do web scale research.
Thank you, Yahoo!

Andy Carlson
Graduate Student @ Carnegie Mellon, Machine Learning Department

Justin Betteridge
Graduate Student @ Carnegie Mellon, Language Technologies Institute