Hadoop2010: Efficient Parallel Set-Similarity Joins

allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21232234&autoPlay=0'>

iPod: Download high-resolution version

A set-similarity join (SSJ) finds pairs of set-based records such that each pair is similar enough based on a similarity function and a threshold. Many applications require efficient SSJ solutions, such as record linkage and plagiarism detection. This talk studies how to efficiently perform SSJs on large data sets using Hadoop. It proposes a 3-stage approach to the problem, to efficiently partition the data across nodes to balance the workload and minimize the need for replication. It reports results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.

Baycat logo
Media Production by BAYCAT, a non-profit community media producer that educates and employs underserved youth and adults in the digital media arts.