allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21232234&autoPlay=0'>
A set-similarity join (SSJ) finds pairs of set-based records such that each pair is similar enough based on a similarity function and a threshold. Many applications require efficient SSJ solutions, such as record linkage and plagiarism detection. This talk studies how to efficiently perform SSJs on large data sets using Hadoop. It proposes a 3-stage approach to the problem, to efficiently partition the data across nodes to balance the workload and minimize the need for replication. It reports results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.
Read More »from Hadoop2010: Efficient Parallel Set-Similarity Joins