Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

Abstract

As two important operations in data cleaning, similarity join and similarity search have attracted much attention recently. Existing methods to support similarity join usually adopt a prefix-filtering-based framework. They select a prefix of each object and prune object pairs whose prefixes have no overlap. We have an observation that prefix lengths have significant effect on the performance. Different prefix lengths lead to significantly different performance, and prefix filtering does not always achieve high performance. To address this problem, in this paper we propose an adaptive framework to support similarity join. We propose a cost model to judiciously select an appropriate prefix for each object. To efficiently select prefixes, we devise effective indexes. We extend our method to support similarity search. Experimental results show that our framework beats the prefix-filtering-based framework and achieves high efficiency.

Publication

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search
[Paper] [PPT]

Codes

1. Overview

We release the binary codes, AdaptJoin and AdaptSearch, which respectively implement the adaptive similarity-join and similarity-search algorithms in the paper.

2. AdaptJoin

AdaptJoin can support three similarity functions, jaccard, cosine and edit distance.

2.1 Jaccard / Cosine

To support jaccard or cosine, run AdaptJoin from the command line:

./AdaptJoin [jaccard | cosine] theta filename

Arguments:

[jaccard | cosine]: choose jaccard or cosine to quantify string similarity.
theta is a similarity threshold (0 < theta ≤ 1)
filename is the file name of a string collection, where strings are separated by '\n' and tokens in each string are separated by ' ' or '_'.

Output: AdaptJoin prints four lines for each similar pair (string1, string2). The first line consists of the similarity between string1 and string2, the line id of string1 and the line id of string2. The second line is string1 and the third line is string2. The fourth line is a blank line.
We show an example result for the command "./AdaptJoin jaccard 0.7 querylog.txt"

sim line_id1 line_id2
string1
string2
# blank line

# Example
0.75 1985 1205
nba tracy mcgrady
tracy mcgrady nba player

2.2 Edit Distance

To support edit distance, run AdaptJoin from the command line:

./AdaptJoin editdist [gram | indexgram | indexchunk] q edth filename

Arguments:

editdist: choose edit distance to quantify string similarity.
[gram | indexgram | indexchunk]: Adaptjoin supports three ways to map a string to a set.
q is a gram length.
edth is an edit-distance threshold (edth≥0, integer).
filename is the file name of a string collection, where strings are separated by '\n'.

Output: AdaptJoin prints four lines for each similar pair (string1, string2). The first line consists of the edit distance between string1 and string2, the line id of string1 and the line id of string2. The second line is string1 and the third line is string2. The fourth line is a blank line.
We show an example result for the command "./AdaptJoin editdist gram 3 1 querylog.txt"

ed line_id1 line_id2
string1
string2
# blank line

# Example
1 1985 1205
pvldb
vldb

3. AdaptSearch

AdaptSearch can support three similarity functions, jaccard, cosine and edit distance.

3.1 Jaccard / Cosine

To support jaccard or cosine, run AdaptSearch from the command line:

./AdaptSearch [jaccard | cosine] datafile queryfile

Arguments:

[jaccard | cosine]: choose jaccard or cosine to quantify string similarity.
datafile is the file name of a string collection, where strings are separated by '\n' and tokens in each string are separated by ' ' or '_'.
queryfile is the file name of a query collection, where queries are separated by '\n'. Each query consists of a query string and a similarity threshold, which are separated by '\t'.

Output: AdaptSearch first builds an index on the strings in datafile, and then searches similar strings for each query in queryfile. AdaptSearch prints three parts for a query. The first part is the query itself. The second part is the search result. The third part is a black line. In the second part, there may be several lines, where each line represents a similar string with the query. Each line consists of the string id, the string, and the similarity between the string and the query string.
We show an example result for the command "./AdaptSearch jaccard data.txt query.txt"

Query_id: query_string theta
line_id1 string1 sim1
line_id2 string2 sim2
line_id3 string3 sim3
...
# blank line

# Example
Query_12: tracy mcgrady nba player 0.7
1985 nba tracy mcgrady 0.75
1205 usa nba player tracy mcgrady 0.8

3.2 Edit Distance

To support edit distance, run AdaptSearch from the command line:

./AdaptSearch editdist [gram | indexgram | indexchunk] q datafile queryfile

Arguments:

editdist: choose edit distance to quantify string similarity.
[gram | indexgram | indexchunk]: AdaptSearch supports three ways to map a string to a set.
q is a gram length.
datafile is the file name of a string collection, where strings are separated by '\n'.
queryfile is the file name of a query collection, where queries are separated by '\n'. Each query consists of a query string and a similarity threshold, which are separated by '\t'.

Output: AdaptSearch first builds an index on the strings in datafile, and then searches similar strings for each query in queryfile. AdaptSearch prints three parts for a query. The first part is the query itself. The second part is the search result. The third part is a black line. In the second part, there may be several lines, where each line represents a similar string with the query. Each line consists of the string id, the string, and the edit distance between the string and the query string.
We show an example result for the command "./AdaptSearch editdist gram 3 data.txt query.txt"

Query_id: query_string edth
line_id1 string1 ed
line_id2 string2 ed
line_id3 string3 ed
...
# blank line

# Example
Query_12: vldb 2
1985 pvldb 1
1205 vldbj 1
2731 vldb12 2

Download

AdaptJoin && AdaptSearch

Data

AOL Query Log (25.2MB, 1,208,844 strings)
DBLP Authors+Title (147.3MB, 1,385,925 strings)
Enron Title+Body (455.7MB, 517,431 strings)

* Note: Each tar file contains a data file and a query file. The query file consists of 10,000 query strings. It is used to evalute AdaptSearch.

Contact

For any questions about this study, please contact Jiannan Wang.