Can We Beat the Prefix Filtering?
An Adaptive Framework for Similarity Join and Search |
Abstract
As two important operations in data cleaning, similarity join and similarity search have attracted much attention recently. Existing methods to support similarity join usually adopt a prefix-filtering-based framework. They select a prefix of each object and prune object pairs whose prefixes have no overlap. We have an observation that prefix lengths have significant effect on the performance. Different prefix lengths lead to significantly different performance, and prefix filtering does not always achieve high performance. To address this problem, in this paper we propose an adaptive framework to support similarity join. We propose a cost model to judiciously select an appropriate prefix for each object. To efficiently select prefixes, we devise effective indexes. We extend our method to support similarity search. Experimental results show that our framework beats the prefix-filtering-based framework and achieves high efficiency.
|
Publication
|
Codes
1. OverviewWe release the binary codes, AdaptJoin and AdaptSearch, which respectively implement the adaptive similarity-join and similarity-search algorithms in the paper.
2. AdaptJoinAdaptJoin can support three similarity functions, jaccard, cosine and edit distance.
2.1 Jaccard / Cosine
To support jaccard or cosine, run AdaptJoin from the command line:
./AdaptJoin [jaccard | cosine] theta filename
Arguments:
Output: AdaptJoin prints four lines for each similar pair (string1, string2). The first line consists of the similarity between string1 and string2, the line id of string1 and the line id of string2. The second line is string1 and the third line is string2. The fourth line is a blank line.
We show an example result for the command "./AdaptJoin jaccard 0.7 querylog.txt" sim line_id1 line_id2
string1 string2 # blank line # Example 0.75 1985 1205 nba tracy mcgrady tracy mcgrady nba player 2.2 Edit Distance
To support edit distance, run AdaptJoin from the command line:
./AdaptJoin editdist [gram | indexgram | indexchunk] q edth filename
Arguments:
Output: AdaptJoin prints four lines for each similar pair (string1, string2). The first line consists of the edit distance between string1 and string2, the line id of string1 and the line id of string2. The second line is string1 and the third line is string2. The fourth line is a blank line.
We show an example result for the command "./AdaptJoin editdist gram 3 1 querylog.txt" ed line_id1 line_id2
string1 string2 # blank line # Example 1 1985 1205 pvldb vldb 3. AdaptSearchAdaptSearch can support three similarity functions, jaccard, cosine and edit distance.
3.1 Jaccard / Cosine
To support jaccard or cosine, run AdaptSearch from the command line:
./AdaptSearch [jaccard | cosine] datafile queryfile
Arguments:
Output: AdaptSearch first builds an index on the strings in datafile, and then searches similar strings for each query in queryfile. AdaptSearch prints three parts for a query. The first part is the query itself. The second part is the search result. The third part is a black line. In the second part, there may be several lines, where each line represents a similar string with the query. Each line consists of the string id, the string, and the similarity between the string and the query string.
We show an example result for the command "./AdaptSearch jaccard data.txt query.txt" Query_id: query_string theta
line_id1 string1 sim1 line_id2 string2 sim2 line_id3 string3 sim3 ... # blank line # Example Query_12: tracy mcgrady nba player 0.7 1985 nba tracy mcgrady 0.75 1205 usa nba player tracy mcgrady 0.8 3.2 Edit Distance
To support edit distance, run AdaptSearch from the command line:
./AdaptSearch editdist [gram | indexgram | indexchunk] q datafile queryfile
Arguments:
Output: AdaptSearch first builds an index on the strings in datafile, and then searches similar strings for each query in queryfile. AdaptSearch prints three parts for a query. The first part is the query itself. The second part is the search result. The third part is a black line. In the second part, there may be several lines, where each line represents a similar string with the query. Each line consists of the string id, the string, and the edit distance between the string and the query string.
We show an example result for the command "./AdaptSearch editdist gram 3 data.txt query.txt" Query_id: query_string edth
line_id1 string1 ed line_id2 string2 ed line_id3 string3 ed ... # blank line # Example Query_12: vldb 2 1985 pvldb 1 1205 vldbj 1 2731 vldb12 2 Download |
Data
* Note: Each tar file contains a data file and a query file. The query file consists of 10,000 query strings. It is used to evalute AdaptSearch.
|
Contact
For any questions about this study, please contact Jiannan Wang.
|