Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join
|
Abstract
String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called "fuzzy token matching based similarity", which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.
|
Publication
|
Codes
OverviewWe release the binary code FastJoin that can perform similarity joins over one/two string collections with Fuzzy Token Matching based similarity constrains.
InputRun FastJoin from command line:
./FastJoin [FJACCARD|FCOSINE|FDICE] delta tau file1 [file2]
Arguments:
OutputResult: FastJoin prints four lines for each similar pair (string1, string2).
The first line consists of fuzzy-token similarity between string1 and string2, the line id of string1 in file1 and the line id of string2 in file2. The second line is string1 and the third line is string2. The fourth line is a blank line. We show an example result for the command "./FastJoin FJACCARD 0.85 0.8 querylog.txt"
fsim line_id1 line_id2
string1 string2 # blank line # Example 0.88 1985 1205 nba mcgrady macgrady nba Statistics: FastJoin prints six statistics to stderr stream. Signature Time and Candidate Time and Verification Time are the running time of three phases of FastJoin. Total Join Time is the total running time of FastJoin. Candidates Number is the number of the generated candidate pairs. Result Number is the number of the final results. We show an example result for the command "./FastJoin FJACCARD 0.85 0.8 querylog.txt"
Signature Time : 25.23s
Candidate Time : 51.32s Verification Time : 55.03s Total Join Time : 131.58s Candidates Number : 80430264 Result Number : 162236 Notes
Download |
Data
|
Contact
For any questions about this study, please contact Jiannan Wang.
|