Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join

Abstract

String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called "fuzzy token matching based similarity", which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.

Publication

Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join
[Paper] [PPT]

Codes

Overview

We release the binary code FastJoin that can perform similarity joins over one/two string collections with Fuzzy Token Matching based similarity constrains.

Input

Run FastJoin from command line:

./FastJoin [FJACCARD|FCOSINE|FDICE] delta tau file1 [file2]

Arguments:

[FJACCARD | FCOSINE | FDICE]: Choose FJACCARD, FCOSINE or FDICE to quantify string similarity. FJACCARD denotes Fuzzy-Jaccard similarity; FCOSINE denotes Fuzzy-Cosine similarity; FDICE denotes Fuzzy-Dice similarity.
delta is an edit-similarity threshold (0 < delta ≤ 1)
tau is a fuzzy-token similarity threshold (0 < tau ≤ 1)
file1 [file2] are file names of string collections, where strings are separated by '\n' and tokens are separated by ' ' or '_'. If there's only file1, FastJoin will perform a self-join on file1. Otherwise, FastJoin will perform a join between file1 and file2.

Output

Result: FastJoin prints four lines for each similar pair (string1, string2). The first line consists of fuzzy-token similarity between string1 and string2, the line id of string1 in file1 and the line id of string2 in file2. The second line is string1 and the third line is string2. The fourth line is a blank line. We show an example result for the command "./FastJoin FJACCARD 0.85 0.8 querylog.txt"

fsim line_id1 line_id2
string1
string2
# blank line

# Example
0.88 1985 1205
nba mcgrady
macgrady nba

Statistics: FastJoin prints six statistics to stderr stream. Signature Time and Candidate Time and Verification Time are the running time of three phases of FastJoin. Total Join Time is the total running time of FastJoin. Candidates Number is the number of the generated candidate pairs. Result Number is the number of the final results. We show an example result for the command "./FastJoin FJACCARD 0.85 0.8 querylog.txt"

Signature Time : 25.23s
Candidate Time : 51.32s
Verification Time : 55.03s
Total Join Time : 131.58s
Candidates Number : 80430264
Result Number : 162236

Notes

FastJoin only supports ASCII characters.
FastJoin requires that the edit distance between tokens is no larger than 8 and the number of tokens in a string is no larger than 128.
Fuzzy-Token Similarity subsumes many existing similarity functions, such as Jaccard, Cosine, Dice and Edit Similarity. Therefore, FastJoin can also compute their similarity-join results. For example, the command "./FastJoin FJACCARD 1 0.8 querylog.txt" can join all similar string pairs in "querylog.txt" whose Jaccard similarity is no smaller than 0.8.
If we swap file1 and file2 in the input command, FastJoin may have different running time. Generally, it is better to put in the front the file with more strings.

Download

FastJoin

Data

DBLP Author (9.25MB, 613,542 strings)
AOL Query Log (25.2MB, 1,208,844 strings)

Contact

For any questions about this study, please contact Jiannan Wang.