Trie-Join:Efficient Trie-based String Similarity Joins with Edit Distance Constraints
|
Abstract
A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and have the following disadvantages:
|
Publication
|
Codes
OverviewWe provide two programs, TrieJoin and BiTrieJoin, which implement Trie-PathStack and Bi-Trie-PathStack algorithms in the paper respectively.
InputRun TrieJoin (BiTrieJoin) from command line:
./TrieJoin edth file1 [file2] # ./BiTrieJoin edth file1 [file2]
Description: edth is the edit-distance threshold. file1 and file2 are file names of string collections, where strings are separated by '\n'. If there's only file1, the programs will perform a self-join on file1. Otherwise, the programs will perform a join between file1 and file2.
OutputTrieJoin (BiTrieJoin) prints four lines for each similar pair (string1, string2):
ed line_id1 line_id2
string1 string2 # blank line # Example 1 245 789 pvldb vldb Description: The first line consists of ed(string1,string2), the line id of string1 in file1 and the line id of string2 in file2. The second line is string1 and the third line is string2. The fourth line is a blank line. One example output for the similar string pair (pvldb, vldb) is shown on the bottom.
Notes
Download |
Data
|
Contact
For any questions about this study, please contact Jiannan Wang.
|