Abstract
A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and have the following disadvantages:
  1. They are inefficient for the data sets with short strings (the average string length is no larger than 30);
  2. They involve large indexes;
  3. They are expensive to support dynamic update of data sets.
To address these problems, we propose a novel framework called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find the similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on three real data sets with short strings.
Publication
Trie-Join: Efficient Trie-based String Similarity Joins with Edit Distance Constraints
[Paper] [PPT]
Trie-join: a trie-based method for efficient string similarity joins
[Paper]
Codes

Overview

We provide two programs, TrieJoin and BiTrieJoin, which implement Trie-PathStack and Bi-Trie-PathStack algorithms in the paper respectively.

Input

Run TrieJoin (BiTrieJoin) from command line:
./TrieJoin edth file1 [file2]        # ./BiTrieJoin edth file1 [file2]
Description: edth is the edit-distance threshold. file1 and file2 are file names of string collections, where strings are separated by '\n'. If there's only file1, the programs will perform a self-join on file1. Otherwise, the programs will perform a join between file1 and file2.

Output

TrieJoin (BiTrieJoin) prints four lines for each similar pair (string1, string2):
ed line_id1 line_id2
string1
string2
# blank line

# Example
1 245 789
pvldb
vldb

Description: The first line consists of ed(string1,string2), the line id of string1 in file1 and the line id of string2 in file2. The second line is string1 and the third line is string2. The fourth line is a blank line. One example output for the similar string pair (pvldb, vldb) is shown on the bottom.

Notes

  1. TrieJoin and BiTrieJoin only support ASCII characters.
  2. When loading strings from the file, long strings (length>=1000) are removed.
  3. We optimized TrieJoin when performing self-join on one string collection. The speed-up is about 2x faster than the join algorithm between two collections. However, we do little optimization for BiTrieJoin for the self-join case. Therefore, its performance is almost the same as the join algorithm between two collections.
  4. If we swap file1 and file2 in the input command, TrieJoin (BiTrieJoin) may have different running time. Generally, it is better to put in the front the file with smaller trie size.
  5. If you are interested in algorithm selection for different data sets, please refer to Table 3 and Figure 11 (Paper).

Download

Data
Contact
For any questions about this study, please contact Jiannan Wang.