Boosting the Quality of Approximate String Matching by Synonyms

Show full item record



Lu , J , Lin , C , Wang , W , Li , C & Xiao , X 2015 , ' Boosting the Quality of Approximate String Matching by Synonyms ' , ACM Transactions on Database Systems , vol. 40 , no. 3 , 15 .

Title: Boosting the Quality of Approximate String Matching by Synonyms
Author: Lu, Jiaheng; Lin, Chunbin; Wang, Wei; Li, Chen; Xiao, Xiaokui
Other contributor: University of Helsinki, Department of Computer Science
Date: 2015
Language: eng
Number of pages: 42
Belongs to series: ACM Transactions on Database Systems
ISSN: 0362-5915
Abstract: A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings ``\textsf{Sam}'' and ``\textsf{Samuel}'' can be considered to be similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, e.g., number of common words or \qgrams. While this is indeed an indicator of similarity, there are many important cases where syntactically different strings can represent the same real-world object. For example, ``\textsf{Bill}'' is a short form of ``\textsf{William}''; and ``\textsf{Database Management Systems}'' can be abbreviated as ``\textsf{DBMS}''. Given a collection of predefined synonyms, the purpose of this article is to explore such existing knowledge to effectively evaluate the similarity between two strings and efficiently perform similarity searches and joins, thereby boosting the quality of approximate string matching. In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. We then study efficient algorithms for similarity searches and joins by proposing two novel indexes, called SI-tree and QP-tree, which combine signature filtering and length filtering strategies. In order to improve the efficiency of our algorithms, we develop an estimator to estimate the size of candidates to enable an online selection of signature filters. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the experimental results from a comprehensive study of the algorithms with three real datasets verify the effectiveness and efficiency of our approaches.
Subject: 113 Computer and information sciences
String algorithms
Data integration

Files in this item

Total number of downloads: Loading...

Files Size Format View
a15_lu.pdf 3.139Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record