Access Type

Open Access Dissertation

Date of Award

January 2012

Degree Type


Degree Name



Computer Science

First Advisor

Farshad Fotouhi


With the rapid increase of the distributed data sources, and in order to make information integration, there is a need to combine the information that refers to the same entity from different sources. However, there are no global conventions that control the format of the data, and it is impractical to impose such global conventions. Also, there could be some spelling errors in the data as it is entered manually in most of the cases. For such reasons, the need to find and join similar records instead of exact records is important in order to integrate the data. Most of the previous work has concentrated on similarity join when the join attribute is a short string attribute, such as person name and address. However, most databases contain long string attributes as well, such as product description and paper abstract, and up to our knowledge, no work has been done in this direction. The use of long string attributes is promising as these attributes contain much more information than short string attributes, which could improve the similarity join performance. On the other hand, most of the literature work did not consider the semantic similarities during the similarity join process.

To address these issues, 1) we showed that the use of long attributes outperformed the use of short attributes in the similarity join process in terms of similarity join accuracy with a comparable running time under both supervised and unsupervised learning scenarios; 2) we found the best semantic similarity method to join long attributes in both supervised and unsupervised learning scenarios; 3) we proposed efficient semantic similarity join methods using long attributes under both supervised and unsupervised learning scenarios; 4) we proposed privacy preserving similarity join protocols that supports the use of long attributes to increase the similarity join accuracy under both supervised and unsupervised learning scenarios; 5) we studied the effect of using multi-label supervised learning on the similarity join performance; 6) we found an efficient similarity join method for expandable databases.