Access Type

Open Access Dissertation

Date of Award

1-1-2010

Degree Type

Dissertation

Degree Name

Ph.D.

Department

Computer Science

First Advisor

Farshad Fotouhi

Abstract

Research on similarity join techniques is becoming one of the growing practical areas for study, especially with the increasing E-availability of vast amounts of digital data from more and more source systems. This research is focused on pre-processing clustering-based techniques to improve existing similarity join approaches.

Identifying and extracting the same real-world entities from different data sources is still a big challenge and a significant task in the digital information era. Dissimilar extracts may indeed represent the same real-world entity because of inconsistent values and naming conventions, incorrect or missing data values, or incomplete information. Therefore discovering efficient and accurate approaches to determine the similarity of data objects or values is of theoretical as well as practical significance.

Semantic problems are raised even on the concept of similarity regarding its usage and foundation. Existing similarity join approaches often have a very specific view of similarity measures and pre-defined predicates that represent a narrow focus on the context of similarity for a given scenario. The predicates have been assumed to be a group of clustering [MSW 72] related attributes on the join. To identify those entities for data integration purposes requires a broader view of similarity; for instance a number of generic similarity measures are useful in a given data integration systems.

This study focused on string similarity join, namely based on the Levenshtein or edit distance and Q-gram. Proposed effective and efficient pre-processing clustering-based techniques were the focus of this study to identify clustering related predicates based on either attribute value or data value that improve existing similarity join techniques in enterprise data integration scenarios.

Recommended Citation

Tan, Yufen, "Clustering-Based Pre-Processing Approaches To Improve Similarity Join Techniques" (2010). Wayne State University Dissertations. 222.
https://digitalcommons.wayne.edu/oa_dissertations/222

Download

Included in

Computer Sciences Commons

COinS

DigitalCommons@WayneState

Wayne State University Dissertations

Clustering-Based Pre-Processing Approaches To Improve Similarity Join Techniques

Access Type

Date of Award

Degree Type

Degree Name

Department

First Advisor

Abstract

Recommended Citation

Included in

Links

Browse

Author Corner

DigitalCommons@WayneState

Wayne State University Dissertations

Clustering-Based Pre-Processing Approaches To Improve Similarity Join Techniques

Author

Access Type

Date of Award

Degree Type

Degree Name

Department

First Advisor

Abstract

Recommended Citation

Included in

Share

Links

Browse

Author Corner