Access Type

Open Access Dissertation

Date of Award

1-1-2010

Degree Type

Dissertation

Degree Name

Ph.D.

Department

Computer Science

First Advisor

Ming Dong

Abstract

Clustering is traditionally an unsupervised task which is to find natural groupings or clusters in multidimensional data based on perceived similarities among the patterns. The purpose of clustering is to extract useful information

from unlabeled data.

In order to present the extracted useful knowledge obtained by clustering in a meaningful way, data visualization becomes a popular and growing area of research field. Visualization can provide a qualitative overview of large and complex data sets, which help us the desired insight in truly understanding the phenomena of interest in data.

The contribution of this dissertation is two-fold: Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for data clustering/co-clustering and Exemplar-based data Visualization (EV) through matrix factorization. Compared to traditional data mining models,

matrix-based methods are fast, easy to understand and implement, especially suitable to solve large-scale challenging problems in text mining, image grouping, medical diagnosis, and bioinformatics.

In this dissertation, we present two effective matrix-based solutions

in the new directions of data clustering and visualization.

First, in many practical learning domains,

there is a large supply of unlabeled data but limited labeled data, and in most cases it might

be expensive to generate large amounts of labeled data. Traditional clustering algorithms completely ignore these valuable labeled data and thus are inapplicable to these problems. Consequently, semi-supervised clustering, which can incorporate the domain knowledge to guide a clustering algorithm, has become a topic of significant recent interest.

Thus, we develop a Non-negative Matrix Factorization

(NMF) based framework to incorporate prior knowledge into data clustering. Moreover, with the fast growth of Internet and computational technologies in the past decade, many data mining applications have advanced swiftly from the simple clustering of one data type to the co-clustering of multiple data types, usually involving high heterogeneity. To this end, we extend SS-NMF to perform heterogeneous data co-clustering. From a theoretical perspective, SS-NMF for data clustering/co-clustering is mathematically rigorous. The convergence and correctness of our algorithms are proved.

In addition, we discuss the relationship between SS-NMF with other well-known clustering and co-clustering models.

Second, most of current clustering models only provide the centroids (e.g., mathematical means of the clusters)

without inferring the representative exemplars from real data, thus they are unable to better summarize or visualize the raw data.

A new method, Exemplar-based Visualization (EV), is proposed to cluster and visualize an extremely large-scale data.

Capitalizing on recent advances in matrix approximation and factorization, EV provides a means

to visualize large scale data with high accuracy (in

retaining neighbor relations), high efficiency (in computation), and

high flexibility (through the use of exemplars).

Empirically, we demonstrate the superior performance of our matrix-based data clustering and visualization models

through extensive experiments performed on the publicly available large scale data sets.

Share

COinS