Access Type

Open Access Dissertation

Date of Award

January 2015

Degree Type

Dissertation

Degree Name

Ph.D.

Department

Computer Science

First Advisor

Sorin Draghici

Abstract

Machine learning as a field is defined to be the set of computational algorithms that improve their performance by assimilating data.

As such, the field as a whole has found applications in many diverse disciplines from robotics and communication in engineering to economics and finance, and also biology and medicine.

It should not come as a surprise that many popular methods in use today have completely different origins.

Despite this heterogeneity, different methods can be divided into standard tasks, such as supervised, unsupervised, semi-supervised and reinforcement learning.

Although machine learning as a field can be formalized as methods trying to solve certain standard tasks, applying these tasks on datasets from different fields comes with certain caveats, and sometimes is fraught with challenges.

In this thesis, we develop general procedures and novel solutions, dealing with practical problems that arise when modeling biological and medical data.

Cost sensitive learning is an important area of research in machine learning which addresses the widespread and practical problem of dealing with different costs during the learning and deployment of classification algorithms.

In many applications such as credit fraud detection, network intrusion and specifically medical diagnosis domains, prior class distributions are highly skewed, which makes the training examples very much unbalanced.

Combining this with uneven misclassification costs renders standard machine learning approaches useless in learning an acceptable decision function.

We experimentally show the benefits and shortcomings of various methods that convert cost blind learning algorithms to cost sensitive ones.

Using the results and best practices found for cost sensitive learning, we design and develop a machine learning approach to ontology mapping.

Next, we present a novel approach to deal with uncertainty in classification when costs are unknown or otherwise hard to assign.

Support Vector Machines (SVM) are considered to be among the most successful approaches for classification.

However prediction of instances near the decision boundary depends more on the specific parameter selection or noise in data, rather than a clear difference in features.

In many applications such as medical diagnosis, these regions should be labeled as uncertain rather than assigned to any particular class.

Furthermore, instances may belong to novel disease subtypes that are not from any previously known class.

In such applications, declining to make a prediction could be beneficial when more powerful but expensive tests are available.

We develop a novel approach for optimal selection of the threshold and show its successful application on three biological and medical datasets.

The last part of this thesis provides novel solutions for handling high dimensional data.

Although high-dimensional data is ubiquitously found in many disciplines, current life science research almost always involves high-dimensional genomics/proteomics data.

The ``omics'' data provide a wealth of information and have changed the research landscape in biology and medicine.

However, these data are plagued with noise, redundancy and collinearity, which makes the discovery process very difficult and costly.

Any method that can accurately detect irrelevant and noisy variables in omics data would be highly valuable.

We present Robust Feature Selection (RFS), a randomized feature selection approach dedicated to low-sample high-dimensional data.

RFS combines an embedded feature selection method with a randomization procedure for stability.

Recent advances in sparse recovery and estimation methods have provided efficient and asymptotically consistent feature selection algorithms.

However, these methods lack finite sample error control due to instability.

Furthermore, the chances of correct recovery diminish with more collinearity among features.

To overcome these difficulties, RFS uses a randomization procedure to provide an accurate and stable feature selection method.

We thoroughly evaluate RFS by comparing it to a number of popular univariate and multivariate feature selection methods and show marked prediction accuracy improvement of a diagnostic signature, while preserving a good stability.

Share

COinS