Access Type

Open Access Thesis

Date of Award

January 2017

Degree Type


Degree Name



Computer Science

First Advisor

Sorin Draghici


Understanding the biological insights hidden in the vast amount of data collected, while investigating a disease, is the main goal for collecting such data in the first place.

Changes in the gene expression or the function of proteins are important components in progression of a disease and is a key to understanding the disease mechanism.

However, more often than not, the causes of such changes are not easily identified. In many cases, genetic variants may cause some of the observed gene expression changes.

In this thesis, we focus on identifying the variants that significantly alter gene expression for an individual by integrating genetic variant data, gene expression data, as well as a priori knowledge about gene-gene interaction networks from multiple databases. Here we show that one can use variants that change gene expression to identify subgroups of patients with significantly different survival profiles.

The method is validated on four different cancer types (renal, lung, colorectal cancer and leukemia) from the TCGA database.

The results show that this method is able to identify variants that significantly affect the gene expression (and in turn the phenotype), as well as identify disease sub-types that are biologically meaningful as validated by survival and pathway analysis.