Access Type

Open Access Dissertation

Date of Award

January 2017

Degree Type


Degree Name



Computer Science

First Advisor

Sorin Draghici


Modern biomedical research lies at the crossroads of data gathering, interpretation, and hypothesis testing. Due to noise, study bias, or too small changes in biological signals between disease and healthy, individual studies often fail to identify the true phenomenon. Data integration is the key to obtaining the power needed to pinpoint the biological mechanisms of disease states. Given this, we tried to make important contributions in both horizontal and vertical integration of high-throughput data; the former is meta-analysis of independent studies, while the latter is the integration of multi-omics data.

For horizontal meta-analysis, we developed two frameworks: DANUBE and the bi-level meta-analysis. In DANUBE, we pointed out that most pathway analysis approaches make wrong assumptions of bio-molecular data which leads to non-uniformity of p-values under the null hypothesis. DANUBE proposed a way to correct the biased p-values before combining them using the Central Limit Theorem. In the bi-level meta-analysis, we added another level of meta-analysis to make better use of the available number of samples within individual studies. Both techniques were validated using thousands of real samples obtained from independent studies related to three human diseases, Alzheimer's disease, acute myeloid leukemia, and type II diabetes mellitus. These frameworks outperformed classical approaches to consistently identify pathways that are relevant to the given phenotypes. Via extensive simulation studies, we also demonstrate that the proposed techniques are sufficiently general to be applied outside the scope of biomedical research.

For vertical integrative analysis, we integrated transcriptomics, epigenomics, and non-coding RNA data to identify disease subtypes. Successful subtyping of complex diseases can lead to identifying biomarkers and targets of new drugs. We developed a perturbation clustering to accurately subtype patients using high-dimensional gene expression data. The framework was also extended to combine complementary information available in multi-omics data, by adapting techniques in network partitioning and cluster ensembles. The algorithm was validated on thousands of real cancer samples, using mRNA, methylation, and microRNA data available on Gene Expression Omnibus, the Broad Institute, and the Cancer Genome Atlas. This simultaneous subtyping approach accurately identifies known cancer subtypes and predicts the survival of novel subgroups of patients.

We also developed a meta-analysis framework that combines two orthogonal types of data integration: horizontal and vertical meta-analysis. Integrative analyses of omics data often require all data types to be available for each individual patient. This reduces their practical availability since sample-matched data is relatively rare and difficult or expensive to obtain. We proposed an orthogonal meta-analysis framework that is able to overcome the sample-matched data bottleneck, by successfully integrating datasets of different types generated in independent laboratories from different sets of patients. The proposed framework was validated using 1,471 samples from 15 mRNA and 14 miRNA expression datasets related to two human cancers, colorectal cancer and pancreatic cancer. The orthogonal approach reliably identifies signaling pathways that are impacted by the two cancer diseases. While validated in the context of pathway analysis, the framework can be modified to adapt to other domains or applications.