Access Type

Open Access Embargo

Date of Award

January 2021

Degree Type


Degree Name



Computer Science

First Advisor

Dr. Sorin Draghici


There is a tremendous need to analyze molecular and patient clinical data to identify biomarkers, biological mechanisms, or to simply classify samples accurately. Issues such as: i) limited tools to diagnose many diseases, ii) not considering biological interactions, or iii) damaged DNA samples could cause a challenge in identifying valuable insights. In this work, I try to address these issues by developing different bioinformatic frameworks.First, I present three frameworks to identify i) Sarcoidosis biomarkers, ii) Tuberculosis biomarkers and iii) Cystic fibrosis (CF) biomarkers. I identified Sarcoidosis biomarkers I applied them to classify Sarcoidosis samples from non-Sarcoidosis (healthy controls, Tuberculosis, and lung cancer) with a sensitivity of 0.92 and specificity of 0.88. I identified 10 TB biomarkers and applied them to classify TB samples versus non-TB (healthy controls and sarcoidosis). The area under the receiver operating characteristics (ROC) curve for the top 10 biomarkers was 1 with a sensitivity of 1 and a specificity of 1. I identified 20 CF biomarkers and used them to classify CF from non-CF (healthy controls and lung cancer). The mean area under the ROC curve for the CF biomarkers was 0.97 with a sensitivity of 0.99 and specificity of 0.95. Second, I present a method that can construct networks of genes that can be considered putative mechanisms. A major challenge in life science research is understanding the mechanism involved in a given phenotype. The putative mechanisms constructed by this approach are not limited to the set of DE genes, but also considers all known and relevant gene-gene interactions. We analyzed three real datasets for which both the causes of the phenotype, as well as the true mechanisms were known. We show that the method identified the correct mechanisms when applied on microarray datasets from mouse. We compared the results of our method with the results of the classical approach, showing that our method produces more meaningful biological insights. Third, I propose a classification method that is able to analyze genomic data and assign an individual to a particular population/group. A current challenge in forensic evidence is to classify samples using genomic data accurately. Fragmented DNA due to degradation is a common problem with samples from crime scenes. The proposed classification method can use SNPs from as little as 10% of the DNA in the human genome to identify the population background of a sample. I compared the performance of the proposed method with three other classification methods: i) naive Bayes, ii) Random Forest, and iii) BIASLESS. The accuracy, sensitivity, specificity, and F1 score values yielded by the proposed classifier were 0.963, 0.798, 0.983, and 0.827, respectively. The results show that the proposed method outperforms the existing methods. Finally, I present the findings of analyzing clinical data for 81 COVID-19 ICU patients. The coronavirus disease (COVID-19) is a highly transmissible viral infection caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). I show evidence that mean platelet volume (MPV) reflects platelet activation and activation of coagulation cascades and plays a major role in the development of acute renal failure. Furthermore, I show that the glomerular filtration rate (GFR) values are deteriorating after day three for patients with acute renal injury (AKI). Such findings will help with the treatment.