Access Type

Open Access Dissertation

Date of Award

January 2012

Degree Type

Dissertation

Degree Name

Ph.D.

Department

Computer Science

First Advisor

Weisong Shi

Abstract

Genomics research has enormous applications in many areas such as health care, forensic, agriculture, etc. Most recent achievements in this field come from the availability of the unprecedented genomic data. However, new sequencing technologies in genomics keep producing data at a faster pace resulting a very huge amount of data. This poses great challenges on how to store, manage, process and analyze the data efficiently. To deal with these, genomics research groups often equip themselves with a small scale server room composed of high storage capacity and computing ability machines. This solution is not only costly, unscalable but also inefficient. A better solution would be the Cloud Computing with its elasticity and pay-as-you-go economic model. Nevertheless, Cloud Computing only provides the potential infrastructure solution. To address the high-throughput processing challenges, we need to have a suitable programming model. The fundamental idea is to process data in parallel. In existing models, MapReduce appears to be the best candidate because of its extremely scalability.

In this work, we plan to develop a domain specific style system to support data management and analysis in genomics using Cloud Computing and MapReduce. Starting from the application layer, we developed a fundamental alignment tool called CloudAligner based on the MapReduce framework that outperformed its counterparts. After that, we continue seeking solutions to improve the system at the infrastructure level. Observing that scientists spend too much time on accessing data from low speed archives (tapes), we developed the Distributed Disk Cache (DiSK), and it was covered in a Master thesis. Another challenge is to enable the system to support differentiated services which are prevalent in Cloud Computing. To address this, we proposed a Differentiated Replication (DiR) mechanism allowing data to be inserted and retrieved with different availability. Another problem that greatly reduces the performance of the system is the heterogeneity of the Cloud. To tame it, we created an Open Reputation model called Opera. It employs vectors to record the behaviors (reputations) of nodes from different aspects. We modified the Hadoop MapReduce scheduler to make use of this information. The results proved that under heterogeneous environments, our system is better than the original Hadoop in terms of job execution time, number of failed/killed tasks, and energy consumption. The last challenge we have dealt with is the data movement since the data in our targeted domain (genomics) is extremely large and is generated with exponential rate. We divided the issue into two categories: internal and external movement. We have successfully developed a cached system to minimize the internal data movement and an easy-to-use tool called SPBD to handle external data movement with minimal respond time.

Share

COinS