Access Type

Open Access Thesis

Date of Award

January 2015

Degree Type


Degree Name



Computer Science

First Advisor

Chandan K. Reddy


In this Big data era, the need for performing large-scale computations is evident. A better understanding of the most suitable platforms which can efficiently run these computations is needed. In this thesis, we attempt to compare four such big data platforms, namely Hadoop, Spark, GPU, and Multicore CPU. We compare these platforms using two prominent data mining algorithms, namely, K-means clustering and K-nearest neighbour classification and discuss specific implementation-level details. We provide several insights into the best possible implementations of these algorithms and systematically compare the benefits and drawbacks of each of these platforms. We conduct experiments by varying data size and parameters to obtain runtime and scalability performances of these platforms. Our experiments show that GPU and Multicore CPU are faster but have certain limitations. On the other hand, Hadoop and Spark are able to handle large scale datasets. We also observe that Spark performs better than Hadoop for both iterative and non-iterative jobs. In summary, we have examined different characteristics of four big data platforms and provided comparative analysis for the cases of two algorithms. Since many other data mining algorithms either use these two methods during pre-processing or as an integral component, we hope that our analysis will have impact in many other applications and algorithms beyond the ones that are being reported in this thesis.