High dimensional data is the situation in which the number of variables included in an analysis approaches or exceeds the sample size. In the context of group classification, researchers are typically interested in finding a model that can be used to correctly place an individual into their appropriate group; e.g. correctly diagnose individuals with depression. However, when the size of the training sample is small and the number of predictors used to differentiate the groups is larger, standard approaches such as discriminant analysis may not work well. In order to address this issue, statisticians have developed a number of tools designed for supervised classification with high dimensional data. The goal of this simulation study was to compare several such approaches for supervised classification with high dimensional data in terms of their ability to correctly classify individuals into groups, and to identify the number of variables associated with group separation. Results of the study showed that the Random Forest ensemble recursive partitioning algorithm was optimal for group prediction, while the Nearest Shrunken Centroid and Regularized Discriminant Analysis methods were optimal for identifying the number of salient predictor variables. The standard linear discriminant analysis approach was generally the worst performer across all high dimensional simulated conditions. Implications of these results to practice and directions for future research are discussed.
"A Comparison of Methods for Group Prediction with High Dimensional Data,"
Journal of Modern Applied Statistical Methods:
2, Article 5.
Available at: http://digitalcommons.wayne.edu/jmasm/vol13/iss2/5