Document Type : Original/Review Paper

Authors

1 Faculty of Information Technology, Strathmore University, Nairobi, Kenya.

2 Department of Information Technology, Mount Kenya University, Thika, Kenya.

Abstract

Redundant and irrelevant features in high dimensional data increase the complexity in underlying mathematical models. It is necessary to conduct pre-processing steps that search for the most relevant features in order to reduce the dimensionality of the data. This study made use of a meta-heuristic search approach which uses lightweight random simulations to balance between the exploitation of relevant features and the exploration of features that have the potential to be relevant. In doing so, the study evaluated how effective the manipulation of the search component in feature selection is on achieving high accuracy with reduced dimensions. A control group experimental design was used to observe factual evidence. The context of the experiment was the high dimensional data experienced in performance tuning of complex database systems. The Wilcoxon signed-rank test at .05 level of significance was used to compare repeated classification accuracy measurements on the independent experiment and control group samples. Encouraging results with a p-value < 0.05 were recorded and provided evidence to reject the null hypothesis in favour of the alternative hypothesis which states that meta-heuristic search approaches are effective in achieving high accuracy with reduced dimensions depending on the outcome variable under investigation.

Keywords

[1] Chaudhry, M. U. & Lee, J. (2018). Feature Selection for High Dimensional Data Using Monte Carlo Tree Search, IEEE Access, vol. 6, pp. 76036-76048, 2018, doi: 10.1109/ACCESS.2018.2883537.
[2] Tadist, K., Najah, S., Nikolov, N. S., Mrabti, F. & Zahi, A. (2019). Feature Selection Methods and Genomic Big Data: A Systematic Review, Journal of Big Data, vol. 6, no. 1, p. 79, Aug. 2019, doi: 10.1186/s40537-019-0241-0.
[3] Omondi, A. O., Lukandu, I. A. & Wanyembi, G. W. (2019). A Variated Monte Carlo Tree Search Algorithm for Automatic Performance Tuning to Achieve Load Scalability in InnoDB Storage Engines’, IRJAES, vol. 4, no. 1, pp. 100-110.
[4] Hjørland, B. (2005). Empiricism, rationalism and positivism in library and information science’, Journal of documentation, vol. 61, no. 1, pp. 130-155, 2005, doi: 10.1108/00220410510578050.
[5] R Core Team, R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, 2018.
[6] Zawadzki, Z. & Kosinski, M. (2019). FSelectorRcpp: ‘Rcpp’ Implementation of ‘FSelector’ Entropy-Based Feature Selection Algorithms with a Sparse Matrix Support. 2019.
[7] Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A. & Leisch, F. (2019). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. 2019.
[8] Wing, M. K. C. from J. et al., caret: Classification and Regression Training. 2019.
[9] Wickham, H., Hester, J. & Francois, R. (2018). readr: Read Rectangular Text Data. 2018.
[10]         Omondi, A. O., Lukandu, I. A. & Wanyembi, G. W. (2018). Scalability and Nonlinear Performance Tuning in Storage Servers’, IJRSSET, vol. 5, no. 9, pp. 7-18, Nov. 2018.
[11]         Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A. & Liu, H. (2010). Advancing feature selection research, ASU Feature Selection Repository Arizona State University, pp. 1-28, 2010.
[12] Kephart, J. O. & Chess, D. M. (2003). The vision of autonomic computing’, Computer, vol. 36, no. 1, pp. 41-50, 2003.