Document Type : Original/Review Paper

Author

Faculty of Engineering, Mahallat Institute of Higher Education, Mahallat, Iran.

10.22044/jadm.2025.16130.2732

Abstract

Storing and processing large volume datasets is one of the most critical problems in large-scale processing. Therefore, it is need to reduce their size before further processing. This paper is proposed a framework for data reduction in large-scale datasets. The proposed framework is based on MapReduce algorithm. It has three steps. Firstly, by reservoir sampling, some instances of a dataset are selected. In the second step, the features of these selected instances are weighted using ReliefF algorithm. Then, all weights are averaged for each feature and features with the highest weight values are selected. Finally, the selected features have been used in classification. Implementation results of the proposed framework show a good reduction of time. It also increases accuracy or maintains it when a large amount of data is removed by eliminating irrelevant features in classification algorithms.

Keywords

Main Subjects

[1] S. d. Río, V. Lopez, J. M. Benítez and F. Herrera, "On the use of MapReduce for imbalanced big data using Random Forest," Information Sciences. vol. 285, pp. 112-137, 2014.
[2] J. Derrac, S. Garcia and F. Herrera, "IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule," Pattern Recognition, vol. 43, no. 6, pp. 2082-2105, 2010.
[3] P. Bradley, U. Fayyad and C. Reina, " Clustering very large databases using EM mixture models," In Proceedings 15th International Conference on Pattern Recognition, Barcelona, ICPR-2000, 2000, pp. 76-80.
[4] H. Liu, H. Motoda and L. Yu, "A selective sampling approach to active feature selection," Artificial Intelligence, vol. 159, pp. 49-74, 2004.
[5] W. G. Cochran, Sampling Techniques,1st ed., New York: Wiley, 1977, [E-book] Available: www.cambridge.org.
[6] H. Liu and H. Motoda, Instance Selection and Construction for Data Mining, 1st ed., Boston: Kluwer Academic, 2001, [E-book] Available: https://books.google.com/.
[7] J. R. Cano, F. Herrera and M. Lozano, "On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining," Applied Soft Computing, vol. 6, p. 323–332, 2006.
[8] M. Rashid, J. Kamruzzaman, T. Imam, S. Wibowo and S. Gordon, "A tree-based stacking ensemble technique with feature selection for network intrusion detection," Applied Intelligence, vol. 52, no. 9, pp. 9768-9781, 2022.
[9] A. V. Turukmane and R. Devendiran, "M-MultiSVM: An efficient feature selection assisted network intrusion detection system using machine learning," Computers & Security, vol. 137, p. 103587, 2024.
[10]F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel , P.Prettenhofer, R. Weiss, V. Dubourg and j. Vanderplas, "Scikit-learn: Machine learning in python," Journal of machine learning research, vol. 12, pp. 2825-2830, 2011.
[11] F. Li, Z. Zhang and C. Jin, "Feature selection with partition differentiation entropy for large-scale data sets," Information Sciences , vol. 329, pp. 690-700, 2016.
[12] J. Qian, P. Lv, X. Yue, C. Liu and Z. Jing, "Hierarchical attribute reduction algorithms for big data using MapReduce," Knowledge-Based Systems, vol. 73, p. 18–31, 2015.
[13] X. Yu and X. Cai, "A multi-objective evolutionary algorithm with interval based initialization and self-adaptive crossover operator for large-scale feature selection in classification," Applied Soft Computing, vol.127, p. 109420, 2022.
[14] Y. Lv, P. Liu, J. Wang, Y. Zhang, A. Slowik and J. Lv, "GA‐based feature selection method for oversized data analysis in digital economy," Expert Systems, vol.41, no. 1, p. 13477, 2024.
[15] H. Liu and L. Yu, "Feature selection for high-dimensional data: A fast correlation-based filter solution," in Proceedings of the 20th international conference on machine learning, Washington, ICML-03, 2003, pp. 856-863.
[16] L. MoránF., V. B. Canedo and A. A. Betanzos, "Centralized vs. distributed feature selection methods based on data complexity measures," Knowledge-Based Systems, vol. 117, pp. 27-45, 2017.
[17] C. Kai, W. W. qiang and L. Yun, "Differentially private feature selection under MapReduce framework," The Journal of China Universities of Posts and Telecommunications, vol. 20, no.5, pp. 85-103, 2013.
[18] C. García-Osorio, A. d. Haro-García and N. G. Pedrajas, "Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts," Artificial Intelligence, vol. 174, no. 5-6, pp. 410-441, 2010.
[19]D. S. F. Isaac Triguero, "MRPR: A MapReduce solution for prototype reduction in big data classification," Neurocomputing, vol.150, p. 331–345, 2015.
[20] G. E. Melo-Acosta, F. Duitama-Muñoz and  J. D. Arias-Londoño, "An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH," preprint arXiv:2210. 04310, 2022.
[21] C. Gong, Z.-g. Su, P.-h. Wang, Q. Wang and Y. You, "Evidential instance selection for K-nearest neighbor classification of big data," International Journal of Approximate Reasoning, vol. 138, pp. 123-144, 2021.
[22] L. Qin, X. Wang and Z. Jiang, "A distributed evolutionary based instance selection algorithm for big data using Apache Spark," Applied Soft Computing, vol.159, p. 111638, 2024.
[23] D. Fragoudis, D. Meretakis and S. Likothanassis, "Integrating feature and instance selection for text classification," in 8th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, KDD '02, 2002, pp. 501-506.
[24] K. Yu, X. Xu, M. Ester and H.-P. Kriegel, "Feature weighting and instance selection for collaborative filtering: An information-theoretic approach," Knowledge and Information Systems, vol. 5, no. 2, pp. 201-224, 2003.
[25] H. Ahn and K.-j. Kim, "Bankruptcy prediction modeling with hybrid case-based reasoning and genetic algorithms approach," Applied Soft Computing, vol.9, no.2, pp. 599-607, 2009.
[26] C.-F. Tsai, W. Eberle and C.-Y. Chu, "Genetic algorithms in feature and instance selection," Knowledge-Based Systems, vol. 39, p. 240–247, 2013.
[27] T. Chen, X. Zhang, S. Jin and O. Kim, "Efficient classification using parallel and scalable compressed model and its application on intrusion detection," Expert Systems with Applications, vol. 41, no.13, pp. 5972-5983, 2014.
[28] Z.-H. You, Y.-H. Hu, C.-F. Tsai and Y.-M. Kuo, "Integrating feature and instance selection techniques in opinion mining," Research Anthology on Implementing Sentiment Analysis Across Multiple Disciplines- IGI Global, vol.1, pp. 800-815, 2022.
[29] C. F. Tsai, K.-L. Sue, Y.-H. Hu and A. Chiu. , "Combining feature selection, instance selection, and ensemble classification techniques for improved financial distress prediction," Journal of Business Research, vol. 130, pp. 200-209, 2021.
[30] T. White, Hadoop, The Definitive Guide, 3rd ed., USA: O’Reilly Media, 2012, [E-book] Available:  books.google.com.
[31] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, vol. 51, no.1, pp. 107-113, 2008.
[32] Apache Software Foundation, "Apache Hadoop Project," 2013. [Online]. Available: <http://hadoop.apache.org/>. [Accessed December 2013].
[33] D. Miner and A. Shook, MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems,1st ed., USA: O’Reilly Media, 2012, [E-book] Available:books.google.com.
[34] J. S. vitter, "Random sampling with a reservoir," ACM Transactions on Mathematical Software (TOMS) vol.11, no.1, pp. 37-57, 1985.
[35] E. Š. M. R.-Š. Igor Kononenko, "Overcoming the myopia of inductive learning algorithms with RELIEFF," Applied Intelligence, vol. 7, no.1, pp. 39-55, 1997.
[36] L. A. R. Kenji Kira, "The feature selection problem: Traditional methods and a new algorithm," AAAI, vol. 2, pp. 129-134, 1992.
[37] S. Das, "Filters, wrappers and a boosting-based hybrid for feature selection," in Proceedings of the 18th international conference on machine learning, San Francisco, ICML-01, 2001, pp. 74-81.
[38] I. K. Marko Robnik-Šikonja, "Theoretical and empirical analysis of ReliefF and RReliefF," Machine learning, vol. 53, no.1-2, pp. 23-69, 2003.
[39] G. Frederickson, "An Optimal Algorithm for Selection in a Min-Heap," Information and Computation, vol. 104, no. 2, pp. 197-214, 1993.
[40] J. R. Quinlan, C4.5: Programs for Machine Learning, 1st ed., USA: Morgan Kaufmann Publishers, 1993, [E-book] Available: books.google.com. 
[41] J. R. Quinlan, "Improved use of continuous attributes in c4.5," Journal of Artificial Intelligence Research, vol.4, pp. 77-90, 1996.
[42] R. J. Hyndman and A. B. Koehler, "Another look at measures of forecast accuracy," International Journal of Forecasting, vol. 22, no. 4, p. 679–688, 2006.
[43] S. Mii Rostami, and M. Ahmadzadeh, “Extracting predictor variables to construct breast cancer survivability model with class imbalance problem, “Journal of AI and Data Mining, vol.6, no. 2, 263-276, 2018.
[44] R. J. Hyndman, and A. B. Koehler,” Another look at measures of forecast accuracy,” International journal of forecasting, vol.22, no.4, 679-688, 2006.