H.3. Artificial Intelligence
Sajjad Alizadeh Fard; Hossein Rahmani
Abstract
Fraud in financial data is a significant concern for both businesses and individuals. Credit card transactions involve numerous features, some of which may lack relevance for classifiers and could lead to overfitting. A pivotal step in the fraud detection process is feature selection, which profoundly ...
Read More
Fraud in financial data is a significant concern for both businesses and individuals. Credit card transactions involve numerous features, some of which may lack relevance for classifiers and could lead to overfitting. A pivotal step in the fraud detection process is feature selection, which profoundly impacts model accuracy and execution time. In this paper, we introduce an ensemble-based, explainable feature selection framework founded on SHAP and LIME algorithms, called "X-SHAoLIM". We applied our framework to diverse combinations of the best models from previous studies, conducting both quantitative and qualitative comparisons with other feature selection methods. The quantitative evaluation of the "X-SHAoLIM" framework across various model combinations revealed consistent accuracy improvements on average, including increases in Precision (+5.6), Recall (+1.5), F1-Score (+3.5), and AUC-PR (+6.75). Beyond enhanced accuracy, our proposed framework, leveraging explainable algorithms like SHAP and LIME, provides a deeper understanding of features' importance in model predictions, delivering effective explanations to system users.
H.3. Artificial Intelligence
Damianus Kofi Owusu; Christiana Cynthia Nyarko; Joseph Acquah; Joel Yarney
Abstract
Head and neck cancer (HNC) recurrence is ever increasing among Ghanaian men and women. Because not all machine learning classifiers are equally created, even if multiple of them suite very well for a given task, it may be very difficult to find one which performs optimally given different distributions. ...
Read More
Head and neck cancer (HNC) recurrence is ever increasing among Ghanaian men and women. Because not all machine learning classifiers are equally created, even if multiple of them suite very well for a given task, it may be very difficult to find one which performs optimally given different distributions. The stacking learns how to best combine weak classifier models to form a strong model. As a prognostic model for classifying HNSCC recurrence patterns, this study tried to identify the best stacked ensemble classifier model when the same ML classifiers for feature selection and stacked ensemble learning are used. Four stacked ensemble models; in which first one used two base classifiers: gradient boosting machine (GBM) and distributed random forest (DRF); second one used three base classifiers: GBM, DRF, and deep neural network (DNN); third one used four base classifiers: GBM, DRF, DNN, and generalized linear model (GLM); and fourth one used five base classifiers: GBM, DRF, DNN, GLM, and Naïve bayes (NB) were developed, using GBM meta-classifier in each case. The results showed that implementing stacked ensemble technique consisting of five base classifiers on gradient boosted features achieved better performance than achieved on other feature subsets, and implementing this stacked ensemble technique on gradient boosted features achieved better performance compared to other stacked ensemble techniques implemented on gradient boosted features and other feature subsets used. Learning stacked ensemble technique having five base classifiers on GBM features is clinically appropriate as a prognostic model for classifying and predicting HNSCC patients’ recurrence data.
H.3.15.3. Evolutionary computing and genetic algorithms
Mahdieh Maazalahi; Soodeh Hosseini
Abstract
Detecting and preventing malware infections in systems is become a critical necessity. This paper presents a hybrid method for malware detection, utilizing data mining algorithms such as simulated annealing (SA), support vector machine (SVM), genetic algorithm (GA), and K-means. The proposed method combines ...
Read More
Detecting and preventing malware infections in systems is become a critical necessity. This paper presents a hybrid method for malware detection, utilizing data mining algorithms such as simulated annealing (SA), support vector machine (SVM), genetic algorithm (GA), and K-means. The proposed method combines these algorithms to achieve effective malware detection. Initially, the SA-SVM method is employed for feature selection, where the SVM algorithm identifies the best features, and the SA algorithm calculates the SVM parameters. Subsequently, the GA-K-means method is utilized to identify attacks. The GA algorithm selects the best chromosome for cluster centers, and the K-means algorithm has applied to identify malware. To evaluate the performance of the proposed method, two datasets, Andro-Autopsy and CICMalDroid 2020, have been utilized. The evaluation results demonstrate that the proposed method achieves high true positive rates (0.964, 0.985), true negative rates (0.985, 0.989), low false negative rates (0.036, 0.015), and false positive rates (0.022, 0.043). This indicates that the method effectively detects malware while reasonably minimizing false identifications.
C.3. Software Engineering
Saba Beiranvand; Mohammad Ali Zare Chahooki
Abstract
Software Cost Estimation (SCE) is one of the most widely used and effective activities in project management. In machine learning methods, some features have adverse effects on accuracy. Thus, preprocessing methods based on reducing non-effective features can improve accuracy in these methods. In clustering ...
Read More
Software Cost Estimation (SCE) is one of the most widely used and effective activities in project management. In machine learning methods, some features have adverse effects on accuracy. Thus, preprocessing methods based on reducing non-effective features can improve accuracy in these methods. In clustering techniques, samples are categorized into different clusters according to their semantic similarity. Accordingly, in the proposed study, to improve SCE accuracy, first samples are clustered based on original features. Then, a feature selection (FS) technique is separately done for each cluster. The proposed FS method is based on a combination of filter and wrapper FS methods. The proposed method uses both filter and wrapper advantages in selecting effective features of each cluster, with less computational complexity and more accuracy. Furthermore, as the assessment criteria have significant impacts on wrapper methods, a fused criterion has also been used. The proposed method was applied to Desharnais, COCOMO81, COCONASA93, Kemerer, and Albrecht datasets, and the obtained Mean Magnitude of Relative Error (MMRE) for these datasets were 0.2173, 0.6489, 0.3129, 0.4898 and 0.4245, respectively. These results were compared with previous studies and showed improvement in the error rate of SCE.
F.4.18. Time series analysis
Ali Ghorbanian; Hamideh Razavi
Abstract
In time series clustering, features are typically extracted from the time series data and used for clustering instead of directly clustering the data. However, using the same set of features for all data sets may not be effective. To overcome this limitation, this study proposes a five-step algorithm ...
Read More
In time series clustering, features are typically extracted from the time series data and used for clustering instead of directly clustering the data. However, using the same set of features for all data sets may not be effective. To overcome this limitation, this study proposes a five-step algorithm that extracts a complete set of features for each data set, including both direct and indirect features. The algorithm then selects essential features for clustering using a genetic algorithm and internal clustering criteria. The final clustering is performed using a hierarchical clustering algorithm and the selected features. Results from applying the algorithm to 81 data sets indicate an average Rand index of 72.16%, with 38 of the 78 extracted features, on average, being selected for clustering. Statistical tests comparing this algorithm to four others in the literature confirm its effectiveness.
S. Hosseini; M. Khorashadizade
Abstract
High dimensionality is the biggest problem when working with large datasets. Feature selection is a procedure for reducing the dimensionality of datasets by removing additional and irrelevant features; the most effective features in the dataset will remain, increasing the algorithms’ performance. ...
Read More
High dimensionality is the biggest problem when working with large datasets. Feature selection is a procedure for reducing the dimensionality of datasets by removing additional and irrelevant features; the most effective features in the dataset will remain, increasing the algorithms’ performance. In this paper, a novel procedure for feature selection is presented that includes a binary teaching learning-based optimization algorithm with mutation (BMTLBO). The TLBO algorithm is one of the most efficient and practical optimization techniques. Although this algorithm has fast convergence speed and it benefits from exploration capability, there may be a possibility of trapping into a local optimum. So, we try to establish a balance between exploration and exploitation. The proposed method is in two parts: First, we used the binary version of the TLBO algorithm for feature selection and added a mutation operator to implement a strong local search capability (BMTLBO). Second, we used a modified TLBO algorithm with the self-learning phase (SLTLBO) for training a neural network to show the application of the classification problem to evaluate the performance of the procedures of the method. We tested the proposed method on 14 datasets in terms of classification accuracy and the number of features. The results showed BMTLBO outperformed the standard TLBO algorithm and proved the potency of the proposed method. The results are very promising and close to optimal.
Z. Shojaee; Seyed A. Shahzadeh Fazeli; E. Abbasi; F. Adibnia
Abstract
Today, feature selection, as a technique to improve the performance of the classification methods, has been widely considered by computer scientists. As the dimensions of a matrix has a huge impact on the performance of processing on it, reducing the number of features by choosing the best subset of ...
Read More
Today, feature selection, as a technique to improve the performance of the classification methods, has been widely considered by computer scientists. As the dimensions of a matrix has a huge impact on the performance of processing on it, reducing the number of features by choosing the best subset of all features, will affect the performance of the algorithms. Finding the best subset by comparing all possible subsets, even when n is small, is an intractable process, hence many researches approach to heuristic methods to find a near-optimal solutions. In this paper, we introduce a novel feature selection technique which selects the most informative features and omits the redundant or irrelevant ones. Our method is embedded in PSO (Particle Swarm Optimization). To omit the redundant or irrelevant features, it is necessary to figure out the relationship between different features. There are many correlation functions that can reveal this relationship. In our proposed method, to find this relationship, we use mutual information technique. We evaluate the performance of our method on three classification benchmarks: Glass, Vowel, and Wine. Comparing the results with four state-of-the-art methods, demonstrates its superiority over them.
M. Salehi; J. Razmara; Sh. Lotfi
Abstract
Prediction of cancer survivability using machine learning techniques has become a popular approach in recent years. In this regard, an important issue is that preparation of some features may need conducting difficult and costly experiments while these features have less significant impacts on the ...
Read More
Prediction of cancer survivability using machine learning techniques has become a popular approach in recent years. In this regard, an important issue is that preparation of some features may need conducting difficult and costly experiments while these features have less significant impacts on the final decision and can be ignored from the feature set. Therefore, developing a machine for prediction of survivability, which ignores these features for simple cases and yields an acceptable prediction accuracy, has turned into a challenge for researchers. In this paper, we have developed an ensemble multi-stage machine for survivability prediction which ignores difficult features for simple cases. The machine employs three basic learners, namely multilayer perceptron (MLP), support vector machine (SVM), and decision tree (DT), in the first stage to predict survivability using simple features. If the learners agree on the output, the machine makes the final decision in the first stage. Otherwise, for difficult cases where the output of learners is different, the machine makes decision in the second stage using SVM over all features. The developed model was evaluated using the Surveillance, Epidemiology, and End Results (SEER) database. The experimental results revealed that the developed machine obtains considerable accuracy while it ignores difficult features for most of the input samples.
J.10.3. Financial
S. Beigi; M.R. Amin Naseri
Abstract
Due to today’s advancement in technology and businesses, fraud detection has become a critical component of financial transactions. Considering vast amounts of data in large datasets, it becomes more difficult to detect fraud transactions manually. In this research, we propose a combined method ...
Read More
Due to today’s advancement in technology and businesses, fraud detection has become a critical component of financial transactions. Considering vast amounts of data in large datasets, it becomes more difficult to detect fraud transactions manually. In this research, we propose a combined method using both data mining and statistical tasks, utilizing feature selection, resampling and cost-sensitive learning for credit card fraud detection. In the first step, useful features are identified using genetic algorithm. Next, the optimal resampling strategy is determined based on the design of experiments (DOE) and response surface methodologies. Finally, the cost sensitive C4.5 algorithm is used as the base learner in the Adaboost algorithm. Using a real-time data set, results show that applying the proposed method significantly reduces the misclassification cost by at least 14% compared with Decision tree, Naïve bayes, Bayesian Network, Neural network and Artificial immune system.
H.6.3.2. Feature evaluation and selection
E. Enayati; Z. Hassani; M. Moodi
Abstract
Breast cancer is one of the most common cancer in the world. Early detection of cancers cause significantly reduce in morbidity rate and treatment costs. Mammography is a known effective diagnosis method of breast cancer. A way for mammography screening behavior identification is women's awareness evaluation ...
Read More
Breast cancer is one of the most common cancer in the world. Early detection of cancers cause significantly reduce in morbidity rate and treatment costs. Mammography is a known effective diagnosis method of breast cancer. A way for mammography screening behavior identification is women's awareness evaluation for participating in mammography screening programs. Todays, intelligence systems could identify main factors on specific incident. These could help to the experts in the wide range of areas specially health scopes such as prevention, diagnosis and treatment. In this paper we use a hybrid model called H-BwoaSvm which BWOA is used for detecting effective factors on mammography screening behavior and SVM for classification. Our model is applied on a data set which collected from a segmental analytical descriptive study on 2256 women. Proposed model is operated on data set with 82.27 and 98.89 percent accuracy and select effective features on mammography screening behavior.
H.3. Artificial Intelligence
N. Emami; A. Pakzad
Abstract
Breast cancer has become a widespread disease around the world in young women. Expert systems, developed by data mining techniques, are valuable tools in diagnosis of breast cancer and can help physicians for decision making process. This paper presents a new hybrid data mining approach to classify two ...
Read More
Breast cancer has become a widespread disease around the world in young women. Expert systems, developed by data mining techniques, are valuable tools in diagnosis of breast cancer and can help physicians for decision making process. This paper presents a new hybrid data mining approach to classify two groups of breast cancer patients (malignant and benign). The proposed approach, AP-AMBFA, consists of two phases. In the first phase, the Affinity Propagation (AP) clustering method is used as instances reduction technique which can find noisy instance and eliminate them. In the second phase, feature selection and classification are conducted by the Adaptive Modified Binary Firefly Algorithm (AMBFA) for selection of the most related predictor variables to target variable and Support Vectors Machine (SVM) technique as classifier. It can reduce the computational complexity and speed up the data mining process. Experimental results on Wisconsin Diagnostic Breast Cancer (WDBC) datasets show higher predictive accuracy. The obtained classification accuracy is 98.606%, a very promising result compared to the current state-of-the-art classification techniques applied to the same database. Hence this method will help physicians in more accurate diagnosis of breast cancer.
F.4.17. Survival analysis
S. Miri Rostami; M. Ahmadzadeh
Abstract
Application of data mining methods as a decision support system has a great benefit to predict survival of new patients. It also has a great potential for health researchers to investigate the relationship between risk factors and cancer survival. But due to the imbalanced nature of datasets associated ...
Read More
Application of data mining methods as a decision support system has a great benefit to predict survival of new patients. It also has a great potential for health researchers to investigate the relationship between risk factors and cancer survival. But due to the imbalanced nature of datasets associated with breast cancer survival, the accuracy of survival prognosis models is a challenging issue for researchers. This study aims to develop a predictive model for 5-year survivability of breast cancer patients and discover relationships between certain predictive variables and survival. The dataset was obtained from SEER database. First, the effectiveness of two synthetic oversampling methods Borderline SMOTE and Density based Synthetic Oversampling method (DSO) is investigated to solve the class imbalance problem. Then a combination of particle swarm optimization (PSO) and Correlation-based feature selection (CFS) is used to identify most important predictive variables. Finally, in order to build a predictive model three classifiers decision tree (C4.5), Bayesian Network, and Logistic Regression are applied to the cleaned dataset. Some assessment metrics such as accuracy, sensitivity, specificity, and G-mean are used to evaluate the performance of the proposed hybrid approach. Also, the area under ROC curve (AUC) is used to evaluate performance of feature selection method. Results show that among all combinations, DSO + PSO_CFS + C4.5 presents the best efficiency in criteria of accuracy, sensitivity, G-mean and AUC with values of 94.33%, 0.930, 0.939 and 0.939, respectively.
C.1. General
L. khalvati; M. Keshtgary; N. Rikhtegar
Abstract
Information security and Intrusion Detection System (IDS) plays a critical role in the Internet. IDS is an essential tool for detecting different kinds of attacks in a network and maintaining data integrity, confidentiality and system availability against possible threats. In this paper, a hybrid approach ...
Read More
Information security and Intrusion Detection System (IDS) plays a critical role in the Internet. IDS is an essential tool for detecting different kinds of attacks in a network and maintaining data integrity, confidentiality and system availability against possible threats. In this paper, a hybrid approach towards achieving high performance is proposed. In fact, the important goal of this paper is generating an efficient training dataset. To exploit the strength of clustering and feature selection, an intensive focus on intrusion detection combines the two, so the proposed method is using these techniques too. At first, a new training dataset is created by K-Medoids clustering and Selecting Feature using SVM method. After that, Naïve Bayes classifier is used for evaluating. The proposed method is compared with another mentioned hybrid algorithm and also 10-fold cross validation. Experimental results based on KDD CUP’99 dataset show that the proposed method has better accuracy, detection rate and also false alarm rate than others.
C.3. Software Engineering
F. Karimian; S. M. Babamir
Abstract
Reliability of software counts on its fault-prone modules. This means that the less software consists of fault-prone units the more we may trust it. Therefore, if we are able to predict the number of fault-prone modules of software, it will be possible to judge the software reliability. In predicting ...
Read More
Reliability of software counts on its fault-prone modules. This means that the less software consists of fault-prone units the more we may trust it. Therefore, if we are able to predict the number of fault-prone modules of software, it will be possible to judge the software reliability. In predicting software fault-prone modules, one of the contributing features is software metric by which one can classify software modules into fault-prone and non-fault-prone ones. To make such a classification, we investigated into 17 classifier methods whose features (attributes) are software metrics (39 metrics) and instances (software modules) of mining are instances of 13 datasets reported by NASA. However, there are two important issues influencing our prediction accuracy when we use data mining methods: (1) selecting the best/most influent features (i.e. software metrics) when there is a wide diversity of them and (2) instance sampling in order to balance the imbalanced instances of mining; we have two imbalanced classes when the classifier biases towards the majority class. Based on the feature selection and instance sampling, we considered 4 scenarios in appraisal of 17 classifier methods to predict software fault-prone modules. To select features, we used Correlation-based Feature Selection (CFS) and to sample instances we did Synthetic Minority Oversampling Technique (SMOTE). Empirical results showed that suitable sampling software modules significantly influences on accuracy of predicting software reliability but metric selection has not considerable effect on the prediction.
H.3. Artificial Intelligence
F. Fadaei Noghani; M. Moattar
Abstract
Due to the rise of technology, the possibility of fraud in different areas such as banking has been increased. Credit card fraud is a crucial problem in banking and its danger is over increasing. This paper proposes an advanced data mining method, considering both feature selection and decision cost ...
Read More
Due to the rise of technology, the possibility of fraud in different areas such as banking has been increased. Credit card fraud is a crucial problem in banking and its danger is over increasing. This paper proposes an advanced data mining method, considering both feature selection and decision cost for accuracy enhancement of credit card fraud detection. After selecting the best and most effective features, using an extended wrapper method, ensemble classification is performed. The extended feature selection approach includes a prior feature filtering and a wrapper approach using C4.5 decision tree. Ensemble classification, using cost sensitive decision trees is performed in a decision forest framework. A locally gathered fraud detection dataset is used to estimate the proposed method. The proposed method is assessed using accuracy, recall, and F-measure as evaluation metrics and compared with basic classification algorithms including ID3, J48, Naïve Bayes, Bayesian Network and NB tree. Experiments show that considering the F-measure as evaluation metric, the proposed approach yields 1.8 to 2.4 percent performance improvement compared to other classifiers.