H.6.3.3. Pattern analysis
Meysam Roostaee; Razieh Meidanshahi
Abstract
In this study, we sought to minimize the need for redundant blood tests in diagnosing common diseases by leveraging unsupervised data mining techniques on a large-scale dataset of over one million patients' blood test results. We excluded non-numeric and subjective data to ensure precision. To identify ...
Read More
In this study, we sought to minimize the need for redundant blood tests in diagnosing common diseases by leveraging unsupervised data mining techniques on a large-scale dataset of over one million patients' blood test results. We excluded non-numeric and subjective data to ensure precision. To identify relationships between attributes, we applied a suite of unsupervised methods including preprocessing, clustering, and association rule mining. Our approach uncovered correlations that enable healthcare professionals to detect potential acute diseases early, improving patient outcomes and reducing costs. The reliability of our extracted patterns also suggest that this approach can lead to significant time and cost savings while reducing the workload for laboratory personnel. Our study highlights the importance of big data analytics and unsupervised learning techniques in increasing efficiency in healthcare centers.
C.3. Software Engineering
Saba Beiranvand; Mohammad Ali Zare Chahooki
Abstract
Software Cost Estimation (SCE) is one of the most widely used and effective activities in project management. In machine learning methods, some features have adverse effects on accuracy. Thus, preprocessing methods based on reducing non-effective features can improve accuracy in these methods. In clustering ...
Read More
Software Cost Estimation (SCE) is one of the most widely used and effective activities in project management. In machine learning methods, some features have adverse effects on accuracy. Thus, preprocessing methods based on reducing non-effective features can improve accuracy in these methods. In clustering techniques, samples are categorized into different clusters according to their semantic similarity. Accordingly, in the proposed study, to improve SCE accuracy, first samples are clustered based on original features. Then, a feature selection (FS) technique is separately done for each cluster. The proposed FS method is based on a combination of filter and wrapper FS methods. The proposed method uses both filter and wrapper advantages in selecting effective features of each cluster, with less computational complexity and more accuracy. Furthermore, as the assessment criteria have significant impacts on wrapper methods, a fused criterion has also been used. The proposed method was applied to Desharnais, COCOMO81, COCONASA93, Kemerer, and Albrecht datasets, and the obtained Mean Magnitude of Relative Error (MMRE) for these datasets were 0.2173, 0.6489, 0.3129, 0.4898 and 0.4245, respectively. These results were compared with previous studies and showed improvement in the error rate of SCE.
Mohammad Reza Keyvanpour; Zahra Karimi Zandian; Nasrin Mottaghi
Abstract
Regression testing reduction is an essential phase in software testing. In this step, the redundant and unnecessary cases are eliminated, whereas software accuracy and performance are not degraded. So far, various researches have been proposed in regression testing reduction field. The main challenge ...
Read More
Regression testing reduction is an essential phase in software testing. In this step, the redundant and unnecessary cases are eliminated, whereas software accuracy and performance are not degraded. So far, various researches have been proposed in regression testing reduction field. The main challenge in this area is to provide a method that maintain fault-detection capability while reducing test suites. In this paper, a new test suite reduction technique is proposed based on data mining. In this method, in addition to test suite reduction, its fault-detection capability is preserved using both clustering and classification. In this approach, regression test cases are reduced using a bi-criteria data mining-based method in two levels. In each level, the different and useful coverage criteria and clustering algorithms are used to establish a better compromise between test suite size and the ability of reduced test suite fault detection. The results of the proposed method have been compared to the effects of five other methods based on PSTR and PFDL. The experiments show the efficiency of the proposed method in the test suite reduction in maintaining its capability in fault detection.
F.4.18. Time series analysis
Ali Ghorbanian; Hamideh Razavi
Abstract
In time series clustering, features are typically extracted from the time series data and used for clustering instead of directly clustering the data. However, using the same set of features for all data sets may not be effective. To overcome this limitation, this study proposes a five-step algorithm ...
Read More
In time series clustering, features are typically extracted from the time series data and used for clustering instead of directly clustering the data. However, using the same set of features for all data sets may not be effective. To overcome this limitation, this study proposes a five-step algorithm that extracts a complete set of features for each data set, including both direct and indirect features. The algorithm then selects essential features for clustering using a genetic algorithm and internal clustering criteria. The final clustering is performed using a hierarchical clustering algorithm and the selected features. Results from applying the algorithm to 81 data sets indicate an average Rand index of 72.16%, with 38 of the 78 extracted features, on average, being selected for clustering. Statistical tests comparing this algorithm to four others in the literature confirm its effectiveness.
F. Amiri; S. Abbasi; M. Babaie mohamadeh
Abstract
During the COVID-19 crisis, we face a wide range of thoughts, feelings, and behaviors on social media that play a significant role in spreading information regarding COVID-19. Trustful information, together with hopeful messages, could be used to control people's emotions and reactions during pandemics. ...
Read More
During the COVID-19 crisis, we face a wide range of thoughts, feelings, and behaviors on social media that play a significant role in spreading information regarding COVID-19. Trustful information, together with hopeful messages, could be used to control people's emotions and reactions during pandemics. This study examines Iranian society's resilience in the face of the Corona crisis and provides a strategy to promote resilience in similar situations. It investigates posts and news related to the COVID-19 pandemic in Iran, to determine which messages and references have caused concern in the community, and how they could be modified? and also which references were the most trusted publishers? Social network analysis methods such as clustering have been used to analyze data. In the present work, we applied a two-stage clustering method constructed on the self-organizing map and K-means. Because of the importance of social trust in accepting messages, This work examines public trust in social posts. The results showed trust in the health-related posts was less than social-related and cultural-related posts. The trusted posts were shared on Instagram and news sites. Health and cultural posts with negative polarity affected people's trust and led to negative emotions such as fear, disgust, sadness, and anger. So, we suggest that non-political discourses be used to share topics in the field of health.
H.5.7. Segmentation
V. Naghashi; Sh. Lotfi
Abstract
Image segmentation is a fundamental step in many of image processing applications. In most cases the image’s pixels are clustered only based on the pixels’ intensity or color information and neither spatial nor neighborhood information of pixels is used in the clustering process. Considering ...
Read More
Image segmentation is a fundamental step in many of image processing applications. In most cases the image’s pixels are clustered only based on the pixels’ intensity or color information and neither spatial nor neighborhood information of pixels is used in the clustering process. Considering the importance of including spatial information of pixels which improves the quality of image segmentation, and using the information of the neighboring pixels, causes enhancing of the accuracy of segmentation. In this paper the idea of combining the K-means algorithm and the Improved Imperialist Competitive algorithm is proposed. Also before applying the hybrid algorithm, a new image is created and then the hybrid algorithm is employed. Finally, a simple post-processing is applied on the clustered image. Comparing the results of the proposed method on different images, with other methods, shows that in most cases, the accuracy of the NLICA algorithm is better than the other methods.
H.5. Image Processing and Computer Vision
A. R. Yamghani; F. Zargari
Abstract
Video abstraction allows searching, browsing and evaluating videos only by accessing the useful contents. Most of the studies are using pixel domain, which requires the decoding process and needs more time and process consuming than compressed domain video abstraction. In this paper, we present a new ...
Read More
Video abstraction allows searching, browsing and evaluating videos only by accessing the useful contents. Most of the studies are using pixel domain, which requires the decoding process and needs more time and process consuming than compressed domain video abstraction. In this paper, we present a new video abstraction method in H.264/AVC compressed domain, AVAIF. The method is based on the normalized histogram of extracted I-frame prediction modes in H.264 standard. The frames’ similarity is calculated by intersecting their I-frame prediction modes’ histogram. Moreover, fuzzy c-means clustering is employed to categorize similar frames and extract key frames. The results show that the proposed method achieves on average 85% accuracy and 22% error rate in compressed domain video abstraction, which is higher than the other tested methods in the pixel domain. Moreover, on average, it generates video key frames that are closer to human summaries and it shows robustness to coding parameters.
H.6.2.2. Fuzzy set
Sh. Asadi; Seyed M. b. Jafari; Z. Shokrollahi
Abstract
Each semester, students go through the process of selecting appropriate courses. It is difficult to find information about each course and ultimately make decisions. The objective of this paper is to design a course recommender model which takes student characteristics into account to recommend appropriate ...
Read More
Each semester, students go through the process of selecting appropriate courses. It is difficult to find information about each course and ultimately make decisions. The objective of this paper is to design a course recommender model which takes student characteristics into account to recommend appropriate courses. The model uses clustering to identify students with similar interests and skills. Once similar students are found, dependencies between student course selections are examined using fuzzy association rules mining. The application of clustering and fuzzy association rules results in appropriate recommendations and a predicted score. In this study, a collection of data on undergraduate students at the Management and Accounting Faculty of College of Farabi in University of Tehran is used. The records are from 2004 to 2015. The students are divided into two clusters according to Educational background and demographics. Finally, recommended courses and predicted scores are given to students. The mined rules facilitate decision-making regarding course selection.
G.3.9. Database Applications
M. Shamsollahi; A. Badiee; M. Ghazanfari
Abstract
Heart disease is one of the major causes of morbidity in the world. Currently, large proportions of healthcare data are not processed properly, thus, failing to be effectively used for decision making purposes. The risk of heart disease may be predicted via investigation of heart disease risk factors ...
Read More
Heart disease is one of the major causes of morbidity in the world. Currently, large proportions of healthcare data are not processed properly, thus, failing to be effectively used for decision making purposes. The risk of heart disease may be predicted via investigation of heart disease risk factors coupled with data mining knowledge. This paper presents a model developed using combined descriptive and predictive techniques of data mining that aims to aid specialists in the healthcare system to effectively predict patients with Coronary Artery Disease (CAD). To achieve this objective, some clustering and classification techniques are used. First, the number of clusters are determined using clustering indexes. Next, some types of decision tree methods and Artificial Neural Network (ANN) are applied to each cluster in order to predict CAD patients. Finally, results obtained show that the C&RT decision tree method performs best on all data used in this study with 0.074 error. All data used in this study are real and are collected from a heart clinic database.
H.3. Artificial Intelligence
Z. Sedighi; R. Boostani
Abstract
Although many studies have been conducted to improve the clustering efficiency, most of the state-of-art schemes suffer from the lack of robustness and stability. This paper is aimed at proposing an efficient approach to elicit prior knowledge in terms of must-link and cannot-link from the estimated ...
Read More
Although many studies have been conducted to improve the clustering efficiency, most of the state-of-art schemes suffer from the lack of robustness and stability. This paper is aimed at proposing an efficient approach to elicit prior knowledge in terms of must-link and cannot-link from the estimated distribution of raw data in order to convert a blind clustering problem into a semi-supervised one. To estimate the density distribution of data, Wiebull Mixture Model (WMM) is utilized due to its high flexibility. Another contribution of this study is to propose a new hill and valley seeking algorithm to find the constraints for semi-supervise algorithm. It is assumed that each density peak stands on a cluster center; therefore, neighbor samples of each center are considered as must-link samples while the near centroid samples belonging to different clusters are considered as cannot-link ones. The proposed approach is applied to a standard image dataset (designed for clustering evaluation) along with some UCI datasets. The achieved results on both databases demonstrate the superiority of the proposed method compared to the conventional clustering methods.
B. Computer Systems Organization
F. Hoseini; A. Shahbahrami; A. Yaghoobi Notash
Abstract
One of the most important and typical application of wireless sensor networks (WSNs) is target tracking. Although target tracking, can provide benefits for large-scale WSNs and organize them into clusters but tracking a moving target in cluster-based WSNs suffers a boundary problem. The main goal of ...
Read More
One of the most important and typical application of wireless sensor networks (WSNs) is target tracking. Although target tracking, can provide benefits for large-scale WSNs and organize them into clusters but tracking a moving target in cluster-based WSNs suffers a boundary problem. The main goal of this paper was to introduce an efficient and novel mobility management protocol namely Target Tracking Based on Virtual Grid (TTBVG), which integrates on-demand dynamic clustering into a cluster- based WSN for target tracking. This protocol converts on-demand dynamic clusters to scalable cluster-based WSNs, by using boundary nodes and facilitates sensors’ collaboration around clusters. In this manner, each sensor node has the probability of becoming a cluster head and apperceives the tradeoff between energy consumption and local sensor collaboration in cluster-based sensor networks. The simulation results of this study demonstrated that the efficiency of the proposed protocol in both one-hop and multi-hop cluster-based sensor networks.
H.6.4. Clustering
M. Manteqipour; A.R. Ghaffari Hadigheh; R. Mahmoodvand; A. Safari
Abstract
Grouping datasets plays an important role in many scientific researches. Depending on data features and applications, different constrains are imposed on groups, while having groups with similar members is always a main criterion. In this paper, we propose an algorithm for grouping the objects with random ...
Read More
Grouping datasets plays an important role in many scientific researches. Depending on data features and applications, different constrains are imposed on groups, while having groups with similar members is always a main criterion. In this paper, we propose an algorithm for grouping the objects with random labels, nominal features having too many nominal attributes. In addition, the size constraint on groups is necessary. These conditions lead to a mixed integer optimization problem which is not convex nor linear. It is an NP-hard problem and exact solution methods are computationally costly. Our motivation to solve such a problem comes along with grouping insurance data which is essential for fair pricing. The proposed algorithm includes two phases. First, we rank random labels using fuzzy numbers. Afterwards, an adjusted K-means algorithm is used to produce homogenous groups satisfying a cluster size constraint. Fuzzy numbers are used to compare random labels, in both observed values and their chance of occurrence. Moreover, an index is defined to find the similarity of multi-valued attributes without perfect information with those accompanied with perfect information. Since all ranks are scaled into the interval [0,1], the result of ranking random labels does not need rescaling techniques. In the adjusted K-means algorithm, the optimum number of clusters is found using coefficient of variation instead of Euclidean distance. Experiments demonstrate that our proposed algorithm produces fairly homogenous and significantly different groups having requisite mass.
H.6.4. Clustering
M. Lashkari; M. Moattar
Abstract
A well-known clustering algorithm is K-means. This algorithm, besides advantages such as high speed and ease of employment, suffers from the problem of local optima. In order to overcome this problem, a lot of studies have been done in clustering. This paper presents a hybrid Extended Cuckoo Optimization ...
Read More
A well-known clustering algorithm is K-means. This algorithm, besides advantages such as high speed and ease of employment, suffers from the problem of local optima. In order to overcome this problem, a lot of studies have been done in clustering. This paper presents a hybrid Extended Cuckoo Optimization Algorithm (ECOA) and K-means (K), which is called ECOA-K. The COA algorithm has advantages such as fast convergence rate, intelligent operators and simultaneous local and global search which are the motivations behind choosing this algorithm. In the Extended Cuckoo Algorithm, we have enhanced the operators in the classical version of the Cuckoo algorithm. The proposed operator of production of the initial population is based on a Chaos trail whereas in the classical version, it is based on randomized trail. Moreover, allocating the number of eggs to each cuckoo in the revised algorithm is done based on its fitness. Another improvement is in cuckoos’ migration which is performed with different deviation degrees. The proposed method is evaluated on several standard data sets at UCI database and its performance is compared with those of Black Hole (BH), Big Bang Big Crunch (BBBC), Cuckoo Search Algorithm (CSA), traditional Cuckoo Optimization Algorithm (COA) and K-means algorithm. The results obtained are compared in terms of purity degree, coefficient of variance, convergence rate and time complexity. The simulation results show that the proposed algorithm is capable of yielding the optimized solution with higher purity degree, faster convergence rate and stability in comparison to the other compared algorithms.
H.3.8. Natural Language Processing
A. Khazaei; M. Ghasemzadeh
Abstract
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of ...
Read More
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of documents based on their content, it is expected that the answer to this question is yes. On the other hand, many differences between various languages can cause the answer to this question to be no. This research has focused on k-means that is one of the basic and popular document clustering methods. We want to know whether the clusters of aligned Persian and English texts obtained by the k-means are similar. To find an answer to this question, Mizan English-Persian Parallel Corpus was considered as benchmark. After features extraction using text mining techniques and applying the PCA dimension reduction method, the k-means clustering was performed. The morphological difference between English and Persian languages caused the larger feature vector length for Persian. So almost in all experiments, the English results were slightly richer than those in Persian. Aside from these differences, the overall behavior of Persian and English clusters was similar. These similar behaviors showed that results of k-means research on English can be expanded to Persian. Finally, there is hope that despite many differences between various languages, clustering methods may be extendable to other languages.
Timing analysis
Z. Izakian; M. Mesgari
Abstract
With rapid development in information gathering technologies and access to large amounts of data, we always require methods for data analyzing and extracting useful information from large raw dataset and data mining is an important method for solving this problem. Clustering analysis as the most commonly ...
Read More
With rapid development in information gathering technologies and access to large amounts of data, we always require methods for data analyzing and extracting useful information from large raw dataset and data mining is an important method for solving this problem. Clustering analysis as the most commonly used function of data mining, has attracted many researchers in computer science. Because of different applications, the problem of clustering the time series data has become highly popular and many algorithms have been proposed in this field. Recently Swarm Intelligence (SI) as a family of nature inspired algorithms has gained huge popularity in the field of pattern recognition and clustering. In this paper, a technique for clustering time series data using a particle swarm optimization (PSO) approach has been proposed, and Pearson Correlation Coefficient as one of the most commonly-used distance measures for time series is considered. The proposed technique is able to find (near) optimal cluster centers during the clustering process. To reduce the dimensionality of the search space and improve the performance of the proposed method, a singular value decomposition (SVD) representation of cluster centers is considered. Experimental results over three popular data sets indicate the superiority of the proposed technique in comparing with fuzzy C-means and fuzzy K-medoids clustering techniques.