Bridging the semantic gap for software effort estimation by hierarchical feature selection techniques

Software project management is one of the significant activates in the software development process. Software development effort estimation (SDEE) is a challenging task in the software project management. SDEE has been an old activity in computer industry from 1940s, and thus it has been reviewed for several times. A SDEE model is appropriate if it provides the accuracy and confidence simultaneously before a software project contract. Due to the uncertain nature of development estimates, and in order to increase the accuracy, researchers have recently focused on machine learning techniques. Choosing the most effective features to achieve higher accuracy in machine learning is crucial. In this work, for narrowing the semantic gap in SDEE, a hierarchical filter and wrapper feature selection (FS) techniques and fused measurement criteria are developed in a two-phase approach. In the first phase, the two-stage filter FS methods provide start sets for the wrapper FS techniques. In the second phase, a fused criterion is proposed to evaluate the accuracy in wrapper FS techniques. The experimental results show the validity and efficiency of the proposed approach for SDEE over a variety of standard datasets.


Introduction
Software project management is the most important activity in any software engineering methodology.SCE for development and maintenance processes in software engineering is a challenging activity, on which many researches have focused.Similarly, SDEE is the process of effort prediction for software system development.
SDEE includes software development and maintenance efforts.In software engineering researches, cost and effort estimation are used equivalently [1][2][3][4].Accurate estimation of development cost has an important role in the success or failure of a software project.Algorithmic methods, expert judgment, and ML techniques are the general approaches in these area.Algorithmic methods are only based on the old data.Therefore, advances in software engineering are not considered in them.With regard to the rapid advancement in these areas, new effective features are recognized.Also in order to investigate the various feature effects, constant and fixed methods are not sufficient.Since algorithmic methods use a constant proven formula to calculate the software cost, these new feature effects on the system performance cannot be evaluated.Expert judgment methods are applied by experts in a particular organization.Hence, the same accuracy in other organizations is not provided by them.Due to the uncertainty in estimating software cost, using uncertain and flexible machine learning techniques plays an important role in accuracy improvement in SDEE.The ability to perform intelligent computational methods for modeling complex set of relationships between effort and influencing factors and also their ability to learn from the old project data are the main advantages of the ML methods [5].SDEE is an old process that began simultaneously with the computer industry in the 1940s [5].In 1980, many developments were introduced on its models and techniques [6].In the same year, Boehm et al. modified the COCOMO model previously developed by them.The result obtained was a new model called COCOMOII.From 1990s onwards, extensive researches were carried out for improvement of software industry and information technology [6].Categorization of cost estimation methods is represented in figure 1.
(1) In this formula, the order x1, x2, ... expresses the features of each project.Effort estimation based on different algorithmic models is usually different from the other models.In these methods, the estimated model is formulated based on a specific algorithm.A variety of algorithmic models are shown in Fig. 1.Since these models are based on the old data and cannot consider current developments in programming languages, hardware, and software engineering, decisionmaking is difficult based on the results [1].In expert judgments, there is usually no need for the old data.Expert judgment is often based on the reuse of determiner previous projects, which may not be documented perfectly.The results obtained show that 62% of the software projects in the organizations are estimated by this method.The advantage of this method is its customization for any specific organization culture, which makes it more accurate than the algorithmic methods.Also in many cases, it has been proven that it has more accuracy than the other preferred models.However, this estimation is subjective, and is based upon each logic expert.Here upon, its advantages can also be considered as its drawbacks.The estimated costs of each expert is only based on his experiences in a specific organization culture, and is not perfect in other organizations [1].In the ML techniques, patterns of the old project data are learned, and can be used for effort prediction in new projects [1].In SDEE, many researches have been performed by the ML approach [12].In ML, a supervised method learns a model from the labeled training data.In the ML classification methods, labels are discrete, whereas in the regression ones, labels are continuous.Where the costs or efforts are calculated as numeric values in software projects, regression methods are studied as the ML models in SDEE researches.The ML algorithms in SDEE are divided into 6 categories [3]  Searching for useful and effective subset of features is known as the approach in the ML area to increase the learning model accuracy.Since all the ML hypotheses are potentially susceptible to the wrong, irrelevant, and redundant features [18], SCE models use a large set of features for estimation that are called cost determination.All of these features are not effective for accurate estimation.Thus in SCE, feature selection (FS) algorithms are used, which have the ability of selecting the subset of most instructive cost determination correctly, and can achieve high accuracy of ML algorithm [19].In the ML studies, complexity and usability of classifier or regressor are dependent on the number of input features.In this area, two main methods as feature selection (FS) and feature extraction were used for feature reduction.In FS, researchers were concerned to find k features of d original features, which gave the most effective information.In feature extraction, k features were extracted from d initial features in a linear or nonlinear manner [35].Many researchers have focused on the efficiency improvement in SDEE by reducing the sample project features [34].Different ML algorithms have been compared in [12] to estimate the cost of the software with different datasets.The effect of backward selection on each ML algorithm was studied in this work.Datasets in SDEE were divided into the within-company and cross-company categories [20][21][22].In [34], the FS effect has been investigated to SDEE within-company and crosscompany datasets, and it has been concluded that cost estimation with less features provides equivalent or better accuracy than estimation with all features.In SDEE, both the filter [34] and wrapper [34] FS techniques have been used.Also in [35], a combination of filter and wrapper techniques have been developed.Studies on SCE have indicated that using the ML algorithms with dimension reduction methods can improve the accuracy.Some of the researchers have used the isolation and connection analysis to dimension reduction in SCE [27].Extensive research works have been conducted on finding the best subset of the cost determination, wrapper method, and climbing hills [19].In [28], using the linear regression and wrapper FS, cost determination has been ranked based on the number of repeat times in different groups and then removing the features with lower ranks.In [29], linear regression and wrapper FSS have been implemented.The results show that a combination of pruning rows (samples) and pruning columns (features) can significantly improve the effort estimation, particularly in the small datasets.
In [30], optimum accuracy has been achieved in this area using the feature weightings and comparative methods based on euclidean distance by using filter FS.In addition, some researchers have developed genetic algorithms to achieve a suitable weight for features [31,32].
In [19], researchers have examined the balance between the features of the old datasets to reduce cost determination, while maintaining accuracy.They have used nine known FSS methods to select the most effective features.In [33], a combined method has been provided based on the mutual information and clustering features.They have combined the supervised learning and unsupervised learning methods.In unsupervised learning, the features are clustered based on the similarity between them and the clusters using hierarchical clustering.Then in the unsupervised learning stage, the feature that is most similar to the effort feature is selected as the representative of any cluster.
In this work, a hierarchical FS approach was developed.A set of features were arranged in a descending order according to different correlation criteria in the filter methods.The start set for wrapper-based methods can be initiated by different combinations of multiple-ordered feature sets.In this study, due to the importance of the initial feature sets for convergence and accuracy in wrapper methods, a hierarchical approach was developed to achieve the advantages of both the filter and wrapper methods in SDEE.Also the evaluation criterion is an important factor that influences the effectiveness of the wrapper methods.Literature review on SDEE shows that median magnitude of relative error (MMRE) and prediction accuracy (PRED) are widely used as the evaluation criteria for the wrapper FS methods.In the second phase of the proposed evaluation function (EF) method, a fused MMRE and PRED evaluation criterion is used for improving the total accuracy results.The innovation of this paper is presented in two parts: (1) developing a hierarchical structure of the filter and wrapper methods in effective FS in SDEE, and (2) developing a fused criterion in the evaluation phase of the wrapper methods that improves semantic gap in SDEE and selects the most effective features at the same time in SCE by considering two main error rate criteria.The remainder of this paper is structured as follows: "FS techniques" section is provided in section 2. In Section 3, we describe the general framework of the proposed method.The empirical setup of implementation on a variety of datasets is described in section 4. Finally, in section 5, concluding remarks and further works are discussed in detail.

FS techniques
""Curse of dimensionality"" was originally discussed by Bellman in 1961.The small sample set and high dimensionality problems are two major challenges in many applications.In general, a large number of features cause the increase of complexity in data analysis and reduce the performance of learning methods such as classification, regression, and clustering.Therefore, dimensional reduction becomes an important issue for improving the efficiency.The most popular approaches in feature reduction are classified into two categories, FS and feature extraction.In FS, sample s with d features is generated from sample x with D features, where d<D.Traditional FS methods attempt to find a global optimal sub-space.It is necessary to mention that in feature extraction, the features of s are transformed into a different feature space, and thus there might be no correspondence between the two feature sets.The mathematical expressions and ideas underlying the feature extraction algorithms have been described in [34].Heretofore, different FS methods have been proposed.These methods have been divided into three categories based on the filtering, wrapper, and embedded methods.Also these methods can be divided into two categories based on the learning dependent (wrapper, embedded) and learning independent (filter) algorithms [35].In the filter methods, the features are selected based on correlation of the specific criteria such as mutual information (MI) and correlation coefficients.In the wrapper methods, learning algorithms are used to determine the correlation between a subset of the features by a prediction model.In the embedded methods, the FS process and training of learning algorithms are integrated.These methods are appropriate when the feature numbers are small.One of the most common approaches in this category is learning by decision tree [35].Since the filter and wrapper methods are used in the proposed method, these are introduced in the following section.

Filter methods
In order to check the relationship between the two features, first of all, a suitable similarity or correlation measure is required.This criterion may be considered as the function of the interaction between variables, rather than a function of their values.In this regard, correlation function may be linear or non-linear.In this function, the amount of information shared between the two variables should be considered.However, to develop this idea, quantitative information is needed.Topic of mathematics called information theory is related to correlation measurement [35].A flowchart of the filter methods is illustrated in figure 2. Each filter FS method consists of three main steps: (1) production of features, (2) measurement, and (3) testing by the learning algorithm.A subset of features is produced in the production step.Then in the measurement step, the feature information in the current time is measured.The above two steps are performed iteratively until the results are not consistent with the assessment criteria.Afterward, the evaluation process is terminated with a threshold of measurement results.Thus maximum information must be contained in the final feature set.Test step is performed by a supervised learning algorithm.

Wrapper methods
A workflow of the wrapper method is shown in figure 3. Its process is the same as the filter methods, except that the measurement step has been replaced by a learning algorithm.This is the main reason that the wrapper methods are slow.On the other hand, the wrapper method learning algorithm can lead to better results in most cases.The process is stopped when the results obtained are worsened or the number of features reach a pre-determined threshold.In regard to the point that the limited scope is most effective in the applicability of the wrapper FSS methods, the hierarchical structure of the filter and wrapper methods are used.In this approach, various combinations of filtering methods are being tested, and the most effective one is combined with the wrapper methods.Also due to the fact that the evaluating criteria in the wrapper methods impact directly on selecting effective features, in this work, hybrid criteria were utilized.

Proposed method
In this section, effective FS approach is presented based on utilizing a combination of both wrapper and filter.Filtering methods are faster than wrapper methods.However, the wrapper methods are more accurate than the filtering ones [37].Thus by combining these methods, the advantages of each method can be used to eliminate the disadvantages of the other one.In the proposed method, at first, the features are ranked based on the P filtering feature selection methods and selected TP of features that have better rank in every method as the selected features.Using the two operators AND and XOR, two final sets of proposed features are produced from the filtering methods.Then the AND set is considered as the basic one, and by using a regression algorithm, the initial accuracy is evaluated based on the fused criteria.Furthermore, by using the two wrapper feature selection methods, the most effective features of the AND and XOR sets are selected.The AND set is considered as the input for the backward FFS method (Algorithm 2), and the XOR set is considered as input to the forward the FSS method (Algorithm 3).These two methods are repeated to increase the accuracy, and finally, the most effective features for each dataset are selected.A chart of the proposed method is shown in figure 4. Various filtering methods are used in this article but in the wrapper ones, only the simple greedy forward and backward FS methods are used.Pseudo-code of the proposed method for combination of these methods is represented in Algorithm 1.In this method, using the fused function within the wrapper methods, a combination of criteria are generated for assessing the effectiveness of the selected features.This approach causes a higher reduction in the semantic gap by selecting the effective features.
For this purpose, all evaluation criteria are passed to fused function.The result is a combined criterion as m that inherits all criteria measures.Two sets of A and B are constructed from the output of the filter methods.Common features of the filter methods are assigned to A, and consequently, non-common features are assigned to B. The rest of the proposed method is followed by the two wrapper methods (Algorithms 2, 3) in an iterative manner.This iteration continues until the accuracy is converged to an optimal value.The output of the proposed method is the selected feature set.where is the optimum subset of features.mf , where mf is the accuracy of regression form s features.where is the optimum subset of features.mf , where mf is the accuracy of regression from s features.

Empirical setup
In this section, the implementation and analysis of experimental results in different datasets are represented.First the criteria and datasets used are proposed.Then the results are presented, and finally, the results are compared and verified by the results of other researches.With the purpose of implementing the proposed methods, the FEAST tools are used, taken from [38].

Performance metrics
In this paper, in order to evaluate the accuracy of this idea in the SDEE, the proposed method implements various datasets in these fields, and the evaluation criteria of these fields are used to analyze the results.In this field, various evaluation criteria are used.The most commonly used criteria are MRE, which represents the difference between the estimated costs and actual costs, MMRE, which represents the average estimation error for the total sample (training samples and test samples), and PRED(X), which represents the percentage of samples whose magnitude of relative error is less than or equal to the value of X.Also in some studies, the median estimation error or MDMRE has been used.The description of the formula used for the criterion defined above will be followed.


Actual Effort is the real project's effort. Estimated Effort is the estimated Effort by the algorithm.


Actual Effort is the real project's effort. Estimated Effort is the estimated Effort by the algorithm.MDMRE = Median (MRE) (4)  X is the difference in most research works that is equal to 0.25. K is the number of samples, and the difference between their estimated cost and their actual cost is equal or less than x. N is the total number of tested samples.
Therefore, the higher value for PRED (0.25) results in the less error rate of the evaluated algorithm, and the estimated cost for the number of tested sample error rate is equal to or less than 0.25.Thus by picking up the features that result in lower MMRE and higher PRED, the semantic gap can be reduced in the estimation procedure.

Datasets
In this study, three popular datasets in SDEE (cocomo81, coconasa93, and Desharnais) were used.They will be briefly introduced, and their usage will be followed.

Desharnais
The original version of this dataset contains 81 projects of generated projects by a candidate software house that have been described in 12 features.The second and third features in four samples are miss value.For this reason, this dataset has been used differently in various articles.In some papers, 4 samples have been put aside and the other 77 samples have been used [19].Other researchers have removed the miss value columns from the set of columns [39].In this work, both methods in these datasets were used.The features of this dataset are described in table 3.

Experimental result
Here are the results of various tests on the test datasets introduced in the previous section.The best results for each dataset were marked bold.Some studies have used a combination of the MMRE and PRED criteria for ranking the algorithms used in this field that are displayed with the EF symbol [40].This method is produced by fusion function.
In this study, EF that is a combination of the criteria for FS is used.The selected features provide a higher EF.In the research works carried out in the FS field in SDEE, usually the MMRE criteria and, less often, the PRED criteria are used for FS.As mentioned, we looked for the features that provided a lower MMRE and higher PRED.Thus when the EF criteria are used, the selected features will have these two conditions.In this work, a multi-layer perceptron (MLP) neural network learning algorithm was developed for the wrapper FS.Artificial neural networks (ANNs) contain a lot of highly inter-connected processing elements called neurons.They usually operate in parallel, and are configured with a regular architecture.Each neuron is connected via a communicative link with other neurons.Each communicative link has a weight that represents the information about the input signal.Neuron calculates a sum of input weights, and if the total weight is more than a threshold, produces an output.This process continues until one (or more) output(s) is (are) produced.The estimate models can be trained using the old training data to produce the results by fine-tuning the algorithm parameter values to reduce the difference between the actual and estimated efforts [34].The MLP neural network in this study consisted of an input layer, a hidden layer, and an output layer.The parameters of the proposed algorithm in this paper were set by the values presented in table 4. Based on the results of the implementation of various compounds in the Desharnais dataset with 77 cases and 12 features, it is clear that the method is effective and from different combinations; only 2 cases have reduced accuracy.Among the compounds tested, 9 different combinations achieved the highest possible accuracy.The 10-Fold cross-validation method was used to evaluate the dataset.This data was divided into 10 equal parts, one as the test data and the other 9-folds were considered as the training data.Similar to similar articles, this process were carried out for ten times and the average results were presented.In Desharnais, for the dataset consisting of 81 samples and 9 features, all combinations increased accuracy.In fact, the accuracy of all compounds was greater than that for the conventional MLP.Among the various compounds, by combining the Mifs and Relief filter methods, the highest accuracy can be achieved.From the experimental results of the Cocomo81 dataset, it can be concluded that a combination of different FS methods in the dataset is effective and has a higher accuracy.
Based on the comparison of different combinations in the filtering step, except one compound, all the combinations caused a higher accuracy.A combination of two methods, Betagamma and Relief, provided the highest accuracy.In the coconasa93 dataset, we used the LOOCV validation method to evaluate the technique.In this form, the dataset was divided for 93 times, containing 92 training and one testing samples.Finally, the average of the results of 93 times division was presented.The results obtained showed that the method was effective in the coconasa93 dataset., "* 100" has been removed from the MDMRE formula.In other words, in their presented formula, the output has not been multiplied by 100.In order to create the conditions to compare, their results have been multiplied by 100.In table 7, a comparison is presented between the experimental results of this paper and other studies.

Conclusion
In this work, a hierarchical FS approach in SCE was developed.In the proposed approach, the accuracy and time complexity were improved.
Using the wrapper methods, the learning algorithm must be run in each round for evaluating the effectiveness of each feature.Thus the filter methods were utilized for limiting the scope of the search into the most effective features, which reduce the number of search in the wrapper methods, and consequently, have a lower computational complexity.The filter methods have higher speeds, while their accuracy is not acceptable.The wrapper methods have lower speeds, and due to the use of ML algorithm, they have higher accuracy.Combination of the filter and wrapper methods resulted in an optimal performance by eliminating the weaknesses of each approach and using the advantages of the other ones.This method was evaluated on the cocomo81, coconasa93, and Desharnais datasets.The results obtained indicated the effectiveness of the method.According to different compounds, the common feature of all datasets, "size" feature, is known as the most effective one.In the future, we intend to work on other combinations of the filter and wrapper methods.In this study, we used a combination of the filter methods in the first phase.Composition of more filter methods may provide more accuracy.In this work, we used a multi-layer neural network algorithm as a learning algorithm.In the future works, we intend to implement this approach in the other learning algorithms.In this work, the EF criteria for SF in the SCR were used for the first time.This criterion consists of a combination of two important evaluation criteria used in other articles in this issue.The proposed method has a function to combine the different evaluation criteria used in this field.We are going to provide more powerful combinations of evaluation criteria using techniques such as genetic programming by fused function.[7] Bailey, J. W. & Basiii, V. R. (1981).A meta-model for software development resource expenditures.5th international conference on Software engineering, IEEE press, 1981.

ALGORITHM 1 .
Hierarchical FS algorithm Input: X= , } where is a sample, is its associated effort, and N is the number of samples.Also any is represented as [x1, x2… xD], where D is the number of sample features.M= {mi} where mi is the ith measure criterion in application, and K is the number of measurement criteria.F= {fi} where fi is the ith filter method, and P is the number of filtering methods.Process: for p=1: P Sp = filter (fi, X, tp), where Sp is a sorted set of top tp selected features by fi on X. end m=fusion (M), where fusion returns a fused measurement criterion.A= Sp R= Sp s= A mf=regression(X,s,m), where mf is accuracy result evaluated by m. repeat [s A mf]= BFS-Function (X,s, A,m,mf) backward FS [s R mf]= FFS-Function (X,s, B,m,mf) forward FS until (mf is not better than previous values) Output: s, where s is the optimum subset of original features ALGORITHM 2. BFS-Function (X,s, A,m,mf) Input: X= , } where is a sample, is its associated effort, and N is the number of samples.Also any is represented as [x1, x2… xD], where D is the number of sample features.S, where S is the initial subset for backward FS.A, where A is the additive subset for backward FS.M, where m is the measurement criterion.Mf, where Mf is the accuracy result of the previous step.PROCESS: n=1 Max=Size(B) while (n≤Max) while (f=selected next element of A) S=s f [accuracy] = regression(X,S,m) If accuracy>Best result in this iteration Best=accuracy b=f
, where D is the number of sample features.S, where s is the initial subset for forward FS.B, where B is the additive subset for forward FS.M, where m is the measurement criterion.Mf, where Mf is the accuracy result of previous step.

Table 4 . Values of method parameters in this work.
From 37 different compounds, 31 compounds provided an accuracy higher than the simple MLP algorithm.The best accuracy was the result of a combination of the MRMR and Cief filter methods, and the lowest accuracy was the result of a combination of the Cief and Icap methods.The results of the implementation of these methods in cocomo81 and coconasa93 are shown in table 5, and the results of its implementation in Desharnais with both approaches are presented in table 6.According to different compounds, Size is known as the most effective feature, which is common in all datasets.Also the two features Cplx and Tool of COCOMO81, two features VIRT and VEXP of COCONASA93, and the Transactions and Entities features from Desharnais with 81 samples and length, entities and envergure features of Desharnais with 77 samples in all compounds were identified as excess features (less important one).