IRDDS: Instance reduction based on Distance-based decision surface

In instance-based learning, a training set is given to a classifier for classifying new instances. In practice, not all information in the training set is useful for classifiers. Therefore, it is convenient to discard irrelevant instances from the training set. This process is known as instance reduction, which is an important task for classifiers since through this process the time for classification or training could be reduced. Instance-based learning methods are often confronted with the difficulty of choosing the instances, which must be stored to be used during an actual test. Storing too many instances may result in large memory requirements and slow execution speed. In this paper, first, a Distance-based Decision Surface (DDS) is proposed and is used as a separate surface between the classes, and then an instance reduction method, which is based on the DDS is proposed, namely IRDDS (Instance Reduction based on Distance-based Decision Surface). Using the DDS with Genetic algorithm selects a reference set for classification. IRDDS selects the most representative instances, satisfying both of the following objectives: high accuracy and reduction rates. The performance of IRDDS is evaluated on real world data sets from UCI repository by the 10-fold cross-validation method. The results of the experiments are compared with some state-of-the-art methods, which show the superiority of the proposed method, in terms of both classification accuracy and reduction percentage.


Introduction
In the pattern recognition, supervised classification is a procedure that assigns a label to an unclassified sample, trained by a set of previously classified samples.Classification is one of the most important goals of pattern recognition [1].In the literature of data classification, there are some methods that classify data based on distance between new data and training samples.Instance reduction is a crucial task in some instance-based learning methods.Instance-based learning methods are often confronted with the difficulty of choosing the instances, which must be stored to be used during an actual test.Storing too many instances may result in a large memory occupation and decrement in execution speed.Actually, some training sets may contain non-or a little information, which can be either noisy or redundant.Therefore, a process will be needed to discard unwanted information from the training set.In literature, this discarding process is known as the instance reduction process.As a result, after some of the instances have been removed from the training set, the amount of memory needed for storage and the time required for an actual test are reduced.One main challenge in designing an instance reduction algorithm is the maintenance of border, central, or other sets of points.In this research, we have decided to maintain the border points because the internal ones do not affect the decision boundaries as much as the nonboundaries do [2][3][4][5].Hence, the internal points can be removed with a low impact on classification accuracy.A large number of border points may be needed to define a border completely, so some methods maintain central points in order to use those instances which are the most typical of a particular class to classify instances close to them.This can affect the decision boundaries, because the decision boundaries depend on not only where the instances of one class lie, but also where those of the other classes lie.In instance reduction, we often face a trade-off between the sample size and the classification accuracy [6].A successful algorithm often reduces the size of the training set significantly without a significant reduction on classification accuracy.In some cases, classification accuracy can be even increased with the reduction of instances, when noisy instances are removed or decision boundaries are smoothed.A recently survey of different methods for data reduction can be seen in [7].This paper focuses on the problem of reducing the size of the stored set of instances while trying to maintain classification accuracy.This is first accomplished by providing a survey of methods which have been employed to reduce the number of instances (that are needed in learning methods) and then by proposing an instance reduction technique based on the Distance-based Decision Surface (DDS) [8], namely IRDDS (Instance Reduction based on Distance-based Decision Surface).In [8], a weighted quadratic decision surface is derived.In this paper, we have derived an unweighted decision surface of order one.The remainder of the paper is organized as follows: In section 2, a survey of instance reduction methods is presented.In section 3, the proposed Distance-based Decision Surface is introduced in subsection 3.1.The proposed instance reduction method (IRDDS) is introduced in subsection 3.2.Finally, the statistical stability of the proposed method is proved in subsection 3.3.An evaluation of the proposed method is presented in section 4 and its performance is compared to some state-of-the-art methods.Finally, in section 5, conclusions and future research direction are presented.

Survey of instance reduction methods and distance-based classifiers
Several methods have been proposed for instance reduction, some of which are surveyed in this section.Most of the methods discussed here use T as original instances in the training set and S (S as a subset of T) as their representatives.The Condensed Nearest Neighbor (CNN) [9] and the Edited Nearest Neighbor (ENN) [10] rules are the first two methods of instance reduction.The CNN begins by selecting one instance which belongs to any class from T randomly and puts it in S, and then each instance in T is classified by using only the instances in S. If an instance is misclassified, it will be added to S in order to ensure that it will be classified correctly.This process will be repeated until there are no misclassified instances in T. Instance Based Learning Algorithms (IBn), introduced in [11], can be considered as editing methods.IB2 is an online learning method, similar to CNN, IB2 works by adding to an initially empty set (S) those instances that are not correctly classified by the edited set (S).Within this setting, a newly available instance that is not added to the edited set does not need to be stored.Since noisy instances are very likely to be misclassified, they are almost always maintained in the edited set.In order to overcome this weakness, the IB3 method uses a wait and see evidence gathering method to determine which of the kept instances are expected to perform well during classification.The Reduced Nearest Neighbor (RNN) rule was introduced by Gates [12].RNN algorithm starts with S = T and removes each instance from S until further removal does not cause any other instances in T to be misclassified.RNN is computationally more expensive than Hart's Condensed Nearest Neighbor (CNN) rule, but it always produces a subset of CNN.Thus, RNN is less expensive in terms of computation and storage during the classification stage.Since the removed instance is not guaranteed to be classified correctly, this algorithm is able to remove noisy and internal instances while maintaining border points.Another variant of CNN is the Generalized Condensed Nearest Neighbor (GCNN) [13], which is similar to CNN.However, GCNN assigns instances that satisfy an absorption criterion to S. The absorption is calculated in terms of the nearest neighbors and its rivals (the nearest instances of the other classes).An instance is absorbed or included in S if its distance compared to its nearest neighbor and its nearest rivals are not more than a threshold.In ENN algorithm, S starts out the same as T, then any instance in S which does not agree with the majority of its k nearest neighbors is removed.This removes noisy instances as well as close border cases.It also maintains all internal points, which keep it from reducing the storage requirements as much as most of the other reduction methods do.A variant of this method is the Repeated ENN (RENN).The RENN applies the ENN algorithm repeatedly until all remaining instances have the majority of their neighbors with the same class.Another extension of ENN is all k-NN method [14].This algorithm works as follows: for i = 1 to k, flag any instances which are not classified correctly by its i nearest neighbors, as bad.When the loop is repeated for k times, any instances which are flagged as bad are removed.In order to reduce storage requirements and remove noisy instances, an instance t should be removed if all k of its neighbors are from the same class or even from a class other than t.This process removes noisy instances as well as internal ones, while maintaining border ones.Unlike most previous methods, there are some methods such as DROP1 to DROP5 which pay more attention to the order according to which instances have been removed [2].In these methods, each instance t has k nearest neighbors (ordered from the nearest to the farthest) and those instances that have t as one of their k nearest neighbors are called the associates of t (sorted from the nearest to the farthest).The Iterative Case Filtering algorithm (ICF) was proposed in [15].ICF is based on the Coverage and Reachable sets which are the neighboring set and associate set, respectively.The neighborhood set for an instance such as t is all the instances between t and the nearest enemy of t.The nearest enemy of t is the nearest instance from the other classes.Those instances that have t as one of their k nearest neighbors are called the associates set of t, where t is the training set.In this method, an instance t is flagged for removal if |Reachable(t)| > |Coverage(t)|, which means that more cases can solve t than t can solve itself, then all the instances flagged for removal will be deleted.Another method that finds border instances is proposed in [3], namely Prototype Selection by Clustering (PSC), which applies clustering algorithm.Two types of clustered regions are in PSC method, namely homogeneous and heterogeneous clusters.In homogeneous cluster, all instances are from the same class, whereas, in heterogeneous clusters, they are from different classes.Thus, two types of instances are in PSC, one of which is the mean of the instances in each homogeneous cluster and the other is from heterogeneous clusters as a border instance.Evolutionary algorithms have been used for instance reduction, with promising results.The basic idea is to maintain a population of chromosomes, which represents solutions to the problem and evolves over time through a process of competition.In this evaluation, both data reduction and classification accuracy are considered.The examples of application of genetic algorithm and other evolutionary algorithms for instance reduction can be found in [16][17][18].The CHC evolutionary algorithm [16] and Steady-State Memetic Algorithm (SSMA) [17] are the most known evolutionary algorithms.In terms of instance selection, the SVM (Support Vector Machines) not only is a classifier but also an instance selection method.SVBPS (Support Vector Based Prototype Selection) is a wrapper method which is based on SVM [19].It works doing a double selection; the first one applies SVM to obtain the support vectors and the second one applies DROP2 over the support vectors.FRPS (Fuzzy Rough Prototype Selection) is a fuzzy-based method which is introduced in [20].It uses fuzzy rough set theory to express the quality of the instances and uses a wrapper method to determine which instances to be removed.Nikolaidis and et.al. introduced a multi-stage method for instance reduction and abstraction in [5], namely Class Boundary Preserving (CBP) algorithm.CBP is a hybrid method which selects and abstracts instances from training set that are close to the class boundaries.In the first stage of CBP, using ENN algorithm smooths the class boundaries.In the second stage, it tries to distinguish between border and non-border instances by using the geometric characteristics of the instance distribution.In the third stage, border instances are pruned by using the concept of mutual neighborhood, and in the last stage, the non-border instances are clustered.Hamidzadeh and et.al. introduced a Large Symmetric Margin Instance Selection method, namely LAMIS [21].LAMIS removes non-border instances and keeps border ones.This method presents an instance reduction process through formulating it as a constrained binary optimization problem and solves it by employment filled function algorithm.In LAMIS, the core of instance selection process is based on keeping the hyperplane that separates a two-class data, to provide large margin separation.These authors introduced another instance reduction method in [4].This method is based on hyperrectangle clustering, called Instance Reduction Algorithm using Hyperrectangle Clustering (IRAHC).IRAHC removes interior instances and keeps border and near border ones.This method presents an instance reduction process based on hyperrectangle clustering.A hyperrectangle is an n-dimensional rectangle with axes aligned sides, which is defined by min and max points and a corresponding distance function.The min-max points are determined by using the hyperrectangle clustering algorithm.In IRAHC, the core of instance reduction process is based on the set of hyperrectangles.A survey of the related classifiers is given below.Classification can be done based on sample properties, one of which is distance.Distance is a numerical description of how much objects are departed.In the Euclidean space n , the distance between two points is usually given by the Euclidean distance (2-norm distance).Based on other norms, different distances are used such as 1-, p-and infinity-norm.In classification, various distances can be employed to measure the closeness, such as the Euclidean, Mahalanobis [22] or bands distance [23].In literature of data classification, there are some methods that classify data based on the distance between new unseen data and training samples.One of the classifiers is Minimum Distance Classifier (MDC) [1].It classifies an unknown sample into a category to which the nearest prototype pattern belongs.In this classifier a Euclidean distance is used as the metric.Senda et al. based on karhunen-loeve expansion omit the redundant calculations of MDC [24].

The proposed method
In this section, first we propose a distance-based decision surface and then a new method for instance reduction, namely IRDDS (Instance Reduction based on Distance-based Decision Surface) is introduced.IRDDS is based on the proposed DDS.For two given classes, we calculate the average distances of all the training samples.Unclassified samples are classified as a class that has the smaller average distance.Applying such a rule leads to derive a formula to be used as the decision surface.Afterwards, we present an instance reduction method based on DDS by employing genetic algorithm.The proposed Distance-based Decision Surface is introduced in subsection 3.1.Also, a kernel extension of DDS is introduced in this subsection.The proposed instance reduction method (IRDDS) is introduced in subsection 3.2.Finally, the statistical stability of the proposed method is proved in subsection 3.3.The steps of IRDDS are described as follow: In the first step of IRDDS, the original training sample is considered.In the second step of IRDDS, a proper distance-based surface, namely DDS, is obtained by using parameter tuning which is described in subsections 3.1 and 3.2.In the next two steps, the instance reduction is done based on DDS using Genetic Algorithm.This step of IRDDS is described in subsection 3.2.Finally, the reduced training sample is obtained through IRDDS.

Distance-based Decision Surface (DDS)
In this subsection, we derive a formula to determine the decision surface between two classes of samples.Let 1 ( , ) The average distance of x from all samples in each class is shown in (1), so a decision surface can be derived.In (1), and are used as the number of training samples for the first and second classes respectively.It should be noted that d(*,*) calculates the distance between two points and the same distance function is used in (1).We can derive a linear equation as a decision surface from (1).Therefore, to classify a new unclassified sample, we have presented a formula as the decision surface.
The proposed decision surface is derived where a and b are defined as in ( 3) and (4), respectively, A linear decision surface is shown in (5), which is called Distance-based Decision Surface (DDS).
To classify a test sample such as x, it is sufficient to determine the sign of (5).Input test sample cannot properly be classified if the sign of decision surface is neither positive nor negative.
In this situation the label of the sample is randomly assigned.Kernel methods are powerful statistical learning techniques, which are widely applied to various learning algorithms [25].Kernel methods can be employed to transform samples into a high dimensional space.In the high dimensional space, various methods can be employed to separate samples linearly.A mapping function denoted can be employed to transform samples into a high dimensional space.By using kernel function, the inner products between the images of the data can be substituted in the feature space.As a result, we have ( 6).
Using ( 6) in (7) gives: ] Using Radial Basis Function (RBF) kernel in (8) gives: As a result of using this kernel function, we have a nonlinear decision surface as shown in (9).

Reduction based on DDS using genetic algorithm
In the context of data reduction, we are required to reduce instances while maintaining the data classification accuracy.Hence, in the reduction of instances, we often face a trade-off between the sample size and the classification accuracy.Therefore, instance reduction is a multi-objective optimization problem that attempts to maximize the classification accuracy and, at the same time, minimizes the sample size.

Statistical stability of the proposed method
Statistical stability is the most fundamental, which means the patterns, which are identified by the algorithm, and really genuine patterns of the data source but not just an accidental relation occurring in the finite training set.This property can be considered as the statistical robustness of the output in the sense that if we rerun the algorithm on a new sample from the same source, it should identify a similar pattern.Proving that a given pattern is indeed significant and is the concern of 'learning theory', a body of principles and methods that estimate the reliability of pattern functions under appropriate assumptions about the way in which the data was generated.The Rademacher complexity measures the capacity of a class.It assesses the capacity of the class by its ability to fit random noise.The difference between empirical and true estimation over the pattern class can be bounded in terms of its Rademacher complexity [27].
The following theorem can be used to derive an upper-bound of the generalization error in terms of Rademacher complexity for the proposed method [27].
Theorem: Let P be a probability distribution on Z, where Z be training samples drawn according to P with probability at least1   , the proposed function f (or DDS classifier) satisfies: n P(y f(x)) P (y f(x)) where, function f is as: For obtaining the upper bound in theorem 1, at first, we prove (12).This inequality shows the empirical complexity measure of the proposed function f.
According to the Sup properties, Cauchy Schwarz and Jensen's inequality, we have Finally, we obtain (15) and the theorem proved.

Experimental results
In this section, at first, SVM and DDS classifiers have been compared, and then IRDDS is compared with the some state-of-the-art instance reduction methods.Extensive experiments have been conducted to evaluate the performance of the proposed method against five state-of-the-art instance selection methods using some real world data sets.
In order to validate IRDDS, experiments have been conducted over the real world data sets which have been taken from the UCI data set repository [28].The selected data sets and their characteristics are shown in table 1.In this table, #samples, #features, and #classes denote the number of instances, the number of attributes, and the number of classes, respectively.The data sets are grouped into two categories depending on the size they have (a horizontal line divides them in the table).The small data sets have less than 1000 instances and the larger data sets have more than 1000 instances.In each group, the data sets have been sorted increasingly depending on their #classes.
In the first experiment, SVM and DDS classifiers have been compared.To this end, the classifiers have been tested on the first group of the data sets (the small size data sets).The experimental results (without instance reduction for the DDS) are summarized in table 2.
Note that the values in table 2 denote the error classification rates in percentage.For multiclass classification, we can use classification by pairwise coupling.In this paper, we have used the voting approach in comparison to the classifier output approaches for the extension of two-class classifier as multiclass classifier.
In order to show the performance of IRDDS, it was compared against the other reduction methods, which include BEPS, LAMIS, DROP3, IRAHC and PSC, using k-NN with k=3 and the Euclidean distance.All the methods have been tested on all the data sets in terms of classification error rate and reduction percentage performance measures.Table 3 shows the results obtained using the competing methods.For each method, the average classification error rates (Err) and the reduction percentages (Red) are shown.As already mentioned, instance reduction is a multiobjective optimization problem.Therefore, table 5 presents both classification error rates and instance reduction percentages for all the competing methods and all the data sets.The reduction percentages are the ratio of the number of discarded instances to the number of instances in the original data set multiplied by 100.
denote distance of the test sample x to the training sample of the first class and denote distance of the test sample x to the training sample of the second class.DDS is based on the distance between unclassified sample x and samples of two classes.The goal is to determine decision surface in a way that the average distances from the two classes are equal.Hence, (1) is employed to determine the decision surface.

1
During the training process, the cost function over the validation set is calculated for the best individual of each population.The individual that achieves the lowest cost value over the validation set is selected as the final individual.The other GA parameters that have been used in the experiments are reported in the next section.

Table 5 : Average computational times of IRDDS and the other methods measured in seconds per run.
In this paper, a new instance reduction procedure is presented.The introduced method employs a Distance-based Decision Surface (DDS) and Genetic algorithm, called IRDDS.In IRDDS, we proposed a decision surface for a binary classification process.An input test sample with respect to the decision surface can be assigned to one of the two classes.The experiment results show IRDDS is the best among all the other methods.IRDDS obtains the best classification error rate and the second best instance reduction percentage ranks among the competing methods.Overall, IRDDS exhibits good results in both classification error rates and instance reduction percentages.The performance of IRDDS is demonstrated through the experiments, and the results show that IRDDS has good robustness in noisy cases.The results clearly demonstrate that IRDDS often yields the most robust performance over the data sets.IRDDS is an offline method.As future research, we are interested in extending it to use for online applications, such as data stream.