Feature reduction of hyperspectral images : Discriminant analysis and the first principal component

When the number of training samples is limited, feature reduction plays an important role in classification of hyperspectral images. In this paper, we propose a supervised feature extraction method based on discriminant analysis (DA) which uses the first principal component (PC1) to weight the scatter matrices. The proposed method, called DA-PC1, copes with the small sample size problem and has not the limitation of linear discriminant analysis (LDA) in the number of extracted features. In DA-PC1, the dominant structure of distribution is preserved by PC1 and the class separability is increased by DA. The experimental results show the good performance of DA-PC1 compared to some state-of-the-art feature extraction methods.


Introduction
Due to the recent advances of remote sensing instruments, hyperspectral imaging has become a fast growing technique in the field of remote sensing [1].Hyperspectral imaging sensors with acquiring a large number of spectral bands allows us to better distinguish many subtle objects and materials [2].An important application of hyperspectral imaging is image classification [3][4][5][6].However, as the inputs of hyperspectral datasets are high-dimensional vectors whose coordinates are highly correlated, the direct use of classical models for hyperspectral image classification faces several difficulties particularly when the number of available training samples is limited.With a fixed number of training samples, hyperspectral image classification accuracy can first increase as the dimensionality of data increases, but decays with the dimensionality higher than some optimum value.In other words, Hughes phenomenon occurs [7].One of the main approaches to mitigate this problem is dimensionality reduction [8][9][10][11].Feature reduction can be done with feature selection or feature extraction.In feature selection approaches, just an appropriate subset of original features is selected usually using a discrimination criterion and a search algorithm [12][13][14][15][16][17][18].Thus, the physical meaning of data is preserved using the feature selection.However, in the feature extraction method, a linear or nonlinear transformation is applied to the original features to extract some new features [19][20][21][22][23][24][25][26].Depending on the use of labeled samples for training feature extraction, theses techniques are divided into supervised ones, which use the class label information, and unsupervised ones, which do not use the class label information for training.Principal component analysis (PCA) and linear discriminant analysis (LDA) are the most widely used unsupervised and supervised linear feature extraction methods, respectively [27].PCA finds the principal components in accordance with the maximum variance of a data matrix.Thus, after such a transformation, the dominant structure of the distribution can be well preserved in the reduced subspace.The generated principal components are linear combination of the original features and are uncorrelated.PCA searches directions with a large variance in the data and subsequently projects data onto it.The position of data in the reduced feature space may be inappropriate to distinguish between classes to doi:10.5829/idosi.JAIDM.2015.03.01.01 have a good classification.LDA utilizes the label information to infer class separability.LDA seeks projection directions on which the ratio of the between-class scatter to within-class scatter is maximized.Some difficulties with LDA method are as follows.When the number of training samples is limited, the accurate estimate of scatter matrices may not be obtained and the within-class scatter matrix becomes singular.Thus LDA has no reasonable performance in small sample size situation.Moreover, LDA can extract maximum  − 1 features ( is the number of classes) which is not always sufficient for representing the original data.As the generalized discriminant analysis (GDA) provides a mapping of input vectors into a high dimensional feature space, it can deal with nonlinear discriminant analysis using kernel function operator [28].Using a different kernel, one can cover a wide class of nonlinearities.GDA, which is the kernelized version of LDA, can extract maximum  − 1 features.Nonparametric weighted feature extraction (NWFE) has been proposed for improving LDA [29].To put different weights on samples to compute the weighted means, defining new nonparametric between-class and within-class scatter matrices to obtain more than  − 1 features is the main idea of NWFE.In order to alleviate the negative influence of outliers in class-mean based methods, authors in [30] have proposed a novel linear dimensionality reduction technique called median-mean line based discriminant analysis (MMLDA) method.They rectify to some extent the position of the classmean caused by outliers by introducing the median-mean line as an adaptive class-prototype.In this paper, we propose a supervised feature extraction method based on discriminant analysis (DA).The proposed method uses the first principal component (PC1) to weight scatter matrices.So, in addition to class discrimination information contained in the Fisher criterion (maximizing the between-class scatter and minimizing the within-class scatter), the proposed method can use the data representation and reconstruction information to preserve the main structure of original data in the reduced subspace.Moreover, the non-parametric form of scatter matrices and the use of regularization method help to extraction of more than  − 1 features and also to solve the singularity problem.We introduce the proposed method in section 2, called DA-PC1, with more details.Then, in section 3, the extensive experiments show that the proposed method outperforms popular feature extraction methods in terms of classification accuracy.Finally, the conclusions are discussed in section 4.

DA-PC1
The proposed feature extraction method, DA-PC1, uses the discriminant analysis to increase the separability between classes.DA-PC1 maximizes the between-class scatter matrix and minimizes the within-class scatter matrix.It defines the weighted non-parametric scatter matrices to provide three main advantages: 1-DA-PC1 copes with the singularity problem of within-class scatter matrix in the small sample size situation.2-It can extract more than  − 1 features where  is the number of classes.3-In addition to class discrimination information, it uses the reconstruction information contained in the first principal component for weighting the scatter matrices.In the first step, we compute the first principal component (PC1) of data.For reaching this purpose, we estimate the covariance matrix of data as follows: (  −  ̅)   (1) where,   ∈ ℛ  ( = 1,2, … , ) is the th pixel of hyperspectral image,  is the number of spectral bands,  is the total number of samples (pixels) and  ̅ is the total mean of data that is given by: The PC1 is obtained by using the eigenvector  1 correspondence with the largest eigenvalue of   .Then, we have: The between-class scatter matrix (  ) and the within-class scatter matrix (  ) are calculated as follows: where,   ( = 1, 2, … , ) is the  th training sample,  is the total number of training samples and   ∈ {1, 2, … , } is the class label of sample   , and  is the number of classes.The closer the principal components of two samples   and   are, the larger weight   will be.Thus, the weight   ( = 1, … , ;  = 1, … , ) is calculated as follows: The number one is added to the denominator because   should not be infinite.In above equation, we have: To degrade the singularity problem and thus, to increase the classification accuracy, we regularize the matrix   as follows: Because of non-parametric form of   and also with the regularization of it, DA-PC1 copes with the singularity problem in small sample size situation.Because of non-parametric form of   , DA-PC1 can extract more than  − 1 features.DA-PC1 uses both the information contained in the DA and PC1.DA-PC1 increases the class separability using DA.Moreover, the PC1, which is in accordance with the maximum variance of data matrix, can preserve the dominant structure of distribution in the reduced subspace after transformation.

Experiments and results
The performance of DA-PC1 is compared with LDA, NWFE, GDA, MMLDA, and PCA.To assess the performance of classification, we use the accuracy and reliability of classes, the average accuracy, the average reliability, kappa coefficient [31] and also the McNemar test results [32].The definitions of these measures are represented below.
The accuracy (Acc.) and reliability (Rel.) for each class are defined as  = / and  = / respectively where  is the number of testing samples that are correctly classified,  denotes the total testing samples of class and  is the total samples which are labeled as this class.The kappa coefficient is defined as follows: where,  and  denote the number of testing samples and the number of classes, respectively.Figure 1 shows that the efficiency of LDA using just 10  Just in the following cases, other feature extraction methods obtain more classification accuracy compared to the proposed method.In Pavia University dataset, with using ML classifier, the best classification accuracy is obtained by GDA, MMLDA, and PCA methods and with using NN classifier, MMLDA obtains the best result for this dataset.Moreover, in KSC dataset, with using ML classifier, the maximum classification accuracy is obtained by GDA.In figure 5, we compare the performance of DA-PC1 with PCA and LDA in a fixed number of extracted features with varying the number of training samples from 5 to 130 samples per class, for Indian dataset with a) SVM classifier and 6 extracted features, b) ML classifier and 5 extracted features, c) NN classifier and 9 extracted features.The following points can be concluded from the results of this experiment: 1) When the training set is small, PCA works better than LDA and when the training set is large, LDA works better than PCA.2) When the number of training samples is limited, DA-PC1 is superior to both the PCA and LDA and when the high number of training samples is available, with using SVM and NN classifiers, LDA outperforms DA-PC1.However, in this case, DA-PC1 has yet reasonable performance.3) With using ML classifier, the performance of DA-PC1 is better than PCA and LDA in both cases of small and large training set.In general, when we use the parametric classifiers such as ML, which need to calculate the mean vectors and covariance matrices of classes and have more sensitivity to the training set, the use of DA-PC1 is preferable compared to PCA and LDA whether the training set size is small or large.But, when we use the non-parametric classifiers such as SVM and NN, which have the less sensitivity to the number of training samples, DA-PC1 is preferable using small training set and LDA is preferable with large training set.

Conclusion
In this paper, the DA-PC1 method is proposed for feature extraction of hyperspectral images.In DA-PC1, the first principal components of training samples are used for weighting of scatter matrices in the DA.Thus, in addition to increasing the class separability, the signal representation in the reduced subspace may be improved.The experimental results show that with using parametric classifiers such as ML, the use of DA-PC1 is superior to PCA and LDA whether the training set size is small or large.Also, with using non-parametric classifiers such as SVM and NN, the use of DA-PC1 is superior to PCA and LDA with a small training sample size.Moreover, the comparison of DA-PC1 with the state-of-the-art feature extraction methods such as NWFE, GDA, and MMLDA shows the better performance of DA-PC1 for feature reduction and classification of hyperspectral images particularly in small sample size situation.
with a step size increment of 20 and the  parameter of the RBF kernel between [0.1-2] with a step size increment of 0.1.The best values of free parameters are obtained using a 5-fold cross validation approach.The training samples are chosen randomly from entire datasets and the remaining samples are used for testing.Each experiment is repeated 10 times, with different random training samples in each time, and the average results are reported.The average classification accuracies versus the number of extracted features are shown in figure 1 for Indian dataset with a) SVM classifier, 10 training samples, b) ML classifier, 10 training samples, c) NN classifier, 10 training samples, d) SVM classifier, 15 training samples, e) ML classifier, 15 training samples, f) NN classifier, 15 training samples, g) SVM classifier, 30 training samples, h) ML classifier, 30 training samples, i) NN classifier, 30 training samples, j) SVM classifier, 60 training samples, k) ML classifier, 60 training samples, l) NN classifier, 60 training samples.In most cases, the better performance of DA-PC1 compared to other feature extraction methods can be seen.

Figure 3 .
Figure 3. GTM and the classification maps of Indian dataset obtained by SVM classifier, 15 training samples and 6 extracted features.

Figure 4 .
Figure 4. GTM and the classification maps of KSC dataset obtained by SVM classifier, 15 training samples and 8 extracted features.

Figure 5 .
Figure 5.The comparison of performance of DA-PC1 with PCA and LDA in a fixed number of extracted features with varying the number of training samples for Indian dataset with a) SVM classifier and 6 extracted features, b) ML classifier and 5 extracted features, c) NN classifier and 9 extracted features.
is the number of samples correctly classified in class ,  + is the number of testing samples labeled as class  , and  + is the number of samples predicted as belonging to class  .The McNemar test is used to assess the statistical significance of differences in classification results.The parameter  12 in McNemar test is defined as follows:  12 is the number of samples which are labeled correctly by classifier 1 and incorrectly by classifier 2. The difference in the accuracy between two classifiers is said to be statistically significant if | 12 | > 1.96.If classifier 1 is more accurate than classifier 2, we have  12 > 0 and otherwise  12 < 0.

table 3 ,
and 15 training samples is very weak because of singularity of within-class scatter matrix.Moreover, LDA and GDA can extract maximum  − 1 = 9 features which are insufficient in some cases for accurate classification of data.The classification accuracies of reduced data in different number of extracted features with using 15 training samples per class for KSC dataset obtained by a) SVM classifier, b) ML classifier, and c) NN classifier are shown in figure 2. The accuracy and reliability of classes obtained by 15 training samples and SVM classifier are represented for Indian (with 6 extracted features) and KSC (with 8 extracted features) datasets in table 1 and table 2, respectively.The McNemar test results and the ground truth map (GTM) and the classification maps of theses cases are shown in table 3, figures 3, and 4, respectively.In   denotes each case of table where  is the row and  is the column.The highest classification accuracies are represented in table 4.