Overlap-based feature weighting : The feature extraction of Hyperspectral remote sensing imagery

Hyperspectral sensors provide a large number of spectral bands. This massive and complex data structure of hyperspectral images presents a challenge to traditional data processing techniques. Therefore, reducing the dimensionality of hyperspectral images without losing important information is a very important issue for the remote sensing community. We propose to use overlap-based feature weighting (OFW) for supervised feature extraction of hyperspectral data. In the OFW method, the feature vector of each pixel of hyperspectral image is divided to some segments. The weighted mean of adjacent spectral bands in each segment is calculated as an extracted feature. The less the overlap between classes is, the more the class discrimination ability will be. Therefore, the inverse of overlap between classes in each band (feature) is considered as a weight for that band. The superiority of OFW, in terms of classification accuracy and computation time, over other supervised feature extraction methods is established on three real hyperspectral images in the small sample size situation.


Introduction
The high spectral resolution hyperspectral images allow the characterization, identification, and classification of the land covers with improved accuracy, robustness, and more details.A large number of training samples is required for achieving satisfactory accuracy in classification problems.However, the collection of ground reference data (training samples) in real world applications is an expensive and time consuming task and so the number of available training samples might be very limited.There are different solutions to cope with the small training sample size.Semi-supervised approaches use the ability of unlabeled samples in addition to labeled samples to improve the classification accuracy [1,2].Advanced classifiers such as kernel-based classifiers are distribution free and do not make assumptions about the density functions of the data [3,4].Feature reduction is one of the most important solutions for small sample size problem [5][6][7][8][9].In addition to improving the classification accuracy, feature reduction techniques reduce the computational complexity and also simple the visualization of data.Feature reduction methods are divided into two general groups: feature selection and feature extraction.Feature selection methods select an appropriate subset of features from the original candidate features and maintain the physical meaning of data.Feature extraction methods transform the feature space of data usually with using a projection matrix.Feature reduction techniques can be done supervised [10,11], unsupervised [12,13] or semi-supervised [14].We assess the supervised feature extraction methods in this paper.Multiple features such as spectral, texture, and shape features are employed to represent pixels from different perspectives in hyperspectral image classification.The properly combining multiple features results in good classification performance.A patch alignment framework to linearly combine multiple features in the optimal way, which obtains a unified low-dimensional representation of these multiple features for subsequent classification, is introduced in [15].A pixel in a hyperspectral image can be represented by both spatial and spectral features.Each view of a feature summarizes a specific characteristic of the studied object from different feature spaces, and also features for different views are complementary to each other.An ensemble manifold regularized sparse low-rank approximation algorithm for multi-view feature dimensionality reduction is proposed in [16].Linear discriminant analysis (LDA) is a simple and popular method for feature extraction in different pattern recognition applications [17].LDA maximizes the between-class scatter matrix and minimizes the within-class scatter matrix to increase the class discrimination.Because of singularity of within-class scatter matrix, LDA has weak efficiency when the number of training samples is limited.Generalized discriminant analysis (GDA) is the nonlinear version of LDA, which works in the kernel space [18].Because of the limitation of rank of between-class scatter matrix, LDA and GDA can extract maximum  − 1 features where  is the number of classes.Nonparametric weighted feature extraction (NWFE) uses the nonparametric form and weighted mean for calculation of scatter matrices [19].Thus, NWFE can extract more than  − 1 features and, moreover, it has good efficiency with small training set.Median-mean line discriminant analysis (MMLDA), which is recently proposed, copes with the negative effect of the class mean caused by outliers with introduction of median-mean line as an adaptive class-prototype [20].We propose a supervised feature extraction method in this paper that is simple, fast and efficient in small sample size situation.The proposed method is named overlap-based feature weighting (OFW).In a hyperspectral image, the adjacent spectral bands contain redundant information.Thus, we divide the feature vector of each sample of data to some segments in such a way that each segment contains adjacent spectral bands.We consider the weighted mean of spectral bands (original features) in each segment as an extracted feature.If classes have more overlap in a spectral band, then, the discrimination of classes in that band is harder.Thus, the class discrimination ability in each band has reverse relationship with the overlap value between classes in that band.Therefore, we assign the inverse of overlap between classes in each feature, as a weight for that feature in the weighted mean.Feature extraction methods such as LDA, GDA, NWFE, and MMLDA need to estimate the mean vectors (the first order statistics) and the scatter matrices (the second order statistics).The accurate estimate of statistics needs large enough training set.When the number of training samples is limited, the accurate estimate of mean vectors and covariance matrices cannot be provided, and so, the accuracy of LDA-based methods such as conventional LDA, GDA, NWFE, and MMLDA is decreased.The proposed method, OFW, just uses the original training samples and does not need to estimate the statistics of data.Therefore, it can have good efficiency in small sample size situations compared to LDA-based methods.Moreover, OFW has simple calculations, so, it is fast.The efficiency of OFW is investigated by three real hyperspectral images.The current paper focuses on the following sections: section 2 introduction of proposed method, section 3 the experimental results, and section 4 conclusions.

Proposed method
The adjacent spectral bands (features) in each pixel of hyperspectral image contain high redundant information.Then, for extraction of  features from  original spectral bands, we divide the feature vector of each sample of data to  segments containing  = ⌊   ⌋ adjacent spectral bands.Then, the weighted mean of spectral bands in each segment, is considered as an extracted feature for that segment.
Let,  = [  1  2 ⋯   ]  be the feature vector of a pixel of hyperspectral image and  = [  1  2 ⋯   ]  be the extracted feature vector of  where  <  .The elements of  are calculated as follows: where,   is the weight of th spectral band in the above weighted mean.How to decompose the whole spectral signature has been searched in some literatures such as [21].
To this end, we implemented the simplest possible approach for segmentation of spectral signature of pixels.The calculation of weights is the novelty of our proposed method.In some spectral bands, the difference between classes is more than other bands.
The more the overlap between classes is in a spectral band (feature), the harder the class discrimination will be in that spectral band.In other words, the class discrimination ability, in each feature, has reverse relationship with the overlap between classes in that feature.
Figure 1 shows the samples of two classes in a two-dimensional feature space.In band  1 , two classes have not overlap and thus are discriminable from each other, while in the band  2 , classes are overlapped and discrimination between them is hard.To better understanding, see figure 2. Two classes in band   , have no overlap; thus, they are easily separated from each other using a simple line; while these two classes have overlap in band   , and so, a complex nonlinear curve is needed to separate them from each other.Therefore, it is obvious that the ability of each spectral band in discrimination between classes has a reverse relationship with the overlap between classes in that band.Let,    ( = 1, … , ;  = 1, … ,   ;  = 1, … , ) be the th feature of th sample of class  where , , and   are the number of spectral bands (features), the number of classes, and the number of training samples in class , respectively.The minimum and maximum values of each spectral band in each class are given by: where,  , is the minimum value of feature  in class  and  , is the maximum value of feature  in class .
Otherwise, two classes  and  have overlap and the value of overlap between them in feature  is calculated as follows: where, (  )  is the overlap value of class  and class  in feature .The overlap between all pairs of classes is calculated as follows: The class discrimination ability has reverse relationship with the overlap value between classes.Thus, the weight associated with each feature in the weighted mean in (1) and ( 2) is calculated by: Figure 3 shows an example of determination of overlap between two classes.In band  1 , we have:

Experiments and discussion
In this section, we assessed the performance of proposed method, OFW, compared to some supervised feature extraction methods such as LDA, NWFE, GDA, and MMLDA using three real hyperspectral images: Indian, university of Pavia, and KSC datasets.

Conclusion
The overlap-based feature weighting (OFW) is proposed for feature extraction of hyperspectral images in this paper.In the proposed method, the feature vector of each pixel is divided into some segments and the weighted mean of features in each segment is calculated as an extracted feature.The weight for each feature is obtained by calculation of overlap between classes in that feature.In the OFW method, there is no need to calculate the statistics of data.As a result, OFW is a simple, fast, and efficient for feature extraction of high dimensional data in small sample size situations.The superiority of OFW compared to some popular feature extraction methods is shown for Indian, Pavia, and KSC datasets using limited training samples. [

Figure 1 .Figure 2 .
Figure 1.Samples of two classes in a two-dimensional feature space.Figure 2.There is not overlap between classes in band   , and so two classes are easily separated from each other in   while there is overlap between classes in band   , and so two classes are hardly separated from each other in   .

Figure 3 .
Figure 3.An example of determination of overlap between two classes.

Figure 4 .Figure 5 .
Figure 4. Average classification accuracy versus the number of extracted features obtained by a) SVM, b) ML classifiers for Indian dataset.

Figure 6 .Figure 7 .
Figure 6.Average classification accuracy versus the number of extracted features obtained by a) SVM, b) ML classifiers for KSC dataset.

Figure 8 .
Figure 8. GTM and classification maps for Pavia dataset obtained by SVM classifier and 8 extracted features.

Figure 9 .
Figure 9.Comparison of OFW with LDA in different training sample size for Indian dataset obtained by SVM classifier and 9 extracted features.
[24]r the removal of noisy bands.This urban image has nine classes and 610×340 pixels.The KSC dataset is provided by AVIRIS over the Kennedy Space Center, Florida.After removing water absorption and low SNR bands, 176 bands are used for the analysis of data.The KSC image has 512×614 pixels and 13 classes.Support vector machine (SVM) and Gaussian maximum likelihood (ML) are used as classifier to assess the performance of feature extraction methods.The polynomial with degree 3 with default parameters defined in LIBSVM[22]is used as kernel function in SVM classifier.We used some measures for assessment of classification accuracy: Average accuracy, average reliability, and kappa coefficient[23].The reliability in a class is the number of testing samples that are correctly classified divided to the overall samples, which are classified in that class.We used the McNemars test[24]for assessment of statistical significance of differences in the classification results.The sign of  12 indicates whether classifier 1 is more accurate than classifier 2 ( 12 > 0) or vice versa ( 12 < 0).The difference in classification accuracy between two classifiers is statistically significant if | 12 | > 1.96.We used 16 training samples per class in our experiments to investigate the performance of feature extraction methods in small sample size situation.The training samples are chosen randomly from entire scene.We used the reminded samples as testing samples.We did each experiment 10 times and the average results are reported here.
classifier, GDA has better performance than other feature extraction methods).Popular feature extraction methods such as LDA, NWFE, GDA, and MMLDA calculate the scatter matrices and maximum the between-class scatter matrix and