Video-based face recognition in color space by graph-based discriminant analysis

Video-based face recognition has attracted significant attention in many applications such as media technology, network security, human-machine interfaces, and automatic access control system in the past decade. The usual way for face recognition is based upon the grayscale image produced by combining the three color component images. In this work, we consider grayscale image as well as color space in the recognition process. For key frame extractions from a video sequence, the input video is converted to a number of clusters, each of which acts as a linear subspace. The center of each cluster is considered as the cluster representative. Also in this work, for comparing the key frames, the three popular color spaces RGB, YCbCr, and HSV are used for mathematical representation, and the graph-based discriminant analysis is applied for the recognition process. It is also shown that by introducing the intra-class and inter-class similarity graphs to the color space, the problem is changed to determining the color component combination vector and mapping matrix. We introduce an iterative algorithm to simultaneously determine the optimum above vector and matrix. Finally, the results of the three color spaces and grayscale image are compared with those obtained from other available methods. Our experimental results demonstrate the effectiveness of the proposed approach.


Introduction
Face recognition has been one of the most popular research areas in computer vision during the recent decades due to its many applications such as media technology, surveillance systems, access control, and human-machine interfaces.However, by far, no comprehensive technique has been proposed to provide a robust solution to all problems.Most of the algorithms proposed are still image-based, and they have been proven successful in solving some of the problems such as pose, lighting and expression variations, occlusion, and low image resolutions.Recently, due to the much development in media technology and the increased demands on security, video-tovideo face recognition has received a significant attention.One of the most important properties of a video sequence is its temporal continuity.Using this property, we can increase the recognition rate.In this regard, in the past decade, researchers have begun to use the spatiotemporal data [1].A wide variety of popular approaches in this field try to represent a video sequence as a linear subspace or a nonlinear manifold [2].Then the similarity between two sequences is obtained by measuring the distance between two subspaces.Most of these methods compute principal angles between two spaces.In these methods, the recognition process relies mainly on the representation of the video sequence as a linear subspace [3].In [4], manifold has been partitioned into a number of linear subspaces by Maximal Linear Patch, and manifold to manifold distance has been converted to measure the similarity between pairwise linear models.In [5], Wang et al. have introduced a discriminative learning method called Manifold Discriminant Analysis (MDA) to solve the recognition problem.Therefore, MDA tries to determine the mapping matrix in the grayscale images, where local models from the manifolds with different class labels can be better separated, and meanwhile, the local data compactness within each manifold can be better enhanced.Some methods are based upon the still image techniques, which try to select key frames from a video sequence and then apply still image-based face recognition algorithms.Therefore, these techniques focus on extracting all key-frames from the video sequence as the representative frames.On the other hand, color plays an effective role in the biometrics works.The RGB color space is an additive color model, and all the other color spaces that may be more appropriate for many types of visual task can be derived from this color space [6][7][8][9][10].For example, the HSV color space has been chosen in image retrieval [11], and it has been demonstrated that the YCbCr color space is more useful than the RGB color model for the face detection tasks [12,13].A conventional way of face recognition is to convert the RGB color space into a grayscale image by combining the three color component images.In [14], Wang et al. have shown that the human visual system uses color cues for recognition, and using color space can improve the recognition rate.In this paper, we compare the three color spaces RGB, YCbCr, and HSV in video-based face recognition using the graph-based discriminant analysis.Then we introduce our proposed method, which includes face detection in video sequence, clustering in order to obtain the key frames, and graph-based discriminant analysis in color space.Extensive experimental evaluation of the proposed model and its comparison for the three color spaces and other available methods are discussed in section 3, followed by discussion of the results.

Proposed algorithm
The proposed algorithm is based upon converting video sequences into some clusters, each of which acting as a linear subspace.The center of each cluster is considered as the cluster representative.The recognition process is completed by comparing the key frames.In this paper, two intelligent ways are proposed for clustering.For comparing the key frames, we consider the grayscale image as well as the color space.The three popular color spaces RGB, YCbCr, and HSV are used for the mathematical representation of each frame and the recognition process used for the graph-based discriminant analysis.By introducing the intra-class and inter-class similarity graphs to the color space, the problem changes to determining the color component combination vector and mapping matrix.In this paper, two equations are achieved using a number of mathematical oprations, and an optimal solution is achieved by designing an iterative algorithm.Figure 1 shows a block diagram of the proposed algorithm.In the following parts, the details of our proposed method are explained.

Face detection in video sequence
Face detection is an essential early step in the face recognition systems.Real-time face detection is very significant in a video-based face recognition.The Cascade AdaBoost face detector, proposed by Viola and Jones [15], and face detection based on color information [16] are two main methods for a real time face detection.A better face region is normalized via finding the distance between two eyes.There are many methods for eye detection.The most important ones consist of template matching, eigen space, and Hough transform.For face detection in other frames, this point is used that the neighbouring regions within successive video frames tend to be highly correlated.Therefore, a search window is determined around the face region in the first frame, and by calculating the correlation between the current face region and the next frame into the search window, in which the highest correlation to conclude is the face region in the next frame.This procedure is repeated for the entire frames.Determining the size of the search window has a major role in the calculation.

Clustering and key frames
Several approaches have been presented for clustering and constructing local linear models from nonlinear manifold [17][18][19].Most of them use iterative-based clustering methods such as kmeans to classify a given data through a certain number of clusters.The most important problem with this method is the number of clusters required to be specified by the user.Also the linearity of the local model has a low accuracy, and therefore, the clustering result is not optimal.Two different algorithms have been proposed for clustering videos.In the first algorithm, data containing face detection in successive frames in the video sequence is used.In other words, the motion vectors obtained by the successive frames can predict the movements of the head and perform clustering.This method has the following advantages: • Determining the number of clusters and linear model is quite intelligent, and with the implementation of this method, the number of clusters depends on moving the head in each video sequence.
• Linearity of the model has a high accuracy.
• This method is very simple because of the existence of the motion vector information.
• Due to changes in the direction of motion vector, we can predict the head direction.In this clustering method, the key frames are selected based on the motion vectors with the most changes in the horizontal and vertical directions.
In the second clustering method, the first frame of each video is chosen as the center of the first cluster.Then by calculating the correlation between the first key frame and the next one and its comparison with the suitable threshold, it can be determined whether the second frame belongs to the first cluster.To obtain the second cluster, the first frame that does not belong to the first cluster is determined as the center of the second cluster.Similarly, this process is continued to successive frames until all frames are assigned to the clusters.It should be noticed that after obtaining all clusters, their key frames are compared with a specified threshold for merging.An appropriate choice of the threshold is important in determining the number of clusters.

Graph-based discriminant analysis in color space
The usual way for face recognition is to use grayscale images.In the present study, we used color information, and compared the three popular color spaces RGB, YcbCr, and HSV.The RGB color space is an additive color model, and all the other color spaces can be derived from it.In this color space, each pixel is represented by the three components red, green, and blue.If we can represent an image by separating the intensity of the color information, some processing steps would be faster.MPEG compression, which is used in digital component video standard, is coded in the YCbCr color space.In this color space, the luminance and color information are separated.The Y component describes brightness, and Cb and Cr describe a color difference rather than a color.The HSV (hue, saturation, value) color space was designed with an emphasis on the visual perception.In the following part, face recognition is achieved using the graph-based discriminant analysis.Let i A be a color image with three columns, and each color component be a column vector.We suppose that the resolution of each color component is n m  .Therefore, the color image i A is expressed as an are three color components related to a color space (RGB, HSV or YCbCr) and n m N   .In this work, we tried to find the optimal coefficients to combine the three color components of the above color spaces.Suppose that i Z is a combined image, given by (1): where, are their coefficients.Therefore, our attempt was to find the coefficients so that i Z was the optimal description of i A for face recognition.
A graph in our model refers to a collection of vertices or nodes and a collection of edges that connect pairs of vertices.Given n points   n Z Z ,.., 1 from the underlying manifold M , and each i Z belonging to class i c , the similarity score between two points with three color components is defined by: 2 , min where,  denotes the Euclidean norm.The twoadjacency graphs for inter-class and intra-class are defined by: where, i c is the class label for , .The matching score between two points in the new manifold can be computed using the Euclidean distance measures.Such a mapping can be expressed by minimizing and maximizing the following two objective functions, respectively [20]: where, . The minimization problem from (5) can be written as follows: and the maximization problem from ( 6) can be written as: where,

   
. It should be noticed that this calculation shows that w L is a positive semi-definite.Equation 7 can be written as: eliminates the arbitrary scaling factor.Therefore, Equation 9 can be re-written as follows: Similarly, Equation 8 can be re-written as a maximization problem, as follows: Integrating (10) and (11), we have: where,  is a suitable constant and 1 0    . To find V and ,  we first took the derivative of ) , , ( Equating the derivative to zero, and thus we can write: Secondly, Equations 5 and 6 can be re-written as follow ( Therefore, we can write: therefore we have the following equation: Therefore, we have the following two equations: Equation I: Equation II: For solving the above two equations, we used the generalized Eigen-equation From the above theorem, the solutions to the above equations can be chosen as the eigenvectors of the generalized equation (for equation II, associating to the largest eigenvalue).Therefore, we can easily calculate the optimum points  and V by an iterative algorithm.Suppose that from (20).In the second step, we calculated In the next iteration, was used as the initial value.This algorithm performs the above two steps successively until it converges.It should be pointed out that convergence may be determined after n+1 times of iterations, if Given a test vide sequence, we first partitioned it into some clusters, and then using 1  n  and 1  n V the center of each cluster was mapped to the new space.Therefore the similarity between manifolds could be obtained by calculating the distance between the cluster representatives in the new space.

Experimental results
The described algorithm was evaluated using the Honda/UCSD database [22] and the CMU MoBo database [23].The first video database, Honda/UCSD, has been collected by K. C. Lee et al., and is used for video-based face recognition.The spatial and temporal resolution of each video sequence is 640×480 and 15 frames per second, respectively.In this database, every person turns his/her head in different speeds and rotations.Some of these sequences include a partial occlusion.The second database, MoBo (Motion of Body), was originally collected for the purpose of human identification from distance.The considered subset contained 96 sequences of 24 different subjects walking on a treadmill.Each person had 4 videos.For each video sequence, after the detection of face and eye, the face area was marked by a square with the dimensions where d is the distance between the two eyes.Figure 2 shows this face area.Figure 3 the illustrates the face regions detected in some frames of the Honda/UCSD database.As described earlier, in the first method of clustering, the face regions obtained at the position of the frames within the search window are used.Accordance with the motion vectors, clustering can be done.As described in the first clustering method, the key frames are selected based on the highest dy dx  within the cluster.These frames in addition to the first frame of a video sequence are considered as the key frames.The second approach to clustering was performed in successive frames by comparing the correlations.The number of clusters was determined according to the threshold value.In this simulation, this value was set as 0.7. Figure 5 illustrates seven clusters obtained by this method.
In the second clustering method, centers of each cluster were selected as the key frames.Figure 6 shows the key frames of two video sequences in the Honda/UCSD database.Experiments were performed for 5 randomly selected training/test combinations to report the identification rates.In order to evaluate the proposed approach, all faces were normalized, followed by histogram equalization to eliminate the lighting effects, and then face recognition experiments were conducted on the three color spaces (RGB, YcbCr, and HSV) and the grayscale image.
The initial value for  was set as (averaging the three color component images).After describing the iterative process for training, the algorithm generated one optimal color component combination coefficient vector and the optimal discriminant mapping matrix.The vector l  determined one discriminating color component We designed the algorithm on the grayscale images and the three popular color spaces (RGB, YCbCr and HSV), and then compared the performance of our algorithm with two still image techniques (Eigenface and Fisherface) [24], NN matching in LLE + k-means clustering [18], Mutual Subspace Method (MSM) [25], Manifold to Manifold Distance (MMD) method [26], Regularized Nearest points (RNP) [27], Collaboratively Regularized Nearest points (CRNP) [28] and Collaborative Mean Attraction (CMA) [29].Table 1 shows the average recognition rate by different color space databases and the seven methods in the Honda/UCSD and CMU MoBo databases.It also shows that the proposed algorithm, in color space, achieves a better performance than the classical methods using grayscale images.Therefore, color plays an effective role in face recognition.In the best performance, the algorithm achieves an average recognition rate of 97.6% for the YCbCr color space (in Honda/UCSD database), which is a nearly 7% increase compared with the proposed method using the grayscale images.

Conclusions and future work
In this work, we proposed a novel method for a video-based face recognition in color space.For extracting the key frames from the video, the input sequences were converted into a number of clusters, each one acting as a linear subspace.Also to compare the center of each cluster, three popular color spaces (RGB, YCbCr and HSV) were used for the mathematical representation of each frame, and graph-based discriminant analysis was applied for the recognition process.We introduced an iterative algorithm to simultaneously determine the optimal color component combination vector and mapping matrix.Our experimental results also indicate that the recognition performance with color images is significantly better than grayscale images.In the future works, we intend to explore new methods for computing the manifold to manifold distance in color space, which mainly emphasize the similarity of data variation modes between two manifolds, and compare them with the result of the proposed method.

Figure 1 .
Figure 1.The block diagram of the proposed algorithm.

Face
detection in the first frame and normalization based on distance between two eyes Detection face in continuous frames based on calculating correlation into the search window Clustering Determining key frames Choose an initial combination coefficient vector Set Calculation of mapping vector ( ) based on combination coefficient vector ( ) k=k+1 YES NO Calculation of combination coefficient vector based on mapping matrix Graph-based discriminant analysis

i Z and j c is the class label for
j Z , and w S and b S are the intraclass similarity graph and the inter-class similarity graph, respectively.Our aim was to minimize the distance between points of the connected points of , w S and to maximize the distance between points of the connected points of b S in the feature space.Therefore, the points on M manifold were mapped to a new manifold M  , i.e., i i Y Z  : 

Figure 2 .
Figure 2. Face area marked by distance between two eyes.

Figure 3 .
Figure 3. Faces detected in one of the video sequences of Honda/UCSD database.

Figure 4 .
Figure 4. Variation of dx and dy in one video sequence of the Honda/UCSD database.

Figure 6 .
Figure 6.Center of clusters for two video sequences of Honda/UCSD database.