B. Z. Mansouri; H.R. Ghaffary; A. Harimi
Abstract
Speech emotion recognition (SER) is a challenging field of research that has attracted attention during the last two decades. Feature extraction has been reported as the most challenging issue in SER systems. Deep neural networks could partially solve this problem in some other applications. In order ...
Read More
Speech emotion recognition (SER) is a challenging field of research that has attracted attention during the last two decades. Feature extraction has been reported as the most challenging issue in SER systems. Deep neural networks could partially solve this problem in some other applications. In order to address this problem, we proposed a novel enriched spectrogram calculated based on the fusion of wide-band and narrow-band spectrograms. The proposed spectrogram benefited from both high temporal and spectral resolution. Then we applied the resultant spectrogram images to the pre-trained deep convolutional neural network, ResNet152. Instead of the last layer of ResNet152, we added five additional layers to adopt the model to the present task. All the experiments performed on the popular EmoDB dataset are based on leaving one speaker out of a technique that guarantees the speaker's independency from the model. The model gains an accuracy rate of 88.97% which shows the efficiency of the proposed approach in contrast to other state-of-the-art methods.
M. R. Fallahzadeh; F. Farokhi; A. Harimi; R. Sabbaghi-Nadooshan
Abstract
Facial Expression Recognition (FER) is one of the basic ways of interacting with machines and has been getting more attention in recent years. In this paper, a novel FER system based on a deep convolutional neural network (DCNN) is presented. Motivated by the powerful ability of DCNN to learn features ...
Read More
Facial Expression Recognition (FER) is one of the basic ways of interacting with machines and has been getting more attention in recent years. In this paper, a novel FER system based on a deep convolutional neural network (DCNN) is presented. Motivated by the powerful ability of DCNN to learn features and image classification, the goal of this research is to design a compatible and discriminative input for pre-trained AlexNet-DCNN. The proposed method consists of 4 steps: first, extracting three channels of the image including the original gray-level image, in addition to horizontal and vertical gradients of the image similar to the red, green, and blue color channels of an RGB image as the DCNN input. Second, data augmentation including scale, rotation, width shift, height shift, zoom, horizontal flip, and vertical flip of the images are prepared in addition to the original images for training the DCNN. Then, the AlexNet-DCNN model is applied to learn high-level features corresponding to different emotion classes. Finally, transfer learning is implemented on the proposed model and the presented model is fine-tuned on target datasets. The average recognition accuracy of 92.41% and 93.66% were achieved for JAFEE and CK+ datasets, respectively. Experimental results on two benchmark emotional datasets show promising performance of the proposed model that can improve the performance of current FER systems.
H.5. Image Processing and Computer Vision
S. Memar Zadeh; A. Harimi
Abstract
In this paper, a new iris localization method for mobile devices is presented. Our system uses both intensity and saturation threshold on the captured eye images to determine iris boundary and sclera area, respectively. Estimated iris boundary pixels which have been placed outside the sclera will be ...
Read More
In this paper, a new iris localization method for mobile devices is presented. Our system uses both intensity and saturation threshold on the captured eye images to determine iris boundary and sclera area, respectively. Estimated iris boundary pixels which have been placed outside the sclera will be removed. The remaining pixels are mainly the boundary of iris inside the sclera. Then, circular Hough transform is applied to such iris boundary pixels in order to localize the iris. Experiments were done on 60 iris images taken by a HTC mobile device from 10 different persons with both left and right eyes images available per person. Also, we evaluate the proposed algorithm on MICHE datasets include iphone5, Samsung Galaxy S4 and Samsung Galaxy Tab2. Experimental evaluation shows that the proposed system can successfully localize iris on tested images.
Ali Harimi; Ali Shahzadi; Alireza Ahmadyfard; Khashayar Yaghmaie
Abstract
Speech Emotion Recognition (SER) is a new and challenging research area with a wide range of applications in man-machine interactions. The aim of a SER system is to recognize human emotion by analyzing the acoustics of speech sound. In this study, we propose Spectral Pattern features (SPs) and Harmonic ...
Read More
Speech Emotion Recognition (SER) is a new and challenging research area with a wide range of applications in man-machine interactions. The aim of a SER system is to recognize human emotion by analyzing the acoustics of speech sound. In this study, we propose Spectral Pattern features (SPs) and Harmonic Energy features (HEs) for emotion recognition. These features extracted from the spectrogram of speech signal using image processing techniques. For this purpose, details in the spectrogram image are firstly highlighted using histogram equalization technique. Then, directional filters are applied to decompose the image into 6 directional components. Finally, binary masking approach is employed to extract SPs from sub-banded images. The proposed HEs are also extracted by implementing the band pass filters on the spectrogram image. The extracted features are reduced in dimensions using a filtering feature selection algorithm based on fisher discriminant ratio. The classification accuracy of the pro-posed SER system has been evaluated using the 10-fold cross-validation technique on the Berlin database. The average recognition rate of 88.37% and 85.04% were achieved for females and males, respectively. By considering the total number of males and females samples, the overall recognition rate of 86.91% was obtained.
Hossein Marvi; Zeynab Esmaileyan; Ali Harimi
Abstract
The vast use of Linear Prediction Coefficients (LPC) in speech processing systems has intensified the importance of their accurate computation. This paper is concerned with computing LPC coefficients using evolutionary algorithms: Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Dif-ferential ...
Read More
The vast use of Linear Prediction Coefficients (LPC) in speech processing systems has intensified the importance of their accurate computation. This paper is concerned with computing LPC coefficients using evolutionary algorithms: Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Dif-ferential Evolution (DE) and Particle Swarm Optimization with Differentially perturbed Velocity (PSO-DV). In this method, evolutionary algorithms try to find the LPC coefficients which can predict the origi-nal signal with minimum prediction error. To this end, the fitness function is defined as the maximum prediction error in all evolutionary algorithms. The coefficients computed by these algorithms compared to coefficients obtained by traditional autocorrelation method in term of prediction accuracy. Our results showed that coefficients obtained by evolutionary algorithms predict the original signal with less prediction error than autocorrelation methods. The maximum prediction error achieved by autocorrelation method, GA, PSO, DE and PSO-DV are 0.35, 0.06, 0.02, 0.07 and 0.001, respectively. This shows that the hybrid algorithm, PSO-DV, is superior to other algorithms in computing linear prediction coefficients.