Kh. Aghajani
Abstract
Emotion recognition has several applications in various fields, including human-computer interactions. In recent years, various methods have been proposed to recognize emotion using facial or speech information. While the fusion of these two has been paid less attention in emotion recognition. In this ...
Read More
Emotion recognition has several applications in various fields, including human-computer interactions. In recent years, various methods have been proposed to recognize emotion using facial or speech information. While the fusion of these two has been paid less attention in emotion recognition. In this paper, first of all, the use of only face or speech information in emotion recognition is examined. For emotion recognition through speech, a pre-trained network called YAMNet is used to extract features. After passing through a convolutional neural network (CNN), the extracted features are then fed into a bi-LSTM with an attention mechanism to perform the recognition. For emotion recognition through facial information, a deep CNN-based model has been proposed. Finally, after reviewing these two approaches, an emotion detection framework based on the fusion of these two models is proposed. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), containing videos taken from 24 actors (12 men and 12 women) with 8 categories has been used to evaluate the proposed model. The results of the implementation show that the combination of the face and speech information improves the performance of the emotion recognizer.
B. Z. Mansouri; H.R. Ghaffary; A. Harimi
Abstract
Speech emotion recognition (SER) is a challenging field of research that has attracted attention during the last two decades. Feature extraction has been reported as the most challenging issue in SER systems. Deep neural networks could partially solve this problem in some other applications. In order ...
Read More
Speech emotion recognition (SER) is a challenging field of research that has attracted attention during the last two decades. Feature extraction has been reported as the most challenging issue in SER systems. Deep neural networks could partially solve this problem in some other applications. In order to address this problem, we proposed a novel enriched spectrogram calculated based on the fusion of wide-band and narrow-band spectrograms. The proposed spectrogram benefited from both high temporal and spectral resolution. Then we applied the resultant spectrogram images to the pre-trained deep convolutional neural network, ResNet152. Instead of the last layer of ResNet152, we added five additional layers to adopt the model to the present task. All the experiments performed on the popular EmoDB dataset are based on leaving one speaker out of a technique that guarantees the speaker's independency from the model. The model gains an accuracy rate of 88.97% which shows the efficiency of the proposed approach in contrast to other state-of-the-art methods.
E. Kalhor; B. Bakhtiari
Abstract
Feature selection is the one of the most important steps in designing speech emotion recognition systems. Because there is uncertainty as to which speech feature is related to which emotion, many features must be taken into account and, for this purpose, identifying the most discriminative features is ...
Read More
Feature selection is the one of the most important steps in designing speech emotion recognition systems. Because there is uncertainty as to which speech feature is related to which emotion, many features must be taken into account and, for this purpose, identifying the most discriminative features is necessary. In the interest of selecting appropriate emotion-related speech features, the current paper focuses on a multi-task approach. For this reason, the study considers each speaker as a task and proposes a multi-task objective function to select features. As a result, the proposed method chooses one set of speaker-independent features of which the selected features are discriminative in all emotion classes. Correspondingly, multi-class classifiers are utilized directly or binary classifications simply perform multi-class classifications. In addition, the present work employs two well-known datasets, the Berlin and Enterface. The experiments also applied the openSmile toolkit to extract more than 6500 features. After feature selection phase, the results illustrated that the proposed method selects the features which is common in the different runs. Also, the runtime of proposed method is the lowest in comparison to other methods. Finally, 7 classifiers are employed and the best achieved performance is 73.76% for the Berlin dataset and 72.17% for the Enterface dataset, in the faced of a new speaker .These experimental results then show that the proposed method is superior to existing state-of-the-art methods.
Ali Harimi; Ali Shahzadi; Alireza Ahmadyfard; Khashayar Yaghmaie
Abstract
Speech Emotion Recognition (SER) is a new and challenging research area with a wide range of applications in man-machine interactions. The aim of a SER system is to recognize human emotion by analyzing the acoustics of speech sound. In this study, we propose Spectral Pattern features (SPs) and Harmonic ...
Read More
Speech Emotion Recognition (SER) is a new and challenging research area with a wide range of applications in man-machine interactions. The aim of a SER system is to recognize human emotion by analyzing the acoustics of speech sound. In this study, we propose Spectral Pattern features (SPs) and Harmonic Energy features (HEs) for emotion recognition. These features extracted from the spectrogram of speech signal using image processing techniques. For this purpose, details in the spectrogram image are firstly highlighted using histogram equalization technique. Then, directional filters are applied to decompose the image into 6 directional components. Finally, binary masking approach is employed to extract SPs from sub-banded images. The proposed HEs are also extracted by implementing the band pass filters on the spectrogram image. The extracted features are reduced in dimensions using a filtering feature selection algorithm based on fisher discriminant ratio. The classification accuracy of the pro-posed SER system has been evaluated using the 10-fold cross-validation technique on the Berlin database. The average recognition rate of 88.37% and 85.04% were achieved for females and males, respectively. By considering the total number of males and females samples, the overall recognition rate of 86.91% was obtained.