Document Type : Other


Department of Computer Engineering, University of Mazandaran, Babolsar, Iran.


Emotion recognition has several applications in various fields, including human-computer interactions. In recent years, various methods have been proposed to recognize emotion using facial or speech information. While the fusion of these two has been paid less attention in emotion recognition. In this paper, first of all, the use of only face or speech information in emotion recognition is examined. For emotion recognition through speech, a pre-trained network called YAMNet is used to extract features. After passing through a convolutional neural network (CNN), the extracted features are then fed into a bi-LSTM with an attention mechanism to perform the recognition. For emotion recognition through facial information, a deep CNN-based model has been proposed. Finally, after reviewing these two approaches, an emotion detection framework based on the fusion of these two models is proposed. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), containing videos taken from 24 actors (12 men and 12 women) with 8 categories has been used to evaluate the proposed model. The results of the implementation show that the combination of the face and speech information improves the performance of the emotion recognizer.


[1] J. Ancilin, and A. Milton, “Improved speech emotion recognition with Mel frequency magnitude coefficient. Appl Acoust, Vol. 179, pp. 108046, 2021.
[2] Y.D. Chavhan, B. S.  Yelure, and K. N. Tayade, “Speech emotion recognition using RBF kernel of LIBSVM”, 2nd international conference on electronics and communication systems (ICECS), pp. 1132-1135, 2015.
[3] A. Chamoli, A. Semwal, and N. Saikia, “Detection of emotion in analysis of speech using linear predictive coding techniques (LPC)”, In 2017 International Conference on Inventive Systems and Control (ICISC), pp. 1-4, 2017.
[4] A. Koduru, H. B.  Valiveti, and A. K. Budati, “Feature extraction algorithms to improve the speech emotion recognition rate”, International Journal of Speech Technology, Vol. 23(1), pp. 45-55, 2020.
[5] M. Jain, S.  Narayan, P.  Balaji, A. Bhowmick, and R. K. Muthu, “Speech emotion recognition using support vector machine”, arXiv preprint arXiv: 2002.07590, 2020.
[6] A. Bhavan, P. Chauhan, and R. R. Shah, “Bagged support vector machines for emotion recognition from speech”, Knowl. Based Syst., Vol. 184, pp.104886, 2019.
[7]R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M.H. Zafar, and T. Alhussain, “Speech emotion recognition using deep learning techniques: A review”, IEEE Access, Vol. 7, pp.117327-117345, 2019.
[8] H. Meng, T. Yan, F. Yuan, H. and Wei, “Speech emotion recognition from 3D log-Mel spectrograms with deep learning network”, IEEE Access, Vol. 7, pp.125868-125881, 2019.
[9] Y. Zhang, J. Du, Z. Wang, J. Zhang, and Y. Tu, “Attention-based fully convolutional network for speech emotion recognition”, In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) pp. 1771-1775, 2018.
[10] D. Issa, M. F. Demirci, and A. Yazici, “Speech emotion recognition with deep convolutional neural networks”, Biomed. Signal Process. Control, Vol. 59,pp. 101894, 2020.
[11] J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D & 2D CNN LSTM networks”, Biomed. Signal Process. Control, Vol. 47, pp. 312-323, 2019.
[12] P. Tzirakis, J. Zhang, and B. W. Schuller, “ End-to-end speech emotion recognition using deep neural networks”, In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5089-5093, 2018, IEEE.
[13] K. Aghajani and I. Esmaili Paeen Afrakoti, “ Speech emotion recognition using scalogram-based deep structure”, International Journal of Engineering, Vol. 33(2), pp. 285-292, 2020.
[14] Y. Li, T. Zhao, and T. Kawahara, “Improved End-to-End Speech Emotion Recognition using Self-attention Mechanism and Multitask Learning”, In Interspeech pp. 2803-2807, 2019.
[15] B. T. Nguyen, M. H. Trinh, T. V. Phan, and H. D. Nguyen, “An efficient real-time emotion detection using camera and facial landmarks”, In 2017 seventh international conference on information science and technology (ICIST), pp. 251-255, IEEE, 2017.
[16] E. Bagheri, P. G. Esteban, H. L. Cao, A. D. Beir, D. Lefeber, and B. Vanderborght, “An autonomous cognitive empathy model responsive to users’ facial emotion expressions”, ACM Transactions on Interactive Intelligent Systems (TIIS), Vol. 10(3), pp. 1-23, 2020.
[17] S. H. Wang, P. Phillips, Z. C. Dong, and Y. D. Zhang, “Intelligent facial emotion recognition based on stationary wavelet entropy and Jaya algorithm”, Neurocomputing, 272, pp. 668-676, 2018.
[18] N. Mehendale, “Facial emotion recognition using convolutional neural networks (FERC)”, SN Applied Sciences, Vol. 2(3), pp. 1-8, 2020.
[19] M. M. T. Zadeh, M.  Imani, and B. Majidi, “Fast facial emotion recognition using convolutional neural networks and Gabor filters”, In 2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI), pp. 577-581, 2019.
[20] P. Giannopoulos, I. Perikos, and I. Hatzilygeroudis, “Deep learning approaches for facial emotion recognition: A case study on FER-2013”, In Advances in hybridization of intelligent methods, pp. 1-16, Springer, Cham, 2018.
[21] N. Mehendale, “Facial emotion recognition using convolutional neural networks (FERC)”, SN Applied Sciences, Vol. 2(3), pp. 1-8, 2020.
[22] I. Lasri, A. R. Solh, and M. El Belkacemi, “ Facial emotion recognition of students using convolutional neural network”, In 2019 third international conference on intelligent computing in data sciences (ICDS), pp. 1-6, IEEE, 2019.
[23] M. R. Fallahzadeh, F. Farokhi, A. Harimi, and R. Sabbaghi-Nadooshan. "Facial Expression Recognition based on Image Gradient and Deep Convolutional Neural Network." Journal of AI and Data Mining , Vol. 9, pp. 259-268 2021.
[24] E. Avots, T. Sapiński, M. Bachmann, and D. Kamińska, “Audiovisual emotion recognition in wild”, Mach. Vis. Appl., Vol. 30(5), pp. 975-985, 2019.
[25] M. C. Sun, S. H. Hsu, M. C. Yang, and J. H. Chien, “Context-aware cascade attention-based RNN for video emotion recognition”, In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), pp. 1-6, IEEE, 2018.
[26] M. Hu, H. Wang, X. Wang, J. Yang, and R. Wang, “Video facial emotion recognition based on local enhanced motion history image and CNN-CTSLSTM networks”, J. Vis. Commun. Image Represent., Vol. 59, pp. 176-185, 2019.
[27] F. Rahdari, E. Rashedi, and M. Eftekhari, “A multimodal emotion recognition system using facial landmark analysis”, Iran. J. Sci. Technol. - Trans. Electr. Eng., Vol. 43(1), pp. 171-189, 2019.
[28] M. Ren, W. Nie, A. Liu, and Y. Su, “Multi-modal Correlated Network for emotion recognition in speech”, Vis. Inform., Vol. 3(3), pp. 150-155, 2019.
[29] K. S. Song, Y. H. Nho, J. H. Seo, and D. S. Kwon, “Decision-level fusion method for emotion recognition using multimodal emotion recognition information”, In 2018 15th International Conference on Ubiquitous Robots (UR), pp. 472-476,  IEEE, 2018.
[30] J. D. Ortega, M. Senoussaoui, E.  Granger, M.  Pedersoli, P. Cardinal, and A. L. Koerich, “Multimodal fusion with deep neural networks for audio-video emotion recognition”, arXiv preprint arXiv: 1907.03196, 2019.
[31] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recognition using facial expressions, speech and multimodal information”, In Proceedings of the 6th international conference on Multimodal interfaces , pp. 205-211, 2004.
[32] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks”, IEEE J. Sel. Top. Signal Process., Vol. 11(8), pp. 1301-1309, 2017.
[33] M. A. Jalal, E. Loweimi, R. K. Moore, and T. Hain, “Learning temporal clusters using capsule routing for speech emotion recognition”, In Proceedings of Interspeech, pp. 1701-1705, 2019 ISCA.
[34] C. Luna-Jiménez, D. Griol, Z. Callejas, R. Kleinlein, J. M. Montero, and F. Fernández-Martínez, “Multimodal Emotion Recognition on RAVDESS Dataset using Transfer Learning”, Sensors, Vol. 21(22), p. 7665, 2021.