Document Type : Original/Review Paper

Authors

1 Electrical and Computer Engineering Department, Ferdows branch, Islamic Azad University, Ferdows, Iran.

2 Electrical and Computer Engineering Department, Shahrood branch, Islamic Azad University, Shahrood, Iran.

Abstract

Speech emotion recognition (SER) is a challenging field of research that has attracted attention during the last two decades. Feature extraction has been reported as the most challenging issue in SER systems. Deep neural networks could partially solve this problem in some other applications. In order to address this problem, we proposed a novel enriched spectrogram calculated based on the fusion of wide-band and narrow-band spectrograms. The proposed spectrogram benefited from both high temporal and spectral resolution. Then we applied the resultant spectrogram images to the pre-trained deep convolutional neural network, ResNet152. Instead of the last layer of ResNet152, we added five additional layers to adopt the model to the present task. All the experiments performed on the popular EmoDB dataset are based on leaving one speaker out of a technique that guarantees the speaker's independency from the model. The model gains an accuracy rate of 88.97% which shows the efficiency of the proposed approach in contrast to other state-of-the-art methods.

Keywords

[1] M. El Ayadi, M. S. Kamel, and F. Karray, "Survey on speech emotion recognition: Features, classification schemes, and databases," Pattern recognition, vol. 44, no. 3, pp. 572-587, 2011.
 
[2] E. H. Kim, K. H. Hyun, S. H. Kim, and Y. K. Kwak, "Improved emotion recognition with a novel speaker-independent feature," IEEE/ASME transactions on mechatronics, vol. 14, no. 3, pp. 317-325, 2009.
 
[3] E. Bozkurt, E. Erzin, C. E. Erdem, and A. T. Erdem, "Formant position based weighted spectral features for emotion recognition," Speech Communication, vol. 53, no. 9-10, pp. 1186-1197, 2011.
 
[4] C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos, "Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011," Artificial Intelligence Review, vol. 43, no. 2, pp. 155-177, 2015.
 
[5] A. Harimi, A. AhmadyFard, A. Shahzadi, and K. Yaghmaie, "Anger or joy? Emotion recognition using nonlinear dynamics of speech," Applied Artificial Intelligence, vol. 29, no. 7, pp. 675-696, 2015.
 
[6] A. Shahzadi, A. Ahmadyfard, A. Harimi, and K. Yaghmaie, "Speech emotion recognition using nonlinear dynamics features," Turkish Journal of Electrical Engineering & Computer Sciences, vol. 23, 2015.
 
[7] A. Harimi, H. S. Fakhr, and A. Bakhshi, "Recognition of emotion using reconstructed phase space of speech," Malaysian Journal of Computer Science, vol. 29, no. 4, pp. 262-271, 2016.
 
[8] A. Bakhshi, A. Harimi, and S. Chalup, "CyTex: Transforming speech to textured images for speech emotion recognition," Speech Communication, vol. 139, pp. 62-75, 2022/04/01/ 2022, doi: https://doi.org/10.1016/j.specom.2022.02.007.
 
[9] H. Marvi, Z. Esmaileyan, and A. Harimi, "Estimation of LPC coefficients using Evolutionary Algorithms," Journal of AI and Data Mining, vol. 1, no. 2, pp. 111-118, 2013, doi: 10.22044/jadm.2013.115.
 
[10] A. Harimi, A. Shahzadi, A. Ahmadyfard, and K. Yaghmaie, "Classification of emotional speech using spectral pattern features," Journal of AI and Data Mining, vol. 2, no. 1, pp. 53-61, 2014, doi: 10.22044/jadm.2014.150.
 
[11] E. Kalhor and B. Bakhtiari, "Multi-Task Feature Selection for Speech Emotion Recognition: Common Speaker-Independent Features Among Emotions," Journal of AI and Data Mining, vol. 9, no. 3, pp. 269-282, 2021, doi: 10.22044/jadm.2021.9800.2118.
 
[12] B. Schuller, S. Steidl, and A. Batliner, The Interspeech 2009 Emotion Challenge. 2009, pp. 312-315.
 
[13] B. Schuller, A. Batliner, S. Steidl, F. Schiel, and J. Krajewski, The interspeech 2011 speaker state challenge. 2011, pp. 3201-3204.
 
[14] J.-C. Lin, C.-H. Wu, and W.-L. Wei, "Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition," IEEE Transactions on Multimedia, vol. 14, no. 1, pp. 142-156, 2011.
 
[15] B. Schuller, G. Rigoll, and M. Lang, "Hidden Markov model-based speech emotion recognition," in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003, vol. 2: Ieee, pp. II-1.
 
[16] M. Bejani, D. Gharavian, and N. M. Charkari, "Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks," Neural Computing and Applications, vol. 24, no. 2, pp. 399-412, 2014.
 
[17] J. Nicholson, K. Takahashi, and R. Nakatsu, "Emotion recognition in speech using neural networks," Neural computing & applications, vol. 9, no. 4, pp. 290-296, 2000.
 
[18] A. Bhavan, P. Chauhan, and R. R. Shah, "Bagged support vector machines for emotion recognition from speech," Knowledge-Based Systems, vol. 184, p. 104886, 2019.
 
[19] B. Schuller, G. Rigoll, and M. Lang, "Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture," in 2004 IEEE international conference on acoustics, speech, and signal processing, 2004, vol. 1: IEEE, pp. I-577.
 
[20] Y. Chavhan, M. Dhore, and P. Yesaware, "Speech emotion recognition using support vector machine," International Journal of Computer Applications, vol. 1, no. 20, pp. 6-9, 2010.
 
[21] T. Zhang, W. Zheng, Z. Cui, Y. Zong, J. Yan, and K. Yan, "A deep neural network-driven feature learning method for multi-view facial expression recognition," IEEE Transactions on Multimedia, vol. 18, no. 12, pp. 2528-2536, 2016.
 
[22] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, "Speech Emotion Recognition Using CNN," presented at the Proceedings of the 22nd ACM international conference on Multimedia, Orlando, Florida, USA, 2014. [Online]. Available: https://doi.org/10.1145/2647868.2654984.
 
[23] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, "Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks," IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2203-2213, 2014, doi: 10.1109/TMM.2014.2360798.
 
[24] G. Trigeorgis et al., "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 20-25 March 2016 2016, pp. 5200-5204, doi: 10.1109/ICASSP.2016.7472669.
 
[25] S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997, doi: 10.1162/neco.1997.9.8.1735.
 
[26] D. Guiming, W. Xia, W. Guangyan, Z. Yan, and L. Dan, "Speech recognition based on convolutional neural networks," in 2016 IEEE International Conference on Signal and Image Processing (ICSIP), 13-15 Aug. 2016 2016, pp. 708-711, doi: 10.1109/SIPROCESS.2016.7888355.
 
[27] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, "Speech emotion recognition using CNN," in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 801-804.
 
[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," Commun. ACM, vol. 60, no. 6, pp. 84–90, 2017, doi: 10.1145/3065386.
 
[29] K. He, X. Zhang, S. Ren, and J. Sun, "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition," in Computer Vision – ECCV 2014, Cham, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., 2014// 2014: Springer International Publishing, pp. 346-361.
 
[30] F. Chollet, Deep learning with Python. Manning New York, 2018.
 
[31] S. Zhang, S. Zhang, T. Huang, and W. Gao, "Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching," IEEE Transactions on Multimedia, vol. 20, no. 6, pp. 1576-1590, 2017.
 
[32] M. Falahzadeh, F. Farokhi, A. Harimi, and R. Sabbaghi, "Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition," Circuits, Systems, and Signal Processing, pp. 1-44, 08/25 2022, doi: 10.1007/s00034-022-02130-3.
 
[33] S. Jothimani and K. Premalatha, "MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network," Chaos, Solitons & Fractals, vol. 162, p. 112512, 2022/09/01/ 2022, doi: https://doi.org/10.1016/j.chaos.2022.112512.
 
[34] X. Xu, D. Li, Y. Zhou, and Z. Wang, "Multi-type features separating fusion learning for Speech Emotion Recognition," Applied Soft Computing, vol. 130, p. 109648, 2022/11/01/ 2022, doi: https://doi.org/10.1016/j.asoc.2022.109648.
 
[35] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 27-30 June 2016 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.
 
[36] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, "A database of German emotional speech," in Ninth european conference on speech communication and technology, 2005.
 
[37] S. M S, A. Elampulakkadu, T. Deepa, C. Shameema, and S. Rajan, Emotion recognition from audio signals using Support Vector Machine. 2015, pp. 139-144.
 
[38] S. Kanwal and S. Asghar, "Speech Emotion Recognition Using Clustering Based GA-Optimized Feature Set," IEEE Access, vol. 9, pp. 125830-125842, 2021, doi: 10.1109/ACCESS.2021.3111659.
 
[39] L. Zão, D. Cavalcante, and R. Coelho, "Time-Frequency Feature and AMS-GMM Mask for Acoustic Emotion Classification," IEEE Signal Processing Letters, vol. 21, no. 5, pp. 620-624, 2014, doi: 10.1109/LSP.2014.2311435.
 
[40] H. Tao, R. Liang, C. Zha, X. Zhang, and L. Zhao, "Spectral Features Based on Local Hu Moments of Gabor Spectrograms for Speech Emotion Recognition," IEICE Transactions on Information and Systems, vol. E99.D, no. 8, pp. 2186-2189, 2016, doi: 10.1587/transinf.2015EDL8258.
 
[41] M. Lech, M. N. Stolar, C. Best, and R. S. Bolia, "Real-Time Speech Emotion Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth Reduction and Companding," in Frontiers in Computer Science, 2020.
 
[42] S. Sekkate, M. Khalil, A. Abdellah, and S. Jebara, "An Investigation of a Feature-Level Fusion for Noisy Speech Emotion Recognition," Computers, vol. 8, p. 91, 12/13 2019, doi: 10.3390/computers8040091.
 
[43] L. Kerkeni, Y. Serrestou, M. Mbarki, K. Raoof, and M. A. Mahjoub, "Speech Emotion Recognition: Methods and Cases Study," in ICAART, 2018.
 
[44] F. Daneshfar and S. J. Kabudian, "Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm," Multimedia Tools Appl., vol. 79, no. 1–2, pp. 1261–1289, 2020, doi: 10.1007/s11042-019-08222-8.
 
[45] D. Issa, M. F. Demirci, and A. Yazıcı, "Speech emotion recognition with deep convolutional neural networks," Biomed. Signal Process. Control., vol. 59, p. 101894, 2020.
 
[46] A. Shirani and A. R. N. Nilchi, "Speech Emotion Recognition based on SVM as Both Feature Selector and Classifier," International Journal of Image, Graphics and Signal Processing, vol. 8, pp. 39-45, 2016.
 
[47] Y. Ü. Sönmez and A. Varol, "A Speech Emotion Recognition Model Based on Multi-Level Local Binary and Local Ternary Patterns," IEEE Access, vol. 8, pp. 190784-190796, 2020, doi: 10.1109/ACCESS.2020.3031763.
 
[48] M. B. Er, "A Novel Approach for Classification of Speech Emotions Based on Deep and Acoustic Features," IEEE Access, vol. 8, pp. 221640-221653, 2020, doi: 10.1109/ACCESS.2020.3043201.
 
[49] Z. Zhao et al., "Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition," IEEE Access, vol. 7, pp. 97515-97525, 2019.