Deep Learning Approach for Robust Voice Activity Detection: Integrating CNN and Self-Attention with Multi-Resolution MFCC

Aghajani, Khadijeh

doi:10.22044/jadm.2024.14839.2582

Document Type : Technical Paper

Author

Khadijeh Aghajani

Department of computer Engineering, Faculty of Engineering and Technology, University of Mazandaran, Babolsar, Iran.

https://doi.org/10.22044/jadm.2024.14839.2582

Abstract

Voice Activity Detection (VAD) plays a vital role in various audio processing applications, such as speech recognition, speech enhancement, telecommunications, satellite phone, and noise reduction. The performance of these systems can be enhanced by utilizing an accurate VAD method. In this paper, multiresolution Mel- Frequency Cepstral Coefficients (MRMFCCs), their first and secondorder derivatives (delta and delta2), are extracted from speech signal and fed into a deep model. The proposed model begins with convolutional layers, which are effective in capturing local features and patterns in the data. The captured features are fed into two consecutive multi-head self-attention layers. With the help of these two layers, the model can selectively focus on the most relevant features across the entire input sequence, thus reducing the influence of irrelevant noise. The combination of convolutional layers and self-attention enables the model to capture both local and global context within the speech signal. The model concludes with a dense layer for classification. To evaluate the proposed model, 15 different noise types from the NoiseX-92 corpus have been used to validate the proposed method in noisy condition. The experimental results show that the proposed framework achieves superior performance compared to traditional VAD techniques, even in noisy environments.

Keywords

Main Subjects

H.6.5.13. Signal processing

References

[1] M. W. Mak, & H. B. Yu, “A study of voice activity detection techniques for NIST speaker recognition evaluations”, Computer Speech & Language, vol. 28, no. 1, 295-313, 2014.

[2] Woo, K. Ho, T. Yang, K. Park, and C. Lee. "Robust voice activity detection algorithm for estimating noise spectrum." Electronics Letters 36, no. 2,180-181, 2000.

[3] T. H. Zaw, and N. War, “The combination of spectral entropy, zero crossing rate, short time energy and linear prediction error for voice activity detection”, In 2017 20th International Conference of Computer and Information Technology (ICCIT) (pp. 1-5). IEEE, 2017, December.

[4] Y. Kida, T. Kawahara, “Voice activity detection based on optimally weighted combination of multiple features”, In INTERSPEECH , pp. 2621-2624, 2005, September.

[5] F. Tao, & C. Busso, “Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection”, In INTERSPEECH (pp. 1938-1942), 2017, September.

[6] X.L. Zhang, & D. Wang, “Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection”, In Fifteenth annual conference of the international speech communication association, 2014.

[7] S. H. Chen, R. C. Guido, T. K. Truong, & Y. Chang, “Improved voice activity detection algorithm using wavelet and support vector machine”, Computer Speech & Language, vol. 24, no. 3, 531-543, 2010.

[8] S. M. Joseph, & A. P. Babu, “Wavelet energy based voice activity detection and adaptive thresholding for efficient speech coding”, International Journal of Speech Technology, 19, 537-550, 2016.

[9] D. Ying, Y. Yan, J. Dang, & F. K. Soong, “Voice activity detection based on an unsupervised learning framework”. IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 8, 2624-2633, 2011.

[10] Z. Shen, J. Wei, W. Lu, & J. Dang, “Voice activity detection based on sequential Gaussian mixture model with maximum likelihood criterion”, In 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 1-5). IEEE, 2016.

[11] N. Esfandian, F. Jahani Bahnamiri, & S. Mavaddati, “Voice activity detection using clustering-based method in Spectro-Temporal features space”, Journal of AI and Data Mining, vol. 10, no. 3, pp. 401-409, 2022.

[12] H. Veisi, & H. Sameti, “Hidden-Markov-model-based voice activity detector with high speech detection rate for speech enhancement”, IET signal processing, vol. 6, no. 1, pp. 54-63, 2012.

[13] X. Liu, Y. Liang, Y. Lou, H. Li, and B. Shan, ”Noise-robust voice activity detector based on hidden semi-markov models”, In 2010 20th International Conference on Pattern Recognition (pp. 81-84). IEEE, 2010.

[14] B. Liu, J. Tao, F. Mo, Y. Li, Z. Wen, & S. Liu,” Efficient voice activity detection algorithm based on sub-band temporal envelope and sub-band long-term signal variability”, In The 9th International Symposium on Chinese Spoken Language Processing (pp. 531-535). IEEE, 2014.

[15] N. Ryant, M. Liberman, & J. Yuan,”Speech activity detection on youtube using deep neural networks”, In INTERSPEECH (pp. 728-731), 2013.

[16] Y. Jung, Y. Kim, H. Lim, & H. Kim, “Linear-scale filterbank for deep neural network-based voice activity detection”, In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) (pp. 1-5). IEEE, 2017.

[17] Y. Jung, Y. Choi, & H. Kim, “Self-adaptive soft voice activity detection using deep neural networks for robust speaker verification”, In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 365-372). IEEE, 2019.

[18] A. Sehgal, & N. Kehtarnavaz, “A convolutional neural network smartphone app for real-time voice activity detection”, IEEE access, 6, 9017-9026, 2018.

[19] M. H. Faridh, & U. S. Zulpratita, “HiVAD: A Voice Activity Detection Application Based on Deep Learning.”,ELKOMIKA: Jurnal Teknik Energi Elektrik, Teknik Telekomunikasi, & Teknik Elektronika, vol. 9, no. 4, 856, 2021.

[20] P. Vecchiotti, F. Vesperini, E. Principi, S. Squartini, & F. Piazza, “Convolutional neural networks with 3-d kernels for voice activity detection in a multiroom environment”, Multidisciplinary Approaches to Neural Computing, pp. 161-170, 2018.

[21] P. Vecchiotti, E. Principi, S. Squartini, & F. Piazza, “ Deep neural networks for joint voice activity detection and speaker localization”, In 2018 26th European Signal Processing Conference (EUSIPCO) (pp. 1567-1571). IEEE, 2018.

[22] S. Mihalache, & D. Burileanu, “Using voice activity detection and deep neural networks with hybrid speech feature extraction for deceptive speech detection”, Sensors, vol. 22, no. 3, 1228, 2022.

[23] R. Lin, C. Costello, C. Jankowski, & V. Mruthyunjaya, “Optimizing Voice Activity Detection for Noisy Conditions”, In INTERSPEECH (pp. 2030-2034), 2019.

[24] N. Wilkinson, & T. Niesler, “ A hybrid CNN-BiLSTM voice activity detector”, In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6803-6807). IEEE, 2021.

[25] M. Ovaska, J. Kultanen, T. Autto, J. Uusnäkki, A. Kariluoto, J. Himmanen, & P. Abrahamsson” Deep Neural Network Voice Activity Detector for Downsampled Audio Data: An Experiment Report”. arXiv preprint arXiv:2108.05553, 2021.

[26] R. Zazo, T. N. Sainath, G. Simko, & C. Parada, “Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection”, In Interspeech (pp. 3668-3672), 2016.

[27] G. Gelly, J.L. & Gauvain, “ Optimization of RNN-based speech activity detection”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 3, pp. 646-656, 2017.

[28] J. Jia, P. Zhao, & D. Wang, “A Real-Time Voice Activity Detection Based On Lightweight Neural”. arXiv preprint arXiv:2405.16797, 2024.

[29] G. Dahy, A. Darwish,& A. E. Hassanein, Robust Voice Activity Detection Based on Feature Fusion and Recurrent Neural Network. In International Conference on Advanced Intelligent Systems and Informatics (pp. 359-367). Springer, Cham, 2024.

[30] Y. Korkmaz, Y., & A. Boyacı, . Hybrid voice activity detection system based on LSTM and auditory speech features. Biomedical Signal Processing and Control, 80, 104408, 2023.

[31] A. Sofer, & S. E. Chazan, “CNN self-attention voice activity detector”, arXiv preprint arXiv: 2203.02944, 2022.

[32] J. Thienpondt, & K. Demuynck,” Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization”, arXiv preprint arXiv:2405.09142, 2024.

[33] J. S. Garofolo et al., TIMIT Acoustic-Phonetic Continuous Speech Corpus, Philadelphia, PA, USA:The Linguistic Data Consortium, 1993.

[34] A. Varga H. J. M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems, Speech Communication;12.3(1993): pp. 247-251,1993.

[35] R. Zhang, P. H. Li, K. w. Liang, & P. C. Chang, Voice Activity Detection by Jo1i nt MRCG and MFCC Features with Robustness Detection based GRU Networks. In 2021 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW) (pp. 1-2). IEEE, 2021.

[36] K. Raut, S. Kulkarni, & A. Sawant, . Multimodal Spatio-Temporal Framework for Real-World Affect Recognition. International Journal of Intelligent Networks, 2024.

[37] S. Alimi, & O. Awodele, Voice activity detection: Fusion of time and frequency domain features with a svm classifier. Comput. Eng. Intell. Syst, vol. 13, no. 3, pp. 20-29, 2022.

[38] S. Dwijayanti, K. Yamamori, & M. Miyoshi, Enhancement of speech dynamics for voice activity detection using DNN. EURASIP Journal on Audio, Speech, and Music Processing, 2018, 1-15, 2018.

[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, & I. Polosukhin. “Attention is all you need”, Advances in neural information processing systems, 30, 2017.

[40] J. Kim, & M. Hahn, “Voice activity detection using an adaptive context attention model”, IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181-1185, 2018.

Deep Learning Approach for Robust Voice Activity Detection: Integrating CNN and Self-Attention with Multi-Resolution MFCC

References

References

Volume 12, Issue 3July 2024Pages 337-347

Volume 12, Issue 3
July 2024
Pages 337-347