Document Type : Original/Review Paper


1 Department of Electrical Engineering, Qaemshahr Branch, Islamic Azad University, Qaemshahr, Iran.

2 Department of Computer Engineering, Aryan Institute of Science and Technology, Babol, Iran.

3 Department of Electrical Engineering, Faculty of Engineering and Technology, University of Mazandaran, Babolsar, Iran.


This paper proposes a novel method for voice activity detection based on clustering in spectro-temporal domain. In the proposed algorithms, auditory model is used to extract the spectro-temporal features. Gaussian Mixture Model and WK-means clustering methods are used to decrease dimensions of the spectro-temporal space. Moreover, the energy and positions of clusters are used for voice activity detection. Silence/speech is recognized using the attributes of clusters and the updated threshold value in each frame. Having higher energy, the first cluster is used as the main speech section in computation. The efficiency of the proposed method was evaluated for silence/speech discrimination in different noisy conditions. Displacement of clusters in spectro-temporal domain was considered as the criteria to determine robustness of features. According to the results, the proposed method improved the speech/non-speech segmentation rate in comparison to temporal and spectral features in low signal to noise ratios (SNRs).


[1] Z. H. Tan and N. Dehak, "rVAD: An unsupervised segment-based robust voice activity detection method," Computer Speech and Language, Vol. 59, pp. 1-21, January 2020.
[2] T. Yoshimura, T. Hayashi, K. Takeda, and S. Watanabe, "End-to-end automatic speech recognition integrated with ctc-based voice activity detection," in International Conference on Acoustics, Speech and Signal Processing (ICASSP)., Barcelona, Spain., pp. 6999-7003, May 2020.
[3] J. Lee, Y. Jung, and H. Kim, "Dual Attention in Time and Frequency Domain for Voice Activity Detection," in Proceedings of Interspeech 2020, Shanghai, China, pp. 3670-3674, October 2020. Available:
[4] Y. G. Thimmaraja, B. Nagaraja, and H. Jayanna, "Speech enhancement and encoding by combining SS-VAD and LPC," International Journal of Speech Technology, Vol. 24, No. 1, pp. 165-172, 2021.
[5] R. Makowski and R. Hossa,"Voice activity detection with quasi-quadrature filters and GMM decomposition for speech and noise," Applied Acoustics, Vol. 166:107344, September 2020.
[6] F. Liu and A. Demosthenous, "A Computation Efficient Voice Activity Detector for Low Signal-to-Noise Ratio in Hearing Aids," in 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS,), Michigan, USA, pp. 524-528, August 2021.
[7] A. K. Alimuradov, "Enhancement of Speech Signal Segmentation using Teager Energy Operator," in 2021 23rd International Conference on Digital Signal Processing and its Applications (DSPA), Moscow, Russia, pp. 1-7, March 2021.
[8] T. H. Zaw and N. War, "The combination of spectral entropy, zero crossing rate, short time energy and linear prediction error for voice activity detection," in 2017 20th International Conference of Computer and Information Technology (ICCIT), Dhaka, Bangladesh, pp. 1-5, December 2021.
[9] H. Ghaemmaghami, B. J. Baker, R. J. Vogt, and S. Sridharan, "Noise robust voice activity detection using features extracted from the time-domain autocorrelation function," in Proceedings of the 11th Annual Conference of the International Speech Communication Association (Interspeech2010), Makuhari, Chiba, Japan, pp. 3118-3121, September 2010.
[10] S. Graf, T. Herbig, M. Buck, and G. Schmidt, "Features for voice activity detection: a comparative analysis," EURASIP Journal on Advances in Signal Processing, Vol. 2015, No. 1, pp. 1-15, 2015.
[11] R. G. Bachu, S. Kopparthi, B. Adapa, and B.  Barkana, "Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal," in American Society for Engineering Education (ASEE), pp. 1-7, 2008.
[12] T. Kristjansson, S. Deligne, and P. Olsen, "Voicing features for robust speech detection," in Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal, pp. 1-4, September 2005.
[13] S. Endah, R. Kusumaningrum, S. Adhy, and R. Ulfattah, "Automatic speech recognition by using local adaptive thresholding in continuous speech segmentation," in Journal of Physics: Conference Series, Vol. 1943, pp. 1-8, 2021.
[14] S. Sharma, A. Sharma, R. Malhotra, and P. Rattan, "Voice Activity Detection using windowing and updated K-Means Clustering Algorithm," in 2021 2nd International Conference on Intelligent Engineering and Management (ICIEM), London, United Kingdom pp. 114-118, April 2021.
[15] H. Khalid, S. Tariq, T. Kim, J. H. Ko, and S. S. Woo, "ORVAE: One-Class Residual Variational Autoencoder for Voice Activity Detection in Noisy Environment," Neural Processing Letters, pp. 1-22, 2022.
[16] F. Jia, S. Majumdar, and B. Ginsburg, "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, pp. 6818-6822, June 2021.
[17] M. Asadolahzade Kermanshahi, and M. M. Homayounpour, "Improving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM." Journal of AI and Data Mining, Vol. 7, No. 1, pp. 137-147, 2019.
[18] N. Esfandian, "Phoneme Classification using Temporal Tracking of Speech Clusters in Spectro-temporal Domain," International Journal of Engineering, Vol. 33, No. 1, pp. 105-111, 2020.
[19] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, and N. Dahlgren, "DARPA TIMIT Acoustic-phonetic continuous speech corpus documentation," in Technical Report NISTIR 4930, National Institute of Standards and Technology, 1993.
[20] S. A. Shamma, M. Elhilali, and C. Micheyl, "Temporal coherence and attention in auditory scene analysis," Trends in neurosciences, Vol. 34, pp. 114-123, 2011.
[21] N. Mesgarani, S. V. David, J. B. Fritz, and S. A. Shamma, "Mechanisms of noise robust representation of speech in primary auditory cortex," in Proceedings of the National Academy of Sciences, Vol. 111, pp. 6792-6797, 2014.
[22] N. Mesgarani, S. V. David, J. B. Fritz, and S. A. Shamma, "Phoneme representation and classification in primary auditory cortex," The Journal of the Acoustical Society of America, Vol. 123, pp. 899-909, 2008.
[23] N. Mesgarani, M. Slaney, and S. A. Shamma, "Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations," IEEE Transactions on Audio, Speech, and Language Processing, Vol. 14, pp. 920-930, 2006.
[24] N. Mesgarani, J. Fritz, and S. Shamma, "A computational model of rapid task-related plasticity of auditory cortical receptive fields," computational neuroscience, Vol. 28, pp. 19-27, 2010.
[25] N. Esfandian, F. Razzazi, and A. Behrad, "A clustering based feature selection method in spectro-temporal domain for speech recognition," Engineering Applications of Artificial Intelligence, Vol. 25, pp. 1194-1202, 2012.
[26] I. Zulfiqar, M. Moerel, and E. Formisano, "Spectro-temporal processing in a two-stream computational model of auditory cortex," Frontiers in computational neuroscience, Vol. 13, pp. 1-18, January 2020.
[27] D. R. Ruggles, A. N. Tausend, S. A. Shamma, and A. J. Oxenham, "Cortical markers of auditory stream segregation revealed for streaming based on tonotopy but not pitch," The Acoustical Society of America, Vol. 144, pp. 2424-2433, 2018.
[28] F. Z. Yen, M. C. Huang, and T. S. Chi, "A two-stage singing voice separation algorithm using spectro-temporal modulation features," in Sixteenth Annual Conference of the International Speech Communication Association (Interspeech), Dresden, Germany, pp. 3321-3324, September 2015.
[29] K. Lu, W. Liu, P. Zan, S. V. David, J. B. Fritz, and S. A. Shamma, "Implicit memory for complex sounds in higher auditory cortex of the ferret," Neuroscience, Vol. 38, pp. 9955-9966, 2018.
[30] N. Esfandian, F. Razzazi, and A. Behrad, "A feature extraction method for speech recognition based on temporal tracking of clusters in spectro-temporal domain," in The 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2012), Shiraz, Iran, pp. 012-017, May 2012.
[31] N. Esfandian, F. Razzazi, A. Behrad, and S. Valipour, "A Feature selection method in spectro-temporal domain based on Gaussian mixture models," in IEEE 10th International Conference on Signal Processing (ICSP), Beijing, China, pp. 522-525, October 2010.
[32] A. Varga and H. J. Steeneken, "Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech communication, Vol. 12, pp. 247-251, 1993.
[33] B. H. Prasetio, E. R. Widasari, and H. Tamura, "Automatic Multiscale-based Peak Detection on Short Time Energy and Spectral Centroid Feature Extraction for Conversational Speech Segmentation," in 6th International Conference on Sustainable Information Engineering and Technology, Indonesia, pp. 44-49, September 2021.
[34] Z. Ali and M. Talha, "Innovative method for unsupervised voice activity detection and classification of audio segments," IEEE Access, Vol. 6, pp. 15494-15504, 2018.