H.6.5.13. Signal processing
Khadijeh Aghajani
Abstract
Voice Activity Detection (VAD) plays a vital role in various audio processing applications, such as speech recognition, speech enhancement, telecommunications, satellite phone, and noise reduction. The performance of these systems can be enhanced by utilizing an accurate VAD method. In this paper, multiresolution ...
Read More
Voice Activity Detection (VAD) plays a vital role in various audio processing applications, such as speech recognition, speech enhancement, telecommunications, satellite phone, and noise reduction. The performance of these systems can be enhanced by utilizing an accurate VAD method. In this paper, multiresolution Mel- Frequency Cepstral Coefficients (MRMFCCs), their first and secondorder derivatives (delta and delta2), are extracted from speech signal and fed into a deep model. The proposed model begins with convolutional layers, which are effective in capturing local features and patterns in the data. The captured features are fed into two consecutive multi-head self-attention layers. With the help of these two layers, the model can selectively focus on the most relevant features across the entire input sequence, thus reducing the influence of irrelevant noise. The combination of convolutional layers and self-attention enables the model to capture both local and global context within the speech signal. The model concludes with a dense layer for classification. To evaluate the proposed model, 15 different noise types from the NoiseX-92 corpus have been used to validate the proposed method in noisy condition. The experimental results show that the proposed framework achieves superior performance compared to traditional VAD techniques, even in noisy environments.
N. Esfandian; F. Jahani bahnamiri; S. Mavaddati
Abstract
This paper proposes a novel method for voice activity detection based on clustering in spectro-temporal domain. In the proposed algorithms, auditory model is used to extract the spectro-temporal features. Gaussian Mixture Model and WK-means clustering methods are used to decrease dimensions of the spectro-temporal ...
Read More
This paper proposes a novel method for voice activity detection based on clustering in spectro-temporal domain. In the proposed algorithms, auditory model is used to extract the spectro-temporal features. Gaussian Mixture Model and WK-means clustering methods are used to decrease dimensions of the spectro-temporal space. Moreover, the energy and positions of clusters are used for voice activity detection. Silence/speech is recognized using the attributes of clusters and the updated threshold value in each frame. Having higher energy, the first cluster is used as the main speech section in computation. The efficiency of the proposed method was evaluated for silence/speech discrimination in different noisy conditions. Displacement of clusters in spectro-temporal domain was considered as the criteria to determine robustness of features. According to the results, the proposed method improved the speech/non-speech segmentation rate in comparison to temporal and spectral features in low signal to noise ratios (SNRs).