H.6.5.13. Signal processing
Khadijeh Aghajani
Abstract
Voice Activity Detection (VAD) plays a vital role in various audio processing applications, such as speech recognition, speech enhancement, telecommunications, satellite phone, and noise reduction. The performance of these systems can be enhanced by utilizing an accurate VAD method. In this paper, multiresolution ...
Read More
Voice Activity Detection (VAD) plays a vital role in various audio processing applications, such as speech recognition, speech enhancement, telecommunications, satellite phone, and noise reduction. The performance of these systems can be enhanced by utilizing an accurate VAD method. In this paper, multiresolution Mel- Frequency Cepstral Coefficients (MRMFCCs), their first and secondorder derivatives (delta and delta2), are extracted from speech signal and fed into a deep model. The proposed model begins with convolutional layers, which are effective in capturing local features and patterns in the data. The captured features are fed into two consecutive multi-head self-attention layers. With the help of these two layers, the model can selectively focus on the most relevant features across the entire input sequence, thus reducing the influence of irrelevant noise. The combination of convolutional layers and self-attention enables the model to capture both local and global context within the speech signal. The model concludes with a dense layer for classification. To evaluate the proposed model, 15 different noise types from the NoiseX-92 corpus have been used to validate the proposed method in noisy condition. The experimental results show that the proposed framework achieves superior performance compared to traditional VAD techniques, even in noisy environments.
Kh. Aghajani
Abstract
Emotion recognition has several applications in various fields, including human-computer interactions. In recent years, various methods have been proposed to recognize emotion using facial or speech information. While the fusion of these two has been paid less attention in emotion recognition. In this ...
Read More
Emotion recognition has several applications in various fields, including human-computer interactions. In recent years, various methods have been proposed to recognize emotion using facial or speech information. While the fusion of these two has been paid less attention in emotion recognition. In this paper, first of all, the use of only face or speech information in emotion recognition is examined. For emotion recognition through speech, a pre-trained network called YAMNet is used to extract features. After passing through a convolutional neural network (CNN), the extracted features are then fed into a bi-LSTM with an attention mechanism to perform the recognition. For emotion recognition through facial information, a deep CNN-based model has been proposed. Finally, after reviewing these two approaches, an emotion detection framework based on the fusion of these two models is proposed. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), containing videos taken from 24 actors (12 men and 12 women) with 8 categories has been used to evaluate the proposed model. The results of the implementation show that the combination of the face and speech information improves the performance of the emotion recognizer.
Kh. Aghajani
Abstract
Deep-learning-based approaches have been extensively used in detecting pulmonary nodules from computer Tomography (CT) scans. In this study, an automated end-to-end framework with a convolution network (Conv-net) has been proposed to detect lung nodules from CT images. Here, boundary regression has been ...
Read More
Deep-learning-based approaches have been extensively used in detecting pulmonary nodules from computer Tomography (CT) scans. In this study, an automated end-to-end framework with a convolution network (Conv-net) has been proposed to detect lung nodules from CT images. Here, boundary regression has been performed by a direct regression method, in which the offset is predicted from a given point. The proposed framework has two outputs; a pixel-wise classification between nodule or normal and a direct regression which is used to determine the four coordinates of the nodule's bounding box. The Loss function includes two terms; one for classification and the other for regression. The performance of the proposed method is compared with YOLOv2. The evaluation has been performed using Lung-Pet-CT-DX dataset. The experimental results show that the proposed framework outperforms the YOLOv2 method. The results demonstrate that the proposed framework possesses high accuracies of nodule localization and boundary estimation.