P. Abdzadeh; H. Veisi
Abstract
Automatic Speaker Verification (ASV) systems have proven to bevulnerable to various types of presentation attacks, among whichLogical Access attacks are manufactured using voiceconversion and text-to-speech methods. In recent years, there has beenloads of work concentrating on synthetic speech detection, ...
Read More
Automatic Speaker Verification (ASV) systems have proven to bevulnerable to various types of presentation attacks, among whichLogical Access attacks are manufactured using voiceconversion and text-to-speech methods. In recent years, there has beenloads of work concentrating on synthetic speech detection, and with the arrival of deep learning-based methods and their success in various computer science fields, they have been a prevailing tool for this very task too. Most of the deep neural network-based techniques forsynthetic speech detection have employed the acoustic features basedon Short-Term Fourier Transform (STFT), which are extracted from theraw audio signal. However, lately, it has been discovered that the usageof Constant Q Transform's (CQT) spectrogram can be a beneficialasset both for performance improvement and processing power andtime reduction of a deep learning-based synthetic speech detection. In this work, we compare the usage of the CQT spectrogram and some most utilized STFT-based acoustic features. As lateral objectives, we consider improving the model's performance as much as we can using methods such as self-attention and one-class learning. Also, short-duration synthetic speech detection has been one of the lateral goals too. Finally, we see that the CQT spectrogram-based model not only outperforms the STFT-based acoustic feature extraction methods but also reduces the processing time and resources for detecting genuine speech from fake. Also, the CQT spectrogram-based model places wellamong the best works done on the LA subset of the ASVspoof 2019 dataset, especially in terms of Equal Error Rate.