Document Type : Original/Review Paper

Authors

Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran.

Abstract

Automatic Speaker Verification (ASV) systems have proven to be
vulnerable to various types of presentation attacks, among which
Logical Access attacks are manufactured using voice
conversion and text-to-speech methods. In recent years, there has been
loads of work concentrating on synthetic speech detection, and with the arrival of deep learning-based methods and their success in various computer science fields, they have been a prevailing tool for this very task too. Most of the deep neural network-based techniques for
synthetic speech detection have employed the acoustic features based
on Short-Term Fourier Transform (STFT), which are extracted from the
raw audio signal. However, lately, it has been discovered that the usage
of Constant Q Transform's (CQT) spectrogram can be a beneficial
asset both for performance improvement and processing power and
time reduction of a deep learning-based synthetic speech detection. In this work, we compare the usage of the CQT spectrogram and some most utilized STFT-based acoustic features. As lateral objectives, we consider improving the model's performance as much as we can using methods such as self-attention and one-class learning. Also, short-duration synthetic speech detection has been one of the lateral goals too. Finally, we see that the CQT spectrogram-based model not only outperforms the STFT-based acoustic feature extraction methods but also reduces the processing time and resources for detecting genuine speech from fake. Also, the CQT spectrogram-based model places well
among the best works done on the LA subset of the ASVspoof 2019 dataset, especially in terms of Equal Error Rate.

Keywords

[1]           D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Commun., vol. 17, no. 1, pp. 91–108, Aug. 1995.
[2] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: A survey,” Speech Commun., vol. 66, pp. 130–153, Feb. 2015.
 
[3] Yee Wah Lau, M. Wagner, and D. Tran, “Vulnerability of speaker verification to voice mimicking,” in Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004., Oct. 2004, pp. 145–148.
 
[4] Z. Wu and H. Li, “Voice conversion and spoofing attack on speaker verification systems,” in 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Nov. 2013, pp. 1–9.
 
[5] Z. Wu, S. Gao, E. S. Cling, and H. Li, “A study on replay attack and anti-spoofing for text-dependent speaker verification,” in Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, Dec. 2014, pp. 1–5.
 
[6] M. Todisco et al., “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” ArXiv190405441 Cs Eess, Apr. 2019.
 
[7] Z. Wu et al., “ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” presented at the Sixteenth annual conference of the international speech communication association, 2015.
 
[8] J. Yamagishi et al., “Asvspoof 2019: The 3rd automatic speaker verification spoofing and countermeasures challenge database,” 2019.
[9]           J. Yamagishi et al., “ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection,” ArXiv210900537 Cs Eess, Sep. 2021
 
[10] P. A. Ziabary and H. Veisi, “A Countermeasure Based on CQT Spectrogram for Deepfake Speech Detection,” in 2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS), Dec. 2021, pp. 1–5.
 
[11] M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep Residual Neural Networks for Audio Spoofing Detection,” in Interspeech 2019, Sep. 2019, pp. 1078–1082. doi: 10.21437/Interspeech.2019-3174.
 
[12] A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M. Gomez, “A Light Convolutional GRU-RNN Deep Feature Extractor for ASV Spoofing Detection,” in Interspeech 2019, Sep. 2019, pp. 1068–1072.
 
[13] C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual Networks,” in Interspeech 2019, Sep. 2019, pp. 1013–1017.
 
[14] Z. Wu, R. K. Das, J. Yang, and H. Li, “Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks,” in Interspeech 2020, Oct. 2020, pp. 1101–1105.
[15] K. Aghajani, “Audio-visual emotion recognition based on a deep convolutional neural network,” Journal of AI & Data Mining, vol. 10, no. 4, pp. 529–537, Nov. 2022.
 
[16] B. Z. Mansouri, H. R. Ghaffary, and A. Harimi, “Speech Emotion Recognition using Enriched Spectrogram and Deep Convolutional Neural Network Transfer Learning,” J. AI Data Min., vol. 10, no. 4, pp. 539–547, Nov. 2022.
 
[17] J. C. Brown, “Calculation of a constant Q spectral transform,” J. Acoust. Soc. Am., vol. 89, no. 1, pp. 425–434, Jan. 1991.
 
[18] M. Todisco, H. Delgado, and N. Evans, “A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients,” in The Speaker and Language Recognition Workshop (Odyssey 2016), Jun. 2016, pp. 283–290.
 
[19] X. Li, X. Wu, H. Lu, X. Liu, and H. Meng, “Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks,” ArXiv210708803 Cs Eess, Jul. 2021, Accessed: May 02, 2022. [Online]. Available: http://arxiv.org/abs/2107.08803
 
[20] Y. Zhang, F. Jiang, and Z. Duan, “One-Class Learning Towards Synthetic Voice Spoofing Detection,” IEEE Signal Process. Lett., vol. 28, pp. 937–941, 2021.
 
[21] J. Monteiro, J. Alam, and T. H. Falk, “Generalized end-to-end detection of spoofing attacks to automatic speaker recognizers,” Comput. Speech Lang., vol. 63, p. 101096, Sep. 2020.
 
[22] H. Tak, J. Jung, J. Patino, M. Todisco, and N. Evans, “Graph attention networks for anti-spoofing,” ArXiv Prepr. ArXiv210403654, 2021.
 
[23] Z. Huang, S. Wang, and K. Yu, “Angular Softmax for Short-Duration Text-independent Speaker Verification.,” presented at the Interspeech, 2018, pp. 3623–3627.
 
[24] M. Sahidullah et al., “UIAI System for Short-Duration Speaker Verification Challenge 2020,” in 2021 IEEE Spoken Language Technology Workshop (SLT), Jan. 2021, pp. 323–329.
 
[25] S. Wang, Z. Huang, Y. Qian, and K. Yu, “Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification,” IEEEACM Trans. Audio Speech Lang. Process., vol. 27, no. 11, pp. 1686–1696, Nov. 2019.
 
[26] Y. Jung, Y. Choi, H. Lim, and H. Kim, “A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments,” IEEE Access, vol. 8, pp. 175448–175466, 2020.
 
[27] M. R. Kamble, H. B. Sailor, H. A. Patil, and H. Li, “Advances in anti-spoofing: from the perspective of ASVspoof challenges,” APSIPA Trans. Signal Inf. Process., vol. 9, p. e2, 2020.
[28] Z. Wu, X. Xiao, E. S. Chng, and H. Li, “Synthetic speech detection using temporal modulation feature,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 7234–7238.
 
[29] M. Sahidullah, T. Kinnunen, and C. Hanilci, “A Comparison of Features for Synthetic Speech Detection,” 2015, Accessed: May 02, 2022. [Online]. Available: https://erepo.uef.fi/handle/123456789/4371
 
[30] I. Saratxaga, J. Sanchez, Z. Wu, I. Hernaez, and E. Navas, “Synthetic speech detection using phase information,” Phase-Aware Signal Process. Speech Commun., vol. 81, pp. 30–41, Jul. 2016.
 
[31] M. Todisco, H. Delgado, and N. W. Evans, “A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients.,” presented at the Odyssey, 2016, vol. 2016, pp. 283–290.
 
[32] B. Chettri, D. Stoller, V. Morfi, M. A. M. Ramírez, E. Benetos, and B. L. Sturm, “Ensemble Models for Spoofing Detection in Automatic Speaker Verification,” in Interspeech 2019, Sep. 2019, pp. 1018–1022.
 
[33] X. Fang, H. Du, T. Gao, L. Zou, and Z. Ling, “Voice Spoofing Detection with Raw Waveform Based on Dual Path Res2net,” in 5th International Conference on Crowd Science and Engineering, New York, NY, USA, 2021, pp. 160–165.
 
[34] M. Pal, A. Raikar, A. Panda, and S. K. Kopparapu, “Synthetic speech detection using meta-learning with prototypical loss,” ArXiv220109470 Cs Eess, Jan. 2022.
 
[35] H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, and C. Wang, “Continual Learning for Fake Audio Detection,” ArXiv210407286 Cs Eess, Apr. 2021.
 
[36] R. Jaiswal, D. Fitzgerald, E. Coyle, and S. Rickard, “Towards Shifted NMF for Improved Monaural Separation,” IET Conf. Proc., pp. 19-19(1), Jan. 2013.
 
[37] Z. Weiping, Y. Jiantao, X. Xiaotao, L. Xiangtao, and P. Shaohu, “Acoustic scene classification using deep convolutional neural network and multiple spectrograms fusion,” Detect. Classif. Acoust. Scenes Events DCASE, 2017.