Document Type : Technical Paper

Authors

Department of Electrical and Computer Engineering, Qom University of Technology, Iran.

10.22044/jadm.2025.15932.2707

Abstract

Artificial intelligence (AI) has significantly advanced speech recognition applications. However, many existing neural network-based methods struggle with noise, reducing accuracy in real-world environments. This study addresses isolated spoken Persian digit recognition (zero to nine) under noisy conditions, particularly for phonetically similar numbers. A hybrid model combining residual convolutional neural networks and bidirectional gated recurrent units (BiGRU) is proposed, utilizing word units instead of phoneme units for speaker-independent recognition. The FARSDIGIT1 dataset, augmented with various approaches, is processed using Mel-Frequency Cepstral Coefficients (MFCC) for feature extraction. Experimental results demonstrate the model’s effectiveness, achieving 98.53%, 96.10%, and 95.92% accuracy on training, validation, and test sets, respectively. In noisy conditions, the proposed approach improves recognition by 26.88% over phoneme unit-based LSTM models and surpasses the Mel-scale Two Dimension Root Cepstrum Coefficients (MTDRCC) feature extraction technique along with MLP model (MTDRCC+MLP) by 7.61%.

Keywords

Main Subjects

[1] A. S. Dhanjal and W. Singh, “A comprehensive survey on automatic speech recognition using neural networks,” Multimedia Tools and Applications, vol. 83, no. 8, pp. 23367–23412, Mar. 2024.
 
[2] H. Veisi and A. H. Mani, “Persian speech recognition using deep learning,” International Journal of Speech Technology, vol. 23, no. 4, pp. 893–905, Dec. 2020.
 
[3] M. S. Zandi and R. Rajabi, “Deep learning based framework for Iranian license plate detection and recognition,” Multimedia Tools and Applications, vol. 81, no. 11, pp. 15841–15858, May 2022.
 
[4] A. Kavand and M. Bekrani, “Speckle noise removal in medical ultrasonic image using spatial filters and DnCNN,” Multimedia Tools and Applications, vol. 83, pp. 45903–45920, May 2024.
 
[5] D. Yu and L. Deng, Automatic speech recognition: A deep learning approach, Springer Publishing Company, 2016.
 
[6] M. H. Rahimi Pour, N. Rastin, and M. M. Kermani, “Persian automatic speech recognition by the use of whisper model,” in 20th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP), Iran, Feb. 2024.
 
[7] M. M. Homayounpour, J. Kabudian, H. Bashiri, and Z. Ahmadpour, “Recognition of Farsi number over telephone: A comparison of statistical neural and hybrid approaches,” Amirkabir, vol. 14, no. 56, pp. 1045–1065, Jan. 2003.
 
[8] M. M. Homayounpour, “FarsDigits database,” in Technical Report, Laboratory for Intelligent Sound and Speech Processing, Amirkabir University of Technology, 2005.
 
[9] J. V. Doremalen and L. Boves, “Spoken digit recognition using a hierarchical temporal memory,” in Interspeech, 2008, pp. 2566–2569.
 
[10] N. Hammami, M. Bedda, N. Farah, and R. O. Lakehal-Ayat, “Spoken Arabic digits recognition based on (GMM) for e-Quran voice browsing: Application for blind category,” in International Conference on Advances in Information Technology for the Holy Quran and Its Sciences, 2013, pp. 123–127.
 
[11] D. Dhanashri and S. B. Dhonde, “Isolated word speech recognition system using deep neural networks,” in International Conference on Data Engineering and Communication Technology: ICDECT 2016, vol. 1, 2017, pp. 9–17.
 
[12] R. G. Leonard and G. Doddington, “TIDIGITS dataset,” Linguistic Data Consortium, Philadelphia, 1993.
 
[13] B. Zada and R. Ullah, “Pashto isolated digits recognition using deep convolutional neural network,” Heliyon, vol. 6, no. 2, Feb. 2020.
 
[14] S. Tabibian, “Robust Persian isolated digit recognition based on LSTM and speech spectral features,” Iranian Journal of Electrical and Computer Engineering, vol. 86, no. 19, pp. 1–17, Oct. 2021.
 
[15] S. M. Hoseini, “Recognition of Persian digits from zero to nine using acoustic images based on Mel Cepstrum coefficients and neural network,” International Journal of Mechatronics, Electrical and Computer Technology, vol. 11, no. 42, pp. 5059–5064, 2020.
 
[16] J. Oruh, S. Viriri, and A. Adegun, “Long short-term memory recurrent neural network for automatic speech recognition,” IEEE Access, vol. 10, pp. 30069–30079, 2022.
 
[17] C. Amadeus, I. Syafalni, N. Sutisna, and T. Adiono, “Digit number speech recognition using spectrogram-based convolutional neural network,” in International Symposium on Electronics and Smart Devices (ISESD), 2022, pp. 1–6.
 
[18] B. Paul and S. Phadikar, “A hybrid feature-extracted deep CNN with reduced parameters substitutes an end-to-end CNN for the recognition of spoken Bengali digits,” Multimedia Tools and Applications, vol. 83, no. 1, pp. 1669–1692, Jan. 2024.
 
[19] A. A. Ramadan and K. M. Ezzat, “Spoken digit recognition using machine and deep learning-based approaches,” in International Telecommunications Conference (ITC-Egypt), Alexandria, Egypt, 2023, pp. 592-596.
 
[20] Z. Jakobovski, “Free spoken digit dataset.” github.com, Aug. 2020, [Online]. Available: https://github.com/Jakobovski/free-spoken-digit-dataset.
 
[21] K. Lounnas, M. Lichouri, and M. Abbas, “Analysis of the effect of audio data augmentation techniques on phone digit recognition for Algerian Arabic dialect,” in International Conference on Advanced Aspects of Software Engineering (ICAASE), 2022, pp. 1–5.
 
[22] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
 
[23] F. Mahdavi, H. Zayyani, and R. Rajabi, “RSS localization using an optimized fusion of two deep neural networks,” IEEE Sensors Letters, vol. 5, no. 12, pp. 1–4, Dec. 2021.
 
[24] W. Hartmann, T. Ng, R. Hsiao, S. Tsakalidis, and R. M. Schwartz, “Two-stage data augmentation for low-resourced speech recognition,” in Proc. Interspeech, vol. 9, 2016, pp. 2378–2382.
 
[25] D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk, Q. V. and Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint, arXiv:1904.08779, 2019.
 
[26] M. Sithu, “Audio Noise Dataset,” kaggle.com, Kaggle, 2019. [Online]. Available: https://www.kaggle.com/datasets/minsithu/audio-noise-dataset
 
[27] N. Dave, “Feature extraction methods LPC, PLP and MFCC in speech recognition,” International Journal for Advance Research in Engineering and Technology, vol. 1, no. 6, pp. 1–4, July 2013.
 
[28] D. Amodei, et al., “Deep speech 2: End-to-end speech recognition in English and Mandarin,” in International Conference on Machine Learning, 2016, pp. 173–182.
 
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
 
[30] S. H. S. Basha, S. R. Dubey, V. Pulabaigari, and S. Mukherjee, “Impact of fully connected layers on performance of convolutional neural networks for image classification,” Neurocomputing, vol. 378, pp. 112–119, Feb. 2020.
 
[31] Q. Tao, F. Liu, Y. Li, and D. Sidorov, “Air pollution forecasting using a deep learning model based on 1D convnets and bidirectional GRU,” IEEE Access, vol. 7, pp. 76690–76698, June 2019.
 
[32] A. Zakir, et al. “Database development and automatic speech recognition of isolated Pashto spoken digits using MFCC and K-NN,” International Journal of Speech Technology, vol. 18, pp. 271-275, June 2015.