Document Type : Original/Review Paper

Author

Electrical and Computer Engineering Faculty, Semnan University, Semnan, Iran.

Abstract

Sign language (SL) is the primary mode of communication within the Deaf community. Recent advances in deep learning have led to the development of various applications and technologies aimed at facilitating bidirectional communication between the Deaf and hearing communities. However, challenges remain in the availability of suitable datasets for deep learning-based models. Only a few public large-scale annotated datasets are available for sign sentences, and none exist for Persian Sign Language sentences. To address this gap, we have collected a large-scale dataset comprising 10,000 sign sentence videos corresponding to 100 Persian sign sentences. This dataset includes comprehensive annotations such as the bounding box of the detected hand, class labels, hand pose parameters, and heatmaps. A notable feature of the proposed dataset is that it contains isolated signs corresponding to the sign sentences within the dataset. To analyze the complexity of the proposed dataset, we present extensive experiments and discuss the results. More concretely, the results of the models in key sub-domains relevant to Sign Language Recognition (SLR), including hand detection, pose estimation, real-time tracking, and gesture recognition, have been included and analyzed. Moreover, the results of seven deep learning-based models on the proposed datasets have been discussed. Finally, the results of Sign Language Production (SLP) using deep generative models have been presented. We report the experimental results of these models from these sub-areas, showcasing their performance on the proposed dataset.

Keywords

Main Subjects

[1] World Health Organization, “Deafness and hearing loss,” https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss. Accessed 17 June 2024.
[2] R. Rastgoo, K. Kiani, S. Escalera, M. Sabokrou M, “Multi-modal zero-shot dynamic hand gesture recognition,” Expert Systems with Applications, vol. 247, 123349, 2024.
[3] R. Rastgoo, K. Kiani, S. Escalera, “Word separation in continuous sign language using isolated signs and post-processing,” Expert Systems with Applications, vol. 249, 123695, 2024.
[4] R. Rastgoo, K. Kiani K, “Face recognition using fine-tuning of Deep Convolutional Neural Network and transfer learning,” Journal of Modeling in Engineering, vol. 17, pp. 103-111, 2019.
[5] R. Rastgoo, K. Kiani, S. Escalera, V. Athitsos, M. Sabokrou, “A survey on recent advances in Sign Language Production,” Expert Systems with Applications, vol. 243, 122846, 2024.
[6] R. Rastgoo, K. Kiani, S. Escalera, V. Athitsos, M. Sabokrou, “All you need in sign language production,” arXiv:2201.01609, 2022.
[7] R. Rastgoo, K. Kiani, S. Escalera, “Sign language recognition: A deep survey,” Expert Systems with Applications, vol. 164, 113794, 2021.
[8] R. Rastgoo, K. Kiani, S. Escalera, “Hand pose aware multimodal isolated sign language recognition,” Multimedia Tools and Applications, vol. 80, pp. 127-163, 2021.
[9] R. Rastgoo, K. Kiani, S. Escalera, “Zs-slr: Zero-shot sign language recognition from rgb-d videos,” arXiv preprint arXiv:2108.10059, 2021.
[10] Z. Mohammadi, A. Akhavanpour, R. Rastgoo, M. Sabokrou, “Diverse hand gesture recognition dataset,” Multimedia Tools and Applications, vol.  83, pp. 50245-50267, 2024.
[11] R. Rastgoo, K. Kiani, S. Escalera, “A deep co-attentive hand-based video question answering framework using multi-view skeleton,” Multimedia Tools and Applications, vol. 82, pp. 1401-1429, 2023.
[12] R. Rastgoo, K. Kiani, S. Escalera, “ZS-GR: zero-shot gesture recognition from RGB-D videos,” Multimedia Tools and Applications, vol. 82, pp. 43781-43796, 2023.
[13] R. Rastgoo, K. Kiani, S. Escalera, “A Non-Anatomical Graph Structure for isolated hand gesture separation in continuous gesture sequences,” arXiv preprint arXiv:2207.07619, 2022.
[14] R. Rastgoo, K. Kiani, S. Escalera, “A transformer model for boundary detection in continuous sign language,” Multimedia Tools and Applications, 1-18, 2024.
[15] N.C. Camgoz, S. Hadfield, O. Koller, H. Ney, R. Bowden, “RWTH-phoenix-weather 2014t: Parallel corpus of sign language video, gloss and translation,” CVPR, Salt Lake City, UT, 2018.
[16] A. Duarte, Sh. Palaskar, D. Ghadiyaram, K. DeHaan, F. Metze, J. Torres, J. GiroiNieto, “How2sign: A large-scale multimodal dataset for continuous American sign language,” Sign Language Recognition, Translation, and Production workshop., 2020.
[17] J. Forster, Ch. Schmidt, Th. Hoyoux, O. Koller, U. Zelle, J. Piater, H. Ney, “RWTH phoenix-weather: A large vocabulary sign language recognition and translation corpus,” LREC12, Istanbul, Turkey, pp. 3785–3789, 2012.
[18] C. Neidle, A. Opoku, D. Metaxas, “ASL Video Corpora & Sign Bank: Resources Available through the American Sign Language Linguistic Research Project (ASLLRP),” arXiv:2201.07899, 2022.
[19] R. Rastgoo, K. Kiani, S. Escalera, Multi-Modal Deep Hand Sign Language Recognition in Still Images Using Restricted Boltzmann Machine. Entropy 20 (11):809.
[20] R. Rastgoo, K. Kiani, S. Escalera, “Hand sign language recognition using multi-view hand skeleton,” Expert System with Applications, vol. 150, 113336, 2020.
[21] R. Rastgoo, K. Kiani, S. Escalera, “Video-based isolated hand sign language recognition using a deep cascaded model,” Multimedia Tools and Application, vol. 79, pp. 22965-22987, 2020.
[22] R. Rastgoo, K. Kiani, S. Escalera, “Real-time isolated hand sign language recognition using deep networks and SVD,” Journal of Ambient Intelligence and Humanized Computing, vol. 13, pp. 591–611, 2022.
[23] V. Athitsos, C. Neidle, S. Sclaroff, J. Nash, A. Stefan, Q. Yuan, A. Thangali, “The American sign language lexicon video dataset,” CVPRW, pp. 1–8, 2018.
[24] N. Caselli, Z. Sevcikova, A. Cohen-Goldberg, K. Emmorey, “ASL-LEX: A lexical database for ASL,” Behavior Research Methods, vol. 49, pp. 784-801, 2016.
[25] H.R. Vaezi Joze, O. Koller, “MS-ASL: A largescale data set and benchmark for understanding American sign language,” arXiv:1812.01053, 2018.
[26] D. Li, C. Rodriguez, X. Yu, H. Li, “Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison,” WACV, pp. 1459–1469, 2020.
[27] A.M. Martınez, R.B. Wilbur, R. Shay, A.C. Kak, “Purdue rvl-slll asl database for automatic recognition of American sign language,” Fourth IEEE International Conference on Multimodal Interfaces, pp. 167–172, 2002.
[28] S. Vintar, B. Jerko, M. Kulovec “Compiling the Slovene sign language corpus,” LREC, vol. 5, pp. 159–162, 2012.
[29] S. Ghanbari Azar, H. Seyedarabi, “Trajectory-based recognition of dynamic Persian sign language using hidden Markov model,” Computer Speech & Language, vol. 61, 101053, 2020.
[31] J. Huang, W. Zhou, O. Zhang, H. Li, W. Li, “Video-based sign language recognition without temporal segmentation,” AAAI, 2018.
[32] O. Koller, J. Forster, H. Ney, “Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers,” Computer Vision and Image Understanding, vol. 141, pp. 108-125, 2015.
[33] N.H. Camgoz, S. Hadfield, O. Koller, H. Ney, R. Bowden, “Neural sign language translation,” CVPR, pp. 7784–7793, 2018.
[34] T. Hanke, M. Schulder, R. Konrad, E. Jahn, “Extending the public dgs corpus in size and depth,” Workshop on the Representation and Processing of Sign Languages, pp. 75–82, 2020.
[35] U.V. Agris, K.F. Kraiss K.F, “Signum database: Video corpus for signer-independent continuous sign language recognition,” Workshop on Representation and Processing of Sign Languages, pp. 243–246, 2010.
[36] C. Neidle, Ch. Vogler, “A new web interface to facilitate access to corpora: Development of the asllrp data access interface,” Proc. 5th Workshop on the Representation and Processing of Sign Languages, 2012.
[37] M. Zahedi, P. Dreuw, D. Rybach, T. Deselaers, H. Ney, “Continuous sign language recognition-approaches from speech recognition and available data resources,” Workshop on Representation and Processing of Sign Languages, 2006.
[38] A. Duarte, et al. “How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language,” CVPR, 2021.
[39] A. Schembri, J. Fenlon, R. Rentelis, S. Reynolds, K. Cormier K, “Building the British sign language corpus,” Language Documentation & Conservation, vol. 7, pp. 136–154, 2013.
[40] B. Saunders, N.C. Camgoz, R. Bowden, “Progressive transformers for end-to-end sign language production,” ECCV, 2020.
[41] B. Saunders, N.C. Camgoz, R. Bowden, “Adversarial training for multi-channel sign language production,” BMVC, 2020.
[42] T. Hanke, M. Schulder, R. Konrad, E. Jahn E, “Extending the public dgs corpus in size and depth,” Workshop on the Representation and Processing of Sign Languages, pp. 75–82, 2020.
[44] Z. Cao, G. Hidalgo, T. Simon, S. Wei, Y. Sheikh, “OpenPose: real-time multi-person 2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, pp. 172-186, 2021.
[45] T. Le, D. Jaw, I. Lin, H. Liu, S. Huang, “An efficient hand detection method based on convolutional neural network,” 7th IEEE international symposium on next-generation electronics, 2018.
[46] T. Simon, H. Joo, I. Matthews, Y. Sheikh, “Hand keypoint detection in single images using multi-view bootstrapping,” CVPR, pp. 1145-1153, 2017.
[47] Sh. Wang, Sh. Wang, H. Kuang, F. Li, Z. Qian, M. Li, “A Survey of Deep Learning-based Hand Pose Estimation,” EEE 8th International Conference on Cloud Computing and Intelligent Systems (CCIS), 2022.
[48] J. Supancic, G. Rogez, Y. Yang, J. Shotton, D. Ramana, “Depth-based hand pose estimation: methods, data, and challenges,” International Journal of Computer Vision, vol. 126, pp. 1180–1198, 2018.
[49] C. Zimmermann, T. Brox, “Learning to estimate 3D hand pose from single RGB images,” ICCV, Italy, 2017.
[50] V. Dibia, “Handtrack: A library for prototyping real-time hand tracking interfaces using convolutional neural networks,” GitHub, https://github.com/victordibia/handtracking, 2017.
[51] F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, C. Theobalt, “Ganerated hands for real-time 3d hand tracking from monocular RGB,” CVPR, USA, 2018.
[52] O. Kopuklu, A. Gunduz, N. Kose, G. Rigoll, “Real-time hand gesture detection and classification using convolutional neural networks,” arXiv:1901.10323, 2019.
[53] G. Devineau, W. Xi, F. Moutarde, J. Yang, “Deep learning for hand gesture recognition on skeletal data,” 13th IEEE conference on automatic face and gesture recognition, China, 2018.
[54] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, “Generative Adversarial Nets,” Advances in Neural Information Processing Systems (NIPS 2014), vol. 27, 2014.
[55] M. Mirza, S. Osindero, “Conditional Generative Adversarial Nets,” arXiv:1411.1784, 2014.
[56] M. Arjovsky, S. Chintala, L. Bottou, “Wasserstein GAN,” arXiv:1701.07875v3, 2017.
[57] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, Y. Zheng, “Recent Progress on Generative Adversarial Networks (GANs): A Survey,” IEEE Access, vol. 7, pp. 36322–36333, 2019.
[58] B. Saunders, N.C. Camgoz, R. Bowden, “Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks,” International Journal of Computer Vision, vol. 129, pp. 2113–2135, 2021.
[59] S. Stoll, N.C. Camgoz, S. Hadfield, R. Bowden, “Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks,” Int J Comput Vis, vol. 128, pp. 891–908, 2020.
[60] L. Hu, L. Gao, Z. Liu, W. Feng, “Continuous Sign Languag Recognition with Correlation Network,” CVPR, 2023.