Skeleton-Based Sign Language Generation Using a Transformer-based Generative Model

Mohammadizand, Rozhin; Rastgoo, Razieh

doi:10.22044/jadm.2025.16369.2759

Articles in Press

Document Type : Original/Review Paper

Authors

Electrical and Computer Engineering Department, Semnan University, Semnan, Iran

10.22044/jadm.2025.16369.2759

Abstract

Sign language is a structured, non-vocal form of communication primarily used by individuals who are deaf or hard of hearing, who often face challenges interacting with non-signers. To address this, translation systems between sign and spoken language are essential, encompassing sign language recognition and production. In this work, we focus on sign language production and propose a deep learning framework for generating skeleton-based video representations of sign language at the word level. Our approach employs a conditional Generative Adversarial Network (cGAN) with transformer embeddings in both generator and discriminator, augmented with bone-length and joint-angle constraints and a classifier-guided loss to ensure anatomically plausible and semantically consistent gestures. We further introduce a novel loss function to improve human keypoint generation for sign representation. Extensive experiments on three benchmark datasets demonstrate that our method outperforms state-of-the-art approaches according to statistical (MMD) and perceptual (FID) metrics, while qualitative analyses confirm that the generated gestures are temporally smooth, anatomically accurate, and semantically meaningful. These results highlight the effectiveness of our model in advancing word-level sign language synthesis.

Keywords

Main Subjects

H.6.5.2. Computer vision

References

[1] R. Rastgoo, K. Kiani, S. Escalera, "Sign Language Recognition: A Deep Survey," Expert Systems with Applications, vol. 164, 113794, 2020.

[2] R. Rastgoo, K. Kiani, S. Escalera, V. Athitsos, M. Sabokrou, A survey on recent advances in Sign Language Production, Expert Systems with Applications 243:122846, 2024.

[3] World Health Organization, https://www.who.int/. Access Date: May 28, 2025.

[4] B. Natarajan, E. Rajalakshmi, R. Elakkiya, Ketan Kotecha, Ajith Abraham, Lubna Abdelkareim Gabralla, " Development of an End-to-End Deep Learning Framework for Sign Language Recognition, Translation, and Video Generation," IEEE Access, vol. 10, pp. 104358-104374, 2022.

[5] R. Rastgoo, K. Kiani, S. Escalera, "Diffusion-Based Continuous Sign Language Generation with Cluster-Specific Fine-Tuning and Motion-Adapted Transformer," CVPR, pp. 4088-4097, 2025.

[6] R. Rastgoo, K. Kiani, S. Escalera, "A transformer model for boundary detection in continuous sign language," Multimedia Tools and Applications, vol. 83, pp. 89931–89948, 2024.

[7] B. Saunders, N. C. Camgoz, and R. Bowden, "Progressive Transformers for End-to-End Sign Language Production," in CVPRW, pp. 11070-11079, 2021.

[8] R. Rastgoo, K. Kiani, S. Escalera, M. Sabokrou, "Sign language production: A review," in CVPRW, pp. 3451-3461, 2021.

[9] R. Rastgoo, K. Kiani, S. Escalera, "A non-anatomical graph structure for boundary detection in continuous sign language," Scientific Reports, vol. 15, 25683, 2025.

[10] R. Rastgoo, K. Kiani, S. Escalera, "A deep generative Skeleton-based dynamic hand gesture production model," Multimedia Tools and Applications, vol. 84, pp. 48589–48608, 2025.

[11] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, Y. Fu, "Skeleton Aware Multi-modal Sign Language Recognition," in CVPRW, pp. 3408-3418, 2021.

[12] R. Rastgoo, K. Kiani, S. Escalera, V. Athitsos, and M. Sabokrou, "All You Need in Sign Language Production," arXiv:2201.01609v2, 2022.

[13] R. Rastgoo, K. Kiani, S. Escalera, "A deep co-attentive hand-based video question answering framework using multi-view skeleton," Multimedia Tools and Applications, vol. 82, pp. 1401-1429, 2023.

[14] D. Kothadiya, C. Bhatt, K. Sapariya, K. Patel, A.B. Gil-González, J.M. Corchado, "Deepsign: Sign Language Detection and Recognition Using Deep Learning," Electronics, vol.11, 1780, 2022.

[15] R. Rastgoo, K. Kiani, S. Escalera, "ZS-GR: zero-shot gesture recognition from RGB-D videos," Multimedia Tools and Applications, vol. 82, pp. 43781-43796, 2023.

[16] R. Rastgoo, K. Kiani, S. Escalera, "Real-time isolated hand sign language recognition using deep networks and SVD," Journal of Ambient Intelligence and Humanized Computing, vol. 13, pp. 591-611, 2023.

[17] C.C. Amorim, D. Macêdo, C. Zanchettin, "Spatial-Temporal Graph Convolutional Networks for Sign Language Recognition," arXiv:1901.11164v2, 2020.

[18] R. Rastgoo, "A Multi-Stream Diffusion Graph Convolutional Model with Adaptive Motion-Aware Attention and Self-Supervised Pretraining for Continuous Sign Language Recognition," Neurocomputing, vol. 656, pp. 131567, 2025.

[19] R. Rastgoo, "A Persian Continuous Sign Language Dataset," Journal of AI and Data Mining, vol. 13, pp. 95-105, 2025.

[20] F. Qi, Y. Duan, H. Zhang, and C. Xu, "SignGen: End-to-End Sign Language Video Generation with Latent Diffusion," in ECCV, pp. 252-27, 2024

[21] R. Rastgoo, K. Kiani, S. Escalera, "Video-based isolated hand sign language recognition using a deep cascaded model," Multimedia Tools and Applications, vol. 79, pp. 22965-22987, 2020.

[22] R. Rastgoo, K. Kiani, S. Escalera, "Hand pose aware multimodal isolated sign language recognition," Multimedia Tools and Applications, vol. 80, pp. 127-163, 2021.

[23] R. Rastgoo, K. Kiani, S. Escalera, M. Sabokrou, "Multi-modal zero-shot dynamic hand gesture recognition," Expert Systems with Applications, vol. 247, 123349, 2024.

[24] L. Pigou, S. Dieleman, P.J. Kindermans, B. Schrauwen, "Sign Language Recognition Using Convolutional Neural Networks," in ECCV, pp. 572–578, 2015.

[25] H. Walsh, B. Saunders, and R. Bowden, "Sign Stitching: A Novel Approach to Sign Language Production," arXiv:2405.07663v2, 2024.

[26] R.V. Azevedo, T.M. Coutinho, J.P. Ferreira, T.L. Gomes, E.R. Nascimento, "Empowering Sign Language Communication: Integrating Sentiment and Semantics for Facial Expression Synthesis," arXiv:2408.15159v1, 2024.

[27] M. Ivashechkin, O. Mendez, and R. Bowden, "Improving 3D Pose Estimation for Sign Language," arXiv:2308.09525v1, 2023.

[28] Z. Wang et al., "Learning Diverse Stochastic Human-Action Generators by Learning Smooth Latent Transitions," arXiv:1912.10150v1, 2019.

[29] P. Jome Yazdian, M. Chen, and A. Lim, "Gesture2Vec: Clustering Gestures using Representation Learning Methods for Co-speech Gesture Generation," in IROS, pp. 5861-5868, 2022.

[30] P. F. Felzenszwalb, D. McAllester, and D. Ramanan, "A discriminatively trained, multiscale, deformable part model," in CVPR, pp. 119-12, 2009.

[31] Z. Cao, T. Simon, S. Wei, and Y. Sheikh, "Realtime multi-person 2D pose estimation using part affinity fields," in CVPR, pp. 7291-729, 2017.

[32] K. Sun, B. Xiao, D. Liu, and J. Wang, "Deep high-resolution representation learning for human pose estimation," in CVPR, pp. 5693-570, 2019.

[33] H. Wang, A. Kläser, J. Schmid, and L. Van Gool, "Action recognition by dense trajectories," in CVPR, pp. 3169-317, 2011.

[34] S. Yan, Y. Xiong, and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in AAAI, pp. 7444-745, 2018.

[35] S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 221-231, 2013.

[36] A. Rezaee, M. H. Razavi, M. H. Moradi, and M. R. Nazemzadeh, "Modelling abnormal walking of the elderly to predict fall risks using a Kalman filter and motion estimation approach," J. Biomech., vol. 49, no. 1, pp. 43-50, 2016.

[37] N. Esfandiari, K. Kiani, R. Rastgoo, “A conditional generative chatbot using transformer model,” Journal of Modeling in Engineering, vol. 23, pp. 99-113, 2025.

[38] N. Esfandiari, K. Kiani, R. Rastgoo, "A new transformer-based generative chatbot using CycleGAN approach," Neural Computing and Applications, vol. 37, no. 31, pp. 26125-26156.

[39] A.M. Ahmadi, K. Kiani, R. Rastgoo, "A Transformer-based model for abnormal activity recognition in video," Journal of Modeling in Engineering, vol. 22, no. 76, pp. 213-221, 2024.

[40] R. Rastgoo, K. Kiani, S. Escalera, "Hand sign language recognition using multi-view hand skeleton," Expert Systems with Applications, vol. 158, 113336, 2020.

[41] R. Rastgoo, K. Kiani, S. Escalera, "Word separation in continuous sign language using isolated signs and post-processing," Expert Systems with Applications, vol. 249, 12369, 2024.

[42] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, "Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments," IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1325–1339, Jul. 2014.

Skeleton-Based Sign Language Generation Using a Transformer-based Generative Model

References

References

Articles in Press, Accepted Manuscript Available Online from 10 February 2026

Articles in Press, Accepted Manuscript
Available Online from 10 February 2026