Document Type : Original/Review Paper

Authors

1 Electrical and Computer Engineering Department, Semnan University, Semnan, Iran

2 Electrical and Computer Engineering Department, Semnan University, Semnan, Iran.

10.22044/jadm.2025.16157.2735

Abstract

Detecting driver distraction is critically important, as it remains a major contributor to road accidents and traffic-related injuries worldwide. This study introduces a novel hybrid deep learning model that integrates Spatio-Temporal Graph Convolutional Networks (ST-GCN) with a Transformer Encoder and Attention mechanisms to effectively detect distracted driving behaviors. The ST-GCN component captures spatial and temporal dependencies in 3D skeletal motion data, modeling the dynamic body movements of the driver. Following this, a Transformer Encoder is employed to further refine temporal representations by leveraging global attention, allowing the model to understand long-range dependencies and subtle behavioral patterns over time. In addition, an Attention mechanism is applied to emphasize the most informative joints and time frames. To address class imbalance in the dataset, the model uses a focal loss function, which helps focus training on more difficult-to-classify examples. The proposed approach is validated on the 3D skeletal Drive&Act dataset, where it achieves a high accuracy of 97.47%, outperforming existing models, particularly under challenging conditions such as poor lighting and complex driving environments. The system demonstrates strong potential for real-time driver monitoring, offering an intelligent solution to enhance road safety and reduce accident risks through early detection of driver distraction.

Keywords

Main Subjects

[1] K. Young, M. Regan, and M. Hammer, "Driver distraction: A review of the literature," Distracted Driving, 2007, pp. 379–405.
 
[2] A.M. Ahmadi, K. Kiani, R. Rastgoo, "A Transformer-based model for abnormal activity recognition in video," Journal of Modeling in Engineering, vol. 22, no. 76, pp. 213–221, 2024.
 
[3] R. Rastgoo, K. Kiani, S. Escalera, "Video-based isolated hand sign language recognition using a deep cascaded model," Multimedia Tools and Applications, vol. 79, pp. 22965–22987, 2020.
 
[4] F. Bagherzadeh, R. Rastgoo, "Deepfake image detection using a deep hybrid convolutional neural network," Journal of Modeling in Engineering, vol. 21, no. 75, pp. 19–28, 2023.
[5] M. Talebian, K. Kiani, R. Rastgoo, "A Deep Learning-based Model for Fingerprint Verification," Journal of AI and Data Mining, vol. 12, no. 2, pp. 241–248, 2024.
 
[6] H. Zaferani, K. Kiani, R. Rastgoo, "Real-time face verification on mobile devices using margin distillation," Multimedia Tools and Applications, vol. 82, no. 28, pp. 44155–44173, 2023.
 
[7] S. Zarbafi, K. Kiani, R. Rastgoo, "Spoken Persian digits recognition using deep learning," Journal of Modeling in Engineering, vol. 21, no. 74, pp. 163–172, 2023.
 
[8] N. Majidi, K. Kiani, R. Rastgoo, "A deep model for super-resolution enhancement from a single image," Journal of AI and Data Mining, vol. 8, no. 4, pp. 451–460, 2020.
 
[9] R. Rastgoo, K. Kiani, "Face recognition using fine-tuning of Deep Convolutional Neural Network and transfer learning," Journal of Modeling in Engineering, vol. 17, no. 58, pp. 103–111, 2019.
 
[10] F. Alinezhad, K. Kiani, R. Rastgoo, "A Deep Learning-based Model for Gender Recognition in Mobile Devices," Journal of AI and Data Mining, vol. 11, no. 2, pp. 229–236, 2023.
 
[11] T. Stewart, "Overview of motor vehicle crashes in 2020," United States Department of Transportation, National Highway Traffic Safety, 2022.
 
[12] M. Wu, et al., "Pose-aware multi-feature fusion network for driver distraction recognition," in ICPR, 2021.
 
[13] R. Rastgoo, K. Kiani, S. Escalera, "Sign Language Recognition: A Deep Survey," Expert Systems with Applications, vol. 164, 113794, 2020.
 
[14] R. Rastgoo, K. Kiani, S. Escalera, "A transformer model for boundary detection in continuous sign language," Multimedia Tools and Applications, vol. 83, pp. 89931–89948, 2024.
 
[15] R. Rastgoo, K. Kiani, S. Escalera, "Hand pose aware multimodal isolated sign language recognition," Multimedia Tools and Applications, vol. 80, pp. 127–163, 2021.
 
[16] R. Rastgoo, K. Kiani, S. Escalera, M. Sabokrou, "Multi-modal zero-shot dynamic hand gesture recognition," Expert Systems with Applications, vol. 247, 123349, 2024.
 
[17] R. Rastgoo, K. Kiani, S. Escalera, "A deep co-attentive hand-based video question answering framework using multi-view skeleton," Multimedia Tools and Applications, vol. 82, pp. 1401–1429, 2023.
 
[18] R. Rastgoo, K. Kiani, S. Escalera, "ZS-GR: zero-shot gesture recognition from RGB-D videos," Multimedia Tools and Applications, vol. 82, pp. 43781–43796, 2023.
 
[19] R. Rastgoo, K. Kiani, S. Escalera, "Hand sign language recognition using multi-view hand skeleton," Expert Systems with Applications, vol. 158, 113336, 2020.
 
[20] R. Rastgoo, K. Kiani, S. Escalera, "A non-anatomical graph structure for boundary detection in continuous sign language," Scientific Reports, vol. 15, 25683, 2025.
 
[21] R. Rastgoo, K. Kiani, S. Escalera, "Real-time isolated hand sign language recognition using deep networks and SVD," Journal of Ambient Intelligence and Humanized Computing, vol. 13, pp. 591–611, 2022.
 
[22] A. Holzbock, et al., "A spatio-temporal multilayer perceptron for gesture recognition," in 2022 IEEE Intelligent Vehicles Symposium (IV), IEEE, 2022.
[23] N. Esfandiari, K. Kiani, R. Rastgoo, "A conditional generative chatbot using transformer model," Journal of Modeling in Engineering, vol. 23, no. 82, pp. 99–113, 2025.
[24] R. Rastgoo, K. Kiani, S. Escalera, "Diffusion-Based Continuous Sign Language Generation with Cluster-Specific Fine-Tuning and Motion-Adapted Transformer," in Proceedings of the Computer Vision and Pattern Recognition Workshop, pp. 4088–4097, 2025.
 
[25] R. Rastgoo, K. Kiani, S. Escalera, V. Athitsos, M. Sabokrou, "A survey on recent advances in Sign Language Production," Expert Systems with Applications, vol. 243, 122846, 2024.
 
[26] R. Rastgoo, K. Kiani, S. Escalera, "Sign language production: A review," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pp. 3451–3461, 2021.
 
[27] R. Rastgoo, K. Kiani, S. Escalera, "A Non-Anatomical Graph Structure for isolated hand gesture separation in continuous gesture sequences," arXiv:2207.07619, 2022.
 
[28] Yan, S., Y. Xiong, and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
 
[29] Vaswani, A., "Attention is all you need," Advances in Neural Information Processing Systems, 2017.
 
[30] N. Esfandiari, K. Kiani, R. Rastgoo, "Development of a Persian Mobile Sales Chatbot based on LLMs and Transformer," Journal of AI and Data Mining, vol. 12, no. 4, pp. 465–472, 2024.
 
[31] M. Martin, et al., "Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
 
[32] M. Martin, D. Lerch, and M. Voit, "Viewpoint invariant 3d driver body pose-based activity recognition," in 2023 IEEE Intelligent Vehicles Symposium (IV), IEEE, 2023.
 
[33] T.A. Dingus, et al., "Driver crash risk factors and prevalence evaluation using naturalistic driving data," Proceedings of the National Academy of Sciences, vol. 113, no. 10, pp. 2636–2641, 2016.
 
[34] N. Moslemi, M. Soryani, and R. Azmi, "Computer vision‐based recognition of driver distraction: A review," Concurrency and Computation: Practice and Experience, vol. 33, no. 24, e6475, 2021.
 
[35] S. Kaplan, et al., "Driver behavior analysis for safe driving: A survey," IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 6, pp. 3017–3032, 2015.
 
[36] M.H. Sigari, et al., "A review on driver face monitoring systems for fatigue and distraction detection," International Journal of Advanced Science and Technology, vol. 64, pp. 73–100, 2014.
 
[37] E. Ohn-Bar, et al., "Head, eye, and hand patterns for driver activity recognition," in 2014 22nd International Conference on Pattern Recognition, IEEE, 2014.
 
[38] A. Jain, et al., "Car that knows before you do: Anticipating maneuvers via learning temporal driving models," in Proceedings of the IEEE International Conference on Computer Vision, 2015.
 
[39] P. Pardo-Decimavilla, et al., "Do You Act Like You Talk? Exploring Pose-based Driver Action Classification with Speech Recognition Networks," in 2024 IEEE Intelligent Vehicles Symposium (IV), IEEE, 2024.
 
[40] N. Esfandiari, K. Kiani, R. Rastgoo, "A new transformer-based generative chatbot using CycleGAN approach," Neural Computing and Applications, vol. 37, no. 31, pp. 26125–26156.
 
[41] H. Wang, and L. Wang, "Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 
[42] T. Lin, "Focal Loss for Dense Object Detection," arXiv preprint arXiv:1708.02002, 2017.
 
[43] D. Tran, et al., "Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the IEEE International Conference on Computer Vision, 2015.
 
[44] Z. Qiu, T. Yao, and T. Mei, "Learning spatio-temporal representation with pseudo-3d residual networks," in Proceedings of the IEEE International Conference on Computer Vision, 2017.
 
[45] M. Martin, et al., "Body pose and context information for driver secondary task detection," in 2018 IEEE Intelligent Vehicles Symposium (IV), IEEE, 2018.
 
[46] J. Carreira, and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.