H.6.5.2. Computer vision
Rozhin Mohammadizand; Razieh Rastgoo
Abstract
Sign language is a structured, non-vocal form of communication primarily used by individuals who are deaf or hard of hearing, who often face challenges interacting with non-signers. To address this, translation systems between sign and spoken language are essential, encompassing sign language recognition ...
Read More
Sign language is a structured, non-vocal form of communication primarily used by individuals who are deaf or hard of hearing, who often face challenges interacting with non-signers. To address this, translation systems between sign and spoken language are essential, encompassing sign language recognition and production. In this work, we focus on sign language production and propose a deep learning framework for generating skeleton-based video representations of sign language at the word level. Our approach employs a conditional Generative Adversarial Network (cGAN) with transformer embeddings in both generator and discriminator, augmented with bone-length and joint-angle constraints and a classifier-guided loss to ensure anatomically plausible and semantically consistent gestures. We further introduce a novel loss function to improve human keypoint generation for sign representation. Extensive experiments on three benchmark datasets demonstrate that our method outperforms state-of-the-art approaches according to statistical (MMD) and perceptual (FID) metrics, while qualitative analyses confirm that the generated gestures are temporally smooth, anatomically accurate, and semantically meaningful. These results highlight the effectiveness of our model in advancing word-level sign language synthesis.
H.6.5.2. Computer vision
Mahdi Davari; Razieh Rastgoo
Abstract
Detecting driver distraction is critically important, as it remains a major contributor to road accidents and traffic-related injuries worldwide. This study introduces a novel hybrid deep learning model that integrates Spatio-Temporal Graph Convolutional Networks (ST-GCN) with a Transformer Encoder and ...
Read More
Detecting driver distraction is critically important, as it remains a major contributor to road accidents and traffic-related injuries worldwide. This study introduces a novel hybrid deep learning model that integrates Spatio-Temporal Graph Convolutional Networks (ST-GCN) with a Transformer Encoder and Attention mechanisms to effectively detect distracted driving behaviors. The ST-GCN component captures spatial and temporal dependencies in 3D skeletal motion data, modeling the dynamic body movements of the driver. Following this, a Transformer Encoder is employed to further refine temporal representations by leveraging global attention, allowing the model to understand long-range dependencies and subtle behavioral patterns over time. In addition, an Attention mechanism is applied to emphasize the most informative joints and time frames. To address class imbalance in the dataset, the model uses a focal loss function, which helps focus training on more difficult-to-classify examples. The proposed approach is validated on the 3D skeletal Drive&Act dataset, where it achieves a high accuracy of 97.47%, outperforming existing models, particularly under challenging conditions such as poor lighting and complex driving environments. The system demonstrates strong potential for real-time driver monitoring, offering an intelligent solution to enhance road safety and reduce accident risks through early detection of driver distraction.
H.3. Artificial Intelligence
Rasoul Hosseinzadeh; Mahdi Sadeghzadeh
Abstract
The attention mechanisms have significantly advanced the field of machine learning and deep learning across various domains, including natural language processing, computer vision, and multimodal systems. This paper presents a comprehensive survey of attention mechanisms in Transformer architectures, ...
Read More
The attention mechanisms have significantly advanced the field of machine learning and deep learning across various domains, including natural language processing, computer vision, and multimodal systems. This paper presents a comprehensive survey of attention mechanisms in Transformer architectures, emphasizing their evolution, design variants, and domain-specific applications in NLP, computer vision, and multimodal learning. We categorize attention types by their goals like efficiency, scalability, and interpretability, and provide a comparative analysis of their strengths, limitations, and suitable use cases. This survey also addresses the lack of visual intuitions, offering a clearer taxonomy and discussion of hybrid approaches, such as sparse-hierarchical combinations. In addition to foundational mechanisms, we highlight hybrid approaches, theoretical underpinnings, and practical trade-offs. The paper identifies current challenges in computation, robustness, and transparency, offering a structured classification and proposing future directions. By comparing state-of-the-art techniques, this survey aims to guide researchers in selecting and designing attention mechanisms best suited for specific AI applications, ultimately fostering the development of more efficient, interpretable, and adaptable Transformer-based models.
H.3.8. Natural Language Processing
Nura Esfandiari; Kourosh Kiani; Razieh Rastgoo
Abstract
Chatbots are computer programs designed to simulate human conversation. Powered by artificial intelligence (AI), these chatbots are increasingly used to provide customer service, particularly by large language models (LLMs). A process known as fine-tuning LLMs is employed to personalize chatbot answers. ...
Read More
Chatbots are computer programs designed to simulate human conversation. Powered by artificial intelligence (AI), these chatbots are increasingly used to provide customer service, particularly by large language models (LLMs). A process known as fine-tuning LLMs is employed to personalize chatbot answers. This process demands substantial high-quality data and computational resources. In this article, to overcome the computational hurdles associated with fine-tuning LLMs, innovative hybrid approach is proposed. This approach aims to enhance the answers generated by LLMs, specifically for Persian chatbots used in mobile customer services. A transformer-based evaluation model was developed to score generated answers and select the most appropriate answers. Additionally, a Persian language dataset tailored to the domain of mobile sales was collected to support the personalization of the Persian chatbot and the training of the evaluation model. This approach is expected to foster increased customer interaction and boost sales within the Persian mobile phone market. Experiments conducted on four different LLMs demonstrated the effectiveness of the proposed approach in generating more relevant and semantically accurate answers for users.