H.6.5.2. Computer vision
Rozhin Mohammadizand; Razieh Rastgoo
Abstract
Sign language is a structured, non-vocal form of communication primarily used by individuals who are deaf or hard of hearing, who often face challenges interacting with non-signers. To address this, translation systems between sign and spoken language are essential, encompassing sign language recognition ...
Read More
Sign language is a structured, non-vocal form of communication primarily used by individuals who are deaf or hard of hearing, who often face challenges interacting with non-signers. To address this, translation systems between sign and spoken language are essential, encompassing sign language recognition and production. In this work, we focus on sign language production and propose a deep learning framework for generating skeleton-based video representations of sign language at the word level. Our approach employs a conditional Generative Adversarial Network (cGAN) with transformer embeddings in both generator and discriminator, augmented with bone-length and joint-angle constraints and a classifier-guided loss to ensure anatomically plausible and semantically consistent gestures. We further introduce a novel loss function to improve human keypoint generation for sign representation. Extensive experiments on three benchmark datasets demonstrate that our method outperforms state-of-the-art approaches according to statistical (MMD) and perceptual (FID) metrics, while qualitative analyses confirm that the generated gestures are temporally smooth, anatomically accurate, and semantically meaningful. These results highlight the effectiveness of our model in advancing word-level sign language synthesis.
H.6.5.2. Computer vision
Mahdi Davari; Razieh Rastgoo
Abstract
Detecting driver distraction is critically important, as it remains a major contributor to road accidents and traffic-related injuries worldwide. This study introduces a novel hybrid deep learning model that integrates Spatio-Temporal Graph Convolutional Networks (ST-GCN) with a Transformer Encoder and ...
Read More
Detecting driver distraction is critically important, as it remains a major contributor to road accidents and traffic-related injuries worldwide. This study introduces a novel hybrid deep learning model that integrates Spatio-Temporal Graph Convolutional Networks (ST-GCN) with a Transformer Encoder and Attention mechanisms to effectively detect distracted driving behaviors. The ST-GCN component captures spatial and temporal dependencies in 3D skeletal motion data, modeling the dynamic body movements of the driver. Following this, a Transformer Encoder is employed to further refine temporal representations by leveraging global attention, allowing the model to understand long-range dependencies and subtle behavioral patterns over time. In addition, an Attention mechanism is applied to emphasize the most informative joints and time frames. To address class imbalance in the dataset, the model uses a focal loss function, which helps focus training on more difficult-to-classify examples. The proposed approach is validated on the 3D skeletal Drive&Act dataset, where it achieves a high accuracy of 97.47%, outperforming existing models, particularly under challenging conditions such as poor lighting and complex driving environments. The system demonstrates strong potential for real-time driver monitoring, offering an intelligent solution to enhance road safety and reduce accident risks through early detection of driver distraction.
H.6.5.2. Computer vision
Havva Askari; Razieh Rastgoo; Kourosh Kiani
Abstract
Drowsiness remains a significant challenge for drivers, often resulting from extended working hours, inadequate sleep, and accumulated fatigue. This condition not only impairs reaction time and decision-making but also contributes to a substantial number of road accidents globally. Therefore, reliable ...
Read More
Drowsiness remains a significant challenge for drivers, often resulting from extended working hours, inadequate sleep, and accumulated fatigue. This condition not only impairs reaction time and decision-making but also contributes to a substantial number of road accidents globally. Therefore, reliable and timely detection of driver drowsiness is essential for enhancing transportation safety and reducing the risk of traffic-related fatalities. With the rapid progress in deep learning, numerous models have been developed to detect driver drowsiness with high accuracy. However, the real-world performance of these models can deteriorate under varying environmental conditions, such as changes in cabin illumination, facial occlusions, and dynamic shadows on the driver’s face. To address these limitations, this paper proposes a robust, real-time driver drowsiness detection model that leverages facial behavioral features and a Transformer-based neural network architecture. The Mediapipe framework is utilized to extract a comprehensive set of facial keypoints, capturing subtle facial movements and expressions indicative of drowsiness. These keypoints are then encoded to form feature vectors that serve as input to the Transformer network, enabling effective temporal modeling of facial dynamics. The proposed model is trained and evaluated on the National Tsing Hua University (NTHU) Driver Drowsiness Detection dataset, achieving a state-of-the-art accuracy of 99.71%, demonstrating its potential for deployment in real-world in-vehicle systems.
H.6.5.2. Computer vision
Kourosh Kiani; Razieh Rastgoo; Alireza Chaji; Sergio Escalera
Abstract
Image inpainting, the process of restoring missing or corrupted regions of an image by reconstructing pixel information, has recently seen considerable advancements through deep learning-based approaches. Aiming to tackle the complex spatial relationships within an image, in this paper, we introduce ...
Read More
Image inpainting, the process of restoring missing or corrupted regions of an image by reconstructing pixel information, has recently seen considerable advancements through deep learning-based approaches. Aiming to tackle the complex spatial relationships within an image, in this paper, we introduce a novel deep learning-based pre-processing methodology for image inpainting utilizing the Vision Transformer (ViT). Unlike CNN-based methods, our approach leverages the self-attention mechanism of ViT to model global contextual dependencies, improving the quality of inpainted regions. Specifically, we replace masked pixel values with those generated by the ViT, utilizing the attention mechanism to extract diverse visual patches and capture discriminative spatial features. To the best of our knowledge, this is the first instance of such a pre-processing model being proposed for image inpainting tasks. Furthermore, we demonstrate that our methodology can be effectively applied using a pre-trained ViT model with a pre-defined patch size, reducing computational overhead while maintaining high reconstruction fidelity. To assess the generalization capability of the proposed methodology, we conduct extensive experiments comparing our approach with four standard inpainting models across four public datasets. The results validate the efficacy of our pre-processing technique in enhancing inpainting performance, particularly in scenarios involving complex textures and large missing regions.
H.6.5.2. Computer vision
M. Karami; A. Moosavie nia; M. Ehsanian
Abstract
In this paper we address the problem of automatic arrangement of cameras in a 3D system to enhance the performance of depth acquisition procedure. Lacking ground truth or a priori information, a measure of uncertainty is required to assess the quality of reconstruction. The mathematical model of iso-disparity ...
Read More
In this paper we address the problem of automatic arrangement of cameras in a 3D system to enhance the performance of depth acquisition procedure. Lacking ground truth or a priori information, a measure of uncertainty is required to assess the quality of reconstruction. The mathematical model of iso-disparity surfaces provides an efficient way to estimate the depth estimation uncertainty which is believed to be related to the baseline length, focal length, panning angle and the pixel resolution in a stereo vision system. Accordingly, we first present analytical relations for fast estimation of the embedded uncertainty in depth acquisition and then these relations, along with the 3D sampling arrangement are employed to define a cost function. The optimal camera arrangement will be determined by minimizing the cost function with respect to the system parameters and the required constraints. Finally, the proposed algorithm is implemented on some 3D models. The simulation results demonstrate significant improvement (up to 35%) in depth uncertainty in the achieved depth maps compared with the traditional rectified camera setup.