H.3.2.2. Computer vision
Mohammad Hossein Khosravi
Abstract
Document Image Quality Assessment (DIQA) is critical for ensuring the reliability of downstream applications such as Optical Character Recognition (OCR), digital archiving, and automated document workflows. In this paper, we propose a deep learning-based DIQA framework using a Siamese neural network ...
Read More
Document Image Quality Assessment (DIQA) is critical for ensuring the reliability of downstream applications such as Optical Character Recognition (OCR), digital archiving, and automated document workflows. In this paper, we propose a deep learning-based DIQA framework using a Siamese neural network architecture with an InceptionV3 backbone. Our model leverages a composite loss function that combines linear regression loss with a monotonic ranking constraint to jointly optimize for score-level accuracy and perceptual consistency. Unlike prior works that rely on handcrafted features or narrow degradation types, our approach generalizes across diverse distortions commonly observed in scanned and photographed documents. Experimental results on the SOC and SmartDoc-QA datasets demonstrate that the proposed model exhibits a strong correlation with OCR accuracy, achieving SROCC values of 0.952 and 0.873, respectively, and outperforming several state-of-the-art DIQA methods.
H.3.2.2. Computer vision
Mohammad Jadidi; Kourosh Kiani; Razieh Rastgoo
Abstract
In recent years, the application of deep learning techniques has revolutionized various domains, including the realm of sports analytics. The analysis of ball tracking and trajectory in sports has become an increasingly vital area of research, driven by advancements in technology and the growing demand ...
Read More
In recent years, the application of deep learning techniques has revolutionized various domains, including the realm of sports analytics. The analysis of ball tracking and trajectory in sports has become an increasingly vital area of research, driven by advancements in technology and the growing demand for data-driven insights in athletic performance. In volleyball, a sport characterized by rapid movements and strategic play, the ability to accurately track the trajectory of the ball is crucial for both training and competitive analysis. This paper proposes novel deep learning models for accurate volleyball ball detection and tracking. By incorporating attention mechanisms into the YOLOv8 and YOLOv10 architecture, our models significantly improve performance, particularly in challenging situations involving occlusions and fast movements. The proposed models across several metrics compared to baseline and other models. Specifically, achieved precision (94.2% and 94.7%, respectively) and recall (88.1% and 87.6%, respectively) and real-time processing speeds, making them suitable for various sports analytics applications.
H.3.2.2. Computer vision
Mahdi Zarrin; Haniyeh Nikkhah
Abstract
Medical image analysis, crucial for disease diagnosis and treatment, often suffers from the challenge of class imbalance, where the area of normal tissue significantly outweighs that of abnormal regions. Furthermore, the varying class ratios across different images within a dataset complicate the application ...
Read More
Medical image analysis, crucial for disease diagnosis and treatment, often suffers from the challenge of class imbalance, where the area of normal tissue significantly outweighs that of abnormal regions. Furthermore, the varying class ratios across different images within a dataset complicate the application of uniform loss adjustments. To address these issues and advance automated segmentation, this study proposes a novel deep learning model integrating the strengths of YOLO Version 8's efficient feature extraction modules (SPPF and C2F) within a U-shaped architecture enhanced by a Receptive Field Enhancement (RFE) module. The RFE module, acting as an advanced skip connection, strategically fuses multi-scale features from corresponding and subsequent encoder layers processed through SPPF and C2F to enrich feature transfer and improve receptive field. To specifically tackle the class imbalance and the diversity of class distributions across images, we introduce a novel Adapt Exponential Loss function. This pixel-level loss dynamically adjusts class weights for each image based on its individual lesion-to-total-pixel ratio (k). We evaluated our proposed model and loss function on challenging skin lesion datasets: ISIC 2018, ISIC 2017, and PH2. Our method achieved significant segmentation performance with IoU scores of 86.47%, 85.67%, and 93.13%, and Dice scores of 91.63%, 90.19%, and 96.02% on ISIC 2018, ISIC 2017, and PH2, respectively, demonstrating its effectiveness in accurately delineating skin lesions despite class imbalance and varying lesion proportions. This work contributes a robust framework for medical image segmentation, facilitating more reliable diagnostic tools in dermatology.
H.3.2.2. Computer vision
Fatemeh Asadi-Zeydabadi; Ali Afkari-Fahandari; Elham Shabaninia; Hossein Nezamabadi-pour
Abstract
Farsi optical character recognition remains challenging due to the script’s cursive structure, positional glyph variations, and frequent diacritics. This study conducts a comparative evaluation of five foundational deep learning architectures widely used in OCR—two lightweight CRNN based ...
Read More
Farsi optical character recognition remains challenging due to the script’s cursive structure, positional glyph variations, and frequent diacritics. This study conducts a comparative evaluation of five foundational deep learning architectures widely used in OCR—two lightweight CRNN based models aimed at efficient deployment and three Transformer based models designed for advanced contextual modeling—to examine their suitability for the distinct characteristics of Farsi script. Performance was benchmarked on four publicly available datasets: Shotor and IDPL PFOD2 for printed text, and Iranshahr and Sadri for handwritten text, using word level accuracy, parameter count, and computational cost as evaluation criteria. CRNN based models achieved high accuracy on word level datasets—99.42% (Shotor), 97.08% (Iranshahr), 98.86% (Sadri)—while maintaining smaller model sizes and lower computational demands. However, their accuracy dropped to 78.49% on the larger and more diverse line level IDPL PFOD2 dataset. Transformer based models substantially narrowed this performance gap, exhibiting greater robustness to variations in font, style, and layout, with the best model reaching 92.81% on IDPL PFOD2. To the best of our knowledge, this work is among the first comprehensive comparative studies of lightweight CRNN and Transformer based architectures for Farsi OCR, encompassing both printed and handwritten scripts, and establishes a solid performance baseline for future research and deployment strategies.
H.3.2.2. Computer vision
Maryam Baghi; Kourosh Kiani; Razieh Rastgoo
Abstract
With rapid advancements in information and communication technology, recommender systems have become vital tools across a wide range of online activities and e-commerce processes. Collaborative recommender systems, which utilize user data and contributions to provide suggestions, represent a significant ...
Read More
With rapid advancements in information and communication technology, recommender systems have become vital tools across a wide range of online activities and e-commerce processes. Collaborative recommender systems, which utilize user data and contributions to provide suggestions, represent a significant innovation in this field. In this paper, we conduct an analysis of collaborative recommender systems and evaluate their impact on enhancing the efficiency and accuracy of recommendations. To this end, we propose a deep learning approach using a Graph Convolutional Network (GCN), as a special type of Graph Neural Network (GNN). By assigning weights to edges between nodes, scores are calculated for these edges. The importance of the edges varies based on the number of neighboring nodes and their proximity to the target node. The higher the edge score, the more significant the path. To calculate edge weights, we leverage metrics such as Jaccard similarity, cosine similarity, LHN index, and Salton cosine similarity. This approach improves the identification of relationships between nodes and enhances the accuracy of the recommender system. For implementation, we utilized the well-known MovieLens dataset. Ultimately, users were clustered into 18 clusters, with a large number of nodes within each cluster. By clustering users, we increased the number and diversity of recommendations. This significantly improved the performance of the recommender system, yielding promising results.
H.3.2.2. Computer vision
Homayoun Rastegar; Hassan Khotanlou
Abstract
One of the challenges in digital image processing that we face today is the presence of haze in images. This challenge is particularly prominent in imaging areas with humid and rainy weather compared to other locations. Examples of AI-based systems that can be impacted by this type of challenge include ...
Read More
One of the challenges in digital image processing that we face today is the presence of haze in images. This challenge is particularly prominent in imaging areas with humid and rainy weather compared to other locations. Examples of AI-based systems that can be impacted by this type of challenge include smart traffic control cameras, autonomous vehicles, and Video Assistant Referee (VAR) systems in football stadiums, security and surveillance cameras, and more. Therefore, this paper aims to propose a method that can mitigate this problem using Self-Supervised Learning (SSL) and deep learning. To this end, a Convolutional Autoencoder Network (CAN) with Convolutional Block Attention Module (CBAM) was proposed to reduce haze from images. The advantage of the proposed method is using fewer layers and filters compared to other models introduced by previous researchers in this field and using more important convolutional channels and important image regions using CBAM. Experiments in this paper reveal that overusing large or numerous convolutional filters to generate diverse features can reduce a model's ability to dehaze images effectively. Thus, the number of filters should be carefully limited. On the other hand, a combined loss function was used to train the proposed architecture. The proposed model was trained and tested using NH-haze dataset and the Realistic Single Image Dehazing (RESIDE) dataset. To evaluate our method, we used structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR). The test results of the proposed architecture showed that it has higher performance compared to the state-of-the-art in the field.
H.3.2.2. Computer vision
Elahe Yadolahi; Sheis Abolmaali
Abstract
Semantic segmentation is a critical task in computer vision, focused on extracting and analyzing detailed visual information. Traditional artificial neural networks (ANNs) have made significant strides in this area, but spiking neural networks (SNNs) are gaining attention for their energy efficiency ...
Read More
Semantic segmentation is a critical task in computer vision, focused on extracting and analyzing detailed visual information. Traditional artificial neural networks (ANNs) have made significant strides in this area, but spiking neural networks (SNNs) are gaining attention for their energy efficiency and biologically inspired time-based processing. However, existing SNN-based methods for semantic segmentation face challenges in achieving high accuracy due to limitations such as quantization errors and suboptimal membrane potential distribution. This research introduces a novel spiking approach based on Spiking-DeepLab, incorporating a Regularized Membrane Potential Loss (RMP-Loss) to address these challenges. Built upon the DeepLabv3 architecture, the proposed model leverages RMP-Loss to enhance segmentation accuracy by optimizing the membrane potential distribution in SNNs. By optimizing the storage of membrane potentials, where values are stored only at the final time step, the model significantly reduces memory usage and processing time. This enhancement not only improves the computational efficiency but also boosts the accuracy of semantic segmentation, enabling more accurate temporal analysis of network behavior. The proposed model also demonstrates better robustness against noise, maintaining its accuracy under varying levels of Gaussian noise, which is common in real-world scenarios. The proposed approach demonstrates competitive performance on standard datasets, showcasing its potential for energy-efficient image processing applications.
H.3.2.2. Computer vision
Razieh Rastgoo
Abstract
Sign language (SL) is the primary mode of communication within the Deaf community. Recent advances in deep learning have led to the development of various applications and technologies aimed at facilitating bidirectional communication between the Deaf and hearing communities. However, challenges remain ...
Read More
Sign language (SL) is the primary mode of communication within the Deaf community. Recent advances in deep learning have led to the development of various applications and technologies aimed at facilitating bidirectional communication between the Deaf and hearing communities. However, challenges remain in the availability of suitable datasets for deep learning-based models. Only a few public large-scale annotated datasets are available for sign sentences, and none exist for Persian Sign Language sentences. To address this gap, we have collected a large-scale dataset comprising 10,000 sign sentence videos corresponding to 100 Persian sign sentences. This dataset includes comprehensive annotations such as the bounding box of the detected hand, class labels, hand pose parameters, and heatmaps. A notable feature of the proposed dataset is that it contains isolated signs corresponding to the sign sentences within the dataset. To analyze the complexity of the proposed dataset, we present extensive experiments and discuss the results. More concretely, the results of the models in key sub-domains relevant to Sign Language Recognition (SLR), including hand detection, pose estimation, real-time tracking, and gesture recognition, have been included and analyzed. Moreover, the results of seven deep learning-based models on the proposed datasets have been discussed. Finally, the results of Sign Language Production (SLP) using deep generative models have been presented. We report the experimental results of these models from these sub-areas, showcasing their performance on the proposed dataset.
H.3.2.2. Computer vision
Shiva Zeymaran; Vali Derhami; Mehran Mehrandezh
Abstract
This paper presents an accurate and efficient method for determining the coordinates of welding seams, addressing a significant challenge in the deployment of welding robots for complex tasks. Despite welding robots’ precision in following predetermined paths, they struggle with seam identification ...
Read More
This paper presents an accurate and efficient method for determining the coordinates of welding seams, addressing a significant challenge in the deployment of welding robots for complex tasks. Despite welding robots’ precision in following predetermined paths, they struggle with seam identification due to noisy industrial environments, stringent accuracy requirements, and computational complexity. Unlike existing approaches, which either rely on random sampling or are limited to simple geometries, our method combines splicing techniques with welding map alignment to handle complex shapes with multiple seams. This research employs a weighed method to integrate point clouds captured by RGB-D cameras, producing a low-noise point cloud. By leveraging the welding map of parts drawn, the method identifies probable regions for weld seams within the point cloud, substantially reducing the search space. This enables the system to find the weld seam in a timely manner. Knowing the approximate shape of the weld based on the available weld map, an innovative technique is then used to accurately locate the weld seam within these regions. Experimental results on fence-shaped structures in a simulated environment show a mean average error of 1.30 mm, achieving a 30% improvement in precision and a 77% reduction in computation time compared to the state-of-the-art methods. The approach's ability to accurately identify weld seams in complex shapes, coupled with its computational efficiency, suggests strong potential for real-world application. By leveraging welding maps and robust point cloud processing techniques, the method is designed to handle noise and variability, key challenges in industrial environments.
H.3.2.2. Computer vision
Mobina Talebian; Kourosh Kiani; Razieh Rastgoo
Abstract
Fingerprint verification has emerged as a cornerstone of personal identity authentication. This research introduces a deep learning-based framework for enhancing the accuracy of this critical process. By integrating a pre-trained Inception model with a custom-designed architecture, we propose a model ...
Read More
Fingerprint verification has emerged as a cornerstone of personal identity authentication. This research introduces a deep learning-based framework for enhancing the accuracy of this critical process. By integrating a pre-trained Inception model with a custom-designed architecture, we propose a model that effectively extracts discriminative features from fingerprint images. To this end, the input fingerprint image is aligned to a base fingerprint through minutiae vector comparison. The aligned input fingerprint is then subtracted from the base fingerprint to generate a residual image. This residual image, along with the aligned input fingerprint and the base fingerprint, constitutes the three input channels for a pre-trained Inception model. Our main contribution lies in the alignment of fingerprint minutiae, followed by the construction of a color fingerprint representation. Moreover, we collected a dataset, including 200 fingerprint images corresponding to 20 persons, for fingerprint verification. The proposed method is evaluated on two distinct datasets, demonstrating its superiority over existing state-of-the-art techniques. With a verification accuracy of 99.40% on the public Hong Kong Dataset, our approach establishes a new benchmark in fingerprint verification. This research holds the potential for applications in various domains, including law enforcement, border control, and secure access systems.
H.3.2.2. Computer vision
Zobeir Raisi; Valimohammad Nazarzehi; Rasoul Damani; Esmaeil Sarani
Abstract
This paper explores the performance of various object detection techniques for autonomous vehicle perception by analyzing classical machine learning and recent deep learning models. We evaluate three classical methods, including PCA, HOG, and HOG alongside different versions of the SVM classifier, and ...
Read More
This paper explores the performance of various object detection techniques for autonomous vehicle perception by analyzing classical machine learning and recent deep learning models. We evaluate three classical methods, including PCA, HOG, and HOG alongside different versions of the SVM classifier, and five deep-learning models, including Faster-RCNN, SSD, YOLOv3, YOLOv5, and YOLOv9 models using the benchmark INRIA dataset. The experimental results show that although classical methods such as HOG + Gaussian SVM outperform other classical approaches, they are outperformed by deep learning techniques. Furthermore, Classical methods have limitations in detecting partially occluded, distant objects and complex clothing challenges, while recent deep-learning models are more efficient and provide better performance (YOLOv9) on these challenges.
H.3.2.2. Computer vision
Masoumeh Esmaeiili; Kourosh Kiani
Abstract
The classification of emotions using electroencephalography (EEG) signals is inherently challenging due to the intricate nature of brain activity. Overcoming inconsistencies in EEG signals and establishing a universally applicable sentiment analysis model are essential objectives. This study introduces ...
Read More
The classification of emotions using electroencephalography (EEG) signals is inherently challenging due to the intricate nature of brain activity. Overcoming inconsistencies in EEG signals and establishing a universally applicable sentiment analysis model are essential objectives. This study introduces an innovative approach to cross-subject emotion recognition, employing a genetic algorithm (GA) to eliminate non-informative frames. Then, the optimal frames identified by the GA undergo spatial feature extraction using common spatial patterns (CSP) and the logarithm of variance. Subsequently, these features are input into a Transformer network to capture spatial-temporal features, and the emotion classification is executed using a fully connected (FC) layer with a Softmax activation function. Therefore, the innovations of this paper include using a limited number of channels for emotion classification without sacrificing accuracy, selecting optimal signal segments using the GA, and employing the Transformer network for high-accuracy and high-speed classification. The proposed method undergoes evaluation on two publicly accessible datasets, SEED and SEED-V, across two distinct scenarios. Notably, it attains mean accuracy rates of 99.96% and 99.51% in the cross-subject scenario, and 99.93% and 99.43% in the multi-subject scenario for the SEED and SEED-V datasets, respectively. Noteworthy is the outperformance of the proposed method over the state-of-the-art (SOTA) in both scenarios for both datasets, thus underscoring its superior efficacy. Additionally, comparing the accuracy of individual subjects with previous works in cross subject scenario further confirms the superiority of the proposed method for both datasets.
H.3.2.2. Computer vision
H. Hosseinpour; Seyed A. Moosavie nia; M. A. Pourmina
Abstract
Virtual view synthesis is an essential part of computer vision and 3D applications. A high-quality depth map is the main problem with virtual view synthesis. Because as compared to the color image the resolution of the corresponding depth image is low. In this paper, an efficient and confided method ...
Read More
Virtual view synthesis is an essential part of computer vision and 3D applications. A high-quality depth map is the main problem with virtual view synthesis. Because as compared to the color image the resolution of the corresponding depth image is low. In this paper, an efficient and confided method based on the gradual omission of outliers is proposed to compute reliable depth values. In the proposed method depth values that are far from the mean of depth values are omitted gradually. By comparison with other state of the art methods, simulation results show that on average, PSNR is 2.5dB (8.1 %) higher, SSIM is 0.028 (3%) more, UNIQUE is 0.021 (2.4%) more, the running time is 8.6s (6.1%) less and wrong pixels are 1.97(24.8%) less.
H.3.2.2. Computer vision
M. H. Khosravi
Abstract
Image segmentation is an essential and critical process in image processing and pattern recognition. In this paper we proposed a textured-based method to segment an input image into regions. In our method an entropy-based textured map of image is extracted, followed by an histogram equalization step ...
Read More
Image segmentation is an essential and critical process in image processing and pattern recognition. In this paper we proposed a textured-based method to segment an input image into regions. In our method an entropy-based textured map of image is extracted, followed by an histogram equalization step to discriminate different regions. Then with the aim of eliminating unnecessary details and achieving more robustness against unwanted noises, a low-pass filtering technique is successfully used to smooth the image. As the next step, the appropriate pixons are extracted and delivered to a fuzzy c-mean clustering stage to obtain the final image segments. The results of applying the proposed method on several different images indicate its better performance in image segmentation compared to the other methods.
H.3.2.2. Computer vision
Seyyed A. Hoseini; P. Kabiri
Abstract
In this paper, a feature-based technique for the camera pose estimation in a sequence of wide-baseline images has been proposed. Camera pose estimation is an important issue in many computer vision and robotics applications, such as, augmented reality and visual SLAM. The proposed method can track captured ...
Read More
In this paper, a feature-based technique for the camera pose estimation in a sequence of wide-baseline images has been proposed. Camera pose estimation is an important issue in many computer vision and robotics applications, such as, augmented reality and visual SLAM. The proposed method can track captured images taken by hand-held camera in room-sized workspaces with maximum scene depth of 3-4 meters. The system can be used in unknown environments with no additional information available from the outside world except in the first two images that are used for initialization. Pose estimation is performed using only natural feature points extracted and matched in successive images. In wide-baseline images unlike consecutive frames of a video stream, displacement of the feature points in consecutive images is notable and hence cannot be traced easily using patch-based methods. To handle this problem, a hybrid strategy is employed to obtain accurate feature correspondences. In this strategy, first initial feature correspondences are found using similarity of their descriptors and then outlier matchings are removed by applying RANSAC algorithm. Further, to provide a set of required feature matchings a mechanism based on sidelong result of robust estimator was employed. The proposed method is applied on indoor real data with images in VGA quality (640×480 pixels) and on average the translation error of camera pose is less than 2 cm which indicates the effectiveness and accuracy of the proposed approach.
H.3.2.2. Computer vision
M. Askari; M. Asadi; A. Asilian Bidgoli; H. Ebrahimpour
Abstract
For many years, researchers have studied high accuracy methods for recognizing the handwriting and achieved many significant improvements. However, an issue that has rarely been studied is the speed of these methods. Considering the computer hardware limitations, it is necessary for these methods to ...
Read More
For many years, researchers have studied high accuracy methods for recognizing the handwriting and achieved many significant improvements. However, an issue that has rarely been studied is the speed of these methods. Considering the computer hardware limitations, it is necessary for these methods to run in high speed. One of the methods to increase the processing speed is to use the computer parallel processing power. This paper introduces one of the best feature extraction methods for the handwritten recognition, called DPP (Derivative Projection Profile), which is employed for isolated Persian handwritten recognition. In addition to achieving good results, this (computationally) light feature can easily be processed. Moreover, Hamming Neural Network is used to classify this system. To increase the speed, some part of the recognition method is executed on GPU (graphic processing unit) cores implemented by CUDA platform. HADAF database (Biggest isolated Persian character database) is utilized to evaluate the system. The results show 94.5% accuracy. We also achieved about 5.5 times speed-up using GPU.