Document Type : Original/Review Paper
Author
University of Mazandaran
Abstract
The rapid growth of intelligent surveillance systems has increased the demand for accurate and efficient criminal activity recognition methods capable of operating in real-world environments. Although conventional deep learning and object detection frameworks have demonstrated promising performance, they often struggle to capture long-range contextual dependencies and complex interactions present in surveillance scenes. To address these limitations, this study proposes a hybrid deep learning framework that combines the real-time detection capability of YOLOv10 with the global contextual modeling power of Vision Transformers (ViT). An attention-guided feature fusion mechanism is introduced to effectively integrate local spatial representations extracted by YOLOv10 with global semantic features generated by the transformer architecture. The proposed framework is evaluated on the UCF-Crime dataset, which consists of fourteen categories of normal and criminal activities, including burglary, robbery, assault, vandalism, shoplifting, and abuse. Surveillance videos are converted into image sequences and analyzed under two experimental scenarios: (I) a standalone YOLOv10 model and (II) the proposed Attention-Guided YOLOv10-ViT framework. Performance is assessed using accuracy, precision, recall, and F1-score metrics. Experimental results show that the standalone YOLOv10 model achieves an overall classification accuracy of 88.07%, outperforming the previously reported YOLOv8 baseline. More importantly, the proposed hybrid framework attains an accuracy of 93.45%, exceeding both YOLOv10 and earlier YOLOv8-ViT architectures. The improvement is particularly evident in challenging scenarios involving occlusion, illumination changes, cluttered backgrounds, and crowded environments. The results demonstrate that integrating YOLOv10, Transformers, and attention-guided feature fusion provides a scalable, robust, and real-time solution for intelligent surveillance and public monitoring applications.
Keywords
- Criminal Activity Recognition
- Intelligent Surveillance Systems
- YOLOv10
- Vision Transformer (ViT)
- Deep Learning
Main Subjects