Context-Aware Criminal Activity Recognition in Surveillance Images Using an Attention-Guided YOLOv10-Vision Transformer Framework

Mavaddati, Samira

doi:10.22044/jadm.2026.17656.2925

Articles in Press

Document Type : Original/Review Paper

Author

Samira Mavaddati

University of Mazandaran

10.22044/jadm.2026.17656.2925

Abstract

The rapid growth of intelligent surveillance systems has increased the demand for accurate and efficient criminal activity recognition methods capable of operating in real-world environments. Although conventional deep learning and object detection frameworks have demonstrated promising performance, they often struggle to capture long-range contextual dependencies and complex interactions present in surveillance scenes. To address these limitations, this study proposes a hybrid deep learning framework that combines the real-time detection capability of YOLOv10 with the global contextual modeling power of Vision Transformers (ViT). An attention-guided feature fusion mechanism is introduced to effectively integrate local spatial representations extracted by YOLOv10 with global semantic features generated by the transformer architecture. The proposed framework is evaluated on the UCF-Crime dataset, which consists of fourteen categories of normal and criminal activities, including burglary, robbery, assault, vandalism, shoplifting, and abuse. Surveillance videos are converted into image sequences and analyzed under two experimental scenarios: (I) a standalone YOLOv10 model and (II) the proposed Attention-Guided YOLOv10-ViT framework. Performance is assessed using accuracy, precision, recall, and F1-score metrics. Experimental results show that the standalone YOLOv10 model achieves an overall classification accuracy of 88.07%, outperforming the previously reported YOLOv8 baseline. More importantly, the proposed hybrid framework attains an accuracy of 93.45%, exceeding both YOLOv10 and earlier YOLOv8-ViT architectures. The improvement is particularly evident in challenging scenarios involving occlusion, illumination changes, cluttered backgrounds, and crowded environments. The results demonstrate that integrating YOLOv10, Transformers, and attention-guided feature fusion provides a scalable, robust, and real-time solution for intelligent surveillance and public monitoring applications.

Keywords

Main Subjects

H.3.12. Distributed Artificial Intelligence

Journal of AI and Data Mining

Context-Aware Criminal Activity Recognition in Surveillance Images Using an Attention-Guided YOLOv10-Vision Transformer Framework

Articles in Press, Accepted Manuscript
Available Online from 01 July 2026

Context-Aware Criminal Activity Recognition in Surveillance Images Using an Attention-Guided YOLOv10-Vision Transformer Framework

Articles in Press, Accepted Manuscript Available Online from 01 July 2026

Articles in Press, Accepted Manuscript
Available Online from 01 July 2026