YOLOv11-Based Explainable Framework for Anomaly Detection in Crowded Scenes Using Attention Fusion

Gozet M., Karakose M., YILMAZ A. E.

15th International Conference on Advanced Computer Information Technologies, ACIT 2025, Hybrid, Sibenik, Hırvatistan, 17 - 19 Eylül 2025, ss.859-862, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/acit65614.2025.11185718
Basıldığı Şehir: Hybrid, Sibenik
Basıldığı Ülke: Hırvatistan
Sayfa Sayıları: ss.859-862
Anahtar Kelimeler: Anomaly detection, Explainable artificial intelligence, TransCAM, UCSD Ped2, Video surveillance, YOLOv11
Ankara Üniversitesi Adresli: Evet

Özet

In this study introduces an explainable anomaly detection framework grounded in the YOLOv11 architecture, targeting object-level irregularities such as bikers, carts, and skaters in crowded scenes. The primary aim is to enhance the interpretability of deep learning models employed in visual surveillance by integrating gradient- and attention-based explainability techniques. To this end, we used TransCAM [16], a novel fusion strategy that combines Gradient-weighted Class Activation Mapping (GradCAM) [10] with Transformer-derived attention maps. This fusion facilitates more precise and semantically coherent visual explanations by highlighting the spatial regions that influence the model's predictions in dense visual contexts. The proposed model is trained on the grayscale UCSD Ped2 dataset, with data augmentation strategies - specifically brightness variation and horizontal flipping - employed to increase generalizability. Experimental evaluation demonstrates the effectiveness of the method in multi-class anomaly detection, achieving a mean Average Precision (mAP@50) of 98.5%, with a precision of 94.7% and recall of 97.1%.