YOLOv11-Based Explainable Framework for Anomaly Detection in Crowded Scenes Using Attention Fusion


Gozet M., Karakose M., YILMAZ A. E.

15th International Conference on Advanced Computer Information Technologies, ACIT 2025, Hybrid, Sibenik, Hırvatistan, 17 - 19 Eylül 2025, ss.859-862, (Tam Metin Bildiri) identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/acit65614.2025.11185718
  • Basıldığı Şehir: Hybrid, Sibenik
  • Basıldığı Ülke: Hırvatistan
  • Sayfa Sayıları: ss.859-862
  • Anahtar Kelimeler: Anomaly detection, Explainable artificial intelligence, TransCAM, UCSD Ped2, Video surveillance, YOLOv11
  • Ankara Üniversitesi Adresli: Evet

Özet

In this study introduces an explainable anomaly detection framework grounded in the YOLOv11 architecture, targeting object-level irregularities such as bikers, carts, and skaters in crowded scenes. The primary aim is to enhance the interpretability of deep learning models employed in visual surveillance by integrating gradient- and attention-based explainability techniques. To this end, we used TransCAM [16], a novel fusion strategy that combines Gradient-weighted Class Activation Mapping (GradCAM) [10] with Transformer-derived attention maps. This fusion facilitates more precise and semantically coherent visual explanations by highlighting the spatial regions that influence the model's predictions in dense visual contexts. The proposed model is trained on the grayscale UCSD Ped2 dataset, with data augmentation strategies - specifically brightness variation and horizontal flipping - employed to increase generalizability. Experimental evaluation demonstrates the effectiveness of the method in multi-class anomaly detection, achieving a mean Average Precision (mAP@50) of 98.5%, with a precision of 94.7% and recall of 97.1%.