TransCCNet: enhancing global contextual learning in multispectral remote sensing images through criss-cross and transformer attention mechanisms


Ülkü İ.

JOURNAL OF APPLIED REMOTE SENSING, cilt.19, sa.4, ss.46516-46531, 2025 (SCI-Expanded, Scopus)

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 19 Sayı: 4
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1117/1.jrs.19.046516
  • Dergi Adı: JOURNAL OF APPLIED REMOTE SENSING
  • Derginin Tarandığı İndeksler: Scopus, Science Citation Index Expanded (SCI-EXPANDED), Compendex, INSPEC
  • Sayfa Sayıları: ss.46516-46531
  • Ankara Üniversitesi Adresli: Evet

Özet

Although semantic segmentation models excel in a wide range of vision tasks, they pose significant challenges for multispectral remote sensing images due to high inter-class similarity, intra-class variability, and occlusions. This work introduces a transformer-based criss-cross network (TransCCNet), a semantic segmentation architecture that couples recurrent criss-cross attention with a parallel transformer block to capture global context while preserving discriminative local cues across multispectral bands. The criss-cross pathway aggregates long-range dependencies along rows and columns, and the transformer pathway models global spectral–spatial relations. Together, their fused representation improves discrimination under occlusion and class ambiguity. Extensive experiments on three multispectral remote sensing image sets (i.e., RIT-18, WYR, and DSTL) demonstrate consistent gains over UNet, DeepLabV3, CCNet, and UNetFormer in terms of intersection over union (IoU) and F1 score across RGB, near-infrared (NIR), and normalized difference vegetation index domains. Compared to the baseline CCNet, TransCCNet improves IoU on the NIR modality by up to 7.1% for wheat yellow rust and 2.7% for crop segmentation. The absolute difference between precision and recall, which ranges from 0.017 to 0.242, remains the smallest among all models and is consistent across image sets. These findings indicate that TransCCNet effectively captures global contextual information to mitigate scene complexity in multispectral remote sensing images.