Balancing accuracy and efficiency in fine-grained image classification: a systematic comparison of CNN, vision transformer, and hybrid architectures

Öztaş, Demir; Salehi, Hakan; Demir, Özge; Acici, Koray; Güzel, Mehmet; Eraslan, Emre; Sevindik, Mustafa; Akata, Ilgaz; Ekinci, FATİH

doi:10.1007/s00606-026-02005-z

Balancing accuracy and efficiency in fine-grained image classification: a systematic comparison of CNN, vision transformer, and hybrid architectures

Öztaş D. M., Salehi H., Demir Ö., Acici K., Güzel M. S., Eraslan E. C., ...Daha Fazla

OESTERREICHISCHE BOTANISCHE ZEITSCHRIFT, cilt.312, sa.4, ss.1-19, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 312 Sayı: 4
Basım Tarihi: 2026
Doi Numarası: 10.1007/s00606-026-02005-z
Dergi Adı: OESTERREICHISCHE BOTANISCHE ZEITSCHRIFT
Derginin Tarandığı İndeksler: Scopus, Science Citation Index Expanded (SCI-EXPANDED), BIOSIS
Sayfa Sayıları: ss.1-19
Ankara Üniversitesi Adresli: Evet

Özet

Fine-grained image classification remains a challenging problem in computer vision due to high inter-class visual similarity and pronounced intra-class variability. In this study, a systematic and architecture-centric comparison of convolutional neural networks (CNNs), Vision Transformer (ViT) models, and hybrid architectures is presented using the Oxford 102 Flowers dataset, which comprises 8189 images across 102 visually similar categories. Rather than proposing a new architecture or training strategy, this work focuses on a controlled and comparative analysis of widely adopted deep learning paradigms under a unified experimental framework. All models are evaluated under a strictly controlled experimental protocol with fixed data splits, identical preprocessing, and unified optimization settings to reduce experimental confounding factors and enable architecture-level analysis. The experimental results show that large-scale Vision Transformers achieve the highest absolute classification performance within this benchmark setting, with ViT-L/16 attaining an accuracy of 99.37% and an F1-score of 99.27%. However, this performance is accompanied by substantial computational cost, including 303.4 million parameters and an inference time of 27.4ms per image. In contrast, hybrid architectures exhibit a more favorable accuracy– efficiency trade-off under the same experimental conditions. Notably, ConvNeXt-V2-Tiny achieves 98.13% accuracy and a 98.05% F1-score while requiring only 27.9 million parameters and an inference time below 1ms, achieving competitive performance with significantly lower computational complexity. Conventional CNN architectures demonstrate comparatively lower performance within this specific transfer learning configuration, with average accuracy remaining below 90%, sug- gesting potential limitations in capturing highly subtle inter-class variations under the evaluated setup rather than indicating general architectural inadequacy. The findings indicate that the highest classification accuracy does not necessarily correspond to the most practically efficient solution, particularly in deployment-constrained environments. Importantly, the conclusions of this study are restricted to the Oxford 102 Flowers dataset under transfer learning and standardized input resolution, and should not be interpreted as universal claims of architectural superiority. By jointly analyzing accuracy, model size, and inference latency, this work provides a structured and decision-oriented assessment of architectural trade-offs for fine-grained image classification. The results offer practical insights for selecting appropriate deep learning architectures under varying computational constraints and contribute to a more nuanced understanding of how convolutional, transformer-based, and hybrid representations behave within a controlled fine-grained benchmark setting.