Diagnostic Performance of ChatGPT-o1 and DeepSeek-V3 in Expert-Validated Simulated Ear, Nose, and Throat Scenarios: A Comparative Accuracy Study

Taraf, Nazlim; Camalan, Burcu; Doluoglu, Sumeyra; Arslan, Erhan; Ural, Ahmet; Demiroglu, Gulbin; ELHAN, ATİLLA; Özlügedik, Samet

doi:10.65396/ejra.1846059

Diagnostic Performance of ChatGPT-o1 and DeepSeek-V3 in Expert-Validated Simulated Ear, Nose, and Throat Scenarios: A Comparative Accuracy Study

Taraf N. H., Camalan B. V., Doluoglu S., Arslan E., Ural A., Demiroglu G., ...Daha Fazla

European Journal of Rhinology and Allergy, cilt.9, sa.1, ss.1-9, 2026 (Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 9 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.65396/ejra.1846059
Dergi Adı: European Journal of Rhinology and Allergy
Derginin Tarandığı İndeksler: Scopus
Sayfa Sayıları: ss.1-9
Anahtar Kelimeler: ChatGPT, DeepSeek, Diagnostic accuracy, Ear, nose, and throat, Large language models
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Ankara Üniversitesi Adresli: Evet

Özet

Objective: To compare the diagnostic accuracy of two advanced large language models (LLMs), ChatGPT-o1 and DeepSeek-V3, in expert-validated simulated otorhinolaryngology cases, and to assess subspecialty-specific performance and inter-rater agreement relative to that of human specialists. Methods: A cross-sectional diagnostic accuracy study was conducted using 70 expert-validated clinical vignettes across five ENT subspecialties. Two academic otolaryngologists and two LLMs independently evaluated each case. All LLMs operated in deterministic mode (temperature = 0) with standardized single-pass prompting in isolated sessions. Diagnostic accuracy, inter-rater agreement (Cohen’s κ), and subspecialty-specific performance were analyzed. A post hoc power analysis (Cohen’s h = 0.22; α = 0.05) assessed the ability to detect moderate effect sizes. Results: Both LLMs achieved 90.0% diagnostic accuracy (63/70), with no significant difference between them (p = 1.00) and substantial inter-model agreement (κ = 0.68). Human evaluators achieved 97.1% and 92.9% accuracy, with fair inter-rater agreement (κ = 0.26). Subspecialty performance was highest in otology and pediatric ENT (100%) and in rhinology (92.3%), with greater variability in laryngology and head and neck surgery. Shared error patterns included overestimating malignancy in high-risk patients. A post hoc power analysis showed 78% power to detect moderate differences. Conclusion: In controlled, vignette-based evaluations, ChatGPT-o1 and DeepSeek-V3 demonstrated diagnostic accuracy approaching expert-level performance across simulated ENT scenarios, with strong inter-model agreement and subspecialty-dependent variability. These findings highlight the potential of LLMs as diagnostic decision-support tools while underscoring the need for multimodal and real-world validation before clinical implementation.