Diagnostic Performance of ChatGPT-o1 and DeepSeek-V3 in Expert-Validated Simulated Ear, Nose, and Throat Scenarios: A Comparative Accuracy Study


Creative Commons License

Taraf N. H., Camalan B. V., Doluoglu S., Arslan E., Ural A., Demiroglu G., ...Daha Fazla

European Journal of Rhinology and Allergy, cilt.9, sa.1, ss.1-9, 2026 (Scopus) identifier

Özet

Objective: To compare the diagnostic accuracy of two advanced large language models (LLMs), ChatGPT-o1 and DeepSeek-V3, in expert-validated simulated otorhinolaryngology cases, and to assess subspecialty-specific performance and inter-rater agreement relative to that of human specialists. Methods: A cross-sectional diagnostic accuracy study was conducted using 70 expert-validated clinical vignettes across five ENT subspecialties. Two academic otolaryngologists and two LLMs independently evaluated each case. All LLMs operated in deterministic mode (temperature = 0) with standardized single-pass prompting in isolated sessions. Diagnostic accuracy, inter-rater agreement (Cohen’s κ), and subspecialty-specific performance were analyzed. A post hoc power analysis (Cohen’s h = 0.22; α = 0.05) assessed the ability to detect moderate effect sizes. Results: Both LLMs achieved 90.0% diagnostic accuracy (63/70), with no significant difference between them (p = 1.00) and substantial inter-model agreement (κ = 0.68). Human evaluators achieved 97.1% and 92.9% accuracy, with fair inter-rater agreement (κ = 0.26). Subspecialty performance was highest in otology and pediatric ENT (100%) and in rhinology (92.3%), with greater variability in laryngology and head and neck surgery. Shared error patterns included overestimating malignancy in high-risk patients. A post hoc power analysis showed 78% power to detect moderate differences. Conclusion: In controlled, vignette-based evaluations, ChatGPT-o1 and DeepSeek-V3 demonstrated diagnostic accuracy approaching expert-level performance across simulated ENT scenarios, with strong inter-model agreement and subspecialty-dependent variability. These findings highlight the potential of LLMs as diagnostic decision-support tools while underscoring the need for multimodal and real-world validation before clinical implementation.