Otolaryngology - Head and Neck Surgery (United States), vol.174, no.4, pp.980-988, 2026 (SCI-Expanded, Scopus)
Objective: To compare the diagnostic accuracy, linguistic clarity, and user satisfaction of three large language models (ChatGPT-4.0, Claude 3.7 Sonet, and OpenAI Mini 3) in managing sudden sensorineural hearing loss. Study Design: Prospective, multi-domain comparative analysis using blinded expert evaluation. Setting: Online artificial intelligence (AI) platforms accessed under standardized conditions. Methods: Twenty-seven sudden sensorineural hearing loss-related questions—covering general knowledge, audiometric interpretation, and clinical case scenarios—were submitted to the three AI models. Responses were evaluated by 10 board-certified otolaryngologists using three validated tools: Quality Assessment of Medical Artificial Intelligence (QAMAI), Artificial Intelligence Performance Instrument (AIPI), and Artificial Intelligence Satisfaction and Performance Evaluation Questionnaire (AISPE-Q). Linguistic complexity was assessed using metrics such as word count, sentence length, lexical diversity, and clinical verb use. Results: ChatGPT-4.0 demonstrated the highest scores in clinical accuracy (QAMAI: 4.57), completeness (4.53), and evaluator satisfaction (AISPE-Q: 94%). Claude 3.7 outperformed in clarity and sentence complexity, while OpenAI Mini 3 exhibited the highest lexical diversity and directive tone but scored lower overall. Inter-rater reliability was strong (intraclass correlation coefficient [ICC] > 0.85). Correlation analysis revealed a significant relationship between objective quality and subjective satisfaction (r > 0.76). Conclusion: ChatGPT-4.0 delivered the most clinically aligned and satisfactory responses, whereas Claude 3.7 provided linguistically refined outputs. Our findings support the context-specific application of hybrid large language model approaches in otolaryngology, particularly for patient education, diagnosis, and AI-driven triage. Level of Evidence: 2—prospective comparative diagnostic accuracy study.