Egyptian Journal of Otolaryngology, cilt.42, sa.1, 2026 (ESCI, Scopus)
Background: Artificial intelligence (AI) language models are increasingly used in surgical aftercare, yet their performance varies across platforms. The objective of this study is to compare the effectiveness of large language models in providing accurate, clinically relevant guidance for postoperative otoplasty. Methods: Ten commonly encountered postoperative otoplasty questions were presented to both models. The generated answers were independently assessed by ten ENT specialists using structured Likert-based instruments and predefined clinical evaluation. To evaluate reliability and inter-model differences, a range of advanced statistical techniques was applied, including t-tests, effect size calculations, sensitivity and specificity analyses, mixed-effects models, and regression-based modeling. Results: Claude 3.5 Sonnet outperformed ChatGPT-5.0 across all evaluation metrics (p < 0.001); mixed-effects modeling showed a positive model effect (β = 0.752), question-level ROC analysis demonstrated complete separation (AUC = 1.00), PCA supported a dominant single factor explaining 70.86% of variance in clinician ratings, and inter-rater agreement was higher for Claude 3.5 Sonnet. Conclusion: Claude 3.5 Sonnet model exhibited higher accuracy and clinical relevance in postoperative otoplasty management, with robust statistical validation supporting its reliability in surgical aftercare.