Artificial intelligence in endocrine practice: comparing ChatGPT, Gemini, and Claude for adrenal incidentaloma care


BAŞ AKSU Ö., AYDIN R. F., GÖKÇAY CANPOLAT A., DEMİR Ö., ŞAHİN M., EMRAL R., ...Daha Fazla

Journal of Endocrinological Investigation, 2025 (SCI-Expanded) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1007/s40618-025-02715-0
  • Dergi Adı: Journal of Endocrinological Investigation
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, CAB Abstracts
  • Anahtar Kelimeler: Adrenal incidentaloma, Artificial intelligence, ChatGPT, Claude, Gemini
  • Ankara Üniversitesi Adresli: Evet

Özet

Purpose: The clinical use of artificial intelligence (AI) is expanding in endocrinology, yet the performance of large language models (LLMs) in managing adrenal incidentalomas remains uncertain. To compare the performance of four LLMs—ChatGPT-4o, ChatGPT-o1, Google Gemini 2.0, and Claude 3.5—on guideline-based queries and clinical scenarios involving adrenal incidentalomas. Methods: In this cross-sectional study, 34 guideline-derived questions and four case scenarios were presented to the LLMs, covering diagnosis, treatment and follow-up, patient questions, and clinical cases. Six endocrinologists evaluated responses using Likert scales assessing hallucination tendency, quality, usability, reliability, and accuracy. Readability metrics and word counts were also analyzed. Results: No significant differences were found between models in diagnosis (p = 0.86–0.72), treatment and follow-up (p = 0.46–0.10), and patient question (p = 0.78–0.10) categories. However, in complex cases, ChatGPT-4o outperformed ChatGPT-o1 with higher scores in hallucination control (6.5 ± 0.8 vs. 4.8 ± 0.8), quality (6.2 ± 0.8 vs. 5.0 ± 0.6), and usability (4.5 ± 0.8 vs. 3.3 ± 0.5) (all p < 0.05). Readability analysis revealed high text complexity (Flesch-Kincaid Grade Level: 10.6–17.4), and inter-rater reliability was excellent (intraclass correlation coefficient: 0.876–0.961, p < 0.001). Conclusion: LLMs show potential as decision-support tools in adrenal incidentaloma management. While their performance is comparable in routine tasks, significant differences arise in complex cases, highlighting the need for model selection, human oversight, and attention to readability in endocrine practice.