Evaluating ChatGPT-4 as a digital patient education tool in anesthesia: A multi-rater quality assessment

AKÇAALAN, YASEMİN; Erkilic, Ezgi; Gulec, Handan; Gumus, Tulin; Kanbak, Orhan; ÖZTÜRK, LEVENT

doi:10.1177/20552076261420876

Evaluating ChatGPT-4 as a digital patient education tool in anesthesia: A multi-rater quality assessment

AKÇAALAN Y., Erkilic E., Gulec H., Gumus T., Kanbak O., ÖZTÜRK L.

Digital Health, cilt.12, 2026 (SCI-Expanded, SSCI, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 12
Basım Tarihi: 2026
Doi Numarası: 10.1177/20552076261420876
Dergi Adı: Digital Health
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, Directory of Open Access Journals
Anahtar Kelimeler: anesthesia, artificial intelligence, ChatGPT, digital health, digital patient education, large language models
Ankara Üniversitesi Adresli: Evet

Özet

Background: Large language models such as ChatGPT are increasingly used by patients seeking perioperative information, yet their reliability for anesthesia-related patient education remains insufficiently evaluated. This study assessed the quality of ChatGPT-4.0 responses to frequently asked anesthesia questions using a multi-rater evaluation framework. Methods: Twenty-two common anesthesia-related patient questions were identified through online search. Each question was submitted once to ChatGPT-4.0 (GPT-4-turbo; chat.openai.com) without follow-up prompts. Five anesthesiology and reanimation specialists—each with more than 20 years of experience—independently evaluated each response using a validated 4-point Likert-type scale (1 = excellent; 4 = unsatisfactory). Inter-rater reliability was calculated using a two-way random-effects model (ICC[2,1]). Results: A total of 110 ratings were collected. Among these, 61.8% were classified as excellent, 32.7% as satisfactory requiring minimal clarification, and 5.5% as satisfactory requiring moderate clarification. No responses were rated as unsatisfactory. Mean scores for individual questions ranged from 1.0 to 2.4. Reviewer-wise averages ranged from 1.27 to 1.73, indicating generally positive evaluations with modest variability in scoring strictness. The overall inter-rater reliability was poor to fair (ICC = 0.25). Conclusions: ChatGPT-4.0 provided high-quality responses to frequently asked patient questions about anesthesia and may serve as a supportive digital health tool for patient education. However, limited agreement among evaluators highlights the need for expert oversight and contextual refinement when integrating large language models into clinical communication pathways.