Digital Health, cilt.12, 2026 (SCI-Expanded, SSCI, Scopus)
Background: Large language models such as ChatGPT are increasingly used by patients seeking perioperative information, yet their reliability for anesthesia-related patient education remains insufficiently evaluated. This study assessed the quality of ChatGPT-4.0 responses to frequently asked anesthesia questions using a multi-rater evaluation framework. Methods: Twenty-two common anesthesia-related patient questions were identified through online search. Each question was submitted once to ChatGPT-4.0 (GPT-4-turbo; chat.openai.com) without follow-up prompts. Five anesthesiology and reanimation specialists—each with more than 20 years of experience—independently evaluated each response using a validated 4-point Likert-type scale (1 = excellent; 4 = unsatisfactory). Inter-rater reliability was calculated using a two-way random-effects model (ICC[2,1]). Results: A total of 110 ratings were collected. Among these, 61.8% were classified as excellent, 32.7% as satisfactory requiring minimal clarification, and 5.5% as satisfactory requiring moderate clarification. No responses were rated as unsatisfactory. Mean scores for individual questions ranged from 1.0 to 2.4. Reviewer-wise averages ranged from 1.27 to 1.73, indicating generally positive evaluations with modest variability in scoring strictness. The overall inter-rater reliability was poor to fair (ICC = 0.25). Conclusions: ChatGPT-4.0 provided high-quality responses to frequently asked patient questions about anesthesia and may serve as a supportive digital health tool for patient education. However, limited agreement among evaluators highlights the need for expert oversight and contextual refinement when integrating large language models into clinical communication pathways.