Evaluation of the artificial intelligence chatbots in frequently asked questions about retinitis pigmentosa: a comparative analysis between ChatGPT-4 and Gemini-2.0

Biçer, Özlem; ŞAHLI, ESRA

doi:10.1186/s40942-025-00772-4

Evaluation of the artificial intelligence chatbots in frequently asked questions about retinitis pigmentosa: a comparative analysis between ChatGPT-4 and Gemini-2.0

Biçer Ö., ŞAHLI E.

International Journal of Retina and Vitreous, cilt.12, sa.1, 2026 (ESCI, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 12 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1186/s40942-025-00772-4
Dergi Adı: International Journal of Retina and Vitreous
Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), Scopus, EMBASE, Directory of Open Access Journals
Anahtar Kelimeler: Artificial intelligence, ChatGPT-4, Gemini-2.0, Readability, Retinitis pigmentosa
Ankara Üniversitesi Adresli: Evet

Özet

Background: To evaluate the accuracy and readability of answers to common retinitis pigmentosa (RP) questions from the popular generative artificial intelligence (AI) chatbots ChatGPT-4 and Gemini-2.0. Methods: In March 2025, frequently asked questions about RP was entered to Google search tool, and the websites appearing on the first search page were selected for enrollment in the study. ChatGPT-4 and Gemini-2.0 were then prompted to generate responses about RP in both standard and simplified formats. To generate the simplified response, the following request was added to the prompt: ‘Please provide a response suitable for the average American adult, at a sixth-grade comprehension level.’ The AI chatbots’ responses to 30 questions about RP, frequently asked by patients, were evaluated by two ophthalmologists using a five-point Likert scale, with scores ranging from 1–5. Additionally, 8 readability indices, including Average Reading Level Consensus Calculator (ARLC), Automated Readability Index (ARI), Flesch Reading Ease (FRE), Gunning Fog Index (GFOG), Flesch–Kincaid Grade Level (FKGL), Coleman–Liau Index (CL), Simple Measure of Gobbledygook (SMOG), and Forcast Readability Formula (FRF) were calculated using an online calculator, Readabilityformulas.com, to assess the ease of comprehension of each answer. Results: No significant difference showed in accuracy both standard and simplified AI chatbot responses (p = 0.557, p = 0.090). In particular, almost all readability indices suggest that standard AI chatbot responses require a higher level of education for comprehension, whereas simplified responses require a lower level of education. Although Gemini-2.0 standard responses were more readable than ChatGPT-4 standard responses according to ARI, GFOG and FRF scores (p = 0.014, p = 0.040, and p = 0.001), Gemini-2.0 simplified responses were more readable than ChatGPT-4 simplified responses solely according to FRF scores (p = 0.016). Conclusions: This study shows that ChatGPT-4 and Gemini-2.0 can provide patients with an avenue to access comprehensive and accurate information about, tailored RP to their educational level.