Evaluation of Accuracy, Information Quality, and Readability of Artificial Intelligence Based Chatbots in Pediatric Oral Surgery: A Comparative Analysis Based on the AAPD Clinical Guideline

KAYA, İLHAN; DEMİREL, AKİF

doi:10.7126/cumudj.1799875

Evaluation of Accuracy, Information Quality, and Readability of Artificial Intelligence Based Chatbots in Pediatric Oral Surgery: A Comparative Analysis Based on the AAPD Clinical Guideline

KAYA İ., DEMİREL A.

Cumhuriyet Dental Journal, cilt.28, sa.4, ss.586-593, 2025 (Scopus, TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 28 Sayı: 4
Basım Tarihi: 2025
Doi Numarası: 10.7126/cumudj.1799875
Dergi Adı: Cumhuriyet Dental Journal
Derginin Tarandığı İndeksler: Scopus, Directory of Open Access Journals, TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.586-593
Anahtar Kelimeler: Artificial intelligence, hekimliği, oral cerrahi, oral surgery, pediatric dentistry, yapay zekâ, Çocuk diş
Ankara Üniversitesi Adresli: Evet

Özet

Objectives: Chatbots powered by artificial intelligence are increasingly used as tools for obtaining medical and dental knowledge. This study aimed to assess and compare the performance of four AI chatbots in providing evidence-based information on pediatric oral surgery topics, with reference to the American Academy of Pediatric Dentistry (AAPD) clinical guideline. Materials and Methods: This descriptive observational study evaluated four AI chatbots (ChatGPT-5, Gemini, Copilot, and DeepSeek) by posing 20 questions derived from the AAPD Guideline on Management Considerations for Pediatric Oral Surgery. Responses were assessed for accuracy using the grading system, for quality using the 16-item DISCERN instrument and for readability using the Flesch-Kincaid Grade Level (FKGL) formula. Non-parametric Kruskal-Wallis and Mann-Whitney U tests with Holm-Bonferroni adjustment were employed for statistical comparisons (p < 0.05). Results: Significant differences were observed among chatbots in all outcome measures. Gemini and ChatGPT-5 achieved the highest accuracy scores (1.30 ± 0.47 and 1.40 ± 0.60, respectively; p = 0.001), whereas DeepSeek and Copilot showed lower accuracy. In terms of information quality, DeepSeek produced the highest DISCERN scores (52.90 ± 3.73; p < 0.001), followed by Copilot. ChatGPT-5 and Gemini yielded more readable outputs (10.73 ± 1.98 and 11.68 ± 1.91, respectively), though readability differences were not statistically significant (p > 0.05). Conclusions: Of the models evaluated, Gemini and ChatGPT-5 produced the most accurate responses, while DeepSeek generated the highest-quality content. While AI chatbots show promise as supplementary tools for patient education and clinical learning in pediatric oral surgery, their reliability varies considerably across platforms. Continuous validation and guideline-based evaluation are essential prior to clinical integration.