Assessing the accuracy of the GPT-4 model in multidisciplinary tumor board decision prediction

Erdat, EFE; Yalçıner, Merih; Örüncü, Mehmet; ÜRÜN, YÜKSEL; Şenler, FİLİZ

doi:10.1007/s12094-025-03905-1

Assessing the accuracy of the GPT-4 model in multidisciplinary tumor board decision prediction

Erdat E. C., Yalçıner M., Örüncü M. B., ÜRÜN Y., Şenler F.

CLINICAL & TRANSLATIONAL ONCOLOGY, cilt.27, sa.9, ss.3793-3802, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 27 Sayı: 9
Basım Tarihi: 2025
Doi Numarası: 10.1007/s12094-025-03905-1
Dergi Adı: CLINICAL & TRANSLATIONAL ONCOLOGY
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, EMBASE, MEDLINE, DIALNET
Sayfa Sayıları: ss.3793-3802
Anahtar Kelimeler: Artificial intelligence, Cancer treatment, Machine learning, Multidisciplinary approach, Real-world data, Tumor board
Ankara Üniversitesi Adresli: Evet

Özet

Purpose Artificial intelligence models like GPT-4 (OpenAI) have the potential to support clinical decision-making in oncology. This study aimed to assess the consistency between multidisciplinary tumor board (MTB) decisions and GPT-4 model predictions in cancer patient management. Patients and methods A cross-sectional study was conducted involving patients aged >= 18 years with definite or suspicious cancer diagnoses presented at MTBs in Ankara University Hospitals, T & uuml;rkiye, from February 2021 to June 2023. GPT-4 was utilized to generate treatment recommendations based on case summaries. Three independent raters evaluated the compatibility between MTB decisions and GPT-4 predictions using a 4-point Likert scale. Cases with mean compatibility scores equal to or below 2 were reviewed by two expert oncologists for appropriateness. Results A total of 610 patients were included. The mean compatibility score across raters was 3.59 (SD = 0.81), indicating high agreement between GPT-4 predictions and MTB decisions. Cronbach's alpha was 0.950 (95% CI 0.935-0.960), demonstrating excellent interrater reliability. Sixty-two cases (10.2%) had mean compatibility scores below the threshold of 2. The first expert oncologist deemed GPT-4's predictions inappropriate in 8 of these cases (12.9%), while the second deemed them inappropriate in 16 cases (25.8%). Cohen's kappa showed moderate agreement (kappa = 0.50, 95% CI 0.25-0.75, p < 0.001). Discrepancies were often due to rare cases lacking guideline information or misunderstandings of case presentations. Conclusion GPT-4 exhibited high compatibility with MTB decisions in cancer patient management, suggesting its potential as a supportive tool in clinical oncology. However, limitations exist, especially in rare or complex cases.