Confirmation of Large Language Models in Head and Neck Cancer Staging


Kayaalp M., BÖLEK H., YAŞAR H. A.

Diagnostics, cilt.15, sa.18, 2025 (SCI-Expanded) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 15 Sayı: 18
  • Basım Tarihi: 2025
  • Doi Numarası: 10.3390/diagnostics15182375
  • Dergi Adı: Diagnostics
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, EMBASE, INSPEC, Directory of Open Access Journals
  • Anahtar Kelimeler: artificial intelligence, head neck cancers, large language models
  • Ankara Üniversitesi Adresli: Evet

Özet

Background/Objectives: Head and neck cancer (HNC) is a heterogeneous group of malignancies in which staging plays a critical role in guiding treatment and prognosis. Large language models (LLMs) such as ChatGPT, DeepSeek, and Grok have emerged as potential tools in oncology, yet their clinical applicability in staging remains unclear. This study aimed to evaluate the accuracy and concordance of LLMs compared to clinician-assigned staging in patients with HNC. Methods: The medical records of 202 patients with HNC, who presented to our center between 1 January 2010 and 13 February 2025, were retrospectively reviewed. The information obtained from the hospital information system by a junior researcher was re-evaluated by a senior researcher, and standard staging was completed. Except for the stage itself, the data used for staging were provided to a blinded third researcher, who then entered them into the ChatGPT, DeepSeek, and Grok applications with a staging command. After all staging processes were completed, the data were compiled, and clinician-assigned stages were compared with those generated by the LLMs. Results: The majority of the patients had laryngeal (45.5%) and nasopharyngeal cancer (21.3%). Definitive surgery was performed in 39.6% of the patients. Stage 4 was the most common stage among the patients (54%). The overall concordance rates, Cohen’s kappa values, and F1 scores were 85.6%, 0.797, and 0.84 for ChatGPT; 67.3%, 0.522, and 0.65 for DeepSeek; and 75.2%, 0.614, and 0.72 for Grok, respectively, with no statistically significant differences between models. Pathological and surgical staging were found to be similar in terms of concordance. The concordance of assessments utilizing only imaging, only pathology notes, only physical examination notes, and comprehensive information was evaluated, revealing no significant differences. Conclusions: Large language models (LLMs) demonstrate relatively high accuracy in staging HNC. With careful implementation and with the consideration of prospective studies, these models have the potential to become valuable tools in oncology practice.