A survey of NLP methods for oncology in the past decade with a focus on cancer registry applications

Article by Isaac Hands & Ramakanth Kavuluru

Abstract

Clinical texts from pathology and radiology reports provide critical information for cancer diagnosis and staging. This study surveys the application of natural language processing (NLP) in cancer registry operations from 2014 to 2024. A total of 156 articles from Scopus and PubMed were reviewed and were categorized by NLP methods, document types, cancer sites, and research aims. NLP approaches were evenly distributed across rule-based (n=70), machine learning (n=66), and traditional deep learning (n=70), with transformer models (n=29) gaining prominence since 2019. Encoder-only models like BERT and its clinical adaptations (e.g., ClinicalBERT, RadBERT) show significant promise, though methods for increasing context length are needed. Decoder-only models (e.g., GPT-3, GPT-4) are less explored due to privacy concerns and computational demands. Notably, pediatric cancers, melanomas, and lymphomas are underrepresented, as are research areas such as disease progression, clinical trial matching, and patient communication. Multi-modal models, important for precision oncology and cancer screening, are also scarce. Our study highlights the potential of NLP to enhance data abstraction efficiency and accuracy in cancer registries, making greater use of cancer registry data for patient benefit. However, further research is needed to fully leverage transformer-based models, particularly for underrepresented cancer types and outcomes. Addressing these gaps can improve the timeliness, completeness, and accuracy of structured data collection from clinical text, ultimately enhancing cancer research and patient outcomes.