Article by Fatimah Altuhaifa & Mustafa Al Tuhaifa
Abstract
Lung cancer remains a major global health challenge, with varying survival outcomes observed between male and female patients. This study investigates machine learning-based approaches for imputing missing patient sex in SEER (Surveillance, Epidemiology, and End Results) cancer registry data. We employed multiple machine learning models—including Logistic Regression (LR), Naïve Bayes (NB), Random Forest (RF), Multilayer Perceptron (MLP), and Bagging Classifier (BC)—to predict missing sex labels based on clinical features: survival years, surgery status, age, and race. The dataset was split into training (70%) and testing (30%) subsets. Feature importance analysis identified race as the most influential predictor (importance score: 0.978). Among the models evaluated, RF achieved the highest performance, with an accuracy of 97%, AUC of 98%, recall of 95%, precision of 87%, and an F1-score of 91%. This represents a 15.47% increase in accuracy compared to LR (97% vs. 84%), a 99% increase in precision (87% vs. 0%), a 95% improvement in recall (95% vs. 0%), and a 91% increase in F1-score (91% vs. 0%). Compared to NB, RF improved precision by 3.57% (87% vs. 84%) and F1-score by 2.25% (91% vs. 89%). MLP and BC performed almost similarly, with an accuracy of 97%, AUC of 96% and 97%, recall of 94% and 95%, precision of 90% and 87%, and an F1-score of 92% and 91%, showing a 15.47% improvement in accuracy over LR and a 1.03% gain over NB (97% vs. 96%). The findings highlight the clinical relevance of imputing missing sex data, as sex-based disparities can influence cancer treatment outcomes and survival predictions. By ensuring complete and accurate patient records, the proposed approach can enhance personalized treatment planning and epidemiological studies. In addition to accuracy, computational efficiency is an important consideration: MLP’s high accuracy comes at the cost of increased computational complexity, making it less suitable for large-scale real-time applications as compared with LR or NB. These findings underscore the critical role of “Race” in sex prediction and emphasize the potential of the RF model for accurate sex prediction in lung cancer patients. The findings emphasize the potential for enhanced healthcare practices through precise sex prediction in lung cancer patients, addressing a critical gap in medical research.