ABSTRACT
Importance Lung cancer disparities occur across minorities, namely Black populations, who face increased risks yet are screened at lower rates. Standards set by the United States Preventive Services Task Force (USPSTF) are derived from a predominantly White cohort: the National Lung Cancer Screening Trial (NLST), which exacerbates disparities in lung cancer screening (LCS) and diagnosis.
Objective To evaluate individualized risk assessment using highly accurate risk models that integrate clinical and imaging-based risk factors for lung cancer prediction for improving LCS accuracy to reduce disparities among minoritized populations.
Design, Setting, and Participants A retrospective real-world patient cohort from University of Illinois Health (UIH) using available LDCT scans (January 1, 2015 to March 16, 2024) was assembled. We then evaluated the performance of a ResNet-18 model trained on LDCTs from the predominantly white NLST cohort on the diverse UIH patient population, consisting of 65,106 patients, of which 8,823 identify as Black. Inclusion criteria of the UIH cohort utilized CPT codes, as well as ICD-9 and ICD-10 criteria for neoplasm of the bronchus or lung. The proposed hybrid model was assessed for its predictive accuracy across different racial groups and Body-Mass Index (BMI) categories.
Main Outcomes and Measures The primary outcomes included the hybrid AI model’s ability to improve lung cancer screening adherence, its effectiveness across diverse racial groups—highlighting disparities in performance between Black and White populations—and its performance in individuals with varying BMI, particularly those with BMI ≥ 30. Secondary outcomes were the hybrid model’s performance in terms of sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) compared to traditional USPSTF guidelines.
Results The hybrid AI model was trained using clinical and imaging data from the NLST cohort and tested on a diverse urban and suburban population in the Chicago metropolitan area (UIH cohort). The model, optimized to 7 clinical features, achieved ROC-AUC values of 0.64-0.67 in the NLST test set and 0.60-0.65 in the UIH cohort. The inclusion of ResNet-based image predictors significantly improved the model’s performance, achieving ROC-AUC values of 0.78-0.91 and PR-AUC values of 0.25-0.33 in NLST. However, the hybrid model’s performance deteriorated when applied to Black patients in the UIH cohort, with ROC-AUC values of 0.65-0.75, and to 0.67 in obese patients (BMI ≥ 30). Further investigation found the ResNet-18 model was the underlying cause of the disparate results with higher performance among White patients compared to Black patients UIH patients. Attempts to optimize the ResNet-18 outputs revealed a domain shift, where model optimization in Black patients resulted in deterioration in White patients, reflecting the limited representation of Black patients in the model’s original training dataset. Model performance also deteriorated for individuals with a BMI ≥ 30 in both the NLST and UIH data sets.
Conclusions and Relevance The hybrid AI model shows promise in providing personalized lung cancer risk predictions with improved accuracy compared to clinical risk models alone. However, biases in training data, particularly regarding race and BMI, limit its generalizability. Future work should focus on developing more inclusive training datasets and further validating the model in diverse prospective cohorts to enhance its applicability in reducing lung cancer disparities.
INTRODUCTION
Lung cancer treatment in the United States carries health disparities in minority populations and are especially pronounced in Black individuals1. Low-dose CT (LDCT) lung cancer screening shifts diagnosis to early-stage disease and is associated with reduced mortality2, 3, but stringent eligibility criteria from the United States Preventive Services Task Force (USPSTF)4 exacerbates disparities in Black populations4, 6. These criteria, specifically regarding age and duration of tobacco usage, were derived from risk/benefit analysis of the ∼ 92% White National Lung Screening Trial (NLST) cohort. This results in exclusion of at-risk Black patients from screening. The eligibility to incidence ratio measures this disparity, and according to current USPSTF guidelines, White populations often have >100 eligible patients screened for one incidence of lung cancer, whereas in Black populations this is reduced to approximately 50 eligible patients for every diagnosed case of lung cancer4. We7 and others5,8,9 have identified that individualized selection using clinical feature-based risk models improved lung cancer screening eligibility and sensitivity as compared to USPSTF guidelines. Moreover, individualized lung cancer risk prediction from LDCT has been achieved using Convolutional Neural Networks (CNNs). Taken together, risk prediction models have the potential to guide screening selection, to follow up, to determine eligibility, and to mitigate lung cancer disparities in early detection and survival. However, these models have been trained on patient cohorts which underrepresent minorities. To advance these risk models for implementation, their generalizability must first be assessed in minority populations including Black patients.
In this retrospective ever-smoker cohort analysis, we evaluated the utility of clinical and imaging based lung cancer risk models in both the NLST cohort, and a diverse urban and suburban real-world patient population across multiple sites in the Chicago metropolitan area (University of Illinois Health, UIH) representing 65,106 patients including 8,823 Black patients. The Chicago metropolitan area is racially, socioeconomically, and geographically diverse and has regions with higher rates of lung cancer, predominantly in areas populated by Black patients10–12. We find that a hybrid deep learning model using ResNet and Support Vector Machines (SVMs) from LDCT and clinical features, respectively, outperformed individual models in the NLST test set. When evaluated in the UIH cohort, we observed a racially disparate performance deterioration in the hybrid model on Black – but not White subjects. Further analysis uncovered the model deterioration occurred within the ResNet imaging component of the Hybrid model. Efforts to regularize model performance via Artificial Neural Networks (ANNs) uncovered underlying disparities where improvement of accuracy in Black patients coincided with decreased accuracy in White patients. However, the residual bias still affected the Black population’s performance. This could potentially be due to a large imbalance of LDCTs in White versus Black patients in the ResNet training cohort. Overall, we show proof of concept that a hybrid model has the potential to identify individuals at risk of developing lung cancer with high accuracy in this retrospective cohort, but with disparate performance in Black patients. To mitigate against further lung cancer screening disparities, risk-based model performance must be benchmarked across all races and be trained in patient cohorts that have adequate representation of Black and other minoritized populations.
METHODS
Research analysis was carried out under approved UIC IRB protocols 2023-1321 and 2023-1377. We developed a hybrid AI predictive model to probe the biological basis of lung cancer, and thus assembled a real-world cohort of > 10,000 patients treated in urban and suburban settings in the city of Chicago as well as greater Cook County. This real-world cohort, including > 5,000 Black patients with history of smoking and/or lung cancer, was also filtered based on available chest imaging studies.
Collecting Data: NLST
Deidentified NLST data was accessed from the National Cancer Institute Cancer Data Access System approved under UIC IRB Protocol 2023-1321.
Collecting Data: UIH Lung Cancer Screening
To assess the hybrid model in the UIH patient population, we assembled a retrospective cohort of patients with a history of smoking under an IRB approved protocol. Inclusion criteria for this cohort is listed in Supplementary Table 1. Briefly, LDCT screening patients were placed into the cohort by CPT code 71271; and/or based on ICD codes for smoking cessation counseling and cigarette smoking, and through documented history of smoking. ICD-9 code 305.1 and ICD-10 code F17 were used to select LDCT patients and LDCT associated DICOMS were chosen; In addition to LDCT patients, we also expanded the real-world cohort to include patients with smoking histories and CT imaging. A diagnosis of lung cancer was determined based on ICD codes of neoplasm of bronchus or lung (ICD-9 162, ICD-10 C34). Those without a diagnosis of lung cancer were assumed to have no lung cancer at the time of the preparation of this manuscript. Median duration of follow-up between LDCT scans was 431 days and the median number of LDCT scans per patient with at least one LDCT was 1 (the median is 5 if each convolution kernel and reconstruction of a given LDCT exam is counted separately). The composition of races and ethnicities in both the NLST cohort and the UIH cohort are presented in (Fig. 1A). The UIH cohort had a diverse geographic distribution including urban, semiurban, and suburban and patients received healthcare across multiple sites (Fig. 1B).
Generating Ground Truth
The ground truth is represented by six values (1 or 0) indicating the presence or absence of a lung cancer diagnosis 1 to 6 years after a CT scan was performed. Using the Python programming language, a data frame was created which represents the ground truth associated with each LDCT in both the NLST and UIH data sets. The time of each CT event was compared to the time of the lung cancer diagnosis event of the associated patient. If the difference is less than or equal to a given number of years, it is labeled as a positive sample (1), otherwise it is labeled as a negative sample (0). However, if the sample has no known lung cancer diagnosis event, and the number of years exceeds the number of days to last follow-up event (or in the case of UIH data, the date 3/16/2024), then the sample is removed due to a lack of information in determining the ground truth. Over years 1-6, 1888, 3128, 3828, 4056, 4476, and 4884 samples (out of 5,980) were removed from the UIH cohort respectively. The NLST data set includes the number of days to the last known cancer status of each patient, and this was used in a similar manner to remove 4, 35, 159, 280, 461, and 1082 samples (out of 11,980) from the NLST data set for each respective year.
Train and Test set selection in NLST
A training set from the NLST was created using the same split as previously described14 with each sample representing an individual patient. In addition, data from patients in the chest X-ray arm of NLST was also incorporated. The SVM models which incorporate a Sybil score were also trained using the same training set, however each sample represents an individual CT scan rather than a patient.
Test set selection in UIH
Data from patients at UIH consisted of various Digital Imaging and Communications in Medicine (DICOM) types, a digital format for storing CT scans and associated metadata. First, LDCTs were selected using codes associated with each DICOM, provided by the UIC Center for Clinical and Translational Sciences.
Prediction with PLCOm2012
PLCOm2012 is a logistic regression model for lung cancer risk inference and was evaluated on NLST data. The PLCOm2012 model was published in 2012, and was created using a set of 11 chosen features to predict the probability of lung cancer diagnosis for a given patient within 6 years: age, race, education, Body-Mass Index (BMI), presence of Chronic Obstructive Pulmonary Disease (COPD), personal history of cancer, family history of lung cancer, smoking status, cigarettes per day, duration of smoking, and duration of quitting. All 11 features are directly represented or can be extracted from the data in the NLST. The PLCOm2012 logistic regression model is represented in an R package created by the authors of PLCOm201216. This was translated to Python code to facilitate incorporation with the remainder of the code base. When using PLCOm2012 for prediction, each sample represents an individual patient. The model was used to generate predictions for probability of lung cancer diagnosis within 6 years only (it was not used to evaluate performance within 1 to 5 years) as intended by its authors (Supp. Fig 3), and these predictions were compared with the ground truth at 6 years, and performance was evaluated across White versus Black patients.
SVM Prediction
Clinical data elements from the NLST were evaluated for inclusion in a SVM using the Scikit-Learn Python library13. 11 features were retained after recursive feature elimination using linear kernel. We started with a large set of initial features from those available in NLST, as well as some generated features which were calculated from existing ones, such as BMI. Recursive feature elimination showed best results when using 11-13 features. We chose to optimize the SVM to 11 features since this is the fewest number of features which resulted in optimal performance. Six unique SVMs were then trained to make predictions 1-6 years post CT scan, with each sample being an individual patient. Each of the 6 SVM models was then evaluated on the test sets using the corresponding ground truths. Additional SVM models were created using a truncated set of features to allow comparison between UIH and NLST cohorts. This truncated set contains 7 features which are well-represented in both data sets; the 4 removed features were poorly represented in the UIH cohort. The training and prediction pipeline of SVM method is given in Fig. 2A.
ResNet Prediction
DICOM data files were evaluated by a ResNet-18 CNN model, trained on the Imagenet database with parameters locked and further trained upon the NLST cohort as previously described14. The ResNet-18 model, Sybil, was run with default parameters on the same test set used by the authors of the model which contains data from 2,328 patients and 11,980 images, and the resulting output from each LDCT image was a set of 6 values indicating the probability that lung cancer was diagnosed within 1 to 6 years from the date of the scan. The ResNet was used in a similar manner on data from UIH across 695 patients with 4,092 LDCT images. Each prediction score is then compared with the ground truth values to determine the model’s performance, broken down into 6 performance evaluations by years 1 to 6.
SVM + ResNet Hybrid Model Prediction
For the hybrid model, SVM learning was used to train six unique models for years 1-6 post imaging including the corresponding ResNet inferred risk for that given year (Fig. 2B). Creating the ResNet + SVM models involved the same process as the initial SVM, with two main differences. Each ResNet + SVM model incorporates the ResNet risk inference of the current year and that the training and testing samples are individual DICOMs associated with the clinical features of each subject ID at baseline study entry. The SVM + Resnet hybrid models thus utilized the same subject ID for training and testing the ResNet model.
ResNet+ANN Hybrid Model Prediction
A simple fully connected artificial neural network (ANN) was constructed to incorporate race and age along with a ResNet prediction score, specifically the 1-year prediction score generated by the ResNet. There are thus 3 inputs: a numerical age, a binary value (0 or 1) indicating if the patient is Black, and LDCT-based score generated by the ResNet representing the probability of cancer diagnosis (Fig. 2C). Patients listed as “more than one race” in the NLST cohort were evaluated as not Black. This category was not present in the UIH cohort. This hybrid model has 6 outputs, each output representing the probability of lung cancer diagnosis by year N (1 to 6). There are 2 fully connected hidden layers. 100 independent ANNs were trained using randomly initialized parameters, and the top 10 performers on a validation set were used to create an ensemble to determine if race data may improve the predictive performance of the original ResNet model.
Evaluating Model Performance
Model performance was evaluated on completely held out test sets within the NLST cohort as well as the real-world UIH cohort. Statistical comparisons were performed using paired and unpaired DeLong tests to compare receiver operating characteristic (ROC) area-under-curve values. When models are tested on a given test set, both a precision-recall (PR) curve and a ROC curve are generated using the Scikit-Learn Python library. Then, a prediction cutoff value is selected such that the sensitivity is as close to 0.8 as possible to capture 80% of lung cancer diagnosis events based on proposed sensitivities utilized in the International Lung Screening Trial for the PLCOm2012 model. The remainder of the statistics (i.e., specificity, PPV, NPV) were calculated based on this cutoff value.
Comparison of models
To determine whether there was a significant difference in the performance of a given model on a similar population, DeLong’s test was utilized. Unpaired analysis was performed as previously described15.
RESULTS
NLST and UIH cohorts evaluated in this study and racial and ethnic breakdown initially comprised of 53,452 patients in NLST and 11,654 patients in UIH (Table 1). We sought to develop a hybrid predictive model to combine the long-term predictive potential of clinical features as well as near-term predictive potential of radiographic features from CT. To develop a hybrid lung cancer prediction model, we used the NLST as a training data set inclusive of CT scans and Chest X-Ray arms amounting to 48,628 total patients2 for the training of the SVM. A SVM model was chosen owing to its interpretability of features and weights and allowed for multi-year risk prognostication from baseline characteristics of participants at entry (Supp Fig. 1). The SVM was also selected as it is effective at classifying two classes (i.e. patients with no lung cancer diagnosis versus patients with lung cancer diagnosis) where ground-truth was measured by the time from study entry to cancer incidence within the window of that year’s specific model (e.g. SVM1 predicts the probability of lung cancer within one year or less, SVM2 within two years etc.). We first focused on and initiated the SVM models with 65 features and utilized recursive feature elimination to optimize the model to 11 features (Supp Fig. 2, Supp. Table 2). These 11 features overlapped with many previous features identified in the PLCOm2012 6-year risk model but were predictive of multi-year risk as opposed to just 6-year risk (Supp. Table 2)16. We then further optimized the model for real-world application by eliminating features not captured in electronic health records within routine clinical care since data was not available in > 50% of patients. The excluded features included former vs current smoker, whether patients have family history of lung cancer, patients’ education whether higher than bachelor’s or not, and the maximum number of years exposed to one of the following at work: asbestos, chemical, sand blasting, coal, foundry. We found that the SVM model could be reduced to 7 clinical features well represented in EHR of the UIH cohort while still retaining predictive performance (Fig. 3A, Supp Fig. 2). Performance of the 7-feature multiyear SVM model was first evaluated in a held-out test set from the NLST cohort with ROC-AUC values of 0.64-0.67 similar to the 6-year lung cancer risk when compared to the PLCOm2012 model (Supp. Fig 2). We then applied the SVM to the UIH cohort of patients and found similar performance with ROC AUC 0.60-0.65 (Fig. 3B).
Following initial model optimization of clinical risk features, we then incorporated an image-based lung risk predictor of the ResNet-18 CNN image classifier with additional training on LDCT images from NLST, Sybil14, which infers lung cancer risk probabilities at multiple years of follow up. The hybrid risk model was trained with the previously identified 7 clinical features in the SVM but with inclusion of the inference of the multiyear risk from the image based ResNet model from 65,161 LDCT study DICOM files. The Hybrid model demonstrated substantially improved accuracy with ROC-AUC values Y1-Y6 all significantly improved (p-values < .001, Supp. Table 4). We then benchmarked the hybrid model on the NLST test set compared to SVM models trained with clinical risk features alone which resulted in a substantial improvement of predictive potential (Fig 3C; Table 2; ROC-AUC ranging from 0.78-0.91; PR-AUC values 0.25-0.33). Positive predictive values and negative predictive values at a sensitivity of 80% (corresponding to a PLCOm2012 6-year risk of 1.6%)16 and found an order of magnitude improvement in positive and negative predictive values when compared to the PLCOm2012 risk predictive model (Supp. Table 3)16.
We then evaluated whether our hybrid model’s predictive ability could be reproduced in the diverse UIH cohort. We observed a substantial decrease in model performance (Fig 3D; ROC AUC 0.68-0.80). Given the similar performance of the clinical feature SVM model between the NLST and UIH cohorts, we focused on the ResNet component of the Hybrid model. To confirm the ResNet model performance was reproducible, we re-ran the model on individual CT scans from the NLST that were not used for the ResNet model training and confirmed findings as previously reported14 (Fig. 4A). We then evaluated baseline LDCT scans from the UIH cohort and identified reduced performance in the UIH cohort with Year 1 ROC-AUC values of 0.8 versus 0.92 in NLST, for example (Fig. 4B). We then asked whether model performance deterioration17 was due to a technical difference of the LDCT scans (e.g. modern CT scanners and image reconstruction in UIH versus NLST) or due to the demographic differences such as race. Evaluating performance of the ResNet model on White patients in the UIH showed no differences in performance (Fig. 4C) between the NLST and the UIH cohorts. However, when assessed in Black patients, there was marked deterioration in model performance with Y1-Y6 ROC AUC values 0.65-0.75 (Fig. 4D). We also evaluated other potential factors of model deterioration and found differences in BMI where model performance had significant deterioration in obese patients (BMI ≥ 30) (Supp. Table 4) where in the NLST cohort Y1 ROC AUC was 0.94 in BMI < 30 versus 0.88 in BMI ≥ 30 subjects. In the UIH cohort, we observed a more pronounced model deterioration where Y1 ROC AUC was 0.90 BMI < 30 versus 0.67 in BMI ≥ 30 subjects (Supp. Table 4). Notably, we found no correlation between race and BMI to suggest that these clinical features were linked to model deterioration.
Because of the racial disparity in model performance, we subsequently inquired whether this could be mitigated by incorporating clinical information along with LDCT images. We utilized a simple two-layer ANN given ease of training to train a large number of unique models. We then used the NLST training cohort for ANN model training, 10% of the UIH cohort for testing, and completely held out 90% of the UIH cohort for validation. The ANN models were trained with the following three inputs: ResNet predicted lung cancer risk, subject’s age, and subject’s race. 100 ANN models were trained with Year 1 ROC AUC performance ranging from 0.898 to 0.908 in the NLST test set (Fig 5A). We then tested on 10% of the UIH cohort (UIH validation set) and utilized the top 10 performing ANNs to be evaluated in the held out UIH test cohort. The results were mixed, where in the overall UIH cohort Y1 ROC AUC values increased to 0.83 (Fig 5B) as compared to the ResNet model alone (Fig 4B). When evaluating this difference across races, we observed that ROC AUC values modestly increased by 0.01 to 0.02 across Years 1-6 (Fig 5D; Table 3) in Black subjects as compared to the ResNet model alone (Fig 4D). However, in White subjects, there was a substantial decrease in performance in Y2-6 ROC-AUC values by 0.10 (Fig 5C) as compared to the ResNet model alone (Fig 4C). While domain shift - which refers to the situation where the distribution of the training data differs from the distribution of the test data - is a significant factor in the performance discrepancies observed, the bias in the training data and model generalization also play critical roles. The observed results indicate that while the new model has improved overall performance, ongoing disparities reflect the lingering impact of the initial training data biases and the complexity of the population-specific factors influencing lung cancer risk predictions.
DISCUSSION
Lung cancer disparities are extensively documented in minority populations with respect to survival outcomes, clinical management, and screening. Recognizing these disparities as a potential means to uncover biological factors, we developed a hybrid AI model that integrates extracted clinical and lung parenchymal imaging features for risk inference. This novel approach not only addresses the limitations of traditional screening strategies but also enhances sensitivity and specificity for diverse populations. While LDCT screening has proven effective in reducing mortality through early detection and improved survival2,18, existing guidelines have been shown to perpetuate inequities. The USPSTF LDCT criteria18, primarily based on the nearly 95% White NLST study population2 has been shown to contribute to disparities where Black populations screening eligible population carries twice the risk of lung cancer as compared to White populations4,5. Alternative strategies have been proposed including utilizing clinical risk models to determine screening eligibility5,7. While these strategies demonstrate increased sensitivity as compared to USPSTF criteria, they have several limitations including patient selection only at long-term risk levels > 1-1.5% lung cancer risk, lack of risk adjustment from LDCT findings, and limited PPV and NPV values. When the hybrid model was evaluated in the diverse real-world cohort within the Chicago metropolitan area, we noted the hybrid model demonstrated higher performance among populations identifying as White as compared to Black, potentially reflecting bias from a training set of a >90% White population. Bias in AI models is an increasing concern, particularly among institutions serving diverse patient populations and may further exacerbate health disparities19,20. However many initiatives are aiming to reduce bias in AI development including AIM-AHEAD21,22 and specifically in lung cancer screening, the National Cancer Institute’s CANDID-4AI23. Because of the existing disparities in lung cancer including the higher risk of lung cancer observed among Black patients who smoke, implementation of AI models will require representative data in training sets and inclusion of Black populations and other minorities in validation cohorts and studies. Furthermore, the domain shift we observed suggests that lung parenchymal features in White LDCT images have different features and latent representations than in Black individuals. The biological basis of this might be related to the lung parenchyma in Black individuals who smoke, which requires further research and radiomic, histologic and cellular characterizations. In addition, we observed model deterioration in individuals with a BMI ≥ 30. While the underlying mechanism for this deterioration in both the NLST and UIH cohorts was outside of the scope of this study, we speculate that latent features within lung parenchyma may be inadequately visualized due to a biological obesity paradox that confounds our hazard risk models24,25. Alternatively, chest wall thickness from subcutaneous adipose tissue may interfere with ray penetration, or cause compression of lung parenchyma due to obesity hypoventilation.
This study has limitations, chiefly those of the retrospective nature of the UIH cohort which is subject to bias and data missingness. We are aiming to address these limitations with a prospective study on new LDCT participants across multiple sites. The hybrid model also utilizes a CNN architecture trained on the NLST cohort which has limited representation of Black and other minoritized populations. Current efforts such as the CANDID-4AI initiative23 to develop more inclusive training sets may eliminate these biases and overcome domain shifts as we observed in our assessment of the ResNet model14 where performance in White subjects decreased when the model was optimized for Black subjects. Having an inclusive and representative training set will allow ResNet and more advanced deep learning architectures with race labeling to understand underlying structures of racial disparities in lung cancer incidence and incorporate these features and enhance model predictive ability and generalizability.
Collectively, we demonstrate an integrated approach combining imaging features from human lung parenchyma and clinical characteristics that elucidated biological contributors to lung cancer. The risk prediction model is capable of predicting lung cancer in two distinct patient cohorts, though it suffers from deterioration in Black populations and individuals with a BMI ≥ 30. Future studies should be conducted aimed at understanding the biological explanation underlying increased lung cancer incidence in Black populations through investigation of the lung parenchyma in Black individuals who smoke or in tumor adjacent normal tissue. Understanding these biological factors would increase our understanding of lung cancer, open up new avenues for prevention, and potentially enable biological based strategies that move beyond USPSTF criteria to create an inclusive and individualized paradigm for lung cancer screening.
Data Availability
Requests for data related to this study are subject to IRB approval and review by the UIC Office of Research.
ACKNOWLEDGEMENTS
We wish to thank the UIC Center for Clinical Translational Sciences, the UI Cancer Center Data Integrated Shared Resource, for supporting this research. A.Z. was supported by an AI.Health4all postdoctoral fellowship, M.B. was supported by a Winn-CIPP fellowship, A.A.S. received in kind support as a National Cancer Institute, Center for Cancer Health Equity Early Investigator Advancement Program Scholar,