Abstract
Earlier detection of pancreatic ductal adenocarcinoma (PDAC) is key to improving patient outcomes, as it is mostly detected at advanced stages which are associated with poor survival. Developing non-invasive blood tests for early detection would be an important breakthrough. The primary objective of the work presented here was to use a unique dataset, that is both large and prospectively collected, to quantify a set of 96 cancer-associated proteins and construct multi-marker models with the capacity to accurately predict PDAC years before diagnosis. The data is part of a nested case control study within UK Collaborative Trial of Ovarian Cancer Screening and is comprised of 219 samples, collected from a total of 143 post-menopausal women who were diagnosed with pancreatic cancer within 70 months after sample collection, and 248 matched non-cancer controls. We developed a stacked ensemble modelling technique to achieve robustness in predictions and, therefore, improve performance in newly collected datasets. With a pool of 10 base-learners and a Bayesian averaging meta-learner, we can predict PDAC status with an AUC of 0.91 (95% CI 0.75 - 1.0), sensitivity of 92% (95% CI 0.54 - 1.0) at 90% specificity, up to 1 year to diagnosis, and at an AUC of 0.85 (95% CI 0.74 - 0.93) up to 2 years to diagnosis (sensitivity of 61%, 95 % CI 0.17 - 0.83, at 90% specificity). These models also use clinical covariates such as hormone replacement therapy use (at randomization), oral contraceptive pill use (ever) and diabetes and outperform biomarker combinations cited in the literature.
Introduction
Pancreatic ductal adenocarcinoma (PDAC) is associated with dismal 5-year survival rates (∼3-7%) and is projected to become the second cause of cancer death by 2030 1–3. A non-specific clinical course leading to late-stage diagnosis is a feature of pancreatic cancer, and only 15% of the patients are diagnosed at early stages with resectable tumours 2–5. Following surgery and adjuvant therapy however, less than 30% of patients survive 5 years after diagnosis 5, compared with a less than 10% 5 year survival in those with unresectable disease 6. The development of new tests that could improve detection of early-stage disease is pivotal for improving outcomes for pancreatic cancer patients. Indeed, it has been shown that if tumour size at detection can be reduced from 3 to 2 cm, improved oncological resection rates (7% to 83%, respectively) and increased median survival (from 7.6 to 17.2 months) can be achieved 7, 8. CA19-9 9, 10 is the only serological tumour marker used routinely for the detection and monitoring of PDAC progression, however, with 79-81% test sensitivity and 82-90% specificity for diagnosis of PDAC at best 11. Despite this, we have recently shown that CA19-9 (and CA125) can be used to detect pancreatic cancer up to 2 years before clinical diagnosis, using samples from a repository collected as part of the UK Collaborative Trial of Ovarian Cancer Screening (UKCTOCS) 12. These samples, which were prospectively and longitudinally collected months and years prior to diagnosis, enabled the detection of rising levels of potential serological biomarkers ahead of diagnosis with high reliability. We proposed that CA19-9 in combination with markers identified in this cohort and their use in multi-marker longitudinal algorithms, may improve test performances and lead time to enable early detection of PDAC. Combining markers into multi-marker models has traditionally involved the application of simple cut-off rules and machine learning methods such as multivariate logistic regression 12, neural networks and support vector machines 13–15. Here, to attain better performances and robustness, we applied an ensemble modelling technique 16–19 and a repeated cross-validation re-sampling strategy. Ensemble methods have been immensely successful in producing accurate predictions for many complex classification tasks 16, 18, 20, because they address fundamental problems in data analysis. For example, these methods avoid overfitting by combining single learners with a local search heuristic which decreases the risk of obtaining a local performance minimum. Issues surrounding dimensionality are also addressed with ensemble models. By allowing each classifier to focus on sub-spaces of features, the burden of large search spaces is reduced. This specific field in machine learning is consistent with the well-known Condorcet’s jury theorem, which states that if each classifier has a probability larger than 0.5 of being correct then increasing the pool of classifiers increases the probability of making the correct decision by majority voting. The task of finding successful ensembles is, nevertheless, more complex and dependent on a balance between diversity and consensus among classifiers. A definitive recipe to achieve this goal has yet to be completely defined 16, 18, 20. Two widely cited examples following the ensemble paradigm that focus on complementary and heterogeneity are, for instance, stacking, a form of meta-learning 18, and ensemble selection 16. Stacking constructs a higher-level predictive model over the predictions of base classifiers. Ensemble selection, on the other hand, uses an incremental strategy to select base predictors for the ensemble while balancing diversity and performance 16. Here we will apply the stacking approach due to its simplicity and computational efficiency. Ensemble like modelling has also been used in the design of the CancerSEEK multi-cancer blood test, which assesses blood levels of multiple analytes such as circulating proteins and cell free DNA mutations 21, but also in the creation of indices evaluating changes in DNA methylation patterns 22. The use of multi-datasets and multi-platform integration in pancreatic cancer studies 23 are essential for early detection and aligns at a fundamental level with the data analysis methodology applied in our work. Here, we demonstrate how using a stacked ensemble approach which relies on a panel of 20 features, including cancer-associated proteins and clinical covariates, outperforms state-of-the-art multi-biomarker combinations previously applied in pancreatic ductal adenocarcinoma early detection.
Results
Data set characteristics
Our dataset is part of a nested case control study within UK Collaborative Trial of Ovarian Cancer Screening (UKCTOCS) 24, 25 and is comprised of 143 individuals with PDAC and 248 controls (see Materials and Methods). 35 of the PDAC-diagnosed patients provided longitudinal samples, with between 2 and 6 annual samples per individual collected prior to diagnosis. These samples were considered to be independent for the purpose of data analysis and were divided into a training (2/3) set and test (1/3) set, by stratifying for age, hormone replacement therapy (HRT) use at randomization, oral contraceptive pill (OCP) use (ever), diabetes status (detailed information about duration and type was not available for this work), body-mass index (BMI) quartile, PDAC or control status and for sample single time-group, i.e., 0-1,1-2,2-3,3-4 and 4+ years to diagnosis (YTD), attributed to each sample and determined by the time to diagnosis at sample collection (compare Table 1 with Tables S2 and S3, see also Table S4 for the total number of cases and controls per single time-group). This stratification enabled a clearer evaluation of PDAC classifier panel performances in collected samples not used in training, i.e., the test set, and ensured that the results are realistic and representative, and are not biased by data or information leakage 26.
Time intervals between sample collection and serum/plasma isolation, i.e., time to spin, were comparable between PDAC cases and controls. There was no significant difference in the mean time to spin between cases and controls for the whole study data set (Table 1), with the ranges being also very similar. The same is observed in the training and test sets (Tables S2 and S3). The distribution of ages at sample draw showed a significant association (OR=1.06, P < 0.0001) with PDAC status, with cases having a mean value at 64.94 years and controls at 62.48 years. Once again, a similar observation regarding significance can be made for the training and test sets. In both, we verify odds-ratios favouring PDAC status (Tables S2 and S3). Through further analysis in the training set, we verified that age at sample draw was, nevertheless, only significantly associated with PDAC in training samples obtained 4+ years prior to diagnosis (Fig. S1 and S3A). Moreover, by applying the logistic regression model used for the ranking of individual features which was developed in the training set (see Methods), we observed that age did not generate significant AUCs in the test set, and the sensitivities were very low (Fig. S1). To qualify as a predictive marker, subject age had to be combined as a co-variate in a multi-marker model as is reported below (see also Supplementary Information for an extensive study on single and typical biomarker combinations in simple state-of-the-art logistic regression models). As a single variate, subject BMI was neither a significant predictor in the entire data set (Table 1) nor was it in the training and test sets (Tables S2, S3 and Fig. S1C and D). Regarding HRT, it was significant in all data sets (Tables 1, S2 and S3). In the training set, this result stemmed mostly from the association of HRT use with a lower risk of pancreatic cancer (OR =0.22 < 1, 95% CI 0.04-0.85, P = 0.027, coloured in blue in Fig. S1E) in the 2-3 single time-group samples. Its significant predictive potential for distinguishing between cases and controls, however, was not reproducible in the test set 2-3 time-group by applying a logistic regression model developed in the training set, AUC=0.56 (95% CI 0.50-0.67) (Fig. S1F). Similarly to age, HRT was expected to add value to a PDAC predictive index as a covariate. Although demonstrating an association with PDAC status across the entire study set (Table 1) and the training set (Table S2), OCP was not a significant predictor in any of the single time-group samples evaluated independently (Fig. S1E). Overall, diabetes was the strongest predictor of risk for PDAC among the clinical covariates (Table 1 and Table S2). While consistent across both the entire study set and our training set in all single time-groups excluding the 1-2 YTD samples, the largest risk was in fact observed in the 3-4 YTD subgroup (OR=13.26 (95% CI 1.36 - 1781.23), P =0.022) and 4+ YTD (OR=8.43 (95% CI 1.64 - 84.86), P = 0.0090). These findings were however not reproducible in our test set despite the proportions pertaining to OCP being roughly the same as those in the training (Fig. S1E, F and Tables S2 and S3). Once again, diabetes had to be combined with other co-variates to enhance performances (see section below on ensemble stacking of multi-marker models).
Another aspect of the data that should be reported is the presence of outliers with respect to CA19-9. Despite not having been confirmed, we assumed that Lewis-antigen negative or positive status for each individual could be determined by the lack of expression of CA19-9, verified by a peak at the minimum value in the data used here. Since CA19-9 is not expressed in 8-10% of Caucasians (Lewis-negative blood group), predictive indices should account for this phenomenon. The proportion of individuals in which CA19-9 was not detected was approximately the same in the training and test sets and, apart from in samples belonging to the 0-1 single time-group, there was no significant association with PDAC status in the training set (Fig. S5). Furthermore, logistic regression models, trained to distinguish cases from controls, based on the presence or absence of measurable CA19-9, did not perform well in the test set, with performances always around AUCs of 0.5 and lacked statistical significance (Fig. S5).
Ensemble learning multi-marker models improve performance and robustness
The group of base-learners chosen for the ensemble analysis reported here covered a large and diverse set of approaches, which was beneficial, as different characteristics of the training set were captured by different classifier techniques (see comments below on variable importance attributed by each classifier). This, in turn, will increase classifier heterogeneity and therefore the likelihood of success when predicting outcomes in unseen data 16–19. Here, we will focus on the results of the JTG2L ensemble classifier, which was trained on all samples collected 0-1, 0-2, 0-3, 0-4 or 0-4+ YTD (described in the Methods section), since it was the most robust. For the results on other stacking ensemble strategies see the Supplementary Information section on ensemble classifiers specialized in single time-group samples.
As we increased the interval of joined time-groups from 0-1 to 0-4+, and thus the number of samples used to train, the performances in the test set decreased (when the training and testing time-groups are the same), as expected, since the median time to diagnosis increased (see Fig. 1B and the diagonal values of the heatmap in Fig. 1C). A similar trend was roughly observed for the sensitivity, positive predictive value (PPV) and negative predictive value (NPV) achieved (see Fig. 1D to F), with the 0-2 group appearing as the outlier. The use of the JTG2L Bayesian Model Averaging (BMA) stacked ensemble was beneficial both for improving performance as well as decreasing variability across folds (Fig. 1A). Models trained in each joined time-groups attained, nevertheless, better performances in the test set, in certain instances, when evaluated in samples belonging to narrower time-groups, e.g. AUCtest(0-3) (0-3,training)=0.79 was considerably smaller than AUCtest(0-3) (0-4+,training)=0.84, with the difference being borderline statistically insignificant (P=0.06) (Fig. 1C). This result probably stemmed from the additional non-specific information contained in the single 3-4 and 4+ time-groups, which helped to correct for the poor performances in the 1-2 single time-group (Fig. S28). Under normal circumstances, when evaluating the PDAC risk for samples in newly collected data, early diagnosis would only be a comprehensive effort if evaluated with the classifier trained with 0-4+ joined time-group samples, as time to diagnosis is obviously not available for data collected from new patients. Reassuringly, the cross time-group performances in the test set, generated with the 0-4+ classifier, were not statistically different from those with narrower joined time-groups (Ptest(0-1) (0-4+,training)=0.66, Ptest(0-2) (0-4+,training)=0.68, Ptest(0-3) (0-4+,training)=0.06, Ptest(0-4) (0-4+,training)=0.41, when comparing the last column in Fig. 1C with the diagonal), under this broad classifier, which justifies its use in external validation sets. In addition, the JTG2L BMA ensemble approach outperformed the single marker and multi-marker models relying on simple logistic regression (Fig. S1 and Fig. S2); trained with 0-4+ samples and evaluated in the 0-1 subgroup of the test set, it reached an AUCtest(0-1)(0-4+,training)=0.94 (95% CI 0.83 - 1), which far exceeded the results with CA19-9 alone or in combination with THBS2, MUC16 or CEACAM5 when predicting PDAC status in the same time-group (see Fig. S1 and Fig. S2 and Supplementary Information on single and multi-marker combination models developed with simple logistic regression techniques). The sensitivity at 90% specificity and the respective PPV and NPV were also enhanced for the JTG2L BMA ensemble model. An improvement in performance with respect to previous studies in pre-diagnostic samples was also achieved with samples collected up to 2 years; there a sensitivity of 0.406 at 0.905 specificity was reported 12. The JTG2L BMA stack was capable of attaining a performance of 0.84 (0.72 - 0.93), when trained in samples belonging to the same joined time-group, i.e., 0-2, or 0.85 (95% CI 0.74-0.93), when trained in 0-4+ samples, the difference between the two was insignificant (Ptest(0-2) (0-4+,training)=0.68). The use of the larger feature input space characterising the JTG2L BMA stack is further justified given that the significant single marker models in the training set (Fig. S11) did not reach significance in the test set in 0-2 samples, with exception of CA19-9, with an AUCtest(0-2) < 0.75. A previously published logistic regression model combining CA19-9 and MUC16 gave AUCs of 0.76 for the up to 1 year group in a previous study validation set 12. In the current data set, the same bi-dimensional model predicted PDAC with an AUCtraining(0-1) (CA19-9, MUC16) = 0.81 (95% CI 0.72 - 0.88), and AUCtraining(0-2) (CA19-9, MUC16)=0.74 (95% CI 0.67-0.81), in the training set (reported in Supplementary Information). The same models tested at roughly the same level in unseen data, here the test set: AUCtest(0-1) (CA19-9, MUC16) = 0.76 (95% CI 0.55 - 0.93) and AUCtest(0-2) (CA19-9, MUC16) = 0.66 (95% CI 0.51-0.80) (see Supplementary Information and Fig. S2), the latter considerably lower than the JTG2L BMA stack performance of 0.84 (trained in 0-2 samples, P= 0.0060) and 0.85 (trained in 0-4+ samples, P= 0.0061) (see Fig. 1C). The sensitivity, PPV and NPV at 90% specificity also improved with the algorithm proposed here, with sensitivities as high as 0.92 (95% CI 0.54 - 1), 0.61 (95% CI 0.17 - 0.83) and 0.63 (95% CI 0.29 - 0.77), and PPVs as high as 0.91 (95% CI 0.85-0.92), 0.81 (95% CI 0.55-0.85) and 0.84 (95% CI 0.70-0.87), in 0- 1, 0-2 and 0-3 test samples, depending on the joined/combined time-group used to train (Figs. 1D to 1I). This outcome came at a cost of increased model complexity and input space dimensionality. The features selected from the 101-marker input panel are plotted in Fig. 2 and the enriched terms are presented in Fig. 3. The selected panel of biomarkers for the JTG2L BMA stack meta-learner developed with all time-group samples combined (0-4+), shows typical gene ontology and pathway terms with ‘Constitutive signalling by aberrant PI3K in cancer’ [REAC], ‘Pancreatic adenocarcinoma pathway’ [WP], ‘Pathways in cancer’ and ‘Pancreatic cancer and proteoglycans in cancer’ [KEGG], being significantly over-represented (Fig. 3).
One of the striking aspects of the 0-4+ PDAC predictive index was the presence of the commonly cited markers such as CA19-9, MUC16, THBS2, CEACAM5 and VWF and 4 of the clinical variables, excluding BMI. Diabetes was ranked just below CA19-9 according to the median importance across base-learners and time-groups, something which is consistent with what is observed in Fig. S2. Also remarkable was the variance in the importance attributed to the same markers across models developed in narrower time-groups, except for CA19-9 which almost always was allocated a scaled importance of 1. These observations are in line with those reported in the literature as strong arguments for using stacking and ensembles of classifiers, e.g., a pool of base-learners outperforms single classifiers by enabling heterogeneity within the pool of the base-learners and thus robustness in the predictions 16, 18. There is, nevertheless, a caveat to this as diversity and performance are not strictly directly proportional and while there is a strong dependency between the former and the latter, diversity might hinder performance of the ensemble 18, 27. Given that we started with a reduced pool of base-learners, the true relationship between heterogeneity/diversity and performance could not be analysed extensively. Yet, upon searching all possible combinations of base-learners from the 10-dimensional input space, the use of all 10 classifiers in the stack always outperformed the rest across a set of 10 times 10-fold cross-validation resampling strategy. Although there wasn’t a clear trend for base-learner pair-wise diversity with time-group across base-learners (see Cohen k-statistic in Fig. S17), the JTG2L BMA stack did, in fact, outperform the best base-learner in the training set (Fig. 1A), and any of the remaining stacks (see Fig. S22 and S23). This was particularly clear in the trend observed among stacks from best to worse in Fig.S22A. The BMA stack is the obvious choice across time-groups and, consequently, constitutes the model of choice for validation in new data. It was interesting to note that the MEAN stack outperforms the GEOMEAN stack, a mixture of experts focusing on consensus among base-learners, and the MAX stack, choosing the maximum probability for PDAC status among the base-learners, which, effectively, amounts to highlighting the classifier that has the highest degree of certainty for each sample. The BMA stack and the MEAN stack provide a balance between base-learners, the former being weighted, which increased the performance in the test set, especially in the 0-1 and 0-2 joined time-groups.
Another prominent trend in the importance attributed to each marker was the fact that the more significant a marker is (Fig. 2A), the higher the variance in importance it registered across base-learners (Fig. 2B, see also Fig. S26). By observing the association of each of the markers with increased (OR > 1) or decreased (OR < 1) risk for PDAC (Fig. S6-S15), we also verified that the ensemble trained on 0-4+ samples had a greater proportion of OR <1 markers than, for instance, the ensemble trained on 0-2 samples.
Regarding the importance of developing a classifier that is reliant on features covering both the population that expresses CA19-9 (Lewis-positive) and does not express CA19-9 (Lewis-negative), we also tested the BMA stack model in the subset of subjects in the test set that are Lewis-positive. The number of Lewis-negative subjects was not sufficient to generate significant performances. Overall, the BMA stacked ensemble was not affected substantially and a performance in the test set of AUC=0.89 (95% CI 0.70 - 1.00) in 0-1 samples and of AUC= 0.83 (95% CI 0.70 – 0.93) in 0-2 samples was obtained. This confirms that the association between the CA19-9 peak at the level associated with absence of expression and PDAC was not affecting the results. Further tests in larger pools of confirmed Lewis-negative samples should clarify the performance of the ensembles developed here.
One particular aspect of PDAC is its low prevalence in the general population, at 8–12 per 100 000 per year and a 1.3% lifetime risk of developing the disease 28. The models developed and tested here relied on data where the prevalence was approximately between 40% to 50% (see Table 1 and Table S4) and on the maximization of the ROC AUC, which is independent from prevalence. At these values, the BMA stack JTG2L reached PPVs as high as 0.91 (95% CI 0.85 – 0.92) in 0-1 samples and 0.84 (95% CI 0.70 – 0.87) in 0-3 samples (Fig. 1E). If the prevalence of the disease in the test set was changed the PPV and NPV at 90% specificity decreased and increased, respectively, at lower disease prevalence, as expected (Fig. S25) 29. The ROC AUC was nevertheless stable and the tendency with time-groups observed in Fig. 1 was also verified (see Fig. S24). The larger variances at lower prevalence stem from the smaller datasets.
Discussion
From the classifiers developed here, those trained with all available samples (0-4+) would be the most appropriate under a clinical setting, as time to diagnosis is not available in newly collected data. The features highlighted as predictive under these circumstances included important and widely referenced markers which we confirmed as having among the strongest association with PDAC, namely CA19-9, CEACAM5, MUC16 (CA125), THBS2 and diabetes. Compared to the rest, CA19-9 was the most prominent marker as all base-learner classifiers attributed it almost always the highest importance. CA19-9 is an indicator of aberrant glycosylation in pancreatic cancer and it is considered as a biomarker, predictor, and promoter in pancreatic cancer, although it is often found elevated in benign pancreatic biliary diseases such as pancreatitis, cholangitis and obstructive jaundice, giving a significant rise in false positives 11, 30. Moreover, CA19-9 expression is absent in 8-10% of Caucasians with a Lewis-negative blood group, as the CA19-9 epitope is in fact, the sialylated Lewis A blood group antigen 11. Although CA19-9 levels are routinely used in detection, determination of resectability and monitoring of PDAC progression 31, 32, its low predictive value and a low prevalence of pancreatic cancer in the general population exclude it as a robust screening tool3, 33. The combination of CA19-9 with other markers proposed here is, therefore, advantageous. Regarding CEACAM5 and the role of carcinoembryonic (CEA) related cell adhesion molecules (CEACAMs) 1, 5 and 6 in progression of solid tumours (such as colorectal, lung, melanoma, breast, liver) including pancreatic cancer is well established, and their expression varies between different tumour histological subtypes 34–36. With respect to pancreatic cancer, CEACAM5 has been widely described as having a variable diagnostic value in PDAC detection, while its expression has also been reported to inversely correlate with disease stage 37. Pancreatic cyst fluid levels (>192 ng/mL) of CEACAM5 are also predictive of mucinous rather than serous cystic lesions 36, 38–41, and endoscopic sampling with fluid CEA measurement is recommended by international guidelines for identification of malignant transformation 42–44. In a recent report, CEACAM5 was found to be persistently elevated up to 26.5 months prior to pancreatic cancer diagnosis, in a large cohort (n=1196) of longitudinally sampled subjects 45. The predictive performance of CEACAM5 as a single analyte in pancreatic cancer however, is poor 35. In our work, CEACAM5 taken as a single predictor achieves only significant performances in 0-1 YTD samples, but is among the top covariates in terms of importance across base-learners.
MUC16/CA125 also ranked high in importance across the base-learner classifiers and time-groups despite not performing well as a single predictor. MUC16 is a cell surface glycoprotein which can be elevated in tissue and sera of patients with various cancers and is mostly used in the diagnosis and prognostication of ovarian cancer 46, 47. MUC16 mediated metabolic reprogramming in pancreatic cancer is associated with tumour progression and metastasis through stromal remodelling 47, 48. Due to its overexpression on the surface of pancreatic cancer and absence from normal tissue, the value of MUC16 as a biomarker of PDAC has been investigated 48. A progressive change in expression of MUC16 throughout different stages of disease progression is already evident at pre-malignant (pancreatic intra-epithelial neoplasia; PaNIN) stages, highlighting its potential value as a diagnostic marker of early cancer 46, 49. With respect to its diagnostic performance, one meta-analysis which included 1235 patients reported a pooled sensitivity of 0.59 (95% CI 0.54-0.62; at 0.78 specificity, 95% CI 0.75-0.82) for detecting PDAC using CA125 (an epitope of MUC16) 50 as a single marker 51. When combined with CA19-9 (AUC 0.85) CA125 increased the overall accuracy of the former (AUC 0.89), demonstrating superior diagnostic accuracy compared to CA125 (or CA19-9) alone 51. The clinical value of MUC16 in prediction of metastasis and prognosis in PDAC has been also previously reported 52, 53. Serum levels of MUC16/CA125 have been shown to be the strongest predictor of metastatic disease (AUC of 0.892 at 95% CI 0.846 - 0.938, p<0.001) and survival (HR: 1.804, 95% CI 1.22 - 2.66, p=0.003) compared to other markers (including CEA) in 180 PDAC patients 53.
THBS2 and THBS1 were both significantly associated with PDAC, albeit not across all time-groups, the latter only in a subset of samples (see Supplementary Information). Thrombospondins are glycoproteins which mediate cancer growth and progression through cell-cell and cell-matrix interactions, tissue remodelling and regulation of inflammation, immunity and angiogenesis 54, 55. A specific role for THBS2 in tumour associated vascularisation has been reported in various cancers (including colon, liver, lung and melanoma) in which its aberrant expression was reported to be of both diagnostic and prognostic value 54, 56. With respect to PDAC, recent observations in a cohort of 493 (263 with PDAC) showed that THBS2 serum levels significantly differed and differentiated PDAC from high-risk individuals (familial pancreatic cancer patients) with a 55.9% test sensitivity (at 100% specificity; 100% PPV and 66.5% NPV). Moreover, PDAC patients with higher serum THBS2 (also termed TSP-2) levels showed worse clinical outcomes (hazard ratio = 1.54, 95% CI 1.143 – 2.086, P = 0.005). Interestingly, when combined with CA19-9 an improved panel sensitivity for PDAC (90.5% at 98.7 % specificity) was reported54.Taken as part of a signature as was presented here, THBS2 is expected to help robustly with early detection of PDAC.
In this study, among all interrogated clinical covariates, diabetes was the strongest predictor of a higher risk for PDAC (OR=13.26, 95% CI 1.36 - 1781.23, P =0.022), in our 3-4+ YTD cohort. An association between long-standing type 2 diabetes and a 1 to 1.5 fold increased risk of PDAC compared to non-diabetics has been reported 57, 58. The value of diabetes as a clinical covariate in a multi-marker panel is supported by the fact that the presence of new onset diabetes (NOD) (less than 3 years prior to diagnosis of PDAC) increases this risk 5-fold and is induced by various pancreatic conditions including chronic pancreatitis and PDAC-induced hyperglycaemia (recently described as type 3c diabetes). Abnormal fasting glucose or glucose intolerance are observed in the majority (>80%) of PDAC patients 3, 58, 59; PDAC as an underlying cause of NOD, is found in around 1% of individuals aged over 50 and therefore NOD might be considered an early warning sign of a pancreatic malignancy 59, 60.
Despite numerous reports of recent biomarker discoveries, lack of standardisation in sample identification and handling, methodologies of analysis as well as data capture and bioanalytical interpretation challenge their later validation in larger cohorts 61. Moreover, due to the biological complexity in PDAC, tested biological fluids and the overlapping features with high risk conditions (e.g. chronic pancreatitis), the predictive capabilities of single analytes are clearly insufficient to meet acceptable diagnostic performances62. Considering their individual, relative poor performance (79-81% test sensitivity and 82-90% specificity in the case of CA19-9 for example) as well as the low prevalence of PDAC in the general population however, accepting low values for test sensitivity and specificities results in a significant number of false positives. The combination of multiple analytes in diagnostic panels clearly enhances CA19-9 performance, as well as being able to compensate for cases in which CA19-9 detection is limited (Lewis body negative patients) 63. A mere combination of multiple biomarkers with low sensitivities at high specificities based on individual levels however, would be insufficient, and their independent contribution to the overall risk should be considered 64. With this in mind, the aim of the ensemble modelling strategy explored here was to improve on single classifiers by combining diverse techniques in a way such that the predictive performance of the ensemble would be greater and more robust when resorting to multi-marker models, thus addressing the issues highlighted above. The pool of chosen classifiers proved to be useful in highlighting different aspects of the data by selecting a larger panel of biomarkers and allocating different importance to each. Despite finding that the stacking protocols explored offered substantial and statistically significant improvements over the previous state-of-the-art prediction methods, we expect that scanning the space of all classifier combinations from available code libraries will improve the results. Yet, we must resort to other methods for finding better performing ensembles when predicting PDAC. Early detection is a notoriously difficult problem due to the extreme class imbalance, missing values and the necessity to create predictive indices performing well at representative disease prevalence. Because the robustness of ensembles is an emergent, not an explicit property, different directions for future work should be taken by formalizing the effects of calibration on heterogeneous ensemble performance and explicitly incorporating diversity in the search 18, 65. In fact, this will be even more prominent when combining longitudinal methods 66 with ensemble selection and stacking techniques. To our knowledge longitudinal samples and associated time-series classification techniques have never been applied to early detection of pancreatic cancer. Dynamic changes of CA19-9 and MUC16 identified in previous work 12 are an indication of the importance of these types of studies. The importance of trends and rates of change in each biomarker taken separately or in multi-marker models has also proven to be advantageous in improving early detection in ovarian cancer longitudinal models 66 and mechanistic tumour and biomarker secretion models67. Further developments in PDAC longitudinal data sets and ensemble model selection routines should highlight the importance of early and late biomarker panel dynamic changes and increase performance in a clinical setting, in addition to contributing with invaluable methodologies to the field of machine learning and artificial intelligence in early detection of cancer 68.
Materials and Methods
Study Design
This nested case control discovery study was approved by the Joint UCL/UCLH Research Ethics Committee A (Ref. 05/Q0505/57). Written informed consent was obtained from donors and no data allowing identification of patients was provided. The study set comprised serum from post-menopausal women aged 50-74 recruited to UKCTOCS between 2001 and 2005 and collected according to an SOP 24, 25. All participants were ‘flagged’ with the national agencies for cancer registrations and deaths using their NHS number. Women subsequently diagnosed with pancreatic ductal adenocarcinoma (cases) were identified by cross-referencing with the Health and Social Care Information Centre cancer registry codes and death codes (ICD10 C25.0/1/2/3/9). Confirmation of diagnosis was sought from GPs and consultants through questionnaire and from the Hospitals Episode Statistics database. In total, 143 cases were identified (with 219 associated serum samples) that had not been registered as having any other cancer since randomization and that had a confirmed diagnosis of pancreatic cancer. Matched non-cancer controls, i.e., with no cancer registry code, from individual women were selected based on collection date and centre to minimize variation due to handling and storage. From this set, 248 controls were selected. 35 of the PDAC cases had longitudinal data, with between 2 and 6 annual longitudinal samples per individual years before diagnosis. Due to the design of the UKCTOCS study, PDAC stage for all the cases at the time of diagnosis was not available.
Serum Analyte Measurements
All serum samples were randomized for testing. A total of 111 analytes were tested. CA19-9(A) was measured using the Mucin PC/CA19-9 ELISA Kit (Alpha Diagnostic International) according to the manufacturer, using a 1:4 serum dilution. CA125(A) (MUC16) assay was performed using the Cobas CA125 II CLIA with a CA125 II Calibrator Set (Roche and Fujirebio Diagnostics) on a Cobas E411 analyzer with PreciControl Tumour Marker to monitor assay imprecision. LRG1 level was assessed using the human LRG1 ELISA Assay Kit (Immuno-Biological Laboratories) at a 1:2000 serum dilution. PIGR was measured using the human secretory component (SC) ELISA Kit (Cusabio) at a 1:500 serum dilution. REG3A/PAP level was determined using the PANCREPAP ELISA Kit (DynaBio) at a 1:100 serum dilution and Factor XII (F12) using the Factor XII Human ELISA Kit (abcam) at a 1:1000 serum dilution. For vWF, we resorted to the Von Willebrand Factor Human ELISA Kit (abcam) at a 1:100 serum dilution. THBS1 (TSP1) level was evaluated using the Quantikine Human Thrombospondin-1 Immunoassay (R&D Systems) at a 1:100 serum dilution. AGR2 was calculated using the Anterior Gradient Protein 2 ELISA kit (USCN Life Science) at a 1:25 serum dilution. A1AT (SERPINA1) was measured by α-1-Antitrypsin ELISA kit (Immunodiagnostik AG) and IL6ST (IL6RB) by Quantikine human soluble gp130 (R&D Systems). THBS2 (TSP2) was measured using the Quantikine Human Thrombospondin-2 Immunoassay (R&D Systems) at a 1:10 serum dilution and TEK using the Quantikine human TIE-2 ELISA Assay Kit (R&D Systems) at a 1:10 serum dilution and IGFBP1 by human IGFBP1 ELISA Assay Kit (abcam) at a 1:50 serum dilution. Finally, IL17RA was measured using the human IL17RA ELISA Assay Kit (Abnova). Assays were performed on singlet test samples and values not measurable on the standard curves were given the value ‘low’.
The group of markers AFP, TEK and IGFBP1 was only measured in a subset of subjects from the total used to develop the classifiers presented in the main text. A similar situation occurred for the group of markers LRG1, PIGR, REG3A, F12, AGR2, IL17RA, SERPINA1 and THBS1. Some of these biomarkers were motivated by a previous paper 12. See Supplementary Information for further details and results.
Samples were also tested using Olink’s multiplex immunoassay Oncology II panel. Known cancer antigens, growth factors, receptors, angiogenic factors and adhesion regulators were measured (see Table S1).
Statistical Analysis
We divided the whole set of samples into a training (2/3) and validation (1/3) set, by stratifying for age quartile, HRT use at randomization, OCP use (ever), Diabetes status, BMI quartile, PDAC and control status and sample single time-group, i.e., 0-1,1-2,2-3,3-4 and 4+ YTD, attributed to each sample determined by the time from collection to diagnosis. Despite the fact that 35 of the PDAC cases had longitudinal data, with between 2 and 6 annual longitudinal samples per individual years before diagnosis, with an average of 1.53 samples per individual (see Table 1), all samples were taken as independent, and no intra-individual correlation was imposed or explicitly modelled during data analysis in this instance.
Receiver operating characteristic (ROC) curves were constructed for each model to assess diagnostic accuracy. The AUC for the ROC curves was used as the performance metric. Models were selected based on their rank in the training set across cross-validation folds. ROC curves were generated with the pROC R package (version 1.15.3). 95% CI for AUCs were determined by stratified bootstrapping. All AUC confidence intervals crossing 0.5 were deemed insignificant. In addition, sensitivity, positive and negative predictive values at 90% specificity are also reported.
In order to evaluate the association between each of the single markers, including the clinical covariates (see Table 1), and PDAC status, we resorted to the logistic regression model implemented in the logistf R package (https://cran.r-project.org/web/packages/logistf/logistf.pdf, version 1.23). This approach fits a logistic regression model using Firth’s bias reduction method. The reported confidence intervals for odds ratios and tests were based on the profile penalized log likelihood and incorporate the ability to perform tests where contingency tables are asymmetric or contain zeros. P values were used to rank markers. The performance of single marker models was also verified in the test set (Fig. S3 to S14 and Supplementary Information section on single marker performance).
Multi-dimensional analysis of the data was performed under two separate frameworks: a brute-force algorithm scanning through combinations of up to 3 markers and fitting a logistic regression model (see Supplementary Information), and a stacked ensemble algorithm with 10 base-learners. The models stemming from the brute-force approach were ranked according to their performance across re-sampling training folds. The ensemble models relied on the performance of the base-learners presented in Fig. 1 and highlighted below:
Decision trees and rule-based models for pattern recognition (C50, https://cran.r-project.org/web/packages/C50/C50.pdf);
Support vector machines with radial basis function kernel (SVM, https://cran.r-project.org/web/packages/kernlab/kernlab.pdf);
Regularized random forests (RRF, https://cran.r-project.org/web/packages/RRF/RRF.pdf);
Neural networks with feature extraction (NNET, https://cran.r-project.org/web/packages/nnet/nnet.pdf);
Gaussian process with radial basis function kernel (GAUSSPR, https://cran.r-project.org/web/packages/kernlab/kernlab.pdf)
Lasso and elastic-net regularized generalized linear models (GLMNET, https://cran.r-project.org/web/packages/glmnet/glmnet.pdf)
Bagged Adaptive Boosting (ADABAG, https://cran.r-project.org/web/packages/adabag/adabag.pdf)
Extreme gradient boosting (XGBOOST, https://cran.r-project.org/web/packages/xgboost/xgboost.pdf)
Generalized Linear Model with Stepwise Feature Selection with Akaike Information criterion (GLMStepAIC, https://cran.r-project.org/web/packages/MASS/MASS.pdf)
Naïve Bayes classifier (NB, https://cran.r-project.org/web/packages/klaR/klaR.pdf)
The selection of base-learners was grounded on covering a number of state-of-art methods and algorithmic families, from bagging and boosting to general linear models with in-built feature extraction, previously referenced in the literature 16, 19, that would be able to capture different aspects of the data with an efficient computational effort and that had, for the most part, typical hyperparameter ranges published in the literature 19, some with applications in biology 18. Due to the size of the data sets, we narrowed down the size of set of base-learners to 10. Further work on ensemble selection from libraries of models should contribute to clarifying if other techniques add significant value 16 by testing performance against base-learner pool diversity 18. The training of the base-learners was executed in two ways:
by taking joined/combined time-group samples, i.e., collected 0-1, 0-2, 0-3, 0-4, 0-4+ YTD or
by training the set of base-learners in each single time-group specific samples, i.e., 0-1, 1-2, 2-3, 3-4, 4+ YTD.
The first model forces the base-learners to learn specific and cross-time-group details together, whereas the second model creates specialized groups of base-learners per single time-group. We tested several staking procedures: by Bayesian Model Averaging (https://cran.r-project.org/web/packages/BMA/BMA.pdf) with an underlying logistic regression model (BMA stack), by averaging with an arithmetic mean (MEAN stack) and geometric mean across the probabilities attributed by each base-learner (GEOMEAN stack), or by taking the maximum probability across all base-learners (MAX stack). This class is named throughout this paper as JTG2L (see Fig. 1 and Figs. S19). For the second model we tested a 2-layer and a 3-layer stacked model. The first, referred to as STG2L (Fig. S20), took the base learners trained in each single time-group and applied the 4 stacks as mentioned above, although to a larger stack input space. If, for example, we are training with samples belonging to every single time-group, i.e., 0-1, 1-2, 2-3, 3-4, 4+, the stack feature input space will have 10 times 5 dimensions; each base-learner is trained on each single time-group, giving 5 models per base-learner and a total of 50 base-models (Fig. S20). Subsequently, the probability output from each base-learner model is concatenated and fed into the meta-learner. For the specific case of the STG2L protocol, we also tested an average neural network meta-learner model (AVNNET stack) trained on the concatenated probability matrix created from each base-learner probability output. The second, named STG3L (Fig. S21), stacks twice and, therefore, has 3 layers. First it stacks the base-learners per single time-group with a BMA stack and, subsequently, stacks the result, a 5-dimension feature space of probabilities with either a BMA stack, a MEAN stack, a GEOMEAN stack or a MAX stack, if, for example, we are training with samples belonging to every single time-group (see Fig. S21 for further details). This approach is reminiscent of stacking auto machine-learning procedures implemented in AutoGluon (https://github.com/awslabs/autogluon). Other combinations of time-groups were also tested, e.g., 0-1 plus 1-4, 0-2 plus 2-4, etc., but the stacked classifiers either underperformed or were not robust.
All base-models were trained by 5 times repeated 10-fold cross-validation with over-sampling of the minority class, in our dataset the PDAC cases (see Table 1). In order to avoid overfitting, we ranked each of the features, both biomarkers and clinical covariates, with the logistic regression model mentioned before, and scanned the ranked feature input space, in increments of 10 features, with the objective of finding the optimal performance across cross-validation folds, without bias 69. Despite some features not being significant according to the logistic regression model with bias correction when evaluated as a single predictor, the protocol we applied scanned over all 101 features, clinical covariates included. As described, the stacked models were trained on the probability matrices, i.e., generated by concatenating the vectors whose entries are the probability of being a case according to each base-learner. We opted for a 10-times 10-fold cross-validation resampling strategy for the meta-learner to further secure that the choice of models was robust. The extensive resampling strategy secured both at the base-learner level as well as the meta-learner level that the models learn and performed robustly across a large number of diverse folds, thus further ensuring that the longitudinal samples taken as being independent did not skew the training and reduce the generalization power of the PDAC classifiers. We also developed other classifiers trained in both real and synthetic data by applying state of the art techniques such as SMOTE 70 and non-parametric algorithms 71 (see Supplementary Information for details).
The fact that the PDAC cases had longitudinal samples which we considered as independent, did not affect the training of the models. The 5-times repeated 10-fold cross-validation resampling strategy secured that the random allocation of samples to training and validation folds during training avoids a systematic use of samples from the same individual in hyperparameter optimization. Regarding the stratification of training and test sets, given that this was done also with information of sample single time-groups, there wasn’t a consistent presence of samples from the same individual which would interfere with the performances reported here.
The variable importance routine selected for evaluating feature importance in each base-learner (see for example Fig. 2) was a model-agnostic method based on a simple feature importance ranking measure 72, implemented in the R package vip (https://cran.r-project.org/web/packages/vip/index.html). Model-agnostic interpretability separates interpretation from the model, is a more flexible approach and can be applied to any supervised learning algorithm. It was crucial in our case for understanding the variance in importance attributed by each base-learner since some classifiers have in-built routines, e.g., Random Forests (RRF, Fig. 2), and others don’t, e.g., Naïve Bayes (NB, Fig. 2).
The enrichment analysis for each of the signatures developed with single and joined time-groups was performed with the gprofiler2 R package (https://cran.r-project.org/web/packages/gprofiler2/gprofiler2.pdf) version 0.2.0. In Fig. 3 only up to 15 significant terms are shown. Threshold for multiple comparison correction under false discovery rate was set at 0.05.
Data Availability
Data requestors will need to sign a data access agreement and in keeping with patient consent for secondary use, obtain ethical approval for any new analyses. All packages used in the pre-processing of data and subsequent analysis have been identified to secure full reproducibility.
Supplementary Information
In this supplementary information we provide the additional tables and figures cited in the main text, as well an extensive analysis of single biomarker association with PDAC and previously reported multi-dimensional models, further results of ensemble classifiers specialized in single time-group samples, the effects of training the base-learners with real and synthetic data and, finally, PDAC signatures developed with additional proteomic data for a sub-group of subjects.
Funding
This research was funded by Cancer Research UK (grant C12077/A26223) and supported by the Pancreatic Cancer UK Early Diagnosis Award 2018, project “The Accelerated Diagnosis of neuroEndocrine and Pancreatic TumourS (ADEPTS)”, and by the National Institute for Health Research (NIHR) University College London Biomedical Research Centre. UKCTOCS was core funded by the Medical Research Council, Cancer Research UK, and the Department of Health with additional support from the Eve Appeal, Special Trustees of Bart’s and the London, and Special Trustees of UCLH. AZ thanks support from the Ministry of Science and Higher Education of the Russian Federation within the framework of state support for the creation and development of World-Class Research Centers “Digital biodesign and personalized healthcare” №075-15-2020-926. HW is supported by the NIHR Imperial Biomedical Research Centre. EC is supported by Cancer Research UK (C7690/A26881).
Author contributions
NRN, JT and AZ conceived the study. JT, AZ, SP and UM secured funding. NRN did model building and statistical analysis, produced the figures, interpreted the data in collaboration with JT and AZ. NRN, JT and AZ drafted the paper. HJ performed the experiments for thrombospondin-2. All authors contributed to data acquisition and interpretation, and critically reviewed and approved the article.
Competing interests
No potential conflicts of interest were disclosed by any of the authors.
Data and materials availability
Data requestors will need to sign a data access agreement and in keeping with patient consent for secondary use, obtain ethical approval for any new analyses. All packages used in the pre-processing of data and subsequent analysis have been identified to secure full reproducibility.
Supplementary Information
Single biomarker association with PDAC and previously reported multi-dimensional models
Levels of the established tumour markers CA19-9, MUC16 and CEACAM5 showed a significant association with PDAC in samples taken less than 1 YTD (Figs. S1 and S6), under a logistic regression model developed in the training set (see details in the Methods section), which is consistent with previous reports 12. Notably, THBS2 also appeared to have a predictive value in samples taken up to a year to diagnosis (OR=1.73 (95% CI 1.17-2.70), P= 0.0054, AUC=0.65 (95% CI 0.55-0.75)) (Figs. 1 and S4), between 1 to 2 years (OR=2.34 (95% CI 1.23 - 4.87), P= 0.0081, AUC=0.67 (95% CI 0.55-0.79)) (Figs. 1 and S5), and above 4 YTD (OR=2.63 (95% CI 1.17 - 6.84), P= 0.018, AUC=0.67 (95% CI 0.52-0.82)) (Figs. 1 and S8). A link between THBS2 and an increased risk for pancreatic cancer has been reported before 55 and was shown to improve the ability of CA19-9 to distinguish PDAC from pancreatitis, with a specificity of 98% and a sensitivity of 87% for PDAC. Samples in that report, however, were collected after diagnosis. The samples used in our work were all gathered prior to diagnosis, the latest approximately 1 month before. THBS2, nevertheless, did not generate significant AUCs in the test set, whether with logistic regression models trained with single time-groups (Fig. S1) or combined time-groups (see also Fig. S6 to S14). The only other marker that remained a consistent predictor in more than one single time-group was CA19-9: ORtraining(0-1)=1.67 (P=2.92 x 10-9), AUCtraining(0-1)=0.81 (95% CI 0.71 - 0.89) and Senstraining(0-1)= 0.64 (95% CI 0.48 - 0.77) at 90% Spec; ORtraining(1-2)=1.67 (P=2.92 x 10-9), AUCtraining(1-2)=0.66 (95% CI 0.53-0.78), Senstraining(1- 2)= 0.31 (95% CI 0.10 - 0.49) at 90% Spec; ORtraining(2-3)=1.60 (P=0.0017), AUCtraining(2-3)=0.71 (95% CI 0.59-0.83), Senstraining(2-3)= 0.33 (95% CI 0.061 - 0.61) at 90% Spec (Fig. S1). When attempting to predict PDAC status in the test set with the logistic regression models trained with time-group 0-1 samples and used to rank biomarkers, we still verified that CA19-9 was one of the two biomarker models, the other being CEACAM5 (ORtraining(0-1)=1.94 (95% CI 1.32 - 3.08), P= 0.00029 and AUCtraining(0-1)= 0.67 (95% CI 0.56 - 0.77) in the training set and AUCtest(0-1)=0.84 (95% CI 0.65 −0.97) in the test set) (see Fig. S1 and Figs. S6 to S14), that achieved significant performances: AUCtest(0-1)=0.73 (95% CI 0.52 - 0.93), Senstest(0-1)= 0.62 (95% CI 0.38 - 0.85) at 90% Spec. The combination of CA19-9 and THBS2 improved the performance in both training and the test sets, for 0-1 samples, i.e., AUCtraining(0-1) (CA19-9, THBS2) = 0.82 (95% CI 0.73 - 0.89) and AUCtest(0-1) (CA19-9, THBS2) = 0.74 (95% CI 0.51 - 0.92), although the difference with respect to the CA19-9 single marker model was not significant, i.e., P=0.19 and P= 0.69, respectively, under a DeLong method for comparing AUCs from 2 different ROC curves (see Methods). Additionally, the combined 2-marker model CA19-9+CEACAM5 did not improve the diagnostic performance of CA19-9 alone in the training set (AUCtraining(0-1)=0.80 (95% CI 0.71-0.88)), but improved in the test set, i.e., AUCtest(0-1)=0.81 (95% CI 0.62 - 0.95), although not sufficiently for the difference to be significant (P=0.24). MUC16 (commonly known as CA125), which was previously reported to provide additional sensitivity when combined with CA19-9 in a pre-diagnosis setting similar to ours 12, was ranked as significant in our study, with an ORtraining(0-1)=2.72 (95% CI 1.66 - 5.01), P= 8.7 x 10-6, AUCtraining(0-1)= 0.72 (95% CI 0.62 - 0.82) and Senstraining (0-1)= 0.64 (95% CI 0.29 - 0.67) at 90% Spec, but was borderline insignificant when tested against blinded samples, i.e., AUCtest(0-1)=0.72 (95% CI 0.49 - 0.90). As was the case of the combined model with THBS2, the bi-dimensional model CA19-9+MUC16 had a similar performance to the CA19-9, i.e., AUCtraining(0-1) (CA19-9, MUC16) = 0.81 (95% CI 0.72 - 0.88) and Senstraining(0-1)= 0.58 (95% CI 0.46 - 0.75) at 90% Spec, and therefore with no significant improvement in performance (P= 0.91). When evaluated in the test set, the CA19-9 and MUC16 multi-marker model outperformed CA19-9 and the combination of CA19-9 and THBS2, i.e., AUCtest(0-1) (CA19-9, MUC16) = 0.76 (95% CI 0.55 - 0.93), but not significantly, P=0.76 and P=0.80, respectively. The CA19-9+CEACAM5 model was, therefore, the best multi-marker predictor in the test set built from frequently cited markers but once again the difference was not significant when compared to CA19-9 alone (P=0.44). We must stress that any of the bi-dimensional models reported above did not involve any interaction terms because if we resorted to non-linear terms in the training set the results were not significant for any combinations of the typical panel of analytes reported in the literature.
Several other multi-marker combinations have been put forward to add value to traditionally used panels 73 (see Fig. S1 for the typical single biomarkers used). In fact, as was observed in the training set 0-4+ samples, a large group of markers appeared significant: CA19-9, VWF, THBS2, MUC16, HGF, CEACAM5, VEGFA, FURIN, CTSV, NT5E, TLR3 (see Fig. S14). These findings further stimulated the exploration of multi-dimensional models in order to reach higher performances and robustness across all time-groups. If we resorted once again to the simple logistic regression approach and generated models with up to 3 features, selected from those reported in the literature 3, 73, 74, e.g., CA19-9, CA125, CEACAM5, HGF, THBS2, CEACAM1 (Fig. S1), included clinical covariates and chose the top 10 according to their performance across training cross-validation folds, we concluded that despite generating a significant performance in the test set, a clear advantage in building models from this restricted pool of features was lacking, especially when the sensitivity at 90% specificity values were relatively low (see Fig. S2 and, additionally, Fig. S15 for results obtained with single time-groups). What was notable about these 3-marker models was the presence of the covariate DIABETES and AGE in top ranks, as we had predicted (see Main Text). DIABETES was a predictor of PDAC in samples allocated to all single time-groups in the training set, with exception of 1-2 YTD (see Fig. S1). There was, therefore, the need for a systematic search over a larger panel, in our work composed of in-house and Olink biomarkers (see Table S1), and for the use of several different classifier types which we combined with an ensemble modelling technique (see Methods and Results sections in the Main Text).
Ensemble classifiers specialized in single time-group samples
We also developed classifiers specialized in each single time-group, i.e. 0-1, 1-2, 2-3, 3-4 and 4+, here referred to as STG2L and STG3L (see Methods for details and Figs. S20 and S21), by resorting to the same framework, i.e. stacked ensemble with repeated cross-validation and identification of optimal size of input space, and combining the probability/risk of being a case of each sample generated by each specialized model either by taking only the output that had the maximum probability, by averaging, i.e. with a Bayesian model or a simple arithmetic average, or utilizing a mixture of experts, in our case, by calculating the geometric probability mean across all models. The latter is more conservative since a subject is only classified as a Case if there is a central tendency in the 10-model probability vector of outputs. Both the maximum probability principle as well as the mixture of experts underachieved significantly in the test set when compared with the standard approach (see Figs. S22 and S23). The problem with training specialized ensembles per single time-group lies with the sample size. Their performance in the training set indicates that these alternative models, despite the extensive re-sampling strategy are not robust both at the base-learner level (Fig. S16) and the meta-learner level (Figs. S22 and S23). Consequently, their potential as predictive indices is reduced systematically, despite the overall pair-wise diversity among base-learner predictions being high in the test set (Fig. S18, see also Fig. S27 for feature importance). Further expansion of the dataset should clarify if the use of single time-group specialized classifiers is advantageous.
Training the base-learners with real and synthetic data
For the purposes of exploring the effects of re-balancing the imbalanced classes, in our work PDAC and Control, across all time-groups, we also developed other classifiers under the JTG2L class (see Methods for details and Figs. S19 and S22) but trained in both real and synthetic data we generated by fitting a high-dimensional Dirichlet Process Gaussian Mixture (HD-DPM) Model 71 model to the cases of the training set and, independently, to the controls. The high-dimensionality character of the HD-DPM method is required to capture cross-biomarker correlations, which would not have been preserved had we fitted a one-dimensional DPM model to each biomarker independently. For the sake of completeness, we also tested the Synthetic Minority Oversampling Technique (SMOTE) 70. The inclusion of the routines for synthetic data generation was done during re-sampling and prior to the application of each stacked algorithm. This is particularly important for the HD-DPM algorithm since a separate distribution is fitted to PDAC cases and controls. The number of synthetic samples needs to be similar to those of real samples, otherwise the classifiers learn the distribution of the synthetic data instead.
This did not improve the performance either in the training set or the test sets (see Figs. S29 and S30) and thus renders the option of synthesizing data and enlarging the number of representative examples to aid the training process not viable. Both techniques were only run for joined time-groups, as the number of samples in each single time-group was not sufficient to implement either the HD-DPM or the SMOTE algorithm. Moreover, there is evidence that for high-dimensional problems techniques such as over-sampling/under-sampling of the minority/majority class outperforms methodologies such as SMOTE 70. We also tested the ROSE algorithm 75 for synthesizing data during cross-validation resampling steps, a technique similar to HD-DPM despite not providing the benefits of non-parametric modelling. Overall, the results obtained with ROSE were far inferior to any of the other 3 algorithms, and as is the case for synthetizing samples with HD-DPM, a lengthy process. Therefore, we ran the remaining models with the oversampling of the minority class as a strategy to develop all the ensembles of base-learners reported in the Main Text.
PDAC signatures developed with additional proteomic data for a sub-group of subjects
In addition to the data set used to generate the models reported in the Main Text, we also had additional biomarkers that had been measured only in sub-groups of samples. Sub-set I includes the biomarkers AFP, TEK and IGFBP1 (see Figs S31 and Tables S5, S6 and S7), while sub-set II includes the additional biomarkers LRG1, PIGR, REG3A, F12, AGR2, IL17RA, SERPINA1 and THBS1 (see Figs. 34 and Tables S8, S9 and S10). These were biomarkers motivated by a previous paper 12. The BMA ensemble stack developed in sub-set I of the training set showed inferior performances compared to the original model (Figs. S32 and S33), whereas the stack developed in the sub-set II improved the predictive potential in sub-set II of the test set, with performances as high as 0.94 (95% CI 0.75-1.00) and 0.88 (95% CI 0.71-0.99), in 0-1 and 0-2 samples, respectively (Figs. S35 and S36). Also noteworthy is the importance attributed to each new feature across base-learners (Fig. S37). For the models developed in sub-set II, 4 of the new biomarkers are ranked in the top 20 markers by feature importance across joined time-groups: IL17RA; AGR2; SERPINA and LRG1. REG3A 12 and THBS173 have been studied before in association with PDAC. REG3A was found to be a late marker adding little to combined models with CA19-9 and MUC16 12. Despite this, it is attributed an importance different from zero across the ensemble classifier developed here (Fig. S37).
Supplementary Figures
Supplementary Tables
Acknowledgments
We thank the participants of the UKCTOCS trial; the management team, research nurses, interviewers, research assistants, and other staff who gathered the data that was used in this study
Footnotes
The name of the platform used to screen 92 of the proteins used in this paper has been corrected. Table S1 has also been corrected.