Abstract
The methylation of plasma cell-free DNA (cfDNA) has emerged as a valuable diagnostic and prognostic biomarker in various cancers including colorectal cancer (CRC). Currently, there are no biomarkers that serve simultaneously for early diagnosis and prognostic prediction in CRC patients. Herein, we developed a plasma panel (27 DMRs, differential methylated regions) and validated its superior performance across CRC diagnosis and prognosis prediction in an independent cohort. We first conducted a preliminary screening of 119 CRC tissue samples to identify CRC-specific methylation features. Subsequently, a CRC-specific methylation panel was developed by further filtering 161 plasma samples. Then machine learning algorithms were applied to develop diagnosis and prognosis models using cfDNA samples from 51 CRC patients and 33 normal controls. The diagnosis model was tested in a cohort consisting of 30 CRC, 37 advanced adenoma (AA), and 14 healthy plasma samples, independently validated in a cohort consisting of 18 CRC, 91 NAA, 23 AA and 34 healthy plasma samples. In the tissue external validation cohort (GSE48684), the cfDNA methylation diagnosis model conducted with the panel, have the area under the curve (AUC) reached 0.983, and for the plasma cfDNA model in the external validation cohort, the sensitivities for NAA, AA and CRC 0 -Ⅱ are 48.4%. 52.2% and 66.7% respectively, with a specificity of 88%. Additionally, the panel was applied to patient staging and metastasis, performing well in predicting CRC distant metastasis (AUC = 0.955) and prognosis (AUC = 0.867). Using normal samples as control, the changes in methylation score in both tissue and plasma were consistent across different lesions, although the degree of alterations varied with severity. The methylation scores vary between paired tissue and blood samples, suggesting distinct mechanisms of migration from tumor tissue to blood for the 27 DMRs. Together, Our cfDNA methylation models based on 27 DMRs can identify different stages of CRC and predict metastasis and prognosis, ultimately enabling early intervention and risk stratification for CRC patients.
Introduction
Colorectal cancer (CRC) is the third most prevalent cancer globally and ranks as the fourth leading cause of cancer death, accounting for 9.4% of cancer-related mortalities (1). When patients were diagnosed with CRC, over half of their family (62.9%) faced financial burdens (2, 3). Extensive research underscores that the survival rates of individuals diagnosed with advanced CRC (Stage III or IV) witness a significant decrement (4, 5). Early detection and removal of precancerous lesions remain the most effective strategy to prevent CRC-associated mortality (6). While colonoscopy is regarded as the gold standard for CRC detection, its limitations including invasiveness, suboptimal patient compliance and risk of complications including intestinal perforation, warrant further consideration (7). Guaiac-based fecal occult blood test (gFOBT), fecal immunochemical test (FIT), carcino-embryonic antigen (CEA), and multi-target stool DNA (mt-sDNA) testing, are constrained in clinical application due to the lower sensitivity(8, 9).
Recent research demonstrated a profound correlation between CRC and the development of genetic and epigenetic alterations. During early stage of CRC development, epigenetic modifications surpass the frequency of gene mutations, indicating their potential as diagnostic biomarkers in screening for colon polyps and cancer(10). Circulating tumor DNA (ctDNA) in cell-free DNA (cfDNA) primarily come from apoptotic and necrotic tumor cells, carrying cancer-specific epigenetic alterations (11). Notably, blood cfDNA methylation emerges as a promising cancer screening pathway because of its early appearance in tumorigenesis and abundant signal density (12).
The value of cfDNA methylation in early diagnosis, detection of recurrence, molecular subtyping and prognostic prediction of CRC has been proven(13). However, there are still reported limitations. One study found a diagnostic panel for CRC demonstrated great performance, but it was derived from CRC tissue and normal blood leukocyte methylation data, potentially introducing bias due to the inconsistent sample types (14). Another study addressed the potential of cfDNA methylation in patient risk stratification, but the requirement for blood sampling every three months led to poor patient compliance(15).
In this study, we aimed to develop plasma biomarkers derived from CRC tissues that exhibit superior performance in the diagnosis, metastasis and prognostic prediction for CRC. We meticulously screened CRC and normal tissues, and refining our selection within plasma samples. More importantly, the substantial proportion of paired tissue and plasma samples (from the same patient), effectively minimized potential data bias. The final resulting panel comprised 27 differential methylated regions (DMRs), with the area under the curve (AUC) reached 0.983 in the tissue validation cohort. Given that DNA methylation markers in plasma primarily stem from tumor tissues, these 27 DMRs exhibit significant potential in CRC plasma diagnostics. With a single blood test, we can ascertain the presence of CRC, distinguish the specific stages of the lesion (advanced adenoma, CRC 0-II, CRC III-IV), help identify distant metastasis, and achieve optimal risk stratification for CRC patients. This test is of clinical significance and holds promise in guiding the diagnosis and treatment for CRC.
Materials and Methods Study design and samples
Specific DNA methylation markers for CRC were identified through analysis of TCGA public database data, alongside collected tissue and plasma samples. Subsequently, diagnosis, metastasis, and prognosis models for CRC were established and validated. The 450k chip methylation data encompassing colorectal, esophageal, gastric, lung, liver, and breast cancers, and CRC transcriptome sequencing data were downloaded from The Cancer Genome Atlas (TCGA) database. The methylation data of tissue validation cohort was obtained from
Fifteen pairs of tissue samples from the Seventh Medical Center of PLA General Hospital were sequenced with the Roche Nimble Gen Seq Cap Epi method. The 89 tissue samples and 77 plasma samples used for marker screening were collected between June and December 2020 at the Seventh Medical Center of PLA General Hospital and Dongying People’s Hospital. Participants for model establishment and validation were enrolled from June 2020 to July 2022 in the Seventh Medical Center of PLA General Hospital, Dongying People’s Hospital.
The paeticipants for external validation cohort were enrolled from June 2021 to December 2022 at the First Hospital of Longyan, Fujian Medical University. Inclusion and exclusion criteria are outlined in the Supplementary Materials. Detailed methods for DNA extraction, library construction, targeted bisulfite sequencing, and methylation data processing are provided in the Supplementary Materials.
CRC-specific methylation markers selection
Combining TCGA methylation data with our 15 pairs of tissue methylation sequencing results, probes were designed for further screening in the 89 samples (13 normal tissues and 38 paired CRC tissues), 404 DMRs were identified. The detailed methods were listed in supplementary methods. Then, the methylation levels of 77 plasma samples from 13 Normal, 15 non-advanced adenoma (NAA), 12 advanced adenoma (AA) and 37 CRC participants, were analyzed. Subsequently, the Least Absolute Shrinkage and Selection Operator (LASSO) regression analysis was used to filter markers, and 27 regions that appeared in more than 50 iterations were chosen as plasma diagnostic markers. The relationship between the methylation levels of the 27 DMRs and the transcription levels of their respective genes was compared by Spearman correlation coefficients.
Construction of the CRC diagnosis model
The methylation data from the training cohort underwent machine learning, culminating in the establishment of a five-fold cross-validated model characterized by binary deviance minimization standards. The diagnosis model was constructed using logistic regression, and its performance was evaluated through Receiver Operating Characteristic (ROC) curve analysis. The optimal cutoff value was determined using the maximum Youden index. The methylation data of tissue validation cohort (GSE48684) was downloaded from GEO (Gene Expression Omnibus) dataset. Using the methylation scores derived from the diagnosis model, the CRC staging (AA, CRC 0-Ⅱ, CRC Ⅲ-Ⅳ) was predicted.
Construction of the CRC metastasis and prognosis models
Utilizing the methylation scores derived from the diagnosis model, in conjunction with patient pathological parameters (metastasis), the plasma CRC metastasis model was constructed. The performance of this model underwent evaluation through ROC analysis, with the optimal cutoff value determined via the maximum Youden index. Furthermore, employing the same methodology, the plasma prognosis model was established and validated, based on the methylation scores derived from the CRC diagnosis model and survival information from the patients.
Statistical analysis
All statistical analyses were conducted using SPSS 26.0 or R 4.1.3. Two-sided tests were used for p-values, with differences deemed statistically significant at P < 0.05. Model performance was assessed through ROC analysis using the “roc” function from the R package pROC, generating AUC and a 95% confidence interval (CI). Kaplan-Meier (KM) curves were used for survival curve estimation, with comparisons made through the log-rank test and hazard ratios determined by Cox regression.
Results
Patient characteristics
To delineate DNA methylation biomarkers specific to CRC, 119 tissues and 77 plasma samples (originating from the same patients with tissues) were collected for methylation sequencing analysis, narrowing down to 142 DMRs. Then the 27 DMRs were selected with the sequencing data of 84 plasma samples. Following this, CRC diagnosis, metastasis, and prognosis models were constructed using plasma samples from 84 individuals (51 CRC, 33 Normal). Validation was then carried out in an independent cohort (30 CRC, 37 AA, 14 Normal). The CRC diagnosis model was further tested in an external validation cohort (18 CRC 0-II, 91 NAA, 23 AA, 34 Normal). Detailed information about the study design is illustrated in Figure 1, and comprehensive patient characteristics are summarized in Table s2-4.
Methylation markers selection
Analyzing the 450k chip methylation data from TCGA, the differential methylated sites (DMCs) in CRC were identified (Figure s1A). Combining with additional CRC-related methylation sites reported in the literature, 1438 DMCs were selected for validation. The heat map depicted the methylation levels of 1438 DMCs, revealing distinct methylation patterns between Normal and Tumor. (Figure s1B). The top 1400 DMCs were carefully selected based on the results of DNA methylation sequencing of 15 paired tissue samples (Figure s2A-C).
The distribution of the DMCs was predominantly observed in introns and promoters, signifying substantial implications for gene expression and cellular functions (Figure s2D).
Merging the 1438 DMCs with the 1400 DMCs for probe design, 404 DMRs were discerned across 89 tissues (Figure s3A). The different methylation patterns between Normal and Tumor were also quite evident (Figure s3B-C). These DMRs correspond to genes that play pivotal roles in biological processes such as cell-cell adhesion, digestive system development, and cAMP and cGMP pathways (Figure s3D). Subsequently, 142 regions remained after sequencing 77 plasma samples (Figure s3E-F). The final 27 DMRs were selected after analyzing methylation data of the training cohort samples. Detailed information of the 27 DMRs is provided in Table s5. Correlation analysis revealed the methylation levels of the majority of the 27 DMRs are correlated with the transcription levels of their respective genes (Figure s4).
Development and validation of the CRC diagnosis model
Using methylation data from 27 DMRs in tissues, we applied machine learning to construct a diagnosis model for CRC and the AUC reached a high value of 0.994 (Figure 2A). In the external tissue independent validation cohort (GSE48684), the AUC for CRC is 0.983 and for AA is 0.966. (Figure 2B-C). The sensitivities for AA and CRC were 95% and 94%, respectively (Figure 2D-E). Besides, the plasma-based diagnosis model for CRC was constructed with training cohort and validated in an external independent validation cohort, the 27 DMRs exhibited distinctive methylation patterns between CRC and normal individuals (Figure 3A-B). The methylation scores of the model were significantly elevated in the CRC group compared to the normal and AA groups (P< 0.001), and they increased with advance tumor staging (Figure 3H). In the training cohort, the ROC curve analysis showed an AUC value of 0.928 [95% confidence interval (CI): 0.874–0.981] (Figure 3C). The sensitivities of the CRC diagnosis model in the training cohort for stages 0-II, III-IV, and all CRC stages were 80%, 92%, and 86%, respectively, with a specificity of 85% (Figure 3I). ROC curve analysis in the validation cohort showed AUC values for AA, stages 0-II, III-IV, and all CRC stages as 0.714 (0.550–0.878), 0.890 (0.749–1.000), 0.967 (0.899–1.000), and 0.929 (0.827–1.000). (Figure 3D-G). The CRC diagnosis model demonstrated a specificity of 93% in the validation cohort, with sensitivities for AA, stages 0-II, III-IV, and all CRC stages of 43%, 67%, 100%, and 83%, respectively (Figure 3I). We further tested the early diagnostic performance of the model in an external validation cohort, and the results showed that the model achieved a sensitivity of 52% for AA and 48% for NAA (Figure 4A-E), The plasma diagnosis model also serves to discern the specific staging of CRC patients. ROC curve analysis on the validation cohort revealed an AUC of 0.813 (0.656–0.970) for discriminating between CRC 0-II and CRC III-IV, and an AUC of 0.799 (0.693–0.905) for distinguishing between AA and CRC (Figure s5A-C). Kaplan-Meier survival curves revealed a significant reduction in overall survival (OS) for CRC III-IV patients compared to CRC 0-II patients identified by this diagnosis model (Figure s5D-E).
Development and validation of the CRC metastasis and prognosis models
Through an analysis of the relationship between the methylation scores in the diagnosis model of CRC patients’ plasma samples and clinical-pathological parameters, we observed that methylation scores are associated with metastasis and staging (P< 0.001, Figure 5A-B), but not with age, gender, and lesion location (P>0.05, Figure s6A-C). This suggests the possible role of 27 DMRs in metastasis and prognosis prediction. Therefore, we developed plasma-based CRC metastasis and prognosis models based on the 27 CRC-specific DMRs. In the training and validation cohorts, the metastasis model demonstrated high AUCs of 0.969 (95% CI: 0.926-1) and 0.955 (95% CI: 0.878-1), respectively (Figure 5C). Methylation scores of M1-CRC individuals identified from the metastasis model were significantly higher than M0-CRC individuals (Figure s6D). Additionally, the M1-CRC patients exhibited shorter OS than M0-CRC patients. (Figure s6E-F). For the prognosis model, ROC curves demonstrated excellent performance in the training cohort (AUC = 0.883, 95% CI: 0.778-0.988) and validation cohort (AUC = 0.867, 95% CI = 0.728-1). (Figure 5D). Furthermore, Kaplan-Meier survival curves revealed a significant reduction in OS for the high-risk group of CRC patients identified by this prognosis model (Figure 5E). Based on the cutoff value determined by the algorithm, we divided CRC patients into high-risk and low-risk groups. Notably, almost all patients who developed distant metastases and those who had died belonged to the high-risk group (Figure 5F, H). We further characterized the survival status distribution and metastasis status between the two groups. As expected, the high-risk group had a higher proportion of deceased individuals, while non-metastatic patients were more prominent in the low-risk group (Figure 5G, I). These results indicate that the metastasis and prognosis models successfully identified patients who require further treatment. Multivariate regression analysis revealed a substantial correlation between the methylation score of the 27 DMRs and OS, indicating methylation score as an independent prognostic factor for CRC (Table s6). These findings underscore the considerable potential of the prognosis model based on the 27 DMRs specific to CRC in predicting the prognosis and conducting risk stratification for CRC patients.
Changes in methylation scores of tissue and plasma samples across different populations
Collectively, the CRC diagnosis model, established based on the methylation levels in these 27 DMRs within tissues, exhibits robust performance. However, the plasma-based diagnosis model exhibited reduced performance in AA and CRC 0-II compared to CRC III-IV. Therefore, we analyzed the methylation scores of the 27 DMRs generated from diagnosis models in paired tissue and plasma samples. Notably, in tissues, the methylation scores are ranked as AA > CRC > Normal (Figure 6A), suggesting significant alterations in the methylation of the 27 DMRs at the onset of precancerous lesions. In plasma, methylation scores gradually increase with CRC progression (CRC III-IV > CRC 0-II > AA > Normal, Figure 6B). Further analysis of individual DMR methylation levels in the blood revealed discernible distinctions in certain DMRs during both AA and CRC 0-II stages, while variances in another subset of DMRs became detectable exclusively during CRC III-IV stages (Figure 6C). We then conducted a gene analysis of DMRs identifiable in early-stage CRC blood on the Metascope website. Our findings revealed their association with processes such as “secretion by cell” and “cell-cell adhesion.” (Figure s6G) Consequently, we hypothesize that although alterations of 26 DMRs were observed in AA tissues, only a selected few are actively released into the bloodstream through cell secretion by tumor cells. The majority of these alterations are likely to be identified in the blood only after the occurrence of apoptosis or necrosis in tumor cells. These results elucidate the inconsistency in the performance of tissue and plasma diagnosis models and the differences in diagnostic efficacy for different CRC stages using the same model.
Discussion
Liquid biopsy of cfDNA has become an ideal clinical detection method because it is minimally invasive and easily sampled. However, due to the low concentration of ctDNA in bodily fluids and the heterogeneity of tumor cells, many DNA markers for CRC currently lack certainty whether they originate from CRC tissues. In this study, we integrated TCGA tissue data with self-collected tissue methylation data to identify CRC-related methylation sites.
Probes were designed and applied in paired tissue and plasma samples to further screen the methylation sites. We successfully determined 27 DMRs originating from CRC tissues, which were subsequently used to construct diagnostic model. In the validation cohorts of tissue and plasma samples, the diagnosis models based on the 27 DMRs we established could effectively differentiate samples between CRC and normal participants, in addition, the plasma diagnosis model could distinguish the different stages of CRC. Furthermore, the transfer prognosis model, established based on diagnosis model scores, could effectively demonstrate whether the CRC had metastasized, separating CRC patients into high-risk and low-risk groups.
Several blood-based methylation biomarker candidates have been proposed for early detection of CRC. For instance, the FDA-approved circulating methylated SEPT9 DNA (mSEPT9) demonstrates sensitivities of 11.2%, 35.0%, 63.0%, 46.0%, and 77.4% for AA and CRC stages I-IV, respectively, with a specificity of 91.5% (16). In a recent study, a cfDNA methylation-based CRC screening model has a sensitivity of 86.4% and a specificity of 90.7%, utilizing 149 markers derived from blood samples (17). In our study, 27 markers were selected through a layered screening process from tissue and plasma, undergoing marker selection, model development, and validation to ensure model robustness. Then the superior performance of our plasma diagnosis model was validated, with a sensitivity of 83.3% and specificity of 92.9%. Notably, the sensitivity for AA and NAA reached 52.2% and 48.4% respectively, far surpassing the 11.2% sensitivity of mSEPT9 and the 33.3% reported in another blood screening study.(16, 17). In summary, our diagnostic model performs well in detecting precancerous lesions of CRC and has the potential to become a screening method for high-risk populations of CRC.
In plasma, we observed the methylation scores of 27 DMRs gradually increase with CRC progression. It is possibly due to less vascular infiltration in early CRC (18, 19). On the other hand, the ctDNA detected in the blood of early-stage CRC is largely derived from tumor cells actively secreting into the bloodstream, making it challenging to be precisely captured with current detection technologies due to the limited quantity. Conversely, ctDNA in the blood of late-stage CRC primarily emanated from apoptotic and necrotic tumor cells, resulting in a larger quantity that facilitates easier detection. However, in reality, the methylation scores of the 27 DMRs have indeed exhibited noticeable changes in AA and early cancerous tissues.
Additionally, in the tissue diagnosis model using the 27 DMRs, the sensitivity was 94% and the specificity was 100%. Therefore, we have grounds to believe that the 27 DMRs detected in plasma originate from CRC tumor tissues. With the implementation of more sensitive detection methods, the performance of these 27 DMRs in early CRC plasma diagnosis is likely to be further enhanced.
Additionally, cfDNA contributes to risk stratification and early recurrence detection in CRC(20). However, current methods rely on continuous blood cfDNA testing by patients (15, 21).
This study discovered that the preoperative cfDNA methylation level of 27 DMRs is linked to distant metastasis and is valuable for predicting the prognosis of CRC, consistent with certain prior research. (22, 23). Our model predicted AUCs of 0.955 and 0.867 for distant metastasis and prognosis in CRC, respectively. This underscores the potential utility of the preoperative application of this cfDNA methylation model for risk stratification, serving as an effective tool to improve the perioperative management of CRC patients.
The treatment strategies for CRC differ significantly across various stages. Molecular stratification approaches for CRC patients are on the rise, and concurrently, the clinical application of biomarkers to determine treatment decisions is gradually gaining traction (24). Studies have demonstrated the precise identification of T1 CRC patients at risk of lymph node metastasis using a specific set of miRNAs, thereby potentially mitigating unnecessary overtreatment (25, 26). Our CRC diagnosis model, with an AUC of 0.813 for distinguishing CRC 0-Ⅱ from CRC Ⅲ-Ⅳ, could also serve as a potent, straightforward, and cost-effective preoperative screening/detection method, guiding patients to select more appropriate treatment plans.
DNA methylation regulates gene transcription, guiding the progression from normal mucosa to AA and ultimately CRC. This process involves silencing tumor suppressor genes and activating oncogene transcription (27). Gene transcription levels correlate notably with methylation levels at specific sites, consistent with our research findings. Additionally, several genes housing the identified 27 DMRs have been partially explored in CRC. Some genes contribute to tumor growth. For instance, high methylation and diminished expression of transmembrane protein 240 (TMEM240) regulate CRC cell proliferation, predicting poor prognosis (28). The transcription factor homeobox A3 (HOXA3) activates aerobic glycolysis, promoting tumor growth (29). KIFC3 controls mitotic spindle assembly initiation (30). Moreover, several genes are implicated in tumor metastasis, such as NOVA alternative splicing regulator 1 (NOVA1), which promotes CRC migration by activating the Notch pathway (31). Protein tyrosine phosphatase receptor type T (PTPRT) contributes to early CRC dissemination (32).
SPARC-related modular calcium binding 2 (SMOC2) serves as the distinctive signature of cancer stem cells (CSCs) in CRC and promotes epithelial-to-mesenchymal transition (EMT) (33, 34). Increased methylation and expression of pancreatic and duodenal homeobox 1 (PDX1) facilitate CRC invasion and migration(35). However, the mechanisms by which alterations in specific DNA methylation impact gene expression are intricate. Certain transcription factors selectively recognize sequences with methylated CpG (mCpG) and influence the expression of multiple genes(36). Further exploration of the potential functional mechanisms of these DNA methylation markers may deepen our understanding of the molecular processes underlying CRC development, offering promising therapeutic targets.
This study has certain limitations. Firstly, although obtained from multiple institutions, the small sample size was insufficient. Therefore, the cfDNA methylation model should be further validated in large-sample trials in the future. Secondly, a prospective study is necessary to compare or combine the cfDNA methylation model with clinically commonly used markers such as CEA and CA19-9.
In conclusion, our cfDNA methylation model based on 27 DMRs can identify different stages of CRC, predict metastasis and prognosis, and ultimately achieve early intervention and risk stratification for CRC patients. The preoperative application of our DNA methylation biomarker as a robust, convenient, and cost-effective detection method can contribute to making more informed clinical decisions and improving the perioperative management of CRC patients.
AUTHOR CONTRIBUTIONS
Yuqi He and Jianqiu Sheng conceived and designed the study. Lang Yang, Fangli Men, Jianwei Yu, Xianzong Ma, Junfeng Xu, Yangjie Li, Ju Tian, Hui Xie, Qian Kang, Linghui Duan, Xiang Yi, Wei Guo, Ni Guo collected the samples and curated the date. Xueqing Gong, Lingqin Zhu, Lang Yang, Fangli Men, Jianwei Yu, Shuyang Sun, Chenguang Li, Xianzong Ma, Junfeng Xu, Yangjie Li, Xiang Yi, Wei Guo, Ni Guo, Youyong Lu, Joseph Leung, Yuqi He and Jianqiu Sheng analysed the data. Lingqin Zhu wrote the manuscript with the assistance of Youyong Lu, Joseph Leung, Yuqi He and Jianqiu Sheng.
Data Availability
All data produced in the present study are available upon reasonable request to the authors.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflict of interest.
DATA AVAILABILITY STATEMENT
All of the data supporting this work will be made available from the corresponding author upon reasonable request.
ETHICS APPROVAL AND CONSENT TO PARTICIPATE
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of the Seventh Medical Center of PLA General Hospital (Approval No. 2016-70, 2020-78), Dongying People’s Hospital (Approval No. DYYW-2019-002-01), and the First Hospital of Longyan, Fujian Medical University (Approval No. 2021-k0001). Informed consents were obtained from all participants involved in the study.
ACKNOWLEDGEMENTS
This research was funded by grants from the National Natural Science Foundation of China (Grant no. 82273245), the Capital’s Funds for Health Improvement and Research (Grant no. 2022-1-5082), and the Beijing Natural Science Foundation (Grant no. 7212107), Sponsored by Dongying City Natural Science Foundation (Grant no. 2023ZR026), Sponsored by Longyan City Science and Technology Plan Project (Grant no. 2022LYF17082).