Abstract
Background To reduce lung cancer burden in the US, a better understanding of biological mechanisms in early disease development could provide new opportunities for risk stratification.
Methods In a nested case-control study, we measured blood leukocyte DNA methylation levels in pre-diagnostic samples collected from 430 men and women in the 1989 CLUE II cohort. Median time from blood drawn to diagnosis was 14 years for all participants. We compared DNA methylation levels by case/control status to identify novel genomic regions, both single CpG sites and differentially methylated regions (DMRs), while controlling for known DNA methylation changes associated with smoking using a previously described pack-years based smoking methylation score. Stratification analyses were conducted by time from blood draw to diagnosis, histology, and smoking status.
Results We identified sixteen single CpG sites and forty DMRs significantly associated with lung cancer risk (q < 0.05). The identified genomic regions were associated with genes including H19, HOXA4, RUNX3, BRICD5, PLXNB2, and RP13. For the single CpG sites, the strongest association was noted for cg09736286 in the DIABLO gene (OR [for 1 SD] = 2.99, 95% CI: 1.95-4.59, P-value = 4.81 × 10−7). For the DMRs, we found that CpG sites in the HOXA4 region were hypermethylated in cases compared to controls.
Conclusion The single CpG sites and DMRs that we identified represented significant measurable differences in lung cancer risk, providing new insights into the biological processes of early lung cancer development and potential biomarkers for lung cancer risk stratification.
Introduction
Despite substantial reductions in lung cancer incidence and death rates over the past three decades (Siegel et al. 2020), lung cancer continues to be the leading cause of cancer death in the US and is projected to account for 135,720 deaths in 2020, approximately 22% of all cancer deaths (Siegel et al. 2020). To reduce the lung cancer burden in the US, cancer prevention and early detection remain a top priority (Moyer and Force 2014). However, while the conventional lung cancer screening method of using low-dose CT (LDCT) scans is effective in reducing lung cancer mortality (National Lung Screening Trial Research et al. 2011), it leads to high false positives rates (>95% of pulmonary nodules detected are benign) (Fabrikant et al. 2018), over-diagnosis (Heleno et al. 2018; Patz et al. 2014), radiation exposure (McCunney and Li 2014), and has had poor uptake (Quaife et al. 2016). A better understanding of biological mechanisms in early disease development, especially understanding gained through using DNA methylation data, may contribute new insights into lung cancer development and provide new opportunities for risk stratification, screening prioritization, and drug development.
Genome-wide DNA methylation profiling has provided new insight into risk factors, biological pathways, and disease processes. For instance, differences in blood leukocyte DNA methylation levels (proportion of CpGs methylated) have been associated with smoking (Baglietto et al. 2017; Shenker et al. 2013), elevated subclinical inflammation (Ahsan et al. 2017; Ligthart et al. 2016), obesity (Wahl et al. 2016; Xu et al. 2018), type II diabetes (Chambers et al. 2015), and heart disease (Agha et al. 2019; Huan et al. 2019). With respect to lung cancer, many of the alterations in blood leukocyte DNA methylation levels identified to date have been directly linked to smoking behaviors (Baglietto et al. 2017; Fasanelli et al. 2015; Shenker et al. 2013; Zhang et al. 2016b; Zhang et al. 2016c). One recent meta-analysis that included over 15,000 individuals identified 2,623 differentially methylated CpG sites that are related to cigarette smoking (Joehanes et al. 2016). Some of the identified smoking-related CpG sites have also been shown to mediate the effects of smoking on lung cancer (Baglietto et al. 2017; Battram et al. 2019; Fasanelli et al. 2015). Nevertheless, very few studies have investigated methylation changes associated with lung cancer risk that are not associated with smoking (Zhang et al. 2016a; Zhang et al. 2016b).
While smoking remains the strongest risk factor for lung cancer, identifying DNA methylation alterations that are not caused by active smoking could reveal important biological pathways. Among smokers, variation in methylation may indicate different genetic susceptibility to lung cancer, since individuals often differ in their ability to detoxify carcinogenic compounds or to repair induced DNA damage, for example. Alternatively, variation in methylation could be a result of other environmental risk factors, such as occupational exposures, changes in the immune response, radon, or secondhand smoke. Using blood samples collected men and women without a cancer diagnosis in 1989 in the CLUE II cohort (Genkinger et al. 2004; Kakourou et al. 2015), we compared DNA methylation levels in individuals who later developed lung cancer with those who did not develop lung cancer in the same time frame. Specifically, we aimed to identify both single CpG sites, and differentially methylated regions, measured in pre-diagnostic blood samples of lung cancer cases and matched controls, that represent measurable differences in lung cancer risk independent of smoking exposures.
Materials and Methods
Study population – CLUE I/II cohort
Subjects for this study were selected from among participants of the CLUE II cohort who had also participated in the CLUE I cohort study (flowchart in Supplemental Figure 1 and additional details in Supplemental Methods). Both cohorts were based in Washington County, MD, and were initially established to identify serological precursors to cancer and other chronic diseases (Genkinger et al. 2004; Kakourou et al. 2015; Schober et al. 1987). CLUE II was conducted from May through October 1989, during which 32,894 individuals (25,076 were Washington County residents) provided a blood sample (Comstock et al. 1991). Among all participants, 98.3% were white, reflecting the population of this county at the time, and 59% were female. Participants provided health information at baseline, including the potential confounders attained education, cigarette smoking status, number of cigarettes smoked daily, cigar/pipe smoking status, and self-reported weight and height, from which body mass index (BMI) was calculated.
Lung cancer ascertainment
All incident lung cancer cases (ICD9 162 and ICD10 C34) were ascertained from linkage to the Washington County cancer registry (before 1992 to the present) and the Maryland Cancer Registry (since 1992 when it began to the present). We selected all 241 first primary incident lung cancer cases who had participated in CLUE I and were diagnosed after the day of blood draw in CLUE II through January 2018. Using incidence density sampling, we selected 1 control per case matching on age, sex, cigarette smoking status, number of cigarettes smoked daily, cigar/pipe smoking status, and date of CLUE II blood draw.
DNA methylation measurements
DNA extracted from buffy coat was bisulfite-treated using the EZ DNA Methylation Kit (Zymo) and DNA methylation was measured at specific CpG sites across the genome using the 850K Illumina Infinium MethylationEPIC BeadArray (Illumina, Inc, CA, USA) at the University of Minnesota Genomic Center (details in Supplemental Methods).
Statistical analysis
In the current nested case-control study, we aimed to identify novel genomic regions, both single CpG sites and differentially methylated regions where differences in DNA methylation levels are not explained by smoking exposures, by controlling for the known DNA methylation changes associated with smoking in the statistical analysis. To examine the association between CpG-specific DNA methylation and lung cancer risk, we conducted epigenome-wide association analysis using unconditional multivariable logistic regression to estimate odds ratios (OR) of lung cancer per 1 SD increase in methylation level at single CpG sites. To maximize power, we used unconditional logistic regression to include cases and controls without a matched pair, and included participants every time they were sampled. All models were adjusted for age at blood draw, sex, surrogate variables for batch effects (Leek et al. 2012; Leek et al. 2010), smoking status (never, former, current), pack-years based smoking methylation score (details in Supplemental Methods), BMI, and leukocyte cell composition (Houseman et al. 2012; Salas et al. 2018) (given the potential for confounding by cell composition) (Adalsteinsson et al. 2012). All p-values were adjusted for multiple comparisons using the false discovery rate (FDR) method. Analyses of single CpGs with lung cancer were also stratified by smoking status and time from blood-draw to diagnosis, and separately by non-small cell (NSCLC) and small cell (SCLC) histology. All controls were included in these three types of stratification analyses. All statistical analyses were performed in R (version 3.5.0).
We used the DMRcate Bioconductor R package (Peters et al. 2015) to identify differentially methylated regions (DMRs) associated with lung cancer risk. Adjusting for the same covariates as in the single CpG analyses, DMRs were calculated using a parameter setting of lambda=1,000 and kernel adjustment C=2 (default setting) (Peters et al. 2015). Statistically significant DMRs were required to have a minimum of two statistically significant single CpGs and to meet the multiple testing adjustment criteria of FDR<0.1. Associations were also examined by time from blood-draw to diagnosis and lung cancer histology. Two of the most statistically significant regions were further evaluated for patterns by time to diagnosis.
Results
Population characteristics
Table 1 presents the characteristics of the 208 lung cancer cases and 222 controls that we included in this study. Over 99% of the majority of participants were white. The median time to lung cancer diagnosis was 14 years. The median age at blood draw in 1989 was 59 and 57 years in cases and controls, respectively. Overall, 55% of cases and controls were women and 11% were never smokers (Table 1).
Single CpG EWAS analysis
The EWAS analysis identified 16 differentially methylated CpGs that were statistically significant after multiple comparisons correction (q<0.05). Results are presented in Table 2 (statistically significant CpGs were sorted by q-value) and Figure 1. Among the 16 CpGs, many were located in genomic regions that have been previously associated with lung cancer or other malignancies (RUNX3 (Sato et al. 2005), H19 (Huang et al. 2018), BAIAP2L2 (Liu et al. 2020), GPR132 (Chen et al. 2017), CUEDC1 (Lopes et al. 2018), SSBP4 (Guo et al. 2018), AMPD2 (Gao et al. 2020), ADAM11 (Sieuwerts et al. 2005), and RTN4R (He et al. 2020); the top 1000 CpGs based on q-value are presented in Supplemental Table 1).
CpGs previously reported to be associated with lung cancer risk, including cg05575921 in the AHRR gene (Fasanelli et al. 2015), cg03636183 in the F2RL3 gene (Fasanelli et al. 2015), cg23387569 in the AGAP2 gene (Baglietto et al. 2017), cg10151248 in the PC gene (Sandanger et al. 2018), and cg13482620 in the B3GNTL1 gene (Sandanger et al. 2018), were not statistically significant in our analyses in which we adjusted for a pack-years methylation score. Adjusting for smoking status (never, former, current) but not for the packyears methylation score, two CpGs in the AHRR and F2RL3 genes had similar-sized associations with lung cancer risk as previously reported (AHRR: OR [for 1 SD] = 0.43, 95% CI: 0.31-0.60, P-value = 6.76 × 10−7 vs. previously reported OR [for 1 SD] = 0.39, 95% CI: 0.24-0.61, P-value = 2.55 × 10−5 for cg05575921; F2RL3: OR for [1 SD] = 0.53, 95% CI: 0.40-0.70, P-value = 7.91 × 10−6 vs. previously reported OR [for 1 SD] = 0.51, 95% CI: 0.35-0.73, P-value = 4.19 × 10−4 for cg03636183) (Fasanelli et al. 2015). In addition, in the complete EWAS analysis conducted without adjusting for the packyears methylation score, only one CpG (cg14391737) was statistically significantly associated with lung cancer risk after adjusting for multiple comparisons. This CpG has been related to smoking in multiple studies (Joehanes et al. 2016) (top 1000 CpGs presented in Supplemental Table 2).
To further examine the associations of the significant CpGs with lung cancer risk, we stratified by time from blood draw to diagnosis (≤10, >10 years). The magnitude of risk was similar in the two strata; small differences in risk were likely due to statistical variability (Table 2). For these 16 differentially methylated CpGs, the ORs of lung cancer were in general slightly higher for former smokers and for SCLC, than among current smokers or for NSCLC, respectively (Supplemental Table 3).
DMR analysis
Using the DMRcate package in R, we identified differentially methylated regions (DMRs) by case/control status (Peters et al. 2015). Instead of focusing on single CpG identification, the DMRcate method identifies regions of chromosomes that are differentially methylated by case-control status. After adjusting for both smoking status and packyears methylation score, forty DMRs were found to be statistically significantly associated with lung cancer risk (Table 3). Many of the top regions identified included genes that have been linked to lung cancer in previous studies (H19 (Huang et al. 2018), HOXA4 (Xu et al. 2019), PLXNB2 (Liu and Zhao 2019), PRDM1(Zhu et al. 2017), TSPAN4 (Ying et al. 2019), PHPT1(Xu et al. 2010), MSI2 (Kudinov et al. 2016), CBX5 (Yu et al. 2012), RCAN1 (Ma et al. 2017), CCL5 (Huang et al. 2009), and BRDT (Grunwald et al. 2006)).
We conducted stratified analyses to examine whether our DMR results were modified by time to diagnosis or differed by histology. Among those with ≤10 years between blood draw and diagnosis, a region located on chromosome 20:36148699-36149271 (genes NNAT and BLCAP) was statistically significantly associated with lung cancer. By histology, one region located in H19 gene and one region located in MYEOV gene were statistically significant DMRs for NSCLC and SCLC, respectively (Supplemental Table 4).
We conducted further analyses for two of the most statistically significant DMRs. These two regions, located on the chromosome 11 (H19 gene) and 7 (HOXA4 gene), consisted of 31 and 20 CpGs, respectively, that differed between the cases and controls (all were q-value<0.1 FDR adjusted). For each of these two regions, we selected the CpGs with the strongest associations with lung cancer risk for comparison by time to cancer diagnosis (≤10, >10 years). Results for CpGs in these two regions are presented in Table 4 (top 10 statistically significant CpGs are included), with additional tables presented in Supplemental Table 5. For CpGs located in the HOXA4 region, 18 out of 20 were statistically significant and all sites were hypermethylated in cases compared to controls. Results for these HOXA4 region CpGs were similar in the ≤10 and >10 year time to diagnosis groups. In the H19 region, the strongest association was noted for cg00237904. This CpG was also identified in the top statistically significant CpGs in the single CpG EWAS (Table 2, q-value<0.05).
Discussion
In this study, we identified both single CpGs and DMRs in lung cancer that are not primarily driven by smoking history, by using a DNA methylation-based packyears score to adjust for cumulative smoking. Using a prospective study design (with pre-diagnostic blood), we identified 16 single CpG sites and 40 DMRs regions that were associated with lung cancer risk; genes in these regions included H19, HOXA4, RUNX3, BRICD5, RP13, and PLXNB2.
Previous studies have used either case-control or nested case-control designs to study the association between methylation markers and lung cancer risk. Retrospective case-control studies on this topic are not comparable to our study because they either used a different methodology to measure blood leukocyte methylation (Wang et al. 2010), or measured methylation biomarkers in sputum as a classifier for lung cancer risk (Leng et al. 2017; Liu et al. 2017). Three nested case-control studies previously used pre-diagnostic, peripheral blood samples to examine DNA methylation levels associated with lung cancer risk, while adjusting for smoking using self-reported information (Baglietto et al. 2017; Fasanelli et al. 2015; Sandanger et al. 2018). Fasanelli 2015 (Fasanelli et al. 2015) and Baglietto 2016 (Baglietto et al. 2017) together identified six CpGs (cg05951221, cg21566642, cg05575921, cg06126421, cg23387569 and cg03636183) with significant ORs for lung cancer after adjusting for smoking using self-reported smoking status. Further stratification showed that five of the six CpGs had methylation levels strongly influenced by smoking. Sandanger 2018 found two additional CpGs, cg10151248 (PC) and cg13482620 (B3GNTL1), to be significantly associated with lung cancer risk after adjusting for smoking status, pack-years, and a comprehensive smoking index built using self-reported information (Sandanger et al. 2018). In our analyses, cg05575921 (AHRR), cg03636183 (F2RL3), and cg21566642 methylation levels were statistically significantly associated with lung cancer when we adjusted for self-reported smoking status, whereas cg10151248 (PC), cg13482620 (B3GNTL1), and cg23387569 (AGAP2) methylation levels did not significantly differ between the cases and controls regardless of whether we adjusted for the pack-years methylation score or not. cg05951221 and cg06126421 were not associated with lung cancer.
Our analyses identified novel genomic regions that are independent of smoking exposures. Many of the significant single CpGs we identified through EWAS were located in genomic regions that have been previously associated with lung cancer or other malignancies (RUNX3 (Sato et al. 2005), H19 (Huang et al. 2018), BAIAP2L2 (Liu et al. 2020), GPR132 (Chen et al. 2017), CUEDC1 (Lopes et al. 2018), SSBP4 (Guo et al. 2018), AMPD2 (Gao et al. 2020), ADAM11 (Sieuwerts et al. 2005), and RTN4R (He et al. 2020)). For instance, the RUNX3 region is a tumor suppressor gene that is implicated in lung cancer oncogenesis (Sato et al. 2005). Promoter hypermethylation of RUNX3 has been associated with NSCLC survival (Yanagawa et al. 2007). Many of the top differentially methylated regions we identified using the DMRcate analysis included genes that have been previously linked to lung cancer in other studies (H19 (Huang et al. 2018), HOXA4 (Xu et al. 2019), PLXNB2 (Liu and Zhao 2019), PRDM1(Zhu et al. 2017), TSPAN4 (Ying et al. 2019), PHPT1(Xu et al. 2010), MSI2 (Kudinov et al. 2016), CBX5 (Yu et al. 2012), RCAN1 (Ma et al. 2017), CCL5 (Huang et al. 2009), and BRDT (Grunwald et al. 2006)), providing support for our findings. Of these eleven genetic regions already linked to lung cancer, many are shown to be connected to poor outcome in lung cancer. For instance, decreased expressed levels of PLXNB2 (Liu and Zhao 2019) and PRDM1 (Zhu et al. 2017) have been found to be correlated with poor prognosis in lung cancer, while TSPAN4 (Ying et al. 2019), PHPT1(Xu et al. 2010), MSI2 (Kudinov et al. 2016), and CBX5 (Yu et al. 2012) are linked to metastasis. In addition, HOXA4 (Xu et al. 2019), RCAN1 (Ma et al. 2017), and CCL5 (Huang et al. 2009) are involved in the growth, development, and migration of lung cancer cells. Other top regions identified in our study include genes that have been linked to breast cancer (NNAT (Nass et al. 2017), RPL7A (Zhu et al. 2001), and HIST1H2BO (Xie et al. 2019)), colorectal cancer (RP11 (Sun et al. 2017; Wu et al. 2019)), endometrial cancer (HELZ2 (Qiao et al. 2019)), pancreatic cancer (LFNG (Liu et al. 2016)), prostate cancer (MAST3 (Dahlman et al. 2012)), renal cell carcinoma (KCNJ1 (Guo et al. 2015)), and tumor progression (ZC3H12D (Huang et al. 2012)).
Of all the novel genomic regions that we identified as associated with lung cancer risk, the H19 region was the only one that appeared in both of the single CpG EWAS and DMRcate results. The H19 long noncoding RNA (LncRNA) has been previously implicated in lung cancer causation. Inhibition of LncRNA H19 has been found to suppress the growth, migration, and invasion of NSCLC (Huang et al. 2018). In terms of disease development, loss of imprinting of the H19 gene has been connected to a genome-wide loss of methylation, and associated with the transformation from normal to NSCLC (Anisowicz et al. 2008; Kondo et al. 1995; Langevin et al. 2015). In our analyses, we found that hypermethylation of many H19 region CpGs were associated with lung cancer risk in CLUE II. The direction of this association was unexpected since the overexpression of H19 LncRNA in lung tumor is often correlated with hypomethylation of the promoter region CpGs (Kondo et al. 1995). H19 LncRNA belongs to a highly conserved imprinted gene cluster that plays important roles in embryonal development and growth control (Gabory et al. 2010) and H19 region methylation has been found to be influenced by early life exposures, including maternal factors during pregnancy (Miyaso et al. 2017), suggesting the possibility that external exposures could impact H19 methylation. Since the blood samples were drawn years before cancer diagnosis in this study, the methylation patterns we observed could be regions that are modulated early on in lung cancer development. More research is needed to investigate the methylation pattern in blood prior to cancer diagnosis.
Use of the DMRcate analysis methods and the Illumina Infinium MethylationEPIC 850K BeadArray allowed us to investigate genome-wide regional methylation level differences between lung cancer cases and controls. Some of the regions we identified that have not previously been associated with lung cancer should be investigated in other populations. It is possible that some of single CpGs and DMRs that we identified after adjusting for the packyear methylation score could be related to risk factors unique to CLUE II. Further studies are needed to investigate pathways related to the novel genomic regions that we identified.
This study demonstrated the importance of carefully controlling for known DNA methylation changes associated with smoking to be able to identify novel genomic regions. We showed the potential for this approach to identify DMRs (i.e., not single CpG alterations) by case/control status using peripheral blood collected prior to lung cancer diagnosis. These findings suggest that methylation changes detectable years prior to cancer diagnosis could potentially influence lung cancer risk, providing new insights into the biological processes of early lung cancer development. Further work in other populations should be conducted to validate regions that we observed to be associated with lung cancer risk independent of smoking exposures, especially among different ethnic and racial groups.
Data Availability
The datasets generated during the current study are available from the corresponding author on reasonable request and will be deposited into dbGaP within 1 year.
Declarations
Competing interest
The authors report no conflicts of interest.
Funding
This work was supported by 2018 American Association for Cancer Research (AACR)-Johnson & Johnson Lung Cancer Innovation Science (18-90-52-MICH).
Note: The funders had no role in the design of the study; the collection, analysis, and interpretation of the data; the writing of the manuscript; and the decision to submit the manuscript for publication.
Authors’ contributions
DSM., KTK and EAP designed the study, obtained funding and acquisition of data. JL assisted with preparation of dataset. DSM supervised all research activities. MR and NZ conducted the statistical analyses. DSM and NZ drafted the manuscript. DSM., EAP, KTK, DCK and CJM interpreted the data and provided critical revisions of the manuscript. All authors read and approved the final version of the manuscript.
Availability of data and material
The datasets generated during the current study are available from the corresponding author on reasonable request and will be deposited into dbGaP within 1 year.
Ethics approval
This study was approved by the Institutional Review Board at Johns Hopkins University Bloomberg School of Public Health and at Tufts University.
Acknowledgments
Cancer incidence data were provided by the Maryland Cancer Registry, Center for Cancer Surveillance and Control, Maryland Department of Health, 201 W. Preston Street, Room 400, Baltimore, MD 21201. We acknowledge the State of Maryland, the Maryland Cigarette Restitution Fund, and the National Program of Cancer Registries of the Centers for Disease Control and Prevention for the funds that helped support the availability of the cancer registry data.