ABSTRACT
International differences in the incidence of many cancer types indicate the existence of carcinogen exposures that have not been identified by conventional epidemiology yet potentially make a substantial contribution to cancer burden1. This pertains to clear cell renal cell carcinoma (ccRCC), for which obesity, hypertension, and tobacco smoking are risk factors but do not explain its geographical variation in incidence2. Some carcinogens generate somatic mutations and a complementary strategy for detecting past exposures is to sequence the genomes of cancers from populations with different incidence rates and infer underlying causes from differences in patterns of somatic mutations. Here, we sequenced 962 ccRCC from 11 countries of varying incidence. Somatic mutation profiles differed between countries. In Romania, Serbia and Thailand, mutational signatures likely caused by extracts of Aristolochia plants were present in most cases and rare elsewhere. In Japan, a mutational signature of unknown cause was found in >70% cases and <2% elsewhere. A further mutational signature of unknown cause was ubiquitous but exhibited higher mutation loads in countries with higher kidney cancer incidence rates (p-value <6 × 10−18). Known signatures of tobacco smoking correlated with tobacco consumption, but no signature was associated with obesity or hypertension suggesting non-mutagenic mechanisms of action underlying these risk factors. The results indicate the existence of multiple, geographically variable, mutagenic exposures potentially affecting 10s of millions of people and illustrate the opportunities for new insights into cancer causation through large-scale global cancer genomics.
INTRODUCTION
The incidence rates of most adult cancers vary substantially between geographical regions and many such differences are unexplained by known risk factors1. Together with unexplained trends in incidence over time, this indicates the likely presence of unknown environmental or lifestyle causes for many cancer types1. Traditional epidemiological studies have identified many important lifestyle, environmental and infectious risk factors for cancer. However, they have had limited success in recent decades suggesting that alternative study designs are required if further risk factors are to be identified.
Characterization of mutational signatures within cancer genomes3 is an approach complementary to conventional epidemiology for investigating unknown causes of cancer. Most cancers contain thousands of somatic mutations that have occurred over the lifetime of the individual. These can be caused by endogenous cellular processes, such as imperfect DNA replication and repair, or by exposure to exogenous environmental or lifestyle mutagens such as ultraviolet radiation in sunlight and compounds in cigarette smoke. Mutational signatures are the patterns of somatic mutation imprinted on genomes by individual mutational processes. Analysis of thousands of cancer genome sequences from most cancer types has established a set of reference mutational signatures including 71 single base substitution (SBS) or doublet base substitution (DBS) signatures, and 18 small insertion and deletion (ID) signatures4. A possible etiology has been suggested for 47 SBS/DBS signatures and nine ID signatures.
Kidney cancer has particularly high incidence rates in Central and Northern Europe, notably in the Czech Republic and Lithuania, and has shown increasing incidence in high income countries in recent decades (Fig. 1)2. Most kidney cancers are clear cell renal cell carcinomas (ccRCC)3 for which obesity, hypertension and tobacco smoking are known risk factors2. However, these account for <50% of the global ccRCC burden and do not explain geographical or temporal incidence trends. Previous ccRCC genome sequencing studies have included relatively modest numbers of individuals from a small number of countries with limited variation in ccRCC incidence5–9 and have not comprehensively examined associations between ccRCC risk factors and mutational signatures. To detect the activity of unknown carcinogens involved in ccRCC development and to investigate the mechanisms of action of known risk factors, we generated and analyzed epidemiological and whole genome sequencing data from a large international series of ccRCC10.
RESULTS
A total of 962 ccRCC cases from 11 countries in four continents were studied, including Czech Republic (n=259), Russia (n=216), United Kingdom (n=115), Brazil (n=96), Canada (n=73), Serbia (n=69), Romania (n=64), Japan (n=36), Lithuania (n=16), Poland (n=13), and Thailand (n=5; Fig. 1; Table 1; Methods). These encompass a broad range of ccRCC incidence, from the highest global age-standardized rates (ASRs) of Lithuania and Czech Republic (ASRs of 14.5 and 14.4/100,000 respectively) to the relatively low rates of Brazil and Thailand (ASRs of 4.5 and 1.8/100,000 respectively)11. Epidemiological questionnaire data were available on sex, age at diagnosis, and important risk factors including body mass index (BMI), hypertension, and tobacco smoking (Table 1, Supplementary Table 1). DNAs from ccRCCs and blood from the same individuals were extracted and whole genome sequenced to average coverage of 54-fold and 31-fold, respectively.
Somatic mutation burdens in the 962 ccRCC genomes ranged from 803 to 45,376 (median 5,093) for single base substitutions (SBS), 2 to 240 (median 53) for doublet base substitutions (DBS), and 10 to 14,770 (median 695) for small insertions and deletions (Supplementary Table 2). The average burden of all three mutation types differed between the 11 countries (p-value <2 × 10−23, p-value <2 × 10−14, p-value <6 × 10−14, for SBSs, DBSs, and IDs, respectively). In particular, the burden of all mutation types was elevated in Romania compared to other countries (Extended Data Fig. 1). Principal Component Analysis (PCA) performed on the proportions of the six primary SBS mutation classes (C>A, C>G, C>T, T>A, T>C, T>G) in each sample identified a distinct cluster of mainly Romanian and Serbian cases and a further cluster of mainly Japanese cases (Extended Data Fig. 2). The results, therefore, clearly demonstrate geographical variation of somatic mutation loads and patterns in ccRCC.
To investigate the mutational processes contributing to the geographical variation in mutation burdens we extracted mutational signatures and estimated the contribution of each signature to each ccRCC genome (Supplementary Tables 3-7). Ten signatures with strong similarity to a reference signature in the Catalogue of Somatic Mutations in Cancer (COSMIC) database were extracted: SBS1, due to deamination of 5-methylcytosine12; SBS2 and SBS13, due to cytosine deamination by Apolipoprotein B mRNA-editing enzyme, catalytic polypeptide-like (APOBEC) DNA editing enzymes12; SBS4, due to tobacco smoke mutagens13; SBS5, due to an endogenous mutational process in which mutations accumulate with age13; SBS12, of unknown cause; SBS18, due to DNA damage by reactive oxygen species13; SBS21 and SBS44, due to defective DNA mismatch repair13,14; and SBS22, due to Aristolochic acid exposure15,16.
Five further SBS signatures were identified which were not well described by the COSMIC catalogue (Fig. 2; Supplementary Table 8). SBS40a, SBS40b and SBS40c were present in most ccRCC, accounting for, on average, ∼30%, ∼20%, and ∼3% of mutations respectively (Fig. 2b). Combined, they closely resemble the previously reported SBS40 (0.96 cosine similarity), suggesting that the large number of ccRCC whole genomes analyzed here provides the power to separate the constituent component signatures of SBS40. SBS40 was previously reported frequently, and at high levels, in kidney cancer, but also in other cancers, and is of unknown etiology. Like the composite SBS40, SBS40a is present in multiple cancer types. However, SBS40b and SBS40c are largely restricted to ccRCC (Supplementary Note, Extended Data Fig. 3). SBS_H was found in a single case and SBS_I is related to Aristolochic acid exposure (see below; SBS_I has been renamed as SBS22b). Analysis of all other types of mutational signatures, including doublet bases substitutions, small insertion and deletions, copy number variants and structural variants, is presented in Supplementary results.
The mutation burdens of multiple SBS mutational signatures varied between the 11 countries. SBS22 is thought to be caused by Aristolochic acids, mutagenic derivatives of plants of the Aristolochia genus which are carcinogenic and also cause Balkan endemic nephropathy (BEN), a kidney disease prevalent in areas adjacent to the Danube in Southeastern Europe17. SBS22 has previously been found in ccRCC, other urothelial tract cancers, and hepatocellular carcinomas from Romania5,18 and various countries in East and South-East Asia15,16,19. In this study, SBS22 was present in high proportions of ccRCC from Romania (45/64, 70%), Serbia (16/69, 23%), and Thailand (3/5, 60%), often with very high mutation burdens. Of note, given the limited number of cases in Thailand, they may not be representative of ccRCC in that region. The presence of SBS22 was strongly correlated with that of new signatures SBS_I, DBS_D, and ID_C (Extended Data Fig. 4-6) which are, therefore, also probably due to Aristolochic acid exposure. SBS_I, like SBS22, is composed predominantly of T>A mutations. The signature identified previously as SBS22, has therefore been renamed SBS22a, and SBS_I has been named SBS22b. The mutation burden of both SBS22a and SBS22b differed between Serbia and Romania, with higher levels being detected in Romania, and away from recognized BEN zones (Fig 3a-c). The two signatures may be due to different subsets of Aristolochic acids, and/or to different metabolites, which induce slightly different mutational patterns. Only five ccRCC cases were known to reside within recognized BEN zones, suggesting no clear link between the two diseases. While the source of this exposure is uncertain, these results indicate that a substantial proportion of the population over a wide geographical area of Eastern Europe, possibly numbering in the 10s of millions, has been exposed to Aristolochic acid containing compounds, the public health consequences of which are uncertain.
SBS12 was present in 72% of Japanese and 2% of non-Japanese ccRCC (p-value=4.7 × 10−78) (Extended Data Fig. 7h). Compared to the mutation burdens imposed by Aristolochic acid in ccRCC, SBS12 contributed modest mutation loads. SBS12 is composed predominantly of T>C substitutions and exhibits strong transcriptional strand bias with more T>C mutations on the transcribed than untranscribed strands of protein coding genes. Transcriptional strand bias is typically caused by activity of transcription-coupled nucleotide excision repair acting on bulky DNA adducts due to exogenous mutagenic exposures such as tobacco smoke chemicals13, ultraviolet light13, Aristolochic acids15, and aflatoxins20. Assuming that transcription-coupled repair of DNA adducts is responsible for the SBS12 strand bias, the adducts are likely on adenine. Alternatively, transcriptional strand bias can also be caused by transcription-coupled damage21,22. The presence of SBS12 was replicated in two further series of whole genome sequenced ccRCC from Japan including 14 cases from an independent study group who undertook a broad genomic analysis of ccRCC but without detailed mutational signature analysis23, and a more recent unpublished series of 61 cases from an additional cohort of ccRCC sequenced by the same center as the initial cohort (Supplementary Note). SBS12 was present in 12/14 cases (85%) cases and 46/61 (75%) of cases, respectively. SBS12 was previously reported in hepatocellular carcinomas4,13 and additional analysis of existing datasets revealed strong SBS12 enrichment in hepatocellular carcinomas from Japan compared to other countries (p-value=3.8 × 10−15; Supplementary Note). These results, therefore, indicate that exposure to an agent contributing SBS12 mutations to kidney and liver cancer is common in Japan and rare in the other 10 countries included in this study. The agent responsible for SBS12 is unknown although the precedents provided by other mutational signatures with strong transcriptional strand bias suggest that it is likely of exogenous origin21,22. A polymorphism in aldehyde dehydrogenase 2 known to impair metabolism of alcohol to aldehydes and common in Japan did not associate with levels of SBS12, and neither did any other common germline variants (Supplementary Note).
SBS40a, SBS40b, and SBS40c were present in ccRCC from all 11 countries. The country-specific average mutation burdens of SBS40a and SBS40b positively associated with country-specific ASRs of kidney cancer incidence (p-value=0.0022 and p-value=5.1 × 10−18, respectively; Extended Data Fig. 8a; Fig. 4a, Supplementary Table 9), with the highest mutation loads in the Czech Republic and Lithuania. Kidney cancer incidence rates also vary between the regions of the Czech Republic and SBS40b mutation burdens differed significantly between these (p-value=0.011; Fig 4b,c, Supplementary Table 10), with the highest attribution in the highest risk region. SBS40b exhibits modest transcriptional strand bias and, assuming that transcription-coupled repair of DNA adducts is responsible, the adducts underlying SBS40b are likely on pyrimidines. Insertion and deletion (indel) signatures ID5 and ID8, which together contributed ∼60% of the indel mutation burden on average, were also strongly associated with country-specific kidney cancer ASR (p-value=1.3 × 10−10 and p-value=7.1 × 10−5, respectively, Extended Data Fig. 8b,c). Signatures ID5 and ID8 correlated with each other (Spearman’s r=0.78), as well as with SBS40b (r=0.79 and r=0.74, respectively) indicating that they likely all constitute products of the same underlying mutational process. Thus, the burdens of the full complement of mutation types generated by this mutational process correlate with age-adjusted kidney cancer incidence rates. The overall mutational burden did not, however, associate significantly with kidney cancer incidence rates (Extended Data Fig. 9).
To investigate potential mutagenic agents underlying these geographically variable signatures, an untargeted metabolomics screen of plasma was conducted on 901 individuals in the study, from all countries except Japan (Methods). 2,392 metabolite features were obtained, including 944 independent peaks (r<0.85). Three features were associated with SBS4 (Supplementary Table 11), with two identified as hydroxycotinine (p-value=2.9 × 10-9) and cotinine (p-value=1.9 × 10-5), two major metabolites of nicotine24. Eight features were associated with SBS40b (Supplementary Table 11). One feature was identified as N,N,N-trimethyl-L-alanyl-L-proline betaine (TMAP; p-value=1.2 × 10-5, Supplementary Table 12), increased levels of which correlate strongly with reduced kidney function25. Other established measures of kidney function, including cystatin C and creatinine, were correlated with TMAP (p-value = 2.5 x 10-30 and 1.7 × 10-69, respectively) and also showed evidence of positive association with SBS40b (p-value=0.023 and 0.058, respectively). Thus, exposure to the mutagenic agent responsible for SBS40b is associated with reduced kidney function. No recognized metabolome features were significantly associated with any other signatures.
A total of 1962 “driver” mutations were found in 136 genes including VHL, PBRM1, SETD2 and BAP1, the known frequently mutated cancer genes in ccRCC (Methods) (Fig. 5a, Supplementary Table 13)9,26. The frequencies of mutations in these genes were consistent across countries (Fig. 5b). The spectrum of all driver mutations in ccRCC with Aristolochic acid exposure (Methods) was enriched in T>A mutations compared to non-exposed cases (25% vs 13%, p-value=0.0062, Fig. 5c,d) with similar enrichment specifically in VHL mutations (30% vs 16%; Fig. 5e,f), and in the whole exomes (27% in exposed compared to 12% in unexposed cases). Thus genome-wide Aristolochic acid mutagenesis has contributed in a proportionate fashion to generation of driver mutations in Aristolochic acid-exposed ccRCC. The driver mutation spectrum did not show statistically significant enrichment of T>C mutations in SBS12 exposed cases (20% vs 12%, p = 0.069), but was consistent with the level of enrichment in the exome (21% in exposed compared to 15% in unexposed cases). SBS40b also did not show statistically significant enrichment possibly due to the ubiquitous exposure and its relatively flat and featureless mutation profile.
Exogenous mutagenic exposures that ultimately cause cancer may be present during the early stages of evolution of cancer clones. To time mutagenic exposures, the contribution of each mutational signature to mutations in the primary clone (relatively early) and to mutations in subclones (relatively late) were estimated27,28 (Methods). All signatures of the putative exogenous mutagenic exposures observed in ccRCC were present at relatively early stages of cancer development, consistent with exposures to normal cells. SBS12, SBS22b, and SBS40b showed higher activities in main clones compared to subclones (q-value=0.04, q-value=0.02, q-value=2.3 × 10−5, respectively) (Extended Data Fig. 10) and SBS22a showed no significant difference15,16. By contrast, signatures due to endogenous mutational processes including APOBEC DNA editing (SBS13) and oxidative damage (SBS18), were enriched in subclones (q-value=1.6 × 10−4, q-value=3.2 × 10−7, respectively).
Established or suspected risk factors for ccRCC include age, tobacco smoking, obesity, hypertension, diabetes, and environmental exposure to PFAS compounds29. Total SBS, DBS, and ID mutation burdens associated with age, as did SBS1, SBS4, SBS5, SBS40a, SBS40b, SBS22a, SBS22b, DBS2, ID1, ID5, and ID8. Total SBS (p-value=3.1 × 10−5), DBS (p-value=3.7 × 10−3) and ID (p-value=1.3 × 10-4) mutation burdens also associated with sex, with males having higher mutation burdens than females, and with SBS40b showing a similar association (p-value=9.3 × 10−5). Associations with tobacco smoking were observed for SBS4 (p-value=5.3 × 10−6) and DBS2 (p-value=2.4 × 10−7), both known to be caused by tobacco carcinogens30,31. These results suggest that the known increased risk of ccRCC with tobacco smoking is due to direct exposure of the kidney to tobacco related mutagens (Supplementary Note). Associations of particular mutational signatures with other ccRCC risk factors were not observed (Supplementary Tables 14, 15). To complement this analysis of observational data, associations between polygenic risk scores for known ccRCC risk factors and mutational signatures32,33 were examined (Methods). Consistent with the observational data, no associations were found between genetically inferred risk factors and mutational signatures except for tobacco smoking and DBS2 (p-value=0.01; Supplementary Table 16).
DISCUSSION
Somatic mutations in the genomes of 962 ccRCC patients from 11 countries indicate the existence of multiple, widespread mutational processes exhibiting substantial geographical variation in their contributions to ccRCC mutation loads. The results contrast with those from 552 esophageal squamous carcinomas from eight countries with widely different esophageal carcinoma incidence rates in which geographical differences in mutation burdens or signatures were not observed34. Together the studies implicate both geographically variable mutagenic and non-mutagenic carcinogenic exposures contributing to global cancer incidence. Indeed, the presence of mutational signatures associated with tobacco smoking but absence of signatures associated with other known ccRCC risk factors, such as obesity and hypertension, suggests that the latter may be mediated by non-mutagenic processes and, therefore, that both classes of carcinogen contribute to the development of ccRCC.
The existence, identity, and carcinogenic effect of some of the agents underlying these mutational processes are known. Aristolochic acids are believed to cause SBS22a/b, and its associated signatures, but this study suggests that the geographical extent and proportion of the population acquiring mutations in South-Eastern Europe is far greater than previously anticipated, possibly affecting 10s of millions of individuals. The sources of the Aristolochic acid exposure, the manner by which it is ingested and whether the exposure continues today are uncertain, and further definition of the source and extent of this exposure is required in order to provide a foundation for public health action.
The existence of the mutagenic exposures underlying SBS12 and SBS40b were not previously suspected, and their causative agents are unknown. Based on current information, the exposure causing SBS12 is restricted to Japan. However, larger studies are now indicated to explore the geographical extent of exposure in Japan and neighboring countries, and the proportions of their populations that have been exposed. Studies of Japanese migrants to other countries are also likely to be informative regarding the potential source of exposure. In the first instance this will be achievable by further sequencing of kidney and hepatocellular cancer genomes. However, studies of normal tissues, using recently reported sequencing methods allowing detection of somatic mutations in normal cells35, and particularly relatively accessible ones such as cells in urine that can be prospectively collected, may enable large population-based studies providing better characterization of the exposure and its consequences. As with exposure to Aristolochic acid in South-Eastern Europe, it is possible that 10s of millions of individuals in East Asia are exposed to a potent mutagen, and identification of the source and extent of exposure would seem to be a public health priority.
In contrast to Aristolochic acid and the agent causing SBS12, the exposure underlying SBS40b appears to be globally ubiquitous. It predominantly causes mutations in ccRCC, with much lower burdens in other cancer types, and generates mutation loads correlating strongly with age and sex. There are few clues as to its origin or nature.
The incidence rates of ccRCC vary ∼eightfold across the eleven countries from which ccRCCs were sequenced. A strong positive correlation (p-value=5.5 × 10−18) was found between the average mutation loads attributable to SBS40b in each country (and also those of ID5 and ID8 which are correlated with SBS40b) and incidence of kidney cancer within each country. This correlation reflects approximately a tripling of average country-specific SBS40b mutation loads (a difference of ∼1000 mutations) in parallel with the eightfold increase of country-specific ASR.
SBS40b mutation burdens also positively correlated with biomarkers of impaired kidney function, reminiscent of the nephrotoxic effects of Aristolochic acids in Balkan endemic nephropathy. It is possible that the increased SBS40b somatic mutation load itself engenders this reduction in renal function. However, studies of other normal tissues suggest that they are generally tolerant of elevated mutation burdens, except for manifesting a higher incidence of neoplasia36,37. It is also possible that the agent underlying SBS40b is directly nephrotoxic, for example by engendering DNA damage and a response to it, and that the mutations it generates are immaterial to kidney function. It is also conceivable, however, that impaired renal function, potentially due to many different causes, results in a metabolic state which itself causes the elevated SBS40b mutation load. Whatever the mutational process underlying SBS40b, it is plausible that it contributes to the geographical variation in the age standardized rates of kidney cancer incidence rates. It is of public health interest to determine the cause of SBS40b and, hence, to consider whether the exposure can be mitigated, potentially with concomitant reduction in global ccRCC incidence rates.
The absence of any association between several known risk factors for ccRCC and mutation burden, in particular for obesity and hypertension, supports a model of cancer development where mutations are essential but additional factors affect the expansion of a mutated clone and thus the chance of it progressing into cancer38. Further efforts at defining how lifestyle and environmental exposures contribute to cancer development will therefore require a greater understanding of both the causes of the mutations in cell clones in normal tissue, and the further promotion of such mutant clones by non-mutagenic processes.
Finally, the substantial geographical variability of SBS12 and SBS22a/b, with most countries not showing evidence of exposure, raises the possibility that additional mutational signature studies of ccRCC involving more countries may reveal further mutagenic exposures. Furthermore, the results relating to SBS40b highlight the prospect that a significant proportion of global cancer burden may be caused by relatively ubiquitous exposures that are not readily detectable by classical cancer epidemiology studies. The conduct of large scale whole genome sequencing for other cancer sites across high and low risk populations around the world would seem to be an appropriate strategy for detecting such novel cancer causing agents.
Data Availability
Whole genome sequencing data and patient metadata are deposited in the European Genome-phenome Archive (EGA) associated with study EGAS00001003542. Aligned BAM files for all ccRCC cases included in the final analysis were deposited in dataset EGAD00001012102, variant calling files in dataset EGAD00001012222 and patient metadata in dataset EGAD00001012223. The metabolomics data are available upon request. All other data are provided in the accompanying Supplementary Tables.
https://ega-archive.org/studies/EGAS00001003542
https://ega-archive.org/datasets/EGAD00001012102
ONLINE METHODS
Recruitment of cases and informed consent
The International Agency for Research on Cancer (IARC/WHO) coordinated case recruitment through an international network of over 40 collaborators from the 11 participating countries (Table1; Supplementary Table 17). The inclusion criteria for patients were >=18 years of age (ranging from 23 to 87, with a mean of 60 and a standard deviation of 12), confirmed diagnosis of primary ccRCC and no prior cancer treatment. Informed consent was obtained for all participants. Patients were excluded if they had any condition that could interfere with their ability to provide informed consent or if there were no means of obtaining adequate tissues or associated data as per the protocol requirements. Ethical approvals were first obtained from each Local Research Ethics Committee and Federal Ethics Committee when applicable, as well as from the IARC Ethics Committee.
Bio-samples, data collection, and expert pathology review
Dedicated standard operating procedures, following guidelines from the International Cancer Genome Consortium (ICGC), were designed by IARC/WHO to select appropriate case series with complete biological samples and exposure information as described previously1 (Supplementary Table 17). In brief, for all case series included, anthropometric measures were taken, together with relevant information regarding medical and familial history. Comparable smoking and alcohol history was available from all centers. Detailed epidemiological information on residential history was collected in Czech Republic, Romania, and Serbia. Potential limitations of using retrospective clinical data collected using different protocols from different populations were addressed by a central data harmonization to ensure a comparable group of exposure variables (Supplementary Table 17). All patient related data as well as clinical, demographical, lifestyle, pathological and outcome data were pseudonymized locally through the use a dedicated alpha-numerical identifier system before being transferred to IARC/WHO central database.
Original diagnostic pathology departments provided diagnostic histological details of contributing cases through standard abstract forms. IARC/WHO centralized the entire pathology workflow and coordinated a centralized digital pathology examination of the frozen tumor tissues collected for the study as well as formalin-fixed, paraffin-embedded (FFPE) sections when available, via a web-based report approach and dedicated expert panel following standardized procedures as described previously1. A minimum of 50% viable tumor cells was required for eligibility to whole genome sequencing.
In summary, frozen tumor tissues were first examined to confirm the morphological type and the percentage of viable tumor cells. A random selection of tumor tissues was independently evaluated by a second pathologist. Enrichment of tumor component was performed by dissection of non-tumoral part, if necessary. 90 cases overlapped with a previously published cohort recruited under the Cancer Genomics of the Kidney (CAGEKID) project2, which were also part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) project3.
DNA extraction
Extraction of DNA from fresh frozen tumor and matched blood samples was centrally conducted at IARC/WHO except for Japan, which performed DNA extractions at the local center following a similarly standardized DNA extraction procedure. Of the cases which proceeded to the final analysis (n=962), germline DNA was extracted from either buffy-coat, whole blood, or from adjacent normal tissue (viz., samples from Japan) using previously described protocols and methods1.
Whole genome sequencing
In total, 1583 renal cell carcinoma cases were evaluated, with 1267 confirmed as ccRCC cases. 116 (9%) were excluded due to insufficient viable tumor cells (pathology level), or inadequate DNA (tumor or germline). DNA from 1151 cases was received at the Wellcome Sanger Institute for whole genome sequencing. Fluidigm SNP genotyping with a custom panel was performed to ensure that each pair of tumor and matched normal samples originated from the same individual. Whole genome sequencing (150bp paired end) was performed on the Illumina NovaSeq 6000 platform with target coverage of 40X for tumors and 20X for matched normal tissues. All sequencing reads were aligned to the GRCh38 human reference genome using Burrows-Wheeler-MEM (v0.7.16a and v0.7.17). Post-sequencing QC metrics were applied for total coverage, evenness of coverage and contamination. Cases were excluded if coverage was below 30X for tumor or 15X for normal tissue. For evenness of coverage, the median over mean coverage (MoM) score was calculated. Tumors with MoM scores outside the range of values determined by previous studies to be appropriate for whole genome sequencing (0.92 – 1.09) were excluded. Conpair4 (https://github.com/nygenome/Conpair) was used to detect contamination, cases were excluded if the result was greater than 3%5. A total of 962 cases passed all criteria and were included in subsequent analysis.
Somatic variant calling
Variant calling was performed using the standard Sanger bioinformatics analysis pipeline (https://github.com/cancerit). Copy number profiles were determined first using the algorithms ASCAT6 and BATTENBERG7, where tumor purity allowed. SNV were called with cgpCaVEMan8, indels were called with cgpPINDEL9, and structural rearrangements were called using BRASS. CaVEMan and BRASS were run using the copy number profile and purity values determined from ASCAT where possible (complete pipeline, n=857). Where tumor purity was insufficient to determine an accurate copy number profile (partial pipeline, n=105), CaVEMan and BRASS were run using copy number defaults and an estimate of purity obtained from ASCAT/BATTENBERG. For SNV additional filters (ASRD >= 140 and CLPM ==0) were applied to remove potential false positive calls. A second variant caller, Strelka2, was run for SNVs and indels as consensus variant calling was previously shown to eliminate algorithm specific artefacts and to generate highly reproducible mutational spectra compared to using a single variant calling algorithm1,10. Only variants called by both the Sanger variant calling pipeline and Strelka2 were included in subsequent analysis.
Validation of Japanese sequencing
The matched normal tissue sequenced was blood for all countries with the exception of Japan, where adjacent normal kidney was used. As Japan displayed an enrichment of SBS12, matched blood was obtained from 28 of the 36 patients to confirm that the source of the matched normal tissue was not influencing the result. In all cases, the mutational spectra of Japanese ccRCC generated using either blood or adjacent normal kidney matched each other with a cosine similarity of >0.99.
Generation of mutational matrices
Mutational matrices for single base substitutions (SBS), doublet base substitutions (DBS) and small insertions and deletions (ID) were generated using SigProfilerMatrixGenerator (https://github.com/AlexandrovLab/SigProfilerMatrixGenerator) with default options (v1.2.12)11.
Mutational signature analysis
Mutational signatures were extracted using two algorithms, SigProfilerExtractor (https://github.com/AlexandrovLab/SigProfilerExtractor), based on nonnegative matrix factorization, and mSigHdp12 (https://github.com/steverozen/mSigHdp), based on the Bayesian hierarchical Dirichlet process. For SigProfilerExtractor, de novo mutational signatures were extracted from each mutational matrix using SigProfilerExtractor with nndsvd_min initialization (NMF_init=“nndsvd_min”) and default parameters (v1.1.9)13. Briefly, SigProfilerExtractor deciphers mutational signatures by first performing Poisson resampling of the original matrix with additional renormalization (based on a generalized mixture model approach) of hypermutators to reduce their effect on the overall factorization13. Nonnegative matrix factorization (NMF) was performed using initialization with nonnegative singular value decomposition and by applying the multiplicative update algorithm using the Kullback–Leibler divergence as an objective function13. NMF was applied with factorizations between k=1 and k=20 signatures; each factorization was repeated 500 times13. De novo single base substitution mutational signatures were extracted with SigProfilerExtractor for both SBS-288 and SBS-1536 contexts11. The results were largely concordant with the SBS-1536 de novo signatures allowing additional separation of mutational processes, therefore the SBS-1536 de novo signatures were taken forward for further analysis (Supplementary Table 3). Mutational signatures for DBS and ID were extracted in DBS-78 and ID-83 contexts respectively (Supplementary Tables 4, 5). Where possible, SigProfilerExtractor matched each de novo extracted mutational signature to a set of previously identified COSMIC signatures14, for SBS-1536 signatures this requires collapsing the 1536 classification into the standard 96 substitution type classification with six mutation classes having single 3’ and 5’ sequence contexts (Supplementary Table 8). This step makes it possible to distinguish between de novo signatures which can be explained by a combination of the known catalog of mutational process (which have not been completely separated during the extraction), and those which have not been previously identified. mSigHdp extraction of SBS-96 and ID-83 signatures was performed using the suggested parameters and using the country of origin to construct the hierarchy. SigProfilerExtractor’s decomposition module was subsequently used to match mSigHdp de novo signatures to previously identified COSMIC signatures14. Further details on the comparison of results between SigProfilerExtractor and mSigHdp and decomposition of de novo signatures into COSMIC reference signatures can be found in the Supplementary Note.
Attribution of activities of mutational signatures
The de novo (SigProfiler) and COSMIC signature (SigProfiler and mSigHdp) activities were attributed for each sample using the MSA signature attribution tool (v2.0, https://gitlab.com/s.senkin/MSA)15. For COSMIC attributions, only COSMIC reference signatures, which were identified in the decomposition of de novo signatures, were included in the panel for attribution, in addition to de novo signatures which could not be decomposed into COSMIC reference. At its core, the tool utilizes the nonnegative least squares (NNLS) approach minimizing the L2 distance between the input sample and the one reconstructed using available signatures. To limit false positive attributions, an automated optimization procedure was applied by repeated removal of all signatures that do not increase the L2 similarity of a sample by >0.008 for SBS, >0.014 for DBS, and >0.03 for ID mutation types, as suggested by simulations. These optimal penalties were derived using an optional parameter (params.no_CI_for_penalties = false) utilizing a conservative approach in calculation of penalties. Finally, a parametric bootstrap approach was applied to extract 95% confidence intervals for each attributed mutational signature activity.
Driver mutations
A dNdS approach was used to identify genes under positive selection in ccRCC16. The analysis was performed both for the whole genome (q-value<0.01), and with restricted hypothesis testing (RHT) for a panel of 369 known cancer genes16. Variants in any gene identified as under positive selection in global dNdS or in the 369-cancer gene panel were assessed as potential drivers16. Candidate driver mutations were annotated with the mode of action using the Cancer Gene Census (https://cancer.sanger.ac.uk/census) and the Cancer Genome Interpreter tool (https://www.cancergenomeinterpreter.org). Missense mutations were assessed using the MutationMapper tool (http://www.cbioportal.org/mutation_mapper). Variants were considered likely drivers if they met any of the following criteria: (i) Truncating mutations in genes annotated as tumor suppressors; (ii) mutations annotated as likely or known oncogenic in MutationMapper; (iii) truncating variants in genes with selection (q-value<0.05) for truncating mutations assumed to be tumor suppressors and thus likely drivers; (iv) missense variants in all genes under positive selection and with dN/dS ratios for missense mutations above 5 (assuming 4 of every 5 missense mutations are drivers) labelled as likely drivers; or (v) in-frame indels in genes under significant positive selection for in-frame indels.
Evolutionary analysis
Subclonal architecture reconstruction was performed using the DPClust R package v2.2.87,17, after obtaining cancer cell fraction (CCF) estimates by dpclust3p v1.0.8 (https://github.com/Wedge-lab/dpclust3p) based on the variant allele frequency provided by the somatic variant callers and the copy number profiles determined by the BATTENBERG algorithm. Only tumors with at least 40% purity according to BATTENBERG were considered for further evolutionary analysis. For each tumor with at least one subclone, the respective somatic mutations were split into clonal and subclonal mutations using the most probable cluster assignment for each mutation as per the DPClust output. Mutations not assigned to a cluster by DPClust were removed from further analysis. Clusters centered at a CCF>1.5 and ones where chromosome X contributed the highest number of mutations were deemed artifactual, and the respective mutations were removed. Samples with a total number of clonal or subclonal mutations below 256 were also removed. Additionally, samples with poor separation between the clonal and subclonal distributions (e.g., subclone centered at a CCF>0.80) were removed. Finally, only samples that had both a clone and at least one subclone post-filtering were retained for further analysis. This yielded a total of 223 samples, each with clonal and subclonal mutations. SigProfilerAssignment (v0.0.13) (https://github.com/AlexandrovLab/SigProfilerAssignment) was used to identify the activity of each mutational signature in each clone/subclone, and these activities were then normalized by the total number of mutations belonging to the clone/subclone (i.e., clonal mutations were not included in the subclone). A two-sided Wilcoxon Signed-Rank Test18 was used to assess the differences in the relative activity of each mutational signature between the clones and their respective subclones. P-values were corrected using the Benjamini-Hochberg procedure19 and reported as q-values in the manuscript.
Regressions
Signature attributions were dichotomized into presence and absence using confidence intervals, with presence defined as both lower and upper limits being positive, and absence as the lower limit being zero. If a signature was present in at least 75% of cases (SBS1, SBS40a, SBS40b, ID1, and ID5), it was dichotomized into above and below the median of attributed mutation counts. The binary attributions served as dependent variables in logistic regressions, and relevant risk factors were used as factorized independent variables. To adjust for confounding factors, sex, age of diagnosis, country, and tobacco status were added as covariates in regressions. The Bonferroni method was used to test for significant p-values (i.e., a total of 224 comparisons for regressions with signatures, and a total of 24 comparisons for regressions with mutation burden). P-values reported are raw (not corrected). Regressions with incidence of renal cancer were performed as linear regressions with mutation burdens or signature attributions (those present in at least 75% of cases) with confidence intervals not consistent with zero as a dependent variable, and age-standardized rates (ASR) of renal cancer obtained from Global Cancer Observatory (GLOBOCAN)20, sex and age of diagnosis as independent variables. ASR of renal cancer for regions of Czech Republic were obtained from SVOD web portal21. Signatures present in less than 75% of cases were dichotomized into presence and absence as previously mentioned and analyzed using the logistic regressions. All regressions were performed on a sample basis.
Polygenic risk score (PRS) analysis of lifestyle risk factors
In this analysis, we used the genome-wide association studies (GWAS) summary statistics estimated in European populations for well-established risk factors for ccRCC. For tobacco smoking status, we used results from the GSCAN consortium meta-analysis of smoking initiation (ever vs never status)22. For body mass index (BMI), the results of UK biobank (UKBB) and GIANT consortium meta-analysis of continuous BMI were used23. GWAS summary statistics related to hypertension, namely systolic blood pressure and diastolic blood pressure, as well as the ones related to diabetes24, such as fasting glucose and fasting insulin were also obtained using UKBB studies25.
Since all the GWAS summary statistics used in the current work were based on European populations, we used ADMIXTURE tool (v1.3.0)26 and principal component analysis (PCA) to infer the unsupervised cluster of individuals with European genetic background within ccRCC cases. Hapmap SNPs (n=1,176,821 variants) were extracted from the ccRCC whole-genome sequence genotype data. After basic quality control using PLINK (v1.9b, www.cog-genomics.org/plink/1.9/), 333 variants were removed due to missing genotype rate > 5%, 1,236 variants failed Hardy-Weinberg equilibrium test (p-values<10-8), and 18,702 variants had MAF<1% in our cohort. Additionally, 3 ambiguous variants and 21,358 variants within regions of long-range, high linkage disequilibrium (LD) in the human genome (hg38) were excluded. After pruning for linkage disequilibrium, 143,727 variants remained in ccRCC genotype data. The 1000 genome reference population genotype data (phase 3) for Europeans (N=489), Africans (YRI, N=108) and East Asians (N=103 from China and 104 from Japan) (https://www.internationalgenome.org/data/) were filtered and merged with ccRCC genotype data based on the pruned set of variants present in both datasets. ADMIXTURE was run on the merged genotype data with k=3, which would correspond to the three ancestral continental population groups that likely reflect the participants of our study. The ccRCC cases with European genetic fraction greater than 80% by the ADMIXTURE analysis were selected for the polygenic risk scores (PRS) analyses. To complement the ADMIXTURE analysis, PCA was run on the same samples.
The initial genotype data based on whole-genome sequence from 849 ccRCC cases with European genetic background consisted of biallelic SNPs with MAF >0.01% (to exclude ultra-rare variants; N ∼ 30 million variants). After basic quality control, variants with missing genotype rate of greater than 5% (N=7,519,196 variants) with strong deviation from Hardy-Weinberg equilibrium (p-values<10-8, N=220,862) were excluded. For each GWAS trait, we restricted our analyses to the biallelic SNPs with minor allele frequency (MAF) greater than 1% in the 1000 genomes reference for European populations. For the selection of the independent genome-wide significant hits (p-values<5 × 10-8) of each GWAS summary statistic used to generate the PRS, SNPs were clumped (r2=0.1 within a LD window of 10 MB) using PLINK (v1.9b, www.cog-genomics.org/plink/1.9/) based on the 1000 genomes European reference population genotype data (N=489; ∼ 10 million variants). Where a selected GWAS hit was not found in ccRCC genotype data, we extracted proxies (r2>0.8 in 1000 genomes) also present in ccRCC dataset where possible (Supplementary Table 18). The variance of each genetic trait explained by the genetic variants were calculated as previously suggested27. PRS was subsequently calculated as the sum of the individual’s beta-weighted genotypes using PRSice-2 software28. Associations were estimated per standard deviation increase in the PRS, which was normalized to have a mean of zero across ccRCC cases of European genetic ancestry.
Untargeted metabolomics association with signatures
Of the 962 subjects from the main analysis, 901 subjects were included in this sub-study – all Japanese samples (n=36) as well as few cases from Czech Republic (n=13), Romania (n=5) and Russia (n=3) were not included due to lack of available plasma samples. Samples were randomized and analyzed as two independent analytical batches. Analysis was performed with a UHPLC-QTOF-MS system that consisted of a 1,290 Binary LC and a 6,550 QTOF mass spectrometer equipped with Jet Stream electrospray ionization source (Agilent Technologies), using previously described methods29. Pre-processing was performed using Profinder 10.0.2.162 and Mass Profiler Professional B.14.9.1 software (Agilent Technologies). A “Batch recursive feature extraction (small molecules)” process was employed for samples and blanks to find [M+H]+ ions. The two batches were processes separately and the resulting features were aligned in Mass Profiler Professional. Chromatographic peak areas were used as a measurement of intensity. No normalization or transformation of raw data was performed prior to the downstream data analysis.
A total of 2,392 features were detectable in at least one of the 901 samples. Features present in only one of the two batches were filtered out. Recursive filtering elimination was applied to decrease redundancy from highly correlated variables (r≥0.85, Pearson’s r calculated before any transformation/imputation) by selecting the features with least missing data within clusters of features. A total of 944 features were included in the statistical analysis. Features were pre-processed: missing values were replaced with 1/5 of the minimal value of the feature before applying mean centering and Pareto scaling. Each feature was regressed against both de novo and COSMIC signatures, adjusting for sex and age of diagnosis, as well as body mass index (BMI) and technical factors (batch, acquisition order) that could impact chromatographic peak area. Models for SBS22a and SBS22b were restricted to Romanian and Serbian samples to find potential pathways of Aristolochic acid exposure in the Balkan region. Logistic models were used for zero-inflated signatures (≥30% zeros) while quasi-Poisson regressions were used for the least zero-inflated signatures (SBS1, SBS40a, and SBS40b). To derive specific false detection rates, random variables were created from permutations of the initial features and regressed against signatures in the same fashion as true features. Maximum p-value thresholds from regressions with random features were compared to adjusted p-value thresholds according to Bonferroni’s procedure. The more conservative approach was used in selecting features of interest. Random forest models were also used as cross-checking multivariate models to assess the relative importance of each feature in explaining the signature attribution. As with univariate models, regression models were used for the least zero-inflated signatures (<30% of zeros) while classification models were used for all other signatures, with restriction to Romanian and Serbian samples for SBS22a and SBS22b. Importance was estimated from the total decrease in node impurities from splitting on the variable, averaged over all trees. Node impurity was measured by the Gini index for classification, and by residual sum of squares for regression. The significance of importance metrics for Random Forest models were estimated by permuting the response variable (https://github.com/EricArcher/rfPermute).
Features considered for identification, along with their highly correlated counterparts, were searched in Human Metabolome Database (HMDB), LipidMaps, Metlin, and Kegg. Compound identity was confirmed by comparison of retention times and MS/MS fragmentation against chemical standards when available, or otherwise against reference MS/MS spectra. Since the feature 240.1468@0.8929933 was strongly correlated with several features identified as TMAP (Supplementary Table 11), the integration of these features was inspected and corrected manually, and regressed against SBS40b using the same model applied to features selected for analysis. Creatinine was identified among the features by matching its retention time and MS/MS spectra against a reference standard and also regressed against SBS40b in the same fashion as other metabolites. Estimation of correlation between metabolic features was done using linear regression adjusting for batch and acquisition order.
Targeted metabolomics analyses
Circulating levels of PFAS (Per- and Polyfluorinated Substances) and cystatin C compounds were investigated using targeted mass spectrometry-based methods as described previously30,31.
Out of the 962 subjects from the main analysis, plasma samples from 909 subjects (from all countries except Japan) were randomized and sent frozen in dry ice to each respective laboratory for analyses. Measurement of cystatin C from 906 subjects included its native form and isoforms (3Pro-OH cystatin C, cystatin C-desS, 3Pro-OH cystatin C-desS and cystatin C-desSSP) that were modeled individually and for the total concentration of cystatin C isoforms. Measurements of PFAS compounds included PFOA (Perfluorooctanoic Acid; total, branch, linear), PFOS (Perfluorooctanoic Acid; total, branch, linear), PFHxS (Perfluorohexane sulfonate), PFNA (Perfluorononanoic acid), PFDA (Perfluorodecanoic acid), MePFOSAA (n-methylperfluoro-1 octanesulfonamido acetic acid), EtPFOSAA (2-(N-Ethyl-perfluorooctane. sulfonamido) acetic acid).
Multivariable quasi-Poisson (for the least sparse signatures SBS1, SBS40a and SBS40b) and logistic regression were used to estimate the association between plasma concentrations of the aforementioned substances and mutational signatures. All compounds were modeled continuously (log2-transformed) and categorically, with adjustments made by sex, age, date of recruitment, country, BMI, tobacco and alcohol status in the case of PFAS molecules and by sex, age and BMI, in the case of cystatin C.
Geospatial analyses
Geospatial analyses were performed to estimate the regional effect for signature attribution, particularly for signatures thought to be from exogenous exposure (SBS40b – unknown – and SBS22a/SB22b - Aristolochic acid). Residential history information was available for a large proportion of cases from the countries of interest: Czech Republic for SBS40b and Romania/Serbia for SBS22a/SBS22b. The 259 cases from Czech Republic within this study were recruited from 4 separate regions including Prague, České Budějovice (in Southern Bohemia), as well as Brno and Olomouc in the east of the country. Each individual residence was geocoded to its administrative region. All locations outside the country of recruitment were labeled as “Abroad”. A multi-membership mixed model was used to account for the full list of regions in which each subject resided, as well as the proportion of life spent in that region before diagnosis. As dependent variable, signatures were inverse-normal transformed. Models were adjusted for sex and age of diagnosis (fixed effects). The regional effect was treated as random effect.
Data availability
Whole genome sequencing data and patient metadata are deposited in the European Genome-phenome Archive (EGA) associated with study EGAS00001003542. Aligned BAM files for all ccRCC cases included in the final analysis were deposited in dataset EGAD00001012102, variant calling files in dataset EGAD00001012222 and patient metadata in dataset EGAD00001012223. The metabolomics data are available upon request. All other data are provided in the accompanying Supplementary Tables.
Code availability
All algorithms used for data analysis are publicly available with repositories noted within the respective method sections and in the accompanying reporting summary. Code used for regression analysis and figures is available at: https://gitlab.com/Mutographs/Mutographs_RCC.
Funding
This work was delivered as part of the Mutographs team supported by the Cancer Grand Challenges partnership funded by Cancer Research UK (C98/A24032). This work was supported by the Wellcome Trust grants 206194 and 220540/Z/20/A. The work was also partly funded by Barretos Cancer Hospital, the Public Ministry of Labor of Campinas (Research, Prevention, and Education of Occupational Cancer, 2015 to R.M.R.), and by Hospital de Clínicas de Porto Alegre (180330 to P.A.-P., M.B., B.S.N.). The work was also partly supported by the Practical Research Project for Innovative Cancer Control from the Japan Agency for Medical Research and Development (AMED) (JP20ck0106547h0001 to T.S.), and by the National Cancer Center Japan Research and Development Fund (2020-A-7 to A.F.). The work was also partly funded by the 1st and 2nd Faculties of Medicine, Charles University, Prague (CAGEKID to I.H.; Occupation, Environment and Kidney Cancer in Central and Eastern Europe to A.H.). The work was also partly supported by the Ministry of Health of the Czech Republic (MH CZ – DRO (MMCI, 00209805) to L.F. and M.N.). Measurement of PFAS compounds was funded by Division of Cancer Epidemiology and Genetics of the National Cancer Institute (USA). Measurement of cystatin C was funded by Cancer Research UK (C18281/A29019).
Contributions
The study was conceived, designed and supervised by M.R.S., P.B. and L.B.A. Analysis of data was performed by S.Senkin, S.Moody, M.D.-G., T.C., A.F.-I., J.W., S.F., M.K., R.V., A.P.L., E.N.B., A.K., B.O., S.C., E.T., J.A., K.S.-B., R.C.C.P., V.G., D.J., J.W.T. and J.M. Pathology review was carried out by B.A.-A., S.F. and M.A. Sample manipulation was carried out by C.L., C.C. and P.C. Patient and sample recruitment was led or facilitated by S.Sangkhathat, W.A., B.S., S.J., R.S., D.M., V.Jinga, S.R., S.Milosavljevic, M.M., S.Savic, J.M.S.B, M.A., L.P., P.A.-P., M.B., B.S.N., S.M.B., M.P.C., S.C.Z., R.M.R., E.F., N.S.M., R.S.F., R.B., N.V., D.Z., A.M., O.S., V.M., L.F., M.N., I.H., A.H., V.Janout, S.C. and C.L., M.P. P.K.-R., S.C., M.P., P.M.U. and M.J. contributed to data generation. Patient and sample recruitment for Japanese cases was led by T.S. and A.F. Scientific project management was carried out by L.H., E.C., G.S., A.C.D.C., A.F.-I. and S.P. S.Moody and S.Senkin jointly contributed and were responsible for overall scientific coordination. The manuscript was written by S.Senkin, S.Moody, M.R.S. and P.B. with contributions from all other authors.
Competing interests
LBA is a compensated consultant and has equity interest in io9, LLC and Genome Insight. His spouse is an employee of Biotheranostics, Inc. LBA is also an inventor of a US Patent 10,776,718 for source identification by non-negative matrix factorization. ENB and LBA declare U.S. provisional applications with serial numbers: 63/289,601; 63/269,033; and 63/483,237. LBA also declares U.S. provisional applications with serial numbers: 63/366,392; 63/367,846; 63/412,835; and 63/492,348. VM received honoraria from Ipsen, Bayer, AstraZeneca, Janssen, Astellas Pharm and MSD, and provided expert testimony to BMS, Bayer, MSD and Janssen. No other authors declare any competing interests.
Disclaimer
Where authors are identified as personnel of the International Agency for Research on Cancer/ World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy or views of the International Agency for Research on Cancer / World Health Organization.
Corresponding author
Correspondence to Paul Brennan.
EXTENDED DATA FIGURE AND TABLE LEGENDS
Acknowledgements
The authors would like to thank Laura O’Neill, Kirsty Roberts, Katie Smith, Maisie Farenden, Siobhan Austin-Guest and the staff of DNA Pipelines at the Wellcome Sanger Institute for their contribution. We are grateful for the support provided by Maja Milojevic, Christophe Lallemand, Helene Renard, Aude Bardot, Andreea Spanu and Nivonirina Robinot as well as IARC General Services, including the Laboratory Services and Biobank team led by Zisis Kozlakidis, the Section of Support to Research overseen by Tamas Landesz and the Evidence Synthesis and Classification Section led by Ian Cree, under IARC regular budget funding. The authors would like to thank Gislaine Bergo, Riley Cox and Juliana Oliveira for help with data/sample preparation and processing. The authors would also like to acknowledge the contributions of the Leeds Biobanking and Sample Processing Lab, the Leeds Multidisciplinary RTB and the Leeds NIHR BioRTB for provision of samples. The authors would like to thank Peter Campbell, Inigo Martincorena, Tim Butler, Daniela Mariosa, Laura Torrens Fontanals, Wellington Oliveira Dos Santos, Hana Zahed, Marc Gunter, Maggie Blanks and Mimi McCord for useful discussions. The authors would also like to thank all the patients and their families involved in this study.
Footnotes
Minor changes to the main text emphasizing the public health importance of the findings; validation of the SBS12 finding with two additional cohorts; authors and affiliations updated; figures updated; supplementary files updated.
REFERENCES
Methods references
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.↵
- 31.