Abstract
Tumor mutational signatures have the potential to inform cancer diagnosis and treatment. However, their detection in targeted sequenced tumors is hampered by sparse mutations and variability in targeted gene panels. Here we present SATS, a scalable mutational signature analyzer addressing these challenges by leveraging tumor mutational burdens from targeted gene panels. Through analyzing simulated data, pseudo-targeted sequencing data generated by down-sampling whole exome and genome data, and samples with matched whole genome sequencing and targeted sequencing, we showed that SATS can accurately detect common mutational signatures and estimate signature burdens. Applying SATS to 111,711 targeted sequenced tumors from the AACR Project GENIE, we generated a pan-cancer catalogue of mutational signatures tailored to targeted sequencing, enabling estimation of signature burdens within individual tumors. Integrating signatures with clinical data, we demonstrated SATS’s clinical utility, including identifying signatures enriched in early-onset hypermutated colorectal cancers and signatures associated with cancer prognosis and immunotherapy response.
Introduction
Tumors accumulate somatic mutations that form specific patterns, known as mutational signatures1,2. These signatures can provide insight into the underlying mutational processes involved in carcinogenesis and have the potential to inform cancer detection3-5 and treatment6-9. For example, aristolochic acid-associated signatures can aid in the screening for liver cancers5, while tumors exhibiting HRD (homologous recombination deficiency)-associated signatures may be amenable to PARP (poly (ADP-ribose) polymerase) inhibitors7. Similarly, tumors characterized by APOBEC (Apolipoprotein B mRNA Editing Catalytic Polypeptide-like)-associated signatures might be responsive to ATR (ataxia telangiectasia and Rad3 related) inhibitors8. To decipher these mutational signatures, multiple algorithms have been proposed10-14, and catalogues of mutational signatures have been established1,2 for tumors analyzed through whole exome or whole genome sequencing (WES/WGS).
In clinical practice, tumors are often sequenced by targeted gene panels to detect somatic mutations in cancer driver genes with therapeutic relevance. The small number of detected mutations, coupled with the use of diverse panels across hospitals, makes it unfeasible to use signature extraction10-12 or signature refitting methods13,14 developed for WES/WGS to analyze mutational signatures in targeted sequenced tumors. Specifically, signature extraction methods extract de novo mutational signatures; however, sparse mutation data from targeted sequencing reduce the ability to distinguish correlated de novo signatures with these methods15. Signature refitting approaches estimate mutational signature activities and burdens in a given sample using prespecified reference signatures. However, it remains uncertain to choose the appropriate reference signatures for targeted sequencing data. The mutational signatures detected through WES/WGS may not accurately represent those present in targeted sequenced tumors. Moreover, using reference signatures from all cancer types in signature refitting methods can lead to misassigning mutations to signatures that are not present in the targeted sequenced tumors16. Furthermore, most methods assume that identical genomic regions, such as whole exome or genome, are sequenced across all samples, which may not hold true in targeted sequencing studies using various panels across different genomic regions. Therefore, while current signature extraction and refitting methods have successfully analyzed mutational signatures in tumors subjected to WES/WGS, their effectiveness is limited when applied to targeted sequencing data, hindering the clinical use of mutational signatures derived from targeted sequencing.
Clustering methods have been proposed for detecting specific mutational signatures in targeted sequenced tumors. For example, the SigMA15 algorithm is tailored to detect the HRD-associated signature SBS3. It utilizes pre-training with WGS data from individual tumors to classify targeted sequenced tumors. Similarly, MUTYH mutation-related signatures SBS18/36 have been identified in targeted sequenced colorectal cancers using WES data from individual samples as the training set for clustering17. However, these methods are limited in their capacity to simultaneously analyze multiple active mutational signatures within targeted sequenced tumors. In addition, rare cancer types or underrepresented populations may lack WES/WGS data for pre-training. The Mix18 method presents an alternative clustering strategy that does not rely on pre-training, but it estimates signature activities at the cluster level rather than for individual samples. Hence, to facilitate the clinical use of mutational signatures, specialized analytical methods and a comprehensive catalogue of mutational signatures for targeted sequenced tumors are needed.
Here, we introduce SATS (Signature Analyzer for Targeted Sequencing), a mutational signature analysis tool explicitly developed for targeted sequencing data. Unlike existing methods optimized for WES/WGS, SATS accounts for the variable size and genomic context of targeted gene panels and leverages large sample sizes of targeted sequencing studies. In the following sections, we first describe the SATS pipeline and then investigate the factors that affect signature detection and signature burden estimation in targeted sequenced tumors using pseudo-targeted sequencing data. Additionally, we perform simulations to compare the signature detection and signature burden estimation by SATS and other methods across four major cancer types.
Furthermore, we demonstrate that SATS outperforms other methods in accurately attributing mutations to prespecified signatures in samples with matched WGS and targeted sequencing. Applying SATS, we establish a pan-cancer catalogue of mutational signatures in 111,711 targeted sequenced tumors from the AACR (American Association for Cancer Research) Project GENIE (Genomics Evidence Neoplasia Information Exchange, Version 13.0-public)19,20. This database contains tumors collected from 16 hospitals or cancer centers in multi-ethnic populations, representing 23 cancer types, including 14,983 lung and 12,144 breast tumors (Methods). Finally, we show that through integration with clinical data, mutational signatures derived from targeted sequencing can identify the potential tissue of origin for tumors of unknown primary, find signatures enriched in early-onset hypermutated colorectal cancers, and serve as a biomarker for cancer prognosis and immunotherapy response.
Results
SATS detects signatures and estimates signature burdens in targeted sequenced tumors
SATS is developed for targeted sequencing to detect mutational signatures within a patient cohort and refit detected signatures to estimate signature burdens in individual patients. While we use single base substitutions (SBS) for the purpose of illustration, SATS is also adaptable to other types of somatic mutations, such as double base substitutions (DBS).
In analyzing SBS mutational signatures, SATS uses a mutation type matrix V as input that contains the counts of SBS across 96 mutation types within 32 trinucleotide contexts, such as a C to G mutation at the trinucleotide context TCT (i.e., a T[C>G]T mutation type). Additionally, SATS incorporates a panel context matrix L that specifies the number of trinucleotide contexts where a specific mutation type (e.g., TCT for T[C>G]T substitutions) could potentially occur in the targeted genes. SATS is based on a Poisson Nonnegative-Matrix Factorization (pNMF) model (Methods). The pNMF model decomposes the mutation type-by-patient matrix V into a mutation type-by-signature matrix W that describes signature profiles and a signature-by-patient matrix H that quantifies signature activities, while adjusting panel sizes by the panel context matrix L (Fig. 1a).
SATS comprises signature detection and signature refitting steps as outlined in Fig. 1b. In the signature detection phase, SATS first employs signeR11 to discover de novo signatures from a targeted sequencing patient cohort. signeR is based on the pNMF model and adjusts for differences in the sizes of gene panels (Supplementary Note). Next, the de novo signature profiles are mapped to reference signatures from the pan-cancer COSMIC catalogue1 using penalized nonnegative least squares (pNNLS)21. This allows us to identify reference signatures which are present within the de novo signature profiles. Notably, these reference signatures can originate from any cancer type featured in the pan-cancer catalogue, not restricted to the specific cancer type of the patient cohort. During the signature refitting stage, we have developed an Expectation–Maximization (EM) algorithm, refitting the detected reference signatures to estimate signature activities in individual patients. Given that a mutational signature can contribute to a variety of mutations, we further estimate the expected number of mutations attributed to a reference signature, termed the signature burden, for each patient (Methods).
While direct detection of mutational signatures in a single patient is challenging, SATS can effectively estimate signature burdens at an individual level by adapting (“refitting”) signatures previously identified in a large group of patients who have the same type of cancer and have been subjected to targeted sequencing. To enable this feature, we have compiled a pan-cancer catalogue of mutational signatures specific to targeted sequencing by analyzing over 111,000 tumors from the AACR Project GENIE.
SATS identifies signatures of tumor mutation burden
One key advancement of SATS lies in its ability to discern signatures of tumor mutation burden (TMB), circumventing the requirement of identical genomic region sequenced across tumor samples for mutational signature analysis. SATS achieves this through approximating V by L ∘ WH (i.e., V ≈ L ∘ WH), where ∘ denotes element-wise product. SATS allows for the analysis of diverse gene panels, each covering different sequenced genomic regions, by using the panel context matrix L (measured in megabase (Mb) pairs) to account for these differences. The resulting signature profile matrix W thus characterizes TMB signatures. In contrast, conventional mutational signature analysis algorithms implicitly require identical sequenced regions across tumor samples (e.g., through WES or WGS). These algorithms employ the canonical NMF method to factorize a mutation type matrix V into a 96 × K signature profile matrix W′ and a K × N signature activity matrix H′ (i.e.,V ≈ W′H′). Thus, the estimated W′ delineates tumor mutation count (TMC) signatures.
TMB and TMC signature profiles are different, as TMB signature profiles take into account the mutation context, while TMC signature profiles do not (Methods). For example, the TMC SBS5 profile is relatively consistent across all 16 trinucleotide contexts of C to T mutations, whereas the TMB SBS5 profile shows increased C to T mutations at the NCG trinucleotides (N represents any nucleotide, Supplementary Fig. 1a), since these trinucleotides are depleted in the human genome due to frequent deamination of 5-methylcytosine to thymine22,23 (Supplementary Fig. 1b). We compared the shape of TMB and TMC signature profiles by the Shannon equitability index, which measures the evenness of signature profiles across mutation types (Methods). A higher index value corresponds to a flatter signature profile, whereas a lower value suggests a distinct or spikier profile with significant contributions from certain mutation types as “spikes.” The Shannon equitability indices of TMB and TMC signature profiles are highly correlated (Pearson correlation coefficient r = 0.915, Supplementary Fig. 1c), but there are a few exceptions. For example, TMB signature SBS10b and SBS15 (Shannon equitability index = 0.192 and 0.391 respectively) exhibit more pronounced spikes than their TMC counterparts (Shannon equitability index = 0.491 and 0.624 respectively, Supplementary Fig. 1d).
The advantage of TMB signature profiles is their robustness against variations in mutation contexts between targeted sequencing and WGS. For example, targeted sequences often have different proportions of certain trinucleotides (e.g., higher NCG, lower NTA and NTT) compared to whole genomes (Supplementary Fig. 2). As a result, TMC signature profiles obtained from targeted sequencing differ from those obtained from WGS, while TMB signature profiles remain consistent. Additionally, TMB signatures allow for mutational signature analysis across different targeted gene panels by normalizing the numbers of mutation contexts, making them particularly useful for studies involving multiple targeted gene panels.
Factors impacting signature detection and signature burden estimation by SATS
We investigated factors that may affect the performance of SATS in detecting signatures and estimating signature burdens. These factors include the size of the targeted gene panels, the prevalence of the signatures, the shape of the TMB signature profile (measured by the Shannon equitability index), and the cancer types. We generated pseudo-targeted sequencing data SBSs that were called in The Cancer Genome Atlas (TCGA) WES studies1,24 and located in the regions covered by various targeted sequencing panels (Methods). In addition, we generated pseudo-targeted sequencing data based on 560 breast tumors25 with WGS data (Methods and Supplementary Fig. 3a). Our analysis focused on common signatures contributing more than 5% of SBSs based on WES or WGS analyses for a given cancer type. While the sample size could also impact signature detection, we could not assess it using limited pseudo-targeted sequencing data. Thus, we conducted in silico simulations to examine the impact of sample size.
First, we analyzed pseudo-targeted sequencing data based on WES and showed that SATS is capable of detecting common signatures, though detection probabilities vary across cancer types, targeted gene panels, and the specific signatures. For instance, larger gene panels generally uncovered more signatures (Fig. 2a). Additionally, there was an inverse relationship (r = -0.452) between the detection probability of a signature and its Shannon equitability index (Fig. 2b). This suggests that “spikier” signatures are more likely to be detected than “flatter” ones, consistent with observations based on WES/WGS data9. To simultaneously quantify the impact of these factors on the detection probabilities, we applied a generalized linear mixed model (GLMM) that included cancer type, panel size, signature prevalence, and the Shannon equitability index of the signature (Methods). Our results revealed that cancer types accounted for 53.26% of the variance of detection probabilities at the logit scale (Fig. 2c). This is because certain cancer types possess more distinctive and easily distinguishable signatures. Additionally, cancer types with high TMB (e.g., lung squamous cell carcinoma, median TMB: 10.07 mutations/Mb) had a higher probability of signature detection compared to those with lower TMB (e.g., thyroid adenocarcinoma, median TMB: 0.47 mutations/Mb). Within a specific cancer type, GLMM indicated that spikier (odds ratio (OR) = 0.962, 95% confidence interval (CI) = 0.956-0.967 for a 0.01 increase of the Shannon equitability index, Fig. 2d) and more prevalent (OR = 1.12, 95% CI = 1.14-1.27 for a one percent increase of signature prevalence) signatures are more detectable, especially when using large gene panels (OR = 1.21, 95% CI = 1.14-1.27 for 1Mb increase in gene panel size). This conclusion was also evident in our analysis of pseudo-targeted sequencing data based on WGS (Supplementary Fig. 3b). This finding helps explain the challenge in detecting signatures in thyroid adenocarcinoma, where spikier signatures like SBS1 are less common (the prevalence 6.58%), while the most common signature, SBS5 (28.26%), is flat and may be confused with other flat signatures like SBS3 or SBS40.
Next, we evaluated the signature refitting steps of SATS for estimating signature burdens - the number of mutations attributed to each signature. We compared the signature burdens calculated using WES1 with these estimated using SATS from pseudo-targeted sequencing data of the same tumors (as an example, see Supplementary Fig. 4a for SBS4 in lung cancer based on the MSK-IMPACT468 panel). A strong correlation between the two would indicate that targeted sequencing panels are a feasible alternative to WES for assessing signature burdens. We found that the median Pearson correlation coefficient was 0.7 for panels with sizes greater than 1Mb (Fig. 2e), with higher correlation coefficients for certain signatures, such as SBS4 in lung adenocarcinoma (r = 0.91) and SBS7a and SBS7b in melanoma (r = 0.98 and 0.95 respectively, Supplementary Fig. 4b and 4c). Similar results were also observed in pseudo-targeted sequencing data based on WGS (Supplementary Fig. 3c).
To explore the impact of sample size on mutational signature detection, we conducted in silico simulations with varying sample sizes, using breast cancer as a case study (consisting of 12 mutational signatures with at least 1% prevalence in the TCGA breast cancer study, Supplementary Fig. 5a). We simulated the mutation burden of 96 SBS mutation types for up to 1 million tumors across 21 different targeted sequencing panels, each with the size larger than 1Mb (Methods). In this way, we could use the “truly” mutational signatures present in in silico simulations as benchmarks. Our findings indicated that with a large number of targeted sequenced tumors, SATS can detect almost all common mutational signatures (>5% prevalence) in breast cancer, including the HRD-associated signature SBS3, while maintaining a low false positive rate (Supplementary Fig. 5b, detailed in the Supplementary Note). This study underscores the importance of sample size in the extraction of mutational signatures from targeted sequenced tumors, as detecting certain signatures might necessitate a larger cohort of samples.
Evaluation of SATS and other methods by in silico simulations
To assess the performance of SATS in mutational signature detection and burden estimation, we conducted in silico simulations, comparing it with other methods. Using signature profiles and distributions of signature activities from the AACR Project GENIE, we simulated mutation type matrices for lung, breast, colorectal, and lymphoid-derived hematologic cancers (Methods).
First, we ran SATS, SigProfilerExtractor12 and Mix18 to detect signatures and compared the detected signatures to the ‘prespecified signatures’ used in the simulations. We observed that SATS could accurately detect most prespecified signatures, except for few flat or rare signatures. For lung and breast cancers, SATS identified all nine prespecified signatures in every replicate (Fig. 3a). In colorectal cancer, SATS detected five out of six prespecified signatures in all replicates, with SBS44 being more elusive. SBS44, as the second flattest signature in colorectal cancer, is challenging to distinguish from the common flat signature SBS5. In lymphoid-derived hematologic cancer, all prespecified signatures were frequently detected. The only false positive signatures in four cancer types were SBS10c and SBS92 in colorectal cancer for one replicate. In contrast, SigProfilerExtractor and Mix failed to detect signatures SBS29 and SBS89 in lung cancer (Fig. 3a). SigProfilerExtractor did not identify SBS44 in colorectal cancer, and Mix missed the majority of signatures in lymphoid-derived hematologic cancer. Besides false negative detections, SigProfilerExtractor incorrectly identified SBS24 in lung cancer and SBS7b in lymphoid-derived hematologic cancer. These results suggest that SATS outperforms SigProfilerExtractor and Mix in signature detection for targeted sequenced tumors, effectively identifying most prespecified signatures with minimal false positives.
Next, we applied SATS, SigProfilerAssignment14 and Mix18 to estimate signature burdens, comparing these estimates with the simulated “ground truth”. SATS showed high accuracy in estimating burdens for common or spiky signatures, such as SBS2/13 in breast cancer (r = 0.96, Fig. 3b), SBS4 in lung cancer (r = 0.86) and SBS10a and SBS10b in colorectal cancer (r = 0.99 for both). However, the correlation was lower for flatter or rarer signatures, such as SBS89 in breast cancer (r = 0.55) and SBS6 in colorectal cancer (r = 0.74). The correlations for signature burdens estimated by SigProfilerAssignment and Mix were generally lower than those by SATS, particularly for signatures like SBS6 in colorectal cancer and SBS84/SBS87 in lymphoid-derived hematologic cancer (Fig. 3b). Overall, these results suggest that SATS can more accurately estimate signature burdens for the majority of signatures than other methods.
Finally, we investigated the impact of including irrelevant signatures in signature refitting. Simulating breast cancer targeted sequencing data using signatures SBS1, SBS2/13, and SBS5, we performed signature refitting using 12 signatures (including the three true signatures) from TCGA WES breast cancer study. We found that a considerable proportion of mutations were incorrectly attributed to non-existent signatures in the simulated data (Supplementary Fig. 6). This result emphasizes the importance of choosing an appropriate set of refitted signatures, tailored for targeted sequencing, to enhance the accuracy of signature refitting in targeted sequencing studies.
Evaluation of SATS and other methods in tumors with both WGS and targeted sequencing
We further assessed SATS using 72 kidney tumors that had undergone both whole genome sequencing and targeted sequencing26. Due to the sample size constraints, we focused on refitting the common signatures to estimate their burdens in the targeted sequencing data using SATS, SigProfilerAssignment14, Mix18 and deconstructSigs13. The common signatures included SBS1, SBS5, and SBS40, all of which had been identified in kidney tumors from the AACR Project GENIE, as well as through WGS-based mutational signature analyses of 72 kidney tumors26.
Our analysis showed a consistent correlation between the estimated burdens of flat signatures (SBS5/40) from targeted sequencing and those from WGS across all methods. Specifically, we observed that samples with a higher burden of mutations attributed to flat signatures in targeted sequencing also had a higher burden of these signatures in WGS (Fig. 3c right panel; Wilcoxon P-value = 2.48 x 10-3, 2.35 x 10-7, 2.35 x 10-7, 1.79 x 10-8 for SATS, SigProfilerAssignment, Mix and deconstructSigs, respectively). However, for the less frequent signature SBS1, the association of signature burdens was observed only in SATS (Fig. 3c left panel; Wilcoxon P-value = 3.62 x 10-7). SigProfilerAssignment and deconstructSigs assigned zero burdens to SBS1 in all samples but one for SigProfilerAssignment, while Mix categorized all samples as one cluster with minimal SBS1 signature burdens. This indicates the limitations of other refitting methods in accurately estimating burdens for less common signatures like SBS1 in targeted sequencing data.
The pan-cancer repertoire of targeted sequencing-based mutational signatures
We used SATS to create a pan-cancer repertoire of SBS signatures based on the targeted sequenced tumors in the AACR Project GENIE, which could be adopted in the clinic to refit signatures in a single tumor. Our analysis revealed the ubiquitous presence of SBS1, caused by deamination of 5-methylcytosine to thymine, and flat signatures (SBS3, SBS5 and SBS40 combined, given that the current sample size of a cancer type is insufficient to separate them accurately) across all cancer types (as shown in Fig. 4a bottom panel). In contrast, other SBS signatures are present in particular cancer types. For example, SBS2/13, associated with APOBEC cytosine deaminases, was found in eight cancer types. DNA repair deficiency-related signatures such as SBS6/14/15/44, indicative of mismatch repair (MMR) deficiency, were observed in nine cancer types. Similarly, SBS10a/b/c signatures, resulting from polymerase epsilon (POLE) exonuclease domain mutations, appeared in eight cancer types.
Besides above signatures associated with endogenous mutational processes, we also detected signatures associated with environmental exposures, such as smoking (e.g., SBS4/29 in lung cancer) and UV radiation (e.g., SBS7a/b in head and neck cancer, skin cancer/melanoma or soft tissue cancer). Additionally, treatment-related signatures were identified as well, such as SBS11 caused by temozolomide in glioma27, and thiopurine chemotherapy treatment-induced signature SBS87 in endometrial cancer28, head and neck cancer29 and lymphoid-derived hematologic cancer30.
We calculated the signature burdens and evaluated the prevalence of mutational signatures for individual cancer types (Methods). A significant share of the mutations in many cancer types was found to be attributable to flat signatures (SBS3/5/40), with some notable exceptions (Fig. 4a top panel). For example, skin cancer or melanoma is primarily characterized by UV-induced signatures, while endometrial cancer is dominated by signatures related to DNA mismatch and replication repair deficiency. Additionally, APOBEC-induced signatures are most frequent in bladder cancer, whereas smoking- and APOBEC-induced signatures dominate in lung cancer.
In addition to SBS signatures, we generated a pan-cancer repertoire of DBS mutational signatures for targeted sequenced tumors (Fig. 4b bottom panel). We found seven DBS signatures, which exhibited a low mutation burden (less than one mutation per megabase, Fig. 4b top panel). We observed that the DBS1 signature, associated with UV exposure, is present in head and neck cancer, skin cancer or melanoma, soft tissue cancer, and cancers of unknown primary, consistent with the presence of UV exposure SBS signatures in these cancer types. Furthermore, the DBS2 signature, associated with smoking, was identified in bladder and lung cancer. We also observed DNA repair deficiency-associated signatures DBS3 in non-colorectal bowel cancer.
Clinical applications of targeted sequencing-based mutational signatures
We provide here examples of utilizing targeted sequencing-based mutational signatures generated by SATS to address important clinical questions.
Identification of tissue of origin for tumors of unknown primary
The distinct presence and prevalence of targeted sequencing-based mutational signatures across cancer types suggest their potential as indicators of tissue of origin, particularly for tumors of unknown primary. Notably, when examining the clustering of tumors sequenced at the Memorial Sloan Kettering Cancer Center (Fig. 5, Methods), we observed that most tumors grouped according to their cancer types. For instance, lung tumors formed a distinct cluster, separate from other cancer types, while tumors with UV signatures, like head and neck cancer, skin cancer/melanoma, and soft tissue cancer, constituted another distinct cluster, in which tumors of head and neck cancer and soft tissue cancer grouped together. In contrast, glioma formed unique clusters, as did pancreatic cancer. These clustering patterns were consistently observed in tumors sequenced at the Dana-Farber Cancer Institute (Supplementary Fig. 7). Interestingly, tumors of unknown primary clustered alongside lung tumors, tumors with UV signatures, glioma, pancreatic tumors, and others (Fig. 5), suggesting their potential tissues of origin.
Targeted sequencing-based mutational signatures enriched in early-onset hypermutated colorectal cancer
A previous study showed that early-onset non-hypermutated colorectal cancers exhibited a lower overall TMB than late-onset cases31. However, the opposite trend was observed in hypermutated cases. To examine this discrepancy, we analyzed TMBs attributed to specific mutational signatures in non-hypermutated and hypermutated colorectal cancers (Methods). First, we confirmed the previous findings of the contrasting association between overall TMB and early-onset status in non-hypermutated and hypermutated colorectal cancers. Next, we found that in non-hypermutated colorectal cancers, TMBs attributed to most signatures (SBS1/5/6/10a/10b) were inversely associated with early-onset status (Fig. 6a), leading to an inverse association between the overall TMB and early-onset status (OR = 0.898, 95% CI = 0.877-0.920). In contrast, among hypermutated colorectal cancers, the positive association of overall TMB and early-onset status (OR = 1.004, 95% CI = 1.001-1.007) was primarily driven by the TMB attributed to the deficient DNA mismatch repair signature (SBS44, OR = 1.014, 95% CI = 1.001-1.027, Fig. 6a) and the signature related to deficient replication repair gene POLE (SBS10a, OR = 1.017, 95% CI = 1.000-1.035) but attenuated by the clock-like signature SBS5 (OR = 0.982, 95% CI = 0.963-1.001). These results reveal distinct mutational processes that are characteristic of early-onset hypermutated colorectal cancer.
Targeted sequencing-based mutational signatures associated with cancer prognosis
To investigate the prognostic value of mutational signatures derived from targeted sequencing, we conducted an analysis associating the presence of these signatures with overall survival across different cancer types. The analyses were adjusted for age, race, sex (where applicable), metastatic status, cancer subtype, and hospital site (Methods). Consistent with previous studies based on health records or WES32-34, we observed that targeted sequencing-based smoking signatures SBS4/29 and APOBEC signatures SBS2/13 were associated with poorer prognosis in lung cancer (SBS4/29: Hazard Ratio (HR) = 1.1, 95% CI = 1.05-1.17; SBS2/13, HR = 1.09, 95% CI = 1.03-1.14). In contrast, UV exposure signature SBS7a/b showed a favorable prognosis (SBS7a: HR = 0.71, 95% CI = 0.62-0.80; SBS7b: HR = 0.81, 95% CI = 0.72-0.92) in skin cancer or melanoma, consistent with previous WES studies that patients with signature SBS7 exhibited better survival outcomes35. Furthermore, we found that POLE deficiency-related signature SBS10a was associated with a better prognosis in colorectal cancer36 (HR = 0.83, 95% CI = 0.77-0.90). Similarly, MMR deficiency-related signatures led to favorable prognosis in esophageal and gastric cancers (SBS6: HR = 0.84, 95% CI = 0.76-0.92) and in colorectal cancer (SBS44: HR = 0.79, 95% CI = 0.74-0.85). These results align with multiple studies which reported improved prognosis in MMR-deficiency (measured by microsatellite markers or MMR immunohistochemistry) colorectal cancer37 or gastric cancer38,39 patients when treated with surgery alone or surgery plus chemotherapy. Thus, our findings confirm the prognostic value of mutational signatures derived from targeted sequencing.
Racial heterogeneity on the association of MMR deficiency and colorectal cancer prognosis
Leveraging the large sample size across different racial/ethnic groups from targeted sequencing data, we investigated the association between MMR deficiency signature SBS 44 and colorectal cancer prognosis across different racial groups. Analyzing 9,562 colorectal cancer cases from eight hospitals, we found significant racial heterogeneity (likelihood ratio test (LRT) P-value = 1.22 x 10-5 for the model, including the interaction term of SBS44 and race compared to the model without the interaction term). Specifically, we observed a highly favorable prognosis associated with SBS44 among white patients (HR = 0.78, 95% CI = 0.72-0.85, Fig. 6b). The effect size was attenuated among Black patients (HR = 0.97, 95% CI = 0.76-1.24), and was inverted among Asian patients (HR = 1.15, 95% CI = 0.84-1.58, Fig. 6b). This racial heterogeneity persisted when we restricted the analysis to 4,786 colorectal cancer cases from the Memorial Sloan Kettering Cancer Center (LRT P-value = 1.69 x 10-4) or 2,892 colorectal cancer cases from the Dana-Farber Cancer Institute (LRT P-value = 2.59 x 10-3). These results warrant further investigation into the potential racial heterogeneity of MMR deficiency in colorectal cancer prognosis.
Targeted sequencing-based mutational signatures associated with immunotherapy response
Using treatment regimens and response records from AACR Project GENIE, we examined the association between TMB attributed to specific mutational signatures and progression-free survival (PFS) in 470 non-small cell lung cancer patients who received immune checkpoint inhibitors. To account for potential confounding factors, we adjusted our analysis for smoking history, stage, age, race, sex, cancer subtype, and hospital site. We revealed a significant association between TMB attributed to SBS1 and poor PFS (HR = 1.31, 95% CI = 1.10-1.56, Fig. 6c). In contrast, TMB attributed to the smoking signatures SBS4/29 exhibited a significant association with favorable PFS (HR = 0.94, 95% CI = 0.90-0.98, Fig. 6c). TMB attributed to other signatures (SBS2/13, SBS89 and SBS5/40) did not show a significant association with PFS. Notably, when considering the overall TMB combining all signatures, only a nominal association with favorable PFS was observed (HR = 0.98, 95% CI = 0.95-1.00, Supplementary Fig. 8a). Consistent results were also observed when analyzing overall survival instead of PFS (Supplementary Fig. 8b). These results suggest that signature-specific TMB, particularly SBS1 and SBS4/29, could be used as more effective biomarkers for predicting the response to immune checkpoint inhibitors in non-small cell lung cancer, compared to the overall TMB.
Discussion
In this study, we have introduced SATS, a new tool to identify mutational signatures and estimate signature burdens in targeted sequenced tumors. We evaluated SATS using pseudo-targeted sequencing data and found that spiky signature profiles, a high signature prevalence, and large sequencing panels (> 1Mb) increase the accuracy of signature detection and refitting. Moreover, we showed that SATS outperformed other methods through analyzing in silico simulated data and samples with both WGS and targeted sequencing. We utilized SATS to analyze 111,711 targeted sequenced tumors in the AACR Project GENIE and developed a pan-cancer catalogue of SBS and DBS signatures, specifically tailored for targeted sequencing tumors. Using this repertoire, SATS can estimate signature burden in a single sample, making it a useful tool in the clinic. Finally, we showed examples of using targeted sequencing-based signatures to address clinical questions. In particular, we found that TMBs attributed to DNA repair deficiency signatures could disentangle the effect of TMB on hypermutated early-vs. late-onset colorectal cancer. In addition, we identified mutational signatures associated with cancer prognosis in multiple cancer types and with immunotherapy response in non-small cell lung cancer.
Our study has made several important contributions to the analysis of mutational signatures in targeted sequenced tumors. First, unique to SATS is the incorporation of panel size in the analysis, enabling the identification of TMB signatures that are not restricted to a particular type of gene panel. In contrast, other mutation signature tools commonly assume the same targeted sequencing panel across samples. This adaptability ensures SATS’s applicability over a wide range of targeted gene panels, enhancing its utility in clinical settings. Second, unlike clustering-based methods15,17 that aim to detect a specific mutational signature, SATS can identify multiple mutational signatures simultaneously, providing a more comprehensive analysis of the mutational landscape of target-sequenced tumors. Third, to the best of our knowledge, this study represents the largest and most comprehensive pan-cancer mutational signature analysis for targeted sequenced tumors to date. We have analyzed 23 cancer types, including those less represented in previous studies, such as cancers of unknown primary origin. Moreover, our analysis includes 757 cancer subtypes, which is more extensive than previous studies. For instance, while the TCGA ovarian cancer study40 focuses on high-grade serous ovarian cancer, our analysis involves 22 ovarian cancer subtypes, encompassing high-grade serous ovarian cancer, clear cell ovarian cancer, low-grade serous ovarian cancer, and endometrioid ovarian cancer, among others. Finally, the established repertoire of mutational signatures is derived from a diverse collection of targeted sequenced tumors at various hospitals and cancer centers, ensuring its relevance to clinical settings. This contrasts with repertoires based on WES/WGS that are often oriented more towards research applications. Users of SATS can estimate signature burdens using the repertoire from this study as the reference, facilitating the clinical utility of mutational signature analysis in individual targeted sequenced tumors.
This study has several limitations that should be considered. First, the data were collected from clinics primarily in the United States and Western Europe as part of the AACR Project GENIE, which may limit the representativeness of our findings for targeted sequenced tumors from other geographic regions. Second, while the identified repertoire of mutational signatures for targeted sequencing tumors is extensive, it may not encompass the entire spectrum of possible signatures. The current repertoire mainly includes common signatures with spiky profiles, such as signatures related to hypermutation, that are easy to detect with the current sample size per cancer type in the AACR Project GENIE. This selection bias towards easily detectable signatures could mean that less common or subtler signatures might be underrepresented or missed. Additionally, we observed a higher prevalence of hypermutated signatures in our study compared to previous studies using WGS or WES. This discrepancy could be attributed to the nature of targeted sequencing, which may not capture mutations in non-hypermutated tumors as effectively as WGS or WES. Consequently, if no mutations are detected in these non-hypermutated tumors, they could be excluded from our analysis, potentially leading to an overestimation of the prevalence of hypermutated signatures. Lastly, the current sample sizes for each cancer type within our study are not large enough to effectively differentiate between flat mutational signatures SBS3, SBS5, and SBS40.
To overcome these limitations, it is crucial to increase the number of tumors sequenced by targeted gene panels and to share the resulting data. The decreasing costs and increasing accessibility of targeted sequencing in clinical practice make the expansion of sample sizes a feasible goal. Initiatives like the AACR Project GENIE are already taking steps towards this goal by collecting and sharing more targeted sequencing data and inviting new participants from underrepresented and underserved populations. Our analysis suggests that as the number of targeted sequenced tumors increases, SATS would provide enhanced clinical utility as it can detect additional common signatures with very low false discovery rates. Moreover, with a larger sample pool, SATS could potentially differentiate the HR deficiency-associated signature SBS3 from other flat signatures.
In summary, we have developed a tool for analyzing mutational signatures in targeted sequenced tumors, and created a pan-cancer repertoire of mutational signatures tailored for targeted sequencing. Our study has highlighted the clinical relevance of these targeted sequencing-based signatures. The SATS R package is publicly available on GitHub. We anticipate that SATS and the repertoire will enhance clinical applications of mutational signature analysis using targeted sequence data.
Methods
Genomic data of AACR Project GENIE
We retrieved the AACR Project GENIE dataset (version: 13.0-public) from Synapse (https://synapse.org/genie). This dataset includes 111,711 tumors that were collected as part of routine clinical practice at 16 hospitals or cancer centers and sequenced by targeted sequencing using different gene panels with sizes larger than 50Kb. The patients provided their consent, and the study was approved by an institutional review board (IRB). The dataset contains samples from diverse ethnic backgrounds, including 5,973 Asians (5.3%); 5,545 Blacks (5.0%); 78,003 Whites (69.8%); 5,311 individuals from other racial groups (4.8%); and 16,879 individuals with unknown race (15.1%). It also includes self-reported sex: 58,576 females, 47,694 males, 2 other, and 5,439 of unknown sex. The dataset covers 102 cancer types with 757 subtypes defined by OncoTree41 (please refer to the Supplementary Note for more information on cancer types in the AACR Project GENIE). To facilitate our analysis, we grouped 102 OncoTree cancer types into 23 analysis cancer types.
The tumors were sequenced at CLIA-/ISO-certified labs with high read depth (median: 519X reads, 1st quantile: 307X, 3rd quantile: 808X). Somatic mutations were called at participating centers with various tools, including Mutect2 (ref42) and Strelka43. Germline variants and artifacts were filtered out using pooled external controls and databases of known germline variants, such as the Genome Aggregation Database (gnomAD)44. For more information on the filtering process, please refer to the “AACR GENIE 13.0-public Data Guide” (https://www.aacr.org/wp-content/uploads/2023/03/13.0_data_guide-1.pdf). The dataset includes 1,213,674 single base substitutions (SBS) and 18,519 double base substitutions (DBS). We further removed somatic mutations with a read depth of less than 100 or an alternative allele read count of less than 5. This resulted in 982,095 SBS and 15,149 DBS from 111,711 tumors (Supplementary Table 1) for mutational signature analysis.
It’s worth noting that the choice of mutation calling pipeline may impact the signature analysis results. The influence on signature detection is likely to be less pronounced than on signature burden estimation. This is because signature detection is based on aggregated data from a group of patients, which tends to mitigate the variations caused by different mutation calling approaches. We recommend for users of SATS to review mutation calling pipelines used by the contributing institutions of the AACR Project GENIE (described in the AACR GENIE 13.0-public Data Guide) and to follow best practices45,46 when applying mutation callers and specifying filtering thresholds.
A Poisson NMF model for signature analysis of tumor mutation burden
We define a Poisson Non-Negative Matrix Factorization (pNMF) model for SATS. The pNMF model assumes that the SBS count υpn for the pth mutation type in the nth targeted sequenced tumor follows a Poisson distribution with mean for K signatures, p = 1,2, …,96, n = 1,2, …, N and K = 1,2, …, K. The υpn, 𝓁pn, wpk and hkn represent elements of the corresponding matrices V (dimension N by 96), L (dimension N by 96), W (dimension N by K) and H (dimension K by 96), respectively. This model specification is equivalent to the one used in signeR11.
The W and H are the parameters of interest that will be estimated based on the log-likelihood function of the pNMF model: When the genomic regions sequenced across all samples are identical, as in WES or WGS, the maximum likelihood estimate based on equation (1) is equivalent to that based on the canonical NMF, which is detailed below. Thus, the proposed pNMF model includes the canonical NMF as a special case.
pNMF model extends the canonical NMF
The pNMF model proposed here incorporates the panel context matrix L, extending the canonical NMF, the standard method for identifying mutational signatures in tumors sequenced by WES or WGS. When all samples are sequenced using the same genomic regions (i.e., 𝓁pn = 𝓁p), the log-likelihood function in equation (1) is simplified as where C is a constant irrelevant to W and H. It is worth noting that is equivalent to the objective function of the canonical NMF47 with w, = 𝓁pwpk.
Extraction of de novo TMB signatures
We utilize signeR11 to extract de novo signatures based on the pNMF model. However, because signeR is computationally demanding due to its use of the Markov Chain Monte Carlo (MCMC) method11, we grouped samples to improve computational efficiency. Our results below reveal that grouping samples does not affect the TMB signatures profile (wpk). Specifically, we define C as the sample index set {1,2, …, N}, and Cm as the mutually exclusive set such that . For , the sum of the mutation count for the targeted sequencing tumors with index n belonging to the set Cm, we can show that: where and . Notably, the TMB signature profile wpk remains unchanged. The panel size of combined samples is the sum of the panel size of individual samples 𝓁pn, and signature activity is the weighted sum of the signature activities of individual samples hkn. The mutation count of combined samples follows a Poisson distribution, as the sum of independent Poisson counts is still Poisson distributed.
Grouping samples can significantly reduce computation time. For example, when analyzing 10,000 samples using SATS, analysis by grouping 100 tumors can be completed in 28.5 minutes on a laptop with a Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz processor and 16 GB of 4267 MHz RAM. In contrast, analyzing the same samples without grouping tumors takes approximately 13 hours.
Mapping de novo TMB signatures to COSMIC reference TMB signatures
Due to the limited number of somatic mutations detected by targeted gene panels, the detected de novo TMB signature profiles may be a linear combination of COSMIC reference TMB signature profiles. To address this limitation, we map the de novo signature profile matrix to reference TMB signatures W0 (e.g., a 96 × 76 COSMIC TMB signature profile matrix for 76 reference SBS TMB signatures), using penalized non-negative least squares21: where β is a coefficient vector and the tuning parameter λ is selected based on cross-validations. Compared with the non-negative least squares, the penalized non-negative least squares allow us to shrink small values of β towards zero and select a smaller number of reference signatures with the profile matrix W* that have a significant contribution to the de novo signature profiles. To reduce the randomness caused by the cross-validation step to select λ, we repeat this process 100 times, and select only reference TMB signatures with a coefficient β greater than 0.1 in more than 80 replicates.
Estimation of signature activities by an expectation-maximization algorithm
We propose an expectation-maximization (EM) algorithm to estimate the signature activity matrix H = [hkn], given the mutation type matrix V = [υpn], the panel context matrix L = [𝓁pn], and the mapped reference TMB signature profiles . The element υpn in V can be expressed as the sum of independent latent counts υ1pn, υ2pn, ⋯, υKpn attributed to K signatures. These latent counts are treated as the missing data, following Poisson distributions with expectations , respectively. Introducing latent counts allows us to compute the complete data log-likelihood as: In addition, the conditional distribution of υKpn given V, L, W* and Ht (the H at the t’th iteration of the EM algorithm) follows a multinomial distribution with parameters υpn and .
In the E-step, we compute Q(H|Ht) as the expected complete data log-likelihood: In the M-step, the maximizer of Q(H|Ht) is obtained by setting the derivative with respect to hkn to 0, and the updated activity value is given by: Note that the M-step depends on the current value of activities for the nth tumor only. Therefore, even though the EM algorithm updates the entire activity matrix H for all samples simultaneously, it is equivalent to updating the activity of one tumor at a time. In other words, the EM algorithm of SATS for signature refitting estimates signature activities independently of other tumors, enabling signature activities to be estimated accurately for a single tumor or small subset of samples.
To complete the EM algorithm, the E-step and the M-step are iterated until convergence and output the estimated activity matrix .
Calculation of signature burdens
To calculate the expected number of mutations attributed to a signature (referred to as the signature burden) E = [Ekn], we use the estimated activity matrix from the EM algorithm. The signature burden Ekn of the kth signature in the nth tumor is then calculated as the sum of the product of the panel size 𝓁pn, the reference signature profile and the estimated signature activity across 96 SBS types as .
Relationship between signatures of TMB and TMC
The DKL(V|W′, H) in equation (2) with highlights the relationship between signatures of TMB and TMC: TMB signature profile wpk normalizes TMC signature profile , by the number of mutation context . This means that we can create a catalogue of TMB signatures based on WGS, dividing the catalogue of TMC signatures (e.g., COSMIC WGS reference TMC signatures1) by the number of trinucleotide contexts from which the mutation type could occur in the human reference whole genome.
Create a catalogue of reference TMB signatures
To create a catalogue of reference TMB signatures in Supplementary Table 2, we normalize COSMIC signature profiles of TMC by the size of mutation contexts in the whole genome. This is done by following these steps:
Download the COSMIC SBS and DBS signature profiles (version 3.2) from https://cancer.sanger.ac.uk/signatures/
For an SBS signature profile, divide the level of each mutation type (e.g., A[C > G]G) by the size of the corresponding mutation context (e.g., ACG for A[C > G]G) that can occur in the whole genome (Supplementary Table 3)
Rescale 96 mutation types to sum to one.
Similarly, create DBS signature profiles of TMB (Supplementary Table 4) based on COSMIC DBS signature profiles of TMC and the number of genomic contexts (Supplementary Table 5) for which DBS can occur.
Shannon equitability index of TMB mutational signatures
We use Shannon equitability index to measure the diversity or “flatness” of a signature profile48,49. The index is calculated as where wp is the level of signature profile at the pth mutation type and the sum across all mutation types (i=1 to 96) is equal to 1.
A higher value of the index indicates a more even distribution of mutation types. The index ranges from 0 to 1, with a value of 1 indicating a completely flat signature profile where all mutation types are represented equally and a value of 0 indicating a signature profile with a single dominant “spike” where a single mutation type has a proportion of 1 and all other mutation types have a proportion of 0. Among SBS TMB signatures with n = 96, some profiles are characterized by a few specific mutation types at high levels, referred to as “spikes” (e.g., SBS1 with the C>T substitution at the NCG trinucleotide has a Shannon equitability index of 0.317, and SBS10a with the T[C>A]T substitution has a Shannon equitability index of 0.192). Other signature profiles are more evenly distributed across all mutation types, referred to as “flat” (e.g., SBS3 has a Shannon equitability index of 0.974, SBS5 has a Shannon equitability index of 0.903, and SBS40 has a Shannon equitability index of 0.969).
Generation of pseudo-targeted sequencing data
To investigate the impact of various factors on mutational signature detection, we created pseudo-targeted sequencing datasets using the TCGA WES studies1,24 and the Sanger breast cancer (BRCA) 560 WGS study25. We assume that targeted sequencing would identify SBSs that are identified in the WES or WGS studies, as long as SBSs are located within the targeted genomic regions of the panels. This assumption is reasonable since targeted sequencing typically provides much higher sequencing coverage than WES or WGS. The steps to generate the simulated data are outlined below:
Download the TCGA WES data (mc3.v0.2.8.PUBLIC) from the Cancer Genome Data Portal (https://gdc.cancer.gov/about-data/publications/mc3-2017) and WGS data of Sanger BRCA560 study (Caveman_560_20Nov14_clean) from ftp://ftp.sanger.ac.uk/pub/cancer/Nik-ZainalEtAl-560BreastGenomes.
Download the genomic information file of AACR Project GENIE (https://www.synapse.org/#!Synapse:syn26706790) which specifies the chromosome, start position, and end position of genomic regions for each targeted sequencing panel.
For SBS in WES or WGS studies, select those located in the genomic regions of a targeted sequencing panel to create the SBS mutation type matrix as pseudo-targeted sequencing data. We generated 648 pseudo-targeted sequencing datasets, encompassing 18 TCGA WES cancer types and 36 targeted sequencing panels (panel size: 0.05 Mb to 9.95 Mb). Each TCGA WES cancer type has at least 200 samples, ensuring a sufficient sample size for evaluating the signature detection step of SATS. Note that certain WES cases lacked SBS within the genes covered by a targeted sequencing panel. Hence, the number of cases in the pseudo-targeted sequencing data is less than that of TCGA WES studies (Supplementary Table 8). For instance, out of the 208 TCGA sarcoma WES cases, each panel could only detect SBS in a subset of cases (e.g., 169 cases for the DFCI-ONCOPANEL-3 panel and 172 cases for the MSK-IMPACT505 panel). Similarly, we generated 36 pseudo-targeted sequencing datasets based on 560 breast tumors with WGS data, using 36 targeted gene panels.
Analysis of pseudo-targeted sequencing data
We calculate the signature detection probability, which represents the percentage of common signatures detected by 36 targeted sequencing panels within a cancer type. Next, we employ a generalized linear mixed model (GLMM) to analyze the factors that influence the detection probability of TMB mutational signatures across 648 pseudo-targeted sequencing datasets (by 18 TCGA cancer types and 36 targeted sequencing panels). The GLMM incorporates several fixed effects, including the flatness of the signature profile (quantified using the Shannon equitability index), the prevalence of the mutational signature in the TCGA WES study (as a percentage of SBS attributed to the signature), and the panel size (per megabase). To account for any variation in the results due to the different cancer types under investigation, we include cancer type as a random intercept in the model.
Evaluation of the impact of sample sizes
We conducted an in silico simulation to investigate the effect of sample size on the ability to detect mutational signatures in breast cancer. The simulation was executed using varying sample sizes, ranging from one thousand to up to one million samples, based on the panel context matrix, signatures profile matrix (consisting of 12 mutational signatures with at least 1% prevalence in the TCGA breast cancer study), and signature activity matrix (following the distributions of the signature activity matrix of the TCGA breast cancer study).
We run signature refitting on the TCGA breast cancer dataset (accessible at https://www.synapse.org/#!Synapse:syn11726618), using12 known mutational signatures (SBS1, 2, 3, 5, 7a, 10a, 10b, 13, 15, 29, 30, 44 and 58) that have a prevalence greater than 1% (based on https://www.synapse.org/#!Synapse:syn11801497). Specifically, we applied the EM algorithm to estimate the signature activity matrix , from the mutation type matrix VB, the panel context matrix LWES of the whole exome sequencing, and the pre-defined TMB signature matrix .
We simulated mutation type matrix for 21 targeted sequencing panels with a panel size larger than 1Mb, using a range of sample sizes from 1000 tumors to 1 million tumors. Specifically, we simulated mutation type matrix from a Poisson distribution with the mean , where LS represents the panel size matrix for a given targeted sequencing panel (S), is sampled from the estimated signature activity matrix . As the activities of APOBEC signatures SBS2 and SBS13 are highly correlated, their activities were jointly sampled. Finally, we excluded any tumors with zero mutation count.
We applied signeR to extract de novo signatures from the simulated mutation type matrix . Then, we employed penalized non-negative least squares to select the mapped reference TMB signatures, . Finally, we estimated signature activities and burdens using the EM algorithm.
To evaluate the ability to detect the pre-specified signatures, we analyzed the proportion of 21 panels that were able to rediscover the prespecified 12 mutational signatures using SATS. We also tracked the probability of detecting false positive signatures that were not used to simulate mutation counts.
Evaluation of SATS by in silico simulations
We conducted in silico simulations to evaluate whether SATS can detect and estimate the prespecified signatures in simulated datasets. When multiple flat signatures were present (e.g., SBS5/40 in lung cancer and SBS3/5/40 in lymphoid-derived hematologic cancer), we combined them into a single flat signature as the sample size of the AACR Project GENIE is insufficient to distinguish between these flat signatures accurately.
We first calculated the expectation matrix , where ∘ denotes element-wise product, Lc a panel size matrix, a signature profile matrix and a signature activity matrix of a cancer type (c). The matrices and were estimated from the AACR Project GENIE, allowing us to generate simulated data that accurately reflects actual observations.
We generated ten replicates of the mutation type matrix for lung cancer, breast cancer, colorectal cancer, and lymphoid-derived hematologic cancer respectively by simulating data from the Poisson distribution using expectation matrix . The number of simulated samples is the same as in the corresponding AACR Project GENIE studies.
We applied signeR and penalized non-negative least squares to estimate TMB signatures for each simulated mutation type matrix . We then compared these estimated signatures with the ground truth signatures .
Using the simulated mutation type matrices , panel size matrices (Lc), and estimated TMB signatures , we estimated the signature activity matrix for all tumors using the EM algorithm. We then calculated the signature burden based on , which is compared with the simulated signature burden as the ground truth. For lung cancer, we also estimated for a subset of samples or even for one sample, as detailed in the Supplementary Note.
Applications of other methods for in silico simulations
We applied SigProfilerExtractor12 on the simulated mutation type matrix to extract signatures. Considering the computational intensity of SigProfilerExtractor, we grouped the simulated samples in . To obtain the mapped COSMIC signatures (version 3.2), we ran SigProfilerExtractor with its default options, setting the number of signatures to be extracted within the range of 1 to 10. After extracting these signatures, we utilized SigProfilerAssignment14 to compute the corresponding signature burdens based on the mapped COSMIC signatures.
Moreover, we applied Mix18 to the same simulated data. Mix clustered samples while simultaneously learning the mixture model; hence we used the original simulated matrix as the input. We specified the number of clusters and signatures ranges from 1 to 20 and 1 to 10, respectively. Bayesian Information Criterions (BICs) were computed to identify the optimal combination of the numbers of clusters and signatures. Under the selected optimal combination, the exposure matrix was calculated. As Mix lacks a mapping step to COSMIC signatures, we annotated Mix signatures as COSMIC signatures with the highest cosine similarity. To obtain the signature burden, the total number of mutations was multiplied by the exposure matrix.
Validation of SATS in tumors with both WGS and targeted sequencing
We analyzed 72 kidney tumors with both WGS and targeted sequencing. For WGS, genomic DNA was extracted from fresh frozen tissue using the QIAmp DNA mini kit (Qiagen) and subsequently sequenced on the Illumina HiSeqX platform. The mean sequencing depth was 65.7x for tumor tissue and 40.1x for normal tissue. For targeted sequencing, genomic DNA was purified using Agencourt AMPure XP Reagent (Beckman Coulter Inc., Brea, CA, USA). A targeted driver gene panel (size 1.90 Mb) was designed, encompassing 254 candidate cancer driver genes50. The targeted sequences were captured by NimbleGen’s SeqCap EZ Choice (custom design; Roche NimbleGen, Inc., Madison, WI, USA). Subsequent targeted sequencing was conducted on an Illumina HiSeq 4000, achieving a mean depth of 500x for both tumor and normal tissue. The details of whole-genome and targeted sequencing and sequencing data preprocessing, alignment and somatic mutation calling were described previously26.
We utilized SATS, SigProfilerAssignment14, Mix18 and DeconstructSigs13 to refit the common signatures, namely SBS1, SBS5, and SBS40, estimating their burdens in the targeted sequencing data. Next, we calculated signature burdens for SBS1 and the combined flat signatures (SBS5/40) in the targeted sequencing data and compared these with the corresponding signature burdens derived from WGS. This comparison allowed us to assess the consistency of estimating mutational signature burdens between the two sequencing methods.
Visualization of SBS mutational signature profiles using UMAP
We applied UMAP51 (Uniform Manifold Approximation and Projection) to reduce the dimensionality of SBS mutational signature profiles and visualized tumors in a two-dimensional scatterplot. To facilitate visualization, tumors within each cancer type are first clustered based on their mutational signature burden profiles using hierarchical clustering with Ward’s minimum variance method. A cut-off value of 0.05 is applied for clustering. The UMAP projections are computed based on the median signature burden for each cluster of tumors, with each dot representing a group of tumors with similar signature burden profiles.
Association between mutational signatures and the risk of early-onset colorectal cancer
To investigate the relationship between mutational signatures and the risk of early-onset colorectal cancer (sequencing age < 50 years), we analyzed the mutational signature profiles of 9,562 colorectal cancer patients along with their clinical data. We stratified the patients into non-hypermutated cases (TMB < 10 mutations/Mb, early-onset: 2,495 cases; late-onset: 6,009 cases) and hypermutated cases (early-onset: 790 cases; late-onset: 268 cases).
We employed a generalized linear mixed model (GLMM) to compare early-onset vs late-onset cases for the non-hypermutated and hypermutated colorectal cancer, respectively. The GLMM incorporated tumor status (primary vs metastasis), race, sex, and individual signatures’ TMBs (SBS1, SBS6, SBS10a, SBS10b, SBS44, and flat SBS) as fixed effects, while center and subtype were considered random effects. The GLMMs were fitted using the ‘lme4’ package in R.
Analysis of mutational signatures as prognostic biomarkers
To assess the prognostic implications of mutational signatures obtained from targeted sequencing, we applied mixed-effect Cox models for each specific cancer type individually. The model is represented as: where t denotes overall survival (OS) time, measured in years from the age of targeted sequencing to the last vital status update (alive or deceased). λ0(t) represents the baseline hazard function. We modeled η as: The mixed-effect Cox model was fitted for one SBS signature at a time, including the presence of an SBS signature, tumor status (primary vs. metastasis), age, race (when applicable), and sex (when relevant) as fixed effects. Additionally, we include center and subtype (if applicable) as random effects with random effect coefficients and , where N(0, σ2) denote a normal distribution with mean zero and variance σ2. The ‘coxme’ package in R was utilized to fit the mixed-effect Cox models. Notably, for breast, ovarian, endometrial, other gynecologic cancers, and prostate cancer, the mixed-effect Cox models do not include the variable of sex.
In order to examine the racial differences in hazard ratios of SBS44 signature in colorectal cancer, we employed mixed-effect Cox models that incorporated interactions between SBS44 signature and race. The likelihood ratio test was utilized to compare models with and without the interaction term. Furthermore, we performed stratification by race groups, fitting separate mixed-effect Cox models for each race group to obtain race-specific hazard ratio estimates.
Analysis of mutational signatures as immunotherapy predictive biomarkers
The AACR Project GENIE Biopharma Collaborative released comprehensive clinical data for 1,846 non-small cell lung cancer (NSCLC) patients (v2.0-public dataset) from four hospitals (MSKCC, DFCI, VICC and UHN). Within this dataset, we focused on a subset of 470 NSCLC patients who underwent treatment with immune checkpoint inhibitor (ICI), including Pembrolizumab, Nivolumab, Atezolizumab, and Durvalumab. We aimed to investigate the predictive value of mutational signature-specific TMB for ICI immunotherapy.
We employed a mixed-effect Cox model to analyze progression-free survival, which measures the time interval between the initiation of ICI treatment and the radiologist or oncologist’s assessment of cancer progression or patient death. In the mixed-effect Cox model, we included variables including age, sex, smoking history, stage (IV vs. others), and individual mutational signature-specific TMBs (SBS1, SBS2/13, SBS4/29, SBS89, and flat SBS) as fixed effects. Additionally, we considered the hospital and subtype as random effects. We also applied the same model to evaluate overall survival, which measures the time interval between the start of ICI treatment and patient death or the last follow-up date.
Software
The R package SATS is publicly available at https://github.com/binzhulab/SATS.
Data Availability
The AACR Project GENIE dataset could be retrieved from Synapse (https://synapse.org/genie)
Supplementary figure legends
Supplementary Fig. 1 Mutation profiles and contexts in the human whole genome. a. Profiles of SBS5 signature based on tumor mutation count (TMC) and tumor mutation burden (TMB), respectively, with 96 mutation types on the x-axis and contributions on the y-axis. b. 32 mutation contexts for single base substitutions (SBS) in the human whole genome, with significantly depleted ACG, TCG, GCG, and CCG trinucleotides. c. Scatterplot of the Shannon equitability index of TMC and TMB signature profiles. The black line represents the diagonal line, the blue line the linear regression line, and the shaded area is 95% confidence intervals. The dots are annotated when the differences of Shannon equitability index between TMC and TMB signature profiles are more than 0.1. d. Profiles of SBS10b and SBS15 signatures based on TMC and TMB, respectively.
Supplementary Fig. 2 Boxplots of mutation context ratios between targeted sequences vs. whole genome sequence. Each dot represents the ratio between the proportion of a mutation context (e.g., CCG) in the genomic regions targeted by a sequencing panel and the proportion of the same mutation context in the whole genome.
Supplementary Fig. 3 Results of pseudo-targeted sequencing data based on Sanger breast cancer 560 WGS study. a. The Shannon equitability index of the signature profile and the percentage of single base substitutions (SBS) attributed by a signature for the Sanger breast cancer (BRCA) 560 WGS study. The signatures with > 5% prevalence are highlighted in color, while others are shown in gray. b. The odds ratio of the Shannon equitability index of the signature profile, frequency of Sanger BRCA 560 WGS study signatures, and panel size. The bars represent 95% confidence intervals (CIs). c. The median of the Pearson correlation coefficient for all common WGS signatures detected. The blue curve represents LOWESS (Locally Weighted Scatterplot Smoothing) curve with the shaded area for 95% CIs.
Supplementary Fig. 4 Results of pseudo-targeted sequencing data based on TCGA WES study. a. Scatterplot of the number of mutations attributed by the smoking-related signature SBS4 (on the y-axis) in the MSK-IMPACT 468 panel compared to the number of mutations attributed by the same signature in the TCGA WES study (on the x-axis). The black line represents the ratio of the MSK-IMPACT 468 panel size to the WES size. The blue line represents the linear regression line and the shaded area represents 95% confidence intervals (CIs). b. and c. Medians of Pearson correlation coefficients for signatures SBS4 (b) and SBS7a/b (c) are shown. Pearson correlation coefficient measures the correlation between the number of mutations attributed to a signature using the pseudo-targeted sequencing data and the number of mutations attributed to the same signature reported previously in the TCGA WES study. The curve represents LOWESS (Locally Weighted Scatterplot Smoothing) curve, with the shaded area representing 95% CIs.
Supplementary Fig. 5 Impact of sample sizes on mutational signature detection. a. The scatterplot of the flatness (measured by Shannon equitability index) of signature profiles and percentage of single base substitutions (SBS) attributed to the signatures in the TCGA breast cancer (BRCA) WES study. The reference line with Shannon equitability index one refers to the theoretical maxima (by a completely flat signature). b. The probability of signature detection. Each dot represents the probability of signature detection, measuring the proportion of 21 targeted sequencing panels (with panel size larger than 1Mb) that can identify the signature at a given sample size. The blue line is the LOWESS (Locally Weighted Scatterplot Smoothing) curve, and the shaded area represents 95% confidence intervals (CIs).
Supplementary Fig. 6 The proportion of tumors in which the mutation signatures were detected in simulated breast cancer data. Mapped signatures SBS1, SBS2/13, and SBS5 were used for simulation. Mapped signatures or WES-based signatures were used for signature refitting, respectively. WES-based signatures include SBS1, SBS2/13, SBS5, SBS3, SBS7a, SBS10a, SBS10b, SBS15, SBS29, SBS30, SBS44 and SBS58, which are present in more than 1% of mutations in TCGA WES breast cancer study. The horizontal lines represent the actual proportions in the simulated data. The error bars in the figure represent the mean plus or minus one standard deviation.
Supplementary Fig. 7 UMAP visualization of SBS mutational signature profiles. The UMAP (Uniform Manifold Approximation and Projection) scatterplot displays 2-dimensional projections of the signature burden profiles for tumors obtained from the Dana-Farber Cancer Institute. The colors represent cancer types.
Supplementary Fig. 8 The associations between targeted sequencing-based mutational signatures and immunotherapy response. a. The proportion of lung cancer patients who remain progression free after receiving immune checkpoint inhibitors (ICI), stratified by high (> 10 Mut/Mb) or low overall TMB levels. The P-value was calculated using the log-rank test. b. The hazard ratio (HR) of TMB attributed to individual signatures for overall survival (PFS) in lung cancer patients treated with ICI.
Supplementary Fig. 9 Detection probability of TCGA breast cancer signatures in simulations. The probability of signature detection is shown for increasing sample sizes (a. up to 10,000 tumors, b. up to 1,000,000 tumors). Each dot represents the proportion of targeted sequencing panels (with a panel size larger than 1Mb) that are able to identify the signature at the corresponding sample size. Blue curves are LOWESS (Locally Weighted Scatterplot Smoothing) curves, and shaded areas represent 95% confidence intervals.
Supplementary Fig. 10 The Pearson correlation coefficient between simulated and estimated SBS signature expectancies in lung cancer with various sample sizes for signature refitting. The bars in the figure represent the mean Pearson correlation coefficient for simulation replicates, while the x-axis indicates the number of samples used for signature refitting, including 100 samples, 10 samples, or even one sample at a time. The error bars in the figure represent the mean plus or minus one standard deviation.
Acknowledgment
This research was conducted with support from Intramural Research Program of the National Institutes of Health, National Cancer Institute, Division of Cancer Epidemiology and Genetics (DCEG). Additionally, Dr. Lee was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT, No. RS-2023-00213625). The authors would like to acknowledge the American Association for Cancer Research and its financial and material support in the development of the AACR Project GENIE registry, as well as members of the consortium for their commitment to data sharing. Interpretations are the responsibility of study authors. This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD: https://biowulf.nih.gov. We would like to thank Bill Wheeler (Information Management Services) for computation support.
Footnotes
This version of the manuscript has been revised to update the following: The main title and author information are updated; The entire manuscript is revised for further clarification; Section for clinical application is newly included; Main Figures 1, 3, 5, 6 are revised according to the manuscript; and Supplementary figures, tables, and notes are also updated accordingly.