Pan-cancer mutational signature analysis of 111,711 targeted sequenced tumors using SATS

Donghyuk Lee; Min Hua; Difei Wang; Lei Song; Tongwu Zhang; Xing Hua; Kai Yu; Xiaohong R. Yang; Stephen J. Chanock; Jianxin Shi; Maria Teresa Landi; Bin Zhu

doi:10.1101/2023.05.18.23290188

Abstract

Tumor mutational signatures have the potential to inform cancer diagnosis and treatment. However, their detection in targeted sequenced tumors is hampered by sparse mutations and variability in targeted gene panels. Here we present SATS, a scalable mutational signature analyzer addressing these challenges by leveraging tumor mutational burdens from targeted gene panels. Through analyzing simulated data, pseudo-targeted sequencing data generated by down-sampling whole exome and genome data, and samples with matched whole genome sequencing and targeted sequencing, we showed that SATS can accurately detect common mutational signatures and estimate signature burdens. Applying SATS to 111,711 targeted sequenced tumors from the AACR Project GENIE, we generated a pan-cancer catalogue of mutational signatures tailored to targeted sequencing, enabling estimation of signature burdens within individual tumors. Integrating signatures with clinical data, we demonstrated SATS’s clinical utility, including identifying signatures enriched in early-onset hypermutated colorectal cancers and signatures associated with cancer prognosis and immunotherapy response.

Introduction

Tumors accumulate somatic mutations that form specific patterns, known as mutational signatures^1,2. These signatures can provide insight into the underlying mutational processes involved in carcinogenesis and have the potential to inform cancer detection^3-5 and treatment^6-9. For example, aristolochic acid-associated signatures can aid in the screening for liver cancers⁵, while tumors exhibiting HRD (homologous recombination deficiency)-associated signatures may be amenable to PARP (poly (ADP-ribose) polymerase) inhibitors⁷. Similarly, tumors characterized by APOBEC (Apolipoprotein B mRNA Editing Catalytic Polypeptide-like)-associated signatures might be responsive to ATR (ataxia telangiectasia and Rad3 related) inhibitors⁸. To decipher these mutational signatures, multiple algorithms have been proposed^10-14, and catalogues of mutational signatures have been established^1,2 for tumors analyzed through whole exome or whole genome sequencing (WES/WGS).

In clinical practice, tumors are often sequenced by targeted gene panels to detect somatic mutations in cancer driver genes with therapeutic relevance. The small number of detected mutations, coupled with the use of diverse panels across hospitals, makes it unfeasible to use signature extraction^10-12 or signature refitting methods^13,14 developed for WES/WGS to analyze mutational signatures in targeted sequenced tumors. Specifically, signature extraction methods extract de novo mutational signatures; however, sparse mutation data from targeted sequencing reduce the ability to distinguish correlated de novo signatures with these methods¹⁵. Signature refitting approaches estimate mutational signature activities and burdens in a given sample using prespecified reference signatures. However, it remains uncertain to choose the appropriate reference signatures for targeted sequencing data. The mutational signatures detected through WES/WGS may not accurately represent those present in targeted sequenced tumors. Moreover, using reference signatures from all cancer types in signature refitting methods can lead to misassigning mutations to signatures that are not present in the targeted sequenced tumors¹⁶. Furthermore, most methods assume that identical genomic regions, such as whole exome or genome, are sequenced across all samples, which may not hold true in targeted sequencing studies using various panels across different genomic regions. Therefore, while current signature extraction and refitting methods have successfully analyzed mutational signatures in tumors subjected to WES/WGS, their effectiveness is limited when applied to targeted sequencing data, hindering the clinical use of mutational signatures derived from targeted sequencing.

Clustering methods have been proposed for detecting specific mutational signatures in targeted sequenced tumors. For example, the SigMA¹⁵ algorithm is tailored to detect the HRD-associated signature SBS3. It utilizes pre-training with WGS data from individual tumors to classify targeted sequenced tumors. Similarly, MUTYH mutation-related signatures SBS18/36 have been identified in targeted sequenced colorectal cancers using WES data from individual samples as the training set for clustering¹⁷. However, these methods are limited in their capacity to simultaneously analyze multiple active mutational signatures within targeted sequenced tumors. In addition, rare cancer types or underrepresented populations may lack WES/WGS data for pre-training. The Mix¹⁸ method presents an alternative clustering strategy that does not rely on pre-training, but it estimates signature activities at the cluster level rather than for individual samples. Hence, to facilitate the clinical use of mutational signatures, specialized analytical methods and a comprehensive catalogue of mutational signatures for targeted sequenced tumors are needed.

Here, we introduce SATS (Signature Analyzer for Targeted Sequencing), a mutational signature analysis tool explicitly developed for targeted sequencing data. Unlike existing methods optimized for WES/WGS, SATS accounts for the variable size and genomic context of targeted gene panels and leverages large sample sizes of targeted sequencing studies. In the following sections, we first describe the SATS pipeline and then investigate the factors that affect signature detection and signature burden estimation in targeted sequenced tumors using pseudo-targeted sequencing data. Additionally, we perform simulations to compare the signature detection and signature burden estimation by SATS and other methods across four major cancer types.

Furthermore, we demonstrate that SATS outperforms other methods in accurately attributing mutations to prespecified signatures in samples with matched WGS and targeted sequencing. Applying SATS, we establish a pan-cancer catalogue of mutational signatures in 111,711 targeted sequenced tumors from the AACR (American Association for Cancer Research) Project GENIE (Genomics Evidence Neoplasia Information Exchange, Version 13.0-public)^19,20. This database contains tumors collected from 16 hospitals or cancer centers in multi-ethnic populations, representing 23 cancer types, including 14,983 lung and 12,144 breast tumors (Methods). Finally, we show that through integration with clinical data, mutational signatures derived from targeted sequencing can identify the potential tissue of origin for tumors of unknown primary, find signatures enriched in early-onset hypermutated colorectal cancers, and serve as a biomarker for cancer prognosis and immunotherapy response.

Results

SATS detects signatures and estimates signature burdens in targeted sequenced tumors

SATS is developed for targeted sequencing to detect mutational signatures within a patient cohort and refit detected signatures to estimate signature burdens in individual patients. While we use single base substitutions (SBS) for the purpose of illustration, SATS is also adaptable to other types of somatic mutations, such as double base substitutions (DBS).

In analyzing SBS mutational signatures, SATS uses a mutation type matrix V as input that contains the counts of SBS across 96 mutation types within 32 trinucleotide contexts, such as a C to G mutation at the trinucleotide context TCT (i.e., a T[C>G]T mutation type). Additionally, SATS incorporates a panel context matrix L that specifies the number of trinucleotide contexts where a specific mutation type (e.g., TCT for T[C>G]T substitutions) could potentially occur in the targeted genes. SATS is based on a Poisson Nonnegative-Matrix Factorization (pNMF) model (Methods). The pNMF model decomposes the mutation type-by-patient matrix V into a mutation type-by-signature matrix W that describes signature profiles and a signature-by-patient matrix H that quantifies signature activities, while adjusting panel sizes by the panel context matrix L (Fig. 1a).

Fig. 1 Schematic workflow of SATS.

a. The workflow starts with summarizing somatic mutations (e.g., single base substitutions) identified through targeted sequencing into a mutation type matrix V. In addition, SATS requires a panel context matrix L that specifies the number of trinucleotide contexts for individual panels. SATS is based on a Poisson Nonnegative-Matrix Factorization (pNMF) model, which approximates the matrix V by L ∘ W × H (i.e., V ≈ L ∘ W × H), where ∘ denotes the element-wise product and × represents the matrix multiplication operator. b. The analysis procedure of SATS involves signature detection for a patient cohort and signature refitting for individual patients. In this illustrative example, SATS initially identifies two de novo tumor mutation burden (TMB) signature in the cohort, subsequently mapping them to reference TMB signatures 1, 2/13 and 5. Next, SATS carries out signature refitting for six patients (e.g., Pt.1, Pt.2, …, Pt.6), estimating the activities of the mapped reference TMB signatures and the expected number of mutations attributed to each signature, termed the signature burden. For instance, the activities of SBS1, SBS2/13 and SBS5 for patient 3 (Pt.3) are 0.27, 0.84 and 0.18, respectively. Additionally, we estimate 0.67, 1.16 and 3.17 SBS attributed to signature SBS1, SBS2/13 and SBS5, respectively.

SATS comprises signature detection and signature refitting steps as outlined in Fig. 1b. In the signature detection phase, SATS first employs signeR¹¹ to discover de novo signatures from a targeted sequencing patient cohort. signeR is based on the pNMF model and adjusts for differences in the sizes of gene panels (Supplementary Note). Next, the de novo signature profiles are mapped to reference signatures from the pan-cancer COSMIC catalogue¹ using penalized nonnegative least squares (pNNLS)²¹. This allows us to identify reference signatures which are present within the de novo signature profiles. Notably, these reference signatures can originate from any cancer type featured in the pan-cancer catalogue, not restricted to the specific cancer type of the patient cohort. During the signature refitting stage, we have developed an Expectation–Maximization (EM) algorithm, refitting the detected reference signatures to estimate signature activities in individual patients. Given that a mutational signature can contribute to a variety of mutations, we further estimate the expected number of mutations attributed to a reference signature, termed the signature burden, for each patient (Methods).

While direct detection of mutational signatures in a single patient is challenging, SATS can effectively estimate signature burdens at an individual level by adapting (“refitting”) signatures previously identified in a large group of patients who have the same type of cancer and have been subjected to targeted sequencing. To enable this feature, we have compiled a pan-cancer catalogue of mutational signatures specific to targeted sequencing by analyzing over 111,000 tumors from the AACR Project GENIE.

SATS identifies signatures of tumor mutation burden

One key advancement of SATS lies in its ability to discern signatures of tumor mutation burden (TMB), circumventing the requirement of identical genomic region sequenced across tumor samples for mutational signature analysis. SATS achieves this through approximating V by L ∘ WH (i.e., V ≈ L ∘ WH), where ∘ denotes element-wise product. SATS allows for the analysis of diverse gene panels, each covering different sequenced genomic regions, by using the panel context matrix L (measured in megabase (Mb) pairs) to account for these differences. The resulting signature profile matrix W thus characterizes TMB signatures. In contrast, conventional mutational signature analysis algorithms implicitly require identical sequenced regions across tumor samples (e.g., through WES or WGS). These algorithms employ the canonical NMF method to factorize a mutation type matrix V into a 96 × K signature profile matrix W′ and a K × N signature activity matrix H′ (i.e.,V ≈ W′H′). Thus, the estimated W′ delineates tumor mutation count (TMC) signatures.

TMB and TMC signature profiles are different, as TMB signature profiles take into account the mutation context, while TMC signature profiles do not (Methods). For example, the TMC SBS5 profile is relatively consistent across all 16 trinucleotide contexts of C to T mutations, whereas the TMB SBS5 profile shows increased C to T mutations at the NCG trinucleotides (N represents any nucleotide, Supplementary Fig. 1a), since these trinucleotides are depleted in the human genome due to frequent deamination of 5-methylcytosine to thymine^22,23 (Supplementary Fig. 1b). We compared the shape of TMB and TMC signature profiles by the Shannon equitability index, which measures the evenness of signature profiles across mutation types (Methods). A higher index value corresponds to a flatter signature profile, whereas a lower value suggests a distinct or spikier profile with significant contributions from certain mutation types as “spikes.” The Shannon equitability indices of TMB and TMC signature profiles are highly correlated (Pearson correlation coefficient r = 0.915, Supplementary Fig. 1c), but there are a few exceptions. For example, TMB signature SBS10b and SBS15 (Shannon equitability index = 0.192 and 0.391 respectively) exhibit more pronounced spikes than their TMC counterparts (Shannon equitability index = 0.491 and 0.624 respectively, Supplementary Fig. 1d).

The advantage of TMB signature profiles is their robustness against variations in mutation contexts between targeted sequencing and WGS. For example, targeted sequences often have different proportions of certain trinucleotides (e.g., higher NCG, lower NTA and NTT) compared to whole genomes (Supplementary Fig. 2). As a result, TMC signature profiles obtained from targeted sequencing differ from those obtained from WGS, while TMB signature profiles remain consistent. Additionally, TMB signatures allow for mutational signature analysis across different targeted gene panels by normalizing the numbers of mutation contexts, making them particularly useful for studies involving multiple targeted gene panels.

Factors impacting signature detection and signature burden estimation by SATS

We investigated factors that may affect the performance of SATS in detecting signatures and estimating signature burdens. These factors include the size of the targeted gene panels, the prevalence of the signatures, the shape of the TMB signature profile (measured by the Shannon equitability index), and the cancer types. We generated pseudo-targeted sequencing data SBSs that were called in The Cancer Genome Atlas (TCGA) WES studies^1,24 and located in the regions covered by various targeted sequencing panels (Methods). In addition, we generated pseudo-targeted sequencing data based on 560 breast tumors²⁵ with WGS data (Methods and Supplementary Fig. 3a). Our analysis focused on common signatures contributing more than 5% of SBSs based on WES or WGS analyses for a given cancer type. While the sample size could also impact signature detection, we could not assess it using limited pseudo-targeted sequencing data. Thus, we conducted in silico simulations to examine the impact of sample size.

First, we analyzed pseudo-targeted sequencing data based on WES and showed that SATS is capable of detecting common signatures, though detection probabilities vary across cancer types, targeted gene panels, and the specific signatures. For instance, larger gene panels generally uncovered more signatures (Fig. 2a). Additionally, there was an inverse relationship (r = -0.452) between the detection probability of a signature and its Shannon equitability index (Fig. 2b). This suggests that “spikier” signatures are more likely to be detected than “flatter” ones, consistent with observations based on WES/WGS data⁹. To simultaneously quantify the impact of these factors on the detection probabilities, we applied a generalized linear mixed model (GLMM) that included cancer type, panel size, signature prevalence, and the Shannon equitability index of the signature (Methods). Our results revealed that cancer types accounted for 53.26% of the variance of detection probabilities at the logit scale (Fig. 2c). This is because certain cancer types possess more distinctive and easily distinguishable signatures. Additionally, cancer types with high TMB (e.g., lung squamous cell carcinoma, median TMB: 10.07 mutations/Mb) had a higher probability of signature detection compared to those with lower TMB (e.g., thyroid adenocarcinoma, median TMB: 0.47 mutations/Mb). Within a specific cancer type, GLMM indicated that spikier (odds ratio (OR) = 0.962, 95% confidence interval (CI) = 0.956-0.967 for a 0.01 increase of the Shannon equitability index, Fig. 2d) and more prevalent (OR = 1.12, 95% CI = 1.14-1.27 for a one percent increase of signature prevalence) signatures are more detectable, especially when using large gene panels (OR = 1.21, 95% CI = 1.14-1.27 for 1Mb increase in gene panel size). This conclusion was also evident in our analysis of pseudo-targeted sequencing data based on WGS (Supplementary Fig. 3b). This finding helps explain the challenge in detecting signatures in thyroid adenocarcinoma, where spikier signatures like SBS1 are less common (the prevalence 6.58%), while the most common signature, SBS5 (28.26%), is flat and may be confused with other flat signatures like SBS3 or SBS40.

Fig. 2: Assessing the determinants of SATS performance in signature detection and signature burden estimation

a. Detection probability of common signatures, which contribute at least 5% of single base substitutions (SBSs) per a cancer type in the TCGA WES study (e.g., SBS1, SBS5, SBS10a, SBS10b, SBS15 and SBS40 in glioblastoma). Detection probability is defined as the percentage of these signatures detected by SATS in pseudo-targeted sequencing samples for each cancer type across 36 panels. For instance, if SATS detects three out of six common signatures in glioblastoma (e.g., SBS1, SBS10b, SBS15), the detection probability is 50%. The median sample size for pseudo-panel data across all panels is denoted in parentheses on the y-axis, and the size of genomic regions covered by each sequencing panel is indicated on the x-axis. b. The percentage of common TCGA WES signatures detected by 36 targeted sequencing panels versus the Shannon equitability index of the signature profile. The blue line refers to the linear regression line. The black dots refer to the detected signatures and gray dots refer to the undetected signatures. c. The proportion of variance in detection probabilities of common TCGA WES signatures that can be explained by determinant factors, including the Shannon equitability index of the signature profile (flatness), the frequency of TCGA WES signatures (prevalence), panel size, and cancer type. d. The odds ratio of determinant factors. Bars represent the 95% confidence intervals (CIs). e. The median Pearson correlation coefficient for all detected common WES signatures across 18 TCGA cancer types was compared to panel size. The Pearson correlation coefficient is a measure of the correlation between the number of single base substitutions (SBS) attributed to a signature, estimated from the pseudo-panel data, and the number of SBS attributed to the same signature previously reported in the TCGA WES study. The blue curve represents LOWESS (Locally Weighted Scatterplot Smoothing) curve, and the shaded area is 95% CIs. CA: cervical cancer; HCC: hepatocellular carcinoma; RCC: renal cell carcinoma; CNS: central nervous system; GBM: glioblastoma; AdenoCa: adenocarcinoma; SCC: squamous cell carcinoma; Thy: thyroid; Prost: prostate; Colorect: colon or rectum; DLBC: diffuse large B cell lymphoma.

Next, we evaluated the signature refitting steps of SATS for estimating signature burdens - the number of mutations attributed to each signature. We compared the signature burdens calculated using WES¹ with these estimated using SATS from pseudo-targeted sequencing data of the same tumors (as an example, see Supplementary Fig. 4a for SBS4 in lung cancer based on the MSK-IMPACT468 panel). A strong correlation between the two would indicate that targeted sequencing panels are a feasible alternative to WES for assessing signature burdens. We found that the median Pearson correlation coefficient was 0.7 for panels with sizes greater than 1Mb (Fig. 2e), with higher correlation coefficients for certain signatures, such as SBS4 in lung adenocarcinoma (r = 0.91) and SBS7a and SBS7b in melanoma (r = 0.98 and 0.95 respectively, Supplementary Fig. 4b and 4c). Similar results were also observed in pseudo-targeted sequencing data based on WGS (Supplementary Fig. 3c).

To explore the impact of sample size on mutational signature detection, we conducted in silico simulations with varying sample sizes, using breast cancer as a case study (consisting of 12 mutational signatures with at least 1% prevalence in the TCGA breast cancer study, Supplementary Fig. 5a). We simulated the mutation burden of 96 SBS mutation types for up to 1 million tumors across 21 different targeted sequencing panels, each with the size larger than 1Mb (Methods). In this way, we could use the “truly” mutational signatures present in in silico simulations as benchmarks. Our findings indicated that with a large number of targeted sequenced tumors, SATS can detect almost all common mutational signatures (>5% prevalence) in breast cancer, including the HRD-associated signature SBS3, while maintaining a low false positive rate (Supplementary Fig. 5b, detailed in the Supplementary Note). This study underscores the importance of sample size in the extraction of mutational signatures from targeted sequenced tumors, as detecting certain signatures might necessitate a larger cohort of samples.

Evaluation of SATS and other methods by in silico simulations

To assess the performance of SATS in mutational signature detection and burden estimation, we conducted in silico simulations, comparing it with other methods. Using signature profiles and distributions of signature activities from the AACR Project GENIE, we simulated mutation type matrices for lung, breast, colorectal, and lymphoid-derived hematologic cancers (Methods).

First, we ran SATS, SigProfilerExtractor¹² and Mix¹⁸ to detect signatures and compared the detected signatures to the ‘prespecified signatures’ used in the simulations. We observed that SATS could accurately detect most prespecified signatures, except for few flat or rare signatures. For lung and breast cancers, SATS identified all nine prespecified signatures in every replicate (Fig. 3a). In colorectal cancer, SATS detected five out of six prespecified signatures in all replicates, with SBS44 being more elusive. SBS44, as the second flattest signature in colorectal cancer, is challenging to distinguish from the common flat signature SBS5. In lymphoid-derived hematologic cancer, all prespecified signatures were frequently detected. The only false positive signatures in four cancer types were SBS10c and SBS92 in colorectal cancer for one replicate. In contrast, SigProfilerExtractor and Mix failed to detect signatures SBS29 and SBS89 in lung cancer (Fig. 3a). SigProfilerExtractor did not identify SBS44 in colorectal cancer, and Mix missed the majority of signatures in lymphoid-derived hematologic cancer. Besides false negative detections, SigProfilerExtractor incorrectly identified SBS24 in lung cancer and SBS7b in lymphoid-derived hematologic cancer. These results suggest that SATS outperforms SigProfilerExtractor and Mix in signature detection for targeted sequenced tumors, effectively identifying most prespecified signatures with minimal false positives.

Fig. 3 Comparison of SATS and other methods.

a. Detection frequency of mutational signatures in 10 replicates for lung, breast, colorectal, and lymphoid-derived hematologic cancers. The signatures on the x-axis are used to simulate mutation counts and are considered as the “ground truth”. The dot size is proportional to the detection frequency by SATS, SigProfilerExtractor and Mix. b. Pearson correlation coefficients between the simulated SBS signature burdens (as the benchmark) and the burdens estimated by SATS, SigProfilerAssignment and Mix. Bars represent the average the Pearson correlation coefficient, and the intervals are the average plus or minus one standard deviation. c. Boxplots illustrating mutational signature burdens of SBS1 and flat signatures (SBS5/SBS40) obtained from WGS, separated by signature burden group based on targeted sequencing (SBS1 high: SBS1 burden > 1 mutation per sample; flat signature high: flat signature burden > 4 mutations per sample). The median of the burdens is marked by the line in each box, which spans from the first to the third quartiles. Whiskers extend to the furthest points within 1.5 times the interquartile range from the quartiles. Each dot represents a data point of mutational signature burdens in WGS.

Next, we applied SATS, SigProfilerAssignment¹⁴ and Mix¹⁸ to estimate signature burdens, comparing these estimates with the simulated “ground truth”. SATS showed high accuracy in estimating burdens for common or spiky signatures, such as SBS2/13 in breast cancer (r = 0.96, Fig. 3b), SBS4 in lung cancer (r = 0.86) and SBS10a and SBS10b in colorectal cancer (r = 0.99 for both). However, the correlation was lower for flatter or rarer signatures, such as SBS89 in breast cancer (r = 0.55) and SBS6 in colorectal cancer (r = 0.74). The correlations for signature burdens estimated by SigProfilerAssignment and Mix were generally lower than those by SATS, particularly for signatures like SBS6 in colorectal cancer and SBS84/SBS87 in lymphoid-derived hematologic cancer (Fig. 3b). Overall, these results suggest that SATS can more accurately estimate signature burdens for the majority of signatures than other methods.

Finally, we investigated the impact of including irrelevant signatures in signature refitting. Simulating breast cancer targeted sequencing data using signatures SBS1, SBS2/13, and SBS5, we performed signature refitting using 12 signatures (including the three true signatures) from TCGA WES breast cancer study. We found that a considerable proportion of mutations were incorrectly attributed to non-existent signatures in the simulated data (Supplementary Fig. 6). This result emphasizes the importance of choosing an appropriate set of refitted signatures, tailored for targeted sequencing, to enhance the accuracy of signature refitting in targeted sequencing studies.

Evaluation of SATS and other methods in tumors with both WGS and targeted sequencing

We further assessed SATS using 72 kidney tumors that had undergone both whole genome sequencing and targeted sequencing²⁶. Due to the sample size constraints, we focused on refitting the common signatures to estimate their burdens in the targeted sequencing data using SATS, SigProfilerAssignment¹⁴, Mix¹⁸ and deconstructSigs¹³. The common signatures included SBS1, SBS5, and SBS40, all of which had been identified in kidney tumors from the AACR Project GENIE, as well as through WGS-based mutational signature analyses of 72 kidney tumors²⁶.

Our analysis showed a consistent correlation between the estimated burdens of flat signatures (SBS5/40) from targeted sequencing and those from WGS across all methods. Specifically, we observed that samples with a higher burden of mutations attributed to flat signatures in targeted sequencing also had a higher burden of these signatures in WGS (Fig. 3c right panel; Wilcoxon P-value = 2.48 x 10^-3, 2.35 x 10^-7, 2.35 x 10^-7, 1.79 x 10^-8 for SATS, SigProfilerAssignment, Mix and deconstructSigs, respectively). However, for the less frequent signature SBS1, the association of signature burdens was observed only in SATS (Fig. 3c left panel; Wilcoxon P-value = 3.62 x 10^-7). SigProfilerAssignment and deconstructSigs assigned zero burdens to SBS1 in all samples but one for SigProfilerAssignment, while Mix categorized all samples as one cluster with minimal SBS1 signature burdens. This indicates the limitations of other refitting methods in accurately estimating burdens for less common signatures like SBS1 in targeted sequencing data.

The pan-cancer repertoire of targeted sequencing-based mutational signatures

We used SATS to create a pan-cancer repertoire of SBS signatures based on the targeted sequenced tumors in the AACR Project GENIE, which could be adopted in the clinic to refit signatures in a single tumor. Our analysis revealed the ubiquitous presence of SBS1, caused by deamination of 5-methylcytosine to thymine, and flat signatures (SBS3, SBS5 and SBS40 combined, given that the current sample size of a cancer type is insufficient to separate them accurately) across all cancer types (as shown in Fig. 4a bottom panel). In contrast, other SBS signatures are present in particular cancer types. For example, SBS2/13, associated with APOBEC cytosine deaminases, was found in eight cancer types. DNA repair deficiency-related signatures such as SBS6/14/15/44, indicative of mismatch repair (MMR) deficiency, were observed in nine cancer types. Similarly, SBS10a/b/c signatures, resulting from polymerase epsilon (POLE) exonuclease domain mutations, appeared in eight cancer types.

Fig. 4 Repertoire of mutational signatures in the AACR Project GENIE.

a. Single base substitution (SBS) signatures. The top bar chart displays the stacked tumor mutation burden (TMB) attributed to specific signatures, with the colors indicating different mutational signatures. The bottom panel illustrates the presence of SBS signatures for individual cancer types, with dot sizes representing the proportion of tumors in which an SBS signature is present. The sample size of targeted sequenced tumors is indicated between the two panels, and the proposed etiology of the mutational signature is included in parentheses. b. Double base substitution (DBS) signatures. The top stacked bar chart shows the TMB of DBS signatures, and the bottom panel shows the proportion of tumors for which a DBS signature is present. 5meC: 5-Methylcytosine; APOBEC: apolipoprotein B mRNA-editing enzyme, catalytic polypeptide, MMR: mismatch repair; UV: ultraviolet radiation; POLE-exo^*: mutations in polymerase epsilon exonuclease domain; TMZ: temozolomide; BER: base excision repair; AZA: azathioprine; AID: activation-induced deaminase; TP: thiopurine.

Besides above signatures associated with endogenous mutational processes, we also detected signatures associated with environmental exposures, such as smoking (e.g., SBS4/29 in lung cancer) and UV radiation (e.g., SBS7a/b in head and neck cancer, skin cancer/melanoma or soft tissue cancer). Additionally, treatment-related signatures were identified as well, such as SBS11 caused by temozolomide in glioma²⁷, and thiopurine chemotherapy treatment-induced signature SBS87 in endometrial cancer²⁸, head and neck cancer²⁹ and lymphoid-derived hematologic cancer³⁰.

We calculated the signature burdens and evaluated the prevalence of mutational signatures for individual cancer types (Methods). A significant share of the mutations in many cancer types was found to be attributable to flat signatures (SBS3/5/40), with some notable exceptions (Fig. 4a top panel). For example, skin cancer or melanoma is primarily characterized by UV-induced signatures, while endometrial cancer is dominated by signatures related to DNA mismatch and replication repair deficiency. Additionally, APOBEC-induced signatures are most frequent in bladder cancer, whereas smoking- and APOBEC-induced signatures dominate in lung cancer.

In addition to SBS signatures, we generated a pan-cancer repertoire of DBS mutational signatures for targeted sequenced tumors (Fig. 4b bottom panel). We found seven DBS signatures, which exhibited a low mutation burden (less than one mutation per megabase, Fig. 4b top panel). We observed that the DBS1 signature, associated with UV exposure, is present in head and neck cancer, skin cancer or melanoma, soft tissue cancer, and cancers of unknown primary, consistent with the presence of UV exposure SBS signatures in these cancer types. Furthermore, the DBS2 signature, associated with smoking, was identified in bladder and lung cancer. We also observed DNA repair deficiency-associated signatures DBS3 in non-colorectal bowel cancer.

Clinical applications of targeted sequencing-based mutational signatures

We provide here examples of utilizing targeted sequencing-based mutational signatures generated by SATS to address important clinical questions.

Identification of tissue of origin for tumors of unknown primary

The distinct presence and prevalence of targeted sequencing-based mutational signatures across cancer types suggest their potential as indicators of tissue of origin, particularly for tumors of unknown primary. Notably, when examining the clustering of tumors sequenced at the Memorial Sloan Kettering Cancer Center (Fig. 5, Methods), we observed that most tumors grouped according to their cancer types. For instance, lung tumors formed a distinct cluster, separate from other cancer types, while tumors with UV signatures, like head and neck cancer, skin cancer/melanoma, and soft tissue cancer, constituted another distinct cluster, in which tumors of head and neck cancer and soft tissue cancer grouped together. In contrast, glioma formed unique clusters, as did pancreatic cancer. These clustering patterns were consistently observed in tumors sequenced at the Dana-Farber Cancer Institute (Supplementary Fig. 7). Interestingly, tumors of unknown primary clustered alongside lung tumors, tumors with UV signatures, glioma, pancreatic tumors, and others (Fig. 5), suggesting their potential tissues of origin.

Fig. 5 UMAP visualization of SBS mutational signature profiles.

The UMAP (Uniform Manifold Approximation and Projection) scatterplot displays 2-dimensional projections of the signature burden profiles for tumors obtained from Memorial Sloan Kettering Cancer Center. Each color represents a cancer type. The number in parentheses represent the number of aggregated sample clusters of individual cancer type.

Targeted sequencing-based mutational signatures enriched in early-onset hypermutated colorectal cancer

A previous study showed that early-onset non-hypermutated colorectal cancers exhibited a lower overall TMB than late-onset cases³¹. However, the opposite trend was observed in hypermutated cases. To examine this discrepancy, we analyzed TMBs attributed to specific mutational signatures in non-hypermutated and hypermutated colorectal cancers (Methods). First, we confirmed the previous findings of the contrasting association between overall TMB and early-onset status in non-hypermutated and hypermutated colorectal cancers. Next, we found that in non-hypermutated colorectal cancers, TMBs attributed to most signatures (SBS1/5/6/10a/10b) were inversely associated with early-onset status (Fig. 6a), leading to an inverse association between the overall TMB and early-onset status (OR = 0.898, 95% CI = 0.877-0.920). In contrast, among hypermutated colorectal cancers, the positive association of overall TMB and early-onset status (OR = 1.004, 95% CI = 1.001-1.007) was primarily driven by the TMB attributed to the deficient DNA mismatch repair signature (SBS44, OR = 1.014, 95% CI = 1.001-1.027, Fig. 6a) and the signature related to deficient replication repair gene POLE (SBS10a, OR = 1.017, 95% CI = 1.000-1.035) but attenuated by the clock-like signature SBS5 (OR = 0.982, 95% CI = 0.963-1.001). These results reveal distinct mutational processes that are characteristic of early-onset hypermutated colorectal cancer.

Fig. 6 The associations between targeted sequencing-based mutational signatures and clinical outcomes.

a. The odds ratios (ORs) for the association between tumor mutation burden (TMB) attributed to individual signature and the status of colorectal cancer onset (early-onset vs. late-onset). The left panel illustrates the results for non-hypermutated colorectal cancer, while the right panel displays the findings for hypermutated colorectal cancer. The bars represent 95% confidence intervals (CIs) for the corresponding OR estimates. b. The panels show the proportion of colorectal cancer patients who remain alive after tumor was sequenced, stratified by the presence or absence of the signature SBS44. Each panel represents a different race group: top panel - White patients, middle panel - Black patients, and bottom panel - Asian patients. The P-value was calculated using the log-rank test. c. The top panel shows the hazard ratio (HR) of TMB attributed to individual signatures with 95% CIs for progress-free survival (PFS) in lung cancer patients treated with immune checkpoint inhibitors (ICI). The remaining two panels demonstrate the proportion of lung cancer patients who remain progression free after receiving ICI, stratified by high or low mutational signature-specific TMB levels. The middle panel represents TMB attributed to signature SBS1, while the bottom panel represents TMB attributed to smoking signatures SBS4/29. The P-value was calculated using the log-rank test.

Targeted sequencing-based mutational signatures associated with cancer prognosis

To investigate the prognostic value of mutational signatures derived from targeted sequencing, we conducted an analysis associating the presence of these signatures with overall survival across different cancer types. The analyses were adjusted for age, race, sex (where applicable), metastatic status, cancer subtype, and hospital site (Methods). Consistent with previous studies based on health records or WES^32-34, we observed that targeted sequencing-based smoking signatures SBS4/29 and APOBEC signatures SBS2/13 were associated with poorer prognosis in lung cancer (SBS4/29: Hazard Ratio (HR) = 1.1, 95% CI = 1.05-1.17; SBS2/13, HR = 1.09, 95% CI = 1.03-1.14). In contrast, UV exposure signature SBS7a/b showed a favorable prognosis (SBS7a: HR = 0.71, 95% CI = 0.62-0.80; SBS7b: HR = 0.81, 95% CI = 0.72-0.92) in skin cancer or melanoma, consistent with previous WES studies that patients with signature SBS7 exhibited better survival outcomes³⁵. Furthermore, we found that POLE deficiency-related signature SBS10a was associated with a better prognosis in colorectal cancer³⁶ (HR = 0.83, 95% CI = 0.77-0.90). Similarly, MMR deficiency-related signatures led to favorable prognosis in esophageal and gastric cancers (SBS6: HR = 0.84, 95% CI = 0.76-0.92) and in colorectal cancer (SBS44: HR = 0.79, 95% CI = 0.74-0.85). These results align with multiple studies which reported improved prognosis in MMR-deficiency (measured by microsatellite markers or MMR immunohistochemistry) colorectal cancer³⁷ or gastric cancer^38,39 patients when treated with surgery alone or surgery plus chemotherapy. Thus, our findings confirm the prognostic value of mutational signatures derived from targeted sequencing.

Racial heterogeneity on the association of MMR deficiency and colorectal cancer prognosis

Leveraging the large sample size across different racial/ethnic groups from targeted sequencing data, we investigated the association between MMR deficiency signature SBS 44 and colorectal cancer prognosis across different racial groups. Analyzing 9,562 colorectal cancer cases from eight hospitals, we found significant racial heterogeneity (likelihood ratio test (LRT) P-value = 1.22 x 10^-5 for the model, including the interaction term of SBS44 and race compared to the model without the interaction term). Specifically, we observed a highly favorable prognosis associated with SBS44 among white patients (HR = 0.78, 95% CI = 0.72-0.85, Fig. 6b). The effect size was attenuated among Black patients (HR = 0.97, 95% CI = 0.76-1.24), and was inverted among Asian patients (HR = 1.15, 95% CI = 0.84-1.58, Fig. 6b). This racial heterogeneity persisted when we restricted the analysis to 4,786 colorectal cancer cases from the Memorial Sloan Kettering Cancer Center (LRT P-value = 1.69 x 10^-4) or 2,892 colorectal cancer cases from the Dana-Farber Cancer Institute (LRT P-value = 2.59 x 10^-3). These results warrant further investigation into the potential racial heterogeneity of MMR deficiency in colorectal cancer prognosis.

Targeted sequencing-based mutational signatures associated with immunotherapy response

Using treatment regimens and response records from AACR Project GENIE, we examined the association between TMB attributed to specific mutational signatures and progression-free survival (PFS) in 470 non-small cell lung cancer patients who received immune checkpoint inhibitors. To account for potential confounding factors, we adjusted our analysis for smoking history, stage, age, race, sex, cancer subtype, and hospital site. We revealed a significant association between TMB attributed to SBS1 and poor PFS (HR = 1.31, 95% CI = 1.10-1.56, Fig. 6c). In contrast, TMB attributed to the smoking signatures SBS4/29 exhibited a significant association with favorable PFS (HR = 0.94, 95% CI = 0.90-0.98, Fig. 6c). TMB attributed to other signatures (SBS2/13, SBS89 and SBS5/40) did not show a significant association with PFS. Notably, when considering the overall TMB combining all signatures, only a nominal association with favorable PFS was observed (HR = 0.98, 95% CI = 0.95-1.00, Supplementary Fig. 8a). Consistent results were also observed when analyzing overall survival instead of PFS (Supplementary Fig. 8b). These results suggest that signature-specific TMB, particularly SBS1 and SBS4/29, could be used as more effective biomarkers for predicting the response to immune checkpoint inhibitors in non-small cell lung cancer, compared to the overall TMB.

Discussion

In this study, we have introduced SATS, a new tool to identify mutational signatures and estimate signature burdens in targeted sequenced tumors. We evaluated SATS using pseudo-targeted sequencing data and found that spiky signature profiles, a high signature prevalence, and large sequencing panels (> 1Mb) increase the accuracy of signature detection and refitting. Moreover, we showed that SATS outperformed other methods through analyzing in silico simulated data and samples with both WGS and targeted sequencing. We utilized SATS to analyze 111,711 targeted sequenced tumors in the AACR Project GENIE and developed a pan-cancer catalogue of SBS and DBS signatures, specifically tailored for targeted sequencing tumors. Using this repertoire, SATS can estimate signature burden in a single sample, making it a useful tool in the clinic. Finally, we showed examples of using targeted sequencing-based signatures to address clinical questions. In particular, we found that TMBs attributed to DNA repair deficiency signatures could disentangle the effect of TMB on hypermutated early-vs. late-onset colorectal cancer. In addition, we identified mutational signatures associated with cancer prognosis in multiple cancer types and with immunotherapy response in non-small cell lung cancer.

Our study has made several important contributions to the analysis of mutational signatures in targeted sequenced tumors. First, unique to SATS is the incorporation of panel size in the analysis, enabling the identification of TMB signatures that are not restricted to a particular type of gene panel. In contrast, other mutation signature tools commonly assume the same targeted sequencing panel across samples. This adaptability ensures SATS’s applicability over a wide range of targeted gene panels, enhancing its utility in clinical settings. Second, unlike clustering-based methods^15,17 that aim to detect a specific mutational signature, SATS can identify multiple mutational signatures simultaneously, providing a more comprehensive analysis of the mutational landscape of target-sequenced tumors. Third, to the best of our knowledge, this study represents the largest and most comprehensive pan-cancer mutational signature analysis for targeted sequenced tumors to date. We have analyzed 23 cancer types, including those less represented in previous studies, such as cancers of unknown primary origin. Moreover, our analysis includes 757 cancer subtypes, which is more extensive than previous studies. For instance, while the TCGA ovarian cancer study⁴⁰ focuses on high-grade serous ovarian cancer, our analysis involves 22 ovarian cancer subtypes, encompassing high-grade serous ovarian cancer, clear cell ovarian cancer, low-grade serous ovarian cancer, and endometrioid ovarian cancer, among others. Finally, the established repertoire of mutational signatures is derived from a diverse collection of targeted sequenced tumors at various hospitals and cancer centers, ensuring its relevance to clinical settings. This contrasts with repertoires based on WES/WGS that are often oriented more towards research applications. Users of SATS can estimate signature burdens using the repertoire from this study as the reference, facilitating the clinical utility of mutational signature analysis in individual targeted sequenced tumors.

This study has several limitations that should be considered. First, the data were collected from clinics primarily in the United States and Western Europe as part of the AACR Project GENIE, which may limit the representativeness of our findings for targeted sequenced tumors from other geographic regions. Second, while the identified repertoire of mutational signatures for targeted sequencing tumors is extensive, it may not encompass the entire spectrum of possible signatures. The current repertoire mainly includes common signatures with spiky profiles, such as signatures related to hypermutation, that are easy to detect with the current sample size per cancer type in the AACR Project GENIE. This selection bias towards easily detectable signatures could mean that less common or subtler signatures might be underrepresented or missed. Additionally, we observed a higher prevalence of hypermutated signatures in our study compared to previous studies using WGS or WES. This discrepancy could be attributed to the nature of targeted sequencing, which may not capture mutations in non-hypermutated tumors as effectively as WGS or WES. Consequently, if no mutations are detected in these non-hypermutated tumors, they could be excluded from our analysis, potentially leading to an overestimation of the prevalence of hypermutated signatures. Lastly, the current sample sizes for each cancer type within our study are not large enough to effectively differentiate between flat mutational signatures SBS3, SBS5, and SBS40.

To overcome these limitations, it is crucial to increase the number of tumors sequenced by targeted gene panels and to share the resulting data. The decreasing costs and increasing accessibility of targeted sequencing in clinical practice make the expansion of sample sizes a feasible goal. Initiatives like the AACR Project GENIE are already taking steps towards this goal by collecting and sharing more targeted sequencing data and inviting new participants from underrepresented and underserved populations. Our analysis suggests that as the number of targeted sequenced tumors increases, SATS would provide enhanced clinical utility as it can detect additional common signatures with very low false discovery rates. Moreover, with a larger sample pool, SATS could potentially differentiate the HR deficiency-associated signature SBS3 from other flat signatures.

In summary, we have developed a tool for analyzing mutational signatures in targeted sequenced tumors, and created a pan-cancer repertoire of mutational signatures tailored for targeted sequencing. Our study has highlighted the clinical relevance of these targeted sequencing-based signatures. The SATS R package is publicly available on GitHub. We anticipate that SATS and the repertoire will enhance clinical applications of mutational signature analysis using targeted sequence data.

Methods

Genomic data of AACR Project GENIE

We retrieved the AACR Project GENIE dataset (version: 13.0-public) from Synapse (https://synapse.org/genie). This dataset includes 111,711 tumors that were collected as part of routine clinical practice at 16 hospitals or cancer centers and sequenced by targeted sequencing using different gene panels with sizes larger than 50Kb. The patients provided their consent, and the study was approved by an institutional review board (IRB). The dataset contains samples from diverse ethnic backgrounds, including 5,973 Asians (5.3%); 5,545 Blacks (5.0%); 78,003 Whites (69.8%); 5,311 individuals from other racial groups (4.8%); and 16,879 individuals with unknown race (15.1%). It also includes self-reported sex: 58,576 females, 47,694 males, 2 other, and 5,439 of unknown sex. The dataset covers 102 cancer types with 757 subtypes defined by OncoTree⁴¹ (please refer to the Supplementary Note for more information on cancer types in the AACR Project GENIE). To facilitate our analysis, we grouped 102 OncoTree cancer types into 23 analysis cancer types.

The tumors were sequenced at CLIA-/ISO-certified labs with high read depth (median: 519X reads, 1st quantile: 307X, 3rd quantile: 808X). Somatic mutations were called at participating centers with various tools, including Mutect2 (ref⁴²) and Strelka⁴³. Germline variants and artifacts were filtered out using pooled external controls and databases of known germline variants, such as the Genome Aggregation Database (gnomAD)⁴⁴. For more information on the filtering process, please refer to the “AACR GENIE 13.0-public Data Guide” (https://www.aacr.org/wp-content/uploads/2023/03/13.0_data_guide-1.pdf). The dataset includes 1,213,674 single base substitutions (SBS) and 18,519 double base substitutions (DBS). We further removed somatic mutations with a read depth of less than 100 or an alternative allele read count of less than 5. This resulted in 982,095 SBS and 15,149 DBS from 111,711 tumors (Supplementary Table 1) for mutational signature analysis.

It’s worth noting that the choice of mutation calling pipeline may impact the signature analysis results. The influence on signature detection is likely to be less pronounced than on signature burden estimation. This is because signature detection is based on aggregated data from a group of patients, which tends to mitigate the variations caused by different mutation calling approaches. We recommend for users of SATS to review mutation calling pipelines used by the contributing institutions of the AACR Project GENIE (described in the AACR GENIE 13.0-public Data Guide) and to follow best practices^45,46 when applying mutation callers and specifying filtering thresholds.

A Poisson NMF model for signature analysis of tumor mutation burden

We define a Poisson Non-Negative Matrix Factorization (pNMF) model for SATS. The pNMF model assumes that the SBS count υ_pn for the p^th mutation type in the n^th targeted sequenced tumor follows a Poisson distribution with mean for K signatures, p = 1,2, …,96, n = 1,2, …, N and K = 1,2, …, K. The υ_pn, 𝓁_pn, w_pk and h_kn represent elements of the corresponding matrices V (dimension N by 96), L (dimension N by 96), W (dimension N by K) and H (dimension K by 96), respectively. This model specification is equivalent to the one used in signeR¹¹.

The W and H are the parameters of interest that will be estimated based on the log-likelihood function of the pNMF model: When the genomic regions sequenced across all samples are identical, as in WES or WGS, the maximum likelihood estimate based on equation (1) is equivalent to that based on the canonical NMF, which is detailed below. Thus, the proposed pNMF model includes the canonical NMF as a special case.

pNMF model extends the canonical NMF

The pNMF model proposed here incorporates the panel context matrix L, extending the canonical NMF, the standard method for identifying mutational signatures in tumors sequenced by WES or WGS. When all samples are sequenced using the same genomic regions (i.e., 𝓁_pn = 𝓁_p), the log-likelihood function in equation (1) is simplified as where C is a constant irrelevant to W and H. It is worth noting that is equivalent to the objective function of the canonical NMF⁴⁷ with w, = 𝓁_pw_pk.

Extraction of de novo TMB signatures

We utilize signeR¹¹ to extract de novo signatures based on the pNMF model. However, because signeR is computationally demanding due to its use of the Markov Chain Monte Carlo (MCMC) method¹¹, we grouped samples to improve computational efficiency. Our results below reveal that grouping samples does not affect the TMB signatures profile (w_pk). Specifically, we define C as the sample index set {1,2, …, N}, and C_m as the mutually exclusive set such that . For , the sum of the mutation count for the targeted sequencing tumors with index n belonging to the set C_m, we can show that: where and . Notably, the TMB signature profile w_pk remains unchanged. The panel size of combined samples is the sum of the panel size of individual samples 𝓁_pn, and signature activity is the weighted sum of the signature activities of individual samples h_kn. The mutation count of combined samples follows a Poisson distribution, as the sum of independent Poisson counts is still Poisson distributed.

Grouping samples can significantly reduce computation time. For example, when analyzing 10,000 samples using SATS, analysis by grouping 100 tumors can be completed in 28.5 minutes on a laptop with a Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz processor and 16 GB of 4267 MHz RAM. In contrast, analyzing the same samples without grouping tumors takes approximately 13 hours.

Mapping de novo TMB signatures to COSMIC reference TMB signatures

Due to the limited number of somatic mutations detected by targeted gene panels, the detected de novo TMB signature profiles may be a linear combination of COSMIC reference TMB signature profiles. To address this limitation, we map the de novo signature profile matrix to reference TMB signatures W₀ (e.g., a 96 × 76 COSMIC TMB signature profile matrix for 76 reference SBS TMB signatures), using penalized non-negative least squares²¹: where β is a coefficient vector and the tuning parameter λ is selected based on cross-validations. Compared with the non-negative least squares, the penalized non-negative least squares allow us to shrink small values of β towards zero and select a smaller number of reference signatures with the profile matrix W^* that have a significant contribution to the de novo signature profiles. To reduce the randomness caused by the cross-validation step to select λ, we repeat this process 100 times, and select only reference TMB signatures with a coefficient β greater than 0.1 in more than 80 replicates.

Estimation of signature activities by an expectation-maximization algorithm

We propose an expectation-maximization (EM) algorithm to estimate the signature activity matrix H = [h_kn], given the mutation type matrix V = [υ_pn], the panel context matrix L = [𝓁_pn], and the mapped reference TMB signature profiles . The element υ_pn in V can be expressed as the sum of independent latent counts υ_1pn, υ_2pn, ⋯, υ_Kpn attributed to K signatures. These latent counts are treated as the missing data, following Poisson distributions with expectations , respectively. Introducing latent counts allows us to compute the complete data log-likelihood as: In addition, the conditional distribution of υ_Kpn given V, L, W^* and H^t (the H at the t’th iteration of the EM algorithm) follows a multinomial distribution with parameters υ_pn and .

In the E-step, we compute Q(H|H^t) as the expected complete data log-likelihood: In the M-step, the maximizer of Q(H|H^t) is obtained by setting the derivative with respect to h_kn to 0, and the updated activity value is given by: Note that the M-step depends on the current value of activities for the n^th tumor only. Therefore, even though the EM algorithm updates the entire activity matrix H for all samples simultaneously, it is equivalent to updating the activity of one tumor at a time. In other words, the EM algorithm of SATS for signature refitting estimates signature activities independently of other tumors, enabling signature activities to be estimated accurately for a single tumor or small subset of samples.

To complete the EM algorithm, the E-step and the M-step are iterated until convergence and output the estimated activity matrix .

Calculation of signature burdens

To calculate the expected number of mutations attributed to a signature (referred to as the signature burden) E = [E_kn], we use the estimated activity matrix from the EM algorithm. The signature burden E_kn of the k^th signature in the n^th tumor is then calculated as the sum of the product of the panel size 𝓁_pn, the reference signature profile and the estimated signature activity across 96 SBS types as .

Relationship between signatures of TMB and TMC

The D_KL(V|W^′, H) in equation (2) with highlights the relationship between signatures of TMB and TMC: TMB signature profile w_pk normalizes TMC signature profile , by the number of mutation context . This means that we can create a catalogue of TMB signatures based on WGS, dividing the catalogue of TMC signatures (e.g., COSMIC WGS reference TMC signatures¹) by the number of trinucleotide contexts from which the mutation type could occur in the human reference whole genome.

Create a catalogue of reference TMB signatures

To create a catalogue of reference TMB signatures in Supplementary Table 2, we normalize COSMIC signature profiles of TMC by the size of mutation contexts in the whole genome. This is done by following these steps:

Download the COSMIC SBS and DBS signature profiles (version 3.2) from https://cancer.sanger.ac.uk/signatures/
For an SBS signature profile, divide the level of each mutation type (e.g., A[C > G]G) by the size of the corresponding mutation context (e.g., ACG for A[C > G]G) that can occur in the whole genome (Supplementary Table 3)
Rescale 96 mutation types to sum to one.
Similarly, create DBS signature profiles of TMB (Supplementary Table 4) based on COSMIC DBS signature profiles of TMC and the number of genomic contexts (Supplementary Table 5) for which DBS can occur.

Shannon equitability index of TMB mutational signatures

We use Shannon equitability index to measure the diversity or “flatness” of a signature profile^48,49. The index is calculated as where w_p is the level of signature profile at the pth mutation type and the sum across all mutation types (i=1 to 96) is equal to 1.

A higher value of the index indicates a more even distribution of mutation types. The index ranges from 0 to 1, with a value of 1 indicating a completely flat signature profile where all mutation types are represented equally and a value of 0 indicating a signature profile with a single dominant “spike” where a single mutation type has a proportion of 1 and all other mutation types have a proportion of 0. Among SBS TMB signatures with n = 96, some profiles are characterized by a few specific mutation types at high levels, referred to as “spikes” (e.g., SBS1 with the C>T substitution at the NCG trinucleotide has a Shannon equitability index of 0.317, and SBS10a with the T[C>A]T substitution has a Shannon equitability index of 0.192). Other signature profiles are more evenly distributed across all mutation types, referred to as “flat” (e.g., SBS3 has a Shannon equitability index of 0.974, SBS5 has a Shannon equitability index of 0.903, and SBS40 has a Shannon equitability index of 0.969).

Generation of pseudo-targeted sequencing data

To investigate the impact of various factors on mutational signature detection, we created pseudo-targeted sequencing datasets using the TCGA WES studies^1,24 and the Sanger breast cancer (BRCA) 560 WGS study²⁵. We assume that targeted sequencing would identify SBSs that are identified in the WES or WGS studies, as long as SBSs are located within the targeted genomic regions of the panels. This assumption is reasonable since targeted sequencing typically provides much higher sequencing coverage than WES or WGS. The steps to generate the simulated data are outlined below:

Download the TCGA WES data (mc3.v0.2.8.PUBLIC) from the Cancer Genome Data Portal (https://gdc.cancer.gov/about-data/publications/mc3-2017) and WGS data of Sanger BRCA560 study (Caveman_560_20Nov14_clean) from ftp://ftp.sanger.ac.uk/pub/cancer/Nik-ZainalEtAl-560BreastGenomes.
Download the genomic information file of AACR Project GENIE (https://www.synapse.org/#!Synapse:syn26706790) which specifies the chromosome, start position, and end position of genomic regions for each targeted sequencing panel.
For SBS in WES or WGS studies, select those located in the genomic regions of a targeted sequencing panel to create the SBS mutation type matrix as pseudo-targeted sequencing data. We generated 648 pseudo-targeted sequencing datasets, encompassing 18 TCGA WES cancer types and 36 targeted sequencing panels (panel size: 0.05 Mb to 9.95 Mb). Each TCGA WES cancer type has at least 200 samples, ensuring a sufficient sample size for evaluating the signature detection step of SATS. Note that certain WES cases lacked SBS within the genes covered by a targeted sequencing panel. Hence, the number of cases in the pseudo-targeted sequencing data is less than that of TCGA WES studies (Supplementary Table 8). For instance, out of the 208 TCGA sarcoma WES cases, each panel could only detect SBS in a subset of cases (e.g., 169 cases for the DFCI-ONCOPANEL-3 panel and 172 cases for the MSK-IMPACT505 panel). Similarly, we generated 36 pseudo-targeted sequencing datasets based on 560 breast tumors with WGS data, using 36 targeted gene panels.

Analysis of pseudo-targeted sequencing data

We calculate the signature detection probability, which represents the percentage of common signatures detected by 36 targeted sequencing panels within a cancer type. Next, we employ a generalized linear mixed model (GLMM) to analyze the factors that influence the detection probability of TMB mutational signatures across 648 pseudo-targeted sequencing datasets (by 18 TCGA cancer types and 36 targeted sequencing panels). The GLMM incorporates several fixed effects, including the flatness of the signature profile (quantified using the Shannon equitability index), the prevalence of the mutational signature in the TCGA WES study (as a percentage of SBS attributed to the signature), and the panel size (per megabase). To account for any variation in the results due to the different cancer types under investigation, we include cancer type as a random intercept in the model.

Evaluation of the impact of sample sizes

We conducted an in silico simulation to investigate the effect of sample size on the ability to detect mutational signatures in breast cancer. The simulation was executed using varying sample sizes, ranging from one thousand to up to one million samples, based on the panel context matrix, signatures profile matrix (consisting of 12 mutational signatures with at least 1% prevalence in the TCGA breast cancer study), and signature activity matrix (following the distributions of the signature activity matrix of the TCGA breast cancer study).

We run signature refitting on the TCGA breast cancer dataset (accessible at https://www.synapse.org/#!Synapse:syn11726618), using12 known mutational signatures (SBS1, 2, 3, 5, 7a, 10a, 10b, 13, 15, 29, 30, 44 and 58) that have a prevalence greater than 1% (based on https://www.synapse.org/#!Synapse:syn11801497). Specifically, we applied the EM algorithm to estimate the signature activity matrix , from the mutation type matrix V_B, the panel context matrix L_WES of the whole exome sequencing, and the pre-defined TMB signature matrix .
We simulated mutation type matrix for 21 targeted sequencing panels with a panel size larger than 1Mb, using a range of sample sizes from 1000 tumors to 1 million tumors. Specifically, we simulated mutation type matrix from a Poisson distribution with the mean , where L_S represents the panel size matrix for a given targeted sequencing panel (S), is sampled from the estimated signature activity matrix . As the activities of APOBEC signatures SBS2 and SBS13 are highly correlated, their activities were jointly sampled. Finally, we excluded any tumors with zero mutation count.
We applied signeR to extract de novo signatures from the simulated mutation type matrix . Then, we employed penalized non-negative least squares to select the mapped reference TMB signatures, . Finally, we estimated signature activities and burdens using the EM algorithm.
To evaluate the ability to detect the pre-specified signatures, we analyzed the proportion of 21 panels that were able to rediscover the prespecified 12 mutational signatures using SATS. We also tracked the probability of detecting false positive signatures that were not used to simulate mutation counts.

Evaluation of SATS by in silico simulations

We conducted in silico simulations to evaluate whether SATS can detect and estimate the prespecified signatures in simulated datasets. When multiple flat signatures were present (e.g., SBS5/40 in lung cancer and SBS3/5/40 in lymphoid-derived hematologic cancer), we combined them into a single flat signature as the sample size of the AACR Project GENIE is insufficient to distinguish between these flat signatures accurately.

We first calculated the expectation matrix , where ∘ denotes element-wise product, L_c a panel size matrix, a signature profile matrix and a signature activity matrix of a cancer type (c). The matrices and were estimated from the AACR Project GENIE, allowing us to generate simulated data that accurately reflects actual observations.
We generated ten replicates of the mutation type matrix for lung cancer, breast cancer, colorectal cancer, and lymphoid-derived hematologic cancer respectively by simulating data from the Poisson distribution using expectation matrix . The number of simulated samples is the same as in the corresponding AACR Project GENIE studies.
We applied signeR and penalized non-negative least squares to estimate TMB signatures for each simulated mutation type matrix . We then compared these estimated signatures with the ground truth signatures .
Using the simulated mutation type matrices , panel size matrices (L_c), and estimated TMB signatures , we estimated the signature activity matrix for all tumors using the EM algorithm. We then calculated the signature burden based on , which is compared with the simulated signature burden as the ground truth. For lung cancer, we also estimated for a subset of samples or even for one sample, as detailed in the Supplementary Note.

Applications of other methods for in silico simulations

We applied SigProfilerExtractor¹² on the simulated mutation type matrix to extract signatures. Considering the computational intensity of SigProfilerExtractor, we grouped the simulated samples in . To obtain the mapped COSMIC signatures (version 3.2), we ran SigProfilerExtractor with its default options, setting the number of signatures to be extracted within the range of 1 to 10. After extracting these signatures, we utilized SigProfilerAssignment¹⁴ to compute the corresponding signature burdens based on the mapped COSMIC signatures.

Moreover, we applied Mix¹⁸ to the same simulated data. Mix clustered samples while simultaneously learning the mixture model; hence we used the original simulated matrix as the input. We specified the number of clusters and signatures ranges from 1 to 20 and 1 to 10, respectively. Bayesian Information Criterions (BICs) were computed to identify the optimal combination of the numbers of clusters and signatures. Under the selected optimal combination, the exposure matrix was calculated. As Mix lacks a mapping step to COSMIC signatures, we annotated Mix signatures as COSMIC signatures with the highest cosine similarity. To obtain the signature burden, the total number of mutations was multiplied by the exposure matrix.

Validation of SATS in tumors with both WGS and targeted sequencing

We analyzed 72 kidney tumors with both WGS and targeted sequencing. For WGS, genomic DNA was extracted from fresh frozen tissue using the QIAmp DNA mini kit (Qiagen) and subsequently sequenced on the Illumina HiSeqX platform. The mean sequencing depth was 65.7x for tumor tissue and 40.1x for normal tissue. For targeted sequencing, genomic DNA was purified using Agencourt AMPure XP Reagent (Beckman Coulter Inc., Brea, CA, USA). A targeted driver gene panel (size 1.90 Mb) was designed, encompassing 254 candidate cancer driver genes⁵⁰. The targeted sequences were captured by NimbleGen’s SeqCap EZ Choice (custom design; Roche NimbleGen, Inc., Madison, WI, USA). Subsequent targeted sequencing was conducted on an Illumina HiSeq 4000, achieving a mean depth of 500x for both tumor and normal tissue. The details of whole-genome and targeted sequencing and sequencing data preprocessing, alignment and somatic mutation calling were described previously²⁶.

We utilized SATS, SigProfilerAssignment¹⁴, Mix¹⁸ and DeconstructSigs¹³ to refit the common signatures, namely SBS1, SBS5, and SBS40, estimating their burdens in the targeted sequencing data. Next, we calculated signature burdens for SBS1 and the combined flat signatures (SBS5/40) in the targeted sequencing data and compared these with the corresponding signature burdens derived from WGS. This comparison allowed us to assess the consistency of estimating mutational signature burdens between the two sequencing methods.

Visualization of SBS mutational signature profiles using UMAP

We applied UMAP⁵¹ (Uniform Manifold Approximation and Projection) to reduce the dimensionality of SBS mutational signature profiles and visualized tumors in a two-dimensional scatterplot. To facilitate visualization, tumors within each cancer type are first clustered based on their mutational signature burden profiles using hierarchical clustering with Ward’s minimum variance method. A cut-off value of 0.05 is applied for clustering. The UMAP projections are computed based on the median signature burden for each cluster of tumors, with each dot representing a group of tumors with similar signature burden profiles.

Association between mutational signatures and the risk of early-onset colorectal cancer

To investigate the relationship between mutational signatures and the risk of early-onset colorectal cancer (sequencing age < 50 years), we analyzed the mutational signature profiles of 9,562 colorectal cancer patients along with their clinical data. We stratified the patients into non-hypermutated cases (TMB < 10 mutations/Mb, early-onset: 2,495 cases; late-onset: 6,009 cases) and hypermutated cases (early-onset: 790 cases; late-onset: 268 cases).

We employed a generalized linear mixed model (GLMM) to compare early-onset vs late-onset cases for the non-hypermutated and hypermutated colorectal cancer, respectively. The GLMM incorporated tumor status (primary vs metastasis), race, sex, and individual signatures’ TMBs (SBS1, SBS6, SBS10a, SBS10b, SBS44, and flat SBS) as fixed effects, while center and subtype were considered random effects. The GLMMs were fitted using the ‘lme4’ package in R.

Analysis of mutational signatures as prognostic biomarkers

To assess the prognostic implications of mutational signatures obtained from targeted sequencing, we applied mixed-effect Cox models for each specific cancer type individually. The model is represented as: where t denotes overall survival (OS) time, measured in years from the age of targeted sequencing to the last vital status update (alive or deceased). λ₀(t) represents the baseline hazard function. We modeled η as: The mixed-effect Cox model was fitted for one SBS signature at a time, including the presence of an SBS signature, tumor status (primary vs. metastasis), age, race (when applicable), and sex (when relevant) as fixed effects. Additionally, we include center and subtype (if applicable) as random effects with random effect coefficients and , where N(0, σ²) denote a normal distribution with mean zero and variance σ². The ‘coxme’ package in R was utilized to fit the mixed-effect Cox models. Notably, for breast, ovarian, endometrial, other gynecologic cancers, and prostate cancer, the mixed-effect Cox models do not include the variable of sex.

In order to examine the racial differences in hazard ratios of SBS44 signature in colorectal cancer, we employed mixed-effect Cox models that incorporated interactions between SBS44 signature and race. The likelihood ratio test was utilized to compare models with and without the interaction term. Furthermore, we performed stratification by race groups, fitting separate mixed-effect Cox models for each race group to obtain race-specific hazard ratio estimates.

Analysis of mutational signatures as immunotherapy predictive biomarkers

The AACR Project GENIE Biopharma Collaborative released comprehensive clinical data for 1,846 non-small cell lung cancer (NSCLC) patients (v2.0-public dataset) from four hospitals (MSKCC, DFCI, VICC and UHN). Within this dataset, we focused on a subset of 470 NSCLC patients who underwent treatment with immune checkpoint inhibitor (ICI), including Pembrolizumab, Nivolumab, Atezolizumab, and Durvalumab. We aimed to investigate the predictive value of mutational signature-specific TMB for ICI immunotherapy.

We employed a mixed-effect Cox model to analyze progression-free survival, which measures the time interval between the initiation of ICI treatment and the radiologist or oncologist’s assessment of cancer progression or patient death. In the mixed-effect Cox model, we included variables including age, sex, smoking history, stage (IV vs. others), and individual mutational signature-specific TMBs (SBS1, SBS2/13, SBS4/29, SBS89, and flat SBS) as fixed effects. Additionally, we considered the hospital and subtype as random effects. We also applied the same model to evaluate overall survival, which measures the time interval between the start of ICI treatment and patient death or the last follow-up date.

Software

The R package SATS is publicly available at https://github.com/binzhulab/SATS.

Data Availability

The AACR Project GENIE dataset could be retrieved from Synapse (https://synapse.org/genie)

https://github.com/binzhulab/SATS

Supplementary figure legends

Supplementary Fig. 1 Mutation profiles and contexts in the human whole genome. a. Profiles of SBS5 signature based on tumor mutation count (TMC) and tumor mutation burden (TMB), respectively, with 96 mutation types on the x-axis and contributions on the y-axis. b. 32 mutation contexts for single base substitutions (SBS) in the human whole genome, with significantly depleted ACG, TCG, GCG, and CCG trinucleotides. c. Scatterplot of the Shannon equitability index of TMC and TMB signature profiles. The black line represents the diagonal line, the blue line the linear regression line, and the shaded area is 95% confidence intervals. The dots are annotated when the differences of Shannon equitability index between TMC and TMB signature profiles are more than 0.1. d. Profiles of SBS10b and SBS15 signatures based on TMC and TMB, respectively.

Supplementary Fig. 2 Boxplots of mutation context ratios between targeted sequences vs. whole genome sequence. Each dot represents the ratio between the proportion of a mutation context (e.g., CCG) in the genomic regions targeted by a sequencing panel and the proportion of the same mutation context in the whole genome.

Supplementary Fig. 3 Results of pseudo-targeted sequencing data based on Sanger breast cancer 560 WGS study. a. The Shannon equitability index of the signature profile and the percentage of single base substitutions (SBS) attributed by a signature for the Sanger breast cancer (BRCA) 560 WGS study. The signatures with > 5% prevalence are highlighted in color, while others are shown in gray. b. The odds ratio of the Shannon equitability index of the signature profile, frequency of Sanger BRCA 560 WGS study signatures, and panel size. The bars represent 95% confidence intervals (CIs). c. The median of the Pearson correlation coefficient for all common WGS signatures detected. The blue curve represents LOWESS (Locally Weighted Scatterplot Smoothing) curve with the shaded area for 95% CIs.

Supplementary Fig. 4 Results of pseudo-targeted sequencing data based on TCGA WES study. a. Scatterplot of the number of mutations attributed by the smoking-related signature SBS4 (on the y-axis) in the MSK-IMPACT 468 panel compared to the number of mutations attributed by the same signature in the TCGA WES study (on the x-axis). The black line represents the ratio of the MSK-IMPACT 468 panel size to the WES size. The blue line represents the linear regression line and the shaded area represents 95% confidence intervals (CIs). b. and c. Medians of Pearson correlation coefficients for signatures SBS4 (b) and SBS7a/b (c) are shown. Pearson correlation coefficient measures the correlation between the number of mutations attributed to a signature using the pseudo-targeted sequencing data and the number of mutations attributed to the same signature reported previously in the TCGA WES study. The curve represents LOWESS (Locally Weighted Scatterplot Smoothing) curve, with the shaded area representing 95% CIs.

Supplementary Fig. 5 Impact of sample sizes on mutational signature detection. a. The scatterplot of the flatness (measured by Shannon equitability index) of signature profiles and percentage of single base substitutions (SBS) attributed to the signatures in the TCGA breast cancer (BRCA) WES study. The reference line with Shannon equitability index one refers to the theoretical maxima (by a completely flat signature). b. The probability of signature detection. Each dot represents the probability of signature detection, measuring the proportion of 21 targeted sequencing panels (with panel size larger than 1Mb) that can identify the signature at a given sample size. The blue line is the LOWESS (Locally Weighted Scatterplot Smoothing) curve, and the shaded area represents 95% confidence intervals (CIs).

Supplementary Fig. 6 The proportion of tumors in which the mutation signatures were detected in simulated breast cancer data. Mapped signatures SBS1, SBS2/13, and SBS5 were used for simulation. Mapped signatures or WES-based signatures were used for signature refitting, respectively. WES-based signatures include SBS1, SBS2/13, SBS5, SBS3, SBS7a, SBS10a, SBS10b, SBS15, SBS29, SBS30, SBS44 and SBS58, which are present in more than 1% of mutations in TCGA WES breast cancer study. The horizontal lines represent the actual proportions in the simulated data. The error bars in the figure represent the mean plus or minus one standard deviation.

Supplementary Fig. 7 UMAP visualization of SBS mutational signature profiles. The UMAP (Uniform Manifold Approximation and Projection) scatterplot displays 2-dimensional projections of the signature burden profiles for tumors obtained from the Dana-Farber Cancer Institute. The colors represent cancer types.

Supplementary Fig. 8 The associations between targeted sequencing-based mutational signatures and immunotherapy response. a. The proportion of lung cancer patients who remain progression free after receiving immune checkpoint inhibitors (ICI), stratified by high (> 10 Mut/Mb) or low overall TMB levels. The P-value was calculated using the log-rank test. b. The hazard ratio (HR) of TMB attributed to individual signatures for overall survival (PFS) in lung cancer patients treated with ICI.

Supplementary Fig. 9 Detection probability of TCGA breast cancer signatures in simulations. The probability of signature detection is shown for increasing sample sizes (a. up to 10,000 tumors, b. up to 1,000,000 tumors). Each dot represents the proportion of targeted sequencing panels (with a panel size larger than 1Mb) that are able to identify the signature at the corresponding sample size. Blue curves are LOWESS (Locally Weighted Scatterplot Smoothing) curves, and shaded areas represent 95% confidence intervals.

Supplementary Fig. 10 The Pearson correlation coefficient between simulated and estimated SBS signature expectancies in lung cancer with various sample sizes for signature refitting. The bars in the figure represent the mean Pearson correlation coefficient for simulation replicates, while the x-axis indicates the number of samples used for signature refitting, including 100 samples, 10 samples, or even one sample at a time. The error bars in the figure represent the mean plus or minus one standard deviation.

Acknowledgment

This research was conducted with support from Intramural Research Program of the National Institutes of Health, National Cancer Institute, Division of Cancer Epidemiology and Genetics (DCEG). Additionally, Dr. Lee was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT, No. RS-2023-00213625). The authors would like to acknowledge the American Association for Cancer Research and its financial and material support in the development of the AACR Project GENIE registry, as well as members of the consortium for their commitment to data sharing. Interpretations are the responsibility of study authors. This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD: https://biowulf.nih.gov. We would like to thank Bill Wheeler (Information Management Services) for computation support.

Footnotes

This version of the manuscript has been revised to update the following: The main title and author information are updated; The entire manuscript is revised for further clarification; Section for clinical application is newly included; Main Figures 1, 3, 5, 6 are revised according to the manuscript; and Supplementary figures, tables, and notes are also updated accordingly.

References

1.↵
Alexandrov, L.B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).
OpenUrl CrossRef PubMed
2.↵
Degasperi, A. et al. Substitution mutational signatures in whole-genome-sequenced cancers in the UK population. Science 376, abl9283 (2022).
OpenUrl
3.↵
Poon, S.L., McPherson, J.R., Tan, P., Teh, B.T. & Rozen, S.G. Mutation signatures of carcinogen exposure: genome-wide detection and new opportunities for cancer prevention. Genome Med 6, 24 (2014).
OpenUrl CrossRef PubMed
4.
Wan, J.C.M. et al. Genome-wide mutational signatures in low-coverage whole genome sequencing of cell-free DNA. Nat Commun 13, 4953 (2022).
OpenUrl
5.↵
Poon, S.L. et al. Genome-wide mutational signatures of aristolochic acid and its application as a screening tool. Sci Transl Med 5, 197ra101 (2013).
OpenUrl Abstract/FREE Full Text
6.↵
Van Hoeck, A., Tjoonk, N.H., van Boxtel, R. & Cuppen, E. Portrait of a cancer: mutational signature analyses for cancer diagnostics. BMC Cancer 19, 457 (2019).
OpenUrl CrossRef PubMed
7.↵
Davies, H. et al. HRDetect is a predictor of BRCA1 and BRCA2 deficiency based on mutational signatures. Nat Med 23, 517–525 (2017).
OpenUrl CrossRef PubMed
8.↵
Buisson, R., Lawrence, M.S., Benes, C.H. & Zou, L. APOBEC3A and APOBEC3B Activities Render Cancer Cells Susceptible to ATR Inhibition. Cancer Res 77, 4567–4578 (2017).
OpenUrl Abstract/FREE Full Text
9.↵
Koh, G., Degasperi, A., Zou, X., Momen, S. & Nik-Zainal, S. Mutational signatures: emerging concepts, caveats and clinical applications. Nat Rev Cancer 21, 619–637 (2021).
OpenUrl
10.↵
Alexandrov, L.B., Nik-Zainal, S., Wedge, D.C., Campbell, P.J. & Stratton, M.R. Deciphering signatures of mutational processes operative in human cancer. Cell Rep 3, 246–59 (2013).
OpenUrl CrossRef PubMed Web of Science
11.↵
Rosales, R.A., Drummond, R.D., Valieris, R., Dias-Neto, E. & da Silva, I.T. signeR: an empirical Bayesian approach to mutational signature discovery. Bioinformatics 33, 8–16 (2017).
OpenUrl CrossRef PubMed
12.↵
Islam, S.M.A. et al. Uncovering novel mutational signatures by de novo extraction with SigProfilerExtractor. Cell Genom 2, None (2022).
13.↵
Rosenthal, R., McGranahan, N., Herrero, J., Taylor, B.S. & Swanton, C. DeconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biol 17, 31 (2016).
OpenUrl CrossRef PubMed
14.↵
Diaz-Gay, M. et al. Assigning mutational signatures to individual samples and individual somatic mutations with SigProfilerAssignment. Bioinformatics 39(2023).
15.↵
Gulhan, D.C., Lee, J.J., Melloni, G.E.M., Cortes-Ciriano, I. & Park, P.J. Detecting the mutational signature of homologous recombination deficiency in clinical samples. Nat Genet 51, 912–919 (2019).
OpenUrl CrossRef PubMed
16.↵
Maura, F. et al. A practical guide for mutational signature analysis in hematological malignancies. Nat Commun 10, 2969 (2019).
OpenUrl
17.↵
Georgeson, P. et al. Identifying colorectal cancer caused by biallelic MUTYH pathogenic variants using tumor mutational signatures. Nature Communications 13, 3254 (2022).
OpenUrl
18.↵
Sason, I., Chen, Y., Leiserson, M.D.M. & Sharan, R. A mixture model for signature discovery from sparse mutation data. Genome Med 13, 173 (2021).
OpenUrl
19.↵
Consortium, A.P.G. AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov 7, 818–831 (2017).
OpenUrl Abstract/FREE Full Text
20.↵
Pugh, T.J. et al. AACR Project GENIE: 100,000 Cases and Beyond. Cancer Discov 12, 2044–2057 (2022).
OpenUrl CrossRef
21.↵
Tibshirani, R. Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society Series B-Statistical Methodology 73, 273–282 (2011).
OpenUrl CrossRef Web of Science
22.↵
Russell, G.J., Walker, P.M., Elton, R.A. & Subak-Sharpe, J.H. Doublet frequency analysis of fractionated vertebrate nuclear DNA. J Mol Biol 108, 1–23 (1976).
OpenUrl CrossRef PubMed Web of Science
23.↵
Bird, A.P. DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Res 8, 1499–504 (1980).
OpenUrl CrossRef PubMed Web of Science
24.↵
Bailey, M.H. et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 174, 1034–1035 (2018).
OpenUrl CrossRef PubMed
25.↵
Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).
OpenUrl CrossRef PubMed
26.↵
Zhu, B. et al. The genomic and epigenomic evolutionary history of papillary renal cell carcinomas. Nat Commun 11, 3096 (2020).
OpenUrl CrossRef
27.↵
Stupp, R. et al. Effects of radiotherapy with concomitant and adjuvant temozolomide versus radiotherapy alone on survival in glioblastoma in a randomised phase III study: 5-year analysis of the EORTC-NCIC trial. Lancet Oncol 10, 459–66 (2009).
OpenUrl CrossRef PubMed Web of Science
28.↵
Farmanbar, A., Firouzi, S., Kneller, R. & Khiabanian, H. Mutational signatures reveal ternary relationships between homologous recombination repair, APOBEC, and mismatch repair in gynecological cancers. J Transl Med 20, 65 (2022).
OpenUrl
29.↵
Muller, P. et al. Why loss of Y? A pan-cancer genome analysis of tumors with loss of Y chromosome. Comput Struct Biotechnol J 21, 1573–1583 (2023).
OpenUrl
30.↵
Li, B. et al. Therapy-induced mutations drive the genomic landscape of relapsed acute lymphoblastic leukemia. Blood 135, 41–55 (2020).
OpenUrl CrossRef
31.↵
Holowatyj, A.N. et al. Racial/Ethnic and Sex Differences in Somatic Cancer Gene Mutations among Patients with Early-Onset Colorectal Cancer. Cancer Discovery 13, 570–579 (2023).
OpenUrl
32.↵
Fox, J.L., Rosenzweig, K.E. & Ostroff, J.S. The effect of smoking status on survival following radiation therapy for non-small cell lung cancer. Lung Cancer 44, 287–93 (2004).
OpenUrl CrossRef PubMed Web of Science
33.
Parsons, A., Daley, A., Begh, R. & Aveyard, P. Influence of smoking cessation after diagnosis of early stage lung cancer on prognosis: systematic review of observational studies with meta-analysis. BMJ 340, b5569 (2010).
OpenUrl Abstract/FREE Full Text
34.↵
Butler, K. & Banday, A.R. APOBEC3-mediated mutagenesis in cancer: causes, clinical significance and therapeutic potential. Journal of Hematology & Oncology 16(2023).
35.↵
Trucco, L.D. et al. Ultraviolet radiation-induced DNA damage is prognostic for outcome in melanoma. Nat Med 25, 221–224 (2019).
OpenUrl
36.↵
Domingo, E. et al. Somatic POLE proofreading domain mutation, immune response, and prognosis in colorectal cancer: a retrospective, pooled biomarker study. Lancet Gastroenterol Hepatol 1, 207–216 (2016).
OpenUrl
37.↵
Sinicrope, F.A. DNA mismatch repair and adjuvant chemotherapy in sporadic colon cancer. Nat Rev Clin Oncol 7, 174–7 (2010).
OpenUrl CrossRef PubMed Web of Science
38.↵
Kim, H., An, J.Y., Noh, S.H., Shin, S.K., Lee, Y.C. & Kim, H. High microsatellite instability predicts good prognosis in intestinal-type gastric cancers. J Gastroenterol Hepatol 26, 585–92 (2011).
OpenUrl CrossRef PubMed
39.↵
Smyth, E.C. et al. Mismatch Repair Deficiency, Microsatellite Instability, and Survival: An Exploratory Analysis of the Medical Research Council Adjuvant Gastric Infusional Chemotherapy (MAGIC) Trial. JAMA Oncol 3, 1197–1203 (2017).
OpenUrl
40.↵
Cancer Genome Atlas Research, N. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–15 (2011).
OpenUrl CrossRef PubMed Web of Science
41.↵
Kundra, R. et al. OncoTree: A Cancer Classification System for Precision Oncology. JCO Clin Cancer Inform 5, 221–230 (2021).
OpenUrl
42.↵
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297–303 (2010).
OpenUrl Abstract/FREE Full Text
43.↵
Saunders, C.T., Wong, W.S., Swamy, S., Becq, J., Murray, L.J. & Cheetham, R.K. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28, 1811–7 (2012).
OpenUrl CrossRef PubMed Web of Science
44.↵
Karczewski, K.J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
OpenUrl CrossRef PubMed
45.↵
Karimnezhad, A. et al. Accuracy and reproducibility of somatic point mutation calling in clinical-type targeted sequencing data. BMC Med Genomics 13, 156 (2020).
OpenUrl
46.↵
Koboldt, D.C. Best practices for variant calling in clinical sequencing. Genome Med 12, 91 (2020).
OpenUrl CrossRef
47.↵
Févotte, C. & Cemgil, A.T. Nonnegative matrix factorizations as probabilistic inference in composite models. in 2009 17th European Signal Processing Conference 1913–1917 (2009).
48.↵
Islam, S.M.A. et al. Uncovering novel mutational signatures by de novo extraction with SigProfilerExtractor. Cell Genom 2, 100179 (2022).
OpenUrl
49.↵
Shannon, C.E. A mathematical theory of communication. The Bell system technical journal 27, 379–423 (1948).
OpenUrl CrossRef Web of Science
50.↵
Lawrence, M.S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014).
OpenUrl CrossRef PubMed Web of Science
51.↵
McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arxiv:1802.03426 (2018).

View the discussion thread.

Posted April 03, 2024.

Download PDF

Supplementary Material

Data/Code

Citation Tools

Subject Area

Genetic and Genomic Medicine

Subject Areas

All Articles

Addiction Medicine (394)
Allergy and Immunology (706)
Anesthesia (197)
Cardiovascular Medicine (2887)
Dentistry and Oral Medicine (327)
Dermatology (247)
Emergency Medicine (433)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1020)
Epidemiology (12629)
Forensic Medicine (10)
Gastroenterology (814)
Genetic and Genomic Medicine (4489)
Geriatric Medicine (410)
Health Economics (717)
Health Informatics (2872)
Health Policy (1059)
Health Systems and Quality Improvement (1058)
Hematology (380)
HIV/AIDS (912)
Infectious Diseases (except HIV/AIDS) (14037)
Intensive Care and Critical Care Medicine (836)
Medical Education (418)
Medical Ethics (115)
Nephrology (466)
Neurology (4256)
Nursing (228)
Nutrition (625)
Obstetrics and Gynecology (796)
Occupational and Environmental Health (727)
Oncology (2229)
Ophthalmology (636)
Orthopedics (255)
Otolaryngology (323)
Pain Medicine (272)
Palliative Medicine (83)
Pathology (491)
Pediatrics (1183)
Pharmacology and Therapeutics (493)
Primary Care Research (489)
Psychiatry and Clinical Psychology (3699)
Public and Global Health (6840)
Radiology and Imaging (1504)
Rehabilitation Medicine and Physical Therapy (880)
Respiratory Medicine (910)
Rheumatology (431)
Sexual and Reproductive Health (435)
Sports Medicine (376)
Surgery (476)
Toxicology (60)
Transplantation (206)
Urology (176)

[1] 1.↵
Alexandrov, L.B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).
OpenUrl CrossRef PubMed

[2] 2.↵
Degasperi, A. et al. Substitution mutational signatures in whole-genome-sequenced cancers in the UK population. Science 376, abl9283 (2022).
OpenUrl

[3] 3.↵
Poon, S.L., McPherson, J.R., Tan, P., Teh, B.T. & Rozen, S.G. Mutation signatures of carcinogen exposure: genome-wide detection and new opportunities for cancer prevention. Genome Med 6, 24 (2014).
OpenUrl CrossRef PubMed

[4] 4.
Wan, J.C.M. et al. Genome-wide mutational signatures in low-coverage whole genome sequencing of cell-free DNA. Nat Commun 13, 4953 (2022).
OpenUrl

[5] 5.↵
Poon, S.L. et al. Genome-wide mutational signatures of aristolochic acid and its application as a screening tool. Sci Transl Med 5, 197ra101 (2013).
OpenUrl Abstract/FREE Full Text

[6] 6.↵
Van Hoeck, A., Tjoonk, N.H., van Boxtel, R. & Cuppen, E. Portrait of a cancer: mutational signature analyses for cancer diagnostics. BMC Cancer 19, 457 (2019).
OpenUrl CrossRef PubMed

[7] 7.↵
Davies, H. et al. HRDetect is a predictor of BRCA1 and BRCA2 deficiency based on mutational signatures. Nat Med 23, 517–525 (2017).
OpenUrl CrossRef PubMed

[8] 8.↵
Buisson, R., Lawrence, M.S., Benes, C.H. & Zou, L. APOBEC3A and APOBEC3B Activities Render Cancer Cells Susceptible to ATR Inhibition. Cancer Res 77, 4567–4578 (2017).
OpenUrl Abstract/FREE Full Text

[9] 9.↵
Koh, G., Degasperi, A., Zou, X., Momen, S. & Nik-Zainal, S. Mutational signatures: emerging concepts, caveats and clinical applications. Nat Rev Cancer 21, 619–637 (2021).
OpenUrl

[10] 10.↵
Alexandrov, L.B., Nik-Zainal, S., Wedge, D.C., Campbell, P.J. & Stratton, M.R. Deciphering signatures of mutational processes operative in human cancer. Cell Rep 3, 246–59 (2013).
OpenUrl CrossRef PubMed Web of Science

[11] 11.↵
Rosales, R.A., Drummond, R.D., Valieris, R., Dias-Neto, E. & da Silva, I.T. signeR: an empirical Bayesian approach to mutational signature discovery. Bioinformatics 33, 8–16 (2017).
OpenUrl CrossRef PubMed

[12] 12.↵
Islam, S.M.A. et al. Uncovering novel mutational signatures by de novo extraction with SigProfilerExtractor. Cell Genom 2, None (2022).

[13] 13.↵
Rosenthal, R., McGranahan, N., Herrero, J., Taylor, B.S. & Swanton, C. DeconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biol 17, 31 (2016).
OpenUrl CrossRef PubMed

[14] 14.↵
Diaz-Gay, M. et al. Assigning mutational signatures to individual samples and individual somatic mutations with SigProfilerAssignment. Bioinformatics 39(2023).

[15] 15.↵
Gulhan, D.C., Lee, J.J., Melloni, G.E.M., Cortes-Ciriano, I. & Park, P.J. Detecting the mutational signature of homologous recombination deficiency in clinical samples. Nat Genet 51, 912–919 (2019).
OpenUrl CrossRef PubMed

[16] 16.↵
Maura, F. et al. A practical guide for mutational signature analysis in hematological malignancies. Nat Commun 10, 2969 (2019).
OpenUrl

[17] 17.↵
Georgeson, P. et al. Identifying colorectal cancer caused by biallelic MUTYH pathogenic variants using tumor mutational signatures. Nature Communications 13, 3254 (2022).
OpenUrl

[18] 18.↵
Sason, I., Chen, Y., Leiserson, M.D.M. & Sharan, R. A mixture model for signature discovery from sparse mutation data. Genome Med 13, 173 (2021).
OpenUrl

[19] 19.↵
Consortium, A.P.G. AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov 7, 818–831 (2017).
OpenUrl Abstract/FREE Full Text

[20] 20.↵
Pugh, T.J. et al. AACR Project GENIE: 100,000 Cases and Beyond. Cancer Discov 12, 2044–2057 (2022).
OpenUrl CrossRef

[21] 21.↵
Tibshirani, R. Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society Series B-Statistical Methodology 73, 273–282 (2011).
OpenUrl CrossRef Web of Science

[22] 22.↵
Russell, G.J., Walker, P.M., Elton, R.A. & Subak-Sharpe, J.H. Doublet frequency analysis of fractionated vertebrate nuclear DNA. J Mol Biol 108, 1–23 (1976).
OpenUrl CrossRef PubMed Web of Science

[23] 23.↵
Bird, A.P. DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Res 8, 1499–504 (1980).
OpenUrl CrossRef PubMed Web of Science

[24] 24.↵
Bailey, M.H. et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 174, 1034–1035 (2018).
OpenUrl CrossRef PubMed

[25] 25.↵
Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).
OpenUrl CrossRef PubMed

[26] 26.↵
Zhu, B. et al. The genomic and epigenomic evolutionary history of papillary renal cell carcinomas. Nat Commun 11, 3096 (2020).
OpenUrl CrossRef

[27] 27.↵
Stupp, R. et al. Effects of radiotherapy with concomitant and adjuvant temozolomide versus radiotherapy alone on survival in glioblastoma in a randomised phase III study: 5-year analysis of the EORTC-NCIC trial. Lancet Oncol 10, 459–66 (2009).
OpenUrl CrossRef PubMed Web of Science

[28] 28.↵
Farmanbar, A., Firouzi, S., Kneller, R. & Khiabanian, H. Mutational signatures reveal ternary relationships between homologous recombination repair, APOBEC, and mismatch repair in gynecological cancers. J Transl Med 20, 65 (2022).
OpenUrl

[29] 29.↵
Muller, P. et al. Why loss of Y? A pan-cancer genome analysis of tumors with loss of Y chromosome. Comput Struct Biotechnol J 21, 1573–1583 (2023).
OpenUrl

[30] 30.↵
Li, B. et al. Therapy-induced mutations drive the genomic landscape of relapsed acute lymphoblastic leukemia. Blood 135, 41–55 (2020).
OpenUrl CrossRef

[31] 31.↵
Holowatyj, A.N. et al. Racial/Ethnic and Sex Differences in Somatic Cancer Gene Mutations among Patients with Early-Onset Colorectal Cancer. Cancer Discovery 13, 570–579 (2023).
OpenUrl

[32] 32.↵
Fox, J.L., Rosenzweig, K.E. & Ostroff, J.S. The effect of smoking status on survival following radiation therapy for non-small cell lung cancer. Lung Cancer 44, 287–93 (2004).
OpenUrl CrossRef PubMed Web of Science

[33] 33.
Parsons, A., Daley, A., Begh, R. & Aveyard, P. Influence of smoking cessation after diagnosis of early stage lung cancer on prognosis: systematic review of observational studies with meta-analysis. BMJ 340, b5569 (2010).
OpenUrl Abstract/FREE Full Text

[34] 34.↵
Butler, K. & Banday, A.R. APOBEC3-mediated mutagenesis in cancer: causes, clinical significance and therapeutic potential. Journal of Hematology & Oncology 16(2023).

[35] 35.↵
Trucco, L.D. et al. Ultraviolet radiation-induced DNA damage is prognostic for outcome in melanoma. Nat Med 25, 221–224 (2019).
OpenUrl

[36] 36.↵
Domingo, E. et al. Somatic POLE proofreading domain mutation, immune response, and prognosis in colorectal cancer: a retrospective, pooled biomarker study. Lancet Gastroenterol Hepatol 1, 207–216 (2016).
OpenUrl

[37] 37.↵
Sinicrope, F.A. DNA mismatch repair and adjuvant chemotherapy in sporadic colon cancer. Nat Rev Clin Oncol 7, 174–7 (2010).
OpenUrl CrossRef PubMed Web of Science

[38] 38.↵
Kim, H., An, J.Y., Noh, S.H., Shin, S.K., Lee, Y.C. & Kim, H. High microsatellite instability predicts good prognosis in intestinal-type gastric cancers. J Gastroenterol Hepatol 26, 585–92 (2011).
OpenUrl CrossRef PubMed

[39] 39.↵
Smyth, E.C. et al. Mismatch Repair Deficiency, Microsatellite Instability, and Survival: An Exploratory Analysis of the Medical Research Council Adjuvant Gastric Infusional Chemotherapy (MAGIC) Trial. JAMA Oncol 3, 1197–1203 (2017).
OpenUrl

[40] 40.↵
Cancer Genome Atlas Research, N. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–15 (2011).
OpenUrl CrossRef PubMed Web of Science

[41] 41.↵
Kundra, R. et al. OncoTree: A Cancer Classification System for Precision Oncology. JCO Clin Cancer Inform 5, 221–230 (2021).
OpenUrl

[42] 42.↵
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297–303 (2010).
OpenUrl Abstract/FREE Full Text

[43] 43.↵
Saunders, C.T., Wong, W.S., Swamy, S., Becq, J., Murray, L.J. & Cheetham, R.K. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28, 1811–7 (2012).
OpenUrl CrossRef PubMed Web of Science

[44] 44.↵
Karczewski, K.J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
OpenUrl CrossRef PubMed

[45] 45.↵
Karimnezhad, A. et al. Accuracy and reproducibility of somatic point mutation calling in clinical-type targeted sequencing data. BMC Med Genomics 13, 156 (2020).
OpenUrl

[46] 46.↵
Koboldt, D.C. Best practices for variant calling in clinical sequencing. Genome Med 12, 91 (2020).
OpenUrl CrossRef

[47] 47.↵
Févotte, C. & Cemgil, A.T. Nonnegative matrix factorizations as probabilistic inference in composite models. in 2009 17th European Signal Processing Conference 1913–1917 (2009).

[48] 48.↵
Islam, S.M.A. et al. Uncovering novel mutational signatures by de novo extraction with SigProfilerExtractor. Cell Genom 2, 100179 (2022).
OpenUrl

[49] 49.↵
Shannon, C.E. A mathematical theory of communication. The Bell system technical journal 27, 379–423 (1948).
OpenUrl CrossRef Web of Science

[50] 50.↵
Lawrence, M.S. et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501 (2014).
OpenUrl CrossRef PubMed Web of Science

[51] 51.↵
McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arxiv:1802.03426 (2018).