Abstract
Background Germline variants have traditionally been used for determining carrier status in heritable cancer families. However, studies are increasingly finding use of germline variants as therapeutic, diagnostic and prognostic biomarkers used to guide therapy decisions in cancer treatment.
Objectives To study the prevalence of mutation specific oncogenic biomarkers in the Indian population and analyze their presence across disease cohorts
Materials and Methods We annotate the IndiGen data obtained from whole genome sequencing of 1029 self-declared healthy Indian individuals with biomarker information obtained from the OncoKB knowledgebase, a repository of evidence-based information about somatic biomarkers and structural alterations in patient tumors.
Results In our study we have discovered 34 biomarker variants of therapeutic actionability across 16 genes linked with 23 unique drugs or drug combinations in 23 unique types of cancer disease in the Indian population. In all, we have found 52 biomarker variants with 172 different biomarker types including therapeutic, resistance, diagnostic, and prognostic. We establish that 4.3% of the Indian population are carriers of therapeutic, and 2.13% are carriers of both diagnostic as well as prognostic germline biomarkers. Finally, we also establish the prevalence of 42 biomarker variants across the 23 genes in both AD and AR modes of inheritance in the Indian population.
Introduction
Traditionally, germline variants in cancer have been studied mainly as predisposition biomarkers for surveillance and early detection of hereditary cancers. However, germline mutations can also determine response to both targeted, as well as traditional cancer therapy, including its toxicity. As Next-generation sequencing becomes increasingly accessible, several recent studies have highlighted the importance of studying germline variants as therapeutic markers. Several recent studies have observed germline mutations in cancer predisposition genes in sporadic cases with no family history of cancer1. Similarly, a pan-cancer analysis in 2017 demonstrated that approximately 17% of patients with advanced cancer had a pathogenic/likely pathogenic germline variant in a cancer susceptibility gene, with more than a quarter of these leading to consideration of change in therapy2.
The germline mutation burden is also observed to be high among pediatric cancer cases; A pan-cancer pediatric cohort analysis found that 6% of all patients carried causative germline variants, with 3% of all therapeutic biomarkers having a germline origin, with another pediatric study (CNS tumors) finding that 89% of participants harbored pharmacogenetic variants2. Additionally, pathogenic germline mutations have also been noted in a substantial number of sporadic pediatric cancer cases1. Thus application genomics is increasingly enabling the determination of the true burden of pathogenic germline mutations with therapeutic potential.
A common example of a biomarker-indicated drug is PARP inhibitors, approved for the treatment of metastatic breast cancer in germline BRCA mutation carriers3. Since germline mutations are clonal in nature (while a somatic mutation may be clonal as well as subclonal), a drug targeting a large clonal population has a higher chance of affecting a larger population of tumor cells, making it more important to study therapeutic implications of germline mutations4. Research efforts have also established variants as indicators of drug resistance/tolerance, efficacy, and toxicity (e.g. 5-fluorouracil related toxicity)4. Additionally, biomarkers can also offer diagnostic and prognostic value. Perhaps the most common example is the BCR-ABL1 fusion, which can used both for diagnosis of B-Lymphoblastic Leukemia/Lymphoma, and is also linked with a poor prognosis of the disease. Finally, the presence or absence of germline variants can guide surgical decisions (e.g. whether to perform mastectomy in women with BRCA1/2 mutations), as well as prophylactic decisions such as chemoprevention (e.g. prophylactic use of aspirin in carriers of Lynch syndrome)4,5.
In this study we utilize data from OncoKB6, a precision oncology knowledgebase that offers evidence-based information about somatic mutations and structural alterations in patient tumors, for annotating the IndiGen7 data, a database containing whole genome sequencing data of 1029 self-declared healthy Indian individuals from across 27 states. Using the resulting information, we calculate the prevalence of onco-biomarkers in the Indian population. We also query GUaRDIAN8, a nation-wide collaborative research initiative dedicated to the study of rare diseases to establish the presence of biomarkers across multiple cohorts. Finally, we compare our results with those obtained by cross referencing the MUSTARD9 resource, a database of mutation-specific therapies in cancer.
Materials and Methods
Biomarker Classification Based on Evidence-Based Clinical Actionability
We collected a list of 737 biomarker genes from the OncoKB database, and queried this list across the IndiGen compendium. We thus collected all germline variants reported across these biomarker genes. Next, using OncoKB API, we annotated this variant data with information regarding their biomarker status; OncoKB classifies variants into 4 main evidence based levels, each with several sub-levels of varying strength:
Therapeutic (Levels 1-7): Biomarkers predictive of response or resistance to a drug
Diagnostic (Levels 1-2): Biomarkers that facilitate diagnosis
Prognostic (Levels 1-3): Biomarkers that facilitate prognosis
FDA (Levels 1-3): Categorization of genetic tests by the FDA based on different levels of clinical significance
Population Based Prevalence Analysis
We mapped global as well as subpopulation allele frequencies across all variants that had OncoKB annotations. We queried 14 global populations including IndiGen, GnomAD10, Human Genome Diversity Project (HGDP)11, TogoVar (Japanese population)12, the China Metabolic Analytics Project (ChinaMAP)13, Qatar14, the Singapore Sequencing Indian Project (SSIP)15, the Singapore Sequencing Malay Project (SSMP)16, 1000Genomes17, the Greater Middle East (GME) Variome Project18, Korea1K Variome19, Taiwan Biobank20, the Vietnamese Genetic Variation Database21, and the Iranome22. We performed the Fisher’s exact test across each of these populations and sub-populations with respect to the IndiGen population to determine whether the allele frequencies were significantly different in any population.
Next, based on the number of individuals who were carriers of any of the biomarkers, we established the percentage of biomarker carriers in the Indian populations. Finally, we calculated the incidence and prevalence of both Autosomal Dominant (AD) as well as Autosomal Recessive (AR) disorders linked with each of the genes containing the variants with OncoKB annotations in the IndiGen population.
Disease Cohort Comparison
We queried all variants with OncoKB annotations across samples in the GUaRDIAN project8, a nation-wide collaborative research initiative that caters to rare diseases across multiple cohorts to determine cohort characteristics.
Comparison with MUSTARD
We have previously compiled MUSTARD9, an exhaustive compendium of mutation-specific therapies in cancer. We queried both mutations as well as structural variants present in the database across OncoKB, and analyzed the variants that returned annotations.
Results & Discussion
From our query of the IndiGen database, we found 17,00,332 variants across OncoKB biomarker genes. Upon annotating these, we obtained 2 Oncogenic, 330 Likely Oncogenic, 67 Likely Neutral, 32 Inconclusive and 16,99,900 Unknown variants. Of these 332 Oncogenic / Likely Oncogenic variants, 52 had highest level of annotations in any 4 of the evidence levels: a total of 34 variants had therapeutic biomarkers (28 Level 1, 3 Level 3A, 3 Level 4), 17 Diagnostic (11 Level Dx2, 6 Level Dx3), 11 Prognostic (5 Level Px1, 5 Level Px2, 1 Level Px3), and 34 had FDA levels (28 FDA Level 2, 6 FDA Level 3). Thus including all levels, 172 biomarker types (therapeutic, diagnostic, prognostic) were annotated across 27 unique genes, encompassing 23 unique drugs or drug combinations in 34 unique types of cancer disease in the Indian population. Complete details of the variants along with their annotations are shown in Supplementary Table 1.
Variants Classified Based on Therapeutic Actionability
Of the therapeutic biomarkers, we found 34 biomarker variants across 16 unique genes (Figure 1). These variants included 56 Level 1 associations (defined as an FDA-recognized biomarkers predictive of response to an FDA-approved drug), 16 Level 2 associations (defined as standard care biomarkers recommended by the NCCN or other professional guidelines predictive of response to an FDA-approved drug), 23 Level 3A associations (defined as compelling clinical evidence supporting the biomarker as being predictive of response to a drug), and 9 Level 4 biomarkers (defined as compelling biological evidence supporting the biomarker as being predictive of response to a drug). In all, we discovered the presence of 118 biomarkers (therapeutic, diagnostic, and prognostic) in these 34 variants across 23 unique types of cancer, associated with 23 unique drugs.
Table 1 describes all the diseases and drugs associated with all 13 genes bearing Level 1 associations.
Population Based Prevalence Analysis
We found 8 variants from 7 genes that were significant across different populations at p-value < 0.01. Of these 8, the variant ATM/H1380Y was significant across all populations except Singapore, however, it was present at AF > 0.05 in the IndiGen (AF=0.064) and GnomAD Global (AF=0.066) populations. Additionally, it is also annotated as benign by ClinVar23. All other variants had AF < 0.05 across all populations. The plot depicting all non missing allele frequencies for the 52 variants is shown in Figure 2. The yellow circles mark statistically significant variants.
Next, we calculated the prevalence of biomarkers across the Indian population using the IndiGen dataset. The variants with OncoKB annotations were linked with 27 genes. The ATM/H1380Y variant was removed from prevalence analyses. Additionally, 10 more variants had certain samples that did not pass QC, and calculations were made accordingly. Incidence and prevalence were calculated for each of the remaining 23 genes for both AD and AR modes of inheritance. Supplementary Table 2 shows these numbers along with the disorders associated with the genes according to OMIM24.
Notably, in the ERCC2 gene, 1 in about 85 individuals are expected to be carriers of pathogenic mutations linked with AD disease with known biomarkers. ERCC2 is traditionally linked with AR inheritance of Xeroderma pigmentosum (OMIM). However, ERCC2 mutations increase cisplatin sensitivity in bladder cancer cells, and can thus be used as a predictive biomarker in Muscle-Invasive Bladder Cancer25. Similarly, TP53 is associated with an AD inheritance of Bone marrow failure syndrome among other cancers (OMIM), and is identified as a prognostic biomarker in several hematological malignancies. For example, TP53-mutated AML is associated with a lower likelihood of response to conventional chemotherapy and with poor outcomes, with a median overall survival of 4 to 6 months, and a 2-year overall survival rate of <10%26.
In all, 44 Therapeutic (24 Level 1, 7 Level 2, 8 Level 3A and 5 Level 4), 16 diagnostic and 16 prognostic biomarkers were observed. These mainly encompassed hematological malignancies (46%), breast and ovarian cancers (13%), prostate cancers (13%), and Gastrointestinal Stromal Tumors (11.8%). Other malignancies included pancreatic, Medullary thyroid cancer, Mesenchymal tumors, Uterine, Bladder, central nervous system, and all solid tumors.
Prevalence in India Population
Nearly 7% of the population (72 unique individuals) were found to be carriers of any one of the four evidence based biomarkers. Further, 4.3% (44 unique individuals) were carriers of therapeutic, and 2.13% (22 unique individuals) were carriers of both diagnostic as well as prognostic biomarkers.
Disease Cohort Comparison
We discovered 10 unique variants across 365 samples from multiple cohorts in the GUaRDIAN data. These included all variants that were determined statistically significant except BRCA2/R2842C. Of these variants, two were annotated as diagnostic biomarkers: NOTCH2/M1? Which is a Diagnostic Level3 biomarker for Splenic Marginal Zone Lymphoma, and SUZ12/D725VfsTer18, a Diagnostic Level3 biomarker for Early T-Cell Precursor Lymphoblastic Leukemia. Of note, the SUZ12/D725VfsTer18 variant was present in several cohorts that represented diseases that increase the risk of hematologic malignancies, including Sjögren syndrome27, Mitochondrial Disorders28, Immunodeficiency29, Hyper IgE Syndromes30. The details of all 10 variants are shown in Supplementary Table 3.
Comparison with MUSTARD
We queried the mutations and structural variants present across 168 unique genes from the MUSTARD database across OncoKB, and found 164 variants (66 Likely Oncogenic, 96 Oncogenic and 2 Resistance) with annotations across at least one of 4 evidence levels.
194 variants had Therapeutic evidence levels (63 Level 1, 62 Level 2, 8 Level 3A, 16 Level 4, 38 Level R1 and 7 Level R2), 92 had Diagnostic (9 Level Dx1, 55 Level Dx2, 28 Level Dx3), 46 had Prognostic (45 Level Px1, 1 Level Px2), and 149 had FDA level evidence (100 FDA Level 2, 49 FDA Level 3). The details of the variants along with their annotations are shown in Supplementary Table 4.
We next mapped all MUSTARD variants against the GUaRDIAN data; we found an overlap of 5 unique variants in 5 different samples. Two of the 5 variants had Level 1 biomarkers for breast cancer, and also belonged to the breast cancer cohort.
Conclusions
In our study, we have calculated the prevalence of biomarkers implicated in cancer, that are of therapeutic, diagnostic and prognostic value in the Indian population. Some biomarkers such as ones for bladder, prostate and hematological malignancies are present in high numbers, and the information can be used to clinical advantage. However, only 23 of the 737 OncoKB genes queried had biomarkers in the Indian population. This could be because of differences in genetic makeup between different populations, and indicates a need for increased efforts to analyze biomarkers in variants prevalent in the Indian populations. Our comparison with the GUaRDIAN disease cohort data revealed concordance between two diagnostic biomarkers and their respective cohorts. 9 of the variants further lie in the top 10 most prevalent biomarker associated genes in the Indian population, showing further concordance between the two analyses. This illustrates the clinical advantage of studies like ours, where prevalence estimates can be used advantageously to guide biomarker research in a population specific manner.
Finally, our mapping of the MUSTARD database with the OncoKB knowledgebase yielded a greater overlap of variants than with the IndiGen data. Upon mapping MUSTARD with the GUaRIDIAN data, we found a similar concordance between the biomarkers and their respective cohorts, despite the small sample size.
Data Availability
All data produced in the present work are contained in the manuscript.
Declaration Of Interests
The authors declare no competing interests.
Author Contributions
VS conceptualized, designed and supervised the study. AV performed the analysis and complied the manuscript.
Data Sharing And Availability
All data produced in the present work are contained in the manuscript.
Acknowledgements
Authors acknowledge funding from the Council of Scientific and Industrial Research (CSIR) through CNP-0007 Grant. The funders had no role in the preparation of the manuscript or decision to publish.
Footnotes
Grant Sponsors: This work was supported by the CNP-0007 project funded by the Council for Scientific and Industrial Research (CSIR-India)
Conflict of interest: The authors declare that they have no conflict of interest.