SUMMARY
The process of inhabiting the Americas by ancestral native American populations involved many individuals settling in the Peruvian Andes and Amazonian regions. Due to Latin American countries representing less than 1% of the human genome data available in public reference databases, the evolution and migration processes involved in adapting to this unique geography have not yet been fully explained. The Peruvian Genome Project is an initiative, started in 2011, to address the underrepresentation of genomic data from native South American populations. This project has collected 1,149 samples from 17 traditional native and 13 mestizo (mixed of native, African, and European ancestry) communities. Currently, 150 whole genomes and 873 array-genotyped individuals have been sequenced from across the geography of Western South America, including coastal, Andes, and Amazonian regions. We discovered 1.6 million novel genetic variants with varying frequencies, indicative of local environmental adaptations and population drift. These novel variants allow us to infer local evolutionary traits and population-specific allele frequencies for people living at different altitudes, as well as varying adaptations to pathogens and living conditions. The Peruvian Genome Project is the result of over a decade of work in sample selection, logistics, and approved regulatory community engagement, designed to enhance the human genome pool of diversity of native Americans. The data collected here enable the targeted characterization of endemic diseases, trait adaptations, and new variants of clinical significance in South America. The Peruvian Genome Project represents a step forward in international and multidisciplinary efforts to make precision medicine more inclusive and accessible for underrepresented communities in Latin America, offering significant potential for drug development and diagnostics in a neglected continent.
INTRODUCTION
Estimates of native American representation within genome reference datasets currently fluctuate between 1-5%, depending on the data source1,2. Peru, which encompasses an area of 1,285,215 km2, boasts a highly diverse population of 33 million inhabitants as of 2023, featuring both mestizos and native peoples, the two ethnic categories broadly used in Latin American societies and census3. Historical records indicate a significant demographic shift among its constituent native population: in 1620 (almost 90 years after the arrival of the first European conquerors), they amounted to 75% of Peru’s population, a figure which decreased to 56% by 1796 and further dwindled to 31% by 2003, reflecting in part a complex social process of urbanization of the Peruvian society. Today, native populations are fragmented into 47 distinct groups4,5, mostly rural, many of which have been inaccessible for genome and phenotypic characterization. The existing downward proportion trend underscores a concerning reality: the proportion of population considered as native is declining over time. This decline, due to colonization and migration, is not only steadily reducing opportunities for exploring the original genetic landscape of the Americas but also decreasing chances for charting endemic evolutionary adaptations specific to a unique set to climates and environments. Therefore, the observed decline in the population classified as native Peruvians translates into a continual reduction in the prospects for investigating novel genetic variants harbored by these communities. This is not only concerning for the local populations, as these variants explain specific characteristics relevant to genome medicine. It is also critical for the scientific community to act now and ethically engage with these understudied communities before the current window of opportunity disappears.
Some high Andean populations reside above 2500 meters above sea level (m a.s.l.), which constitutes approximately 30% of the population of Peru6. The Andes and the Amazon jungle have influenced the environmental exposure of local populations. The population structure within the Western South America region has also exhibited remarkable homogeneity between Andean populations, suggesting a tendency to migrate within the Andean region7. This pattern of migration has likely conditioned populations to remain in their environment, fostering diverse adaptive changes, notably in response to hypoxia8. However, underlying genetic mechanisms for local adaptations, as well as their interactions with environmental factors, lifestyle, and potential epigenomic modifications, remain to be fully elucidated. These factors collectively contribute to physiological, endocrine, cardiovascular, respiratory, and other systemic changes observed in native populations9,10.
When carrying out the analysis of coastal, Andean, and Amazonian populations, it is useful to subdivide them according to their proportion of admixture. The examination of both mestizo and native population classification offers a valuable framework for advancing genomic medicine in Peru and the broader Americas11,12 and has the potential to delineate health disparities and disease prevalence in these communities. While scientific discoveries are starting to emerge from the Peruvian Genome Project (PGP), it is essential to acknowledge its limitations, which mostly relate to its modest size. Nevertheless, these limitations are significantly offset by the stringent sample selection criteria with which sampling diversity was performed and the extraordinary diversity offered by some populations, which as of to date have not been studied before. These project findings, as illustrated in Figure 1, pave the way for further advancements toward precision medicine and data equity within a dramatically underrepresented set of human populations in the Latin American region.
The initial stages of the PGP involved the sequencing of pathogen genomics in 2010. Since then, the project evolved into a tailored-made protocol for selecting populations and individuals for sequencing and genotyping from across the diverse geography of Western South America. This protocol involved international experts and local authority panels, starting in 2012. It took us over a year to have a suitable ethical and sample logistics procedure approved by regulatory authorities. In 2013, we then started visiting communities with which we engaged via a variety of channels, including local media, presentations, and interviews. These interviews included chiefs (“Apus”) from local communities as our initial point of contact and exchange of information. Once appropriate consent from Apus and authorized individuals was obtained, sample collection, sequencing, and genotyping started in 2014 and went through to 2017. The sequencing was performed at the New York Genome Center (NYGC) and the array genotyping at the facility of the Instituto Nacional de Salud del Peru [Peruvian National Institute of Health]. It took us 3 years for this phase to be completed, including the implementation of anthropological and genetic analyses, which culminated with the first peer-reviewed publication of initial results in 201813. Herein, we showed that the studied populations of the three geographic regions in Peru (Amazon, Andes, and coast) diverged from each other ∼12,000 ya. In 2018 and 2019 we then performed pharmacogenomics and transcriptomics investigations. This research allowed us to shed light on the susceptibility’s highlanders have in tuberculosis infections, together with their pharmacological responses14,15. During the COVID pandemic (2020-2022), we shifted our focus to collaborating with international research groups to incorporate ancient Peruvian DNA samples into the pool. From then onwards, we focused our efforts on developing a pool of genome and phenome data that goes beyond the current ∼1000 genomes we have collected to date that are in the process of sequencing. As we continue exploring the wealth in diversity that the Andes, Amazon, and Coast accumulate in Western South America, we have worked with special emphasis on the endemic diseases and precision medicine approaches that apply to local native populations.
POPULATION SELECTION CRITERIA AND SAMPLE COLLECTION STRATEGIES
The presence of the Andes, which cross from North to South the South American continent (Figure 2), has dramatically shaped human evolution and adaptation in the region. Nearly 30% of the Peruvian population live above 2,500 m a.s.l. on the Andes, whereas 10% live in the Amazon jungle, on the East side of the mountains, and all of them exhibit different epidemiological and isolating environmental conditions16. As indicated above, to carry out a balanced recruitment and identification of population diversity criteria by the mandate of the Instituto de Salud Nacional del Perú, we convened a local and an international research panel as follows. i) The Local Panel was composed of representatives of the native communities of Peru, the Ministerio de Cultura, Non-Governmental Organizations, archaeologists, sociologists, and anthropologists. According to the experience of this panel, and the main objective of the project, 3 criteria were identified to select populations from across the coast, Andes, and Amazon jungle: representativeness (number of residents in each population), degree of isolation, and vulnerability of the population to extinction. Within these criteria, 17 native populations and 13 mestizo populations were identified, spanning diverse locations and geographical distances (Figure 2). Figure 3 shows the total breakdown of community participant individuals from a total of 1,149 individuals. Additionally, for this process of selection we also considered migratory routes and self-described identity information of populations to understand historical movements of peoples across centuries and processes of miscegenation within the communities. We applied a definition of “native individuals” consisting of parents and grandparents of the subject under study being born in the same native community. ii) The International Panel was composed of researchers who had already developed projects in similarly underrepresented countries, such as researchers from INMEGEN (National Institute of Genomic Medicine of Mexico), the University of Michigan, University of Maryland (USA), and Universidade Federal de Minas Gerais (Brazil). These researchers’ experience allowed us to consider appropriate clinical measurements and analyses not necessarily considered in previously peer-reviewed projects.
ETHICAL APPROVAL PROCESS AND CASCADE OF CONSENTS
The ethical process for collection, stewardship, and dissemination of data and results for gathered samples was meticulously implemented and carried out at the community and individual levels. We made sure that communities and individuals were engaged according to international and local protocols, including the Declaration of Helsinki for medical research involving human subjects. For community participation, a consultation process was performed involving authorities at the national level (Ministerio de Salud, Peruvian Government), regional authorities, and several Peruvian universities. The information gathering strategy for sample collection began a month in advance for each community, including communication materials. These materials consisted of: i) the distribution of a brochure that explains the project written in simple language (Spanish or local native language); ii) the display of a poster that reproduces the informed consent format; iii) public informative sessions aimed at the participating community; and iv) communication through local television, radio, and written press whenever possible or available. Native communities were visited several months before sampling to request authorization from the community leaders (Apus). The final decision for participation was made by the individuals themselves who had to understand and consent to the ethical processes outlined here. Subjects that matched the inclusion criteria were contacted to participate and then informed consent and authorization were obtained to preserve their samples. All participants were offered the possibility of withdrawing from this study at any time with no need for explanation required. All participants gave their informed consent in the presence of a translator to their mother-tongue traditional language and two local witnesses. All procedures were reported and presented for evaluation and approval by the Research and Ethics Committee from the Instituto Nacional de Salud (authorization no. OI-003-11 and no. OI-087-13).
DATA SAMPLING AND GENOTYPING
Participants were selected to represent diverse self-described Native and Mestizo Peruvian populations. We applied three criteria to optimize participant selection to best represent the Indigenous American populations. These included: (i) the place of birth of the participant, his or her parents and grandparents (they all had to belong to the same community), (ii) their last names (selecting only those corresponding to the region if they existed or were known), and (iii) if several members of a family met our standards for inclusion, the oldest member was then selected for our study.
The first phase of this study included 150 Native and Mestizo Peruvian whole genomes, sequenced to an average of 35X coverage on an Illumina HiSeq X 10 platform by the New York Genome Center (NYGC). An additional 130 Native American and mestizo Peruvian individuals were genotyped using a 2.5M Illumina chip, featuring over two-millions of markers for dense genome-wide coverage and extensive disease-associated content at the Biotechnology and Molecular Biology Lab of the “Instituto Nacional de Salud del Perú”.
PHENOTYPE DATA AND LABORATORY MEASUREMENTS
To date, we have collected a total of 1,149 samples (each corresponding to a different individual) from 17 traditional Native and 13 Mestizo (admixed of Native Peruvian, African, Asian, and European ancestry) communities.
Demographic information, including lifestyle (smoking status and diet patterns), and anthropometrics data such as body mass index (BMI), weight, and height were measured. These measurements were also included in fasting conditions to encompass blood lipids, blood cell traits (mean hemoglobin levels, red cell count, white cell count, and platelets), glucose levels, and renal function markers (Table 1).
We were able to infer significant differences between Native and Mestizo samples due to environmental factors such as geographical altitude, and diet. BMI and ancestry, which have been associated with specific environmental conditions in previously related studies18, showed significant differences in BMI between Native and Mestizo populations. However, more precise methodologies need to be considered in future studies in order to discriminate whether these differences are related to excess fat or muscle mass percentage.
INITIAL FINDINGS AND KEY CONTRIBUTIONS FOR GENOMIC MEDICINE IN PERU AND NATIVE AMERICANS
The PGP offers significant insights into the genetic diversity and evolutionary adaptations of Native and Mestizo populations in Peru. To date, PGP has sequenced 150 genomes and genotyped 850 individual samples. This work is being applied to a range of clinical interventions that are actionable for the advancement of precision medicine, public health strategies, and understanding of human genetic evolution19. In what follows, we present some examples that illustrate how our work has shed light on our understanding of native Peruvian genome variants endemic from these populations.
Enhanced Understanding of Genetic Diversity and Disease Susceptibility
It has been previously shown that the Peruvian Mestizo populations have at least a 60% genetic Indigenous American ancestry, with some native communities adding up to 90% of their genetic native component as shown in Figure 2. We have discovered 1.6 million novel genetic variants according to the Variant Effect Predictor (https://grch37.ensembl.org/Homo_sapiens/Tools/VEP), which are not present in existing data banks resources such as dbSNP13. The inclusion of variants from under-represented populations such as Native Peruvians in global genomic databases is expected to aid in refining and broaden the accuracy of genetic risk assessments and pharmacogenomics interventions for diseases most prevalent for indigenous American populations, whose pharmacogenomic representation is only 0.1% of existing data20. These efforts will then lead to more personalized and effective healthcare interventions, at least to some degree, for all Indigenous American groups.
One of the initiatives preceding the Peruvian Genome Project was the 1000 Genomes Project21, which included 85 genomes of Peruvians from Lima. These individuals, although helpful, were all sampled from Mestizo individuals from Lima, so have more European ancestry were not representative of the diversity richness of Native communities in Peru. The advantage of the PGP lies in the sophisticated inclusion criteria carried out to consider different biogeographical regions from across the coast, the Andes, and the Amazon. This allowed us to evaluate the effects of migration over thousands of years as well as helping differentiate populations based on evolutionary bottlenecks, revealing a distinct genetic fingerprint on the Americans’ ancestors22. Moreover, we have found that the degree of ancestry (mestizos vs natives) and the geographic altitude habitat (high vs low landers) are linked23.
Unique genomic changes in the composition of Peruvian populations, such as the ones we are beginning to uncover, have aided in elucidating the impact of ancient migrations, helping determine population structure based on geographic barriers in Peru24-26. Our data provides evidence of migrations in the central Andean region, which are restricted in the Southern Andes and are likely due to the effect of the high elevation of the Andean mountains. Evidence of gene flow from migrations and differential patterns of genetic variation have also been found, including those associated with immune system genes27.
Pharmacogenomics and Drug Safety
Genetic ancestry plays a central role in population pharmacogenomics28. In one of our studies, we researched the presence of adverse reactions during anti-tuberculosis treatment in the Peruvian population. Our results suggest that 30% of the Peruvian populations are associated with the slow metabolism of isoniazid29. We also identified haplotypes with divergent associations with drug-induced liver injury (DILI), based on the mestizo or native Peruvian population. For instance, we found evidence of NAT25B and NAT27B being associated with DILI risk in mestizos, while no such association has been observed in natives. Additionally, haplotypes NAT25G and NAT213A have only been negatively associated with DILI in the studied Native Peruvians15. Current research also suggests a greater prevalence of probable hepatotoxicity in the Amazonian population. In a study still in progress, we have compared the pharmacogenetic response to antituberculosis drugs in a population with ancestry greater than 95% native from the coast, Andes, and the Amazon of Peru. Our initial results point to greater hepatotoxicity in antiretroviral cotreatment on the coast. Thus, by considering the genetic diversity within and between populations, healthcare providers are beginning to better predict adverse drug reactions, adjust dosages, and select the most appropriate medication, thereby enhancing patient safety and treatment efficacy30,31.
Adaptation to High Altitude and Its Health Implications
PGP’s findings on the selection of genes related to immune response in different Peruvian populations offer promising data for understanding susceptibility and resistance to infectious diseases. For instance, we have found that native people living in the Andes had positively selected the HAND2-AS1 and DUOX2 genes that are involved in cardiovascular and thyroid functions, respectively. On the other hand, we have observed that Peruvian Amazonians show an important selection of the CD45 protein gene, which is relevant in the viral exposure immune response27. Considering that bottleneck effects and genetic fixations have shaped the biological structure marked by geographical differences in Peru, our group embarked on finding how this effect may be manifested in gene expression.
In previous studies, the immune response has been mainly associated with ancestry. However, our findings show that altitude and the microbiome are also as important as ancestry in the immune response in populations with a high Indigenous American ancestry. To evaluate the immune response in Native Highlanders, a laboratory was developed at the University of Huanuco, located at 3000 meters above sea level, to carry out a transcriptomic project resulting from the stimulation of PBMCs with proteins from bacteria, viruses, and fungi. For this aim, we intended to address the research question of how the expression of immune response-associated genes varies in high-altitude populations compared to those not inhabiting such geographic conditions. Our unpublished data suggest that there is indeed a difference in the immune response of these native inhabitants, with upregulated expression in genes such as HLA-DPB1, FN1, CD36, MMP2, HLA-DRB1, FCGR1A, CCL17, and HLA-DRB5, and down-regulation in TGAX, CCL22, CSF1, CXCL8, IL12A, MMP9, CSF2, PTGS2, and FGF2. However, at the genomic level, we have not found variation in gene expression in those genes associated with the evolutionary pattern of native populations in the central Peruvian highland region32. Our findings agree with recent observations by Sharma et al., who suggested that the effects of genetic selection do not align with genes exhibiting higher or lower genetic expression, nor within the realm of proteins33.
Prevalence of established autoimmune risk variants in the PGP
Autoimmune diseases, in particular Systemic Lupus Erythematosus (SLE), tend to be present at a younger age, and with worse clinical outcomes, in people of non-European descent. Immune genetic variants show differential evolution based on geographic pathogen pressure as nearly 13% of non-HLA GWAS loci for SLE exhibit signs of natural selection34.
While acknowledging the inherent challenges in comparing GRSs across diverse populations, in this case constructed from Genome-Wide Association Study (GWAS) data predominantly sourced from European and Asian cohorts, it’s noteworthy that healthy Native Peruvians exhibit elevated unweighted polygenic risk for SLE in contrast to European, African, and South Asian counterparts, aligning more closely with East Asians. Confirmation of these findings necessitates comprehensive population-based studies within the Americas. These findings emphasize the imperative for continued genetic investigations into autoimmunity in Peru, moving beyond Eurocentric genetic frameworks, and offering potential insights into the heterogeneity of SLE.
NEXT STEPS
Ancient DNA studies in samples from 8000 years ago
Genetic variants have been generated throughout human evolution and migration since the first inhabitants. Environmental factors, epidemiological pressures, and human interactions have likely played roles in shaping the emergence and persistence of genetic variants. However, the origins of these variants in the context of Peru’s ancient populations remain unclear. It is uncertain whether these variants were introduced by the first immigrants or emerged locally. Conducting ancient DNA studies in Peru holds the potential to illuminate this question.
For this purpose, we have generated a mobile laboratory installed in the excavation centers so after the identification of coprolites, samples are transported in 5 minutes to the lab to start extracting ancient DNA and avoid contemporary contamination34,36.
Peruvian Clinical Genome
Previously, we found SNPs associated with the prognosis, severity and treatment response in tuberculosis infection14. As the pathogenic potential of newly discovered genetic variants within the PGP remains uncertain, a collaborative effort with the University of Westminster is underway to establish a comprehensive database of the Peruvian clinical genome. This database aims to enable and streamline specialized medical studies, elucidating the clinical significance of these variants in relation to Peru’s most prevalent diseases.
Genomics and transcriptomic in cardiovascular-related genes in highlanders
Considering the new genetic variants associated with cardiovascular health found in the PGP, associated with the concept that not everything your genes have is expressed, we decided to combine genomic and transcriptomic analysis of cardiovascular genes in response to high altitude. This effort is in collaboration with Queen Mary University of London.
Beyond 1000 Peruvian Genome Project (B1000PGP)
The rigorous data collection methodology adopted by PGP was developed in concordance with the a priori inquiries posed by an international panel of experts from related projects, specifically regarding opportunities and variables relevant to include in the PGP framework. We are currently deliberating the inception of B1000PGP, a project with heightened ambitions aimed at collecting a broader array of variables beyond genomic data, incorporating additional samples progressively over time. We are conducting assessments on both healthy individuals and those diagnosed with various pathologies. Our objectives encompass more than just elucidating the spectrum, prevalence, and implications of germline or tumor genomic variants. Moreover, acknowledging the pivotal roles of the microbiome and DNA methylation as essential markers in human health, we are deliberating additional initiatives to produce datasets of this type.
CONCLUSIONS
Peru’s limited engagement in genomic research stems from a confluence of factors: foremost, the nation’s lack of prioritization of genomic medicine policies leads to a dearth of strategic direction and investment. Additionally, the scarcity of adequately trained genomic scientists impedes the development of a proficient workforce capable of advancing research in this field. Meanwhile, centralized research infrastructure in the capital city restricts access and opportunities for researchers in the countryside. To these difficulties, cumbersome administrative processes further hamper Peru’s ability to undertake large-scale genomic studies. Nevertheless, the PGP has managed to implement a multidisciplinary approach throughout years of work and international collaborations, establishing rigorous criteria for the selection and definition of mestizo and native populations. Ethical considerations and stringent methodologies, alongside strict inclusion criteria, have facilitated the identification of participants with substantial Indigenous American ancestry, yielding 1.6 million novel genetic variants pertinent to understanding migration, adaptation, and immune response in native Americans and highlanders. These findings bear great promise for potential clinical applications, yet discerning the biological adaptations of Peruvians remains paramount, as genomic variation devoid of clinical significance merely constitutes a discovery without necessarily leading to actionability. To propel genomic research forward in Peru, fostering international collaborations, particularly through training grants for doctoral and post-doctoral positions, is imperative. Equally crucial is the development of infrastructure conducive to initiating new projects and facilitating the interpretation of results. Such multidisciplinary collaborative efforts are indispensable for elucidating the genomic heritage of Peruvian populations residing in the Andean and Amazonian regions, thereby enriching our collective understanding of humankind.
FOOTNOTES
Competing interests
At the time of writing MC declares he is associated with Cambridge Precision Medicine Limited. HG declares he worked at the Instituto Nacional de Salud until June, 2020 and now is associated as Medical Director in INBIOMEDIC Research and Technological Center. No other author involved in this publication declares any further conflict of interest.
Data Availability
Data have been deposited in the European Genome-phenome Archive (EGA), https://www.ebi.ac.uk/ega/home (accession nos. EGAD00010001958, EGAD00010001990, EGAD00010001991, EGAD00010001992). Access to the data is available upon request of the authors through, please contact the corresponding authors for the access directions.
INCLUSION AND DIVERSITY
We support inclusive, diverse, and equitable conduct of research.
ACKNOWLEDGMENTS
To the researchers who were part of the national panels (Cesar Cabezas, Sonia Guillen, Oswaldo Salaverry, Santiago Pastor, Patricia Mayta, Leonidas Gomez, Evelyn Guevara) and Angel Medina, Yuri Alegre, Harrison Montejo for their invaluable support in the coordination and collection of samples. To Luis Guillermo Lumbreras for his advice on anthropological and archaeological topics. To facilitate studies in highlanders Jose Beraun, Milward Ubillus, Diana Palma. To all participants of this study.
Footnotes
The co-authors have made a more detailed revision to improve the form and substance of the first accepted version. Likewise, we have improved the writing of the publication so that it is better understood.