Abstract
Multiplexed PCR amplicon sequencing (AmpSeq) is an increasingly popular application for cost-effective monitoring of threatened species and managed wildlife populations, and shows strong potential for genomic epidemiology of infectious disease. AmpSeq data for infectious microbes can inform disease control in multiple ways, including measuring drug resistance marker prevalence, distinguishing imported from local cases, and determining the effectiveness of therapeutics. We describe the design and comparative evaluation of two new AmpSeq assays for Plasmodium falciparum malaria parasites: a four-locus panel (‘4CAST’) composed of highly diverse antigens, and a 129-locus panel (‘AMPLseq’) composed of drug resistance markers, highly diverse loci for measuring relatedness, and a locus to detect Plasmodium vivax co-infections. We explore the performance of each panel in various public health use cases with in silico simulations as well as empirical experiments. We find that the smaller 4CAST panel performs reliably across a wide range of parasitemia levels without DNA pre-amplification, and could be highly informative for evaluating the number of distinct parasite strains within samples (complexity of infection), and distinguishing recrudescent infections from new infections in therapeutic efficacy studies. The AMPLseq panel performs similarly to two existing panels of comparable size for relatedness measurement, despite differences in the data and approach used for designing each panel. Finally, we describe an R package (paneljudge) that facilitates design and comparative evaluation of AmpSeq panels for relatedness estimation, and we provide general guidance on the design and implementation of AmpSeq panels for genomic epidemiology of infectious disease.
Introduction
Genetic data are a valuable resource for understanding the epidemiology of infectious disease. The value of this data type has been highlighted by the COVID-19 pandemic, for which viral sequence analysis has greatly informed patterns of disease spread and evolution, influencing public health policy decisions around the world (Oude Munnink et al., 2021). Applications of genetic data in epidemiology extend from viral and bacterial outbreak management (Baker, Thomson, Weill, & Holt, 2018; Coll et al., 2017; Quick et al., 2016) to the study of eukaryotic parasites underlying important diseases such as malaria, toxoplasmosis, helminthiasis, leishmaniasis and Chagas disease.
Many use cases (applications) of genetic data have been identified for malaria (Dalmat, Naughton, Kwan-Gett, Slyker, & Stuckey, 2019), the leading parasitic killer worldwide (WHO, 2019), include tracking the spread of drug/insecticide resistance genetic markers and diagnostic resistance mutations (Chenet et al., 2016; Jacob et al., 2021; Kayiba et al., 2021; Lautu-Gumal et al., 2021; Miotto et al., 2020), assessing disease transmission levels (Daniels et al., 2015; Galinsky et al., 2015), identifying sources of infections and imported cases (Liu et al., 2020; Tessema et al., 2019), and estimating genetic connectivity among different populations (Taylor et al., 2017). Malaria parasite genetic data also have demonstrated utility in therapeutic efficacy studies, for distinguishing recrudescent infections potentially indicative of low drug efficacy from reinfections or relapses from dormant liver stages (Gruenberg, Lerch, Beck, & Felger, 2019; Jones et al., 2021). In the malaria field, these applications are served by different types of genetic data produced at varying resolution, technical complexity, and cost, ranging from genetic panels that may comprise as few as 8 – 12 polymorphic microsatellites (MS) or 24 single nucleotide polymorphisms (SNPs) (Baniecki et al., 2015; Daniels et al., 2008), to whole genome sequencing (WGS) data (Miotto et al., 2015; Takala-Harrison et al., 2015).
To be scalable and sustainable, genetic data should be produced at the minimum resolution that provides robust support for the intended analysis application. Whole genome sequencing (WGS) data provide the most complete population genomic perspective on an organism of interest. However, the cost and technical challenges of generating, storing, and interpreting WGS data are impediments to scalability and widespread implementation for organisms with large genomes, or microbes with small genomes in samples dominated by host DNA. Targeted sequencing approaches that focus deep coverage on select genomic regions of interest using multiplexed PCR amplification (AmpSeq) are finding increased application in conservation genomics and fisheries biology (Baetscher, Clemento, Ng, Anderson, & Garza, 2018; Hargrove, McCane, Roth, High, & Campbell, 2021; Natesh et al., 2019; Schmidt, Campbell, Govindarajulu, Larsen, & Russello, 2020), and can serve genomic epidemiology of infectious diseases by focusing sequencing coverage on the most informative regions of pathogen or parasite genomes, instead of typically dominant host genomes (Jacob et al., 2021; Tessema et al., 2020).
Recent work on AmpSeq protocols for genotyping malaria and trypanosomatid parasites has confirmed the viability of this approach with low-parasitemia host and vector samples, where parasite DNA comprises a very small fraction of the total sample (Jacob et al., 2021; Schwabl et al., 2020; Tessema et al., 2020). Furthermore, one recent study has confirmed the value of designing amplicons to capture multi-SNP ‘microhaplotypes’, which exhibit polyallelic rather than biallelic diversity to facilitate relatedness inference (Tessema et al., 2020). New relatedness-based analytical approaches for genomic epidemiology are currently developing for malaria parasites and other sexually recombining pathogens (Henden, Lee, Mueller, Barry, & Bahlo, 2018; Schaffner, Taylor, Wong, Wirth, & Neafsey, 2018). The use of genomic data for estimation of recent common ancestry shared by pairs or clusters of parasites or mosquitoes has shown strong potential to provide epidemiologically useful inferences over very small geographic distances (10s-100s of kilometers) and short time scales (weeks to months) relative to traditional population genetic parameters of population diversity and divergence (Cerqueira et al., 2017; Taylor et al., 2017). While many analyses of recent common ancestry in malaria parasites to date have used WGS data, targeted genotyping of as few as 200 biallelic SNPs or 100 polyallelic loci (e.g., microsatellites or microhaplotypes) may also be used to infer recent common ancestry with necessary precision (Taylor, Jacob, Neafsey, & Buckee, 2019), making AmpSeq an excellent candidate to serve relatedness estimation.
However, there remains uncertainty in the molecular epidemiology field as to the suitability of existing panels for profiling parasite or pathogen populations in specific geographic locations that did not inform the original panel designs, and it is unclear which protocol features are most conducive to implementation in both high and low resource settings. Should each disease field adopt a common multiplexed amplicon protocol and panel, or should bespoke panels be implemented regionally to address genetically distinct parasite populations and specific use cases?
To address these questions, in this manuscript we describe the design and comparative evaluation of two new multiplexed amplicon assays for Plasmodium falciparum malaria parasites: a four-locus panel composed of highly diverse loci, ideal for estimating the number of genetically distinct strains within an infection (Complexity of Infection; COI) as well as distinguishing pre-existing vs. new infections in any geographic setting, and a 129-locus panel composed of drug resistance markers and many diverse loci for relatedness inference designed for application in South America (a region that did not inform previously published panel designs) as well as other geographic regions. Both assays use non-proprietary reagents (including standard PCR oligos) in order to maximize accessibility and affordability in malaria-endemic regions. The panels are supported by new open-source bioinformatic analysis pipelines to facilitate widespread use. We also show that the core sets of multiplexed PCR oligos can flexibly accommodate most new targets not included in the original design, allowing for panel customization towards detecting locally relevant resistance markers, polymorphic loci, and co-infecting parasite species. We use whole genome sequencing data to explore the degree to which our newly described and previously published genotyping panels can serve studies in diverse geographies, versus the alternative of customizing panels with targets that are locally informative but not globally useful. We suggest there is value in genotyping panels that can be flexibly adapted to incorporate informative targets from pathogen populations of interest. The analyses and resources described in this manuscript clarify the rapidly diversifying options for targeted microbial sequencing (Fig. 1), by providing tools and guidance for the comparative evaluation and refinement of AmpSeq panels.
Materials and Methods
Panel designs
We developed a small multiplex of four highly polymorphic antigenic loci, dubbed ‘4CAST’: CSP, AMA1, SERA2, and TRAP (Fig. 2). All four amplicons use previously published primer sequences (Miller et al., 2017; Neafsey et al., 2015), as no modification was required for successful multiplexing.
In designing the larger ‘AMPLseq’ multiplexed amplicon panel, we first built a large pool of candidate loci, anticipating significant attrition of candidates due to primer incompatibility. We prioritized four classes of loci: loci within antigens of interest (Helb et al., 2015), loci with high population diversity for relatedness inference (Taylor et al., 2019), loci included in the SpotMalaria v1 panel (Chang et al., 2019; Jacob et al., 2021), and known drug resistance markers. We contracted the services of GTseek LLC (https://gtseek.com) to design multiplexed oligo panels according to the criteria previously described for the Genotyping-in-Thousands by sequencing (GT-seq) protocol (Campbell, Harmon, & Narum, 2015) (S1 Supporting information). We optimized the final primer set and and reaction conditions through several sequencing runs and determined that the primers for the four 4CAST loci (CSP, AMA1, SERA2, TRAP) could be added to the panel without impacting amplification of the other loci. We also successfully added primers amplifying known markers of antimalarial drug resistance within the genes dhfr, dhps, mdr1, and kelch13 (S2 Table). Furthermore, we successfully added previously described primers for PvDHFR (Lefterova, Budvytiene, Sandlund, Färnert, & Banaei, 2015) in order to identify P. falciparum / P. vivax co-infections that have gone undetected in preliminary screening by microscopy or rapid diagnostic test (RDT). The final panel, dubbed ‘AMPLseq’ (short for Assorted Mix of Plasmodium Loci) contains this single P. vivax locus and 128 P. falciparum loci (Fig. 2), with a median length across all amplicons of 276 bp (S1 Fig.).
Panel protocols
To create the primer pool used in 4CAST PCR1, we combined 100 µM of each 4CAST primer (S3 Table) and diluted the combined primer mix to 6.25 µM per primer in nuclease-free water (NF dH2O). Each 10.5 µl PCR1 reaction incorporated 1.5 µl combined primer mix, 5 µl KAPA HiFi HotStart ReadyMix (2x), and 4 µl sample template. PCR1 amplification consisted of an initial incubation step at 95 °C (3 min); 25 amplification cycles at 95 °C (20 s), 57 °C (15 s) and 62 °C (30 s); and a final extension step at 72 °C (1 min). Each 12.2 µl PCR2 reaction (which adds sample indices and sequencing adapters) incorporated 2.2 µl unique dual index (10 µM Illumina Nextera DNA UD Indexes), 5 µl KAPA HiFi HotStart ReadyMix (2x), 2 µl NF dH2O and 3 µl PCR1 product. PCR2 consisted of an initial incubation step at 95 °C (1 min); 10 amplification cycles at 95 °C (15 s), 55 °C (15 s) and 72 °C (30 s); and a final extension step at 72 °C (1 min). We combined PCR2 products in equal volumes and performed double-sided size selection using Agencourt AMPure XP beads (Beckman Coulter): we incubated 100 µl library with 55 µl beads, immobilized beads via magnet rack, and transferred the supernatant to a new tube. We incubated the transferred supernatant with 20 µl beads and washed immobilized beads twice with 80% ethanol. We eluted the library in 25 µl EB buffer (10 mM Tris-Cl, pH 8.5), subsequently adding 2.5 µl EB buffer with 1% Tween-20. We verified size selection via Agilent BioAnalyzer 2100 and sequenced the selected library at 6 pM with >10% PhiX in paired-end, 500-cycle format using MiSeq Reagent Kit v2 (S1 Protocol).
We followed a similar nested PCR and pooled clean-up procedure for AMPLseq library construction. Primer sequences, input volumes and concentrations are listed in S3 Table and PCR conditions and size selection steps are shown in S2 Protocol. As detailed therein, library construction for AMPLseq library construction differs in a few minor aspects. For example, primer input quantities vary slightly (800 pmol +/- 33%) to account for amplification rate differences among loci. PCR1 products are diluted 1:12 in NF dH2O prior to PCR2 and only single-sided (left-tailed) bead-based size selection is used to enhance yield. Sequencing also occurs via paired-end, 500-cycle MiSeq but with a higher final library loading concentration (12 pM) and a lower fraction of PhiX (8%).
Mock samples
We generated mock samples from parasite lines 3D7 and Dd2, cultured at 3% hematocrit in commercially obtained red blood cells as previously described (Trager & Jensen, 1976). We extracted genomic DNA using the Qiagen Blood and Tissue Kit on cells previously lysed with 0.15% saponin. We generated positive control template representing DNA extractions from whole human blood infected with 10,000 monoclonal 3D7 parasites/µl by diluting genomic DNA from 3D7 to 0.25 ng/µl in 10 ng/µl human genomic DNA, and storing in 10 mM Tris-HCl (pH 8.0), 1 mM EDTA (Promega, Madison, WI). We generated further control templates representing 1000, 100, and 10 3D7 parasites/µl by serial 1:10 dilution of the 10,000 3D7 parasites/µl control, likewise using 10 ng/µl human genomic DNA as diluent. We also generated a 10,000 parasites/µl positive control as described above but using Dd2 instead of 3D7 strain genomic DNA. We generated mixed-strain control templates by combining the 10,000 3D7 parasites/µl control with this 10,000 Dd2 parasites/µl control at 1:1, 3:1, and 10:1 ratios (respectively). We serially diluted the 1:1 ratio to 1000, 100, and 10 parasites/µl concentrations and diluted the 3:1 and 10:1 ratios to 1000 and 100 parasites/µl concentrations using 10 ng/µl human genomic DNA diluent as before. We also applied selective whole genome amplification (sWGA) to all above control templates representing ≤ 1000 parasites/µl. The 50 µl sWGA reaction followed Oyola et al. 2016 (Oyola et al., 2016) with the exception of fixing template input volume to 10 µL. We purified sWGA products with Agencourt AMPure XP beads (Beckman Coulter) on the KingFisher Flex (S3 Protocol) and verified amplification success via NanoDrop (ThermoFisher Scientific).
Clinical samples
We tested the panels on clinical dried blood spot (DBS) samples from Mali and Guyana. Tran et al. collected samples in Kalifabougou, Mali in 2011 – 2013 as previously described (Tran et al., 2013). The Kalifabougou cohort study was approved by the Ethics Committee of the Faculty of Medicine, Pharmacy and Dentistry at the University of Sciences, Technique and Technology of Bamako, and the Institutional Review Board of the National Institute of Allergy and Infectious Diseases, National Institutes of Health (NIH IRB protocol number: 11IN126; https://clinicaltrials.gov/; trial number NCT01322581). Written informed consent was obtained from participants or parents or guardians of participating children before inclusion in the study. The Guyana Ministry of Health collected samples from Port Kaituma and Georgetown, Guyana between May – August 2020, by spotting participants’ whole blood onto Whatman FTA cards and storing the samples with individual desiccant packets at room temperature. Informed consent (or parental assent for minors) was obtained for all subjects according to protocols approved by ethical committees.
We punched DBS samples 3 – 5 times into a 96-well deep well plate using the DBS pneumatic card puncher (Analytical Sales and Services, Inc.) equipped with a 3 mm cutter. We then extracted gDNA following the DNA purification from buccal swab section of the KingFisher Ready DNA Ultra 2.0 Prefilled Plates for KingFisher Flex instruments protocol (ThermoFisher Scientific) with minor modifications (S4 Protocol). We used the same sWGA procedure as above on the extracted gDNA.
Whole genome sequencing and variant calling
We performed whole genome sequencing on clinical samples collected in Guyana to validate 4CAST and AMPLseq outcomes. We performed sWGA on DNA samples as described above to enrich parasite DNA. We used the enriched DNA to construct Illumina sequencing libraries from the amplified material using the NEBNext Ultra II FS DNA prep kit (NEB #E6177) prior to sequencing on an Illumina HiSeqX instrument at the Broad Institute, using 150 bp paired-end reads, targeting a sequencing depth of at least 50X coverage. We aligned reads to the P. falciparum v3 reference genome assembly using BWA-MEM (Li, 2013) and called SNPs and INDELs using the GATK HaplotypeCaller (DePristo et al., 2011; McKenna et al., 2010; Van der Auwera et al., 2013) according to the best practices for P. falciparum as determined by the Pf3k consortium (https://www.malariagen.net/resource/34). Analyses were limited to the callable segments of the genome (Miles et al., 2016) and excluded sites where over 20% of samples were multiallelic. Data from these samples were submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) under accession PRJNA758191.
Amplicon data analysis
We developed an application named AmpSeQC (S2 Supporting information) to assess sequence quality and amplicon/sequence run success (S2 Fig.). We also used AmpSeQC for P. vivax detection by applying a concatenation of the P. falciparum 3D7 and P. vivax PvP01 reference genomes during the BWA-MEM alignment step. For in-depth assessment of P. falciparum sequence variation, we processed paired-end Illumina sequencing data in the form of FASTQ files using a custom analysis pipeline (S2 Supporting information) that leverages the Divisive Amplicon Denoising Algorithm (DADA2) tool designed by Callahan et al. 2016 (Callahan et al., 2016) to obtain microhaplotypes (S2 Fig.). We mapped microhaplotypes obtained from DADA2 against a custom-built database of 3D7 and Dd2 reference sequences for each amplicon locus and filtered microhaplotypes based on edit distance, length, and chimeric identification, using a custom R script. (S2 Supporting information). We summarized observed sequence polymorphism into a concise format by converting individual microhaplotypes into pseudo-CIGAR strings using a custom python script. Microhaplotypes were discarded if supported by fewer than 10 read-pairs or by less than 1% total read-pairs within the locus, or if they exhibited other error features (S3 Supporting information).
We analyzed native and pre-amplified mock samples to determine precision and sensitivity of the DADA2 pipeline and filters. We defined a true positive (TP) as a microhaplotype with a pseudo-CIGAR string identical to the reference strain (either 3D7 or Dd2). We defined a false positive (FP) as a microhaplotype with a pseudo-CIGAR string not matching 3D7 (in the case of samples containing only 3D7) or not matching 3D7 or Dd2 (in the case of the mixes), and we defined a false negative (FN) as a locus without any correct microhaplotype representation. We defined precision as TP/(TP+FP), and sensitivity (recall) as TP/(TP+FN). Forty-two of 128 P. falciparum loci in AMPLseq exhibit identical 3D7 and Dd2 reference sequences; we only included these in precision and sensitivity calculations for pure 3D7 controls (i.e., TP+FN = 128); precision and sensitivity calculations for strain mixtures considered only the 86 loci that differ between 3D7 and Dd2 reference sequences (i.e., TP+FN = 86).
All amplicon sequencing data were submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) under accession PRJNA758191.
Comparator Panels
We compared AMPLseq and 4CAST to two previously published AmpSeq panels for malaria molecular surveillance, Paragon HeOME v1 (Tessema et al., 2020) and SpotMalaria v2 (Jacob et al., 2021).
Paragon HeOME v1, designed via CleanPlex algorithm (Paragon Genomics Inc, USA), contains 100 primer pairs in a single pool. These target 89 P. falciparum loci selected based on high diversity and differentiation (Jost D ≥ 0.21) among clinical isolates from Africa as well as eleven drug resistance-associated loci. A distinctive feature of HeOME library construction involves its requirement for bead-based clean-up and CleanPlex digestion of each sample between PCR1 and PCR2. The protocol therefore does not require sWGA prior to PCR1.
SpotMalaria v2, designed via Agena BioScience and MPrimer design software, contains 136 primer pairs divided into three different pools. These target loci were considered best able to recapitulate pairwise genetic distance, population differentiation, and sample heterozygosity inferences from global WGS data (MalariaGEN Plasmodium falciparum Community Project, 2016). Primers also target various drug resistance-associated loci (some known to exhibit copy number variation) and mitochondrial loci with conserved primer binding sites among Plasmodium spp. Library construction requires sWGA prior to PCR1 but no special processing between PCR1 and PCR2.
We also compared our amplicon panels to a 24-SNP molecular barcode assay (Daniels et al., 2008). The SNPs targeted by this Taqman qPCR-based assay were chosen principally for their high minor allele frequency (average MAF > 0.35) in parasite sample collections from Thailand and Senegal (Daniels et al., 2008).
paneljudge and in silico data simulations
We used WGS data to simulate genotypic panel data for simulations. This publication uses data from the MalariaGEN Plasmodium falciparum Community Project as described online pending publication and public release of dataset Pf7 (https://www.malariagen.net/resource/34). Specifically, we used genomic data from monoclonal samples collected in Mali, Malawi, Senegal, and Thailand (Zhu et al., 2019), and from Colombia and Venezuela (ENA accession numbers in S4 Table). We also used previously published monoclonal genomic data from Guyana (SRA BioProject PRJNA543530) (Mathieu et al., 2020) and French Guiana (SRA BioProject PRJNA242182) (Pelleau et al., 2015). We used the scikit-allel library (Miles et al., 2020) to process the data and then estimate microhaplotype frequency and diversity. Specifically, we used the read_vcf, is_het, and haploidify_samples functions as described (S1 Supporting information), and we estimated haplotype frequencies with the distinct_frequencies function.
We assessed the performance of different panels for relatedness inference using simulated data. We generated data on pairs of haploid genotypes (equivalent to pairs of monoclonal malaria samples) using paneljudge (Taylor & Jacob, 2020), an R package that we built to simulate data under a hidden Markov model (HMM) (Taylor et al., 2019) (S2 Supporting information). For each panel, we calculated inter-locus distances from the median nucleotide position of each locus and set distances as infinite between chromosomes. For each panel and population of interest, we calculated haplotype frequency estimates using scikit-allel, as described above. Given these distances and frequency estimates, we simulated data using relatedness parameter values of 0.01 (unrelated), 0.50 (siblings), and 0.99 (clonal), and switch rate parameter values of 1, 5, 10, and 50. For each combination of panel, population, relatedness parameter, and switch rate parameter, we simulated data on 100 haploid genotype pairs. For each haploid genotype pair, we then generated estimates of the relatedness parameter and the switch rate parameter using paneljudge, with 95% confidence intervals (CIs), under the same model used to simulate the data. We next performed relationship classification from these estimates and CIs. For estimates of unrelated pairs (relatedness parameter of 0.01), we generously classified estimates as correct if the lower limit of the 95% confidence interval (LCI) was below or equal to 0.01 and the upper limit of the 95% confidence interval (UCI) was below 0.99. We classified estimates of sibling-level relatedness (0.50) as correct if the LCI was above 0.01 and the UCI was below 0.99. We classified estimates of clonal pairs (0.99) as correct if the LCI was above 0.01 and the UCI was above or equal to 0.99. In all relatedness levels, if the 95% confidence interval spanned both 0.01 and 0.99 (i.e., LCI < 0.01 and UCI > 0.99), then we denoted the estimate as unclassified.
To evaluate panel performance in COI estimation, we combined monoclonal WGS data to engineer in silico polyclonal samples using vcftools (Danecek et al., 2011). We then estimated microhaplotype frequencies for each locus of a given panel, using scikit-allel as described above, and counted the number of distinct microhaplotypes observed at each locus per sample. We estimated COI as the maximum number of distinct microhaplotypes observed at any locus within a sample.
To evaluate panel performance for geographic attribution, we identified microhaplotypes at loci as described above. We used the microhaplotype sequences themselves and visualized these data using the Rtsne package (Krijthe, 2015), with 5000 iterations, Θ of 0.0, and perplexity parameters of 10 (for 4CAST and the 24 SNP barcode) or 30 (for the remaining panels).
Results
4CAST and AMPLseq validation
We validated assay precision (defined as TP/(TP+FP)), sensitivity (defined as TP/(TP+FN)), and depth of coverage using 3D7 mock clinical samples representing parasitemia levels between 10 and 10000 parasites/µl in 10 ng/µl human DNA. Both 4CAST and AMPLseq generated 3D7 microhaplotype calls with 100% precision for all parasitemia levels assessed, both with and without pre-amplification by sWGA. 4CAST achieved high sensitivity and depth without preliminary sWGA, generating a median of 43 read-pairs per locus from native templates representing 10 parasites/µl (Fig. 3A). Median depth increased to 443 and 1312.5 read-pairs per locus for native templates representing 100 and 1000 parasites/µl, respectively. Read-pair counts were also evenly distributed among 4CAST loci using native DNA (Fig. 3A).
Unlike 4CAST, AMPLseq required sWGA for 3D7 mock samples representing ≤ 100 parasites/µl (Fig. 3B). Following sWGA on mock samples representing 10 parasites/µl, the assay generated ≥ 10 read-pairs at a median of 126 loci, with a median of 465 read-pairs after excluding loci with fewer than 10 reads. Values were statistically similar for pre-amplified samples representing 100 parasites/µl and increased to 692 read-pairs for pre-amplified samples representing 1000 parasites/µl (S3 Fig.).
We also validated the sensitivity of 4CAST and AMPLseq for genotyping polyclonal infections by using mock samples containing both 3D7 and Dd2 templates (likewise in 10 ng/µl human DNA). These mixtures featured Dd2 at 50% (i.e., 1:1 3D7:Dd2 ratio), 25% (3:1), and 9% (10:1) relative abundance. Total parasitemia levels ranged between 10 and 10000 parasites/µl. Both 4CAST and AMPLseq generated microhaplotype calls with 100% precision at the 86 loci that are dimorphic between the 3D7 and Dd2 references (including all four 4CAST loci and an additional 82 loci in AMPLseq). This perfect precision was observed at all parasitemia levels in both native and pre-amplified mock mixtures of the two strains.
4CAST showed high sensitivity for Dd2 without the need for sWGA. At 1000 parasites/µl, the assay detected Dd2-specific microhaplotypes at each of its four loci in all 1:1, 3:1, and 10:1 mixture replicates (Fig. 3C). At 100 parasites/µl, median Dd2 sensitivity remained 100% at 1:1 and 3:1 ratios but lowered slightly to 94% at 10:1. Sensitivity for each strain decreased at 10 parasites/µl, with 1:1 ratios yielding a median of 3 target loci for 3D7 and a median of 2 targets for Dd2; median sensitivity in these samples rose to 3.5 and 3 loci (respectively) following pre-amplification with sWGA, but this led to unbalanced read-pair support between the two strains (S4A Fig.), possibly due to differential sWGA success on low-quality Dd2 vs. high-quality 3D7 templates. 4CAST read-pair ratios generated from native templates, by contrast, showed a very high correlation with input ratios at 100 p/µl (S4A Fig.) and 1000 p/µl (Fig. 3C). Ratios became less informative at 10 parasites/µl (S4A Fig.).
AMPLseq (with sWGA) was also successful in detecting Dd2-specific microhaplotypes, but only at a maximum of 77 of 86 dimorphic loci (in the 1:1 ratio at 10000 parasites/µl). Dd2-specific sequences were detected at a minimum of two dimorphic loci for all three input ratios (1:1, 3:1, 10:1) and parasitemia levels (≥ 10 p/µl) assessed. Like with 4CAST, however, the use of sWGA decorrelated read-pair ratios from input ratios (S4B Fig.). Dd2 sensitivity was also reduced relative to 3D7 sensitivity (median Dd2 sensitivity / 3D7 sensitivity = 58%) with sWGA. These discrepancies were not observed with native templates (Fig. 3D) at ≥1000 parasites/µl for which AMPLseq achieves high read-pair support without the use of sWGA.
We also tested both panels on genomic DNA extracted from dried blood spots collected by the Guyana Ministry of Health in 2020 from individuals diagnosed as P. falciparum-positive via a rapid diagnostic test (RDT). Ten Guyanese samples were tested with both panels, and an additional six were tested with AMPLseq. Using 4CAST, we observed coverage across all loci in all samples, with a median read-pair depth per locus of 1162 read-pairs without sWGA (Fig. 3A). Using AMPLseq (with sWGA), we observed a median of 122 loci with ≥10 read-pairs and a median read-pair depth of 298 read-pairs per covered locus (Fig. 3B).
Additionally, we tested both panels on gDNA extracted from 16 dried blood spot samples collected in Mali in 2011 (Tran et al., 2013) and subsequently stored at room temperature for ten years. Using 4CAST (without sWGA), we observed a median read-depth of 407 read-pairs per locus (Fig. 3A). Using AMPLseq (with sWGA), we observed a median of 75 loci with ≥10 read-pairs and a median read-depth of 112 read-pairs per covered locus (Fig. 3B).
Evaluation of panel performance for relatedness
We used the R package paneljudge to assess in silico the impact of choosing a specific genotyping panel for relatedness inference. Considering the choice of panel, we evaluated relatedness estimation from data simulated on our 4CAST and AMPLseq panels, the SpotMalaria v2 (Jacob et al., 2021) and Paragon HeOME v1 (Tessema et al., 2020) amplicon panels, and a barcode of 24 SNPs. When data were simulated using microhaplotype frequency estimates of Senegalese parasites, we found that almost all estimates of unrelated or clonal pairs were correctly classified, regardless of the panel (Fig. 4A). All three large panels also performed similarly well in accurately identifying partially-related parasite pairs, despite being the product of three distinct design processes. Neither 4CAST nor the 24 SNP barcode estimated relatedness for partially-related samples as well as the larger panels. We also evaluated panel performance in less diverse parasite populations (Colombia and Thailand), including a population not used in the panel designs (Colombia). We repeated the simulations using microhaplotype frequencies estimated with these data. Again, we found that all panels performed well for estimating relatedness of clonal pairs, and that the 24 SNP barcode and 4CAST were less likely to have correctly classified estimates of non-clonal pairs. With the data simulated using Colombian microhaplotype frequencies from the Pacific Coast region, all three large panels performed well for all three relatedness values, despite the Colombian data not having informed the design of any of the panels.
Pairwise relatedness estimates (Schaffner et al., 2018) from AMPLseq correlated highly with those from WGS data available for the Guyanese sample set (Pearson’s r = 0.86, slope = 1.01, p < 0.001) (Fig. 4B). Despite patient travel history metadata suggesting infections to have occurred in various geographic regions of Guyana (S1 Table), AMPLseq relatedness estimates for the Guyanese sample set are significantly higher than those for the Malian sample set (Mann Whitney U, p < 0.001), consistent with anticipated lower transmission levels in Guyana. Nevertheless, the wide range of relatedness estimates (0.007 – 1) observed among Guyanese sample comparisons suggests AMPLseq capacity to indicate epidemiologically relevant microstructure even in relatively unstructured parasite populations. For example, the first (lowest) quartile of pairwise relatedness estimates from Guyana was enriched for comparisons involving A2-GUY and C5-GUY, two highly related samples that in whole genome analysis show 50% relatedness with a sample from Venezuela (SPT26229, see S4 Table).
Geographic attribution
We again engineered amplicon data in silico to evaluate the relative signal in genotyping panels for geographic attribution of samples (Fig. 5). By sub-sampling WGS variant calls, calling microhaplotypes, and visualizing these data using t-SNE plots, we found that results from both the 24 SNP barcode (Daniels et al., 2008) and 4CAST distinguished samples by continent of origin, though not by country. Results from all three larger panels additionally distinguished non-African samples by country, and these panels separated East African (Malawi) from West African samples (Mali/Senegal) to varying degrees; no panel was able to distinguish between Malian and Senegalese samples in this visualization. We also added empirical AMPLseq data from 5 Guyanese samples (C3-GUY, C4-GUY, C5-GUY, C7-GUY, and C8-GUY) and WGS data sub-sampled to AMPLseq coordinates for Venezuelan sample SPT26229 (S5 Fig.). The AMPLseq samples formed a small cluster beside the WGS-based Guyanese and French Guianese samples. The Venezuelan sample SPT26229 also placed on the perimeter of the Guyana/French Guiana sample cluster, sharing the same axis-2 position as the empirical AMPLseq points. Results show that empirical AMPLseq data can distinguish autochthonous samples from the Guiana shield, and we expect geographic attribution in the region to improve as more data are collected from infections originating in Venezuela and other undersampled localities.
COI estimation
We evaluated COI estimation based on 4CAST as opposed to the single locus AMA1, which is commonly used for this purpose, alone or with another locus (Lerch et al., 2017; Miller et al., 2017; Nelson et al., 2019). We engineered in silico samples with COI ranging from 2 – 10 (100 simulations per COI level) and used the maximum number of unique microhaplotypes present at any locus as a simple objective method to estimate COI. 4CAST provided more accurate estimates of COI than AMA1 alone in these simulated data, especially at simulated COI levels between 5 and 7. (Fig. 6A). S6 Fig. indicates that estimation improves at simulated COI=8 using AMPLseq, but to reap the full benefit of the larger panel in practice will require a more elaborate probabilistic algorithm that accounts for variable coverage/sensitivity among loci and incorporates a multinomial approach for polyallelic loci.
In the absence of such an algorithm, we proceeded with the naive estimation method above to classify COI in the Mali and Guyana clinical samples. Only a single polyclonal infection (C6-GUY) occurred among Guyanese samples assayed by 4CAST. The repeated detection of two CSP and SERA2 alternate alleles at depths ranging from 32 to 168 read-pairs enabled unambiguous COI=2 classification for the sample. WGS sequencing coverage, by contrast, detected only moderately elevated SNP heterozygosity (1.9%) in C6-GUY and this elevation was not sufficient to classify COI>1 via The Real McCoil (Chang et al., 2017) (S7 Fig.). AMPLseq also identified COI=2 for C6-GUY but without consistent support between replicates (2 vs. 6 biallelic loci). Six additional Guyanese samples were assayed by AMPLseq and one was classified as COI=2. This sample (A5-GUY) gave a stronger minor variant signal in both AMPLseq (15 biallelic loci in both replicates) and WGS data (10.9% SNP heterozygosity) (S7 Fig.).
For the Malian sample set, 4CAST and AMPLseq both classified samples E5-PST030 and C6-PST063 as monoclonal and all other samples as polyclonal based on presence/absence of multiallelic loci. While 4CAST detected as many as 10 alternate alleles (median = 3) per sample locus (Fig. 6B), AMPLseq detected at most 4 (median = 2). These results reaffirm 4CAST as a tool of choice for resolving higher COI levels and when parasitemia levels are low. AMPLseq may reach simulated performance levels (S6 Fig.) by reducing sample multiplexing or sequencing on higher output platforms (e.g., NovaSeq) for sample sets with low parasitemia.
Longitudinal sampling: distinguishing recrudescence vs. reinfection
We used 4CAST to examine longitudinal samples that were likely to be diverse and polyclonal. We sequenced samples from the same asymptomatic individual in the longitudinal Mali cohort over three consecutive visits (Fig. 7) (Tran et al., 2013). We detected a single microhaplotype at each locus that was present in the first two time points, suggesting a continued infection during the two weeks between samples. At the third time point, we detected a single, distinct microhaplotype at each locus, suggesting that a new infection had occurred and the original infection had disappeared or decreased below our limit of detection (<10 p/µl). In this particular case, the individual was asymptomatic and did not receive anti-malarial treatment between any samples; however, this simple example demonstrates the clarity that 4CAST can bring to tracking infection turnover in longitudinal studies, and suggests its utility in distinguishing recrudescence vs. reinfection in therapeutic efficacy studies.
Drug resistance profiling
AMPLseq loci in dhfr, mdr1, dhps, kelch13, and mdr2 contain ten sequence regions that code for various amino acid (AA) polymorphisms that have previously been associated with resistance to antimalarial drugs (Ariey et al., 2014; Miotto et al., 2015; Mita et al., 2007; Veiga et al., 2016)Thirteen of these 18 positions of interest contained nonsynonymous mutations in Malian and Guyanese clinical samples of this study (Fig. 8). Positions of interest that lacked mutations across both sample sets were DHFR AA 59 and 164; DHPS AA 613; KELCH13 AA 580; and MDR2 AA 484. All Guyanese sequences shared the same mutant alleles at many loci. Malian samples, by contrast, did not show fixed mutant alleles at any amino acid position of interest. A mix of mutant and wildtype alleles occurred among Malian samples for DHFR AA 51 and 108; MDR1 AA 86, 184, and 1246; DHPS AA 436 and 437; and MDR2 AA 492. A previously reported synonymous polymorphism was observed in one Malian sample at KELCH13 AA 589 (Taylor et al., 2015).
P. falciparum and P. vivax co-infection detection
To test the ability of AMPLseq to detect P. vivax co-infections via co-amplification of PvDHFR, two additional Guyanese blood spot samples that had been diagnosed as P. vivax-only (G4G430) and P. vivax + P. falciparum co-infection (G4G180) via RDT were included in the sample set. These samples did not undergo sWGA.
PvDHFR was detected at high depth in both samples (1068 – 1822 read-pairs for G4G430 and 234 – 560 read-pairs for G4G180) (Fig. 9). Only G4G180 also showed read-pair support at P. falciparum panel loci (>10 read-pairs at 100 – 115 loci). PvDHFR was not detected in any native or pre-amplified 3D7 or mixed-strain (3D7 + Dd2) templates. This demonstrates high specificity of both PvDHFR and P. falciparum AMPLseq primers to their intended target species without any apparent amplification inhibition by the presence of congeneric DNA.
PvDHFR was also detected at low levels (16 – 30 read-pairs) in both native template replicates of C7-GUY, one of the sixteen Guyanese samples previously diagnosed as P. falciparum-only via RDT. Surprisingly, two PvDHFR read-pairs were also detected in one of the two sWGA replicates from the sample, despite the expectation that sWGA would primarily amplify P. falciparum sequencines. Sensitivity of PvDHFR detection in pre-amplified samples could be enhanced by adding PvDHFR primers to the P. falciparum sWGA primer pool. PvDHFR detection did not occur in any Malian sample, consistent with low prevalence of P. vivax in West Africa relative to the Guiana shield.
Discussion
The utility of AmpSeq for molecular surveillance of infectious diseases is evidenced by the growing number of protocols recently published or under development for Plasmodium and other pathogens (Aydemir et al., 2018; Fola et al., 2020; Jacob et al., 2021; Mitchell et al., 2021; Moser et al., 2021; Ruybal-Pesántez et al., 2021; Schwabl et al., 2020; Tessema et al., 2020). Here, we demonstrate the performance of two new panels for P. falciparum, designed to serve different use cases and exhibiting different per-sample costs and levels of complexity. Our comparative analyses of these two new panels, AMPLseq and 4CAST, relative to previously published genotyping panels demonstrates that they perform comparably to existing panels of similar composition across use cases, in a diversity of geographic settings, despite different geographic representation in the population genomic data used to inform their designs. This suggests that de novo custom panel design may not be required for accurate COI and relatedness estimation in parasite populations from previously unstudied geographic regions. We therefore suggest that future implementation of these panels should be guided by three criteria: 1) the intended use cases for the data, 2) protocol complexity and compatibility with available instruments and expertise, and 3) protocol customizability for locally relevant genetic loci.
Considering the first of these criteria, intended use case, our investigations above suggest a straightforward mapping of panels by size and feature to use case. The small 4CAST panel is well suited to COI estimation (Fig. 6), and profiles four highly diverse antigens for the same effort and cost traditionally used to profile a single locus. Because of the very high diversity of the loci in the 4CAST panel in most parasite populations, this panel is also well suited to any application requiring genetic delineation of distinct parasite lineages (Fig. 7). In therapeutic efficacy studies, for example, it is essential to determine whether subjects who become parasitemic following drug treatment are exhibiting a recrudescence of an incompletely-cleared strain from the initial infection (which could indicate treatment failure), or if they have become reinfected with a distinct parasite strain subsequent to treatment. We suggest that the 4CAST panel would be significantly more informative than traditional genotyping approaches used in therapeutic efficacy studies, such as profiling length polymorphisms or allele-specific amplification in the MSP1/MSP2/GLURP loci (Reeder & Marshall, 1994; Snounou, 2002), and more cost-effective than independent monoplex amplification and Illumina sequencing of individual loci (Early et al., 2019; Gruenberg et al., 2019; Lerch et al., 2017).
Our work also demonstrates that the AMPLseq panel performs comparably to two existing multiplexed amplicon sequencing panels of similar size (Jacob et al., 2021; Tessema et al., 2020) for any use case reliant on estimation of parasite relatedness (Fig. 4), despite different design criteria and datasets that informed the panels. Potential public health use cases that employ relatedness information include measuring the connectivity of parasites between locations to define units of control, and monitoring changes in the level of transmission (Cerqueira et al., 2017; Daniels et al., 2015; Knudson et al., 2020). The AMPLseq panel and its peers are also much better suited to detecting imported infections given their improved capacity to distinguish parasites from distinct geographic locations (Fig. 5.) Finally, the larger panels offer the capacity to monitor genetic markers associated with drug resistance (Fig. 8) or, in some panels, detect co-infection with other Plasmodium species (Fig. 9).
The second panel selection criterion, protocol complexity and compatibility with available instruments, should be prefaced with a reminder that all of these protocols employ nested PCR reactions as the fundamental mechanism to produce sequencing libraries targeting small genomic regions of interest. Any molecular laboratory with a capacity for PCR and a small sequencing instrument such as an Illumina iSeq100 will be capable of carrying out any of these protocols. The exquisite sensitivity of amplicon sequencing implemented via an Illumina platform means that all of these protocols are also susceptible to contamination. PCR reactions should be conducted in dedicated hoods, ideally in rooms or locations physically removed from settings in which PCR products are manipulated. Finally, all of these protocols share: 1) a requirement for careful cleanup of inappropriately large or small DNA molecules from pooled libraries prior to sequencing, 2) precise quantification of said libraries for optimal loading on the sequencing instrument, and 3) large batch sizes (samples in increments of 96, 192, or larger to accommodate the capacity of the intended sequencing instrument).
Though these AmpSeq protocols share many common features, they differ in other aspects that may impact implementation. Whereas the 4CAST and Paragon panels and single nested PCR reactions perform well on native DNA from clinical samples, the AMPLseq and SpotMalaria panels require sWGA pre-amplification prior to the first PCR reaction to ensure adequate performance for samples with parasitemia at or below 100 parasites/µl, which may comprise a significant proportion of samples in some settings. sWGA is an isothermal amplification protocol that is relatively simple to perform but requires an expensive phi29 DNA polymerase and a magnetic bead-based cleanup of individual samples afterward, for an approximate additional cost of $8 USD per sample at the time of writing. Though not large in absolute terms, this cost is comparable to the cost of the AMPLseq or 4CAST protocols themselves, which range from $5 – 10 USD per sample, depending on details of implementation such as sequencing instrument and sample indexing per run. The larger panels additionally employ differing numbers of first-round PCR reactions and require a varying number of magnetic bead-based cleanups to tailor the length profile of intermediate products (summarized in S5 Table), which means that the local capacity for automating the bead-based cleanups is a relevant implementation consideration.
The third criterion for panel selection, customizability, may be most relevant for the drug resistance surveillance use case, given differences in the geographic distribution of important drug markers, and varying coverage of known markers by the existing panels. All of the protocols are amenable to customization through the addition of independent target amplifications in the first round of PCR, which could be combined with other first-round PCR multiplex products prior to the second PCR. A more elegant customization approach would be to add (or subtract) targets from the first-round PCR reaction. While complicated bioinformatic pipelines are useful or essential in the design of large multiplexes, in our experience, small multiplexes like 4CAST, which was made from pre-existing primer pairs designed independently, may simply function without optimization, and could presumably be augmented with a small number of additional loci. Though the AMPLseq multiplex of 129 PCR loci benefited from careful design of the original panel, we added the 4CAST targets to the designed AMPLseq target set with no primer modifications and found it to be functional, suggesting it is likely receptive to further augmentation. As the AMPLseq and 4CAST protocols utilize unmodified, commercially available oligos as primers, further customization should be feasible in any setting. However, we must note that not all targets are amenable to incorporation into the multiplex, as we failed despite multiple attempts to include amplicons targeting the pfcrt gene associated with chloroquine resistance (Fidock et al., 2000), or the hrp2/3 genes, which can contain deletions that lead to false-negative diagnosis via rapid diagnostic test (Gamboa et al., 2010).
The proliferation of new AmpSeq protocols for molecular surveillance of infectious diseases raises the important question of whether it is valuable for each disease field to converge on a single approach or common panel. Factors precluding a completely homogeneous approach include varying instrumentation, expertise, and use cases for the data across settings, in addition to an anticipated onward evolution of genotyping technology and elucidation of new markers of interest for drug resistance or other phenotypes. Factors favoring convergence include opportunities for improved procurement of instruments and reagents at a regional level in malaria-endemic countries, and opportunities to directly compare observations between studies and surveillance efforts led by different groups. This latter factor, which we term portability of analyses, has the potential to provide regional or global insight through syntheses across studies. However, the portability of certain analyses is hampered by ascertainment bias, an inherent limitation of any targeted sequencing approach for analyses based on the genotypic state of select loci in different countries. That is, a panel designed based on observations of genetic diversity through WGS in countries A and B may not provide a fair means of comparing diversity in countries C vs. D, if diversity there is distributed differently in the genome than in countries A and B. WGS is the ultimate tool for avoiding this bias. However, the problems of comparing parasite populations profiled with different panels may be mitigated by comparing inferred relatedness levels within populations rather than actual genotypic diversity measures. Overlap of loci among panels would further facilitate direct assessment of relatedness between samples included in different studies (Neafsey, Taylor, & MacInnis, 2021; Taylor et al., 2019). The AMPLseq panel we describe here contains a significant number (n=47) of targets from the SpotMalaria panel, and we expect that future P. falciparum panel designs will also tend to exhibit some degree of overlap with other panels, both by deliberate design and through blind convergence based on key genomic features, such as high diversity and sequence amenability to PCR primer design.
As molecular surveillance efforts for malaria and other diseases are more widely adopted and become increasingly diverse, it will be essential for the community to develop standardized approaches for the design, validation, interpretation, and sharing of targeted amplicon sequencing data. The paneljudge R package described here provides an excellent means to comparatively evaluate existing and hypothetical panel performance via data collected from previous population genomic surveys, and the bioinformatic analysis pipelines we have developed are suitable for interpreting Illumina data from diverse targets and panels in different organisms. We anticipate the growth of this field and the development of new analytical tools to extract even more knowledge from increasingly large AmpSeq datasets.
Data Availability
All amplicon sequencing data, as well as WGS data from 16 Guyana samples, were submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) under accession PRJNA758191. This publication uses data from the MalariaGEN Plasmodium falciparum Community Project as described online pending publication and public release of dataset Pf7 (https://www.malariagen.net/resource/34); additional ENA accession numbers are available in Table S4. Previously published data from Guyana and French Guiana can be found at SRA BioProjects PRJNA543530 and PRJNA242182, respectively. Software and documentation can be found at https://github.com/broadinstitute/AmpSeQC (AmpSeQC pipeline), https://github.com/broadinstitute/malaria-amplicon-pipeline.git (Malaria amplicon pipeline), and https://github.com/artaylor85/paneljudge (paneljudge).
https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA758191
https://www.malariagen.net/resource/34
https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA543530
https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA242182
https://github.com/broadinstitute/AmpSeQC
https://github.com/broadinstitute/malaria-amplicon-pipeline.git
Data Accessibility and Benefit Sharing
Data Accessibility: All amplicon sequencing data, as well as WGS data from 16 Guyana samples, were submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) under accession PRJNA758191. This publication uses data from the MalariaGEN Plasmodium falciparum Community Project as described online pending publication and public release of dataset Pf7 (https://www.malariagen.net/resource/34); additional ENA accession numbers are available in Table S4. Previously published data from Guyana and French Guiana can be found at SRA BioProjects PRJNA543530 and PRJNA242182, respectively. Software and documentation can be found at https://github.com/broadinstitute/AmpSeQC (AmpSeQC pipeline), https://github.com/broadinstitute/malaria-amplicon-pipeline.git (Malaria amplicon pipeline), and https://github.com/artaylor85/paneljudge (paneljudge).
Benefits Generated: A research collaboration was developed with scientists from the countries providing genetic samples, all collaborators are included as co-authors, the results of research have been shared with the provider communities and the broader scientific community (see above), and the research addresses a priority concern, in this case the public health concern of malaria. More broadly, our group is committed to international scientific partnerships, as well as institutional capacity building.
Author Contributions
Conceptualization EL, PS, MC, ART, AME, BLM, DEN
Data Curation ZMJ, RP, TJS
Formal Analysis EL, PS, MC, ART, RP, TJS
Funding Acquisition COB, BLM, DEN
Investigation PS, MC, ZMJ, MS, RK, SW
Methodology EL, MC, PS, ART, ZMJ, MS, RP, TJS, RK
Project Administration BLM, DEN
Resources CMA, SP, PDC, BT, JCR, VC, KJ, HC
Software ART, RP, TJS, AME
Supervision COB, BLM, AME, DEN
Validation EL, PS, MC, ZMJ, MS, RP, TJS, RK
Visualization EL, PS, MC, ART, RP
Writing - Original Draft Preparation EL, PS, MC, ART, DEN
Writing - Review & Editing [All authors]
Acknowledgments
This project has been funded in whole or in part with Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Grant Number U19AI110818 to the Broad Institute. This project was also supported by an NIH R01 award to DN (R01AI141544), an award from the Bill and Melinda Gates Foundation to DN and COB (OPP1213366), and a Broad Institute NextGen Award to BM. The Mali cohort study was funded by the Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health. The Colombian cohort study was supported by British Council Newton-Caldas Fund Institutional Links Award G1854. We thank MalariaGen for use of the Colombian WGS data. We thank Annie Laws for project management. We thank Dr. Nathan Campbell for assistance in the AMPLseq panel design and evaluation.