Abstract
Clinically pathogenic chromosomal microdeletions causing genetic disorders such as DiGeorge syndrome are rare genetic aberrations that can cause clinically relevant fetal and childhood developmental deficiencies. Clinical severity of such deficiencies depend on the exact genomic location and genes affected by the fetal chromosomal aberration.
Here we present the BinDel, a novel region-aware microdeletion detection software package developed to infer clinically relevant microdeletion risk in low-coverage whole-genome sequencing NIPT data. To test BinDel, we quantified the impact of sequencing coverage, fetal DNA fraction, and region length on microdeletion risk detection accuracy. We also estimated BinDel accuracy on known microdeletion samples and clinically validated aneuploidy samples.
BinDel identified each positive control sample as high risk. We also determined that it is critical to take into account that the sample with a detected high microdeletion risk does not have a full chromosome aneuploidy, as the latter can cause erroneous high microdeletion risk findings. We observed that lower sequencing coverage resulted in reduced microdeletion detection accuracy, and higher fetal fractions considerably increased the microdeletion detection accuracy, with coverage becoming increasingly relevant as fetal DNA fraction decreased.
In conclusion, we developed an R package-based software tool BinDel for inferring fetal microdeletion risks, which accurately identified all positive control samples with microdeletion or -duplication aberrations as high-risk samples.
Introduction
Non-Invasive Prenatal genetic Testing (NIPT) is a circulating cell-free DNA (cfDNA) and sequencing-based screening test to detect the risk of fetal aneuploidies. In addition to ‘traditional’ screening for most common aneuploidies (causing, e.g. Down, Edwards, and Patau syndromes), whole-genome sequencing (WGS) NIPT assay data can enable the analysis of any region of the genome to infer the risk of potential subchromosomal aberrations – microdeletions. The most frequent (1:1,524) microdeletion (MD) is the 22q11.2 heterozygous deletion (2.1Mb in length), causing the DiGeorge syndrome, characterised by adverse infant and childhood outcomes [1]. Other clinically relevant microdeletions are less frequent but also cause severe genetic disorders such as, for example, Williams-Beuren (5.2Mb in length), Wolf-Hirschhorn (4.5Mb) and Prader-Willi/Angelman (PWS/AS) syndrome (5.5Mb).
While pathogenic microdeletions are individually rare, with frequencies ranging from 1:1,524 for MD causing DiGeorge syndrome to 1:50,000 for MD causing Cri-du-chat syndrome, the overall prevalence rate of microdeletions is estimated to be approximately 1:2,500 [1–3]. 1:2,500 translates to that in a population of pregnant women younger than 35 years fetal microdeletions are more common than Down syndrome [2]. As there are no biomarker-based screening tests for MDs for first trimester screening (ultrasound sonography procedure can only sometimes detect associated findings for specific MD syndromes), a reliable NIPT-based microdeletion screening is required [4]. With the advance in non-invasive prenatal testing (NIPT) algorithms and laboratory methods, the detection of such pathogenic chromosomal microdeletions during routine NIPT is starting to be incorporated into clinical settings [5–7].
Previous low-coverage WGS based NIPT studies have demonstrated that one of the most important factors of successful NIPT analysis is sequencing coverage (depth) [8]. For full chromosome aneuploidy detection, WisecondorX [8], for example, has shown to accurately operate with coverage of 5 million (M) reads per sample (RPS), while some other tools require coverage of 20M RPS to keep the number of false-negative trisomy calls under clinically acceptable 1% screening test limit [9]. The higher the sequencing coverage, the greater the likelihood of comprehensive and confident genome-wide interrogation and, therefore, more accurate evidence to infer chromosomal or even sub-chromosomal aberrations [9]. Kucharik et al. suggested that for MD NIPT assays, sequencing coverage should be 10-20M RPS, with the exact minimum coverage depending on the fetal DNA fraction and target region length [10].
Fetal DNA fraction (FF) estimates the proportion of cell-free fetal DNA molecules originating from the fetus/placenta as compared to the total cell-free DNA molecules captured from a pregnant women’s blood sample. The higher the FF, the greater the possibility of detecting potential fetal aberrations from the maternal ‘background’ with high confidence. For that reason, the quality control related FF cut-off threshold in complete chromosome aneuploidy detection is commonly set to 4% [11,12].
As both higher sequencing coverage and higher FF allow more confident detection of fetal aberrations, the same principle also applies to the MDs region length [10,13]. The longer the region, the more sequencing reads are expected to cover it, increasing the potential to detect the deficit or excess of sequencing reads in the studied sample and chromosomal region, consequently drawing attention to a potential MD risk.
Even though there are some existing software tools that besides full chromosomal analuploidies can detect putative MDs (e.g. WisecondorX), these tools rely on sequential fixed-length genomic bins without possibility to transfer known MD region specific context (customising bin length, removing bins or using variable bin length) into the MD analysis [8]. Furthermore, to our knowledge, these tools have not been comprehensively evaluated for detecting MDs.
Here we present BinDel, a novel microdeletion detection software package developed to infer clinically relevant microdeletion risk from low-coverage WGS-based NIPT data. To demonstrate its feasibility, we quantify the impact of sequencing coverage, FF, and region length on microdeletion detection accuracy. While screening of pathogenic microdeletions in NIPT has been studied, a systematic review by Zaninović et al. notes that there is no consistency in achieved sensitivity, specificity, and PPV rates depending on the copy number variation (CNV) size between different studies [2]. Here we systematically investigated the importance of chromosomal region-specific characteristics with algorithm parameters in detecting microdeletions and exact connections between FF, coverage and different MD regions. Finally, we estimate BinDel accuracy on known microdeletion and aneuploidy samples.
Results
BinDel is an R software package developed to infer microdeletion and -duplication risk in user-defined regions of interest in WGS-based NIPT samples. Given a human reference assembly aligned NIPT sample, BinDel divides sample sequencing reads into genomic bins for each user-defined microdeletion region and counts the number of reads in each bin (Suppl. Figure 1). Next, BinDel performs multiple normalisation steps of these data, creates informative features from the normalised read counts, calculates Mahalanobis distance from the euploid pregnancy reference group samples, and transforms the resulting distances into MD risk probabilities that are reported as output.
Here we used WGS data for 373 euploid NIPT samples (previously determined to be normal euploid pregnancies using the NIPT screening test and postnatal examination) with ∼10 million (M) reads per sample (RPS) to develop and optimise the algorithm. In these samples (sample groups 1-2 in Suppl. Table 1), we bioinformatically reduced read count proportionally in the MD region to in silico induce clinically relevant fetal heterozygous MD events for various fetal DNA fractions (FF) and sequencing coverages (sample groups 9-11 in Suppl. Table 1), which were then blindly studied with BinDel.
Sequencing read coverage and microdeletion detection accuracy
First, we investigated the impact of sequencing coverage on microdeletion detection accuracy by applying BinDel on 73 euploid and 73 samples with in silico created MDs over nine pre-defined clinically relevant MD regions (Suppl. Table 2, sample group 2 and 10 in Suppl. Table 1). In addition to initial sequencing coverage (10M RPS), we also used in silico down- and upsampled (5M and 20M RPS, respectively) data (see Materials and Methods, sample groups 7, 8, 9 and 11 in Suppl. Table 1).
As expected, we observed that lower sequencing coverage resulted in reduced MD detection accuracy (Figure 1, Suppl. Figure 2, Suppl. Figure 3). This was observed for most studied MD regions, and in some regions, sequencing coverage had considerable effect on MD detection accuracy. For example, at 5% FF, if sequencing coverage was increased from 5M RPS to 20M, the sensitivity of microdeletion detection in Prader-Willi/Angelman (PWS/AS) associated region increased from 0.55 to 0.86 and in DiGeorge MD region, sensitivity increased from 0.18 to 0.42 (Figure 1F,B).
Fetal DNA fraction and microdeletion detection accuracy
Next, we investigated the impact of FF on MD detection accuracy. As anticipated, we determined that fetal DNA fraction had a significant effect on MD detection accuracy (Suppl. Figure 2). Higher FF considerably increased the MD detection sensitivity and positive predictive value (PPV) (Figure 1, Suppl. Figure 2). For example, in the case of 20M RPS, as FF increased from 5% to 10%, microdeletion detection sensitivity in the DiGeorge syndrome associated MD region increased from 0.42 to 0.92 (Figure 1B). Similarly, in the case of 10M RPS and increased FF (from 5% to 10%), DiGeorge region MD detection sensitivity increased from 0.27 to 0.71 (Figure 1B).
Microdeletion region length and detection accuracy
We also observed that the MD chromosomal region length had a considerable impact on its detection accuracy. In the longer MD regions, the detection of in silico created MDs was considerably more accurate (Figure 2B). For example, at 8% FF ordered by region length (Suppl. Table 2), 20p13del (1.2Mb), DiGeorge (2.1Mb), 1p36 (4Mb), Wolf-Hirschhorn (4.5Mb), Prader-Willi/Angelman (5.5Mb), Langer-Giedion (9.6Mb), Cri-du-chat (12.5Mb) and Jacobsen region (20.5Mb) MD detection sensitivity increased together with the MD region length (sample group 10, Figure 2B). The only exception to this was observed in Williams-Beuren (5.2Mb) MD region, which had lower sensitivity than other similar-length MD regions (Figure 2B).
Region-specific characteristics in MD detection
Analysis of the MD regions in WGS NIPT should consider local chromosomal characteristics as MD regions are often in the vicinity of low-copy repeat (LCR) or homologous sequences, which in the case of short read sequencing methods do not facilitate unambiguously unique sequencing read mapping [14]. Our context-aware region analysis demonstrates that if each MD region specific characteristics are considered (such as region length, region-specific bin size and regional variability) the accuracy of detecting corresponding potential MDs can improve (Suppl. Figure 7). A good example is the PWS/AS syndrome associated region, where we examined the normalised read count distribution of the BinDel 10M RPS reference set (300 samples, sample group 1 in Suppl. Table 1) and determined two highly variable 300 kb bins (Suppl. Figure 5). After removing these bins from the analysis, we observed a slightly increased sensitivity at FF of 3% and 5% (Suppl. Figure 6B, F). Determining and removing variable bins could be performed for any MD region under analysis. For example, BinDel’s reference set file contains read counts for each bin (per sample) in the plain text format. It is easy to visualise the reference data for each MD region, determine the ultra-variable bins and remove them from the reference file.
BinDel accuracy on validated microdeletion samples
To evaluate the accuracy of our algorithm with the optimal selected settings, we analysed six low-coverage (5 to 20 M RPS) WGS-based NIPT assays with confirmed fetal aberration events. These included two confirmed DiGeorge syndrome MD samples and one sample possessing confirmed 20p13 microdeletion and 3q29 microduplication events (sample groups 4-6 in Suppl. Table 1).
BinDel successfully inferred the high risk for correct regions (DiGeorge, 3q29, and 20p13) in all validation samples (Figure 3A-F). BinDel also erroneously inferred one false positive high-risk MD in the DiGeorge MD region for sample SC-0106, sequenced at 5M RPS (Figure 3F, DiGeorge region), emphasising the relevance of sufficiently high sequencing coverage (≥10M RPS), which would enable avoiding false positive MD findings in clinical applications. BinDel’s MD detection sensitivity, specificity and PPV on validated MD samples are shown in Table 1 (10 and 20M RPS).
BinDel accuracy on clinically validated aneuploidies
We further analysed 244 clinically validated samples (sequenced at 20M RPS, sample group 3 in Suppl. Table 1), including 17 full chromosome fetal aneuploidies and one fetal Prader-Willi syndrome associated MD. BinDel successfully detected the confirmed Prader-Willi syndrome case (Figure 4C). BinDel incorrectly detected high PWS/AS risk for seven samples lacking PWS/AS aberration, resulting in a PPV of 12.5% for PWS/AS (95% CI 0–35%), one of which was a trisomy 21 (T21) sample. Besides one T21 sample with false-positive PWS/AS MD risk, we observed three more aneuploidy samples with false-positive MD risk - two in Wolf-Hirschhorn MD region (samples with T21 ad T18) and one in Jacobsen syndrome associated MD region (sample with T21). Consequently, if MD risk is detected, a sample should also be screened for aneuploidies, as existing full chromosome aneuploidy can cause false MD risk indications. Although BinDel is not specifically developed for chromosomal aneuploidy detection, it was able to detect 1/1 T13 (PPV 17%, 95% CI 0–46%), 5/5 T18 (PPV 83%, 95% CI 54–100%), and 11/12 T21 (PPV of 85%, 95% CI 65–100%) in known trisomy samples (Figure 4A). BinDel falsely detected 20 (8.2%) high risk aneuploidies and 36 high risk MDs (14.8%, four of which were known trisomy samples) (Figure 4A, Figure 4C). This is fairly similar to the false positive proportions (9.6% for MDs and 15% for aneuploidies) observed in the 10M RPS sample group 2 (Suppl. Table 1) that used in the algorithm development and optimisation process (Figure 4B, Figure 4D).
Discussion
NIPT is a fetal aneuploidy risk screening method that is based on laboratory and computational analysis of cell-free DNA extracted from pregnant women’s peripheral blood. NIPT can also be used to screen for rare, but clinically relevant pathogenic chromosomal microdeletions. The most frequent (1:1,524) microdeletion is the 22q11.2 heterozygous deletion (2.1Mb in length, Suppl. Table 2), causing adverse infant and childhood outcomes (DiGeorge syndrome) [1]. Other clinically significant microdeletions are less frequent but also cause severe genetic disorders (Suppl. Table 2).
Here we present BinDel, a novel microdeletion detection software package to infer clinically significant microdeletion risk from low-coverage WGS NIPT data. To demonstrate its feasibility, we quantify the impact of sequencing coverage, FF, and region length on microdeletion detection accuracy. We also estimated BinDel accuracy on known microdeletion samples and clinically validated aneuploidy samples.
In MD regions longer or equal to 4.5Mb in length, we discovered that FF is one of the most important factors for correct MD detection. MDs of 4.5Mb and longer were determined with 100% sensitivity from FF of 15% onwards at all coverages measured (Figure 1D-I). Coverage and FF were equally important components for MD detection sensitivity in regions shorter than 4.5Mb, with the coverage impact on sensitivity significantly reducing with an FF of 20% (Figure 1A-C). Specificity was only affected by the coverage in all MD regions analysed (Suppl. Figure 4). Our analysis demonstrated that considering region-specific characteristics such as region length, region-specific bin size, and regional variability can improve the accuracy of detecting corresponding potential MDs (Suppl. Figure 6, Suppl. Figure 7). In microdeletion detection accuracy analyses, BinDel – an R package based software tool for inferring fetal MD risks – accurately identified all positive control samples (sample groups 4-6 in Suppl. Table 1) with microdeletion or -duplication aberrations as high risk samples (Figure 3A-F). Finally, each analysed sample should also be screened for aneuploidies, as existing aneuploidy can cause false high-risk MD predictions, as we observed in four clinically validated trisomy samples. We believe that the number of false positive high-risk MD results can be reduced in a clinical context by confirming that there is no fetal full chromosome aneuploidy detected in the same sample, followed by resequencing and reanalysis of the sample to avoid the random sequencing read placement affecting the genomic site of the MD. Although not explicitly developed for full chromosome aneuploidy detection, BinDel also accurately identified 17 of the 18 aneuploidies as high-risk results (Figure 4A).
Dar et al. have shown recently that NIPT screening for DiGeorge syndrome can detect most cases, with a PPV of 23.7% (95% CI, 11.44–40.24%) [1]. Wang et al. observed PPV of 49.0% for fetal microdeletion/microduplication syndromes [15]. Due to the different sensitivity of the algorithms and the influence of random sequencing read alignment effects, such PPVs, as seen by Dar et al. and Wang et al., are in accord with our hypothesis that PPV might increase if inferred high-risk samples were resequenced [1,15]. Rafalko et al. associated an overall PPV of 74.2% for detecting sub-chromosomal copy number variations [16], which is higher than BinDel’s PPV on a very limited number of validated microdeletion samples (Table 1) but generally lower than BinDel’s PPV on in silico created MD samples (Suppl. Figure 2).
While our study included a significant number of NIPT samples to create a representative reference panel, we only had a small number of confirmed microdeletion samples. We also observed a considerable number of false positive results for MDs (14.8%, Figure 4C) in the analysis of 20M RPS of 244 clinically validated samples (Suppl. Table 1, sample group 3). In practice, we believe that such high-risk samples should be resequenced at higher coverage to reduce the number of false positive MD results. Further, we analysed different datasets with a single 10M RPS development reference set. Results would be more consistent if the reference matched the analysis data set coverage more accurately, as demonstrated previously for full chromosome aneuploidy detection [9]. Using a single 10M RPS development reference set (sample group 1 in Suppl. Table 1) to analyse various laboratory protocols, on the other hand, demostrates that BinDel can handle different NIPT protocols. Second, in the development phase, upsampling 10M to 20M RPS (sample group 8 in Suppl. Table 1) may not perfectly reflect actual sample sequencing read variability since it might reduce or increase observed sequencing read data variability. Moreover, compared to the in silico MDs, naturally occurring MDs would provide a more accurate measure but would necessitate a relatively large sample set of rare microdeletions based on a single laboratory protocol, which is difficult to obtain.
Finally, for PWS/AS region analysis, we noticed that MD detection accuracy can be increased at low FF by leaving out the ultra-variable bins (Suppl. Figure 6B, C), but we also saw no effect at FF of 1%, 10% and 20%, and a small decrease at 8% (Suppl. Figure 6A, E, F, D respectively), which could be linked to the relatively long bin length (300bp) used in these analyses.
In conclusion, we developed a software tool for MD detection in WGS-NIPT and evaluated it with a number of positive and negative control samples, whilst also considering microdeletion detection accuracy in terms of fetal DNA fraction, sequencing coverage and MD region length.
Materials and Methods
Ethics statement
This study was performed with the written informed consent from the participants and with approval of the Research Ethics Committee of the University of Tartu (#352/M-12).
Studied samples and sample pre-processing
Multiple NIPT datasets were used for this study, including development, validation, DNA mixes and samples with in silico created MDs (Suppl. Table 1). We had separate dataset(s) for the BinDel reference set, BinDel development, MD validation, algorithm specificity, PPV and sensitivity analysis. A selection of datasets was sub- or upsampled, or MD was bioinformatically created in silico (as described below).
Each sequenced sample was aligned against human reference assembly GRCh38, sorted, and the reads originating from a single fragment of DNA were marked as duplicates.
Sample group 1 (Suppl. Table 1) consisted of 300 samples reported previously as euploid fetus pregnancies by NIPTIFY screening test and postnatal evaluation. The group consisted of 151 female and 149 male fetus pregnancies (determined by NIPT) with coverage between 8.0-18.5M RPS (median of 12.7M) and was used as a BinDel 10M RPS euploid reference set. Sample group 1 was processed similarly to previously published guidelines from KU Leuven, with modifications [17]. Briefly, peripheral blood samples were collected in cell-free DNA BCT tubes (Streck, USA), and plasma was separated with standard dual centrifugation. Cell-free DNA was extracted from 3 ml plasma using MagMAX Cell-Free DNA Isolation Kit (ThermoFisher Scientific). Whole-genome libraries were prepared using the FOCUS (Fragmented DNA Compact Sequencing Assay, Competence Centre on Health Technologies, Estonia) NIPT method protocol with 12 cycles for the final PCR enrichment step. In the following quantification, equal amounts of 36 samples were pooled, and the quality and quantity of the pool were assessed on Agilent 2200 TapeStation (Agilent Technologies, USA). Whole genome sequencing was performed on the NextSeq 550 instrument (Illumina Inc.) with an average coverage of 0.32× (minimum 0.08 and maximum 0.42) and producing 85 bp single-end reads.
Sample group 2 consisted of 73 samples used for BinDel development with coverage between 1.9-17.4M (median 12.9) RPS and was processed as was sample group 1. Similarly to sample group 1, sample group 2 samples were reported previously as euploid fetus pregnancies by NIPTIFY screening test and postnatal evaluation. The sample set consisted of 39 female and 28 male fetus pregnancies. For six samples, the gender of the fetus was not determined previously by NIPT due to the patient’s wish. This set also consisted of one SeraCare Life Sciences Inc euploid fetus reference material as a negative control sample. Sample group 2 was used as a development dataset of negative controls when calculating PPV, sensitivity, and specificity in combination with sample group 10. Sample group 2 was also used for in silico introduction of MDs in sample group 10 and for up- and downsampling samples in sample groups 7 and 8.
Sample group 3 is based on a previously published validation study by Žilina et al. [12] (except for one Prader-Willi sample, which was not in the previous study) and consists of 244 clinically validated samples. These samples include one T13, five T18, 12 T21, one Prader-Willi and 225 euploid fetus samples. Sample group 3 was sequenced with Illumina NextSeq 500 platform, producing 85 bp single-end reads with an average coverage of 0.32× at the University of Tartu, Institute of Genomics Core Facility, as described previously by Žilina et al. [12]. The sample group was used for algorithm validation as a test sample set.
SC005 (SeraCare Life Sciences Inc lot #10446565), SC0042 (#10571706), and SC016 (#10560229) are SeraCare Life Sciences Inc circulating cell-free DNA (ccfDNA) like mixture of human genomic DNA that consists of matched maternal and fetus DNA (sample groups 4-6 in Suppl. Table 1). SC005 (male fetus pregnancy) and SC0042 (female fetus pregnancy) consist of matched DNA of maternal and fetus with DiGeorge Syndrome. SC005 DiGeorge MD is at least 2Mb and includes TUPLE1 deletion. SC016 is a custom-ordered DNA Mix from SeraCare Life Sciences Inc with fetus DNA having a pathogenic loss of the terminal region of 20p13 (contains 14 genes including SOX12 and NRSN2 but not including JAG1) and a pathogenic 3q29 duplication. The specification states that the gain of 3q is at least 4.3 M in size, and the loss of 20p is at least 1.2 M. Sample group 4 is sequenced at 20M RPS, sample group 5 at 10M RPS and sample group 6 is subsampled (as is subsampled sample group 7, see below) from sample group 5. Sample groups 4-6 were used for validating MD detection.
Sample groups 7 and 8 consists of down- and upsampled samples from sample group 2 (Suppl. Table 1). Seventy-three euploid samples from sample group 2 were emulated to reduce and increase the RPS to 5M (by reducing read count by 50%) and 20M sequencing coverage, respectively. For 20M RPS dataset upsampling (see sample group 8 in Suppl. Table 1), the read count of the 73 initial samples was increased by combining initial samples randomly. More precisely, combinations of 73 samples were taken twice without repetition and merged. While the intuitive would be to increase sequencing reads by repeating existing reads within the sample by the desired fraction, such increase would lead to the introduction of duplicated sequencing reads and require estimation of the sequencing read distribution in different genomic regions. From 2,628 ways to merge two samples, 73 combinations were randomly chosen to make 20M RPS analysis comparable with 10 and 5M RPS analysis. We considered that merging two different samples leads to possible copy-number variation merging, due to which we did not merge more than two samples. Merging more than two samples will eventually even out read count distribution on copy number variation loci. Sample groups 7 and 8 were used as a development dataset of negative controls (in combination with sample groups 9 and 11) when calculating PPV, sensitivity, and specificity, matching the coverage. Sample group 7 and 8 was also used for in silico introduction of MDs in sample groups 9 and 11.
Sample group 9, 10 and 11 was created in silico bioinformatically introducing heterozygous MDs by reducing sequencing reads in selected genomic regions (see Region selection and Suppl. Table 1) in sample groups 7, 2, 8, respectively. We introduced heterozygous microdeletions by reducing sequencing reads in selected genomic regions (see Region selection). Given that the NIPT sample is euploid, the FF of the sample has no effect on in silico creating and detecting microdeletions as low-coverage whole-genome sequencing NIPT does not distinguish reads between pregnant women and the fetus. We reduced the proportion of reads in the selected region in the aligned sample. For example, for creating heterozygous deletion with 10% FF, we would reduce aligned sequencing reads in the region 0.1 (FF) ×0.5 (heterozygous) × total sample read count. Sample groups 9, 10, 11 combined with 7, 2, 8 (matching the coverage) was used to conduct BinDel’s specificity, PPV and sensitivity analysis.
Region selection
While common pathogenic copy number variations such as 1p36 deletion are well described and defined in OMIM or DECIPHER, determining the exact start and end coordinates for the precision algorithm is not straightforward. For instance, with 1p36 deletion syndrome, OMIM and DECIPHER have non-overlapping coordinates of 23,600,000-27,600,000 and 10,001-12,780,116 (Suppl. Table 2) [18,19]. Furthermore, for some syndromes such as DiGeorge syndrome, OMIM did not define exact coordinates, but the DECIPHER did (Suppl. Table 2) [18,19]. Finally, for 20p13 microdeletion syndrome, we did not find coordinates from DECIPHER or OMIM and had to use coordinates found from the SeraCare Life Sciences Inc sample specification. Given previously defined constraints, we selected OMIM coordinates over DECIPHER if both had coordinates defined, DECIPHER coordinates if OMIM did not provide coordinates, and SeraCare Life Sciences Inc custom DNA-mix sample specification coordinates if no coordinates were found from OMIM or DECIPHER.
Optimal bin size
To choose the optimal bin size and BinDel sensitivity setting, we focused on finding a bin size that would perform well over all coverages of 5, 10, and 20M RPS while also considering both false and true positive MD risk inference and would equally work well with all 9 MD regions. For this, we calculated the area under the curve (AUC) for each of the 9 MD regions with each bin size (100-500 kb), coverage (5, 10, 20M RPS), and algorithm sensitivity (greedy, conservative) setting based on the inferred BinDel risk scores (each bin size × 9 MD regions × two algorithm settings × three coverage settings, sample groups 2, 7, 8, 9, 10 and 11 in Suppl. Table 1). We used a fixed number of PCA components set to 270 (90% of the reference samples derived from the BinDel development phase). Next, we calculated the mean of the 9 MD AUC values per coverage and bin size (Suppl. Figure 7). Finally, we summed the mean AUC values over coverages for optimal bin size. We found that the 300 kb bin size with a conservative setting yields the highest sum of mean AUC over all coverages (2.693), e.g. BinDel distinguished the positive and negative MD samples best with 300 kb conservative sensitivity. From the most optimal to the least optimal bin size, the sizes were 300, 200, 100, 500, and 400 kb (sum of mean 2.679), all conservative sensitivity, followed by greedy sensitivity of 200, 100, 300, 500, and 400 (sum of mean 2.630).
BinDel algorithm
BinDel allows flexible NIPT sample analysis, where the laboratory can incorporate their knowledge about the target regions into the analysis by, for example, leaving out variable parts of the target regions or using custom varying bin lengths. BinDel outputs two risk scores, a more sensitive greedy score and a less sensitive conservative score, from which users can choose the one based on the algorithm’s intended use or the region characteristics under the question. The greedy score is more sensitive but can lead to increased false-positive MD risk calls due to the increased risk of registering noise as MD risk. The conservative score is more balanced than the greedy score in distinguishing MD risks from noise. See Optimal bin size for more information about the bin size and score type dynamics on MD detection accuracy.
Algorithm working principles are shown in Suppl. Figure 1. Calculating a high-risk probability for a sample of interest requires the euploid reference sample group. Creating a file containing euploid sequencing read info by the package requires GRCh38 aligned euploid NIPT samples and pre-defined genomic bin coordinates. The reference file also defines the regions of interest.
Once the reference file is created, for GRCh38 aligned sample of interest, the genomic bins are GC% corrected [20] and normalised by total sequencing read count and bin lengths. Next, the normalised values are transformed by dividing them by dimensionality-reduced versions [21]. After the normalisation, based on the euploid reference group, an intermediate Z-score for each bin is calculated [21]. The Z-scores of each bin are aggregated and transformed so that only two values (features) describing the sample are left. The first value is received by dividing each bin Z-score by the square root of the number of bins and then summed. For the second value, each Z-score is divided by the square root of the number of bins and further divided by the count of bins over the mean reference group (region) read count and then summed. In the construction of the second value, the Laplace smoothing is applied to avoid division by zero.
After the feature engineering, the Mahalanobis distance, based on the newly created features, from the reference group is found and transformed into high-risk probabilities with Chi-Square distribution and further transformed with -log10 to human-interpretable probabilities. Conservative probabilities are calculated only based on the second engineered feature. Greedy probabilities use both Z-scores.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Supplementary Information
Acknowledgements
The authors would like to thank Signe Mölder, who participated in SeraCare Life Sciences Inc sample management and communications. This study was supported by Enterprise Estonia (grants No. EU48695 and EU53935); Estonian Research Council (grant No. PRG1076) and Horizon 2020 innovation grant (ERIN, grant no. EU952516).