ABSTRACT
Blood transfusion is a life-saving medical procedure performed routinely worldwide. A key element for successful transfusion is compatibility of the patient and donor red blood cell (RBC) antigens. Precise antigen matching reduces the risk for immunization and other adverse transfusion outcomes. RBC antigens are encoded by specific genes, which allows developing computational methods for determining antigens from genomic data.
We describe here a classification method for determining RBC antigens from genotyping array data. Random forest models for 39 RBC antigens in 14 blood group systems and for human platelet antigen (HPA)-1 were trained and tested using genotype and RBC antigen and HPA-1 typing data available for 1,192 blood donors in the Finnish Blood Service Biobank. The algorithm and models were further evaluated using a validation cohort of 111,667 Danish blood donors.
In the Finnish test data set, the median (interquartile range [IQR]) balanced accuracy for 39 models was 99.9 (98.9–100)%. We were able to replicate 34 out of 39 Finnish models in the Danish cohort and the median (IQR) balanced accuracy for classifications was 97.1 (90.1– 99.4)%. When applying models trained with the Danish cohort, the median (IQR) balanced accuracy for the 40 Danish models in the Danish test data set was 99.3 (95.1–99.8)%.
The RBC antigen and HPA-1 prediction models demonstrated high overall accuracies suitable for probabilistic determination of blood groups and HPA-1 at biobank-scale. Furthermore, population-specific training cohort increased the accuracies of the models. This stand-alone and freely available method is applicable for research and screening for antigen-negative blood donors.
INTRODUCTION
Blood transfusion is a life-saving procedure performed widely in treating various medical conditions. Despite routine practices, the safety of transfusions remains a major concern1. Exposure to foreign RBC antigens may result in alloantibody formation and hemolytic transfusion reactions. Additionally, sensitization to non-self RBC antigens and human platelet antigens (HPAs) can also occur via pregnancy and cause fetal morbidity and mortality2,3. The current general practice of matching the recipient and blood donor for ABO and RhD antigens is inadequate to prevent sensitization to other antigens. Extended matching could reduce the risk of alloimmunization and adverse events, which are especially pronounced among patients receiving regular transfusions4,5.
Blood group typing of blood donors has been conventionally performed by serotyping and is still the main method used in blood centers. To overcome limitations regarding low throughput and lack of valid reagents for all clinically relevant antigens, numerous DNA-based genotyping and sequencing methods have emerged within the last decades6–10. This development has been enabled by the accumulating knowledge about the genetic basis of the blood groups11,12 and the rapid evolution of molecular methodology. However, the systematically extended blood group typing of blood donors and, even more so, the recipients, remains sparse. Economic feasibility has been a major restraint to the progress. The development of genotyping array technologies has promoted high-throughput and cost-effective genetic studies in many fields and, in 2020, Gleadall et al.13 introduced a microarray platform for RBC antigen, human leukocyte antigens (HLA), and HPA typing for precision matching of blood.
While accurate blood group typing is obligatory for safe transfusions, an initial screening for potential donors could be achieved using less stringent procedures. In the last decade, the development of machine learning approaches for high-dimensional data has provided new opportunities for exploitation of expanding genetic data. In 2015, Giollo et al.14 presented BOOGIE, an RBC antigen predictor based on Boolean rules and k-nearest neighbor (k-NN) algorithm. Decision tree-based methods, including bootstrap aggregation15 and random forest16, have been utilized for imputation of HLA alleles17,18 and killer cell immunoglobulin-like receptor (KIR) copy number19 and gene content20. To our knowledge, these methods have not yet been implemented on RBC antigen and HPA screening. The analysis of high-dimensional data with computational performance suitable for large-scale analyses may be implemented using “RANdom forest GEneRator” software R package21. The execution is feasible in the local computing environment and sensitive data uploads are not required.
Here we describe a stand-alone and freely available random forest classification method and models for determining RBC antigens and HPA-1 from array technology-based genotyping data. We investigate the performance of models trained with Finnish blood donor biobank data and further validate the method with a Danish cohort. Our results suggest that the method is applicable for biobank-scale probabilistic determination of RBC antigens and HPA-1, and could facilitate research and screening for antigen-negative blood donors.
STUDY SUBJECTS AND METHODS
Study cohorts and design
The Finnish study cohort consists of 1,192 blood donors belonging to the Blood Service Biobank, Helsinki, Finland (https://www.veripalvelu.fi/en/biobank/). Genotype and blood group phenotype data were obtained from the Blood Service Biobank. The study (biobank decision 002-2018) conforms to the principles of the Finnish Biobank Act (688/2012) and the participants have given written informed consent to the Blood Service Biobank.
The Danish validation cohort consists of 111,667 participants of the Danish Blood Donor Study (DBDS) Genomic Cohort expanding on the Danish blood bank system22,23. The genetic studies in DBDS have been approved by the Danish Data Protection Agency (P-2019-99) and the Scientific Ethical Committee system (NVK-1700407).
Genotyping and genotype imputation
The genotyping and genotype imputation of the Finnish cohort have been performed originally as a part of FinnGen project (https://www.finngen.fi/en). Biobank samples were genotyped using FinnGen ThermoFisher Axiom custom array v2 (Thermo Fisher Scientific, Santa Clara, CA, USA) and imputed using the population-specific Sisu v3 imputation reference panel with Beagle 4.1. Detailed description of the procedures is available at https://finngen.gitbook.io/documentation/v/r4/methods/genotype-imputation and the marker content of the custom array v2 is downloadable at https://www.finngen.fi/en/researchers/genotyping. The phased genotypes were filtered for the imputation INFO-score >0.6 and were in vcf format.
In the Danish cohort, the genotyping was performed using Illumina’s Infinium Global Screening Array and imputed using the deCODE genetics’ (Reykjavik, Iceland) North European reference sequence panel. Unphased genotypes were filtered for the imputation INFO-score >0.75, minor allele frequency >0.01, Hardy–Weinberg equilibrium P-values <1 × 10−4, and samples for missingness per individual <3%.
RBC antigen and HPA typing
The RBC antigen and HPA-1 phenotypic information for the Finnish and Danish cohorts is presented in Table 1. The availability of the phenotype data varied in a wide range depending on the antigen due to the different testing criteria practices. In the Finnish cohort, RBC antigen and HPA-1 typing was performed at the FRCBS Blood Group Unit by routine methods and the results were obtained using validated serological and genotyping techniques.
The sources for RBC antigen and HPA-1 typing results were the Danish electronic blood bank systems and the typing was performed using serological methods, except for Vel-status, which was determined using polymerase chain reaction technique.
Classification random forest models
An overview of the study design is depicted in Figure 1. RBC antigen and HPA-1 coding genes and the genetic regions used in the models are presented in Supplementary Table 1. The models for the antigens were generated separately using the same hyperparameters. Only antigens having at least four cases in each respective typing data class were included, resulting altogether in 39 models. For the Finnish reference data set, SNVs in RBC antigen and HPA-1 coding genetic regions ± 2,000 bp flanking regions were utilized in dosage format. Table 2 presents the number of SNVs available for each model. Only samples having full dosage data were used. The genetic and antigen typing information were combined into a single full data set and divided randomly 1:1 into train and test data sets.
R v4.3.0 environment24 was used for the implementation and classification random forest models were created using the R package ranger v0.13.121. The number of trees was 2,000 and split criteria based on node impurity measured by the Gini index. Class weights were applied due to unbalanced outcome classes. Number of variables to possibly split at each node (mtry) was number of SNVs divided by 2 and the variable importance was determined by permutation. Feature selection was based on variable importance >0 and the model was re-fitted using these important SNVs only. The number of important variables and prediction errors for each antigen model are presented in the Table 2. Prediction error was determined as misclassification frequency obtained from out-of-bag data and prediction on the test set. The important variables for Finnish models are listed in Supplementary Data 2. The full data set was used in fitting the final models.
Model evaluation metrics
The model accuracy was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and balanced accuracy. The data was wrangled using tidyverse v1.3.1 package25 and the evaluation metrics were derived using caret v6.0-9226. For each model, the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) were determined. Sensitivity was defined as TP / (TP + FN), specificity as TN / (TN + FP), PPV as TP / (TP + FP), NPV as TN / (TN + FN)27. Balanced accuracy accounts for imbalanced classification and was defined as (sensitivity + specificity) / 2.
Validation of the Finnish models and the random forest method for generating the models
The models obtained using the Finnish data set were applied to the Danish cohort. The implementation required imputed genotype data in vcf or PLINK format. The Danish allele dosage data was harmonized by naming and allele orientation for compatibility with the Finnish models and the dosage data for the missing important variables was imputed using mean values.
The model-generating method was further validated by fitting the models on the Danish data set to create models specific to the Danish cohort. In the Danish data set, the percentage of missing genotypes was on average 5% depending on the genetic region of the blood group/HPA system. Missing allele dosage values were imputed separately for train and test data sets using mean values before classification random forest step. Characteristics of the Danish models are presented in Supplementary Table 2. The important variables for the Danish models are listed in Supplementary Data 3. The evaluation metrics for both prediction and modelling were defined as depicted in the “Model evaluation metrics” section.
The significance of variation of balanced accuracies was analyzed using Mann-Whitney-Wilcoxon Test implemented with R v3.6.1.
Data availability
Genotyping and RBC antigen/phenotype and HPA-1 typing data for the Finnish cohort are stored in the Blood Service Biobank, Helsinki, Finland. Researchers may apply for access to data (https://www.veripalvelu.fi/en/biobank/for-researchers/). Due to privacy laws, the Danish genetic data and phenotypes are only available to DBDS researchers and blood banks.
RESULTS
Evaluation of the Finnish classification models
In the Finnish cohort, the genotype data was accessible for 1,192 blood donors and the RBC antigen typing data was available for 39 antigens representing 15 blood group systems. The blood group typing frequency varied greatly depending on RBC antigen/phenotype, being at the lowest 5% for HPA-1b and at the highest 100% for A, B, AB, O, K, D, C, c, E, and e (Table 1).
After data partitioning, the number of study subjects in the test data set was 596. The median (interquartile range [IQR]) balanced accuracy for 39 models was 99.9 (98.9–100)% in the test data set and accuracy metrics for all models are presented in the Table 3. The models for antigen/phenotype positivity of AB, B, A1, A2, Ytb, Coa, Doa, Dob, Fya, HPA-1b, K, Kpa, Ula, Jka, Lua, S, and s reached balanced accuracy of 100%. For other models, the balanced accuracy was ≥98.0%, except 83.3% for Lsa, 94.0% for Leb, 95.0% for HPA-1a, and 96.0% for hrS. Accuracy metrics for the test and full data sets are presented in the Supplementary Tables 3 and 4, respectively. Figure 2 illustrates the confusion matrices for classification models in the Finnish test data set. The number of false negative plus false positive (FN + FP) samples out of all samples was low, ranging from 0 to 1% in all models, except 2% for hrS. Confusion matrices for the test and full data sets are presented in Supplementary Figures 1 and 2, respectively. The median (IQR) prediction error, determined as misclassification frequency obtained from out-of-bag data, of the Finnish models was 1.6 x 10−3 (1.9 x 10−4 –7.0 x 10−3) (Table 2.).
The distributions of posterior probabilities (PP) in the test data set are depicted in Figure 3. The samples having PP >0.5 were classified as antigen positive and ≤0.5 as antigen negative. The majority of the PPs were close to 1 for the antigen typing positive samples and close to 0 for the antigen typing negative samples. The Coa-negative samples (only two samples in the test data set) were classified correctly but the PPs were closer to 0.5 than to 0. One of the three Lsa-positive samples were misclassified and the PPs for the other two were closer to 0.5 than to 1 (specificity 66.7%). The spectrum of PP distribution with some misclassifications was observed for Cob, Leb, M, N, C, Cw, D, and hrS. Supplementary Figures 3 and 4 depict the distributions of PPs in the test and full data sets.
Validation of the Finnish classification models in the Danish cohort
The Danish validation cohort had genotype and phenotype data for 34 out of the 39 Finnish classification models. Antigen/phenotype typing data varied from 433 for A2 to ∼111,000 for A, AB, B, O, and D (Table 1). Due to missing Finnish model variables in the Danish genotype data, the Danish allele dosage data was harmonized using mean imputation before applying the Finnish models.
The median (IQR) balanced accuracy for classifications was 97.1 (90.1–99.4)% and all the evaluation metrics are presented in Supplementary Table 4. The balanced accuracies were >98.0% for 14 models including antigen/phenotype positivity of A, AB, B, O, Ytb, Doa, Dob, HPA-1a, Jka, Lea, S, s, E, and e. Models for antigen/phenotype positivity of A1, Cob, Fya, Fyb, HPA-1b, K, Kpa, Lua, M, N, and Cw had balanced accuracy ranging from 91.6 to 98.0%. Six models, A2, Coa, Jkb, Leb, D, and C had balanced accuracy ranging from 64.6 to 89.4%. The Finnish models for LWb, P1, and c failed classification in the Danish cohort.
Validation of the classification model algorithm in the Danish cohort
The RBC antigen/phenotype and HPA-1 typing and genotype data available for the Danish cohort enabled implementation of 40 Danish classification models representing 15 blood group systems. Due to missing genotypes (approximately 5%), missing allele dosage values were imputed separately for train and test data sets using mean values.
Median (IQR) balanced accuracy for the 40 Danish models in the Danish test data set was 99.3 (95.1–99.8)%. The evaluation metrics for test data set are available in Table 4 and for the train and full data sets in Supplementary Tables 6–7, respectively. Majority (23/40) of the Danish models reached balanced accuracy of ≥99.0% including models for antigen/phenotype positivity of A, AB, B, O, Yta, Ytb, Doa, Dob, Fya, HPA-1a, HPA-1b, Jka, Jkb, M, N, S, s, C, c, D, E, Lea, and Knb. Balanced accuracies for A1, Cob, Fyb, K, Kpa, Lua, Cw, e, and P1 models ranged from 94.4 to 98.1%, and for A2, Coa, k, Kpb, Lub, Vel, and Leb from 70.0 to 89.3%. Danish model for Kna failed classification due to too low number of Kna-negative samples in the test data set. Confusion matrices for the Danish models in the Danish train, test and full data sets depict the distribution of TN, FN, TP, and FP samples and are illustrated in Supplementary Figures 5–7, respectively. The median (IQR) prediction error of the Danish models was 2.3 x 10−3 (9.3 x 10−4 –7.1 x 10−3)% (Supplementary Table 5).
Comparison of the Finnish and Danish classification models
Assembly of the balanced accuracies for Finnish and Danish models in the Finnish and Danish full data sets is presented in Table 5. When analyzing the shared 33 models, the Finnish models predicted the blood groups of the Finnish cohort more accurately than the blood groups of the Danish cohort (median [IQR] balanced accuracy 99.9 [98.8–100]% vs. 97.1 [91.6–99.5], p = 1.15e-06). The Danish models were performing better than the Finnish models in the blood group classification of the Danish cohort (median [IQR] balanced accuracy 99.5 [96.5–99.8]% vs. 97.1 [91.6–99.5]%, p = 0.006).
The number of genetic variants available for the Finnish random forest modelling ranged from 35 to 688 depending on the blood group/HPA system and number of the important variables selected by the classifier for the final models ranged from 12 to 214 (Table 2). In the Danish genotyping data set, the number of variants varied from 42 to 766 and the final models utilized 20–743 variants (Supplementary Table 5).
DISCUSSION
Our study introduces random forest classification models for predicting RBC antigens/phenotypes and HPA-1 from array-based genotyping data. The method and models were generated utilizing blood group typing data from Finnish blood donors and further validated using a large Danish blood donor cohort. The results demonstrate high overall accuracy, and the method is suitable for biobank-scale screening and analysis of HPA-1 and RBC antigens.
Blood transfusion is one of the most common clinical procedures in the hospitals and the key element for safe transfusion is compatibility between the recipient and donor RBC antigens1. Although transfusion-related severe outcomes are rare, the prominent risk of sensitization and further alloimmunization affects especially patients dependent on recurrent transfusions 4,5. Extended blood group typing has proven to be beneficial by reducing the incidence of alloantibody formation28,29. Additionally, studies have shown that the extended genotyping of blood donors markedly increases the number of suitable donors for immunized recipients 13 and enhances the supply of antigen-negative blood6.
At present, preventive matching strategies are implemented only for specific patient groups and, despite the obvious advantages of the extended genotyping of donors, the procedure has not been considered feasible covering all blood donors. Over the last decades, the genotyping of different populations has expanded widely. Using machine learning approaches to screen blood donor and research biobank genotyping data may provide a cost-effective solution for enlarging the pool of antigen-negative blood donors. Our random forest classification method infers RBC antigens and HPA-1 from genotype-imputed microarray data. The R package ranger performed fast and handled the dimensionality of input data without problems21. The obtained results demonstrated high balanced accuracies both in the Finnish discovery cohort (median 99.8% for the 39 Finnish models) and in the Danish validation cohort (median 99.3% for the 40 Danish models) (Table 5). The performance was not affected by nearly a 100-fold size difference between the Finnish and the Danish cohorts (∼1,200 vs. ∼111,000, respectively).
Rh and MNS blood group system antigens have been challenging to determine by sequencing due to complex genetic variation and gene rearrangements12,30. We observed reduced balanced accuracy in the Finnish model for hrS (93.3%) and the Danish model for Cw (95.3%). However, the other Rh and MNS antigen models, including clinically significant E, e, C, c, S, and s, performed accurately. The balanced accuracies for clinically significant antigens in other systems, including K, Jka, Jkb, Fya, and Fyb, ranged from 95.6% to 100% (Table 5).
The BOOGIE method for prediction of RBC antigens was published in 201514. It builds on 1-NN algorithm and implementation requires genotype sequencing data and curated haplotype tables for the RBC antigen phenotypes. When compared, the Finnish models for ABO and RhD performed better than the BOOGIE method (median balanced accuracy for the Finnish ABO models 100% vs. BOOGIE ABO accuracy 94.2%; balanced accuracy for the Finnish RhD model 98.8% vs. BOOGIE RhD accuracy 94.2%). The observed differences in accuracies could be explained by the potentially limited haplotype tables utilized by BOOGIE. Additionally, the reported results of BOOGIE are based on low number of samples.
When applying the Finnish models to the Danish cohort, the observed decrease in balanced accuracies was expected because of the evident genetic, genotyping, and imputation differences between the Finnish and the Danish cohorts (Table 5). The Finnish cohort was imputed using population-specific imputation reference panel having no missingness per individual. On the contrary, the Danish cohort was imputed using the North European reference sequence panel resulting in an average missingness of 5%. As random forest is not able to handle missing input data and the important variables of the Finnish models were not fully present in the Danish data, we were obliged to use mean imputation for missing variant dosage data. It is obvious that this approach also introduces errors to the data, which may partly explain the reduced accuracy. The better performance of the Danish models in Danish cohort underlined the benefit of the population-specific training cohort.
To our surprise, the Finnish genotyping data had only one variant in the RHD region. Nonetheless, the Finnish model for RhD performed with sufficient balanced accuracy in the Finnish cohort (98.8%). Our method combines RHD and RHCE region variants for the modelling and the high linkage disequilibrium may have supported the classification (Table 2). However, the Finnish model for RhD worked poorly in the Danish cohort (78.4%), which may be attributed to the mean imputation of missing values.
The present modelling method is restricted to the RBC antigen typing data available for the training and test data sets, which can be considered as a major limitation because the data for some RBC antigens are scarce. RBC antigens have demonstrated significant diversity among populations and rare blood group variants may not be discovered without substantially large typing numbers. The Danish model for Kna failed because of lacking Kna-negative samples in the test data set and we were not able to create Finnish models for e.g., Vel, k, Kpb, Lua, and LWa.
In the future, comprehensive donor and recipient typing and precision matching are likely to increase. A recent publication by van Sambeeck et al.31 demonstrated the feasibility of preventive matching for all genotyped recipients and donors. Our method is suitable for initial screening for antigen-negative donors at biobank-scale, presenting a cost-effective solution for the extended blood group and HPA-1 typing. Additionally, successful prediction of polygenic blood groups may facilitate the research of desease associations in large biobanks.
Scripts for random forest modelling and for applying the tested 39 Finnish models are freely available in the GitHub. The implementation is possible in the local computing environment without sensitive data uploads and requires only a moderate level of bioinformatic skills.
AUTHOR CONTRIBUTION
K. Hyvärinen, K. Haimila, J. Partanen, and J. Ritari designed the study. The Blood Service Biobank provided the samples and genomic data. K. Haimila supervised blood group typing and provided blood group expertise. K. Hyvärinen and J. Ritari scripted the random forest method, carried the analyses out in the Finnish cohort, and interpret the results. O.B. Pedersen and C. Erikstrup established the Danish cohort. C. Moslemi executed the replications analysis in the Danish cohort. M.L. Olsson, S.R. Ostrowski. and O.B. Pedersen supervised and interpreted the analysis performed in the Danish cohort. K. Hyvärinen drafted the manuscript. All authors contributed to the final version of the manuscript.
Data Availability
Genotyping and RBC antigen/phenotype and HPA-1 typing data for the Finnish cohort are stored in the Blood Service Biobank, Helsinki, Finland. Researchers may apply for access to data (https://www.veripalvelu.fi/en/biobank/for-researchers/). Due to privacy laws, the Danish genetic data and phenotypes are only available to DBDS researchers and blood banks.
DISCLOSURE AND FUNDING
M.L. Olsson is an inventor on patents about Vel blood group genotyping (unrelated to the methods and models presented in this study) and owns 50% each of the shares in BLUsang AB, an incorporated consulting firm, which receives royalties for said patents. The other authors declare no conflicts of interest.
The study was partially supported by funding from the Government of Finland VTR funding, the Academy of Finland, the Finnish Cancer Association, and Business Finland. The study of the Danish cohort was supported by Independent Research Fund Denmark (project number 0214-00127B), Bloddonornes forskningsfond, and A.P Møller Fonden. M.L. Olsson is a Wallenberg Clinical Scholar funded by Knut and Alice Wallenberg Foundation.
SUPPORTING INFORMATION STATEMENT
Supplementary Information is online.
ACKNOWLEDGEMENTS
We want to thank Dr. Satu Pastila and Ms. Ritva Toivanen at the FRCBS for the collaboration with the blood group typing data. We are also grateful for Ms. Birgitta Rantala, Mr. Petteri Vaskin, Ms. Katariina Karjalainen, Ms. Nina Nikiforow, Ms. Jonna Clancy, and Dr. Mikko Arvas and Dr. Tiina Wahlfors at the Blood Service Biobank for their help in handling the data and samples, and Dr. Jaana Mättö and the personnel at the FRCBS Blood Group Unit for blood group typing analyses.
From Denmark, we wish to thank the Danish blood donors and deCODE Genetics for genotyping the Danish cohort.
Footnotes
Suplementary Figures updated.