Abstract
While many machine learning and deep learning-based models for clinical event prediction leverage various data elements from electronic healthcare records such as patient demographics and billing codes, such models face severe challenges when tested outside of their institution of training. These challenges are rooted in differences in patient population characteristics and medical practice patterns of different institutions. We propose a solution to this problem through systematically adaptable design of graph-based convolutional neural networks (GCNN) for clinical event prediction. Our solution relies on unique property of GCNN where data encoded as graph edges is only implicitly used during prediction process and can be adapted after model training without requiring model re-training. Our adaptable GCNN-based prediction models outperformed all comparative models during external validation for two different clinical problems, while supporting multimodal data integration. These results support our hypothesis that carefully designed GCNN-based models can overcome generalization challenges faced by prediction models.
Introduction
During each patient visit, healthcare centers record the health data of patients in digital systems referred to as Electronic Health Records (EHR) that consist of heterogeneous elements, including demographics, prescriptions, diagnosis, laboratory and radiology test results, encounter notes, procedures, and treatment plans. Structured format of EHR represents data that can take a value within a specified range or from a pre-defined dictionary. Examples of such EHR data include, but are not limited to, medical codes, medications, administrative data, vital signs, and laboratory test outcomes. In the era of digital age, secondary use of structured electronic health records (EHR) for developing machine learning (ML) and deep learning (DL) models for clinical event prediction and digital phenotyping1,2 is becoming widely popular and is being clinically adopted for improving the healthcare delivery. However, models trained on a single institution’s data often face severe challenges when applied across multiple different institutions and diverse populations3. These challenges usually stem from differences in patient population characteristics such as age, gender, race and common comorbidities among the population, as well as differences in medical practice, manifested in the billing practice of medical procedures and recording and coding of comorbidities as CPT and ICD codes.
ML/DL models commonly leverage ICD and CPT codes to incorporate the clinical status of patients in addition to their demographic features4,5. These codes are designed to convert healthcare services to billable revenue. Qualified healthcare coders are responsible for accuracy and completeness of these codes. However, significant differences exist in coding practices between different healthcare institutions6. ML/DL models often learn practice patterns of the training institution rather than relevant predictive features and can fail when applied to another institution7. Some studies have even shown that time and frequency of lab test order is more important for the model than actually the result of the lab test8. A survey paper recently concluded that when tested on external data, more than 20 models trained for prognosis for COVID-19 patients could not outperform univariable predictions made on oxygen saturation level at the time of admission to the hospital, calling into question the utility of ML modeling for clinical event prediction9. Proprietary risk prediction models are not exempt from this trend either. Recent studies have shown that the Epic sepsis model achieves subpar performance when validated externally9. Models also experience performance decay over time even when deployed in the same institution where it was trained, likely attributed to evolution of population characteristics and practice patterns over time10. These challenges limit the generalizability and scalability of ML/DL models that leverage electronic health records for tasks like clinical event prediction or patient phenotyping. Hence, researchers have been motivated to find remedies to the problem of limited generalizability of EHR-based ML/DL models.
A popular applied remedy is the curation of refined clinical features for standardized risk prediction11, such as pooled cohort equations12. Clinical features are refined to eliminate practice pattern based variations that might arise in the recording of those features. However, curation of these features require extreme manual effort, and hence introduces the possibility of curation errors and is limited to generating smaller datasets. This approach also lacks comprehensiveness and can potentially miss other relevant features that might be predictive for a given DL/ML task contained within the EHR. Such models can only focus on expert-defined clinical features and cannot make use of the vast amount of information available in the electronic health records in general. Research has shown that comprehensive models using a wide variety of EHR outperform models using curated features13. Even after valuable curation effort and targeted modeling based on a narrow set of curated features, this approach shows biases among different population groups12. Moreover, such models necessitate availability of curated clinical features thus requiring patients to undergo potential tests that may be part of such features. Another approach is harmonizing EHR under standard data models like Observational Medical Outcomes Partnership (OMOP) and Fast Healthcare Interoperability Resources (FHIR). These data models put well-known limitations on granularity of EHR and cannot handle variations in the data patterns themselves14,15. These challenges hinder wide-spread adoption of EHR-based ML/DL models across multiple institutions.
Considering the limitations of the previously proposed solutions to the challenge of generalization, we propose a novel solution to adaptability challenges of EHR models through the design of graph-based convolutional neural networks (GCNN). GCNNs have been used to fuse data modalities such as radiological images and demographic information in their two data structures - nodes and edges1,16. A few previous studies have hinted towards generalizability properties of GCNN on a small scale for limited patient cohorts17. We formalize GCNN model design such that the trained model is generalizable in cross-institutional validation setup and adaptable to the difference between the coded EHR data. In our model design setup, the edge structure of the graph is used to encode patients’ similarity based on structured EHR data elements like ICD and CPT codes. The GCNN models are trained to learn explicitly from data elements selected as node features (e.g., imaging data or selected data elements from EHR) and implicitly from patients’ similarity patterns based on EHR data elements selected for edge formation. These similarity patterns can be systematically adapted to handle temporal and cross-institutional demographic and practice pattern variations while keeping the pre-trained GCNN model applicable as the model only operates on similarity patterns and is agnostic to the exact estimation process used for that similarity pattern.
We validated our generalizable model design framework to solve two clinically relevant problems on completely different populations; 1) prediction of two clinical events for patients hospitalized with positive COVID-19 test: discharge from hospital and mortality using chest X-rays and EHR data elements such as billing codes and demographics features; and 2) prediction of blood transfusion in hospitalized patients using a wide range of EHR data elements (demographic features, CPT and ICD codes, medications, lab tests, vital signs). Hospital discharge and mortality prediction models were trained over data collected from Emory University Healthcare (EUH) network and externally validated on data collected from four geographically disparate sites of Mayo Clinic (MC). For transfusion prediction, data was collected from MC sites of Rochester and Arizona to serve as internal training and testing data while external validation was performed on publicly available MIMIC IV (Medical Information Mart for Intensive Care) dataset that contains data for ICU patients from Beth Israel Deaconess Medical Center in Boston, Massachusetts.
Methodology
Graph Convolutional Neural Network
Graph convolutional neural network (GCNN) advanced machine learning by allowing the model designer to choose the definition of ‘neighborhood’ to be incorporated by the model through definition of a graph G(N, E) where N denotes the set of nodes and E denotes the set of edges. In this scenario, ith sample from the cohort forms ith node (vi) with two feature vectors, i.e., node features (ni) and edge features (ei). An edge between ith and jth sample, denoted as εi,j, is decided based on edge-formation function ϵ(ei, ej). GCNN model will learn to generate embeddings for ith node by manipulating nodes features of this node (ni), and ‘messages’ received from nodes in its edge-connected neighbor (η(i)). At k + 1th graph convolutional layer, the following describes the process of generating embedding of ith node () In a supervised leaning scenario where target label for each node is available, node embedding generated by graph convolutional layers is used to predict target label ŷi for ith node as Through backpropagation of loss such as binary cross entropy defined on ground truth y and predicted labels ŷ, weight matrices Wk∀ k ∈ K and Wfc are optimized where the model included K graph convolutional layers.
The neighborhood (ηi) of ith node vi can be defined based on its edge-connected nodes, i.e., ηi = {vj ∀εi,j ∈ E }. Messages are sent and received between nodes in a neighborhood. In essence, these messages are features of the nodes (nj, ∀vj ∈ ηi) in the neighborhood (ηi). GCNN model, through its training process, learns the function parameters to manipulate features of the ith node (ni) as well as ‘messages’ being received through various edge-connected nodes from its neighborhood ηi. Hence, GCNN is capable of two-fold learning. The model learns from the features of ith node (ni) directly, and implicitly learns from information used for edge formation (edge features ei), through incorporation of ‘messages’ from edge-connected nodes (nj, ∀vj ∈ ηi). However, model never directly manipulates edge features (ei).
This is an important advantage of GCNNs as it relates to generalizing a trained model to unseen and diverse populations from external institutes. Since GCNN never has to manipulate edge features (ei) with parametric functions, edge features and edge formation function ε(.) can be adapted based on characteristics of the data of the individual institutions when shipping the trained model from one institute to the other. Thus, we based our work upon systematic use of this characteristic of GCNN to build generalizable models using EHR data elements.
Adaptable GCNN Design for external use cases
We focused on adaptable edge-formation process to ensure generalizability of our GCNN based models. Trained GCNN model requires consistent formation of node features (ni) in external cohort for its application on external cohort. However, the model does not directly manipulate edge features (ei), and hence, edge formation process ε(ei, ej) can be adapted to suit external cohort without hindering the application of trained GCNN model on external cohort. The following represent two scenarios where such adaptation is crucial.
Case – 1
Let us assume that edge features of internal cohort are denoted as where and edge features of the external set are denoted as where where . Such distinct feature selection for the two cohorts may be the result of frequency-based selection of common features such as billing codes or administered medication. No existing machine learning model trained on internal feature vectors will be applicable to a separate set of external feature vectors. However, GCNN can tolerate such difference in internal and external cohort by employing these features for edge formation ε(.). As explained earlier, GCNN models do no manipulate edge feature vectors directly, but only implicitly use them through ‘messages’ received through these edges.
Case – 2
Let us assume that edge features for internal and external cohorts are the same, i.e., SE with A = |SE|. However, graph formation process is more intelligent than simple thresholding on count-based or binary edge feature vectors for internal and external cohorts. For example, edge features may be collected over T time intervals, and an edge is formed based on similarity in temporal pattern of these features for nodes i and j, i.e., and ε(oi, oj). Even with the same set of features, temporal patterns may be different for internal and external cohorts. For large academic healthcare centers, such patterns may involve both in-patient and out-patient data. For databases collected for critical-care patients only, outpatient data may be missing in temporal patterns. Hence, temporal pattern forming function should be different for internal and external cohort, i.e., τint and τext. Putting limitations on pattern formation function may enable traditional ML models trained on output of internal temporal pattern function τint to be applicable to outputs of external temporal pattern function τext, but graph learning paradigm provides more flexibility. To suit the characteristics of each cohort, τint and τext may produce output of different dimensions (), or operate of sequences of different length Tintand Text), or even work on different set of features, i.e., and (encompassing the scenario described in case –1).
In terms of patients’ cohort, one node may represent a patient at a certain point in time and edge may denote that two connected nodes/patients are similar in terms of some demographic (e.g., age) or clinical features (e.g., comorbidities). This property has been exploited for detection of Alzheimer and autism spectrum disorder by building graphs with patients as nodes, brain imaging data as node features, and simple demographic features-based similarity used for edge formation1,16,18,19,20. We move beyond such modeling by allowing much more comprehensive information to be used as edge features, e.g., all recorded billing codes for patients, or auto-encoder based compressed representation of historical patterns in recorded billing codes and medications.
The primary intuition of our generalizable GCNN design is to represent the measured/recorded health data (e.g. images, lab values) as node features which face minimal chance of variability due to practice pattern, and leverage the adaptability of the edge formation to represent the variable EHR information (e.g. diagnosis and procedure codes).
Clinical use-cases of the adaptable GCNN design
We validated our model design scheme on two clinically relevant use-cases; 1) prediction of major clinical events (discharge from hospital and mortality) for patients hospitalized with positive RT-PCR test for COVID-19, 2) prediction of need for transfusion for hospitalized patients. Figures 1-a and 1-b show frameworks for both use-cases. The first use-case employs a branched framework where patients marked as highly probable for discharge are evaluated for mortality risk and only in-patient data is used. Second use-case employs historical pattern of recorded procedures, comorbidities, and medication for a patient as well as data recorded during first 48 hours of hospitalization.
Cohort Selection
For use-case 1, internal cohort included all patients admitted to EUH between Jan-Dec 2020 with positive RT-PCR test and for whom chest X-ray examinations (AP view) were acquired at regular intervals during their hospital stay. External cohort was selected with similar criteria from MC (four sites) for the year 2020.
For use-case 2, transfusion prediction model was trained on internal data collected from two sites (Rochester and Arizona) of MC. External validation was performed on two datasets; a) data collected from geographically distant site of MC, i.e., Florida, and b) open-source MIMIC IV dataset. Relatively small number of patients required blood transfusion (approximately 0.5% of hospitalizations in MC in 2019 required blood transfusion). Such a small positivity rate of transfusion drove us to curate a training dataset through propensity matching for the control group.
Demographic features as well as 5 major comorbidities groups including metabolic disorders, hypertensive disease, heart disease, acute kidney failure adverse effects of drugs, were used as confounders to perform one-on-one propensity matching with cases (hospitalizations with blood transfusion) to select control groups (hospitalizations without blood transfusion). Case and control groups were perfectly balanced in terms of confounding variables in our propensity matched training data. Salient characteristics of all cohorts are described in Table 1.
Model Design
Figure 2 shows distributions of subgroups of billing code sets (CPT and ICD) for use-case 1 from two different institutions. External institute used a much larger number diagnostic tests indicated by higher bars for diagnostic radiology and drug assay subgroups. While blood disease seems to be more common among patients in internal institute than in external institute, external cohort had a larger fraction of patients with metabolic disorders and heart disease. Such data elements require adaptation when experimenting from external cohort, and hence are suitable for edge feature formation.
Such variations are handled by our generalizable model design which relies on unique adaptable learning paradigm of GCNN model as described earlier. Chest X-rays and tabular data (demographics, and CPT and ICD codes) were available for case-study 1. Image features extracted from pre-trained DenseNet-121 models were used as node features (ni). All tabular data elements were evaluated as edge feature vectors (ei) for effective edge formation. CPT and ICD codes were mapped to their corresponding subgroups in CPT and ICD code hierarchies respectively, and finally represented as one-hot feature vectors.
Transfusion prediction model employed a larger variety of EHR features. Latest results of selected labs (recorded as Normal/Abnormal/Unknown), trend of change (gradient) in five important vital signs (temperature, mean arterial pressure (MAP), SpO2, pain score, pulse rate) recorded during the first 48 hours of hospitalization, demographic features, and free-text field of reason for visit vectorized under tf-idf featurization scheme were concatenated to form node features for this model. Edge features (ei) were generated by a temporal embedding model τ(.) for variable data elements like billing codes (CPT and ICD) and medication.
Temporal Embedding Model
Embedding model τ(.) encodes temporal patterns recorded as three-point sequences; Timepoint 1 (T1): data collected between 6 and 12 months before hospitalization, Timepoint 2 (T2): data collected between 6 months before hospitalization to the time of hospitalization, Timepoint 3 (T3): data collected within first 48 hours of hospitalization.
Temporal embedding model τ(.) is essentially an LSTM based encoder-decoder architecture. The feature vector at each time point is encoded to a latent space such that when decoded, it generates the feature vector of the next time point. Hence, the model is trained in a self-supervised fashion with no regard to any downstream prediction label. Once the model has been trained, the last hidden state vector generated in response to an input sequence can be used as an embedded edge feature vector ei encompassing temporal information encoded in the data elements used as input. Our transfusion prediction model operates on a graph of patients where edges between patients are decided based on similarity in their embedded vectors ei. As explained in Methodology section, this temporal embedding model is trained separately for external data, while keeping the GCNN-based transfusion prediction model trained on internal cohort applicable over external cohort.
Three timepoints sequence design was selected experimentally as further finer-grained splitting of historical data resulted in many empty timepoints (time interval with no available data). This is due to the nature of data collected as inpatient vs outpatient. Most historical data is outpatient data except for cases where a patient was hospitalized in the last one year as well. Outpatient data is much more sparse than inpatient data. Note that external data was collected from MIMIC IV which is very different from the internal cohort. MIMIC data records patient hospitalizations with no outpatient data. Hence historical data is only available if the patient was hospitalized within the last one year as well. Still, retraining the embedding model provided a chance for adaptation to such a different scenario.
Results
We performed thorough experimental evaluation for both use-cases. As both use-cases involve multiple data elements (including EHR data elements and chest X-rays) and the proposed approach is based on fusion of all data elements through GCNN-based models, an intuitive comparative baseline is formed by single-modality models, each using only one of the data elements. Therefore, we trained and evaluated such single modality models for all three clinical events. In addition, we also employed a traditional fusion modeling approach, i.e., late fusion, which gathered target label probability estimates from single modality models and processed them together through a meta-learner for final target prediction. Tables 2 and 3 show performance of clinical event predictors for use-cases 1 and 2, respectively, for both internal held-out test sets and external sets.
For use case 1, other than the obvious difference in coding patterns (Figure 2), the patient populations are significantly different between EUH and MC (internal and external institutions, respectively) - (1) 49% female patients in EUH while 37% female in MC; (2) 61.5% African American in EUH and 7.9% in MC; (3) as comorbidities, 16.9% kidney disease in EUH and 51.2% in MC; 52.4% cardiovascular disease in EUH while 75.2% in MC (Table 1). For use-case 2, the external cohort was significantly small (68 patients); however similar variations in patient population were also observed.
In the challenging scenario formed by vast differences in internal and external datasets, individual modality classifiers and traditional fusion models struggle when presented with external data. On the other hand, GNN based models tend to fare better under similar settings. For use-case 1, late fusion model achieved 0.58 [0.52 - 0.59] AUROC on the external dataset for hospital discharge prediction while the GCNN achieved 0.70 [0.68-0.70]. For mortality prediction, late fusion model achieved 0.81[0.80-0.82] AUROC on external dataset while the GCNN model achieved 0.91 [0.90 - 0.92]. For the use-case 2, we observed more gaps in performance due to missing/incomplete data in the external dataset - late fusion achieved 0.44[0.31-0.56] while the GCNN model achieved 0.7 [0.62-0.84].
We hypothesize that superior performance of GCNN on external data is due to the adaptation of the edge formation function. To test this hypothesis, we applied GCNN based models without adaptation of edge formation function on external for all our clinical event prediction tasks and compared results with application of GCNN based models with edge formation function adaptation. The models suffer significant performance loss in majority of the cases when used without edge formation function adaptation (Table 4). Hence, we can safely conclude that the generalization and adaptability capabilities of GCNN based models arise from its unique ability to adapt edge formation process to suit new population even after model training.
Discussion
As highlighted in the literature7,8, the challenges related to generalization limit the application scope of ML/DL models that could otherwise leverage the rich electronic health records for tasks such as clinical event prediction or patient phenotyping. In this study, we propose an adaptable GCNN framework for EHR modeling that can be easily generalizable across institutions where the difference in patient population and coding practices are significant. The GCNN framework allows to choose the definition of patient/case similarity to be incorporated into the model through definition of a graph which not only mimics the parts of clinical decision making but can also be utilized to overcome the generalizability limitation of traditional ML/DL models. While the GCNN model learns from the features of node directly, it implicitly learns from information used for edge formation through incorporation of edge-connected nodes. Thus, our generalizable GCNN design primarily represents the measured/recorded health data (e.g. images, lab values) as node features which usually have minimal variability across sites, and leverage the adaptability of the edge formation to represent the variable EHR information (e.g. diagnosis and procedure codes).
We validated the proposed generalizable GCNN model design framework to solve two important clinical use-cases; 1) prediction of adverse clinical events for COVID-19 patients, and 2) prediction of blood transfusion in hospitalized patients. We trained the GCNN models using data from one institution (EUH/MC) and validated externally on MC and publicly available MIMICIV datasets, respectively. During our experimentations, even though the performances on the internal datasets were close, GCNN models consistently outperformed the traditional ML/DL models on the external datasets. We hypothesized that this performance trend is due to the ability of graph-based models to adapt their edge formation functions without requiring any re-training or fine tuning the model itself.
To our knowledge, we are the first to report the edge adaptable GCNN property to improve the generalizability of ML/DL model in healthcare settings. Our proposed design has several important advantages. First, the model trained on an internal dataset does not need fine-tuning or retraining on the external data, even when the EHR data structure and coding frequency differs significantly between the institutions. Second, the graph design implicitly models the similarity between the patients and thus mimics the clinical decision making. Third, the adaptable edge formation technique allows to explore institution specific variables to define the patient/case similarity and provide flexible design choice. Fourth, graph design allows integration of multi-modal data (images + EHR).
There are several limitations in this study, such as those associated with a retrospective design of both use-cases. In addition, for the MIMIC dataset, the timestamp associated with the CPT code and reason for admission were missing, thus we were not able to evaluate the model performance using those data elements. Given the low prevalence, we used propensity score matching to select the control cases which provided only a selective sample for validation.
Conclusion
We proposed a novel solution to the challenges faced by machine learning and deep learning-based models relying on electronic health records for predictive modeling for patient populations. Generally, such models suffer from poor generalization capabilities due to differences in medical practice patterns and patient population characteristics when applied outside of their institute of training. Our systematic design of graph based convolutional neural networks implicitly learns from such varying data elements of electronic health records through their use in the edge formation process. Edge formation function can be adapted to suit the new population when models are to be tested externally without needing any retraining of the originally trained model. We proved the benefits of our approach through its application on two clinically relevant problems; each tested on two diverse populations. We included a wide variety of electronic health records data elements as well as imaging information, indicating that our approach is capable of handling complex multi-modal data while developing highly adaptable models.
Data Availability
Data collected from Emory University Healthcare network and Mayo Clinic is confidential, and can only be made available through request to these institutions.