ABSTRACT
Clinical trial emulation, which is the process of mimicking targeted randomized controlled trials (RCT) with real-world data (RWD), has attracted growing attention and interest in recent years from the pharmaceutical industry. Different from RCTs which have stringent eligibility criteria for recruiting participants, RWD are more representative of real-world patients to whom the drugs will be prescribed. One technical challenge for trial emulation is how to conduct effective confounding control with complex RWD so that the treatment effects can be objectively derived. Recently many approaches, including deep learning algorithms, have been proposed for this goal, but there is still no systematic evaluation and practical guidance on them. In this paper, we emulate 430, 000 trials from two large-scale RWD warehouses, covering both electronic health records (EHR) and general claims, over 170 million patients spanning more than 10 years, aiming to identify new indications of approved drugs for Alzheimer’s disease (AD). We have investigated the behaviors of multiple different approaches including logistic regression and deep learning models, and propose a new model selection strategy that can significantly improve the performance of confounding balance of the participants in different arms of emulated trials. We demonstrate that regularized logistic regression-based propensity score (PS) model outperforms the deep learning-based PS model and others, which contradicts with our intuitions to a certain extent. Finally, we identified 8 drugs whose original indications are not AD (pantoprazole, gabapentin, acetaminophen, atorvastatin, albuterol, fluticasone, amoxicillin, and omeprazole), hold great potential of being beneficial to AD patients.
Pharmaceutical development of novel therapeutics for Alzheimer’s disease (AD) has consumed a large amount of resources over past decades but the majority of AD clinical trials have failed to produce positive results1. Drug repurposing, i.e., identifying novel indications for already approved drugs with well-defined safety and toxicity profiles, can potentially serve as a cost-effective way to accelerate AD drug development with a higher success rate2. Although repurposing drugs for AD has received increasing attention, no success has been reported on clinical sites3. One important reason is that existing efforts have been mostly based on pre-clinical (e.g., -omics, chemical, etc.) data, however, due to the complexity of the disease, these insights may not be directly translational to clinical settings.
On the other hand, large-scale real-world patient data (RWD), such as electronic health records (EHR) or administrative claims, has been accumulated in recent years and becoming readily available. Generating drug repurposing hypotheses from RWD through emulating randomized clinical trials (RCTs) with a causal analysis framework has demonstrated great potential in accelerating translation from bench to bedside for drug development and discovery4–7. This framework consists of two major steps: high-throughput RCT emulation for a large set of drug candidates using RWD, and estimation of the treatment effect of each drug candidate with causal analysis methods (such as the inverse probability of treatment re-weighting, or IPTW) for screening at scale. Due to the complexity of RWD, trial emulation with large-scale RWD has become a great touchstone for advanced AI algorithms, including machine learning or deep learning-based propensity score methods, for effective inference of treatment effects of drugs by adjusting for complicated confounding issues inherent within the observational data. As an example, recently an advanced deep learning-based long short-term memory with attention propensity score (PS) model7 showed superior performance in balancing covariates than logistic regression-based PS model when applied in the IPTW method. However, the superiority of these deep learning-based PS models still lacks systematic studies and validations on high-throughput trial emulations on different RWD databases, and whether it can be applied to AD is largely unknown.
In this study, with two large-scale RWD warehouses covering both electronic health records (EHR) and general claims, we systematically investigated the feasibility of generating AD repurposing hypotheses through high-throughput trial emulations, under an IPTW based framework with different ways of PS calculation. We emulated 430,000 RCTs for candidate drugs existing in RWD based on their impacts on the progression of patients with mild cognitive impairment (MCI) to AD. Inferring such treatment effects from large-scale RWD requires that different drug exposure groups to be balanced after IPTW with respect to high-dimensional baseline covariates4,6–8. Interestingly, we observed that the state-of-the-art deep learning-based PS model failed to balance the majority of our emulated trials. Specifically, these models usually result in insufficient overlap in different exposure groups, namely some patients in one exposure group have zero probability of being assigned to another group. This violates the basic positivity assumption in IPTW and can lead to balance failure with RWD. Related to this issue, we further demonstrate that building a PS model by purely optimizing the performance of predicting the likelihood of treatment can lead to failure in balancing covariates for patients in different treatment arms in trial emulations in our empirical studies. We, therefore, propose a new model selection strategy tailored for building machine learning models for PS calculation which yields significantly better balancing performance than existing practice. With the proposed strategy, we found that a simple regularized logistic regression-based PS model outperformed other complicated machine learning models including deep learning, and we are able to identify eight drugs including gabapentin, acetaminophen, atorvastatin, albuterol, fluticasone, pantoprazole, amoxicillin, omeprazle, with significant and consistent reduced risk of AD within 2 years, which can potentially serve as repurposing candidates for AD. Fig. 1 illustrates the overall pipeline of our proposed framework, which includes the following main steps.
First, we specify the protocols of targeted trials and their high-throughput emulations using two large-scale RWD warehouses, where we treat every single drug existing in RWD as a potential candidate (Fig. 1a). The details of the protocols are summarized in the Method section and their key components are provided in the extended data table 1. Briefly, the eligibility criteria of the treated groups of any emulated trial include MCI patients with age ≥ 50, with at least 1 year of records in the database before baseline (the date of the first prescription of the target drug) for collecting covariates, MCI diagnosis before baseline, and no AD or AD-related dementia diagnoses before baseline. For each target drug candidate, we emulate one hundred trials by constructing different control groups selected either from patients who took a random alternative drug or a similar drug under the same second-level Anatomical Therapeutic Chemical classification codes (ATC-L2) as the target drug. We further excluded patients from the control groups who were also in the treated group or took any trial drugs before baseline. All patients were followed up to 2 years or until AD diagnosis or loss to follow-up (censoring). There were over 4, 300 unique drugs (grouped by their major active ingredients) in the two databases we investigated and for each drug we emulated 100 trials, leading to 430, 000 (referred to as high throughput) emulated trials in total.
Next, we propose a new model selection strategy tailored for training, selecting, and evaluating PS calculation methods, by taking into account both goodness-of-balance and goodness-of-fit (Fig. 1b). In particular, we first randomly partition each emulated trial into mutually exclusive training, validation, and testing sets, and then a) train the PS model on the training set, b) select the best model according to the goodness-of-balance measure on the training and validation combined sets, and the goodness-of-fit measure on the validation set, and c) evaluate the selected best PS model according to the goodness-of-balance measure on the whole dataset and the goodness-of-fit measure on the test set. We quantify the goodness-of-balance by the standardized mean difference (SMD) and the goodness-of-fit by the area under the receiver operating characteristic (AUC). We tested 4 different PS calculation models including regularized logistic regression, long short-term memory network (LSTM, with attention mechanisms)7, gradient boosted decision trees (GBDT)9–11, and multi-layer perceptrons12,13, and observed that i) all these models learned and selected by our proposed strategy balanced more emulated trials than existing model selection strategies, and ii) with our strategy, complicated machine learning models such as LSTM and GBDT did not necessarily outperform the simple regularized logistic regression PS model.
Finally, we estimate the treatment effects from high-throughput emulated trials and based on which we screen and prioritize potential candidates of non-AD drugs that can be repurposed for treating AD (Fig. 1b). We compute the stabilized IPTW weights by the aforementioned PS models and use these learned weights to re-weight each emulated trial. We compute the number of unbalanced covariates of emulated trials by SMD before and after IPTW. We estimate the treatment effects (adjusted 2-year survival difference and adjusted hazard ratio) of successfully balanced trials after IPTW. We propose two criteria, significant benefits, and consistency, to screen and prioritize drug candidates, based on which eight drugs were identified which show significantly beneficial effects on AD and their estimated beneficial effects are consistent over a large number of balanced emulated trials in both databases.
Results
Our model selection for propensity score calculation results in better balancing
We construct treatment groups consisting of eligible patients (Methods) for each unique drug ingredient existing in our databases and emulate trials for all of them. For each emulated drug trial, its treatment arm consists of patients from trial drug group, and its control arm is composed by either patients randomly selected from drug groups other than trial group, or patients from drug groups wherein drugs are under the same second-level Anatomical Therapeutic Chemical classification codes (ATC-L2) as the trial drug. To achieve statistical significance, we emulated 100 trials for each drug trial consisting of 50 emulations by constructing random control groups and 50 emulations by constructing ATC-L2 control groups. Taking the OneFlorida database (see Data Section) as our discovery set, we included 73, 927 patients with MCI diagnosis from 2012 to 2020 (Fig. 1a). We found 1, 825 unique drug ingredients and emulated 182, 500 trials. We finally targeted at 66 drugs with 6, 600 emulated trials of which each treatment group has ≥ 500 patients. For each emulated trial, we randomly partitioned the data into mutually exclusive training, validation and testing subsets as standard practice. All PS calculation models were trained on the same training set, and the best-estimated model was selected by following three strategies: (a) goodness-of-fit performance on the (out-of-sample) validation set, quantified by the area under the receiver operating characteristic curve (AUC) score8,14–16; (b) goodness-of-balance performance on the validation set, quantified by the maximum value of SMD scores over all baseline covariates after IPTW7; and (c) our proposed strategy, which leverages goodness-of-balance on the training and validation combined set, and goodness-of-fit on the validation set (Method Algorithm 1). We evaluated the performance of selected models from two aspects: (i) the goodness-of-balance, which measures how similar the baseline covariates of different exposure groups are after IPTW on the whole data, and (ii) the goodness-of-fit, which measures how good the learned PS model predicts on the unseen test data (See Method Algorithm 2). Of note, the goodness-of-balance is the single most important criterion for evaluating trial emulations. We considered one covariate as balanced if its SMD value ≤ 0.117, and one emulated trial before/after IPTW is balanced if the ratio of unbalanced features among all covariates before/after IPTW ≤ 2%7. The goodness-of-fit was evaluated by the AUC score on the (unseen) test data, which is the most commonly used model evaluation strategy for machine learning models in general for binary classification.
Figure 2 shows the proportion of successfully balanced trials (≥ 10% among all emulations) for different drugs after IPTW by different PS models including: a) regularized logistic regression-based PS models (LR-PS), b) deep learning based-PS model using long short-term memory network with attention mechanisms (LSTM-PS)7, which were trained and selected by different model selection strategies as described above. Please refer to Extended Data Fig. 1 for the other PS models, including multi-layer perceptron based PS model (MLP-PS)12,13, and gradient boosted tree-based PS model (GBT-PS)9–11. We observed that PS models built with our proposed model selection strategy outperformed models selected by other strategies in terms of goodness-of-balance. The previous model selection strategies, either according to AUC (yellow bars in Fig. 2) or SMD (blue bars in Fig. 2) on validation data, failed to balance a large proportion of emulated trials (≤ 50%). By contrast, our strategy (red bars in Fig. 2) balanced many more emulated trials by large margins on all trial drugs. Taking atorvastatin as an example (Fig. 2a), LR-PS selected by our strategy balanced 96% [95% confidence interval (CI) 92% − 99%] of all emulated trials, which was much better than the AUC-based strategy (39%, 95% CI 30% −48%) and the SMD-based strategy (47%, 95% CI 37% −57%). The same phenomenon was observed on atorvastatin trials balanced by LSTM-PS (Fig. 2b), where our strategy (68%, 95% CI 58% −77%) balanced many more trials than AUC-(22%, 95% CI 14% −31%) or SMD-based strategy (53%, 95% CI 43% −63%). In addition, we also compared the performance of PS models built with different model selection strategies in terms of goodness-of-fit measured by AUC score on unseen test data. We observed that all of the above PS models selected by our proposed strategy achieved test AUC on par with models selected by AUC-on-validation strategy, and on par with or better than models selected by SMD-on-validation strategy (Extended Data Figs. 2 and 3). In summary, PS models selected by our proposed strategy balanced many more emulated trials than existing practice, and at the same time showed good generalized prediction performance on unseen data, and the proposed model training, selection, and testing processes were summarized in Algorithm 1 and 2.
Does deep learning based models perform better?
Recently deep learning-based models have demonstrated great promises in various applications and researchers have proposed to apply these models for PS calculation in trial emulation7. We evaluated the performance of the PS calculation model based on the long short-term memory network with attention mechanisms (LSTM-PS) used in Liu et al.7 on our data, and observed that LSTM-PS did not necessarily outperform simple LR-PS. As shown in Fig. 2, the LR-PS model selected by our model selection strategy balanced 21 drugs of which ≥ 10% emulated trials were successfully balanced, while the LSTM-PS model only balanced 15 drugs. The LR-PS also identified more drug candidates than MLP-PS (9 drugs) and GBT-PS (9 drugs) as illustrated in Extended Data Fig. 1. Furthermore, we compared the balancing performance of LR-PS versus LSTM-PS by comparing the number of unbalanced features before and after IPTW (the 6th and 7th columns in Table 1), from which we observed that both LR-PS and LSTM-PS can greatly reduce the number of unbalanced features after re-weighting. However, the number of unbalanced features after IPTW by LSTM-PS similar to or even worse than the LR-PS model (LSTM rows v.s. LR rows in the 7th column). Moreover, LSTM-PS introduced additional biases by consistently under-estimating the number of unbalanced features even before re-weighting (LSTM rows v.s. LR rows in the 6th column in Table 1), which could be originated from the fact that LSTM-PS compressed the original covariate space through the learned attention weights, and the SMD scores were evaluated on the compressed covariates.
To test the generalizability of our conclusion, we further validated our proposed strategy on the MarketScan data, which is a national healthcare insurance claims database (see Data Section). Following the same procedures as we did with the OneFlorida data, we identified a total of 424, 961 MCI patients from 2009 to 2020 and among which, there were 2, 489 unique drug ingredients. We emulated 24, 600 trials for 246 drugs which had ≥ 500 patients in their respective treated groups. With the MarketScan data, we were able to obtain the same conclusions: (a) our model selection strategy built better PS models which balanced more trials than existing model selection strategies over different PS classes; (b) with our proposed strategy, conventional models such as LR-PS outperforms deep learning models such as LSTM-PS7 (Extended Data Fig. 4). In the following, we applied our strategy into the process of identification of repurposing drug candidates for AD via high-throughput trial emulation.
High-throughput screening of repurposing drug candidates for AD
With our proposed model selection strategy and the LR-PS model, we have emulated 430,000 drug trials on two large-scale RWD warehouses (Fig. 1a). The adjusted 2-year survival difference and adjusted hazard ratio (HR) with AD onset as the outcome event obtained from these emulated trials are demonstrated in Fig. 1b. The repurposable drug candidates were identified according to the following two criteria: (i) beneficial effects, meaning the estimated treatment effect of balanced trials for any target drug should be significantly beneficial for MCI to AD progression (≥ 10% emulations were balanced after IPTW, the sample mean of adjusted 2-yr survival difference > 0, adjusted HR < 1, and P-value < 0.05); and (ii) consistency, the estimated treatment effects of each drug candidate from both RWD warehouses (both EHRs and administrative claims) should be beneficial for MCI to AD progression. Table 1 summarizes the screening process and emulated trials: with the first criterion we were able to identify 14 drugs from the OneFlorida data (Table 1, marked in shaded color) and 28 drugs from the Marketscan data (Supplementary Table 3), among which 8 drugs showed consistent beneficial effects on both data sets (Table 1, marked in bold).
We highlight these eight identified repurposable drug candidates in Fig. 3, and for each drug we also have conducted a rapid literature review19 for additional evidence:
Pantoprazole is a proton pump inhibitors (PPI) drug for treating gastroesophageal reflux disease (GERD), a damaged esophagus, and high levels of stomach acid caused by tumors. We observed that pantoprazole was associated with a 43% reduced risk of AD [hazard ratio (HR) 0.57, 95% confidence interval (CI) 0.56-0.59] in OneFlorida compared with a 8% reduced risk of AD (HR 0.92, 95% CI 0.89-0.94) in MarketScan. The association between using PPI drugs and risk of incident AD or non-AD dementias were contradictory20,21 in existing literature. Our study revealed one of the first large-scale RWD signals of pantoprazole for AD.
Gabapentin is an anti-epileptic drug for treating seizures and pain. We observed that gabapentin was associated with a 39% reduced risk of AD (HR 0.61, 95% CI 0.58-0.64) in OneFlorida and a 28% reduced risk of AD (HR 0.72, 95% CI 0.70-0.74) in MarketScan. Previous research suggested possible benefit of gabapentin for behavioural and psychological symptoms of dementia in AD patients based on summarizing case reviews22, and revealed crucial role of gabapentin in the Amyloid Beta Toxicity Cascade23. Our study showed one of the first large-scale RWD signals of gabapentin for AD.
Acetaminophen is used for treating mild to moderate pains and reducing fever. We observed that acetaminophen was associated with a 24% reduced risk of AD (HR 0.76, 95% CI 0.73-0.78) in OneFlorida and a 22% reduced risk of AD (HR 0.78, 95% CI 0.76-0.81) in MarketScan. Previous studies only indicated a weak association of acetaminophen with reduced risk of AD without any significance24–26 (e.g. Relative Risk 0.87, 95% CI 0.40-1.91 in a meta-analysis24).
Atorvastatin is used to treat high cholesterol and triglyceride levels, shows potentially beneficial but not significant effects on AD in27,28. We observed that atorvastatin was associated with a 21% reduced risk of AD (HR 0.79, 95% CI 0.77-0.81) in OneFlorida and a 12% reduced risk of AD (HR 0.88, 95% CI 0.85-0.90) in MarketScan.
Albuterol (also salbutamol) is a drug for asthma and chronic obstructive pulmonary disease (COPD). We observed that the albuterol was associated with a consistent 22% reduced risk of AD (HR 0.78, 95% CI 0.74-0.83) in OneFlorida and a 22% reduced risk of AD (HR 0.78, 95% CI 0.76-0.80) in MarketScan. Previous literature generated AD signals from in-vivo screening29 or rats models30. To the best of our knowledge, our work showed the first RWD signal of albuterol for AD.
Fluticasone is used to treat nasal symptoms, skin diseases, and also asthma. We observed that fluticasone was associated with a consistent 10% reduced risk of AD (HR 0.90, 95% CI 0.84-0.94) in OneFlorida and a 14% reduced risk of AD (HR 0.86, 95% CI 0.83-0.90) in MarketScan. Instead of high-throughput screening repurposing signals from RWD, Xu et al.31 validated fluticasone from MarketScan on a case by case basis and showed a consistent decreased risk for AD (HR 0.86, 95% CI 0.83–0.89) as ours, and Lehrer et al.32 also suggested a lower incidence of AD after taking fluticasone in another independent database, FDA MedWatch Adverse Events Database.
Amoxicillin is used to treat a wide variety of bacterial infections and stomach ulcers. We observed that amoxicillin was associated with a 12% reduced risk of AD (HR 0.88, 95% CI 0.82-0.95) in OneFlorida and a 7% reduced risk of AD (HR 0.93, 95% CI 0.90-0.96) in MarketScan. Jannis et al.33 revealed that the eradication of Helicobacter pylori (Hp) by a triple eradication regimen of omeprazole, clarithromycin and amoxicillin may positively influence AD manifestations in Hp-positive AD patients, and the action of different type of antibiotics in AD remains largely unknown34. To the best of our knowledge, our work revealed one of the first RWD signals of amoxicillin for AD.
Omeprazole is another PPI drug similar to pantoprazole. We observed that omeprazole was associated with a 12% reduced risk of AD (HR 0.88, 95% CI 0.85-0.91) in OneFlorida and a 8% reduced risk of AD (HR 0.92, 95% CI 0.88-0.94) in MarketScan. There is still no consensus on the role of PPIs and AD20,21,35. Our study showed one of the first large-scale RWD signals of omeprazole for AD.
Discussion
We explored the problem of high-throughput clinical trial emulation on two large-scale RWD data warehouses, covering both EHRs and claims, in the context of identifying repurposable drug candidates for AD. There are several aspects we would like to highlight for our investigation.
First, we emulated hundreds of trials for each drug based on two different ways of constructing control groups, which allowed for potentially more robust estimation of treatment effects. In our investigation, indeed, we observed a large variability (e.g., a large range of 95% confidence interval) of estimated treatment effects within emulated trials for certain drugs (e.g., Fig. 3e, albuterol FL-Rand, HR 0.78, 95% CI 0.70-0.86), and sometimes a large discrepancy between emulated trials when building control groups in different ways (Fig. 3f, fluticasone, FL-Rand, HR 0.75, 95% CI 0.70-0.81 versus FL-ATC, HR 1.01, 95% CI 0.97-1.05) These variabilities can become big challenges for existing observational studies that use a single control group36 or a single way of building multiple control groups (e.g. only random control groups)7.
Second, we observed inconsistent results across the two data sets. For example, escitalopram showed a reduced risk in OneFlorida data (HR 0.61, 95% CI 0.55-0.67, Table. 1) but an increased risk in MarketScan database (HR 1.55, 95% CI 1.49-1.61, Table. 3). Potential explanations were rooted in intrinsic heterogeneity across the two datasets: OneFlorida is a regional database mainly covers patients’ EHRs in Florida area, while MarketScan is a nation-wide claims database across the US (Supplementary Tables 1). For example, the number of patients in escitalopram group in OneFlorida and MarketScan were 767 and 5,041 respectively. Such inconsistency highlights the necessity of leveraging at least two (different type of) data sets to derive robust and consistent evidence.
Third, we conducted multiple sensitivity analyses to guarantee the robustness of our findings. We have investigated the impact of different ways of building control groups on balance performance (Supplementary Figs. 6). Our proposed model selection strategy greatly improved the performance of different PS models over conventional approaches. We also examined the influence of the balance diagnostics on the generated repurposing hypotheses. For example, if we adopted a more stringent balance criteria by requiring zero tolerance of unbalanced covariates (compared with 2% used in our primary analyses) in each emulated trial after re-weighting, we still recovered top four drugs among our reported eight drugs–pantoprazole (HR 0.60, 95% CI 0.57-0.63, OneFlorida; HR 0.92, 95% CI 0.89-0.94, MarketScan), gabapentin (HR 0.55, 95% CI 0.50-0.60, OneFlorida; HR 0.72, 95% CI 0.70-0.74, MarketScan), acetaminophen (HR 0.74, 95% CI 0.70-0.79, OneFlorida; HR 0.78, 95% CI 0.76-0.81, MarketScan), and atorvastatin (HR 0.78, 95% CI 0.75-0.81, OneFlorida; HR 0.88, 95% CI 0.85-0.90, MarketScan), which again significantly and consistently reduced risks for AD on both OneFlorida (Supplementary Table 4) and MarketScan (Supplementary Table 3) databases. We also studied different ways of constructing control groups–random controls and ATC-L2 controls (Method Section)–on the estimated treatment effects which were consistent for most of drugs (Fig. 3), except for fluticasone estimated from ATC-L2 controls on OneFlorida (Fig. 3f, FL-ATC, HR 1.01, 95% CI 0.97-1.05) and amoxicillin estimated from ATC-L2 controls on MarketScan (Fig. 3g, MS-ATC, HR 1.06, 95% CI 1.05-1.06).
Last, compared with existing AD repurposing studies which typically focused on validating one or two hypotheses with a single type of RWD31,37,38, our study offered a high-throughput way of generating and validating AD repurposing hypotheses using both EHRs and claims39, which would further catalyze innovation in AD drug discovery at scale, or can be broadly applied to other diseases.
Lots of recent research efforts have been devoted to developing complex deep learning-based models for propensity score based modeling7,40–43In this paper, after emulating hundreds of thousands of trials from two large-scale RWD warehouses, we found that one LSTM-PS7, which is a representative deep learning based PS method, did not outperform LR-PS. Our study also highlighted the importance of model selection and we proposed our own strategy under which we demonstrated LR-PS outperformed gradient boosting tree-based PS models and deep multi-layer peceptron-based PS models as well in terms of balancing performance and the number of generated repurposing hypotheses. In addition, we also evaluated another model selection strategy widely used in literature4,44–47, which did not follow the out-of-sample validation strategy by partitioning data into complementary subsets but just estimated and evaluated PS model on the entire data set. We observed that with this strategy, even simple regularized LR-PS model suffered from over-fitting issue and could not be generalized well to unseen data in our empirical studies as demonstrated in Extended Data Fig. 5, emphasizing the need for a better model selection strategy for PS calculation. With all these investigations, we were able to show that our proposed model selection strategy, together with model training and evaluation pipelines, could serve as a better choice than existing model selection strategies for PS models in terms of goodness-of-balance and goodness-of-fit performance on emulated trials.
This study has several limitations. First, we identified MCI patients and AD onsets using ICD codes (Supplementary Tables 2) which were provided by physicians and validated in48,49, yet there might be a certain level of inaccuracy due to mis- and under-diagnosis or the lack of clinical details in EHRs or claims39,50. Information contained in clinical notes will be explored in the future through natural language processing to complement the structured codes. Second, although we balanced high-dimensional covariates collected during the baseline period, measurement error, residual confounding, and selection bias in the follow-up period were still possible. Therefore, adapting negative control51 for detecting residual confounding and selection bias to high-throughput trial emulation settings would be another promising direction.
Conclusion
In this work, we proposed a high-throughput clinical trial emulation system for AD drug repurposing driven by RWD and propensity score-based causal inference with a tailored model selection strategy. On two large-scale RWD warehouses covering both EHRs and general claims, we demonstrated that our strategy identified eight drugs (pantoprazole, gabapentin, acetaminophen, atorvastatin, albuterol, fluticasone, amoxicillin, omeprazole) with different original indications could be potentially beneficial to AD patients. Our analyses highlighted model selection, rather than the PS model itself, is critical in balancing emulated trials at scale, which informs future RWD-based high-throughput trial emulation and can potentially accelerate the drug development process.
Methods
Data
We used two large-scale real-world longitudinal patient-level healthcare warehouses, including OneFlorida Clinical Research Consortium and IBM MarketScan Commercial Claims and Encounters (Data availability section). The OneFlorida database contains robust patient-level electronic health record (EHR) data for nearly 15 million (14,883,388) patients majorly from Florida and selected cities in Georgia and Alabama from January 2012 to April 2020, and the IBM MarketScan database (formerly known as Truven) contains administrative claim records from January 2009 to June 2020 for over 164 million (164,148,434) enrollees across the US, serving as a nationally representative database of the US population (See Supplementary Tables 1 for the population characteristics of two database). Both databases contain comprehensive longitudinal information on demographics, diagnoses, procedures, prescriptions, and outpatient dispensing for all enrollees. This study was approved by the Institutional Review Board of Weill Cornell Medicine with protocol number 21-07023759. The use of OneFlorida data for this study is approved under the University of Florida IRB number IRB202001888. Access to the MarketScan data analyzed in this manuscript is provided by the University of Kentucky.
High-throughput trial emulation for Alzheimer’s disease (AD)
Instead of emulating one targeted randomized controlled trial on a case-by-case basis, here we tried to scale up trial emulation to a high-throughput setting, namely, to emulate hundreds of thousands of target trials to find potentially new indications of non-AD drugs for AD. We described the protocol of high-throughput trial emulations as follows and compared target trials and their emulations in the Extended Data Table 1. An illustration of the high-throughput cohort selection process was shown in Fig. 1a.
Eligibility criteria
We included patients with at least one mild cognitive impairment (MCI) diagnosis between January 2012 and April 2020 in the OneFlorida database (January 2009 to Jun 2020 in the MarketScan data). Other inclusion criteria were age at MCI diagnosis ≥ 50, no history of AD or AD-related dementia diagnoses before the baseline, the first MCI diagnosis date should be prior to the baseline, and ≥ 1 year of records before baseline. Of note, we defined the baseline as the first prescription date of the trial drug, and at baseline, all of the above criteria should have been met.
Treatment strategies
We compared two strategies for each drug trial: (0) no initiation of the trial drug before or after baseline (control group), and (1) initiation of the trial drug at baseline (treated group). We defined the treatment initiation date with the drug of interest as the first prescription date of the drug and we required at least two consecutive drug prescriptions over 30 days since the first prescription date in our database as a valid drug initiation.
Treatment assignment procedures
We classified patients into different drug groups according to their baseline eligibility criteria and their treatment strategies. We assumed the treated group and control group were exchangeable at baseline conditional on high-dimensional baseline covariates, including diagnoses, medications, demographics, and time from the MCI diagnosis date to drug initiation date. The diagnosis covariates consisted of selected comorbidities from Chronic Conditions Data Warehouse52 and established risk factors for AD selected by experts, resulting in 64 covariates; each defined by a set of selected ICD-9/10 codes. We grouped drug prescriptions coded as National Drug Code (NDC) or RXNORM codes into their major active ingredients coded in RXNORM defined in Unified Medical Language System53 for the OneFlorida case, and into the Medi-Span Generic Product Identifier (GPI)54 by their first 8 digits for the MarketScan data. We used the first 200 most commonly prescribed drug ingredients for the co-prescribed medication covariates for each drug trial and thus the medication covariates varied in different drug trials. We used 2 covariates age and sex for demographics and 1 covariate for the time from the MCI diagnosis date to the drug initiation date. In total, there were 267 covariates to adjust for. In addition to the 267 baseline covariates, we also considered the temporal sequences of each of diagnoses and medications for the deep long short-term memory network with attention mechanisms-based PS calculation7.
Follow-up
We followed each patient from his/her baseline until the day of the first AD diagnosis, loss to follow-up (censoring), 2 years after baseline, or the end date of our databases, whichever came first.
Outcomes
The outcome of interest is the diagnosis of AD recorded in the database within his/her follow-up period, which was denoted as a positive event. If there was no AD diagnosis recorded in a patient’s follow-up period, and the last prescription date or the last diagnosis date recorded in the database came after the end of the follow-up, then we marked it as a negative event. A censoring event is a case where there was no AD diagnosis recorded in a patient’s follow-up period and the last prescription date and the last diagnosis date recorded in the database came before the end of the follow-up. The time to positive event is defined as the days between the baseline date and the first diagnosis of AD. The time to negative event is the time of follow-up. The time to censoring is defined as the days between the baseline date and the last prescription date or the last diagnosis date, whichever comes last. Clinical phenotypes were identified by the selected diagnosis codes by experts (Supplementary Tables 2).
Causal contrasts of interest
The observational analogy of intention-to-treat effect of being assigned to trial drug initiation versus no initiation at baseline.
High-throughput emulation
We emulated trials for all drugs appeared in our databases with at least 500 eligible patients in their treated groups. For each emulated trial, its treated group consists of eligible patients who initiated the trial drug, and its control group consists of eligible patients who had no initiation of the trial drug. We constructed the no-initiation patients group in two ways: a) randomly selecting eligible patients from other drug initiation group49, or selecting patients from similar drug groups that are under the same second-level Anatomical Therapeutic Chemical classification category55 (ATC-L2) as the target trial drug6. We further excluded any of those patients who were also in the trial drug group or prescribed the trial drug before baseline. To investigate statistically significance of results with varying control groups, we emulated 100 trials for each targeted drug and among which 50 emulated trials adopted random controls and the other 50 emulated trials adopted ATC-L2 controls as described above. Different combinations of control groups were studies as sensitivity analysis.
Causal effect estimation and the screening of repurposing drugs
We used propensity score (PS) methods56 for confounding control and treatment effect estimation for high-throughput emulated trials, and proposed two criteria to screen and prioritize non-AD drugs for repurposing (Summarized in Fig. 1b).
Propensity score and IPTW
For each emulated trial, we used propensity score (PS) framework56 to learn empirical treatment assignment given baseline covariates, and used the inverse probability of treatment weighting (IPTW)57 to balance treated and control groups. We used triplet (X, Z,Y, T) to represent data of both treated and control groups where X, Z, Y, T represent the baseline covariates, treatment assignment, outcome indicator, and time to events, respectively. The PS is defined as P(Z = 1|X)56 where Z is treatment assignment (Z = 1 and Z = 0 for treated and control respectively) and X denotes patients’ observed baseline covariates. The inverse probability of treatment weight (IPTW) is defined as 57, 58, which tries to make original trial into a more balanced pseudo trial by re-weighting each data sample. We used an updated version named stabilized IPTW, defined as to deal with extreme re-weighting weights and thus potentially inflated sample size7,59,60.
A machine learning (ML) or deep learning (DL)-based propensity score (ML/DL-PS) model is a binary classification model fθ ∈ ℱΘ : X → Z, to approximate P(Z = 1 |X) by fθ (X) with learnable parameters θ. Here, we use ℱΘ to denote a set of ML/DL models (e.g. a set of models with varying hyper-parameters) and fθ to denote one specific model instance in this set. We considered four classes of ML/DL ℱΘ: (a) regularized logistic regression-based PS models (LR-PS), encompassing its special case logistic regression (without any regularization term), which are most widely used model for PS calculation; (b) the state-of-the-art deep learning based-PS model, long short-term memory network with attention mechanisms-based PS models (LSTM-PS)7; c) multi-layer perception network-based PS models (MLP-PS)12,13; and d) the state-of-the-art gradient boosted tree-based PS models (GBT-PS)9–11.
Performance evaluation criteria
We evaluated the performance of estimated PS models in terms of two aspects: a) the goodness-of-balance, and b) the goodness-of-fit.
The goodness-of-balance is measured by the standardized mean difference (SMD)17,44,61 on the whole dataset, defined as follows: where xtreat, xcontrol ∈ RD represent the vector representations of D covariates of treated group and control group respectively, µtreat, µcontrol ∈ RD are their sample means over the treated group and control group respectively. Similarly, streat2, scontrol2 ∈ RD are their sample variances. Suppose that we have learned sample weight wi for each patient i by IPTW, the weighted sample mean and variance are:
The weighted versions of sample mean and variance hold for both treated and control groups and thus we ignored their corner marks for brevity. The SMDweight can be calculated by applying above weighted mean and variance to Eq.2. All operations in Eq.2 and 3 are conducted in an element-wise way for each covariate. For each dimension d of either original SMD or weighted SMD, it is considered balanced if its dth SMD value SMD(d) ≤ 0.117, and the treated and control groups are balanced if the total number of unbalanced features ≤ 2%∗D7. More stringent balance criteria (e.g., requiring non-unbalanced features) were also considered as sensitivity analysis. Taking IPTW re-weighted case as an example, we can calculate the number of balanced feature after IPTW by:
The smaller the nweight is, the better the balance performance of IPTW is, and the less biased estimated causal effect is. As shown in61, SMD is one of the top predictors of the bias of estimated causal effect. To quantify balance performance of high-throughput emulation of one drug trials, we further defined the probability of successfully balancing one specific drug M trial by a set of PS models ℱΘ as , which can be estimated by the fraction of successfully balanced trials over all emulations as follows: where ne is the total number of emulated trials (X, Z,Y, T)i, i = 1, 2, …, ne for drug M, fbest is the best PS model among ℱΘ learned from the ith emulated trial, and the IPTW and nweight are calculated by applying fbest to the ith emulated trial. We will discuss how to learn and select fbest ∈ ℱΘ in the next section. In general, the larger the balancing success rate PM(nweight ≤ 2% ∗D|ℱΘ) is, the better the ℱΘ model balances the drug M trial.
The goodness-of-fit is the generalized prediction performance of the PS model on the unseen data. We used the area under the receiver operating characteristic (AUC) measured on the (unseen) testing dataset to quantify it62,63. The larger AUC on the testing set is, the better the generalization performance of the classification model is.
Model training, selection and evaluation
Much existing literature used statistical models (e.g. logistic regression) for PS calculation in estimating treatment effect from observational data, and the PS model was both estimated from and applied to the whole dataset4,44–47. By contrast, machine learning or deep learning (ML/DL) models, which are good at capturing complex data and usually have a large number of hyper-parameters, are faced with over-fitting and generalization trade-off64. Thus, to get more generalized ML/DL models, the conventional approach is to split the whole dataset into complementary training, validation, and testing sets; then train the model on the training set, select the best learned model according to the AUC performance for example (a goodness-of-fit measure; the larger the better) on the validation set, and finally evaluate the selected model on the testing set. This model validation strategy for ML/DL models is also known as the (one-round) cross-validation14. Following this conventional data splitting strategy, existing ML or DL-based PS models9–13,49 selected the best model by the AUC (a goodness-of-fit measure; the larger the better) or the maximum SMD value (a goodness-of-balance measure; the smaller the better) measured on the validation set. However, following the above model selection strategies for PS model selection for balancing emulated trials, poor balance performance was observed in our high-throughput study in both two RWD warehouses.
Here, we introduce our model training and selection algorithm tailored for ML/DL-based PS model in Algorithm 1, trying to get the best goodness-of-balance performance as well as the best possible goodness-of-fit performance. We used binary cross-entropy loss ℒ as the objective function for learning empirical binary propensity scores. We also describe the evaluation (testing) algorithm for ML/DL-based PS models in Algorithm 2, to evaluate and benchmark different learned and selected models. Of note, the goodness-of-balance is the single most important criterion for evaluating trial emulations. If the balancing results are tied, the goodness-of-fit is evaluated.
Statistical analysis
We reported adjusted 2-year survival difference by adjusted Kaplan–Meier estimator65,66 and adjusted hazard ratio (HRs) modeled by the adjusted Cox proportional hazard model66,67 for each of the emulated trials. The above outcome estimators were adjusted by inverse probability of treatment weighting (IPTW) based on the best PS calculation selected by our model selection strategy. For each drug we reported their sample means of different outcome estimators with 95% confidence intervals68 over all the balanced trials. We used two ways of building different control groups (e.g., random controls or ATC-L2 controls) and different balance criteria (e.g., different thresholds for SMD) to evaluate the robustness of our estimated effects in various sensitivity analyses.
Screening and prioritization
To generate reliable and robust repurposing hypotheses for AD, we required that the estimated effects of repurposing drug candidates should be significantly and consistently beneficial. As for the significant (beneficial) effects, we require that the fraction of successfully balanced trials of a drug candidates after IPTW ≥ 10%, and their adjusted 2-yr survival difference of these balanced trials should be significant. We used bootstrapping hypothesis testing68 to test if the sample mean of the adjusted 2-yr survival difference from all the balanced trials is > 0 (< 1 for HRs), and we considered p-value < 0.05 as significant. As for the consistency of effects, we required that the estimated effects should be all significantly beneficial over different databases. We then ranked the drug candidates according to their estimated effects. More stringent screening criteria were considered in sensitivity analysis.
Comparison with existing works
We replicated the analytic approach by Liu et al.7 and we found that their methods led to biased SMD estimation and worse balance performance as shown in Table 1 due to their deep LSTM-based PS methods. Besides, there are other major concerns. First, they selected patients at baseline according to patients’ treatment strategy over follow-up and such post-baseline information should not be used at baseline46. Second, they estimated treatment effect by the average treatment effect (ATE) ATE = 𝔼[Y1 − Y0] (Y1 and Y0 are the potential outcomes for each patient under the treatment or the control respectively), which can introduce selection bias due to loss to follow-up (censoring)69. Third, they generated hypotheses only on one database and used only random controls, ignoring the potential variability we found over different databases and over emulations with different control groups.
Experimental settings
We implemented our high-throughput clinical trial emulation system for drug repurposing by Python 3.9 and Pytorch 1.8 (https://pytorch.org/ and trained deep learning models by Adam optimizer70 on a Linux server with two GeForce RTX 2080 Ti GPUs and 16 CPU cores. We used python package lifelines-0.2666 for survival analysis, scikit-learn-0.2318 for machine learning models including regularized logistic regression, and lightgbm-3.211 for gradient boosted machine. We followed Liu et al.7 for their LSTM-PS implementations. We randomly partitioned each emulated trial into complementary training, validation and testing data sets with a ratio of 70:10:20. Please refer to our python package for more details.
Data Availability
OneFlorida data can be requested through https://onefloridaconsortium.org/front-door/. Since OneFlorida data is a HIPAA-limited data set, a data use agreement needs to be established with the OneFlorida network.The MarketScan dataset is available from IBM at https://www.ibm.com/products/marketscan-research-databases.
Code availability
For reproducibility, we open-sourced our python code package at https://github.com/calvin-zcx/RWD4Drug.
Data Availability
The OneFlorida data can be requested through https://onefloridaconsortium.org/front-door/. Since the OneFlorida data is a HIPAA-limited data set, a data use agreement needs to be established with the OneFlorida network. The MarketScan dataset is available from IBM at https://www.ibm.com/products/marketscan-research-databases.
Author contributions statement
CZ and FW proposed the initial idea. CZ designed and implemented the framework, analyzed the data, and wrote the initial draft of the paper. Hansi Zhang, JX and JB processed and analyzed the OneFlorida data. SF and JC processed the data and ran the experiments on the MarketScan database. Hao Zhang, SH and BSG helped with the data analysis and result interpretation. KC and YC designed the statistical analysis. All the authors contributed to the discussions of results and final writing of the paper. JC had access to the raw data of the MarketScan database. JB had access to the raw data of the OneFlorida database.
Competing interests
No competing interests to report.
Supplemental Materials for
Acknowledgements
The work of C.Z., H.Z. and F.W. were supported by NIH awards R01MH124740 and RF1AG072449, as well as NSF award 1750326. B.S.G. is supported by NIH grant RF1AG059319. J.B. is supported by NIH awards R21AG068717, R56AG069880, R01CA246418-02S1, and R21CA245858-01A1S1.
Footnotes
correcting errors in the algorithms, updated IRB number in WCM