Abstract
Linear mixed models (LMMs) are commonly used in many areas including epidemiology for analyzing multi-site data with heterogeneous site-specific random effects. However, due to the regulation of protecting patients’ privacy, sensitive individual patient data (IPD) are usually not allowed to be shared across sites. In this paper we propose a novel algorithm for distributed linear mixed models (DLMMs). Our proposed DLMM algorithm can achieve exactly the same results as if we had pooled IPD from all sites, hence the lossless property. The DLMM algorithm requires each site to contribute some aggregated data (AD) in only one iteration. We apply the proposed DLMM algorithm to analyze the association of length of stay of COVID-19 hospitalization with demographic and clinical characteristics using the administrative claims database from the UnitedHealth Group Clinical Research Database.
1. Introduction
The COVID-19 outbreak has become a pandemic, causing a large increase in mortality and posing a heavy burden to the healthcare system. Much research has been done on treatment efficacy and adverse clinical outcomes [1-5] and much remains to be done. As studies continue to be conducted and published, multi-site collaboration is demanded for evidence synthesis [3,4]. Multi-site studies based on healthcare data, including the electronic health record (EHR) and claims data, can integrate clinical information across multiple sites or systems to improve estimation and predictive performance due to use of a larger and more inclusive sample from the population of interest.
One primary challenge for multi-site collaboration is preserving the privacy of protected health information. Sensitive individual patient data (IPD) including the patient’s identity, diagnoses and treatments are usually not allowed under privacy regulation to be shared across networks. Existing approaches to performing multi-site studies, e.g. distributed algorithms, have the drawback of biased estimation [12] or communication burden due to the requirement of iterative transmission of summarized data [14,15]. Specifically, for ordinary linear regression, identical results with pooled analysis can be obtained by lossless compression [7]. Another important challenge of analyzing multi-site data is the heterogeneity of data distribution across sites. Many existing approaches for multi-site analysis assume the data are homogeneously distributed across sites. This assumption may be violated in practical scenarios and thus make the model vulnerable in estimation and hypothesis testing. For instance, in this manuscript we consider the length of stay of the hospitalization due to COVID-19 as the continuous outcome. Heterogeneity across countries, regions, and sub-populations has been reported in the literature [2]. The aforementioned lossless compression approach [7] for ordinary linear regression, fails to take the heterogeneity into account.
In this paper we propose a novel algorithm for distributed linear mixed models (DLMMs). Linear mixed models (LMMs) are commonly used in many areas including epidemiology for analyzing multi-site data with heterogeneity. The model assumes site-specific random effects of the covariates (and intercept) on a continuous outcome. To the best of our knowledge, there is no existing approach for fitting LMMs in a distributed manner, see Figure 1 for the comparison of several approaches in this context. Our proposed distributed LMM can achieve exactly the same results as if we had pooled individual patient data from all sites, hence the lossless property. These lossless results can be obtained by requiring the sharing of summary statistics from each site in only one iteration. We apply the proposed DLMM to analyze the association of length of stay of COVID-19 hospitalization with demographic and clinical characteristics using the administrative claims database for Medicare Advantage members from a large US Health insurance provider (Appendix Figure 2A).
2. Method
2.1 Linear mixed model
Due to the heterogeneity of data across sites, the effects of the covariates on the outcome among sites in the linear regression model may not always be the same [7]. A linear mixed model is thus often used. Assume for the jth patient at the ith site, yij is the continuous outcome, xijis the p-dimensional covariate vector and β is the vector of fixed effects, zijis the q-dimensional covariate vector having random effect ui, and ϵijis the random error. where ui ∼ N(0, V), ϵij ∼ N(0, σ2). The random effects covariates zij can be part or all of xij, or constant if random intercept only. The random effect covariance matrix V can admit certain structures with unknown parameters. For instance we can assume the random effects are independent, i.e. V = diag(σ12, …, σq2). These parameters (e.g. variance components) and the fixed effects β are usually estimated by maximum likelihood (ML) or restricted maximum likelihood (REML) estimation [6]. The log-likelihood of LMM using all the data is where Xi and Yi are the covariate matrix and the outcome vector of the ith site, |. | is the matrix determinant and .
The maximum likelihood estimation can be further simplified by profiling out β and σ2 from (2). Denote Θ = V/σ2, given Θ, the estimation of β and σ2 are where . Thus the profile log-likelihood with respect to only Θ is and the restricted profile log-likelihood is
The ML or REML estimate of Θcan be obtained by maximizing (5) or (6). The estimates of β and σ2 can be subsequently obtained by (3) and (4). We denote these estimates as . The variance of the estimated fixed effects is thus or the sandwich estimator:
2.2 Distributed linear mixed model
It’s easy to see from (5) that there is no closed-form estimation for LMM. Thus unlike in the ordinary linear model [7], the LMM estimation is not trivial to be distributed to each site losslessly. Fortunately, with some linear algebra, we can disentangle the data (Yi, Xi) and the parameters Θ in |Γi| and Γi−1and thus reconstruct the profile log-likelihood (5) without communicating individual patient data. Specifically, we utilize the Woodbury matrix identity [8] to obtain and the matrix determinant lemma [9] to obtain where Iq is the q × q identity matrix. We focus on the situation that the covariates in Z are a subset of that in X. The more general case is similar and will be elaborated in the Appendix. We require the ith site to contribute some summary statistics, i.e. the p × p matrix Si X = XiTXi, the p × 1 vector Si Xy = Xi Tyi, the scalar siY = yiT yi and the sample size ni. The exact log-likelihood (5) (or a restricted likelihood (6)) can then be reconstructed using these summary statistics. The details of the reconstruction are in the Appendix. The result by the DLMM algorithm is thus identical to that of the pooled LMM analysis, see Figure 2.
2.3 Selection of variance components
We test the significance of random effects of each individual covariate by likelihood ratio test. For simplicity we assume the potential random effects are independent and the random intercept always exists, i.e. V = diag(σ12, …, σq2) and σ12 > 0, for the covariate corresponding to variance component σk2, k ≥ 2, we test
The likelihood ratio test (LRT) gives the likelihood ratio follows a 50:50 mixture of χ02 and χ12[10,11]. Notice both the log-likelihoods in (10) can be reconstructed by the communicated summary statistics.
2.4 Best linear unbiased predictors for the random effects
Finally, the BLUP [6] of the random effects ui at the ith site is
Conditioning on Xi, +ihas mean zero and covariance matrix
Since we are more interested in prediction of ui, it is more appropriate to use prediction intervals as below
We summarize the analysis with the proposed DLMM algorithm as in Algorithm 1.
Analysis with the distributed linear mixed model algorithm
3. Multi-site analysis of COVID-19 hospitalization length of stay
We demonstrate the utility and lossless property of the DLMM method by studying the association of length of stay of COVID-19 hospitalization with patients’ demographic and clinical characteristics. We emphasize that this example is for illustrative purposes only and to this end we considered only covariates that have already been well-documented in the literature.
We identified patients who were admitted as inpatients to a hospital with a primary or secondary diagnosis of COVID-19 between January 1, 2020 and September 30, 2020. The data are collected from K = 538 sites in the UnitedHealth Group Clinical Research Database and the total number of patients is . The detailed inclusion criteria is in Appendix Figure 1A. We treat length of stay as a continuous outcome. The demographic characteristics include age, gender and race and the clinical characteristics include a history of cancer, chronic obstructive pulmonary disease (COPD), heart disease, hypertension, hyperlipidemia, kidney disease and obesity. Charlson comorbidity index score is also included as a measure of the overall patient’s burden of diseases; the higher the score, the more severe the patient’s health condition is. We provide the details of the definition of the characteristics in the Appendix Table 2A.
We select the covariate-specific random effects uim, m = 1, …, q, as described in Section 2.3. For simplicity we assume the random effect of different covariates are independent, i.e. V = diag(σ12, …, σq2). We select the covariates for which the corresponding p-value ≤0.05. In particular, we select random slopes for obesity, diabetes, kidney-diseases, and charlosn scores. A LMM with random intercept and random effects for obesity, diabetes, kidney-diseases, and charlosn scores is then fitted by either pooling the IPD together, or the proposed DLMM algorithm. We also calculate the BLUPs and the prediction intervals of the random effects at each site by (11) in Section 2.4.
We compare the result of the pooled analysis and the distributed algorithm in Figure 2 and Appendix Figure 3A. Specifically, the estimation of the fixed effects, their standard errors, the variance components are shown to be identical by the pooled analysis or the distributed algorithm. The estimated BLUPs by either the pooled analysis or the distributed algorithm are also shown to be identical in Figure 3A. The forest plot of fixed effects estimation and BLUPs of the random effects at a specific site are shown in Figure 3. Notice male, older age (≥80), higher Charlson score, obesity, prevalence of diabetes, and kidney diseases are shown to be significantly associated with longer COVID-19 hospitalization and Caucasian race, compared to others, is significantly associated with shorter COVID-19 hospitalization. The results match with that of literature for most of the covariates [16-19].
4. Discussion and future work
Special care must be taken with healthcare data in order to preserve patient privacy. Anonymizing data while preserving features that are important for understanding an individual’s health is highly non-trivial. In addition, large, representative datasets are especially scarce. Distributed models solve the privacy issue by requiring that only summary level statistics are shared. The one-shot model presented here requires only the p × p matrix of summary statistics, sample size, and p-dimensional vector be sent once. This allows for efficient sharing to build models for various applications across healthcare where data may remain completely protected by eliminating the need for data pooling at a central source. By considering a large, more diverse sample from multiple sites we expect a more robust outcome which benefits all institutions.
Data Availability
The data are proprietary and are not available for public use but can be made available under a data use agreement to confirm the findings of the current study.
Funding
Research reported in this article was partially funded through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-2019C3-18315).
Disclosures
Drs. Sheils and Islam and Mr. Buresh are full-time employees in Optum Labs and own stock in its parent company, UnitedHealth Group, Inc.
5. Acknowledgement
Appendix
A1. Reconstruction of the (restricted) LMM likelihood
In the case that the random effects covariates in Z is not a subset of the fixed effects covariates in X, the proposed DLMM algorithm requires the ith site to communicate
p × p matrix SiX = XiT Xi, p × q matrix SiXZ = (SiZX)T= XiT Zi,
p − dim vector SiXy = (SiyX)T = XiT yi, q − dim vector SiZy = (Siyz)T = ZiT yi,
scalar Siy = yiT yi, sample size ni,
for reconstructing the (restricted) LMM likelihood.
Below are the details of the reconstruction. By (8), thus and can be reconstructed by (3) and (4). Notice the unknown parameters are contained only in Θ and are separated from the summary statistics. Therefore, the profile log-likelihood (5) with respect to Θ can be reconstructed as where |Γi| = |Iq + Si ZΘ| according to (9). The restricted profile log-likelihood (6) can also be reconstructed in the same way.
A2. Further information on data sources
A2.1 Standardization of data entry and data structure
Medical and pharmacy claims data are captured, predominantly electronically, from sites of care seeking third-party reimbursement for both Medicare and commercial plans using the industry standard data collection forms HCFA/CMS-1500 for facility claims, UB04/CMS-1450 for professional services and outpatient claims, and NCPDP for pharmacy claims or their electronic equivalents. Structured data from these standardized forms are coded using the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM), National Drug Codes (NDC), Current Procedural Terminology (CPT) codes, and Logical Observation Identifiers Names and Codes (LOINC) codes, and Diagnosis Related Groups (DRG). This nomenclature ensures consistency of data collection across geographic regions, health systems, and payers throughout the United States.
A2.2 Methods to Control for Errors in Sampling and Data Collection
Claims that do not adhere to the form or coding standards described above are rejected from reimbursement, minimizing the risk that inappropriately structured data are included in the database. Data specific to SARS-CoV-2 and COVID-19 has an additional Quality Control layer to control for errors in sampling and data collection; this is described below in the section on Quality Control.
A2.3 Quality control
A COVID-19 data source-specific layer of quality control is also present, given the rapidly evolving situation. Members with a qualified COVID-19 related hospital admission are included in the report when any diagnosis matches qualified ICD-10 codes of U071, U072, or B9729. Suspected COVID-19 inpatient cases are manually reviewed daily by health plan clinical staff via clinical notes to determine an individual’s COVID-19 status. Each case is then manually flagged as either negative, confirmed, presumed positive, or needs clinical review. If a case is confirmed, it is not reviewed again. If a case is listed as negative or unknown, it is periodically reviewed for changes in the record. All others are reviewed and updated daily.
A2.4 Data Sharing
The data are proprietary and are not available for public use but, under certain conditions, may be made available to editors and their approved auditors under a data use agreement to confirm the findings of the current study.
Footnotes
Institution information updated for authors: Md. Nazmul Islam, Natalie E. Sheils, and Rui Duan.