Lossless Distributed Linear Mixed Model with Application to Integration of Heterogeneous Healthcare Data

Chongliang Luo; Md. Nazmul Islam; Natalie E. Sheils; Jenna Reps; John Buresh; Rui Duan; Jiayi Tong; Mackenzie Edmondson; Martijn J. Schumie; Yong Chen

doi:10.1101/2020.11.16.20230730

Abstract

Linear mixed models (LMMs) are commonly used in many areas including epidemiology for analyzing multi-site data with heterogeneous site-specific random effects. However, due to the regulation of protecting patients’ privacy, sensitive individual patient data (IPD) are usually not allowed to be shared across sites. In this paper we propose a novel algorithm for distributed linear mixed models (DLMMs). Our proposed DLMM algorithm can achieve exactly the same results as if we had pooled IPD from all sites, hence the lossless property. The DLMM algorithm requires each site to contribute some aggregated data (AD) in only one iteration. We apply the proposed DLMM algorithm to analyze the association of length of stay of COVID-19 hospitalization with demographic and clinical characteristics using the administrative claims database from the UnitedHealth Group Clinical Research Database.

1. Introduction

The COVID-19 outbreak has become a pandemic, causing a large increase in mortality and posing a heavy burden to the healthcare system. Much research has been done on treatment efficacy and adverse clinical outcomes [1-5] and much remains to be done. As studies continue to be conducted and published, multi-site collaboration is demanded for evidence synthesis [3,4]. Multi-site studies based on healthcare data, including the electronic health record (EHR) and claims data, can integrate clinical information across multiple sites or systems to improve estimation and predictive performance due to use of a larger and more inclusive sample from the population of interest.

One primary challenge for multi-site collaboration is preserving the privacy of protected health information. Sensitive individual patient data (IPD) including the patient’s identity, diagnoses and treatments are usually not allowed under privacy regulation to be shared across networks. Existing approaches to performing multi-site studies, e.g. distributed algorithms, have the drawback of biased estimation [12] or communication burden due to the requirement of iterative transmission of summarized data [14,15]. Specifically, for ordinary linear regression, identical results with pooled analysis can be obtained by lossless compression [7]. Another important challenge of analyzing multi-site data is the heterogeneity of data distribution across sites. Many existing approaches for multi-site analysis assume the data are homogeneously distributed across sites. This assumption may be violated in practical scenarios and thus make the model vulnerable in estimation and hypothesis testing. For instance, in this manuscript we consider the length of stay of the hospitalization due to COVID-19 as the continuous outcome. Heterogeneity across countries, regions, and sub-populations has been reported in the literature [2]. The aforementioned lossless compression approach [7] for ordinary linear regression, fails to take the heterogeneity into account.

In this paper we propose a novel algorithm for distributed linear mixed models (DLMMs). Linear mixed models (LMMs) are commonly used in many areas including epidemiology for analyzing multi-site data with heterogeneity. The model assumes site-specific random effects of the covariates (and intercept) on a continuous outcome. To the best of our knowledge, there is no existing approach for fitting LMMs in a distributed manner, see Figure 1 for the comparison of several approaches in this context. Our proposed distributed LMM can achieve exactly the same results as if we had pooled individual patient data from all sites, hence the lossless property. These lossless results can be obtained by requiring the sharing of summary statistics from each site in only one iteration. We apply the proposed DLMM to analyze the association of length of stay of COVID-19 hospitalization with demographic and clinical characteristics using the administrative claims database for Medicare Advantage members from a large US Health insurance provider (Appendix Figure 2A).

Figure 1.

Comparison of several approaches for linear regression analysis of multi-site EHR data with heterogeneity.

2. Method

2.1 Linear mixed model

Due to the heterogeneity of data across sites, the effects of the covariates on the outcome among sites in the linear regression model may not always be the same [7]. A linear mixed model is thus often used. Assume for the j^th patient at the i^th site, y_ij is the continuous outcome, x_ijis the p-dimensional covariate vector and β is the vector of fixed effects, z_ijis the q-dimensional covariate vector having random effect u_i, and ϵ_ijis the random error. where u_i ∼ N(0, V), ϵ_ij ∼ N(0, σ²). The random effects covariates z_ij can be part or all of x_ij, or constant if random intercept only. The random effect covariance matrix V can admit certain structures with unknown parameters. For instance we can assume the random effects are independent, i.e. V = diag(σ₁², …, σ_q²). These parameters (e.g. variance components) and the fixed effects β are usually estimated by maximum likelihood (ML) or restricted maximum likelihood (REML) estimation [6]. The log-likelihood of LMM using all the data is where X_i and Y_i are the covariate matrix and the outcome vector of the i^th site, |. | is the matrix determinant and .

The maximum likelihood estimation can be further simplified by profiling out β and σ² from (2). Denote Θ = V/σ², given Θ, the estimation of β and σ² are where . Thus the profile log-likelihood with respect to only Θ is and the restricted profile log-likelihood is

The ML or REML estimate of Θcan be obtained by maximizing (5) or (6). The estimates of β and σ² can be subsequently obtained by (3) and (4). We denote these estimates as . The variance of the estimated fixed effects is thus or the sandwich estimator:

2.2 Distributed linear mixed model

It’s easy to see from (5) that there is no closed-form estimation for LMM. Thus unlike in the ordinary linear model [7], the LMM estimation is not trivial to be distributed to each site losslessly. Fortunately, with some linear algebra, we can disentangle the data (Y_i, X_i) and the parameters Θ in |Γ_i| and Γ_i⁻¹and thus reconstruct the profile log-likelihood (5) without communicating individual patient data. Specifically, we utilize the Woodbury matrix identity [8] to obtain and the matrix determinant lemma [9] to obtain where I_q is the q × q identity matrix. We focus on the situation that the covariates in Z are a subset of that in X. The more general case is similar and will be elaborated in the Appendix. We require the i^th site to contribute some summary statistics, i.e. the p × p matrix S_i ^X = X_i^TX_i, the p × 1 vector S_i ^Xy = X_i ^Ty_i, the scalar s_i^Y = y_i^T y_i and the sample size n_i. The exact log-likelihood (5) (or a restricted likelihood (6)) can then be reconstructed using these summary statistics. The details of the reconstruction are in the Appendix. The result by the DLMM algorithm is thus identical to that of the pooled LMM analysis, see Figure 2.

Figure 2.

Schematic overview of the proposed algorithm for distributed linear mixed model (DLMM). The linear mixed model takes into account the heterogeneity of the effect of the covariates X on the continuous outcome y across sites. The proposed distributed algorithm achieves identical results as pooling the individual patient data (IPD) from all sites, by requiring only aggregated data (AD) S_i^X, S_i^Xy, s_i^yand sample size n_i from the i^th site. The distributed algorithm is privacy-preserving as only summary statistics (i.e. p × p matrices, p × 1 vectors and scalars) are being communicated.

2.3 Selection of variance components

We test the significance of random effects of each individual covariate by likelihood ratio test. For simplicity we assume the potential random effects are independent and the random intercept always exists, i.e. V = diag(σ₁², …, σ_q²) and σ₁² > 0, for the covariate corresponding to variance component σ_k², k ≥ 2, we test

The likelihood ratio test (LRT) gives the likelihood ratio follows a 50:50 mixture of χ₀² and χ₁²[10,11]. Notice both the log-likelihoods in (10) can be reconstructed by the communicated summary statistics.

2.4 Best linear unbiased predictors for the random effects

Finally, the BLUP [6] of the random effects u_i at the i^th site is

Conditioning on X_i, +_ihas mean zero and covariance matrix

Since we are more interested in prediction of u_i, it is more appropriate to use prediction intervals as below

We summarize the analysis with the proposed DLMM algorithm as in Algorithm 1.

Algorithm 1.

Analysis with the distributed linear mixed model algorithm

3. Multi-site analysis of COVID-19 hospitalization length of stay

We demonstrate the utility and lossless property of the DLMM method by studying the association of length of stay of COVID-19 hospitalization with patients’ demographic and clinical characteristics. We emphasize that this example is for illustrative purposes only and to this end we considered only covariates that have already been well-documented in the literature.

We identified patients who were admitted as inpatients to a hospital with a primary or secondary diagnosis of COVID-19 between January 1, 2020 and September 30, 2020. The data are collected from K = 538 sites in the UnitedHealth Group Clinical Research Database and the total number of patients is . The detailed inclusion criteria is in Appendix Figure 1A. We treat length of stay as a continuous outcome. The demographic characteristics include age, gender and race and the clinical characteristics include a history of cancer, chronic obstructive pulmonary disease (COPD), heart disease, hypertension, hyperlipidemia, kidney disease and obesity. Charlson comorbidity index score is also included as a measure of the overall patient’s burden of diseases; the higher the score, the more severe the patient’s health condition is. We provide the details of the definition of the characteristics in the Appendix Table 2A.

Figure 2.

(a) The mean and standard deviation of length of stay of 47756 hospitalized COVID-19 patients from 538 hospitals. The data are collected from a single large US insurer and separated into their respective hospital sites to illustrate the algorithm. The area of each dot is proportional to the number of patients at that hospital and color represents the region. (b) Fixed effects estimation of linear mixed model by the proposed DLMM algorithm vs the pooled analysis. (c) Fixed effects’ standard error estimation of linear mixed model by the proposed DLMM algorithm vs the pooled analysis. (d) Variance components estimation of linear mixed model by the proposed DLMM algorithm vs the pooled analysis.

We select the covariate-specific random effects u_im, m = 1, …, q, as described in Section 2.3. For simplicity we assume the random effect of different covariates are independent, i.e. V = diag(σ₁², …, σ_q²). We select the covariates for which the corresponding p-value ≤0.05. In particular, we select random slopes for obesity, diabetes, kidney-diseases, and charlosn scores. A LMM with random intercept and random effects for obesity, diabetes, kidney-diseases, and charlosn scores is then fitted by either pooling the IPD together, or the proposed DLMM algorithm. We also calculate the BLUPs and the prediction intervals of the random effects at each site by (11) in Section 2.4.

We compare the result of the pooled analysis and the distributed algorithm in Figure 2 and Appendix Figure 3A. Specifically, the estimation of the fixed effects, their standard errors, the variance components are shown to be identical by the pooled analysis or the distributed algorithm. The estimated BLUPs by either the pooled analysis or the distributed algorithm are also shown to be identical in Figure 3A. The forest plot of fixed effects estimation and BLUPs of the random effects at a specific site are shown in Figure 3. Notice male, older age (≥80), higher Charlson score, obesity, prevalence of diabetes, and kidney diseases are shown to be significantly associated with longer COVID-19 hospitalization and Caucasian race, compared to others, is significantly associated with shorter COVID-19 hospitalization. The results match with that of literature for most of the covariates [16-19].

Figure 3.

(a) Fixed effects of demographic and clinical characteristics on COVID-19 hospitalization length of stay. A vertical reference line is drawn for convenience in comparison. Reported are the estimated effect sizes, 95% confidence intervals and corresponding p-values based on Wald test. (b) BLUPs for random effects corresponding to a site located in the south region; reported are 95% prediction intervals.

4. Discussion and future work

Special care must be taken with healthcare data in order to preserve patient privacy. Anonymizing data while preserving features that are important for understanding an individual’s health is highly non-trivial. In addition, large, representative datasets are especially scarce. Distributed models solve the privacy issue by requiring that only summary level statistics are shared. The one-shot model presented here requires only the p × p matrix of summary statistics, sample size, and p-dimensional vector be sent once. This allows for efficient sharing to build models for various applications across healthcare where data may remain completely protected by eliminating the need for data pooling at a central source. By considering a large, more diverse sample from multiple sites we expect a more robust outcome which benefits all institutions.

Data Availability

The data are proprietary and are not available for public use but can be made available under a data use agreement to confirm the findings of the current study.

Funding

Research reported in this article was partially funded through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-2019C3-18315).

Disclosures

Drs. Sheils and Islam and Mr. Buresh are full-time employees in Optum Labs and own stock in its parent company, UnitedHealth Group, Inc.

5. Acknowledgement

Appendix

A1. Reconstruction of the (restricted) LMM likelihood

In the case that the random effects covariates in Z is not a subset of the fixed effects covariates in X, the proposed DLMM algorithm requires the i^th site to communicate

p × p matrix S_i^X = X_i^T X_i, p × q matrix S_i^XZ = (S_i^ZX)^T= X_i^T Z_i,
p − dim vector S_i^Xy = (S_i^yX)^T = X_i^T y_i, q − dim vector S_i^Zy = (S_i^yz)^T = Z_i^T y_i,
scalar S_i^y = y_i^T y_i, sample size n_i,

for reconstructing the (restricted) LMM likelihood.

Below are the details of the reconstruction. By (8), thus and can be reconstructed by (3) and (4). Notice the unknown parameters are contained only in Θ and are separated from the summary statistics. Therefore, the profile log-likelihood (5) with respect to Θ can be reconstructed as where |Γ_i| = |I_q + S_i ^ZΘ| according to (9). The restricted profile log-likelihood (6) can also be reconstructed in the same way.

A2. Further information on data sources

A2.1 Standardization of data entry and data structure

Medical and pharmacy claims data are captured, predominantly electronically, from sites of care seeking third-party reimbursement for both Medicare and commercial plans using the industry standard data collection forms HCFA/CMS-1500 for facility claims, UB04/CMS-1450 for professional services and outpatient claims, and NCPDP for pharmacy claims or their electronic equivalents. Structured data from these standardized forms are coded using the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM), National Drug Codes (NDC), Current Procedural Terminology (CPT) codes, and Logical Observation Identifiers Names and Codes (LOINC) codes, and Diagnosis Related Groups (DRG). This nomenclature ensures consistency of data collection across geographic regions, health systems, and payers throughout the United States.

A2.2 Methods to Control for Errors in Sampling and Data Collection

Claims that do not adhere to the form or coding standards described above are rejected from reimbursement, minimizing the risk that inappropriately structured data are included in the database. Data specific to SARS-CoV-2 and COVID-19 has an additional Quality Control layer to control for errors in sampling and data collection; this is described below in the section on Quality Control.

Figure 1A.

Flow chart of cohort definition for the COVID-19 hospitalization length of stay study using UHG claims data.

A2.3 Quality control

A COVID-19 data source-specific layer of quality control is also present, given the rapidly evolving situation. Members with a qualified COVID-19 related hospital admission are included in the report when any diagnosis matches qualified ICD-10 codes of U071, U072, or B9729. Suspected COVID-19 inpatient cases are manually reviewed daily by health plan clinical staff via clinical notes to determine an individual’s COVID-19 status. Each case is then manually flagged as either negative, confirmed, presumed positive, or needs clinical review. If a case is confirmed, it is not reviewed again. If a case is listed as negative or unknown, it is periodically reviewed for changes in the record. All others are reviewed and updated daily.

A2.4 Data Sharing

The data are proprietary and are not available for public use but, under certain conditions, may be made available to editors and their approved auditors under a data use agreement to confirm the findings of the current study.

Figure 2A.

COVID-19 inpatient case distribution: number of hospitalizations by state. Data are extracted from UHG Clinical Research Database from Jan 1 to Sep 30, 2020.

View this table:

Table 1A.

UHG COVID-19 hospitalized patients characteristics.

View this table:

Table 2A.

ICD-10-CM codes used to calculate Charlson comorbidity score.

Figure 3A.

Comparison of the best linear unbiased predictors (BLUPs) of the random effects by pooled and DLMM methods. The BLUPs are obtained from a linear mixed model with COVID-19 hospitalization as the outcome and demographics and comorbidity variables as covariates. Diabetes, obesity, kidney disease and Charlson score are selected as having significant random effects.

Footnotes

Institution information updated for authors: Md. Nazmul Islam, Natalie E. Sheils, and Rui Duan.

References

1.↵
Kimmel SE, Califf RM, Dean NE, Goodman SN, Ogburn EL. COVID-19 Clinical Trials: A Teachable Moment for Improving Our Research Infrastructure and Relevance. Annals of Internal Medicine. 2020 Jun 16.
2.↵
Becchetti, Leonardo and Conzo, Gianluigi and Conzo, Pierluigi and Salustri, Francesco, Understanding the Heterogeneity of Adverse COVID-19 Outcomes: The Role of Poor Quality of Air and Lockdown Decisions (April 10, 2020). Available at SSRN: http://dx.doi.org/10.2139/ssrn.3572548.
3.↵
Chen Q, Allot A, Lu Z. Keep up with the latest coronavirus research. Nature. 2020 Mar;579(7798):193.
OpenUrl CrossRef PubMed
4.↵
Ogburn EL, Bierer BE, Brookmeyer R, Choirat C, Dean NE, De Gruttola V, Ellenberg SS, Halloran ME, Hanley Jr DF, Lee JK, Wang R. Aggregating data from COVID-19 trials. Science (New York, NY). 2020 Jun 12;368(6496):1198–9.
OpenUrl FREE Full Text
5.↵
Williamson EJ, Walker AJ, Bhaskaran K, Bacon S, Bates C, Morton CE, Curtis HJ, Mehrkar A, Evans D, Inglesby P, Cockburn J. OpenSAFELY: factors associated with COVID-19 death in 17 million patients. Nature. 2020 Jul 8:1–1.
6.↵
Ruppert, D., Wand, M.P. and Carroll, R.J., 2003. Semiparametric regression (No. 12). Cambridge university press.
7.↵
Chen, Y., Dong, G., Han, J., Pei, J., Wah, B.W. and Wang, J., 2006. Regression cubes with lossless compression and aggregation. IEEE Transactions on Knowledge and Data Engineering, 18(12), pp.1585–1599.
OpenUrl
8.↵
Sherman, J. and Morrison, W.J., 1950. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1), pp.124–127.
OpenUrl
9.↵
Ding, J. and Zhou, A., 2007. Eigenvalues of rank-one updated matrices with some applications. Applied Mathematics Letters, 20(12), pp.1223–1226.
OpenUrl
10.↵
Self SG, Liang KY. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association. 1987 Jun 1;82(398):605–10.
OpenUrl CrossRef Web of Science
11.↵
Stram DO, Lee JW. Variance components testing in the longitudinal mixed effects model. Biometrics. 1994 Dec 1:1171–7.
12.↵
Duan R, Boland MR, Liu Z, Liu Y, Chang HH, Xu H, Chu H, Schmid CH, Forrest CB, Holmes JH, Schuemie MJ. Learning from electronic health records across multiple sites: A communication-efficient and privacy-preserving distributed algorithm. Journal of the American Medical Informatics Association. 2020 Mar;27(3):376–85.
OpenUrl
13.
Duan, R., Luo, C., Schuemie, M. H., Tong, J., Liang, J. C., Chang, H. H., Boland, M. R., Bian, J., Xu, H., Holmes, J. H.. Learning from local to global-an efficient distributed algorithm for modeling time-to-event data. Journal of the American Medical Informatics Association. 2020 July; 27(7):1028–1036.
OpenUrl
14.↵
Lu, C.-L., Wang, S., Ji, Z., Wu, Y., Xiong, L., Jiang, X., and Ohno-Machado, L.. Webdisco: a web service for distributed cox model learning without patient-level data sharing. Journal of the American Medical Informatics Association 22, 1212–1219.
15.↵
Wu, Y., Jiang, X., Kim, J., and Ohno-Machado, L.. G rid binary lo gistic re gression (glore): building shared models without sharing data. Journal of the American Medical Informatics Association 19, 758–764.
16.↵
Sanyaolu A, Okorie C, Marinkovic A, Patidar R, Younis K, Desai P, Hosein Z, Padda I, Mangat J, Altaf M. Comorbidity and its Impact on Patients with COVID-19. Sn Comprehensive Clinical Medicine. 2020 Jun 25:1–8.
17.
Price-Haywood EG, Burton J, Fort D, Seoane L. Hospitalization and mortality among black patients and white patients with Covid-19. New England Journal of Medicine. 2020 May 27.
18.
Rees EM, Nightingale ES, Jafari Y, Waterlow NR, Clifford S, Pearson CA, Jombart T, Procter SR, Knight GM, CMMID Working Group. COVID-19 length of hospital stay: a systematic review and data synthesis. medRxiv preprint https://doi.org/10.1101/2020.04.30.20084780.
19.↵
Wang L, He W, Yu X, Hu D, Bao M, Liu H, Zhou J, Jiang H. Coronavirus disease 2019 in elderly patients: Characteristics and prognostic factors based on 4-week follow-up. Journal of Infection. 2020 Mar 30.