Predicting Long COVID in the National COVID Cohort Collaborative Using Super Learner

Zachary Butzin-Dozier; Yunwen Ji; Haodong Li; Jeremy Coyle; Junming (Seraphina) Shi; Rachael V. Philips; Andrew Mertens; Romain Pirracchio; Mark J. van der Laan; Rena Patel; John M. Colford; Alan E. Hubbard; the National COVID Cohort Collaborative (N3C) Consortium

doi:10.1101/2023.07.27.23293272

ABSTRACT

Post-acute Sequelae of COVID-19 (PASC), also known as Long COVID, is a broad grouping of a range of long-term symptoms following acute COVID-19 infection. An understanding of characteristics that are predictive of future PASC is valuable, as this can inform the identification of high-risk individuals and future preventative efforts. However, current knowledge regarding PASC risk factors is limited. Using a sample of 55,257 participants from the National COVID Cohort Collaborative, as part of the NIH Long COVID Computational Challenge, we sought to predict individual risk of PASC diagnosis from a curated set of clinically informed covariates. We predicted individual PASC status, given covariate information, using Super Learner (an ensemble machine learning algorithm also known as stacking) to learn the optimal, AUC-maximizing combination of gradient boosting and random forest algorithms. We were able to predict individual PASC diagnoses accurately (AUC 0.947). Temporally, we found that baseline characteristics were most predictive of future PASC diagnosis, compared with characteristics immediately before, during, or after COVID-19 infection. This finding supports the hypothesis that clinicians may be able to accurately assess the risk of PASC in patients prior to acute COVID diagnosis, which could improve early interventions and preventive care. We found that medical utilization, demographics and anthropometry, and respiratory factors were most predictive of PASC diagnosis. This highlights the importance of respiratory characteristics in PASC risk assessment. The methods outlined here provide an open-source, applied example of using Super Learner to predict PASC status using electronic health record data, which can be replicated across a variety of settings.

BACKGROUND

As the mortality rate associated with acute COVID-19 incidence wanes, investigators have shifted focus to determining its longer-term, chronic impacts.¹ Post-acute Sequelae of COVID-19 (PASC) is a loosely categorized consequence of acute infection that is related to dysfunction across multiple biological systems.² Electronic health record (EHR) databases, such as the National COVID Cohorts Collaborative (N3C), provide an important tool for predicting, evaluating, and understanding PASC.^{3, 4}

Given the broad range of factors associated with PASC, the high dimensionality of the N3C Enclave data, and the unknown determinants of Long COVID, modeling methods for predicting PASC must be highly flexible. Super Learner (SL) is a flexible, ensemble (stacked) machine learning algorithm that uses cross-validation to learn the optimal weighted combination of a specified set of algorithms_.5,6 The SL is grounded in statistical optimality theory that guarantees for large sample sizes it will perform at least as well as the best-performing algorithm included in the library. Thus, a rich library of learners, with a sufficient sample size, will ensure optimal performance. This robustness is supported by numerous applications, and the SL can be specified to maximize any performance metric, such as mean squared error.₆

Here, we used the SL to estimate the function for predicting PASC diagnosis in COVID-infected patients, given a diverse set of features curated from the EHR. The SL was specified such that it learned the combination of algorithms, including variations of gradient boosting and random forest, that maximized the area under the receiver operator characteristic curve (AUC).₇ Our set of features for predicting PASC included those previously described in the literature,₃ and additional features related to subject-matter knowledge and patterns of missingness. We also investigated the importance of features for predicting PASC across multiple levels, including assessing the importance of each individual feature, and groups of features based on temporality (baseline, pre-COVID, acute COVID, and post-COVID features) and hypothesized biological pathways of PASC.

METHODS

Sample

The Long COVID Computational Challenge (L3C, DUR RP-5A73BA) sample population was selected from the N3C dataset, a national, open dataset that has been described previously_.3,4 The L3C sample included participants diagnosed with PASC (ICD code U09.9) and controls with a documented COVID-19 diagnosis who had at least one medical visit more than 4 weeks after their initial COVID diagnosis date. Controls were selected at a 1:4 (case:control) ratio and were matched based on the distribution of medical visits prior to COVID-19 diagnosis. The primary outcome of interest was PASC diagnosis via ICD code U09.9.

The dataset included 57,672 patients with 9,031 cases, 46,226 controls, and 2,415 patients excluded due to having a PASC diagnosis before 4 weeks following acute COVID diagnosis. This yielded a final analytic sample of 55,257 participants.

Feature selection

We extracted 304 features from N3C data. After indexing across four time periods and transforming features into formats amenable to machine learning analysis, our sample included 1,339 features (see Supplemental Table 1. Metadata). Details regarding feature selection and processing can be accessed via GitHub (https://github.com/BerkeleyBiostats/l3c_ctml/tree/v1). For continuous features, we included the minimum, maximum, and mean values for each measurement in each temporal window. For binary features, we either included an indicator (when repetition was not relevant) or a count (when repetition was relevant) over each time period and we re-coded categorical variables as indicators.

Temporal windows

We divided each participant’s records into four temporal windows: baseline, which consisted of all records occurring a minimum of 37 days before the COVID index date (t - 37, where t represents the COVID index date), and all time-invariant factors (such as sex, ethnicity, etc.); pre-COVID, observations falling 37 days prior to 7 days prior to the index date (t - 37 to t - 7); acute COVID, observations falling 7 days prior to 14 days after to the index date (t - 7 to t + 14); and post-COVID, records from 14 to 28 days after the index date (t + 14 to t + 28).

Features described in the literature

Pfaff et al. used gradient-boosting machine learning models (XGBoost) to identify patients at risk for PASC using N3C data.₃ We extracted and transformed key features that were identified by Pfaff et al. These features included 199 previously described factors related to medical history, diagnoses, demographics, and comorbidities.₃

Temporality

To account for differences in follow-up, we included as an additional factor a continuous variable for follow-up time, defined as the number of days between the COVID index date and the most recent observation. To account for temporal trends of COVID (such as seasonality and dominant variant), we included categorical (ordinal) covariates for the season and months since the first observed COVID index date.

Missing data

We applied an approach that can be used to predict future observations with missing data, and we did so by creating indicator basis functions that indicate whether, for each variable, the observation was missing (yes/no).₈ By including these (along with filling each missing variable with a 0), we allow the machine to determine what predictive information can be utilized by the missingness process, without relying on a current imputation model. Thus, this indicator allows the pattern of missingness to be a predictor of PASC.

COVID-19 positivity

We added several measures of COVID severity and persistent SARS-CoV-2 viral load, which are associated with PASC incidence.₉ We imported measures of COVID severity as well as 15 measures of COVID infection from laboratory measurements, which provided insights on persistent SARS-CoV-2 viral load. We assessed the duration of COVID viral positivity separately for each laboratory measure of COVID and each temporal window. For participants who had both a positive and negative value of a given test during a temporal window, we took the midpoint between the last positive test and the first negative test as being the endpoint of their positivity. For individuals who had a positive test but no subsequent negative test within that temporal window, we determined their endpoint to be their final positive test plus three days. We included separate missingness indicators in each temporal window for each test, for a positive value for each test, and for a negative value following a positive value to indicate an imputed positivity endpoint.

Additional features

We incorporated the laboratory measurements related to anthropometry, nutrition, COVID positivity, inflammation, tissue damage due to viral infection, auto-antibodies and immunity, cardiovascular health, and microvascular disease, which are potential predictors of PASC.₉ We also extracted information about smoking status, alcohol use, marital status, and use of insulin or anticoagulant from the observation table as baseline characteristics of individuals, and we included the number of times a person has been exposed to respiratory devices in each of the four windows from the device table. We extracted covariates related to COVID severity, vaccination history, demographics, medical history, and previous diagnoses from before and during acute COVID infection.

Prediction using ensemble machine learning

We used the SL, an ensemble machine learning method, also known as stacking, to learn the optimally weighted combination of candidate algorithms for maximizing the AUC. We reprogrammed the SL in Python in order to capitalize on the resources available in the N3C Data Enclave (e.g., PySpark parallelization), and this software is available to external researchers (https://github.com/BerkeleyBiostats/l3c_ctml/tree/v1). We used a relatively small ensemble of four learners (a mix of robust parametric models and machine learning models): 1. Logistic regression; 2. L1 penalized logistic regression (with penalty parameter lambda = 0.01); 3. Gradient boosting (with n_estimators = 200, max_depth = 5, learning_rate = 0.1); 4. Random forest (max_depth = 5, num_trees = 20). The original candidate learner library consisted of a large set of candidate learners with different combinations of hyperparameters (e.g. gradient boosting (with n_estimators = [200, 150, 100, 50], max_depth = [3, 5, 7], learning_rate = [0.05, 0.1, 0.2]).

One important decision for optimizing an algorithm is to decide which metric will be used to evaluate the fit and optimize the weighting of the algorithms in the ensemble. We used an approach developed specifically for maximizing the area under the curve (AUC).₇ Specifically, we used an AUC maximizing meta-learner with Powell optimization to learn the convex combination of these four candidate algorithms_.7 The SL was implemented with a V-fold/k-fold cross-validation scheme with 10 folds.

Variable importance

In this section, for the sake of computational efficiency, we worked with the discrete SL selector (the single candidate learner in the library with the highest cross-validated AUC) instead of the entire ensemble SL. In this case, the gradient-boosting learner was the candidate learner with the highest cross-validated AUC. We used a general approach (for any machine learning algorithm) known as Shapley values.₁₀ We generated these values within three groupings of predictors for ease of interpretability: individual features (e.g. cough diagnosis during acute COVID window), the temporal window when measurements were made relative to acute COVID infection, (e.g. pre- COVID window), and by specific biological pathways (e.g. respiratory pathway). At the individual level, we assessed the importance of each variable (indexed across each of the four temporal windows) in predicting PASC. At the temporal level, we assessed the relative importance of each of the four temporal windows (baseline, pre-COVID, acute COVID, and post-COVID) in predicting PASC status. At the level of the biological pathway, we grouped variables based on the following hypothesized mechanistic pathways of PASC: 1) Baseline demographics and anthropometry, 2) Medical visitation and procedures, 3) Respiratory system, 4) Antimicrobials and infectious disease, 4) Cardiovascular system, 5) Female hormones and pregnancy, 6) Mental health and wellbeing, 7) Pain, skin sensitivity, and headaches, 8) Digestive system, 9) Inflammation, autoimmune, and autoantibodies, 10) Renal function, liver function, and diabetes, 11) Nutrition, 12) COVID Positivity, 13) Uncategorized disease, nervous system, injury, mobility, and age-related factors.₉ For temporal and biological groupings, we assessed the mean Shapley value of the 10 most predictive features in each group. A full list of our included covariates along with their grouping by temporality and biological pathway is included in our metatable (Supplemental Table 1. Metadata).

RESULTS

Predictive performance

Our models accurately predicted PASC diagnosis status among participants in the training sample, with an AUC of 0.947 on a holdout test set (10% of full data).

Variable importance

Individual predictors

We found that the strongest individual predictors (mean absolute Shapley value) of PASC diagnosis were the length of follow-up (0.40), the number of medical visits associated with a diagnosis during the acute COVID window (0.26), data partner ID (0.25), viral lower respiratory infection during the acute COVID window (0.11), and age (0.06) (Figures 1 and 2).

Figure 1.

Bar plot of most important model features associated with PASC. For additional information regarding covariates, see metatable.

Figure 2.

Beeswarm plot of most important model features associated with PASC. For additional information regarding covariates, see metatable.

Temporal windows

Baseline and time-invariant characteristics were the strongest predictors of PASC (mean 0.093), followed by characteristics during the acute COVID window (mean 0.049) (Figure 3).

Figure 3.

Variable importance by the temporal window. Ranked by the mean absolute Shapley value of the top 10 features in each category. Baseline (prior to t – 37); Pre-COVID (t-37 to t – 7). acute COVID (t – 7 tot t + 14). and Dost-COVID (t + 14 to t + 28). with t beina the index COVID date.

Biologic pathways

We found that medical visitation and procedures included the strongest predictors (mean 0.085), followed by demographics and anthropometry (mean 0.054), respiratory factors (mean 0.023), COVID markers (mean 0.0064), and markers of pain (mean 0.0047) (Figure 4).

Figure 4.

Variable importance by biological pathway. Ranked by the mean absolute Shapley value of the top 10 features (ranked by the same metric) in each category. For additional information reoardina covariates. see metatable.

DISCUSSION

Predictive performance

These results provide strong support for 1) the choice of an ensemble learning approach, 2) the specific learners used, 3) how the missing data was handled, and 4) the choice of optimization criteria (maximizing the AUC).

Variable importance

Individual predictors

We found that the individual predictors most associated with PASC diagnosis were related to medical utilization rate and site of care, such as length of follow-up and data provider ID. These factors are unlikely to be causal drivers of PASC incidence. On the other hand, we found that lower viral respiratory infection during acute COVID was highly predictive of PASC diagnosis. Lower respiratory infection during acute COVID may be a causal pathway by which acute COVID leads to PASC, although future studies should apply a causal inference framework to evaluate this hypothesis.

Temporal windows

We found that baseline factors were the strongest predictor of PASC diagnosis, compared with factors immediately before, during, or after acute COVID-19 infection. This suggests that clinicians may be able to effectively identify who is at risk for PASC based on baseline characteristics and COVID infection symptoms. Although it should be noted that baseline characteristics included the greatest interval of time and included some time-variant factors that were not linked to any specific time point. Future analyses should expand on this finding to evaluate the feasibility of predicting individual PASC incidence, rather than diagnosis, using baseline characteristics alone. Additional information regarding this relationship could identify patients at risk for PASC prior to acute COVID-19 and could inform early interventions to prevent PASC.

Biological pathways

These results are consistent with published literature and highlight the importance of respiratory features (e.g., asthma) as important factors in predicting who may develop PASC, which is consistent with the fact that SARS-CoV-2 is a respiratory virus._2,3 Respiratory factors can influence individual susceptibility to COVID-19, are important features of acute COVID-19 severity, and are key symptoms of PASC._2,3,11 Therefore, future studies should seek to parse the contributions of respiratory symptoms to PASC through the pathways of baseline susceptibility to COVID-19 versus phenotyping of severe COVID-19 in order to improve our understanding of respiratory features as a risk factor for PASC. Despite the range of PASC phenotypes, these findings are consistent with respiratory symptoms (e.g. dyspnea, cough) being the most commonly reported PASC symptoms._9,11 Other biological pathways, such as cardiovascular factors, have similar roles as both markers of susceptibility and severity of COVID-19 and should also be explored further in future studies.

Limitations

Our goal for this analysis was to maximize predictive accuracy, rather than to make causal inferences regarding exposure-outcome relationships, therefore we included all predictors prior to four weeks post-COVID (censored window). The inclusion of pre-COVID, acute COVID, and post-COVID factors complicates inference regarding whether predictive features (e.g., respiratory factors) reflect vulnerability to acute COVID, COVID symptoms, or early PASC symptoms. This analytic sample was matched 1:4 (PASC:non-PASC), with matching based on pre-COVID medical visitation rate, and this matched sample was drawn from N3C, which is a matched sample of COVID patients and healthy controls. Therefore, this sample may not be representative of a broader population. We note that, for future use of these data, if the prevalence of PASC in the target population is known, and the matching identifier is available, there are methods to calibrate the results to the actual population. Given that was not the case, one might generate results that need to be re-calibrated to the target population of interest.

We found measures of medical visitation to be strong predictors of PASC diagnosis. It is plausible that medical visitation may be associated with increased diagnoses in general, rather than true PASC incidence. However, increased medical visitation may be an effect of early PASC symptoms.

Future steps

In order to improve upon the interpretation and clinical applications of these findings, future studies should apply a causal inference approach to evaluate the potential causal impact of individual predictors on the risk of PASC. Future studies should rigorously evaluate highly-predictive features, e.g. via targeted maximum likelihood estimation (TMLE), to generate estimates of parameters more interpretable to non-statisticians._12–15 TMLE is a general method for deriving estimates and robust inference for nonparametric measures of associations, so it is particularly well-suited for use in the context of machine learning. It can produce estimates of parameters, such as the average treatment effect, causal relative risk, causal attributable risk, direct effects, and many others; interpretation of results as estimates of causal parameters requires assumptions outside of the data (e.g., no unmeasured confounding), so though they provide good insights about the magnitude and direction of the average impact of a predictor, causal interpretation of the results should be made with caution

One key exposure of interest is vaccination, which is a key strategy in preventing acute COVID-19 infection. There is evidence that COVID-19 vaccination is protective against PASC, but less is known about how vaccination timing (i.e. recency of vaccination prior to acute COVID-19 infection) relates to the risk of PASC.^16–18 Additional information on the relationship between vaccination timing and PASC may inform vaccination guidelines. Furthermore, we lack biomarkers that can objectively diagnose or quantify the risk of PASC, which prevents our ability to research, prevent, and treat this condition.^{9, 19} Evidence regarding these potential mechanistic biomarkers will be a key step in the efforts to combat this disease.

Summary

These findings highlight the importance of respiratory symptoms, healthcare utilization, and age in predicting PASC incidence, which is consistent with Pfaff et al..³ Although further investigation is needed, this supports the referral of COVID-19 patients with severe respiratory symptoms for subsequent PASC monitoring. In future work, we plan to investigate predictive performance when only baseline information is used as input to classify PASC, as this provides a practical implementation based on readily-available clinical features that could identify participants at risk of PASC prior to COVID diagnosis.

Data Availability

All data analyzed and produced in the manuscript are accessible via the National COVID Cohort Collaborative Data Enclave. A version of the manuscript analysis, using synthetic data rather than de-identified data, can be accessed via GitHub.

https://covid.cd2h.org

https://github.com/BerkeleyBiostats/l3c_ctml/tree/v1

Funding

This research was financially supported by a global development grant (OPP1165144) from the Bill & Melinda Gates Foundation to the University of California, Berkeley, CA, USA.

N3C Attribution

The analyses described in this manuscript were conducted with data or tools accessed through the NCATS N3C Data Enclave https://covid.cd2h.org and N3C Attribution & Publication Policy v 1.2-2020-08-25b supported by NCATS U24 TR002306, Axle Informatics Subcontract: NCATS-P00438-B, and the Bill & Melinda Gates Foundation: OPP1165144. This research was possible because of the patients whose information is included within the data and the organizations (https://ncats.nih.gov/n3c/resources/data-contribution/data-transfer-agreement-signatories) and scientists who have contributed to the on-going development of this community resource [https://doi.org/10.1093/jamia/ocaa196].

Disclaimer

The N3C Publication committee confirmed that this manuscript (MSID:1495.891) is in accordance with N3C data use and attribution policies; however, this content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the N3C program.

IRB

The N3C data transfer to NCATS is performed under a Johns Hopkins University Reliance Protocol # IRB00249128 or individual site agreements with NIH. The N3C Data Enclave is managed under the authority of the NIH; information can be found at https://ncats.nih.gov/n3c/resources.

Individual Acknowledgements For Core Contributors

We gratefully acknowledge the following core contributors to N3C: Adam B. Wilcox, Adam M. Lee, Alexis Graves, Alfred (Jerrod) Anzalone, Amin Manna, Amit Saha, Amy Olex, Andrea Zhou, Andrew E. Williams, Andrew Southerland, Andrew T. Girvin, Anita Walden, Anjali A. Sharathkumar, Benjamin Amor, Benjamin Bates, Brian Hendricks, Brijesh Patel, Caleb Alexander, Carolyn Bramante, Cavin Ward-Caviness, Charisse Madlock-Brown, Christine Suver, Christopher Chute, Christopher Dillon, Chunlei Wu, Clare Schmitt, Cliff Takemoto, Dan Housman, Davera Gabriel, David A. Eichmann, Diego Mazzotti, Don Brown, Eilis Boudreau, Elaine Hill, Elizabeth Zampino, Emily Carlson Marti, Emily R. Pfaff, Evan French, Farrukh M Koraishy, Federico Mariona, Fred Prior, George Sokos, Greg Martin, Harold Lehmann, Heidi Spratt, Hemalkumar Mehta, Hongfang Liu, Hythem Sidky, J.W. Awori Hayanga, Jami Pincavitch, Jaylyn Clark, Jeremy Richard Harper, Jessica Islam, Jin Ge, Joel Gagnier, Joel H. Saltz, Joel Saltz, Johanna Loomba, John Buse, Jomol Mathew, Joni L. Rutter, Julie A. McMurry, Justin Guinney, Justin Starren, Karen Crowley, Katie Rebecca Bradwell, Kellie M. Walters, Ken Wilkins, Kenneth R. Gersing, Kenrick Dwain Cato, Kimberly Murray, Kristin Kostka, Lavance Northington, Lee Allan Pyles, Leonie Misquitta, Lesley Cottrell, Lili Portilla, Mariam Deacy, Mark M. Bissell, Marshall Clark, Mary Emmett, Mary Morrison Saltz, Matvey B. Palchuk, Melissa A. Haendel, Meredith Adams, Meredith Temple-O’Connor, Michael G. Kurilla, Michele Morris, Nabeel Qureshi, Nasia Safdar, Nicole Garbarini, Noha Sharafeldin, Ofer Sadan, Patricia A. Francis, Penny Wung Burgoon, Peter Robinson, Philip R.O. Payne, Rafael Fuentes, Randeep Jawa, Rebecca Erwin-Cohen, Rena Patel, Richard A. Moffitt, Richard L. Zhu, Rishi Kamaleswaran, Robert Hurley, Robert T. Miller, Saiju Pyarajan, Sam G. Michael, Samuel Bozzette, Sandeep Mallipattu, Satyanarayana Vedula, Scott Chapman, Shawn T. O’Neil, Soko Setoguchi, Stephanie S. Hong, Steve Johnson, Tellen D. Bennett, Tiffany Callahan, Umit Topaloglu, Usman Sheikh, Valery Gordon, Vignesh Subbian, Warren A. Kibbe, Wenndy Hernandez, Will Beasley, Will Cooper, William Hillegass, Xiaohan Tanner Zhang. Details of contributions available at covid.cd2h.org/core-contributors

Data Partners with Released Data

The following institutions whose data is released or pending: Available: Advocate Health Care Network — UL1TR002389: The Institute for Translational Medicine (ITM) • Boston University Medical Campus — UL1TR001430: Boston University Clinical and Translational Science Institute • Brown University — U54GM115677: Advance Clinical Translational Research (Advance-CTR) • Carilion Clinic — UL1TR003015: iTHRIV Integrated Translational health Research Institute of Virginia • Charleston Area Medical Center U54GM104942: West Virginia Clinical and Translational Science Institute (WVCTSI) • Children’s Hospital Colorado — UL1TR002535: Colorado Clinical and Translational Sciences Institute • Columbia University Irving Medical Center — UL1TR001873: Irving Institute for Clinical and Translational Research • Duke University — UL1TR002553: Duke Clinical and Translational Science Institute • George Washington Children’s Research Institute — UL1TR001876: Clinical and Translational Science Institute at Children’s National (CTSA-CN) • George Washington University — UL1TR001876: Clinical and Translational Science Institute at Children’s National (CTSA-CN) • Indiana University School of Medicine — UL1TR002529: Indiana Clinical and Translational Science Institute • Johns Hopkins University — UL1TR003098: Johns Hopkins Institute for Clinical and Translational Research • Loyola Medicine — Loyola University Medical Center • Loyola University Medical Center — UL1TR002389: The Institute for Translational Medicine (ITM) • Maine Medical Center — U54GM115516: Northern New England Clinical & Translational Research (NNE-CTR) Network • Massachusetts General Brigham — UL1TR002541: Harvard Catalyst • Mayo Clinic Rochester • UL1TR002377: Mayo Clinic Center for Clinical and Translational Science (CCaTS) • Medical University of South Carolina — UL1TR001450: South Carolina Clinical & Translational Research Institute (SCTR) • Montefiore Medical Center — UL1TR002556: Institute for Clinical and Translational Research at Einstein and Montefiore • Nemours — U54GM104941: Delaware CTR ACCEL Program • NorthShore University HealthSystem — UL1TR002389: The Institute for Translational Medicine (ITM) • Northwestern University at Chicago — UL1TR001422: Northwestern University Clinical and Translational Science Institute (NUCATS) • OCHIN — INV-018455: Bill and Melinda Gates Foundation grant to Sage Bionetworks • Oregon Health & Science University — UL1TR002369: Oregon Clinical and Translational Research Institute • Penn State Health Milton S. Hershey Medical Center — UL1TR002014: Penn State Clinical and Translational Science Institute • Rush University Medical Center — UL1TR002389: The Institute for Translational Medicine (ITM) • Rutgers, The State University of New Jersey — UL1TR003017: New Jersey Alliance for Clinical and Translational Science • Stony Brook University — U24TR002306 • The Ohio State University — UL1TR002733: Center for Clinical and Translational Science • The State University of New York at Buffalo — UL1TR001412: Clinical and Translational Science Institute • The University of Chicago — UL1TR002389: The Institute for Translational Medicine (ITM) • The University of Iowa — UL1TR002537: Institute for Clinical and Translational Science • The University of Miami Leonard M. Miller School of Medicine — UL1TR002736: University of Miami Clinical and Translational Science Institute • The University of Michigan at Ann Arbor — UL1TR002240: Michigan Institute for Clinical and Health Research • The University of Texas Health Science Center at Houston — UL1TR003167: Center for Clinical and Translational Sciences (CCTS) • The University of Texas Medical Branch at Galveston — UL1TR001439: The Institute for Translational Sciences • The University of Utah — UL1TR002538: Uhealth Center for Clinical and Translational Science • Tufts Medical Center — UL1TR002544: Tufts Clinical and Translational Science Institute • Tulane University — UL1TR003096: Center for Clinical and Translational Science • University Medical Center New Orleans — U54GM104940: Louisiana Clinical and Translational Science (LA CaTS) Center • University of Alabama at Birmingham — UL1TR003096: Center for Clinical and Translational Science • University of Arkansas for Medical Sciences — UL1TR003107: UAMS Translational Research Institute • University of Cincinnati — UL1TR001425: Center for Clinical and Translational Science and Training • University of Colorado Denver, Anschutz Medical Campus — UL1TR002535: Colorado Clinical and Translational Sciences Institute • University of Illinois at Chicago — UL1TR002003: UIC Center for Clinical and Translational Science • University of Kansas Medical Center — UL1TR002366: Frontiers: University of Kansas Clinical and Translational Science Institute • University of Kentucky — UL1TR001998: UK Center for Clinical and Translational Science • University of Massachusetts Medical School Worcester — UL1TR001453: The UMass Center for Clinical and Translational Science (UMCCTS) • University of Minnesota — UL1TR002494: Clinical and Translational Science Institute • University of Mississippi Medical Center — U54GM115428: Mississippi Center for Clinical and Translational Research (CCTR) • University of Nebraska Medical Center — U54GM115458: Great Plains IDeA-Clinical & Translational Research • University of North Carolina at Chapel Hill — UL1TR002489: North Carolina Translational and Clinical Science Institute • University of Oklahoma Health Sciences Center — U54GM104938: Oklahoma Clinical and Translational Science Institute (OCTSI) • University of Rochester — UL1TR002001: UR Clinical & Translational Science Institute • University of Southern California — UL1TR001855: The Southern California Clinical and Translational Science Institute (SC CTSI) • University of Vermont — U54GM115516: Northern New England Clinical & Translational Research (NNE-CTR) Network • University of Virginia — UL1TR003015: iTHRIV Integrated Translational health Research Institute of Virginia • University of Washington — UL1TR002319: Institute of Translational Health Sciences • University of Wisconsin-Madison — UL1TR002373: UW Institute for Clinical and Translational Research • Vanderbilt University Medical Center — UL1TR002243: Vanderbilt Institute for Clinical and Translational Research • Virginia Commonwealth University — UL1TR002649: C. Kenneth and Dianne Wright Center for Clinical and Translational Research • Wake Forest University Health Sciences — UL1TR001420: Wake Forest Clinical and Translational Science Institute • Washington University in St. Louis — UL1TR002345: Institute of Clinical and Translational Sciences • Weill Medical College of Cornell University — UL1TR002384: Weill Cornell Medicine Clinical and Translational Science Center • West Virginia University — U54GM104942: West Virginia Clinical and Translational Science Institute (WVCTSI) Submitted: Icahn School of Medicine at Mount Sinai — UL1TR001433: ConduITS Institute for Translational Sciences • The University of Texas Health Science Center at Tyler — UL1TR003167: Center for Clinical and Translational Sciences (CCTS) • University of California, Davis — UL1TR001860: UCDavis Health Clinical and Translational Science Center • University of California, Irvine — UL1TR001414: The UC Irvine Institute for Clinical and Translational Science (ICTS) • University of California, Los Angeles — UL1TR001881: UCLA Clinical Translational Science Institute • University of California, San Diego — UL1TR001442: Altman Clinical and Translational Research Institute • University of California, San Francisco — UL1TR001872: UCSF Clinical and Translational Science Institute Pending: Arkansas Children’s Hospital — UL1TR003107: UAMS Translational Research Institute • Baylor College of Medicine — None (Voluntary) • Children’s Hospital of Philadelphia UL1TR001878: Institute for Translational Medicine and Therapeutics • Cincinnati Children’s Hospital Medical Center — UL1TR001425: Center for Clinical and Translational Science and Training • Emory University — UL1TR002378: Georgia Clinical and Translational Science Alliance • HonorHealth — None (Voluntary) • Loyola University Chicago — UL1TR002389: The Institute for Translational Medicine (ITM) • Medical College of Wisconsin — UL1TR001436: Clinical and Translational Science Institute of Southeast Wisconsin • MedStar Health Research Institute — UL1TR001409: The Georgetown-Howard Universities Center for Clinical and Translational Science (GHUCCTS) • MetroHealth — None (Voluntary) • Montana State University — U54GM115371: American Indian/Alaska Native CTR • NYU Langone Medical Center — UL1TR001445: Langone Health’s Clinical and Translational Science Institute • Ochsner Medical Center — U54GM104940: Louisiana Clinical and Translational Science (LA CaTS) Center • Regenstrief Institute — UL1TR002529: Indiana Clinical and Translational Science Institute • Sanford Research — None (Voluntary) • Stanford University — UL1TR003142: Spectrum: The Stanford Center for Clinical and Translational Research and Education • The Rockefeller University — UL1TR001866: Center for Clinical and Translational Science • The Scripps Research Institute — UL1TR002550: Scripps Research Translational Institute • University of Florida — UL1TR001427: UF Clinical and Translational Science Institute University of New Mexico Health Sciences Center — UL1TR001449: University of New Mexico Clinical and Translational Science Center • University of Texas Health Science Center at San Antonio — UL1TR002645: Institute for Integration of Medicine and Science • Yale New Haven Hospital — UL1TR001863: Yale Center for Clinical Investigation

Authors statement

Authorship was determined using ICMJE recommendations.

ZB: Generated list of included covariates, drafted writeup, managed competition timeline, attended weekly office hours, and coordinated analysis.

YJ and SS: Screened covariates for inclusion, processed datasets, and developed analysis tools.

HL and JC: Developed analysis workflow for the Enclave, implemented analysis, tuned learners, and designed variable importance framework.

AM, RVP, JC, ML, AH, RCP, and RP: Provided oversight on analysis workflow, gave feedback on drafts and proposed plans, and supported subject matter interpretations.

Supplemental Materials

Supplemental Table 1. Metadata

Appendix 1. Competition Writeup

ACKNOWLEDGMENTS

Footnotes

↵*Members are listed at the end of the manuscript

REFERENCES

1.↵
Iuliano, A. D. et al. Trends in Disease Severity and Health Care Utilization During the Early Omicron Variant Period Compared with Previous SARS-CoV-2 High Transmission Periods - United States, December 2020-January 2022. MMWR Morb Mortal Wkly Rep 71, 146– 152 (2022).
OpenUrl CrossRef PubMed
2.↵
Al-Aly, Z., Xie, Y. & Bowe, B. High-dimensional characterization of post-acute sequelae of COVID-19. Nature 594, 259–264 (2021).
OpenUrl CrossRef PubMed
3.↵
Pfaff, E. R. et al. Identifying who has long COVID in the USA: a machine learning approach using N3C data. Lancet Digit Health 4, e532–e541 (2022).
OpenUrl
4.↵
National Institutes of Health. About the National COVID Cohort Collaborative. National Center for Advancing Translational Sciences https://ncats.nih.gov/n3c/about (2023).
5.
van der Laan, M. J., Polley, E. C. & Hubbard, A. E. Super learner. Stat Appl Genet Mol Biol 6, Article25 (2007).
6.
Phillips, R. V., van der Laan, M. J., Lee, H. & Gruber, S. Practical considerations for specifying a super learner. International Journal of Epidemiology dyad023 (2023) doi:10.1093/ije/dyad023.
OpenUrl CrossRef
7.
LeDell, E., van der Laan, M. J. & Petersen, M. AUC-Maximizing Ensembles through Metalearning. 12, 203–218 (2016).
8.
Gruber, S., Lee, H., Phillips, R., Ho, M. & van der Laan, M. Developing a Targeted Learning-Based Statistical Analysis Plan. Statistics in Biopharmaceutical Research 1–8 (2022) doi:10.1080/19466315.2022.2116104.
OpenUrl CrossRef
9.↵
Peluso, M., Abdel-Mohsen, M., Walt, D. & McComsey, G. Understanding the Biomarkers of PASC. (2022).
10.
Williamson, B. D. & Feng, J. Efficient nonparametric statistical inference on population feature importance using Shapley values. Proc Mach Learn Res 119, 10282–10291 (2020).
OpenUrl
11.
Daines, L., Zheng, B., Pfeffer, P., Hurst, J. R. & Sheikh, A. A clinical review of long-COVID with a focus on the respiratory system. Curr Opin Pulm Med 28, 174–179 (2022).
OpenUrl CrossRef
12.
van der Laan, M. J. & Rose, S. Targeted learning: causal inference for observational and experimental data. (Springer, 2011).
13.
van der Laan, M., et al. Targeted Learning in R: Causal Data Science with the tlverse Software Ecosystem. (2023).
14.
Van der Laan, M. J. & Rose, S. Targeted learning in data science: causal inference for complex longitudinal studies. (Springer Berlin Heidelberg, 2017).
15.
Coyle, J. R. et al. Targeted Learning. in Wiley StatsRef: Statistics Reference Online 1–20 (2023). doi:10.1002/9781118445112.stat08414.
OpenUrl CrossRef
16.↵
Notarte, K. I., et al. Impact of COVID-19 vaccination on the risk of developing long-COVID and on existing long-COVID symptoms: A systematic review. eClinicalMedicine 53, (2022).
17.
Brannock, M. D. et al. Long COVID Risk and Pre-COVID Vaccination: An EHR-Based Cohort Study from the RECOVER Program. medRxiv : the preprint server for health sciences 2022.doi:10.06.22280795 Preprint at https://doi.org/10.1101/2022.10.06.22280795 (2022).
OpenUrl CrossRef
18.↵
Azzolini, E. et al. Association Between BNT162b2 Vaccination and Long COVID After Infections Not Requiring Hospitalization in Health Care Workers. JAMA 328, 676–678 (2022).
OpenUrl PubMed
19.↵
Raveendran, A. V., Jayadevan, R. & Sashidharan, S. Long COVID: an overview. Diabetes & Metabolic Syndrome: Clinical Research & Reviews 15, 869–875 (2021).
OpenUrl

View the discussion thread.

Posted August 04, 2023.

Download PDF

Supplementary Material

Data/Code

Citation Tools

Subject Area

Epidemiology

Subject Areas

All Articles

Addiction Medicine (382)
Allergy and Immunology (699)
Anesthesia (190)
Cardiovascular Medicine (2835)
Dentistry and Oral Medicine (325)
Dermatology (243)
Emergency Medicine (427)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1008)
Epidemiology (12540)
Forensic Medicine (10)
Gastroenterology (801)
Genetic and Genomic Medicine (4418)
Geriatric Medicine (400)
Health Economics (712)
Health Informatics (2843)
Health Policy (1046)
Health Systems and Quality Improvement (1045)
Hematology (373)
HIV/AIDS (893)
Infectious Diseases (except HIV/AIDS) (13961)
Intensive Care and Critical Care Medicine (828)
Medical Education (413)
Medical Ethics (114)
Nephrology (461)
Neurology (4171)
Nursing (220)
Nutrition (615)
Obstetrics and Gynecology (784)
Occupational and Environmental Health (721)
Oncology (2198)
Ophthalmology (623)
Orthopedics (254)
Otolaryngology (317)
Pain Medicine (266)
Palliative Medicine (81)
Pathology (485)
Pediatrics (1171)
Pharmacology and Therapeutics (487)
Primary Care Research (481)
Psychiatry and Clinical Psychology (3644)
Public and Global Health (6760)
Radiology and Imaging (1486)
Rehabilitation Medicine and Physical Therapy (866)
Respiratory Medicine (897)
Rheumatology (430)
Sexual and Reproductive Health (431)
Sports Medicine (368)
Surgery (472)
Toxicology (57)
Transplantation (200)
Urology (173)

[1] 1.↵
Iuliano, A. D. et al. Trends in Disease Severity and Health Care Utilization During the Early Omicron Variant Period Compared with Previous SARS-CoV-2 High Transmission Periods - United States, December 2020-January 2022. MMWR Morb Mortal Wkly Rep 71, 146– 152 (2022).
OpenUrl CrossRef PubMed

[2] 2.↵
Al-Aly, Z., Xie, Y. & Bowe, B. High-dimensional characterization of post-acute sequelae of COVID-19. Nature 594, 259–264 (2021).
OpenUrl CrossRef PubMed

[3] 3.↵
Pfaff, E. R. et al. Identifying who has long COVID in the USA: a machine learning approach using N3C data. Lancet Digit Health 4, e532–e541 (2022).
OpenUrl

[4] 4.↵
National Institutes of Health. About the National COVID Cohort Collaborative. National Center for Advancing Translational Sciences https://ncats.nih.gov/n3c/about (2023).

[5] 5.
van der Laan, M. J., Polley, E. C. & Hubbard, A. E. Super learner. Stat Appl Genet Mol Biol 6, Article25 (2007).

[6] 6.
Phillips, R. V., van der Laan, M. J., Lee, H. & Gruber, S. Practical considerations for specifying a super learner. International Journal of Epidemiology dyad023 (2023) doi:10.1093/ije/dyad023.
OpenUrl CrossRef

[7] 7.
LeDell, E., van der Laan, M. J. & Petersen, M. AUC-Maximizing Ensembles through Metalearning. 12, 203–218 (2016).

[8] 8.
Gruber, S., Lee, H., Phillips, R., Ho, M. & van der Laan, M. Developing a Targeted Learning-Based Statistical Analysis Plan. Statistics in Biopharmaceutical Research 1–8 (2022) doi:10.1080/19466315.2022.2116104.
OpenUrl CrossRef

[9] 9.↵
Peluso, M., Abdel-Mohsen, M., Walt, D. & McComsey, G. Understanding the Biomarkers of PASC. (2022).

[10] 10.
Williamson, B. D. & Feng, J. Efficient nonparametric statistical inference on population feature importance using Shapley values. Proc Mach Learn Res 119, 10282–10291 (2020).
OpenUrl

[11] 11.
Daines, L., Zheng, B., Pfeffer, P., Hurst, J. R. & Sheikh, A. A clinical review of long-COVID with a focus on the respiratory system. Curr Opin Pulm Med 28, 174–179 (2022).
OpenUrl CrossRef

[12] 12.
van der Laan, M. J. & Rose, S. Targeted learning: causal inference for observational and experimental data. (Springer, 2011).

[13] 13.
van der Laan, M., et al. Targeted Learning in R: Causal Data Science with the tlverse Software Ecosystem. (2023).

[14] 14.
Van der Laan, M. J. & Rose, S. Targeted learning in data science: causal inference for complex longitudinal studies. (Springer Berlin Heidelberg, 2017).

[15] 15.
Coyle, J. R. et al. Targeted Learning. in Wiley StatsRef: Statistics Reference Online 1–20 (2023). doi:10.1002/9781118445112.stat08414.
OpenUrl CrossRef

[16] 16.↵
Notarte, K. I., et al. Impact of COVID-19 vaccination on the risk of developing long-COVID and on existing long-COVID symptoms: A systematic review. eClinicalMedicine 53, (2022).

[17] 17.
Brannock, M. D. et al. Long COVID Risk and Pre-COVID Vaccination: An EHR-Based Cohort Study from the RECOVER Program. medRxiv : the preprint server for health sciences 2022.doi:10.06.22280795 Preprint at https://doi.org/10.1101/2022.10.06.22280795 (2022).
OpenUrl CrossRef

[18] 18.↵
Azzolini, E. et al. Association Between BNT162b2 Vaccination and Long COVID After Infections Not Requiring Hospitalization in Health Care Workers. JAMA 328, 676–678 (2022).
OpenUrl PubMed

[19] 19.↵
Raveendran, A. V., Jayadevan, R. & Sashidharan, S. Long COVID: an overview. Diabetes & Metabolic Syndrome: Clinical Research & Reviews 15, 869–875 (2021).
OpenUrl

Predicting Long COVID in the National COVID Cohort Collaborative Using Super Learner

ABSTRACT

BACKGROUND

METHODS

Sample

Feature selection

Temporal windows

Features described in the literature

Temporality

Missing data

COVID-19 positivity

Additional features

Prediction using ensemble machine learning

Variable importance

RESULTS

Predictive performance

Variable importance

Individual predictors

Temporal windows

Biologic pathways

DISCUSSION

Predictive performance

Variable importance

Individual predictors

Temporal windows

Biological pathways

Limitations

Future steps

Summary

Data Availability

Funding

N3C Attribution

Disclaimer

IRB

Individual Acknowledgements For Core Contributors

Data Partners with Released Data

Authors statement

Supplemental Materials

ACKNOWLEDGMENTS

Footnotes

REFERENCES

Citation Manager Formats

Subject Area