Abstract
Rationale Multiple pulmonary, sleep, and other disorders are associated with the severity of Covid-19 infections but may or may not directly affect the etiology of acute Covid-19 infection. Identifying the relative importance of concurrent risk factors may prioritize respiratory disease outbreaks research.
Objectives To identify associations of common preexisting pulmonary and sleep disease on acute Covid-19 infection severity, investigate the relative contributions of each disease and selected risk factors, identify sex-specific effects, and examine whether additional electronic health record (EHR) information would affect these associations.
Methods 45 pulmonary and 6 sleep diseases were examined in 37,020 patients with Covid-19. We analyzed three outcomes: death; a composite measure of mechanical ventilation and/or ICU admission; and inpatient admission. The relative contribution of pre-infection covariates including other diseases, laboratory tests, clinical procedures, and clinical note terms was calculated using LASSO. Each pulmonary/sleep disease model was then further adjusted for covariates.
Measurements and main results 37 pulmonary/sleep diseases were associated with at least one outcome at Bonferroni significance, 6 of which had increased relative risk in LASSO analyses. Multiple prospectively collected non-pulmonary/sleep diseases, EHR terms and laboratory results attenuated the associations between preexisting disease and Covid-19 infection severity. Adjustment for counts of prior “blood urea nitrogen” phrases in clinical notes attenuated the odds ratio point estimates of 12 pulmonary disease associations with death in women by ≥1.
Conclusions Pulmonary diseases are commonly associated with Covid-19 infection severity. Associations are partially attenuated by prospectively-collected EHR data, which may aid in risk stratification and physiological studies.
Introduction
Covid-19 continues to be an important public health problem. Prior to the Omicron wave, Covid-19 was one of the leading causes of death in the US in every age group above age 15 [1]. Thousands continue to die despite the availability of vaccines, emphasizing a continued need for research to help mitigate the effects of this and future infectious disease outbreaks.
The range of effects arising from Covid-19 infection, from minimal symptoms to death is striking. Less is known about the relative importance of individual pulmonary and sleep diseases contributing to infection severity in patients with multimorbidity. Examining the association of a range of pulmonary and sleep diseases with Covid-19 infection severity and applying machine learning methods to identify a subset of independent associations that predict infection severity may aid in identifying etiological mechanisms of morbidity and mortality. These associations may differ by sex, given sex differences in both Covid-19 prevalence and associated risk factors.
Associations between preexisting pulmonary/sleep disease and Covid-19 infection severity may be changing with time. The Omicron variant has reduced pulmonary pathology compared to other variants [2,3]. Patients carrying the Omicron variant have reduced risk of hospital admissions and severe infection [4]. The degree of association between individual diseases and Covid-19 severity may therefore change across different phases of the epidemic. Examining these changing environmental interactions may provide insight into the pathophysiology of both Covid-19 and the tested diseases themselves.
Prospective collection of data in clinical biobanks are one means of addressing these and other questions. EHR databases can leverage extensive clinical note, diagnosis, procedure, and laboratory data collected prior to infection. Natural language processing (NLP) [5] of controlled medical vocabulary concepts [6] from date-stamped clinical notes can improve disease phenotyping [7]. These clinical note concepts, together with procedure and laboratory information, can also be used to identify risk factors that complement or potentially mediate or confound the association of existing diseases with Covid-19 infection severity.
In this study, we report associations between common preexisting pulmonary and sleep diseases and acute Covid-19 morbidity and mortality among patients with PCR-based diagnoses of Covid-19 with data in the Mass General Brigham (MGB) EHR database from the greater Boston, MA region. We investigate the relative importance of individual preexisting diseases and ancillary EHR concepts, identify sex- and date-specific associations, and report the effect of lead EHR concepts on the associations of preexisting diseases and Covid-19 infection severity.
Methods
Additional details are provided in the Supplemental Methods.
Study sample
All patients with PCR-diagnosed Covid-19 were enrolled in the MGB Research Patient Data Registry EHR database [8,9]. Study collection ended at 6/1/2022. We used a stringent combined data floor and loyalty cohort [10] approach to minimize the number of patients with reduced documentation (and increased risk of false negative associations) [11] by requiring at least three clinical notes, encounters, and diagnoses. We also required at least 7 clinical facts: one diagnosis, encounter, laboratory result, medication, outpatient visit, routine visit [10], and vital sign obtained before 2020. Children, MGB employees, and patients living outside of Massachusetts or without available race/ethnicity, age, biological sex, or BMI values were excluded.
Data used
The acute phase of infection for measuring ICU admissions, mechanical ventilation, and inpatient admissions was defined as -7 to +30 days relative to the first positive PCR test (we did not analyze Covid-19 reinfections) in the EHR. Diseases, procedures, and NLP terms recorded at least 8 days prior to the PCR test were extracted. Laboratory measurements were extracted from -5 to -1 years prior to the PCR test. A breakthrough infection was defined as having occurred over 30 days after a second vaccination. Majority variant date ranges were based on Broad Institute Covid-19 Genome Surveillance project sequencing in the greater Boston area. Transition dates from the pre-Delta to the Delta and Omicron variants were set at 7/3/2021 and 12/17/2021 respectively.
Diseases were based on PheCode groupings of ICD codes [12,13] (Table S1) and refined using natural language processing. We evaluated 45 common (≥1% prevalence in patients with Covid-19) and selected rare pulmonary disorders and 6 common sleep disorders [14–16]. Clinical note concept terms were extracted using NLP and mapped to Concept Unique Identifiers (CUIs) codes [5,6]. Terms that matched disease names were used to improve disease phenotyping using multimodal automated phenotyping (MAP) [7]. We included all non-pulmonary and non-sleep diseases defined using PheCodes and MAP with >1% prevalence among patients with Covid-19 in LASSO regression analyses (described below). Given the sparse nature of certain individual procedures, we also grouped procedures as specified by AHRQ Clinical Classification Software (CCS, https://www.hcup-us.ahrq.gov/toolssoftware/ccs10/ccs10.jsp). Procedure and CUI IDs for variables retained after LASSO regressions are listed in Table S6.
We documented evidence of continuous positive airway pressure (CPAP) use among patients with at least one clinical encounter and one clinical note from -365 to -8 days prior to the first positive PCR test. Total counts of five related procedures and one NLP term (Table S5) were combined. CPAP evidence in the prior year was present for 2,946 / 4,027 (73.1%) of the eligible patients with sleep apnea.
In an exploratory aim, we examined whether selected hematology laboratory results were associated with Covid-19 severity. We included kidney function measures given the retention of kidney disease in the LASSO results. The median values of laboratory results for individual patients from -5 to -1 years were tested with and without rank normalization. The 23 laboratory measurements were available for 65 – 89% of the patients (Table S8).
Statistical analyses
Logistic regression included 45 pulmonary disorders, 6 sleep disorders, was adjusted for age, sex at birth, WHO obesity classification, race/ethnicity, and breakthrough infection status to predict 3 outcomes: death, mechanical ventilation or ICU admission, or inpatient admissions. The effect of CPAP evidence was tested using logistic regression with further adjustment for a single CPAP evidence term.
Adaptive LASSO regression used training and testing sets, each with cross-validation. Non-zero coefficient terms were carried forward to the training run and then to the final combined sample run. Age, sex, BMI, race/ethnicity, breakthrough infection status, healthcare encounter counts, and days into the pandemic were forced into runs. Laboratory results were not included due to reduced sample size and non-random ascertainment. We opted to include sex-specific terms in combined sex analyses and evaluate them in separate sex-stratified analyses. Finally, the potential effect of lead LASSO variables on disease associations with Covid-19 infection severity was assessed using logistic regression with standard covariates that included and excluded individual LASSO variables as covariates.
Results
Sample characteristics
Sample characteristics are listed in Table 1. The initial sample of patients with Covid-19 (through 6/1/2022) and suitable data to clear a data floor was 37,020. Women represented a majority of infections (62%) but a minority of patients with more severe death and mechanical ventilation, or ICU admission outcomes (48%). The majority of breakthrough infections were in women (62%). At the time of the data freeze (6/1/2022), the number of patients with an incident diagnosis during the pre-Delta majority variant virus dates was slightly less than the combined number of patients with a diagnosis during the Delta or Omicron majority variant diagnosis dates. The median BMI has decreased for Omicron majority variant diagnosis dates, as has the percentage of Hispanic/Latino Americans among patients with an incident Covid-19 infection. The prevalence of 45 pulmonary and 6 sleep diseases grouped by PheCodes (Table S1) may be reduced due to the strict case criteria used by NLP-defined phenotyping. The median number of preexisting pulmonary and sleep diseases diagnosed for patients with an adverse Covid-19 outcome was higher than the median number of preexisting diseases for all patients with Covid-19.
Associations between existing disease with Covid-19 infection severity
The associations between existing diseases and Covid-19 infection severity are described in Table 2. Models included adjustment for breakthrough infection, age, sex at birth, obesity severity, and self-reported race/ethnicity. 42 of 51 diseases were nominally associated with any outcome (p < 0.05). 37 diseases were associated with an outcome at Bonferroni significance (p < 3.3 × 10−4): 25, 23, and 37 diseases were associated with risk of death, mechanical ventilation or ICU admission, and inpatient admission, respectively. Chronic pharyngitis and nasopharyngitis and septal deviations / turbinate hypertrophy were inversely associated with risk of mechanical ventilation or ICU admission at modest significance.
We analyzed sex-specific associations between existing diseases and Covid-19 infection severity. Death outcome results are shown in Table 3, with full results provided in Table S2. 27 and 33 diseases were associated with at least one infection severity outcome in men and women respectively at Bonferroni significance (p < 1.1 × 10−4). While multiple diseases displayed notably different odds ratio point estimates, sex × disease interaction p-values were generally modest in this cohort with stringent loyalty and data floor criteria. Associations between existing diseases and Covid-19 infection severity in different age groups are provided in Table S3.
We examined associations between existing disease and Covid-19 incident infection severity during days when the pre-Delta, Delta, and Omicron virus variants comprised the majority of samples sequenced from the greater Boston area. Table 4 lists the composite outcome results, with full results and sex-stratified results listed in Table S4. The pre-Delta, Delta, and Omicron majority variant date analyses were all adjusted for breakthrough infection status. 31 and 26 diseases were associated with any outcome during pre-Delta and Omicron variant majority dates respectively (Bonferroni p < 1.1 × 10−4). Seven diseases significantly associated with a death outcome during pre-Delta variant majority dates (p < 1.1 × 10−4) were notably attenuated (p > 0.01) during Omicron variant majority dates (bacterial infection NOS, empyema and pneumothorax, epistaxis or throat hemorrhage, pneumonia, pulmonary hypertension, sleep apnea, and symptoms involving respiratory system and other chest symptoms). Conversely, a modest association between solitary pulmonary nodule and the death outcome during pre-Delta variant majority dates strengthened to Bonferroni significance during Omicron variant majority dates.
Effects of continuous airway positive pressure
We examined whether adjusting for evidence of continuous positive airway pressure (CPAP) usage in clinical notes and procedures in the year prior to infection would attenuate the associations with sleep apnea, given the trends we identified in our previous report with a smaller sample size [14]. All associations with sleep apnea were attenuated (p > 0.02, Table S5).
Relative importance of individual existing diseases and EHR information
Given the large number of associated diseases and elevated multimorbidity (18.8% of the patients carried ≥5 of the tested diseases), we analyzed the relative importance of individual existing diseases affecting the risk of Covid-19 infection severity. We performed adaptive LASSO in the combined sex and sex-stratified datasets by including all pulmonary and sleep diseases and multiple covariates. Non-zero coefficients from the outcome results are shown in Tables 5 and S6, with larger absolute values indicating greater importance for individual variables. Chronic airway obstruction, pneumonia, and respiratory failure had non-zero coefficients for the death outcome.
We next leveraged additional EHR information by performing similar adaptive LASSO analyses to identify factors with greater importance using separate analyses of 505 non-pulmonary/sleep diseases, 1,647 procedures, 171 CCS grouped procedures, and 9,839 clinical note NLP terms. Results for non-zero coefficient terms are listed in Table S6. 124 EHR variables were identified in the combined sex analyses. Six non-pulmonary/sleep diseases and ICD-coded procedures had non-zero coefficient terms in the combined-sex death outcome analysis: chronic renal failure, chemotherapy, altered mental status, essential hypertension, thrombocytopenia, and abnormality of gait. Associations between LASSO terms and Covid-19 infection severity adjusted for covariates are presented in Table S7. Multiple non-pulmonary diseases were associated with the death outcome, including chronic renal failure (OR 3.61, 95% CI 3.04 – 4.28) and thrombocytopenia (OR 3.19, 95% CI 2.45 – 4.17).
Predictive laboratory values
Complete blood count and other hematology measurements are commonly collected in many patients undergoing hospital procedures. As an exploratory aim, we examined whether the median values of these measurements collected 1 – 5 years prior to the incident Covid-19 diagnosis were associated with infection severity. We also included measures of kidney function, given the identification of kidney disease in the LASSO results. 17 of the 23 laboratory tests were associated with at least one outcome in the available sample at Bonferroni significance (65 – 89% coverage of patients), using logistic regression with adjustment for covariates (Table S8). Estimated glomerular filtration rate, hematocrit, hemoglobin, and red cell distribution width results were the most notable associated measures with all p-values ≤ 2.0 × 10−16, despite the identification of thrombocytopenia and elevated white blood cell counts PheCodes in the LASSO results.
Associated EHR concepts attenuate disease associations
Finally, we examined whether individual retained LASSO terms and prior hematology and kidney laboratory results would attenuate the associations between individual diseases and infection severity when included individually as model covariate terms. The lead terms with the highest median point estimate adjustment differences across all disease associations with the death outcome are shown in Figure 1. Lead terms spanned a range of healthcare procedures, laboratory results, and pulmonary and non-pulmonary diseases. Full results are provided in Table S9, and additional visualizations are provided in Figures S1 – S17. Adjustment for “blood urea nitrogen” clinical note phases attenuated the odds ratio point estimates of 12 pulmonary disease associations with death in women by ≥ 1.0.
Discussion
In this study, we examined the association of preexisting pulmonary, sleep, and other diseases with the risk of acute Covid-19 infection severity and investigated other factors that may alter those associations. Notable results include (a) common associations between dozens of pulmonary disorders and infection severity, some of which are protective; (b) changes in disease associations across date periods with different SARS-CoV-2 majority variants, with attenuated associations with death for 7 diseases, and an increased point estimate for 1 disease; (c) an increased importance of specific pulmonary disease associations with Covid-19 severity relative to other comorbid pulmonary and sleep diseases; (d) associations between infection severity and multiple hematology and kidney-related laboratory results collected 1 – 5 years prior to infection; and (e) non-pulmonary/sleep diseases, laboratory results, hospital procedures, and clinical note terms that attenuate and sometimes increase the associations between pulmonary and sleep diseases with Covid-19 infection severity. These results may aid in further understanding the etiology of a changing disease.
We observed common associations between preexisting pulmonary and sleep diseases with Covid-19 infection severity, including 37 of the 51 tested diseases associated with at least one outcome at Bonferroni significance (Tables 2 and S2). Certain diseases (e.g. asthma) can confer different risks for men and women, although sex × disease interaction p-values tended to be modest (Tables 3 and S2). Acute sinusitis has a protective association with the death measure in men (0.33 OR, 95% CI 0.14 – 0.77, p = 9.8 × 10−3) that is not present in women (0.94 OR, 95% CI 0.61 – 1.43, p = 0.76). A similar but more modest association was also observed between the death outcome and acute upper respiratory infections. These findings may help improve the personalization of treatment for patients with increased Covid-19 morbidity.
The associations between preexisting disease and infection severity are changing across time for certain diseases (Tables 4 and S4). Notable diseases with attenuated associations with death during Omicron-majority variant periods compared to pre-Delta majority variant periods include bacterial infection NOS (3.00 OR [95% CI 2.16 - 4.16] pre-Delta period; 1.63 [0.90 - 2.95] Omicron period), sleep apnea (1.83 [1.44 - 2.33] pre- Delta period; 1.35 [0.89 - 2.04] Omicron period), and influenza (1.72 [1.29 - 2.30] pre- Delta period; 0.90 [0.53 - 1.55] Omicron period). These changes may reflect differences in SARS-CoV-2 pathogenicity in upper and lower respiratory tracts [2,3]. The differences in outcomes across majority variant strain periods are consistent with a recent report examining the same dataset using independent methods [17]. Differences in risk factors and outcomes have potential study design implications for acute and possibly Long Covid research. The changing nature of SARS-CoV-2 associations with a range of comorbidities could potentially provide a window for future physiological research examining how individual pulmonary and sleep diseases increase risk for outcomes in other diseases.
Our study sample has a high degree of multimorbidity, and 39 pre-existing diseases were associated with at least one infection severity outcome at Bonferroni significance. We therefore performed LASSO regression analyses to identify the relative importance of specific diseases (Tables 5 and S6). LASSO analyses minimize the number of factors that collectively may predict an outcome. Six pulmonary diseases were retained: chronic airway obstruction, respiratory failure, pneumonia, emphysema, pulmonary congestion and hypostasis, and shortness of breath (Table 5). These diseases may be enriched with pathophysiological mechanisms that contribute increased risk across the broader range of pulmonary and sleep disorders associated with acute Covid-19 infection severity.
Preexisting non-pulmonary diseases, hospital procedures and clinical note terms identified by LASSO regression and laboratory results may have predictive value for Covid-19 infection severity, given the common associations between preexisting pulmonary disease and Covid-19 infection severity (Table S6). These variables may potentially mediate or confound associations between preexisting pulmonary and sleep diseases and infection severity (Table S9 and Figures 1, S1 – S17) or could reflect the effects of other factors unmeasured in this study. Five non-pulmonary PheCode-encoded diseases (and chemotherapy) were associated with death: chronic renal failure, altered mental status, essential hypertension, thrombocytopenia, and abnormality of gait. We chose to analyze procedures and all NLP term with common frequencies in order to identify factors that could help predict outcome severity, even if they are unlikely to be directly related to pathophysiology. Some lead terms that attenuate the associations of multiple diseases are related to healthcare utilization (e.g. hospital discharge procedures and “normal diet” text phrases) that are consistent with multimorbidity increasing the risk of adverse Covid-19 infection outcomes. Adjustment for electrocardiogram procedures and red cell distribution width laboratory results also attenuated the median point estimate of pulmonary and sleep disease associations with the death outcome more than adjustment for any pulmonary disease. The most notable variable to broadly affect associations with death was “blood urea nitrogen” in analyses of women (Figures S12 and S13), which attenuated the point estimates of 12 diseases by >1 and the statistical significance of 13 diseases (p > 0.05), although a smaller effect was observed when adjusting for actual median urea nitrogen blood levels collected from -5 to -1 years prior to infection with Covid-19. Adjustment for chronic renal failure and creatinine and estimated glomerular filtration rate laboratory results also support a role for kidney malfunction contributing to pulmonary and sleep disease associations with deaths in women.
Seventeen of 23 prospectively collected hematology and kidney function laboratory tests were associated with Covid-19 outcome severity at Bonferroni significance (Table S8). Estimated glomerular filtration rate and three red blood cell tests were the most significantly associated tests across all models. The laboratory value ranges may reflect physiological changes due to both known and unknown risk factors and may be useful as a general risk stratification measure. These associations were based on a subset of patients with available data, and the associations could possibly also reflect differences in test ordering (e.g. kidney test results may be disproportionally missing for patients whose clinicians do not suspect kidney malfunction). Additional follow-up in a broader range of patients is needed to understand the generalization of these results. The prediction models could be readily improved beyond these initial exploratory results.
The strengths of this study include an extensive sample size of patients diagnosed using PCR testing, which has allowed us to perform sex-, age-, and majority SARS-CoV-2 strain date-specific analyses with increased study power. Data were restricted by stringent combined “data floor” and loyalty cohort criteria. We included patients with minimal clinical encounters, notes, diagnoses and evidence of seven clinical facts each collected before the pandemic in order to increase the reliability of our findings in an open environment where some patient records may be located in other healthcare systems. All disease and healthcare utilization information was collected prior to Covid-19 infection to eliminate potential reverse causation effects. All disease phenotyping included natural language processing to increase the positive predictive value in the face of rule-out diagnoses and other potential biases. We extensively studied the potential effect of dozens of pulmonary disease on Covid-19 infection severity, which enabled us to identify the relative importance of pulmonary and disease on infection severity.
Weaknesses are also possible with this study design. Despite inclusion of over 37,000 patients, this was a single healthcare system study. While the Mass General Brigham system is comprised of multiple hospitals, our results may or may not completely generalize to other healthcare systems. Loss to follow-up and missing data from out-of-network care may affect our results and those of similarly designed studies. Temporal effects may be confounded by changes in PCR testing availability and use and standards of care (e.g. proning). Patients with different financial means may have different levels of healthcare utilization, which may influence the recording of diseases, procedures, and NLP phrases differently. The diseases, procedures, and note terms identified by LASSO regression may not completely generalize, despite cross-validation and separate training and testing sets. Individual terms with lower LASSO coefficients, terms that were not individually associated with outcomes (Table S7), or terms that do not affect multiple disease associations may have increased likelihood of site-specific associations or may indirectly measure factors not included in our study design. While the majority of the patients had undergone all of the laboratory testing (Table S8), there is likely some selection bias present for patients with clinically relevant diagnoses, although some of our results are supported by orthogonal diagnosis and NLP term data (Table S7).
Conclusions
We identified common associations between preexisting pulmonary disease and Covid-19 infection severity and identified specific diseases with increased relative importance in our patient sample with multimorbidity. Certain associations are majority-variant date specific. We also identified several non-pulmonary diseases, clinical procedures, and clinical note phrases that potentially mediate or confound these associations. These results may aid in further understanding the etiology of a changing disease.
Data Availability
All data produced in the present work are contained in the manuscript
Acknowledgments
The authors wish to thank the MGB Research Patient Data Registry (RPDR) and patients for providing health information data.
Footnotes
Funding: Brian Cade is supported by grants from the American Thoracic Society Foundation and the National Institutes of Health (R03-HL154284, R01-HL153805). Susan Redline is supported by grants from the National Institutes of Health (R35-HL135818). Elizabeth Karlson is supported by grants from the National Institutes of Health (U01-HG008685 and P30 AR070253).
Competing Interest Statement: The authors declare no competing interest.