Abstract
The post-acute sequelae of SARS-CoV-2 infection (PASC) refers to a broad spectrum of symptoms and signs that are persistent, exacerbated, or newly incident in the post-acute SARS-CoV-2 infection period of COVID-19 patients. Most studies have examined these conditions individually without providing concluding evidence on co-occurring conditions. To answer this question, this study leveraged electronic health records (EHRs) from two large clinical research networks from the national Patient-Centered Clinical Research Network (PCORnet) and investigated patients’ newly incident diagnoses that appeared within 30 to 180 days after a documented SARS-CoV-2 infection. Through machine learning, we identified four reproducible subphenotypes of PASC dominated by blood and circulatory system, respiratory, musculoskeletal and nervous system, and digestive system problems, respectively. We also demonstrated that these subphenotypes were associated with distinct patterns of patient demographics, underlying conditions present prior to SARS-CoV-2 infection, acute infection phase severity, and use of new medications in the post-acute period. Our study provides novel insights into the heterogeneity of PASC and can inform stratified decision-making in the treatment of COVID-19 patients with PASC conditions.
Introduction
A variety of symptoms and signs involving multiple organ systems (e.g., cardiovascular1, mental2, metabolic3, and renal4) were persistent, exacerbated or newly developed following the acute phase of SARS-CoV-2 infection. Currently, our understanding of these conditions, typically regarded as post-acute sequelae of SARS-CoV-2 infection (PASC)5,6, remains limited. Most existing studies have investigated these PASC conditions individually (e.g., by examining the incidence7 or excess burden8 of each symptom or condition in the post-acute period for COVID-19 patients relative to controls). It is unclear if any of these PASC symptoms and conditions tend to co-appear together or are more likely to develop in certain patient populations.
To answer the above question, we developed a machine learning approach to derive subphenotypes of SARS-CoV-2 infected patients based on the newly incident conditions in the post-acute period (defined as 30 to 180 days after their confirmed SARS-CoV-2 infection) using the data from electronic health records (EHR). We focused on new incidences in this study because it provided a clean way of defining PASC phenotypes without complicated consideration of pre-existing conditions. We leveraged the EHR repositories from two large-scale clinical research networks (CRNs) from the national Patient-Centered Clinical Research Network (PCORnet): the INSIGHT network9, which includes 12 million patients in the New York City (NYC) area, and the OneFlorida+ network10, which includes 19 million patients from Florida, Georgia, and Alabama. We examined the incidence of 137 diagnosis categories defined from the Clinical Classifications Software Refined (CCSR) categories11. Four distinct subphenotypes were identified from the INSIGHT CRN data and validated in the OneFlorida+ CRN data. Patients in Subphenotype 1 were older with incident blood and circulatory conditions in the post-acute phase. Subphenotype 2 included patients who are younger with incident respiratory problems. Subphenotype 3 included patients who developed musculoskeletal and nervous system conditions. Lastly, patients in Subphenotype 4 are characterized by digestive systems incident conditions. Our study dissects the heterogeneity of potential PASC conditions as different subgroups, which can inform the classification and treatment of PASC. This study is part of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative12, which seeks to understand, treat, and prevent the post-acute sequelae of SARS-CoV-2 infection (PASC). For more information on RECOVER, visit https://recovercovid.org/
Results
Overall Pipeline
Our overall analytics pipeline is demonstrated in Figure 1. Using the two CRNs, we extracted patients who had positive nucleic acid amplification or antigen viral tests for SARS-CoV-2 from March 2020 to November 2021. A list of 137 potential PASC diagnosis categories (Methods) were compiled and only patients who had documented new incidences of these conditions in their post-acute infection period were retained. Each patient was initially represented as a 137-dimensional binary vector according to whether a particular condition appeared in the post-acute infection period or not (Step 1). Then a set of “PASC topics” was learned based on the co-incidence patterns of these conditions (Step 2) and the initial patient vectors were projected onto these learned topics to obtain their topic loading representations (Step 3), which was further used in a clustering procedure to identify the subphenotypes (Step 4). Details on the concepts and methods involved in our pipeline are presented in Methods. In the following text, we present our main results.
Study Cohorts
Our study included 20,881 patients from the INSIGHT CRN and 13,724 patients from the OneFlorida+ CRN who tested positive for SARS-CoV-2 on viral tests (see Methods for detailed inclusion-exclusion criteria). The patients within the INSIGHT cohort had a median age of 58.0 (interquartile range [IQR] [42.0-70.0]) and a median Area Deprivation Index (ADI)13 of 15.0 (IQR [6.0-25.0]), consisting of 12,188 (58.37%) females, 7013 (33.59%) White patients, and 4771 (22.85%) Black patients. The OneFlorida+ cohort contained patients who were younger (median age of 51.0 (IQR [35.0-65.0])), with more disadvantaged social conditions (median ADI 59.0; IQR [42.0-76.0])), and more white patients (7175; 52.28%). 33.04% of the INSIGHT CRN patients had a confirmed SARS-CoV-2 infection from March to June 2020 (compared to 8.83% of the patients from OneFlorida+). This coincided with the first wave of COVID-19 in the US when NYC was the epicenter. Patients from OneFlorida+ were more likely to test positive from July to October 2020 (26% vs. 6% of patients from INSIGHT). Table 1 summarizes the summary statistics of the patients from the two cohorts.
Potential PASC Topics
A list of 137 potentially PASC-related diagnoses groups defined by ICD-10 diagnosis codes and CCSR categories11 (Supplementary Table 1) was compiled for our study. We first investigated the co-incidence patterns across different diagnoses within 30-180 days after the SARS-CoV-2 infection confirmation for COVID-19 positive patients (Methods). We achieved this goal through probabilistic topic modeling14, originally proposed for learning word co-occurrence patterns in documents with different semantic topics. With this approach (see details in Methods), we were able to identify ten distinct “PASC topics”, each of which is characterized by a unique post-acute infection incidence probability distribution across the 137 individual conditions.
Figure 2 shows the heatmap matrix of the learned topics from the INSIGHT cohort. Each column was a learned topic, and each row was a potential PASC condition category (we demonstrated 31 of them in the heatmap and aggregated the remaining 106 because none of their incident probabilities exceeded 0.1 in any of the learned topics). Each entry in the matrix corresponded to the probability of the specific PASC condition in the corresponding topic. All entry values in the same column added up to one so that each topic was characterized by a rigorous incident probability distribution over the 137 PASC conditions. Specifically, Topics T1, T2, and T5 concentrated on the conditions of the musculoskeletal system, digestive system, and nervous system, respectively. T4, T7, and T9 included respiratory conditions mixed with sleep disorder and anxiety along with symptoms such as headache and chest pain. T3 included fluid and electrolyte disorders combined with anemia and cardiac problems. T6 was musculoskeletal and skin conditions with headache and fatigue problems. T8 was anemia and digestive system problems. T10 was a mixture of circulatory problems, renal failure, fluid and electrolyte problems, and others.
Potential PASC Subphenotypes
With the potential PASC topics identified, we could describe PASC-affected patients with them and derive potential PASC subphenotypes as patient clusters (Methods). In the INSIGHT cohort, four subphenotypes were identified. Table 2 summarizes their characteristics with respect to patient demographics, disease severity in the acute phase according to treatment setting, as well as the prevalence of comorbidities in the baseline period. Across the subphenotypes, we also demonstrated the prevalence of potential PASC conditions and incident prescriptions of medications in the post-acute infection period for patients in different subphenotypes in Figures 3 and 4. From Figure 3, we observed that four subphenotypes had different prevalent PASC conditions, which were consistent with the top PASC conditions in the topics with large proportions. Next, we characterized these subphenotypes in detail as follows.
Subphenotype 1 (Blood and Circulatory System)
consisted of 7,047 (33.75%) patients. It was dominated by blood- and circulation-related topics (T3, T8, T10), including anemia, fluid and electrolyte problems, and circulatory and cardiac problems. Compared to other subphenotypes, patients in this subphenotype were older (median age 65.0 years, IQR [52.0-75.0]) and had the highest proportion of males (48.53%). They also had a higher severity of COVID-19 in the acute phase (with the highest rate of hospitalization [61.15%], use of mechanical ventilation [4.81%], and critical care services [9.95%]). Furthermore, this subphenotype had the highest portion of patients (37.38%) infected with SARS-CoV-2 during the first wave of the pandemic (March to June 2020) in NYC. In addition, patients in this subphenotype had a higher prevalence of underlying conditions than other subphenotypes, especially for blood, circulation, and endocrine comorbidities. Correspondingly, patients in this subphenotype had a high incident prescription for medications to treat circulatory and endocrine problems and anemia.
Subphenotype 2 (Respiratory System)
included 6,838 (32.75%) patients. It was dominated by respiratory conditions (topics T4, T7, and T9), sleep disorders, anxiety, and symptoms such as headache and chest pain. This subphenotype had the youngest patients among the four (median age 51.0 years, IQR [35.0-64.0]), the highest proportion of females (62.8%), and the lowest rate of hospitalization (31.28%) for COVID-19. It also had the largest proportion of patients who tested positive for SARS-CoV-2 from November 2020 to November 2021 (64.47%). Patients in this subphenotype had higher baseline comorbidity burdens for respiratory conditions such as upper respiratory problems and chronic obstructive pulmonary disease, breathing problems, and a higher incident prescription for a diverse set of anti-asthma, anti-allergy, and anti-inflammation medications including inhaled steroids, levalbuterol, and montelukast.
Subphenotype 3 (Musculoskeletal and Nervous System)
consisted of 4,879 (23.37%) patients. It mainly contained musculoskeletal and nervous system problems (topics T1, T5, and T6) such as musculoskeletal pain, headaches, and sleep-wake problems. This subphenotype included patients with a median age of 57.0 years (IQR [42.0-69.0]), with 60.71% female. It had the highest proportion of patients with more than five outpatient visits (78.4%). Patients in this subphenotype had higher baseline comorbidity burdens of autoimmune and allergy conditions such as rheumatoid arthritis and asthma, as well as other musculoskeletal and nervous system problems including soft tissue, bone, and sleep problems. This subphenotype was also associated with a higher incident prescription risk of pain medications (ibuprofen and ketorolac) in the post-acute infection period.
Subphenotype 4 (Digestive System)
included 2,117 (10.14%) patients mainly with digestive system problems such as abdominal pain, vomiting, and respiratory conditions (topics T2, T4, T8). Patients in this subphenotype had a median age of 54.0 (IQR [39.0-67.0]) with 61.64% female. Patients in this subphenotype had the highest proportion of patients without any baseline emergency visits (57.06%) and the lowest rates of mechanical ventilation (0.8%) and critical care admission (2.79%) in the acute phase of COVID-19. Compared with the other three subphenotypes, this subphenotype had an overall lower prevalence of underlying conditions, and a slightly higher prevalence of digestive problems such as hematemesis, stomach and duodenum disorders, and digestive system neoplasm. In addition, this subphenotype had higher incident prescription rates of drugs for treating the digestive system.
Contrast with COVID-19 Negative Patients
These potential PASC subphenotypes were derived from SARS-CoV-2 infected patient. However, it was unclear how the potential PASC diagnosis co-incidence patterns encoded in them differed from the non-infected patients. To answer this question, we compared the incidence patterns of 28 selected potential PASC conditions in 30-180 days after the COVID-19 lab test between positive and matched negative patients (Methods). The results were demonstrated in Figure 5, where the nodes in each network corresponded to a particular potential PASC condition with their sizes proportional to the incidence rate in the corresponding subphenotype or matched controls, and each line linking a pair of nodes indicated co-incidence of the corresponding potential PASC diagnoses with its thickness proportional to the co-incidence rate. From the figure, we could see that the conditions we used to characterize each subphenotype were clearly associated with larger-sized nodes, representing higher incidence rates. At the same time, we do not observe differences in node sizes on the matched COVID-19 negative networks. In addition, we observed denser connections in COVID-19 positive networks, which suggested that the potential PASC conditions did not appear independently, but collectively, and those larger nodes were network hubs associated with more lines.
Replication on OneFlorida+ CRN
We repeated the same subphenotyping process on the OneFlorida+ cohort; the subphenotypes generated were highly overlapping with those from the INSIGHT cohort. Following the same analytical pipeline (Figure 1), we first identified incident diagnoses for the 137 potential PASC conditions with 30-180 days of testing positive for SARS-CoV-2; we then conducted topic modeling. Supplemental Figure 6 displays the heatmap of all learned potential PASC topics, where we see topics concentrated on problems with the musculoskeletal system (T1), digestive system (T2), nervous system (T5), and topics mixed with respiratory system problems and blood/circulatory system problems (T3), as well as headache and sleep-wake problems (T7). Some topics also include a mixture of diagnoses. For example, T9 is throat/chest pain mixed with breathing/heartbeat abnormalities; T6 is a mixture of musculoskeletal pain, headache, malaise and fatigue, and skin sensory problems; T8 and T10 are topics mixed with electrolyte/fluid disorders and anemia/arrhythmias, and T4 is a topic mixing up problems involving digestive, nervous and respiratory systems. To quantify areas of overlap with findings from the INSIGHT cohort, we evaluated the pairwise similarities between the topics learned from the two different cohorts (Methods) and visualized the results in Supplemental Figure 4, which clearly showed a one-to-one correspondence between the topics learned from the two cohorts (as identified by darker colors on the diagonal line).
With the learned topics, we also built topic loading-based representations for patients in the OneFlorida+ cohort, derived four potential PASC subphenotypes with HAC (Methods), and described the characteristics of patients classified into each subphenotype. (Supplemental Table 3, Figures 7 and 8). Subphenotype 1 was dominated by incidental blood and circulatory system problems in the post-acute infection period, which included 25.43% of the patients who were older (with a median age of 62.0 and IQR [49.0-74.0]), with the highest proportion of males (46.93%, compared to 38.29% for the overall population) and the highest rates of hospitalization (57.34%, compared to 36.69% for the overall population), mechanical ventilation (8.57%, compared to 3.39% for the overall population) and critical care admission (12.52%, compared to 6.07% for the overall population) in the acute phase of COVID-19. This subphenotype was associated with a higher prevalence of underlying conditions and more new prescriptions for medications treating circulatory system, blood, and endocrine problems. Subphenotype 2 was dominated by incidental respiratory problems and was the largest subphenotype containing 5281 (38.48%) patients with a median age of 47.0 (IQR [33.0-61.0]). They had a higher prevalence of respiratory conditions at baseline including chronic obstructive pulmonary disease, pneumonia, upper respiratory tract problems, and had a higher post-acute infection incident prescription for respiratory medications. Subphenotype 3 was dominated by incident problems with musculoskeletal and nervous problems in the post-acute infection phase. It included 3205 (23.35%) patients with a median age of 48.0 (IQR [33.0-61.0]) and had the lowest hospitalization rate in the acute phase (27.8%). This subphenotype had a higher prevalence of baseline musculoskeletal and connective tissue problems and asthma, and more new prescriptions for pain medications, including ketorolac and ibuprofen in the post-acute infection phase. Subphenotype 4 was dominated by incidental digestive problems. It was the youngest (median age 46.0 [32.0-60.0]) and smallest (including 1748 [12.74%] patients) subphenotype, with the highest proportion of females (67.11%, compared to 61.70% overall) and lowest rates of mechanical ventilation (0.97%) and critical care admission (2.8%) in the acute phase. This subphenotype was associated with a higher baseline burden of digestive system problems, and more new prescriptions for medications focused on the digestive system. These observations and characterizations were highly consistent with the subphenotypes identified from the INSIGHT cohort. In addition, the difference in the incidence patterns of 28 selected potential PASC conditions in 30-180 days after the COVID-19 lab test between positive and matched negative patients on the OneFlorida+ cohort is shown in Supplemental Figure 9, which was also highly consistent with the results from the INSIGHT cohort.
Discussion
Many studies have pointed out the existence of a diverse set of symptoms and signs that may develop, persist, or recur in the post-acute SARS-CoV-2 infection period7. These conditions, typically referred to as PASC, involve a wide range of organ systems. Different from most of the existing research which studied these conditions independently, we developed a data-driven framework shown in Figure 1 to identify subphenotypes of SARS-CoV-2 infected patients based on newly incident symptoms and signs from 30 to 180 days (the post-acute period) after their infection confirmation, such that patients within the same subphenotype share a similar distribution of potential PASC condition incidences in the post-acute periods.
Within both INSIGHT and OneFlorida+ CRNs, we identified four consistent subphenotypes dominated by new conditions of the blood and circulatory systems (Subphenotype 1), respiratory system (Subphenotype 2), musculoskeletal and nervous systems (Subphenotype 3), and digestive system (Subphenotype 4).
Comparing the subphenotypes derived from both cohorts, we observed that Subphenotype 1 included older patients with higher baseline comorbidity burden, with greater severity of illness during the acute phase (e.g., higher rates of hospitalization, critical care admission and mechanical ventilation) and with higher incident prescription rates of medications for treating diseases of many different organ systems. This subphenotype had the highest proportion of males, which aligns with the finding that males had more severe acute SARS-CoV-2 infections15. This subphenotype had a large proportion of patients with their SARS-CoV-2 infection confirmed during the early pandemic (March to September 2020) when treatment standards were still evolving. Temporally, NYC was the epicenter for the first wave. This may explain the observation that this was the largest subphenotype for INSIGHT (containing 33.75% of the patients) but the second largest subphenotype for OneFlorida+ (containing 25.43% of the patients). Early cases had greater acute phase severity, which may explain the more severe incident conditions (e.g., heart failure and renal failure) in the post-acute infection period of these patients, which could be caused by the hyperinflammation during the acute phase16.
Subphenotype 2 is another major subphenotype for both cohorts (the second largest for INSIGHT occupying 32.75% of the patients, and the largest for OneFlorida+ accounting for 38.48% of the population). It is the youngest subphenotype for INSIGHT and the second youngest subphenotype for OneFlorida+ and contains the highest proportion of patients who had an acute SARS-CoV-2 infection confirmed from July to November 2021. The young age and infection recency are consistent with the milder incident PASC conditions of this subphenotype. Of note, patients in this subphenotype had a high baseline rate of pulmonary comorbidities which is likely correlated with the high rate of incident respiratory medication prescriptions in the post-acute SARS-CoV-2 period.
Subphenotype 3 and 4 were two smaller subphenotypes with Subphenotype 3 associated with musculoskeletal and neurological conditions, whereas Subphenotype 4 was associated with gastrointestinal conditions. Patients in Subphenotype 3 also suffered from dermatologic conditions and had the highest rates of related conditions at baseline, including autoimmune diagnoses such as rheumatoid arthritis and allergy conditions. Conversely, patients in Subphenotype 4 had the mildest acute phase severity (e.g., lowest rates of mechanical ventilation and critical care admissions.
There are several strengths in our study. First, we adopted a topic modeling approach to derive compact patient representations based on the co-incidence patterns across different diagnoses. Unlike other dimensionality reduction techniques such as Principal Component Analysis (PCA)17, topic modeling is designed specifically for data samples with binary or count features18,19 and thus appropriate for our analysis. Second, INSIGHT and OneFlorida+ include patients from distinct geographic regions in the US with different characteristics, allowing us to validate the robustness of the derived subphenotypes. Third, our study period (March 2020 to November 2021) covers different COVID-19 waves associated with different SARS-CoV-2 virus variants. Our study cohorts contained robust patient populations in New York and Florida, representing the different waves of SARS-CoV-2 infected cases in US. This is important factor contributing to the different distributions of the four subphenotypes in the two cohorts.
Our study is not without limitations. First, our analysis is based on longitudinal observational patient data, which cannot explain the biological mechanisms behind PASC directly. Second, the PASC diagnoses we investigated were encoded as CCSR categories, which may not reflect the co-incidence patterns of fine-grained diagnosis conditions in the context of PASC. Third, we focused on new incidences of conditions in the post-acute infection period for COVID-19 patients and did not consider pre-existing conditions that are persistent or exacerbated due to the acute SARS-CoV-2 infection. Finally, our study period did not represent the recent wave dominated by the Omicron variants of SARS-CoV-2.
To summarize, our study dissects the complexity and heterogeneity of newly incident conditions in 30-180 days after SARS-CoV-2 infection confirmation into four reproducible subphenotypes based on the EHR repositories from two large CRNs using machine learning. These four subphenotypes included a severe one involving problems with the blood and circulatory system and associated with high baseline comorbidity burden and disease severity in its acute phase, a milder one in younger people mainly with respiratory problems, and two pain-dominated ones (musculoskeletal/nervous system pain and abdominal pain respectively). Overall, patients in each subphenotype tend to have higher rates of related conditions in the baseline period. Our study provides the first systematic study on the co-incidence patterns of conditions in the post-acute infection period of SARS-CoV-2 infected adult patients, which can inform more nuanced and tailored diagnosis and treatment plans.
Methods
EHR Data Repositories
Two large-scale de-identified real-world EHR data warehouses were utilized in our analyses. Our first cohort data was based on the EHR from INSIGHT CRN9, which contains the longitudinal clinical information of around 12 million patients in New York City area. Our second cohort data was based on the EHR from the OneFlorida+ CRN10, which contains the information of nearly 15 million patients majorly from Florida and selected cities in Georgia and Alabama.
The use of the INSIGHT data was approved by the Institutional Review Board (IRB) of Weill Cornell Medicine following protocol 21-10-95-380 with title “Adult PCORnet-PASC Response to the Proposed Revised Milestones for the PASC EHR/ORWD Teams (RECOVER)”. The use of the OneFlorida+ data for this study was approved under the University of Florida IRB number IRB202001831.
Potential PASC Conditions
We compiled a list of 137 diagnostic categories covering near 6,500 ICD-10-CM codes as potential PASC conditions for our study. The list was built based on the Clinical Classifications Software Refined (CCSR) v2022.1 covering 66,534 ICD-10-CM Diagnoses. The codes that would not plausibly be considered post-acute sequelae of COVID-19 in the adult population (e.g., HIV, tuberculosis, infection by non-COVID causes, neoplasms, injury due to external causes, etc.) were excluded, and parent codes (e.g., the first 3-digits of ICD-10 codes) were systematically added. The full list of diagnosis codes for the PASC is provided in Supplemental Table 1.
Cohort Construction
For the both cohorts, adult patients (age ≥ 20) with at least one SARS-CoV-2 polymerase-chain-reaction (PCR) or antigen laboratory test (Supplemental Table 2) between March 01, 2020 and November 30, 2021 were selected. Then we chose the patients who had at least one positive test and had at least one potential PASC conditions in the follow-up (or post-acute infection) period defined as below. We further made sure those potential PASC conditions were new incidences in the follow-up period by excluding patients who had any of them in both baseline and follow-up periods. The overall inclusion-exclusion cascade was shown in Supplemental Figure 1, and the relevant definitions are provided below.
Index date: the date of the first COVID-19 positive test.
Baseline period: from 3 years to one week prior to the index date.
Follow-up (post-acute infection) period: from 31 days after the index date to the day of documented death, last record in the database, 180 days after baseline, or the end of our observational window (Nov. 30, 2021), whichever came first.
Topic Modeling
We use binary vectors to represent the patients, where n is the patient index the i-th element of xn, or xn(i) = 1 if the i-th potential PASC condition appears in the post-acute infection period of the n-th patient’s EHR, otherwise xn(i) = 0. Therefore, each xn is a 137-dimensional binary vector (Step 1 in Figure 1). Topic modeling (TM)14 as then applied on these vectors to learn a set of potential PASC topics. Specifically, assume that each patient can be represented as a mixture of K latent PASC topics Φ ∈ RD×K, where each topic φk can be viewed as a set of PASC that are more likely to be co-incident in the post-acute infection period of a particular patient (Step 2 in Figure 1). Then for each patient, TM infers the mixture memberships θn ∈ RK, also called topic proportions or topic loadings, as the new representation for each patient (Step 3 in Figure 1). A patient with higher loading value on a particular topic indicates that he/she suffers from more co-incident condition patterns from this topic. In other words, TM transforms the representations of each patient from the original 137-dimensional binary space xn to the low-dimensional continuous PASC topic space θn, which will be leveraged further for subphenotype identification through clustering later (Step 4 in Figure 1). Specifically, we used Poisson factor analysis (PFA)20 as the concrete TM method, which generates xn as follows.
Draw a topic proportion θn for the n-th patient from a Gamma distribution θn ∼ Gamma(1,1);
Draw the k-th PASC topic from a Dirichlet distribution φk ∼ Dirichlet(0.01), k = 1, …, K;
Draw a binary vector xn from the Bernoulli distribution by Bernoulli-Poisson link21: where 1(·) is an indicator function representing xn = 1 if un ≥ 1, and xn = 0 if un = 0. We use Gibbs sampling22 to infer the posterior of PASC topic Φ and topic proportions .
Determining the Number of Topics
The number of topics, K, is an important parameter in TM. To determine an optimal K based on the data, we used two metrics: data likelihood and topic coherence23. Data likelihood is used to evaluate the fitness of the model on the current data set, where the larger value indicates better fitness. Topic coherence is used to evaluate the relevance of our learned PASC topics to the investigative condition list, where the value is from zero to one and higher value indicates better coherence. Detailed calculations of these two metrics are provided in the Supplemental methods and results are shown in Supplemental Figure 2. From this figure, we can see that more topics can provide higher data likelihood because we have more topics to represent the original PASCs. However, more topics may bring down topic coherence, which suggests the redundancy between the newly added and the old ones. With these considerations, we set the final number of topics as 10 for both INSIGHT and OneFlorida+ cohorts as it achieved the best topic coherence and reasonable data likelihood (we do not want the data likelihood to be too perfect as that may suggests overfitting).
Topic Robustness
As TM is a probabilistic process, we want to guarantee the robustness of the identified topics. To achieve this goal, we firstly did 1000 bootstrapping (randomly choose 80% patients from each cohort) to learn topics. Then according to the importance of each topic (mean topic loadings over all patients), we reordered the topics and calculated the cosine similarity among all topics from each pair of bootstrapping: where is the p-th topic vector learned from the i-th bootstrapped samples, and Spq is the similarity between the p-th topic and the q-th topic. Supplemental Figure 3 demonstrated the heatmap of the similarity matrix with Spq as its (p, q)-th entry, from which we can clearly observe a darker diagonal line, which indicates a high similarity between the topics learned from different bootstrapped samples and thus implies the learned topics are robust.
Topic Consistency
We also quantitatively evaluated the consistency between the two set of topics learned from different cohorts. Specifically, denoting the topic matrices learned from the two cohorts as Φ(1) and Φ(2), then we can evaluate the consistency between the i-th topic in cohort 1 and the j-th topic in cohort 2 as the cosine similarity between their corresponding topic vector and as:
Finally, the heatmap of the topic consistency matrix with Sij as its (i, j)-th entry was shown in Supplemental Figure 4, from which we can clearly observe darker diagonal values, which suggests high consistency between the two sets of topics.
Subphenotyping through Clustering
With the learned K-dimensional topic loading vector for n-th patient θn, we applied hierarchical agglomerative clustering method with Euclidean distance calculation and Ward linkage criterion24 to derive subphenotypes as patient clusters. For determining the optimal number of clusters (subphenotypes), we applied NbClust R package25, which includes 21 cluster indices to evaluate the quality of clusters. With the patients from the INSIGHT and OneFlorida+ CRN, 13 and 12 out of the 21 indices agreed 4 is the optimal number of clusters. Through majority voting, we set the number of clusters as 4 in both two cohorts. Supplemental Figure 5 demonstrates the UMAP embeddings26 and dendrogram of these clusters for both cohorts.
We have also examined the robustness of the identified clusters on both cohorts. Specifically, for each cohort, we used the subphenotypes derived from all patients as references, so that the subphenotype index for the n-th patient is denoted as yn. Then we ran 1000 bootstrapping (randomly choose 80% patients from the cohort denoted as set Ωi, i = 1, …, 1000) to learn topic model and then derive subphenotypes with the same procedure as described above. For each bootstrapped sample set i, we can obtain another subphenotype index for n-th patient from Ωi as ŷn,i. We calculated mean and 95% confidence interval of adjusted rand score (ARI) and normalized mutual information (NMI) between the clustering results on bootstrapped sample sets and reference as 0.902 (95% confidence interval (CI): 0.863 − 0.927) and 0.937 (95% CI: 0.908 − 0.952) for the INSIGHT cohort, and 0.914 (95% CI: 0.907 − 0.929) and 0.950 (95% CI: 0.936 − 0.968) for the OneFlorida+ cohort, which suggests the identified clusters are highly robust.
Comparison with SARS-CoV-2 Infection Negative Patients
We compared the co-incidence patterns of the investigative conditions in the follow-up periods for patients with SARS-CoV-2 infection testing positive and negative. The SARS-CoV-2 infection negative patients are with all negative results for their SARS-CoV-2 infection lab tests during March 2020 to November 2021, and there were no documented COVID-19 related diagnoses during this time period. The index date for each individual patient in non-infected group is defined as the date of the first (negative) lab test.
To make fair comparisons, we performed similarity matching to identify appropriate negative patients for each positive patient based on the following hypothetical confounding variables.
Demographics: age, gender, race, and ethnicity, where age was binned into different groups (20-<40 years, 40-<55 years, 55-<65 years, 65-<75 years, 75-<85 years, 85+ years).
The area deprivation index (10-rank bins of national ADI) for capturing socioeconomic disadvantage of patients’ neighborhood13.
Index date for considering the effect of different stages of pandemic, which was binned into different time intervals (March 2020 – June 2020, July 2020 – October 2020, November 2020 - February 2021, March 2021 – June 2021, July 2021 – November 2021).
Medical utilizations measured by numbers of inpatient, outpatient, and emergency encounters in the baseline period (binned into 0 visit, 1 or 2 visits, 3 or 4 visits, 5+ visits for each encounter type).
Coexisting conditions including comorbidities and medications based on a tailored list of the Elixhauser comorbidities27. We defined the patient having a particular condition if he/she had at least two related records during the baseline period.
For identifying the negative controls for each patient in a particular subphenotype, we first required exact match for confounders of demographics, ADI, and index date to obtain an initial set, and then performed robust propensity score (PS) matching on other hypothetical confounders robust propensity score to rank the patients in the initial set and we finally picked the top 2. We used standardized mean difference (SMD) to quantify the goodness-of-balance of confounders between two groups , where SSDD < 0.2 is the threshold to examine whether this confounder is balanced28. On both ISNIGHT and OneFlorida cohort, we found that all confounders on all subphenotypes were balanced.
Data Availability
All data produced in the present study are available upon reasonable request to the corresponding authors
Supplemental Figure and Table
Supplemental Table 1. PASC list in our study defined by CCSR domain with corresponding ICD-10-CM codes.
Supplemental Table 2. SARS-CoV-2 polymerase-chain-reaction or antigen laboratory test.
Supplemental Methods
Evaluate the data likelihood and topic coherence
To select the optimal number of topics, based on different number of topics, we learned topic model and calculated the data likelihood and topic coherence (Supplemental Figure 2) according to the following methods.
Data likelihood
In PFA, to model the binary PASC vector, we used the Bernoulli-Poisson link as: where, xn is the binary PASC vector, Φ is the topic matrix, θn is the topic proportion vector, un is the latent variables which links the binary observation and topic representation. According to the property of Bernoulli-Poisson link, un can be marginalized out and then one can obtain a Bernoulli likelihood as
Given , after learning Φ and , we can calculate the Bernoulli data likelihood.
Topic coherence
Topic coherence is an important matric to evaluate the quality of topics based on the input data. It is measured based on a sliding window (in our case, the length of the window is the total number of PASCs), and then one can calculate the normalized pointwise mutual information (NPMI) between input data and the learned topics. We used the python package GENSIM (https://radimrehurek.com/gensim/) to calculate the topic coherence.
Acknowledgement
This research was funded by the National Institutes of Health (NIH) Agreement OTA OT2HL161847 (contract number EHR-01-21) as part of the Researching COVID to Enhance Recovery (RECOVER) research program.