Abstract
Temporal inference from laboratory testing results and their triangulation with clinical outcomes as described in the associated unstructured text from the providers’ notes in the Electronic Health Record (EHR) is integral to advancing precision medicine. Here, we studied 181 COVIDpos and 7,775 COVIDneg patients subjected to 1.3 million laboratory tests across 194 assays during a two-month observation period centered around their SARS-CoV-2 PCR testing dates. We found that compared to COVIDneg at the time of clinical presentation and diagnostic testing, COVIDpos patients tended to have higher plasma fibrinogen levels and similarly low platelet counts, with approximately 25% of patients in both cohorts showing outright thrombocytopenia. However, these measures show opposite longitudinal trends as the infection evolves, with declining fibrinogen and increasing platelet counts to levels that are lower and higher compared to the COVIDneg cohort, respectively. Our EHR augmented curation efforts suggest a minority of patients develop thromboembolic events after the PCR testing date, including rare cases with disseminated intravascular coagulopathy (DIC), with most patients lacking the platelet reductions typically observed in consumptive coagulopathies. These temporal trends present, for the first time, fine-grained resolution of COVID-19 associated coagulopathy (CAC), via a digital framework that synthesizes longitudinal lab measurements with structured medication data and neural network-powered extraction of outcomes from the unstructured EHR. This study demonstrates how a precision medicine platform can help contextualize each patient’s specific coagulation profile over time, towards the goal of informing better personalization of thromboprophylaxis regimen.
Introduction
There is a growing body of evidence that severe COVID-19 outcomes may be associated with dysregulated coagulation1, including stroke, pulmonary embolism, myocardial infarction, and other venous or arterial thromboembolic complications2. This so-called COVID-19 associated coagulopathy (CAC) shares similarities with disseminated intravascular coagulation (DIC) and thrombotic microangiopathy but also has distinctive features3. Given the significance of CAC to COVID-19 mortality, there is an urgent need for fine-grained resolution into the temporal manifestation of CAC, particularly in comparison to the broad-spectrum of other, better characterized coagulopathies. While there is a growing body of evidence suggesting associations between COVID-19 infection and mortality with thrombocytopenia, D-dimer levels, and prolongation of prothrombin time, the potentially heterogeneous signatures of CAC onset and progression as well as their connection to clinical outcomes are not well defined 1,4,5. An advanced understanding of this phenotype may aid in risk stratifying patients, thus facilitating optimal monitoring strategies during disease evolution through the paradigm of precision medicine, which hopefully will lead to further improvements in treatment decisions and overall survival.
To this end, we instituted a holistic data science platform across an academic health care system that enables machine intelligence to augment the curation of phenotypes and outcomes from 15.2 million EHR clinical notes and associated 3 million lab tests from 1,192 COVID-19-positive (COVIDpos) and 47,344 confirmed COVID-19-negative (COVIDneg) patients over a retrospectively defined 2-month observation period straddling the date of the PCR test (see Methods). By compiling all available laboratory testing data for the 30 days preceding the SARS-CoV-2 PCR diagnostic testing date (‘day 0’), as well as the 30 days following the diagnostic testing date, and triangulating this information with medications and clinical outcomes, we were able to identify laboratory abnormalities significantly associated with positive COVID-19 diagnosis. We identified coagulation-related parameters among this set of abnormalities and then studied aggregate as well as individual patient trajectories that could aid in extracting a temporal signature of CAC onset and progression. We also correlated these signals with the clinical outcomes of these patients.
In order to hone into longitudinal lab test trends that would apply at the individual patient level, we restricted our analysis to patients with available serial testing data, which had at least 3 test results of the same type during the observation period. After applying these inclusion criteria, 181 COVIDpos and 7,775 COVIDneg patients met study inclusion criteria. The median age in the COVIDpos and COVIDneg groups were 59.9 years and 63.9 years respectively (see Methods and Table 1), and the numbers of males were 96 (53%) and 4,064 (52%), respectively. The number of pre-existing coagulopathies in the COVIDpos and COVIDneg groups were 56 (33%) and 3,325 (40%), respectively. Overall, more patients in the COVIDneg cohort were hospitalized at the start of the study period. In particular, the number of COVIDpos patients hospitalized on the SARS-CoV-2 PCR testing date was 23 (12.7%), compared to 1435 (18.5%) for the COVIDneg cohort. In order to balance some of these clinical covariates and others between the two cohorts, we applied 1:10 propensity score matching to define a subset of 1,810 patients from the COVIDneg cohort to use for the final statistical analyses (see Methods). In particular, the general categories of covariates considered for balancing included: demographics, anticoagulant/antiplatelet medication use, and hospital admission status. Population-level characteristics of the COVIDpos, COVIDneg, and the final propensity score-matched COVIDneg cohorts are summarized in Table 1.
Results
Longitudinal analysis highlights classes of lab abnormalities in COVID-19 progression
To identify laboratory test results which differ between COVIDpos and COVIDneg patients, we analyzed longitudinal trends for 194 laboratory test results in the 30 days prior to and after the day of PCR testing (designated as day 0). As most patients did not undergo laboratory testing for each assay on a daily basis, we grouped the measurements into time windows reflecting potential stages of infection as follows: pre-infection (days -30 to -11), pre-PCR (days -10 to -2), time of clinical presentation (days -1 to 0), and post-PCR phases 1 (days 1 to 3), 2 (days 4-6), 3 (days 7-9), 4 (days 10-12), 5 (days 13-15), and 6 (days 16-30). We only considered test-time window pairs in which there were at least three patients contributing laboratory test results in both groups. During each time window, we then compared the distribution of results from COVIDpos versus COVIDneg patients, allowing us to identify any lab tests which were significantly altered in COVIDpos patients during any time of disease acquisition, onset, and/or progression.
Out of 863 lab test-time window pairs with adequate data points for comparison, we identified 88 such pairs (comprising 36 unique lab tests) which met our thresholds for statistical significance (Cohen’s D > 0.35, BH-adjusted Mann-Whitney p-value < 0.05) (Table 2). Among these were lab tests which may be considered positive controls for our analysis, including a reduction in blood oxygenation in COVIDpos patients after diagnosis and elevated titers of SARS-CoV-2 IgG antibodies in the late phase 6 (days 16-30) after diagnosis (Figures 1A-B). We also identified abnormalities in several other classes of lab tests, including immune cell counts (Figures 1C-G), erythrocyte-related tests (Figures 1H-I), serum chemistry (Figures 1J-K), and coagulation-related tests (Figure 2).
With respect to coagulation, we found that plasma fibrinogen was significantly elevated in COVIDpos patients at the time of diagnosis (Cohen’s D = 1.092, BH-adjusted Mann-Whitney p-value = 0.016, Table 2, Figure 2A). This hyperfibrinogenemia generally resolved during the 10 days following diagnosis, even declining to levels significantly lower than the COVIDneg cohort (Figure 2A). Conversely, platelet counts were lower in the COVIDpos cohort at the time of clinical presentation but tended to increase over the subsequent 10 days to levels significantly higher than those in COVIDneg patients (Cohen’s D = 0.361, BH-adjusted Mann-Whitney p-value = 0.008, Table 2, Figure 2B). While thrombocytopenia has been reported in COVID-19 patients before6,7, an upward trend in platelet counts after diagnosis has not been described to our knowledge. Among other coagulation-related tests, we found no difference in prothrombin time (PT) and slight elevations in activated partial thromboplastin times (aPTT) after diagnosis (Figure 2C). D-dimer levels were frequently above normal limits in both the COVIDpos and COVIDneg cohorts and were not significantly different between these cohorts during any time windows (Figure 2D). Of note, the observed trends in both fibrinogen and platelet counts could not be attributed to the administration of either antiplatelet or anticoagulant therapies, as drug class-matched cohorts of COVIDneg patients failed to show these trends after the day of PCR testing (Figures 2E-F).
We also performed similar analyses comparing the COVIDpos and COVIDneg cohorts using different time windows definitions including daily trends (Figure 3). This approach offers the advantage of increased granularity at the cost of sample size per time point, but we did identify similar lab tests as altered in COVIDpos patients using each approach including the fibrinogen decline and platelet increase in the COVIDpos cohort after diagnosis (Figure 3).
Thrombosis is an enriched phenotype among COVID19 patients undergoing longitudinal lab testing
Given the recently described coagulopathies associated with COVID-191,2,3, we were intrigued by these temporal trends in fibrinogen levels and platelet counts in the COVIDpos cohort. We next asked whether the observed coagulation-related laboratory trends were associated with clinical manifestations of thrombosis. To do so, we employed a BERT-based neural network8 (see Methods) to identify patients who experienced a thrombotic event after their SARS-CoV-2 testing date. Specifically, we extracted diagnostic sentiment from EHR notes (e.g. whether a patient was diagnosed with a phenotype, suspected of having a phenotype, ruled out for having a phenotype, or other) regarding specific thromboembolic phenotypes including pulmonary embolism (PE), deep vein thrombosis (DVT), venous thromboembolism (VTE), thrombotic stroke, myocardial infarction, and cerebral venous thrombosis.
We found that 40 of the total 1729 COVIDpos cohort (2.3%) were positively diagnosed with one or more of these phenotypes in the 30 days after PCR testing, with the majority of these patients (27 of 40; 68%) experiencing a DVT. Interestingly, we found that after creating subsets of the patients with longitudinal lab testing data (i.e. the patients meeting the criteria for inclusion in our study), 33 of the 181 patients (18%) had at least one EHR-derived clot diagnosis, including 25 patients with DVT (Table 6). Thus, the cohort under consideration here is highly enriched (Table 4; hypergeometric p-value < 1×10-20) for patients experiencing thrombotic events compared to the overall COVIDpos cohort.
Longitudinal platelet trends are not strongly associated with the development of thrombosis in COVID-19 patients
Among the 181 COVIDpos patients with longitudinal lab testing data, only 10 were serially tested starting at clinical presentation for fibrinogen versus 87 tested for platelets. As such, we first analyzed whether associations exist between platelet counts (or temporal alterations thereof) and clotting propensity in this cohort. Among these 87, there were 68 patients without thrombosis after PCR-based diagnosis (“non-thrombotic”) and 19 patients with thrombosis (“thrombotic”). Thrombocytopenia (platelet count <150×109/L) was observed in ~25% of COVIDpos (and COVIDneg) patients at the time of diagnosis (Figure 4A), but the platelet levels at this time point were not associated with the subsequent formation of a blood clot in the COVIDpos cohort (Figure 4B).
We hypothesized that the previously discussed increase in platelet counts after COVID-19 diagnosis may be associated with the development of blood clots. If true, then we would expect the thrombotic COVIDpos cohort to show significantly higher maximum platelet counts during their course of disease progression compared to the non-thrombotic COVIDpos cohort. We found that this was not the case, as maximum platelet counts were similar in the two groups (Figure 4C). Similarly, among the 74 COVIDpos patients with at least one post-diagnosis platelet count higher than that at the time of presentation, the degree of maximal platelet increase was not associated with the development of thrombosis (Figure 4D). It would certainly be of interest to perform this same analysis on the larger COVIDpos cohort (n = 1729; 40 thrombotic vs. 1689 non-thrombotic), but we were not able to do so given the lack of longitudinal testing available for the large majority of non-thrombotic COVIDpos patients (Table 3).
Conversely, we explored whether some COVIDpos patients may experience clotting in the setting of low or declining platelets (e.g., consumptive coagulopathy) despite the population-level trend of increasing platelets over time. Indeed, we found that 4 of 19 thrombotic patients showed absolute platelet counts below 100×109/L during at least one post-diagnosis time window (below dotted red line in Figure 4E). We next identified all COVIDpos patients who showed at least one post-diagnosis platelet count lower than the count at the time of diagnosis. Interestingly, the binary presence of post-diagnosis platelet reductions was associated with thrombotic status (n = 27 of 68 in non-thrombotic population; n = 14 of 19 in thrombotic population; p = 0.0017). While the maximum degree of absolute platelet reduction was not associated with clot development overall (p = 0.15; Figure 4F), we did find that 3 of the 15 thrombotic patients experienced a reduction of at least 100×109/L relative to the time of diagnosis (below gray line in Figure 4F). Of note, similar fractions of non-thrombotic COVIDpos patients also showed these low or declining platelet counts, indicating that these trends are not specific indicators of thrombosis (Figures 4E-F).
Consumptive coagulopathy contributes to a small fraction of COVID-19 associated thromboses
The observed declining platelet counts and thrombocytopenia in the context of thrombosis in a small fraction of COVIDpos patients is consistent with previous reports that fewer than 1% of survivors, but over 70% of non-survivors, meet the International Society on Thrombosis and Hemostasis (ISTH) criteria for disseminated intravascular coagulation (DIC)1. As was previously noted, hyperfibrinogenemia was among the strongest lab test features distinguishing COVIDpos from COVIDneg patients at diagnosis, but the subsequent downward trend (Figure 2B) could be attributed to a resolving acute phase response and/or consumption of fibrinogen in a systemic coagulopathy. Using our BERT-based sentiment extraction, we found that only 2 of the 1729 COVIDpos patients were diagnosed with DIC, all of whom were included in our longitudinal cohort of 181 COVIDpos patients. Both extracted diagnoses were manually confirmed by EHR review. This finding suggests that declining fibrinogen after COVID-19 diagnosis typically represents a physiologic return to normal range rather than pathologic coagulation factor consumption. To further examine the plasma fibrinogen trends among COVID-19 patients with DIC, with non-DIC thrombosis, and without thrombosis, we examined patient-level lab test trends from the 10 individuals who were tested for fibrinogen both at the time of diagnosis and at least two times subsequently.
This patient-level analysis indeed revealed multiple distinct trajectories with respect to fibrinogen and other coagulation parameters in COVIDpos patients. Four of these ten individuals developed at least one blood clot during their hospital course. Only one was identified by our BERT model (and confirmed by manual EHR review) to have low-grade DIC, and as expected we found this patient’s longitudinal lab test pattern to be consistent with consumptive coagulopathy (Patient 124; Figure 5A). At the time of diagnosis, this patient showed significant hyperfibrinogenemia with elevated D-dimers (1304.5 ng/mL) and a borderline normal platelet count (1533×109/L). Over the next ten days, this patient’s fibrinogen levels consistently decreased, reaching a minimum of 110 mg/dL on day 9. Similarly, after an initial recovery to 190×109/L the platelet counts consistently declined starting on day 2 post-diagnosis, reaching a minimum of 117×109/L on day 11. D-dimer levels exponentially increased after 5 days, reaching a maximum of 41,300 ng/mL on day 10. Phenotypically, this patient experienced both thrombotic (right internal jugular vein and right superior thyroid artery) and hemorrhagic (oropharyngeal and pulmonary) events. This combination of lab results and clinical manifestations is consistent with the diagnosis of DIC-like consumptive coagulopathy during the first week after COVID-19 diagnosis.
Lab test results from the other three non-DIC thrombotic patients with longitudinal fibrinogen testing confirm the presence of alternative forms of coagulopathy in the COVID-19 population. Patient 23 developed a clot on day 4 post-diagnosis in the context of a declining fibrinogen level and increasing D-dimers but steady platelet counts, which actually increased shortly thereafter (Figure 5B). Patient 79 developed several clots after day 3 post-diagnosis in the setting of upward trending platelets (which eventually exceed the upper limit of normal) and elevated levels of both fibrinogen and D-dimers (Figure 5C). Patient 94 developed a clot on day 8 post-diagnosis with relatively stable platelet counts within normal limits and steadily declining fibrinogen levels (Figure 5D).
Whether early elevations in plasma fibrinogen contribute to the clotting observed in the non-DIC like COVIDpos cohort is unfortunately not possible for us to assess here as only one of the ten patients who received fibrinogen testing at the time of clinical presentation and twice subsequently developed thrombosis, and this patient was diagnosed with DIC. This hypothesis may warrant further analysis in cohorts with more longitudinal fibrinogen data, but again it is important to note that several COVIDpos patients who presented with hyperfibrinogenemia did not go on to develop thromboses (Figures 5E-F). This emphasizes that a steady post-diagnosis decline in plasma fibrinogen may represent physiologic resolution of acute phase response rather than a pathologic consumption of fibrinogen and other coagulation factors (Figures 5B, D-F).
Taken together, this analysis affirms that a DIC-like coagulopathy resulting in a combination of hemorrhage and thrombosis can develop in the setting of COVID-19 infection. However, the observations that (1) DIC was formally diagnosed in only 2 of 1729 COVIDpos patients and (2) 32 of 54 thrombotic COVIDpos patients with serial platelet testing never showed a single platelet count lower than that obtained on day 0 emphasizes that consumptive coagulopathy is an exception rather than the rule as it pertains to thrombotic phenotypes in COVID-19 patients.
Discussion
A number of studies on clinical characteristics and lab tests are shedding light on the spectrum of hematological parameters associated with COVID-19 patients. In an initial study of 41 patients from Wuhan, the blood counts in COVIDpos patients showed leukopenia and lymphopenia, and prothrombin time and D-dimer levels were higher in ICU patients than in non-ICU patients9. Another study based on 343 Wuhan COVIDpos patients found that a D-dimer level of at least 2.0 μg/ml could predict mortality with a sensitivity of 92.3% and a specificity of 83.3%10. An independent study of 43 COVID-19 patients found significant differences between mild and severe cases in plasma interleukin-6 (IL-6), D-dimers, glucose, thrombin time, fibrinogen, and C-reactive protein (P<0.05)4. While such studies indeed highlight that hematological and inflammatory abnormalities are prevalent in COVIDpos, a high-resolution temporal understanding of how these parameters evolve in COVID-19 patients post diagnosis has not been established. Specifically, in the wake of accumulating evidence for hypercoagulability in COVIDpos patients, there are important clinical questions emerging regarding the necessity of and guidelines for thromboprophylaxis in patient management.
The ability to derive this longitudinal understanding of COVID-19 progression, including laboratory abnormalities and their associated clinical manifestations, mandates the synthesis of structured and unstructured EHR data (e.g., lab tests and clinical notes) at a large scale. The fact that tens of thousands of patients have undergone SARS-CoV-2 testing at major academic medical centers (AMCs) provides an abundance of potential data to perform this analysis but also poses significant challenges from a practicality standpoint. Manual review and curation of patient trajectories and associated testing results is not practical, nor is it likely to provide comprehensive or even entirely accurate individual patient records. Rather, triangulation across datasets, including lab measurements, clinical notes, and prescription information, using a scalable digitized approach to extract structured data along with sentiment-surrounded clinical phenotypes and outcomes is required to efficiently perform this analysis in a timely fashion.
By developing and deploying such a digitized platform on the entirety of EHR data from a large AMC, we have identified in an unbiased manner, laboratory abnormalities that characterize COVIDpos patients from their COVIDneg counterparts who also presented with symptoms leading to the performance of a SARS-CoV-2 PCR test. The abnormalities in coagulation-related tests, including fibrinogen and platelets, were intriguing in the context of literature reporting the occurrence of various clotting phenotypes in COVID-19 patients, including DIC-like consumptive coagulopathies along with more isolated clotting events in the lungs, central nervous system, and other tissues1,2,3. Our finding that consumptive coagulopathy represents a minority of COVID-19 associated clotting events corroborates and provides context for other studies which have reported overt DIC or DIC-like disease in over 70% of non-survivors but far lower fractions of survivors1. A study of 83 COVID-19 patients suggests that when LMWH prophylaxis was initiated on hospital admission, progression to DIC was rare11.
While our analysis highlights that consumptive coagulopathy should be considered in the minority of COVIDpos patients with significant serial reductions in platelet counts, it remains to be seen whether the post-diagnosis platelet increases or early hyperfibrinogenemia which we observed may contribute mechanistically to the clotting in the much larger non-DIC thrombotic COVID-19 population. It is important to note that despite the trend of increasing platelets, the platelet count only extended above the normal range (>450×109/L) in 24 of 135 COVIDpos patients with serial measurements, and the development of such outright thrombocytosis was observed with similar frequencies in the thrombotic and non-thrombotic cohorts. Further, the fact that several patients with elevated fibrinogen (i.e. >400 mg/dL) at presentation did not develop thromboses suggests that early hyperfibrinogenemia is not a singular driver of subsequent clotting events, but a small sample size (n = 10 patients; 9 non-thrombotic vs. 1 thrombotic) limited the power of this analysis.
Despite these uncertainties, this linking of longitudinal trends to patient outcomes provides several useful pieces of clinical information, including but not limited to: (1) Early hyperfibrinogenemia is to be expected in COVID-19 patients and may not be useful in determining an appropriate thromboprophylaxis approach for a given patient. (2) Declining fibrinogen levels shortly after diagnosis are also expected and likely represent the resolution of an acute phase response in most patients rather than a decline secondary to the onset of consumptive coagulation. (3) Borderline or overt thrombocytopenia is common in COVID-19 patients at the time of clinical presentation, and the initial platelet count does not robustly predict patients who are likely to develop thromboses. (4) After diagnosis, COVID-19 patients generally show an upward trend in platelets. Patients whose platelets trend down after diagnosis should be monitored, as platelet reductions after clinical presentation are associated with thromboses and significant reductions may be indicative of ongoing consumptive coagulopathy.
There certainly remains much to understand regarding the molecular pathophysiology underlying COVID-19 associated thrombosis. While elevations in fibrinogen and D-dimer have been reported (and confirmed by our analyses), the circulating levels of tissue factor and other coagulation factors have not been assessed broadly in this population. Interestingly, single cell RNA-sequencing data highlights that the respiratory tract, including the nasal cavity, are strongly enriched for high expression of tissue factor (encoded by the gene F3) (Figure 6A). Specifically, F3 is highly expressed in epithelial cells of these tissues - the same cell types which have been found to express ACE2, the putative entry receptor for SARS-CoV-2 (Figures 6B-C). Type II pneumocytes, which may also express ACE2 and have been reported as a putative entry site for SARS-CoV-212–14, are also strongly enriched among all human cell types showing high F3 expression (Figure 6A). Whether infection-related damage to these epithelial cells may lead to shedding of tissue factor into circulation or its increased exposure circulating coagulation factors locally in respiratory tissues would be interesting to evaluate. Such a mechanism could at least partially contribute to the development of clots systemically and locally in the pulmonary vasculature. This has been confirmed in at least 3 autopsy series that reported both microthrombi and pulmonary emboli in a significant proportion of COVID-19 patients15.
In addition to our findings regarding abnormal coagulation parameters in COVID-19 patients, we identified several other laboratory features which differed from the COVIDneg cohort. For example, others have previously reported leukopenia and T cell deficiencies in COVID-19 patients at presentation 9,16; we identified a leukopenia which existed both prior to and at the time of clinical presentation which recovered during the week following COVID-19 diagnosis (Figure 1C). Specifically, we found that monocytes, basophils, neutrophils, and eosinophils followed this temporal trend (Figures 1D-G). With respect to erythrocytes, we found that both COVIDpos and COVIDneg patients developed a worsening mild anemia after PCR testing, but the trajectory was actually slower for COVIDpos patients (Figures 1B, 1H, Table 1). This progressive anemia was interestingly paralleled by a declining mean corpuscular volume unique to the COVIDpos cohort (Figure 1I). This may represent an inflammation associated anemia and could simply be related to repeated blood testing (i.e., mild blood loss) in these patients. Among serum chemistry tests, we observed distinct patterns showing reduced calcium in COVIDpos patients at presentation and through the first week of infection versus elevated magnesium between days 4 and 15 (Figures 1J-K). While low calcium at the time of presentation has been previously described in COVID-19 patients 17, elevations in serum magnesium levels in the 1-2 weeks after diagnosis have not been reported. Determining whether these serum chemistry alterations are induced by SARS-CoV-2 or have any impact on viral replication or pathogenicity, although preliminary analyses have shown no association between these lab measurements at any time point and patient mortality (not shown).
It is important to note that while we center the study period around the PCR testing date, this date may not correspond to the same disease state of COVID-19 for each individual in the COVIDpos cohort. Additionally, there are several covariates which may influence these longitudinal trends and should be explored further. For example, we have already considered whether previous or concomitant administration of anticoagulants or antiplatelet agents influences patient lab test results and/or outcomes. In Figure 2D-E, we present the longitudinal platelet counts and fibrinogen levels for subsets of the COVIDpos and COVIDneg cohort who are on and off these medications. While we do observe elevated levels of platelets generally in the COVIDneg cohort who are on antiplatelets, this potential medication effect does not alone account for the differences in platelet levels that we observe in the COVIDpos cohort. Similarly, the authors wish to explore whether longitudinal lab measurement trends differ between outpatient, inpatient, and ICU admitted patient cohorts. New data sets can also be utilized; for example, rather than grouping patients by the identified thromboembolic phenotypes extracted from the clinical notes alone, patients could be stratified by those who had imaging studies (Duplex ultrasound, CT scan, etc.) performed, and phenotypes could be directly extracted from these procedural reports. As more data accumulates from COVIDpos and COVIDneg patients in the coming months, these analyses will be expanded to assess similarities and differences in the temporal trends of laboratory test results among a wider range of patient subgroups relevant for COVID-19 outcomes, such as those who have pre-existing conditions (e.g., diabetes, hypertension, obesity, malignancies) or patients who are on specific medication (e.g., ACE inhibitors, statins, immunosuppressants).
In summary, this work demonstrates significant progress in our goal of enabling scaled and digitized analyses of longitudinal unstructured and structured EHR records to identify variables (e.g., laboratory results) which are associated with relevant clinical phenotypes (e.g., COVID-19 diagnosis and outcomes). In doing so, we identified trends in lab test results which may be relevant to monitor in COVID-19 patients and warrant both clinical and mechanistic follow-up in more targeted and explicitly controlled prospective analyses.
Data Availability
All data necessary are self-contained in the manuscript
Methods
Study design, setting and patient population
This is a retrospective study of patients who underwent polymerase chain reaction (PCR) testing for suspected SARS-CoV-2 infection at the Mayo Clinic and hospitals affiliated to the Mayo health system.
This research was conducted under IRB 20-003278, “Study of COVID-19 patient characteristics with augmented curation of Electronic Health Records (EHR) to inform strategic and operational decisions”. All analysis of EHRs was performed in the privacy-preserving environment secured and controlled by the Mayo Clinic. Nference, the Mayo Clinic, and the Janssen pharmaceutical companies of Johnson & Johnson (J&J) subscribe to the basic ethical principles underlying the conduct of research involving human subjects as set forth in the Belmont Report and strictly ensure compliance with the Common Rule in the Code of Federal Regulations (45 CFR 46) on the Protection of Human Subjects.
Longitudinal lab testing tied to COVID-19 PCR diagnostic testing
We analyzed data from 48,536 patients who received PCR tests from the Mayo Clinic between February 15, 2020 to May 8, 2020. Among this population, 1,192 patients had at least one positive SARS-CoV-2 PCR test result, and 47,344 patients had all negative PCR test results. In order to align the data for the analysis of aggregate longitudinal trends, we selected a reference date for each patient. For patients in the COVIDpos cohort, we used the date of the first positive PCR test result as the reference date (day = 0). For patients with all negative PCR tests, we used the date of the first PCR test result as the reference date (day = 0). We defined the study period for each patient to be 30 days before and after the PCR testing date. Patients with contradictory PCR test results were excluded for the purpose of this analysis; e.g. a positive PCR test result and a negative PCR test result on the same day, or a positive PCR test result followed immediately by several negative PCR test results.
Over 3 million test results from 5,536 different types of lab tests were recorded for the patients who received PCR tests in the 60-day windows surrounding their PCR testing dates at the Mayo Clinic campuses in Minnesota, Arizona and Florida. Among these lab tests, we restricted our analysis to 194 tests with at least 1000 observations total and at least 10 observations from the COVIDpos cohort. In addition, we considered different subsets of the COVIDpos cohort for the analysis of each of the 194 lab tests, due to differences in availability of testing results. For each lab test, we consider the results from patients with 3 or more observations during the study period.
In the end, there are 181 COVID-19 positive and 7,775 COVID-19 negative patients that had 3 or more test results during the study period for at least one of the assays among the 194 lab tests considered. We take this set of 181 COVID-19 positive patients to be the COVIDpos cohort. In order to construct the COVIDneg cohort from the 7,775 COVID-19 negative patients, we apply propensity score matching, which is described in the next section.
Propensity score matching to select the final COVIDneg cohort
To construct a COVIDneg cohort similar in baseline clinical covariates to the COVIDpos cohort, we employ 1:10 propensity score matching19. In particular, first we trained a regularized logistic regression model to predict the likelihood that each patient will have a positive or negative COVID-19 test result, using the following covariates: demographics (age, gender, race), anticoagulant/antiplatelet medication use (orders for alteplase, antithrombin III, apixaban, argatroban, aspirin, bivalirudin, clopidogrel, dabigatran, dalteparin, enoxaparin, eptifibatide, heparin, rivaroxaban, warfarin in the past year and in the past 30 days), and hospitalization status (i.e. whether or not the patient was hospitalized on the date of PCR testing).
Using the predictions from the logistic regression model as propensity scores, we then matched each of the 181 patients in the COVIDpos cohort to 10 patients out of the 7,775 COVID-19 negative patients, using greedy nearest-neighbor matching without replacement20.20 As a result, we ended up with a final COVIDneg cohort that included 1,810 patients with similar baseline characteristics to the COVIDpos cohort. The characteristics of the two cohorts are summarized in Table S1.
Further, for the analyses conducted on individual lab tests, which include only a subset of patients from the COVIDpos cohort, we use the propensity scores to match each patient from the COVIDpos cohort to 10 patients from the COVIDneg cohort which have the most similar propensity scores and lab tests available. For example, for the Fibrinogen lab test, in which we have data on 10 patients from the COVIDpos cohort, we select 100 patients from the COVIDneg cohort with Fibrinogen lab tests available and the most similar propensity scores to be the control group. In this way, we ensure that all of the comparisons are done between subsets of the positive and negative cohorts with similar propensity scores, and therefore similar underlying characteristics.
Statistical significance assessments over 1-month post SARS-CoV-2 PCR diagnosis date
To identify tests that show significant differentiation, we first compared the overall distributions of test scores of COVIDpos vs COVIDneg over the 30-day period following testing.
For each test the procedure was to:
Compute a single score per patient by averaging all their tests over the duration (may be 1 or many tests per patient); we now have a set of patient test scores for the COVIDpos patient sets, and one for the COVIDneg patient set.
Perform a two-sample t-test as well as a Mann-Whitney U test to get p-values against the null hypothesis that both sets have the same mean (for the t-test), and that both sets come from the same distribution (for the Mann-Whitney U test)
Compute Cohen’s d (effect size) as a measure of the effect size of the difference; for the Mann-Whitney U test, the U statistic itself serves as an effect size.
We then filtered by t-test p-value then ranked by Cohen’s d to identify which tests appeared to have the most differentiation between COVIDpos and COVIDneg scores; we then investigated some of these tests in deeper detail by looking at how they changed over time. We applied the same procedure to compute statistics for each of our time windows (restricting tests to the respective duration in each case).
Augmented curation of anticoagulant administration and the coagulopathy outcomes from the unstructured clinical notes and their triangulation to structured EHR databases
A state-of-the-art BERT-based neural network8 was previously developed to classify sentiment regarding a diagnosis in the EHR21. Sentences containing phenotypes were classified into the following categories: Yes (confirmed diagnosis), No (ruled out diagnosis), Maybe (possibility of disease), and Other (alternate context, e.g. family history of disease). The neural network used to perform this classification was trained using nearly 250 different phenotypes and 18,500 sentences and achieves 93.6% overall accuracy and over 95% precision and recall for Yes/No sentiment classification21. Here, this model was used to classify the sentiment around coagulopathies in the unstructured text of the 181 COVIDpos and 7,775 COVIDneg patients’ clinical notes, structuring this information so that it could be compiled with longitudinal lab measurement and medication information.
Acknowledgments
The authors thank Mathai Mammen, Jim List, JoAnne Foody, Patrick Lenehan, Murali Aravamudan, Rakesh Barve, Sankar Ardhanari, and Vishy Thiagarajan, for their helpful feedback.