Abstract
Estimating the true extent of the outbreak was one of the major challenges in combating COVID-19 outbreak early on. Our inability in doing so, allowed unreported/undetected infections to drive up disease spread in numerous regions in the US and worldwide. Accurately identifying the true magnitude of infections still remains a major challenge, despite the use of surveillance-based methods such as serological studies, due to their costs and biases. In this paper, we propose an information theoretic approach to accurately estimate the unreported infections. Our approach, built on top of an existing ordinary differential equations based epidemiological model, aims to deduce an optimal parameterization of the epidemiological model and the true extent of the outbreak which “best describes” the observed reported infections. Our experiments show that the parameterization learned by our framework leads to a better estimation of unreported infections as well as more accurate forecasts of the reported infections compared to the baseline parameterization. We also demonstrate that our framework can be leveraged to simulate what-if scenarios with non-pharmaceutical interventions. Our results also support earlier findings that a large majority of COVID-19 infections were unreported and non-pharmaceutical interventions indeed helped in mitigating the COVID-19 outbreak.
Introduction
The COVID-19 pandemic has emerged as one of the most formidable public health challenges in recent history. As of May 1, 2021, it had already resulted in more than 32 million reported infections, and half a million deaths just in the United States. Worldwide reported infections total at 161 million and deaths at 3.34 million [26]. The devastating effect of COVID-19 is not limited just to the public health sector, but also extends to the economy as a whole. For example, in the US the unemployment rate peaked to 15.8 percent in April 2020 [5], and U.S. GDP contracted at a 3.5% annualized rate for 2020 [1]. Similar economic impacts have been observed worldwide.
One of the major challenges in combating the COVID-19 pandemic early on was our inability to estimate the true magnitude of unreported infections. As noted by many studies [21, 15, 11, 54, 51, 41], a significant number of COVID-19 infections were unreported, due to various factors such as the lack of testing and asymptomatic infections. In the early stages, transmissions in previously unreported regions were facilitated by unreported infections before being detected. As a result, there were many instances of unreported COVID-19 outbreaks across the globe. In each of the early outbreak “hubs”, unreported spread was later identified as the driving force behind the surge in positive infections. For example, phylogenetic studies revealed that COVID-19 had locally spread in Washington state before active community surveillance was implemented in early 2020 [16]. Similarly, there were only 23 reported infections in five major U.S. cities by March 1. However, it has been estimated that there were already more than 28000 total infections in those cities by then [13, 3]. Similar trends were observed in Italy, Germany, and the UK [18]. Even currently, a majority of asymptomatic and mild symptomatic infections remain difficult to identify [21, 35, 12].
More generally, accurate estimation of the true extent of the outbreak is a fundamental epidemiological question, and is also critical for pandemic planning and response. Note that the true reported rate (i.e., the ground truth reported rate), which is the ratio of reported infections to total infections, is different from the case ascertainment rate, which is the ratio of confirmed symptomatic infections to the true number of symptomatic individuals [50]. The ascertainment rate is also an important parameter, and has been studied extensively. However, the true reported rate can help in better evaluation of the effectiveness of different interventions, which has been a big challenge due to the uncertainty in the number of infections and deaths, e.g., [19, 14, 50]. It can also help in efficient resource allocation and the design and implementation of non-pharmaceutical interventions.
One of the most effective methods to identify the true magnitude of unreported infections in a region is through large-scale serological studies [17, 53, 32, 57]. These surveys use blood tests to identify the prevalence of antibodies against SARS-CoV-2 in a large population. The CDC COVID Data Tracker portal [2] summarizes the results of serological studies conducted by commercial laboratories at a U.S. national level as well as 10 specific sites. As per the portal, in a sample collected by the CDC in Connecticut between April 26 and May 3 2020, it was estimated that the total infections were at least 6 times higher than the reported infections. Similarly, the estimated ratio of total infections to the reported infections in a sample collected between April 13 and April 25 2020 in Minnesota was at least 10. The same ratio was 12 in New York City between March 23 and Apr 1 2020. While serological studies give an accurate estimation of the unreported infections, they are expensive to carry out and not sustainable in a long run. It is even more challenging to perform real-time serological studies as there are unavoidable delays between sample collection and laboratory tests. The situation is even worse during the early stages of a novel pandemic when proper surveillance channels are not set up. Serological studies can also suffer from sampling biases [9], and heuristics have been designed to account for them.
Other lines of work to estimate unreported infections for COVID-19 try to exploit existing influenza surveillance systems to estimate symptomatic COVID-19 infections [42], due to symptomatic similarities between the two diseases. However, they also suffer from numerous issues ranging from ad-hoc corrections to account for the symptomatic similarities, low resolution of the influenza surveillance systems (e.g., they fail to account for outbreaks in nursing homes), and inaccuracy stemming from seasonality of influenza when ILI surveillance systems are operating normally.
In the face of these surveillance challenges, data scientists and epidemiologists have devoted much time and effort to estimate the reported rate and forecast future COVID-19 trends through models [41, 50, 14, 49, 38, 40, 56]. Usually the reported rate αreported is one of the parameters of the epidemiological model which needs to be calibrated/estimated. Although these epidemiological models, designed with expert knowledge, help guide policymakers, they can have several drawbacks. First, they often employ ad-hoc modelling assumptions such as equating unreported infections with case ascertainment [50] and/or predefining a set of parameters [22]. Secondly, they often use heuristics for parameter estimation, e.g., limiting the search space to a narrow range [28]. As a result, judging the quality of a specific parameterization pM of an epidemiological model M is hard. Hence this may lead to multiple divergent parameterizations, which is magnified by noisy reported infection data, especially during the early stages of a pandemic [27, 22]. As we show later in our results, small errors in estimates of reported rates may lead to very different forecasts of future trends as well as non-intervention policy recommendations. Hence a principled approach to distinguish between these nearly equivalent epidemiological model parameterizations is needed in order to estimate the true extent of the outbreak.
To achieve this goal, in this paper, we propose a new information theory based approach. Imagine that we are given the ground truth extent of the outbreak i.e., the total infections Dgt. Let us call the parameterization we get by calibrating a given epidemiological model against Dgt as pgt. Note that the ground truth reported rate αreported(gt) is also one of the parameters in pgt; so we can easily derive the reported infections as Dreported = Dgt × αreported(gt). However in the real-world, we do not know Dgt and αreported(gt). Therefore, we leverage an information theoretic framework to find the total infections D by posing an optimization problem to select that D which minimizes the bits to describe the observed Dreported.
To describe D efficiently, we also encode p, some baseline parameterization, and p’, a candidate parameterization. Then the optimization is defined over all possible values of Model = (D, p’, p) with the goal of describing Data = Dreported efficiently. Following the convention in information theory, we use L(Model) to denote the number of bits required to encode the Model (i.e. D, p’ and p). Similarly, L(Data | Model) is the number of bits required to encode the Data, Dreported, given D, p’, and p. Then the overall objective of the optimization problem is to infer the optimal Model∗ which minimizes L(Model) + L(Data | Model).
The optimization framework described above follows the popular two-part Minimum Description Length (MDL) framework. The two-part MDL (aka sender receiver framework) consists of hypothetical actors S and R. Sender S is in possession of Data and wants to transmit it to receiver R using as fewer bits as possible [31]. Hence, sender S searches for the best possible Model, which minimizes the overall cost of encoding and transmitting both the Model and the Data given the Model. Note that the two-part MDL (and MDL in general) does not make any assumption on the nature of the Data or the Model. As such, it has been widely used for numerous optimization problems ranging from network summarization [37], causality inference [20], and failure detection in critical infrastructures [10]. MDL has also previously been used for some epidemiological problems mainly in inferring patient-zero and associated infections in cascades over contact networks [48]. However, our technique is the first to propose an MDL-based approach on top of ODE-based epidemiological models, which are harder to formulate and optimize over.
We call our MDL-based optimization framework as MdlInfer. We use the ODE model proposed in [24] as the base epidemiological model OM for MdlInfer and this article. It consists of compartments representing the Susceptible, Exposed, various levels of Symptomatic infections, Hospitalization, Death and Recovered states. Following past work [36], here we assume that a fraction of symptomatic infections are reported each day and the rest are unreported. The parameters to be inferred include the transmission rate, proportional reduction on transmission post shelter-in-place orders, number of initial infections, and the reported rate (See Materials and Methods section for more detail). We construct a baseline using the standard approach via usual model calibration to estimate the reported rate. More specifically, we calibrate OM using Dreported following the procedure in [24], and use this as the baseline parameterization BaseParam represented by p. On the other hand, the parameterization OM resulting from our approach is termed MdlParam and is represented by p’.
Our experiments show that MdlParam is superior to BaseParam in a variety of tasks. An example is shown in Fig. 1. On March 2, 2020 Santa Clara county has only 11 COVID-19 reported infections. The best version of BaseParam estimated that there were 330 total infections (colored as light green in the iceberg). On the other hand, our MdlParam gave an estimate of 697 total infections shown below the iceberg, which is closer to the total infections estimated from serological studies [17]. In the following Results section, we will show MdlInfer leads to more accurate unreported infections and symptomatic rate estimations than baseline parameterizations. It also leads to better fit and projections on reported infections. We also demonstrate that MdlInfer estimates more accurate reported rate over time and can aid in policy making through analysis of counter-factual non-pharmaceutical interventions.
Results
Here we present our empirical findings on a set of experiments at different geographical regions and different time periods. We chose these regions based on the severity of outbreak in the early stages of the pandemic and the availability of the serological studies and symptomatic surveillance data. In each region, we divide the timeline into two periods: (i) observed period when the number of reported infections are available and the models are trained to learn the parameters, and (ii) future period where we evaluate the forecasts generated by the epidemiological models trained in the observed period. Note that the inferred parameters include the proportional reduction on transmission post shelter-in-place order. Hence one would expect a more accurate parameterization to generalize better regardless of policy changes. The division between the periods is chosen either based on (i) onset of non-intervension policy where we choose the division date based on the onset of interventions such as stay-at-home orders and (ii) random partition where no such information is available.
The days prior to the start of NPI policy are in the observed period and the remaining days are in the future period. Note that the reported infections in the future period are not made available to the models. However, epidemiological model parameters include the proportional reduction in transmission rate post shelter-in-place orders. Hence, one would expect that the better parameterization leads to a better forecast. Next we describe our experiments and findings in detail.
To help understanding, we also listed the notation in the Supplementary Information.
Estimating Total Infections
MdlParam estimates total infections more accurately than BaseParam
We plotted total infections, MdlParamTinf, inferred by MdlParam (green curve) against the point estimates of the total infections calculated from serological studies SeroStudyTinf (red dot) in the same figure (See Fig. 2 top row). Note that to compare with SeroStudyTinf, we are using the cumulative value of estimated total infections. The vertical red line shows 95% confidence interval on the estimate made by the serological studies. The blue curve represents the total infections, BaseParamTinf, estimated by BaseParam.
As seen in the figure, the total infections due to MdlParam fall within the confidence interval of the estimates given by serological studies. Fig. 2 (c) shows the results for the Western Washington region, where the baseline parameterization overestimates the total infections. However, MdlInfer correctly predicts a lower total infection. This observation, in conjunction with the results in Santa Clara and Bucks county shown that MdlInfer can improve upon the baseline parameterization in either direction (i.e., by increasing and decreasing the total infections) as necessary.
To quantify the performance gap of the two approaches, we define the metric ρTinf as . In Fig. 2 (d), we plotted the ρTinf. Note that values of ρTinf greater than 1 imply MdlParam estimation of the total infections are closer to the serological studies estimates than the BaseParam’s estimation. Overall, the ρTinf is usually larger than 1 (1.50, 7.40, and 9.62 in (a)-(c)) which indicates that MdlParam generally performs better in estimating total infections than BaseParam. More results in the Supplementary Information also show similar conclusion.
Estimating Symptomatic Rate
MdlParam estimates the symptomatic rate more accurately than BaseParam
We validate this observation using Facebook’s symptomatic surveillance data [8]. Here we plot the inferred symptomatic rate over time and overlay the estimates and confidence interval from the symptomatic surveillance data (See Fig. 2 bottom row). The green and blue curves are the MdlParam and BaseParam symptomatic rates, MdlParamSymp and BaseParamSymp respectively. As seen in the figure, the ground truth symptomatic rate RateSymp (represented by red plus symbols) aligns more closely with MdlParamSymp. Using a similar definition as earlier, we define a quantitative performance metric to compare the performance of MdlParam against BaseParam in estimating symptomatic rate, . In Fig. 2 (h), we plotted the ρSymp. We notice that the ρSymp is larger than 1 in all settings (1.71, 1.29, and 6.36 in (e)-(g)), indicating that MdlParam performs better than BaseParam in estimating the symptomatic rate. More results in the Supplementary Information also show similar conclusion.
To summarize, these two sets of experiments together demonstrate that BaseParam is still failing to estimate the true unreported infections. On the other hand, MdlParam infers unreported infections which is closer to the ones estimated by serological studies and symptomatic surveillance data.
Estimating Reported Infections
MdlParam leads to better fit and projection than BaseParam at different stage of the COVID-19 epidemic
Our approach shows success based on the proximity. Here we train MdlParam on the observed period and calibrate BaseParam using the same. We then use the trained models to forecast the reported infections in future period, which was not accessible to the model while training. The results are summarized in Fig. 3.
In Fig. 3 (a) to (f), the vertical grey dash divides the observed and future period. The black plus symbols represent New York Times reported infections NYT-Rinf. Similarly, the blue curve depicts the number of reported infections, BaseParamRinf, as per BaseParam. And finally, the green curve represents the number of reported infections, MdlParamRinf, as per MdlParam. Note that the curves to the right of the horizontal grey line are the future predictions. As easily seen in the figure, NYT-Rinf aligns more closely with MdlParamRinf than BaseParamRinf, indicating the superiority of our approach in estimating reported infections for both observed and future period.
We again define a performance metric to compare MdlParam against BaseParam in estimating reported infections as . In Fig. 3 (h), we plotted the ρRinf for the future period. In both (g) and (h), we observe that the ρRinf is more than 1. This further shows that MdlParam results in a better fit on reported infections than BaseParam. We also notice that MdlParam’s gain is minimal when the base epidemiological model is not expressive enough to capture the observed trend (See Fig. 3 (e)). However, such cases are rare.
Note that Fig. 3 (a) to (c) correspond to the early state of the COVID-19 epidemic in spring 2020, and Fig. 3 (d) to (f) correspond to fall 2020. Here, the value of ρRinf is greater than 1 indicating that MdlParam outperforms BaseParam and our MdlInfer performs well in estimating temporal patterns at different stage of the COVID-19 epidemic. More results in the Supplementary Information also show similar conclusion.
Evaluating effect of Non-pharmaceutical Interventions
MdlParam reveals that a large majority of COVID-19 infections were unreported
We plotted the reported rate inferred by MdlParam using the icebergs (See Fig. 1 (d)). The height of the iceberg above the water surface (shown in white) and the height below the water(in blue) are proportional to the number of reported and the inferred unreported infections respectively. As seen in the figure, the height above the water is significantly smaller than the height under the water across all regions, implying that the majority of the infections were actually unreported. We also notice that the reported rate for New York City is extremely low when compared with other regions, this could be explained by the fact that New York City has more international transportation and therefore the COVID-19 may have already spread in latent for a longer time than other regions.
We also computed the reported rate MdlParamRate measured by the ratio of the cumulative value of reported infections to the total infections estimated by MdlParam over time and plotted it in Fig. 4 (a).
As seen in the figure, the reported rate MdlParamRate sharply increased in early February in Santa Clara county, CA followed by a gradual decay and eventual rise. This observation is explained by the fact that Santa Clara was one of the first counties to observe the imported infections. However, the community spread driven COVID-19 outbreaks were not reported until late February. The observation that MdlParamRate is extremely low in early March also fits the earlier study [42].
Non-pharmaceutical interventions on asymptomatic and presymptomatic infections are essential to control the COVID-19 epidemic
Our simulations show that non-pharmaceutical interventions on asymptomatic and presymptomatic infections are essential to control COVID-19. Here, we plotted the simulated reported infections generated by MdlParam in Fig. 4 (c) (green curve). We then repeated the simulation of reported infections for 5 different scenarios: isolate just the reported infections, just the symptomatic infections, and symptomatic infections in addition to 25%, 50%, and 75% of both asymptomatic and presymptomatic infections. In our setup, we assume that the infectiousness reduces by half when a person is isolated. As seen in the figure, when only the reported infections are isolated, there is almost no change in the “future” reported infections. However when we isolate both the reported and symptomatic infections, the reported infections decreases significantly. Even here, the reported infections are still on the rise. On the other hand, non-pharmaceutical interventions on some fraction of asymptomatic and presymptomatic infections leads to decrease in reported infections. Thus, we can conclude that non-pharmaceutical interventions on asymptomatic infections are essential in controlling the COVID-19 epidemic.
Accuracy of non-pharmaceutical intervention simulations relies on the good inference of unreported infections
Here we also plotted the simulated reported infections generated by BaseParam in Fig. 4 (b) (blue curve). As seen in the figure, based on BaseParam we can infer that only non-pharmaceutical interventions on symptomatic infections can be enough to control the COVID-19 epidemic. However, this has been proven to be incorrect by prior studies and real-world observations [43]. Therefore, we can conclude that the accuracy of non-pharmaceutical intervention simulation relies on the goodness of the inferred parameterization.
Discussion and Future Work
In this study, we propose MdlInfer, a data-driven model selection approach that automatically estimates unreported infections based on an optimal parameterization of epidemiological models. Our approach leverages the information theoretic Minimum Length Description (MDL) principle, to select unreported infections which best describes the observed outbreak. Our approach addresses several gaps in current practice including long term infeasibility of serological studies [17, 32] and a lack of formal model selection over epidemiological models. MdlInfer employs a principled method that searches over a large space of parameterizations and finally selects the one that describes the outbreak best, while the existing epidemiological models [24, 41, 45] typically rely on the parameterization inferred by ad-hoc heuristics. The MdlInfer framework can also be adapted to work on a set of calibrated parameterizations to generate uncertainty estimates.
Overall, our results show that MdlInfer estimates total infections and the symptomatic rate at various geographical scales more accurately than baseline parameterization from both directions, i.e., it corrects both over- and under-estimates. For example, compared to the baseline parameterization, we correctly estimate 10792 more infections by April 1 in Santa Clara, CA (Fig. 2 (a)), and 214710 fewer infections in Western Washington (Fig. 2 (c)). We also show that MdlInfer leads to better fit of the reported infections in the observed period and more accurate forecasts for the future period than the baseline parameterization. We also reveal that a large majority of COVID-19 infections were unreported, where non-pharmaceutical interventions on unreported infections can help to mitigate the COVID-19 outbreak. Although our results are slightly inconsistent in matching the symptomatic rate, which can be attributed to the noise in the surveillance data, we are still fitting the symptomatic rate better than the baseline parameterization. And our results show consistent performance with respect to the reported infections and serological studies.
The MdlInfer framework is likely to be useful in surveillance of COVID-19 in the near future, and for future epidemics. Even with the U.S. returning to normalcy, surveillance of the pandemic is still important for public health. Daily incidence of COVID-19 has decreased since early 2021, according to the CDC COVID Data Tracker portal [6]. However new variants of the SARS-CoV-2 (e.g., the delta and gamma variants) have been spreading rapidly [39, 47, 30]. Testing for these, and large-scale surveillance via laboratory tests may be limited and less systematic than what was done for COVID-19 in the past few months. In such settings, using our MdlInfer framework, epidemiologists and policymakers can improve the accuracy of estimates of total infections (without large-scale serological studies), as well as forecasts of their models.
One of the limitations of our work is that the benefits of using MdlInfer depends on the suitability of the base epidemiological model. If the base epidemiological model is not expressive enough for the observed data, then the gains from MdlInfer may not be significant. As future work, it may be useful to adapt MdlInfer to give a measure of the quality of the base epidemiological model. We also note that MdlInfer is built on ODE-based epidemiological models; other kinds of epidemic models, e.g., agent-based models [25, 52, 34, 55, 23, 46], are more suitable in some settings. It would be interesting to extend MdlInfer to incorporate such models. Finally, there is significant population heterogeneity in disease outcomes, e.g., there are differences in severity rate or mortality rate, when infected with COVID-19, for different age group [44, 33], which has not been considered in our work.
To summarize, MdlInfer is a robust data-driven method to accurately estimate unreported infections, which will help data scientists, epidemiologists, and policy makers to further improve existing ODE-based epidemiological models, make accurate forecasts, and combat the ongoing COVID-19 pandemic. More generally, MdlInfer opens up a new line of research in epidemiology using information theoretic methods.
Materials and Methods
Data
We are using the following publicly available datasets for our study.
New York Times reported infections dataset The New York Times reported infections dataset, or NYT-Rinf, consists of daily time sequence of reported infections Dreported and the mortality Dmortality (cumulative values) for each county in the US starting from January 21, 2020 [7].
Serological studies: The serological studies [17, 32] consists of the point and 95% confidence interval estimate of the prevalence of antibodies to SARS-CoV-2. Using the population and the prevalence of antibodies, we can compute the estimated total infections SeroStudyTinf and 95% confidence interval in the location.
Symptomatic surveillance data: The symptomatic surveillance data [8] consists of point estimate RateSymp and confidence interval of the COVID-related symptomatic rate for each county in the US starting from April 6, 2020. The survey asks a series of questions designed to help researchers understand the spread of COVID-19 and its effect on people in the United States.
Our Approach
Minimum Length Description
We formulate the problem of inferring COVID-19 unreported infections using the Minimum Length Description (MDL) principle [10, 31]. Here we use the two-part sender-receiver framework. The goal of the framework is to transmit the Data from the possession of the sender S to the receiver R using a Model. We do this by identifying the Model describes the Data such that the total number of bits needed to represent both the Model and the Data is minimized. We term this as a cost function composed by two parts:
Model cost L(Model): The cost of describing the Model.
Data cost L(Data|Model): The cost of describing Data given the Model.
MDL Formulation
Before formalizing the model space, L(Model) and L(Data | Model), we will first define some notations and concepts.
In this work, we term the COVID-19 reported infections as Dreported, candidate total infections as D (which corresponds to unreported infections Dunreported), and mortality as Dmortality. We also term the epidemiological model [24] as OM.
For OM, we define the calibration on Dreported and Dmortality as the procedure to find the parameterization p that minimizes the log likelihood function between Dreported(p) and Dreported, and between the Dmortality(p) and Dmortality. Here, the Dreported(p) is the OM output reported infections based on p, and the Dmortality(p) is the OM output mortality based on p. In this work, p is the baseline parameteriztion. We write this procedure as follows: Note that the procedure can be extended without Dmortality easily.
From p, we can generate the OM output unreported infections Dunreported(p), and total infections D(p) = Dreported(p) + Dunreported(p). We can also calculate the baseline reported rate p[αreported]: Similarly, we can also calibrate OM on Dunreported = D − Dreported, Dreported, and Dmortality to find the parameterization p’. We write this procedure as follows: From p’, we can generate the OM output unreported infections Dunreported(p’), reported infections Dreported(p’), total infections D(p’) = Dreported(p’) + Dunreported(p’). We can also calculate the reported rate p’[αreported]: With the notations and concepts defined above, we will next formalize the model space, L(Model) and L(Data|Model).
Model Space
In this work, the Data to describe is Dreported, and therefore the most natural Model would have been Model = (p) as it directly corrects from the BaseParam reported infections Dreported(p) to describe Dreported. However, this has the disadvantage that the model space could be fragile that slightly different p could lead to vastly different costs. To account for this, we propose our Model as Model = (D, p’, p), which consists of three components. We use D to reparameterize the p’, use p to send the p’, and use the p’ and D to correct from p’[αreported] × D to send Dreported.
Model Cost
With the model space proposed above, the sender S will send the Model = (D, p’, p) to the receiver R in three parts:
First send the p.
Next send the p’ given p.
Then send D given p’ and p.
Therefore the MDL Model Cost, L(D, p’, p) will have three components: Specifically, we will send the p directly, send the p’ given p by sending p’ − p, and send D given p’ and p by sending p’[αreported] × D − Dreported(p). We write this down as below: The details of the encoding method to express the cost for each of the three components can be found in the Supplementary Information.
Data Cost
Next we need to send (describe) the Data, reported infections Dreported in terms of the Model. Given the Model = (D, p’, p), we describe D by describing . We write down the MDL Data Cost as below:
Total MDL Cost
Putting both the MDL Model Cost and MDL Data Cost together, the Total MDL Cost is
Problem Statement
With L(Dreported, D, p’, p) formulated above, we can state the problem of as one of searching for the best total infections D∗: Given the time sequence Dreported, epidemiological model OM, find D∗ that minimizes the MDL total cost i.e.
Algorithm
Next, we will present our algorithm to find D∗. Note that directly searching D∗ naively is intractable. An alternate method is to first find a good reported rate fast since we can constrain and reduce the search space. Then with from step 1, we can search for the optimal D∗. Hence we propose a two-step search algorithm to find D∗. Key steps involved in our algorithm are as follows:
First, we will do a linear search to find a good reported rate .
Given the found above, we will use an optimization method to find the D∗ that minimizes L(Dreported, D, p’, p) with constraints.
Step 1: Find the
In step 1, we do a linear search on αreported to find the . We formulate this as below where the in the linear search. This helps to reduce the search space. Hence, the p’ is For stability and robustness, as Dreported can be noisy, we use D(p’) instead of as D in the MDL Total Cost in this step.
Step 2: Find the D∗ given
With the found in step 1, we will next find the D∗ that minimizes the MDL cost. Since we have already found a good , we will constrain the D∗ as below We then use the Nelder-Mead method [29] solve this constrained optimization problem for D∗, and we initialize the search from . We describe the two-step algorithm in more detail in the Supplementary Information.
Calibration
The epidemiological model [24] consists of 10 states: Susceptible S, exposed E, pre-symptomatic IP, severe symptomatic IS, mild symptomatic IM, asymptomatic IA, hospitalized (eventual death) HD, hospitalized (eventual recover) HR, recovered R, and dead D. The calibration process described in [24] only infers the transmission rate β0 (the transmission rate in the absence of interventions), σ (the proportional reduction on β0 under shelter-in-place), and E0 (number of initial infections). The other parameters are fixed. During inference, the model calibrates on mortality.
We extend the calibration process described above to include the reported infections and unreported infections. Note that we do not make any structural changes to the existing model. Similar to [36], in our calibration process we compute the newly reported infections and unreported infection as follows:
New reported = α1 × (dIP IS +dIP IM): Here dIP IS +dIP IM is the number of new symptomatic infections everyday. We assume α1 proportion of new symptomatic infections everyday are reported.
New unreported = (1 − α1) × (dIP IS + dIP IM) + dEIA.
We also extend the calibration to infer two more parameters: α (proportion of infections that are asymptomatic) and α1 (proportion of new symptomatic infections that are reported). The extended epidemiological model OM now calibrates on β0, σ, E0, α, and α1. Hence, our vector p of model parameters is defined as p = [p[β0], p[σ], p[E0], p[α], p[α1]]
Calibration Results
From the parameterization vector p and OM, we can infer the reported infections Dreported(p), unreported infections Dunreported(p), and total infections D(p) = Dreported(p) + Dunreported(p). Here, we calculate the BaseParam total infections (cumulative values) as Note that BaseParam also estimates the number of symptomatic people, which can be calculated from the number of infections in state IS (severe symptomatic) and IM (mild symptomatic). With the number of symptomatic people, we can also calculate the COVID-19 related symptomatic rate among the population N : Here, we also calculate the BaseParam estimated reported infections and the reported rate From p’, we can also calculate similar measures for MdlParam.
Data Availability
Coronavirus in the U.S.:Latest Map and Case Count, 2020. URL https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html. Eran Bendavid, Bianca Mulaney, Neeraj Sood, Soleil Shah, Rebecca Bromley-Dulfano, Cara Lai, Zoe Weissberg, Rodrigo Saavedra-Walker, Jim Tedrow, Andrew Bogan, et al. Covid-19 antibody seroprevalence in santa clara county, california. International journal of epidemiology, 50(2):410-419, 2021. Fiona P Havers, Carrie Reed, Travis Lim, Joel M Montgomery, John D Klena, Aron J Hall, Alicia M Fry, Deborah L Cannon, Cheng-Feng Chiang, Aridth Gibbons, et al. Seroprevalence of antibodies to sars-cov-2 in 10 sites in the united states, march 23-may 12, 2020. JAMA internal medicine, 180(12):1576-1586, 2020. Delphi's COVID-19 Surveys, 2020. URL https://delphi.cmu.edu/covidcast/surveys/.
https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Data Availability
Code and data have been deposited in GitHub [4].
Acknowledgements
This paper was partially supported by the NSF (Expeditions CCF-1918770 and CCF-1918656, CA-REER IIS-2028586, RAPID IIS-2027862, Medium IIS-1955883, Medium IIS-2106961, IIS-1931628, IIS-1955797, IIS-2027848), NIH 2R01GM109718, CDC MInD program U01CK000589, ORNL and funds/computing resources from Georgia Tech and GTRI. B. A. was in part supported by the CDC MInD-Healthcare U01CK000531-Supplement. A.V.’s work is also supported in part by grants from the UVA Global Infectious Diseases Institute (GIDI).
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵