Abstract
One of the most significant challenges in the early combat against COVID-19 was the difficulty in estimating the true magnitude of infections. Unreported infections drove up disease spread in numerous regions, made it very hard to accurately estimate the infectivity of the pathogen, therewith hampering our ability to react effectively. Despite the use of surveillance-based methods such as serological studies, identifying the true magnitude is still challenging today. This paper proposes an information theoretic approach for accurately estimating the number of total infections. Our approach is built on top of Ordinary Differential Equations (ODE) based models, which are commonly used in epidemiology and for estimating such infections. We show how we can help such models to better compute the number of total infections and identify the parameterization by which we need the fewest bits to describe the observed dynamics of reported infections. Our experiments show that our approach leads to not only substantially better estimates of the number of total infections but also better forecasts of infections than standard model calibration based methods. We additionally show how our learned parameterization helps in modeling more accurate what-if scenarios with non-pharmaceutical interventions. Our results support earlier findings that most COVID-19 infections were unreported and non-pharmaceutical interventions indeed helped to mitigate the spread of the outbreak. Our approach provides a general method for improving epidemic modeling which is applicable broadly.
Introduction
The COVID-19 pandemic has emerged as one of the most formidable public health challenges in recent history. By Nov 1, 2022, there were already more than 98 million reported infections and 1.07 million deaths in the United States alone. Worldwide, the reported infections summed to 636 million with at least 6.61 million deaths [19]. The devastating effects of COVID-19 extends to the economy as well. For example, in the US, the unemployment rate peaked at 15.8 percent in April 2020 [6], and US GDP contracted at a 3.5% annualized rate for 2020 [1]. Similar economic impacts have been observed worldwide.
One of the most significant challenges in the early combat against COVID-19 was estimating the number of total infections. A significant number of COVID-19 infections were unreported, due to various factors such as the lack of testing and asymptomatic infections [13, 11, 57, 55, 39]. The inability in estimating these unreported infections allowed them to drive up disease transmission in many regions. For example, phylogenetic studies revealed that COVID-19 had locally spread in Washington state before early 2020, when active community surveillance was implemented [14]. There were only 23 reported infections in five major U.S. cities by March 1, 2020, but it has been estimated that there were in fact more than 28,000 total infections by then [5]. Similar trends were observed in other countries, such as in Italy, Germany, and the UK [60]. Despite having more advanced surveillance techniques such as serological studies, estimating the total number of infections continues to be a challenge for COVID-19 response even today [8, 30].
An accurate estimation of the number of total infections is a fundamental epidemiological question and critical for pandemic planning and response. Not withstanding its importance, there is not even a commonly agreed upon metric. One proposal is the case ascertainment rate, which is defined as the ratio of reported symptomatic infections to the actual number of symptomatic infections [52]. Another popular proposal is the reported rate αreported, which is defined as the ratio of reported infections to total infections [46]. This definition includes asymptomatic infections, which are known to contribute substantially to community transmission [58, 41]. In this paper, we focus on this particular measure.
However, estimating the reported rate is challenging, and as a result all current methods have their limitations. One of the most effective current methods to identify the reported rate in a region is through large-scale serological studies [56, 26, 64]. These surveys use blood tests to identify the prevalence of antibodies against SARS-CoV-2 in a large population. The CDC COVID Data Tracker portal [2, 26] summarizes the results of serological studies conducted by commercial laboratories at a national level as well as at 10 specific sites. For example, the estimated reported rate was at most 0.1 in Minneapolis and South Florida as of April 2020. This means that there were at least 10 times more total infections than reported infections. While serological studies can give an accurate estimation, they are expensive and are not sustainable in the long run [4]. Furthermore, it is also challenging to obtain real-time data using such studies since there are unavoidable delays between sample collection and laboratory tests [2, 26]. Additional difficulties include sampling biases that make it necessary to use carefully designed heuristics to account for them [9]. Other methods include exploiting existing surveillance systems of related diseases like influenza, and using them to estimate symptomatic infections [40]. However, this can also be unreliable and requires ad-hoc corrections to account for the similarities between COVID-19 and influenza symptoms.
In the face of these challenges, data scientists and epidemiologists have devoted much time and effort to estimate the reported rate αreported through epidemiological models OM. By now, there exist carefully constructed models that capture the transmission dynamics of COVID-19 well [39, 52, 12, 50, 36, 38, 61, 25, 33, 62, 63, 17, 43]. In general, an epidemiological model OM has a set of parameters Θ that we estimate from observed data using a so-called calibration procedure, Calibrate In practice, the data we use for calibration can be the time series of the number of reported infections, which we call Dreported. To estimate the number of total infections, these models often explicitly include reported rate αreported as one of their parameters, or include multiple parameters that jointly account for it. There are many calibration procedures commonly used in literature, such as RMSE-based [23] or Bayesian approaches [33, 25].
We call the above general methodology the basic approach to estimate the reported rate, or BaseInfer for short. It takes the epidemiological model OM, a calibration procedure Calibrate, and observed data Dreported as input. The output of BaseInfer is then a baseline parameterization and, by extension, an estimated reported rate . Calibrating a parameterization is generally a complex, high-demensional problem, since consists of multiple interacting parameters. To make matters worse, there exist many possible parameterizations that show similar performance (e.g. in RMSE, likelihood) yet correspond to vastly different reported rates. BaseInfer cannot select between these competing parameterizations in a principled way: the parameterization it results in may or may not overfit the reported infections and may or may not predict future infections well. One method for selecting them is to take a Bayesian approach. That is, we choose a prior distribution, and then select the best parameterization that maximizes the posterior probability. Choosing such a prior, however, is ad-hoc and does not generalize well across different models OM. As we will see in the experimental evaluation, minor differences in estimates of re-ported rates can indeed lead to very different forecasts of future trends and therewith intervention policy recommendations.
Instead, we propose a new information theory-based approach named MdlInfer. It takes the same input as BaseInfer, but uses a principled approach to determine the best parameterization Θ*. It is based on the following central intuition: Suppose an oracle also gives us the time series of the number of total infections D in additional to the already known reported number of infections Dreported, and we are asked to describe Dreported as succinctly as possible. As we know both D and Dreported, it is trivial to estimate . If we know D and , it is trivial to describe Dreported, as it is simply plus a little bit of noise. Now to most succinctly describe D, we have to calibrate OM to obtain Θ′. The only things we now have to describe are Θ′, , the (small) errors that OM makes in predicting D, and the (small) errors that we make predicting Dreported using D and . In practice, we are of course not given D, but the key idea of this paper is to estimate D as a latent variable such that we can most succinctly describe (most accurately reconstruct) the dynamics of Dreported.
In practice, we need both a way of measuring how well a latent Model (i.e., D and its corresponding ) describes the Data (i.e., reported infections Dreported), as well as a way to find the best such Model. To do so, the Minimum Description Length (MDL) principle provides a statistically sound approach. MDL has been widely used for numerous optimization problems ranging from network summarization [34], causality inference [16], and failure detection in critical infrastructures [10]. MDL has also previously been used for some epidemiological problems, mainly in inferring patient-zero and associated infections in cascades over contact networks [49]. However, we are the first to propose an MDL-based approach on top of ODE-based epidemiological models, which are harder to formulate and optimize.
Specifically, we use two-part MDL (aka sender-receiver framework) consisting of hypothetical actors S and R: Sender S has the Data and wants to transmit it to receiver R using as few bits as possible [24]. Hence, sender S searches for the best possible Model, which minimizes the overall cost of encoding and transmitting both the Model and the Data given the Model. Following the convention in information theory, we use L(Model) to denote the number of bits required to encode the Model; and L(Data|Model) to denote the number of bits required to encode the Data, Dreported, given the Model. The overall objective of our optimization problem is to infer an optimal Model*, which minimizes L(Model) + L(Data|Model). To put MDL to practice for our problem, we carefully design our MDL cost to minimize the discrepancy in fitting Dreported. This cost ensures the generalizability of our learned D* and -it can avoid overfitting on Dreported and predict the future reported infections well. Our later experiments exactly show this. Our approach, MdlInfer, can be applied to any ODE model since two-part MDL does not assume about the nature of the Data or the Model.
We compare MdlInfer and BaseInfer using two different ODE-based epidemiological models: SAPHIRE [25] and SEIR + HD [33] as OM. Following their literature [25, 33], we use Markov Chain Monte Carlo (MCMC) as the calibration procedure Calibrate for SAPHIRE and iterated filtering (IF) for SEIR + HD, both of with are Bayesian approaches[29]. Both these epidemiological models have previously been shown to perform well in fitting reported infections and provided insight that was beneficial for the COVID-19 response. SAPHIRE focuses on two key features of the outbreak: high covertness and high transmissibility that drove the outbreak of COVID-19 in Wuhan. SEIR + HD investigates how non-pharmaceutical interventions like social distancing will be needed to maintain epidemic control. These models are broadly representative to show that MdlInfer gives consistent performance across multiple epidemiological models with different dynamics. The experiments clearly show that our proposed MDL-based approach MdlInfer performs superior to the state of the art. To illustrate, we give an example in Fig. 1. By March 11, 2020, the Minneapolis Metro Area had only 16 COVID-19 reported infections. BaseInfer estimated 182 total infections, which are colored as light green in the iceberg. On the other hand, our MdlInfer gives an estimate of 301 total infections shown below the sea level, which is closer to the total infections estimated from serological studies [26, 2]. Additionally, MdlInfer also leads to better fits and future projections on reported infections. We also demonstrate that MdlInfer can aid policy making by analyzing counter-factual non-pharmaceutical interventions, while inaccurate BaseInfer estimates lead to wrong non-pharmaceutical intervention conclusions.
Results
Next, we present our empirical findings on a large set of experiments in different geographical regions and time periods. We choose 8 regions and periods based on the severity of the outbreak and the availability of serological studies and symptomatic surveillance data. In each region, we divide the timeline into two time periods: (i) observed period, when only the number of reported infections are available, and both BaseInfer and MdlInfer are used to learn the baseline parameterization (BaseParam) and MDL parameterization (MdlParam) Θ*, and (ii) forecast period, where we evaluate the forecasts generated by the parameterizations learned in the observed period. To handle the time-varying reported rates, we divide the observed period into multiple sub-periods and learn different reported rates for each sub-period separately.
(A) Estimating total infections: MdlInfer estimates total infections more accurately than BaseInfer
Here, we use the point estimates of the total infections calculated from serological studies as the ground truth (black dots shown in Fig. 2). We call it SeroStudyTinf. We also plot MdlInfer’s estimation of total infections, MdlParamTinf, in the same figure (red curve). To compare the performance of MdlInfer and BaseInfer with SeroStudyTinf, we use the cumulative value of estimated total infections. Note that values from the serological studies are not directly comparable with the total infections because of the lag between antibodies becoming detectable and infections being reported [2, 26]. In Fig. 2, we have already accounted for this lag following CDC study guidelines [2, 26] (See Methods section for details). The vertical black lines shows a 95% confidence interval for SeroStudyTinf. The blue curve represents total infections estimated by BaseInfer, BaseParamTinf. As seen in the figure, MdlParamTinf falls within the confidence interval of the estimates given by serological studies. Significantly, in Fig. 2B and Fig. 2F for South Florida, BaseInfer for SAPHIRE model [25] overestimates the total infections, while for SEIR + HD model underestimates the total infections. However, MdlInfer consistently estimates the total infections correctly. This observation shows that as needed, MdlParamTinf can improve upon the BaseParamTinf in either direction (i.e., by increasing or decreasing the total infections). Note that the MdlParamTinf curves from both models are closer to the SeroStudyTinf even when the BaseParamTinf curves are different. The results of better accuracy in spite of various geographical regions and time periods show that MdlInfer is consistently able to estimate total infections more accurately.
To quantify the performance gap between the two approaches, we first compute the root mean squared error (RMSE) between SeroStudyTinf and BaseParamTinf. We also compute the same between SeroStudyTinf and MdlParamTinf. We then compute the ratio, ρTinf, of the two RMSE errors as . Note that the values of ρTinf being greater than 1 implies that the MdlParamTinf is closer to SeroStudyTinf estimates than BaseParamTinf. In Fig. 2I and Fig. 2J, we plot ρTinf. Overall, the ρTinf values are greater than 1 in Fig. 2I and Fig. 2J, which indicates that MdlInfer performs better than BaseInfer. Note that even when the value of ρTinf is 1.20 for Fig. 2A, the improvement made by MdlParamTinf over BaseParamTinf in terms of RMSE is about 12091. Hence, one can conclude that MdlInfer is indeed superior to BaseInfer, when it comes to estimating total infections. We show more experiments in the Supplementary Information.
(B) Estimating reported infections: MdlInfer leads to better fit and projection than BaseInfer at different stages of the COVID-19 epidemic
Here, we first use the observed period to learn the parameterizations. We then forecast the future reported infections (i.e., forecast periods), which were not accessible to the model while training. The results are summarized in Fig. 3. In Fig. 3A to Fig. 3H, the vertical grey dash line divides the observed and forecast period. The black plus symbols represent reported infections collected by the New York Times, NYT-Rinf. The red curve represents MdlInfer’s estimation of reported infections, MdlParamRinf. Similarly, the blue curve represents BaseInfer’s estimation of reported infections, BaseParamRinf. Note that the curves to the right of the vertical grey line are future predictions. As seen in Fig. 3, MdlParamRinf aligns more closely with NYT-Rinf than BaseParamRinf, indicating the superiority of MdlInfer in fitting and forecasting reported infections.
We define a performance metric ρRinf as to compare MdlParamRinf against BaseParamRinf in a manner similar to ρTinf. In Fig. 3I and Fig. 3J, we plot the ρRinf for the observed and forecast period. In both periods, we notice that the ρRinf is close to or greater than 1. This further shows that MdlInfer has a better or at least closer fit for reported infections than BaseInfer. Additionally, the ρRinf for the forecast period is even greater than ρRinf for the observed period, which shows that MdlInfer performs even better than BaseInfer while forecasting.
Note that Fig. 3A, C, E, G correspond to the early state of the COVID-19 epidemic in spring and summer 2020, and Fig. 3B, D, F, H correspond to fall 2020. We can see that MdlInfer performs well in estimating temporal patterns at different stages of the COVID-19 epidemic. We show more experiments in the Supplementary Information.
(C) Estimating symptomatic rate trends: MdlInfer estimates the symptomatic rate trends more accurately than BaseInfer
We validate this observation using Facebook’s symptomatic surveillance dataset [51]. We plot MdlInfer’s and BaseInfer’s estimated symptomatic rate over time and overlay the estimates and standard error from the symptomatic surveillance data in Fig. 4. The red and blue curves are MdlInfer’s and BaseInfer’s estimation of symptomatic rates, MdlParamSymp and BaseParamSymp respectively. Note that SAPHIRE model does not contain states corresponding to the symptomatic infections. Therefore, we only focus on SEIR + HD model. We compare the trends of the MdlParamSymp and BaseParamSymp with the symptomatic surveillance results. We focus on trends rather than actual values because the symptomatic rate numbers could be biased [51] (see Methods section for a detailed discussion) and therefore cannot be compared directly with model outputs like what we have done for serological studies. As seen in Fig. 4, MdlParamSymp captures the trends of the surveyed symptomatic rate RateSymp (black plus symbols) better than BaseParamSymp. We show more experiments in the Supplementary Information.
To summarize, these three sets of experiments in (A), (B) and (C) together demonstrate that BaseInfer fail to accurately estimate the total infections including unreported ones. On the other hand, MdlInfer estimates total infections closer to those estimated by serological studies and better fits reported infections and symptomatic rate trends.
Evaluating the effect of non-pharmaceutical Interventions
We have already shown that MdlInfer is able to estimate the number of total infections accurately. In the following three observations, we show that such accurate estimations are important for evaluating the effect of non-pharmaceutical interventions.
(D) MdlInfer reveals that a large majority of COVID-19 infections were unreported
We compute the cumulative reported rate MdlParamRate measured by the ratio of the cumulative value of reported infections to the total infections estimated by MdlInfer over time and plotted it for Minneapolis-Spring-20 in Fig. 5A. The figure shows that the MdlParamRate increases in early March, and then gradually decreases. This observation is explained by the community spread-driven COVID-19 outbreaks that were not reported until early March, which fits earlier studies [40].
(E) Non-pharmaceutical interventions on asymptomatic and presymptomatic infections are essential to control the COVID-19 epidemic
Our simulations show that non-pharmaceutical interventions on asymptomatic and presymptomatic infections are essential to control COVID-19. Here, we plot the simulated reported infections of MdlParam in Fig. 5B (red curve). We then repeat the simulation of reported infections for 5 different scenarios: (i) isolate just the reported infections, (ii) isolate just the symptomatic infections, and isolate symptomatic infections in addition to (iii) 25%, (iv) 50%, and (v) 75% of both asymptomatic and presymptomatic infections. In our setup, we assume that the infectivity reduces by half when a person is isolated. As seen in Fig. 5B, when only the reported infections are isolated, there is almost no change in the “future” reported infections. However, when we isolate both the reported and symptomatic infections, the reported infections decreases significantly. Even here, the reported infections are still not in decreasing trend. On the other hand, non-pharmaceutical interventions for some fraction of asymptomatic and presymptomatic infections make reported infections decrease. Thus, we can conclude that non-pharmaceutical interventions on asymptomatic infections are essential in controlling the COVID-19 epidemic.
(F) Accuracy of non-pharmaceutical intervention simulations relies on the good estimation of parameterization
Next, we also plot the simulated reported infections generated by BaseInfer in Fig. 5C (blue curve). As seen in the figure, based on BaseInfer, we can infer that only non-pharmaceutical interventions on symptomatic infections are enough to control the COVID-19 epidemic. However, this has been proven to be incorrect by prior studies and real-world observations [41]. Therefore, we can conclude that the accuracy of non-pharmaceutical intervention simulation relies on the quality of the learned parameterization.
Discussion and Future Work
This study proposes MdlInfer, a data-driven model selection approach that automatically estimates the number of total infections using epidemiological models. Our approach leverages the information theoretic Minimum Description Length (MDL) principle to select total infections that “best describe” the observed outbreak. Our approach addresses several gaps in current practice including the long-term infeasibility of serological studies [26], and ad-hoc assumptions in epidemiological models [33, 39, 44, 25].
Overall, our results show that MdlInfer estimates total infections at various geographical locations and different epidemiological models more accurately than BaseInfer from both directions, i.e., it corrects both over- and under-estimates. For example, compared to BaseInfer, we correctly estimate 55719 more infections by April 1 for the SEIR + HD model in Fig. 2F, and 87636 fewer infections for the SAPHIRE model in Fig. 2B for South Florida-Spring-20. We also show that MdlInfer leads to a better fit of the reported infections in the observed period and more accurate forecasts for the forecast period than BaseInfer. We reveal that a large majority of COVID-19 infections were unreported, where non-pharmaceutical interventions on unreported infections can help to mitigate the COVID-19 outbreak. We also show that MdlInfer estimates more accurate symptomatic rate trends than BaseInfer. Additionally, our results show consistent performance with respect to the reported infections and serological studies on both SAPHIRE and SEIR + HD model. We also show that MdlInfer identifies the ground truth parameters better than BaseInfer (see Supplementary Information section for details). As an aside, BaseInfer may also give uncertainty estimates for their calibrated parameterizations. Our framework MdlInfer can be adapted to generate such estimates as well (see Supplementary Information section for a demonstration).
The MdlInfer framework is likely to be helpful in the surveillance of COVID-19 in the near future, and for future epidemics. Even with the U.S. returning to normalcy, surveillance of the pandemic is still essential for public health. The daily incidence of COVID-19 has decreased from early 2021 to summer 2021, according to the CDC COVID Data Tracker portal [7, 35]. However, new variants of the SARS-CoV-2 (e.g., the Delta and Omicron variants) have been spreading rapidly [37, 48, 21]. Testing for these new variants and large-scale surveillance via laboratory tests may be limited and less systematic than what was done for COVID-19 before. In such settings, using our MdlInfer framework, epidemiologists and policymakers can improve the accuracy of estimates of total infections (without large-scale serological studies), as well as forecasts of their models.
One of the limitations of our work is that the benefits of using MdlInfer depends on the suitability of the epidemiological model. If the epidemiological model is not expressive enough for the observed data, then the gains from MdlInfer may not be significant. As a future work, it may be helpful to adapt MdlInfer to measure the quality of an epidemiological model. We also note that MdlInfer is built on ODE-based epidemiological models; other kinds of epidemic models, e.g., agent-based models [42, 28, 59, 18, 45], are more suitable in some settings. It would be interesting to extend MdlInfer to incorporate such models. Finally, there is significant population or spatial heterogeneity in disease outcomes [15, 31], e.g., differences in severity rate or mortality rate, when infected with COVID-19, for different age groups [22, 27], which has not been considered in our work.
To summarize, MdlInfer is a robust data-driven method to accurately estimate total infections, which will help data scientists, epidemiologists, and policy-makers to further improve existing ODE-based epidemiological models, make accurate forecasts, and combat the ongoing COVID-19 pandemic. More generally, MdlInfer opens up a new line of research in epidemic modeling using information theory.
Materials and Methods
Data
Datasets
We use the following publicly available datasets for our study:
New York Times reported infections [3]: This dataset (NYT-Rinf) consists of the daily time sequence of reported COVID-19 infections Dreported and the mortality Dmortality (cumulative values) for each county in the US starting from January 21, 2020 to current.
Serological studies [26, 2]: This dataset consists of the point and 95% confidence interval estimates of the prevalence of antibodies to SARS-CoV-2 in 10 US locations every 3–4 weeks from March to July 2020. For each location, CDC works with commercial laboratories to collect the blood specimens in the population and test them for antibodies to SARS-CoV-2. Each specimen collection period ranges from 6 to 14 days. As suggested by prior work [32, 47], these serological studies have high sensitivity to antibodies for 6 months after infections. Hence, using the prevalence and total population in one location, we can compute the estimated total infections SeroStudyTinf for the past 6 months (i.e., from the beginning of the pandemic since January 2020). However, this SeroStudyTinf can not be compared with the epidemiological model estimated total infection numbers directly. The reasons are (i) the antibodies may take 10 to 14 days delay to be detectable after infection [65, 54] and (ii) the 6-14 range period for specimen collection as mentioned before. To account for this, we compare the SeroStudyTinf numbers with the MdlInfer and BaseInfer estimated total infections of 7 days prior to the first day of specimen collection period as suggested by the CDC serological studies work [26].
Symptomatic surveillance [51, 53]: This dataset consists of point estimate RateSymp and standard error of the COVID-related symptomatic rate for each county in the US starting from April 6, 2020 to date. The survey asks a series of questions on randomly sampled social media (Facebook) users to estimate the percentage of people who have a COVID-like symptoms such as the fever along with cough or shortness of breath or difficulty breathing on a given day. However, there are several caveats such as they could not cover all symptoms of COVID-19 and these symptoms can be also caused by many other conditions, due to which they are not expected to be unbiased estimates for the true symptomatic rate [51]. Besides, as the original symptomatic surveillance data is at a county level, we sum up the numbers to compute the RateSymp and focus on trends instead of the exact numbers.
Our Approach
Two-part sender-receiver framework
In this work, we use two-part sender-receiver framework. The conceptual goal of the framework is to transmit the Data from the possession of the hypothetical sender S to the hypothetical receiver R. We assume the sender does this by first sending a Model and then sending the Data under this Model. In this MDL framework, we want to minimize the number of bits for this process. We do this by identifying the Model that encodes the Data such that the total number of bits needed to encode both the Model and the Data is minimized. Hence our cost function in the total number of bits needed is composed of two parts: (i) model cost L(Model): The cost in bits of encoding the Model and (ii) data cost L(Data|Model): The cost in bits of encoding the Data given the Model. Intuitively, the idea is that a good Model will lead to a fewer number of bits needed to encode both Model and Data. We formulate the general MDL optimization problem as follows:
Given the Data, L(Model), and L(Data|Model), find Model* such that In our situation, the Data is the reported COVID-19 infections Dreported: it is the only realworld data given to us. Note that total infections are not directly observed. As described in the introduction section, the Model is intuitively (D, ). Here D refers to a candidate total infections time series, and is the corresponding reported rate. Specifically, we calibrate OM on (D, Dreported) using Calibrate to get the “candidate” parameterization Θ′, and then compute from Θ′. Further, we choose to also add estimated by BaseInfer, making our Model to be . There are alternative Models that can be considered, but we choose this and explain more in the Supplementary Information. Note that as two-part MDL (and MDL in general) does not assume the nature of the Data or the Model, our MdlInfer can be applied to any ODE model. We have also discussed intuitive advantages of the MdlInfer over BaseInfer briefly in the introduction section (see Supplementary Information for more details). Next, we give more details how to formulate our problem of estimating total infections D.
MDL formulation
First, we need to introduce some notations. Given an epidemiological model OM and the paramterization estimated by BaseInfer, we can compute the reported infections. However, this is only an estimate of the reported infections rather than the exact Dreported. This is because even though we have already calibrated OM using Dreported, the calibration cannot be perfect, and there will be differences between these estimated reported infections and Dreported. Here, we term this estimated reported infections as . We can also estimate the total infections for OM in the same way. Similarly, we have the Dreported (Θ′) and D(Θ) for Θ′. As described in the introduction section, we can also calculate the reported rate and using and Θ′. With these notations, next we will formulate the space of all possible Models and give the equation for the cost in bits of encoding Model and Data.
Model space
We have as described above. Hence our Model space will be all possible daily sequences for D and all possible parameterizations for Θ′ and . The MDL framework will search in this space to find the Model*.
Model cost
With , we conceptualize the model cost by imagining that the sender S will send the to the receiver R in three parts: (i) first send the by encoding directly (ii) next send the Θ′ given by encoding and (iii) then send D given Θ′ and by encoding . Intuitively, both and should be close to Dreported, and the receiver could recover the D using , and as they have already been sent. We term the model cost as with three components: , and . Hence, For Equation 2, the Cost() function gives the total number of bits we need to spend in encoding each term. The details of the encoding method can be found in the Supplementary Information.
Data cost
We need to send the Data = Dreported next given the Model. Given , we send Data by encoding . Intuitively, D − Dreported corresponds to the unreported infections, and is the unreported rate. Therefore, should be close to the total infections D and D(Θ′). The receiver could also recover the Dreported using D, , and D(Θ′) as they have already been sent. We term data cost as and formulate it as Equation 3.
Total cost
With as in Equation 2 and as in Equation 3 above, the total cost is:
Problem statement
Note that our main objective is to estimate the total infections D. With formulated in Equation 4, we can state the problem as: Given the time sequence Dreported, epidemiological model OM, and a calibration procedure Calibrate, find D* that minimizes the MDL total cost i.e.
Algorithm
Next, we will present our algorithm to solve the problem in Equation 5. Note that directly searching D* naively is intractable since D* is a daily sequence not a scalar. Instead, we propose first finding a “good enough” reported rate quickly with the constraint to reduce the search space. Then with this , we can search for the optimal D* in Equation 5. Hence we propose a two-step algorithm: (i) do a linear search to find a good reported rate (ii) given the found above, use an optimization method to find the D* that minimizes with constraints.
Step 1: Find the
In step 1, we do a linear search on αreported to find the . As stated before, we use as D in to help reduce the search space. Here, we formulate step 1 algorithm as Equation 6.
Step 2: Find the D* given
With the found in step 1, we next find the D* that minimizes the . Note that we have already found a good , we can constrain the D* to ensure that the sum of D* equals to the sum of . We use the Nelder-Mead method [20] to solve this constrained optimization problem for D*. Here, we formulate step 2 algorithm as Equation 7. We describe the two-step algorithm in more detail in the Supplementary Information.
BaseInfer and MdlInfer formulation
Here, we also give the mathematical formulations for BaseInfer and MdlInfer. As described in the introduction section, given an epidemiological model OM, a typical approach is to calibrate the OM to Dreported using the calibration procedure Calibrate. We call this methodology as BaseInfer(OM, Calibrate, Dreported). As in Equation 8, the output of BaseInfer is the baseline parameterization (BaseParam) . As for the MdlInfer, it also takes the same input (OM, Calibrate, Dreported) as BaseInfer. Assume we are given the total infections D, we calibrate the OM on (D, Dreported) to get a “candidate” paramterization Θ′ in Equation 9. However, we are not given the D. Hence, we use the MDL framework to find such D* as in Equation 7. With such D*, we could finally calibrate the OM on (D*, Dreported) and gets another parameterization Θ*. As in Equation 10, we call Θ* as the MDL parameterization, or MdlParam. where . Intuitively, if estimated by BaseInfer is perfect, MdlInfer will also give the same Θ* as .
Epidemiological models
Next, we describe the two epidemiological models we use in our experiments: SEIR + HD and SAPHIRE model. SEIR + HD [33] consists of 10 states: Susceptible S, exposed E, pre-symptomatic IP, severe symptomatic IS, mild symptomatic IM, asymptomatic IA, hospitalized (eventual death) HD, hospitalized (eventual recover) HR, recovered R, and dead D. The parameters to be calibrated are the transmission rate β0 (the transmission rate in the absence of interventions), σ (the proportional reduction on β0 under shelter-in-place), and E0 (number of initial infections). The other parameters are fixed and given. They assume the importations only happen at the beginning of the pandemic (captured by E0), and the total population N remains constant. We also extend SEIR + HD model to infer two more parameters: α (proportion of asymptomatic infections) and α1 (proportion of new symptomatic infections that are reported). We compute the new reported infections and unreported infections as follows:
New reported infections = : Here is the number of new symptomatic infections everyday. is the number of patients switching their state from IP to IS (and similarly for ). We assume α1 proportion of new symptomatic infections every day are reported.
New unreported infections = .
SAPHIRE [25] consists of 7 states: Susceptible S, exposed E, pre-symptomatic P, ascertained infectious I, unascertained infectious A, hospitalized H, and recovered R. The parameters to be calibrated are the transmission rate β and reported rate r while keeping other parameters fixed as given values. We also compute the new reported infections and unreported infections as follows:
New reported infections = : Here is the number of new infections from pre-symptomatic every day. Dp is the parameter for the presymptomatic infectious period and is fixed. r is the reported rate estimated by the epidemiological model.
New unreported infections = .
Estimating infections using BaseParam and MdlParam
Here, we describe how we get the estimations in the results section using BaseParam and MdlParam. Here we use the BaseParam from BaseInfer as the example (this can also be repeated for MdlParam for MdlInfer). Using the epidemiological model OM, we can calculate the BaseParam’s estimation of total infections BaseParamTinf as the cumulative values of from pandemic’s beginning. can be directly used as the BaseParam’s estimation of reported infections. For the cumulative reported rate BaseParamRate, we calculate it as the cumulative values of NYT-Rinf divided by . For the symptomatic rate, SEIR + HD model [33] could estimate the number of symptomatic rate BaseParamSymp by dividing the number of infections in state IS and IM by the population number. However, SAPHIRE model [25] does not contain states that correspond to the symptomatic cases, so we cannot estimate the symptomatic rate using this model.
Data Availability
All data produced in the present work are contained in the manuscript.
https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Acknowledgements
This paper was supported in part by the NSF (Expeditions CCF-1918770 and CCF-1918656, CA-REER IIS-2028586, RAPID IIS-2027862, Medium IIS-1955883, Medium IIS-2106961, IIS-1931628, IIS-1955797, IIS-2027848, PIPP CCF-2200269), NIH 2R01GM109718, CDC MInD program U01CK000589, ORNL and funds/computing resources from Georgia Tech and GTRI. B. A. was in part supported by the CDC MInD-Healthcare U01CK000531-Supplement. A.V.’s work is also supported in part by grants from the UVA Global Infectious Diseases Institute (GIDI). J.V. is institutionally funded by CISPA.
Footnotes
Article updated
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵