Information Theoretic Model Selection for Accurately Estimating Unreported COVID-19 Infections
==============================================================================================

* Jiaming Cui
* Arash Haddadan
* A S M Ahsan-Ul Haque
* Bijaya Adhikari
* Anil Vullikanti
* B. Aditya Prakash

## Abstract

Estimating the true extent of the outbreak was one of the major challenges in combating COVID-19 outbreak early on. Our inability in doing so, allowed unreported/undetected infections to drive up disease spread in numerous regions in the US and worldwide. Accurately identifying the true magnitude of infections still remains a major challenge, despite the use of surveillance-based methods such as serological studies, due to their costs and biases. In this paper, we propose an information theoretic approach to accurately estimate the unreported infections. Our approach, built on top of an existing ordinary differential equations based epidemiological model, aims to deduce an optimal parameterization of the epidemiological model and the true extent of the outbreak which “best describes” the observed reported infections. Our experiments show that the parameterization learned by our framework leads to a better estimation of unreported infections as well as more accurate forecasts of the *reported* infections compared to the baseline parameterization. We also demonstrate that our framework can be leveraged to simulate what-if scenarios with non-pharmaceutical interventions. Our results also support earlier findings that a large majority of COVID-19 infections were unreported and non-pharmaceutical interventions indeed helped in mitigating the COVID-19 outbreak.

Keywords
*   information theory
*   model selection
*   epidemiological models
*   non-pharmaceutical interventions

## Introduction

The COVID-19 pandemic has emerged as one of the most formidable public health challenges in recent history. As of May 1, 2021, it had already resulted in more than 32 million reported infections, and half a million deaths just in the United States. Worldwide reported infections total at 161 million and deaths at 3.34 million [26]. The devastating effect of COVID-19 is not limited just to the public health sector, but also extends to the economy as a whole. For example, in the US the unemployment rate peaked to 15.8 percent in April 2020 [5], and U.S. GDP contracted at a 3.5% annualized rate for 2020 [1]. Similar economic impacts have been observed worldwide.

One of the major challenges in combating the COVID-19 pandemic early on was our inability to estimate the true magnitude of unreported infections. As noted by many studies [21, 15, 11, 54, 51, 41], a significant number of COVID-19 infections were unreported, due to various factors such as the lack of testing and asymptomatic infections. In the early stages, transmissions in previously unreported regions were facilitated by unreported infections before being detected. As a result, there were many instances of unreported COVID-19 outbreaks across the globe. In each of the early outbreak “hubs”, unreported spread was later identified as the driving force behind the surge in positive infections. For example, phylogenetic studies revealed that COVID-19 had locally spread in Washington state before active community surveillance was implemented in early 2020 [16]. Similarly, there were only 23 reported infections in five major U.S. cities by March 1. However, it has been estimated that there were already more than 28000 total infections in those cities by then [13, 3]. Similar trends were observed in Italy, Germany, and the UK [18]. Even currently, a majority of asymptomatic and mild symptomatic infections remain difficult to identify [21, 35, 12].

More generally, accurate estimation of the true extent of the outbreak is a fundamental epidemiological question, and is also critical for pandemic planning and response. Note that the *true reported rate* (i.e., the ground truth reported rate), which is the ratio of reported infections to total infections, is different from the *case ascertainment rate*, which is the ratio of confirmed symptomatic infections to the true number of symptomatic individuals [50]. The ascertainment rate is also an important parameter, and has been studied extensively. However, the true reported rate can help in better evaluation of the effectiveness of different interventions, which has been a big challenge due to the uncertainty in the number of infections and deaths, e.g., [19, 14, 50]. It can also help in efficient resource allocation and the design and implementation of non-pharmaceutical interventions.

One of the most effective methods to identify the true magnitude of unreported infections in a region is through large-scale serological studies [17, 53, 32, 57]. These surveys use blood tests to identify the prevalence of antibodies against SARS-CoV-2 in a large population. The CDC COVID Data Tracker portal [2] summarizes the results of serological studies conducted by commercial laboratories at a U.S. national level as well as 10 specific sites. As per the portal, in a sample collected by the CDC in Connecticut between April 26 and May 3 2020, it was estimated that the total infections were at least 6 times higher than the reported infections. Similarly, the estimated ratio of total infections to the reported infections in a sample collected between April 13 and April 25 2020 in Minnesota was at least 10. The same ratio was 12 in New York City between March 23 and Apr 1 2020. While serological studies give an accurate estimation of the unreported infections, they are expensive to carry out and not sustainable in a long run. It is even more challenging to perform real-time serological studies as there are unavoidable delays between sample collection and laboratory tests. The situation is even worse during the early stages of a novel pandemic when proper surveillance channels are not set up. Serological studies can also suffer from sampling biases [9], and heuristics have been designed to account for them.

Other lines of work to estimate unreported infections for COVID-19 try to exploit existing influenza surveillance systems to estimate symptomatic COVID-19 infections [42], due to symptomatic similarities between the two diseases. However, they also suffer from numerous issues ranging from ad-hoc corrections to account for the symptomatic similarities, low resolution of the influenza surveillance systems (e.g., they fail to account for outbreaks in nursing homes), and inaccuracy stemming from seasonality of influenza when ILI surveillance systems are operating normally.

In the face of these surveillance challenges, data scientists and epidemiologists have devoted much time and effort to estimate the reported rate and forecast future COVID-19 trends through models [41, 50, 14, 49, 38, 40, 56]. Usually the reported rate *α*reported is one of the parameters of the epidemiological model which needs to be calibrated/estimated. Although these epidemiological models, designed with expert knowledge, help guide policymakers, they can have several drawbacks. First, they often employ ad-hoc modelling assumptions such as equating unreported infections with case ascertainment [50] and/or predefining a set of parameters [22]. Secondly, they often use heuristics for parameter estimation, e.g., limiting the search space to a narrow range [28]. As a result, judging the *quality* of a specific parameterization **p***M* of an epidemiological model *M* is hard. Hence this may lead to multiple divergent parameterizations, which is magnified by noisy reported infection data, especially during the early stages of a pandemic [27, 22]. As we show later in our results, small errors in estimates of reported rates may lead to very different forecasts of future trends as well as non-intervention policy recommendations. Hence a principled approach to distinguish between these nearly equivalent epidemiological model parameterizations is needed in order to estimate the true extent of the outbreak.

To achieve this goal, in this paper, we propose a new information theory based approach. Imagine that we are given the ground truth extent of the outbreak i.e., the total infections *D*gt. Let us call the parameterization we get by calibrating a given epidemiological model against *D*gt as **p**gt. Note that the ground truth reported rate *α*reported(gt) is also one of the parameters in **p**gt; so we can easily derive the reported infections as *D*reported = *D*gt *× α*reported(gt). However in the real-world, we do not know *D*gt and *α*reported(gt). Therefore, we leverage an information theoretic framework to find the total infections *D* by posing an optimization problem to select that *D* which minimizes the bits to describe the observed *D*reported.

To describe *D* efficiently, we also encode **p**, some baseline parameterization, and **p’**, a candidate parameterization. Then the optimization is defined over all possible values of Model = (*D*, **p’, p**) with the goal of describing Data = *D*reported efficiently. Following the convention in information theory, we use *L*(Model) to denote the number of bits required to encode the Model (i.e. *D*, **p’** and **p**). Similarly, *L*(Data | Model) is the number of bits required to encode the Data, *D*reported, given *D*, **p’**, and **p**. Then the overall objective of the optimization problem is to infer the optimal Model∗ which minimizes *L*(Model) + *L*(Data | Model).

The optimization framework described above follows the popular two-part Minimum Description Length (MDL) framework. The two-part MDL (aka sender receiver framework) consists of hypothetical actors *S* and *R*. Sender *S* is in possession of Data and wants to transmit it to receiver *R* using as fewer bits as possible [31]. Hence, sender *S* searches for the best possible Model, which minimizes the overall cost of encoding and transmitting both the Model and the Data given the Model. Note that the two-part MDL (and MDL in general) does not make any assumption on the nature of the Data or the Model. As such, it has been widely used for numerous optimization problems ranging from network summarization [37], causality inference [20], and failure detection in critical infrastructures [10]. MDL has also previously been used for some epidemiological problems mainly in inferring patient-zero and associated infections in cascades over contact networks [48]. However, our technique is the first to propose an MDL-based approach on top of ODE-based epidemiological models, which are harder to formulate and optimize over.

We call our MDL-based optimization framework as MdlInfer. We use the ODE model proposed in [24] as the base epidemiological model *O*M for MdlInfer and this article. It consists of compartments representing the Susceptible, Exposed, various levels of Symptomatic infections, Hospitalization, Death and Recovered states. Following past work [36], here we assume that a fraction of symptomatic infections are reported each day and the rest are unreported. The parameters to be inferred include the transmission rate, proportional reduction on transmission post shelter-in-place orders, number of initial infections, and the reported rate (See Materials and Methods section for more detail). We construct a baseline using the standard approach via usual model calibration to estimate the reported rate. More specifically, we calibrate *O*M using *D*reported following the procedure in [24], and use this as the baseline parameterization BaseParam represented by **p**. On the other hand, the parameterization *O*M resulting from our approach is termed MdlParam and is represented by **p’**.

Our experiments show that MdlParam is superior to BaseParam in a variety of tasks. An example is shown in Fig. 1. On March 2, 2020 Santa Clara county has only 11 COVID-19 reported infections. The best version of BaseParam estimated that there were 330 total infections (colored as light green in the iceberg). On the other hand, our MdlParam gave an estimate of 697 total infections shown below the iceberg, which is closer to the total infections estimated from serological studies [17]. In the following Results section, we will show MdlInfer leads to more accurate unreported infections and symptomatic rate estimations than baseline parameterizations. It also leads to better fit and projections on reported infections. We also demonstrate that MdlInfer estimates more accurate reported rate over time and can aid in policy making through analysis of counter-factual non-pharmaceutical interventions.

![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/09/22/2021.09.14.21263467/F1.medium.gif)

[Figure 1:](http://medrxiv.org/content/early/2021/09/22/2021.09.14.21263467/F1)

Figure 1: 
The workflow of MdlInfer. (a) The baseline parameterization underestimates the unreported infections (Shown in light green in the iceberg plot in (c)). (b) Our MDL model selection approach MdlInfer accurately estimates the extent of COVID-19 outbreak by inferring unreported infections given the reported infections. In MdlInfer, the Data a hypothetical Sender *S* wants to transmit to a Receiver *R* is the reported infections *D*reported. At a high level, the sender *S* will send the Data by sending the total infections *D*, the baseline model parameterization **p**, and the inferred best parameterization **p’**. (c) The number of COVID-19 reported infections is just the tip of an iceberg, while the whole iceberg corresponds to total infections estimated by MdlInfer. (d) MdlParam reveals that a large majority of COVID-19 infections were unreported from multiple regions at different geographical resolution and different time periods. We visualize the inferred reported rate using the icebergs. The height of the iceberg over water (in white) and the height below the water(in blue) are proportional to the reported infections and the unreported infections estimated using MdlInfer respectively.

## Results

Here we present our empirical findings on a set of experiments at different geographical regions and different time periods. We chose these regions based on the severity of outbreak in the early stages of the pandemic and the availability of the serological studies and symptomatic surveillance data. In each region, we divide the timeline into two periods: (i) observed period when the number of reported infections are available and the models are trained to learn the parameters, and (ii) future period where we evaluate the forecasts generated by the epidemiological models trained in the observed period. Note that the inferred parameters include the proportional reduction on transmission post shelter-in-place order. Hence one would expect a more accurate parameterization to generalize better regardless of policy changes. The division between the periods is chosen either based on (i) onset of non-intervension policy where we choose the division date based on the onset of interventions such as stay-at-home orders and (ii) random partition where no such information is available.

The days prior to the start of NPI policy are in the observed period and the remaining days are in the future period. Note that the reported infections in the future period are not made available to the models. However, epidemiological model parameters include the proportional reduction in transmission rate post shelter-in-place orders. Hence, one would expect that the better parameterization leads to a better forecast. Next we describe our experiments and findings in detail.

To help understanding, we also listed the notation in the Supplementary Information.

### Estimating Total Infections

#### MdlParam estimates total infections more accurately than BaseParam

We plotted total infections, MdlParamTinf, inferred by MdlParam (green curve) against the point estimates of the total infections calculated from serological studies SeroStudyTinf (red dot) in the same figure (See Fig. 2 top row). Note that to compare with SeroStudyTinf, we are using the cumulative value of estimated total infections. The vertical red line shows 95% confidence interval on the estimate made by the serological studies. The blue curve represents the total infections, BaseParamTinf, estimated by BaseParam.

![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/09/22/2021.09.14.21263467/F2.medium.gif)

[Figure 2:](http://medrxiv.org/content/early/2021/09/22/2021.09.14.21263467/F2)

Figure 2: 
Results of total infections and symptomatic rate estimated by BaseParam and MdlParam. (a)-(c) MdlParam estimates total infections more accurately than BaseParam. The grey dash line divides the observed period (used to train BaseParam as well as MdlParam) and the future period (which was not accessible to the model while training). Blue curve and green curve represent the total infections estimated by baseline parameterization, BaseParamTinf, and total infections estimated by MdlInfer parameterization, MdlParamTinf respectively. The red point estimate Sero StudyTinf and confidence interval represent the total infections estimated by serological studies [17, 32]. Note that each plot corresponds to a different geographic region, and the scales are different. (d) The performance metric, *ρ*Tinf, comparing MdlParam against BaseParam in estimating total infections are shown for the regions in (a)-(c). Here the values of *ρ*Tinf are 1.50, 7.40, 9.62, implying that MdlParam performs better in estimating total infections than BaseParam. (e)-(g) MdlParam estimates more accurate symptomatic rate than BaseParam. Blue curve and green curve represent the symptomatic rate estimated by baseline parameterization, BaseParamSymp, and symptomatic rate estimated by MdlInfer parameterization, MdlParamSymp respectively. The red point estimate RateSymp and confidence interval represent the COVID-related symptomatic rate from Facebook’s symptomatic surveillance data [8]. (h) The performance metric, *ρ*Symp, comparing MdlParam against BaseParam in estimating symptomatic rate are shown for the regions in (e)-(g). Here the values of *ρ*Symp are 1.71, 1.29, and 6.36, implying that MdlParam performs better in estimating symptomatic rate than BaseParam.

As seen in the figure, the total infections due to MdlParam fall within the confidence interval of the estimates given by serological studies. Fig. 2 (c) shows the results for the Western Washington region, where the baseline parameterization overestimates the total infections. However, MdlInfer correctly predicts a lower total infection. This observation, in conjunction with the results in Santa Clara and Bucks county shown that MdlInfer can improve upon the baseline parameterization in either direction (i.e., by increasing and decreasing the total infections) as necessary.

To quantify the performance gap of the two approaches, we define the metric *ρ*Tinf as ![Graphic][1]</img>. In Fig. 2 (d), we plotted the *ρ*Tinf. Note that values of *ρ*Tinf greater than 1 imply MdlParam estimation of the total infections are closer to the serological studies estimates than the BaseParam’s estimation. Overall, the *ρ*Tinf is usually larger than 1 (1.50, 7.40, and 9.62 in (a)-(c)) which indicates that MdlParam generally performs better in estimating total infections than BaseParam. More results in the Supplementary Information also show similar conclusion.

### Estimating Symptomatic Rate

#### MdlParam estimates the symptomatic rate more accurately than BaseParam

We validate this observation using Facebook’s symptomatic surveillance data [8]. Here we plot the inferred symptomatic rate over time and overlay the estimates and confidence interval from the symptomatic surveillance data (See Fig. 2 bottom row). The green and blue curves are the MdlParam and BaseParam symptomatic rates, MdlParamSymp and BaseParamSymp respectively. As seen in the figure, the ground truth symptomatic rate RateSymp (represented by red plus symbols) aligns more closely with MdlParamSymp. Using a similar definition as earlier, we define a quantitative performance metric to compare the performance of MdlParam against BaseParam in estimating symptomatic rate, ![Graphic][2]</img>. In Fig. 2 (h), we plotted the *ρ*Symp. We notice that the *ρ*Symp is larger than 1 in all settings (1.71, 1.29, and 6.36 in (e)-(g)), indicating that MdlParam performs better than BaseParam in estimating the symptomatic rate. More results in the Supplementary Information also show similar conclusion.

To summarize, these two sets of experiments together demonstrate that BaseParam is still failing to estimate the true unreported infections. On the other hand, MdlParam infers unreported infections which is closer to the ones estimated by serological studies and symptomatic surveillance data.

### Estimating Reported Infections

#### MdlParam leads to better fit and projection than BaseParam at different stage of the COVID-19 epidemic

Our approach shows success based on the proximity. Here we train MdlParam on the observed period and calibrate BaseParam using the same. We then use the trained models to forecast the reported infections in future period, which was not accessible to the model while training. The results are summarized in Fig. 3.

![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/09/22/2021.09.14.21263467/F3.medium.gif)

[Figure 3:](http://medrxiv.org/content/early/2021/09/22/2021.09.14.21263467/F3)

Figure 3: 
Results of reported infections estimated by BaseParam and MdlParam. (a)-(f) MdlParam leads to better fit and projection than BaseParam. The grey dash line divides the observed period and future period. Black plus symbols, blue curve, and green curve represent the infections reported by the New York Times, NYT-Rinf, reported infections estimated by baseline parameterization, BaseParamRinf, and reported infections estimated by MdlParam parameterization, MdlParamRinf respectively. Note that each plot corresponds to a different geographic region, and the scales are different. The results show that MdlParamRinf aligns much closer with NYT-Rinf than BaseParamRinf. Besides, MdlParamRinf projects the future trends better than BaseParamRinf. (g)-(h) The performance metric, *ρ*Rinf, comparing MdlParam against BaseParam in estimating reported infections are shown for the regions for both observed period (g), and future period (h). In both observed and future period, reported infections estimated by MdlInfer is better than baseline parameterization overall, and *ρ*Rinf in future period is even larger than *ρ*Rinf in observed period implying that MdlParam leads to a better generalization.

In Fig. 3 (a) to (f), the vertical grey dash divides the observed and future period. The black plus symbols represent New York Times reported infections NYT-Rinf. Similarly, the blue curve depicts the number of reported infections, BaseParamRinf, as per BaseParam. And finally, the green curve represents the number of reported infections, MdlParamRinf, as per MdlParam. Note that the curves to the right of the horizontal grey line are the future predictions. As easily seen in the figure, NYT-Rinf aligns more closely with MdlParamRinf than BaseParamRinf, indicating the superiority of our approach in estimating reported infections for both observed and future period.

We again define a performance metric to compare MdlParam against BaseParam in estimating reported infections as ![Graphic][3]</img>. In Fig. 3 (h), we plotted the *ρ*Rinf for the future period. In both (g) and (h), we observe that the *ρ*Rinf is more than 1. This further shows that MdlParam results in a better fit on reported infections than BaseParam. We also notice that MdlParam’s gain is minimal when the base epidemiological model is not expressive enough to capture the observed trend (See Fig. 3 (e)). However, such cases are rare.

Note that Fig. 3 (a) to (c) correspond to the early state of the COVID-19 epidemic in spring 2020, and Fig. 3 (d) to (f) correspond to fall 2020. Here, the value of *ρ*Rinf is greater than 1 indicating that MdlParam outperforms BaseParam and our MdlInfer performs well in estimating temporal patterns at different stage of the COVID-19 epidemic. More results in the Supplementary Information also show similar conclusion.

### Evaluating effect of Non-pharmaceutical Interventions

#### MdlParam reveals that a large majority of COVID-19 infections were unreported

We plotted the reported rate inferred by MdlParam using the icebergs (See Fig. 1 (d)). The height of the iceberg above the water surface (shown in white) and the height below the water(in blue) are proportional to the number of reported and the inferred unreported infections respectively. As seen in the figure, the height above the water is significantly smaller than the height under the water across all regions, implying that the majority of the infections were actually unreported. We also notice that the reported rate for New York City is extremely low when compared with other regions, this could be explained by the fact that New York City has more international transportation and therefore the COVID-19 may have already spread in latent for a longer time than other regions.

We also computed the reported rate MdlParamRate measured by the ratio of the cumulative value of reported infections to the total infections estimated by MdlParam over time and plotted it in Fig. 4 (a).

![Figure 4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/09/22/2021.09.14.21263467/F4.medium.gif)

[Figure 4:](http://medrxiv.org/content/early/2021/09/22/2021.09.14.21263467/F4)

Figure 4: 
Results of reported rate, and non-pharmaceutical intervention simulations on asymptomatic and presymptomatic infections. (a) The reported rate trend during COVID-19 outbreak. Blue curve and green curve represent the reported rate estimated by baseline parameterization, BaseParamRate, and reported rate estimated by MdlInfer parameterization, MdlParamRate respectively. The red point estimate and confidence interval represent the reported rate Sero StudyRate estimated by serological studies [17]. As seen in the figure, the MdlParamRate is high in February, possibly due to imported infections, it then gradually decreases. It finally increases in March 2020. Our results in (b) and (c) demonstrate that the accuracy of non-pharmaceutical intervention simulations rely on the accuracy of the inferred unreported infections. (b) The grey dash line divides the observed period and future period. The blue curve represents the reported infections estimated by BaseParam. The other five curves represent the simulated reported infections for 5 scenarios: Isolate the reported infections, symptomatic infections, symptomatic infections and 25%, 50%, 75% asymptomatic and presymptomatic infections, where we reduce the infectiousness of these isolated infections to half in future period. As seen in the figure, non-pharmaceutical interventions on symptomatic infections can be enough to control the COVID-19 epidemic. However, this has been proven to be incorrect by prior studies and real-world observations [43]. (c) Non-pharmaceutical interventions on asymptomatic and presymptomatic infections are essential to control the COVID-19 epidemic. The green curve represents the reported infections estimated by MdlParam. The other five curves represent the simulated reported infections for the same 5 scenarios as in (b). As seen in the figure, even isolating all symptomatic infections, the reported infections are still increasing. Therefore, only with non-pharmaceutical interventions on some fraction of asymptomatic and presymptomatic infections, the reported infections will decrease.

As seen in the figure, the reported rate MdlParamRate sharply increased in early February in Santa Clara county, CA followed by a gradual decay and eventual rise. This observation is explained by the fact that Santa Clara was one of the first counties to observe the imported infections. However, the community spread driven COVID-19 outbreaks were not reported until late February. The observation that MdlParamRate is extremely low in early March also fits the earlier study [42].

#### Non-pharmaceutical interventions on asymptomatic and presymptomatic infections are essential to control the COVID-19 epidemic

Our simulations show that non-pharmaceutical interventions on asymptomatic and presymptomatic infections are essential to control COVID-19. Here, we plotted the simulated reported infections generated by MdlParam in Fig. 4 (c) (green curve). We then repeated the simulation of reported infections for 5 different scenarios: isolate just the reported infections, just the symptomatic infections, and symptomatic infections in addition to 25%, 50%, and 75% of both asymptomatic and presymptomatic infections. In our setup, we assume that the infectiousness reduces by half when a person is isolated. As seen in the figure, when only the reported infections are isolated, there is almost no change in the “future” reported infections. However when we isolate both the reported and symptomatic infections, the reported infections decreases significantly. Even here, the reported infections are still on the rise. On the other hand, non-pharmaceutical interventions on some fraction of asymptomatic and presymptomatic infections leads to decrease in reported infections. Thus, we can conclude that non-pharmaceutical interventions on asymptomatic infections are essential in controlling the COVID-19 epidemic.

#### Accuracy of non-pharmaceutical intervention simulations relies on the good inference of unreported infections

Here we also plotted the simulated reported infections generated by BaseParam in Fig. 4 (b) (blue curve). As seen in the figure, based on BaseParam we can infer that only non-pharmaceutical interventions on symptomatic infections can be enough to control the COVID-19 epidemic. However, this has been proven to be incorrect by prior studies and real-world observations [43]. Therefore, we can conclude that the accuracy of non-pharmaceutical intervention simulation relies on the goodness of the inferred parameterization.

## Discussion and Future Work

In this study, we propose MdlInfer, a data-driven model selection approach that automatically estimates unreported infections based on an optimal parameterization of epidemiological models. Our approach leverages the information theoretic Minimum Length Description (MDL) principle, to select unreported infections which best describes the observed outbreak. Our approach addresses several gaps in current practice including long term infeasibility of serological studies [17, 32] and a lack of formal model selection over epidemiological models. MdlInfer employs a principled method that searches over a large space of parameterizations and finally selects the one that describes the outbreak best, while the existing epidemiological models [24, 41, 45] typically rely on the parameterization inferred by ad-hoc heuristics. The MdlInfer framework can also be adapted to work on a set of calibrated parameterizations to generate uncertainty estimates.

Overall, our results show that MdlInfer estimates total infections and the symptomatic rate at various geographical scales more accurately than baseline parameterization from both directions, i.e., it corrects both over- and under-estimates. For example, compared to the baseline parameterization, we correctly estimate 10792 more infections by April 1 in Santa Clara, CA (Fig. 2 (a)), and 214710 fewer infections in Western Washington (Fig. 2 (c)). We also show that MdlInfer leads to better fit of the reported infections in the observed period and more accurate forecasts for the future period than the baseline parameterization. We also reveal that a large majority of COVID-19 infections were unreported, where non-pharmaceutical interventions on unreported infections can help to mitigate the COVID-19 outbreak. Although our results are slightly inconsistent in matching the symptomatic rate, which can be attributed to the noise in the surveillance data, we are still fitting the symptomatic rate better than the baseline parameterization. And our results show consistent performance with respect to the reported infections and serological studies.

The MdlInfer framework is likely to be useful in surveillance of COVID-19 in the near future, and for future epidemics. Even with the U.S. returning to normalcy, surveillance of the pandemic is still important for public health. Daily incidence of COVID-19 has decreased since early 2021, according to the CDC COVID Data Tracker portal [6]. However new variants of the SARS-CoV-2 (e.g., the delta and gamma variants) have been spreading rapidly [39, 47, 30]. Testing for these, and large-scale surveillance via laboratory tests may be limited and less systematic than what was done for COVID-19 in the past few months. In such settings, using our MdlInfer framework, epidemiologists and policymakers can improve the accuracy of estimates of total infections (without large-scale serological studies), as well as forecasts of their models.

One of the limitations of our work is that the benefits of using MdlInfer depends on the suitability of the base epidemiological model. If the base epidemiological model is not expressive enough for the observed data, then the gains from MdlInfer may not be significant. As future work, it may be useful to adapt MdlInfer to give a measure of the quality of the base epidemiological model. We also note that MdlInfer is built on ODE-based epidemiological models; other kinds of epidemic models, e.g., agent-based models [25, 52, 34, 55, 23, 46], are more suitable in some settings. It would be interesting to extend MdlInfer to incorporate such models. Finally, there is significant population heterogeneity in disease outcomes, e.g., there are differences in severity rate or mortality rate, when infected with COVID-19, for different age group [44, 33], which has not been considered in our work.

To summarize, MdlInfer is a robust data-driven method to accurately estimate unreported infections, which will help data scientists, epidemiologists, and policy makers to further improve existing ODE-based epidemiological models, make accurate forecasts, and combat the ongoing COVID-19 pandemic. More generally, MdlInfer opens up a new line of research in epidemiology using information theoretic methods.

## Materials and Methods

### Data

We are using the following publicly available datasets for our study.

1.  **New York Times reported infections dataset** The New York Times reported infections dataset, or NYT-Rinf, consists of daily time sequence of reported infections *D*reported and the mortality *D*mortality (cumulative values) for each county in the US starting from January 21, 2020 [7].

2.  **Serological studies:** The serological studies [17, 32] consists of the point and 95% confidence interval estimate of the prevalence of antibodies to SARS-CoV-2. Using the population and the prevalence of antibodies, we can compute the estimated total infections SeroStudyTinf and 95% confidence interval in the location.

3.  **Symptomatic surveillance data:** The symptomatic surveillance data [8] consists of point estimate RateSymp and confidence interval of the COVID-related symptomatic rate for each county in the US starting from April 6, 2020. The survey asks a series of questions designed to help researchers understand the spread of COVID-19 and its effect on people in the United States.

### Our Approach

#### Minimum Length Description

We formulate the problem of inferring COVID-19 unreported infections using the Minimum Length Description (MDL) principle [10, 31]. Here we use the two-part sender-receiver framework. The goal of the framework is to transmit the Data from the possession of the sender *S* to the receiver *R* using a Model. We do this by identifying the Model describes the Data such that the total number of bits needed to represent both the Model and the Data is minimized. We term this as a cost function composed by two parts:

1.  Model cost *L*(Model): The cost of describing the Model.

2.  Data cost *L*(Data|Model): The cost of describing Data given the Model.

#### MDL Formulation

Before formalizing the model space, *L*(Model) and *L*(Data | Model), we will first define some notations and concepts.

In this work, we term the COVID-19 reported infections as *D*reported, candidate total infections as *D* (which corresponds to unreported infections *D*unreported), and mortality as *D*mortality. We also term the epidemiological model [24] as *O*M.

For *O*M, we define the calibration on *D*reported and *D*mortality as the procedure to find the parameterization **p** that minimizes the log likelihood function between *D*reported(**p**) and *D*reported, and between the *D*mortality(**p**) and *D*mortality. Here, the *D*reported(**p**) is the *O**M* output reported infections based on **p**, and the *D*mortality(**p**) is the *O**M* output mortality based on **p**. In this work, **p** is the baseline parameteriztion. We write this procedure as follows: ![Formula][4]</img>  Note that the procedure can be extended without *D*mortality easily.

From **p**, we can generate the *O**M* output unreported infections *D*unreported(**p**), and total infections *D*(**p**) = *D*reported(**p**) + *D*unreported(**p**). We can also calculate the baseline reported rate **p**[*α*reported]: ![Formula][5]</img>  Similarly, we can also calibrate *O*M on *D*unreported = *D* − *D*reported, *D*reported, and *D*mortality to find the parameterization **p’**. We write this procedure as follows: ![Formula][6]</img>  From **p’**, we can generate the *O**M* output unreported infections *D*unreported(**p’**), reported infections *D*reported(**p’**), total infections *D*(**p’**) = *D*reported(**p’**) + *D*unreported(**p’**). We can also calculate the reported rate **p’**[*α*reported]: ![Formula][7]</img>  With the notations and concepts defined above, we will next formalize the model space, *L*(Model) and *L*(Data|Model).

##### Model Space

In this work, the Data to describe is *D*reported, and therefore the most natural Model would have been Model = (**p**) as it directly corrects from the BaseParam reported infections *D*reported(**p**) to describe *D*reported. However, this has the disadvantage that the model space could be fragile that slightly different **p** could lead to vastly different costs. To account for this, we propose our Model as Model = (*D*, **p’, p**), which consists of three components. We use *D* to reparameterize the **p’**, use **p** to send the **p’**, and use the **p’** and *D* to correct from **p’**[*α*reported] *× D* to send *D*reported.

##### Model Cost

With the model space proposed above, the sender *S* will send the Model = (*D*, **p’, p**) to the receiver *R* in three parts:

1.  First send the **p**.

2.  Next send the **p’** given **p**.

3.  Then send *D* given **p’** and **p**.

Therefore the MDL Model Cost, *L*(*D*, **p’, p**) will have three components: ![Formula][8]</img>  Specifically, we will send the **p** directly, send the **p’** given **p** by sending **p’** − **p**, and send *D* given **p’** and **p** by sending **p’**[*α*reported] *× D* − *D*reported(**p**). We write this down as below: ![Formula][9]</img>  The details of the encoding method to express the cost for each of the three components can be found in the Supplementary Information.

##### Data Cost

Next we need to send (describe) the Data, reported infections *D*reported in terms of the Model. Given the Model = (*D*, **p’, p**), we describe *D* by describing ![Graphic][10]</img>. We write down the MDL Data Cost as below: ![Formula][11]</img>  

##### Total MDL Cost

Putting both the MDL Model Cost and MDL Data Cost together, the Total MDL Cost is ![Formula][12]</img>  

##### Problem Statement

With *L*(*D*reported, *D*, **p’, p**) formulated above, we can state the problem of as one of searching for the best total infections *D*∗: Given the time sequence *D*reported, epidemiological model *O*M, find *D*∗ that minimizes the MDL total cost i.e. ![Formula][13]</img>  

##### Algorithm

Next, we will present our algorithm to find *D*∗. Note that directly searching *D*∗ naively is intractable. An alternate method is to first find a good reported rate ![Graphic][14]</img> fast since we can constrain ![Graphic][15]</img> and reduce the search space. Then with ![Graphic][16]</img> from step 1, we can search for the optimal *D*∗. Hence we propose a two-step search algorithm to find *D*∗. Key steps involved in our algorithm are as follows:

1.  First, we will do a linear search to find a good reported rate ![Graphic][17]</img>.

2.  Given the ![Graphic][18]</img> found above, we will use an optimization method to find the *D*∗ that minimizes *L*(*D*reported, *D*, **p’, p**) with ![Graphic][19]</img> constraints.

##### Step 1: Find the ![Graphic][20]</img>

In step 1, we do a linear search on *α*reported to find the ![Graphic][21]</img>. We formulate this as below ![Formula][22]</img>  where the ![Graphic][23]</img> in the linear search. This helps to reduce the search space. Hence, the **p’** is ![Formula][24]</img>  For stability and robustness, as *D*reported can be noisy, we use *D*(**p’**) instead of ![Graphic][25]</img> as *D* in the MDL Total Cost in this step.

##### Step 2: Find the *D*∗ given ![Graphic][26]</img>

With the ![Graphic][27]</img> found in step 1, we will next find the *D*∗ that minimizes the MDL cost. ![Formula][28]</img>  Since we have already found a good ![Graphic][29]</img>, we will constrain the *D*∗ as below ![Formula][30]</img>  We then use the Nelder-Mead method [29] solve this constrained optimization problem for *D*∗, and we initialize the search from ![Graphic][31]</img>. We describe the two-step algorithm in more detail in the Supplementary Information.

#### Calibration

The epidemiological model [24] consists of 10 states: Susceptible *S*, exposed *E*, pre-symptomatic *I**P*, severe symptomatic *I**S*, mild symptomatic *I**M*, asymptomatic *I**A*, hospitalized (eventual death) *H**D*, hospitalized (eventual recover) *H**R*, recovered *R*, and dead *D*. The calibration process described in [24] only infers the transmission rate *β* (the transmission rate in the absence of interventions), *σ* (the proportional reduction on *β* under shelter-in-place), and *E* (number of initial infections). The other parameters are fixed. During inference, the model calibrates on mortality.

We extend the calibration process described above to include the reported infections and unreported infections. Note that we do not make any structural changes to the existing model. Similar to [36], in our calibration process we compute the newly reported infections and unreported infection as follows:

1.  New reported = *α*1 *×* (*dI**P* *I**S* +*dI**P* *I**M*): Here *dI**P* *I**S* +*dI**P* *I**M* is the number of new symptomatic infections everyday. We assume *α*1 proportion of new symptomatic infections everyday are reported.

2.  New unreported = (1 − *α*1) *×* (*dI**P* *I**S* + *dI**P* *I**M*) + *dEI**A*.

We also extend the calibration to infer two more parameters: *α* (proportion of infections that are asymptomatic) and *α*1 (proportion of new symptomatic infections that are reported). The extended epidemiological model *O**M* now calibrates on *β*, *σ, E*, *α*, and *α*1. Hence, our vector **p** of model parameters is defined as **p** = [**p**[*β*], **p**[*σ*], **p**[*E*], **p**[*α*], **p**[*α*1]]

#### Calibration Results

From the parameterization vector **p** and *O*M, we can infer the reported infections *D*reported(**p**), unreported infections *D*unreported(**p**), and total infections *D*(**p**) = *D*reported(**p**) + *D*unreported(**p**). Here, we calculate the BaseParam total infections (cumulative values) as ![Formula][32]</img>  Note that BaseParam also estimates the number of symptomatic people, which can be calculated from the number of infections in state *I**S* (severe symptomatic) and *I**M* (mild symptomatic). With the number of symptomatic people, we can also calculate the COVID-19 related symptomatic rate among the population *N* : ![Formula][33]</img>  Here, we also calculate the BaseParam estimated reported infections ![Formula][34]</img>  and the reported rate ![Formula][35]</img>  From **p’**, we can also calculate similar measures for MdlParam.

## Supporting information

Supplementary Information [[supplements/263467_file07.pdf]](pending:yes)

## Data Availability

Coronavirus in the U.S.:Latest Map and Case Count, 2020. URL [https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html](https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html). Eran Bendavid, Bianca Mulaney, Neeraj Sood, Soleil Shah, Rebecca Bromley-Dulfano, Cara Lai, Zoe Weissberg, Rodrigo Saavedra-Walker, Jim Tedrow, Andrew Bogan, et al. Covid-19 antibody seroprevalence in santa clara county, california. International journal of epidemiology, 50(2):410-419, 2021. Fiona P Havers, Carrie Reed, Travis Lim, Joel M Montgomery, John D Klena, Aron J Hall, Alicia M Fry, Deborah L Cannon, Cheng-Feng Chiang, Aridth Gibbons, et al. Seroprevalence of antibodies to sars-cov-2 in 10 sites in the united states, march 23-may 12, 2020. JAMA internal medicine, 180(12):1576-1586, 2020. Delphi's COVID-19 Surveys, 2020. URL [https://delphi.cmu.edu/covidcast/surveys/](https://delphi.cmu.edu/covidcast/surveys/).

[https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html](https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html) 

[https://delphi.cmu.edu/covidcast/surveys/](https://delphi.cmu.edu/covidcast/surveys/) 

## Data Availability

Code and data have been deposited in GitHub [4].

## Acknowledgements

This paper was partially supported by the NSF (Expeditions CCF-1918770 and CCF-1918656, CA-REER IIS-2028586, RAPID IIS-2027862, Medium IIS-1955883, Medium IIS-2106961, IIS-1931628, IIS-1955797, IIS-2027848), NIH 2R01GM109718, CDC MInD program U01CK000589, ORNL and funds/computing resources from Georgia Tech and GTRI. B. A. was in part supported by the CDC MInD-Healthcare U01CK000531-Supplement. A.V.’s work is also supported in part by grants from the UVA Global Infectious Diseases Institute (GIDI).

*   Received September 14, 2021.
*   Revision received September 14, 2021.
*   Accepted September 22, 2021.


*   © 2021, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/)

## References

1.  [1].4Q GDP: Economy expands at a 4.0% annualized rate.
    
    
2.  [2].Commercial Laboratory Seroprevalence Survey Data.
    
    
3.  [3].Hidden Outbreaks Spread Through U.S. Cities Far Earlier Than Americans Knew, Estimates Say.
    
    
4.  [4].[https://github.com/adityalab/mdl-ode-missing](https://github.com/adityalab/mdl-ode-missing).
    
    
5.  [5].Interim Economic Projections for 2020 and 2021.
    
    
6.  [6].United States COVID-19 Cases, Deaths, and Laboratory Testing (NAATs) by State, Territory, and Jurisdiction.
    
    
7.  [7].Coronavirus in the U.S.:Latest Map and Case Count, 2020.
    
    
8.  [8].Delphi’s COVID-19 Surveys, 2020.
    
    
9.  [9].Accorsi, E. K., Qiu, X., Rumpler, E., Kennedy-Shaffer, L., Kahn, R., Joshi, K., Goldstein, E., Stensrud, M. J., Niehus, R., Cevik, M., et al. How to detect and reduce potential sources of biases in studies of sars-cov-2 and covid-19. European Journal of Epidemiology (2021), 1–18.
    
    
10. [10].Adhikari, B., Rangudu, P., Prakash, B. A., and Vullikanti, A. Near-optimal mapping of network states using probes. In Proceedings of the 2018 SIAM International Conference on Data Mining (2018), SIAM, pp. 108–116.
    
    
11. [11].Aguilar, J. B., Faust, J. S., Westafer, L. M., and Gutierrez, J. B. Investigating the impact of asymptomatic carriers on covid-19 transmission. MedRxiv (2020).
    
    
12. [12].Alene, M., Yismaw, L., Assemie, M. A., Ketema, D. B., Mengist, B., Kassie, B., and Birhan, T. Y. Magnitude of asymptomatic covid-19 cases throughout the course of infection: A systematic review and meta-analysis. PloS one 16, 3 (2021), e0249090.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0249090&link_type=DOI) 

13. [13].Aleta, A., Martín-Corral, D., Bakker, M. A., y Piontti, A. P., Ajelli, M., Litvinova, M., Chinazzi, M., Dean, N. E., Halloran, M. E., Longini, I. M., et al. Quantifying the importance and location of sars-cov-2 transmission events in large metropolitan areas. medRxiv (2020).
    
    
14. [14].Angelopoulos, A. N., Pathak, R., Varma, R., and Jordan, M. I. On identifying and mitigating bias in the estimation of the covid-19 case fatality rate. Harvard Data Science Review (7 2020). [https://hdsr.mitpress.mit.edu/pub/y9vc2u36](https://hdsr.mitpress.mit.edu/pub/y9vc2u36).
    
    
15. [15].Bai, Y., Yao, L., Wei, T., Tian, F., Jin, D.-Y., Chen, L., and Wang, M. Presumed asymptomatic carrier transmission of covid-19. Jama 323, 14 (2020), 1406–1407.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2020.2565&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F09%2F22%2F2021.09.14.21263467.atom) 

16. [16].Bedford, T., Greninger, A. L., Roychoudhury, P., Starita, L. M., Famulare, M., Huang, M.-L., Nalla, A., Pepper, G., Reinhardt, A., Xie, H., et al. Cryptic transmission of sars-cov-2 in washington state. Science 370, 6516 (2020), 571–575.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNzAvNjUxNi81NzEiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wOS8yMi8yMDIxLjA5LjE0LjIxMjYzNDY3LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

17. [17].Bendavid, E., Mulaney, B., Sood, N., Shah, S., Bromley-Dulfano, R., Lai, C., Weissberg, Z., Saavedra-Walker, R., Tedrow, J., Bogan, A., et al. Covid-19 antibody seroprevalence in santa clara county, california. International journal of epidemiology 50, 2 (2021), 410–419.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F09%2F22%2F2021.09.14.21263467.atom) 

18. [18].Böhning, D., Rocchetti, I., Maruotti, A., and Holling, H. Estimating the undetected infections in the covid-19 outbreak by harnessing capture–recapture methods. International Journal of Infectious Diseases 97 (2020), 197–201.
    
    
19. [19].Brauner, J. M., Mindermann, S., Sharma, M., Johnston, D., Salvatier, J., Gavenčiak, T., Stephenson, A. B., Leech, G., Altman, G., Mikulik, V., Norman, A. J., Monrad, J. T., Besiroglu, T., Ge, H., Hartwick, M. A., Teh, Y. W., Chindelevitch, L., Gal, Y., and Kulveit, J. Inferring the effectiveness of government interventions against covid-19. Science 371, 6531 (2021).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1126/science.abd9338&link_type=DOI) 

20. [20].Budhathoki, K., and Vreeken, J. Origo: causal inference by compression. Knowledge and Information Systems 56, 2 (2018), 285–307.
    
    
21. [21].Byambasuren, O., Cardona, M., Bell, K., Clark, J., McLaws, M.-L., and Glasziou, P. Estimating the extent of asymptomatic covid-19 and its potential for community transmission: systematic review and meta-analysis. Official Journal of the Association of Medical Microbiology and Infectious Disease Canada 5, 4 (2020), 223–234.
    
    
22. [22].Chang, S., Pierson, E., Koh, P. W., Gerardin, J., Redbird, B., Grusky, D., and Leskovec, J. Mobility network models of covid-19 explain inequities and inform reopening. Nature 589, 7840 (2021), 82–87.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F09%2F22%2F2021.09.14.21263467.atom) 

23. [23].Chang, S. Y., Wilson, M. L., Lewis, B., Mehrab, Z., Dudakiya, K. K., Pierson, E., Koh, P. W., Gerardin, J., Redbird, B., Grusky, D., et al. Supporting covid-19 policy response with large-scale mobility-based modeling. medRxiv (2021).
    
    
24. [24].Childs, M. L., Kain, M. P., Kirk, D., Harris, M., Couper, L., Nova, N., Delwel, I., Ritchie, J., and Mordecai, E. A. The impact of long-term non-pharmaceutical interventions on covid-19 epidemic dynamics and control. medRxiv (2020).
    
    
25. [25].Cuevas, E. An agent-based model to evaluate the covid-19 transmission risks in facilities. Computers in biology and medicine 121 (2020), 103827.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.compbiomed.2020.103827&link_type=DOI) 

26. [26].Dong, E., Du, H., and Gardner, L. An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases 20, 5 (2020), 533–534.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S1473-3099(20)30120-1&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F09%2F22%2F2021.09.14.21263467.atom) 

27. [27].Drake, J. M., Handel, A., Marty, E., B.O’Dea, E., and Tredennick, A. T. Transmission model for sars-cov-2 in us states.
    
    
28. [28].Edeling, W., Arabnejad, H., Sinclair, R., Suleimenova, D., Gopalakrishnan, K., Bosak, B., Groen, D., Mahmood, I., Crommelin, D., and Coveney, P. V. The impact of uncertainty on predictions of the covidsim epidemiological code. Nature Computational Science 1, 2 (2021), 128–135.
    
    
29. [29].Gao, F., and Han, L. Implementing the nelder-mead simplex algorithm with adaptive parameters. Computational Optimization and Applications 51, 1 (2012), 259–277.
    
    
30. [30].Geers, D., Shamier, M. C., Bogers, S., den Hartog, G., Gommers, L., Nieuwkoop, N. N., Schmitz, K. S., Rijsbergen, L. C., van Osch, J. A., Dijkhuizen, E., et al. Sars-cov-2 variants of concern partially escape humoral but not t-cell responses in covid-19 convalescent donors and vaccinees. Science Immunology 6, 59 (2021).
    
    
31. [31].Grunwald, P. A tutorial introduction to the minimum description length principle. arXiv preprint math/0406077 (2004).
    
    
32. [32].Havers, F. P., Reed, C., Lim, T., Montgomery, J. M., Klena, J. D., Hall, A. J., Fry, A. M., Cannon, D. L., Chiang, C.-F., Gibbons, A., et al. Seroprevalence of antibodies to sars-cov-2 in 10 sites in the united states, march 23-may 12, 2020. JAMA internal medicine 180, 12 (2020), 1576–1586.
    
    
33. [33].Ho, F. K., Petermann-Rocha, F., Gray, S. R., Jani, B. D., Katikireddi, S. V., Niedzwiedz, C. L., Foster, H., Hastie, C. E., Mackay, D. F., Gill, J. M., et al. Is older age associated with covid-19 mortality in the absence of other risk factors? general population cohort study of 470,034 participants. PLoS One 15, 11 (2020), e0241824.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F09%2F22%2F2021.09.14.21263467.atom) 

34. [34].Hoertel, N., Blachier, M., Blanco, C., Olfson, M., Massetti, M., Rico, M. S., Limosin, F., and Leleu, H. A stochastic agent-based model of the sars-cov-2 epidemic in france. Nature medicine 26, 9 (2020), 1417–1421.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41591-020-1001-6&link_type=DOI) 

35. [35].Johansson, M. A., Quandelacy, T. M., Kada, S., Prasad, P. V., Steele, M., Brooks, J. T., Slayton, R. B., Biggerstaff, M., and Butler, J. C. Sars-cov-2 transmission from people without covid-19 symptoms. JAMA network open 4, 1 (2021), e2035057– e2035057.
    
    
36. [36].Kain, M. P., Childs, M. L., Becker, A. D., and Mordecai, E. A. Chopping the tail: How preventing superspreading can help to maintain covid-19 control. Epidemics 34 (2021), 100430.
    
    
37. [37].Koutra, D., Kang, U., Vreeken, J., and Faloutsos, C. Vog: Summarizing and understanding large graphs. In Proceedings of the 2014 SIAM international conference on data mining (2014), SIAM, pp. 91–99.
    
    
38. [38].Kraemer, M. U., Yang, C.-H., Gutierrez, B., Wu, C.-H., Klein, B., Pigott, D. M., Du Plessis, L., Faria, N. R., Li, R., Hanage, W. P., et al. The effect of human mobility and control measures on the covid-19 epidemic in china. Science 368, 6490 (2020), 493–497.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNjgvNjQ5MC80OTMiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wOS8yMi8yMDIxLjA5LjE0LjIxMjYzNDY3LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

39. [39].Kustin, T., Harel, N., Finkel, U., Perchik, S., Harari, S., Tahor, M., Caspi, I., Levy, R., Leshchinsky, M., Dror, S. K., et al. Evidence for increased breakthrough rates of sars-cov-2 variants of concern in bnt162b2-mrna-vaccinated individuals. Nature Medicine (2021), 1–6.
    
    
40. [40].Lai, S., Ruktanonchai, N. W., Zhou, L., Prosper, O., Luo, W., Floyd, J. R., Wesolowski, A., Santillana, M., Zhang, C., Du, X., et al. Effect of nonpharmaceutical interventions to contain covid-19 in china. Nature 585, 7825 (2020), 410–413.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-020-2293-x&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F09%2F22%2F2021.09.14.21263467.atom) 

41. [41].Li, R., Pei, S., Chen, B., Song, Y., Zhang, T., Yang, W., and Shaman, J. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov-2). Science 368, 6490 (2020), 489–493.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNjgvNjQ5MC80ODkiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wOS8yMi8yMDIxLjA5LjE0LjIxMjYzNDY3LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

42. [42].Lu, F. S., Nguyen, A. T., Link, N. B., Molina, M., Davis, J. T., Chinazzi, M., Xiong, X., Vespignani, A., Lipsitch, M., and Santillana, M. Estimating the cumulative incidence of covid-19 in the united states using influenza surveillance, virologic testing, and mortality data: Four complementary approaches. PLOS Computational Biology 17, 6 (2021), e1008994.
    
    
43. [43].Moghadas, S. M., Fitzpatrick, M. C., Sah, P., Pandey, A., Shoukat, A., Singer, B. H., and Galvani, A. P. The implications of silent transmission for the control of covid-19 outbreaks. Proceedings of the National Academy of Sciences 117, 30 (2020), 17513–17515.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTE3LzMwLzE3NTEzIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMDkvMjIvMjAyMS4wOS4xNC4yMTI2MzQ2Ny5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

44. [44].Mueller, A. L., McNamara, M. S., and Sinclair, D. A. Why does covid-19 disproportionately affect older people? Aging (Albany NY) 12, 10 (2020), 9959.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.18632/aging.103344&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32470948&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F09%2F22%2F2021.09.14.21263467.atom) 

45. [45].Pei, S., Kandula, S., and Shaman, J. Differential effects of intervention timing on covid-19 spread in the united states. Science advances 6, 49 (2020), eabd6370.
    
    [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6MzoiUERGIjtzOjExOiJqb3VybmFsQ29kZSI7czo4OiJhZHZhbmNlcyI7czo1OiJyZXNpZCI7czoxMzoiNi80OS9lYWJkNjM3MCI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzA5LzIyLzIwMjEuMDkuMTQuMjEyNjM0NjcuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

46. [46].Pei, S., Teng, X., Lewis, P., and Shaman, J. Optimizing respiratory virus surveillance networks using uncertainty propagation. Nature communications 12, 1 (2021), 1–10.
    
    
47. [47].Planas, D., Veyer, D., Baidaliuk, A., Staropoli, I., Guivel-Benhassine, F., Rajah, M. M., Planchais, C., Porrot, F., Robillard, N., Puech, J., et al. Reduced sensitivity of sars-cov-2 variant delta to antibody neutralization. Nature (2021), 1–7.
    
    
48. [48].Prakash, B. A., Vreeken, J., and Faloutsos, C. Spotting culprits in epidemics: How many and which ones? In 2012 IEEE 12th International Conference on Data Mining (2012), IEEE, pp. 11–20.
    
    
49. [49].Press, W. H., and Levin, R.C. Modeling, post covid-19, 2020.
    
    
50. [50].Russell, T. W., Golding, N., Hellewell, J., Abbott, S., Wright, L., Pearson, C. A., van Zandvoort, K., Jarvis, C. I., Gibbs, H., Liu, Y., et al. Reconstructing the early global dynamics of under-ascertained covid-19 cases and infections. BMC medicine 18, 1 (2020), 1–9.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12916-020-01726-3&link_type=DOI) 

51. [51].Shaman, J. An estimation of undetected covid cases in france, 2020.
    
    
52. [52].Silva, P. C., Batista, P. V., Lima, H. S., Alves, M. A., Guimarães, F. G., and Silva, R. C. Covid-abs: An agent-based model of covid-19 epidemic to simulate health and economic effects of social distancing interventions. Chaos, Solitons & Fractals 139 (2020), 110088.
    
    
53. [53].Sood, N., Simon, P., Ebner, P., Eichner, D., Reynolds, J., Bendavid, E., and Bhattacharya, J. Seroprevalence of sars-cov-2–specific antibodies among adults in los angeles county, california, on april 10-11, 2020. Jama (2020).
    
    
54. [54].Stockmaier, S., Stroeymeyt, N., Shattuck, E. C., Hawley, D. M., Meyers, L. A., and Bolnick, D. I. Infectious diseases and social distancing in nature. Science 371, 6533 (2021).
    
    
55. [55].Tian, Y., Sridhar, A., Yağan, O., and Poor, H. V. Analysis of the impact of mask-wearing in viral spread: Implications for covid-19. In 2021 American Control Conference (ACC) (2021), IEEE, pp. 3132–3137.
    
    
56. [56].Wells, C. R., Sah, P., Moghadas, S. M., Pandey, A., Shoukat, A., Wang, Y., Wang, Z., Meyers, L. A., Singer, B. H., and Galvani, A. P. Impact of international travel and border control measures on the global spread of the novel 2019 coronavirus outbreak. Proceedings of the National Academy of Sciences 117, 13 (2020), 7504–7509.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMToiMTE3LzEzLzc1MDQiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wOS8yMi8yMDIxLjA5LjE0LjIxMjYzNDY3LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

57. [57].Zhang, W., Govindavari, J. P., Davis, B. D., Chen, S. S., Kim, J. T., Song, J., Lopategui, J., Plummer, J. T., and Vail, E. Analysis of genomic characteristics and transmission routes of patients with confirmed sars-cov-2 in southern california during the early stage of the us covid-19 pandemic. JAMA network open 3, 10 (2020), e2024191–e2024191.

 [1]: /embed/inline-graphic-1.gif
 [2]: /embed/inline-graphic-2.gif
 [3]: /embed/inline-graphic-3.gif
 [4]: /embed/graphic-5.gif
 [5]: /embed/graphic-6.gif
 [6]: /embed/graphic-7.gif
 [7]: /embed/graphic-8.gif
 [8]: /embed/graphic-9.gif
 [9]: /embed/graphic-10.gif
 [10]: /embed/inline-graphic-4.gif
 [11]: /embed/graphic-11.gif
 [12]: /embed/graphic-12.gif
 [13]: /embed/graphic-13.gif
 [14]: /embed/inline-graphic-5.gif
 [15]: /embed/inline-graphic-6.gif
 [16]: /embed/inline-graphic-7.gif
 [17]: /embed/inline-graphic-8.gif
 [18]: /embed/inline-graphic-9.gif
 [19]: /embed/inline-graphic-10.gif
 [20]: /embed/inline-graphic-11.gif
 [21]: /embed/inline-graphic-12.gif
 [22]: /embed/graphic-14.gif
 [23]: /embed/inline-graphic-13.gif
 [24]: /embed/graphic-15.gif
 [25]: /embed/inline-graphic-14.gif
 [26]: /embed/inline-graphic-15.gif
 [27]: /embed/inline-graphic-16.gif
 [28]: /embed/graphic-16.gif
 [29]: /embed/inline-graphic-17.gif
 [30]: /embed/graphic-17.gif
 [31]: /embed/inline-graphic-18.gif
 [32]: /embed/graphic-18.gif
 [33]: /embed/graphic-19.gif
 [34]: /embed/graphic-20.gif
 [35]: /embed/graphic-21.gif