Abstract
The role of schools in the spread of the COVID-19 pandemic is controversial, with some claiming they are an important driver of the pandemic and others arguing that transmission in schools is negligible. School cluster reports that have been collected in various jurisdictions are a source of data about transmission in schools. These reports consist of the name of a school, a date, and the number of students known to be infected. We provide a simple model for the frequency and size of clusters in this data, based on random arrivals of index cases at schools who then infect their classmates with a highly variable rate, fitting the overdispersion evident in the data. We fit our model to reports for several jurisdictions in the US and Canada, providing estimates of mean and dispersion for cluster size, whilst factoring in imperfect ascertainment. Our parameter estimates are robust to variations in ascertainment fraction. We use these estimates in three ways: i) to explore how uneven the distribution of cases is among different clusters in different jurisdictions (that is, what fraction of cases are in the 20% largest clusters), ii) to estimate how long it will be until we see a cluster a given size in jurisdiction, and iii) to determine the distribution of instantaneous transmission rate β among different index case. We show how these latter distribution can be used in simulations of school transmission where we explore the effect of different interventions, in the context of highly variable transmission rates.
1 Background
In the management of the COVID-19 pandemic an important consideration is the role of children and in particular schools. In most jurisdictions rates of COVID infection among children are similar to those in the adult population (8). But severity is much lower in children; the infection fatality rate (IFR) of COVID for at age 10 is estimated to be 0.002% versus an IFR of 0.01% at age 25, and 0.4% at age 55 (24). Cases are more often asymptomatic among children, less likely to require hospitalization and ICU care (8), and less likely to be classified as long COVID (34). On the other hand, not all the news is good for children, MIS-C is a serious condition sometimes resulting from COVID infection (6), and myocarditis happens more frequently as a side effect of infection among younger individuals (31).
Jurisdictions have had to make a choice between closing schools, with all the attendant social, economic, and psychological costs (9), and leaving schools open, allowing possible transmission of COVID in that setting (8). The direct downside of transmission in schools if it occurs is that children may be infected there, risking the low but non-negligible harms of COVID-19 in that age range, but also adult teachers and staff are put at risk. Transmission in schools may also contribute to overall community transmission, indirectly jeopardizing more vulnerable individuals (41). As a concrete example, if a child contracts COVID at school, they may then go on to transmit it to an elderly relative they live with, for whom the consequences are more severe (21). Estimating the magnitude of these two kinds of harm and making the decision as to what choice to make involves many sources of uncertainty and value judgements, which helps explain why different jurisdictions have taken different approaches (18). In some jurisdictions schools were open for the 2020 to 2021 school year, though many measures were put into place in order to reduce the risk of COVID transmission (28). Measures include cohorting, staggered entrance and exit times, masks, improvements in ventilation, extra sanitization measures. In other jurisdictions schools were closed for large portions of the year (30).
Studies that have looked at the effect of school closures on the overall rate of COVID transmission find mixed results: some find substantial reduction in community transmission when schools are closed, and others small or no effect (11, 41). Given that schools involve many children all sharing a room for many hours a day, it may be surprising that there is not a clearer evidence of significant transmission in schools. One explanation is that children may be less likely to transmit COVID to each other, either by being less infectious or by being less susceptible (14, 38). But transmission in schools does occur, and it’s worthwhile to estimate the magnitude and characterize the variation in it.
One source of evidence for transmission in schools are school exposure reports. Throughout the pandemic organizations have collected data submitted by volunteers about COVID cases in schools, and such data has subsequently been published online (4, 5). Data consists of reports of exposures or clusters in schools, either submitted by parents, or determined from reading newspaper reports. Several such websites exist, though many ceased due to excessive workload after the 2020–21 school year. In some jurisdictions there are also similar sources of data provided by local government (17,29) or Public Health Agencies (19, 20).
Here we propose a simple model of transmission in schools, and we use these data on cluster sizes to estimate parameters of the model for various jurisdictions in the US and Canada. Our model allows for heterogeneity in transmission rate, which is able to capture the considerable variability in the sizes of the clusters, with most exposures leading to no further cases (and so a cluster of size 1) but with few having a large number of cases (36). We estimate the the mean and overdispersion parameters for different jurisdictions. We then use our parameter estimates in a few different ways: firstly, we explore the overdispersion of cluster sizes in different jurisdictions, giving estimates of what fraction of all cases are in the 20% largest of all clusters. Secondly, we are able to estimate how long we can expect to wait until we see clusters of a given size in a jurisdiction. Thirdly, we can get an estimate of the distribution of transmission rate β, the rate at which a single infected individual infects a susceptible person when they are in contact. This parameter, in turn, could be used to simulate school transmission and explore the impacts of interventions (37) as we explore for several parameter choices.
Finally, two important changes have occurred in 2021 that we expect to impact cluster sizes in schools. On the one hand, in many jurisdictions, large portions of children aged 12 and up have been vaccinated with the Pfizer/BIONTech vaccine (35). According to the extent to which the vaccine protects against infection, we expect cluster size will be reduced, as fewer students will be infected if they have been immunized. Observed cluster size may be reduced further even than this, if the vaccine allows harder-to-detect infections to occur. On the other hand, now more infectious variants of the coronavirus have emerged, most noticeably the Alpha variant and subsequently the Delta variant (7). Increased transmissibilty would suggest larger cluster sizes, certainly among unvaccinated ages, but the relative impact of vaccination and the new variants together is difficult to gauge. Furthermore, both changes may lead to a change in the variability in cluster sizes.
2 Methods
Our data consist of lists of clusters of confirmed cases among students in schools for 8 US states and 4 Canadian provinces during the 2020–2021 school year. The US data was gathered from the National Educational Association website (4) (originally started by Alisha Morris, an educator at a Kansas high school) which collected data from news media and from reports submitted by volunteers (39). We selected the 8 states with the most data available, and covering the period between August and November 2020. Canadian data was contributed by Dr Shraddha Pai from COVID Schools Canada (5), an initiative of the group Masks for Canada (1). We included the 4 provinces from this data set with the most clusters recorded (omitting provinces where there was only the total cases by school recorded); entries ranged over the whole 2020–21 school year For both sources of data, each entry is a single cluster with number of confirmed infections among students in the school, the school’s name, the district, and the date of the report. For the US data we used confirmed student cases for the cluster size, excluding teachers and staff; for the Canadian data we used total cases. Information is usually not available about whether the students were in just one classroom, and some of the clusters exceed 30 students, so are likely not restricted to just one class. Accordingly, we interpret clusters as capturing all linked cases at a given school, and not just one classroom. Figure 1 shows histograms of the observed cluster size in the 12 jurisdictions.
In Table 1 we show, for each jurisdiction, the number of clusters, the number of schools appearing, the number of schools with more than one reported cluster, the fraction of clusters that have only one observed case, and the average number of observed cases in the clusters. There are several striking differences between the US and Canadian data. There are substantially more clusters reported in Canada than in the US, despite the US states having greater population on average. This may partly be explained by the Canadian data being collected over a longer period than the US data, but this is likely not the full explanation: in Figure 2 (top) we show the rate (in clusters per day) that cases appear in the dataset over time. In Figure 2 (bottom) we show rate of COVID incidence in the jurisdiction over the same period of time. Even at times when both US and Canadian datasets record clusters, Canadian rates are higher than US rates. This suggests that the method used for gathering cluster reports in different jurisdictions varied substantially between jurisdictions, especially when we look at daily incident cases in each. Statewide incident cases were generally higher per capita in US states than Canadian provinces. In the US data, cluster reports do not track well with incidence, but in the Canadian data there is a clear correspondence. Furthermore, the majority of schools in the US datasets only report one cluster, whereas the opposite is true of the Canadian data.
There is also substantial differences in statistics of cluster sizes. Mean observed cluster sizes were without exception larger in the US states than Canadian provinces, and Canadian provinces tended to have a higher fraction of clusters with only one case. Given the incomplete nature of the US data, we cannot determine whether these differences are due to real differences in transmission in the jurisdictions, or because smaller clusters were less likely to be reported in the US states.
There is substantial uncertainty in whether each reported cluster of cases accurately represents a set of cases linked by transmission. For any cluster of 2 or more cases, it may be that two independent sets of cases are incorrectly included in the same cluster. This may lead us to overestimate the size of clusters. Likewise, any two cluster in the the same school in the same week or adjacent weeks may actually be a single cluster misidentified as two. Both these factors may occur in our data, but we neglect both of them, taking the observed cluster size as given in the dataset. We are also unable to distinguish between transmission occurring in a school and in social activities with classmates outside of school.
In a given jurisdiction, we assume exposure events occur according to a Poisson process with variable rate. Figure 2 (top) gives an estimate of the rate of exposure as a function of time for the dataset of each jurisdiction. Independently of this process, once an exposure event occurs at a school, we say Z additional people are infected by the index case, for a total of Z + 1 individuals in the cluster. The variable Z includes individuals directly infected by the index case, as well as any subsequent infected individuals that are included in the same cluster. Following (25), we model Z as a Poisson random variable with parameter ν, where ν itself is a Gamma-distributed random variable. As described by (25), Z is then a negative binomial random variable. Rather than the usual parametrization of a negative binomial distribution we use parameters Rc and k. The parameter Rc is the expected number of additional infections in a cluster, and k is the dispersion: a measure of how far the distribution of Z is from being Poisson. As k → ∞, the distribution of Z approaches that of a Poisson distribution with mean Rc. The variance of Z is Rc(1 + Rc/k) and so for smaller values of k we expect more of the secondary cases to occur in rare large clusters rather than in frequent small clusters (25).
There are then a total of Z + 1 infected individuals in the school. The number that are actually observed X depends on the ascertainment model. We consider a model where each case is observed and contributes to the reported cluster size with probability q, so that the observed cluster size X (conditioned on Z) is binomial with parameters n = Z + 1 and probability q. The index case is treated the same as the infectees, so X may or may not include the index case. If none of the cases in a cluster are observed, we assume the cluster is not reported, so our model factors in the effect that smaller clusters are more likely to be missed.
For each collection of cluster sizes in our datasets we estimate the mean Rc and dispersion k using the ascertainment model with q = 0.75. We use maximum likelihood estimation to obtain estimates of R and k, and we use the Hessian of the log-likelihood to obtain 95% confidence ellipses for the parameters (42, Sec. 9.10). See the Supplementary Information for an explicit statement of the likelihood function.
3 Results
Figure 3 (left) shows the estimated mean cluster size (= Rc + 1) and dispersion k for the 8 US states. Figure 3 (right) shows the same for 4 Canadian provinces. In the US data, mean cluster size was estimated to range from about 2 in Texas to almost 8 in Florida. Dispersions ranged from 0.05 to 0.3, showing considerable overdispersion compared to the Poisson distribution. In the Canadian provinces mean cluster sizes ranged between 1.3 and 1.7 cases, and dispersion ranged from 0.28 to 0.45, meaning that observed dispersion was lower than in the US data (recalling that no overdispersion corresponds to k → ∞.)
In the Supplementary Information we explore varying the ascertainment fraction between 0.2 and 1. Though lower ascertainment fractions yield bigger values of Rc and smaller values of k, we see that the parameter estimates are relatively insensitive to values of q between 0.5 and 1. The reason for this is that though a given cluster with multiple cases will look smaller with fewer cases detected, and therefore biasing observed size downwards, many single-case clusters will not be detected at all, biasing the observed cluster size upwards again. We also consider an alternate model of ascertainment, where the chance of a cluster being reported at all depends on the size of the cluster, and vary the rate of ascertainment in it; see the Supplementary Information.
As a way of interpreting dispersion values and what they mean for cluster size, we consider the fraction of all cases that occur in the largest 20% of all clusters. (If the distribution of cases follows the Pareto principle (43) then 80% of the cases will be in the top 20% largest clusters.) If we consider only secondary cases (not including the index case) we see from Figure 4 (right) the fraction that are due to the 20% largest clusters for various values of mean cluster size and k. For example, for Alberta with a mean cluster size of 1.6 and a dispersion k of 0.3, 86% of the secondary cases are in the top 20% of the clusters by size. For Florida, with a mean cluster size of 8 and k = 0.075, 97% of secondary cases are in the top 20% of clusters by size. When we include index cases, the fractions are correspondingly lower, as we see in Figure 4 (right).
Another way to visualize the variability of transmission we have inferred from the data is to show the distribution of the Poisson parameter ν, of which Rc is just the mean. In our model ν is the index case-specific expected number of further cases in a cluster, and is a gamma distributed random variable. Figure 5 shows the estimated distribution of ν for each jurisdiction, and Table 3 shows some key properties of the jurisdictions.
As another application of our results, we can estimate the time to observe a cluster of a given size or larger in a given jurisdiction. We used the estimate of rate of cluster occurrence shown in Figure 2 for the day September 15th, 2020. Summing the tail of the cluster size distribution past a certain point with a given Rc and k gives the fraction of clusters observed that are that size or larger. Multiplying these quantities and taking the inverse gives the expected number of days until we would see a cluster this size for larger in the dataset. We show the results of this calculation in Figure 6. These vary widely: in our estimates for Florida, it would take under 5 days for there to be a cluster of 10 cases, whereas in Indiana this would not happen until 11 or 12 days, and in Alberta it would take over 50 days.
Our model does not consider the details of transmission at the individual level, and so does not make use of an instantaneous transmission rate per contact pair. However, by making some simple assumptions about COVID transmission, we can infer a distribution of transmission rate β from our estimate of the distribution of the parameter ν. Recall that ν is a Gamma-distributed random variable that gives mean number of secondary cases. Another way to estimate mean cluster size is to use an individual contact model where when an infectious person is in contact with a susceptible person, the susceptible person is infected with rate β. In such a model we assume that infected individuals are in a classroom for 2 days before isolating (when they develop symptoms), and that the total contact time with their classmates is T = 12 hours. Assuming that all individuals are in the same class, the infected individual is in contact with n = 25 other susceptible students for that time period. Then the infected individual will on average infect βnT other students. So we estimate β = ν/(nT). Since ν is Gamma-distributed, our estimate of β is too. The approximations here are only reasonable when it’s plausible that all secondary infections are in the same classroom; this is generally not the case with the US data sets, since there are occasional clusters of more than 30 cases. So we estimate the distribution of β this way only for the Canadian data sets. Figure 5 (right) shows with the right y-axis the distribution of β for the different Canadian provinces. Table 3 shows some of the features of the estimated distribution for β.
One application of these estimates of the distribution of β is that we can explore the consequences of different types of interventions in the classroom setting. In (37) the authors consider a simple model of COVID-19 transmission among a group of contacts and investigate the quantity Revent, the average number of secondary infections due to the presence of a single infectious individual. Revent is determined by T, the total length of time the infectious individual is with others; ncontact, the number of contacts at any point in time, τ the length of time the individual is with a fixed set of contacts; and β, the instantaneous transmission rate. Interventions can be classified according to which of these parameters they modify: reducing transmission reduces β, social distancing reduces ncontact, and “bubbling” (staying with the same small group rather than mingling) increases τ to T. If use our distributions for β with the model of (37) we can estimate how the distribution of cluster sizes is changed with different interventions under different values of the parameters Rc and k.
In Figure 7 we show estimated size distributions of clusters under different interventions. Our baseline simulation settings intend to capture a pre-COVID high school classroom: T = 12 hours (two days of exposure before the index case isolates), τ = 3 hours (each student has four different classes that they attend, nclass = 25, and β is sampled from our estimated distribution for a given choice of Rc and k. We consider three interventions: reducing transmission (for example, by introducing masks) reduces β by a factor of 2; social distancing cuts the size of a class in half; strict bubbling sets τ to T. Figure 7(left) shows the distributions of cluster size with the baseline and the different interventions for choices of Rc and k appropriate to Saskatchewan (Rc = 0.5, k = 0.3). We see that both reducing transmission and social distancing are effective, whereas bubbling does not contribute much to reduce cluster sizes. This is characteristic of what (37) call the linear regime: the number of secondary infections depends linearly on the time the infectious person is present with others. Figure 7 (right) shows the results in a hypothetical setting where R is much larger (Rc = 0.5, k = 0.3). Here transmission reduction is less effective than in the linear regime, and strict bubbling more so; these parameters have moved us closer to the so-called saturating regime.
4 Discussion
We have used cluster size data to estimate the mean and dispersion in cluster sizes, accounting for imperfect case detection. We have found that in all jurisdictions in our data, the majority of school transmission occurs in a small number of classrooms, with the top 20% of clusters containing between 80% and 100% of the secondary cases in school settings. We developed a method to estimate the transmission rate per contact per unit time, with reference to a simple model of classroom transmission. Having a direct estimate of the transmission rate allows us to compare the benefits of different control measures. We find that with parameters estimated from Canadian jurisdictions, interventions that reduce transmission rates (such as masking) and reduce number of contacts at any one time (class size reduction), are more effective than strategies aimed at keeping sets of contacts consistent (such as bubbling).
Overdispersion in transmission of COVID-19 and other infectious diseases is well documented (eg (45)) and is often described with reference to the 20/80 rule: that 20% of the infected individuals account for 80% of the transmission. Naturally, if the more infectious 20% can be identified, interventions targeting that portion of the population are likely to have a high impact. For SARS in 2003, Lloyd-Smith et al (25) estimated that 20% of the cases were responsible for almost 90% of the transmission. Estimates for COVID-19 also find considerable overdispersion, with the parameter k between 0.1 (16) and 0.5 (22) (with R0 = 2.5 this gives the top 20% of cases causing 69% - 96% of the transmissions; see (32) for a survey). These estimates focus on the distribution of the number of people an infectious person infects directly during the whole course of infection (with mean R0), which is of obvious epidemiological importance, but for which it is difficult to obtain high-quality data. When a case is identified, we are not always able to determine who they infected, and indirect methods must be used. We may miss cases, and others may be wrongly attributed to a given index case.
In our present study, we examined a different random quantity, the number of additional cases Z infected, either directly or through intermediaries, by a given index case in a given setting. We denoted the mean of Z by Rc. Including the index case means that the cluster size is Z + 1, with mean Rc + 1. Compared to estimates of R0, Rc does not count people infected at other sites, but it does include additional cases, because it includes both direct and indirect transmission. Z and its mean Rc are therefore more focused on the particular setting (in this case a school) than R0 is. In general it will depend on the infectiousness of the index case, as well as how conducive the environment is to transmission, and what activities are undertaken there. Determining the distribution of Z, as we have done here, provides an alternative means of investigating transmission.
However, these two measures of transmissibility (R0 and Rc, the mean of Z) may be close enough that it is instructive to compare our estimates for Z with the traditional R0, and our dispersion estimates with dispersion estimates for the number of secondary infections. Our Rc ranges from 0.3 in Ontario to nearly 7 in Florida. The lower range of Rc is inconsistent with R0 estimates (which range from 2 to 6 (3)), and indicate that in these jurisdictions schools are unlikely to be a major contributor to COVID-19 spread. Our medium to higher estimates of Rc are consistent with estimates of R0 and suggest that school transmissions were an important factor in COVID transmission in general in these jurisdictions at these times. Our estimates for k range from 0.06 (in Indiana) to 0.5 (in Alberta), corresponding closely to earlier estimates of dispersion.
Overdispersion has consequences for controlling transmission and for estimation. Estimating the average transmission rate from a small number of clusters will be difficult, and will result in a high variability. Most likely what will be observed in a small number of sampled clusters will be little to no onward transmission, which would lead to underestimates of the transmission rate. But if one or more larger clusters are included in a sample by chance, then this could lead to an overestimate of the transmission rate.
If we could identify the conditions under which the rare larger clusters occur (high-risk individuals, activities and settings) we could achieve a disproportionately large effect on reducing transmission by applying new measures in these settings. There are myriad possible reasons for overdispersion of transmission for SARS-CoV-2, including variation in viral load (10), behaviour, and number of contacts. Properties of the setting may be very important, with some settings (cramped, poor ventilation) being especially conducive to transmission. It would be good to identify classrooms or schools where there is a high risk of larger clusters. For example, if data were available on the occupancy, ventilation standards, mask use, classroom size, distancing behaviour and other features of classrooms, we could investigate how this related to the cluster size. Rapid tests may be especially good for identifying the most infectious individuals, given that they are sensitive to viral loads (27), but additional data collection is likely needed to quantify setting-level risks.
Two important changes have happened since the majority of the data here was collected. Firstly, in the jurisdictions studied, effective vaccines have been developed and deployed for those aged 12 and up (35). There are several ways in which this may effect cluster sizes in the school setting. To the extent that the general population (including adults) being vaccinated reduces incidence of COVID (23, 26, 44), there will be fewer introductions of COVID into the classroom, and so fewer exposures will occur leading to fewer clusters. This effect may be dampened by relaxation of distancing and other measures that were keeping COVID-19 at bay and are no longer necessary in the context of vaccination. The distribution of cluster sizes when clusters do occur will also change: many students who might otherwise be infected will be protected by the vaccine, others who are vaccinated but infected (breakthrough infection) may have reduced symptoms and therefore may not be detected. We therefore expect the mean cluster size to be reduced by vaccination, in age ranges where vaccination has been deployed. It is unclear what the consequences will be for the dispersion.
Secondly, new, more transmissible variants of SARS-CoV-2 have emerged (6), most notably the Alpha variant with a transmissibility approximately 1.5 times higher than that of the original variant, and the Delta variant estimated to have a transmissibility 1.5 times higher than that (15). A natural way to implement this change in our model is to multiply Rc by a factor of 2.25, boosting the size of clusters, without changing the dispersion parameter k. This would result in mean cluster sizes ranging from 1.7 in Ontario to 16 in Florida according to our estimates. Data from the period in which Delta was the prevalent strain is not available, but schools in the Canada and the US saw resurgences in clusters in schools around school openings (2, 13, 33).
Our data and model have some limitations. The data rely on crowdsourcing, and there is reason to believe that reporting is incomplete. Larger clusters may be more likely to be reported. The rate at which exposures were reported in these data varies a great deal from state to state, and so these results cannot be used to estimate the true rate of exposures in schools in the jurisdictions. For example, California and New York are some of the most populous states in the US, both of which were severely hit by COVID during this period, and yet neither of them had enough reports to make it into the top 8 most reported states. However, the consistency of the Canadian estimates, despite the data being from different jurisdictions with different reporting, lend them credibility. In the modelling, we assumed a Poisson random variable for the cluster size, with an underlying gamma-distributed rate variable. This is a flexible model allowing for considerable overdispersion, but it is simple and does not explicitly handle complexities such as the differences between elementary and high schools. Our estimates of the transmission rate were derived (where feasible) from a model with a fixed number of hours that the index case would be infectious in the classroom, and fixed class sizes. Accounting for variation in these would result in more variability in the estimates. We also assumed that reported clusters in the Canadian data occurred in the same class. Errors where distinct clusters are incorrectly put together and errors where one larger cluster is incorrectly split and assumed to be two distinct clusters probably both occur in the data. Without additional information, however, we were not able to account for them. Finally, our illustrative modelling of the impact of interventions was simple, and used simple assumptions for the impacts of masking, distancing and cohorts (bubbling). Our estimates of the per-contact transmission rate per unit time could, however, be used in more sophisticated simulation modelling to compare interventions.
Despite these limitations, our approach has distinct advantages. We have developed estimates of the person-to-person transmission rate derived directly from data. The data we use (cluster sizes) are relatively easy to access. This approach does not require individual-level data or contact tracing information, which are often not available; individuals may be identifiable and data are held within public health institutions. Our estimation approach, together with cluster size data, offers a high-resolution view of transmission: we can estimate the distribution of cluster sizes in specific settings, accounting for reporting and overdispersion, and in some contexts we can estimate the transmission rate, all without requiring either individual-level data or assumptions on transmission parameters such as the serial interval (see, in contrast, (12, 40) which require serial interval estimates). The results offer context-specific tools to simulate interventions in particular settings (here, schools). The methods are readily generalizable to other structured settings, such as workplace outbreaks where workplaces are similar in size and structure. Our results also suggest the need for data collection activities that can relate cluster sizes to setting variables such as occupancy, density, ventilation, activity and distancing behaviour. Ultimately this would provide the data needed to design interventions that best reduce school and/or workplace transmission.
Data Availability
Data is all available at https://github.com/PaulFredTupper/covid-19-clusters-in-schools
https://github.com/PaulFredTupper/covid-19-clusters-in-schools
Supplementary Information
Model for Cluster Size
We consider two models for ascertainment (whether a case is actually detected), though we only consider the first in the main text.
In the first ascertainment model (Individual Ascertainment) each of the infected individuals is detected with a probability q1. So X, the total number of infected individuals is binomial (n, p) with parameters n = Z + 1 and p = q1. If by chance none of the individuals are observed, we do not observe the cluster. This is meant to model a situation where cases are detected independently of each other, and one detected case does not lead to further tests or screening.
In the second ascertainment model (Cluster Ascertainment), at first each case is identified with probability q2, but then if any of the students are identified they are all subsequently identified. This is intended to capture a situation where a single detected case triggers testing for the whole class. Again, if no cases are detected we do not observe the cluster. This is equivalent to saying that clusters of size m are detected in their entirety with probability 1 − (1 − q2)m.
The number Z of new cases given the presence of one infectious case is a Poisson distributed random variable with a rate ν that is itself a Gamma-distributed random variable with a shape k and scale θ. This means Z has a negative binomial distribution NB(r, p) where r = k and p = θ/(1 + θ). Letting Θ = (k, θ), the cdf of Y is Under the individual ascertainment model with ascertainment probability q1, X, the number of observed cases is binomial (n, p) with n = Z + 1 and p = q1. So, the probability that i individuals are observed is for i = 0, 1,… Since we do not observe clusters with no observed cases the probability of observing a cluster of size i is actually for i = 1, 2, …, where . If the observed cluster sizes are Xi, i = 1, …, n, the log-likelihood function for Θ = (k, θ) under the individual ascertainment model is then Under the cluster ascertainment model, the cluster is observed or not with probability 1 − (1 − q2)Y +1. So the probability of observing X = j in a cluster is for i = 0, 1, … but then since we cannot observe clusters of size 0, an observed cluster has size i with probability for i = 1, 2, …, where .
If the observed cluster sizes are Xi, i = 1, …, n, the log-likelihood function for Θ = (k, θ) under the cluster ascertainment model is then Under both ascertainment models, we then go from our estimates of k and Θ to estimates of Rc via the formula We use the Delta method to obtain confidence intervals for Rc from confidence intervals on k and θ.
Varying the rate and model of ascertainment
In the main text we estimated parameters with the assumption of the individual ascertainment model with an ascertainment probability of 0.75. Here we investigate how our main parameters Rc + 1 (expected cluster size) and k (dispersion) vary with this ascertainment probability. We also consider the alternate ascertainment model discussed in the previous section Model for Cluster Size.
Figure 8 shows the parameter estimates for the two models. On the left we show the results for the US states, and on the right for the Canadian provinces. The top plots show results for the individual ascertainment model where we set q2 = 1 and vary q1 from 0.2 to 1. The bottom plots show results for the group ascertainment model with q1 = 1 and q2 varying from 0.2 to 1. We see that the parameters do vary with the model and the ascertainment fraction, but relative magnitudes of the parameters in different jurisdictions do not change. In particular, for a range of q1 greater than 0.5 in the individual ascertainment model, there is little change in the parameter estimates.
Acknowledgments
We thank Alisha Morris and the other volunteers at the National Education Association for providing the US data that was used in this study. PT and CC were supported by Natural Science and Engineering Research Council (Canada) Discovery Grants (RGPIN-2019-06911 and RGPIN-2019-06624). CC receives funding from the Federal Government of Canada’s Canada 150 Research Chair program.