Abstract
Coronavirus disease 2019 (COVID-19) is a pandemic. To characterize the disease transmissibility, we propose a Bayesian change point detection model using daily actively infectious cases. Our model is built upon a Bayesian Poisson segmented regression model that can 1) capture the epidemiological dynamics under the changing conditions caused by external or internal factors; 2) provide uncertainty estimates of both the number and locations of change points; 3) adjust any explanatory time-varying covariates. Our model can be used to evaluate public health interventions, identify latent events associated with spreading rates, and yield better short-term forecasts.
1 Introduction
A newly identified coronavirus, SARS-CoV-2, is a lethal virus for humans. It has caused the worldwide pandemic – COVID-19. As reported by the Johns Hopkins University Center for Systems Science and Engineering (JHU-CSSE), the COVID-19 pandemic has spread to 188 countries and territories with more than 14 million confirmed cases by the end of July 2020. The extremely rapid spreading of the disease and the increasing burden on healthcare systems becomes a major problem for public health. In a response to “flatten the curve” (Akiyama et al., 2020), both federal and local governments in the United States (U.S.) have enforced a wide range of public health measures, such as border shutdowns, travel restrictions, and quarantine.
As a consequence, the importance of understanding the COVID-19 dynamics is steadily increasing in the contemporary world. In epidemiology, the basic reproduction number, ℛ0, is commonly used to evaluate the transmissibility of an infectious disease like COVID-19. ℛ0 is defined as the expected number of secondary cases produced by a typical case in a closed population. During the outbreak of an epidemic, the basic reproduction number can be affected by intervention strategies. For example, social measures that decrease the contact rate between individuals would control the basic reproduction number. Isolating or treating the infected cases could lower the ℛ0 value as well. Another concept in the epidemic theory is the effective reproduction number ℛt, which describes the number of people who can be infected by an individual at any specific time t in a population. ℛt is time-specific since it accounts for the varying proportions of the population that are immune to the disease over time. There are many recent studies implementing the SIR model (Kermack and McKendrick, 1927) or its modified version to analyze the COVID-19 transmissibility in terms of ℛ0 or ℛt (see e.g. Chen et al., 2020; Alvarez et al., 2020; Kantner and Koprucki, 2020; Gostic et al., 2020; Cooper et al., 2020). Furthermore, several studies have incorporated the information on social measures to understand the COVID-19 dynamics over the world. For instance, Dehning et al. (2020) combined the SIR model with Bayesian inference to study the time-dependent spreading rate of COVID-19 in Germany. Song et al. (2020) extended the SIR model by considering the quarantine protocols with a focus on understanding the time-course dynamics of COVID-19 in Hubei, China. Giordano et al. (2020) enriched the SIR model with additional compartments to account for under-diagnosis, which could explain the gap between the actual infection dynamics and perception of COVID-19 outbreak in Italy. Because of heterogeneity in susceptibility and social dynamics, the COVID-19 transmissibility differs among locations and change over time. U.S. local governments have started different interventions since mid-March to combat the widespread of COVID-19. Therefore, the basic reproduction numbers should spatiotemporally vary.
The basic reproduction number of an epidemic event is changing due to the effects of societal and political actions. The effective social measures such as business closures and stay-at-home orders could help lower the transmission rate and thus induce changes in ℛ0. By studying the variations in ℛ0 over time, we can evaluate the dynamic transmissibility of infectious diseases like COVID-19. For instance, during the outbreak of Severe acute respiratory syndrome (SARS) in China around 2003, Riley et al. (2003); Lloyd-Smith et al. (2003) reported an ℛ0 ≈ 3.0 for the onset stage of SARS in Hong Kong. Later in Chowell et al. (2004), the reported ℛ0 for Hong Kong dropped to about 1.1 due to stringent control measures. Decreasing in ℛ0 captured the evolution of SARS transmission dynamics under the efficient diagnosis coupled with patient isolation measures. A recent study in Germany (Dehning et al., 2020) estimated the variations in COVID-19 transmission rates for the four phases defined by the time of three major official government interventions. These studies motivated us to detect the important transitioning times that occurred during the COVID-19 outbreak in the U.S. The change points that caused the transitions partitioned the COVID-19 data into segments characterized by unique temporal patterns. A stochastic version of the discrete-time SIR model was then applied to estimate the basic reproduction numbers across segments. For a given population, the variation in the reproduction numbers across different stages reflects the efficacy of the state governmental interventions.
We propose BayesSMILES: Bayesian Segmentation ModelIng for Longitudinal Epidemiological Studies, to study the dynamics of COVID-19 transmissibility and to evaluate the effectiveness of mitigation interventions. The BayesSMILES adopts a Bayesian Poisson segmented regression model to detect multiple change points based on the daily infectious COVID-19 cases. This novel model can 1) capture the epidemiological dynamics under the changing conditions caused by external or internal factors; 2) quantify the uncertainty in both the number and time of change points; 3) adjust any relevant explanatory time-varying covariates that may affect the infectious case numbers. In addition, the BayesSMILES incorporated the change point information to quantify the COVID-19 transmissibility by estimating the distribution of the basic reproduction numbers among change-points segmented temporal spans. On the simulated data, we demonstrate that our approach can improve the accuracy of the change point detection as compared with a common multiple change-point search method. Applying the proposed methods into U.S. states, we find that the detected change points correlate well with the times of publicly announced interventions. We also demonstrate that the SIR model integrated with change point information provided a better short-term forecast. In all, the BayesSMILES enables us to understand the disease dynamics of COVID-19, and provides useful insights for the anticipation and control of current and future pandemics.
The rest of the paper is organized as follows. We review the traditional susceptible-infectious-recovered (SIR) model in Section 2. In Section 3, we describe the framework of the proposed BayesSMILES. The Markov chain Monte Carlo algorithm and posterior inference procedures are described in Section 4. We provide a comprehensive simulations study to illustrate the performance of the proposed method against a competing approach in Section 5. In Section 6, we analyze the COVID-19 data for U.S. states using the proposed BayesSMILES. Finally, we conclude with remarks in Section 7.
2 Review of the SIR Model
The susceptible-infected-removed (SIR) model was developed to simplify the mathematical modeling of human-to-human infectious diseases by Kermack and McKendrick (1927). It is a fundamental compartmental model in epidemiology. At any given time, each individual in a closed population with size N is assigned to three distinctive compartments with labels: susceptible (S), infectious (I), or removed (R, being either recovery or dead). The standard SIR model in continuous time that models the flow of people from S to I and then from I to R is described by the following set of nonlinear ordinary differential equations (ODEs): for t > 0, subjecting to S(t) + I(t) + R(t) = N. Here β > 0 is the diseases transmission rate and γ > 0 is the removal (recovery or death) rate. Conceptually, susceptible individuals become infectious (S → I) and then are ultimately removed from the possibility of spreading the disease (I → R) due to death, recovery with immunity against reinfection, or isolation from the rest of the population (quarantine).
The rationale behind the first equation is (1) is that the number of new infections during an infinitesimal amount of time, −dS(t)/dt, is equal to the number of susceptible people, S(t), times the product of the contact rate, I(t)/N, and the disease transmission rate β. The third equation in (1) reflects that the infectious individuals leave the infectious population per unit time, dI(t)/dt, as a rate of γI(t). The second equation in (1) follows immediately from the first and third ones as a result of dS(t)/dt + dI(t)/dt + dR(t)/dt = 0. Assuming that only a small fraction of the population is infected or removed in the onset phase of an epidemic, we have S(t)/N ≈ 1 and thus the second equation reduces to dI(t)/dt = (β − γ)I(t), revealing that the infectious population is growing if and only if β > γ. As the expected lifetime of an infected case is given by γ−1, the ratio β/γ is the average number of new infectious cases directly produced by an infected case in a completely susceptible population. Since it is a good indicator of the transmissibility of an infectious disease, the epidemiologists name it the basic reproduction number ℛ0 = β/γ in the context of a standard SIR model, or the effective reproduction number ℛt = βt/γt in the context of a time-variant SIR model, where β and γ are replaced by β(t) and γ(t) in (1).
3 The Proposed BayesSMILES Method
3.1 Data notations
During a pandemic such as COVID-19, the most accessible and complete data are the daily reported numbers on confirmed cases and deaths. Suppose N is the total population size in a given region. Let C = (C1, …, CT) and D = (D1, …, DT) be the sequences of cumulative confirmed case and death numbers observed at T successive equally spaced points in time (e.g. day), where Ct and Dt ∈ ℕ for t = 1, …, T. For a region of which recovery cases are closely monitored day by day, we use E = (E1, …, ET) to denote the sequence of cumulative recovery case numbers. Thus, due to the compositional nature of the basic SIR model, the three trajectories can be constructed as S = (S1, …, ST) with St = N − Ct, R = (R1, …, RT) with Rt = Dt + Et, and I = (I1, …, IT) with It = N − St − Rt = Ct − Dt − Et. For a region of which recovery cases do not exist or under-reported, we consider both R and I as missing data and reconstruct these two sequences according to the last equation of (1) with a pre-specified constant removal rate γ. Specifically we set I1 = C1 and R1 = 0, and generate Rt = Rt−1 + rIt−1 and It = It−1 +(Ct − Ct−1) − (Rt − Rt−1) from t = 2 to T sequentially, where. : [0, ∞) → N denotes the ceiling function. For the choice of γ, we suggest estimating its value from publicly available reports in some region where confirmed, death, and recovery cases are all well-documented, or from prior epidemic studies due to the same under-reporting issue in actual data. Lastly, let Y0 be the initial value and be the lag one difference of a sequence Y, where and each following entry , i.e. the difference between two adjacent observations. Note that Y could be any of those time-series data introduced above, i.e. C, D, E, S, I, and R.
3.2 Modeling epidemic dynamics via a modified stochastic SIR model
An SIR model has three trajectories, one for each compartment. The compositional nature of the three trajectories, i.e. S(t) + I(t) + R(t) = N, implies that we only need two of them to solve the ODEs as shown in (1). As mentioned previously, assuming S(t) ≈ N for all t results in dI(t)/dt = (β − γ)I(t) and further leads to an exact solution: I(t) = I(0) exp((β − γ)t). For modeling daily reported infectious data I, we utilize its discrete-time version, with a time-varying rate βt to account for the transmissibility changes of the disease. For simplicity’s sake, we assume a constant removal rate γ. Based on (2), we introduce a probabilistic model, which approximately mimics the dynamics of the deterministic standard SIR model as shown in (1) during the onset phase of a pandemic. Specifically, we suppose the infectious population size at time t is sampled from a Poisson model, where αt = I0 exp ((βt − γ)t)/N is a redefined time-varying transmissibility parameter that depends on the initial infectious population size I0, the transmission rate βt, the removal rate γ, and any latent factors (e.g. public health intervention, social behavior, virus mutation, healthcare quality, etc.) that may affect the disease transmissibility. This model automatically accounts for measurement errors and uncertainties associated with the counts. Note that (3) can be generalized to a negative binomial (NB) model, i.e. It|αt ∼ NB(Nαt, ϕI) if needed, where ϕI is a dispersion parameter to account for over-dispersion that might be observed in I. Here we use NB(µ, ϕ), µ, ϕ > 0 to denote an NB distribution with expectation µ and variance µ + µ2/ϕ.
3.3 Detecting multiple change-point via a Poisson segmented regression model
Our multiple change-point detection builds upon the above modified stochastic time-variant SIR model (3) with stationary transmissibility αt in a certain segment. Particularly, we assume that βt only changes at certain time points. Identifying those change points is of significant importance, as it not only enables us to characterize the temporal dynamics of the pandemic but also helps policymakers evaluate the effectiveness of the past and ongoing mitigation and intervention strategies.
In this paper, the change points are defined as those discrete time points that significantly alter the transmission rate βt between two adjacent segments, while assuming the removal rate γ is constant across all time points. We introduce a latent binary vector γ = (γ1, …, γT), γt ∈ {0, 1}, with γt = 1 if time point t is a change point and γj = 0 otherwise. We restrict γ1 = 1, although the first time point is not a change point. Those points with γj = 0 can be partitioned into segments bounded by two adjacent change points. Thus, we use another vector z = (z1, …, zT), zi ∈ {1, …, K} to reparameterize γ, where zt = k if time point t is in segment k, that is, between the k and (k + 1)-th change points. The total number of change points excluding the first time point is K − 1. Mathematically, z is the cumulative sum of γ, i.e. , while γ is the lag one difference of z, i.e. γt = zt − zt−1 with γ1 = 1. Figure 1 shows the representation of γ and z for a simulated time-series dataset (T = 10) with two change points.
To infer γ or z given the sequence of infectious population size I, we adopt a Poisson segmented regression framework, which can be written as, where xt is a p-dimensional column vector of covariates that includes a scalar of one for the intercept, time t, and p − 2 explanatory variables observed at time t if applicable. Those explanatory variables could contain the number of tests, weather information, or other temporal information that are important to adjust for during a longitudinal epidemiological study. bk is a p-dimensional column vector of segment-specified coefficients that includes an intercept representing the proportion of infectious people at logarithmic scale, i.e. bk1 = log(I0/N), in segment k, and a slope accounting for the segment-varying transmission rate, i.e. bk2 = βk − γ. For simplicity’s sake, we assume that ϵt is an independent and identically distributed random variable from a zero-mean normal distribution with segment-specific variance . To ensure the identifiability of all model parameters, we try a set of considerably small values for and employ a robust cross validation method called Pareto-smoothed importance sampling leave-one-out (PSIS-LOO) cross validation to determine the best choice (Vehtari et al., 2017).
Let αk be the sequence of all αt’s in segment k, i.e. , where we denote as the time point corresponding to the k-th change point and as the number of time points in segment k. We can rewrite the second stage in (4) as , where is an identity matrix and Xk can be explicitly written as We assume a zero-mean multivariate normal distribution for bk, that is, bk ∼ MN(0p, H), where 0p is an p-by-1 all zero column vector and is set to be a diagonal variance-covariance matrix. For a non-informative setting, we recommend choosing a large value for each . Through this prior specification, we are able to marginalize out the nuisance parameter bk and obtain . Consequently, we can write the collapsed model of (4) as To complete the model specification, we impose an independent Bernoulli prior on γ as , where ω is interpreted as the probability of a time point being a change point a priori. We further relax this assumption by allowing ω ∼ Be(aω, bω) to achieve a beta-Bernoulli prior. In practice, we suggest a constraint of aω + bω = 2 for a vague hyperprior of ω (Tadesse et al., 2005). In addition to that, we make the prior probability of γ equal to zero if two adjacent time points are jointly selected as change points. In other words, a segment should consist of at least two time points.
3.4 Estimating basic reproduction numbers via a stochastic SIR model
Given the multiple change points γ, we estimate the basic reproduction number R0 for each segment via a stochastic version of the standard SIR model as shown in (1), which only needs the cumulative confirmed case numbers C. This is because recovery data only exist in few states in the U.S., which makes both model inference and predictions infeasible. This model considers both of the removed and actively infectious cases as missing data and mimic their relationship in spirit to some compartmental models in epidemiology. Specifically, we assume the number of new removed cases at time t, i.e. , is sampled from a Poisson distribution with mean γIt−1, that is, , where γ should be pre-specified. Based on this simplification, we rewrite the discrete version of the first equation in (1) as, resulting in We further assume the new case number observed at time t, i.e. , is sampled from an NB model, as it automatically accounts for measurement errors and uncertainties associated with the counts. Following most epidemiological models, we assume this stochastic process is a Markov process, where the present state (at time t) depends only upon its previous state (at time t − 1). For the prior distribution of the dispersion parameter ϕ, we choose a gamma distribution, ϕ ∼ Ga(aϕ, bϕ). We recommend small values, such as aϕ = bϕ = 0.001, for a non-informative setting. This model, on average, mimics the epidemic dynamics and is more flexible than those deterministic epidemiological models. We assume β comes from a gamma distribution with hyperparameters that makes both mean and variance of the transformed variable ℛ0 = β/γ equal to 1.
4 Model Fitting
In this section, we describe the Markov chain Monte Carlo (MCMC) algorithms for posterior inference of the proposed BayesSMILES method, including the inferential strategy to identify multiple change points and to estimate the basic reproduction numbers, respectively.
4.1 MCMC algorithms for detecting multiple change-point
Our primary interest lies in identifying the change-point locations via the vector γ based on the actively infectious cases I. According to Section 3.3, the full data likelihood and the priors of the change-point detection model are written as, where we use the superscript (γ) to denote the transmissibility parameters αk, the design matrix Xk, and the number of time points nk in segment k under a specific configuration of γ. At each MCMC iteration, we perform the following steps.
Update change-point indicator γ
We update the binary latent vector γ via an add-delete-swap algorithm. We randomly select an entry in γ, say γt, and change its value to to form a new γ*. This is an add step if and a delete step otherwise. The swap step is performed every ten iterations, where we randomly select a change point, say γt = 1, and swap the values between the t and (t ± 1)-th entries in γ to form a new γ*. We accept the proposed γ* with the probability min(1, mMH), where the Hasting ratio is Here we use J (· ← ·) to denote the proposal probability distribution for the selected move. Note that the last proposal density ratio equals one. This step simultaneously updates the segmentation vector z, as it can be constructed from γ.
Update transmissibility parameters α
For each segment partitioned by γ, we update αt within the same segment, say segment k, sequentially by using a random walk Metropolis-Hastings (RWMH) algorithm. We first propose a new , of which logarithmic value is generated from . Let . Then we accept the proposed value with probability min(1, mMH), where the Hastings ratio is Note that the proposal density ratio cancels out for this RWMH update. The computation of the multivariate normal (MN) probability density involves matrix inversion, which can be time-consuming, particularly when is large. To significantly improve the computational efficiency, we follow Zhou and Guan (2018) to approximate the exact inversion under an appropriate choice of H that satisfies the asymptotic condition. As mentioned previously, H is a p-by-p diagonal matrix, where the first entry corresponds to the variance of the normal prior on bk1. Under the asymptotic condition of , the inversion of an -by- matrix reduces to an inversion of a p-by-p matrix (See more details in Appendix). In practice, we set and to ensure this asymptotic condition.
4.2 MCMC algorithms for estimating basic reproduction numbers
Once the change points are determined, we aim to estimate the basic reproduction numbers ℛ0′s across different segments and quantify their uncertainties based on the cumulative confirmed cases C only. According to Section 3.4, the full data likelihood and the priors of the stochastic SIR model are written as, where β = (β1, …, βK) and ϕ = (ϕ1, …, ϕK), i.e. the collections of transmission and dispersion rates of all segments. For the hyperparameters, we set aβ = 1 and bβ = 1/γ so that both of the expectation and variance of the basic reproduction number ℛ0 = βk/γ are equal to one. We specify aϕ = bϕ = 0.001 to obtain a weakly informative gamma prior for all dispersion parameters. With a pre-defined removal rate γ, we propose the following updates in each MCMC iterations.
Generate R based on C
We assume I1 = C1 and R1 = 0, i.e. all the confirmed cases are capable to pass the disease to all susceptible individuals in a closed population at time point t = 1. Then we sample R2 ∼ Poi(γI1), where γ is a pre-specified tuning parameter. Due to the compositional nature of the SIR model, we can compute , where and are the new confirmed cases and new removed cases at time point t = 2, respectively. Next, we repeat this process of sampling Rt ∼ Poi(γIt−1) and computing , to generate the sequence R used in an iteration.
Update dispersion parameters ϕ
For each segment, we update ϕk by using a RWMH algorithm. We first propose a new , of which logarithmic value is generated from . Let , where only the k-th entry is replaced. Then we accept the proposed value with probability min(1, mMH), where the Hastings ratio is Note that the proposal density ratio cancels out for this RWMH update.
Update transmission rates β
For each segment, we update βk by using a RWMH algorithm. We first propose a new , of which logarithmic value is generated from .
Let , where only the k-th entry is replaced. Then we accept the proposed value with probability min(1;mMH), where the Hastings ratio is Note that the proposal density ratio cancels out for this RWMH update.
4.3 Posterior inference
We explore posterior inference on the parameters of interest by postprocessing the MCMC samples after burn-in. We start by obtaining a point estimate of the change-point indicator γ by analyzing its MCMC samples {γ(b), …, γ(B)}, where b indexes the MCMC iteration after burn-in. One way is to choose the γ corresponding to the maximum-a-posteriori (MAP), The corresponding can be obtained by taking the cumulative sum of . An alternative estimate relies on the computation of posterior pairwise probability matrix (PPM), that is, the probabilities that time point t and t′ are assigned into the same segment, . This estimate utilizes the information from all MCMC samples and is thus more robust. After obtaining this T -by-T co-clustering matrix denoted by P, a point estimate of z can be approximated by minimizing the sum of squared deviations of its association matrix from the PPM, that is, The corresponding can be obtained by taking the difference between consecutive entries in and set the first entry to one. To construct a “credible interval” for a change point, we utilize its local dependency structure from all MCMC samples of γ that belong to its neighbors. Due to the nature of the MCMC algorithm described in Section 4.1, if a time point t is selected as a change point, i.e. γt = 1, then its nearby time points must not be a change point. Thus the correlation between the MCMC sample vectors and tends to be negative when u is small. Thus, we define the credible interval of a change point as the two ends of all its nearby consecutive time points, of which MCMC samples of γ are significantly negatively correlated with that of the change point. This could be done via a one-sided Pearson correlation test with a pre-specified significant level, e.g. 0.05. Although quantifying uncertainties of change points is not formal, it performs very well in the simulation study and yields reasonable results in the real data analysis.
We constructed the confidence interval for a selected change point t by evaluating its MCMC samples . In particular, according to the swap step that updates γ in the MCMC algorithm, the change point was typically switching between the selected time point t and its close neighborhood in the MCMC iterations. This is due to the fact that the change in patterns between segments may not be abrupt, and therefore the time points closed to time t could also be accepted as a change point in an iteration. Hence, for each change point with γt = 1, we calculated the correlation between and Here, we set different values for Δt (e.g., Δt = −3, −2, …, 1, 2) such that the collection of t+Δt included a reasonable neighborhood for time t. Then, we calculated the correlation between and . Given the fact that only one change point would appear in the neighborhood region, vector tended to have a negative correlation with .
Once the change points are determined, an approximate Bayesian estimator of the transmission rate βk for each segment k can be simply obtained by averaging over all of its MCMC samples, . In addition, a quantile estimation or credible interval can be obtained. Lastly, we summarize the basic reproduction number in each segment k as .
4.4 Prediction
Conditional on the multiple change-point locations, we can predict the cumulative or new confirmed cases at any future time Tf by Monte Carlo simulation based on the information in the last segment K only. Specifically, from time T + 1 to Tf, we sequentially generate Then, both short and long-term forecasts can be made by summarizing the (Tf − T)-by-B matrix of MCMC samples. For instance, the predictive number of cumulative and new confirmed cases at time T + 1, in average, are and , respectively.
5 Simulation
We used simulated data to evaluate the performance of our BayesSMILES method in terms of both multiple change-point detection and basic reproduction number estimation. It is shown that the proposed Bayesian framework outperforms alternative change-point detection method.
5.1 The generative model
The three trajectories S, I, and R with length T = 120 were generated in the following ways. We first divided the T = 120 time points into K = 4 segments with the same length, that is, the true change points were t = 31, t = 61, and t = 91. In other words, the first 30 entries of z equal to 1, the following 30 entries equal to 2, the next 30 entries equal to 3, and the last 30 entries equal to 4. To mimic the disease transmissibility dynamics across different segments, we chose segment-varying transmission rates βk while fixing the removal rate γ = 0.03. We considered four scenarios of the set (β1, β2, β3, β4), corresponding to 1) ℛ0 = (3.0, 1.2, 2.0, 0.8); 2) ℛ0 = (3.0, 2.3, 1.5, 0.8); 3) ℛ0 = (3.0, 1.8, 0.8, 1.6); 4) ℛ0 = (3.0, 2.0, 1.1, 0.5). Then based on the stochastic version of the standard SIR model, we sampled St and Rt from negative binomial (NB) distributions, and obtained It, sequentially from t = 1 to T through where N = 1, 000, 000, the initial I0 = 100 and R0 = 0, and the dispersion parameters ϕS = ϕR = 10. Note that the generative scheme was with an NB error structure, which was different from our model assumption based on a Poisson error structure. We repeated the above steps to generate 50 independent datasets for each setting of ℛ0. Figure 2 displays the temporal patterns of the simulated infectious counts I for the four scenarios.
5.2 Evaluation criteria
To evaluate the multiple change-point detection, we may rely on either the binary change point indicator vector γ or the time point allocation vector z. For the choice of γ, a change point is considered to be correctly identified if its location is within a local window of the true position (Killick and Eckley, 2014). The selection of the window size is ad hoc and may bias the evaluation.
In addition to that, since change points and non-change points are usually of very different sizes, most of the binary classification metrics are not suitable for model comparison here. Thus, we chose those metrics that quantify the agreement between the true and estimated allocation vectors, i.e. z and . The two classic performance metrics for the analysis of clustering results are the adjusted Rand index (ARI) and mutual information (MI), proposed by Hubert and Arabie (1985) and Steuer et al. (2002), respectively. ARI is the corrected-for-chance version of the Rand index (Rand, 1971), as a similarity measure between two sample allocation vectors. Let ; and be the number of pairs of time points that are a) in the same segment in both of the true and estimated partitions; b) in different segments in the true partition but in the same segment of the estimated one; c) in the same segment of the true partition but in different segments in the estimated one; and d) in different segments in both of the true and estimated partitions. Then, the ARI can be computed as The ARI usually yields values between 0 and 1, although it can yield negative values (Santos and Embrechts, 2009). The large the index, the more similar between z and , thus the more accurately the method detects the actual times at which change points occurred. An alternative metric choice is MI, which measures the information about one variable that is shared by the other (Steuer et al., 2002). Let be the number of time points shared between the k-th segment in the true z and the kt-the segment in the estimated one . Then, the MI can be computed as where is the number of segments and ’s are the segment lengths for segment in . It yields non-negative values. The larger the MI, the more accurate the partition result.
To quantify how well a method estimates the dynamic transmissibility across different segments, we used the root mean square error (RMSE) that measures the deviation between the true and estimated values of ℛ0 over all T time points: A smaller value of RMSE indicates a more accurate estimation of ℛ0’s.
5.3 Results
As for the MCMC setting of change point detection, we set 40, 000 MCMC iterations and discarded the first half as burn-in. We adopted the weakly informative setting by setting aω = 0.1 and bω = 1.9 in the Beta-Bernoulli prior for the change-point indicator vector γ. We set with and as the covariance matrix in the prior distribution of bk’s. Finally, we let take ten equally spaced values ranging from 0.0001 to 0.01 at the logarithmic scale (base 10) in the PSIS-LOO cross validation. In fitting the stochastic SIR model, we set 100, 000 MCMC iterations with the first half as burn-in. As suggested in Waqas et al. (2020), the value of removal rate γ could be estimated by for each simulated dataset. Then, we set aβ = 1 and bβ = 1/γ so that both of the expectation and variance of the basic reproduction number ℛ0 = βk/γ are equal to 1.
We first checked the performance of the BayesSMILES on a single simulated dataset, which was randomly selected from the 50 replicates in Scenario 4 (marked as the blue line in Figure 2). Figure 3(a) demonstrates the change-point detection result based on the Poisson segmented regression model. The red dashed and the blue solid lines represent the true and the estimated change point locations, respectively, while the gray ribbons represent the 95% confidence intervals for those identified change points. As we can see, the BayesSMILES successfully detected the three true change points in general, as each of the 95% confidence intervals covered the truth. The resulted values of ARI and MI were 0.93 and 1.28, respectively. Later on, the stochastic SIR model introduced in Section 3.4 was then fitted to quantify the disease transmissibility in each segment bounded by the identified change points. Figure 3(b) shows the posterior distributions of ’s for from their MCMC samples. The red dashed and blue solid line pinpoints the true and posterior mean of ’s, while the two black solid lines mark the boundary of their 95% credible intervals. Clearly, those true values were within their corresponding 95% credible intervals. The final RMSE for ℛ0 estimation was 0.38 for this single simulated dataset.
To our best knowledge, there is no method like the BayesSMILES that can detect latent change points while characterizing the transmission dynamics through an SIR model. Thus, in setting up a comparison study, we therefore considered a two-stage approach that first identifies multiple change-point of time-series data based on a likelihood based framework, and then estimates the basic reproduction numbers between each pair of nearby change points, following the stochastic SIR model introduced in Section 3.4. The alternative change point model assumes time points within one segment follow a normal distribution with distinct mean and/or variance from its nearby segments (Hinkley, 1970; Jen and Gupta, 1987), and it uses likelihood ratio test (LRT) to detect multiple change points. An algorithm named binary segmentation (Edwards and Cavalli-Sforza, 1965; Sen and Srivastava, 1975) is commonly used to compute the test statistics for the LRT with high efficiency (Killick et al., 2012). In our case, to detect multiple change points using this alternative approach named likelihood ratio test with binary segmentation (LRT-BinSeg), we input the logarithmic scale of I into the function cpt.meanvar in the related R package changepoint (Killick and Eckley, 2014) for each of the simulated datasets. We specified to use the binary segmentation algorithm with up to 5 of change points to search for. Note that this restriction was not applicable to the alternative algorithms provided in the changepoint package. In practice, we found that the alternative algorithms tended to over-select the number change points.
Figure 4(a) and (b) exhibit the change-point detection performances for the four scenarios of ℛ0. Our BayesSMILES performed much better than the likelihood ratio test with the binary segmentation algorithm (LRT-BinSeg) with respect to multiple change-point detection under both performance metrics, ARI and MI. For instance, the ARI by the BayesSMILES increased 39.29% to 122.16% over the LRT-BinSeg among the four scenarios, while the growth in MI could be up to 60.54%. Figure 4(c) compares the ability to capture the transmission dynamics in terms of RMSE, which depends on the change-point detection accuracy. As expected, Our BayesSMILES yielded smaller RMSE values across all scenarios since its identified change point locations were more accurate. In all, the simulation study demonstrated that the strengths of BayesSMILES.
6 Analysis of COVID-19 Data
In this section, we applied the BayesSMILES to the COVID-19 daily report data provided by JHU-CSSE (COVID-19 Data Repository). Several recent COVID-19 studies also based their analyses on this resource (Dong et al., 2020; Zhou and Ji, 2020; Toda, 2020). We first performed a preprocessing step to ensure the quality of the infectious data I for the model fitting. Due to the fact that the recovery cases did not exist for some U.S. states, we treated I and R as missing data and reconstructed the two sequences according to the process described in Section 3.1. The observed cumulative confirmed case numbers C were collected for each U.S. state starting from an early stage of the pandemic outbreak. In particular, we chose the starting time for each state as when there were least ten confirmed COVID-19 cases for that state. We also set the removal rate γ = 0.1 as suggested by Pedersen and Meneghini (2020) and Weitz et al. (2020). Due to the fact that different states could have different starting times, we further trimmed the sequences I and R for each state based on the latest starting time available. Finally, we set March 22, 2020, as the new starting time (t = 1) for all 50 states, and let July 19, 2020, be the last observed time point (t = 120). We used the same hyperparameter and algorithm settings as described in Section 5.3.
6.1 Detecting change points for U.S. states
We limit our analysis to four U.S. states with the highest cumulative confirmed cases as of July 19, 2020, to keep the paper in a reasonable length. They are New York, Texas, California, and Florida. The results for the 46 remaining states are available in https://shuangj00.github.io/BayesSMILES/ (see details in Section 8). Figure 5 displays the detected change points, as well as the estimated basic reproduction number ℛ0’s cross segments, for the four states. The associated confidence interval to each identified change point is represented by a gray ribbon. In general, those change points detected by the BayesSMILES indeed captured the important COVID-19 events that might affect the transmission rates. For instance, some change points reflected the positive effects of the preventative strategies such as lockdown, while others explained the “bounce back” in confirmed cases after the reopening. Table 2 lists the change point locations and their potentially related events for the four states.
In New York, the first change point was estimated to be March 28. We estimated the posterior mean of the basic reproduction number decreased from 2.24 (between March and to March 27) to 1.63 (between March 28 and April 8). Notably, March 28 was the date when the Centers for Disease Control and Prevention (CDC) issued a 14-day domestic travel advisory for non-essential persons, which presumably alleviated the situation for the populated states such as New York. The second change point appeared around April 9, and the R0 of the third segment dropped to 0.98 with a 95% credible interval of [0.76, 1.25]. This matched the exact day when the New York state posted its first drop of the ICU admissions since the COVID-19 outbreak began. The third change point was around April 27. Though there was no direct intervention issued in late April, we noticed that the mayor of New York City announced that all the major events had been canceled starting from April 20. This action could bring a positive effect in controlling the outbreak, and our estimation from the SIR model suggested a further decrease in the basic reproduction number down to 0.66 with a 95% credible interval of [0.54, 0.81]. We observed another change point around June 18, which was close to the Phase II reopening of New York state on June 22. During the Phase II reopening, restaurants were allowed for outdoor dining, stores opened for in-person retail, and more services resumed operational under strict limitations. Thus, we saw a little “bounced back” in ℛ0 from 0.66 to 0.82. The last change point was on June 29. As expected, the basic reproduction number increased to 1.04 with a 95% credible interval of [0.84, 1.29] in the last segment. Although there was no public announcement around June 29 with a confidence interval from June 28 to July 4, we suspect that the increased social interaction during the Independence Day long weekend (between July 3 and July 5) could be responsible for the increase of the transmission dynamics.
In Texas, there were five change points detected. The first change point was estimated to be March 28, the same day as the first one for New York state. Due to the similar reason, the policy of mandatory 14-day quarantines for travelers entering Texas could bring a decrease in terms of the basic reproduction number (decreased from 2.97 to 2.07). The second change point was around April 9 with a further drop of ℛ0 to 1.14 with a 95% credible interval [0.96, 1.35]. We found that the Texas Governor had extended the state’s disaster declaration for an additional 30 days on April 12. The extension aimed at protecting the health and safety of Texans by ensuring adequate capabilities to support communities. Organizations such as the State Operations Center and the Strategic National Stockpile would continuously supply the state government with resources needed to protect residents. May 25 was detected as the third change point, and it was the first time that ℛ0 increased after the two drops. The estimated basic reproduction number was 1.29 with a 95% credible interval [1.02, 1.62]. This increase appeared around May 25 could be due to the Governor’s updated executive order issued on May 26 that allowed additional services and activities to open for phase II reopening. The next change point was around June 16, and ℛ0 further increased to 1.72 with a 95% credible interval [1.40, 2.11]. According to the prediction reported by UT Austin’s COVID-19 Modeling Consortium at the end of May, there might be a significant increase in the number of cases and hospitalizations beginning mid-June (News from kxan). Here, the change point location, as well as the increased basic reproduction number, were consistent with the results of this report. The last change point was around June 28 with an estimated decrease in ℛ0 to 1.42 with a 95% credible interval [1.17, 1.72]. Notably, the Texas Governor issued multiple executive orders around late June to early July to mitigate the disease spreading. For instance, the executive order on June 26 reemphasized the limited occupancy for all business establishments in Texas. According to an executive order on July 2, all Texans were required to wear a face-covering in public spaces in counties with 20 or more positive COVID-19 cases. On the same day, the Governor announced an update regarding the executive order on June 26 with additional measures to slow down the spreading of COVID-19.
In Florida, the first estimated change point was April 3. It was two days after the statewide stay-at-home order for Florida. We estimated that the basic reproduction number decreased from 2.70 to 1.28 after the change point. The second change point appeared around the middle of April. Starting from April 13, some counties such as Osceola county started the requirement of wearing a face covering while in public. It could explain the reason why we observed a slight decrease in ℛ0, from 1.28 to 0.92 with a 95% credible interval [0.73, 1.15]. The next change point was located around May 13, and ℛ0 in this new stage went above 1 again, with a posterior mean of 1.19 and a 95% credible interval [0.96, 1.50]. We noticed that Florida entered the phase I reopening on May 18, which could lead to the “bounced back” situation. The fourth change point was around June 7, two days after the phase II reopening in Florida. Changes in the phase II reopening included Universal Orlando opened the parks to the general public for the first time in months, and we observed that ℛ0 increased again to 1.81 with a 95% credible interval [1.50, 2.19]. In the last segment (after June 27), our result revealed a slight drop in the basic reproduction number from 1.81 to 1.45. This change was potentially related to the consequence of requiring facial coverings in the four most populated cities in Florida: Tampa, Orlando, Miami, and Jacksonville. The face mask mandates went into effect for the four cities starting from June 19, 20, 25, and 29, respectively. Therefore, the drop in the transmissibility at the end of June may be explained by the effectiveness of wearing face masks as a non-pharmaceutical practice.
In California, we detected two change points. California was the first state to announce lockdown in the COVID-19 pandemic and its stay-at-home order became effective on March 19. Our results in change point detection could miss these early actions since the data we analyzed started from March 22. The first selected change point was on April 5, with the value of the basic reproduction number decreased dramatically when transitioning to the second segment (from 2.20 to 1.20). The second change point was on June 17, and we saw that ℛ0 increased to 1.35 in the last segment with a 95% credible interval [1.12, 1.62]. According to California Governor, higher-risk businesses and venues (e.g. movie theaters, bars, gyms) were allowed to reopen with restrictions on June 12. Hence, increasing in the basic reproduction number could be the consequence of reopening. This observation was also observed in New York and Texas.
6.2 Clustering U.S. states based on their change-point locations
We applied BayesSMILES on all 50 U.S. states. Based on the results, we seek to derive an overall picture of the COVID-19 dynamics across states. We summarized the temporally detected change points of the 50 states into common patterns, and then we labeled each state by matching its specific change point pattern to the common patterns. In particular, for each state, we calculated the marginal posterior probability of inclusion (PPI) for all time points, where the PPI for a time point t was calculated based on the B of MCMC samples after burn-in: . Then we obtained the vector . Each entry in pPPI is a value between 0 and, 1, representing the proportion of time t selected as a change point among all iterations. Next, we computed the overall pattern by averaging over the vector pPPI across 50 states. We noticed that some time points were rarely or never selected as change points. That naturally hinted us to group the time points. To illustrate this, we trimmed the top 20% values of for each time t (Figure 6). The trimming step provided a clear pattern and highlighted the groups of dates that were commonly identified as change points. We observed three time spans as shown in Figure 6: March 27 - April 11, May 1 - May 10, and May 22 - July 3. For each state, we defined its cluster label based on the corresponding change point detection results. If this state had at least one change point (including the confidence interval) between March 27 - April 11, the first element in its cluster label is “Change”. Otherwise, the first element in the group label was set to “Stable”. We repeated the same process to determine the second and third elements of the class label for each state. In the end, each state was assigned to a cluster label from “Change-Change-Change”, “Change-Stable-Change”, or “Stable-Change-Change”.
The map in Figure 7 colors each of the 50 states based on its cluster label, where green, yellow, and pink correspond to temporal patterns “Change-Stable-Change”, “Change-Change-Change”, and “Stable-Change-Change”, respectively. Interestingly, three out of the four states we have analyzed, New York, Texas, and California, belonged to the same category, i.e. “Change-Stable-Change”. Several other states that are also in this category include Georgia, Arizona, North Carolina, and Louisiana. All of these states were in the top ten states with the most COVID-19 confirmed cases. We noticed that the phase I statewide reopening for all these states occurred in mid-May (May 15 for Georgia, May 13 for Arizona, May 8 for North Carolina, May 15 for Louisiana). Therefore, our model did not report any change points for these states between May 1 and May 10. The rest states in the 10 states with the most COVID-19 confirmed cases, including Florida, Illinois, and New Jersey, were labeled as “Change-Change-Change”, and all of them had a change point between May 1 and May 10. As discussed in Section 6.1, Florida had a change point around May 13 with a confidence interval [May 8, May 18]. According to the Executive Order 2020-32 issued by the Illinois governors, the state entered the phase II reopening starting on May 1 with a modified stay-at-home order. For New Jersey, the statewide state-at-home order was not lifted until June 9. However, our model suggested a change point around the end of April with a confidence interval [April 25, May 3] with a drop in R0 from 1.11 to 0.68 (details available at https://shuangj00.github.io/BayesSMILES/). We noticed that on May 3 the New Jersey Governor announced a multi-state agreement to develop a regional supply chain for personal protective equipment, other medical equipment and testing. This joint-state protective measure allowed for efficient delivery and reliability of medical equipment for states, and therefore best utilized the life-saving resources in the face of the COVID-19 outbreak.
6.3 Predicting new confirmed cases for U.S. states
Short-term forecasting of the new daily confirmed COVID-19 cases at a future time Tf is important for designing public policies. Here, we used the estimation results from BayesSMILES to predict the new confirmed cases for the four states that we analyzed: New York, Texas, Florida, and California. As suggested in Section 6.1, the basic reproduction number ℛ0 for one state could go above or below the threshold value of 1 at different disease stages. Hence, for short-term forecasting, the predictions given by the BayesSMILES were based on the last segment’s information, which ensured that the predictions fully reflected the most recent disease characteristics. In contrast, we evaluated an alternative approach that ignored the underlying disease dynamics and used all the data to fit a single SIR model for the prediction. We named this method FullDataSIR. Here, we illustrated the advantages of incorporating the change point information into prediction. In particular, we compared the prediction accuracy of the two methods: the BayesSMILES and FullDataSIR.
Figure 8 showed the true value of the new daily confirmed COVID-19 cases and the two types of predictions for the four states. For each state, the 7-day forecast was from July 20 to 26. First of all, we observed that the predicted mean by the FullDataSIR model tended to be larger than that from the BayesSMILES for the same day. This was because the data in the early stage (in our case, from late March to early April) corresponded to a much larger basic reproduction number, as compared to the data from the late stage, and the BayesSMILES would ignore the early stage observations in making the prediction. Second, the closer the predicted value to the true value on each day, the better the prediction performance. We thus quantified the prediction accuracy using the mean absolute percentage error (MAPE). The MAPE for the 7-day forecast was defined as where was the observed new confirmed cases at future time Tf, and was the corresponding prediction. The smaller the MAPE value, the more accurate the prediction.
We calculated the MAPE for the 7-day forecast of the four states analyzed in Section 6.1, and Table 1 summarized the results. For the first three states, the MAPEs from the BayesS-MILES were much smaller than the MAPEs from the FullDataSIR model, suggesting a better performance of BayesSMILES. For California, the conclusion was reversed since the FullDataSIR model showed a slightly lower MAPE. However, if we look at Figure 8(d), the real data (black horizontal lines) and the predictions from the BayesSMILES (red circles) both showed a down-going trend. In this case, the consistency in trend could be more important to evaluate two types of predictions, as compared to the slight difference in MAPEs. Lastly, for most of the days shown in Figure 8, the 95% credible interval by BayesSMILES tended to be much narrower than the interval given by the FullDataSIR model under the same day, especially for Texas, Florida, and California. Combined with a smaller value of MAPE, the BayesSMILES was able to provide a more accurate and precise prediction as compared to the alternative approach. In all, incorporating the change point location information helped improve the prediction accuracy for the 7-day forecast of new confirmed COVID-19 case numbers.
7 Conclusion
In this paper, we proposed the BayesSMILES, a Bayesian segmentation model for studying the longitudinal epidemiological data, to analyze the transmission dynamics of an infectious disease such as COVID-19. Our approach includes a Bayesian Poisson segmented regression model to detect multiple change-point from the sequence of actively daily infectious cases. Those identified change points correspond to latent events that significantly alter disease spreading rates, while the resulting segments are characterized by unique epidemiological patterns. We further describe the disease transmissibility for each of the segments by using a stochastic time-invariant SIR model, assuming the transmission rate remains the same till the next change points. Our model outputs a series of the basic reproduction numbers ℛ0’s over stages to track the changes in spreading rates during a pandemic.
We applied the BayesSMILES method to analyze the COVID-19 daily report data of 50 U.S. states. Our results showed that the COVID-19 outbreak declined substantially after implementing stringent interventions for several states, including New York, Texas, and Florida. Meanwhile, our identified change points matched well with the timelines of publicly announced intervention strategies. In addition to that, the change in the basic reproduction numbers between two adjacent segments might be used to quantify the effectiveness of an intervention, which could help us understand the impact of different control measures. Several downstream analyses based on the BayesSMILES results were conducted. In particular, we clustered the temporal patterns of the 50 U.S. states based on their change-point locations, leading to an interesting spatial pattern related to the COVID-19 dynamics. Lastly, we demonstrated that our method could also improve the short-term forecasting of the new daily confirmed cases.
In the future, we plan to extend the Poisson error structure in the change point detection model to a negative binomial distribution for modeling the over-dispersed count data. Furthermore, the current BayesSMILES framework can be generalized to characterize temporal patterns in other epidemiological data. To do so, the segmented regression model should not be restricted to countable outcomes.
8 Software
We provide software in the form of R/C++ codes on GitHub https://github.com/shuangj00/BayesSMILES. It includes the tutorial of implementing the BayesSMILES, using U.S. state-level COVID-19 data as an example. Besides, we design a website https://shuangj00.github.io/BayesSMILES/ to summarize the inference results for the 50 U.S. states, as a supplement to Section 6. The website shows that 1) the detected change points for each U.S. state; 2) the COVID-19 transmission dynamics based on the segment-varying basic reproduction numbers R0’s, including their poster means and 95% confidence intervals.
Data Availability
The COVID-19 dataset was downloaded from the COVID-19 Data Repository hosted by Johns Hopkins University Center for Systems Science and Engineering
9 Appendix: Approximate the multivariate normal density function
This section provides the details of the multivariate normal density approximation used to improve the computational efficiency in Section 4.1. We consider a general setting as follows. Let y be an n × 1 vector, W be an n × q matrix, X be an n × p matrix, and A = [W X] (which is an n × (p + q) matrix). Let Σ be a (q + p) × (q + p) diagonal matrix where the first q diagonal elements are h0 and the last p diagonal elements are h1. Assume h0, h1, σ2 > 0. By Woodbury identity and Sylvester’s determinant identity, where | · | denotes the matrix determinant and I is a diagonal matrix of proper dimension. Define where (respectively ) is the residual after regressing out W from X (respectively y). Zhou and Guan (2018) showed that the expressions in (10) and (11) can be further simplified when h0 → ∞. The conclusions are summarized below with the proof available in the supplement of Zhou and Guan (2018).
Let y, be as defined above. Then, and
The conclusions in the lemma above can be used to improve the computational efficiency for approximating the multivariate normal probability density function (p.d.f.) in our model. Within each segment k we assumed log α. = Xkβk + ϵk. Under the prior specification discussed in Section 3.3, we have , and the corresponding p.d.f. for log α. is: where nk is the segment length. Next, we simplify the calculation of and by using Lemma 1. Consider and W as a column vector of 1’s with length nk. The vector y in our case matches log α., and . Lemma 1 states that the inverse and determinant calculation of an nk × nk matrix can be reduced to that of a p × p matrix, and in our case p = 1 (since the regression model only includes “time” as a covariate except the intercept term). Therefore, the computational benefit could be significant when nk was large. We first derive the formula of P as follows, Next, according to Equation (12), can be approximated by when h0 → +∞. In particular, we can derive Therefore, Therefore, To calculate |h0WW T + σ2I|, note that WW T is an nk × nk matrix, and we can derive that According to Sylvester’s determinant lemma, we have Therefore, the formula in Equation (16) equals the following, Combining the results in Equations (15) and (17), we can approximate the matrix determinant in (14) as follows when h0 → ∞, Next, it is straightforward to derive the formula in the exponent part in Equation (14) using the result in Equation (12). In particular, we can derive: Then we approximate the exponent part in Equation (14) as follows under the condition of h0 → ∞, We further introduce the following notation, By combining the approximations in (18) and (19), we obtain that In practice, the condition of h0 → ∞ is satisfied when h0 ≫ h1. We set h0 = 10, 000 and h1 = 10 and the approximation accuracy is desirable.