Abstract
In this paper, a data-driven adaptive model for infection of COVID-19 is formulated to predict the confirmed total cases and active cases of an area over 4 weeks. The parameter of the model is always updated based on daily observations. It is found that the short term prediction of up to 3-4 weeks can be possible with good accuracy. Detailed analysis of predicted value and the actual value of confirmed total cases and active cases for India from 1st June to 3rd July is provided. Prediction over 7, 14, 21, 28 days has the accuracy about 0.73% ± 1.97%, 1.92% ± 2.95%, 4.34% ± 3.91%, 6.40% ± 9.26% of the actual value of confirmed total cases. Similarly, the 7, 14, 21, 28 days prediction has the accuracy about 1.24% ± 6.57%, 3.04% ± 10.00%, 6.33% ± 16.12%, 10.20% ± 24.14% of the actual value of confirmed active cases.
1. Introduction
The accurate predictions of the spread of the novel coronavirus (COVID-19) are essential for planning and management of medical resources as well as lockdown strategy. A good prediction can avoid the shortage of critical medical resources, economical loss associated with unnecessary lockdowns. Mathematical modelling of infectious disease is generally performed by classifying the population into different compartments such as Susceptible, Infectious, Recovered, Deceased. Various classical compartmental model reported in literature such as SIR, SIS, SIRD, MSIR, SEIR, SEIS, MSEIR, MSEIRS. In order to model the spreading of corona virus, various modification of the SIR model is proposed in literature to include travel information [1], behavioural changes [2], various intervention measures such as quarantine, strict social distancing, tracing and isolation [3],[4], [5]. The model parameters are generally obtained for a specific area as there can be huge variation in these parameters based on the government policies, economical factors, adoption of social distancing measures. Prediction for the Spread of COVID-19 for a specific country is developed such for India [6, 7], China [8, 9, 10], South Korea-Italy [1], Iran [11],Korea [2], UK [12].
The estimation of spread of COVID-19 is made based on the assumption on rate of growth of total cumulative cases following a certain pattern such as gaussian [13], Verhulst equation [14], Lotka-Volterra dynamic model[15], Gompertz equation [16]. Estimation for Greece, Netherlands, Germany, Italy, Spain, France, United Kingdom and United States is developed using Gaussian fitting hypothesis based on the assumptions that evolutions of infected cases are of Gaussian in nature [13]. In [17], estimations for COVID-19 is performed from the fatality data for India under the assumption that the growth of infected cases is exponential. In [15], Lotka-Volterra dynamic model is used to represent the growth rate and the model coefficients are derived using the Extended Kalman filter techniques. In [18], Kalman filter-based techniques are used for short term prediction model.
Time-invariant SIR model is not effective to predict the spread of corona virus[19], considering the fact the large variation in the transmission dynamics. Also, it is difficult to fit a single model in different areas. So, Prediction models are developed based on the time series analysis of the reported data, where the model parameters c are estimated based on the previous observations of the confirmed cases. Data-driven estimation methods using long short-term memory (LSTM) and curve fitting is used for prediction of COVID-19 cases for India in 30 days ahead [20]. Deceased cases of COVID-19 prediction for one month is developed using a hybrid model using discrete wavelet decomposition and autoregressive integrated moving average (ARIMA) models [21]. In [22], Autoregressive Integrated moving average model is used to predict the daily number of COVID-9 cases up to 4 weeks. A mathematical model for prediction is developed using power series polynomials, where the coefficients of the polynomials are obtained using the least square approximations.
Generally, it is observed that it is difficult to predict in the long term due to the various uncertainties involved in the government decision-making process, social distancing measured followed by the people, availability of medical resources. There is a time lag between the time of infection and the time of reporting of infection. So in short term prediction has less uncertainty compared to the long term. A good general prediction model should use information about the local state as less as possible.
In this paper, a short term prediction model is proposed based on the concept of SIR model, where the parameters are updated continuously based on the observations using weighted least square techniques. It is observed that the spread of COVID-19, recovery and deceased rate can be approximated using time-varying polynomials. The proposed model is validated by comparing the true and actual value of June for India over the past predictions on time different time windows such as 7 days, 14 days, 21 days, and 28 days. In case of 2 days time window, It is found that error in prediction is within 6.40 % ± 9.26 % and 10.22 % ± 24.14 % for the prediction of total cases and active cases.
The rest of the paper is described as follows. The basic model is described in Section 2. The model parameters are estimated in Section 3. The proposed model is validated with COVID-19 statistics of India in Section 5. Expected total and active cases for a month are predicted in Section 7. The prediction results are discussed and concluded in Discussions and Conclusions Sections respectively.
2. Model formulation
The total number of people who have been infected with the virus can be classified as active, recovered or deceased cases. Few cases will be unreported due to non-development of symptoms in the infected person. In Fig. 1, different categories of different cases are shown. We will consider the population as susceptible (S), Infectious (I), and Removed (as shown in Fig. 2). Infectious cases are divided into two categories, active cases(A) and active unreported cases Aur. Similarly, recovered and deceased cases are also classified as those who are reported and unreported cases. So, if N is the total population,then,
where, R, Rur are the reported recovered and unreported recovered cases respectively. Similarly, D, Dur are the reported and unreported deceased cases respectively. Also, As per standard SIR model, If β is the infection rate, then, Also, we can write, where γ(t) and µ(t) are the rate of recovered and death among the reported cases; and γur and µur are the rate of recovered and death cases among unreported cases.
3. Parameter estimation
The different parameter of the above model is estimated from the observed data and dynamically adjusted using the daily observations. The active cases, recovered and deceased from the reported cases are readily available; whereas, the similar statistics for the unreported cases need to be estimated from random antibody tests or serological tests. The confirmed active cases at each day are obtained by the following equation where, T is the total number of reported infected person. Then we have, From the previous reported daily statistics of daily new infected cases, recovered cases, and deceased cases; the rate of growth of the total number of cases, recovered cases and deceased cases are obtained. The rate of growth of total cases are obtained by taking derivative of the variable T. T can be represented by different functions. In the case of India, it is observed that T can be approximated using a fourth-order polynomial. Let consider T be, The coefficients (aT, bT, cT, dT, eT) of the function T is obtained using the weighted least squares method by minimizing the error between the predicted value and the observed daily values. The weighted least squares method is used to provide more weightage on the recent values which in turn reflects the status of lockdown, social distancing measured followed by the citizens. The following cost function is used. where yt is the observed cumulative total cases, T is expressed as T = Xζ, and ζ are the coefficients of the polynomial. The weights (wt are selected from the previous observations as 1(last 7 days), 0.9 (last 7-14 days), 0.8 (last 14-21 days), 0.5 (last 21-28 days) and 0.2 for further previous days.
The weights need to be selected properly for an area considering the lockdown/unlock period. The optimum values of ζ are calculated from the equation 14 as follows. where W is the diagonal matrix consist of weights (wt). It is observed that reported recovered and deceased cases can be approximated using the fourth-order and second-order polynomial respectively.
4. Prediction
Prediction values are updated based on the current observations. The data of total cases recovered cases, and deceased cases from 14th March to 3rd July for India is considered in the analysis. The data is collected from the “https://api.covid19india.org/” website, where the daily COVID-19 updates of various states and central government’s of India are recorded. The variation of total cases, cumulative recovered cases, cumulative deceased cases, and corresponding active cases on daily basis are plotted in Figure 3a, Figure 3b, Figure 3c, Figure 3d respectively. Based on the observations till 3rd July, the prediction for the next 14 days is shown in Figure 4a and Figure 4b. The variation of coefficients of the prediction curve of the total cases is tabulated in Table 1.
5. Model Validation
5.1. Total case prediction validation
The difference between the predicted value and the actual value is compared to check the efficiency of the prediction algorithm. The predicted value on the 7th, 14th, 21th and 28th days back is considered for comparison. The predictions from June 1st to July 3rd are considered for detailed analysis. The total case prediction over a 7 days time window and the corresponding actual value is shown in Table 2. In 26th June, the total cases prediction for 3rd July was 61130 and the actual value reported is 627065. So, the error in prediction is 2.54 % of the actual value of 627065. From Table 2, over the complete duration of June 1st to July 3rd, the error between the actual and predicted value is 0.73 % ± 1.97 % of the actual value.
The predicted and actual values over a 14 days prediction window from June 1st is tabulated in Table 3. As per Table 3 In case of prediction on 14 days duration, the error between the actual and the predicted value is found to be 1.92 % ± 2.95 % of the actual value. Similary, prediction over 21 days and 28 days time window are tabulated in 4 and 5. The error between the predicted value and the actual value over 21 days and 28 days is 4.34 % ± 3.91 % and 6.40 % ± 9.26 % of the actual value respectively.
The error in prediction from mid-April to July 3rd is plotted in Figure 5a. In Figure5a, magenta points show the predicted value on 7 days back and the black points are the predicted value on that day. At each day, the difference between the black and magenta point is the difference in predicted and the actual value. Similar plots for the prediction window of 14 days, 21 days, and 28 days are plotted in the Figure 5b, Figure 5c, and Figure 5d respectively.
5.2. Active case prediction validation
The prediction results in case of active cases are compared with the actual value of the active cases. The prediction values and the error in prediction of active cases over 7 days, 14 days, 21 days, 28 days time duration are tabulated in Table 6, Table 7, Table 8, and Table 9 respectively. The difference between the predicted active case and the actual active case is found to be 1.24 % ± 6.57 %, 3.04 % ± 10.00 %, 6.33 % ± 16.12 %, 10.20 % ± 24.14 % for the prediction window of 7 days, 14 days, 21 days, 28 days respectively. The error in active case prediction from mid April to July 3rd over different time window is plotted in Figure 6a, Figure 6b, Figure 6c, and Figure 6d.
6. Discussions
The prediction is good in the smaller time window and error between the actual and predicted case grows with larger time window prediction. The prediction error from June 1st to 3rd July is tabulated in Table 10. The prediction is better in case of total cases compared to active cases. The reason behind the higher value of error bound could be the frequent change in discharge policy by the various state governments to adjust the growth of active patients in hospitals, which in turn affected the recovery rates. Also sometimes the deceased cases are adjusted in a single day after detailed accounting which caused a high jump in the deceased curve. Discharge policy by the governments and reporting of deceased cases have caused the large variation in the actual value of active cases over daily-basis. However, the reporting of the total case is smooth (Figure 3a); therefore, the bound of prediction error is also less.
7. Future prediction
Future predictions of total cases of India from the proposed algorithm over different time windows are provided in Table 11, Table 12, Table 13, and Table 14. We will compare this table in future to further validate our algorithm. Similar predictions for the active cases are tabulated in Table 15, Table 16, Table 17, and Table 18. The prediction for next 28 days based on the upto date observations are also included in Table 19 for reference.
8. Conclusions
In this paper, a data-driven prediction model is proposed to predict the total cases and active cases of an area. The model can apply to an area. It is observed that the error between the actual value and predicted value is 4.34 % ± 3.91 % and 6.33 % ± 16.12 % for total cases and active cases respectively over the prediction window of 21 days. We have also provided the prediction for the next 28 days from today for further validation of the proposed algorithm. The short term prediction can be used in the allocation of scar medical resources among different units, optimal lockdown planning.
Data Availability
All data is available in open source.
Funding
Work is not funded by any funding agency.
Conflicts of interest
Authors have no conflict of interest.
ACKNOWLEDGEMENTS
The authors would like to thank Mayur Shewale and GCDSL lab members for their suggestions in data processing and analysis of results.