ARIMA Modelling & Forecasting of COVID-19 in Top Five Affected-Sahai A. K DKK
ARIMA Modelling & Forecasting of COVID-19 in Top Five Affected-Sahai A. K DKK
ARIMA Modelling & Forecasting of COVID-19 in Top Five Affected-Sahai A. K DKK
a r t i c l e i n f o a b s t r a c t
Article history: Background and aims: In a little over six months, the Corona virus epidemic has affected over ten million
Received 14 July 2020 and killed over half a million people worldwide as on June 30, 2020. With no vaccine in sight, the spread
Received in revised form of the virus is likely to continue unabated. This article aims to analyze the time series data for top five
21 July 2020
countries affected by the COVID-19 for forecasting the spread of the epidemic.
Accepted 22 July 2020
Material and methods: Daily time series data from 15th February to June 30, 2020 of total infected cases
from the top five countries namely US, Brazil, India, Russia and Spain were collected from the online
Keywords:
database. ARIMA model specifications were estimated using Hannan and Rissanen algorithm. Out of
COVID-19
SARV-2 Cov
sample forecast for the next 77 days was computed using the ARIMA models.
Pandemic Results: Forecast for the first 18 days of July was compared with the actual data and the forecast accuracy
ARIMA was using MAD and MAPE were found within acceptable agreement. The graphic plots of forecast data
Forecasting suggest that While Russia and Spain have reached the inflexion point in the spread of epidemic, the US,
Brazil and India are still experiencing an exponential curve.
Conclusion: Our analysis shows that India and Brazil will hit 1.38 million and 2.47 million mark while the
US will reach the 4.29 million mark by 31st July. With no effective cure available at the moment, this
forecast will help the governments to be better prepared to combat the epidemic by ramping up their
healthcare facilities.
© 2020 Diabetes India. Published by Elsevier Ltd. All rights reserved.
The Corona pandemic which originated in Wuhan, China in by Brazil, Russia, India and Spain respectively [1].
December 2019 has spread out to the whole world and in six In India, the first case of COVID19 was reported on January 30,
months has caused unprecedented havoc. This extremely virulent 2020 and the spread in India was extremely slow. As the severity of
strain of corona virus is highly contagious and has already affected the viral infection became known the Government of India resorted
over 10,101,998 cases worldwide and has claimed 501,644 lives to a complete lockdown to contain the spread of the virus. The first
within seven months [1]. The earlier instances of corona virus lockdown was announced on 25th March which was extended
namely SARS and MERS were not as contagious and persistent as gradually till the end of May. Owing to the all-round collapse of the
the 2019-nCov or COVID-19 as it has come to be known. The industry and the miseries of the daily wage earners and migrant
confusion and lack of transparency in the initial stages of the labours, the government decided to lift the lockdown in a phased
outbreak only worsened the situation and today 185 countries are manner from June 2, 2020. Migrant labours from the two hotspots
suffering from the virus with no cure in sight. The virus in the of New Delhi and Mumbai migrated to their home states and this
current form is highly contagious and causes death due to respi- large scale export of corona virus resulted in the explosion of the
ratory failure. Due to the differences in epidemiological conditions number of cases. Slowly India has entered the top ten countries
and testing facilities the spread of the virus has been varied in affected by COVID-109 and today is the third most badly affected
countries. The worst affected are developed countries like Spain, country in the world [1].
Italy, France, Germany and the US. Today, US tops the list followed Despite the claims of the government of increased medical and
testing facilities, the number of affected cases is not flattening or
abating. The number of new patients every day is reaching 20,000
* Corresponding author. per day and many concerns are looming over the spread of COVID-
E-mail address: [email protected] (A.K. Sahai). 19. How many people will be infected tomorrow? How many
https://doi.org/10.1016/j.dsx.2020.07.042
1871-4021/© 2020 Diabetes India. Published by Elsevier Ltd. All rights reserved.
1420 A.K. Sahai et al. / Diabetes & Metabolic Syndrome: Clinical Research & Reviews 14 (2020) 1419e1427
deaths will happen tomorrow? When will the infection curve reach observations, the weights decreasing the farther we go in past with
inflexion or get flattened? How many people will be affected during the higher weightage on the more recent data. The Hold-Winters
the peak period of the outbreak? Are there mathematical models procedure is a variant of simple exponential soothing which al-
available to answer these questions? Under the circumstances, it is lows for local trend and seasonality. This method has been used
very important to estimate the spread of COVID-19 so that the with success in many surveillance studies and has done better than
policymakers, medical personnel and the general public could be other forecasting methods [22].
better prepared to deal with the emergency. Auto Regressive Integrated Moving Average (ARIMA) models
In this paper, we have employed Auto Regressive Integrated [23] have been widely used for detecting outbreaks of infectious
Moving Average (ARIMA) model to predict the incidence and diseases [24e27]. Stationarity of the time series is a prerequisite for
spread of the COVID-19 in India, Russia, Brazil, Spain and the US as fitting an ARIMA model. An investigation of ARIMA modelling
the five most badly hit countries [1]. As compared to other showed that it was unable to model eight out of 17 syndromic time
econometric models ARIMA models have been used with success in series resulting from sparse data [28]. However, for the series
the prediction of several diseases [2e7]. which were successfully modelled, one step ahead forecasts were
highly acceptable and forecasts up to 3 years in future were ob-
1. Literature review tained by continuously updated models. The traditional ARIMA
models require a fairly large number of parameters for the auto
The past two decades have seen research focused on statistical correlation to be detected. Further, a model for one syndrome or
issues pertaining to a prospective detection of outbreaks of infec- outbreak cannot be automatically applied to another and the model
tious diseases. The challenges arise in early detection and possible has to be identified each time. For shorter lengths of time series
evolution of the epidemic for taking the appropriate preventive data, it is prudent to use a hierarchical time series model. It is
measures. The rapid growth in this area is called biosurveillance claimed that the hierarchical times series model can detect out-
[8,9]. breaks faster than the lab based exceedance system [29].
An early model of regression method of outbreak detection was The ARIMA model has seen widespread usage in the study of
presented by Shewhart [10]. Assuming a normally distributed infectious diseases for several time series events. These include
incidence of infected cases the regression tested for exceeding the leptospirosis and its relationship with rainfall and temperature [5]
mean by a certain multiple of the standard deviation. However, and the relationship of suicide cases with changes in national
with epidemics, the normal distribution is no longer a valid dis-
tribution and most epidemics show an exponential distribution or a
highly skewed bell curve [11]. used a simple regression model Table 1
which computer the expected number of cases at month t calcu- ARIMA model specifications.
lated as the mean count over t-1, t and tþ1 months over a specified INDIA BRAZIL RUSSIA SPAIN US
number of years. Regression models are used to detect the onset of
MODEL (4,2,4) (3,1,2) (3,0,0) (4,2,4) (1,2,1)
influenza epidemics [12,13]. When data frequency is not much the AR1 1.375893 1.38224 1.821833 0.1032 0.224382
normal errors regression model are inadequate and Poisson Std Err 0.13892 0.232872 0.080456 0.24914 0.199694
regression models have been used [14,15]. Unlike the parametric T 9.90423 5.93562 22.64392 0.41423 1.12363
regression models described so far semi parametric models can be P 0.00000 0.00000 0.00000 0.67941 0.26321
AR2 1.02318 0.83395 0.647 0.265827
used to create a baseline model as used in monitoring the mortality
Std Err 0.112545 0.328354 0.161136 0.201803
and other related effects. A smoothing method to obtain baseline T 9.09125 2.53978 4.01521 1.31726
and standard deviations while working with Salmonella outbreaks P 0.00000 0.01227 0.00010 0.19014
was used by Ref. [16,17]. Most regression-based models define a AR3 1.369517 0.448294 0.17526 0.43591
mean at time t and issue an alarm at t if the observed value lies Std Err 0.105984 0.12919 0.080759 0.125818
T 12.92191 3.47003 2.17015 3.46459
above a certain threshold predetermined by the sample statistics P 0.00000 0.00071 0.03177 0.00073
and the quantiles of a suitable normal or Poisson distribution [18]. AR4 0.72424 0.63051
described non-thresholding regression methods which test the Std Err 0.132004 0.160281
hypothesis that a given value yt at time t belongs to the same dis- T 5.48652 3.93377
P 0.00000 0.00014
tribution as the baseline distribution.
MA1 1.794408 0.824621 0.380611 0.594318
The regression techniques do not capture the correlation Std Err 0.24318 0.165616
structure of the data. Time series methods offer this advantage over T 3.39099 3.58854
the regression methods. Syndromic and laboratory data collected P 0.00092 0.00047
with daily or weekly frequency are generally autocorrelated with MA2 1.50841 0.11077 0.150048
Std Err 0.23289
some lags. They may further exhibit correlations associated with T 0.47564
the seasonal patterns in the data arising out of weekly or yearly P 0.63513
seasonality. Failure to account properly for the autocorrelation in MA3 1.591911 0.37507
the time series data leads to misspecified models and incorrect Std Err
T
forecasts. The Box Jenkins model is designed to take care of the
P
autocorrelation of times series into account. MA4 0.89826 0.4913
With outbreak surveillance, the trend is best estimated through Std Err
a relatively simple procedure. A Serfling model [19] based on trig- T
onometric functions may be used to estimate the trend and sea- P
AIC 2128.399 2689.998 2077.294 2179.333 2624.271
sonal components for time series data with regular seasonality.
SBC 2154.546 2707.474 2088.974 2205.481 2632.987
Simple exponential smoothing [20,21] and Holt- Winters proced- Log Likelihood 1055.2 1339 1034.65 1080.67 1309.14
ure are employed in surveillance studies. Simple exponential
Source: Authors’ own calculation.
smoothing makes predictions by taking a weighted average of past
A.K. Sahai et al. / Diabetes & Metabolic Syndrome: Clinical Research & Reviews 14 (2020) 1419e1427 1421
DATE LOWER CI FORECAST UPPER CI STD_ERR DATE LOWER CI FORECAST UPPER CI STD_ERR3
08/31/2020 2,781,469 3,464,810 4,148,151 348649.7 08/20/2020 647,830 784,396 920,963 69677.95
January 09, 2020 2,797,771 3,496,107 4,194,443 356300.6 08/21/2020 642,870 782,790 922,710 71388.83
February 09, 2020 2,813,935 3,527,351 4,240,767 363994.6 08/22/2020 637,755 781,022 924,290 73096.99
March 09, 2020 2,829,962 3,558,541 4,287,121 371731.1 08/23/2020 632,485 779,094 925,703 74801.74
April 09, 2020 2,845,853 3,589,678 4,333,503 379509.5 08/24/2020 627,065 777,007 926,949 76502.43
May 09, 2020 2,861,611 3,620,762 4,379,914 387329.3 08/25/2020 621,495 774,761 928,027 78198.38
June 09, 2020 2,877,236 3,651,793 4,426,351 395189.8 08/26/2020 615,778 772,358 928,937 79888.95
July 09, 2020 2,892,729 3,682,772 4,472,814 403090.4 08/27/2020 609,918 769,799 929,680 81573.5
August 09, 2020 2,908,092 3,713,697 4,519,303 411030.7 08/28/2020 603,916 767,086 930,255 83251.4
September 09, 2020 2,923,326 3,744,571 4,565,815 419,010 08/29/2020 597,775 764,219 930,663 84922.04
October 09, 2020 2,938,433 3,775,392 4,612,351 427027.9 08/30/2020 591,497 761,200 930,903 86584.81
November 09, 2020 2,953,412 3,806,161 4,658,909 435083.7 08/31/2020 585,085 758,030 930,976 88239.11
December 09, 2020 2,968,267 3,836,878 4,705,489 443,177 January 09, 2020 578,542 754,712 930,882 89884.37
09/13/2020 2,982,997 3,867,543 4,752,089 451307.2 February 09, 2020 571,870 751,246 930,621 91519.99
09/14/2020 2,997,604 3,898,157 4,798,709 459473.9 March 09, 2020 565,072 747,633 930,195 93145.44
09/15/2020 3,012,090 3,928,719 4,845,348 467676.6 April 09, 2020 558,150 743,877 929,603 94760.14
May 09, 2020 551,108 739,977 928,846 96363.57
Source: Authors’ own computation.
June 09, 2020 543,948 735,937 927,925 97955.19
July 09, 2020 536,673 731,757 926,841 99534.49
August 09, 2020 529,285 727,439 925,593 101,101
Table 4 September 09, 2020 521,788 722,986 924,184 102654.1
Forecast and 95% confidence interval for COVID-19 outbreak in Russia. October 09, 2020 514,184 718,399 922,615 104193.4
November 09, 2020 506,476 713,680 920,885 105718.5
DATE LOWER CI FORECAST UPPER CI STD_ERR3
December 09, 2020 498,667 708,832 918,996 107228.8
January 07, 2020 653,535 654,393 655,251 437.7359 09/13/2020 490,760 703,855 916,950 108,724
February 07, 2020 659,024 660,807 662,590 909.7199 09/14/2020 482,758 698,753 914,748 110203.5
March 07, 2020 664,181 667,085 669,990 1481.792 09/15/2020 474,664 693,527 912,390 111,667
April 07, 2020 669,040 673,227 677,413 2135.867
Source: Authors’ own computation.
May 07, 2020 673,619 679,229 684,840 2862.586
June 07, 2020 677,928 685,091 692,254 3654.641
July 07, 2020 681,978 690,810 699,642 4506.393 alcohol policies [30] among others. Time series modelling of in-
August 07, 2020 685,775 696,385 706,994 5413.246
fectious disease specially COVID-19 has been reported by several
September 07, 2020 689,326 701,813 714,301 6371.346
October 07, 2020 692,635 707,095 721,554 7377.383 researchers [4,7,31e38].
November 07, 2020 695,707 712,226 728,746 8428.454
December 07, 2020 698,545 717,208 735,870 9521.978
07/13/2020 701,152 722,037 742,921 10655.63 2. Methodology
07/14/2020 703,531 726,712 749,893 11827.27
07/15/2020 705,684 731,232 756,781 13034.97 COVID-19 daily data of all reported cases were taken from the
07/16/2020 707,615 735,597 763,579 14276.89 Worldometers website (worldometers.info/coronavirus/#coun-
07/17/2020 709,323 739,804 770,284 15551.35
07/18/2020 710,813 743,852 776,890 16856.75
tries). Data for India was of primary interest but data for the other
07/19/2020 712,085 747,740 783,395 18191.59 two countries above and below India in the severity of epidemic
07/20/2020 713,142 751,468 789,794 19554.42 were also studied to have a comparison of the epidemic and also
07/21/2020 713,985 755,034 796,083 20943.87 investigate the onset of flattening of the curve. Daily data from 15
07/22/2020 714,615 758,438 802,260 22358.65
February to June 30, 2020 was collected and analysed separately for
07/23/2020 715,035 761,677 808,320 23797.49
07/24/2020 715,246 764,753 814,260 25259.17 each country. We used data 30th June for modelling and then 77
07/25/2020 715,249 767,663 820,078 26742.52 days out of sample forecast was done based on the ARIMA models
07/26/2020 715,046 770,408 825,770 28246.4 fitted to the data. Actual data from 1st to 7th was used to compute
07/27/2020 714,639 772,986 831,334 29769.71 the accuracy and forecast error.
07/28/2020 714,029 775,398 836,767 31311.38
07/29/2020 713,217 777,642 842,066 32870.36
07/30/2020 712,205 779,718 847,230 34445.63 2.1. Box Jenkins procedure
07/31/2020 710,996 781,625 852,255 36036.18
January 08, 2020 709,589 783,365 857,140 37641.05
February 08, 2020 707,988 784,935 861,882 39259.28 Box and Jenkins (1971) popularised a method which combines
March 08, 2020 706,193 786,336 866,479 40889.93 both autoregressive (AR) and moving average (MA) models. An
April 08, 2020 704,207 787,568 870,929 42532.08 ARMA (p,q) model is a combination of AR(p) and MA(q) models and
May 08, 2020 702,030 788,631 875,232 44184.84 is best used for univariate time series modelling. In AR(p) model the
June 08, 2020 699,665 789,524 879,384 45847.3
July 08, 2020 697,114 790,249 883,384 47518.61
future value of a variable is assumed to be dependent upon a linear
DATE LOWER CI FORECAST UPPER CI STD_ERR combination of p past observations and a random error term.
August 08, 2020 694,378 790,804 887,230 49197.9 Mathematically and AR(p) model can be expressed as follows-
September 08, 2020 691,459 791,190 890,922 50884.34
October 08, 2020 688,359 791,408 894,457 52577.09
Yt ¼ cþ f1yt-1þ f2yt-2þ f3yt-3þ f4yt-4þ …. .þ fpyt-p þεt
November 08, 2020 685,079 791,457 897,835 54275.33
December 08, 2020 681,623 791,338 901,053 55978.26
08/13/2020 677,991 791,051 904,112 57685.09 Yt and εt are the actual value and the error terms at time period
08/14/2020 674,185 790,597 907,010 59395.04 t, fi (i ¼ 1,2,3,4 …. ) are model parameters and c is a constant.
08/15/2020 670,209 789,977 909,745 61107.34 Integer p is known as the order of the model. Unlike AR(p) model an
08/16/2020 666,063 789,190 912,318 62821.23
MA(q) model uses past errors as explanatory variables. The MA(q)
08/17/2020 661,750 788,238 914,726 64535.97
08/18/2020 657,272 787,121 916,970 66250.81 model is given below-
08/19/2020 652,631 785,840 919,049 67965.05
Yt ¼ mþ q1εt-1þ q2εt-2þ q3εt-3þ q4εt-4þ …. .þ qpεt-q þεt
A.K. Sahai et al. / Diabetes & Metabolic Syndrome: Clinical Research & Reviews 14 (2020) 1419e1427 1423
plots with 95% confidence intervals are presented in Figs. 1e5. We relative estimate of the severity of the spread across the countries
compared the actual data from 1st July to 18th July and checked the under consideration. MAPE for India, Brazil and US were
forecast efficiency using mean absolute deviation (MAD) and the 3.701%,1.844% and 2.885% respectively. It was lowest for Russia and
mean absolute percentage error (MAPE). The MAD was lowest for Spain at 1.090% and 0.832% indicating a very tight forecast accuracy.
Spain followed by Russia whereas India, Brazil and US exhibited The smaller numbers for Russia and Spain further indicate that the
increasing absolute deviations indicating that actual forecasts lean forecast is following the linear trend established by the past data.
towards the upper bound of the forecast. In other words, the Spain has even dropped out of the top five countries in the world.
forecast indicated worsening situation and steepening of the case Even though the MAPE numbers for US, India and Brazil are all less
graph for India, Brazil and US in the days to come. A better measure than 4.0%, the relatively larger numbers indicate a trend which is
of the forecast efficiency is the mean absolute percentage error steepening and leaning towards the upper bound of the forecast.
(MAPE) which converts the absolute deviations as percentage of The MAPE numbers validate the accuracy of the forecast. The re-
actual numbers. Percentage numbers are easily compared to have a sults are presented in Table 6.
A.K. Sahai et al. / Diabetes & Metabolic Syndrome: Clinical Research & Reviews 14 (2020) 1419e1427 1425
Table 7
Forecast accuracy with mean absolute deviation (MAD) and mean absolute percentage error (MAPE).
ACTUAL FORECAST ACTUAL FORECAST ACTUAL FORECAST ACTUAL FORECAST ACTUAL FORECAST
1-Jul-20 605,220 605,084 1,453,369 1,448,644 654,405 654,393 296,739 296,504 2,778,452 2,772,875
2-Jul-20 627,168 623,844 1,543,341 1,484,229 661,165 660,807 297,183 296,695 2,835,684 2,818,529
3-Jul-20 649,889 642,773 1,543,341 1,517,010 667,883 667,085 297,625 296,853 2,890,588 2,864,474
4-Jul-20 673,904 663,129 1,578,376 1,550,697 674,904 673,227 297,625 297,115 2,935,770 2,910,744
5-Jul-20 697,836 683,792 1,604,585 1,585,926 681,261 679,229 297,625 297,437 2,982,928 2,957,348
6-Jul-20 720,346 704,039 1,626,071 1,621,272 687,862 685,091 298,869 297,774 3,040,833 3,004,288
7-Jul-20 743,481 725,228 1,674,655 1,655,902 694,230 690,810 299,210 298,102 3,097,084 3,051,562
8-Jul-20 769,052 747,529 1,716,196 1,690,134 700,792 696,385 299,593 298,345 3,163,318 3,099,173
9-Jul-20 794,842 769,604 1,759,103 1,724,467 707,301 701,813 300,136 298,554 3,224,892 3,147,119
10-Jul-20 822,603 791,821 1,804,338 1,758,950 713,936 707,095 300,988 298,739 3,297,170 3,195,401
11-Jul-20 850,358 815,308 1,840,812 1,793,378 720,547 712,226 301,670 298,962 3,359,174 3,244,019
12-Jul-20 879,466 839,280 1,866,176 1,827,649 727,162 717,208 302,352 299,246 3,417,795 3,292,972
13-Jul-20 907,645 862,980 1,887,959 1,861,818 733,699 722,037 303,033 299,568 3,483,584 3,342,261
14-Jul-20 937,487 887,448 1,931,204 1,895,950 739,947 726,712 303,699 299,903 3,549,632 3,391,885
15-Jul-20 970,169 912,992 1,970,909 1,930,047 746,797 731,232 304,574 300,198 3,621,637 3,441,846
16-Jul-20 1,005,637 938,511 2,014,738 1,964,081 752,797 735,597 305,935 300,447 3,695,025 3,492,142
17-Jul-20 1,040,457 964,142 2,048,697 1,998,039 759,203 739,804 307,335 300,664 3,770,012 3,542,774
18-Jul-20 1,077,864 990,871 2,075,246 2,031,931 765,437 743,852 307,335 300,883 3,833,271 3,593,741