Forecasting Spread of COVID 19 Using Google Trends A - 2021 - Chaos Solitons

Chaos, Solitons and Fractals 142 (2021) 110336
Contents lists available at ScienceDirect
Chaos, Solitons and Fractals

Nonlinear Science, and Nonequilibrium and Complex Phenomena
journal homepage: www.elsevier.com/locate/chaos
Forecasting spread of COVID-19 using google trends: A hybrid

GWO-deep learning approachR
Sikakollu Prasanth a, Uttam Singh a, Arun Kumar a,∗, Vinay Anand Tikkiwal b,
Peter H.J. Chong c
a
National Institute of Technology, Rourkela 769008, India
b
Jaypee Institute of Information Technology, Noida 201304, India
c
Department of Electrical and Electronic Engineering, Auckland University of Technology 1010, New Zealand
a r t i c l e i n f o a b s t r a c t
Article history: The recent outbreak of COVID-19 has brought the entire world to a standstill. The rapid pace at which
Received 17 August 2020 the virus has spread across the world is unprecedented. The sheer number of infected cases and fatalities
Accepted 1 October 2020
in such a short period of time has overwhelmed medical facilities across the globe. The rapid pace of
Available online 22 October 2020
the spread of the novel coronavirus makes it imperative that its’ spread be forecasted well in advance in
Keywords: order to plan for eventualities. An accurate early forecasting of the number of cases would certainly as-
COVID-19 sist governments and various other organizations to strategize and prepare for the newly infected cases,
Forecasting well in advance. In this work, a novel method of forecasting the future cases of infection, based on the
Long short term memory (LSTM) study of data mined from the internet search terms of people in the affected region, is proposed. The
Deep learning study utilizes relevant Google Trends of specific search terms related to COVID-19 pandemic along with
Pandemic European Centre for Disease prevention and Control (ECDC) data on COVID-19 spread, to forecast the
Grey wolf optimization (GWO)
future trends of daily new cases, cumulative cases and deaths for India, USA and UK. For this purpose,
Google trends
Optimization
a hybrid GWO-LSTM model is developed, where the network parameters of Long Short Term Memory
Auto regressive integrated moving average (LSTM) network are optimized using Grey Wolf Optimizer (GWO). The results of the proposed model are
(ARIMA) compared with the baseline models including Auto Regressive Integrated Moving Average (ARIMA), and
it is observed that the proposed model achieves much better results in forecasting the future trends of
the spread of infection. Using the proposed hybrid GWO-LSTM model incorporating online big data from
Google Trends, a reduction in Mean Absolute Percentage Error (MAPE) values for forecasting results to
the extent of about 98% have been observed. Further, reduction in MAPE by 74% for models incorporat-
ing Google Trends was observed, thus, confirming the efficacy of utilizing public sentiments in terms of
search frequencies of relevant terms online, in forecasting pandemic numbers.
© 2020 Elsevier Ltd. All rights reserved.
1. Introduction more than 150 countries in a short span of time. This virus has
infected almost 16.5 million people [2] in the world.
The recent outbreak of coronavirus, popularly known as COVID- The coronavirus infection has spread quickly all over the world,
19, took place in Wuhan city of Hubei province in China [1]. causing a huge number of deaths. It is an infectious disease that
Though the first case of coronavirus was reported in December, causes severe acute respiratory problems. The infected person
several countries started reporting the cases since late January. shows mild symptoms of cold and fever initially, which worsens as
Since the details of this virus were not known initially, the spread time progresses. Body pain, nausea and high fever are also some
has been very fast. COVID-19 has emerged as a global pandemic of the characteristic symptoms of this infection [3,4]. It affects the
as declared by the World Health Organization (WHO). Initially, it people of all age groups, the worst affected being the people of
started as an unknown case of pneumonia, the virus has spread to higher age groups [5] who are suffering from at least one health
ailment or having a history of respiratory disease.
R
An exponential rise in the number of COVID-19 cases has led to
This research did not receive any specific grant from funding agencies in the
public, commercial, or not-for-profit sectors.
a crisis of shortage of medical equipment and healthcare personnel
∗
Corresponding author. in many countries. The number of ICU beds, ventilators, PPE kits
E-mail addresses: [email protected] (A. Kumar), [email protected] (V.A. for doctors and other health-care workers has seen a huge surge in
Tikkiwal), [email protected] (P.H.J. Chong).
https://doi.org/10.1016/j.chaos.2020.110336
0960-0779/© 2020 Elsevier Ltd. All rights reserved.
S. Prasanth, U. Singh, A. Kumar et al. Chaos, Solitons and Fractals 142 (2021) 110336
demand. Ramping up the supply of medical equipment in adequate comparative analysis of the results is presented in Section 6,
numbers still remains a challenge. The rapid rise in the COVID-19 while, Section 7 concludes the work presented in the paper.
infection, necessitates early forecasting of the spread in order to
assist the governments and local authorities to plan for necessary
measures including manpower and medical equipment deployment 2. Related work
among others.
The impact and fast spread of Coronavirus have created a gen- The recent outbreak of COVID-19, which has spread at an un-
eral fear among the people all over the world, and there has been a precedented rate; has led to a lot of research being conducted
rise in the number of internet searches related to COVID-19. Peo- to predict the spread. Domenico et al. [22] proposed the use of
ple learning about preventive measures and constantly searching ARIMA on the Johns Hopkins data to predict the epidemiological
the web for the updates. Referencing to this aspect of human be- trend of the incidences of COVID-19. Singh et al. [23] employed
haviour, different techniques are explored to design the models ARIMA to forecast COVID-19 related confirmed cases, deaths, and
that learn from the data mined from the search results of the peo- recoveries.The ARIMA model was validated using the AIC value;
ple belonging to a particular region. The intensity of the impact which were around 20, 14, and 16 for cumulative confirmed cases,
of the virus can be related to the search trends. Google trends, deaths, and recoveries from COVID-19, respectively. Ceylan et al.
obtained by processing a multiple types of Google search results, [24] in their work has exploited an ARIMA-based framework for
reflects the public attention towards a particular search keyword predicting the coronavirus trends. The study carried out for Euro-
[6] (Li, Ma, Wang, & Zhang, 2015). Google trends represents the pean countries like Spain, Italy and France. Kumar and Hembram
volume for a given search term, relative to the total number of [25] proposed a model which used Logistic equation, Weibull and,
searches on Google, on a scale of 0 to 100. Accordingly, Google Hill equation to find infection rates and obtained the power in-
trends have been widely utilized to be a sort of big data cover- dex of top ten highly infected countries. A recent work on coron-
ing large-scale information [7]. Given these implications, this study avirus [26] proposes the use of supervised machine learning mod-
utilises Google trends as a predictor for COVID-19 cases. els in their research to forecast the coronavirus trends. The re-
In this work, the future incidences of Coronavirus are forecasted searchers used four standard forecasting models, namely Linear
using both European Centre for Disease prevention and Control Regression, Support Vector Machines and, Exponential Smoothing
(ECDC) data [8] and Google Trends (GT) data for three countries, etc. to forecast the trends of recovery rate, deaths and daily new
namely USA, India and UK. The impact of people’s internet brows- cases due to coronavirus. The best Mean Absolute Error (MAE) and
ing interest on the incidence of pandemic is studied using Google Root Mean Square Error (RMSE) values obtained for new cases and
search trends data. Highly correlated Google search terms are se- recovery rate are 8867.43 and 15322.11 and 1827.85 and 2443.48
lected using Spearman’s rank correlation between ECDC COVID-19 respectively. The research carried by Sahai et.al. [27] has used
trends and GT data. the ARIMA model to forecast the spread of coronavirus in top
Several time-series forecasting techniques have been proposed five affected countries. They have used hannan and Rissenan algo-
in the literature. Traditional statistical techniques such as Autore- rithm to find the parameters of ARIMA model. Their study forecast
gressive Integrated Moving Average (ARIMA) have been commonly that 1.39, 2.47 and 4.31 million people will be affected in India,
used to forecast time-series value of variables across disciplines. Brazil and United States of America (USA) respectively. Chintala-
However, these methods suffer from poor accuracy as the influ- pudi et.al. [28] have used seasonal ARIMA to forecast the registered
ence of external factors is not well captured. Recently, deep learn- and recovered cases after sixty days of lockdown in Italy. An accu-
ing based LSTM has garnered significant attention for time se- racy of 93.75% and 84.4% have been recorded for the registered and
ries forecasting of various trends. LSTM has been previously em- recovered cases respectively. A reduced space guassian process re-
ployed to forecast: weather [9], stock price movements [10,11], gression method has been used in Velsquez and Lara [29] to fore-
pandemics [12], solar irradiance [13,14], atmospheric pollution lev- cast the spread of coronavirus in USA. They forecasted the peak in
els [15]. They have also been employed to predict the answers to cases will occur on 14th July for USA.
questions [16], predicting the next word [17] etc.. The most impor- Vinay et al. [30] presented an LSTM framework for predicting
tant feature of an LSTM network is its capability to find the time- the number of coronavirus cases in Canada. The authors have also
series dependencies. Since the trends of Coronavirus is a time- compared the results of Canada with transmission rates of USA and
series data, this work utilizes LSTM-based forecasting model to Italy. Tomar and Gupta [31] used LSTM to forecast number of re-
forecast the future trends. covered cases, daily positive cases, deceased cases for India, thirty
Hyperparameter-tuning is an important task while designing days in advance. The study also reported the effectiveness of pre-
any neural network-based models. Usually, a huge amount of time ventive measures like social isolation and lockdown on the spread
is spent on finding the optimal set of hyperparameters manually. of COVID-19. Ibrahim et al. [32] proposed a variational LSTM au-
In this work, we automate the process of hyperparameter-tuning toencoder model to predict the global trends of coronavirus. The
using a meta-heuristic search algorithm namely, Grey Wolf Opti- authors have not only used the historical data of the cases trends
mization (GWO). GWO algorithm has shown success in various op- but also made use of some urban characteristics and government
timization tasks such as optimal feature set selection [18], node response to the virus; including, closing of workplaces and schools,
localization problem in wireless networks [19], Kernel ELM param- cancellation of public events and, closure of public transport etc.
eter tuning for bankruptcy prediction [20], etc. Also, GWO algo- These features, along with the COVID cases data, were used to fore-
rithm is proven to be superior [21] when compared to other meta- cast the future incidence of coronavirus. The RMSE values obtained
heuristic algorithms like GA, PSO, GSA, grid search, etc. GWO algo- for the prediction results were 12722.61, 2712.82 and 271.38, re-
rithm, being simple, robust, and flexible, is proposed for hyperpa- spectively for USA, UK and India.
rameter tuning of LSTM networks. In [33], authors present an infectious disease prediction model
The rest of the paper is organized as follows - using different input variables, selected based on the OLS method.
Section 2 presents the related work done on forecasting the The authors compared the performance of three models, i.e.,
outbreak and spreading of epidemics. Section 3 provides a descrip- ARIMA, DNN and LSTM trained with optimal parameters to predict
tion of the datasets used in this study.The proposed workflow and the future trends of three infectious diseases, Chickenpox, malaria
methods adopted are presented in Section 4. Section 5 illustrates and Scarlet fever. The results demonstrated that the neural net-
the experimental designs and the results obtained. A detailed work models gave more accurate prediction when compared to
2
ARIMA. For Chickenpox, the DNN and LSTM models have reported Table 1
Google trends search terms.
an increase of 24% and 19% in average performance, respectively.
Information extracted from social media platforms such as S.No. Search terms
Twitter, Google Trends, blogs etc. can prove to be useful sources 1. Coronavirus symptoms
of information for forecasting pandemics. Previously, Lampos 2. Coronavirus
et al. [34] have used thousands of tweets related to the flu dis- 3. Covid
ease and predicted a flu-score. This research carried out in the 4. Handwash
5. Healthcenter
UK for the H1N1 virus. The work has shown a 95% correlation
6. Mask
with the data from the health agency. Similar work has been pre- 7. Positive cases
sented by Signorini et al. [35], for tracking the H1N1 virus spread 8. Sanitizer
in USA. The authors developed a SVM based model using the in- 9. Coronavirus Vaccine
fluenza data and data from tweets regarding the disease to predict
the spread of the virus. The research provides a window of dates
when the infection would obtain a peak in the number of infected ered. GT data for search terms are good indicators for forecasting
cases. Anggraeni and Aristiani [36], studied the usage of GT data in the future coronavirus trends. We select the countries that appear
forecasting the dengue fever in Indonesia. The research used the in the list of worsely affected countries due to the virus. USA is the
data from local hospitals on the number of cases of dengue fever worst affected country while UK has a large number of deaths. In-
and google search index to forecast the new cases using ARIMA. dia has shown an exponential rise in cases in a span of few weeks.
The model with ARIMA using Google Trends achieved better ac- Hence these three countries have been selected for our research.
curacy than the normal ARIMA with 3% decrease in MAPE value. The duration of the research starts from 24th February as some of
Teng et al. [37] carried out experiments to dynamically forecast the countries have reported the infection lately. For example, the
the spread of Zika virus epidemic using GT data. The authors in first confirmed case in India was reported on 30th January 2020,
this work exploited ARIMA regressor with GT data as an external as mentioned in ECDC dataset [8] and the next confirmed cases
regressor to improve the prediction. Effenberger et al. [38] carried were reported after one week. A duration with zero cases or deaths
out a correlation-based study for the new cases data and ‘Coron- would result in loss in accuracy of the model. Hence the duration
avirus’ search term GT data. This work emphasized the increase in was chosen to have finite values.
the search volume about the virus with the increase in the num- In this work, nine search terms related to coronavirus are con-
ber of cases, which the authors inferred could be a result of public sidered that are mostly searched across the globe. These search
panicking in the face of the fast-spreading pandemic. terms were obtained by comparing the terms using the compare
Previously conducted research work indicate that Google Trends function in the google trends [7,38]. This allows to mine the data
may play an important factor in terms of forecasting a pandemic. which have significant search frequency. This research deals with
However, it is important to identify more relevant search terms the search terms. GT data [39] for the nine search terms are down-
with respect to the forecasted variable, during the period of spread loaded for three different countries, India, USA and UK for a dura-
of a pandemic. In this research, the main objective is to use the tion of three months, i.e., from February 24, 2020 to May 20, 2020.
Google Trends search frequencies of more significant terms related The search terms are mentioned in Table 1.
to COVID-19 for effectively forecasting the incidences of infection
spread using a GWO optimized LSTM model. Implementation of
4. Proposed methodology
metaheuristics for improved LSTM model through hyperparameter
tuning in the pandemic domain has hardly been investigated. To
The problem of forecasting future trends is formulated as a
the best of our knowledge, this is the first study to forecast the
three-step procedure. The three steps include relationship inves-
spread of COVID-19 through the use of optimized LSTM networks
tigation, forecasting model and optimizing the hyperparameters. In
incorporating GT data.
step 1, the relationship between GT data and ECDC data is evalu-
ated. Next, the top two search terms that are having the highest
3. Dataset description
relationship are selected, and only those are considered for further
experiments as input data. In step 2, the models are trained with
3.1. ECDC data for different coronavirus trends
ECDC data and GT data to forecast future coronavirus trends. In
step 3, a nature-inspired metaheuristic algorithm, i.e., GWO algo-
Since the outbreak of coronavirus and its spread to various
rithm is applied to find the optimal set of hyperparameters (win-
countries, many different organizations including ECDC has main-
dow size, number of hidden layers and number of cells in each
tained a count of the total number of infections, new daily infec-
layer) to be considered for forecasting COVID-19 data. The com-
tions, total deaths etc., due to coronavirus, to keep a track of the
plete proposed workflow is shown in Fig. 1, and the three steps
spread of the epidemic. This data is available country-wise and
are explained below in detail.
region-wise. For this study, the data between February 24, 2020
to May 20, 2020 for India, USA and United Kingdom (UK) has been
taken into consideration [8]. This dataset contains Total Cumulative 4.1. Relationship investigation and feature selection
cases (TCC), New cases (NC), Total Cumulative Deaths (TCD) for the
three countries. In this step, the relationship between GT data (Y1 , Y2 , . . . Yn ) and
the ECDC data (TCC, NC, TCD) is computed using a correlation-
3.2. Google trends data based method. To use the impact of the internet searching in fore-
casting the trends of coronavirus, we make use of GT data of 90
During the outbreak of COVID-19, people all over the world days. Not all the search terms follow the same trend as that of
began searching for different terms related to the pandemic like coronavirus cases in a region. To find the optimal search terms,
COVID-19, coronavirus, symptoms, sanitizer, etc. The Google Trends the correlation between the search term frequencies and the his-
(GT) data for a given search term represent the interest of the peo- tory of the cases are calculated. The terms with greater correlation
ple for that particular search term on google search. It is repre- values help for better forecasting. The coronavirus cases for TCC,
sented using percentage values relative to the time period consid- NC, TCD have been collected for 90 days as shown for TCC = {
3
two search terms with the highest correlation coefficient values are
selected for further experiments.
The plots of the top two highly correlated Google Trend search
terms and NC for the three countries are shown in Fig. 2. The
graphs are plotted on daily basis with NC on the primary Y-axis
and normalized search frequency of the Google Trend terms on
secondary Y-axis. NC and search frequency can be visualized of
having a non-linear correlation. Hence these plots support the idea
of using a correlation which takes the non-linear relationship of
both the trends.
4.2. Forecasting methods
In this step, the forecasting models are designed using differ-

ent inputs and forecasting techniques. This research work uses two
time-series forecasting techniques, namely ARIMA and LSTM. The
working of the two techniques is described below in detail.
4.2.1. ARIMA
ARIMA [40,41] is a class of regression models used for forecast-
ing time-series data. It predicts the time-series values based on its
past values, i.e., its own lags and forecast errors. An ARIMA model
is characterised by three parameters, (p, d, q). ‘p’ stands for the or-
der of the Auto-Regressive (AR) term. It describes the number of
lags to be considered for forecasting. ‘q’ stands for the order of the
Moving Average (MA) term. It describes the number of forecasting
error lags needed for prediction. ‘d’ stands for the number of dif-
ferencing needed to make the data stationary. To apply the ARIMA
model, the input data should be stationary. Data is made stationary
by subtracting the previous value from the current value. Depend-
ing on the data, multiple differencing is needed to make the data
stationary. The predicted value is a function of the ’p’ lag terms, ’q’
forecasting error lags and the constant term. The working equation
Fig. 1. Proposed workflow for forecasting future COVID-19 trends.
of the ARIMA model is given in Eq. (3).
Yt = α + β1Yt−1 + β2Yt−2 + . . . + β pYt−p + φ1 t−1 + φ2 t−2 + . . . + φq t−q

x1 , x2 . . . , x90 }. For any search term Yi , the data consist of normal-
(3)
ized search frequencies Yi = {y1 , y2 . . . , y90 }. To obtain the correla-
tion between the GT data and the ECDC data, two correlation coef- where Yt represents the value of the time-series data at time
ficients are used, namely Pearson correlation and Spearman Corre- t, β1 , β2 , . . . , β p represents the coefficients of the previous p
lation. The Pearson correlation coefficient between two vectors, x time step terms, t represents the forecasting error at time t,
and y, is calculated after subtracting the mean of the correspond- φ1 , φ2 , . . . , φq represents the corresponding coefficients of error
ing data series. It can be viewed as the dot product of two mean terms, and α corresponds to the constant term.
subtracted vectors. The equation for computing the Pearson corre-
lation is given in Eq. (1). 4.2.2. LSTM network

i (xi − x̄ ) · (yi − ȳ )
Long Short-Term Memory (LSTM) [42,43] networks are a kind of
γ = (1) recurrent neural networks that are used for forecasting time series
i i − x̄ ) ·
( i (yi − ȳ )
x 2 2
data. LSTMs overcome the drawbacks of the vanilla RNN, i.e., the
The drawback of Pearson correlation is that it can identify only problems of exploding and diminishing gradients, by having mem-
the linear relationship between the two variables. But most of the ory. Therefore it is well suited to learn from important experiences
data in the real world do not follow a straight line equation. Spear- that have very long time lags in between. The units of an LSTM are
man rank correlation can efficiently identify the non-linear rela- used as the building units for the layers of the network.
tionship between the variables. Thus, the Spearman rank correla- LSTM cells enable the network to remember their inputs over
tion is employed to derive the relation between the two variables. a long period of time. This is because LSTMs contain their infor-
Spearman Correlation organises the data by assigning ranks (rg) mation in a memory, that is much like the memory of a computer
to them. For two variables to have a maximum correlation, it is because the LSTM can read, write and delete information from its
very important that the difference between the ranks of each data memory. This scheme is implemented using three gates; namely,
point is very minimal. Spearman correlation is also defined as the input, forget and output gates. The working equations of LSTM cell
Pearson correlation between ranks of each data point belonging to are shown in Eqs. (4)–(9).
two different data series. The formula for calculating the Spearman
ft = σ (W f [ht−1 , xt ] + b f ) (4)
correlation coefficient is shown in Eq. (2).
cov(r gx , r gy )
ρ= (2) it = σ (Wi [ht−1 , xt ] + bi ) (5)
σrgx · σrgy
Both Pearson and Spearman correlation coefficients value
ranges from −1 to +1. For the creation of input dataset, the top ot = σ (Wo[ht−1 , xt ] + bo ) (6)
4
Fig. 2. Plots depicting the relationship between the highly correlated GT terms and NC data for India, USA and UK.
where ft is the output of forget gate for time t, ht−1 output of the
c˜t = tanh(Wc [ht−1 , xt ] + bc ) (7) hidden state vector form time t-1, it represents the input at state t,
ot represents the output from the state t, ct represents the output
from the cell unit, Wi , Wc , W f and Wo are the weights associated
ct = ft ct−1 + it c˜t (8)
with the input, cell unit, forget and output gates and bi , bc , b f and
bo are the bias associated with input, cell unit, forget and output
ht = ot tanh(ct ) (9) gates, σ represents the sigma activation function and represents
5
updated to a negative value, those values are updated according to

the Evolution scheme as given in Trivedi et al. [45]. The formulae
for updating the positions of the wolves are given below.
Lα = |K1 · Xα − X | (14)
Lβ = |K2 · Xβ − X | (15)
Lδ = |K3 · Xδ − X | (16)
Fig. 3. Architecture of an LSTM cell. 1 · (Lα )

1 = Xα − H
X (17)
the Hadamard operation on two matrices. The architecture of the 2 · (Lβ )

2 = Xβ − H
X (18)
LSTM cell with all the gates and variables is shown in the Fig. 3
4.3. Optimizing LSTM parameters using GWO 3 · (Lδ )

3 = Xδ − H
X (19)
Grey Wolf Optimization [44] is a nature-inspired metaheuris-

(t + 1 ) = X1 + X2 + X3

tic search algorithm that enables to find the optimal solution from X (20)
the solution space in an efficient manner. This algorithm is used
3
to find the optimal hyperparameters namely, window size, num- The values of r1 , r2 are randomly chosen in the range (0,1).
ber of cells in layers and the number of hidden layers to be con- These allow the wolves to reach any position around the prey. h
sidered for LSTM network. This algorithm imitates the social be- is chosen in the range [0,2], and H takes the values in the range
haviour and hunting mechanism of grey wolves to find the exact [−h, h]. When |H | < 1, it allows the wolves to exploit the solution
location of the prey. There are four kinds of grey wolves in a pack space, meaning that it gets close to the prey. When |H | ≥ 1, it
- alpha, beta, delta and omega. Alpha wolves are the most dom- allows the wolves to explore the solution space, meaning that it
inant wolves in the group, and these are few in number(usually moves away from the prey, enabling to explore the search space.
one or two). They lead the hunt and are responsible for decision- H and K also allow the wolves to come out of the local minima or
making. Beta wolves assist the alpha wolves in decision making maxima. Finally, at the end of the last iteration, the fittest candi-
and other activities. These are more in number than alpha wolves date, α is returned as the optimal solution.
but less than delta and omega wolves. Alpha and Beta wolves are
the most-experienced wolves in the group. Delta wolves assist the 5. Experimental study
alpha and beta wolves but dominate the omega wolves. Omega
wolves, the least dominant category, are mostly baby-sitters. The In this section, the correlation between ECDC data and GT data
three phases of the grey wolf hunting are: (i) search and approach is investigated. Different experiments are conducted to compute
the prey, (ii) encircle and make the prey not to move and (iii) at- Pearson and Spearman correlation coefficients between each of the
tacking the prey. search terms’ GT and ECDC data (TCC, NC and TCD). Based upon
To model the social behaviour of the grey wolves mathemati- the correlation values, the top 2 search terms having the highest
cally, the fittest candidate in the population (of size N) is consid- correlation values have been used for forecasting the trends. Next,
ered as alpha(α ), second and third fittest candidates are consid- four different forecasting models are designed, based on different
ered as beta(β ) and delta(δ ) respectively. The other candidate solu- techniques and inputs. The effect of ECDC and GT data on fore-
tions are considered as omega(ω). The hunting procedure is guided casting COVID-19 trends can be studied using these models. The
by α , β and δ . To mathematically encircle the prey, the following description and implementation details of these four models are
equations are used. described in Section 5.2.
L = | K
· XP (t ) − X
(t ) | (10) 5.1. Feature selection
Google search trends often show what is about to come in

(t + 1 ) = Xp (t ) − H
X · L (11)
future. But not all related search terms convey the future trend.
where t indicates the current iteration number, H and K are the Therefore, the selection of features (search terms) is essential for
coefficient vectors, XP denotes the position of the prey and X de- any future trend forecasting. The Pearson and Spearman correla-
and
notes the position of the grey wolf (candidate). The vectors, H tion values of different search terms with TCC, NC and TCD are
are given by
K computed for the three countries. The correlation values are shown
inTables 2–4 for India, USA and UK respectively. The best features
= 2 · h · r1 − h
H (12) are obtained among the list of search terms by selecting the top
two features having the highest correlation values.
= 2 · r2
K (13) From the correlation values, it is observed that for India, the
search terms having highest correlation values with TCC, NC and
While hunting, it is assumed that the alpha, beta and gamma TCD are “covid” and “coronavirus vaccine”. Therefore, those two
candidates have a better idea of the location of the prey and they search terms are selected along with ECDC data as input to the
guide the entire search operation towards the optimal solution. forecasting models for India. Similarly, “mask” and “covid” search
During each iteration, the positions of the candidates are updated terms data are selected for the USA, and “mask” and “coron-
based on the positions of the top three candidates. If the updated avirus vaccine” search terms data are selected for the UK as input
values are outside the solution space, i.e., if the window size is features.
6
Table 2
Correlation values between google trends data and ECDC data for India.
TCC NC TCD
S.No. Search term Pearson Spearman Pearson Spearman Pearson Spearman
1. Sanitizer −0.207 −0.041 −0.247 −0.055 −0.282 −0.136

2. Coronavirus −0.252 0.174 −0.206 0.166 −0.184 0.127
3. Covid 0.374 0.550 0.448 0.539 0.498 0.539
4. Health center 0.037 0.071 0.026 0.054 0.058 0.071
5. Mask −0.032 0.020 −0.070 0.018 −0.112 −0.056
6. Handwash −0.155 0.026 −0.175 0.017 −0.184 −0.052
7. Coronavirus Symptoms −0.077 0.292 −0.034 0.294 −0.018 0.266
8. Positive Cases −0.475 −0.488 −0.519 −0.489 −0.547 −0.562
9. Coronavirus Vaccine 0.362 0.566 0.370 0.545 0.370 0.508
Table 3
Correlation values between google trends data and ECDC data for USA.
TCC NC TCD
1. Sanitizer −0.554 −0.463 −0.415 −0.071 −0.540 −0.463

2. Coronavirus −0.593 −0.443 −0.164 0.049 −0.616 −0.444
3. Covid 0.136 0.181 0.653 0.605 0.058 0.181
4. Health center −0.347 −0.383 −0.420 −0.364 −0.308 −0.384
5. Mask 0.234 0.541 0.683 0.851 0.143 0.540
6. Handwash −0.221 −0.176 −0.066 −0.060 −0.232 −0.176
7. Coronavirus Symptoms −0.390 −0.033 0.083 0.318 −0.427 −0.033
8. Positive Cases −0.632 −0.771 −0.588 −0.404 −0.597 −0.772
9. Coronavirus Vaccine −0.047 0.179 −0.045 0.260 −0.039 0.178
Table 4
Correlation values between google trends data and ECDC data for UK.
TCC NC TCD
1. Sanitizer −0.598 −0.652 −0.650 −0.455 −0.599 −0.666

2. Coronavirus −0.559 −0.358 −0.193 0.060 −0.568 −0.362
3. Covid 0.213 0.263 0.50 0.386 0.193 0.262
4. Health center −0.347 −0.340 −0.309 −0.249 −0.350 −0.343
5. Mask 0.583 0.698 0.502 0.604 0.574 0.697
6. Handwash −0.622 −0.551 −0.538 −0.323 −0.629 −0.563
7. Coronavirus Symptoms −0.214 0.017 −0.038 0.080 −0.231 0.015
8. Positive Cases −0.711 −0.832 −0.639 −0.513 −0.719 −0.837
9. Coronavirus Vaccine 0.318 0.471 0.280 0.377 0.325 0.470
It is also observed that many search terms have a negative cor- 5.2.1. Forecasting COVID-19 trends using ARIMA (ECDC-A)
relation with ECDC data. It means that these search terms are For the ARIMA model (ECDC-A), the previous coronavirus
inversely associated with ECDC data. This inverse relationship is trends data is given as input. The optimal combination of p, d, q is
mainly due to the fact that as time progresses, the interest in found out for each country and for each parameter forecasted us-
searching these terms on the internet falls drastically. For e.g., the ing ‘auto_arima’ module of ‘pmdarima’ library in python. It tries to
interest of the search term, ‘sanitizer’ decreased as the awareness find the optimal combination suitable for the input data to obtain
and importance of applying hand sanitizer and sanitizing com- the best forecast. Next, the ARIMA model is trained, as mentioned
monly used surfaces is known to people, and hence resulting in in Section 4 using optimal parameters to obtain the predictions.
negative correlation coefficients.
5.2.2. Forecasting COVID-19 trends using LSTM model (ECDC-L)

For the second model (ECDC-L), only coronavirus trends data
5.2. Designing experimental models (X) is given as input to the LSTM model, i.e., the input data has
only one feature. ‘Keras’ library with ‘Tensorflow’ backend are used
This section discusses the four different models used for fore- to implement the LSTM model in Python. LSTM model has two
casting the coronavirus trends. These four models use different LSTM-cell layers, each layer having 128 cells. The window size of
combinations of models and inputs required for the forecast. The 14 days is used for this model, with ‘Adam’ optimizer and learn-
selected features, along with the ECDC data, are given as input to ing rate of 0.001. For updating the weights, ‘MSLE’ loss function is
different models. These include ARIMA with ECDC data as input, used. A dropout of 0.4 and recurrent dropout of 0.2 is also used to
LSTM with ECDC data as input, LSTM with both GT and ECDC data prevent overfitting of the model. If z is considered to be the fore-
as input, and LSTM with both GT and ECDC data as input using op- casted output, then the mathematical representation of the input
timized hyperparameters using GWO. The implementation details and output is given as Eq. (21).
for each of these four combinations are described in subsections
below. z = f (X ) (21)
7
5.2.3. Forecasting COVID-19 trends with ECDC and GT data using four experimental models are trained with eighty three days data
LSTM (ECDC-GT-L) i.e from 24th February 2020 to 13th May 2020,and tested with data
In this model (ECDC-GT-L), both coronavirus trends data (X) points between 14th May to 20th May 2020. The forecasting per-
and the top two google search trends (Y1 , Y2 ) (discussed in formance of the models are compared, and the plots are shown in
Section 5.1) data is fed as input to the LSTM model, i.e., the in- Figs. 4–6 for India, USA and UK, respectively. The nRMSE and MAPE
put data has three features. The rest of the hyperparameters used metrics are computed for each trend predicted for the three coun-
for LSTM model are as described in Section 5.2.2 above. By running tries, and the values are shown inTables 5–7 for the three coun-
this model, the improvement in the performance of the model due tries, respectively.
to incorporating GT data is found out. The mathematical formula- For India, the ECDC-A gives least errors when the p, d, q val-
tion of the inputs and output (z) of the forecasting model is given ues are 2, 2, 0 respectively. The input to ECDC-Amodel was the
in Eq. (22). ECDC Coronavirus trend data. Using ECDC-A model, the values of
nRMSE and MAPE obtained were 0.485 and 41.698 for forecasted
z = f (X, Y1 , Y2 ) (22)
TCC, 0.274 and 22.591 for forecasted NC and 0.3098 and 27.535 for
5.2.4. Forecasting COVID-19 trends with ECDC and GT data using forecasted TCD.
optimized LSTM (ECDC-GT-GWO-L) For ECDC-L, the values of nRMSE and MAPE obtained were
This is the proposed model (ECDC-GT-GWO-L) where the win- 0.035 and 14.161 for TCC, 0.14 and 20.923 for NC and 0.074 and
dow size, number of units and number of hidden layers in the 13.752 for the case of TCD. This model performed better when
LSTM model are optimized using GWO algorithm. The input data compared to ECDC-A model proving the effectiveness of LSTM-
remains the same, as the third model. Using different window based models over the ARIMA regressors.
sizes gave a lot of variation in the forecasting results. Therefore, For the model, ECDC-GT-L, values of nRMSE and MAPE ob-
the window size is optimized in the range 1 to 28 both inclusive, tained were 0.0165 and 4.1540 when the input was TCC, 0.037 and
i.e., a maximum of four weeks to obtain the best forecasting. The 9.259 for NC and 0.027 and 6.885 for the case of TCD. This model
number of iterations and population size in each iteration are cho- achieved better results when compared to ECDC-L model indicat-
sen as 25 and 10, respectively. RMSE is used as the fitness func- ing the effectiveness of GT data when used along with ECDC data
tion, and the fitness function is minimized to obtain the optimal in forecasting results.
set of hyperparameters. However, the other hyperparameters like For ECDC-GT-GWO-L model, values of nRMSE and MAPE ob-
the number of hidden layers and the number of LSTM cells in each tained were 0.0136 and 3.452 in the case of forecasted TCC, 0.032
layer didn’t result in many variations in the metrics. and 7.140 for NC and 0.001 and 0.304 for the case of TCD. Sim-
ilarly for the other two countries, the proposed model surpasses
6. Results & discussion other models. For USA, MAPE values obtained by the proposed
model are3.13, 11.78 and 2.565 for TCC, NC, TCD respectively. For
In this section, metrics used for the evaluation of the effective- UK, MAPE values obtained using the proposed model are 1.696,
ness of different techniques are described in detail. Experiments 6.946 and 1.443 for TCC, NC, TCD respectively.
are conducted using different models described in Section 5.2 and The percentage improvements in MAPE values for the proposed
the performance of these techniques are compared using nRMSE model when compared to other experimental models are shown
and MAPE metrics. The idea of taking the internet browsing data inTables 8–10.The proposed model achieved better results for all
related to COVID-19 along with ECDC data is validated for different the three countries when compared to other models, indicating the
countries using plots and tables to forecast the cases. effectiveness of using optimized LSTM framework using GWO.
To show the effectiveness of the use of GT data for forecasting
6.1. Evaluation metrics the trends of the infection when compared to the traditional LSTM,
the percentage improvement of the ECDC - GT - Lwhen compared
The choice of evaluation metrics plays an important role in to ECDC - L is shown in Table 11. From the values obtained it is
judging the performance of a model in a correct manner. Most clear that using GT data improves the forecasting results.
popularly used metrics for evaluating time-series forecasting mod- It is also observed that the error value in the case of NC is
els are Root Mean Squared Error (RMSE) and Mean Absolute Per- higher when compared to TCC and TCD. This is due to the fact that
centage Error (MAPE) [46]. In this paper, a variant of RMSE, i.e., the TCC and TCD, being monotonic in nature, produces a very ef-
normalised RMSE (nRMSE) is used along with MAPE to compare fective forecast. The same trend can be observed for the other two
the performance of different models. nRMSE is more useful than countries also.
RMSE as normalizing RMSE makes the comparison scale-free. Since Therefore, the proposed model, i.e., using ECDC data and GT as
the value of the RMSE is dependent upon the predicted values, the inputs to the LSTM model that uses optimal window size derived
error in case of TCC will always be greater than NC and TCD be- from GWO, and the features selected using Spearman’s correlation
cause of the large TCC values. Normalizing the error values helps in coefficient prove to be the better-performing model, when com-
better comparison of the model overall the trends. nRMSE is compared to its other variants.
puted using the mean of the forecasted values. The two metrics are
computed according to Eqs. (23) and (24), respectively. 7. Conclusion and future work

N
i=1 (yi −yî ) 2
N In this paper, a novel workflow for forecasting future trends of

nRMSE = (23) COVID-19 spread using the historical ECDC data and Google Trends
ȳ
has been proposed. The influence of the pandemic is incident on
N |yi −yî |
i=1
the google search history of people across different countries. The
yi
MAP E = × 100 (24) well-known, Spearman’s correlationhas been used to calculate the
N
most relevant search terms. GWO algorithm has been used to se-
6.2. Discussion on forecasted results lect the optimal hyper-parameters for LSTM architecture to achieve
higher accuracy in forecasting total cumulative cases, deaths and
A comparative study of LSTM models and regressors has been new cases of infection. The analysis is carried out for some of the
made to validate the effectiveness of the proposed approach. The worst affected countries including India, USA and UK.
8
Fig. 4. Comparison of performance of different models for forecasting different parameters in India (a) Daily New cases (b) Total Cumulative cases (c) Total Deaths.
Table 5
Comparison of RMSE and MAPE values using different models and features for India.
S.No. Trend Model Inputs used nRMSE MAPE
1. Total cumulative cases (TCC) ECDC-A (2, 2, 0) TCC 0.485 41.698

ECDC - L TCC 0.035 14.161
ECDC - GT - L TCC, GT - Covid, Vaccine 0.016 4.154
ECDC - GT - GWO - L TCC, GT - Covid, Vaccine 0.013 3.452
2. Daily new cases (NC) ECDC-A (2, 1, 0) NC 0.274 22.591
ECDC - L NC 0.140 20.923
ECDC - GT - L NC, GT - Covid, Vaccine 0.037 9.259
ECDC - GT- GWO - L NC, GT - Covid, Vaccine 0.032 7.140
3. Total cumulative deaths (TCD) ECDC - A (0, 2, 1) TCD 0.309 27.535
ECDC - L TCD 0.074 13.752
ECDC - GT - L TCD, GT - Covid, Vaccine 0.027 6.885
ECDC - GT - GWO - L TCD, GT - Covid, Vaccine 0.001 0.304
Table 6
Comparison of RMSE and MAPE values using different models and features for USA.
1. Total cumulative cases (TCC) ECDC - A (0, 2, 1) TCC 0.109 10.46

ECDC - L TCC 0.135 12.914
ECDC - GT - L TCC, GT - Covid, Mask 0.011 3.831
ECDC - GT - GWO - L TCC, GT - Covid, Mask 0.012 3.132
2. Daily new cases (NC) ECDC - A (1, 1, 0) NC 0.169 15.571
ECDC - L NC 0.157 13.262
ECDC- GT- L NC, GT - Covid, Mask 0.138 12.637
ECDC - GT - GWO - L NC, GT - Covid, Mask 0.132 11.78
ECDC - L TCD 0.250 14.55
ECDC - GT - L TCC, GT - Covid, Mask 0.014 3.746
ECDC - GT - GWO - L TCD, GT - Covid, Mask 0.009 2.565
9
Fig. 5. Comparison of performance of different models for forecasting different parameters in USA (a) Daily New cases (b) Total Cumulative cases (c) Total Deaths.
Table 7
Comparison of RMSE and MAPE values using different models and features for UK.
1. Total cumulative cases (TCC) ECDC - A (2, 2, 0) TCC 0.088 8.808

ECDC - L TCC 0.089 8.993
ECDC - GT -L TCC, GT - Covid, Vaccine 0.027 7.136
ECDC - GT - GWO - L TCC, GT - Covid, Vaccine 0.006 1.695
2. Daily new cases (NC) ECDC - A (2, 1, 0) NC 0.168 16.931
ECDC - L NC 0.195 19.632
ECDC - GT - L NC, GT - Covid, Vaccine 0.0363 9.236
ECDC - GT - GWO - L NC, GT - Covid, Vaccine 0.027 6.945
ECDC - L TCD 0.058 3.12
ECDC - GT - L TCD, GT - Covid, Vaccine 0.025 2.484
ECDC - GT - GWO - L TCD, GT - Covid, Vaccine 0.009 1.442
Table 8 Table 10
Percentage improvement in MAPE values for proposed Percentage improvement in MAPE values for proposed
model (ECDC-GT-GWO-L) when compared to other three model (ECDC-GT-GWO-L) when compared to other three
models for India. models for UK.
S.No. Model TCC(%) NC(%) TCD(%) S.No. Model TCC(%) NC(%) TCD(%)
1. ECDC-A 91.71 68.39 98.89 1. ECDC-A 80.07 58.97 81.38

2. ECDC - L 75.61 65.87 97.78 2. ECDC - L 81.14 64.62 53.76
3. ECDC - GT - L 16.8 22.89 95.57 3. ECDC - GT - L 76.23 24.80 41.93
Table 9 Table 11
Percentage improvement in MAPE values for proposed Percentage improvement in MAPE values for
model (ECDC-GT-GWO-L) when compared to other three ECDC-GT-L when compared to ECDC-L for India,
models for USA. USA, UK.
S.No. Model TCC(%) NC(%) TCD(%) S.No. Country TCC(%) NC(%) TCD(%)
1. ECDC-A 70.05 24.34 73.04 1. India 70.66 55.74 49.93

2. ECDC - L 75.74 11.17 82.37 2. USA 70.33 4.71 74.25
3. ECDC - GT - L 18.24 6.78 31.53 3. UK 20.64 52.95 53.78
10
Fig. 6. Comparison of performance of different models for forecasting different parameters in United Kingdom (a) Daily New cases (b) Total Cumulative cases (c) Total
Deaths.
The results obtained establish that mining the search trends of [6] Li X, Ma J, Wang S, Zhang X. How does google search affect traders position
the public in a particular region can be used to forecast the fu- and crude oil prices? Econ Model 2015;49:162–71.
[7] Yu L, Zhao Y, Tang L, Yang Z. Online big data-driven oil consumption forecast-
ture number of cases of a disease or infection. The different exper- ing with google trends. Int J Forecast 2019a;35:213–23.
iments conducted have produced effective results for all the three [8] Roser M., Ritchie H., Ortiz-Ospina E., Hasell J.. Coronavirus pandemic (COVID-
countries. Also, the relevant search terms keeps on changing as 19), 2020. 2020.
[9] Fente DN, Singh DK. Weather forecasting using artificial neural network. In:
a pandemic progresses. Thus, it becomes imperative that the GT Proc. second international conference on inventive communication and com-
terms are continuously evaluated and updated for forecasting the putational technologies; 2018. p. 1757–61.
spread. The results can be improved further when the search terms [10] Vijh M, Chandola D, Tikkiwal VA, Kumar A. Stock closing price prediction using
machine learning techniques. Procedia Comput Sci 2020;167:599–606.
of higher correlation values are used. Hence, the proposed work- [11] Lai CY, Chen RC, Charaka RE. Prediction stock price based on different index
flow can be adopted significantly in future for the prediction of factors using LSTM. In: Proc. 2019 international conference on machine learn-
the spread of pandemics. ing and cybernetics; 2019. p. 1–6.
[12] Venna SR, Tavanaet A, Gottumukkala RN, Raghavan VV, Maidi AS, Nichols S.
A novel data-driven model for real-timeinfluenza forecastin. IEEE Access
Declaration of Competing Interest 2018;7:7691–701.
[13] Yu Y, Cao J, Zhu J. An LSTM short-term solar irradiance forecasting. Under
Complicat Weather Cond 2019b;7:145651–66.
Authors declare that they have no conflict of interest. [14] Chandola D, Gupta H, Tikkiwal VA, Bohra MK. Multi-step ahead forecasting of
global solar radiation for arid zones using deep learning. Procedia Comput Sci
2020;167:626–35.
References [15] Wang B, Kong W, Guan H, Xiong NN. Air quality forecasting based on gated
recurrent long short term memory model in internet of things. IEEE Access
[1] Shereen MA, Khan S, Kazmi A, Bashir N, Siddique R. COVID-19 infection: 2019;7:69524–34.
origin, transmission, and characteristics of human coronaviruses. J Adv Res [16] Singh U, Kedas S, Prasanth S, Kumar A, Semwal VB, Tikkiwal VA. Design of a
2020;24:91–8. recurrent neural network model for machine reading comprehension. Procedia
[2] Coronavirus Updates. 2020, 28 July, Available. https://www.ECDC.int/ Comput Sci 2020a;167:1791–800.
emergencies/diseases/novel-coronavirus-2019. [17] Ganai AF, Khursheed F. Predicting next word using RNN and LSTM cells: stas-
[3] Jin X, Lian J-S, Hu J-H. Epidemiological, clinical and virological characteristics tical language modeling,. In: 2019 fifth international conference on image in-
of 74 cases of coronavirus-infected disease 2019 (COVID-19) with gastrointesti- formation processing (ICIIP), Shimla, India; 2019. p. 469–74.
nal symptoms. BMJ J 2020;69:1002–9. [18] Hu P, Pan J-S, Chu S-C. Improved binary grey wolf optimizer and its application
[4] Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, Zhang L, Fan G, Xu J, Gu X, for feature selection. Knowl Based Syst 2020;195:105746–58.
Cheng Z, Yu T, Xia J, Wei Y, Wu W, Xie X, Yin W, Li H, Liu M, Xiao Y, Gao H, [19] Rajakumar R, Amudhavel J, Dhavachelvan P, Vengattaraman T. GWO-LPWSN:
Guo L, Xie J, Wang G, Jiang R, Gao Z, Jin Q, Wang J, Cao B. Clinical fea- grey wolf optimization algorithm for node localization problem in wireless
tures of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet sensor networks, 2017; 2017. Article ID 7348141
2020;395:497–506. [20] Wang M, Chen H, Li H, Cai Z, Zhao X, Tong C, Li J, Xu X. Grey wolf optimization
[5] Zhou F, Yu T, Du R, Fan G, Liu Y, Liu Z, Xiang J, Wang Y, Song B, Gu X, Guan L, evolving kernel extreme learning machine: application to bankruptcy predic-
Wei Y, Li H, Wu X, Xu J, Tu S, Zhang Y, Chen H, Cao B. Clinical course and tion. Eng Appl Artif Intell 2017;63:54–68.
risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China:
a retrospective cohort study. Lancet 2020;395.
11
[21] Song X, Tang L, Zhao S, Zhang X, Li L, Huang J, Cai W. Grey wolf optimizer for [34] Lampos V, Cristianini N. Tracking the flu pandemic by monitoring the social
parameter estimation in surface waves. Soil Dyn Earthq Eng 2017;75:147–57. web. In: Proc. second international workshop on cognitive information pro-
[22] Benvenutoa D, Giovanettib M, Vassalloc L, Angelettid S, Ciccozzib M. Appli- cessing; 2020. p. 411–14.
cation of the ARIMA model on the COVID-2019 epidemic dataset. Data Brief [35] Signorini A, Segre AM, Polgreen PM. The use of twitter to track levels of dis-
2020;29:105340–3. ease activity and public concern in the U.S. during the influenza a H1N1 pan-
[23] Singh RK, Rani M, Bhagvathula AS, Shah R, Morales AJR, Kalita H, Nanda C, demic. PLoS One 2011;6(5):1–10.
Sharma S, Sharma YD, Rabban A, Rahmani J, Kumar P. Prediction of the [36] Anggraeni W, Aristiani L. Using google trends data in forecasting number of
COVID-19 pandemic for the top 15 affected countries: advanced autoregres- dengue fever cases with ARIMAX method case study. In: Proc. 2016 interna-
sive integrated moving average (ARIMA) model. JMIR Public Health Surveill tional conference on information & communication technology and systems;
2020b;6. 2016. p. 114–18.
[24] Ceylan Z. Estimation of COVID-19 prevelance in italy, Spain and France. Sci To- [37] Teng Y, Bi D, Xiu G, Jin Y, Huang Y, Lil B, An X, Feng D, Tong Y. Dynamic
tal Environ 2020;79. forecasting of Zika epidemics using google trends. PLoS One 2017;12(1).
[25] Kumar J., SHembram K.P.S.. Epidemiological survey of novel coronavirus [38] Effenberger M, Kronbichler A, Shin JI, Mayer G, Tilg H, Percoe P. Association of
(COVID–19). https://arxiv.org/abs/2003.11376. the COVID-19 pandemic with internet search volumes : a google trends analy-
[26] Rustum F, Reshi AA, Mehmood A, Ullah S, On B-W, Aslam W, Choi GS. sis. Int J Infect Dis 2020;95:192–7.
COVID-19 future forecasting using supervised machine learning models. IEEE [39] Google. Trends data retrieved from ’https://www.google.com/trends’ [Online
Access 2020;8:101489–99. Resource].
[27] Sahai AK, Rath N, Sood V, Singh MP. ARIMA modelling & forecast- [40] Zhang GP. Time series forecasting using a hybrid ARIMA and neural network
ing of COVID-19 in top five affected countries. Diabetes Metab Syndr model. Neurocomputing 2003;50:159–75.
2020;14:1419–27. [41] Conejo AJ, Plazas MA, Espinola R, Molina AB. Day-ahead electricity price fore-
[28] Chintalapudi N, Battineni G, Amenta F. COVID-19 virus outbreak forecasting casting using the wavelet transform and ARIMA models. IEEE Trans Power Syst
ofregistered and recovered cases after sixty day lockdown in Italy: a data 2005;52:1035–42.
driven modelapproach. J Microbiol Immunol Infect 2020;53:396–403. [42] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput
[29] Velsquez RMA, Lara JVM. Forecast and evaluation of COVID-19 spreading in 1997;9(8):1735–80.
USA with re duce d-space gaussian process regression. Chaos Solitons Fractals [43] Sak H, Senior A, Beaufays F. Long short-term memory recurrent neural net-
2020;136:109924–32. work architectures for large scale acoustic modeling. Interspeech 2014:338–42.
[30] Chimmula VKR, Zhang L. Time series forecasting of COVID-19 transmission in [44] Mirjalili S, Mirjalili SM, Lewis A. Grey wolf optimizer. Adv Eng Softw
Canada using LSTM networks. Chaos Solitons Fractals 2020;135:109864–9. 2014;69:46–61.
[31] Tomar A., Gupta N.. Prediction for the spread of COVID-19 in India and effec- [45] Trivedi IN, Gandomi AH, Jangir P, Jangir N. Study of different boundary con-
tiveness of preventive measures. 2020; 728:138762–138767. straint handling schemes in interior search algorithm. In: International con-
[32] Ibrahim M.R., Haworth J., Lipani A., Aslam N., Cheng T., Christie N.. Variational ference on artificial intelligence and evolutionary computations in engineering
LSTM autoencoder to forecast the spread of coronavirus across the globe. systems, 517; 2016.
medRx iv2020; 12(1). [46] Swamidass PM. Mean absolute percentage error (MAPE). Encycl Prod Manuf
[33] Chae S, Kwon S, Lee D. Predicting infectious disease using deep learning and Manag 2020:462.
big data. Int J Environ Res Public Health 2018;15(8):1596.
12

Forecasting Spread of COVID 19 Using Google Trends A - 2021 - Chaos Solitons

Uploaded by

Copyright:

Available Formats

Forecasting Spread of COVID 19 Using Google Trends A - 2021 - Chaos Solitons

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Forecasting Spread of COVID 19 Using Google Trends A - 2021 - Chaos Solitons

Uploaded by

Copyright:

Available Formats

Chaos, Solitons and Fractals 142 (2021) 110336

Contents lists available at ScienceDirect

Chaos, Solitons and Fractals

Forecasting spread of COVID-19 using google trends: A hybrid

4.2. Forecasting methods

In this step, the forecasting models are designed using differ-

Yt = α + β1Yt−1 + β2Yt−2 + . . . + β pYt−p + φ1 t−1 + φ2 t−2 + . . . + φq t−q

updated to a negative value, those values are updated according to

Lβ = |K2 · Xβ − X | (15)

Lδ = |K3 · Xδ − X | (16)

Fig. 3. Architecture of an LSTM cell. 1 · (Lα )

the Hadamard operation on two matrices. The architecture of the 2 · (Lβ )

4.3. Optimizing LSTM parameters using GWO 3 · (Lδ )

Grey Wolf Optimization [44] is a nature-inspired metaheuris-

Google search trends often show what is about to come in

S.No. Search term Pearson Spearman Pearson Spearman Pearson Spearman

1. Sanitizer −0.207 −0.041 −0.247 −0.055 −0.282 −0.136

S.No. Search term Pearson Spearman Pearson Spearman Pearson Spearman

1. Sanitizer −0.554 −0.463 −0.415 −0.071 −0.540 −0.463

S.No. Search term Pearson Spearman Pearson Spearman Pearson Spearman

1. Sanitizer −0.598 −0.652 −0.650 −0.455 −0.599 −0.666

5.2.2. Forecasting COVID-19 trends using LSTM model (ECDC-L)

N In this paper, a novel workﬂow for forecasting future trends of

S.No. Trend Model Inputs used nRMSE MAPE

1. Total cumulative cases (TCC) ECDC-A (2, 2, 0) TCC 0.485 41.698

S.No. Trend Model Inputs used nRMSE MAPE

1. Total cumulative cases (TCC) ECDC - A (0, 2, 1) TCC 0.109 10.46

S.No. Trend Model Inputs used nRMSE MAPE

1. Total cumulative cases (TCC) ECDC - A (2, 2, 0) TCC 0.088 8.808

1. ECDC-A 91.71 68.39 98.89 1. ECDC-A 80.07 58.97 81.38

1. ECDC-A 70.05 24.34 73.04 1. India 70.66 55.74 49.93

You might also like

Yt = α + β1Yt−1 + β2Yt−2 + . . . + β pYt−p + φ1 t−1 + φ2 t−2 + . . . + φq t−q