Module 3.2 Time Series Forecasting LSTM Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

MODULE 3.

2 SUPERVISED LEARNING: TIME SERIES FORECASTING PART 2


Deep Learning Approach to Time Series Modeling: Introduction to Long Short-Term Memory (LSTM)

The traditional time series models such as ARIMA are well understood and effective on many problems. However, these
traditional methods also suffer from several limitations. Traditional time series models are linear functions, or simple
transformations of linear functions, and they require manually diagnosed parameters, such as time dependence.

Recurrent neural network (RNN) poses an alternative to ARIMA models as RNN models can identify structure and
patterns such as nonlinearity, can seamlessly model problems with multiple input variables, and are relatively robust
to missing data. RNN models can retain state from one iteration to the next by using their own output as input for the
next step. These deep learning models can be referred to as time series models, as they can make future predictions
using the data points in the past, similar to traditional time series models such as ARIMA.

Part 1: Architecture of Recurrent Neural Networks (RNN)

Recurrent neural networks (RNNs) are called “recurrent” because they perform the same task for every element of a
sequence, with the output being dependent on the previous computations. RNN models have a memory, which
captures information about what has been calculated so far. RNNs are networks with loops in them, allowing
information to persist (i.e., information cycles through the loop, so the output is determined by the current input and
previously received inputs).

In the above diagram, a chunk of neural network, A, looks at some input Xt at timestep index t and outputs a value ht.
The input layer X processes the initial input and passes it to the middle layer A, the hidden state or the memory of the
network.

Before we get into greater detail to how an RNN works, let’s zoom out the above diagram and understand
computationally what is happening in each cell or state or neuron of the RNN.
1. At the beginning, input X0 gets multiplied by a matrix U resulting in a value X0*U. U is a matrix of weights that
parametrizes input-to-hidden connections. Note that a neuron or cell takes an input, applies the learning
parameters to generate a weighted sum, and then passes that sum to an activation function that computes
the output.

2. The feedback from the last time step gets multiplied to the matrix W, a parametrized weight matrix of hidden-
to-hidden recurrent connections.

Since this is the initial stage, the feedback value is zero (to keep it simple). Hence the feedback value is A0*W
= 0. Thus, the value after passing through hidden state A0 is 0 + X0*U.

3. Now, this gets multiplied with the matrix V resulting in X0*U*V. V is a matrix of weights that parametrizes
hidden-to-output connections.

4. For the next time step t1 to tn, this value will get stored in A and will not be a non-zero.

A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a
successor. Consider what happens if we unroll the loop:

The middle layer consists of multiple hidden layers, each with its activation functions, weights, and biases. These
parameters are standardized across the hidden layer so that instead of creating multiple hidden layers, it will create
one and loop it over.
As stated, the weight matrices U, V, and W do not change with time unroll. This means that once the RNN is trained
the weight matrices are fixed during inference and not time-dependent. In other words, the same weight matrices
(U, V, W) are used in every time step.

Part 2: Training a Recurrent Neural Network

A neuron weight represents the strength of the connection between units and measures the influence the input will
have on the output. If the weight from neuron one to neuron two has greater magnitude, it means that neuron one
has a greater influence over neuron two. Weights near zero mean changing this input will not change the output.
Negative weights mean increasing this input will decrease the output.

Training a neural network basically means calibrating all of the weights in the network. This optimization is performed
using an iterative approach involving forward propagation and backpropagation steps. Almost all kinds of neural
networks use backpropagation.

Suppose the desired output of a network is Y and the predicted value of the network from forward propagation is Y’.
The difference between the predicted output and the desired output (Y–Y′) is converted into the loss (or cost) function
J(w), where w represents the weights in a neural network. The goal is to optimize the loss function (i.e., make the loss
as small as possible) over the training set.

The optimization method used is gradient descent. The goal of the gradient descent method is to find the gradient of
J(w) with respect to w at the current point and take a small step in the direction of the negative gradient until the
minimum value is reached, as shown below.

In any neural network, the function J(w) is essentially a composition of multiple hidden layers. So, if layer one is
represented as function p(), layer two as q(), and layer three as r(), then the overall function is J(w) = r(q(p())). w consists
of all weights in all three layers. We want to find the gradient of J(w) with respect to each component of w.

Skipping the mathematical details, the above essentially implies that the gradient of a component w in the first layer
would depend on the gradients in the second and third layers. Similarly, the gradients in the second layer will depend
on the gradients in the third layer. Therefore, we start computing the derivatives in the reverse direction, starting
with the last layer, and use backpropagation to compute gradients of the previous layer.

Overall, in the process of backpropagation, the model error (difference between predicted and desired output) is
propagated back through the network, one layer at a time, and the weights are updated according to the amount they
contributed to the error.

Now, instead of using traditional backpropagation, RNNs use backpropagation through time (BPTT) algorithms to
determine the gradient. In backpropagation, the model adjusts the parameter by calculating errors from the output to
the input layer. BPTT sums the error at each time step as RNN shares parameters across each layer.
Part 3: Introduction to Hyperparameters

Hyperparameters are the variables that are set before the training process, and they cannot be learned during training.
Neural networks have an abundance of hyperparameters, which makes them quite flexible. However, this flexibility
makes the model tuning process difficult. Understanding the hyperparameters and the intuition behind them helps
give an idea of what values are reasonable for each hyperparameter so we can restrict the search space.

Below are some common hyperparameters for neural network models.

1. Number of hidden layers and nodes

More hidden layers or nodes per layer means more parameters in the neural network, allowing the model to fit more
complex functions. To have a trained network that generalizes well, we need to pick an optimal number of hidden
layers, as well as of the nodes in each hidden layer. Too few nodes and layers will lead to high errors for the system, as
the predictive factors might be too complex for a small number of nodes to capture. Too many nodes and layers will
overfit to the training data and not generalize well.

There is no hard-and-fast rule to decide the number of layers and nodes.

2. Learning Rate

When we train RNNs, we use many iterations of BPTT to optimize the weights. At each iteration, we calculate the
derivative of the loss function with respect to each weight and subtract it from that weight. The learning rate
determines how quickly or slowly we want to update our weight (parameter) values. This learning rate should be high
enough so that it converges in a reasonable amount of time. Yet it should be low enough so that it finds the minimum
value of the loss function.

3. Activation Function

Activation functions are what defines the output of a node as either being ON or OFF. These functions are used to
introduce non-linearity to models, allowing deep learning models to learn non-linear prediction boundaries. The most
common activation functions used in RNN modules are described below:

• The sigmoid function has a range between 0 and 1. A large positive input results in a large positive output; a
large negative input results in a large negative output. It is also referred to as logistic activation function.

• The tanh function is similar to the sigmoid function, but the output of this function ranges from –1 to 1, with
an equal mass on both sides of the zero-axis, as shown above.
• ReLU stands for the Rectified Linear Unit. So, if the input is a positive number, the function returns the number
itself, and if the input is a negative number, then the function returns zero. It is the most commonly used
function because of its simplicity.

4. Epoch

One round of updating the network for the entire training dataset is called an epoch. A network may be trained for
tens, hundreds, or many thousands of epochs depending on the data size and computational constraints.

5. Batch Size

The batch size is the number of training examples in one forward/backward pass. A batch size of 32 means that 32
samples from the training dataset will be used to estimate the error gradient before the model weights are updated.
The higher the batch size, the more memory space is needed.

Part 4: The Problem of Long-Term Dependencies

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task,
such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d
be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. In such cases, where the gap
between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

RNNs suffer from the problem of preserving the context for long-range sequences. In other words, RNNs are unable to
work with sequences that are very long. The effect of a given input on the hidden layer (and thus the output) either
decays exponentially (or blows and saturates) as a function of time (or sequence length). This is called the vanishing
gradient problem.

Part 5: Introduction to the Long Short-Term Memory (LSTM) Model

Again, the problem with RNNs is that they simply store the previous data in their short-term memory. Once the memory
in it runs out, it simply deletes the longest retained information and replaces it with new data. The LSTM model
attempts to escape this problem by retaining selected information in long-term memory.

This long-term memory is stored in the so-called Cell State. The cell state is kind of like a conveyor belt. It runs straight
down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it
unchanged. In addition, there is also the hidden state, which we already know from the discussion earlier in RNN
architecture and in which short-term information from the previous calculation steps is stored. The hidden state is
the short-term memory of the model.
In each computational step, an LSTM unit receives three vectors (three lists of numbers) as input (represented as yellow
dots in the diagram below). Two vectors come from the LSTM itself and were generated by the LSTM at the previous
instant or timestep (timestep t-1). These are the Cell State or long-term memory c(t-1) and the previous hidden state
h(t-1). The third vector comes from outside. This is input x(t) submitted to the LSTM at timestep t.

Given the three input vectors (c, h, x), the LSTM regulates, through the gates, the internal flow of information and
transforms the values of the cell state and hidden state vectors.

In practice, the LSTM unit uses recent past information (the short-term memory, h) and new information coming from
the outside (the input vector, x) to update the long-term memory (cell state, c). Finally, it uses the long-term memory
(the cell state, c) to update the current short-term memory (the hidden state, h). The hidden state determined in
instant t is also the output of the LSTM unit in instant t.3

LSTM Architecture | Photo based on Towards Data Science article An Intuitive Explanation of LSTM

These three values pass through the following gates on their way to a new Cell State and Hidden State:
1. In the so-called Forget Gate, it is decided which current and previous information is kept and which is thrown out.
This includes the hidden status from the previous timestep h(t-1) and the current input x(t). These values are
passed into a sigmoid function, which can only output values between 0 and 1. A value of 0 means that previous
information can be forgotten because there is possibly new, more important information. A value of 1 means
accordingly that the previous information is preserved. The results from this are multiplied by c(t-1) or the current
Cell State (i.e., the cumulative information stored in the long-term memory at the point) so that knowledge that is
no longer needed is forgotten since it is multiplied by 0 and thus dropped out.

2. After removing some of the information from the cell state received in input c(t-1), the next step is to decide what
new information we’re going to store in the cell state. This activity is carried out by two neural networks: the
candidate memory and the input gate. The two neural networks are independent of each other.
The candidate memory is responsible for the generation of a candidate vector: a vector of information that is
candidate to be added to the cell state. Candidate memory output neurons use tanh function. The properties of
this function ensure that all values of the candidate vector are between -1 and 1. This is used to normalize the
information that will be added to the cell state.

The input gate is responsible for the generation of a selector vector which will be multiplied element by element
with the candidate vector. The selector vector and the candidate vector are multiplied with each other, element
by element. This means that a position where the selector vector has a value equal to 0 completely eliminates (in
the multiplication) the information included in the same position in the candidate vector. A position where the
selector vector has a value equal to 1 leaves unchanged (in the multiplication) information included in the same
position in the candidate vector.

The result of the multiplication between the candidate vector and the selector vector is added to the Cell State
and forms the new Cell State or long-term memory c(t).

3. The output gate decides what will yield out of each cell. The yielded value will be based on the cell state along
with the filtered and newly added data.

A vector is generated from the output gate based on the values of x(t) and h(t-1) it receives as input. The output
gate uses the sigmoid function as the activation function of the output neurons. Then, we put the cell state through
tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate (which is 0 or
1), so that it will only output the information h(t) the model decides to be relevant for the next timestep t+1.

Part 6: Some Techniques on Hyperparameter Tuning for LSTM Models

The following are some of the most important hyperparameters when training an LSTM model. As discussed in Part 3,
configuring neural networks is difficult because there is no good theory on how to do it. You must be systematic and
explore different configurations both from a dynamical and an objective point of a view to try to understand what is
going on for a given predictive modeling problem.

Before we get into the tuning of the most relevant hyperparameters for LSTM, it is worth noting that there are ways to
let your system find the hyperparameters for you by using optimizations tools. These methods are useful to bypass
more manual processes in identifying good hyperparameters and tuning them. Later in Part 7, we will be using the
Keras family of tools.

It should be kept in mind that many such hyperparameters are volatile, in the sense that different values (or even same
values and different runs) may yield different results. So, make sure you always compare models and performance by
tweaking these hyperparameters to get the optimum results.

1. Number of Hidden Layers and Nodes

The layers between the input and output layers are called hidden layers. This fundamental concept is what makes deep
learning networks being termed as a “black box”, often being criticized for not being transparent and their predictions
not being traceable by humans. There is no final number on how many nodes (hidden neurons) or hidden layers one
should use, so depending on the individual problem (believe it or not) a trial and error approach will give the best
results.

As a general rule of thumb, one hidden layer will work with most simple problems and two layers with reasonably
complex ones. Also, while many nodes (with regularization techniques) within a layer can increase accuracy, fewer
number of nodes may cause underfitting.

2. Number of Units in a Dense Layer


A dense layer is the most frequently used layer which is basically a layer where each neuron receives input from all
neurons in the previous layer — thus, “densely connected.” Dense layers improve overall accuracy and 5–10 units or
nodes per layer is a good base. So the output shape of the final dense layer will be affected by the number of neuron
or units specified.

3. Dropout Layer

Every LSTM layer should be accompanied by a dropout layer. Such a layer helps avoid overfitting in training by bypassing
randomly selected neurons, thereby reducing the sensitivity to specific weights of the individual neurons. While
dropout layers can be used with input layers, they shouldn’t be used with output layers as that may mess up the output
from the model and the calculation of error. While adding more complexity may risk overfitting (by increasing nodes
in dense layers or adding more number of dense layers and have poor validation accuracy), this can be addressed by
adding dropout.

A good starting point is 20% but the dropout value should be kept small (up to 50%). The 20% value is widely accepted
as the best compromise between preventing model overfitting and retaining model accuracy.

4. Activation Function

Again, choice of activation layer depends on the application, however, the ReLU and tanh activation functions are most
popular. Specific situations entail specific functions. For example, sigmoid activation is used in the output layer for
binary predictions and softmax is used to make multi-class predictions (softmax gives your ability the ability to interpret
the outputs as probabilities.

5. Learning Rate

This hyperparameter defines how quickly the network updates its parameters. Setting a higher learning rate accelerates
the learning but the model may not converge (a state during training where the loss settles to within an error range
around the final value), or even diverge. Conversely, a lower rate will slow down the learning drastically as steps
towards the minimum of loss function will be tiny, but will allow the model to converge smoothly.

Usually, a decaying learning rate is preferred and this hyperparameter is used in the training phase and has a small
positive value, mostly between 0.0 and 0.1.

6. Epochs

This hyperparameters sets how many complete iterations of the dataset is to be run. While theoretically, this number
can be set to an integer value between one and infinity, this should be increased until the validation accuracy starts to
decrease even though training accuracy increases (and hence risking overfitting).

A pro move is to employ the early stopping method to first specify a large number of training epochs and stop training
once the model performance stops improving by a pre-set threshold on the validation dataset.

7. Batch Size

This hyperparameter defines the number of samples to work on before the internal parameters of the model are
updated. Large sizes make large gradient steps compared to smaller ones for the same number of samples “seen”.

Widely accepted, a good default value for batch size is 32. For experimentation, you can try multiples of 32, such as 64,
128 and 256.

8. Optimization Set-up
While not a hyperparameter per se, setting up an optimizer within your LSTM code can help train your learner better.
One technique is to add an adaptive optimizer like Adam to better handle the complex training dynamics of recurrent
neural networks (that a plain gradient descent may not address).

Part 7: Designing an LSTM Model for Forecasting Inflation

We will use the Philippine Consumer Price Index (CPI) dataset introduced in our ARIMA lecture. Here, we will compare
the performance of an LSTM model against the ARIMA forecasts from our last discussion. To recap, here is a chart of
the ARIMA forecasts vs. actual CPI figures.

For this exercise, we will develop an LSTM model in Python using the Keras deep learning library. Keras wraps the
efficient numerical computation libraries and functions and allows us to define and train LSTM neural network models
in a few short lines of code, compared to other libraries like Scalecast.

For the purposes of brevity also, we will skip the exploratory data analysis already performed for the same dataset
when we did our ARIMA models. IMPORTANT: If you decide to implement an LSTM model from the get-go and skip
ARIMA, it is still imperative that you perform EDA on your dataset (i.e., initial time series plotting, ACF and PACF plots,
seasonality and trend decomposition, density plots for distribution, outlier detection, etc.).

CODE BLOCK #1: Importing required libraries and setting random seed

import numpy as np
import pandas as pd
from numpy import concatenate
from pandas import DataFrame
from pandas import Series
from pandas import concat
from pandas import read_csv
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import SGD
from tensorflow.random import set_seed

Due to the stochastic nature of neural networks, Keras will output different results every time we run the code. To
prevent this, a normal random seeding will not work. We need to set random seeding on environment-level like below.
This will work with any number from 0 to infinity, but if you want to reproduce the exact results of my run, please use
seed_value = 8888.

# fix random seed for reproducibility

seed_value = 8888

# 1. Set `PYTHONHASHSEED` environment variable at a fixed value


import os
os.environ['PYTHONHASHSEED']=str(seed_value)

# 2. Set `python` built-in pseudo-random generator at a fixed value


import random
random.seed(seed_value)

# 3. Set `numpy` pseudo-random generator at a fixed value


np.random.seed(seed_value)

# 4. Set `tensorflow` pseudo-random generator at a fixed value


tf.random.set_seed(seed_value)

# 5. Configure a new global `tensorflow` session


from tensorflow.compat.v1.keras.backend import set_session
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
set_session(sess)

CODE BLOCK #2: Importing dataset and splitting it to train and test sets

# load dataset
path = r'C:\Users\cdeani\Documents\Duane DECSC 131\05 Kaggle Datasets\ph_inflation.csv'
dataset = pd.read_csv(path, header=0, usecols=['CPI'])

print(dataset.head())

# split data into train and test-sets


train, test = dataset[0:-12], dataset[-12:]

print(len(train), len(test))

Like the ARIMA example, we split the data into train and test sets, leaving 12 months for the test set in order to capture
pandemic data in the training of the model.

CODE BLOCK #3: Normalizing the scale of the dataset

# scale train and test data to [0, 1]


scaler = MinMaxScaler(feature_range=(0,1))

train_scaled = scaler.fit_transform(train)
test_scaled = scaler.fit_transform(test)
print(train_scaled.shape, test_scaled.shape)

LSTM models are sensitive to the scale of the input data, specifically when the sigmoid or tanh activation functions are
used. It can be a good practice to rescale the data to the range of 0-to-1, also called normalizing. You can easily
normalize the dataset using the MinMaxScaler preprocessing class from the scikit-learn library.

Normalization normalizes the input data and ensures that (X) lies within the “good range” (marked as the green region
in the image below) and doesn’t reach the outer edges of the sigmoid function. If the input is in the good range, then
the activation does not saturate, and thus the derivative also stays in the good range (i.e., the derivative value isn’t too
small). Thus, batch normalization prevents the gradients from becoming too small and makes sure that the gradient
signal is heard. Normalizing the data will help us avoid outliers or anomalies.

CODE BLOCK #4: Modifying the dataset to be a supervised learning dataset

# frame a sequence as a supervised learning problem


def timeseries_to_supervised(data, timestep=1):
df = DataFrame(data)
columns = [df.shift(i) for i in range(1, timestep+1)]
columns.append(df)
df = concat(columns, axis=1)
df.fillna(0, inplace=True)
return df

# transform train and test data to be supervised learning


train_sup = timeseries_to_supervised(train_scaled, 1)
train_new = train_sup.values

test_sup = timeseries_to_supervised(test_scaled, 1)
test_new = test_sup.values

print(train_sup.head())

A time series is a sequence of numbers that are ordered by a time index. Supervised learning is where we have input
variables (X) and an output variable (Y). Given a sequence of numbers for a time series dataset, we can restructure
the data into a set of predictor and predicted variables, just like in a supervised learning problem. We can do this by
using previous time steps as input variables and using the next time step as the output variable. Below is an
example:
We can see that the previous time step is the input (X) and the next time step is the output (Y) in our supervised
learning problem. The order between the observations is preserved and must continue to be preserved when using
this dataset to train a supervised model. Using the code above the first datapoint in (X) and last datapoint in (Y) will be
replaced with a 0, so that there will be no #N/As in the dataset.

CODE BLOCK #5: Reshaping the input data into an input that can be used by RNNs

# reshape input to be [sample, timesteps, features]


trainX, trainY = train_new[:, 0:-1], train_new[:, -1]
testX, testY = test_new[:, 0:-1], test_new[:, -1]

trainX = np.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))


trainY = np.reshape(trainY, (trainY.shape[0], 1))

testX = np.reshape(testX, (testX.shape[0], testX.shape[1], 1))


testY = np.reshape(testY, (testY.shape[0], 1))

print("X_train :",trainX.shape,"Y_train :",trainY.shape)


print("X_test :",testX.shape,"Y_test :",testY.shape)

The data was originally of the shape [samples, timesteps] and we will reshape it to [samples, timesteps, features] ,
where timesteps denotes the number of time steps in the input sequence and features denotes the number of features
in the input data. We are working with univariate series, so the number of features is one.

Running the code above will result in the below reshaping where there are 212 datapoints in the X_train set, 1 timestep
(i.e., lag = 1), and 1 feature (i.e., the only predictor is the previous datapoint). Likewise, the X_test set has 12 datapoin ts
and the same number of timesteps/lags and features as the train set.

The Y_train and Y_test arrays are transformed from a 1-dimensional array of shape [samples] into a 2-dimensional
array of shape [samples, timesteps], where each row represents the output value at a certain timestep.

CODE BLOCK #6: Fitting LSTM Model #1 and using it to predict against the test set

# fitting LSTM Model 1

# initializing the model


lstm1 = Sequential()

# adding LSTM with one input layer with 10 hidden units and tanh activation function
lstm1.add(LSTM(units=10, activation="tanh", input_shape=(trainX.shape[1], trainX.shape[2])))

# adding one output layer


lstm1.add(Dense(units=1))

# compiling model 1 using an optimizer


lstm1.compile(optimizer='adam', loss="mean_squared_error")

# adding an early stopping callback to avoid overfitting


# patience = 3 means if after three consecutive epochs of the loss function getting bigger, the simulation stops
already
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

# fitting model 1
lstm1.fit(trainX, trainY, batch_size=32, epochs=300, callbacks=[callback], shuffle=False)
lstm1.summary()

Model #1 has only one layer (i.e., there’s only 1 lstm1.add line in the code), 10 hidden units and tanh activation function
(the default for LSTM models). The input_shape argument must be specified as a tuple that specifies the number of
time steps and the number of features.

Model#1 has only one dense output layer. It employs the Adam optimizer and mean squared error as the loss function.
We also added an early stopping callback layer to avoid overfitting. This layer will stop the simulation when after a
certain number of epochs the loss function gets bigger.

Lastly, the batch size is often much smaller than the total number of samples. It, along with the number of epochs,
defines how quickly the network learns the data (how often the weights are updated). For this run, we will try using
only 32 batches (which is normal and the standard for time series LSTM models) and 300 epochs for computational
optimization.

Also, by default, the samples within an epoch are shuffled prior to being exposed to the network. Again, this is
undesirable for the LSTM because we want the network to build up state as it learns across the sequence of
observations. We can disable the shuffling of samples by setting the shuffle argument to False.

When you run the code above, the network reports a lot of debug information about the learning progress and skill of
the model at the end of each epoch. It will look like this:

Next, we will use Model #1 to predict against the test set using the code below. An important step here is where the
predicted values are transformed back from the normalized state to their original scale using the inverse_transform()
function.

# predictions on test set

lstm1_pred = lstm1.predict(testX)
lstm1_pred = scaler.inverse_transform(lstm1_pred)

print(lstm1_pred)
The code above will print the forecasts. We will compile this with the forecasts of the other models that we will fit and
check them against the test set by calculating the RMSE.

CODE BLOCK #7: Fitting LSTM Model #2 and Model #3 and using them to predict against the test set

Models #2 and #3 employ also the Adam optimizer, MSE loss function, tanh activation, early stopping callback, input
shape, and one dense output layer as we did in Model #1.

However, in Model #2, we will stack input layers. This time we have two input layers (i.e., two lstm2.add lines) and
more hidden neurons or units (we will set them equal to 50). Here, the goal is to see whether the forecasts will improve
if we increase the number of units or neurons. We will also keep the batch size the same as Model #1 where batch_size
= 32. I changed though the number of epochs for computational purposes and to avoid overfitting.

Model #3 does the opposite. We will use the same input layer as Model #1 but this time, we will set batch_size = 1 and
epochs = 100. The purpose of this is to see whether a smaller batching/sampling and shorter simulation will result in
more accurate forecasts.

# fitting LSTM Model 2

# initializing the model


lstm2 = Sequential()

# adding LSTM with two input layers with 50 hidden units each and tanh activation function
lstm2.add(LSTM(units=50, activation="tanh", input_shape=(trainX.shape[1], trainX.shape[2]),
return_sequences=True))
lstm2.add(LSTM(units=50, activation="tanh"))

# adding one output layer


lstm2.add(Dense(units=1))

# compiling model 2 using an optimizer


lstm2.compile(optimizer='adam', loss="mean_squared_error")

# adding an early stopping callback to avoid overfitting


# patience = 3 means if after three consecutive epochs of the loss function getting bigger, the simulation stops
already
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

# fitting model 2
lstm2.fit(trainX, trainY, batch_size=32, epochs=100, callbacks=[callback], shuffle=False)
lstm2.summary()
# fitting LSTM Model 3 aka a more parsimonious model

# initializing the model


lstm3 = Sequential()

# adding LSTM with one input layer with 10 hidden units and tanh activation function
lstm3.add(LSTM(units=10, activation="tanh", input_shape=(trainX.shape[1], trainX.shape[2])))

# adding one output layer


lstm3.add(Dense(units=1))

# compiling model 3 using an optimizer


lstm3.compile(optimizer='adam', loss="mean_squared_error")

# adding an early stopping callback to avoid overfitting


# patience = 3 means if after three consecutive epochs of the loss function getting bigger, the simulation stops
already
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

# fitting model 3 with only 1 batch and 100 epochs


lstm1.fit(trainX, trainY, batch_size=1, epochs=100, callbacks=[callback], shuffle=False)
lstm1.summary()

We will then run the forecasts for Model #2 and Model #3 and compile them with the forecasts from Model #1. The
codes below will run the forecasts.

# predictions on test set

lstm2_pred = lstm2.predict(testX)
lstm2_pred = scaler.inverse_transform(lstm2_pred)

print(lstm2_pred)

# predictions on test set

lstm3_pred = lstm3.predict(testX)
lstm3_pred = scaler.inverse_transform(lstm3_pred)

print(lstm3_pred)
CODE BLOCK #8: Fitting LSTM Model #4 and using it to predict against the test set

Model #4 will do something different. From Models #1 to #3, we used the time series data as it is and let the LSTM
network learn the non-linearities, trend, and patterns of the data.

Recall from the ARIMA lecture that the PH Consumer Price Index dataset is not stationary. Model #4 will employ the
same parameters as Model #1 (from input layer up to the batch size and epochs), which is our benchmark model. This
time, however, we will transform the dataset into a stationary time series.

The below code blocks will transform the dataset into a stationary time series, normalize the input data, reshape the
input arrays, fit LSTM, and forecast against the test set.

# create a differenced series


def difference(dataset, interval=1):
diff = list()
for i in range(interval, len(dataset)):
value = dataset[i] - dataset[i - interval]
diff.append(value)
return Series(diff)

# invert differenced value


def inverse_difference(history, yhat, interval=1):
return yhat + history[-interval]

# transform data to be stationary


raw_values = dataset.values
diff_series = difference(raw_values, 1)

# split differenced series into train and test-sets


dtrain, dtest = diff_series[0:-12].values, diff_series[-12:].values

print(len(dtrain), len(dtest))

# scale differenced train and test data to [0, 1]


dscaler = MinMaxScaler(feature_range=(0,1))

dtrain = dtrain.reshape(-1,1)
dtrain_scaled = dscaler.fit_transform(dtrain)

dtest = dtest.reshape(-1,1)
dtest_scaled = dscaler.fit_transform(dtest)
print(dtrain_scaled.shape, dtest_scaled.shape)

# transform differenced train and test data to be supervised learning


dtrain_sup = timeseries_to_supervised(dtrain_scaled, 1)
dtrain_new = train_sup.values

dtest_sup = timeseries_to_supervised(dtest_scaled, 1)
dtest_new = dtest_sup.values

print(dtrain_sup.head())

# reshape stationary input to be [sample, timesteps, features]


dtrainX, dtrainY = dtrain_new[:, 0:-1], dtrain_new[:, -1]
dtestX, dtestY = dtest_new[:, 0:-1], dtest_new[:, -1]

dtrainX = np.reshape(dtrainX, (dtrainX.shape[0], dtrainX.shape[1], 1))


dtrainY = np.reshape(dtrainY, (dtrainY.shape[0], 1))

dtestX = np.reshape(dtestX, (dtestX.shape[0], dtestX.shape[1], 1))


dtestY = np.reshape(dtestY, (dtestY.shape[0], 1))

print("X_train_diff :",dtrainX.shape,"Y_train :",dtrainY.shape)


print("X_test_diff :",dtestX.shape,"Y_test :",dtestY.shape)

The code blocks above prepared the dataset. The below code block will fit the LSTM network same Model #1 but using
stationary data.

# fitting LSTM Model 4 using differenced series

# initializing the model


lstm4 = Sequential()

# adding LSTM with two input layers with 10 hidden units each and tanh activation function
lstm4.add(LSTM(units=10, activation="tanh", input_shape=(dtrainX.shape[1], dtrainX.shape[2])))

# adding one output layer


lstm4.add(Dense(units=1))

# compiling model 4 using an optimizer


lstm4.compile(optimizer='adam', loss="mean_squared_error")

# adding an early stopping callback to avoid overfitting


# patience = 3 means if after three consecutive epochs of the loss function getting bigger, the simulation stops
already
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

# fitting model 4
lstm4.fit(dtrainX, dtrainY, batch_size=32, epochs=300, callbacks=[callback], shuffle=False)
lstm4.summary()

We will then run the forecasts compile them with the forecasts from the other models. The code below will run the
forecasts. On top of the inverse scaler, we also need to de-difference the dataset. Line 3 in the code below does that.

# predictions on test set


lstm4_pred = lstm4.predict(dtestX)
lstm4_pred = dscaler.inverse_transform(lstm4_pred)
lstm4_pred = inverse_difference(raw_values, lstm4_pred, len(dtest_scaled)+1)

print(lstm4_pred)

CODE BLOCK #9: Fitting LSTM Model #5 and using it to predict against the test set

So far, we’ve fitted LSTM models using tanh activation. For Model #5 we will try to fit a network using the ReLU
activation function. Usually, with a tanh activation function, the model becomes not trainable: the loss decreases very
slowly. This phenomenon persists when we tune the output dimension of the LSTM layer. If we train the model with
more epochs, the loss might drop to a sufficiently low value, but the model likely ends up over-fit. This phenomenon
suggests that a tanh activation function disables the model.

Hence, we will see if forecasts will improve using the ReLU activation function. Here, we will employ the same optimizer,
loss function, input shape, and the number of dense output layer.

To maximize the power of the ReLU activation function where the upper bound limit is infinity, we can increase the
number of hidden neurons and the number of epochs. ReLU activation uses less computing memory compared to the
tanh activation function.

First, we need to rescale the data different from the way we rescaled the past input data. The below code blocks
prepare the dataset for ReLU activation function fitting.

# split data into train and test-sets for ReLU model


rtrain, rtest = dataset[0:-12], dataset[-12:]

print(len(rtrain), len(rtest))

# scale train and test data for ReLU activation


rscaler = MinMaxScaler()

rtrain_scaled = rscaler.fit_transform(rtrain)
rtest_scaled = rscaler.fit_transform(rtest)

print(rtrain_scaled.shape, rtest_scaled.shape)

# transform train and test data to be supervised learning for ReLU


rtrain_sup = timeseries_to_supervised(rtrain_scaled, 1)
rtrain_new = rtrain_sup.values
rtest_sup = timeseries_to_supervised(rtest_scaled, 1)
rtest_new = rtest_sup.values

print(rtrain_sup.head())

# reshape ReLU input to be [sample, timesteps, features]


rtrainX, rtrainY = rtrain_new[:, 0:-1], rtrain_new[:, -1]
rtestX, rtestY = rtest_new[:, 0:-1], rtest_new[:, -1]

rtrainX = np.reshape(rtrainX, (rtrainX.shape[0], rtrainX.shape[1], 1))


rtrainY = np.reshape(rtrainY, (rtrainY.shape[0], 1))

rtestX = np.reshape(rtestX, (rtestX.shape[0], rtestX.shape[1], 1))


rtestY = np.reshape(rtestY, (rtestY.shape[0], 1))

print("X_train_for_ReLU :",rtrainX.shape,"Y_train :",rtrainY.shape)


print("X_test_for_ReLU :",rtestX.shape,"Y_test :",rtestY.shape)

The below code will run an LSTM network that employs 120 hidden neurons and 1,000 epochs.

# fitting LSTM Model 5 for ReLU

# initializing the model


lstm5 = Sequential()

# adding LSTM with one input layer with 120 hidden units and ReLU activation function
lstm5.add(LSTM(units=120, activation="relu", input_shape=(rtrainX.shape[1], rtrainX.shape[2])))

# adding one output layer


lstm5.add(Dense(units=1))

# compiling model 5 using an optimizer


lstm5.compile(optimizer='adam', loss="mean_squared_error")

# adding an early stopping callback to avoid overfitting


# patience = 3 means if after three consecutive epochs of the loss function getting bigger, the simulation stops
already
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

# fitting model 5
lstm5.fit(rtrainX, rtrainY, batch_size=32, epochs=1000, callbacks=[callback], shuffle=False)
lstm5.summary()

The below code will run the forecasts for Model #5. DO NOT FORGET TO INVERSE SCALE BUT USING THE SCALER WE
USED FOR THE RELU ACTIVATION.

# predictions on test set

lstm5_pred = lstm5.predict(rtestX)
lstm5_pred = rscaler.inverse_transform(lstm5_pred)

print(lstm5_pred)
Below compiles all forecasts from the ARIMA models and LSTM models and computing the RMSE against the test set
data. Recall that RMSE = sqrt [(Σ(Pi – Oi)²) / n], where Pi is the predicted value, Oi is the actual value, and n is the
number of datapoints.

You’ll see that in general, the LSTM models far outperformed the ARIMA models. Among the LSTM models, Model #1
and Model #5 produced the best results. The RMSE for Model #4 shows that using a stationary time series produces a
dampened forecast. The parsimonious Model #3 produced the worst results, even worse than the ARIMA models.
Hence, the data benefits from learning from more batching. Model #2 was better compared to ARIMA models, but
using only 1 activation or input layer seems better compared to using two layers. The below graph visualizes the
performance of the models.

CODE BLOCK #10: 12-month ahead forecasts using LSTM Model #1 and LSTM Model #5
We will now produce 12-month ahead forecasts using LSTM Model #1 and #5 and compare the two. Similar to the
ARIMA lecture, we will forecast using the entire dataset. For fairness and comparability, we will run 1,000 epochs and
the 32 batches for both.

# 12-month ahead forecast on the entire dataset using best LSTM model(s)
# we will use LSTM 1 and LSTM 5 as they have the lowest RMSEs

from keras.preprocessing.sequence import TimeseriesGenerator

series1 = scaler.fit_transform(dataset)
series5 = rscaler.fit_transform(dataset)

n_input = 12
n_features = 1
gen1 = TimeseriesGenerator(series1, series1, length=n_input, batch_size=32)
gen5 = TimeseriesGenerator(series5, series5, length=n_input, batch_size=32)

# run of LSTM 1 on entire series


lstm1.fit_generator(gen1, epochs=1000)

# run of LSTM 5 on entire series


lstm5.fit_generator(gen5, epochs=1000)

The below codes will produce the 12-month ahead forecasts.

# forecasting the next 12 months using LSTM1

lstm1_pred_list = []

batch = series1[-n_input:].reshape((1, n_input, n_features))

for i in range(n_input):
lstm1_pred_list.append(lstm1.predict(batch)[0])
batch = np.append(batch[:,1:,:],[[lstm1_pred_list[i]]],axis=1)

lstm1_series_pred = scaler.inverse_transform(lstm1_pred_list)
print(lstm1_series_pred)
# forecasting the next 12 months using LSTM5

lstm5_pred_list = []

batch = series5[-n_input:].reshape((1, n_input, n_features))

for i in range(n_input):
lstm5_pred_list.append(lstm5.predict(batch)[0])
batch = np.append(batch[:,1:,:],[[lstm5_pred_list[i]]],axis=1)

lstm5_series_pred = rscaler.inverse_transform(lstm5_pred_list)
print(lstm5_series_pred)
Compiling the forecasts below:

Here, we can see that LSTM Model #5 produced 12-month ahead forecasts of PH Consumer Price Index with an upward
bias. How do we check moving forward which model has the better performance now that we forecasted 12-month
ahead forecasts on the entire dataset?

The next step would be to perform holdout validation. For every new data that gets provided (i.e., new CPI data every
month starting September 2023), we will check how far is the predicted data from the actual data.

This process is called model backtesting. For the PH CPI time series analysis, we will need to perform monthly
backtesting as new data is provided on a monthly basis. A common practice is to get the monthly RMSE, moving 3-
month RMSE, and moving 6-month RMSE. If the RMSE measures keep on getting bigger, that means the model
produced poor forecasts and we will need to update the model used.

You might also like