Short-Term Load Forecasting With Deep Residual Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

IEEE TRANSACTIONS ON SMART GRID, VOL. 10, NO.

4, JULY 2019 3943

Short-Term Load Forecasting


With Deep Residual Networks
Kunjin Chen , Kunlong Chen, Qin Wang, Ziyu He , Jun Hu, Member, IEEE, and Jinliang He , Fellow, IEEE

Abstract—We present in this paper a model for forecasting methods can be found in [7]–[10]. Building STLF systems
short-term electric load based on deep residual networks. The with artificial neural networks (ANN) has long been one of
proposed model is able to integrate domain knowledge and the main-stream solutions to this task. As early as 2001, a
researchers’ understanding of the task by virtue of different
neural network building blocks. Specifically, a modified deep review paper by Hippert et al. surveyed and examined a col-
residual network is formulated to improve the forecast results. lection of papers that had been published between 1991 and
Further, a two-stage ensemble strategy is used to enhance 1999, and arrived at the conclusions that most of the proposed
the generalization capability of the proposed model. We also models were over-parameterized and the results they had to
apply the proposed model to probabilistic load forecasting using offer were not convincing enough [11]. In addition to the fact
Monte Carlo dropout. Three public datasets are used to prove
the effectiveness of the proposed model. Multiple test cases that the size of neural networks would grow rapidly with the
and comparison with existing models show that the proposed increase in the numbers of input variables, hidden nodes or
model provides accurate load forecasting results and has high hidden layers, other criticisms mainly focus on the “over-
generalization capability. fitting” issue of neural networks [1]. Nevertheless, different
Index Terms—Short-term load forecasting, deep learning, deep types and variants of neural networks have been proposed and
residual network, probabilistic load forecasting. applied to STLF, such as radial basis function (RBF) neural
networks [12], wavelet neural networks [13], [14], extreme
learning machines (ELM) [15], to name a few.
I. I NTRODUCTION Recent developments in neural networks, especially deep
HE FORECASTING of power demand is of cru- neural networks, have had great impacts in the fields includ-
T cial importance for the development of modern power
systems. The stable and efficient management, scheduling and
ing computer vision, natural language processing, and speech
recognition [16]. Instead of sticking with fixed shallow struc-
dispatch in power systems rely heavily on precise forecast- tures of neural networks with hand-designed features as inputs,
ing of future loads on various time horizons. In particular, researchers are now able to integrate their understandings of
short-term load forecasting (STLF) focuses on the forecast- different tasks into the network structures. Different building
ing of loads from several minutes up to one week into the blocks including convolutional neural networks (CNN) [17],
future [1]. A reliable STLF helps utilities and energy providers and long short-term memory (LSTM) [18] have allowed deep
deal with the challenges posed by the higher penetration of neural networks to be highly flexible and effective. Various
renewable energies and the development of electricity markets techniques have also been proposed so that neural networks
with increasingly complex pricing strategies in future smart with many layers can be trained effectively without the
grids. vanishing of gradients or severe overfitting. Applying deep
Various STLF methods have been proposed by researchers neural networks to short-term load forecasting is a relatively
over the years. Some of the models used for STLF include new topic. Researchers have been using restricted Boltzmann
linear or nonparametric regression [2], [3], support vector machines (RBM) and feed-forward neural networks with
regression (SVR) [1], [4], autoregressive models [5], fuzzy- multiple layers in forecasting of demand side loads and natural
logic approach [6], etc. Reviews and evaluations of existing gas loads [19], [20]. However, these models are increasingly
hard to train as the number of layers increases, thus the num-
Manuscript received October 1, 2017; revised February 1, 2018 and March ber of hidden layers are often considerably small (e.g., 2 to 5
18, 2018; accepted May 29, 2018. Date of publication June 5, 2018; date of
current version June 19, 2019. Paper no. TSG-01422-2017. (Corresponding layers), which limits the performance of the models.
author: Jinliang He.) In this work, we aim at extending existing structures
K. Chen, J. Hu, and J. He are with the State Key Laboratory of Power of ANN for STLF by adopting state-of-the-art deep neural
Systems, Department of Electrical Engineering, Tsinghua University, Beijing
100084, China (e-mail: [email protected]). network structures and implementation techniques. Instead of
K. Chen is with the Department of Electrical Engineering, Beijing Jiaotong stacking multiple hidden layers between the input and the out-
University, Beijing 100044, China. put, we learn from the residual network structure proposed
Q. Wang is with the Department of Information Technology and Electrical
Engineering, ETH Zurich, 8092 Zürich, Switzerland. in [21] and propose a novel end-to-end neural network model
Z. He is with the Department of Industrial and Systems Engineering, capable of forecasting loads of next 24 hours. An ensem-
University of Southern California, Los Angeles, CA 90007 USA. ble strategy to combine multiple individual networks is also
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. proposed. Further, we extend the model to probabilistic load
Digital Object Identifier 10.1109/TSG.2018.2844307 forecasting by adopting Monte Carlo (MC) dropout (for a
1949-3053 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 11,2024 at 09:54:54 UTC from IEEE Xplore. Restrictions apply.
3944 IEEE TRANSACTIONS ON SMART GRID, VOL. 10, NO. 4, JULY 2019

TABLE I
comprehensive review of probabilistic electric load forecast- I NPUTS FOR THE L OAD F ORECAST OF THE hTH H OUR OF THE N EXT DAY
ing, the reader is referred to [22] and [23]). The contributions
of this work are two-folds. First, a fully end-to-end model
based on deep residual networks for STLF is proposed. The
proposed model does not involve external feature extraction or
feature selection algorithms, and only raw data of loads, tem-
perature and information that is readily available are used as
inputs. The results show that the forecasting performance can
be greatly enhanced by improving the structure of the neural
networks and adopting the ensemble strategy. As complicated
feature engineering techniques and additional information
(e.g., humidity, wind speed, cloud cover, etc.) are not involved,
we provide a good benchmark that can be easily compared
with. In addition, the building blocks of the proposed model
can also be adapted to existing neural-network-based STLF
models. Combining the building blocks with existing feature
extraction and feature selection techniques is straight-forward
and may lead to further improvement in accuracy. Additional
data can also be easily incorporated. Second, a new formula-
tion of probabilistic STLF for an ensemble of neural networks
is proposed. By using MC dropout, we can directly obtain the
probability forecasting results using the models trained for the temperature time series [24]. More specifically, we expect that
task of point forecasting. Lhmonth , Lhweek , Thmonth and Thweek can help the model identify
The remainder of the paper is organized as follows. In long-term trends in the time series (the days of the same day-
Section II, we formulate the proposed model based on deep of-week index as the next day are selected as they are more
day
residual networks. The ensemble strategy, the MC dropout likely to have similar load characteristics [13]), while Lh and
day
method, as well as the implementation details are also pro- Th are able to provide short-term closeness and characteris-
vided. In Section III, the results of STLF by the proposed tics. The input Lhhour feeds the loads of the most recent 24 hours
model are presented. We also discuss the performance of to the model. Forecast loads are used to replace the values in
the proposed model and compare it with existing methods. Lhhour that are not available at the time of forecasting, which
Section IV concludes this paper and proposes future works. also helps associate the forecasts of the whole day. Note that
The source code for the STLF model proposed in this paper is the sizes of the above-mentioned inputs can be adjusted flexi-
available at https://github.com/yalickj/load-forecasting-resnet. bly. In addition, one-hot codes for season,1 weekday/weekend
distinction, and holiday/non-holiday2 distinction are added to
II. S HORT-T ERM L OAD F ORECASTING BASED help the model capture the periodic and unordinary temporal
ON D EEP R ESIDUAL N ETWORKS characteristics of the load time series.
The structure of the neural network model for load fore-
In this paper, we propose a day-ahead load forecasting casting of one hour is illustrated in Fig. 1. For Lhmonth , Lhweek ,
model based on deep residual networks. We first formulate day day
Lh , Thmonth , Thweek , and Th , we first concatenate the pairs
the low-level basic structure where the inputs of the model are day day
processed by several fully connected layers to produce prelim- [Lhmonth , Thmonth ], [Lhweek , Thweek ], and [Lh , Th ], and connect
inary forecasts of 24 hours. The preliminary forecasts are then them with three separate fully-connected layers. The three
passed through a deep residual network. After presenting the fully-connected layers are then concatenated and connected
structure of the deep residual network, some modifications are with another fully-connected layer denoted as FC2 . For Lhhour ,
made to further enhance its learning capability. An ensemble we forward pass it through two fully-connected layers, the
strategy is designed to enhance the generalization capability second layer of which is denoted as FC1 . S and W are con-
of the proposed model. The formulation of MC dropout for catenated to produce two fully-connected layers, one used as
probabilistic forecasting is also provided. part of the input of FC1 , the other used as part of the input of
FC2 . H is also connected to FC2 . In order to produce the out-
put Lh , we concatenate FC1 , FC2 , and Th , and connect them
A. Model Input and the Basic Structure for Load
with a fully-connected layer, which is then connected to Lh
Forecasting of One Hour
with another fully connected layer. All fully-connected layers
We use the model with the basic structure to give prelim-
inary forecasts of the 24 hours of the next day. Specifically, 1 In this paper, the ranges for Spring, Summer, Autumn, and Winter are
the inputs used to forecast the load for the hth hour of the March 8th to June 7th, June 8th to September 7th, September 8th to December
next day, Lh , are listed in Table I. The values for loads and 7th, December 8th to March 7th, respectively.
2 In this paper, we consider three major public holidays, namely, Christmas
temperatures are normalized by dividing the maximum value
Eve, Thanksgiving Day, and Independence Day as the activities involved in
of the training dataset. The selected inputs allow us to capture these holidays have great impacts on the loads. The rest of the holidays are
both short-term closeness and long-term trends in the load and considered as non-holidays for simplicity.

Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 11,2024 at 09:54:54 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STLF WITH DEEP RESIDUAL NETWORKS 3945

Fig. 2. The building block of the deep residual network. SELU is used as
Fig. 1. The structure of the neural network model for load forecasting of the activation function between two linear layers.
one hour.

but the output layer use scaled exponential linear units (SELU) We then concatenate {L1 , . . . , L24 } as L, which directly
as the activation function. becomes the output of the model with the basic structure. Next,
The adoption of the ReLU has greatly improved the we proceed to formulate the deep residual network and add
performance of deep neural networks [25]. Specifically, ReLU it on top of L. The output of the deep residual network is
has the form denoted as ŷ and has the same size of L.

ReLU(yi ) = max(0, yi ) (1) B. The Deep Residual Network Structure for Day-Ahead
Load Forecasting
where yi is the linear activation of the i-th node of a layer. A
problem with ReLU is that if a unit can not be activated by any In [21], an innovative way of constructing deep neural
input in the dataset, the gradient-based optimization algorithm networks for image recognition is proposed. In this paper, the
is unable to update the weights of the unit, so that the unit residual block in Fig. 2 is used to build the deep neural network
will never be activated again. In addition, the network will structure. In the residual block, instead of learning a mapping
become very hard to train if a large proportion of the hidden from x to H(x), a mapping from x to F(x, ) is learned, where
units produce constant 0 gradients [26]. This problem can be  is a set of weights (and biases) associated with the residual
solved by adding a slope to the negative half axis of ReLU. block. Thus, the overall representation of the residual block
With a simple modification to the formulation of ReLU on the becomes
negative half axis, we get PReLU [27]. The activations of a H(x) = F(x, ) + x (4)
layer with PReLU as the activation function is obtained by
 A deep residual network can be easily constructed by stack-
y if yi > 0
PReLU(yi ) = i (2) ing a number of residual blocks. We illustrate in Fig. 3 the
βi yi if yi ≤ 0 structure of the deep residual network (ResNet) used for the
where βi is the coefficient controlling the slope of βi yi when proposed model. More specifically, if K residual blocks are
yi ≤ 0. A further modification to ReLU that induces self- stacked, the forward propagation of such a structure can be
normalizing properties is provided in [28], where the activation represented by
function of SELU is given by 
K
 xK = x0 + F(xi−1 , i−1 ) (5)
y if yi > 0
SELU(yi ) = λ i yi (3)
αe − α if yi ≤ 0 i=1

where x0 is the input of the residual network, xK the output of


where λ and α are two tunable parameters. It is shown in [28] the residual network, and i = {θi,l |1≤l≤L } the set of weights
that if we have λ ≈ 1.0577 and α ≈ 1.6733, the outputs of the associated with the ith residual block, L being the number of
layers in a fully-connected neural network would approach the layers within the block. The back propagation of the overall
standard normal distribution when the inputs follow the stan- loss of the neural network to x0 can then be calculated as
dard normal distribution. This helps the networks to prevent  
∂ 
K
the problems of vanishing and exploding gradients. ∂L ∂L
As previously mentioned, in order to associate the fore- = 1+ F(xi−1 , i−1 ) (6)
∂x0 ∂xK ∂x0
casts of the 24 hours of the next day, the corresponding values i=1

within Lhhour are replaced by {L1 , . . . , Lh−1 } for h > 1. Instead where L is the overall loss of the neural network. The “1”
of simply copying the values, we maintain the neural network in the equation indicates that the gradients at the output of
connections underneath them. Thus, the gradients of subse- the network can be directly back-propagated to the input of
quent hours can be propagated backward through time. This the network, so that the vanishing of gradients (which is often
would help the model adjust the forecast value of each hour observed when the gradients at the output have to go through
given the inputs and forecast values of the rest of the hours. many layers before reaching the input) in the network is much

Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 11,2024 at 09:54:54 UTC from IEEE Xplore. Restrictions apply.
3946 IEEE TRANSACTIONS ON SMART GRID, VOL. 10, NO. 4, JULY 2019

Fig. 4. An illustration of the modified deep residual network (ResNetPlus)


structure. The blue dots in the figure average their inputs, and the outputs are
connected to subsequent residual blocks.

Fig. 3. An illustration of the deep residual network (ResNet) structure. More network in [31], the outputs of those blue dots are connected
shortcut connections are made in addition to the ones within the blocks. In this
figure, every three residual blocks has one shortcut connection and another
to all main residual blocks in subsequent layers. Starting from
shortcut connection is made from the input to the output. Each round node the second layer, the input of each main residual block is
averages all of its inputs. obtained by averaging all connections from the blue dots on
the right together with the connection from the input of the
network (indicated by the blue dots on the main path). It is
less likely to occur [29]. As a matter of fact, this equation expected that the additional side residual blocks and the dense
can also be applied to any pair (xi , xj ) (0 ≤ i < j ≤ K), shortcut connections can improve the representation capability
where xi and xj are the output of the ith residual block (or the and the efficiency of error back-propagation of the network.
input of the network when i = 0), and the jth residual block, Later in this paper, we will compare the performance of the
respectively. basic structure, the basic structure connected with ResNet, and
In addition to the stacked residual blocks, extra shortcut the basic structure connected with ResNetPlus.
connections can be added into the deep residual network, as
is introduced in [30]. Concretely, two levels of extra shortcut
connections are added to the network. The lower level shortcut C. The Ensemble Strategy of Multiple Models
connection bypasses several adjacent residual blocks, while the It is widely acknowledged in the field of machine learning
higher level shortcut connection is made between the input and that an ensemble of multiple models has higher generalization
output. If more than one shortcut connection reaches a residual capability [16] than individual models. In [33], analysis of
block or the output of the network, the values from the connec- neural network ensembles for STLF of office buildings is pro-
tions are averaged. Note that after adding the extra shortcut vided by the authors. Results show that an ensemble of neural
connections, the formulations of the forward-propagation of networks reduces the variance of performances. A demonstra-
responses and the back-propagation of gradients are slightly tion of the ensemble strategy used in this paper is shown in
different, but the characteristics of the network that we care Fig. 5. More specifically, the ensemble strategy consists of two
about remain unchanged. stages.
We can further improve the learning ability of ResNet by The first stage of the strategy takes several snapshots dur-
modifying its structure. Inspired by the convolutional network ing the training of a single model. Huang et al. [34] show that
structures proposed in [31] and [32], we propose the modified setting cyclic learning rate schedules for stochastic gradient
deep residual network (ResNetPlus), whose structure is shown descent (SGD) optimizer greatly improves the performance
in Fig. 4. First, we add a series of side residual blocks to the of existing deep neural network models. In this paper, as
model (the residual blocks on the right). Unlike the implemen- we use Adam (abbreviated from adaptive moment estima-
tation in [32], the input of the side residual blocks is the output tion [35]) as the optimizer, the learning rates for each iteration
of the first residual block on the main path (except for the first are decided adaptively. Thus, no learning rate schedules are set
side residual block, whose input is the input of the network). by ourselves. This scheme is similar to the NoCycle snapshot
The output of each main residual block is averaged with the ensemble method discussed in [34], that is, we take several
output of the side residual block in the same layer (indicated snapshots of the same model during its training process (e.g.,
by the blue dots on the right). Similar to the densely connected the 4 snapshots along the training process of the model with

Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 11,2024 at 09:54:54 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STLF WITH DEEP RESIDUAL NETWORKS 3947

dropout stochastically for M times at test time and collect the


outputs of the network, we can approximate the first term of
the forecasting uncertainty, which is
     
Var y∗ |x∗ = Var E y∗ |W, x∗ + E Var y∗ |W, x∗
 
= Var f W x∗ + σ 2

1  ∗
M
2
≈ ŷ(m) − ŷ¯ ∗ + σ2 (8)
M
m=1

where ŷ∗(m) is the mth output we obtain, ŷ¯ ∗ is the mean of


all M outputs, and E denotes the expectation operator. The
second term, σ 2 , measures the inherent noise for the data
Fig. 5. A demonstration of the ensemble strategy used in this paper. The generating process. According to [37], σ 2 can be estimated
snapshot models are taken where the slope of validation loss is considerably
small. using an independent validation dataset. We denote the vali-
dation dataset with X  = {x1 , . . . , xV }, Y  = {y1 , . . . , yV }, and
estimate σ 2 by
initial parameters W0(1) ). As is indicated in Fig. 5, the snap-
β 
V
  2
shots are taken after an appropriate number of epochs, so that σ2 = yv − f Ŵ xv (9)
the loss of each snapshot is of similar level. V
v=1
We can further ensemble a number of models that are
trained independently. This is done by simply re-initializing where f Ŵ (·) is the model trained on the training dataset and β
(1) (5)
the parameters of the model (e.g., W0 to W0 are 5 sets is a parameter to be estimated also using the validation dataset.
of initial parameters sampled from the same distribution used We need to extend the above estimation procedure to an
for initializing the model), which is one of the standard prac- ensemble of models. Concretely, for an ensemble of K neural
tices of obtaining good ensemble models [36]. The numbers of network models of the same structure, we estimate the first
snapshots and re-trained models are hyper-parameters, which term of (8) with a single model of the same structure trained
means they can be tuned using the validation dataset. After with dropout. The parameter β in (9) is also estimated by the
we obtain the all the snapshot models, we average the outputs model. More specifically, we find the β that provides the best
of the models and produce the final forecast. 90% and 95% interval forecasts on the validation dataset. σ 2
is estimated by replacing f Ŵ (·) in (9) by the ensemble model,
D. Probabilistic Forecasting Based on Monte Carlo Dropout f ∗ (·). Note that the estimation of σ 2 is specific to each hour
of the day.
If we look at the deep residual network (either ResNet or
After obtaining the forecasting uncertainty for each forecast,
ResNetPlus) as an ensemble of relatively shallow networks,
we can calculate the α-level interval with the point forecast,
the increased width and number of connections in the network
f ∗ (x∗ ), and its corresponding quantiles to obtain probabilistic
can provide more shallow networks to form the ensemble
forecasting results.
model [32]. It is expected that the relatively shallow networks
themselves can partially capture the nature of the load forecast-
ing task, and multiple shallow networks with the same input E. Model Design and Implementation Details
can give varied outputs. This indicates that the proposed model The proposed model consists of the neural network struc-
have the potential to be used for probabilistic load forecasting. ture for load forecasting of one hour (referred to as the
Probabilistic forecasting of time series can be fulfilled basic structure), the deep residual network (referred to as
by capturing the uncertainty within the models [37]. From ResNet) for improving the forecasts of 24 hours, and the mod-
a Bayesian probability theory point of view, the predictive ified deep residual network (referred to as ResNetPlus). The
probability of a Bayesian neural network can be obtained with configurations of the models are elaborated as follows.
         1) The Model With the Basic Structure: The graphic
p y∗ x∗ = p y∗ f W x∗ p W X, Y dW (7) representation of the model with the basic structure is
W day
shown in Fig. 1. Each fully-connected layer for [Lh , Th ],
day

where X and Y are the observations we use to train f W (·), [Lh , Th ], [Lh
week week month , Th
month hour
], and Lh has 10 hidden nodes,
a neural network with parameters W. The intractable poste- while the fully-connected layers for [S, W] have 5 hidden
rior distribution p(W|X, Y) is often approximated by various nodes. FC1 , FC2 , and the fully-connected layer before Lh have
inference methods [37]. In this paper, we use MC dropout [38] 10 hidden nodes. All but the output layer use SELU as the
to obtain the probabilistic forecasting uncertainty, which is activation function.
easy and computationally efficient to implement. Specifically, 2) The Deep Residual Network (ResNet): ResNet is added
dropout refers to the technique of randomly dropping out to the neural network with the basic structure. Each residual
hidden units in a neural network during the training of the block has a hidden layer with 20 hidden nodes and SELU as
network [39], and a parameter p is used to control the prob- the activation function. The size of the outputs of the blocks
ability that any hidden neuron is dropped out. If we apply is 24, which is the same as that of the inputs. A total of 30

Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 11,2024 at 09:54:54 UTC from IEEE Xplore. Restrictions apply.
3948 IEEE TRANSACTIONS ON SMART GRID, VOL. 10, NO. 4, JULY 2019

residual blocks are stacked, forming a 60-layer deep residual


network. The second level of shortcut connections is made
every 5 residual blocks. The shortcut path of the highest level
connects the input and the output of the network.
3) The Modified Deep Residual Network (ResNetPlus): The
structure of ResNetPlus follows the structure shown in Fig. 4.
The hyper-parameters inside the residual blocks are the same
as ResNet.
In order to properly train the models, the loss of the model,
L, is formulated as the sum of two terms:
L = LE + LR (10)
where LE measures the error of the forecasts, and LR is an out-
of-range penalty term used to accelerate the training process.
Specifically, LE is defined as Fig. 6. Test losses of the neural network with the basic structure (Basic),
the model with the deep residual network (Basic + ResNet), and the model
N H  
1   ŷ(i,h) − y(i,h) 
with the modified deep residual network (Basic + ResNetPlus). Each model
LE = (11) is trained 5 times with shuffled weight initialization. The solid lines are the
NH y(i,h) average losses, and the standard deviation above and below the average losses
i=1 h=1 are indicated by coloured areas.
where ŷ(i,h) and y(i,h) are the output of the model and the actual
normalized load for the hth hour of the ith day, respectively, N
the number of data samples, and H the number of hourly loads A. Performance of the Proposed Model on the
within a day (i.e., H = 24 in this case). This error measure, North-American Utility Dataset
widely known as the mean absolute percentage error (MAPE),
is also used to evaluate the forecast results of the models. The The first test case uses the North-American Utility dataset.
second term, LR , is calculated as This dataset contains load and temperature data at one-hour
resolution for a north-American utility. The dataset covers the
 
1 
N
time range between January 1st, 1985 and October 12th, 1992.
LR = max 0, max ŷ(i,h) − max y(i,h) The data of the two-year period prior to October 12th, 1992
2N h h
i=1
  is used as the test set, and the data prior to the test set is
+ max 0, min y(i,h) − min ŷ(i,h) (12) used for training the model. More specifically, two starting
h h dates, namely, January 1st, 1986, and January 1st, 1988, are
This term penalizes the model when the forecast daily load used for the training sets. As the latter starting date is used
curves are out of the range of the actual load curves, thus in experiments in the literature, we tune the hyper-parameters
accelerating the beginning stage of the training process. When using the last 10% of the training set with this starting date.5
a model is able to produce forecasts with relatively high accu- The model trained with the training set containing 2 years of
racy, this term serves to emphasize the cost for overestimating extra data has the same hyper-parameters.
the peaks and the valleys of the load curves. Before reporting the performance of the ensemble model
All the models are trained using the Adam optimizer with obtained by combining multiple individual models, we first
default parameters as suggested in [35]. The models are look at the performance of the three models mentioned in
implemented using Keras 2.0.2 with Tensorflow 1.0.1 as Section II. The test losses of the three models are shown in
backend in the Python 3.5 environment [40], [41]. A laptop Fig. 6 (the models are trained with the training set starting
with Intel Core i7-5500U CPUs is used to train the models. with January 1st, 1988). In order to yield credible results, we
Training the ResNetPlus model with data of three years for train each model 5 times and average the losses to obtain
700 epochs takes approximately 1.5 hours. When 5 individual the solid lines in the figure. The coloured areas indicate the
models are trained, the total training time is less than 8 hours. range between one standard deviation above and below the
average losses. It is observed in the figure that ResNet is able
III. R ESULTS AND D ISCUSSION to improve the performance of the model, and further reduction
in loss can be achieved when ResNetPlus is implemented. Note
In this section, we use the North-American Utility dataset3
that the results to be reported in this paper are all obtained with
and the ISO-NE dataset4 to verify the effectiveness of the
the ensemble model. For simplicity, the ensemble model with
proposed model. As we use actual temperature as the input,
the basic structure connected with ResNetPlus is referred to
we further modify the temperature values to evaluate the
as “the ResNetPlus model” hereinafter.
performance of the proposed model. Results of probabilis-
tic forecasting on the North-American Utility dataset and the
GEFCom2014 dataset [42] are also provided. 5 For this dataset, 4 snapshots are taken between 1200 to 1350 epochs for
8 individual models. For the basic structure, all layers except the input and
3 Available at https://class.ee.washington.edu/555/el-sharkawi.
the output layers are shared for the 24 hours (sharing weights for 24 hours is
4 Available at https://www.iso-ne.com/isoexpress/web/reports/load-and- only implemented in this test case). The ResNetPlus model has 30 layers on
demand. the main path.

Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 11,2024 at 09:54:54 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STLF WITH DEEP RESIDUAL NETWORKS 3949

TABLE II TABLE III


C OMPARISON OF THE P ROPOSED R ES N ET P LUS M ODEL MAPE S (%) OF THE P ROPOSED R ES N ET P LUS M ODEL FOR THE ISO-NE
W ITH E XISTING M ODELS ON THE N ORTH -A MERICAN DATASET IN 2006 AND A C OMPARISON W ITH E XISTING M ODELS
U TILITY DATASET W ITH R ESPECT TO MAPE (%)

We compare the results of the proposed ResNetPlus model


with existing models proposed in [1] and [43]–[48], as is
shown in Table II. In order to estimate the performance of
the models when forecast temperature is used, we also add a TABLE IV
C OMPARISON OF THE P ROPOSED R ES N ET P LUS M ODEL W ITH E XISTING
Gaussian noise with mean 0 o F, and standard deviation 1 o F M ODELS ON THE ISO-NE DATASET FOR 2010 AND 2011
to the temperature input and report the MAPE in this case. It
is seen in the table that the proposed model outperforms exist-
ing models which highly depend on external feature extraction,
feature selection, or hyper-parameter optimization techniques.
The proposed model also has a lower increase of MAPE when
modified temperature is applied. In addition, the test loss can
be further reduced when more data is added to the training
set.
in [46] produces better results. Nevertheless, as most of the
B. Performance of the Proposed Model on the hyper-parameters are not tuned on the ISO-NE dataset, we
ISO-NE Dataset can conclude that the proposed model has good generalization
The second task of the paper is to examine the general- capability across different datasets.
ization capability of the proposed model. To this end, we We further test the generalization capability of the proposed
use the majority of the hyper-parameters of ResNetPlus tuned ResNetPlus model on data of the years 2010 and 2011. The
with the North-American Utility dataset to train load forecast- same model for the year 2006 is used for this test case, and
ing models for the ISO-NE dataset (The time range of the historical data from 2004 to 2009 is used to train the model.
dataset is between March 2003 and December 2014). Here, In Table IV, we report the performance of the proposed model
the ResNetPlus structure has 10 layers on the main path. and compare it with models mentioned in [12], [49], and [50].
The first test case is to predict the daily loads of Results show that the proposed ResNetPlus model outperforms
the year 2006 in the ISO-NE dataset. For the proposed existing models with respect to the overall MAPE for the two
ResNetPlus model, the training period is from June 2003 to years, and an improvement of 8.9% is achieved for the year
December 20056 (we reduce the size of Lhmonth and Thmonth 2011. Note that all the existing models are specifically tuned
to 3 so that more training samples can be used, and the rest on the ISO-NE dataset for the period from 2004 to 2009,
of the hyper-parameters are unchanged). In comparison, the while the design of the proposed ResNetPlus model is directly
similar day-based wavelet neural network (SIWNN) model implemented without any tuning.
in [13] is trained with data from 2003 to 2005, while the As we use actual temperature values for the input of the
models proposed in [46] and [49] use data from March 2003 proposed model (except for the “modified temperature” case of
to December 2005 (both models use past loads up to 200 North-American Utility dataset), the results we have obtained
hours prior to the hour to be predicted). The results of previously provide us with an estimated upper bound of the
MAPEs with respect to each month are listed in Table III. The performance of the model. Thus, we need to further analyze
MAPEs for the 12 months in 2006 are not explicitly reported how the proposed model would perform when forecast tem-
in [49]. It is seen in the table that the proposed ResNetPlus perature data is used, and whether the ensemble model is more
model has the lowest overall MAPE for the year 2006. For robust to noise in forecast weather. We follow the way of modi-
some months, however, the WT-ELM-MABC model proposed fying temperature values introduced in [43], and consider three
cases of temperature modification:
6 The training dataset is used to determine how the snapshots are taken
• Case 1: add Gaussian noise with mean 0 o F, and standard
for the ensemble model for the ISO-NE dataset. For each implementation, 5
individual models are trained, and the snapshots are taken at 600, 650, and deviation 1 o F to the original temperature values before
700 epochs. normalization.

Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 11,2024 at 09:54:54 UTC from IEEE Xplore. Restrictions apply.
3950 IEEE TRANSACTIONS ON SMART GRID, VOL. 10, NO. 4, JULY 2019

TABLE VI
C OMPARISON OF P ROBABILISTIC F ORECASTING P ERFORMANCE
M EASURES FOR THE Y EAR 2011 IN THE GEFC OM 2014 DATASET

previously implemented ensemble model7 except for the input


layer and the output layer (dropout with p ranging from 0.05
and 0.2 produce similar results, similar to the results reported
Fig. 7. The comparison of the proposed model with the ensemble strategy in [38]). The first term in (8) and is estimated by a single
and the proposed model without ensemble when different cases of modified model trained with 500 epochs (with M = 100 for (8) and
temperature are applied. The model without ensemble is a single ResNetPlus
model trained with 700 epochs. p = 0.1), and the estimated value of β is 0.79.
The empirical coverages produced by the proposed model
TABLE V with respect to different z-scores are listed in Table V, and an
E MPIRICAL C OVERAGES OF THE P ROPOSED M ODEL W ITH MC D ROPOUT illustration of the 95% prediction intervals for two weeks in
1992 is provided in Fig. 8. The results show that the proposed
model with MC dropout is able to give satisfactory empirical
coverages for different intervals.
In order to quantify the performance of the probabilis-
tic STLF by MC dropout, we adopt the pinball loss and
Winkler score mentioned in [23] and use them to assess the
proposed method in terms of coverage rate and interval width.
Specifically, the pinball loss is averaged over all quantiles and
• Case 2: add Gaussian noise with mean 0 o F, and change hours in the prediction range, and the Winkler scores are aver-
the standard deviation of case 1 to 2 o F. aged over all the hours of the year in the test set. We implement
• Case 3: add Gaussian noise with mean 0 o F, and change the ResNetPlus model8 on the GEFCom2014 dataset and com-
the standard deviation of case 1 to 3 o F. pare the results with those reported in [23] and [51]. Following
For all three cases, we repeat the trials 5 times and calcu- the setting in [23], the load and temperature data from 2006 to
late the means and standard deviations of increased MAPE 2009 is used to train the proposed model, the data of the year
compared with the case where actual temperature data is used. 2010 is used for validation, and the test results are obtained
The results of increased test MAPEs for the year 2006 using data of the year 2011. The temperature values used
with modified temperature values are shown in Fig. 7. We for the input of the model are calculated as the mean of the
compare the performance of the proposed ResNetPlus model temperature values of all 25 weather stations in the dataset.
(which is an ensemble of 15 single snapshot models) with a In Table VI, we present the values of pinball loss and
single snapshot model trained with 700 epochs. As can be Winkler scores for the proposed model and the models
seen in the figure, the ensemble strategy greatly reduces the in [23] and [51] for the year of 2011 in the GEFCom2014
increase of MAPE, especially for case 1, where the increase dataset. The Lasso method in [51] serves as a benchmark
of MAPE is 0.0168%. As the reported smallest increase of for methods that build regression models on the input data,
MAPE for case 1 in [1] is 0.04%, it is reasonable to conclude and the quantile regression averaging (QRA) method in [23]
that the proposed model is robust against the uncertainty of builds quantile regression models on sister point forecasts (the
temperature for case 1 (as we use a different dataset here, row of Ind stands for the performance of a single model).
the results are not directly comparable). Is is also observed It can be seen in Table VI that the proposed ResNetPlus
that the ensemble strategy is able to reduce the standard model is able to provide improved probabilistic forecast-
deviation of multiple trials. This also indicates the higher gen- ing results compared with existing methods in terms of
eralization capability of the proposed model with the ensemble the pinball loss and two Winkler scores. As we obtain the
strategy.
7 the model implemented here uses ResNet instead of ResNetPlus, and the
information of season, weekday/weekend distinction, and holiday/non-holiday
C. Probabilistic Forecasting for the Ensemble Model distinction is not used. In addition, the activation function used for the residual
We first use the North-American Utility dataset to demon- blocks is ReLU.
8 Five individual models are trained with a dropout rate of 0.1 and 6 snap-
strate the probabilistic STLF by MC dropout. The last year
shots are taken from 100 epochs to 350 epochs. M is set to 100 for MC
of the dataset is used as the test set and the previous year dropout and the first term in (8) is estimated by a single model trained with
is used for validation. Dropout with p = 0.1 is added to the 100 epochs. The estimated value of β is 0.77.

Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 11,2024 at 09:54:54 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: STLF WITH DEEP RESIDUAL NETWORKS 3951

Fig. 8. Actual load and 95% prediction intervals for a winter week (left) and a summer week (right) of 1992 for the North-American Utility dataset. The
two weeks start with February 3rd, 1992, and July 6th, 1992, respectively.

probabilistic forecasting results by sampling the trained neu- [8] J. W. Taylor, L. M. D. Menezes, and P. E. Mcsharry, “A comparison
ral networks with MC dropout, we can conclude that the of univariate methods for forecasting electricity demand up to a day
ahead,” Int. J. Forecast., vol. 22, no. 1, pp. 1–16, Jan./Mar. 2006.
proposed model is good at capturing the uncertainty of the [9] H. Hahn, S. Meyer-Nieberg, and S. Pickl, “Electric load forecasting
task of STLF. methods: Tools for decision making,” Eur. J. Oper. Res., vol. 199, no. 3,
pp. 902–907, Dec. 2009.
[10] Y. Wang, Q. Chen, T. Hong, and C. Kang, “Review of smart meter data
IV. C ONCLUSION AND F UTURE W ORK analytics: Applications, methodologies, and challenges,” IEEE Trans.
We have proposed an STLF model based on deep resid- Smart Grid, to be published, doi: 10.1109/TSG.2018.2818167.
[11] H. S. Hippert, C. E. Pedreira, and R. C. Souza, “Neural networks for
ual networks in this paper. The low-level neural network with short-term load forecasting: A review and evaluation,” IEEE Trans.
the basic structure, the ResNetPlus structure, and the two- Power Syst., vol. 16, no. 1, pp. 44–55, Feb. 2001.
stage ensemble strategy enable the proposed model to have [12] C. Cecati, J. Kolbusz, P. Różycki, P. Siano, and B. M. Wilamowski, “A
novel RBF training algorithm for short-term electric load forecasting
high accuracy as well as satisfactory generalization capabil- and comparative studies,” IEEE Trans. Ind. Electron., vol. 62, no. 10,
ity. Two widely acknowledged public datasets are used to pp. 6519–6529, Oct. 2015.
verify the effectiveness of the proposed model with various [13] Y. Chen et al., “Short-term load forecasting: Similar day-based wavelet
test cases. Comparisons with existing models have shown that neural networks,” IEEE Trans. Power Syst., vol. 25, no. 1, pp. 322–330,
Feb. 2010.
the proposed model is superior in both forecasting accuracy [14] Y. Zhao, P. B. Luh, C. Bomgardner, and G. H. Beerel, “Short-term load
and robustness to temperature variation. We have also shown forecasting: Multi-level wavelet neural networks with holiday correc-
that the proposed model can be directly used for probabilistic tions,” in Proc. Power Energy Soc. Gen. Meeting, Calgary, AB, Canada,
2009, pp. 1–7.
forecasting when MC dropout is adopted.
[15] R. Zhang, Z. Y. Dong, Y. Xu, K. Meng, and K. P. Wong, “Short-term
A number of paths for further work are attractive. As we load forecasting of Australian national electricity market by an ensemble
have only scratched the surface of state-of-the-art of deep neu- model of extreme learning machine,” IET Gener. Transm. Distrib., vol. 7,
ral networks, we may apply more building blocks of deep no. 4, pp. 391–397, Apr. 2013.
[16] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
neural networks (e.g., CNN or LSTM) into the model to MA, USA: MIT Press, 2016.
enhance its performance. In addition, we will further investi- [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-
gate the implementation of deep neural works for probabilistic tion with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Process. Syst., 2012, pp. 1097–1105.
STLF and make further comparisons with existing methods.
[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
R EFERENCES [19] S. Ryu, J. Noh, and H. Kim, “Deep neural network based demand side
short term load forecasting,” Energies, vol. 10, no. 1, p. 3, 2016.
[1] E. Ceperic, V. Ceperic, and A. Baric, “A strategy for short-term load [20] G. Merkel, R. J. Povinelli, and R. H. Brown, “Deep neural
forecasting by support vector regression machines,” IEEE Trans. Power network regression for short-term load forecasting of natural gas,”
Syst., vol. 28, no. 4, pp. 4356–4364, Nov. 2013. in Proc. 37th Annu. Int. Symp. Forecast., 2017. [Online]. Available:
[2] K.-B. Song, Y.-S. Baek, D. H. Hong, and G. Jang, “Short-term load https://isf.forecasters.org/wp-content/uploads/ISF2017-Proceedings.pdf
forecasting for the holidays using fuzzy linear regression method,” IEEE [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
Trans. Power Syst., vol. 20, no. 1, pp. 96–101, Feb. 2005. image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[3] W. Charytoniuk, M. S. Chen, and P. V. Olinda, “Nonparametric regres- Las Vegas, NV, USA, 2016, pp. 770–778.
sion based short-term load forecasting,” IEEE Trans. Power Syst.,
[22] T. Hong and S. Fan, “Probabilistic electric load forecasting: A tutorial
vol. 13, no. 3, pp. 725–730, Aug. 1998.
review,” Int. J. Forecast., vol. 32, no. 3, pp. 914–938, Jul./Sep. 2016.
[4] E. E. Elattar, J. Goulermas, and Q. H. Wu, “Electric load forecasting
based on locally weighted support vector regression,” IEEE Trans. Syst., [23] B. Liu, J. Nowotarski, T. Hong, and R. Weron, “Probabilistic load
Man, Cybern. C, Appl. Rev., vol. 40, no. 4, pp. 438–447, Jul. 2010. forecasting via quantile regression averaging on sister forecasts,” IEEE
[5] J. W. Taylor, “Short-term electricity demand forecasting using double Trans. Smart Grid, vol. 8, no. 2, pp. 730–737, Mar. 2017.
seasonal exponential smoothing,” J. Oper. Res. Soc., vol. 54, no. 8, [24] J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi, “DNN-based prediction
pp. 799–805, Aug. 2003. model for spatio-temporal data,” in Proc. 24th ACM SIGSPATIAL Int.
[6] M. Rejc and M. Pantos, “Short-term transmission-loss forecast for the Conf. Adv. Geograph. Inf. Syst., Burlingame, CA, USA, 2016, p. 92.
Slovenian transmission power system based on a fuzzy-logic decision [25] G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural
approach,” IEEE Trans. Power Syst., vol. 26, no. 3, pp. 1511–1521, networks for LVCSR using rectified linear units and dropout,” in Proc.
Aug. 2011. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Vancouver,
[7] E. A. Feinberg and D. Genethliou, “Load forecasting,” in Applied BC, Canada, 2013, pp. 8609–8613.
Mathematics for Restructured Electric Power Systems, J. H. Chow, [26] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities
F. F. Wu, and J. Momoh, Eds. New York, NY, USA: Springer, 2005, improve neural network acoustic models,” in Proc. Int. Conf. Mach.
pp. 269–285. Learn., vol. 30, 2013, p. 3.

Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 11,2024 at 09:54:54 UTC from IEEE Xplore. Restrictions apply.
3952 IEEE TRANSACTIONS ON SMART GRID, VOL. 10, NO. 4, JULY 2019

[27] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rec- Kunjin Chen received the B.Sc. degree in electri-
tifiers: Surpassing human-level performance on ImageNet classifica- cal engineering from Tsinghua University, Beijing,
tion,” in Proc. IEEE Int. Conf. Comput. Vis., Santiago, Chile, 2015, China, in 2015, where he is currently pursuing
pp. 1026–1034. the Ph.D. degree with the Department of Electrical
[28] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self- Engineering.
normalizing neural networks,” in Proc. Adv. Neural Inf. Process. Syst., His research interests include applications of
Long Beach, CA, USA, 2017, pp. 972–981. machine learning and data science in power systems.
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in
deep residual networks,” in Proc. Eur. Conf. Comput. Vis., 2016,
pp. 630–645.
[30] K. Zhang et al., “Residual networks of residual networks: Multilevel
residual networks,” IEEE Trans. Circuits Syst. Video Technol., vol. 28,
no. 6, pp. 1303–1314, Jun. 2018, doi: 10.1109/TCSVT.2017.2654543. Kunlong Chen received the B.Sc. degree in elec-
[31] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, trical engineering from Beijing Jiaotong University,
“Densely connected convolutional networks,” in Proc. IEEE Conf. Beijing, China, in 2015 and the engineering degree
Comput. Vis. Pattern Recognit., vol. 1. Honolulu, HI, USA, 2017, from CentraleSupélec, Paris, France, in 2015.
pp. 2261–2269. He is currently pursuing the M.Sc. degree with
[32] L. Zhao et al., “Deep convolutional neural networks with the Department of Electrical Engineering, Beijing
merge-and-run mappings,” Int. Joint Conf. Artif. Intell. (IJCAI), Jiaotong University.
2018. His research interests include applications of sta-
[33] M. De Felice and X. Yao, “Short-term load forecasting with tistical learning techniques in the field of electrical
neural network ensembles: A comparative study [application engineering.
notes],” IEEE Comput. Intell. Mag., vol. 6, no. 3, pp. 47–56,
Aug. 2011. Qin Wang received the B.Sc. degree in electrical
[34] G. Huang et al., “Snapshot ensembles: Train 1, get M for free,” presented engineering from Tsinghua University in 2015. He
at the 5th Int. Conf. Learn. Represent. (ICLR), 2017. is currently pursuing the master’s degree with ETH
[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Zürich, Switzerland.
presented at the 3rd Int. Conf. Learn. Represent. (ICLR), 2015. His research interests include computer vision and
[36] A. R. Webb, Statistical Pattern Recognition. Chiechester, U.K.: Wiley, deep learning applications.
2003.
[37] L. Zhu and N. Laptev, “Deep and confident prediction for time series
at Uber,” in Proc. IEEE Int. Conf. Data Min. Workshops, New Orleans,
LA, USA, 2017, pp. 103–110.
[38] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation:
Representing model uncertainty in deep learning,” in Proc. 33rd Int.
Conf. Mach. Learn., New York, NY, USA, 2016, pp. 1050–1059. Ziyu He received the B.Sc. degree from Zhejiang
[39] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and University, Hangzhou, China, in 2015 and the M.S.
R. Salakhutdinov, “Dropout: A simple way to prevent neural networks degree from Columbia University, NY, USA, in
from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2017. He is currently pursuing the Ph.D. degree
Jan. 2014. with the Department of Industrial and Systems
[40] F. Chollet et al. (2015). Keras. [Online]. Available: Engineering, University of Southern California, CA,
https://github.com/fchollet/keras USA.
[41] M. Abadi et al., “TensorFlow: A system for large-scale machine learn- His research interests are optimization and
ing,” in Proc. 12th USENIX Symp. Oper. Syst. Design Implement., machine learning and their applications in energy.
vol. 16. Savannah, GA, USA, 2016, pp. 265–283.
[42] T. Hong et al., “Probabilistic energy forecasting: Global energy fore-
casting competition 2014 and beyond,” Int. J. Forecast., vol. 32, no. 3, Jun Hu (M’10) received the B.Sc., M.Sc., and
pp. 896–913, 2016. Ph.D. degrees in electrical engineering from the
[43] A. J. R. Reis and A. P. A. Da Silva, “Feature extraction via multiresolu- Department of Electrical Engineering, Tsinghua
tion analysis for short-term load forecasting,” IEEE Trans. Power Syst., University, Beijing, China, in 1998, 2000, and 2008,
vol. 20, no. 1, pp. 189–198, Feb. 2005. respectively.
[44] N. Amjady and F. Keynia, “Short-term load forecasting of power systems He is currently an Associate Professor with
by combination of wavelet transform and neuro-evolutionary algorithm,” the Department of Electrical Engineering, Tsinghua
Energy, vol. 34, no. 1, pp. 46–57, Jan. 2009. University. His research interests include overvolt-
[45] A. Deihimi and H. Showkati, “Application of echo state networks in age analysis in power system, sensors and big data,
short-term electric load forecasting,” Energy, vol. 39, no. 1, pp. 327–340, dielectric materials, and surge arrester technology.
Mar. 2012.
[46] S. Li, P. Wang, and L. Goel, “Short-term load forecasting by wavelet
transform and evolutionary extreme learning machine,” Elect. Power Jinliang He (M’02–SM’02–F’08) received the B.Sc.
Syst. Res., vol. 122, p. 96–103, May 2015. degree from the Wuhan University of Hydraulic
[47] Z. Hu, Y. Bao, and T. Xiong, “Comprehensive learning particle swarm and Electrical Engineering, Wuhan, China, in 1988,
optimization based memetic algorithm for model selection in short-term the M.Sc. degree from Chongqing University,
load forecasting using support vector regression,” Appl. Soft Comput., Chongqing, China, in 1991, and the Ph.D. degree
vol. 25, pp. 15–25, Dec. 2014. from Tsinghua University, Beijing, China, in 1994,
[48] S. Li, P. Wang, and L. Goel, “A novel wavelet-based ensemble method all in electrical engineering.
for short-term load forecasting with hybrid neural networks and fea- He became a Lecturer in 1994, and an Associate
ture selection,” IEEE Trans. Power Syst., vol. 31, no. 3, pp. 1788–1798, Professor in 1996, with the Department of Electrical
May 2016. Engineering, Tsinghua University. From 1997 to
[49] S. Li, L. Goel, and P. Wang, “An ensemble approach for short-term 1998, he was a Visiting Scientist with Korea
load forecasting by extreme learning machine,” Appl. Energy, vol. 170, Electrotechnology Research Institute, Changwon, South Korea. From 2014
pp. 22–29, May 2016. to 2015, he was a Visiting Professor with the Department of Electrical
[50] H. Yu, P. D. Reiner, T. Xie, T. Bartczak, and B. M. Wilamowski, Engineering, Stanford University, Palo Alto, CA, USA. In 2001, he was pro-
“An incremental design of radial basis function networks,” IEEE Trans. moted to a Professor with Tsinghua University, where he is currently the Chair
Neural Netw. Learn. Syst., vol. 25, no. 10, pp. 1793–1803, Oct. 2014. with High Voltage Research Institute. He has authored seven books and 600
[51] F. Ziel and B. Liu, “Lasso estimation for GEFCom2014 probabilistic technical papers. His research interests include advanced power transmission
electric load forecasting,” Int. J. Forecast., vol. 32, no. 3, pp. 1029–1037, technology, sensing technology and big data mining, and smart nanodielectric
2016. materials.

Authorized licensed use limited to: University of Chinese Academy of SciencesCAS. Downloaded on September 11,2024 at 09:54:54 UTC from IEEE Xplore. Restrictions apply.

You might also like