Deep Reinforcement Learning-Based Energy Storage A
Deep Reinforcement Learning-Based Energy Storage A
Deep Reinforcement Learning-Based Energy Storage A
net/publication/340535096
CITATIONS READS
134 662
6 authors, including:
Some of the authors of this publication are also working on these related projects:
[Energies] Special Issue: Transaction-Based Peer-to-Peer Energy Management Systems View project
All content following this page was uploaded by Thomas Morstyn on 26 May 2021.
Article:
Cao, J, Harrold, D, Fan, Z et al. (3 more authors) (2020) Deep Reinforcement Learning
Based Energy Storage Arbitrage With Accurate Lithium-ion Battery Degradation Model.
IEEE Transactions on Smart Grid, 11 (5). pp. 4513-4521. ISSN 1949-3053
https://doi.org/10.1109/TSG.2020.2986333
© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/republishing
this material for advertising or promotional purposes, creating new collective works, for
resale or redistribution to servers or lists, or reuse of any copyrighted component of this
work in other works. Uploaded in accordance with the publisher's self-archiving policy.
Reuse
Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless
indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by
national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of
the full text version. This is indicated by the licence information on the White Rose Research Online record
for the item.
Takedown
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by
emailing [email protected] including the URL of the record and the reason for the withdrawal request.
[email protected]
https://eprints.whiterose.ac.uk/
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2019 1
Abstract—Accurate estimation of battery degradation cost is respectively. In order to handle the uncertainty in electricity
one of the main barriers for battery participating on the energy price, a scenario-based stochastic formulation was developed
arbitrage market. This paper addresses this problem by using a in [4] for battery energy arbitrage in both day-ahead and real-
model-free deep reinforcement learning (DRL) method to opti-
mize the battery energy arbitrage considering an accurate battery time market. The authors of [5] present a bidding mechanism
degradation model. Firstly, the control problem is formulated based on two stage stochastic programming for a group
as a Markov Decision Process (MDP). Then a noisy network of storage that participate in the day-ahead reserve market.
based deep reinforcement learning approach is proposed to learn Apart from the above stochastic optimization approaches,
an optimized control policy for storage charging/discharging robust optimization is also widely used to handle uncertainty.
strategy. To address the uncertainty of electricity price, a hybrid
Convolutional Neural Network (CNN) and Long Short Term In [6], a robust optimization based bidding strategy has
Memory (LSTM) model is adopted to predict the price for the shown an increasing probability of yielding better economic
next day. Finally, the proposed approach is tested on the the performance than a deterministic optimization based bidding
historical UK wholesale electricity market prices. The results strategy, when the forecast error in electricity price increases.
compared with model based Mixed Integer Linear Programming In [7], an affinely adjustable robust bidding strategy for a
(MILP) have demonstrated the effectiveness and performance of
the proposed framework. solar power with a battery storage system was proposed to
address the uncertainties of both PV solar power productions
Index Terms—Energy storage, Energy arbitrage, Battery and electricity prices. However, the research in [2]-[7] did not
degradation, Deep reinforcement learning, Noisy Networks
consider a detailed model of battery degradation during the
energy arbitrage process.
I. I NTRODUCTION Battery degradation model is the key factor to energy
arbitrage problem. Accurate calculation of degradation costs
E NERGY storage systems can improve the flexibility of
the power systems by providing various ancillary services
to system operators, e.g. load shifting, frequency regulation,
is crucial for obtaining realistic estimates of profitability.
There is a growing literature examining the impact of battery
voltage support and grid stabilization [1]. Among these, energy degradation on energy arbitrage revenue [8], [9]. The impact
arbitrage represents the largest profit opportunity for battery of battery degradation on energy arbitrage revenue is studied
storage. In electricity markets, the storage can take advantage in [8] and a novel battery operational cost model considering
of the daily energy price fluctuations to buy the cheapest degradation cost based on depth of charge and discharge
energy available during the period of low demand and sell rate is developed in [9]. However, the degradation model
it at the highest price in order to generate profits using energy used in [8], [9] is quite simplistic, which is not realistic
arbitrage. to account for the degradation costs for energy arbitrage.
Extensive research has been conducted on the optimisation There are already some independent research works on battery
of energy storage arbitrage problem to maximise revenue. In degradation model using either model based or data driven
[2] and [3], a mixed integer linear approach was developed methods [10], [11], which can provide a precise degradation
to optimise the storage dispatch that can maximise the prof- costs for different charging profiles. One of the main barriers
its in real-time markets in the United States and Germany, of embedding this accurate model to energy arbitrage prob-
lem is that the calculation of degradation process is quite
This work is partly supported by the SEND project (Grant REF. complicated and it is not straightforward to find a simple
32R16P00706) funded by ERDF and BEIS, the Royal Society Research Grant mathematical degradation model that can be included into the
(REF. RGS/R1/191395) and EnergyREV(EP/S031863/1).
J. Cao is with the School of Geography, Geology and the Environment, model-based energy arbitrage algorithm.
Keele University, UK ST5 5BG (corresponding author: [email protected]). Recently, data-driven model-free approaches have made
Dan Harrold and Prof. Z. Fan are with the School of Computing and great progress in decision-making problems [12]. Many studies
Mathematics, Keele University, UK ST5 5BG (email: [email protected]).
T. Morstyn is with the Department of Engineering Science, University of have focused on the application of Reinforcement Learning
Oxford, Oxford OX1 2JD, U.K ([email protected]). (RL), a model-free agent based AI algorithm, for smart grid,
Prof. David Healey is the managing director of smart grid solutions and especially demand response. The authors of [13] present
also Professor in Practice at Keele University (email:[email protected]).
Prof. Kang Li is with the School of Electronics and Electrical Engineering, a comprehensive review on RL for demand response. The
University of Leeds, Leeds, UK (e-mail: [email protected]). authors of [14] proposed a deep reinforcement learning based
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2019 2
Remaining
Capacity
Cycle number
Remaining
Capacity
0.8
SoC
0.6
0.4
0.2
Capacity at time 168
0 Ec
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
1
7
103
109
115
121
127
133
139
145
151
157
163
169
Times Cycle number
(a) Input (SoC profile for one week) (b) output (remaining capacity)
and next 24 hours electricity prices from forecasting, the agent where γ is the discount factor, and the policy π maps from
selects an action (charging or discharging) from a action-space the system states to the charging/discharging action.
a ∈ A(s) based on the policy π. The goal of the proposed By exploring the environment, the agent will iteratively
algorithm is to find the optimal policy to maximise the reward update the action-value function Qπ (s, a) using the following
(profit) in the energy arbitrage process. The MDP formulation Bellman Equation:
for energy arbitrage is defined as: h i
1) State space: The state space at time instant t is defined as Q(st , at ) ← Q(st , at )+α Rt + γ max Q(st+1 , a) − Q(st , at )
a
st = (ct , ..., ct+22 , ct+23 , SoCt ). where ct , ..., ct+22 , ct+23 is (11)
the predicted price for the next day. Using the predicted price where α is the learning rate.
signal is to make sure the agent knows whether the price signal The iteration will continue until it converges to the best
is going up or down in order to make the best control action. action-value function Q∗π (s, a). Then, the choosing action is
The state transition of the battery SoC from state st to st + 1 determined by the ǫ-greedy policy, which at every timestep
is defined in (1). t, the agent selects the greedy action at = argmaxa Q(s, a)
2) Action space: The charging/discharging action space with probability 1 − ǫ and selects a random action to explore a
is discrete as a = (−Pemax , −0.5Pemax , 0, 0.5Pemax , Pemax ), better reward with probability ǫ. The Q∗π (s, a) is approximated
where Pemax is the maximum charging/discharging power of by a look up table in Q-learning.
the battery. The actual charging/discharging power is limited 2) Deep Q-network (DQN): Q-learning is confronted with
by (7) due to the limit of SoC. a difficult task when the state or action space are high-
(SoCt − SoCmax ) · Eess (SoCt − SoCmin ) · Eess dimensional. One solution proposed by Google DeepMind [12]
≤ Pe,t ≤ is to use a deep neural network to approximate the optimal
η t · Ts η t · Ts
(7) action-value function Q∗π (s, a). The represented value function
where SoCmax and SoCmin are the maximum and minimum by DQN with weights ω is denoted as:
state of charge of the battery, respectively. ηt is the charg-
Q(st , at ; ω) ≈ Q∗ (st , at ) (12)
ing/discharging efficiency defined in (1).
3) Reward: The design of reward function is the key factor The objective of DQN is to minimise the Mean Squared
in the algorithm. The reward in the energy arbitrage problem Error (MSE) loss L(ω) between Q(s, a) and TD (temporal
should include not only the profit from the discharging action, difference) target by Stochastic Gradient Descent (SGD):
but also the degradation costs of the control action. The
immediate reward Rt at time step t is defined as follows: L(ω) = (Rt +γ max Q(st+1 , at+1 ; ω − )−Q(st , at ; ω))2 (13)
a
Fig. 7. The overall framework of the proposed approach. (The top part is the proposed prediction algorithm based on hybrid CNN and LSTM networks; the
bottom part is the basic DQN approach. To improve the stability of training process, we use the experience replay mechanism [12] which stores the state
transitions in a replay buffer and randomly sampled during the training).
a state, while the other stream is for estimating advantage well known for modelling the time series data [27] and
function for each valid action. Finally, the two streams are has shown great advantages in load forecasting using
combined in a way to produce and approximate the Q-function, smart meter data [28], [29]. The reason for adding a
which is denoted as follows: CNN layer prior to the LSTM network is to incorporate
multiple features simultaneously (other features such as
Q(s, a) = V (s, v) + A(s, a, ω) (15)
weather, generation) and reduce the temporal input di-
5) Noisy network for Exploration mension if only one feature is included. In this paper, only
An alternative approach to exploration when using neural one feature is included in the input data which is the price.
network to approximate the action-value function is Noisy The CNN layer can reduce the temporal input dimension
Networks for Exploration [25] that replaces the linear layer (from 1 × 168 to 7 × 24). Finally, a fully connected layer
with a noisy linear layer, which is defined as: with 24 nodes is connected to the output. Each node is
corresponding to every hour that predicted.
Y = (µω + σ ω ⊙ ǫω )X + (µb + σ b ⊙ ǫb ) (16)
(iii) Training and accuracy assessment: The architecture de-
where ǫ = [ǫω , ǫb ] are randomly sampled, zero mean noise ma- signed in step (ii) will be tuned and trained. The final
trices with fixed statistics, and µ = [µω , µb ] and σ = [σ ω , σ b ] trained model will be used for prediction and the accuracy
are the learning parameters of the network. In noisy network, will be assessed using Mean Absolute Error (MAE).
instead of using an ǫ− greedy policy, the agent can act greedily 2) NoisyNet-DDQN algorithm (NN-DDQN): The detailed
according to a network using noisy linear layers. algorithm for energy arbitrage using NN-DDQN is presented
in Algorithm 1.
B. Proposed algorithm
The overall framework of the proposed approach is shown V. C ASE S TUDY
in Fig. 7. The first part of the algorithm is forecasting the In this section, we evaluate the proposed approach using
electricity price using hybrid CNN and LSTM network. Then actual UK wholesale market electricity price [30]. Electricity
the prices predicted concatenated with other features such as prices from Yeas 2015 and 2016 are used as the training and
SoC are fed into the DRL to learn the optimal policy. The testing data, respectively.
detailed explanation of these two parts are shown as follows: We use five Lithium-ion batteries and each battery has
1) Price forecasting: The goal of the proposed forecasting the capacity 200kWh and the charging/discharging power is
approach is to forecast the hourly market price of the next discretised to [-100kW, -50kW, 0, 50kW, 100kW]. The battery
day (24 hours), by using the historical price data of the last parameters for calculating efficiency are shown in Table I. The
one week (168 hours). There are three steps in the forecasting whole training takes about three and half hours on a Computer
approach: with GPU GTX 1080 Ti and CPU i7-7800X. Once the training
(i) Data pre-processing: the database used has some extreme is finished, the proposed approach takes about 5ms to output
high peaks which are caused by either the market failure the control actions, which could be used in real time control.
or data errors. To reduce the impact of data outliers The algorithm is developed on Python and Keras, which is a
on prediction accuracy, all the values that are outside high-level neural networks API [31].
the range of 15% and 85% quantiles are replaced by
the threshold values. Then, the price data are scaled to
[0,1] values by using MinMaxScaler function in Python A. Forecasting method evaluation
Sklearn [26]. The price forecasting method proposed in Section IV-B
(ii) Model architecture design: The proposed model uses a is adopted to predict the electricity price and the model
combined CNN and LSTM networks. LSTM network is architecture developed in Keras is shown in Table II. The data
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2019 6
sj , aj , rj , sj+1 from D 80
12: Sample the noisy variable for the online and 70
target network ǫ 60
13: Estimate the target yj 50
yj = Rj + γQ(sj+1 , argmaxa′ Q(sj+1 , a′ , ω), ω − )
40
14: Do a gradient descent with loss yj −Q(st , at , ω))2
30
15: Every C steps update ω ′ = ω
20
16: end for 0 24 48 72 96 120 144 168
17: end for (b)
Fig. 8. Electricity price forecasting results during summer and winter season.
TABLE I ((a)Electricity forecasting results from 1st July to 7th July; (b)Electricity
BATTERY PARAMETERS IN (2) [17] forecasting results from 1st Jan to 7th Jan)
TABLE III
a0 -0.852 a1 63.867 a2 3.6297 a3 0.559 S UMMARY OF DRL TRAINING SETTINGS
a4 0.51 a5 0.508 b0 0.1463 b1 30.27
b2 0.1037 b3 0.0584 b4 0.1747 b5 0.1288
Item Value
c0 0.1063 c1 62.94 c2 0.0437 d0 -200
d1 -138 d2 300 No. of hidden layers 3
No. of nodes in each layer 16
Activation function ReLU
Learning rate 0.00025
are randomly splitted using train_test_split function Optimizer Adam optimizer
batch size 32
in Sklearn [26]. The input data spans a whole last week of Target model update 10000 steps
electricity prices (168 hours) and these data are fed into the
convolutional layer with a kernel size and a stride of 24,
which results in a length 7 per feature map. The output is Vanilla DQN and Double dueling DQN, described in Section
the electricity price prediction for the next day (24 hours). IV. All the training settings are summarized in Table III. The
Fig. 8 shows the forecasting results of one week during NN-DDQN is trained with 12000 episodes. The convergence
summer and winter seasons. We can clearly see that the model process of the episode rewards over 12000 episodes for these
can learn not only the daily variations of prices, but also the three methods is illustrated in Fig.10. It can be observed that
week and seasonal patterns (more peaks values during winter). the NN-DDQN is more stable during the training process,
The forecasting accuracy MAE is 4.686 in this case. compared with other two approaches. It can converge to the
optimized reward which is around 6 at episode 2200. As the
B. Performance Evaluation of NN-DDQN NN-DDQN keeps on choosing random actions with a small
The performance of the proposed NN-DDQN is evaluated probability of epsilon 0.01, therefore the episode rewards keep
using the electricity prices at year 2016 in this section. To fluctuating.
compare the effectiveness of the proposed approach, the pro- After training, the optimal weight parameters of NN-DDQN
posed NN-DDQN is compared with other two DRL methods: are used to control the charging/discharging actions of battery
storage using the electricity price at year 2016. Fig. 9 shows
the charging/discharging results over one week for different
TABLE II summer and winter price patterns. The electricity prices are il-
M ODEL A RCHITECTURE IN K ERAS lustrated with the green line and the SoC, charging/discharging
actions are represented with blue and red bars respectively. The
Layer type Output shape Param
Input Layer (None, 168,1) 0 charging power (-100kW) and discharging power (100kW)
Conv1D (None, 7, 128) 3200 are scaled to -1 and 1 to allow them draw on one figure.
LSTM (None, 32) 20608 We can clearly see that the proposed approach can learn the
Dense (None, 24) 792
optimized charging/discharging strategy (charging during low
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2019 7
Fig. 9. The charging/discharging results over one week for summer (a) and winter (b). (Blue bar: SoC; Red bar: charging(-)/discharging(+) actions, the
values are scaled from [-100kW, 100kW] to [-1, 1]; The green curve with the right axis represents the electricity prices)
800
NN-DDQN
6 MILP
600 Double Dueling DQN
Vanilla DQN
Profit(k$)
4
Episode_reward
400
2
200
0
0
Double Dueling DQN
0 1000 2000 3000 4000 5000 6000 7000 8000
NNDDQN
Time (Hours)
2 Vanilla DQN
0 2000 4000 6000 8000 10000 Fig. 11. Comparison results of cumulative profits with MILP
Episode
MILP is:
N
X
((Pc,k − Pd,k ) · ck − αd (Pc,k + Pd,k )) (17)
k=1
constraints:
SoCk = SoCk−1 − ηd Pd,k · ud,k + ηc Pc,k · uc,k
0 ≤ Pc,k ≤ uc,k Pcmax
0 ≤ Pd,k ≤ ud,k Pdmax
0 ≤ uc,k + ud,k ≤ 1 (18)
u c,k , ud,k ∈ {0, 1}
0.2 ≤ SoCk ≤ 1
SoC0 = 0.5