Deep Reinforcement Learning-Based Energy Storage A

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/340535096
Deep Reinforcement Learning-Based Energy Storage Arbitrage With Accurate

Lithium-Ion Battery Degradation Model
Article in IEEE Transactions on Smart Grid · April 2020

DOI: 10.1109/TSG.2020.2986333
CITATIONS READS
134 662
6 authors, including:
Jun Cao Thomas Morstyn

Luxembourg Institute of Science and Technology (LIST) The University of Edinburgh
28 PUBLICATIONS 704 CITATIONS 103 PUBLICATIONS 4,695 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
[Energies] Special Issue: Transaction-Based Peer-to-Peer Energy Management Systems View project
Oxford Martin Programme on Integrating Renewable Energy http://renewableenergy.ox.ac.uk/ View project
All content following this page was uploaded by Thomas Morstyn on 26 May 2021.
The user has requested enhancement of the downloaded file.

This is a repository copy of Deep Reinforcement Learning Based Energy Storage
Arbitrage With Accurate Lithium-ion Battery Degradation Model.
White Rose Research Online URL for this paper:

http://eprints.whiterose.ac.uk/159354/
Version: Accepted Version
Article:
Cao, J, Harrold, D, Fan, Z et al. (3 more authors) (2020) Deep Reinforcement Learning
Based Energy Storage Arbitrage With Accurate Lithium-ion Battery Degradation Model.
IEEE Transactions on Smart Grid, 11 (5). pp. 4513-4521. ISSN 1949-3053
https://doi.org/10.1109/TSG.2020.2986333
© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/republishing
this material for advertising or promotional purposes, creating new collective works, for
resale or redistribution to servers or lists, or reuse of any copyrighted component of this
work in other works. Uploaded in accordance with the publisher's self-archiving policy.
Reuse
Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless
indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by
national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of
the full text version. This is indicated by the licence information on the White Rose Research Online record
for the item.
Takedown
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by
emailing [email protected] including the URL of the record and the reason for the withdrawal request.
[email protected]
https://eprints.whiterose.ac.uk/
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2019 1
Deep Reinforcement Learning Based Energy

Storage Arbitrage With Accurate Lithium-ion
Battery Degradation Model
Jun Cao, Member, IEEE, Dan Harrold, Zhong Fan, Senior Member, IEEE,
Thomas Morstyn, Member, IEEE, David Healey, and Kang Li
Abstract—Accurate estimation of battery degradation cost is respectively. In order to handle the uncertainty in electricity
one of the main barriers for battery participating on the energy price, a scenario-based stochastic formulation was developed
arbitrage market. This paper addresses this problem by using a in [4] for battery energy arbitrage in both day-ahead and real-
model-free deep reinforcement learning (DRL) method to opti-
mize the battery energy arbitrage considering an accurate battery time market. The authors of [5] present a bidding mechanism
degradation model. Firstly, the control problem is formulated based on two stage stochastic programming for a group
as a Markov Decision Process (MDP). Then a noisy network of storage that participate in the day-ahead reserve market.
based deep reinforcement learning approach is proposed to learn Apart from the above stochastic optimization approaches,
an optimized control policy for storage charging/discharging robust optimization is also widely used to handle uncertainty.
strategy. To address the uncertainty of electricity price, a hybrid
Convolutional Neural Network (CNN) and Long Short Term In [6], a robust optimization based bidding strategy has
Memory (LSTM) model is adopted to predict the price for the shown an increasing probability of yielding better economic
next day. Finally, the proposed approach is tested on the the performance than a deterministic optimization based bidding
historical UK wholesale electricity market prices. The results strategy, when the forecast error in electricity price increases.
compared with model based Mixed Integer Linear Programming In [7], an affinely adjustable robust bidding strategy for a
(MILP) have demonstrated the effectiveness and performance of
the proposed framework. solar power with a battery storage system was proposed to
address the uncertainties of both PV solar power productions
Index Terms—Energy storage, Energy arbitrage, Battery and electricity prices. However, the research in [2]-[7] did not
degradation, Deep reinforcement learning, Noisy Networks
consider a detailed model of battery degradation during the
energy arbitrage process.
I. I NTRODUCTION Battery degradation model is the key factor to energy
arbitrage problem. Accurate calculation of degradation costs
E NERGY storage systems can improve the flexibility of
the power systems by providing various ancillary services
to system operators, e.g. load shifting, frequency regulation,
is crucial for obtaining realistic estimates of profitability.
There is a growing literature examining the impact of battery
voltage support and grid stabilization [1]. Among these, energy degradation on energy arbitrage revenue [8], [9]. The impact
arbitrage represents the largest profit opportunity for battery of battery degradation on energy arbitrage revenue is studied
storage. In electricity markets, the storage can take advantage in [8] and a novel battery operational cost model considering
of the daily energy price fluctuations to buy the cheapest degradation cost based on depth of charge and discharge
energy available during the period of low demand and sell rate is developed in [9]. However, the degradation model
it at the highest price in order to generate profits using energy used in [8], [9] is quite simplistic, which is not realistic
arbitrage. to account for the degradation costs for energy arbitrage.
Extensive research has been conducted on the optimisation There are already some independent research works on battery
of energy storage arbitrage problem to maximise revenue. In degradation model using either model based or data driven
[2] and [3], a mixed integer linear approach was developed methods [10], [11], which can provide a precise degradation
to optimise the storage dispatch that can maximise the prof- costs for different charging profiles. One of the main barriers
its in real-time markets in the United States and Germany, of embedding this accurate model to energy arbitrage prob-
lem is that the calculation of degradation process is quite
This work is partly supported by the SEND project (Grant REF. complicated and it is not straightforward to find a simple
32R16P00706) funded by ERDF and BEIS, the Royal Society Research Grant mathematical degradation model that can be included into the
(REF. RGS/R1/191395) and EnergyREV(EP/S031863/1).
J. Cao is with the School of Geography, Geology and the Environment, model-based energy arbitrage algorithm.
Keele University, UK ST5 5BG (corresponding author: [email protected]). Recently, data-driven model-free approaches have made
Dan Harrold and Prof. Z. Fan are with the School of Computing and great progress in decision-making problems [12]. Many studies
Mathematics, Keele University, UK ST5 5BG (email: [email protected]).
T. Morstyn is with the Department of Engineering Science, University of have focused on the application of Reinforcement Learning
Oxford, Oxford OX1 2JD, U.K ([email protected]). (RL), a model-free agent based AI algorithm, for smart grid,
Prof. David Healey is the managing director of smart grid solutions and especially demand response. The authors of [13] present
also Professor in Practice at Keele University (email:[email protected]).
Prof. Kang Li is with the School of Electronics and Electrical Engineering, a comprehensive review on RL for demand response. The
University of Leeds, Leeds, UK (e-mail: [email protected]). authors of [14] proposed a deep reinforcement learning based
approach to optimize the EV charging scheduling. A Q learn-

ing based algorithm is proposed in [15] for energy arbitrage I Rs Rts Rtl
on the real-time market. Compared to model-based methods,
+
the data-driven approaches show great advantages: 1) they Pcha
have self-adaptability, model-free nature, and the ability to + or
Voc Pout
learn from historical data; 2) Deep reinforcement learning - Pdis
(DRL) can learn a good control policy, even under a very
complex environment by using deep neural networks. This -
feature provides great potentials for DRL to learn a battery
charging/discharging policy for energy arbitrage considering a
Fig. 1. Steady state battery equivalent circuit.
complicated, precise battery degradation model.
SoC
The objective of energy arbitrage using battery storage
(2)
is to maximise the profits. In current literature, three rela-
Pe Voc, Rtot
tively simple assumptions in energy storage arbitrage remain
(3)
the major obstacles for its adoption in industry: 1) perfect I
foresight about electricity market prices; 2) constant battery (4) (5)
charging/discharging efficiency; 3) simple representation of Charing Discharging
efficiency efficiency
battery degradation model. This paper aims to address all these
issues by using a deep reinforcement learning method. The
Fig. 2. The calculation process of discharging and charging efficiencies.
contribution of this paper is to propose a self-learning noisy
network based deep reinforcement learning approach to learn circuit model is adopted to represent the Li-ion battery, as
the optimized control actions for battery storage under very shown in Fig. 1. The circuit consists of an open circuit voltage
complex environment (e.g. accurate battery degradation, non- Voc and three resistors (Rs , Rts , Rtl ) that represent different
linear charging/discharging efficiency and price uncertainty). electrochemical processes: ohmic losses, charge tansfer and
The remainder of this paper is organized as follows. Section membrane diffusion. The open circuit voltage and three resis-
II introduces the environment model of the battery storage and tors are the nonlinear function of SoC, which can be expressed
battery degradation costs. The control problem is formulated by [17]
as a Markov Decision Process in Section III. The deep
reinforcement learning algorithm is introduced in Section IV. 
 Voc = a0 e(−a1 SoC) + a2 + a3 SoC − a4 SoC 2 + a5 SoC 3
Section V presents case studies results and finally Section VI 
 Rs = b0 e(−b1 SoC) + b2 + b3 SoC − b4 SoC 2 + b5 SoC 3


concludes the paper. Rts = c0 · e−c1 ·SoC + c2
R = d0 · e−d1 ·SoC + d2

 tl


II. E NVIRONMENT M ODEL

Rtot = Rs + Rts + Rtl
To improve the training process of the proposed DRL (2)
method, the battery and battery degradation are modeled based Then we can obtain the circuit current by solving the
on a standardized set of environments in OpenAI Gym [16] quadratic equation Pe = I(Voc − Rtot I) in Fig. 1:
in this section. p
Voc − Voc 2 −4·R
tot · Pe
I= (3)
A. Battery Energy Storage Model 2 · Rtot
In this paper, a generalized mathematical model of energy The charging and discharging efficiencies of battery can be
storage system based on state of charge (SoC) to describe the given by (4) and (5), respectively.
battery behaviour, is defined as follows:
R t+1 Voc
1 η ch =
 ch
 SoCt − Eess · ηt · Rt Pe,t dt, Pe,t < 0

Voc − Rtot · I
(4)
1 1 t+1
SoCt+1 = SoCt − Eess · ηdis · t Pe,t dt, Pe,t > 0
t
t+1

1
R
SoCt − Eess · t Pstandby,t dt, Pe,t = 0 Voc − Rtot · I

(1) η dis = (5)
Voc
where SoCt is the state of charge at time t; Pe,t is the output
power of battery (Pe,t > 0, when discharging and Pe,t < 0, Fig. 2 shows the basic calculation process of charging
when charging); Pstandby,t is the standby losses of battery; and discharging efficiencies. For a particular SoC and charg-
Eess is the energy capacity of battery (kWs); ηtch and ηtdis are ing/discharging power Pe , we can derive the efficiencies
the charging and discharging efficiencies respectively. through the flowchart of Fig. 2. Fig. 3 shows the results
In the conventional battery energy storage model, the charg- of different charging/discharging efficiency corresponding to
ing/discharging efficiency is usually assumed to be constant. different SoC and charging/discharging rate (called C-rate,
However, the efficiency is actually a nonlinear function of defined as the charge or discharge current divided by the
battery SoC and battery charging/discharging power [17]. To battery’s capacity). As seen in Fig. 3, the efficiency of a battery
calculate the efficiency of battery, a steady state equivalent improves for higher SoC and lower C-rate.
Remaining
Capacity
Cycle number
Fig. 5. General capacity degradation of lithium-ion battery [20]

1.2
Capacity at time 1
1 Es
Remaining
Capacity
0.8
SoC
0.6
0.4
0.2
Capacity at time 168
0 Ec
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
1
7
103
109
115
121
127
133
139
145
151
157
163
169
Times Cycle number
(a) Input (SoC profile for one week) (b) output (remaining capacity)
Fig. 6. The results of battery degradation using the framework in Fig. 4
of battery degradation using the framework in Fig. 4. The input

Fig. 3. Discharging and charging efficiencies. of the algorithm is the random SoC profile for one week (168
hours), as shown in Fig. 6 (a). Fig. 6 (b) dispalys the final
Rainflow cycle Cycle amplitude
counting Cycle mean value results, with the initial capacity Es at time 1 and remaining
algorithm Cycle number
capacity Ec at time 168.
Input: History SoC profile
Calendar degradation However, this model in [11] can only estimate the degra-
Final degradation model Time stress
Output (1) fresh battery SoC stress dation for a period of cycling operations, which will lead to
Temperature stress
Remaining a delayed reward to the reinforcement learning approach. It
capacity
(2) used battery
Cycling degradation
can hardly recognise which action (charging/discharging) is
DoD stress
SoC stress
actually responsible for the high reward (degradation costs in
Temperature stress this paper), as the rewards are delayed and accumulated.
Actually, the degradation process can be treated as a linear
Fig. 4. Framework of the battery degradation assessment [11] function to the cycle number during a short period of time
(see Fig. 6(b)). To account for the immediate rewards in
B. Battery Degradation Model the learning process, a degradation coefficient αd , which
In the process of energy arbitrage, a key factor is the represents the slope of linear approximation of battery aging
accurate estimation of the battery operating cost, which mainly during a short period of time in Fig. 5, is proposed to estimate
stems from the battery degradation. An accurate battery degra- the reward for every charging or discharging control action.
dation model as a function of the battery operation is needed The coefficient αd is updated based on the degradation results
to calculate the operation cost during the energy arbitrage. of the last training episode in reinforcement learning algorithm
The degradation process of battery is a nonlinear process explained in Section IV. It is defined as:
with respect to time and cycle numbers, which is shown in Es,j − Ec,j
Fig. 5. Basically, battery aging consists of two types of aging: αd,j = PT ∗ CB (6)
i=1 |Pe,i |
(i) calendar aging and (ii) cyclic aging [18]. Calendar aging
reflects the battery’s inherent degradation over time, which where, Es,j and Ee,j are the battery remaining capacity at the
is affected by the temperature and SoC. Cyclic aging is start and end point of the episode j. T is the time period
the capacity lost each time in the battery operation during per episode (168 hours in this paper). Pe,i is the battery
charging and discharging and it depends on the depth of charging/discharging power at time i during the episode j.
charge, discharging rate, ambient temperature, etc. [18]. CB is the battery cost per kWh.
A semi-empirical lithium-ion battery degradation model
which can account for irregular cycling operations in [11] III. P ROBLEM F ORMULATION
has been adopted to estimate the battery degradation costs. A Markov Decision Process (MDP) model with discrete
Fig. 4 shows the framework of calculation process. Firstly, time step Ts is formulated in this section for the energy
a historical SoC profile is used as the input to the rainflow arbitrage problem. The time step Ts is chosen based on the
cycle-counting algorithm [19] and the output of the algorithm time interval of the input data. In this paper, Ts is one hour
includes: 1) cycle amplitude; 2) cycle mean value; 3) cycle according to the available data of the UK wholesale market.
number; 4) cycle begin and end time. Then, both the calendar The whole sequential decision-making process of the MDP
and cycling degradation results are combined to estimate the model for battery energy arbitrage is: given a state st ∈ S
final remaining capacity of the battery. Fig. 6 shows the results at time step t which includes battery state of charge SoCt
and next 24 hours electricity prices from forecasting, the agent where γ is the discount factor, and the policy π maps from
selects an action (charging or discharging) from a action-space the system states to the charging/discharging action.
a ∈ A(s) based on the policy π. The goal of the proposed By exploring the environment, the agent will iteratively
algorithm is to find the optimal policy to maximise the reward update the action-value function Qπ (s, a) using the following
(profit) in the energy arbitrage process. The MDP formulation Bellman Equation:
for energy arbitrage is defined as: h i
1) State space: The state space at time instant t is defined as Q(st , at ) ← Q(st , at )+α Rt + γ max Q(st+1 , a) − Q(st , at )
a
st = (ct , ..., ct+22 , ct+23 , SoCt ). where ct , ..., ct+22 , ct+23 is (11)
the predicted price for the next day. Using the predicted price where α is the learning rate.
signal is to make sure the agent knows whether the price signal The iteration will continue until it converges to the best
is going up or down in order to make the best control action. action-value function Q∗π (s, a). Then, the choosing action is
The state transition of the battery SoC from state st to st + 1 determined by the ǫ-greedy policy, which at every timestep
is defined in (1). t, the agent selects the greedy action at = argmaxa Q(s, a)
2) Action space: The charging/discharging action space with probability 1 − ǫ and selects a random action to explore a
is discrete as a = (−Pemax , −0.5Pemax , 0, 0.5Pemax , Pemax ), better reward with probability ǫ. The Q∗π (s, a) is approximated
where Pemax is the maximum charging/discharging power of by a look up table in Q-learning.
the battery. The actual charging/discharging power is limited 2) Deep Q-network (DQN): Q-learning is confronted with
by (7) due to the limit of SoC. a difficult task when the state or action space are high-
(SoCt − SoCmax ) · Eess (SoCt − SoCmin ) · Eess dimensional. One solution proposed by Google DeepMind [12]
≤ Pe,t ≤ is to use a deep neural network to approximate the optimal
η t · Ts η t · Ts
(7) action-value function Q∗π (s, a). The represented value function
where SoCmax and SoCmin are the maximum and minimum by DQN with weights ω is denoted as:
state of charge of the battery, respectively. ηt is the charg-
Q(st , at ; ω) ≈ Q∗ (st , at ) (12)
ing/discharging efficiency defined in (1).
3) Reward: The design of reward function is the key factor The objective of DQN is to minimise the Mean Squared
in the algorithm. The reward in the energy arbitrage problem Error (MSE) loss L(ω) between Q(s, a) and TD (temporal
should include not only the profit from the discharging action, difference) target by Stochastic Gradient Descent (SGD):
but also the degradation costs of the control action. The
immediate reward Rt at time step t is defined as follows: L(ω) = (Rt +γ max Q(st+1 , at+1 ; ω − )−Q(st , at ; ω))2 (13)
a
Pe,t |Pe,t | where, the TD target is yt = Rt + γ maxa Q(st+1 , at+1 ; ω − ).

Rt = ct · max − αd · max (8)
Pe Pe In (13), we actually use a separate network (target network)
where Pe,t · ct denotes the charging cost when Pe,t < 0 and with a fixed parameter ω − for estimating the TD target yt and
discharging revenue when Pe,t > 0. αd · |Pe,t | represents the parameters from DQN network ω are copied to update the
the cost of battery degradation. αd is updated every training target network ω − periodically.
episode. To improved the results and the speed of training, the Some other improvements to the DQN includes: Double
reward scale technique suggested by [21] is adopted to clip the DQN (DDQN), Dueling DQN and Noisy Networks for Ex-
reward between -1 and 1. ploration, which will be introduced in the following parts.
The cumulative profits during the energy arbitrage are 3) DDQN: The standard DQN suffers from upward bias
denoted as: caused by maxa Q(s, a; ω) in (13) [23]. DDQN mitigates the
XT issue by using two separate networks to decouple the action
Rtcum = (Pe,t · ct − αd · |Pe,t |) (9) selection from the target Q value generation.
t In DDQN, we use the current DQN network ω to select
Rtcum is used as the only metric to evaluate the performance what is the best action to take for the next state (the action
of different methods. with the highest Q value) and use the older target network ω −
to evaluate the target Q value of taking that action at the next
IV. P ROPOSED A LGORITHM state. The TD target of DDQN is defined as:
A. Reinforcement Learning
ytDDQN = Rt + γQ(st+1 , argmax Q(st+1 , at+1 , ω), ω − )
In this section, the background of the RL is introduced. at+1
1) Q-learning: Q-learning is a model-free reinforcement (14)
learning algorithm. The goal of Q-learning is to let the agent 4) Dueling DQN: To further improve the DQN, the dueling
learn a best policy in a given state by exploring the environ- DQN approximates the Q-function by decoupling the action-
ment [22]. The quality of the charging/discharging action a independent value function V (s, v) and the advantage function
in a given state s is determined by the action-value function, A(s, a, ω) [24].
denoted as Qπ (s, a) for policy π, which is defined as: Instead of using a single stream of fully connected layers
"k=K # for Q-value estimation, the dueling network uses two streams
Qπ (s, a) = Eπ
X
k
γ · Rt+k | st = s, at = a . (10) of fully connected layers with parameters v and ω respectively.
k=0
One stream is used to provide value function estimate given
Fig. 7. The overall framework of the proposed approach. (The top part is the proposed prediction algorithm based on hybrid CNN and LSTM networks; the
bottom part is the basic DQN approach. To improve the stability of training process, we use the experience replay mechanism [12] which stores the state
transitions in a replay buffer and randomly sampled during the training).
a state, while the other stream is for estimating advantage well known for modelling the time series data [27] and
function for each valid action. Finally, the two streams are has shown great advantages in load forecasting using
combined in a way to produce and approximate the Q-function, smart meter data [28], [29]. The reason for adding a
which is denoted as follows: CNN layer prior to the LSTM network is to incorporate
multiple features simultaneously (other features such as
Q(s, a) = V (s, v) + A(s, a, ω) (15)
weather, generation) and reduce the temporal input di-
5) Noisy network for Exploration mension if only one feature is included. In this paper, only
An alternative approach to exploration when using neural one feature is included in the input data which is the price.
network to approximate the action-value function is Noisy The CNN layer can reduce the temporal input dimension
Networks for Exploration [25] that replaces the linear layer (from 1 × 168 to 7 × 24). Finally, a fully connected layer
with a noisy linear layer, which is defined as: with 24 nodes is connected to the output. Each node is
corresponding to every hour that predicted.
Y = (µω + σ ω ⊙ ǫω )X + (µb + σ b ⊙ ǫb ) (16)
(iii) Training and accuracy assessment: The architecture de-
where ǫ = [ǫω , ǫb ] are randomly sampled, zero mean noise ma- signed in step (ii) will be tuned and trained. The final
trices with fixed statistics, and µ = [µω , µb ] and σ = [σ ω , σ b ] trained model will be used for prediction and the accuracy
are the learning parameters of the network. In noisy network, will be assessed using Mean Absolute Error (MAE).
instead of using an ǫ− greedy policy, the agent can act greedily 2) NoisyNet-DDQN algorithm (NN-DDQN): The detailed
according to a network using noisy linear layers. algorithm for energy arbitrage using NN-DDQN is presented
in Algorithm 1.
B. Proposed algorithm
The overall framework of the proposed approach is shown V. C ASE S TUDY
in Fig. 7. The first part of the algorithm is forecasting the In this section, we evaluate the proposed approach using
electricity price using hybrid CNN and LSTM network. Then actual UK wholesale market electricity price [30]. Electricity
the prices predicted concatenated with other features such as prices from Yeas 2015 and 2016 are used as the training and
SoC are fed into the DRL to learn the optimal policy. The testing data, respectively.
detailed explanation of these two parts are shown as follows: We use five Lithium-ion batteries and each battery has
1) Price forecasting: The goal of the proposed forecasting the capacity 200kWh and the charging/discharging power is
approach is to forecast the hourly market price of the next discretised to [-100kW, -50kW, 0, 50kW, 100kW]. The battery
day (24 hours), by using the historical price data of the last parameters for calculating efficiency are shown in Table I. The
one week (168 hours). There are three steps in the forecasting whole training takes about three and half hours on a Computer
approach: with GPU GTX 1080 Ti and CPU i7-7800X. Once the training
(i) Data pre-processing: the database used has some extreme is finished, the proposed approach takes about 5ms to output
high peaks which are caused by either the market failure the control actions, which could be used in real time control.
or data errors. To reduce the impact of data outliers The algorithm is developed on Python and Keras, which is a
on prediction accuracy, all the values that are outside high-level neural networks API [31].
the range of 15% and 85% quantiles are replaced by
the threshold values. Then, the price data are scaled to
[0,1] values by using MinMaxScaler function in Python A. Forecasting method evaluation
Sklearn [26]. The price forecasting method proposed in Section IV-B
(ii) Model architecture design: The proposed model uses a is adopted to predict the electricity price and the model
combined CNN and LSTM networks. LSTM network is architecture developed in Keras is shown in Table II. The data
Algorithm 1 NN-DDQN for Energy Arbitrage 100

1: Initialize the ǫ set of random variables of the network; 90 Electricity price
2: Initialize the network and target network parameters; Prediction

80
3: Initialize the reply memory: D and the mini-batch size; 70
4: for Episode e = 1 to J do
60
5: Observe state space st = (ct−23 , ct−22 , ...ct , SoCt )
50
6: for t = 1, . . . T : do
40
7: Sample zero mean noisy ǫ
30
8: Select an action at = argmaxa (Q(s, a))
20
9: Execute action at , receive reward rt 0 24 48 72 96 120 144 168
and next state st+1 (a)
10: Store transition st , at , rt , st+1 in D 100
11: Sample random mini-batch of transitions 90
sj , aj , rj , sj+1 from D 80
12: Sample the noisy variable for the online and 70
target network ǫ 60
13: Estimate the target yj 50
yj = Rj + γQ(sj+1 , argmaxa′ Q(sj+1 , a′ , ω), ω − )
40
14: Do a gradient descent with loss yj −Q(st , at , ω))2
30
15: Every C steps update ω ′ = ω
20
16: end for 0 24 48 72 96 120 144 168
17: end for (b)
Fig. 8. Electricity price forecasting results during summer and winter season.
TABLE I ((a)Electricity forecasting results from 1st July to 7th July; (b)Electricity
BATTERY PARAMETERS IN (2) [17] forecasting results from 1st Jan to 7th Jan)
TABLE III
a0 -0.852 a1 63.867 a2 3.6297 a3 0.559 S UMMARY OF DRL TRAINING SETTINGS
a4 0.51 a5 0.508 b0 0.1463 b1 30.27
b2 0.1037 b3 0.0584 b4 0.1747 b5 0.1288
Item Value
c0 0.1063 c1 62.94 c2 0.0437 d0 -200
d1 -138 d2 300 No. of hidden layers 3
No. of nodes in each layer 16
Activation function ReLU
Learning rate 0.00025
are randomly splitted using train_test_split function Optimizer Adam optimizer
batch size 32
in Sklearn [26]. The input data spans a whole last week of Target model update 10000 steps
electricity prices (168 hours) and these data are fed into the
convolutional layer with a kernel size and a stride of 24,
which results in a length 7 per feature map. The output is Vanilla DQN and Double dueling DQN, described in Section
the electricity price prediction for the next day (24 hours). IV. All the training settings are summarized in Table III. The
Fig. 8 shows the forecasting results of one week during NN-DDQN is trained with 12000 episodes. The convergence
summer and winter seasons. We can clearly see that the model process of the episode rewards over 12000 episodes for these
can learn not only the daily variations of prices, but also the three methods is illustrated in Fig.10. It can be observed that
week and seasonal patterns (more peaks values during winter). the NN-DDQN is more stable during the training process,
The forecasting accuracy MAE is 4.686 in this case. compared with other two approaches. It can converge to the
optimized reward which is around 6 at episode 2200. As the
B. Performance Evaluation of NN-DDQN NN-DDQN keeps on choosing random actions with a small
The performance of the proposed NN-DDQN is evaluated probability of epsilon 0.01, therefore the episode rewards keep
using the electricity prices at year 2016 in this section. To fluctuating.
compare the effectiveness of the proposed approach, the pro- After training, the optimal weight parameters of NN-DDQN
posed NN-DDQN is compared with other two DRL methods: are used to control the charging/discharging actions of battery
storage using the electricity price at year 2016. Fig. 9 shows
the charging/discharging results over one week for different
TABLE II summer and winter price patterns. The electricity prices are il-
M ODEL A RCHITECTURE IN K ERAS lustrated with the green line and the SoC, charging/discharging
actions are represented with blue and red bars respectively. The
Layer type Output shape Param
Input Layer (None, 168,1) 0 charging power (-100kW) and discharging power (100kW)
Conv1D (None, 7, 128) 3200 are scaled to -1 and 1 to allow them draw on one figure.
LSTM (None, 32) 20608 We can clearly see that the proposed approach can learn the
Dense (None, 24) 792
optimized charging/discharging strategy (charging during low
Fig. 9. The charging/discharging results over one week for summer (a) and winter (b). (Blue bar: SoC; Red bar: charging(-)/discharging(+) actions, the
values are scaled from [-100kW, 100kW] to [-1, 1]; The green curve with the right axis represents the electricity prices)
800
NN-DDQN
6 MILP
600 Double Dueling DQN
Vanilla DQN
Profit(k$)
4
Episode_reward
400
2
200
0
0
Double Dueling DQN
0 1000 2000 3000 4000 5000 6000 7000 8000
NNDDQN
Time (Hours)
2 Vanilla DQN
0 2000 4000 6000 8000 10000 Fig. 11. Comparison results of cumulative profits with MILP
Episode
Fig. 10. The episode reward during the training process

the proposed methods and perfect forecasting over the whole
prices, and discharging during high prices) for battery not only year of 2016 are presented in Fig. 12. We can observe that
in the summer period, but also during the winter when the the perfect forecasting can improve the profits by 4.63% in
variation of electricity price is quite high. We have found that comparison with the proposed NN-DDQN method. The reason
the battery experiences roughly two cycles per day during the of this small difference is that the proposed prediction method
winter season and it can still make profits considering the illustrated in Section IV-B can already predict the electricity
higher price difference between the peak and valley electricity price accurately as shown in Fig. 8.
price in the whole sale market.
3) Comparison results without accurate degradation model:
The proposed approach is compared to the model that does not
C. Comparison Results
consider the battery degradation which means αd,j = 0 in (6).
1) Comparison results with model-based method: The pro- The cumulative profits of the proposed methods and the model
posed approach is compared with the mixed integer linear without battery degradation over the whole year of 2016 are
programming (MILP) shown in Appendix. The electricity presented in Fig. 12. We can observe that the model without
price in MILP is predicted using the same prediction method considering degradation can influence the profits by 5.13% in
illustrated in Section IV-B. The cumulative profits of the pro- comparison with the proposed NN-DDQN method.
posed methods and MILP method over the whole year of 2016
are presented in Fig. 11. We can observe that the proposed NN- 4) Comparison results with different hyperparameters of
DDQN improves the profits by 58.51% in comparison with the training: The proposed approach is compared with different
MILP method. In addition, the NN-DDQN shows better results hyperparameters of training shown in Table III. The number of
in comparison with Vanilla DQN and Double Dueling DQN. hidden layers in Table III is changed to 4 in the comparison.
2) Comparison results without uncertainty (perfect fore- The cumulative profits of the proposed methods and different
casting of price): The proposed approach is compared with hyperparameters over the whole year of 2016 are presented
perfect forecasting of electricity price to show how forecasting in Fig. 12. We can observe that fine-tuned hyperparameters in
algorithm influences the results. The cumulative profits of NN-DDQN can improve the profits.
800 [11] B. Xu, A. Oudalov, A. Ulbig, G. Andersson, and D. S. Kirschen,

NN-DDQN “Modeling of lithium-ion battery degradation for cell life assessment,”
NN-DDQN with different hyperparatemeters IEEE Tran. on Smart Grid, vol. 9, no. 2, pp. 1131–1140, March 2018.
600 NN-DDQN with perfect price
NN-DDQN without degradation [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
Profit(k$)
400 S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,

D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
deep reinforcement learning,” Nature, vol. 518, p. 529, Feb 2015.
200
[13] J. R. Vazquez-Canteli and Z. Nagy, “Reinforcement learning for demand
response: A review of algorithms and modeling techniques,” Applied
0 Energy, vol. 235, pp. 1072–1089, FEB 2019.
0 1000 2000 3000 4000 5000 6000 7000 8000 [14] Z. Wan, H. Li, H. He, and D. Prokhorov, “Model-free real-time ev
Time (Hours) charging scheduling based on deep reinforcement learning,” IEEE Tran.
on Smart Grid, vol. 10, no. 5, pp. 5246–5257, Sep. 2019.
Fig. 12. Comparison results of cumulative profits [15] H. J. Wang and B. Zhang, “Energy storage arbitrage in real-time markets
via reinforcement learning,” IEEE Power Energy Society General
VI. C ONCLUSIONS Meeting (PESGM), pp. 1–5, 2018.
[16] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,
In this paper, we have proposed a charging/discharging J. Tang, and W. Zaremba, “Openai gym,” 2016.
strategy for energy storage participating in the energy arbitrary [17] T. Morstyn, B. Hredzak, R. Aguilera, and V. Agelidis, “Model predictive
control for distributed microgrid battery energy storage systems,” IEEE
based on DRL methods, which is a model-free approach, and Transactions on Control Systems Technology, vol. 26, no. 3, pp. 1107–
can learn any complex system models. We use DRL meth- 1114, May 2018.
ods to address three challenges in energy storage arbitrage: [18] J. Vetter, P. Novk, M. Wagner, C. Veit, K.-C. Mller, J. Besenhard,
M. Winter, M. Wohlfahrt-Mehrens, C. Vogler, and A. Hammouche,
nonlinear efficiency of battery charging/discharging, accurate “Ageing mechanisms in lithium-ion batteries,” Journal of Power Sources,
battery degradation model and electricity price uncertainty. vol. 147, no. 1, pp. 269 – 281, 2005.
In the DRL, a combined CNN and LSTM hybrid network [19] S. Downing and D. Socie, “Simple rainflow counting algorithms,”
International Journal of Fatigue, vol. 4, no. 1, pp. 31 – 40, 1982.
is proposed to predict the electricity prices. Then a NN- [20] R. Spotnitz, “Simulation of capacity fade in lithium-ion batteries,”
DDQN is implemented to learn the optimal control policy of Journal of Power Sources, vol. 113, no. 1, pp. 72 – 80, 2003.
battery considering the price uncertainty and battery degrada- [21] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and
D. Meger, “Deep reinforcement learning that matters,” Thirty-Second
tion. Experimental results using actual electricity prices have AAAI Conference on Artificial Intelligence, 2018. [Online]. Available:
demonstrated the effectiveness of the proposed methods. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16669
[22] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
2018.
R EFERENCES [23] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
[1] X. Luo, J. Wang, M. Dooner, and J. Clarke, “Overview of with double q-learning,” CoRR, vol. abs/1509.06461, 2015. [Online].
current development in electrical energy storage technologies and Available: http://arxiv.org/abs/1509.06461
the application potential in power system operation,” Applied [24] Z. Wang, N. de Freitas, and M. Lanctot, “Dueling network architectures
Energy, vol. 137, pp. 511 – 536, 2015. [Online]. Available: for deep reinforcement learning,” CoRR, vol. abs/1511.06581, 2015.
http://www.sciencedirect.com/science/article/pii/S0306261914010290 [Online]. Available: http://arxiv.org/abs/1511.06581
[2] H. Khani and M. R. D. Zadeh, “Real-Time Optimal Dispatch and [25] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves,
Economic Viability of Cryogenic Energy Storage Exploiting Arbitrage V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg,
Opportunities in an Electricity Market,” IEEE Tran. on Smart Grid, “Noisy networks for exploration,” CoRR, vol. abs/1706.10295, 2017.
vol. 6, no. 1, pp. 391–401, JAN 2015. [Online]. Available: http://arxiv.org/abs/1706.10295
[3] D. Metz and J. T. Saraiva, “Use of battery storage systems for price [26] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
arbitrage operations in the 15-and 60-min German intraday markets,” O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
Electric Power Systems Research, vol. 160, pp. 27–36, JUL 2018. plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
[4] D. Krishnamurthy, C. Uckun, Z. Zhou, P. R. Thimmapuram, and esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
A. Botterud, “Energy Storage Arbitrage Under Day-Ahead and Real- Learning Research, vol. 12, pp. 2825–2830, 2011.
Time Price Uncertainty,” IEEE Tran. on Power Systems, vol. 33, no. 1, [27] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
pp. 84–93, JAN 2018. Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available:
[5] H. Akhavan-Hejazi and H. Mohsenian-Rad, “Optimal operation of https://doi.org/10.1162/neco.1997.9.8.1735
independent storage systems in energy and reserve markets with high [28] W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang, “Short-term
wind penetration,” IEEE Tran. on Smart Grid, vol. 5, no. 2, pp. 1088– residential load forecasting based on lstm recurrent neural network,”
1097, March 2014. IEEE Tran. on Smart Grid, vol. 10, no. 1, pp. 841–851, Jan 2019.
[6] A. A. Thatte, L. Xie, D. E. Viassolo, and S. Singh, “Risk Measure Based [29] Y. Wang, D. Gan, M. Sun, N. Zhang, Z. Lu, and C. Kang,
Robust Bidding Strategy for Arbitrage Using a Wind Farm and Energy “Probabilistic individual load forecasting using pinball loss guided
Storage,” IEEE Tran. on Smart Grid, vol. 4, no. 4, pp. 2191–2199, DEC lstm,” Applied Energy, vol. 235, pp. 10 – 20, 2019. [Online]. Available:
2013. http://www.sciencedirect.com/science/article/pii/S0306261918316465
[7] A. Attarha, N. Amjady, and S. Dehghan, “Affinely Adjustable Robust [30] (2017) The changing price of wholesale uk electricity over more than
Bidding Strategy for a Solar Plant Paired With a Battery Storage,” IEEE a decade. [Online]. Available: https://www.ice.org.uk/knowledge-and-
Tran. on Smart Grid, vol. 10, no. 3, pp. 2629–2640, MAY 2019. resources/briefing-sheet/the-changing-price-of-wholesale-uk-electricity
[8] F. Wankmller, P. R. Thimmapuram, K. G. Gallagher, and [31] (2019) Keras: The python deep learning library. [Online]. Available:
A. Botterud, “Impact of battery degradation on energy arbitrage https://keras.io/
revenue of grid-level energy storage,” Journal of Energy [32] IBM. (2019) IBM ILOG CPLEX optimization studio. [Online].
Storage, vol. 10, pp. 56 – 66, 2017. [Online]. Available: Available: www.cplex.com
http://www.sciencedirect.com/science/article/pii/S2352152X16303231
[9] N. Padmanabhan, M. Ahmed, and K. Bhattacharya, “Battery energy
storage systems in energy and reserve markets,” IEEE Tran. on Power
Systems, pp. 1–1, 2019.
VII. A PPENDIX
[10] T. Ashwin, Y. M. Chung, and J. Wang, “Capacity fade modelling of
lithium-ion battery under cyclic loading conditions,” Journal of Power The energy arbitrage problem is formulated as a MILP and
Sources, vol. 328, pp. 586 – 598, 2016. solved using CPLEX [32] in Python. The objective of the
MILP is:
N
X
((Pc,k − Pd,k ) · ck − αd (Pc,k + Pd,k )) (17)
k=1
constraints:


 SoCk = SoCk−1 − ηd Pd,k · ud,k + ηc Pc,k · uc,k
0 ≤ Pc,k ≤ uc,k Pcmax




 0 ≤ Pd,k ≤ ud,k Pdmax


0 ≤ uc,k + ud,k ≤ 1 (18)
u c,k , ud,k ∈ {0, 1}




0.2 ≤ SoCk ≤ 1




SoC0 = 0.5

The state of charge of a storage unit, denoted by SoCk ,

at time step k depends on its state of charge in the previous
time step k − 1 and the current charge power Pc,k or dis-
charge power Pd,k . Losses of the battery are represented by
charging/discharging efficiencies ηc and ηd respectively.
ck is the predicted prices using the method in Section IV-
B. ud,k is a binary variable with ud,k = 1 if the battery
is discharging and ud,k = 0 if the battery is charging. The
binary variables ud,k and uc,k prevent the model from using
the charge and discharge efficiencies of the storage units to
dump energy by simultaneously charging and discharging the
battery.
View publication stats

Deep Reinforcement Learning-Based Energy Storage A

Uploaded by

Copyright:

Available Formats

Deep Reinforcement Learning-Based Energy Storage A

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Reinforcement Learning-Based Energy Storage A

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Deep Reinforcement Learning-Based Energy Storage Arbitrage With Accurate

Article in IEEE Transactions on Smart Grid · April 2020

Jun Cao Thomas Morstyn

SEE PROFILE SEE PROFILE

Oxford Martin Programme on Integrating Renewable Energy http://renewableenergy.ox.ac.uk/ View project

The user has requested enhancement of the downloaded file.

White Rose Research Online URL for this paper:

Version: Accepted Version

Deep Reinforcement Learning Based Energy

approach to optimize the EV charging scheduling. A Q learn-

Fig. 5. General capacity degradation of lithium-ion battery [20]

Fig. 6. The results of battery degradation using the framework in Fig. 4

of battery degradation using the framework in Fig. 4. The input

Pe,t |Pe,t | where, the TD target is yt = Rt + γ maxa Q(st+1 , at+1 ; ω − ).

Algorithm 1 NN-DDQN for Energy Arbitrage 100

2: Initialize the network and target network parameters; Prediction

11: Sample random mini-batch of transitions 90

Fig. 10. The episode reward during the training process

800 [11] B. Xu, A. Oudalov, A. Ulbig, G. Andersson, and D. S. Kirschen,

400 S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,

The state of charge of a storage unit, denoted by SoCk ,

View publication stats

You might also like