DRL For Optimal Perturbation For MPPT in Wind Energy

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

DROP: Deep Reinforcement Learning Based

Optimal Perturbation for MPPT in Wind Energy


Salman Sadiq Shuvo Md Maidul Islam Yasin Yilmaz
Electrical Engineering Electrical Engineering Electrical Engineering
University of South Florida Florida State University University of South Florida
Tampa, FL, USA Tallahassee, FL, USA Tampa, FL, USA
[email protected] [email protected] [email protected]

Abstract—Methods to draw maximum power from Photo-


voltaic (PV) modules are an ongoing research topic. The so-
called Maximum Power Point Tracking (MPPT) method aims
to operate the PV module at its maximum power point (MPP)
by matching the load resistance to its characteristic resistance,
which changes with temperature and solar irradiance. Pertur-
bation and Observation (P&O) is a popular method that lays the
foundation for many advanced techniques. We propose a deep
reinforcement learning (RL) based algorithm to determine the
optimal perturbation size to reach the MPP. Our method utilizes
an artificial neural network-based predictor to determine the
MPP from temperature and solar irradiance measurements. The
proposed technique provides an effective learning-based solution
to the classical MPPT problem. The effectiveness of our model
is demonstrated through comparative analysis with respect to
the popular methods from the literature.
Index Terms—Deep reinforcement learning, markov decision
process, MPPT, renewable energy, wind power.
Fig. 1. Classification of MPPT algorithms for wind power.
I. I NTRODUCTION
A. Wind Energy AI based and hybrid algorithms that utilize both conventional
The increasing cost and adverse effects of fossil fuels on and smart methods [2]. Fig. 1 shows classification of different
climate, have increased the demand for an efficient renewable MPPT methods for wind.
energy alternative. Wind energy can be a great source of clean
C. Wind MPPT Existing methods
and reliable energy, and there has been a rapid penetration of
wind generators in modern power system in the last decade. The most popular IPC based conventional MPPT algorithm
The global wind power generation capacity is expected to is tip speed ratio (TSR). In this method, reference rotor
reach 840 GW by the end of 2022 [1]. The basic nature of speed is generated by estimating the rotor and wind speed.
wind energy is extremely fluctuating, and thus tracking of Using this reference speed and other system parameters,
maximum power point (MPP) to extract maximum capture optimal power extraction is achieved. The TSR algorithm can
of energy at different varying wind speed is of great interest. either utilize mechanical sensors employed in anemometers
in the wind turbine swept area or estimate the wind speed
B. Wind MPPT through mathematical modelling [3], [4]. TSR is simple
Maximum power point tracking (MPPT) algorithms helps to implement and shows rapid response in regulating the
to extract maximum power from wind energy conversion rotor speed under changing environments. The drawbacks are
system (WECS). The speed and direction of wind changes increased installation and maintenance cost, lower efficiency
continuously and thus output from a WECS fluctuates. As and lack of reliability. The optimum torque (OT) method uses
per the Betz limit only 59% of total available wind energy optimal torque curve for multiple wind speeds to regulate
can be harnessed by the wind turbine. The WECS system the generator torque. Although the technique is simple and
operating region is from the cut in wind speed Vcutin to rated yields higher efficiency, it is greatly dependent on climate
wind velocity Vrated . MPPT algorithms take into account and the wind turbine characteristics [5]. The power signal
variables like voltage, optimal power, duty cycle and ensures feedback method uses lookup table for optimal power for
maximum power generation for corresponding wind velocity wind turbine that are generated by experimental setup or
in the operating region. The MPPT tracking algorithms for simulation. This method shows good performance in tracking
WECS can be broadly categorized in four types, as direct MPP at low wind speeds, but requires prior knowledge of
power control (DPC), Indirect Power control (IPC), smart or system and wind speed sensors [6]. Conventional DPC based
methods include Perturb and observe (P&O) and Incremental
conductance (INC) method [6], [7]. In P&O, the control
variables are adjusted and their effect on performance is PC
PB
observed to decide on next steps. The advantage of this al-
PF
gorithm is it doesn’t require additional measurement sensors PE
PD PA
and prior knowledge of wind turbine parameters. However,
choosing the appropriate step size of perturbation may be
challenging and thus the algorithm may oscillate near MPP
but fail to achieve it. The INC method observes the rectifier
output power to decide the direction of perturbation. This
method has increased reliability and does not require sensors
or system parameter information but suffers from the same Fig. 2. Turbine power versus turbine speed for different wind
problem as P&O. With advances in AI models, several smart speeds.
MPPT algorithms have become popular recently. Researchers
in [8], [9] presented radial basis function neural (RBF) turbine from wind. The mechanical power captured by a wind
network and Wilcoxon radial basis function neural (WRBFN) turbine is,
network based MPPT strategy for wind power. Authors in 1
Pm = ρv 3 ACp ,
[10] presented an ANN based MPPT algorithm by using the 2
electric power and rotor speed of the generator as input and 3 2
where ρ, v , A = πR , R, Cp are respectively the air
the action values of the WECS as the output. [11] presents a density, air velocity, area swept by the turbine blades, blade
Q learning based MPPT algorithm has been proposed which radius, and turbine power coefficient. Cp is a function of the
shows comparatively faster and more efficient performance. turbine speed ratio λ = ωR v , and the blade pitch angle β. As
However, this technique uses look up table to store previous the air density is mostly constant; for a particular turbine Eq.
controller actions which require a lot of memory space. Sev- (1) can be rewritten as
eral hybrid methods have also been proposed by researchers
such as ANN and PSF [12], ORB and PSO [13], tip speed 1
P = ηG ηC ρA × v 3 Cp (2)
ratio (TSR) control, power signal feedback (PSF) control 2
| {z } | {z }
f (v,ω)
and hill climb search (HCS) control [14]. Authors in [15] const,K
presented Fuzzy logic (FL) and ANN based Adaptive Neural- Eq. (2) indicates that the electrical power output for a
Fuzzy Interface System (ANFIS) technique for MPPT. The given turbine is depenedent only on wind velocity v, and
hybrid methods aim to combine the advantages of several rotor speed ω; as shown in Fig. 2. The solid curved lines
methods to improve performance but are complex in nature represent the output power for different turbine rotor speed
and time and computational complexity are high. and wind velocity. Higher wind speed results into more
electrical power, as evident by the increasing power curves
D. RL for optimization/ MPPT for higher wind velocity (i.e., vw1 < vw2 < ... < vw7 ). The
Reinforcement learning (RL) is an AI technique which concave power curves show that for a fixed wind velocity
has been extensively used in robotics and optimization there is an unique optimal rotor speed ω ∗ that results into
applications in industries. Researchers have also explored maximum output power Pmax . Connecting these points, we
reinforcement learning for MPPT applications in solar [16]. get the Optimal power curve shown by the dashed magenta
Some works have been done for WECS too, i.e. [17], authors line.
presented RL based MPPT technique where RL agent learns As a test case, lets assume the wind velocity is vw6 and
the MPP by interacting with the environment using model the turbine is operating at ωm7 rotor speed. This operating
free Q algorithm. The advantage of RL is it does not require condition is translated as turbine output power PA shown
prior knowledge of the WECS or deployment of wind speed in Fig. 2. The rotor speed step size is ∆ω = ωm2 − ωm1 ,
measurement sensors. RL is comparatively faster than other where ωm1 = ωm2 = ... = ωm7 . The maximum power
conventional and smart techniques making it ideal for fast point tracking (MPPT) task here is to change the rotor speed
response in higher/abrupt wind speed change. Furthermore, to optimal value (i.e., ωm5 here), which results the optimal
as only learned MPPs are saved, the memory requirement for output power PC . However, it requies two time steps and an
a RL based MPPT technique is comparatively low. intermediate output power PB to reach MPP. Then, the wind
velocity changes to vw5 , and the output power goes down to
II. BACKGROUND PD as the rotor speed is still ωm5 . Now, the MPPT algorithm
will oscillate between power PD and PE , however will not
The electrical power generated by a wind turbine,
reach the optimal power PF , as it falls into a point of no
P = ηG ηC Pm , (1) available rotor speed. It will settle down at PE with rotor
speed of ωm4 .
depends on the generator efficiency ηG , converter effi- This MPPT task is synonymous to track the optimal power
ciency ηC , and the mechanical power Pm captured by the curve for the 2-D coordinate system where X-axis represents

2
110
Maximum Power
Resistance at MPP
ωt Speed Controller Δωt 105
Action
ωt=ωt-1+Δωt At= Δωt
100 99.07 99.4 99.04 99.4 99.06 99.73

% Accuracy
ωt st
Generator ^
(ωt, ω
ANN t) State 95
Pt Predictor St=(ωt ,ω^)
t
Wattmeter ^
Pt 90
Pt Reward
^
Electrical Load Rt= Pt-Pt 85
RIDGE LASSO HUBER
Regression Model
Fig. 3. Proposed wind MPPT model. RL components are
marked in red. Fig. 4. Average accuracy for different regressor models for
predicting Pmax and RM P P from temperature and irradiance
values.
Turbine speed and Y-axis represents Turbine output power.
It asserts that the whole dynamics is captured within the
coordinate system. The test case also explains the rotor
a rotor speed that achieves the maximum power and the
speed step size selection dilemma in many MPPT techniques.
point falls on the optimal power curve. That means all the
Clearly, the big step size ensures fast tracking but fails in
points over the power curve should be mapped to the same
granularity issues in the test case. With aging, the turbine
point on the 2-D coordinate system. We use an Artificial
capacity and performance declines and the optimal power
Neural Network (ANN) based function approximator to map
curve shifts downward. This non-stationary optimal power
these relation. In fact, the wind turbine manufacturer has
curve makes Reinforcement Learning a suitable optimization
the capacity to provide different power curves for different
technique for wind MPPT task.
wind velocity, along with the optimal power curve. The
This work aims to map any point in the 2-D surface to points on these power curves serve as input features and the
its corresponding point at the optimal power curve. That corresponding point in the optimal power curve serves as the
causes the knowledge of wind speed redundant, hence makes target value (output) for the ANN. So, these mapping is a
our work free of wind speed sensors and measurements. For straight forward data driven approach. This predictor breaks
clarity, the point A, and B is mapped to point C; similarly down the task for the RL agent by providing it the current
the point D, and E is mapped to point F. So, our technique rotor speed and optimal rotor speed.
aims to reach point C from A by changing the rotor speed
We consider several regression models (Ridge, Lasso, and
accordingly.
Huber) with two fully connected layers and select Huber
III. M ODEL D EVELOPEMENT regressor, which gives the maximum accuracy, for our ex-
periments. Notably, all of the regressors provide around 99%
Our model consists of an ANN predictor and a RL agent
accuracy, as shown in Fig. 4.
which controls the rotor speed of the turbine. Our model
utilizes the advantage actor-critic (A2C) deep RL framework, The real-time predictor continuously provides prediction
which is suitable for fine-grained action space [18]. Fig. 3 ω̂t , which is in turn used as input for the deep RL state.
shows that at time step t, the speed controller controls the Similarly, the predictor provides prediction P̂t for maximum
turbine rotation speed ωt . The Turbine is connected to a power Pmax that we use to calculate the reward of the RL
generator that converst the mechanical power into electrical agent. Besides, we use P̂t to estimate the performance of the
power serves load. The generated power Pt is measured by algorithms.
a wattmeter. Then the Pt , and ωt is fed into Artificial Neural
Network (ANN) based predictor that maps the operating B. MDP Model
point to the optimal point ω̂t as explained in Fig. 2. The
current rotor speed and predicted optimal speed defines the We develop a Markov Decision Process (MDP) model
state for the RL agent. The RL agent takes action of rotor for formulating the problem for the RL agent. The MDP
speed change ∆ωt to control the turbine speed. The generated model is based on the Markov Property; i.e., the future state
power Pt and the predicted optimal power P̂t defines the is dependent only on the current state and action taken by
reward to complete the RL framework. the agent. The blocks in Fig. 3 represent the elements for
our model, inclusive of the components of the MDP model
A. ANN Predictor marked in red. The MPPT controller is the MDP agent that
Fig. 2 shows that the power curves for different wind speed takes action At about change in speed ∆ω of the turbine.
are non-overlapping. For a particular power curve, there is The other non-marked blocks form the environment.

3
1) State, St : The agent collects the rotor speed ωt from Algorithm 1 A2C algorithm for wind MPPT.
the speed controller and predicted optimal speed ω̂t from the Input: discount factor γ, learning rate, and number of
ANN predictor two define the MDP state as: episodes e
Input: Wind velocity {vt }, and turbine speed {ωt }
St = (ωt , ω̂t ). Initialize: Actor network with random weights and critic
network with random weights
Notably, the state space is framed with only two variables for episode = 1, 2, ..., e do
(rotor speed); suitable for the learning and convergence of the for t= 1, 2, ..., T do
Deep RL model. Both of the inputs are positive real numbers ANN regressor predicts ω̂t , and P̂t .
and floating inputs for the algorithm; hence, they have an Select action At for state St = (ωt , ω̂t ) using actor
infinite number of possible values and require Deep RL to network.
deal with it. Execute action At and observe reward Rt from Eq.
2) Action, At : The RL agent’s action At is to select the (3).
change of turbine rotation speed ∆ωt . So, ∆ωt = 0 indicates Store transitions (St , At , Rt , St+1 ).
no change in speed; and can be positive or negative values to Update actor network via advantage function.
represent increase and decrease of turbine speed, respectively. Update critic network through back propagation.
Theoretically, continuous-valued speed changes can give the end for
highest flexibility and performance optimization; however, end for
the turbine speed controller can only accommodate a limited
number of discrete values. We consider a fine-grained action Parameter Definition Value
space for the deep RL model and choose the speed change Pmax Rated capacity 1500 kW
vrated Rated wind speed 12 m/s
At = ∆ωt ∈ {−0.05, −0.04, ..., 0.04, 0.05}, where the num- vmin Cut-in wind speed 4 m/s
ber indicates the percentile change of the nominal/nameplate Number of rotor blades 3
D = 2R Rotor diameter 70.5 m
turbine speed (provided by the manufacturer). So, the RL A Swept area 3.904 m2
agent has 11 actions possible actions. ω Rotor speed (range) 11.1 – 22.2 rpm
The turbine speed change, ∆ωt , is executed by the Tur- Table I: GE Energy 1.5MW Wind Turbine Technical Speci-
bine Speed Controller (TSC). This TSC includes necessary fications (GE 1.5 S).
mechanical and electrical devices to achieve the target speed
ωt−1 = ωt + ∆ωt . C. Solution Approach
3) Reward, Rt : In RL, the reward function guides the
The RL agent’s objective is to maximize the discounted
agent towards optimal action. The reward is observed from
total reward for time horizon T ,
the environment but requires modeling to provide meaningful
T
insight to the RL agent. So, we define reward, Rt as the X
difference between output power and the predicted maximum RT = γ t Rt , (4)
t=0
power of the ANN predictor.
where γ ∈ (0, 1) is the discount factor for future reward.
Rt = Pt − P̂t (3) There are two popular approaches for finding the optimal
policy {At }, value-based methods (e.g., deep Q-learning)
The agent aims to maximize the reward, i.e., maximize and policy-based methods (e.g., policy gradient). We need
output power Pt . The selected reward in Eq. (3) provides to solve the following Bellman equation. After i iterations,
the RL agent a stable target to reach. The highest reward the the agent’s value function at time step t is
agent can achieve is zero, i.e., equalling the ANN regressor n  o
predicted maximum power. So, once the MPP is reached, V i (ωt , ω̂t ) = max E Rt + γV i−1 (ωt+1 , ω̂t+1 ) ,
At
changing turbine speed will incur a negative reward. So, the
agent selects At = 0, i.e., making no change in rotor speed We consider the popular Advantage Actor-Critic (A2C) al-
(∆ωt = 0). gorithm for this continuous state MDP [18]. A2C is a hybrid
4) Next State, St+1 : If the wind velocity remains same, deep RL method which consists of a policy-based actor
the agent’s action makes the next state deterministic as network and value-based critic network. A pseudo code for
the A2C algorithm is given in Algorithm 1.
St+1 = (ωt+1 = ωt + ∆ωt , ω̂t+1 ). IV. R ESULTS
Here, the ω̂t+1 is the observation from ANN predictor that A. Experimental Setup
uses Pt and ωt for making the prediction. If the wind velocity In our experiments, we use the PV Module 1STH-220-
changes; then the system moves to a different operating point P, whose operation details are provided in Table I. All the
in the turbine speed vs turbine output power curve. These the experiments are performed in Python 3.6.8 version. Fig. 5
observations are obtained from Wattmeter and turbine speed shows the convergence of our A2C deep RL algorithm for
controller. varying irradiance. The y-axis represents the episodic output

4
300
0 Irradiance 1000
275
2
250 800
4

Output Power (W)

Irradiance (W/m2)
6 225
600
E-Emax (kJ)

8 200
10 175 400
12 150 Ideal Case
Chou et al. Method[9] 200
14 125 Proposed Method
Raw Rewards P & O Method
16 Smoothed Rewards 100 0
0 2000 4000 6000 8000 10000 0 25 50 75 100 125 150 175 200
Episode Time (s)

Fig. 5. Convergence of our A2C deep RL method under Fig. 6. Power output for different methods for varying
varying irradiance. irradiance.
190
Ideal Case Temperature 35
energy difference with respect to the ideal case (if the PV Chou et al. Method[9]
185 Proposed Method
always operates at MPP). The smoothed reward is the running 30
P & O Method
mean of the last 10 episodes of raw (actual) rewards. The 180 25

Output Power (W)

Temperautre (°C)
algorithm converges within 4000 episodes and minimizes this
175 20
energy difference to 0.36 kJ, where the total ideal output is
32.7 kJ for the episode duration (200 s). This minimal 1.1% 15
170
loss of energy happens during the irradiance change time 10
step, which is impossible to nullify. 165 5
B. Benchmark Policies
160 0
We compare our method with the following policies. 0 25 50 75 100 125 150 175 200
Time (s)
1) Perturb and Observe (P&O): We use the popular P&O
method [19] as our baseline policy. We determine 0.01 to be Fig. 7. Power output for different methods for varying
a suitable step size for perturbation through a grid search. temperature.
2) RL-based approach: Chou et al. [16] propose a deep
Reinforcement Learning (RL) based MPPT. Their method 2) Varying Temperature (T): We set the irradiance at 800
uses temperature, irradiance, and duty cycle of the DC/DC W/m2 and the load resistance (5 Ω) stationary for this setup.
converter as the RL state, so they require pretty much the The right y-axis in Fig. 7 represents the temperature value
same setup as ours except the ANN predictor. Also, the that changes between 20, 25, and 30 °C, shown by the dashed
reward used in [16] is the change in power, ∆Pt = Pt −Pt−1 , line. The left y-axis represents the output power, and the solid
which provides a less stable (i.e., more fluctuating) feedback blue line shows the ideal output power. Our method performs
than our prediction-based reward Rt = Pt − P̂t . As the significantly better than the other methods.
hardware requirement is quite similar, this method provides 3) Varying Load Resistance (R): The temperature and
a fair comparison for our method. irradiance are fixed at 25 °C and 800 W/m2 respectively for
this case. But the load resitance changes among 1, 1.5 and 2
C. Performance Analysis Ω as shown in Fig. 8. We don’t include Chou et al. method
We aim to test our method for different environments. [16] here as their model does not consider variable load.
Hence, we provide three sets of case simulations where we The maximum power remains stable at 172.8 W as it is free
examine our method by changing either irradiance, tempera- of load variability. The P&O method cannot reach the MPP
ture, or load resistance. The experiment duration is 200 time fast enough due to small step size. We also experimented
steps (seconds) for each analysis. with bigger step sizes, which provided worse results and
1) Varying Irradiance (G): We keep the temperature (25 unstable output power. Our method uses its variable step size
°C) and the load resistance (5 Ω) stationary for this setup. to provide the optimal solution. Clearly, MPPT for variable
The right y-axis in Fig. 6 represents the irradiance value load is a more challenging task as it shifts the operation point
that changes between 600, 800, and 1000 W/m2 , shown by further from the MPP.
the dashed line. The left y-axis shows the output power for Table II shows the summary of the performance for the
different methods. The solid blue line represents the ideal methods for different cases. Our proposed deep RL method is
output power that all the methods try to reach. P&O method the fastest to track the MPP and maximizes the power output
is the slowest to reach, and our proposed method is the for each case. All the methods does well to maximize the
fastest. Chou et al.’s method [16] lie in between. output for variable irradiance and temperature; however, our

5
Time to reach MPP for each change in operating condition (s) Energy Output (kJ)
Case P&O Chou et al. Proposed Ideal P&O Chou et al. Proposed
Variable G 25, 22, 19, 18, 19 18, 14, 13, 13, 15 2, 6, 6, 6, 4 32.68 31.78 32.2 32.34
Variable T 19, 2, 3, 13, f/r* f/r*, 22, 13, 2, 13 8, 6, 3, 5, 4 34.52 34.08 34.11 34.27
Variable R 25, 22, 19, 18, 19 n/a** 2, 6, 6, 6, 4 34.56 27.76 n/a** 31.84
* Fails to reach the MPP, ** Not applicable

Table II: Summary of performances under different cases considered in Figs. 6–8. The five numbers in each cell represent
the performance under five time intervals in each case.

VI. C ONCLUSION
200 10
Resistance This work aims to provide a state-of-the-art solution to the
180
8 MPPT task for photovoltaics by modeling a deep RL-based
160
technique. We integrated an ANN-based pre-trained predictor
140
Output Power (W)

into the deep RL model that predicts power and resistance

Resistance ( )
Ideal Case 6
120 at MPP for a given irradiance and temperature. These two
Proposed Method
100 P & O Method parameters help to shape the state and reward of the RL
4
80 model. This process breaks down the task for the deep RL-
60 based algorithm, resulting in superior performance than the
2
existing P&O and a recent deep RL-based method [16]. Our
40
method is robust and can be used for any PV module by
20 0 training the predictor with the module’s I-V data.
0 25 50 75 100 125 150 175 200
Time (s)
R EFERENCES
Fig. 8. Power output for different methods for varying
resistance. [1] H. H. Mousa, A.-R. Youssef, and E. E. Mohamed, “State of the art
perturb and observe mppt algorithms based wind energy conversion
method outperforms the others to be the closest to ideal case. systems: A technology review,” International Journal of Electrical
Power & Energy Systems, vol. 126, p. 106598, 2021.
The benefit of our method is more evident in the variable [2] J. Pande, P. Nasikkar, K. Kotecha, and V. Varadarajan, “A review of
resistance case, where it outputs 13 % more energy than maximum power point tracking algorithms for wind energy conversion
the P&O method. The time to reach the MPP after every systems,” Journal of Marine Science and Engineering, vol. 9, no. 11,
p. 1187, 2021.
change in PV dynamics is also provided in Table II, which
[3] C. M. Parker and M. C. Leftwich, “The effect of tip speed ratio on a
is consistent with the output energy results. vertical axis wind turbine at high reynolds numbers,” Experiments in
Fluids, vol. 57, no. 5, pp. 1–11, 2016.
[4] Y. Errami, M. Ouassaid, and M. Maaroufi, “Optimal power control
V. D ISCUSSIONS strategy of maximizing wind energy tracking and different operating
conditions for permanent magnet synchronous generator wind farm,”
Energy Procedia, vol. 74, pp. 477–490, 2015.
The MPPT task aims to reach MPP by shifting the load [5] M. Yin, W. Li, C. Y. Chung, L. Zhou, Z. Chen, and Y. Zou, “Optimal
resistance towards the MPP resistance through duty cycle torque control based on effective tracking range for maximum power
change. We define our MDP state as estimated MPP resis- point tracking of wind turbines under varying wind conditions,” IET
Renewable power generation, vol. 11, no. 4, pp. 501–510, 2017.
tance and current load resistance, which has enough infor- [6] D. Kumar and K. Chatterjee, “A review of conventional and advanced
mation to change the duty cycle. This effective breakdown mppt algorithms for wind energy systems,” Renewable and sustainable
of the problem helps us keep the RL state small and to the energy reviews, vol. 55, pp. 957–970, 2016.
[7] H. H. Mousa, A.-R. Youssef, and E. E. Mohamed, “Variable step size
point, which is the underlying reason for the success of this p&o mppt algorithm for optimal power extraction of multi-phase pmsg
model. This model is suitable for large-scale PV units where based wind generation system,” International Journal of Electrical
temperature and irradiance from multiple sensors may keep Power & Energy Systems, vol. 108, pp. 218–231, 2019.
[8] C.-H. Chen, C.-M. Hong, and T.-C. Ou, “Wrbf network based control
it apart from unnecessary noise from those sensors. Further- strategy for pmsg on smart grid,” in 2011 16th International Confer-
more, periodical (yearly) calibration of the ANN predictor ence on Intelligent System Applications to Power Systems, 2011, pp.
may compensate for degradation and corresponding changes 1–6.
[9] T. Li and Z. Ji, “Intelligent inverse control to maximum power point
in the I-V curve of the PV module over long-term usage. tracking control strategy of wind energy conversion system,” in 2011
Deep RL algorithms with continuous action may further Chinese Control and Decision Conference (CCDC), 2011, pp. 970–
benefit this approach; however, the action range is a matter 974.
of deliberation as a significant change in the duty cycle may [10] C. Wei, Z. Zhang, W. Qiao, and L. Qu, “Intelligent maximum power
extraction control for wind energy conversion systems based on
complicate the action of the DC/DC converter. The real-life online q-learning with function approximation,” in 2014 IEEE Energy
implementation of this simulation-based method may provide Conversion Congress and Exposition (ECCE), 2014, pp. 4911–4916.
further insights into this technique. Our discrete action setup [11] A. Kushwaha, M. Gopal, and B. Singh, “Q-learning based maximum
power extraction for wind energy conversion system with variable wind
has small granularity and a suitable operating range for the speed,” IEEE Transactions on Energy Conversion, vol. 35, no. 3, pp.
DC/DC converter. Our method addresses the major problem 1160–1170, 2020.
of optimal perturbation size of the P&O method by providing [12] S. Azzouz, S. Messalti, and A. Harrag, “A novel hybrid mppt controller
using (p&o)-neural networks for variable speed wind turbine based on
flexible duty cycle change based on the state of the MDP dfig a novel hybrid mppt controller using (p&o)-neural networks for
model. variable speed wind turbine based on dfig,” 1874.

6
[13] M. A. Abdullah, T. Al-Hadhrami, C. W. Tan, and A. H. Yatim,
“Towards green energy for smart cities: Particle swarm optimization
based mppt approach,” IEEE Access, vol. 6, pp. 58 427–58 438, 2018.
[14] J. Hussain and M. K. Mishra, “Adaptive maximum power point
tracking control algorithm for wind energy conversion systems,” IEEE
Transactions on Energy Conversion, vol. 31, no. 2, pp. 697–705, 2016.
[15] M. R. Javed, A. Waleed, U. S. Virk, and S. Z. ul Hassan, “Comparison
of the adaptive neural-fuzzy interface system (anfis) based solar max-
imum power point tracking (mppt) with other solar mppt methods,” in
2020 IEEE 23rd international multitopic conference (INMIC). IEEE,
2020, pp. 1–5.
[16] K.-Y. Chou, S.-T. Yang, and Y.-P. Chen, “Maximum power point
tracking of photovoltaic system based on reinforcement learning,”
Sensors, vol. 19, no. 22, p. 5054, 2019.
[17] C. Wei, Z. Zhang, W. Qiao, and L. Qu, “Reinforcement-learning-based
intelligent maximum power point tracking control for wind energy
conversion systems,” IEEE Transactions on Industrial Electronics,
vol. 62, no. 10, pp. 6360–6370, 2015.
[18] P.-H. Su, P. Budzianowski, S. Ultes, M. Gasic, and S. Young, “Sample-
efficient actor-critic reinforcement learning with supervised data for
dialogue management,” arXiv preprint arXiv:1707.00130, 2017.
[19] N. Femia, G. Petrone, G. Spagnuolo, and M. Vitelli, “Optimization of
perturb and observe maximum power point tracking method,” IEEE
transactions on power electronics, vol. 20, no. 4, pp. 963–973, 2005.

You might also like