DRL For Optimal Perturbation For MPPT in Wind Energy
DRL For Optimal Perturbation For MPPT in Wind Energy
DRL For Optimal Perturbation For MPPT in Wind Energy
2
110
Maximum Power
Resistance at MPP
ωt Speed Controller Δωt 105
Action
ωt=ωt-1+Δωt At= Δωt
100 99.07 99.4 99.04 99.4 99.06 99.73
% Accuracy
ωt st
Generator ^
(ωt, ω
ANN t) State 95
Pt Predictor St=(ωt ,ω^)
t
Wattmeter ^
Pt 90
Pt Reward
^
Electrical Load Rt= Pt-Pt 85
RIDGE LASSO HUBER
Regression Model
Fig. 3. Proposed wind MPPT model. RL components are
marked in red. Fig. 4. Average accuracy for different regressor models for
predicting Pmax and RM P P from temperature and irradiance
values.
Turbine speed and Y-axis represents Turbine output power.
It asserts that the whole dynamics is captured within the
coordinate system. The test case also explains the rotor
a rotor speed that achieves the maximum power and the
speed step size selection dilemma in many MPPT techniques.
point falls on the optimal power curve. That means all the
Clearly, the big step size ensures fast tracking but fails in
points over the power curve should be mapped to the same
granularity issues in the test case. With aging, the turbine
point on the 2-D coordinate system. We use an Artificial
capacity and performance declines and the optimal power
Neural Network (ANN) based function approximator to map
curve shifts downward. This non-stationary optimal power
these relation. In fact, the wind turbine manufacturer has
curve makes Reinforcement Learning a suitable optimization
the capacity to provide different power curves for different
technique for wind MPPT task.
wind velocity, along with the optimal power curve. The
This work aims to map any point in the 2-D surface to points on these power curves serve as input features and the
its corresponding point at the optimal power curve. That corresponding point in the optimal power curve serves as the
causes the knowledge of wind speed redundant, hence makes target value (output) for the ANN. So, these mapping is a
our work free of wind speed sensors and measurements. For straight forward data driven approach. This predictor breaks
clarity, the point A, and B is mapped to point C; similarly down the task for the RL agent by providing it the current
the point D, and E is mapped to point F. So, our technique rotor speed and optimal rotor speed.
aims to reach point C from A by changing the rotor speed
We consider several regression models (Ridge, Lasso, and
accordingly.
Huber) with two fully connected layers and select Huber
III. M ODEL D EVELOPEMENT regressor, which gives the maximum accuracy, for our ex-
periments. Notably, all of the regressors provide around 99%
Our model consists of an ANN predictor and a RL agent
accuracy, as shown in Fig. 4.
which controls the rotor speed of the turbine. Our model
utilizes the advantage actor-critic (A2C) deep RL framework, The real-time predictor continuously provides prediction
which is suitable for fine-grained action space [18]. Fig. 3 ω̂t , which is in turn used as input for the deep RL state.
shows that at time step t, the speed controller controls the Similarly, the predictor provides prediction P̂t for maximum
turbine rotation speed ωt . The Turbine is connected to a power Pmax that we use to calculate the reward of the RL
generator that converst the mechanical power into electrical agent. Besides, we use P̂t to estimate the performance of the
power serves load. The generated power Pt is measured by algorithms.
a wattmeter. Then the Pt , and ωt is fed into Artificial Neural
Network (ANN) based predictor that maps the operating B. MDP Model
point to the optimal point ω̂t as explained in Fig. 2. The
current rotor speed and predicted optimal speed defines the We develop a Markov Decision Process (MDP) model
state for the RL agent. The RL agent takes action of rotor for formulating the problem for the RL agent. The MDP
speed change ∆ωt to control the turbine speed. The generated model is based on the Markov Property; i.e., the future state
power Pt and the predicted optimal power P̂t defines the is dependent only on the current state and action taken by
reward to complete the RL framework. the agent. The blocks in Fig. 3 represent the elements for
our model, inclusive of the components of the MDP model
A. ANN Predictor marked in red. The MPPT controller is the MDP agent that
Fig. 2 shows that the power curves for different wind speed takes action At about change in speed ∆ω of the turbine.
are non-overlapping. For a particular power curve, there is The other non-marked blocks form the environment.
3
1) State, St : The agent collects the rotor speed ωt from Algorithm 1 A2C algorithm for wind MPPT.
the speed controller and predicted optimal speed ω̂t from the Input: discount factor γ, learning rate, and number of
ANN predictor two define the MDP state as: episodes e
Input: Wind velocity {vt }, and turbine speed {ωt }
St = (ωt , ω̂t ). Initialize: Actor network with random weights and critic
network with random weights
Notably, the state space is framed with only two variables for episode = 1, 2, ..., e do
(rotor speed); suitable for the learning and convergence of the for t= 1, 2, ..., T do
Deep RL model. Both of the inputs are positive real numbers ANN regressor predicts ω̂t , and P̂t .
and floating inputs for the algorithm; hence, they have an Select action At for state St = (ωt , ω̂t ) using actor
infinite number of possible values and require Deep RL to network.
deal with it. Execute action At and observe reward Rt from Eq.
2) Action, At : The RL agent’s action At is to select the (3).
change of turbine rotation speed ∆ωt . So, ∆ωt = 0 indicates Store transitions (St , At , Rt , St+1 ).
no change in speed; and can be positive or negative values to Update actor network via advantage function.
represent increase and decrease of turbine speed, respectively. Update critic network through back propagation.
Theoretically, continuous-valued speed changes can give the end for
highest flexibility and performance optimization; however, end for
the turbine speed controller can only accommodate a limited
number of discrete values. We consider a fine-grained action Parameter Definition Value
space for the deep RL model and choose the speed change Pmax Rated capacity 1500 kW
vrated Rated wind speed 12 m/s
At = ∆ωt ∈ {−0.05, −0.04, ..., 0.04, 0.05}, where the num- vmin Cut-in wind speed 4 m/s
ber indicates the percentile change of the nominal/nameplate Number of rotor blades 3
D = 2R Rotor diameter 70.5 m
turbine speed (provided by the manufacturer). So, the RL A Swept area 3.904 m2
agent has 11 actions possible actions. ω Rotor speed (range) 11.1 – 22.2 rpm
The turbine speed change, ∆ωt , is executed by the Tur- Table I: GE Energy 1.5MW Wind Turbine Technical Speci-
bine Speed Controller (TSC). This TSC includes necessary fications (GE 1.5 S).
mechanical and electrical devices to achieve the target speed
ωt−1 = ωt + ∆ωt . C. Solution Approach
3) Reward, Rt : In RL, the reward function guides the
The RL agent’s objective is to maximize the discounted
agent towards optimal action. The reward is observed from
total reward for time horizon T ,
the environment but requires modeling to provide meaningful
T
insight to the RL agent. So, we define reward, Rt as the X
difference between output power and the predicted maximum RT = γ t Rt , (4)
t=0
power of the ANN predictor.
where γ ∈ (0, 1) is the discount factor for future reward.
Rt = Pt − P̂t (3) There are two popular approaches for finding the optimal
policy {At }, value-based methods (e.g., deep Q-learning)
The agent aims to maximize the reward, i.e., maximize and policy-based methods (e.g., policy gradient). We need
output power Pt . The selected reward in Eq. (3) provides to solve the following Bellman equation. After i iterations,
the RL agent a stable target to reach. The highest reward the the agent’s value function at time step t is
agent can achieve is zero, i.e., equalling the ANN regressor n o
predicted maximum power. So, once the MPP is reached, V i (ωt , ω̂t ) = max E Rt + γV i−1 (ωt+1 , ω̂t+1 ) ,
At
changing turbine speed will incur a negative reward. So, the
agent selects At = 0, i.e., making no change in rotor speed We consider the popular Advantage Actor-Critic (A2C) al-
(∆ωt = 0). gorithm for this continuous state MDP [18]. A2C is a hybrid
4) Next State, St+1 : If the wind velocity remains same, deep RL method which consists of a policy-based actor
the agent’s action makes the next state deterministic as network and value-based critic network. A pseudo code for
the A2C algorithm is given in Algorithm 1.
St+1 = (ωt+1 = ωt + ∆ωt , ω̂t+1 ). IV. R ESULTS
Here, the ω̂t+1 is the observation from ANN predictor that A. Experimental Setup
uses Pt and ωt for making the prediction. If the wind velocity In our experiments, we use the PV Module 1STH-220-
changes; then the system moves to a different operating point P, whose operation details are provided in Table I. All the
in the turbine speed vs turbine output power curve. These the experiments are performed in Python 3.6.8 version. Fig. 5
observations are obtained from Wattmeter and turbine speed shows the convergence of our A2C deep RL algorithm for
controller. varying irradiance. The y-axis represents the episodic output
4
300
0 Irradiance 1000
275
2
250 800
4
Irradiance (W/m2)
6 225
600
E-Emax (kJ)
8 200
10 175 400
12 150 Ideal Case
Chou et al. Method[9] 200
14 125 Proposed Method
Raw Rewards P & O Method
16 Smoothed Rewards 100 0
0 2000 4000 6000 8000 10000 0 25 50 75 100 125 150 175 200
Episode Time (s)
Fig. 5. Convergence of our A2C deep RL method under Fig. 6. Power output for different methods for varying
varying irradiance. irradiance.
190
Ideal Case Temperature 35
energy difference with respect to the ideal case (if the PV Chou et al. Method[9]
185 Proposed Method
always operates at MPP). The smoothed reward is the running 30
P & O Method
mean of the last 10 episodes of raw (actual) rewards. The 180 25
Temperautre (°C)
algorithm converges within 4000 episodes and minimizes this
175 20
energy difference to 0.36 kJ, where the total ideal output is
32.7 kJ for the episode duration (200 s). This minimal 1.1% 15
170
loss of energy happens during the irradiance change time 10
step, which is impossible to nullify. 165 5
B. Benchmark Policies
160 0
We compare our method with the following policies. 0 25 50 75 100 125 150 175 200
Time (s)
1) Perturb and Observe (P&O): We use the popular P&O
method [19] as our baseline policy. We determine 0.01 to be Fig. 7. Power output for different methods for varying
a suitable step size for perturbation through a grid search. temperature.
2) RL-based approach: Chou et al. [16] propose a deep
Reinforcement Learning (RL) based MPPT. Their method 2) Varying Temperature (T): We set the irradiance at 800
uses temperature, irradiance, and duty cycle of the DC/DC W/m2 and the load resistance (5 Ω) stationary for this setup.
converter as the RL state, so they require pretty much the The right y-axis in Fig. 7 represents the temperature value
same setup as ours except the ANN predictor. Also, the that changes between 20, 25, and 30 °C, shown by the dashed
reward used in [16] is the change in power, ∆Pt = Pt −Pt−1 , line. The left y-axis represents the output power, and the solid
which provides a less stable (i.e., more fluctuating) feedback blue line shows the ideal output power. Our method performs
than our prediction-based reward Rt = Pt − P̂t . As the significantly better than the other methods.
hardware requirement is quite similar, this method provides 3) Varying Load Resistance (R): The temperature and
a fair comparison for our method. irradiance are fixed at 25 °C and 800 W/m2 respectively for
this case. But the load resitance changes among 1, 1.5 and 2
C. Performance Analysis Ω as shown in Fig. 8. We don’t include Chou et al. method
We aim to test our method for different environments. [16] here as their model does not consider variable load.
Hence, we provide three sets of case simulations where we The maximum power remains stable at 172.8 W as it is free
examine our method by changing either irradiance, tempera- of load variability. The P&O method cannot reach the MPP
ture, or load resistance. The experiment duration is 200 time fast enough due to small step size. We also experimented
steps (seconds) for each analysis. with bigger step sizes, which provided worse results and
1) Varying Irradiance (G): We keep the temperature (25 unstable output power. Our method uses its variable step size
°C) and the load resistance (5 Ω) stationary for this setup. to provide the optimal solution. Clearly, MPPT for variable
The right y-axis in Fig. 6 represents the irradiance value load is a more challenging task as it shifts the operation point
that changes between 600, 800, and 1000 W/m2 , shown by further from the MPP.
the dashed line. The left y-axis shows the output power for Table II shows the summary of the performance for the
different methods. The solid blue line represents the ideal methods for different cases. Our proposed deep RL method is
output power that all the methods try to reach. P&O method the fastest to track the MPP and maximizes the power output
is the slowest to reach, and our proposed method is the for each case. All the methods does well to maximize the
fastest. Chou et al.’s method [16] lie in between. output for variable irradiance and temperature; however, our
5
Time to reach MPP for each change in operating condition (s) Energy Output (kJ)
Case P&O Chou et al. Proposed Ideal P&O Chou et al. Proposed
Variable G 25, 22, 19, 18, 19 18, 14, 13, 13, 15 2, 6, 6, 6, 4 32.68 31.78 32.2 32.34
Variable T 19, 2, 3, 13, f/r* f/r*, 22, 13, 2, 13 8, 6, 3, 5, 4 34.52 34.08 34.11 34.27
Variable R 25, 22, 19, 18, 19 n/a** 2, 6, 6, 6, 4 34.56 27.76 n/a** 31.84
* Fails to reach the MPP, ** Not applicable
Table II: Summary of performances under different cases considered in Figs. 6–8. The five numbers in each cell represent
the performance under five time intervals in each case.
VI. C ONCLUSION
200 10
Resistance This work aims to provide a state-of-the-art solution to the
180
8 MPPT task for photovoltaics by modeling a deep RL-based
160
technique. We integrated an ANN-based pre-trained predictor
140
Output Power (W)
Resistance ( )
Ideal Case 6
120 at MPP for a given irradiance and temperature. These two
Proposed Method
100 P & O Method parameters help to shape the state and reward of the RL
4
80 model. This process breaks down the task for the deep RL-
60 based algorithm, resulting in superior performance than the
2
existing P&O and a recent deep RL-based method [16]. Our
40
method is robust and can be used for any PV module by
20 0 training the predictor with the module’s I-V data.
0 25 50 75 100 125 150 175 200
Time (s)
R EFERENCES
Fig. 8. Power output for different methods for varying
resistance. [1] H. H. Mousa, A.-R. Youssef, and E. E. Mohamed, “State of the art
perturb and observe mppt algorithms based wind energy conversion
method outperforms the others to be the closest to ideal case. systems: A technology review,” International Journal of Electrical
Power & Energy Systems, vol. 126, p. 106598, 2021.
The benefit of our method is more evident in the variable [2] J. Pande, P. Nasikkar, K. Kotecha, and V. Varadarajan, “A review of
resistance case, where it outputs 13 % more energy than maximum power point tracking algorithms for wind energy conversion
the P&O method. The time to reach the MPP after every systems,” Journal of Marine Science and Engineering, vol. 9, no. 11,
p. 1187, 2021.
change in PV dynamics is also provided in Table II, which
[3] C. M. Parker and M. C. Leftwich, “The effect of tip speed ratio on a
is consistent with the output energy results. vertical axis wind turbine at high reynolds numbers,” Experiments in
Fluids, vol. 57, no. 5, pp. 1–11, 2016.
[4] Y. Errami, M. Ouassaid, and M. Maaroufi, “Optimal power control
V. D ISCUSSIONS strategy of maximizing wind energy tracking and different operating
conditions for permanent magnet synchronous generator wind farm,”
Energy Procedia, vol. 74, pp. 477–490, 2015.
The MPPT task aims to reach MPP by shifting the load [5] M. Yin, W. Li, C. Y. Chung, L. Zhou, Z. Chen, and Y. Zou, “Optimal
resistance towards the MPP resistance through duty cycle torque control based on effective tracking range for maximum power
change. We define our MDP state as estimated MPP resis- point tracking of wind turbines under varying wind conditions,” IET
Renewable power generation, vol. 11, no. 4, pp. 501–510, 2017.
tance and current load resistance, which has enough infor- [6] D. Kumar and K. Chatterjee, “A review of conventional and advanced
mation to change the duty cycle. This effective breakdown mppt algorithms for wind energy systems,” Renewable and sustainable
of the problem helps us keep the RL state small and to the energy reviews, vol. 55, pp. 957–970, 2016.
[7] H. H. Mousa, A.-R. Youssef, and E. E. Mohamed, “Variable step size
point, which is the underlying reason for the success of this p&o mppt algorithm for optimal power extraction of multi-phase pmsg
model. This model is suitable for large-scale PV units where based wind generation system,” International Journal of Electrical
temperature and irradiance from multiple sensors may keep Power & Energy Systems, vol. 108, pp. 218–231, 2019.
[8] C.-H. Chen, C.-M. Hong, and T.-C. Ou, “Wrbf network based control
it apart from unnecessary noise from those sensors. Further- strategy for pmsg on smart grid,” in 2011 16th International Confer-
more, periodical (yearly) calibration of the ANN predictor ence on Intelligent System Applications to Power Systems, 2011, pp.
may compensate for degradation and corresponding changes 1–6.
[9] T. Li and Z. Ji, “Intelligent inverse control to maximum power point
in the I-V curve of the PV module over long-term usage. tracking control strategy of wind energy conversion system,” in 2011
Deep RL algorithms with continuous action may further Chinese Control and Decision Conference (CCDC), 2011, pp. 970–
benefit this approach; however, the action range is a matter 974.
of deliberation as a significant change in the duty cycle may [10] C. Wei, Z. Zhang, W. Qiao, and L. Qu, “Intelligent maximum power
extraction control for wind energy conversion systems based on
complicate the action of the DC/DC converter. The real-life online q-learning with function approximation,” in 2014 IEEE Energy
implementation of this simulation-based method may provide Conversion Congress and Exposition (ECCE), 2014, pp. 4911–4916.
further insights into this technique. Our discrete action setup [11] A. Kushwaha, M. Gopal, and B. Singh, “Q-learning based maximum
power extraction for wind energy conversion system with variable wind
has small granularity and a suitable operating range for the speed,” IEEE Transactions on Energy Conversion, vol. 35, no. 3, pp.
DC/DC converter. Our method addresses the major problem 1160–1170, 2020.
of optimal perturbation size of the P&O method by providing [12] S. Azzouz, S. Messalti, and A. Harrag, “A novel hybrid mppt controller
using (p&o)-neural networks for variable speed wind turbine based on
flexible duty cycle change based on the state of the MDP dfig a novel hybrid mppt controller using (p&o)-neural networks for
model. variable speed wind turbine based on dfig,” 1874.
6
[13] M. A. Abdullah, T. Al-Hadhrami, C. W. Tan, and A. H. Yatim,
“Towards green energy for smart cities: Particle swarm optimization
based mppt approach,” IEEE Access, vol. 6, pp. 58 427–58 438, 2018.
[14] J. Hussain and M. K. Mishra, “Adaptive maximum power point
tracking control algorithm for wind energy conversion systems,” IEEE
Transactions on Energy Conversion, vol. 31, no. 2, pp. 697–705, 2016.
[15] M. R. Javed, A. Waleed, U. S. Virk, and S. Z. ul Hassan, “Comparison
of the adaptive neural-fuzzy interface system (anfis) based solar max-
imum power point tracking (mppt) with other solar mppt methods,” in
2020 IEEE 23rd international multitopic conference (INMIC). IEEE,
2020, pp. 1–5.
[16] K.-Y. Chou, S.-T. Yang, and Y.-P. Chen, “Maximum power point
tracking of photovoltaic system based on reinforcement learning,”
Sensors, vol. 19, no. 22, p. 5054, 2019.
[17] C. Wei, Z. Zhang, W. Qiao, and L. Qu, “Reinforcement-learning-based
intelligent maximum power point tracking control for wind energy
conversion systems,” IEEE Transactions on Industrial Electronics,
vol. 62, no. 10, pp. 6360–6370, 2015.
[18] P.-H. Su, P. Budzianowski, S. Ultes, M. Gasic, and S. Young, “Sample-
efficient actor-critic reinforcement learning with supervised data for
dialogue management,” arXiv preprint arXiv:1707.00130, 2017.
[19] N. Femia, G. Petrone, G. Spagnuolo, and M. Vitelli, “Optimization of
perturb and observe maximum power point tracking method,” IEEE
transactions on power electronics, vol. 20, no. 4, pp. 963–973, 2005.