Deep Reinforcement Learning For Online

1
IEEE Transaction on Internet of Things,Year:2019
Deep Reinforcement Learning for Online

Computation Offloading in Wireless Powered
Mobile-Edge Computing Networks
Liang Huang, Member, IEEE, Suzhi Bi, Senior Member, IEEE, and Ying-Jun Angela Zhang, Senior
Member, IEEE
Abstract—Wireless powered mobile-edge computing (MEC) has recently emerged as a promising paradigm to enhance the data
processing capability of low-power networks, such as wireless sensor networks and internet of things (IoT). In this paper, we consider a
wireless powered MEC network that adopts a binary offloading policy, so that each computation task of wireless devices (WDs) is
arXiv:1808.01977v4 [cs.NI] 28 Jul 2019
either executed locally or fully offloaded to an MEC server. Our goal is to acquire an online algorithm that optimally adapts task
offloading decisions and wireless resource allocations to the time-varying wireless channel conditions. This requires quickly solving
hard combinatorial optimization problems within the channel coherence time, which is hardly achievable with conventional numerical
optimization methods. To tackle this problem, we propose a Deep Reinforcement learning-based Online Offloading (DROO) framework
that implements a deep neural network as a scalable solution that learns the binary offloading decisions from the experience. It
eliminates the need of solving combinatorial optimization problems, and thus greatly reduces the computational complexity especially
in large-size networks. To further reduce the complexity, we propose an adaptive procedure that automatically adjusts the parameters
of the DROO algorithm on the fly. Numerical results show that the proposed algorithm can achieve near-optimal performance while
significantly decreasing the computation time by more than an order of magnitude compared with existing optimization methods. For
example, the CPU execution latency of DROO is less than 0.1 second in a 30-user network, making real-time and optimal offloading
truly viable even in a fast fading environment.
Index Terms—Mobile-edge computing, wireless power transfer, reinforcement learning, resource allocation.
1 I NTRODUCTION
D UE to the small form factor and stringent production

cost constraint, modern Internet of Things (IoT) de-
vices are often limited in battery lifetime and computing
transferring RF (radio frequency) energy to and receiving
computation offloading from the WDs. In particular, the
WDs follow a binary task offloading policy [8], where a task is
power. Thanks to the recent advance in wireless power transfer either computed locally or offloaded to the MEC server for
(WPT) technology, the batteries of wireless devices (WDs) remote computing. The system setup may correspond to a
can be continuously charged over the air without the need of typical outdoor IoT network, where each energy-harvesting
battery replacement [1]. Meanwhile, the device computing wireless sensor computes a non-partitionable simple sens-
power can be effectively enhanced by the recent develop- ing task with the assistance of an MEC server.
ment of mobile-edge computing (MEC) technology [2], [3]. In a wireless fading environment, the time-varying wire-
With MEC, the WDs can offload computationally intensive less channel condition largely impacts the optimal offload-
tasks to nearby edge servers to reduce computation latency ing decision of a wireless powered MEC system [9]. In
and energy consumption [4], [5]. a multi-user scenario, a major challenge is the joint op-
The newly emerged wireless powered MEC combines the timization of individual computing mode (i.e., offloading
advantages of the two aforementioned technologies, and or local computing) and wireless resource allocation (e.g.,
thus holds significant promise to solve the two fundamental the transmission air time divided between WPT and of-
performance limitations for IoT devices [6], [7]. In this paper, floading). Such problems are generally formulated as mixed
we consider a wireless powered MEC system as shown in integer programming (MIP) problems due to the existence
Fig. 1, where the access point (AP) is responsible for both of binary offloading variables. To tackle the MIP problems,
branch-and-bound algorithms [10] and dynamic program-
ming [11] have been adopted, however, with prohibitively
• L. Huang is with the College of Information Engineering, Zhejiang high computational complexity, especially for large-scale
University of Technology, Hangzhou, China 310058, (e-mail:
[email protected]).
MEC networks. To reduce the computational complexity,
heuristic local search [7], [12] and convex relaxation [13],
• S. Bi is with the College of Electronic and Information Engineering, [14] methods are proposed. However, both of them require
Shenzhen University, Shenzhen, Guangdong, China 518060 (e-mail: considerable number of iterations to reach a satisfying local
[email protected]).
optimum. Hence, they are not suitable for making real-time
• Y-J. A. Zhang is with the Department of Information Engineering, The offloading decisions in fast fading channels, as the optimiza-
Chinese University of Hong Kong, Shatin, N.T., Hong Kong. (e-mail: tion problem needs to be re-solved once the channel fading
[email protected]).
has varied significantly.
IEEE Transaction on Internet of Things,Year:2019 2
and leads to better convergence performance than

Task offloading conventional action generation techniques.
WD1 4) We further develop an adaptive procedure that au-
Energy transfer h1
circuit tomatically adjusts the parameters of the DROO
Communication h2 algorithm on the fly. Specifically, it gradually de-
circuit
creases the number of convex resource allocation
MEC server Local computing sub-problems to be solved in a time frame. This
h3 WD2 effectively reduces the computational complexity
Dual-function AP
without compromising the solution quality.
Energy
harvesting circuit
We evaluate the proposed DROO framework under exten-
Task offloading Communication
sive numerical studies. Our results show that on average
Energy flow WD3 circuit the DROO algorithm achieves over 99.5% of the computa-
Data flow Computing unit tion rate of the existing near-optimal benchmark method
[7]. Compared to the Linear Relaxation (LR) algorithm
aT τ1T τ3T ≈0 ≈0 [13], it significantly reduces the CPU execution latency by
AP Æ WDs WD1ÆAP WD3ÆAP APÆWD1 APÆWD3 more than an order of magnitude, e.g., from 0.81 second
WPT Offload Offload Download Download
to 0.059 second in a 30-user network. This makes real-
T time and optimal design truly viable in wireless powered
MEC networks even in a fast fading environment. The
complete source code implementing DROO is available at
Fig. 1. An example of the considered wireless powered MEC network https://github.com/revenol/DROO.
and system time allocation.
The remainder of this paper is organized as follows.
In Section 2, a review of related works in literature is
In this paper, we consider a wireless powered MEC presented. In Section 3, we describe the system model and
network with one AP and multiple WDs as shown in problem formulation. We introduce the detailed designs of
Fig. 1, where each WD follows a binary offloading policy. the DROO algorithm in Section 4. Numerical results are
In particular, we aim to jointly optimize the individual presented in Section 5. Finally, the paper is concluded in
WD’s task offloading decisions, transmission time alloca- Section 6.
tion between WPT and task offloading, and time allocation
among multiple WDs according to the time-varying wireless 2 R ELATED W ORK
channels. Towards this end, we propose a deep reinforce-
There are many related works that jointly model the com-
ment learning-based online offloading (DROO) framework
puting mode decision problem and resource allocation prob-
to maximize the weighted sum of the computation rates
lem in MEC networks as the MIP problems. For instance, [7]
of all the WDs, i.e., the number of processed bits within a
proposed a coordinate descent (CD) method that searches
unit time. Compared with the existing integer programming
along one variable dimension at a time. [12] studies a similar
and learning-based methods, we have the following novel
heuristic search method for multi-server MEC networks,
contributions:
which iteratively adjusts binary offloading decisions. An-
1) The proposed DROO framework learns from the other widely adopted heuristic is through convex relaxation,
past offloading experiences under various wireless e.g., by relaxing integer variables to be continuous between
fading conditions, and automatically improves its 0 and 1 [13] or by approximating the binary constraints
action generating policy. As such, it completely re- with quadratic constraints [14]. Nonetheless, on one hand,
moves the need of solving complex MIP problems, the solution quality of the reduced-complexity heuristics is
and thus, the computational complexity does not not guaranteed. On the other hand, both search-based and
explode with the network size. convex relaxation methods require considerable number
2) Unlike many existing deep learning methods that of iterations to reach a satisfying local optimum and are
optimize all system parameters at the same time inapplicable for fast fading channels.
resulting infeasible solutions, DROO decomposes Our work is inspired by recent advantages of deep
the original optimization problem into an offloading reinforcement learning in handling reinforcement learning
decision sub-problem and a resource allocation subproblems with large state spaces [15] and action spaces [16].
problem, such that all physical constraints are guar- In particular, it relies on deep neural networks (DNNs) [17]
anteed. It works for continuous state spaces and to learn from the training data samples, and eventually
does not require the discretization of channel gains, produces the optimal mapping from the state space to the
thus, avoiding the curse of dimensionality problem. action space. There exists limited work on deep reinforce-
3) To efficiently generate offloading actions, we devise ment learning-based offloading for MEC networks [18]–[22].
a novel order-preserving action generation method. By taking advantage of parallel computing, [19] proposed
Specifically, it only needs to select from few can- a distributed deep learning-based offloading (DDLO) algo-
didate actions each time, thus is computationally rithm for MEC networks. For an energy-harvesting MEC
feasible and efficient in large-size networks with networks, [20] proposed a deep Q-network (DQN) based
high-dimensional action space. Meanwhile, it also offloading policy to optimize the computational perfor-
provides high diversity in the generated actions mance. Under the similar network setup, [21] studied an
online computation offloading policy based on DQN under 3.2 Local Computing Mode
random task arrivals. However, both DQN-based works A WD in the local computing mode can harvest energy
take discretized channel gains as the input state vector, and compute its task simultaneously [6]. Let fi denote
and thus suffer from the curse of dimensionality and slow the processor’s computing speed (cycles per second) and
convergence when high channel quantization accuracy is 0 ≤ ti ≤ T denote the computation time. Then, the amount
required. Besides, because of its exhaustive search nature of processed bits by the WD is fi ti /φ, where φ > 0 denotes
in selecting the action in each iteration, DQN is not suitable the number of cycles needed to process one bit of task data.
for handling problems with high-dimensional action spaces Meanwhile, the energy consumption of the WD due to the
[23]. In our problem, there are a total of 2N offloading computing is constrained by ki fi3 ti ≤ Ei , where ki denotes
decisions (actions) to choose from, where DQN is evidently the computation energy efficiency coefficient [13]. It can be
inapplicable even for a small N , e.g., N = 20. shown that to process the maximum amount of data within
T under the energy constraint, a WD should exhaust the
3 P RELIMINARY harvested energy and compute throughout the time frame,
13
3.1 System Model i.e., t∗i = T and accordingly fi∗ = kEi Ti . Thus, the local
As shown in Fig. 1, we consider a wireless powered MEC computation rate (in bits per second) is
network consisting of an AP and N fixed WDs, denoted as 31
a set N = {1, 2, . . . , N }, where each device has a single ∗ f ∗ t∗ hi 1
rL,i (a) = i i = η1 a3 , (1)
antenna. In practice, this may correspond to a static sensor φT ki
network or a low-power IoT system. The AP has stable 1
power supply and can broadcast RF energy to the WDs. where η1 , (µP ) 3 /φ is a fixed parameter.
Each WD has a rechargeable battery that can store the
harvested energy to power the operations of the device. 3.3 Edge Computing Mode
Suppose that the AP has higher computational capability Due to the TDD constraint, a WD in the offloading mode
than the WDs, so that the WDs may offload their computing can only offload its task to the AP after harvesting energy.
tasks to the AP. Specifically, we suppose that WPT and We denote τi T as the offloading time of the i-th WD, τi ∈
communication (computation offloading) are performed in [0, 1]. Here, we assume that the computing speed and the
the same frequency band. Accordingly, a time-division- transmit power of the AP is much larger than the size- and
multiplexing (TDD) circuit is implemented at each device energy-constrained WDs, e.g., by more than three orders
to avoid mutual interference between WPT and communi- of magnitude [6], [9]. Besides, the computation feedback to
cation. be downloaded to the WD is much shorter than the data
The system time is divided into consecutive time frames offloaded to the edge server. Accordingly, as shown in Fig. 1,
of equal lengths T , which is set smaller than the channel we safely neglect the time spent on task computation and
coherence time, e.g., in the scale of several seconds [24]–[26] downloading by the AP, such that each time frame is only
in a static IoT environment. At each tagged time, both the occupied by WPT and task offloading, i.e.,
amount of energy that a WD harvests from the AP and the
N
communication speed between them are related to the wire- X
less channel gain. Let hi denote the wireless channel gain τi + a ≤ 1. (2)
i=1
between the AP and the i-th WD at a tagged time frame.
The channel is assumed to be reciprocal in the downlink To maximize the computation rate, an offloading WD
and uplink,1 and remain unchanged within each time frame, exhausts its harvested energy on task offloading, i.e., Pi∗ =
Ei
but may vary across different frames. At the beginning of a τi T . Accordingly, the computation rate equals to its data
time frame, aT amount of time is used for WPT, a ∈ [0, 1], offloading capacity, i.e.,
where the AP broadcasts RF energy for the WDs to harvest.
µP ah2i

∗ Bτi
Specifically, the i-th WD harvests Ei = µP hi aT amount rO,i (a, τi ) = log2 1 + , (3)
of energy, where µ ∈ (0, 1) denotes the energy harvesting vu τi N0
efficiency and P denotes the AP transmit power [1]. With where B denotes the communication bandwidth and N0
the harvested energy, each WD needs to accomplish a pri- denotes the receiver noise power.
oritized computing task before the end of a time frame. A
unique weight wi is assigned to the i-th WD. The greater the 3.4 Problem Formulation
weight wi , the more computation rate is allocated to the i-th
WD. In this paper, we consider a binary offloading policy, Among all the system parameters in (1) and (3), we assume
such that the task is either computed locally at the WD (such that only the wireless channel gains h = {hi |i ∈ N } are
as WD2 in Fig. 1) or offloaded to the AP (such as WD1 and time-varying in the considered period, while the others
WD3 in Fig. 1). Let xi ∈ {0, 1} be an indicator variable, (e.g., wi ’s and ki ’s) are fixed parameters. Accordingly, the
where xi = 1 denotes that the i-th user’s computation task weighted sum computation rate of the wireless powered
is offloaded to the AP, and xi = 0 denotes that the task is MEC network in a tagged time frame is denoted as
computed locally. N
X
∗ ∗

Q (h, x, τ , a) , wi (1 − xi )rL,i (a) + xi rO,i (a, τi ) ,
1. The channel reciprocity assumption is made to simplify the nota- i=1
tions of channel state. However, the results of this paper can be easily
extended to the case with unequal uplink and downlink channels. where x = {xi |i ∈ N } and τ = {τi |i ∈ N }.
TABLE 1
&RPSXWDWLRQ5DWH0D[LPL]DWLRQ 6ROYLQJ0,33UREOHP 3 x, Ĳ , a Notations used throughout the paper
Notation Description
N The number of WDs
2IIORDGLQJ'HFLVLRQ 'HHS5HLQIRUFHPHQW/HDUQLQJ x T The length of a time frame
i Index of the i-th WD
hi The wireless channel gain between the i-th WD and the
AP
a The fraction of time that the AP broadcasts RF energy
5HVRXUFH$OORFDWLRQ 6ROYLQJ&RQYH[3UREOHP 3 Ĳ,a
for the WDs to harvest
Ei The amount of energy harvested by the i-th WD
P The AP transmit power when broadcasts RF energy
Fig. 2. The two-level optimization structure of solving (P1). µ The energy harvesting efficiency
wi The weight assigned to the i-th WD
xi An offloading indicator for the i-th WD
For each time frame with channel realization h, we are fi The processor’s computing speed of the i-th WD
φ The number of cycles needed to process one bit of task
interested in maximizing the weighted sum computation data
rate: ti The computation time of the i-th WD
ki The computation energy efficiency coefficient
∗
(P 1) : Q (h) = maximize Q (h, x, τ , a) (4a) τi The fraction of time allocated to the i-th WD for task
x,τ ,a offloading
PN B The communication bandwidth
subject to i=1 τi + a ≤ 1, (4b)
N0 The receiver noise power
a ≥ 0, τi ≥ 0, ∀i ∈ N , (4c) h The vector representation of wireless channel gains
{hi |i ∈ N }
xi ∈ {0, 1}. (4d) x The vector representation of offloading indicators
{xi |i ∈ N }
We can easily infer that τi = 0 if xi = 0, i.e., when the i-th τ The vector representation of {τi |i ∈ N }
WD is in the local computing mode. Q(·) The weighted sum computation rate function
Problem (P1) is a mixed integer programming non- π Offloading policy function
convex problem, which is hard to solve. However, once x θ The parameters of the DNN
x̂t Relaxed computation offloading action
is given, (P1) reduces to a convex problem as follows. K The number of quantized binary offloading actions
gK The quantization function
(P 2) : Q∗ (h, x) = maximize Q (h, x, τ , a) L(·) The training loss function of the DNN
τ ,a
PN δ The training interval of the DNN
subject to i=1 τi + a ≤ 1,
∆ The updating interval for K
a ≥ 0, τi ≥ 0, ∀i ∈ N .
Accordingly, problem (P1) can be decomposed into two sub- supervised learning-based deep neural network (DNN) ap-
problems, namely, offloading decision and resource alloca- proaches (such as in [27] and [28]) in dynamic wireless
tion (P2), as shown in Fig. 2: applications. Other than the fact that deep reinforcement
• Offloading Decision: One needs to search among the learning does not need manually labeled training samples
2N possible offloading decisions to find an optimal (e.g., the (h, x) pairs in this paper) as DNN, it is much
or a satisfying sub-optimal offloading decision x. more robust to the change of user channel distributions.
For instance, meta-heuristic search algorithms are For instance, the DNN needs to be completely retrained
proposed in [7] and [12] to optimize the offloading once some WDs change their locations significantly or are
decisions. However, due to the exponentially large suddenly turned off. In contrast, the adopted deep reinforce-
search space, it takes a long time for the algorithms ment learning method can automatically update its offload-
to converge. ing decision policy upon such channel distribution changes
• Resource Allocation: The optimal time allocation without manual involvement. Those important notations
{a∗ , τ ∗ } of the convex problem (P2) can be ef- used throughout this paper are summarized in Table 1.
ficiently solved, e.g., using a one-dimensional bi-
section search over the dual variable associated with 4 T HE DROO A LGORITHM
the time allocation constraint in O(N ) complexity [7]. We aim to devise an offloading policy function π that
The major difficulty of solving (P1) lies in the offloading quickly generates an optimal offloading action x∗ ∈ {0, 1}N
decision problem. Traditional optimization algorithms re- of (P1) once the channel realization h ∈ RN>0 is revealed at
quire iteratively adjusting the offloading decisions towards the beginning of each time frame. The policy is denoted as
the optimum [11], which is fundamentally infeasible for π : h 7→ x∗ . (5)
real-time system optimization under fast fading channel. To
tackle the complexity issue, we propose a novel deep re- The proposed DROO algorithm gradually learns such policy
inforcement learning-based online offloading (DROO) algo- function π from the experience.
rithm that can achieve a millisecond order of computational
time in solving the offloading decision problem. 4.1 Algorithm Overview
Before leaving this section, it is worth mentioning the The structure of the DROO algorithm is illustrated in Fig. 3.
advantages of applying deep reinforcement learning over It is composed of two alternating stages: offloading action
'11 4XDQWL]DWLRQgK DUJPD[ Q* ht xk

&KDQQHO
&RPSXWH 2IIORDGLQJ$FWLRQ
,QSXWIRUWKHtWK *DLQ
Q* ht xk 6HOHFWHG *HQHUDWLRQ
WLPHIUDPH

E\VROYLQJ $FWLRQ
FRQYH[
SUREOHP 3
2XWSXWIRUWKHtWKWLPH
IUDPH xt at Ĳt
5HSOD\0HPRU\
7UDLQ
6DPSOH UDQGRPEDWFK 2IIORDGLQJ3ROLF\

8SGDWH
7UDLQLQJ6DPSOHV
&KDQQHO ht $FWLRQxt
Fig. 3. The schematics of the proposed DROO algorithm.
generation and offloading policy update. The generation of suffices to approximate any continuous mapping f if a
the offloading action relies on the use of a DNN, which proper activation function is applied at the neurons, e.g., sig-
is characterized by its embedded parameters θ, e.g., the moid, ReLu, and tanh functions [29]. Here, we use ReLU as
weights that connect the hidden neurons. In the t-th time the activation function in the hidden layers, where the out-
frame, the DNN takes the channel gain ht as the input, and put y and input v of a neuron are related by y = max{v, 0}.
outputs a relaxed offloading action x̂t (each entry is relaxed In the output layer, we use a sigmoid activation function,
to continuous between 0 and 1) based on its current of- i.e., y = 1/ (1 + e−v ), such that the relaxed offloading action
floading policy πθt , parameterized by θt . The relaxed action satisfies x̂t,i ∈ (0, 1).
is then quantized into K binary offloading actions, among Then, we quantize x̂t to obtain K binary offloading
which one best action x∗t is selected based on the achievable actions, where K is a design parameter. The quantization
computation rate as in (P2). The corresponding {x∗t , a∗t , τ ∗t } function, gK , is defined as
is output as the solution for ht , which guarantees that all
the physical constrains listed in (4b)-(4d) are satisfied. The
network takes the offloading action x∗t , receives a reward gK : x̂t 7→ {xk | xk ∈ {0, 1}N , k = 1, · · · , K}. (7)
Q∗ (ht , x∗t ), and adds the newly obtained state-action pair
(ht , x∗t ) to the replay memory. In general, K can be any integer within [1, 2N ] (N is
Subsequently, in the policy update stage of the t-th time the number of WDs), where a larger K results in better
frame, a batch of training samples are drawn from the solution quality and higher computational complexity, and
memory to train the DNN, which accordingly updates its vice versa. To balance the performance and complexity, we
parameter from θt to θt+1 (and equivalently the offloading propose an order-preserving quantization method, where the
policy πθt+1 ). The new offloading policy πθt+1 is used in value of K could be set from 1 to (N + 1). The basic idea
the next time frame to generate offloading decision x∗t+1 is to preserve the ordering during quantization. That is,
according to the new channel ht+1 observed. Such iterations for each quantized action xk , xk,i ≥ xk,j should hold if
repeat thereafter as new channel realizations are observed, x̂t,i ≥ x̂t,j for all i, j ∈ {1, · · · , N }. Specifically, for a given
and the policy πθt of the DNN is gradually improved. The 1 ≤ K ≤ N + 1, the set of K quantized actions {xk } is
descriptions of the two stages are detailed in the following generated from the relaxed action x̂t as follows:
subsections.
1) The first binary offloading decision x1 is obtained
4.2 Offloading Action Generation as
Suppose that we observe the channel gain realization ht in
the t-th time frame, where t = 1, 2, · · · . The parameters of
(
1 x̂t,i > 0.5,
the DNN θt are randomly initialized following a zero-mean x1,i = (8)
normal distribution when t = 1. The DNN first outputs a 0 x̂t,i ≤ 0.5,
relaxed computation offloading action x̂t , represented by a
parameterized function x̂t = fθt (ht ), where for i = 1, · · · , N .
2) To generate the remaining K − 1 actions, we first
x̂t = {x̂t,i |x̂t,i ∈ [0, 1], i = 1, · · · , N } (6)
order the entries of x̂t with respective to their dis-
and x̂t,i denotes the i-th entry of x̂t . tances to 0.5, denoted by |x̂t,(1) − 0.5| ≤ |x̂t,(2) −
The well-known universal approximation theorem 0.5| ≤ · · · ≤ |x̂t,(i) − 0.5| · · · ≤ |x̂t,(N ) − 0.5|, where
claims that one hidden layer with enough hidden neurons x̂t,(i) is the i-th order statistic of x̂t . Then, the k -
th offloading decision xk , where k = 2, · · · , K , is frame, we randomly select a batch of training data samples
calculated based on x̂t,(k−1) as {(hτ , x∗τ ) | τ ∈ Tt } from the memory, characterized by a
 set of time indices Tt . The parameters θt of the DNN are

 1 x̂t,i > x̂t,(k−1) , updated by applying the Adam algorithm [31] to reduce the

t,(k−1) and x̂t,(k−1) ≤ 0.5,
1 x̂ = x̂
t,i averaged cross-entropy loss, as
xk,i = (9)


 0 x̂ t,i = x̂ t,(k−1) and x̂t,(k−1) > 0.5, L(θt ) =
0 x̂t,i < x̂t,(k−1) ,

1 X
|
− (x∗τ ) log fθt (hτ ) + (1 − x∗τ )| log 1 − fθt (hτ ) ,
|Tt | τ ∈T t
for i = 1, · · · , N .
where |Tt | denotes the size of Tt , the superscript | denotes
Because there are in total N order statistic of x̂t , while the transpose operator, and the log function denotes the
each can be used to generate one quantized action from element-wise logarithm operation of a vector. The detailed
(9), the above order-preserving quantization method in (8) update procedure of the Adam algorithm is omitted here for
and (9) generates at most (N + 1) quantized actions, i.e., brevity. In practice, we train the DNN every δ time frames
K ≤ N + 1. In general, setting a large K (e.g., K = N ) after collecting sufficient number of new data samples. The
leads to better computation rate performance at the cost experience replay technique used in our framework has
of higher complexity. However, as we will show later in several advantages. First, the batch update has a reduced
Section 4.4, it is not only inefficient but also unnecessary complexity than using the entire set of data samples. Second,
to generate a large number of quantized actions in each the reuse of historical data reduces the variance of θt during
time frame. Instead, setting a small K (even close to 1) the iterative update. Third, the random sampling fastens
suffices to achieve good computation rate performance and the convergence by reducing the correlation in the training
low complexity after sufficiently long training period. samples.
We use an example to illustrate the above order- Overall, the DNN iteratively learns from the best state-
preserving quantization method. Suppose that x̂t = [0.2, 0.4, action pairs (ht , x∗t )’s and generates better offloading deci-
0.7, 0.9] and K = 4. The corresponding order statistics of x̂t sions output as the time progresses. Meanwhile, with the
are x̂t,(1) = 0.4, x̂t,(2) = 0.7, x̂t,(3) = 0.2, and x̂t,(4) = 0.9. finite memory space constraint, the DNN only learns from
Therefore, the 4 offloading actions generated from the above the most recent data samples generated by the most recent
quantization method are x1 = [0, 0, 1, 1], x2 = [0, 1, 1, 1], x3 (and more refined) offloading policies. This closed-loop
= [0, 0, 0, 1], and x4 = [1, 1, 1, 1]. In comparison, when the reinforcement learning mechanism constantly improves its
conventional KNN method is used, the obtained actions are offloading policy until convergence. We provide the pseudo-
x1 = [0, 0, 1, 1], x2 = [0, 1, 1, 1], x3 = [0, 0, 0, 1], and x4 = [0, code of the DROO algorithm in Algorithm 1.
1, 0, 1].
Compared to the KNN method where the quantized Algorithm 1: An online DROO algorithm to solve the
solutions are closely placed around x̂, the offloading actions offloading decision problem.
produced by the order-preserving quantization method are
input : Wireless channel gain ht at each time frame t,
separated by a larger distance. Intuitively, this creates higher
the number of quantized actions K
diversity in the candidate action set, thus increasing the
output: Offloading action x∗t , and the corresponding
chance of finding a local maximum around x̂t . In Sec-
optimal resource allocation for each time
tion 5.1, we show that the proposed order-preserving quan-
frame t;
tization method achieves better convergence performance
1 Initialize the DNN with random parameters θ1 and
than KNN method.
empty memory;
Recall that each candidate action xk can achieve 2 Set iteration number M and the training interval δ ;
Q∗ (ht , xk ) computation rate by solving (P2). Therefore, the 3 for t = 1, 2, . . . , M do
best offloading action x∗t at the t-th time frame is chosen as 4 Generate a relaxed offloading action x̂t = fθt (ht );
x∗t ∗
= arg max Q (ht , xi ). (10) 5 Quantize x̂t into K binary actions {xk } = gK (x̂t );
xi ∈{xk } 6 Compute Q∗ (ht , xk ) for all {xk } by solving (P2);
∗
Note that the K -times evaluation of Q (ht , xk ) can be 7 Select the best action x∗t = arg max Q∗ (ht , xk );
{xk }
processed in parallel to speed up the computation of (10). 8 Update the memory by adding (ht , x∗t );
Then, the network outputs the offloading action x∗t along 9 if t mod δ = 0 then
with its corresponding optimal resource allocation (τt∗ , a∗t ). 10 Uniformly sample a batch of data set
{(hτ , x∗τ ) | τ ∈ Tt } from the memory;
4.3 Offloading Policy Update 11 Train the DNN with {(hτ , x∗τ ) | τ ∈ Tt } and
update θt using the Adam algorithm;
The offloading solution obtained in (10) will be used to
12 end
update the offloading policy of the DNN. Specifically, we
13 end
maintain an initially empty memory of limited capacity. At
the t-th time frame, a new training data sample (ht , x∗t ) is
added to the memory. When the memory is full, the newly
generated data sample replaces the oldest one. 4.4 Adaptive Setting of K
We use the experience replay technique [15], [30] to train Compared to the conventional optimization algorithms, the
the DNN using the stored data samples. In the t-th time DROO algorithm has the advantage in removing the need
7
the iterations. Mathematically, Kt is calculated as

10 
9
N,
 t = 1,
∗ ∗

Kt = min max kt−1 , · · · , kt−∆ + 1, N , t mod ∆ = 0,
8

Kt−1 , otherwise,

7
for t ≥ 1. For an extreme case with ∆ = 1, Kt updates

6 in each time frame. Meanwhile, when ∆ → ∞, Kt never

updates such that it is equivalent to setting a constant

5
K = N . In Section 5.2, we numerically show that setting
4 a proper ∆ can effectively speed up the learning process
without compromising the computation rate performance.
3
2
5 N UMERICAL R ESULTS
1
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 In this section, we use simulations to evaluate the per-
formance of the proposed DROO algorithm. In all simu-
lations, we use the parameters of Powercast TX91501-3W
with P = 3 Watts for the energy transmitter at the AP,
Fig. 4. The index kt∗ of the best offloading actions x∗t for DROO algo-
rithm when the number of WDs is N = 10 and K = N . The detailed and those of P2110 Powerharvester for the energy receiver
simulation setups are presented in Section 5. at each WD.2 The energy harvesting efficiency µ = 0.51.
The distance from the i-th WD to the AP, denoted by di ,
is uniformly distributed in the range of (2.5, 5.2) meters,
i = 1, · · · , N . Due to the page limit, the exact values of di ’s
of solving hard MIP problems, and thus has the potential
are omitted. The average channel gain h̄i follows the free-
to significantly reduce the complexity. The major compu- de
3·108
tational complexity of the DROO algorithm comes from space path loss model h̄i = Ad 4πf c di
, where Ad = 4.11
solving (P2) K times in each time frame to select the best denotes the antenna gain, fc = 915 MHz denotes the carrier
offloading action. Evidently, a larger K (e.g., K = N ) in frequency, and de = 2.8 denotes the path loss exponent.
general leads to a better offloading decision in each time The time-varying wireless channel gain of the N WDs at
frame and accordingly a better offloading policy in the long time frame t, denoted by ht = [ht1 , ht2 , · · · , htN ], is generated
term. Therefore, there exists a fundamental performance- from a Rayleigh fading channel model as hti = h̄i αit . Here αit
complexity tradeoff in setting the value of K . is the independent random channel fading factor following
In this subsection, we propose an adaptive procedure an exponential distribution with unit mean. Without loss
to automatically adjust the number of quantized actions of generality, the channel gains are assumed to remain the
generated by the order-preserving quantization method. same within one time frame and vary independently from
We argue that using a large and fixed K is not only one time frame to another. We assume equal computing
computationally inefficient but also unnecessary in terms efficiency ki = 10−26 , i = 1, · · · , N , and φ = 100 for all
of computation rate performance. To see this, consider a the WDs [32]. The data offloading bandwidth B = 2 MHz,
wireless powered MEC network with N = 10 WDs. We receiver noise power N0 = 10−10 , and vu = 1.1. Without
apply the DROO algorithm with a fixed K = 10 and plot loss of generality, we set T = 1 and the wi = 1 if i is an
in Fig. 4 the index of the best action x∗t calculated from (10) odd number and wi = 1.5 otherwise. All the simulations
over time, denoted as kt∗ . For instance, kt∗ = 2 indicates that are performed on a desktop with an Intel Core i5-4570 3.2
the best action in the t-th time frame is ranked the second GHz CPU and 12 GB memory.
among the K ordered quantized actions. In the figure, the We simply consider a fully connected DNN consisting of
curve is plotted as the 50-time-frames rolling average of kt∗ one input layer, two hidden layers, and one output layer in
and the light shadow region is the upper and lower bounds the proposed DROO algorithm, where the first and second
of kt∗ in the past 50 time frames. Apparently, most of the hidden layers have 120 and 80 hidden neurons, respectively.
selected indices kt∗ are no larger than 5 when t ≥ 5000. This Note that the DNN can be replaced by other structures with
indicates that those generated offloading actions xk with different number of hidden layers and neurons, or even
k > 5 are redundant. In other words, we can gradually other types of neural networks to fit the specific learning
reduce K during the learning process to speed up the problem, such as convolutional neural network (CNN) or
algorithm without compromising the performance. recurrent neural network (RNN) [33]. In this paper, we
find that a simple two-layer perceptron suffices to achieve
Inspired by the results in Fig. 4, we propose an adaptive satisfactory convergence performance, while better conver-
method for setting K . We denote Kt as the number of binary gence performance is expected by further optimizing the
offloading actions generated by the quantization function at DNN parameters. We implement the DROO algorithm in
the t-th time frame. We set K1 = N initially and update Kt Python with TensorFlow 1.0 and set training interval δ = 10,
every ∆ time frames, where ∆ is referred to as the updating training batch size |T | = 128, memory size as 1024, and
interval for K . Upon an update time frame, Kt is set as 1
plus the largest kt∗ observed in the past ∆ time frames. The 2. See detailed product specifications at http://
reason for the additional 1 is to allow Kt to increase during www.powercastco.com.
8
0.5
Training Loss L(θtπ )

Training Loss L(θtπ ) 0.4 0.4
0.3
0.3
0.2
0.2
0.1
0.1 0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Alternate Alternate
all weights all weights
Normalized Computation Rate Q̂

1
1
0.95 0.95
0.9
0.9
0.85
0.85
0.8
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0.8 Time Frame t
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Time Frame t
Fig. 6. Normalized computation rates and training losses for DROO

Fig. 5. Normalized computation rates and training losses for DROO algorithm with alternating-weight WDs when N = 10 and K = 10.
algorithm under fading channels when N = 10 and K = 10.
that W D1 and W D2 have a temporary surge of commu-

learning rate for Adam optimizer as 0.01. The source code is
tation demands. We double W D2 ’s weight from 1.5 to 3 at
available at https://github.com/revenol/DROO.
time frame t = 4, 000, triple W D1 ’s weight from 1 to 3 at
t = 6, 000, and reset both of their weights to the original
5.1 Convergence Performance values at t = 8, 000. In the top sub-figure in Fig. 7, we plot
We first consider a wireless powered MEC network with the relative computation rates for both WDs, where each
N = 10 WDs. Here, we define the normalized computation WD’s computation rate is normalized against that achieved
under the optimal offloading actions with their original
rate Q̂(h, x) ∈ [0, 1], as
weights. In the first 3,000 time frames, DROO gradually
Q∗ (h, x) converges and the corresponding relative computation rates
Q̂(h, x) = , (11) for both WDs are lower than the baseline at most of the
maxx0 ∈{0,1}N Q∗ (h, x0 )
time frames. During time frames 4, 000 < t < 8, 000,
where the optimal solution in the denominator is obtained W D2 ’s weight is doubled. Its computation rate significantly
by enumerating all the 2N offloading actions. improves over the baseline, where at some time frames the
In Fig. 5, we plot the training loss L(θt ) of the DNN and improvement can be as high as 2 to 3 times of the baseline.
the normalized computation rate Q̂. Here, we set a fixed Similar rate improvement is also observed for W D1 when
K = N . In the figure below, the blue curve denotes the its weight is tripled between 6, 000 < t < 8, 000. In addition,
moving average of Q̂ over the last 50 time frames, and the their computation rates gradually converge to the baseline
light blue shadow denotes the maximum and minimum of when their weights are reset to the original value after
Q̂ in the last 50 frames. We see that the moving average t = 8, 000. On average, W D1 and W D2 have experienced
Q̂ of DROO gradually converges to the optimal solution 26% and 12% higher computation rate, respectively, during
when t is large. Specifically, the achieved average Q̂ exceeds their periods with increased weights. In the bottom sub-
0.98 at an early stage when t > 400 and the variance figure in Fig. 7, we plot the normalized computation rate
gradually decreases to zero as t becomes larger, e.g., when performance of DROO, which shows that the algorithm
t > 3, 000. Meanwhile, in the figure above, the training can quickly adapt itself to the temporary demand variation
loss L(θt ) gradually decreases and stabilizes at around 0.04, of users. The results in Fig. 7 have verified the ability of
whose fluctuation is mainly due to the random sampling of the propose DROO framework in supporting temporarily
training data. critical service quality requirements.
In Fig. 6, we evaluate DROO for MEC networks with In Fig. 8, we evaluate DROO for MEC networks where
alternating-weight WDs. We evaluate the worst case by WDs can be occasionally turned off/on. After DROO con-
alternating the weights of all WDs between 1 and 1.5 at verges, we randomly turn off on one WD at each time
the same time, specifically, at t = 6, 000 and t = 8, 000. The frame t = 6, 000, 6, 500, 7, 000, 7, 500, and then turn them
training loss sharply increases after the weights alternated on at time frames t = 8, 000, 8, 500, 9, 000. At time frame
and gradually decreases and stabilizes after training for t = 9, 500, we randomly turn off two WDs, resulting an
1,000 time frames, which means that DROO automatically MEC network with 8 active WDs. Since the number of
updates its offloading decision policy and converges to the neurons in the input layer of DNN is fixed as N = 10,
new optimal solution. Meanwhile, as shown in Fig. 6, the we set the input channel gains h for the inactive WDs as 0
minimum of Q̂ is greater than 0.95 and the moving average to exclude them from the resource allocation optimization
of Q̂ is always greater than 0.99 for t > 6, 000. with respect to (P2). We numerically study the performance
In Fig. 7, we evaluate the ability of DROO in supporting of this modified DROO in Fig. 8. Note that, when evaluat-
WDs’ temporarily critical computation demand. Suppose ing the normalized computation rate Q̂ via equation (11),
Relative Computation Rate

6 training data and degrades the convergence performance.
WD
5
WD
2
1
Furthermore, a large batch size consumes more time for
4
3
training. As a trade-off between convergence speed and
2 computation time, we set the training batch size |T | = 128
1
in the following simulations. In Fig. 9(c), we investigate
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
the convergence of DROO under different training intervals
δ . DROO converges faster with shorter training interval,
Double Triple Reset both
WD 2 's weight WD 1 's weight WDs' weights and thus more frequent policy update. However, numerical
1
results show that it is unnecessary to train and update the
DNN too frequently. Hence, we set the training interval
0.95
δ = 10 to speed up the convergence of DROO. In Fig. 9(d),
0.9
we study the impact of the learning rate in Adam optimizer
0.85
[31] to the convergence performance. We notice that either a
0.8
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
too small or a too large learning rate causes the algorithm to
Time Frame t converge to a local optimum. In the following simulations,
we set the learning rate as 0.01.
In Fig. 10, we compare the performance of two quantiza-
Fig. 7. Computation rates for DROO algorithm with temporarily new
weights when N = 10 and K = 10. tion methods: the proposed order-preserving quantization
and the conventional KNN quantization method under
0.5
different K . In particular, we plot the the moving average
Training Loss L(θtπ )
0.4 of Q̂ over a window of 200 time frames. When K = N ,

0.3 both methods converge to the optimal offloading actions,
0.2 i.e., the moving average of Q̂ approaches 1. However, they
0.1 both achieve suboptimal offloading actions when K is small.
0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 For instance, when K = 2, the order-preserving quanti-
×1 ×1 ×1 ×1 ×1 ×1 ×2 ×2
zation method and KNN both only converge to around
0.95. Nonetheless, we can observe that when K ≥ 2, the

1
order-preserving quantization method converges faster than
0.95
the KNN method. Intuitively, this is because the order-
0.9
preserving quantization method offers a larger diversity in
0.85 the candidate actions than the KNN method. Therefore, the
0.8 training of DNN requires exploring fewer offloading actions
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Time Frame t before convergence. Notice that the DROO algorithm does
WD OFF WD ON
not converge for both quantization methods when K = 1.
This is because the DNN cannot improve its offloading
policy when action selection is absent.
Fig. 8. Normalized computation rates and training losses for DROO The simulation results in this subsection show that the
algorithm with ON-OFF WDs when N = 10 and K = 10.
proposed DROO framework can quickly converge to the
optimal offloading policy, especially when the proposed
order-preserving action quantization method is used.
the denominator is re-computed when one WD is turned
off/on. For example, when there are 8 active WDs in the
MEC network, the denominator is obtained by enumerating 5.2 Impact of Updating Intervals ∆
all the 28 offloading actions. As shown in Fig. 8, the training In Fig. 11, we further study the impact of the updating
loss L(θt ) increases little after WDs are turned off/on, and interval of K (i.e., ∆) on the convergence property. Here,
the moving average of the resulting Q̂ is always greater than we use the adaptive setting method of K in Section 4.4 and
0.99. plot the moving average of Q̂ over a window of 200 time
In Fig. 9, we further study the effect of different al- frames. We see that the DROO algorithm converges to the
gorithm parameters on the convergence performance of optimal solution only when setting a sufficiently large ∆,
DROO, including different memory sizes, batch sizes, train- e.g., ∆ ≥ 16. Meanwhile, we also plot in Fig. 12 the moving
ing intervals, and learning rates. In Fig. 9(a), a small memory average of Kt under different ∆. We see that Kt increases
(=128) causes larger fluctuations on the convergence perfor- with ∆ when t is large. This indicates that setting a larger ∆
mance, while a large memory (=2048) requires more training will lead to higher computational complexity, i.e., requires
data to converge to optimal, as Q̂ = 1. In the following computing (P2) more times in a time frame. Therefore, a
simulations, we choose the memory size as 1024. For each performance-complexity tradeoff exists in setting ∆.
training procedure, we randomly sample a batch of data To properly choose an updating interval ∆, we plot in
samples from the memory to improve the DNN. Hence, the Fig. 13 the tradeoff between the total CPU execution latency
batch size must be no more than the memory size 1024. of 10000 channel realizations and the moving average of
As shown in Fig. 9(b), a small batch size (=32) does not Q̂ in the last time frame. On one hand, we see that the
take advantage of all training data stored in the memory, average of Q̂ quickly increases from 0.96 to close to 1
while a large batch size (=1024) frequently uses the “old” when ∆ ≤ 16, while the improvement becomes marginal
D E
F G
Fig. 9. Moving average of Q̂ under different algorithm parameters when N = 10: (a) memory size ; (b) training batch size; (c) training interval; (d)
learning rate.
1 1
0.99
0.95
0.98

0.9 0.97

0.96

0.85
0.95

0.94
0.8

0.93

0.75 0.92

0.91
0.7

0.9
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Fig. 10. Moving average of Q̂ under different quantization functions and Fig. 11. Moving average of Q̂ for DROO algorithm with different updating
K when N = 10. interval ∆ for setting an adaptive K . Here, we set N = 10.
afterwards when we further increase ∆. On the other hand,

the CPU execution latency increases monotonically with ∆.
mode of the WD that leads to the largest computation
To balance between performance and complexity, we set
rate improvement. That is, from xi = 0 to xi = 1, or
∆ = 32 for DROO algorithm in the following simulations.
vice versa. The iteration stops when the computation
performance cannot be further improved by the com-
5.3 Computation Rate Performance
puting mode swapping. The CD method is shown
Regarding to the weighted sum computation rate perfor- to achieve near-optimal performance under different
mance, we compare our DROO algorithm with three repre- N.
sentative benchmarks: • Linear Relaxation (LR) algorithm [13]. The binary of-
• Coordinate Descent (CD) algorithm [7]. The CD algo- floading decision variable xi conditioned on (4d)
rithm iteratively swaps in each round the computing is relaxed to a real number between 0 and 1, as
×106
7
10
CD
9 DROO
6
Maximum Computation Rate Q (bits/s)

LR
Edge Computing
8
Local Computing

5

7

6 4
5

3
4
3 2
2
1
1
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
10 20 30
Number of WDs N
Fig. 12. Dynamics of Kt under different updating interval ∆ when N =

10. Fig. 14. Comparisons of computation rate performance for different
offloading algorithms.
1
0.995 Δ=16 Δ=64 Δ=256 Δ=1024

Δ=32 Δ=128 Δ=512 in the figure is the average performance of 6, 000 inde-
Δ=8 pendent wireless channel realizations. We see that DROO
0.99
achieves similar near-optimal performance with the CD

0.985
method, and significantly outperforms the Edge Computing
and Local Computing algorithms. In Fig. 15, we further
E[Q̂]
0.98
Δ=4 compare the performance of DROO and LR algorithms. For
0.975 better exposition, we plot the normalized computation rate
Q̂ achievable by DROO and LR. Specifically, we enumerate
Δ=2
all 2N possible offloading actions as in (11) when N = 10.
0.97
0.965 For N = 20 and 30, it is computationally prohibitive to

Δ=1
enumerate all the possible actions. In this case, Q̂ is obtained
0.96
60 80 100 120 140 160 180 200 220 240
by normalizing the computation rate achievable by DROO
CPU Execution Latency (seconds) (or LR) against that of CD method. We then plot both
the median and the confidence intervals of Q̂ over 6000
Fig. 13. Tradeoff between Q̂ and CPU execution latency after training independent channel realizations. We see that the median
DROO for 10,000 channel realizations under different updating intervals of DROO is always close-to-1 for different number of users,
∆ when N = 10. and the confidence intervals are mostly above 0.99. Some
normalized computation rate Q̂ of DROO is greater than
x̂i ∈ [0, 1]. Then the optimization problem (P1) with 1, since DROO generates greater computation rate than CD
this relaxed constraint is convex with respect to {x̂i } at some time frame. In comparison, the median of the LR
and can be solved using the CVXPY convex opti- algorithm is always less than 1. The results in Fig. 14 and
mization toolbox.3 Once x̂i is obtained, the binary Fig. 15 show that the proposed DROO method can achieve
offloading decision xi is determined as follows near-optimal computation rate performance under different
( network placements.
∗ ∗
1, when rO,i (a, τi ) ≥ rL,i (a),
xi = (12)
0, otherwise. 5.4 Execution Latency
• Local Computing. All N WDs only perform local com- At last, we evaluate the execution latency of the DROO al-
putation, i.e., setting xi = 0, i = 1, · · · , N in (P2). gorithm. The computational complexity of DROO algorithm
• Edge Computing. All N WDs offload their tasks to the greatly depends on the complexity in solving the resource
AP, i.e., setting xi = 1, i = 1, · · · , N in (P2). allocation sub-problem (P2). For fair comparison, we use
the same bi-section search method as the CD algorithm
In Fig. 14, we first compare the computation rate per- in [7]. The CD method is reported to achieve an O(N 3 )
formance achieved by different offloading algorithms under complexity. For the DROO algorithm, we consider both
varying number of WDs, N . Before the evaluation, DROO using a fixed K = N and an adaptive K as in Section 4.4.
has been trained with 24, 000 independent wireless channel Note that the execution latency for DROO listed in Table 2
realizations, and its offloading policy has converged. This is averaged over 30,000 independent wireless channel re-
is reasonable since we are more interested in the long-term alizations including both offloading action generation and
operation performance [34] for field deployment. Each point DNN training. Overall, the training of DNN contributes
only a small proportion of CPU execution latency, which
3. CVXPY package is online available at https://www.cvxpy.org/ is much smaller than that of the bi-section search algorithm
12
wireless powered MEC networks in fading environment.

! 6 C ONCLUSION
In this paper, we have proposed a deep reinforcement

learning-based online offloading algorithm, DROO, to max-
imize the weighted sum computation rate in wireless pow-
"!
ered MEC networks with binary computation offloading.
The algorithm learns from the past offloading experiences
"
to improve its offloading action generated by a DNN via
reinforcement learning. An order-preserving quantization
! and an adaptive parameter setting method are devised
DRLOO
LR to achieve fast algorithm convergence. Compared to the

conventional optimization methods, the proposed DROO al-
gorithm completely removes the need of solving hard mixed
integer programming problems. Simulation results show
that DROO achieves similar near-optimal performance as
Fig. 15. Boxplot of the normalized computation rate Q̂ for DROO and
LR algorithms under different number of WDs. The central mark (in red) existing benchmark methods but reduces the CPU execution
indicates the median, and the bottom and top edges of the box indicate latency by more than an order of magnitude, making real-
the 25th and 75th percentiles, respectively. time system optimization truly viable for wireless powered
MEC networks in fading environment.
Despite that the resource allocation subproblem is solved
for resource allocation. Taking DROO with K = 10 as an
under a specific wireless powered network setup, the pro-
example, it uses 0.034 second to generate an offloading
posed DROO framework is applicable for computation
action and uses 0.002 second to train the DNN in each
offloading in general MEC networks. A major challenge,
time frame. Here training DNN is efficient. During each
however, is that the mobility of the WDs would cause
offloading policy update, only a small batch of training
DROO harder to converge.
data samples, |T | = 128, are used to train a two-hidden-
As a concluding remark, we expect that the proposed
layer DNN with only 200 hidden neurons in total via back-
framework can also be extended to solve MIP problems
propagation. We see from Table 2 that an adaptive K can
for various applications in wireless communications and
effectively reduce the CPU execution latency than a fixed
networks that involve in coupled integer decision and con-
K = N . Besides, DROO with an adaptive K requires much
tinuous resource allocation problems, e.g., mode selection
shorter CPU execution latency than the CD algorithm and
in D2D communications, user-to-base-station association
the LR algorithm. In particular, it generates an offloading
in cellular systems, routing in wireless sensor networks,
action in less than 0.1 second when N = 30, while CD
and caching placement in wireless networks. The proposed
and LR take 65 times and 14 times longer CPU execution
DROO framework is applicable as long as the resource
latency, respectively. Overall, DROO achieves similar rate
allocation subproblems can be efficiently solved to evaluate
performance as the near-optimal CD algorithm but requires
the quality of the given integer decision variables.
substantially less CPU execution latency than the heuristic
LR algorithm.
The wireless-powered MEC network considered in this
R EFERENCES
paper may correspond to a static IoT network with both the
transmitter and receivers are fixed in locations. Measure- [1] S. Bi, C. K. Ho, and R. Zhang, “Wireless powered communication:
ment experiments [24]–[26] show that the channel coherence Opportunities and challenges,” IEEE Commun. Mag., vol. 53, no. 4,
pp. 117–125, Apr. 2015.
time, during which we deem the channel invariant, ranges [2] M. Chiang and T. Zhang, “Fog and IoT: An overview of research
from 1 to 10 seconds, and is typically no less than 2 seconds. opportunities,” IEEE Internet Things J., vol. 3, no. 6, pp. 854–864,
The time frame duration is set smaller than the coherence Dec. 2016.
time. Without loss of generality, let us assume that the time [3] Y. Mao, J. Zhang, and K. B. Letaief. “Dynamic computation
offloading for mobile-edge computing with energy harvesting
frame is 2 seconds. Taking the MEC network with N = 30 devices.” IEEE J. Sel. Areas Commun., vol. 34, no. 12, pp. 3590-3605,
as an example, the total execution latency of DROO is 0.059 Dec. 2016.
second, accounting for 3% of the time frame, which is an ac- [4] C. You, K. Huang, H. Chae, and B.-H. Kim, “Energy-efficient
resource allocation for mobile-edge computation offloading,” IEEE
ceptable overhead for field deployment. In fact, DROO can Trans. Wireless Commun., vol. 16, no. 3, pp. 1397–1411, Mar. 2017.
be further improved by only generating offloading actions [5] X. Chen, L. Jiao, W. Li, and X. Fu. “Efficient multi-user compu-
at the beginning of the time frame and then training DNN tation offloading for mobile-edge cloud computing.” IEEE/ACM
during the remaining time frame in parallel with energy Trans. Netw., vol. 24, no. 5, pp. 2795-2808, Oct. 2016.
[6] F. Wang, J. Xu, X. Wang, and S. Cui, “Joint offloading and com-
transfer, task offloading and computation. In comparison, puting optimization in wireless powered mobile-edge computing
the execution of LR algorithm consumes 40% of the time systems,” IEEE Trans. Wireless Commun., vol. 17, no. 3, pp. 1784–
frame, and the CD algorithm even requires longer execution 1797, Mar. 2018.
time than the time frame, which are evidently unacceptable [7] S. Bi and Y. J. A. Zhang, “Computation rate maximization for wire-
less powered mobile-edge computing with binary computation
in practical implementation. Therefore, DROO makes real- offloading,” IEEE Trans. Wireless Commun., vol. 17, no. 6, pp. 4177–
time offloading and resource allocation truly viable for 4190, Jun. 2018.
13
TABLE 2
Comparisons of CPU execution latency
DROO DROO
# of WDs CD LR
(Fixed K = N ) (Adaptive K with ∆ = 32)
10 3.6e-2s 1.2e-2s 2.0e-1s 2.4e-1s
20 1.3e-1s 3.0e-2s 1.3s 5.3e-1s
30 3.1e-1s 5.9e-2s 3.8s 8.1e-1s
[8] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on [28] H. Ye, G. Y. Li, and B. H. Juang, “Power of deep learning for
mobile edge computing: The communication perspective,” IEEE channel estimation and signal detection in OFDM systems,” IEEE
Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, Aug. 2017. Wireless Commun. Lett., vol. 7, no. 1, pp. 114–117, Feb 2018.
[9] C. You, K. Huang, and H. Chae, “Energy efficient mobile cloud [29] S. Marsland, Machine learning: an algorithmic perspective. CRC
computing powered by wireless energy transfer,” IEEE J. Sel. Areas press, 2015.
Commun., vol. 34, no. 5, pp. 1757-1771, May 2016. [30] L.-J. Lin, “Reinforcement learning for robots using neural net-
[10] P. M. Narendra and K. Fukunaga, “A branch and bound algorithm works,” Carnegie-Mellon Univ Pittsburgh PA School of Computer
for feature subset selection,” IEEE Trans. Comput., vol. C-26, no. 9, Science, Tech. Rep., 1993.
pp. 917–922, Sep. 1977. [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
[11] D. P. Bertsekas, Dynamic programming and optimal control. Athena tion,” in Proc. ICLR, 2015.
Scientific Belmont, MA, 1995, vol. 1, no. 2. [32] Y. Wang, M. Sheng, X. Wang, L. Wang, and J. Li, “Mobile-edge
computing: Partial computation offloading using dynamic voltage
[12] T. X. Tran and D. Pompili, “Joint task offloading and resource
scaling,” IEEE Trans. Commun., vol. 64, no. 10, pp. 4268–4282, Oct.
allocation for multi-server mobile-edge computing networks,”
2016.
arXiv preprint arXiv:1705.00704, 2017.
[33] I. Goodfellow and Y. Bengio and A. Courville, Deep Learning. MIT
[13] S. Guo, B. Xiao, Y. Yang, and Y. Yang, “Energy-efficient dynamic press, 2016.
offloading and resource scheduling in mobile cloud computing,” [34] R. S. Sutton, and A. G. Barto, Reinforcement learning: An introduc-
in Proc. IEEE INFOCOM, Apr. 2016, pp. 1–9. tion, 2nd ed., Cambridge, MA: MIT press, 2018.
[14] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. Quek, “Offloading in mobile
edge computing: Task allocation and computational frequency
scaling,” IEEE Trans. Commun., vol. 65, no. 8, pp. 3571–3584, Aug.
2017.
[15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, p. 529, Feb. 2015.
[16] G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap,
J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin, “Deep rein-
forcement learning in large discrete action spaces,” arXiv preprint
arXiv:1512.07679, 2015.
[17] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol.
521, no. 7553, p. 436, May 2015.
[18] Y. He, F. R. Yu, N. Zhao, V. C. Leung, and H. Yin, “Software-
defined networks with mobile edge computing and caching for
smart cities: A big data deep reinforcement learning approach,”
IEEE Commun. Mag., vol. 55, no. 12, pp. 31–37, Dec. 2017.
[19] L. Huang, X. Feng, A. Feng, Y. Huang, and P. Qian, “Distributed
Deep Learning-based Offloading for Mobile Edge Computing Net-
works,” Mobile Netw. Appl., 2018, doi: 10.1007/s11036-018-1177-x.
[20] M. Min, D. Xu, L. Xiao, Y. Tang, and D. Wu, “Learning-based
computation offloading for IoT devices with energy harvesting,”
IEEE Trans. Veh. Technol., vol. 68, no. 2, pp. 1930-1941, Feb. 2019.
[21] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Perfor-
mance optimization in mobile-edge computing via deep reinforce-
ment learning,” IEEE Internet of Things Journal, Oct. 2018.
[22] L. Huang, X. Feng, C. Zhang, L. Qian, Y. Wu, “Deep reinforcement
learning-based joint task offloading and bandwidth allocation for
multi-user mobile edge computing,” Digital Communications and
Networks, vol. 5, no. 1, pp. 10-17, 2019.
[23] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep rein-
forcement learning,” in Proc. ICLR, 2016.
[24] R. Bultitude, “Measurement, characterization and modeling of
indoor 800/900 MHz radio channels for digital communications,”
IEEE Commun. Mag., vol. 25, no. 6, pp. 5-12, Jun. 1987.
[25] S. J. Howard and K. Pahlavan, “Doppler spread measurements of
indoor radio channel,” Electronics Letters, vol. 26, no. 2, pp. 107-109,
Jan. 1990.
[26] S. Herbert, I. Wassell, T. H. Loh, and J. Rigelsford, “Characterizing
the spectral properties and time variation of the in-vehicle wireless
communication channel,” IEEE Trans. Commun., vol. 62, no. 7,
pp. 2390-2399, Jul. 2014.
[27] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
“Learning to optimize: Training deep neural networks for wireless
resource management,” in Proc. IEEE SPAWC, Jul. 2017, pp. 1–6.

Deep Reinforcement Learning For Online

Uploaded by

Copyright:

Available Formats

Deep Reinforcement Learning For Online

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Reinforcement Learning For Online

Uploaded by

Copyright:

Available Formats

1

IEEE Transaction on Internet of Things,Year:2019

Deep Reinforcement Learning for Online

D UE to the small form factor and stringent production

and leads to better convergence performance than

'11 4XDQWL]DWLRQgK DUJPD[ Q* ht xk

6DPSOH UDQGRPEDWFK 2IIORDGLQJ3ROLF\

Fig. 3. The schematics of the proposed DROO algorithm.

the iterations. Mathematically, Kt is calculated as

6 in each time frame. Meanwhile, when ∆ → ∞, Kt never

Training Loss L(θtπ )

Normalized Computation Rate Q̂

Fig. 6. Normalized computation rates and training losses for DROO

that W D1 and W D2 have a temporary surge of commu-

Relative Computation Rate

0.4 of Q̂ over a window of 200 time frames. When K = N ,

0.95. Nonetheless, we can observe that when K ≥ 2, the

afterwards when we further increase ∆. On the other hand,

Maximum Computation Rate Q (bits/s)

Fig. 12. Dynamics of Kt under different updating interval ∆ when N =

0.995 Δ=16 Δ=64 Δ=256 Δ=1024

achieves similar near-optimal performance with the CD

0.965 For N = 20 and 30, it is computationally prohibitive to

wireless powered MEC networks in fading environment.

You might also like

'11 4XDQWL]DWLRQgK DUJPD[ Q* ht xk