Learning Task-Oriented Communication For Edge Inference: An Information Bottleneck Approach
Learning Task-Oriented Communication For Edge Inference: An Information Bottleneck Approach
Learning Task-Oriented Communication For Edge Inference: An Information Bottleneck Approach
Abstract—This paper investigates task-oriented communica- for communication system designs that rely heavily on ex-
tion for edge inference, where a low-end edge device transmits pert knowledge, and will undoubtedly transform the wireless
the extracted feature vector of a local data sample to a powerful networks toward the next generation [11].
edge server for processing. It is critical to encode the data
into an informative and compact representation for low-latency Meanwhile, emerging AI applications also raise new com-
inference given the limited bandwidth. We propose a learning- munication problems [12], [13]. To provide an immersive
based communication scheme that jointly optimizes feature user experience, DNN-based mobile applications need to be
extraction, source coding, and channel coding in a task-oriented performed within the edge of wireless networks, which elim-
manner, i.e., targeting the downstream inference task rather inates the excessive latency incurred by routing data to the
arXiv:2102.04170v3 [eess.SP] 18 Jan 2023
The autoencoder-based framework for communication systems by using the mutual information as both a cost function and
was later extended to a deep joint source-channel coding a regularizer. Particularly, the IB framework maximizes the
(JSCC) architecture for wireless image transmission in [8], mutual information between the latency representation and
which enjoys significant improvement of image reconstruction the label of the data sample to promote high accuracy, while
quality over separate source/channel coding techniques. JSCC minimizing the mutual information between the representation
has also been applied to natural language processing for and the input sample to promote generalization. Such a trade-
text transmission, which was accomplished by incorporating off between preserving the relevant information and finding
the semantic information of sentences using recurrent neural a compact representation fits nicely with bandwidth-limited
networks [21]. It is worth noting that the aforementioned edge inference and thus will be adopted as the main design
works focus on data-oriented communication, which targets principle in our study for task-oriented communication. The IB
at transmitting data reliably given the limited radio resources. framework is inherently related to the communication problem
Nevertheless, the shifted objective of feature transmissions of remote source coding (RSC) [31]. It has recently attracted
for accurate edge inference with low latency is not aligned great attention from both the machine learning and information
with that of data-oriented communication, as it regards a theory communities [32], [33], [34], [35]. Nevertheless, ap-
part of the raw input data (e.g., nuisance, task-irrelevant plying it to task-oriented communication demands additional
information) as meaningless. Thus, recovering the original optimization, which forms the main technical contributions of
data sample with high fidelity at the edge server results our study.
in redundant communication overhead, which leaves room
for further compression. This insight is also supported by B. Contributions
a basic principle from representation learning [22]: A good In this paper, we develop effective methods for task-oriented
representation should be insensitive (or invariant) to nuisances communication for device-edge co-inference based on the IB
such as translations, rotations, occlusions. Thus, we advocate principle [20]. Our major contributions are summarized as
for task-oriented communication for applications such as edge follows:
inference, to improve the efficiency by transmitting sufficient • We design the task-oriented communication system by
but minimal information for the downstream task. formalizing a rate-distortion tradeoff using the IB frame-
There have been recent studies on feature compression for work. Our formulation aims at maximizing the mutual
efficient transmission in edge inference [23], [24], [25], [26], information between the inference result and the encoded
[27]. In particular, for the image classification task, an end-to- feature, meanwhile, minimizing the mutual information
end architecture was proposed in [25] to jointly optimize the between the encoded feature and input data. Thus, it
feature compression and encoding by integrating deep JSCC. addresses the objectives of improving the inference accu-
In contrast to data-oriented communication that concerns the racy, while reducing the communication overhead, respec-
data recovery metrics (e.g., the 𝑙2 -distance or bit error rate), the tively. To the best of our knowledge, this is the first time
proposed method was directly trained with the cross-entropy that IB is introduced to design wireless edge inference
loss for the targeted classification task and ignored the data systems.
reconstruction quality. The end-to-end training facilitates the • As the mutual information terms in the IB formu-
mapping of task-relevant information to the channel symbols lation are generally intractable for DNNs with high-
and omits the irrelevance. Similar ideas were utilized to design dimensional features, we leverage the variational ap-
feature compression and encoding schemes for image retrieval proximation, known as variational information bottleneck
tasks at the wireless network edge in [28] and for point cloud (VIB), to devise a tractable upper bound. Besides, by
data processing in [29]. selecting a sparsity-inducing distribution as the variational
While the end-to-end learning-driven architectures for task- prior, the VIB framework identifies and prunes the redun-
oriented communication have been proven effective in saving dant dimensions of the encoded feature vector to reduce
communication bandwidth, there remain multiple restrictions the communication overhead. The proposed method is
unsolved in order to unleash their highest potentials: First, named as variational feature encoding (VFE).
there lacks a systematic way to quantify the informativeness • We extend the proposed task-oriented communication
of the encoded feature vector and its impact on the inference scheme to dynamic communication environments by en-
tasks, hindering to achieve the best inference performance abling flexible adjustment of the transmitted signal length.
given the available resources; Besides, the dynamic wireless In particular, we develop a variable-length variational
channel condition necessitates adaptive encoding scheme for feature encoding (VL-VFE) based on dynamic neural
reliable feature transmission, which has received less attention networks that can adaptively adjust the active dimensions
in existing frameworks (e.g. [25], [26], [27], [30]). These form according to different channel conditions.
the main motivations of our study. • The effectiveness of the proposed task-oriented commu-
Data-oriented communication relies on classical source cod- nication schemes is validated in both static and dynamic
ing and channel coding theory, which, however, is not opti- channel conditions on image classification tasks. Exten-
mized for task-oriented communication. Recently, an informa- sive simulation results demonstrate that VFE and VL-
tion theoretical design principle, named information bottleneck VFE outperform the traditional communication design
(IB) [20], has been applied to investigate deep learning, which and existing learning-based joint source-channel coding
seeks the right balance between data fit and generalization for data-oriented communication.
3
the conditional mutual information 𝐼 (𝑋, 𝑍ˆ |𝑌 ), which corre- overhead, an effective method is needed to aggregate the
sponds to the amount of redundant information that needs to be nuisance to the expandable dimensions so that the number
transmitted. Compared with data-oriented communication, the of symbols to be transmitted is minimized.
IB framework retains the task-relevant information and results • Dynamic channel conditions: The hostile wireless chan-
ˆ 𝑋) that is much smaller than 𝐻 (𝑋), which reduces the
in 𝐼 ( 𝑍, nel always poses significant challenges for communica-
communication overhead. tion systems. Particularly, the channel dynamics have to
be accounted for. Dynamically adjusting the encoded fea-
ture length based on the DNNs is nontrivial, as the neural
C. Main Challenges network structure is fixed since initialization. Changing
The IB framework is promising for task-oriented commu- the activation of neurons according to the channel condi-
nication as it explicitly quantifies the informativeness of the tions calls for other control modules.
encoded feature vector and offers a formalization of the rate- The following two sections will tackle these challenges, and
distortion tradeoff in edge inference. However, there are three develop effective methods for task-oriented communications.
main challenges when applying it to develop practical feature The effectiveness of the proposed methods will be tested in
encoding methods, listed as follows. Section V.
• Estimation of mutual information: The computation
of mutual information terms for high-dimensional data III. VARIATIONAL F EATURE E NCODING
with unknown distributions is challenging since the em- In this section, we develop a variational information bottle-
pirical estimate for the probability distribution requires neck (VIB) framework to resolve the difficulty of mutual infor-
the sampling number to increase exponentially with the mation computation of the original IB objective in (2). Besides,
dimension [38]. Therefore, developing a tractable estima- we show that by selecting a sparsity-inducing distribution
tor for mutual information is critical to make the problem as the variational prior, minimizing the mutual information
solvable. between the raw input data 𝑋 and the noise-corrupted feature
• Effective control of communication overhead: Mini- 𝑍ˆ facilitates the sparsification of 𝑍ˆ by pruning the task-
mizing the mutual information between the input data and irrelevant dimensions. Such an activation pruning scheme, i.e.,
the feature vector indeed reduces the redundancy about removing neurons in a DNN, is effective in reducing the
task-irrelevant information. However, there is no direct overhead of task-oriented communication. Based on this idea,
link between redundancy reduction and feature sparsifi- we name our proposed method as variational feature encoding
cation, which controls the communication overhead with (VFE). This section assumes a static channel condition, while
a JSCC encoder. Thus, to reduce the communication dynamic channels will be treated in Section IV.
5
With a known distribution 𝑝𝝓 ( 𝒛ˆ |𝒙) and the joint data Specifically, for each dimension 𝑧ˆ𝑖 , the variational prior dis-
distribution 𝑝(𝒙, 𝒚), the distributions 𝑝( 𝒛ˆ) and 𝑝( 𝒚| 𝒛ˆ) are fully tribution is chosen as:
characterized by the underlying Markov chain 𝑌 ↔ 𝑋 ↔ 𝑍. ˆ
𝑞 (log | 𝑧ˆ𝑖 |) = constant.
Unfortunately, these two distributions are intractable due to Î𝑛
the following high-dimensional integrals: Since 𝑝𝝓 ( 𝒛ˆ |𝒙) = 𝑖 𝑝𝝓 ( 𝑧ˆ𝑖 |𝒙), the KL-divergence term in (3)
∫ can be decomposed into a summation:
𝑝( 𝒛ˆ) = 𝑝(𝒙) 𝑝𝝓 ( 𝒛ˆ |𝒙)𝑑𝒙, 𝑛
∑︁
𝐷 𝐾 𝐿 𝑝𝝓 ( 𝒛ˆ |𝒙) k𝑞(𝒙) = 𝐷 𝐾 𝐿 𝑝𝝓 ( 𝑧ˆ𝑖 |𝒙) k𝑞 ( 𝑧ˆ𝑖 ) . (5)
𝑝(𝒙, 𝒚) 𝑝𝝓 ( 𝒛ˆ |𝒙)
∫
𝑝( 𝒚| 𝒛ˆ) = 𝑑𝒙. 𝑖=1
𝑝( 𝒛ˆ)
Nevertheless, as the KL-divergence term in (5) does not have a
To overcome this issue, we apply two variational distributions closed-form expression, we utilize the approximation proposed
𝑞( 𝒛ˆ) and 𝑞𝜽 ( 𝒚| 𝒛ˆ) to approximate the true distributions 𝑝( 𝒛ˆ) in [41] as follows:
and 𝑝( 𝒚| 𝒛ˆ), respectively, where 𝜽 is the parameters of the
− 𝐷 𝐾 𝐿 𝑝𝝓 ( 𝑧ˆ𝑖 |𝒙) k𝑞 ( 𝑧ˆ𝑖 ) =
server-based network shown in Fig. 1b that computes the
inference result 𝒚ˆ . Therefore, we recast the objective function 1
= log 𝛼𝑖 − E 𝜖 ∼N (1, 𝛼𝑖 ) log |𝜖 | + C
in (2) as follows: 2
≈𝑘 1 𝑆 (𝑘 2 + 𝑘 3 log 𝛼𝑖 ) − 0.5 log 1 + 𝛼𝑖−1 + C, (6)
L𝑉 𝐼 𝐵 (𝝓, 𝜽) = E 𝑝 (𝒙,𝒚) E 𝑝𝝓 ( 𝒛ˆ |𝒙) [− log 𝑞𝜽 (𝒚| 𝒛ˆ)]
(3)
where
+ 𝛽𝐷 𝐾 𝐿 𝑝𝝓 ( 𝒛ˆ |𝒙) k𝑞( 𝒛ˆ) .
𝜎2
The above formulation is termed as the variational information 𝛼𝑖 = 𝑘 1 = 0.63576 𝑘 2 = 1.87320 𝑘 3 = 1.48695,
bottleneck (VIB) [35], which invokes an upper bound on the IB 𝑧2𝑖
objective function in (2). Details of the derivations are deferred and C is a constant. Besides, 𝑧𝑖 is the 𝑖-th dimension in 𝒛, and
to the Appendix A. By further applying the reparameterization 𝑆(·) denotes the sigmoid function. It can be verified that the
trick [39] and Monte Carlo sampling, we are able to obtain approximate KL-divergence approaches its minimum when 𝛼𝑖
an unbiased estimate of the gradient and hence optimize the goes to infinite (i.e., 𝑧 𝑖 goes to zero), and minimizing this term
objective using stochastic gradient descent. In particular, given encourages the value of 𝑧 𝑖 to be small. Empirical results in
𝑀
a mini-batch of data {(𝒙𝒊 , 𝒚 𝒊 )}𝑖=1 and sampling the channel Section V show that the selected sparsity-inducing distribution
noise 𝐿 times for each pair (𝒙𝒊 , 𝒚 𝒊 ), we have the following sparsifies some dimensions in 𝒛, i.e., 𝑧 𝑖 ≡ 0 for arbitrary input,
empirical estimation: which can be pruned to reduce the communication overhead.
𝑀
( 𝐿
1 ∑︁ 1 ∑︁
L𝑉 𝐼 𝐵 (𝝓, 𝜽) ' − log 𝑞𝜽 𝒚 𝒎 | 𝒛ˆ𝒎,𝒍 C. Variational Pruning on Dimension Importance
𝑀 𝑚=1 𝐿 𝑙=1
) (4) While the selected variational prior helps to promote spar-
sity in the feature vector, we still need an effective method to
+ 𝛽𝐷 𝐾 𝐿 𝑝𝝓 ( 𝒛ˆ |𝒙 𝒎 ) k𝑞( 𝒛ˆ ) ,
determine which of the dimensions can be pruned. Maintaining
𝑧𝑖 ≡ 0 requires all the weights and the bias corresponding to
where 𝒛ˆ𝒎,𝒍 = 𝒛 𝒎 + 𝝐 𝒎,𝒍 and 𝝐 𝒎,𝒍 ∼ N 0, 𝜎 2 𝑰 .
𝑧𝑖 in this layer to converge to zero. However, checking each
In the next subsection, we illustrate that minimizing the parameter is time-consuming in a large-scale DNN. To develop
VIB objective helps to prune the redundant dimensions in the an efficient solution, we introduce a dimension importance
encoded feature vector, and thus it serves as a suitable and vector γ to denote the importance of each output neuron.
tractable objective for task-oriented communication. Revisiting the fully-connected (FC) layer, each neuron has full
6
Algorithm 1 Training Variational Feature Encoding (VFE) data transmission may experience changes due to various
Input: 𝑇 (number of iterations), 𝑛 (number of output dimension of factors such as beam blockage and signal attenuation. This
encoder), 𝐿 (number of channel noise samples per datapoint), necessitates instant link adaptation to improve the efficiency of
batch size 𝑀, channel variance 𝜎 2 , and threshold 𝛾0 .
1: while epoch 𝑡 = 1 to 𝑇 do
feature encoding for low-latency inference. In this section, we
2: 𝑀
Select a mini-batch of data {(𝒙 𝒎 , 𝒚 𝒎 )} 𝑚=1 extend our findings in Section III and propose a new encoding
3: Compute the encoded feature vector {𝒛 𝒎 } 𝑚=1𝑀 based on (8) scheme, namely variable-length variational feature encoding
4: Compute the appropriate KL-divergence based on (6) (VL-VFE), by designing a dynamic neural network, which
5: while 𝑚 = 1 to 𝑀 do admits flexible control of the encoded feature dimension.
Sample the noise 𝝐 𝒎,𝒍 𝑙=1 ∼ N (0, 𝜎 2 𝑰)
𝐿
6:
7: end while
8: Compute the loss L𝑉 𝐼 𝐵 (𝝓, 𝜽) based on (4)
9: Update parameters 𝝓, 𝜽 through backpropagation. A. Background on Dynamic Neural Networks
10: while 𝑖 = 1 to 𝑛 do
11: if 𝛾𝑖 ≤ 𝛾0 then Dynamic neural networks are able to adapt their architec-
12: Prune the 𝑖-th dimension in the encoded feature vector tures to the given input and are effective in improving the
13: end if efficiency of the network processing via selective execution.
14: end while For example, several prior works (e.g. [42], [43], [44]) pro-
15: end while
posed to learn a binary gating module to adaptively skip
layers or prune channels based on the input data. Besides,
there are also some variants of dynamic neural networks,
connections to its input a, and their activations can thus be
including the slimmable neural networks and the “Once-for-
computed with a matrix multiplication with W followed by
All” architecture. In particular, inventors of the slimmable
an offset b as follows:
neural networks [45] proposed to train a single model to
FC (𝒂) =𝑾 𝒂 + 𝒃 = 𝑾
e 𝒂,
˜ (7) support layers with arbitrary widths; while authors of [17]
proposed the “Once-for-All” architecture with a progressive
where 𝑾 e = [𝑾, 𝒃] is an augmented weight matrix, and 𝒂˜ = shrinking algorithm that trains one network to support diverse
[𝒂 , 1] is an augmented input vector. By denoting the 𝑖-th
𝑇 𝑇
sub-networks. In this work, we employ the idea of selective
row in the augmented weight matrix 𝑾 e as W f𝒊 · and the 𝑖-th activation, as shown in Fig. 2, to learn a set of neurons that
dimension in 𝜸 as 𝛾𝑖 , we rewrite the augmented weight matrix can adjust the number of activated neurons according to the
as Wf𝒊 · = 𝛾𝑖 W
g𝒊·
g𝒊· k 2 , where γ corresponds to the scale factor
kW
channel conditions.
for each row. The proposed VFE method defines the mapping
from the input x to the encoded feature z according to the
following formula: B. Selective Activation for Dynamic Channel Conditions
We propose the variable-length variational feature encoding
!
𝑾e𝒊 ·
𝑧𝑖 = Tanh 𝛾𝑖 𝒇 (𝒙) , (8) (VL-VFE), which is empowered with the capability of adjust-
k𝑾
e𝒊 · k 2
ing its output length under different channel conditions. Such
where 𝑧𝑖 is the 𝑖-th dimension of z, and Tanh(·) is the activa- kinds of channel-adaptive feature encoding schemes favor the
tion function. Besides, function 𝒇 (·) is defined by the previous following two properties:
on-device layers, and its output 𝒇 (𝒙) is the input of the fully- • The activated dimensions of the feature 𝒛 can be adjusted
connected layer (i.e., a = 𝒇 (𝒙) in (7)). As the weight vector in the DNN forward propagation according to the channel
𝑾e𝒊 · is normalized by its 𝑙2 -norm, the magnitude of 𝑧𝑖 is highly conditions. More dimensions should be activated during
dependent on the scale factor 𝛾𝑖 . When 𝛾𝑖 is close to zero, 𝑧𝑖 the bad channel conditions and vice versa.
is also close to zero, and the corresponding 𝑝𝝓 ( 𝒛ˆ |𝒙) degrades • The activated dimensions start consecutively from the first
to the channel noise distribution without valid information. dimension (shown in Fig. 2b), which avoids transmitting
Based on this idea, we eliminate the redundant channels when the indexes of the activated dimensions using extra com-
the parameter 𝛾𝑖 is less than a threshold 𝛾0 . Since the Tanh munication resources.
activation function has an output range from -1 to 1, the
In practical communication systems, the mobile device
peak transmitted power 𝑃 is constrained to 1. Note that the
could be aware of the channel condition via a feedback
formula in (8) can be easily extended to convolutional layers
channel. Therefore, the channel condition can be incorporated
by replacing the matrix multiplication with convolution. Such
in the feature encoding process. Because the amplitude of the
a variational pruning process is one of the main components of
encoded feature vector is constrained to 1 by Tanh function,
the proposed VFE method. The training procedures for VFE
the noise variance 𝜎 2 suffices to represent the PSNR and
are illustrated in Algorithm 1.
is adopted as an extra input of the feature encoder. In the
training process, the noise variance 𝜎 2 is regarded as a random
IV. VARIABLE - LENGTH VARIATIONAL F EATURE variable distributed within a range to model the dynamic
E NCODING channel conditions. For simplicity, we sample the channel
The task-oriented communication scheme developed in Sec- variance 𝜎 2 from the uniform distribution 𝑝(𝜎 2 ). As the
tion III assumes static wireless channels. In practice, wireless noise variance 𝑝(𝜎 2 ) is independent to the dataset, we have
7
𝑝(𝒙, 𝒚, 𝜎 2 ) = 𝑝(𝒙, 𝒚) 𝑝(𝜎 2 ). The loss function in (3) is thus Algorithm 2 Training Variable-Length Variational Feature
revised as follows: Enoding (VL-VFE)
n Input: 𝑇 (number of iterations), 𝑛 (number of output dimension of
e𝑉 𝐼 𝐵 (𝝓, 𝜽) = E 𝑝 (𝒙,𝒚, 𝜎 2 ) E 𝑝 ( 𝒛ˆ |𝒙, 𝜎 2 ) [− log 𝑞𝜽 ( 𝒚| 𝒛ˆ)]
L encoder), 𝐿 (number of channel noise samples per datapoint),
𝝓
o (9) batch size 𝑀, noise variance distribution 𝑝(𝜎 2 ), and threshod
+ 𝛽𝐷 𝐾 𝐿 𝑝𝝓 ( 𝒛ˆ |𝒙, 𝜎 2 ) k𝑞( 𝒛ˆ) . 𝛾0 .
1: while epoch 𝑡 = 1 to 𝑇 do
2: 𝑀
Get a mini-batch of data {(𝒙 𝒎 , 𝒚 𝒎 )} 𝑚=1
Similarly, we adopt Monte Carlo sampling as in (4) to estimate 2 𝑀
3: Sample the channel variance 𝜎𝑚 𝑚=1 ∼ 𝑝(𝜎 2 )
L
e𝑉 𝐼 𝐵 . The formula is as follows: 𝑀 based on (11)
4: Compute the encoded feature vector {𝒛 𝒎 } 𝑚=1
𝑀
( 𝐿 5: while 𝑚 = 1 to 𝑀 do
1 ∑︁ 1 ∑︁ 𝐿 2 𝑰)
6: Sample the channel noise 𝝐 𝒎,𝒍 𝑙=1 ∼ N (0, 𝜎𝑚
L𝑉 𝐼 𝐵 (𝝓, 𝜽) '
e − log 𝑞𝜽 𝒚 𝒎 | 𝒛ˆ𝒎,𝒍 7: while 𝑖 = 1 to 𝑛 do
𝑀 𝑚=1 𝐿 𝑙=1 2 ) ≤ 𝛾 then
) (10) 8: if 𝛾𝑖 (𝜎𝑚 0
9: Deactivate the 𝑖-th dimension of 𝒛 𝑚 in this epoch
2
+ 𝛽𝐷 𝐾 𝐿 𝑝𝝓 ( 𝒛ˆ |𝒙 𝒎 , 𝜎𝑚 ) k𝑞( 𝒛ˆ ) , 10: end if
11: end while
12: end while
2 ∼ 𝑝(𝜎 2 ), and 𝝐
where 𝒛ˆ𝒎,𝒍 = 𝒛 𝒎 + 𝝐 𝒎,𝒍 , 𝜎𝑚 2 Compute the appropriate KL-divergence based on (6)
𝒎,𝒍 ∼ N (0, 𝜎𝑚 𝑰), 13:
and for a given 𝒛 𝒎 , the channel noise is sampled 𝐿 times. 14: Compute loss L e𝑉 𝐼 𝐵 (𝝓, 𝜽) based on (10)
15: Update parameters 𝝓, 𝜽 through backpropagation
Then, as the encoding scheme should be channel-adaptive, 16: end while
we have 𝑝𝝓 𝒛ˆ |𝒙, 𝜎 2 = N ( ẑ|𝒛(𝒙; 𝝓, 𝜎 2 ), 𝜎 2 𝑰), where the
function 𝒛(𝒙; 𝝓, 𝜎 2 ) determined by the on-device network
incorporates 𝜎 2 as an input variable. Hence, the function in
(8) is modified as follows: sparsity pattern, and for the 𝑖-th element 𝛾𝑖 (𝜎 2 ), the expression
is constructed as follows:
!
2 𝑾
e𝒊 ·
𝑧𝑖 = Tanh 𝛾𝑖 (𝜎 ) 𝒇 (𝒙) , (11) 𝑛
∑︁
k𝑾e𝒊 · k 2 𝛾𝑖 (𝜎 2 ) = 𝑔 𝑗 (𝜎 2 ), (12)
𝑗=𝑖
where the dimension importance 𝛾𝑖 (𝜎 2 ) (i.e., the 𝑖-th element
in 𝜸(𝜎 2 )) is a function of the channel condition (i.e., channel where 𝑔 𝑗 (·) denotes the 𝑗-th output dimension of the function
noise variance 𝜎 2 ). Rather than directly training a gating 𝒈( · ), which is parameterized by a lightweight multi-layer per-
network to control the activated dimensions like other dynamic ceptron (MLP). By constraining the range of parameters in the
neural networks (e.g., [42], [43], [44]), 𝜸(𝜎 2 ) can adaptively MLP, each function 𝑔 𝑗 (𝜎 2 ) can be a non-negative increasing
prune the redundant dimensions in the encoded feature vector function, which naturally leads to 𝛾𝑖 (𝜎 2 ) ≥ 𝛾 𝑗 (𝜎 2 ), ∀ 𝑗 > 𝑖
for different 𝜎 2 due to the intrinsic sparsity discussed in and 𝛾𝑖 (𝜎 2 ) ≥ 𝛾𝑖 ( 𝜎
¯ 2 ), ∀𝜎 2 ≥ 𝜎
¯ 2 . Therefore, given a threshold
Section III. As a result, in the device-edge co-inference system, 𝛾0 , the VL-VFE method summarized in Algorithm 2 can
the activated dimensions of the encoded feature vector can be activate the dimensions consecutively, and more dimensions
easily decided by setting a threshold for 𝜸(𝜎 2 ). Besides, as can be activated during the adverse channel conditions. Details
VL-VFE needs to meet the consecutive activation property, of the MLP structure and parameter constraints are deferred
we define the function 𝜸(𝜎 2 ) to induce a particular group to Appendix B.
8
A. Experimental Setup where 𝜎𝑃2 is the PSNR. This formula was shown to be a
1) Datasets: In this section, we select two benchmark tight upper bound on the capacity of the amplitude-limited
datasets for image classification, including MNIST [46] and scalar Gaussian channel in [51].
CIFAR-10 [47]. The MNIST dataset of handwritten digits from 3) Metrics: We mainly concern the rate-distortion tradeoff
“0” to “9” has a training set of 60,000 sample images and in task-oriented communication. For the classification tasks,
a test set of 10,000 sample images. The CIFAR-10 dataset we use the classification accuracy to denote the inference
consists of 60,000 color images in 10 classes with 5,000 performance (corresponding to “distortion”), and adopt the
training images per class and 10,000 test images. In Appendix communication latency as an indicator of “rate”. In the fol-
D, we further test the performance of the proposed methods lowing experiments, we set the bandwidth 𝑊 as 12.5kHz with
on the Tiny Imagenet dataset [48]. a symbol rate of 9,600 Baud, corresponding to the limited
2) Baselines: We compare the proposed methods against bandwidth at the wireless edge.
two learning-based communication schemes for device-edge 4) Neural Network Architecture: Carefully designing the
co-inference, including DeepJSCC [8], [28] and learning- on-device network is important due to the limited onboard
based Quantization [49]. computation and memory resources. Besides, as the DNN
• DeepJSCC: DeepJSCC is a learning-based JSCC structure affects the inference performance and communication
method, which maps the input data directly to the channel overhead, all methods adopt the same architecture for fair
symbols via a JSCC encoder. We set the loss function comparisons as follows3 .
of DeepJSCC to cross-entropy, and its communication • For the MNIST classification experiment, we assume a
cost is proportional to the output dimension of the feature microcontroller unit (e.g., ARM STM32F4 series) as the
encoder. mobile device, and its memory (RAM) is less than 0.5
• Learning-based Quantization: This scheme quantizes MB. Therefore, we use only one fully-connected layer as
the floating-point values in the encoded feature vector into the on-device network to meet the memory constraint. At
low-precision data representations (e.g., the 2-bit fixed- the edge server, we select an MLP as the server-based
point format). Such a quantization method imitates the network. The corresponding network structure is shown
lossy source coding and therefore it requires an extra step in Table I. Note that a 4-layer MLP achieves an error rate
of channel coding before transmission for error correc- of 1.38% as reported in [35].
tion. Note that designing a universally optimal channel • For the CIFAR-10 classification task, we assume a single-
coding scheme for different channel conditions in the board computer (e.g., Raspberry Pi series) as the mobile
finite block-length regime is highly nontrivial [50]. For device and adopt ResNet [52] as the backbone for the
fair comparisons, we assume an adaptive channel coding CIFAR-10 processing, which can achieve the classifi-
scheme that achieves the following communication rate: cation accuracy of around 92%. As the single-board
𝐶 (𝑃, 𝜎 2 ) = computer has much more memory compared to a micro-
( √︂ ! ) controller, we deploy convolutional layers on the mobile
2𝑃 1 𝑃 device to extract a compact representation. Besides, to
= min log2 1 + , log2 1 + 2 (b.p.c.u),
𝜋𝑒𝜎 2 2 𝜎
(13) 3 The code is available at github.com/shaojiawei07/VL-VFE.
9
4.0
5.0 Quantization Quantization
DeepJSCC DeepJSCC
VFE VFE
4.5 3.5
4.0
3.0
3.5
3.0 2.5
2.5
2.0
2.0
1.5 1.5
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9
Latency (ms) Latency (ms)
(a) PSNR = 10 dB (b) PSNR = 20 dB
Fig. 3. The rate-distortion curves in the MNIST classification task with (a) PSNR = 10 dB and (b) PSNR = 20 dB.
8.8
9.5
8.6
9.0
8.4
8.5 8.2
8.0 8.0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0
Latency (ms) Latency (ms)
(a) PSNR = 10 dB (b) PSNR = 20 dB
Fig. 4. The rate-distortion curves in the CIFAR-10 classification task with (a) PSNR = 10 dB and (b) PSNR = 20 dB.
reduce the communication overhead, we add a fully- phases. Then, we record the inference accuracy achieved with
connected layer at the end of the on-device network to different communication latency to obtain the rate-distortion
map the intermediate tensor to an 𝑛-dimensional encoded tradeoff curves. In the proposed VFE method, varying the
feature. Correspondingly, there is a fully-connected layer weighting parameter 𝛽 can adjust the encoded feature length,
in the server-based network that maps the received feature where 𝛽 ∈ [10−4 , 10−2 ] in the MNIST classification, and
vector back to a tensor, and several server-based layers 𝛽 ∈ [5 × 10−5 , 10−2 ] in the CIFAR-10 classification. The com-
are adopted for further processing. The network structure munication latency of DeepJSCC is determined by the encoded
is shown in Table II. feature dimension 𝑛, while for the learning-based Quantization
Since the proposed methods can prune the redundant dimen- method, the communication latency is determined by the
sions in the encoded feature vector, our methods initialize 𝑛 to dimension 𝑛 and the number of quantization levels. Adjusting
64 or 128 in the following experiments. Moreover, the function these parameters affects both the communication latency and
𝒈(·) in (12) for variable-length encoding is a 3-layer MLP with accuracy. The rate-distortion tradeoff curves are shown in Fig.
16 hidden units each, which brings negligible computation 3 and Fig. 4 for the MNIST and CIFAR-10 classification tasks,
compared with other computation-intensive modules4 . respectively. It shows that our proposed method outperforms
the baselines by achieving a better rate-distortion tradeoff,
i.e., with a given latency requirement, a higher classification
B. Results for Static Channel Conditions accuracy is maintained, and vice versa. This is because the
In this set of experiments, we assume the wireless channel proposed VFE method is able to identify and eliminate the
model has the same value of PSNR in both the training and test redundant dimensions of the encoded feature vector for the
4 Note that there is a tradeoff between the on-device computation latency
task-oriented communication. Besides, we also depict the
and the communication overhead caused by the complexity of the on-device
noisy feature vector 𝒛ˆ in the MNIST classification tasks in Fig.
network [27]. In this paper, as we assume an extreme bandwidth-limited 5 using a two-dimensional t-distributed stochastic neighbor
situation, we mainly consider the communication overhead in the experiments.
10
(a) DeepJSCC: Accuracy = 96.77%, dimension 𝑛 = 24. (b) Proposed VFE: Accuracy = 97.39%, dimension 𝑛 = 24.
Fig. 5. 2-dimensional t-SNE embedding of the received feature in the MNIST classification task with PSNR = 20 dB.
5 Theoretically, based on the channel capacity bound in (13), transmitting D. Ablation Study
a MNIST image takes around 8 ms when PSNR = 25 dB and 20 ms when
PSNR = 10 dB. Similarly, transmitting a CIFAR-10 image takes around 70 To verify the effectiveness of the log-uniform distribution
ms when PSNR = 25 dB and 180 ms when PSNR = 10 dB. as the variational prior 𝑞( 𝒛ˆ) for sparsity induction, we further
11
9.25
Latency (ms)
3.6 1.5
3.5 9.00
3.5 1.4
3.0 1.3 8.75
3.4
1.2 8.50
3.3 2.5
1.1 8.25
3.2
10 12 14 16 18 20 22 24 10 12 14 16 18 20 22 24
PSNR (dB) PSNR (dB)
(a) The MNIST classification task (b) The CIFAR-10 classification task
Fig. 6. Communication latency and error rate as a function of the channel PSNR in dynamic channel conditions.
0.20 0.20
line corresponds to the value of threshold 𝛾0 used to prune
0.15 0.15
the dimensions. From these two figures, it can be seen that,
although using the Gaussian distribution can also confine some
Value
Value
0.10 0.10
dimensions of 𝜸 to close-to-zero values, it is prone to shrinking
0.05 0.05 the remaining informative dimensions that eventually results
in inference accuracy degradation.
0.00 0.00
0 25 50 75 100 125 0 25 50 75 100 125
Index Index
(a) The 𝜸 value with a Gaussian (b) The 𝜸 value with a log-uniform VI. C ONCLUSIONS
distribution as the variational prior. distribution as the variational prior.
Task accuracy = 95.91 % with 21 Task accuracy = 97.99 % with 32 In this work, we investigated task-oriented communication
activated dimensions. activated dimensions. for edge inference, where a low-end edge device transmits the
Fig. 7. The 𝜸 value in the MNIST classification task with (a) a Gaussian extracted feature vector of a local data sample to a powerful
distribution as the variational prior and (b) a log-uniform distribution as the edge server for processing. Our proposed methodology is
variational prior. The red dashed line denotes the pruning threshold 𝛾0 = 0.05.
built upon the information bottleneck (IB) framework, which
provides a principled way to characterize and optimize a
0.06 0.06
new rate-distortion tradeoff in edge inference. Assisted by
0.05 0.05
Value
0.03 0.03
0.02 0.02
feature, we obtained a tractable formulation that is amenable
0.01 0.01
to end-to-end training, named variational feature encoding.
0.00 0.00
We further extended our method to develop a variable-length
0 20 40 60 0 20 40 60
Index Index variational feature encoding scheme based on the dynamic
(a) The 𝜸 value with a Gaussian (b) The 𝜸 value with a log-uniform neural networks, which makes it adaptive to dynamic channel
distribution as the variational prior. distribution as the variational prior. conditions. The effectiveness of our methods was verified by
Task accuracy = 91.18 % with 21 Task accuracy = 91.83 % with 21
activated dimensions. activated dimensions.
extensive simulations on image classification datasets.
Through this study, we would like to advocate for rethinking
Fig. 8. The 𝜸 value in the CIFAR-10 classification task with (a) a Gaussian
distribution as the variational prior and (b) a log-uniform distribution as the the communication system design for emerging applications
variational prior. The red dashed line denotes the pruning threshold 𝛾0 = 0.01. such as edge inference. In these applications, communication
will keep playing a critical role, but it will serve for the
downstream task rather than for data reconstruction as in
conduct an ablation study that selects a Gaussian distribution the classical communication setting. Thus we should take a
with a diagonal covariance matrix for comparison. Note that task-oriented perspective to design the communication module
the Gaussian distribution is widely used in the previous vari- for such applications. New design tools and methodologies
ational approximation studies (e.g., [39], [34]) as it generally will be needed, and the IB framework is a promising candi-
has a closed-form solution. Since the Gaussian distribution is date. It bridges machine learning and information theory, and
not a parameter-free distribution, the mean value and covari- leverages theory and tools from both fields. There are many
ance matrix are optimized in the training process to minimize interesting future research directions on this exciting topic,
the KL-divergence 𝐷 𝐾 𝐿 ( 𝑝( 𝒛ˆ |𝒙) k𝑞( 𝒛ˆ)). The experiments are e.g., to apply the IB-based framework to the scenario with
conducted for MNIST and CIFAR-10 classification assuming multiple devices, to develop a theoretical understanding of the
PSNR = 20 dB. The values of 𝜸 with different variational new rate-distortion tradeoff, to improve the robustness of the
prior distributions are shown in Fig. 7 and 8. The dashed method, etc.
12
L 𝐼 𝐵 (𝝓) = − 𝐼 (𝑌 , 𝑍ˆ ) + 𝛽𝐼 ( 𝑍ˆ , 𝑋 )
𝑝𝝓 ( 𝒛ˆ |𝒙)
∫ ∫
𝑝 (𝒚 | 𝒛ˆ )
=− 𝑝 (𝒚 | 𝒛ˆ ) 𝑝 ( 𝒛ˆ ) log 𝑑𝒚𝑑 𝒛ˆ + 𝛽 𝑝𝝓 ( 𝒛ˆ |𝒙) 𝑝 (𝒙) log 𝑑𝒙𝑑 𝒛ˆ
𝑝 (𝒚) 𝑝 ( 𝒛ˆ )
𝑝𝝓 ( 𝒛ˆ |𝒙)
∫ ∫
=− 𝑝 (𝒚 | 𝒛ˆ ) 𝑝 ( 𝒛ˆ ) log 𝑝 (𝒚 | 𝒛ˆ ) 𝑑𝒚𝑑 𝒛ˆ + 𝛽 𝑝𝝓 ( 𝒛ˆ |𝒙) 𝑝 (𝒙) log 𝑑𝒙𝑑 𝒛ˆ − 𝐻 (𝑌 )
𝑝 ( 𝒛ˆ )
𝑝𝝓 ( 𝒛ˆ |𝒙)
∫ ∫
=− 𝑝 (𝒚 | 𝒛ˆ ) 𝑝 ( 𝒛ˆ ) log 𝑞 𝜃 (𝒚 | 𝒛ˆ ) 𝑑𝒚𝑑 𝒛ˆ + 𝛽 𝑝𝝓 ( 𝒛ˆ |𝒙) 𝑝 (𝒙) log 𝑑𝒙𝑑 𝒛ˆ (14)
𝑞 ( 𝒛ˆ )
| {z }
L𝑉 𝐼 𝐵 (𝝓,𝜽)
∫ ∫
𝑝 (𝒚 | 𝒛ˆ ) 𝑝 ( 𝒛ˆ )
− 𝑝 (𝒚 | 𝒛ˆ ) 𝑝 ( 𝒛ˆ ) log 𝑑𝒚𝑑 𝒛ˆ −𝛽 𝑝𝝓 ( 𝒛ˆ |𝒙) 𝑝 (𝒙) log 𝑑𝒙𝑑 𝒛ˆ − 𝐻 (𝑌 ) .
𝑞 𝜃 (𝒚 | 𝒛ˆ ) 𝑞 ( 𝒛ˆ ) | {z }
| {z } | {z } constant
𝐷𝐾 𝐿 ( 𝑝 (𝒚| 𝒛)
ˆ k𝑞𝜽 (𝒚| 𝒛)
ˆ ) ≥0 𝐷𝐾 𝐿 ( 𝑝 ( 𝒛)
ˆ k𝑞 ( 𝒛)
ˆ ) ≥0
A PPENDIX A A PPENDIX C
D ERIVATION OF THE VARIATIONAL U PPER B OUND ROBUSTNESS OF THE VL-VFE METHOD GIVEN
Recall that the IB objective in (2) has the form L 𝐼 𝐵 (𝝓) = INACCURATE CHANNEL NOISE VARIANCE
ˆ 𝑌 ) + 𝛽𝐼 ( 𝑍,
−𝐼 ( 𝑍, ˆ 𝑋). Writing it out in full, the derivation is
shown in (14). L𝑉 𝐼 𝐵 (𝝓, 𝜽) in this formulation is the VIB We conduct the experiments to evaluate the robustness of the
objective function in (3). As the KL-divergence is nonnegative proposed method given inaccurate channel noise variance. In
and the entropy of 𝑌 is a constant, L𝑉 𝐼 𝐵 (𝝓, 𝜽) is a variational particular, by assuming 𝑚 pilot symbols are transmitted from
upper bound of the IB objective L 𝐼 𝐵 (𝝓). the mobile device for noise variance estimation, and adopting
the uniformly minimum-variance unbiased estimator, the noise
1 Í𝑚
A PPENDIX B variance is estimated as 𝜎 ˆ 2 = 𝑚−1 2
𝑖 ( 𝑧ˆ𝑖, 𝑝 −𝑧 𝑖, 𝑝 ) , where 𝑧ˆ𝑖, 𝑝
MLP S TRUCTURE OF THE F UNCTION g 𝜎 2 and 𝑧𝑖, 𝑝 correspond to the 𝑖-th transmitted and received pilot
ˆ2 =
We parameterize g 𝜎 2 by a 𝐾-layer MLP, and thus it can symbols, respectively. It can be easily verified that E 𝜎
𝜎2 2
be written as a composition of 𝐾 non-linear functions: 𝜎 2 and 𝑝( 𝜎ˆ 2 |𝜎 2 ) = 𝑚−1 𝜒 (𝑚), where 𝜒2 (𝑚) denotes the chi-
square distribution with 𝑚 degrees of freedom. The variance of
g 𝜎 2 = h𝑲 ◦ h𝑲 −1 · · · h1 𝜎 2 ,
ˆ 2 reduces as 𝑚 increases, i.e., the noise variance estimation
𝜎
where h𝒌 represents the 𝑘-th layer in the MLP and has becomes more accurate. With the inaccurate noise variance
h𝒌 𝒙 = tanh 𝑾 (𝒌) 𝒙 6 . To maintain the desired properties of ˆ 2 at the transmitter, we test the performance of the proposed
𝜎
the proposed VL-VFE method, each function 𝑔 𝑗 (𝜎 2 ) (the 𝑗-th VL-VFE method based on the CIFAR-10 image classification
output dimension of the vector function g 𝜎 2 should be non- task for the following three cases:
negative and increase with the noise variance 𝜎 2 . Therefore, • VL-VFE (𝑚 = 0): This corresponds to the case that the
functions 𝑔 𝑗 (𝜎 2 ) should satisfy the following constraints: transmitter has no knowledge about the noise variance,
𝜕𝑔 𝑗 (𝜎 2 ) and the PSNR is set to be 10 dB for feature encoding;
𝑔 𝑗 (𝜎 2 ) ≥ 0; 𝑔 0𝑗 (𝜎 2 ) = ≥ 0. • VL-VFE (𝑚 = 8): The noise variance is estimated via 8
𝜕𝜎 2 pilot symbols, which corresponds to the case of imperfect
The function 𝑔(𝜎 2𝑗 ) can be writtern as follows: channel knowledge for feature encoding;
• VL-VFE (𝑚 = ∞): This corresponds to the case of perfect
𝑔 𝑗 (𝜎 2 ) = h𝑲 , 𝒋 ◦ h𝑲 −1 · · · h1 𝜎 2 ,
channel knowledge for feature encoding.
where h𝑲 , 𝒋 is 𝑗-th output dimension of h𝑲 . The derivative of Following the experimental settings in Section V, we also
𝑔 𝑗 (𝜎 2 ) can be obtained using the chain rule: adopt DeepJSCC as the baseline in comparison. The exper-
𝑔 0𝑗 (𝜎 2 ) = h𝑲
0 0 0 2
, 𝒋 ◦ h𝑲 −1 · · · h1 𝜎 ,
imental results on the error rate and feature transmission
latency are shown in Fig. 9 and Fig. 10, with the new findings
where we denote the Jacobian matrix of h𝒌 as h𝒌0 , and h𝑲 0
,𝒋 summarized as follows:
0
is the 𝑗-th row of h𝑲 . The derivatives work out as follows:
• The proposed method achieves lower communication
h𝒌0 𝒙 = diag tanh0 𝑾 (𝒌) 𝒙 · 𝑾 (𝒌) . latency compared with DeepJSCC in all the three cases
in the dynamic channel conditions;
To guarantee that each 𝑔 𝑗 (𝜎 2 ) is a non-negative increasing • While reducing the number of pilot symbols to 8 incurs
(𝒌) = abs( W c (𝒌) ), which means that
2
we set W
function, performance degradation to the proposed method due
𝑔 𝑗 𝜎 outputs a non-negative value, and all entries in Ja- to the inaccurate noise variance, the proposed method
cobian matrices are non-negative7 . still achieves a much better rate-distortion tradeoff than
𝑥
6 tanh( 𝑥) = 𝑒 −𝑒−𝑥 0
DeepJSCC;
𝑒 𝑥 +𝑒−𝑥 and tanh ( 𝑥) = 1 − tanh( 𝑥). For simplicity, we define • Even when the transmitter has no knowledge of the noise
tanh( 𝑥) and tanh0 ( 𝑥) as element-wise functions.
7 abs( ·) denotes the element-wise absolute function. 𝑾 c (𝒌) are the actual variance, i.e., 𝑚 = 0, the proposed method still shows a
parameters in the 𝐾 -layer MLP. comparable performance as DeepJSCC.
13
1.70
DeepJSCC
9.2 VL-VFE (m=0)
VL-VFE (m=8) 1.65
VL-VFE (m=∞)
1.60
9.0
1.55
Error Rate (%)
Latency (ms)
8.8 1.50
1.45
8.6
1.40 DeepJSCC
VL-VFE (m=0)
8.4 1.35 VL-VFE (m=8)
VL-VFE (m=∞)
1.30
10 12 14 16 18 20 22 24 26 10 12 14 16 18 20 22 24 26
PSNR (dB) PSNR (dB)
Fig. 9. Error rate as a function of the channel PSNR in dynamic Fig. 10. Communication latency as a function of the channel
channel conditions. PSNR in dynamic channel conditions.
55.0 58
Quantization DeepJSCC latency
54.5 DeepJSCC 16 VL-VFE latency
VFE 57
54.0 DeepJSCC error rate
14 VL-VFE error rate
56
53.5
Error Rate (%)
[4] X. Hou, S. Dey, J. Zhang, and M. Budagavi, “Predictive view generation [29] J. Shao, H. Zhang, Y. Mao, and J. Zhang, “Branchy-GNN: a device-
to enable mobile 360-degree and VR experiences,” in Proc. Morning edge co-inference framework for efficient point cloud processing,” 2020.
Workshop VR AR Netw., Budapest, Hungary, Aug. 2018, pp. 20–26. [Online]. Avaliable: https://arxiv.org/abs/2011.02422.
[5] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Artificial neural [30] K. Choi, K. Tatwawadi, A. Grover, T. Weissman, and S. Ermon, “Neural
networks-based machine learning for wireless networks: A tutorial,” joint source-channel coding,” in Proc. Int. Conf. Mach. Learn., Long
IEEE Commun. Surv. Tut., vol. 21, no. 4, pp. 3039–3071, Jul. 2019. Beach, CA, USA, Jun. 2019, pp. 1182–1192.
[6] J. Downey, B. Hilburn, T. O’Shea, and N. West, “Machine learning [31] R. Dobrushin and B. Tsybakov, “Information transmission with addi-
remakes radio,” IEEE Spectr., vol. 57, no. 5, pp. 35–39, Apr. 2020. tional noise,” IRE Trans. Inf. Theory, vol. 8, no. 5, pp. 293–304, Sep.
[7] T. O’Shea and J. Hoydis, “An introduction to deep learning for the 1962.
physical layer,” IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp. [32] Z. Goldfeld and Y. Polyanskiy, “The information bottleneck problem
563–575, Oct. 2017. and its applications in machine learning,” IEEE J. Sel. Areas Inf. Theory,
[8] E. Bourtsoulatze, D. Burth Kurka, and D. Gündüz, “Deep joint source- Apr. 2020.
channel coding for wireless image transmission,” IEEE Trans. Cogn. [33] A. Zaidi, I. Estella-Aguerri et al., “On the information bottleneck
Commun. Netw., vol. 5, no. 3, pp. 567–579, May 2019. problems: Models, connections, applications and information theoretic
[9] N. Samuel, T. Diskin, and A. Wiesel, “Learning to detect,” IEEE Trans. views,” Entropy, vol. 22, no. 2, p. 151, Jan. 2020.
Singal Process., vol. 67, no. 10, pp. 2554–2564, Feb. 2019. [34] A. Achille and S. Soatto, “Information dropout: Learning optimal
[10] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “Graph neural networks for representations through noisy computation,” IEEE Trans. Pattern Anal.
scalable radio resource management: Architecture design and theoretical Mach. Intell., vol. 40, no. 12, pp. 2897–2905, Jan. 2018.
analysis,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 101–115, Jan. [35] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational
2021. information bottleneck,” in Proc. Int. Conf. Learn. Represent., Toulon,
[11] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y. Zhang, “The roadmap France, Apr. 2017.
to 6G: AI empowered wireless networks,” IEEE Commun. Mag., vol. 57, [36] S. Dörner, S. Cammerer, J. Hoydis, and S. Brink, “Deep learning based
no. 8, pp. 84–90, Aug. 2019. communication over the air,” IEEE J. Selected Topics Singal Process.,
[12] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an vol. 12, no. 1, pp. 132–143, Dec. 2018.
intelligent edge: wireless communication meets machine learning,” IEEE [37] T. M. Cover and J. A. Thomas, Elements of information theory.
Commun. Mag., vol. 58, no. 1, pp. 19–25, Jan. 2020. Hoboken, New Jersey: John Wiley & Sons, Inc., 2012.
[13] Y. Shi, K. Yang, T. Jiang, J. Zhang, and K. B. Letaief, “Communication- [38] Z. Wang and D. W. Scott, “Nonparametric density estimation for high-
efficient edge AI: Algorithms and systems,” IEEE Commun. Surv. Tut., dimensional data—algorithms and applications,” Wiley Interdiscip. Rev.
vol. 22, no. 4, pp. 2167–2191, Jul. 2020. Comput. Statist., vol. 11, no. 4, p. 1461, Apr. 2019.
[14] E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep learn- [39] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in
ing model co-inference with device-edge synergy,” in Proc. Workshop Proc. Int. Conf. Learn. Represent., Banff, Canada, Apr. 2014.
Mobile Edge Commun., Budapest, Hungary, Aug. 2018, pp. 31–36. [40] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and
the local reparameterization trick,” in Proc. Adv. Neural Inf. Process.
[15] A. I. Maqueda, A. Loquercio, G. Gallego, N. García, and D. Scaramuzza,
Syst., San Diego, CA, USA, May 2015, pp. 2575–2583.
“Event-based vision meets deep learning on steering prediction for self-
[41] D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout spar-
driving cars,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Salt
sifies deep neural networks,” in Proc. Int. Conf. Mach. Learn., Sydney,
Lake City, UT, USA, Jun. 2018, pp. 5419–5427.
Australia, Aug. 2017, pp. 2498–2507.
[16] L. Liu, H. Li, and M. Gruteser, “Edge assisted real-time object detection
[42] X. Wang, F. Yu, Z.-Y. Dou, T. Darrell, and J. E. Gonzalez, “Skipnet:
for mobile augmented reality,” in Proc. Annu. Int. Conf. Mobile Comput.
Learning dynamic routing in convolutional networks,” in Proc. Eur.
Netw., Los Cabos, Mexico, Oct. 2019, pp. 1–16.
Conf. Comput. Vis., Munich, Germany, Sep. 2018, pp. 409–424.
[17] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once for all: Train
[43] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and
one network and specialize it for efficient deployment,” in Proc. Int.
R. Feris, “Blockdrop: Dynamic inference paths in residual networks,” in
Conf. Learn. Represent., Addis Ababa, Ethiopia, Apr. 2020.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT,
[18] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and USA, Jun. 2018, pp. 8817–8826.
L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud [44] Z. Chen, Y. Li, S. Bengio, and S. Si, “You look twice: Gaternet for
and mobile edge,” ACM SIGARCH Comput. Archit. News, vol. 45, no. 1, dynamic filter selection in cnns,” in Proc. IEEE Conf. Comput. Vis.
pp. 615–629, Apr. 2017. Pattern Recognit., Seoul, Korea, Oct. 2019, pp. 9172–9180.
[19] H. Li, C. Hu, J. Jiang, Z. Wang, Y. Wen, and W. Zhu, “JALAD: Joint [45] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang, “Slimmable neural
accuracy-and latency-aware deep structure decoupling for edge-cloud networks,” in Proc. Int. Conf. Learn. Represent., 2019.
execution,” in Proc. Int. Conf. Parallel Distrib. Syst., Singapore, Dec. [46] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
2018, pp. 671–678. applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–
[20] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck 2324, May 1998.
method,” in Proc. Annu. Allerton Conf. Commun. Control Comput., [47] A. Krizhevsky, G. Hinton et al., “Learning multiple layers
Monticello, IL, USA, Oct. 1999, pp. 368–377. of features from tiny images,” 2009. [Online]. Available:
[21] N. Farsad, M. Rao, and A. Goldsmith, “Deep learning for joint source- https://www.cs.toronto.edu/∼kriz/learning-features-2009-TR.pdf.
channel coding of text,” in Proc. Int. Conf. Acoust. Speech Process., [48] Y. Le and X. Yang, “Tiny imagenet visual
Calgary, Canada, Apr. 2018, pp. 2326–2330. recognition challenge,” 2015. [Online]. Available:
[22] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A http://cs231n.stanford.edu/reports/2017/pdfs/930.pdf.
review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., [49] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
vol. 35, no. 8, pp. 1798–1828, Mar. 2013. “Quantized neural networks: Training neural networks with low preci-
[23] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sion weights and activations,” J. Mach. Learn. Res., vol. 18, no. 1, pp.
sparsity in deep neural networks,” in Proc. Int. Conf. Neural Inf. Process. 6869–6898, Jan 2017.
Syst., Barcelona, Spain, Dec. 2016, pp. 2082–2090. [50] Y. Polyanskiy, H. V. Poor, and S. Verdu, “Channel coding rate in the
[24] W. Shi, Y. Hou, S. Zhou, Z. Niu, Y. Zhang, and L. Geng, “Improving finite blocklength regime,” IEEE Trans. Inf. Theory, vol. 56, no. 5, pp.
device-edge cooperative inference of deep learning via 2-step pruning,” 2307–2359, May 2010.
in Proc. IEEE Conf. Comput. Commun. Workshop, 2019, pp. 1–6. [51] A. L. McKellips, “Simple tight bounds on capacity for the peak-limited
[25] J. Shao and J. Zhang, “Bottlenet++: An end-to-end approach for feature discrete-time channel,” in Proc. Int. Symp. Inf. Theory., Chicago, IL,
compression in device-edge co-inference systems,” in Proc. Int. Conf. USA, Jun. 2004, pp. 348–348.
Commun. Workshop, Dublin, Ireland, Jun. 2020, pp. 1–6. [52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
[26] M. Jankowski, D. Gündüz, and K. Mikolajczyk, “Wireless image re- recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las
trieval at the edge,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. Vegas, NV, USA, Jun. 2016, pp. 770–778.
89–100, May 2021. [53] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” J. Mach.
[27] J. Shao and J. Zhang, “Communication-computation trade-off in Learn. Res., vol. 9, no. 11, pp. 2579–2605, Nov. 2008.
resource-constrained edge inference,” IEEE Commun. Mag., Dec. 2020.
[28] M. Jankowski, D. Gündüz, and K. Mikolajczyk, “Deep joint source-
channel coding for wireless image retrieval,” in Proc. Int. Conf. Acoust.
Speech Process., Barcelona, Spain, May 2020, pp. 5070–5074.