Massive MIMO Adaptive Modulation and Coding Using Online Deep Learning Algorithm

Evgeny Bobrov, Dmitry Kropotov, Hao Lu, and Danila Zaev The work has been supported by Huawei Technologies and Interdisciplinary Scientific and Educational School of Moscow University «Brain, Cognitive Systems, Artificial Intelligence». Evgeny Bobrov is with Moscow Research Center, Huawei Technologies, Russia, and M. V. Lomonosov Moscow State University, Russia (e-mail: [email protected]). Dmitry Kropotov is with National Research University Higher School of Economics, Russia, and M. V. Lomonosov Moscow State University, Russia (e-mail: [email protected]). Hao Lu and Danila Zaev are with Moscow Research Center, Huawei Technologies, Russia (e-mail: [email protected], [email protected]).
Abstract

The paper describes an online deep learning algorithm (ODL) for adaptive modulation and coding in massive MIMO. The algorithm is based on a fully connected neural network, which is initially trained on the output of the traditional algorithm and then incrementally retrained by the service feedback of its output. We show the advantage of our solution over the state-of-the-art Q-learning approach. We provide system-level simulation results to support this conclusion in various scenarios with different channel characteristics and different user speeds. Compared with traditional OLLA, the algorithm shows a 10% to 20% improvement in user throughput in the full-buffer case.

Index Terms:
Adaptive Modulation and Coding, Link Adaptation, Olla, Deep Learning, Reinforcement Learning, Massive MIMO, Wireless Communications, Online Training

I Introduction

The adaptive modulation and coding (AMC) process carried out in the link adaptation is a crucial part of current wireless communication systems. It becomes especially important and challenging in massive MIMO systems with dynamic beamforming. Advanced AMC techniques allow a significant increase in the data rate that can be reliably transmitted [1].

Following New Radio (5G) downlink AMC procedure [2], user equipment (UE) has to suggest to the serving base station (BS) an appropriate modulation and coding scheme (MCS) to be used in the next transmission. The proposed MCS is provided by UE using a channel quality indicator (CQI). However, this indication is not enough for high-performance service. The first reason is that each CQI is associated with an interval of signal-to-inference-and-noise ratio (SINR), which could correspond to more than one MCS. In addition, in massive MIMO systems, the accuracy of CQI is limited by the number of specific antenna ports, which is usually less than the number of transmit antennas at BS. Due to this, BS cannot rely solely on the user’s CQI report in MCS selection. That is why various AMC methods are proposed for this goal.

The well-known outer loop link adaptation (OLLA) technique was first proposed in [3]. OLLA modifies the SINR CQI-based estimation by an offset [4, 5] which can be positive (making the MCS selection more conservative) or negative (when the CQI selection was too optimistic). This offset is updated based on transport blocks’ transmission success rate so that the average block error rate is kept as close as possible to the predefined target [6].

It should be noted that the OLLA family of algorithms uses only the last binary acknowledgment information and does not take into account more refined SINR channel data, e.g., sounding (SRS) based measurements. Contrary to that, we offer an adaptive and self-learning method that predicts the next MCS using the available SINR-related measurements. The method performs both the mapping from SINR and channel data to the optimal MCS and the training (self-learning) in an online manner.

The main advantage of the proposed online deep learning (ODL) algorithm is its ability to adapt to different environments, different channel types, and different scenario conditions that BS cannot measure directly, e.g., UE speed. Due to the channel aging effect, user speed is an important hidden factor for the optimal choice of MCS, and it is hard to catch it with an offline pre-trained AI-based model. In the proposed approach, the model is able to adaptively learn the behavior of the UE and implicitly take into account its speed. In the state of the art, this challenge is called c̱oncept drift [7]. It is described as a situation when some hidden features are important and change over time, but cannot be measured. This way, our task falls into the class of incremental learning [8] algorithms, which proceed with optimization in non-stationary environments such as the massive MIMO service of a mobile UE. The deep learning approach in massive MIMO scenarios was also studied in the work [9].

Traditional OLLA adapts its offset based on HARQ acknowledgment (ACK/NACK) feedback for a transmitted transport block. The adaptation is done only if the transmission is performed. In this respect, the OLLA technique is highly dependent on traffic characteristics. If traffic is sparse compared with the channel variation, the OLLA adaptation may not achieve satisfactory quality. However, other modern techniques, like e.g., eOLLA [6], can update their offset independently of whether a transmission is carried out or not, which is very convenient for bursty traffic scenarios. In this manuscript, the proposed solution updates its parameters only using ACK/NACK feedback and assumes continuous (i.e. full-buffer) traffic. The proposed solution is fully compatible with 5G NR specifications (Release 15 or higher). It does not require any modification to the standard.

The novelty of our work is in the proposed scheme of online deep learning with a new optimization target. On the one hand, it is simpler and more effective than the existing Q-learning approach ([10, 11]) to the AMC problem. On the one hand, it outperforms the basic OLLA approach because of the better utilization of the available channel//SINR information.

In this manuscript, machine training and execution are carried out exclusively on the base station side. The input (set of ’features’) of the algorithm consists of the subband SINR measurements, CQI, time period from the last sounding, and the last reference signal received power (RSRP) measurement. Training data (’features’ and ’labels’) is collected in real-time and stored in a limited memory buffer. The computational complexity and storage requirements of the ODL approach have been investigated. Simulation results prove the stable behavior of the proposal and its uniform advantage over OLLA and Q-learning baselines. Quantitatively, the proposal increases throughput value compared to OLLA by 10% to 20%, depending on the agent speed.

We summarize the advantages of our proposal as follows: (i) the ODL can adapt to different agent speeds, (ii) the proposed approach is fully compliant with the existing NR 5G specifications, (iii) the entire online machine learning process is conducted at the base station side, has feasible storage and computational overheads.

This paper is organized as follows. Section 2 briefly describes the massive MIMO model. Section 3 carries out the proposed algorithm structure, the neural network model, and the complexity of the online training with the sample buffer approach. Section 4 describes the simulation results. Section 5 contains the conclusion.

II System Model

In the MIMO system, it is possible to send several information symbols to a multi-antenna user on a single physical resource. The number of such symbols is called the rank of the user. Under certain channel conditions, the higher rank can significantly increase the amount of transmitted information, but at the same time, it increases the requirements for channel quality. The single-user MIMO model is described by the following linear system:

r=G(HWx+n).𝑟𝐺𝐻𝑊𝑥𝑛r=G(HWx+n).italic_r = italic_G ( italic_H italic_W italic_x + italic_n ) . (1)

Where rL𝑟superscript𝐿r\in\mathbb{C}^{L}italic_r ∈ blackboard_C start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT is a vector of detected symbols at receiver, xL𝑥superscript𝐿x\in\mathbb{C}^{L}italic_x ∈ blackboard_C start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT is a vector of sent symbols, HR×T𝐻superscript𝑅𝑇H\in\mathbb{C}^{R\times T}italic_H ∈ blackboard_C start_POSTSUPERSCRIPT italic_R × italic_T end_POSTSUPERSCRIPT is a channel matrix, WT×L𝑊superscript𝑇𝐿W\in\mathbb{C}^{T\times L}italic_W ∈ blackboard_C start_POSTSUPERSCRIPT italic_T × italic_L end_POSTSUPERSCRIPT is a precoding matrix, GL×R𝐺superscript𝐿𝑅G\in\mathbb{C}^{L\times R}italic_G ∈ blackboard_C start_POSTSUPERSCRIPT italic_L × italic_R end_POSTSUPERSCRIPT is a detection matrix, and n𝒞𝒩(0,IL)similar-to𝑛𝒞𝒩0subscript𝐼𝐿n\sim\mathcal{CN}(0,I_{L})italic_n ∼ caligraphic_C caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) is a noise-vector. The constant T𝑇Titalic_T is the number of transmit antennas, R𝑅Ritalic_R is the number of receiver antennas, L𝐿Litalic_L is the user rank. We assume they are related as follows: LRT𝐿𝑅𝑇L\leqslant R\leqslant Titalic_L ⩽ italic_R ⩽ italic_T. As for detection matrix G𝐺Gitalic_G we assume linear MMSE [12] and for the precoding W𝑊Witalic_W we assume the SVD-based transmission scheme [13].

The optimization objective is to maximize the expected throughput and was also considered, e.g., in [14]. The parameters of the model, including the bandwidth and the sounding period, are provided in the Table I.

III Structure of the proposed algorithm

The general structure of the solution follows [6]. The algorithm predicts the success acknowledgment (ACK) probability for each MCS given the available SINR measurements.

We propose to consider the product of spectral efficiency and the probability of successful transmission, and maximize the resulting value over possible choices of MCS:

mcs^SE(sinr)=argmaxmcs{pw(ack|mcs,sinr)SE(mcs)}subscript^𝑚𝑐𝑠𝑆𝐸𝑠𝑖𝑛𝑟subscript𝑚𝑐𝑠subscript𝑝𝑤conditional𝑎𝑐𝑘𝑚𝑐𝑠𝑠𝑖𝑛𝑟𝑆𝐸𝑚𝑐𝑠\widehat{mcs}_{SE}(sinr)=\arg\max\limits_{mcs}\big{\{}p_{w}(ack|mcs,sinr)SE(% mcs)\big{\}}over^ start_ARG italic_m italic_c italic_s end_ARG start_POSTSUBSCRIPT italic_S italic_E end_POSTSUBSCRIPT ( italic_s italic_i italic_n italic_r ) = roman_arg roman_max start_POSTSUBSCRIPT italic_m italic_c italic_s end_POSTSUBSCRIPT { italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_a italic_c italic_k | italic_m italic_c italic_s , italic_s italic_i italic_n italic_r ) italic_S italic_E ( italic_m italic_c italic_s ) } (2)

This approach corresponds to the maximization of the expected throughput under the assumption of the Bernoulli probabilistic scheme.

Here, pw(ack|mcs,sinr)subscript𝑝𝑤conditional𝑎𝑐𝑘𝑚𝑐𝑠𝑠𝑖𝑛𝑟p_{w}(ack|mcs,sinr)italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_a italic_c italic_k | italic_m italic_c italic_s , italic_s italic_i italic_n italic_r ) is a neural network model that predicts probabilities and has weights w𝑤witalic_w as the parameters for optimization. At the inference stage, the neural network takes as input the frequency-specific SINR estimations and an MCS and provides an acknowledgment probability as an output. The algorithm iterates through the MCS values and selects the scheme that provides the maximum expected throughput.

In the current state-of-the-art, there is a tendency to use the Q-learning (also called reinforcement learning) technique for the AMC problem [10, 11]. This technique considers MCS selection as an agent action. While deep Q-learning is widely applied in wireless communication systems and can be applied to this task as well, we argue that this application is not natural. We propose an alternative scheme (2) using classical deep learning that appears to be superior to Q-learning. Our choice of architecture is based on the following observations:

  1. (a)

    All actions in AMC are performed immediately and the reward delay is strictly specified in advance. The reward does not depend on the future actions, as, e.g., in a chess game that is modeled by Q-learning.

  2. (b)

    There is no influence on the system from our actions. The actual SINR of the transmission is independent of the MCS we choose.

  3. (c)

    The actual channel, the BS measurements, and the precoding are time-varying in general. Thus, we have access to the input data (features) and training outputs (labels) sequentially. Older data samples tend to become irrelevant.

Observations (a) and (b) motivate the use of the traditional deep learning approach rather than Q-learning. We consider acknowledgment prediction as a binary classification problem and use the scheme (2) to select the optimal MCS. Observation c) motivates the use of the online approach.

Compared with Q-learning, the main difference in our ODL approach is the use of a binary logarithmic loss function (log-loss) instead of Q-learning Temporal-Difference (TD)-Loss [15]. This way, we move to the binary classification problem instead of maximizing the delayed rewards (a) and modeling the influence on the system of our actions (b).

Note that we do not need to model a chain of future actions for this type of optimization. Indeed, the proposed ODL method predicts only the MCS for the next transmission, while the Q-learning approach predicts a chain of future actions (which are enclosed in Q-values). Thus, the ODL method is more suitable for the MCS selection problem and, as we show later, provides more stable performance.

Refer to caption
Figure 1: Online Deep Learning algorithm block scheme.

As a competitor to our solution, we consider the following Q-learning regression model [10, 11], which selects MCS based on the following maximization principle:

mcs~SE(sinr)=argmaxmcs{qwSE(ack|mcs,sinr)}subscript~𝑚𝑐𝑠𝑆𝐸𝑠𝑖𝑛𝑟subscript𝑚𝑐𝑠superscriptsubscript𝑞𝑤𝑆𝐸conditional𝑎𝑐𝑘𝑚𝑐𝑠𝑠𝑖𝑛𝑟\widetilde{mcs}_{SE}(sinr)=\arg\max\limits_{mcs}\big{\{}q_{w}^{SE}(ack|mcs,% sinr)\big{\}}over~ start_ARG italic_m italic_c italic_s end_ARG start_POSTSUBSCRIPT italic_S italic_E end_POSTSUBSCRIPT ( italic_s italic_i italic_n italic_r ) = roman_arg roman_max start_POSTSUBSCRIPT italic_m italic_c italic_s end_POSTSUBSCRIPT { italic_q start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_E end_POSTSUPERSCRIPT ( italic_a italic_c italic_k | italic_m italic_c italic_s , italic_s italic_i italic_n italic_r ) } (3)

Here, qwSE(ack|mcs,sinr)superscriptsubscript𝑞𝑤𝑆𝐸conditional𝑎𝑐𝑘𝑚𝑐𝑠𝑠𝑖𝑛𝑟q_{w}^{SE}(ack|mcs,sinr)italic_q start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_E end_POSTSUPERSCRIPT ( italic_a italic_c italic_k | italic_m italic_c italic_s , italic_s italic_i italic_n italic_r ) is the neural network regression that predicts real scalar values. The Q-learning model is trained on the rewards r(ack,mcs)=SE(mcs)[ack]𝑟𝑎𝑐𝑘𝑚𝑐𝑠𝑆𝐸𝑚𝑐𝑠delimited-[]𝑎𝑐𝑘r(ack,mcs)=SE(mcs)[ack]italic_r ( italic_a italic_c italic_k , italic_m italic_c italic_s ) = italic_S italic_E ( italic_m italic_c italic_s ) [ italic_a italic_c italic_k ], where [x]delimited-[]𝑥[x][ italic_x ] is the indicator function that returns 1111 if condition x𝑥xitalic_x is true and 00 otherwise. The condition ack𝑎𝑐𝑘ackitalic_a italic_c italic_k corresponds to the receipt of the success acknowledgment. We will discuss this in more detail in the next section.

III-A Neural Network Model

In this work, we propose using the simplest neural network for binary classification without hidden layers (logistic regression). This model is lightweight, fast trainable, and robust to the environmental changes in the online-learning setting.

Our classification model uses the standard sigmoid function, which takes any real input t𝑡titalic_t, and outputs a value between zero and one. The sigmoid function σ:(0,1):𝜎01\sigma:\mathbb{R}\rightarrow(0,1)italic_σ : blackboard_R → ( 0 , 1 ) is defined as follows: σ(t)=1/(1+et)𝜎𝑡11superscript𝑒𝑡\sigma(t)=1/(1+e^{-t})italic_σ ( italic_t ) = 1 / ( 1 + italic_e start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT ).

Thus, we can express the probability of receiving acknowledgement in terms of the sigmoid function σ𝜎\sigmaitalic_σ depending on mcs𝑚𝑐𝑠mcsitalic_m italic_c italic_s and sinr𝑠𝑖𝑛𝑟sinritalic_s italic_i italic_n italic_r arguments through the function fwsubscript𝑓𝑤f_{w}italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, which is the neural network function with weights w𝑤witalic_w:

pw(ack|mcs,sinr)=σ(fw(mcs,sinr))).p_{w}(ack|mcs,sinr)=\sigma(f_{w}(mcs,sinr))).italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_a italic_c italic_k | italic_m italic_c italic_s , italic_s italic_i italic_n italic_r ) = italic_σ ( italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_m italic_c italic_s , italic_s italic_i italic_n italic_r ) ) ) . (4)

The output of the model for a given vector of input features can be interpreted as a probability and serves as the basis for classification. The optimization method computes the log-loss for all the observations n{1N}𝑛1𝑁n~{}\in~{}\{1~{}\dots~{}N\}italic_n ∈ { 1 … italic_N } on which it is trained. The function J𝐽Jitalic_J counts the log-probabilities of ACKs in the following way:

J(w)=1Nn=1N(acknlogpw(ackn|sinrn,mcsn)+(1ackn)log(1pw(ackn|sinrn,mcsn)))maxw𝐽𝑤1𝑁superscriptsubscript𝑛1𝑁𝑎𝑐subscript𝑘𝑛subscript𝑝𝑤|𝑎𝑐subscript𝑘𝑛𝑠𝑖𝑛subscript𝑟𝑛𝑚𝑐subscript𝑠𝑛1𝑎𝑐subscript𝑘𝑛1subscript𝑝𝑤|𝑎𝑐subscript𝑘𝑛𝑠𝑖𝑛subscript𝑟𝑛𝑚𝑐subscript𝑠𝑛subscript𝑤J(w)=\frac{1}{N}\sum_{n=1}^{N}\big{(}ack_{n}\log p_{w}(ack_{n}|sinr_{n},mcs_{n% })+\\ (1-ack_{n})\log(1-p_{w}(ack_{n}|sinr_{n},mcs_{n}))\big{)}\rightarrow\max% \limits_{w}start_ROW start_CELL italic_J ( italic_w ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_a italic_c italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_a italic_c italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_s italic_i italic_n italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_m italic_c italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + end_CELL end_ROW start_ROW start_CELL ( 1 - italic_a italic_c italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_a italic_c italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_s italic_i italic_n italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_m italic_c italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) ) → roman_max start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_CELL end_ROW (5)

Here ackn{0,1}𝑎𝑐subscript𝑘𝑛01ack_{n}\in\{0,1\}italic_a italic_c italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ { 0 , 1 } is the "true" acknowledgement, which we get to know after the action is completed, and pw(ackn|sinrn,mcsn)subscript𝑝𝑤conditional𝑎𝑐subscript𝑘𝑛𝑠𝑖𝑛subscript𝑟𝑛𝑚𝑐subscript𝑠𝑛p_{w}(ack_{n}|sinr_{n},mcs_{n})italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_a italic_c italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_s italic_i italic_n italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_m italic_c italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is the probability model of the ackn𝑎𝑐subscript𝑘𝑛ack_{n}italic_a italic_c italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT reception, which is a function of features: {sinrn,mcsn}𝑠𝑖𝑛subscript𝑟𝑛𝑚𝑐subscript𝑠𝑛\{sinr_{n},mcs_{n}\}{ italic_s italic_i italic_n italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_m italic_c italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }.

For the Q-learning approach, we apply the MSE-Loss function to the reward. Since we do not have a delayed reward, this is the TD-Loss with γ=0𝛾0\gamma=0italic_γ = 0 [15]:

F(w)=1Nn=1N(qw(ackn|sinrn,mcsn)r(ackn,mcsn))2minw𝐹𝑤1𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑞𝑤conditional𝑎𝑐subscript𝑘𝑛𝑠𝑖𝑛subscript𝑟𝑛𝑚𝑐subscript𝑠𝑛𝑟𝑎𝑐subscript𝑘𝑛𝑚𝑐subscript𝑠𝑛2subscript𝑤F(w)=\frac{1}{N}\sum_{n=1}^{N}\big{(}q_{w}(ack_{n}|sinr_{n},mcs_{n})-r(ack_{n}% ,mcs_{n})\big{)}^{2}\rightarrow\min\limits_{w}italic_F ( italic_w ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_a italic_c italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_s italic_i italic_n italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_m italic_c italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_r ( italic_a italic_c italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_m italic_c italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT (6)

III-B Proposed algorithm complexity

For the process of online learning, we use the Adam [16] method as one of the simplest gradient-based algorithms. It is worth noting that the previously obtained solution wtsuperscriptsubscript𝑤𝑡w_{t}^{*}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (optimal weights of the model) can be used as the starting point for the next re-training step wt+1osuperscriptsubscript𝑤𝑡1𝑜w_{t+1}^{o}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, resulting in: wt+1o=wtsuperscriptsubscript𝑤𝑡1𝑜superscriptsubscript𝑤𝑡w_{t+1}^{o}=w_{t}^{*}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Therefore, in practical implementation, it is enough to just make a few gradient steps at the re-training step.

Refer to caption
Figure 2: Working Algorithm Time Axis.

Since the algorithm works online, it needs to be retrained on the new data. We suggest using a buffer for every user containing the recent samples (transmission examples): features, the selected MCS, and the result of the transmission (ack/nack). Buffer samples are updated in FIFO order; the oldest samples are replaced with the newest ones. We can visualize this mechanism as follows: (Fig. 3).

Refer to caption
Figure 3: Algorithm Sample Buffer.

We propose adding new samples to the buffer with a (possibly adaptive) subsampling rate to avoid the situation where most features remain the same between the channel measurements. By doing so, we significantly reduce the memory buffer size and retraining speed without sacrificing prediction quality. The quality may even get better since we can expand the buffer to our storage limits. For our experiments with full-buffer users, we chose subsampling rate as an inverse probability of sounding length, excluding the pilot signals.

Parameters: Initial value yosubscript𝑦𝑜y_{o}italic_y start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and step size d𝑑ditalic_d of OLLA. Initial CQI cosubscript𝑐𝑜c_{o}italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, target BLER b𝑏bitalic_b, buffer size U𝑈Uitalic_U, retraining period: N𝑁Nitalic_N.

Initialize: OLLA: y=yo𝑦subscript𝑦𝑜y=y_{o}italic_y = italic_y start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, CQI: c=co𝑐subscript𝑐𝑜c=c_{o}italic_c = italic_c start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, the sample buffer of the size U𝑈Uitalic_U and the neural network model A(w)𝐴𝑤A(w)italic_A ( italic_w ) with number of nodes P𝑃Pitalic_P and number of connections Q𝑄Qitalic_Q.

Complexity. Computations 𝒪(Q/N)𝒪𝑄𝑁\mathcal{O}(Q/N)caligraphic_O ( italic_Q / italic_N ) and memory 𝒪(PU)𝒪𝑃𝑈\mathcal{O}(PU)caligraphic_O ( italic_P italic_U )

procedure An Agent Scheme
     for each of the first U𝑈Uitalic_U TTIs do
         Set MCSmin(max(round(c+y),1),29)𝑀𝐶𝑆round𝑐𝑦129MCS\leftarrow\min(\max(\text{round}(c+y),1),29)italic_M italic_C italic_S ← roman_min ( roman_max ( round ( italic_c + italic_y ) , 1 ) , 29 )
         Receive and put to the buffer the labels:               ACK or NACK: a{0,1}𝑎01a\in\{0,1\}italic_a ∈ { 0 , 1 }, and the features:
              CQI c{1n}𝑐1𝑛c\in\{1\dots n\}italic_c ∈ { 1 … italic_n }, SINR sm𝑠superscript𝑚s\in\mathbb{R}^{m}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT,
              MCS {129}absent129\in\{1\dots 29\}∈ { 1 … 29 }.
         Update OLLA: yy+dad(1a)(1b)/b𝑦𝑦𝑑𝑎𝑑1𝑎1𝑏𝑏y\leftarrow y+da-d(1-a)(1-b)/bitalic_y ← italic_y + italic_d italic_a - italic_d ( 1 - italic_a ) ( 1 - italic_b ) / italic_b [6]
         Train A(w)𝐴𝑤A(w)italic_A ( italic_w ) targeting J𝐽Jitalic_J (5) or F𝐹Fitalic_F (6) and using buffer
     end for
     for each time frame k=U+1,U+2𝑘𝑈1𝑈2k=U+1,U+2\dotsitalic_k = italic_U + 1 , italic_U + 2 …  do
         Set MCS A(w)absent𝐴𝑤\leftarrow A(w)← italic_A ( italic_w ) NN prediction by (2) or (3)
         Receive and replace the oldest values of the buffer               labels: ACK or NACK: a{0,1}𝑎01a\in\{0,1\}italic_a ∈ { 0 , 1 }, and features:               CQI: c{1n}𝑐1𝑛c\in\{1\dots n\}italic_c ∈ { 1 … italic_n }, SINR sm𝑠superscript𝑚s\in\mathbb{R}^{m}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT,
              MCS {129}absent129\in\{1\dots 29\}∈ { 1 … 29 }.
         if k mod N=0𝑘 mod 𝑁0k\text{ mod }N=0italic_k mod italic_N = 0 then
              Initialize NN wko=wkNsuperscriptsubscript𝑤𝑘𝑜superscriptsubscript𝑤𝑘𝑁w_{k}^{o}=w_{k-N}^{*}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_k - italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from the previous step
              Retrain A(w)𝐴𝑤A(w)italic_A ( italic_w ) using (5) or (6) and memory buffer
         end if
     end for
end procedure
Figure 4: The scheme of both Online Deep Learning (proposed) and Q-learning with a sample buffer. The key difference between the algorithms is in the different activation and loss functions.

III-C The structure of the neural network

The input of the neural network in the proposed approach and in the Q-learning approach is designed to be the same. The feature space of the algorithm consists of SINR for each user antenna (RxNum=4𝑅𝑥𝑁𝑢𝑚4RxNum=4italic_R italic_x italic_N italic_u italic_m = 4 for our experiments), reported CQI, the time interval from the last sounding, cell RSRP and one of the MCS values. Additional bias parameters are configured for each layer of the network. On the output layer the acknowledge success is predicted. We apply the standard scaling normalization method by subtracting the average value and dividing it by the standard deviation for each feature across all samples. The structure of the neural network is presented in Fig. 5.

The structure of ODL and Q-learning neural networks is the same in all aspects except the activation functions at the output layer. For the output layer, ODL uses the sigmoid function, while the Q-learning method uses the identity function. This difference is motivated by a difference in the problems the models solve. ODL solves the binary classification problem by predicting the probability of success, while the Q-learning model solves the regression problem by predicting real Q-values. Thus, we selected the activation functions that give the best quality results for each of the considered approaches.

Refer to caption
Figure 5: Block diagram of the neural network used.

IV System-level simulation results

First, we compare the proposed machine learning algorithm with the traditional OLLA method. The provided performance gains and losses are calculated with respect to OLLA performance. We have gotten stable, uniformly better results, which have never failed in our experiments. On average, the proposal increases throughput values from 12.64% to 21.52% depending on UE speed.

The advantage of the proposal is explained by the use of additional information based on SRS-based SINR measurements. We should also notice that the step-by-step behavior of OLLA is too conservative in a rapidly changing environment.

IV-A Quality improvement with machine learning

We provide experimental results for different speeds, user ranks, and random seeds. Note that the proposed algorithm is not manually tuned for the various conditions: all its hyper-parameters remain the same. It is important since in the real-life commercial system, BS does not have information about the user speed and, especially, about the user environment (e.g., urban, rural, etc.).

Refer to caption
Figure 6: The spectral Efficiency gain of ODL over OLLA. Average of 10 random seeds. Ranks 1, 2, and 3. Speeds 3km/h and 60 km/h.

The proposed online deep learning model performs the mapping from the SINR measurements to the optimal MCS. The most significant advantage is achieved on the rises and falls of the SINR quality because ODL is more adaptive to the instant SINR than OLLA and instantly converges to the optimal MCS. The following Fig. 8 shows the uniform advantage of the ODL algorithm over OLLA.

IV-B ODL and Q-learning performance comparison

Next, we compare ODL performance with the performance of the Q-learning algorithm. Our simulation results show that the ODL method works uniformly better for all user ranks at a speed of 30 km/h and a random trajectory.

Refer to caption
Figure 7: Spectral Efficiency gain over OLLA of the two models: ODL and Q-learning. The agent speed is 30 km/h with a random trajectory for moving.
TABLE I: System configuration of simulation-based experiments.
CellMaxPower 40 dBm
ThermalNoisePower -174 dBm/Hz
Bandwith 20 MHz
TxAntNum, T𝑇Titalic_T 64
RxAntNum, R𝑅Ritalic_R 4
Sounding Period 5 ms

V Conclusion

This paper proposes a novel online deep learning solution for adaptive modulation and coding for massive MIMO systems. It learns to predict the probability of transmission success for different MCS values and selects the MCS with the highest expected throughput. Simulation results show that the proposed approach outperforms both the Q-learning approach and the traditional outer loop link adaptation method. When compared to standard OLLA, our method improves user performance by 10% to 20% in the full-buffer scenario. We provided an explanation for this advantage. The proposed approach has lower complexity than the Q-learning method and provides better and more stable performance. In addition, the proposed method is fully compatible with the current 5G RAN specifications. We hope that the analysis of the AMC problem provided in this paper will help to design better and simpler NN-based solutions for adaptive MCS selection in massive MIMO systems.

Refer to caption
Figure 8: Throughput statistics on a time interval, user rank 2 and 24 km/h speed. The red line is proposed ODL, the blue is OLLA.

References

  • [1] S. T. Chung and A. J. Goldsmith, “Degrees of freedom in adaptive modulation: a unified view,” IEEE Transactions on Communications, vol. 49, no. 9, pp. 1561–1571, 2001.
  • [2] J. Wannstrom, “LTE-advanced,” Third Generation Partnership Project (3GPP), 2013.
  • [3] A. Sampath, P. S. Kumar, and J. M. Holtzman, “On setting reverse link target SIR in a CDMA system,” in 1997 IEEE 47th Vehicular Technology Conference. Technology in Motion, vol. 2.   IEEE, 1997, pp. 929–933.
  • [4] P. Song and S. Jin, “Performance evaluation on dynamic dual layer beamforming transmission in TDD LTE system,” in 2013 Third International Conference on Communications and Information Technology (ICCIT).   IEEE, 2013, pp. 269–274.
  • [5] K. I. Pedersen, G. Monghal, I. Z. Kovacs, T. E. Kolding, A. Pokhariyal, F. Frederiksen, and P. Mogensen, “Frequency domain scheduling for OFDMA with limited and noisy channel feedback,” in 2007 IEEE 66th Vehicular Technology Conference.   IEEE, 2007, pp. 1792–1796.
  • [6] F. Blanquez-Casado, G. Gomez, M. del Carmen Aguayo-Torres, and J. T. Entrambasaguas, “eOLLA: an enhanced outer loop link adaptation for cellular networks,” EURASIP Journal on Wireless Communications and Networking, vol. 2016, no. 1, pp. 1–16, 2016.
  • [7] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM computing surveys (CSUR), vol. 46, no. 4, pp. 1–37, 2014.
  • [8] B. Krawczyk and A. Cano, “Online ensemble learning with abstaining classifiers for drifting and noisy data streams,” Applied Soft Computing, vol. 68, pp. 677–692, 2018.
  • [9] C.-K. Wen, W.-T. Shih, and S. Jin, “Deep learning for massive MIMO CSI feedback,” IEEE Wireless Communications Letters, vol. 7, no. 5, pp. 748–751, 2018.
  • [10] M. P. Mota, D. C. Araujo, F. H. C. Neto, A. L. de Almeida, and F. R. Cavalcanti, “Adaptive modulation and coding based on reinforcement learning for 5g networks,” in 2019 IEEE Globecom Workshops (GC Wkshps).   IEEE, 2019, pp. 1–6.
  • [11] L. Zhang, J. Tan, Y.-C. Liang, G. Feng, and D. Niyato, “Deep reinforcement learning-based modulation and coding scheme selection in cognitive heterogeneous networks,” IEEE Transactions on Wireless Communications, vol. 18, no. 6, pp. 3281–3294, 2019.
  • [12] D. Wubben, R. Bohnke, V. Kuhn, and K.-D. Kammeyer, “Near-maximum-likelihood detection of mimo systems using MMSE-based lattice-reduction,” in 2004 IEEE International Conference on Communications (IEEE Cat. No. 04CH37577), vol. 2.   IEEE, 2004, pp. 798–802.
  • [13] L. Sun and M. R. McKay, “Eigen-based transceivers for the MIMO broadcast channel with semi-orthogonal user selection,” IEEE Transactions on Signal Processing, vol. 58, no. 10, pp. 5246–5261, 2010.
  • [14] P. Fan and K. B. Letaief, “Understanding of transmission throughput and channel capacity in a systematic way,” in 2011 20th Annual Wireless and Optical Communications Conference (WOCC).   IEEE, 2011, pp. 1–5.
  • [15] G. Tesauro et al., “Temporal difference learning and TD-Gammon,” Communications of the ACM, vol. 38, no. 3, pp. 58–68, 1995.
  • [16] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.