Cooperative Deep Reinforcement Learning for Multiple Groups NB-IoT Networks Optimization

10/27/2018 ∙ by Nan Jiang, et al. ∙ King's College London Queen Mary University of London 0

NarrowBand-Internet of Things (NB-IoT) is an emerging cellular-based technology that offers a range of flexible configurations for massive IoT radio access from groups of devices with heterogeneous requirements. A configuration specifies the amount of radio resources allocated to each group of devices for random access and for data transmission. Assuming no knowledge of the traffic statistics, the problem is to determine, in an online fashion at each Transmission Time Interval (TTI), the configurations that maximizes the long-term average number of IoT devices that are able to both access and deliver data. Given the complexity of optimal algorithms, a Cooperative Multi-Agent Deep Neural Network based Q-learning (CMA-DQN) approach is developed, whereby each DQN agent independently control a configuration variable for each group. The DQN agents are cooperatively trained in the same environment based on feedback regarding transmission outcomes. CMA-DQN is seen to considerably outperform conventional heuristic approaches based on load estimation.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

To effectively support the emerging massive Internet of Things (IoT) ecosystem, the 3GPP has standardized NarrowBand-IoT (NB-IoT), a new radio access technology designed to coexist with Long-Term Evolution [1]. NB-IoT supports up to three groups of IoT devices, known as Coverage Enhancement (CE) groups. Each group shares a similar average received Signal-to-Noise Ratio (SNR), as measured based on a broadcast signal, and traffic characteristics (see Fig.1(a)) [2]. At the beginning of each uplink Transmission Time Interval (TTI), the evolved Node B (eNB) selects a system configuration that specifies the radio resources allocated to each group in order to accommodate the Random Access CHannel (RACH) procedure with the remaining resources used for data transmission. The key challenge is to optimally balance the allocations of channel resources between the RACH procedure and data transmission so as to provide reliable connections: Allocating too many resources for RACH enhances the random access performance, while leaving insufficient resources for data transmission.

The eNB observes the number of successful transmissions and collisions on the RACH for all groups at the end of any TTI. This historical information can be potentially used to predict traffic from all groups and to aid the optimization of future TTIs’ configurations. Even if one knew all the relevant statistics, tackling this problem in an exact manner would result in a Partially Observable Markov Decision Process (POMDP) with large state and action spaces, which would be generally intractable. The complexity of the problem is compounded by the lack of a prior knowledge at the eNB regarding the traffic and channel statistics.

Fig. 1: (a) Illustration of system model; (b) Uplink channel frame structure.

In light of these challenges, prior works [3, 4] have tackled the problem under the simplifying assumptions that at most two configurations are allowed and that the optimization is done separately for each group without considering errors due to wireless transmission. In order to consider more complex and practical formulations, Reinforcement Learning (RL) emerges as a natural solution given the availability of feedback in the form of number of successful and unsuccessful transmissions per TTI. Q-learning based Access Class Barring (ACB) schemes have been proposed in [5, 6]

with the aim of optimizing the access success probability of the RACH. These schemes optimize the ACB procedure by using a tabular approach. Finally, optimizing some of the parameters of the NB-IoT configuration, namely the repetition value (to be defined below), was carried out from the perspective of a single device in terms of latency and power consumption in

[7] using a queuing framework.

In this paper, we develop a Cooperative Multi-Agent Deep Neural Network based Q-learning (CMA-DQN) algorithm for online uplink resource configuration in NB-IoT systems. In the proposed approach, a number of DQN agents are cooperatively trained to produce the configurations for the three CE groups. The reliance on Deep Neural Networks (DNNs) addresses the problem of tabular approaches [5, 6] in enabling operation over a large state space, while the use of multiple agents deals with the large dimensionality of the output space, corresponding to the configurations for the three CE groups.

The rest of the paper is organized as follows. Section 2 illustrates the system model. Section 3 discusses the conventional solution. Section 4 presents the CMA-DQN approach. Section 5 provides the numerical results and discussion.

Ii System Model

As illustrated in Fig. 1(a), we consider a single-cell NB-IoT network composed of an eNB located at the center of the cell, and a set of static IoT devices randomly located in an area of the plane . The devices are divided into three CE groups as further discussed below. In each IoT device, uplink data is generated according to random inter-arrival processes over the TTIs, which are Markovian and possibly time-varying as defined in [8, Ch. 6.1].

Ii-a Problem Formulation

Once backlogged, an IoT device executes the contention-based RACH procedure in order to establish a Radio Resource Control (RRC) connection with the eNB. This is done by transmitting a randomly selected preamble for a given number of times within the next RACH period of the current TTI. The RACH process can fail if: (i) a collision occurs between two or more IoT devices selecting the same preamble; or (ii) there is no collision, but the eNB cannot detect a preamble due to low SNR. As shown in Fig. 1(b), for each TTI and for each CE group , in order to reduce the chance of a collision, the eNB can increase the number of RACH periods in the TTI or the number of preambles available in each RACH period [9]. Furthermore, in order to mitigate the SNR outage, the eNB can increase the number of times that a preamble transmission is repeated by a device in group in one RACH period [9] of the TTI.

After the RRC connection is established, the IoT device requests uplink channel resources from the eNB for control information and data transmission. As shown in Fig. 1(b), given a total number of resources available for uplink transmission in the TTI, the number of resources available for data transmission is obtained as the difference , where is the overall number of Resource Elements (REs)111The uplink channel consists of 48 sub-carriers within 180 kHz bandwidth. With a 3.75 kHz tone spacing, one RE is composed of one time slot of 2 ms and one sub-carrier of 3.75 kHz [1]. allocated for the RACH procedure. This can be computed as , where measures the number of REs required for the transmission of one preamble.

In this work, we tackle the problem of optimizing the RACH configuration defined by parameters for each th group in an online manner for every TTI . In order to make this decision at the beginning of every TTI , the eNB has available for all prior TTIs the collection consisting of the following variables: the number of the collided preambles the number of the successfully received preambles, and the number of idle preambles in the th TTI for the RACH, as well as the number of IoT devices that have successfully sent data and the number of IoT devices that are waiting for be allocated data resources. We denote as the history of all such measurements and past actions.

The eNB aims at maximizing the long-term average number of devices that successfully transmit data with respect to the stochastic policy that maps the current observation history to the probabilities of selecting each possible configuration . This problem can be formulated as the optimization


where is the discount rate for the performance accrued in future TTIs and index runs over the CE groups. Since the dynamics of the system is Markovian over the TTI and is defined by the NB-IoT protocol to be further discussed below, this is a POMDP problem that is generally intractable. Approximate solutions will be discussed in Sections 3 and 4.

Ii-B NB-IoT Access Network

We now provide additional details on the model and on the NB-IoT protocol. For the wireless channels, we consider the standard power-law path-loss model with path-loss exponent and Rayleigh flat-fading. Once an IoT device becomes backlogged, it first determines its associated CE group by comparing the received power of the broadcast signal to the Reference Signal Received Power (RSRP) thresholds according to the rule [10]


where the received power is averaged over small-scale Rayleigh fading, is the device’s distance from the eNB, and is the broadcast power of eNB [10].

After CE group determination, each IoT device in group repeats a randomly selected preamble times in the next RACH period by using a pseudo-random frequency hopping schedule. The preamble consists of four so-called symbol groups, each occupying one RE [11, 12, 1]. Therefore, a preamble is successfully detected if at least one preamble repetition succeeds, which in turn happens if all of its four symbol groups are correctly decoded [12]. Assuming that correct detecting is determined by the SNR level for the th repetition and the symbol group, the correct detecting event can be expressed as


where is the SNR threshold, and the SNR can be written as given the preamble transmit power for (CE group 0), and for or 2. Here, is the maximal transmit power of IoT devices. Note that the preamble is transmitted using full path-loss inversion power control for CE group 0 [10], which ensures an average received power of unless the power constraint is violated.

If a RACH fails, the IoT device repeats the procedure until receiving a positive acknowledgement that RRC connection is established, or exceeding RACH attempts while being part of one CE group. If these attempts are exceeded, the device switches to a higher CE group if possible [13]. Moreover, the IoT device is allowed to attempt the RACH procedure no more than times before dropping a packet.

To allocate data resources among the devices that have correctly completed the RACH procedure, we adopt a basic random scheduling strategy, whereby an ordered list of all devices that have successfully completed the RACH procedure but have not received a data channel is compiled using a random order. In each TTI, devices in the list are considered in order for access to the data channel until the data resources are insufficient to serve the next device in the list. The remaining RRC connections between the unscheduled IoT devices and the eNB will be preserved within at most subsequent TTIs, and attempts will be made to schedule the device’s data during these TTIs [14]. The condition that the data resources are sufficient in TTI is expressed as


where is the number of scheduled devices; is the number of required REs for serving one IoT device within the th CE group; and is the number of REs per repetition. Note that the number of repetitions is the same as for preamble transmission [1].

Iii Conventional Solutions

Due to its complexity, most previous works simplify the optimization in (1) by considering the greedy formulation


for some group , whereby only the performance in the current TTI is considered. Furthermore, the expectation in (5) is approximated based on an estimate of the load in TTI as discussed below; and the action space for is typically reduced to include only some parameters such as the number of preambles in each RACH period [4, 3].

To elaborate, we now briefly describe a solution based on [4] that follows the outlined simplifying principles. We drop the group index in order to avoid unnecessary notation. At the beginning of each TTI, the scheme first estimates the number of IoT devices that will attempt RACH access in the th TTI, and then adjusts only the parameters according to the estimated load. The estimate is given as


where the term reflects the fact that there are at least IoT devices colliding in the last TTI; is the difference between the estimated numbers of RACH attempting IoT devices in the ()th and the th TTIs [4]; and is an estimate of the number of RACH-attempting IoT devices in the (

)th TTI obtained via moment matching


Using the estimated load given in (6), the approach, which is referred to as Load Estimation based Uplink Resource Configuration (LE-URC), attempts to solve problem (5) by approximating the objective as


where is the expected number of IoT devices requesting uplink resource in the th TTI; and is an upper bound on the number of IoT devices can be scheduled.

Iv Cooperative Multi-agent DNN-Q Approach

We now introduce an RL-based approach to tackle problem (1). A direct application of the DQN approach [15] or of its enhancement proposed in [16], whereby the policy for all is modelled by a DNN, is not feasible due to the increasing size of the action . In order to overcome this issue, we break up the action space by considering separately each of the nine action variables in . Recall that we have three variables for each group , namely , , and .

A separate DQN agent is introduced for each output variable in . We define as the action selected by the th agent. Each th agent is responsible to update the value of action in state , where the state variable only includes information about the last TTIs. All agents receive the same reward signal at the end of each TTI as per problem (1). The use of this common reward signal ensures that all DQN agents aim at cooperatively increase the objective in (1). Note that the approach can be interpreted as applying a factorization of the overall value function akin to the approach proposed in [17] for multi-agent systems.

The DQN agents are trained in parallel. Each agent parameterizes the action-state value function by using a function , where represents the weights matrix of a DNN with fully-connected layers. The input of the DNN is given by the variables in state

; the intermediate layers are Rectifier Linear Units (ReLUs); while the output layer is composed of linear units. Each output neurons provides the value of one of the actions in

as in [15]. The weights matrix is updated online along each training episode by using double deep Q-learning (DDQN) [16]. Accordingly, learning takes place over multiple training episodes, with each episode of duration TTI periods. In each TTI, the parameters of the Q-function approximator

are updated using Stochastic Gradient Descent at all agents




is RMSProp learning rate


is the gradient of the loss function

used to train the Q-function approximator. This is given as


where the expectation is taken with respect to randomly selected previous samples for some , with being the replay memory [15]. When is negative, this is to be intended as including samples from the previous episode. Following DDQN [16], in (IV), is a so-called target parameter that is used to estimate the future value of the Q-function in the update rule. This parameter is periodically copied from the current value and kept fixed for a number of episodes.

V Simulation Results and Discussion

In this section, we evaluate the performance of the proposed CMA-DQN and compare it with the conventional LE-URC described in Sec. 3 via numerical experiments. The eNB is assumed to be located at the center of a circle area with 12 km radius, and we adopt the standard network parameters listed in Table I following [1, 2, 19, 13, 9]

. Accordingly, one epoch consists of 937 TTIs (i.e., 10 minutes). Throughout epoch, each device have a bursty traffic profile, where the packet generation probability is given by the time limited Beta profile defined in

[8, Ch. 6.1] with parameters , which has a peak around the 400th TTI. The resulting average number of generated packets is shown as dashed line in Fig. 2

. The DQNs used by CMA-DQN have three hidden layers, each with 128 ReLU units, where other hyperparameters are listed in Table

I. All results are obtained by averaging over 1000 training episodes.

Simulation Parameters Setting Simulation Parameters Setting
Path-loss exponent 4 Noise power -138 dBm
Received SNR threshold 0 dB Power control threshold 120 dB
eNB broadcast power 35 dBm TTI 640 ms
Bursty traffic duration 10 mins IoT devices 30000
Maximum transmit power 23 dBm Set of number of preambles {12,24,36,48}
Maximum resource requests 5 Set of repetition value {1,2,4,8,16,32}
Maximum RACH in one CE 5 Set of RACH periods {1,2,4}
Maximum RACH attempts 10 RSRP threshold {,} {0,-5} dB
REs required for 4 REs required for 32
Q-learning Hyperparameters Value Q-learning Hyperparameters Value
Exploration [0.1,1] RMSProp Learning rate 0.0001
Discount rate 0.5 Minibatch size 32
Replay memory 10000 Target Q-network update frequency 1000
TABLE I: Simulation Parameters and Q-learning hyperparameters

Fig. 2 compares the number of successfully served IoT devices during one epoch using CMA-DQN and LE-URC. The “LE-URC-[1,4,8]” and “LE-URC-[2,8,16]” curves represent the LE-URC approach with the repetition values set to and , respectively. We observe that the CMA-DQN slightly outperforms LE-URC in the light traffic regions at the beginning and end of the epoch, but it substantially outperforms LE-URC in the period of heavy traffic in the middle of the epoch. This demonstrates the capability of CMA-DQN to better manage the scarce channel resources in the presence of heavy traffic. It is also observed that increasing the repetition value of each CE group with LE-URC improves the received SNR, and thus the RACH success rate, in the light traffic region, but it degrades the scheduling success rate due to limited channel resource in the heavy traffic region.

Fig. 2: The average number of successfully served IoT devices per TTI during one bursty traffic duration. The dashed line represents the average number of generated packets per TTI.

To gain more insight into the operation of CMA-DQN, Fig. 3 plots the average number of repetitions and the average number of Random Access Opportunities (RAOs), defined as the product , for each CE group that are selected by CMA-DQN over the training episodes. As seen in Fig. 3(a)-(c), CMA-DQN increases the number of repetitions in the light traffic region in order to improve the SNR and reduce RACH failures, while decreasing it in the heavy traffic region so as to reduce scheduling failures. As illustrated in Fig. 3(d)-(f), this allows CMA-DQN to increase the number of RAOs in the high traffic regime mitigating the impact of collisions on the throughput. In contrast, for the CE groups 1 and 2, in the heavy traffic region, LE-URC decreases the number of RAOs in order to reduce resource scheduling failures, causing an overall lower throughput as seen in Fig. 2.

Fig. 3: The allocated repetition value , and RAOs producted by .


  • [1] J. Schlienz and D. Raddino, “Narrowband internet of things whitepaper,” IEEE Microw. Mag., vol. 8, no. 1, pp. 76–82, Aug. 2016.
  • [2] Y.-P. E. Wang, X. Lin, A. Adhikary, A. Grovlen, Y. Sui, Y. Blankenship, J. Bergman, and H. S. Razaghi, “A primer on 3GPP narrowband internet of things (NB-IoT),” IEEE Commun. Mag., vol. 55, no. 3, pp. 117–123, Mar. 2017.
  • [3] D. T. Wiriaatmadja and K. W. Choi, “Hybrid random access and data transmission protocol for machine-to-machine communications in cellular networks,” IEEE Trans. Wireless Commun., vol. 14, no. 1, pp. 33–46, Jan. 2015.
  • [4] S. Duan, V. Shah-Mansouri, Z. Wang, and V. W. Wong, “D-ACB: Adaptive congestion control algorithm for bursty M2M traffic in LTE networks,” IEEE Trans. Veh. Technol., vol. 65, no. 12, pp. 9847–9861, Dec. 2016.
  • [5] M. ihun and L. Yujin, “A reinforcement learning approach to access management in wireless cellular networks,” in Wireless Commun. Mobile Comput., May. 2017, pp. 1–7.
  • [6] T.-O. Luis, P.-P. Diego, P. Vicent, and M.-B. Jorge, “Reinforcement learning-based ACB in LTE-A networks for handling massive M2M and H2H communications,” in IEEE Int. Commun. Conf. (ICC), May. 2018, pp. 1–7.
  • [7] A. Azari, G. Miao, C. Stefanovic, and P. Popovski, “Latency-energy tradeoff based on channel scheduling and repetitions in nb-iot systems,” arXiv preprint arXiv:1807.05602, Jul. 2018.
  • [8] “Study on RAN improvements for machine-type communications,” 3GPP TR 37.868 V11.0.0, Sep. 2011.
  • [9] “Evolved universal terrestrial radio access (E-UTRA); Physical channels and modulation,” 3GPP TS 36.211 v.14.2.0, Apr. 2017.
  • [10] “Evolved universal terrestrial radio access (E-UTRA); Physical layer measurements,” 3GPP TS 36.214 v. 14.2.0, Apr. 2017.
  • [11] X. Lin, A. Adhikary, and Y.-P. E. Wang, “Random access preamble design and detection for 3GPP narrowband IoT systems,” IEEE Wireless Commun. Lett., vol. 5, no. 6, pp. 640–643, Jun. 2016.
  • [12] N. Jiang, Y. Deng, M. Condoluci, W. Guo, A. Nallanathan, and M. Dohler, “RACH preamble repetition in NB-IoT network,” IEEE Commun. Lett., vol. 22, no. 6, pp. 1244–1247, Jun. 2018.
  • [13] “Evolved universal terrestrial radio access (E-UTRA); Medium Access Control protocol specification,” 3GPP TS 36.321 v.14.2.1, May. 2017.
  • [14] “Evolved universal terrestrial radio access (E-UTRA); Requirements for support of radio resource management,” 3GPP TS 36.133 v. 14.3.0, Apr. 2017.
  • [15] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, Feb. 2015.
  • [16] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning.” in Assoc. Adv. AI (AAAI), vol. 2, Feb. 2016, p. 5.
  • [17] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in Pro. Int. Conf. Auton. Agents MultiAgent Syst. (AAMAS), Jul. 2018, pp. 2085–2087.
  • [18] T. Tieleman and G. Hinton, “Lecture 6.5-RMSprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Netw. Mach. Learn., vol. 4, no. 2, pp. 26–31, Oct. 2012.
  • [19] “Cellular system support for ultra-low complexity and low throughput Internet of Things (CIoT),” 3GPP TR 45.820 V13.1.0, Nov. 2015.