I Introduction
Given its ubiquitous coverage, the 5th Generation of cellular networks (5G) has great potential to support diverse wireless technologies. These wireless technologies are expected to be crucial in next generation smart cities, smart homes, automated factories, automated health management systems, and many other applications, some of which can not even be foreseen today [1]. The heterogeneous ecosystem of applications will result in different and often conflicting demands on the 5G radio and, as a result, the airinterface must be capable of supporting both high and low data rates, mobility, (ultra) low latency, as well as many different types of Quality of Service (QoS). In order for this QoS diversity to be achieved, improved medium access control (MAC) protocols have to be developed. As the state of the art MAC protocols for cellular networks have been designed and optimized to support primarily HumantoHuman (H2H) communication, these MAC protocols are not optimal for a massive number of Internet of Things (IoT) devices, due to the unique characteristics of the IoT traffic.
In IoT networks, communication devices can operate autonomously with little or no human intervention. The amount of data which is generated by the IoT nodes is usually small and the communication activity of these devices is heterogeneous [2]. The heterogeneous communication activity of the IoT nodes makes any attempt to preallocate network resources to each IoT node spectrally inefficient [3]. Currently, IoT devices gain access to the channel, and thereby transmit information, by performing grantbased random access (RA) or grantfree RA [4]. In grantbased RA, the nodes attempting to access the channel have to first obtain an access grant from an Access Point (AP) through a fourway handshake procedure [4]. This ensures that the user has exclusive rights to the channel if granted access, thus avoiding any potential collisions, at the expense of large latency and signalling overhead. In grantfree RA, the data is piggybacked on the first transmission itself along with the required control information, in order to reduce the access latency. However, both schemes suffer from massive packet collisions as the number of IoT nodes requiring access increases. Packet collisions increase latency and energy inefficiency, since collided packets need to be retransmitted, and require heavy exchange of signalling messages. As a result, packet collisions in massive IoT networks can easily become a bottleneck.
One promising research direction for MAC related problems in wireless communication is machine learning. Reinforcement Learning (RL) is one of many machine learning paradigms, where agents mimic the human learning process and learn optimal strategies by trialanderror interactions with the environment
[5]. RL has been implemented in the development of MAC schemes for cognitive radios [6], where the authors have developed a RL based MAC scheme which allows each autonomous cognitive radio to distributively learn its own spectrum sensing policy. In [7], the authors add intelligence to sensor nodes to improve the performance of Slotted ALOHA. In addition, solving MAC problems with multi agent Deep Reinforcement Learning (DRL) has been proposed in [8],[9],[10]. Specifically, in [8], the authors propose a DRL MAC protocol for wireless networks in which multiple agents learn when to access the channel. A DRL MAC protocol for wireless networks in which several different MAC protocols coexist has been studied in [9]. The authors of [10] have proposed a multiagent DRLbased MAC scheme for wireless sensor networks with multiple frequency channels. Another distributed MAC scheme has been investigated in [11], where authors embed learning mechanisms to the IoT nodes in order to control IoT traffic and consequently reduce its impact on any cellular network. In [12], [13], [14], the authors reduce the access congestion by adapting the parameters of the access class baring mechanism to different IoT traffic conditions, via DRL. A different approach is proposed in [15], [16], where the authors investigate learning mechanisms to aid the MAC in IoT networks, via dynamic AP selection schemes, in order to avoid overloading a single AP.In spite of being highly promising, the schemes proposed in [7][11] do not necessarily account for the severe device constraints in terms of energy availability and computing power for running ondevice optimization and inferences [17]. On the other hand, the schemes in [12][14] still rely on RA as a primary access mechanism. In this paper, we propose a DRLaided RA scheme which does not require ondevice inferences at the IoT nodes and therefore is applicable to devices with computational and energy constraints. In particular, we consider an IoT network comprised of an AP and IoT nodes that sporadically become active and transmit information towards the AP. Practical applications that subscribe to these assumptions include smart metering, temperature monitoring, airquality monitoring, emergency reporting etc. In the proposed scheme, the AP is assumed to have timefrequency resource blocks that it can allocate to the IoT nodes that wish to send data, where . The main problem is how to allocate the timefrequency resource blocks to the IoT nodes in each time slot such that the average packet rate received at the AP is maximized. For this problem, we propose a DRLaided RA scheme, where an intelligent DRL agent at the AP learns to predict the activity of the IoT nodes in each time slot and grants timefrequency resource blocks to the IoT nodes predicted as active. Next, the IoT nodes that are missclassified as nonactive by the DRL agent, as well as unseen or newly arrived nodes in the cell, employ the standard RA scheme in order to obtain timefrequency resource blocks. In this paper, we rely on grandbased RA, however, the proposed hybrid scheme is also compatible with grantfree RA. To reduce the amount of live data which needs to be acquired from the IoT network for training the DRL agent, we propose to leverage expert knowledge from the available theoretical models in the literature. Our numerical results show that the proposed algorithm significantly increases the packet rate, and implicitly decreases the energy consumption of the IoT nodes. In addition, as the intelligence is concentrated at the AP, the IoT nodes do not need significant computational power, or energy, for the ondevice inferences, and thereby the proposed scheme can be deployed in cells with generic IoT nodes that have limited computational capabilities.
The promising results of the proposed scheme are due to the fact that the conventional RA scheme can not utilize the determinism in the nodes’ activity patterns, which usually exists in practice [18]. Our proposed DRLaided RA scheme fills in this gap. Specifically, the proposed DRLaided RA scheme uses the DRL algorithm to learn the deterministic components of the nodes’ activity patterns in order to allocate timefrequency resources. Moreover, the proposed DRLaided RA scheme uses the conventional RA scheme to cope with the random components of the nodes’ activity patterns and allocate timefrequency resources in the presence of random components. In that sense, the proposed DRLaided RA scheme operates in the range between the two limiting type of activity patterns. At one end of the range is the absolutely independent and identically distributed (i.i.d.) random activity pattern and at the other end of the range is the absolutely deterministic activity pattern.
The rest of the paper is organized as follows. Section II provides the network model. Section III presents the proposed DRLaided RA algorithm. In Section IV, we provide numerical evaluation, and a short conclusion concludes the paper in Section V.
Ii System Model And Problem Formulation
In the following, we provide the system model and formulate the underlying problem.
Iia System Model
We consider a network comprised of IoT nodes and an AP, as illustrated in Fig. 1. The locations of the IoT nodes are assumed to be fixed and not to change with time. The transmission time is divided into time slots of equal duration. At the beginning of each time slot, each IoT node sporadically becomes active in order to sense its environment, generates a data packet from the sensed data, and tries to transmit this data packet to the AP in the same time slot. In order for an IoT node to transmit a data packet to the AP in time slot , a dedicated timefrequency block, refereed to as resource block (RB), needs to be allocated to the IoT node. Without loss of generality, we assume that all nodes transmit their packets with identical data rate, which is set to one. We assume that the AP has RBs available in total, where holds. As a result, in each time slot, the AP needs to perform intelligent resource allocation by allocating the available RBs to the active IoT nodes only. Otherwise, if the AP allocates a RB to a nonactive node, that RB would be wasted and the AP will not receive a packet on the corresponding RB.
IiB Problem Formulation
The AP receives a packet from the th IoT node in time slot if the following two events occur:

the th IoT node is active in time slot ,

the th IoT node has been allocated a RB in time slot .
Otherwise, the AP will not receive a packet from the th IoT node in time slot . To model this behaviour, let and be binary indicators defined as
(1)  
(2) 
Using these binary indicators, we can obtain the average packet rate, denoted by , as
(3) 
Our aim is to maximize the average packet rate by solving the following optimization problem
(4) 
where the last constraint follows since the AP has RBs in total in each time slot. If are known, then the optimal solution of (IIB) is known and is if until all RBs are used up. However, the main problem is that
is unknown in practice and needs to be estimated. As a result, a practical algorithm that provides the optimal solution to the maximization problem in (
IIB) is difficult in general. Hence, our aim in this paper is to propose a suboptimal but a practical solution to the recourse allocation problem in (IIB), which provides good performance.Iii Proposed Solution
In the following, we discuss the existing solution used in practice and propose our solution.
Iiia Existing Solution: The RA Scheme
The existing practical suboptimal solution to the resource allocation problem in (IIB) is the conventional RA scheme [4]. In the RA scheme, each of the IoT nodes has an identical set of orthonormal sequences. At the start of each time slot, each active node selects uniformly at random a single orthonormal sequences from its set, and uses that sequences to transmit information to the AP via a dedicated control channel. This transmission, if successful, informs the AP that the considered node is active in the current time slot and thereby needs to be allocated a RB. The AP is able to receive this information from a given active node correctly if no other active node has selected the same orthonormal sequence as the considered node. Otherwise, if two or more active nodes have selected the same orthonormal sequence, collisions occur and the AP is not able to receive the information from these nodes correctly. As a result, the AP will not know that these nodes will be active in time slot , and consequently the AP will not grant RBs to these nodes. In addition, the RA scheme also fails if the AP does not have enough RBs to grant to all active nodes which have successfully completed the RA procedure and informed the AP that they are active.
The average packet rate achieved by the RA scheme is given by
(5) 
where
is the probability that collisions will not occur if
nodes are active, and denotes the number of active nodes at time slot found as(6) 
IiiB Proposed Solution: The DRLAided RA Scheme
In the proposed DRLaided RA scheme, the allocation of the RBs is conducted in two consecutive phases. In the first phase, RBs are allocated by the AP using the DRL algorithm presented below, where . Next, in the following phase, RBs are allocated by the AP using the conventional RA scheme, described in Sec. IIIA. To this end, the AP is assumed to host a DRL agent that learns to predict which nodes will be active in a given time slot . The learning and resource allocation process of the DRL agent, which is repeated in each time slot, is as follows:

We define a state in time slot , denoted by . The state represents a set comprised of the nodes which have been active during the previous time slots, i.e., , where denotes the history that the agent ”remembers”.

Based on the state in time slot , the DRL agent produces an output set, denoted by and referred to as the action, comprised of the nodes that have been predicted to be active in time slot by the DRL agent.

Based on the set , the following RBs allocations occur:

If , each node in the set is allocated a RB.

If , then nodes from the set are selected uniformly at random and each of these nodes is allocated a RB.


Based on the allocations of RBs to the nodes predicted as active by the DRL agent, the following occurs in time slot

If a node has been correctly predicted as active, and thereby granted a RB, the node transmits its data packet to the AP on the corresponding RB, and the AP receives this data packet correctly. Consequently, the DRL agent will classify this node as correctly predicted.

If a node has been misspredicted as active by the DRL agent and thereby has been granted a RB, this node stays silent since it is inactive. As a result, the AP will not receive a packet on the corresponding RB. Consequently, the DRL agent will classify this node as erroneously predicted.

If a node has been correctly predicted as inactive and thereby has not been granted a RB, the node stays silent.

If a node has been misspredicted as inactive and thereby has not been granted a RB, the misspredicted active node attempts to obtain a RB using the conventional RA scheme, as described in Sec. IIIA. Thereby, the node selects uniformly at random a single orthonormal sequence from its set, and uses that sequence to inform the AP, via the control channel, that the considered node is active and has been misspredicted in the current time slot. The AP listens to the control channel and detects the nodes which have been misspredicted as inactive. The AP can detect only those nodes which have selected a unique orthonormal sequence. The other nodes cannot be detected due to collisions of their packets, as explained in Sec. IIIA.


Next, based on the observations, the AP constructs the set by including the following nodes

the nodes which the DRL predicted as active and from which the AP received a packet on the corresponding allocated RB.

the nodes which the AP detected as active on the control channel via the RA scheme.


Based on the observation, the AP computes a reward in time slot , denoted by , which is equal to , if all nodes are correctly predicted in time slot by the DRl agent, or otherwise.

The system transitions to the next time slot and the whole process described above is repeated.
IiiB1 Implementation of the DRL Agent
The DRL agent is implemented as a deep neural network located at the AP. This deep neural network, in time slot
, has the set as input and produces as outputs values, denoted by , , …, , where is an estimated average reward the agent will receive in the future if the agent predicts that the set of active nodes in time slot is , for , andis a vector comprised of the weights of the neural network. Next, the agent chooses that set
which corresponds to the largest output value of the neural network, i.e.,(7) 
The function obtained at the output of the neural network is an estimate of the function , which is known as the discounted average award. The discounted average award function is defined as
(8) 
where is referred to as the discount factor. In order for the neural network to produce output functions that are estimates of , the Bellman equation is used and thereby the weights in the neural network are optimized such that the following mean squared error is minimized
(9) 
The above minimization of the mean square error can be implemented iteratively via a stochastic gradient descent (or a variant), where in each iteration the weights in the neural network
are updated according to(10) 
IiiB2 Training the DRL Agent Using Expert Knowledge
In order for the training process of the agent to be succesful, the agent needs to obtain a sufficient number of ”live” training data samples from the interaction between the AP and the IoT nodes. Let the live data sample at time slot be obtained as per Subsection IIIB
. In practice, the acquisition of sufficient number of live samples can be impractical and ultimately prohibitive, due to the excessively long amount of time required to acquire the data. In these cases, transfer learning can be leveraged in order to accelerate the training process. Transfer learning is a recent trend in the ML community where available prior knowledge about the considered problem stemming from theoretical models is embedded in the neural networks
[20][21]. Transfer learning dramatically reduces the number of live data samples that are needed for the training process to be successful.The IoT networks research community has provided many theoretical models for the activity of the IoT nodes, such as those in [2], [22]. We choose to leverage the model in [22], where the authors use a Coupled Markov Modulated Poisson Process (CMMPP) to model the activity of the nodes in a IoT cell theoretically. The CMMPP model captures both regular and alarm reporting, as well as the correlated activity behavior among the nodes. Hence, we use the CMMPP model in [22] to train the DRL agent at the AP. To this end, we first synthesize artificial activity patterns of the IoT nodes, according to the CMMPP model, with which we train the DRL agent. Thereby, during the training process, in each time slot, the DRL agent learns to predict the active nodes from artificial activity patterns as per Subsection IIIB. Once the prior knowledge has been transferred, i.e., the DRL agent has been trained using the artificial activity patterns, the DRLRA scheme starts using the actual live samples from the IoT nodes.
IiiB3 Average Packet Rate
In the following, we derive the average packet rate of the proposed DRLaided RA scheme. To this end, let and denote the probability that a node is predicted as active and inactive at time slot , respectively. Let and denote the probability that a node is correctly predicted as active and mispredicted as inactive at time slot , respectively. Finally, let denote the probability that a node is correctly predicted as active at time slot , but the AP does not have any RBs left to allocate to the node. For the DRLaided RA scheme, in each time slot, nodes that are misclassified as inactive, and nodes that are correctly classified as active but the DRL agent does not have enough any RBs left to allocate to them, will attempt the RA procedure in order to obtain one of RBs. The rate of the proposed algorithm is thus given by
(11) 
where . To find , we need to calculate the number of nodes which have not been granted an RB, in spite of being correctly classified as active . To do so, let denote the binomial coefficient, defined as
(12) 
We can distinguish the following cases: and ; and ; and ; and and . In the first case, when and , the following can occur:

of the misclassified nodes are granted all RBs and none of the correctly classified nodes get an RB, which occurs with probability
(13) 
of the misclassified nodes are granted RBs and one of the correctly classified nodes gets an RB, which occurs with probability
(14) 
of the misclassified nodes are granted RBs and two of the correctly classified nodes get an RB, which occurs with probability
(15) ⋮

None of the misclassified nodes are granted RBs and of the correctly classified nodes get an RB, which occurs with probability
(16)
Thereby, when and , is given by
(17) 
By extending this analysis to the other three cases, we obtain as
(18) 
where is given by (12). In (IIIB3), the constants , , , and can be found as
(19) 
(20) 
(21) 
and
(22) 
Iv Numerical Results
In this section, we compare the performance of the proposed DRLaided RA scheme with the conventional RA scheme in [4]. To this end, in Section IVA, we first present the data sets that have been used in the simulations and the hyper parameters of the proposed algorithm are given in Section IVB. The numerical results are finally given in Section IVC.
Iva Data Sets
IvA1 Synthetic Activity Patterns
To demonstrate the effectiveness of the proposed scheme on different traffic types, we first generate synthetic data sets. Specifically, node is assumed to be active in time slot with probability , where
(23) 
where is a constant which controls the determinism of the activity patterns. Thereby, lower values of will result in a more periodic activity pattern, and as increases, the activity pattern becomes more random.
IvA2 RealWorld Activity Pattern
To demonstrate the effectiveness of the proposed scheme even further, realworld activity patterns are drawn from the publicly available data sets in [23], [24], [25], and [26]. We assume that all nodes operate during the same time period and in the same IoT cell. The data sets in [23][26] are comprised of nodes which have different reporting intervals [2]. In particular, the time elapsed between two consecutive data arrivals ranges from one second for some nodes up to 1 hour for others.
IvB Neural Network HyperParameters
To speedup the training of the DRL agent, we split the neural network into an ensemble of neural networks, such that only subsets of the nodes are included in each network in the ensemble. All networks in the ensemble are identically trained, as described previously. The architecture of each neural network in the ensemble consists of a threelayer, fully connected neural network. The activation functions for the neurons are ReLU functions
[27], given by(24) 
The discount factor in (9) is set to . The explorationexploitation tradeoff [27] is controlled via the greedy algorithm, where is decreasing from to as
(25) 
where . Thereby, the agent chooses the action with the highest value with probability , and randomly chooses an action with probability . At the start of the training when is high, the agent explores the action space via randomly choosing the action. As decreases, the agent begins to exploit the accumulated knowledge via choosing the action with the highest value. The parameters of the proposed algorithm are summarized in Table I.
Parameter  Value 

No. of hidden layers  3 
Discount factor  0.05 
Learning rate  0.001 
1 to 0.01 
IvC Performance Evaluation
IvC1 Synthetic Data
In Fig. 2, we present the average packet rate achieved with the proposed DRLaided RA scheme on the synthetic activity pattern generated by (23) and compare it with the packet rate achieved with the conventional RA scheme for different values of . In this example, the number of nodes in the cell is set to and the number of available RBs at the AP is set to . The number of RBs allocated by the AP via the DRL agent decreases from for , to for . To determine , we only need to know the probability that a node is correctly classified as active and the probability that a node is misclassified as inactive (see (11)(22)), which are obtained from the data samples used for training. In particular, we use these samples to calculate the rate using (11) for all values of , and we chose the value which results with the highest rate. In the case of transfer learning, we use only the samples from the realworld activity pattern (not the data used for pretraining). The number of available orthonormal sequences is set to . As Fig. 2 illustrates, the average packet rate of the conventional RA scheme is not sensitive to . On the other hand, the average packet rate of the proposed DRLaided RA scheme is a decreasing function of . This is due to the amount of randomness in the activity patterns as increases. In particular, when is low the activity pattern is almost periodic so the DRL agent is able to learn it and then correctly allocate the available RBs, thereby reducing the need for the nodes to attempt RA. Conversely, when is high the activity pattern is highly random and the agent is not able to learn it completly, so the nodes attempt RA to gain RBs. This example illustrates that the average packet rate of the proposed DRLaided RA scheme is lower bounded by the average packet rate of the RA scheme. Thereby, the worst possible performance of the proposed DRLaided RA scheme, obtained for , is identical to the performance of the RA scheme.
In Fig. 3, we illustrate average packet rate achieved with the proposed DRLaided RA scheme and compare it with the average packet rate of the conventional RA scheme as a function of the number of nodes in the cell , for two different values of . The number of available RBs at the AP is set to . The number of RBs allocated by the AP via the DRL agent is set to and for and , respectivley. The number of available orthonormal sequences is set to . As it can be seen from Fig. 3, the packet rate of the proposed DRLaided RA scheme is significantly higher than the rate of the RA scheme. For example, the proposed scheme can achieve a packet rate of when nodes, whilst the conventional RA scheme can achieve the same packet rate with nodes when . Similarly, when the activity patterns are more random i.e., when , the proposed scheme can achieve a packet rate of for nodes, whilst the conventional RA scheme achieves the same rate with nodes.
IvC2 RealWorld Activity Pattern
In Fig. 4, we illustrate the instantaneous packet rate in each time slot during a period of 1 hour (3600 s) for the realworld activity pattern. In total our IoT cell is comprised of nodes, which report data arrivals during one hour. The number of available RBs at the AP is set to . The number of RBs allocated by the AP via the DRL agent is set to . The number of available orthonormal sequences is set to . Since the minimum duration between two data arrivals in these data sets is s, we assume that the duration of a time slot is s. As it can be seen from Fig. 4, the proposed DRLaided RA scheme achieves a packet rate that is significantly higher than the conventional RA scheme in each time slot. This is a consequence of the fact that the agent is able to extract the determinism in the activity pattern, which exists in practice, and correctly predict some of the active nodes. As a result, the number of nodes that attempt the RA procedure is much lower compared to the conventional RA scheme. In the spirit of reproducible science, the codes used for generating this figure are made available on [28].
To illustrate the benefits of transfer learning we present Fig. 5, where the percentage of maximum possible reward is illustrated as a function of the percentage of sufficient live samples. The sufficient number of live samples is defined as the number of live samples needed for the DRL agent to obtain the maximum possible reward, and thereby achieve the maximum possible inference accuracy. The maximum reward is defined as the reward obtained by using of live data samples. Fig. 5 shows that the maximum reward can be obtained by using of live data samples and of artificial samples. Note that, using an insufficient number of live data samples, and no artificial samples, leads to a reward that is significantly lower than the maximum possible reward, as the agent does not have enough data for the training process. In addition, the obtained reward is significantly lower if only artificial samples, without any live samples, are used which is a consequence of the mismatch between the artificial model and the activity in the IoT cell. Thereby, optimal performance can be achieved by using of live samples and of artificial samples from the theoretical model. This in turn significantly decreases the time required for the agent to be trained, i.e., by up to in our case.
V Conclusion
In this paper, we proposed a DRLaided RA scheme for a network comprised of IoT nodes and an AP. In particular, an intelligent DRL agent placed at the AP learns to predict the activity of the IoT nodes in each time slot and grants timefrequency resource blocks to the IoT nodes predicted as active. The standard RA scheme is used as a backup access mechanism for potentially misclassified, unseen or new nodes in the cell. In addition, we leverage expert knowledge in order to ensure faster training of the DRL agent. By using publicly available data sets, we show signifficant improvements in terms of rate, when the proposed DRLaided RA scheme is implemented, compared to the conventional RA scheme.
References
 [1] R. Ratasuk, A. Prasad, Z. Li, A. Ghosh, and M. A. Uusitalo, “Recent advancements in m2m communications in 4g networks and evolution towards 5g,” in 2015 18th International Conference on Intelligence in Next Generation Networks, Feb 2015.
 [2] V. W. Wong, R. Schober, D. W. K. Ng, and L.C. Wang, Key technologies for 5G wireless systems. Cambridge university press, 2017.
 [3] A. Laya, L. Alonso, and J. AlonsoZarate, “Is the random access channel of lte and ltea suitable for m2m communications? a survey of alternatives.” IEEE Communications Surveys and Tutorials, vol. 16, no. 1, pp. 4–16, 2014.
 [4] 3GPP, “Medium access control (mac) protocol specification,” 3GPP TS 38.321 V0.0.3, May 2017.
 [5] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [6] M. Bkassiny, S. K. Jayaweera, and K. A. Avery, “Distributed reinforcement learning based mac protocols for autonomous cognitive secondary users,” in Wireless and Optical Communications Conference (WOCC), 2011 20th Annual. IEEE, 2011, pp. 1–6.
 [7] Y. Chu, P. D. Mitchell, and D. Grace, “Aloha and qlearning based medium access control for wireless sensor networks,” in Wireless Communication Systems (ISWCS), 2012 International Symposium on. IEEE, 2012, pp. 511–515.
 [8] O. Naparstek and K. Cohen, “Deep multiuser reinforcement learning for dynamic spectrum access in multichannel wireless networks,” arXiv preprint arXiv:1704.02613, 2017.
 [9] Y. Yu, T. Wang, and S. C. Liew, “Deepreinforcement learning multiple access for heterogeneous wireless networks,” arXiv preprint arXiv:1712.00162, 2017.
 [10] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforcement learning for dynamic multichannel access,” in International Conference on Computing, Networking and Communications (ICNC), 2017.
 [11] L. M. Bello, P. Mitchell, and D. Grace, “Application of qlearning for rach access to support m2m traffic over a cellular network,” in European Wireless 2014; 20th European Wireless Conference. VDE, 2014, pp. 1–6.
 [12] J. Moon and Y. Lim, “Access control of mtc devices using reinforcement learning approach,” in 2017 International Conference on Information Networking (ICOIN). IEEE, 2017, pp. 641–643.
 [13] ——, “A reinforcement learning approach to access management in wireless cellular networks,” Wireless Communications and Mobile Computing, vol. 2017, 2017.
 [14] L. TelloOquendo, D. PachecoParamo, V. Pla, and J. MartinezBauset, “Reinforcement learningbased acb in ltea networks for handling massive m2m and h2h communications,” in 2018 IEEE International Conference on Communications (ICC). IEEE, 2018, pp. 1–7.

[15]
Y.J. Liu, S.M. Cheng, and Y.L. Hsueh, “enb selection for machine type communications using reinforcement learning based markov decision process,”
IEEE Transactions on Vehicular Technology, vol. 66, no. 12, pp. 11 330–11 338, 2017.  [16] A. Mohammed, A. S. Khwaja, A. Anpalagan, and I. Woungang, “Base station selection in m2m communication using qlearning algorithm in ltea networks,” in 2015 IEEE 29th International Conference on Advanced Information Networking and Applications. IEEE, 2015, pp. 17–22.
 [17] T. Park, N. Abuzainab, and W. Saad, “Learning how to communicate in the internet of things: Finite resources and heterogeneity,” IEEE Access, vol. 4, pp. 7063–7073, 2016.
 [18] L. Ferdouse, A. Anpalagan, and S. Misra, “Congestion and overload control techniques in massive m2m systems: A survey,” Transactions on Emerging Telecommunications Technologies, vol. 28, no. 2, p. e2936, 2017.
 [19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.

[20]
T. Inoue, S. Choudhury, G. De Magistris, and S. Dasgupta, “Transfer learning from synthetic to real images using variational autoencoders for precise position detection,” in
2018 25th IEEE International Conference on Image Processing (ICIP), Oct 2018, pp. 2725–2729.  [21] A. N. C. Kim, E. Variani and M. Bacchiani, “Efficient implementation of the room simulator for training deep neural network acoustic models,” 2019.
 [22] G. C. Madueño, Č. Stefanović, and P. Popovski, “Reliable and efficient access for alarminitiated and regular m2m in ieee 802.11 ah systems,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 673–682, 2016.
 [23] D. Murray and L. Stankovic, “Refit,” available on https://pureportal.strath.ac.uk/en/datasets/refitelectricalloadmeasurementscleaned, 2016.
 [24] S. De Vito, E. Massera, M. Piga, L. Martinotto, and G. Di Francia, “On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario,” Sensors and Actuators B: Chemical, vol. 129, no. 2, pp. 750–757, 2008.
 [25] N. Batra, O. Parson, M. Berges, A. Singh, and A. Rogers, “A comparison of nonintrusive load monitoring methods for commercial and residential buildings,” available on arXiv:1408.6595, 2014.
 [26] “Individual household electric power consumption data set.”
 [27] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016, vol. 1.
 [28] “Ra x drl,” https://github.com/ikoloska.
Comments
There are no comments yet.