Resource Allocation in Uplink NOMA-IoT Networks: A Reinforcement-Learning Approach

07/16/2020 ∙ by Waleed Ahsan, et al. ∙ Queen Mary University of London 0

Non-orthogonal multiple access (NOMA) exploits the potential of power domain to enhance the connectivity for Internet of Things (IoT). Due to time-varying communication channels, dynamic user clustering is a promising method to increase the throughput of NOMA-IoT networks. This paper develops an intelligent resource allocation scheme for uplink NOMA-IoT communications. To maximise the average performance of sum rates, this work designs an efficient optimization approach based on two reinforcement learning algorithms, namely deep reinforcement learning (DRL) and SARSA-learning. For light traffic, SARSA-learning is used to explore the safest resource allocation policy with low cost. For heavy traffic, DRL is used to handle traffic-introduced huge variables. With the aid of the considered approach, this work addresses two main problems of the fair resource allocation in NOMA techniques: 1) allocating users dynamically and 2) balancing resource blocks and network traffic. We analytically demonstrate that the rate of convergence is inversely proportional to network sizes. Numerical results show that: 1) compared with the optimal benchmark scheme, the proposed DRL and SARSA-learning algorithms achieve high accuracy with low complexity and 2) NOMA-enabled IoT networks outperform the conventional orthogonal multiple access based IoT networks in terms of system throughput.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 14

page 18

page 20

page 25

page 30

page 33

page 34

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Internet of things (IoT) enable millions of devices to communicate simultaneously. It is predicted that the number of IoT devices will rapidly increase in the next decades [zhai2019delay]. Owing to a large number of time-varying communication channels, the dynamic network access with massive connectivity becomes a key requirement for future IoT networks. Recently, non-orthogonal multiple access (NOMA) is evolved as a promising approach to solve this problem[islam2017power],[sharma2019towards]. The key benefit of using NOMA is that NOMA exploits power domain to enable more connectivity than the traditional orthogonal multiple access (OMA). More specifically, NOMA supports multiple users in the same time/frequency resource block (RB) by employing superposition coding at transmitters and successive interference cancellation (SIC) techniques at receivers [wan2018non]. Various model-based schemes have been proposed to improve different metrics of NOMA-IoT networks, such as coverage performance, energy efficiency, system throughput (sum-rates), etc. Additionally, on the importance of sum-rates, the recent work in wireless networks based on the state of the art reflective intelligent surfaces (RIS) considered sum-rate maximization objective function [guo2020weighted]. The sum-rate is an important parameter to depict the average performance of wireless networks in detail for each user. Due to this sum-rate is widely used as significant performance indicator in wireless networks by the research community [zeng2020sum] and [tse2005fundamentals]. It shows the significance of the sum-rate maximization based objective functions. Regrading the system design, the uncertainty and dynamic mechanisms of wireless communication environments are difficult to be depicted by an accurate model. The dynamic mechanism involves spectral availability, channel access methods (e.g., OMA, NOMA, hybrid systems, etc.), and dynamic traffic arrival. Especially in practical NOMA systems by allowing resource share among more than one users the process is more dynamic, when the number of users are joining and leaving the network in short term and long term basis. Numerous model-based techniques target to solve dynamic behaviour of wireless networks but failed to provide long-term performance outcomes [ding2017survey], [shao2018dynamic], [ali2016dynamic],[miuccio2020joint] and [mostafa2019connection]

. Moreover, due to the absence of learning abilities, to provide long term network stability the computational complexity of traditional schemes becomes ultra-high. This is due to the fact that, by default traditional approaches cannot extract knowledge from any given problem (e.g, given distributions) online. Fortunately, the online learning properties of recently developed machine learning (ML) methods are extremely suitable to handle such type of dynamic problems

[8519960].

I-a Related Works and Motivations

I-A1 Studies on NOMA-IoT Networks

Due to the aforementioned benefits, academia has proposed numerous studies on the optimization of resource allocation in NOMA-enabled IoT networks. For single-cell scenarios, the authors in [shao2018dynamic] proposed a two-stage NOMA-based model to optimize the computation offloading mechanism for IoT networks [hussain2019machine]. In the first stage, a large number of IoT devices are clustered into several NOMA groups depending on their channel conditions. In the second stage, different power levels are allocated to users to enhance the network performance. The comparison between uplink NOMA-IoT and OMA-IoT is presented in [zhang2016uplink], which considered the optimal selection of targeted data rates for each user. Regarding downlink transmission, the similar topic was studied in [ding2014performance] and [hanif2016minorization]. Different from others, in [zhang2018energy] using 2D matching theory authors performed dynamic resource allocations considering energy efficiency for downlink NOMA. Similarly, in [miuccio2020joint] for the massive Machine Type Communications (mMTC) usage scenario, also known as massive Internet of Things (mIoT) dynamic resource management is performed with Sparse Code Multiple Access (SCMA) domain using conventional mathematical tools. The authors in [yang2016general] proposed a general power allocation scheme for uplink and downlink NOMA to guarantee the quality of service (QoS). In [zhai2018energy], NOMA scheduling schemes in terms of power allocation and resource management were optimized to realize the massive connectivity in IoT networks. For multi-cell scenarios, the impact of NOMA on large scale multi-cell IoT networks was investigated in [liu2017enhancing]. To characterize the communication distances, the authors in [8635489] analysed the performance of large scale NOMA communications via stochastic geometry. It is worth noting that NOMA-IoT channels are time-varying in the real world. Therefore, the study in [ali2018coordinated] considered a practical framework with dynamic channel state information for evaluating the performance of massive connectivity. The authors in [qian2018optimal],[shahab2019grant], and [dai2018survey]

discussed the advantages of various NOMA-IoT applications. Interestingly, the proposed schemes introduced artificial intelligence (AI) methods to solve some practical challenges of NOMA-IoT systems. For both uplink and downlink scenarios, AI-based multi-constrained functions can be utilized to optimise multiple parameters simultaneously.

I-A2 Studies on ML-based NOMA Systems

Due to the dynamic nature of NOMA-IoT communications, traditional methods may not be suitable for such type of networks [mostafa2019connection]. Note that ML-based methods are capable to handle the complex requirement of future wireless networks via learning. In [gui2018deep]

, one typical deep learning method, namely long short-term memory (LSTM)

[hochreiter1997long]

, was applied for the maximization of user rates by minimizing the received signal-to-noise-ratio (SINR). In

[xu2018outage]

, successive approximation based algorithm was proposed to minimize outage probabilities through optimizing power allocation strategies. For next-generation ultra-dense networks, ML-aided user clustering schemes were discussed in

[jiang2017machine] for obtaining efficient network management and performance gains. Because using clustering schemes, the entire network can be divided into several small groups, which helps to ease the resource management [bi2015wireless]. Regarding AI-based cluster techniques, in [arafat2019localization] and [cui2018unsupervised], resources were assigned to the most suitable user to ensure the best QoS for unmanned aerial vehicle (UAV) networks and millimetre wave networks, respectively. It is worth noting that the optimization of clustering is an NP-hard problem. Therefore, for such type of problems the authors in [gui2018deep], [jiang2017machine], and [liu2019machine] recommended to use AI instead of conventional mathematical models. Currently, realistic datasets are not available for most of the machine learning algorithms, to overcome this designers use synthetic dataset for simulations. The data set is generated for a certain environment so it is difficult to depict general property and online scenarios of wireless networks. Therefore, algorithms like reinforcement learning plays very important role where data is collected online (during simulation) to learn the given search space for the simulation requirements. There are various Q-learning algorithm variants used for NOMA systems. Due to inefficient learning mechanism other methods like traditional Q-learning and Multi-arm bandits (MABs) are heavily influenced by regret (negative reward) [li2020multi][de2018comparing]. On the other hand two most powerful methods, deep reinforcement learning (DRL) and SARSA learning created by google deep mind[silver2017mastering] and by the authors in [rummery1994line]. Both DRL deep mind and SARSA learning algorithms are efficient learners. Due to unique learning behaviour DRL and SARSA tend to receive more rewards. The main advantage of deep mind and online SARSA learning is to handle dynamic control as in [lillicrap2015continuous]. With the development of such type of RL techniques, the challenges for NOMA systems, which are difficult to be solved via traditional optimization methods, have been reinvestigated via RL-based approaches [xiao2017reinforcement, liu2019uav, yang2019reinforcement].

I-A3 Motivations

Combining multi-user relationship and resource allocation increases the complexity of NOMA-IoT systems, which also introduces new problems for optimizing power allocation and scheduling schemes. Unlike traditional methods [zhai2018energy], where only one BS is considered for small scale network with no inter-cell interference and dynamic user connectivity. The design of schedulers should be in tandem with the large scale dynamic resource allocations and user decoding strategies. Therefore, due to the high complexity of the problem under multi-cell multi-user cases, AI can be a feasible option for the dynamic resource allocation [cui2017optimal]. For large-scale NOMA-IoT networks, an intelligent reinforcement learning (RL) algorithm becomes a promising approach to find the optimal long-term resource allocation strategy. This algorithm should jointly optimize multiple criteria under dynamic network states. In this paper, our main goal is to address the following research questions:

  • Q1: In NOMA-IoT networks, how to maximize the long-term sum rates of users for a given network traffic density?

  • Q2: How does the inter-cell interference affect the long-term sum rates?

  • Q3: What is the correlation between traffic density, system bandwidth, and the number of clusters in NOMA-IoT networks?

From above as it is known that model-free methods are suitable to address multi-constrained long-term problem online. Therefore, in long-term there is strong correlation of mentioned research questions with general problems of “intermittent connectivity of IoT users (continuously joining and leaving the network), balanced resource allocations ( optimal allocations policy for dynamic network settings) and network traffic (as the (Min-Max) number of users competing for the resource blocks)” in wireless networks. Similarly, research Q1 for capacity maximization, research Q2 for network scalability and, research Q3 for long-term network performance are strongly dependent on the main problems “balancing of network resources, IoT users and, the dynamic network behaviour”.

I-B Contributions and Organization

This paper considers uplink NOMA-IoT networks, where multiple IoT users are allowed to share the same RB based on NOMA techniques. With the aid of RL methods, we propose a multi-constrained clustering solution to optimize the resource allocation among IoT users, base stations (BSs), and sub-channels, according to the received power levels of IoT users. Appropriate bandwidth selection for the entire system with different traffic densities is also taken into consideration for enhancing the generality. Our work provides several noteworthy contributions:

  • We design a 3D association model free framework for connecting IoT users, BSs, and sub-channels. Based on this framework, we formulate a sum-rate maximization problem with multiple constraints. These constraints consider long-term variables in the proposed NOMA-IoT networks, such as the number of users, channel gains, and transmit power levels. To characterize the dynamic nature (online), at each time slot, these variables are changeable.

  • We propose two RL techniques, namely SARSA-learning with

    and DRL, to solve this long-term optimization problem. SARSA-learning is used for light traffic scenarios to avoid high complexity and memory requirements. Heavy traffic scenarios with a huge number of variables are studied by DRL, where three different neuron activation mechanisms, namely TanH, Sigmoid, and ReLU, are compared to evaluate the impact of neuron activation on the convergence of the proposed DRL algorithm.

  • We design novel 3D state and action spaces to minimise the number of Q-tables for both SARSA and DRL frameworks. The considered action space represent switching between RBs, which is the most efficient strategy for our networks. Based on this adequate Q-table design, DRL is able to converge faster.

  • We show that: 1) according to the time-varying environment, resources can be assigned dynamically to IoT users based on our proposed framework; 2) for the proposed model, the learning rate provides the best convergence and data rates; 3) for SARSA and DRL the sum-rate is proportional to the number of users; 4) DRL with the ReLU activation mechanism is more efficient than TanH and Sigmoid; and 5) IoT networks with NOMA provide better system throughput than those with OMA.

The rest of the paper is organised as follows: In Section II, the system model for the proposed NOMA-IoT networks is presented. In Section III, SARSA-learning and DRL-based resource allocation is investigated. The corresponding algorithms are also presented. Finally, numerical results and conclusions are drawn in Section IV and Section V, respectively.


Fig. 1: Illustrating uplink NOMA resource allocation by using the optimization algorithm to efficiently cluster users for resource blocks at the base-station side. Resource allocations-(a) presents different resource blocks in yellow, green, and blue with power on (x-axis) and time/frequency on (y-axis) assigned to IoT users. The powers and gains of users are denoted with and .
Symbol Definition
Number of BSs, symbol of BSs
Number of sub-channels (NOMA clusters), symbol of sub-channels (NOMA clusters)
Number of users, symbol of users
Set of BSs
Set of sub-channels (NOMA clusters)
, Set of users connected to BS via sub-channel , user in the set
Clustering variable for user connecting to BS via sub-channel at time
Transmit power for user at time
Channel gain for user at time
Additive white Gaussian noise at time
Inter-cell interference at time
Instantaneous SINR for user at time
Instantaneous data rate for user at time
Rate requirement for the SIC process of user
, Maximal load of each sub-channel, Maximal power for each sub-channel
Duration of the considered long-term communication
, Matrix for clustering parameters, matrix for transmit power
Vector for DRL gradients
Moment estimation decay rate
TABLE I: Table of notations

Ii System Model

In this paper, we consider an uplink IoT network with NOMA techniques as shown in Fig. 1, where BSs communicate with IoT users via orthogonal sub-channels. we assume dynamic in each time-slot in our model, however for simplicity we omit for further sections. Additionally, channel gains are also dynamic for each user at each time-slot, even for the same user. The BSs and sub-channels are indexed by sets and , respectively. Regarding users, the set for users severed by one BS through a sub-channel is defined as , where is the number of the intra-set users and . BSs and users are assumed to be equipped with a single antenna. For each BS, the entire bandwidth is equally divided into sub-channels and hence each sub-channel has bandwidth. In a time slot, we assume a part of users are active and the rest users keep silence. To share knowledge, we consider fiber link with ideal back-haul for inter BS connectivity. The defined notations in this system model are listed in TABLE I.

Ii-a NOMA Clusters

Based on the principles of NOMA, more than two users can be served in the same resource block (time/frequency), which forms a NOMA cluster. In this paper, each sub-channel represents one NOMA cluster and [kiani2018edge]. To simplify the analysis, we assume BSs contain perfect CSI of all users. That CSI is our state space showing signalling and the channel conditions of IoT users connected to sub-channel via base-station. Detailed explanation is present in the section III-b and section III-c. Based on such CSI, BSs are capable to dynamically optimize the sub-channel allocation for active users in a long-term communication. For an arbitrary user , we define its clustering variable at time as follows:

(1)

It is worth noting that also implies the activity status of users. If user is inactive, we obtain that . The set of clustering parameters is defined as and .

Ii-B Signal Model

In a NOMA cluster , one BS first receives the superposed messages from the active users in and then applies SIC to sequentially decode each user’s signal [liu2016cooperative]. Without loss of generality, we assume the order of channel gains is , where is the channel gain for the -th user in [803503]. Therefore, the decoding order in this paper is the reverse of the channel gain order [8680645]. In a time slot , the instantaneous signal-to-interference-plus-noise ratio (SINR) for the intra-cluster user is given by

(2)

where

(3)

and is the transmit power of the user and the set of transmit power is given by [8626185]. The power of thermal noise obeys , where is temperature of resistors is Boltsmann’s constant, is the considered bandwidth. In this paper we use K therefore, . The represents the inter-cell interference, which is generated by the active users served by other BSs using the same sub-channel . In uplink NOMA, the decoding of user is based on the SIC process of its previous user . If the data rate of successfully completing the SIC process is , when the decoding rate of user obeys

(4)

the data rate of user is given by

(5)

Otherwise, if , the decoding of all rest users fails, namely .

Ii-C Problem Formulation

For a long-term communication with period , the number of active users is different across each time slot. Given the maximal load of each sub-channel

, we assume the number of active users are uniformly distributed in the range

and . Under this condition, the average long-term sum rate can be maximized by optimizing clustering parameters and transmit power . Therefore, the objective function is given by

(6a)
(6b)
(6c)
(6d)
(6e)
(6f)
(6g)

where (6b) is the ordered channel gains based on the perfect CSI. (6c) is to impose the power constraint of each sub-channel. (6d) ensures all clustered IoT users can be successfully decoded for maximizing the connectivity. (6e) and (6f) limits the number of clustered users for the entire system and each sub-channel, respectively. (6g) indicates that each user belongs to only one cluster. Problem (6a) is an NP-hard problem, even only fixed number of users per cluster is considered instead of dynamic range, especially, in case of (6c) and (6f). The proof process is provided in Appendix A. The proof of (6a) follows the idea in [cui2018optimal] and [8807386].

Iii Intelligent Resource Allocation

Iii-a Markov Decision Process Model for Uplink NOMA

In this section, we formulate user clustering and optimal resource allocation for uplink NOMA as a Markov decision process (MDP) problem. Problem transformations are shown in Fig.

2 and Fig. 2. A general MDP problem contains single or multiple agents, environment, states, actions, rewards, and policies. The process starts with the interaction of an agent with a given environment. In each interaction, the agent processes an action followed by a policy with previous state . After processing action according to these conditions and observed state agent/s receives a reward in the form of feedback to change its state to next state . A reward can be positive (reward) or negative (penalty). It helps the agent/s to find an optimal set of actions to maximize the cumulative reward for all interactions. Q-table acts as the brain of an agent. The main function of Q-table is to store/memorize states and corresponding actions that the agent can take according to all the states as during trail for the basic RL algorithms. SARSA and DRL are two promising RL methods to solve this MDP problem. SARSA learns the safest path, the policy is learned by estimation of state-value optimization function

but it requires more memory for complex state space. DRL uses neural network to simplify the Q-table by reducing memory requirements to handle more complex types of problems. Therefore, this work implements SARSA learning for light traffic. To further reduce the impact of state-space complexity DRL is used for heavy traffic scenarios. Additionally, in any case when SARSA algorithm fails to provide optimal policy for any type of network traffic during threshold trial

then final allocation is done using DRL. Finally to summarize, this model follows model free on policy SARSA-learning algorithm instead of value iteration and off-policy methods for light traffic and DRL for complex networks. The major advantage of proposed algorithms is to avoid huge memory requirements (DRL) and learn the safest allocation policy (SARSA) for the different traffic conditions.

Fig. 2: Overview for the proposed framework to the sum-rate maximization problem. Sub-figure (a) is an optimization problem breakdown to show where RL algorithms are applied and Sub-figure (b) shows problem transformations for the users and BSs as system states and the brain of reinforcement learning agents, respectively.

Iii-B SARSA-Learning Based Optimization For Light Traffic ((2-3,2-4)-Ue’s)

As the name suggests, for this type of traffic scenarios there is less number of users joining and leaving the network. In other words the state space is not huge as compare to heavy traffic. Therefore, we use SARSA learning algorithm to find optimal long term policy. The traditional Q-learning is not suitable for long term because it uses tuple of 3 for policy learning which doesn’t know the knowledge of next step that is not suitable for our case. Secondly, the state space is not huge as compare to heavy traffic that require more complex control. To efficiently utilise system resources we use SARSA learning for light traffic and DRL for heavy traffic where the state space is huge with dynamic users. For SARSA learning, discount factor , sum reward, and the number of iterations are significant hyper parameters. The details for the flow of the information update is shown in Fig. 3. The 5-tuple (, , , , , ) SARSA-learning elements are mentioned below:

  1. , is a state space consists of finite set having dimensions containing total number of states. Each state represents one sub-set of 3D associations among users, BSs, and sub-channels.

  2. , is an action space consists of a finite set of actions to move the agent in a specific environment. Actions in this model are . The ’-1’ is to reduce any one of the state elements from state matrix. Similarly, ’+1’ shows an increment in any of the state matrix elements. The last action ’0’ represents no change in the current state of the agent (BSs). It means that actions are swap operations between sub-channels and all BSs. For example, when an agent takes an action from (2), the first action in means agent performs swap operation of user between sub-channels at BS. In this model, agents have total swap operations between BSs and sub-channels.

    (7)
  3. , is an expected probability to change current state into next state by taking action . The total number of actions for an agent are with ’8’ swap operations. These operations include ’+1’,’-1’, and ’0’ actions, the agent selects suitable actions according to corresponding state to obtain an optimal state and action pair.

  4. , is a finite set of rewards, where the reward obtained after state transition to next state by taking action . Reward function is denoted by , showing that in result of all associations the agent will receive reward according to the conditions mentioned in reward function.

  5. Multi-constrained reward function, the short-term reward in proposed model depends on two conditions:1) sum-rate and 2) the state of the system means the total number of users associated to BSs and sub-channels, which is defined as . The reward function can be expressed as follows:

    (8)
  6. , is a next state of an agent based on the previous state, action, and reward pairs of an agent.

  7. , is a next possible action can be taken by an agent from state .

Definition 1.

The parameters of 3D state matrix defined as total number of states with dimensions . For all types of network traffic minimum for is defined as , the maximum for light traffic is and for heavy network traffic the maximum load is .

Furthermore, the optimal policy of the aforementioned parameters can be discovered by an agent using following function:

(9)

where represents the optimal policy. This function provides the optimal policy value for each state from the finite sate set after taking appropriate action . For a better understanding, the optimal policy can be defined:

(10)

For Q-table value updating that contains state and corresponding action values of an agent. Bellmen equation is utilised to perform optimization processes. According to Bellman equation statement, there is only one optimal solution strategy for each environment setting. Bellman’s equation is defined as:

(11)

where is a discount factor, which is a balancing factor between historical and future Q-table values. The larger is the more weight for the future value and vice versa. indicates learning rate, it works like a step function (i.e., larger contributes to fast learning but due to minimal experience, it may result in non-convergence. Similarly, if the value of is too small then it will increase the time complexity of the system by leading it to a slow learning process).

Definition 2.

For Q-learning we define to learn greedy policy for all state and action pairs.

Fig. 3: An illustration of the communication environment for proposed algorithm, where RL technique (SARSA) is invoked to optimize NOMA-IoT uplink 3D associations and resource allocation. The agents in this case are the BSs. The process of associations and resource allocation based on users activities is the state for our system.

One main limitation of reinforcement learning algorithms is slow convergence due to requirement. Additionally, it is challenging with 3D state space and dynamic systems [watkins1992q],[melo2001convergence]. Due to dynamic behaviour of IoT users the 3D state and action space influence learning process more as and are main parts of Q-table . The convergence of the reward functions and reinforcement learning hyper parameters guide the algorithm towards optimal policy . In other words the choice of reward function and the values for are used by reinforcement learning agent/s to avoid the random walk. The random walk in search space cause infinite exploration of the search space resulting no convergence. Therefore, we are able to propose the following conclusion.

Remark 1.

The selection of suitable rewards according to system dynamics is critical for effective convergence to find optimal . Consequently, following (10) altering the reward function does not change the output of RL algorithms but the convergence towards policy is highly influenced.

It is known that the proposed protocols are capable to handle multi-constrained optimization problems for different network traffic scenarios. We used SARSA-learning and DRL algorithms to explore and exploit search space to find dynamic outcomes, so the proposed protocols are capable to successfully obtain the optimal clustering solution. The Q-table in our model contains solutions for all subsets (user associations) in the search space. Therefore, in each episode , only a specific subset of users will be active.

Remark 2.

In reinforcement learning to find the best associations from the set possible states, an agent will converge towards the optimal states and actions pairing with the highest probability . In this way, by the increase in probabilities, the number of visits per state-action pair and rewards increase as well.

Since an agent has limited successful visits, the achieved rewards will be as described in Remark 1 and Remark 2. As a result, the agent successfully finds the optimal policy for the given system by processing best actions.

Iii-B1 SARSA-algorithm

Based on the above discussions, we design Algorithm 1 for step by step significant optimization stages of the SARSA algorithm for light traffic networks. The details of the mentioned algorithm are as follows:

1:Inputs for SARSA:
  1. Episodes

  2. Explorations per trials

  3. Learning rate

2:Initialization for SARSA:
  1. Network parameters ()

  2. Q-Table

3:Define number of clusters-k
4:Define range of users per cluster
5:load and
6:Random user association to any and Cluster
7:for iteration = : do
8:       
9:       for iteration = : do
10:             
11:             compute
12:             update
13:             update
14:             Update towards greediness
15:             
16:       end for
17:       return optimised (c,p) (6) under constraints (6a),(6b),(6c),(6d) and (6e)
18:end for
19:Return Q-Table
Algorithm 1 SARSA-Learning Based NOMA-IoT Uplink Resource Optimization
  • Line : presents initialization of the SARSA algorithm, in which the system is initialized by initial sets of users, BSs, and sub-channels as an initial state . After this we define the maximum number of clusters and the maximum number of users for each cluster. In line the brain of an agent is initialized with having dimensions as Q-table. The purpose of initialization with is to show that the brain of an agent needs training. Therefore, after training, the Q-table will contain values approaching to zero for the best case and vice versa. Secondly, it also shows that the proposed algorithm is targeted to solve the maximization problem, maximum Q-value means better solutions. Line shows SARSA-learning parameter definition and initial random association among IoT users, BSs, and sub-channels.

  • Line : shows key training steps based on Q-table updates via bellman equations. From line, an agent performs actions according to a given state of the environment, that is 3D associations and cluster allocation. In line agent picks new associations for different active users in one episode, then for all trials agent is trying to get optimal associations with optimal sum rate, if the associations are successful then the agent gets a reward (0) and if it fails then negative (-10) is given as a punishment. In line, based on the 3D designed 5-tuple (, , , , , ) values (11) is updated on-line. To perform online updates using , , , , , instead of , , , as (traditional Q-learning) the online learning mechanism becomes more fast converging. In other words, the agent finds optimal long-term online allocation policy more efficiently. Similarly, these updates are calculated for maximum episode and all the trials to maximize the overall long-term average reward of the system.

Definition 3.

In 3D state matrix from the set possible states, is defined as CSI of the proposed network that is known to both of the reinforcement learning agents. Therefore, the reinforcement learning agent contains perfect knowledge of the CSI for the whole network.

Iii-C Deep Reinforcement Learning For Heavy Traffic ((2-10)Ue’s)

In general, both on-line and off-line Q-learning methods require high memory space to build a state of the systems. However, practical systems are high in dimension and complex. Due to this reason Q-learning is not suitable for a large action space, this is a major drawback of conventional Q-learning methods. To overcome this, DRL method adopts a deep neural network (DNN) , to generate its Q-table with the help of by approximating the Q-values [silver2017mastering]. Therefore, DRL agents only need to memorize the weights instead of reserving huge memory space for all possible states and action pairs. This is the main advantage to use DNN. More specifically, in conventional Q-leaning algorithms, the optimization of is equal to the optimization of in DRL with low memory requirements. Similarly, updates are based on history states, actions, and reward values. More specifically, these values are based on DRL agent interactions with the environment to learns the relationship among the different actions and states by continuously observing given environment.

  1. , is a unique state space used as an input of DNN. Each state is a combination of multiple sub-sets of 3D associations among users, BSs, and sub-channels. It also consists of current rewards of the system as instantaneous and average reward from previous iterations.

  2. , is a reward of the system that is denoted by , where is an instantaneous rewards similar to SARSA algorithm and denotes long-term average rewards for the time slot .

  3. , is a multi-dimensional matrix representing actions as . For the DRL algorithm, the action mechanism is based on two main parts as; allocation strategies described as switching strategy and association strategy for the optimization process, where is a switching mechanism similar to SARSA and used for the DRL channel switching process. The second strategy is a result of selected switching strategy ,

    denotes an index of the 3D associations among users, BSs, and sub-channels. Finally, the DRL agent uses loss function mentioned in (

    12) to calculate based on the previous experience.

    (12)

    where

    (13)

    and is the target Q-values from target DNN. For the improved training, in general the update frequency of the target network is performed in slow manner. Due to this reason the target network remains fixed for the target network update threshold .

The DRL agent uses gradient decent method as in (12) to reduce the prediction error by minimizing the loss function. The updating of is provided in (16), which is based on the outcome of new experience. The updating function for is defined in (18), namely DRL Bellman equation.

(14)
(15)
(16)

where the function shows Q-values and the long-term reward calculations for DRL based on the discount factor and below mentioned optimal DRL policy .

(17)

where represents the optimal policy for the DRL algorithm. This function provides the optimal policy value for each state from finite sate set after taking appropriate action .

(18)

where is showing Q-value update according to DRL Bellman equation.

(19)
(20)

where in (19), represents state of the DRL agent and equation (20) shows the activation mechanism for each neuron layer based on weights for depth of neurons with bias term . In this model, the input of the DRL algorithm is the instantaneous network observation as . This state is sent to the different neural network neurons with specific network to obtain the final output as a set of different Q-values for all actions. For the DRL framework the size of output actions are similar to the SARSA. We use the replay memory as an experience for the DRL agent to store the tuple for all the time steps in an experience dataset with size . When the size is full, the first experience as an oldest tuple will be removed to free some space for the new experience update. The reason for this updated experience is to reflect the sequential exploration of the DRL framework. However, the samples distribution is independent and identical. Therefore, to get more general output, the update process is performed on the basis of randomly sampled tuple instead of the current tuple. This is because output is highly influenced by the correlated set of tuples

and variance of the updates.

Definition 4.

DRL design in this work is defined with two main elements, the first element is target Q-network based on . The second main element of DRL is state transition mechanism . This mechanism is used to construct mini-batch for experience reply from dataset to train DNN.

Fig. 4: DRL structure: it shows the flow of information between target and training networks to minimize the loss function using states, actions, rewards, and replay memory.
Remark 3.

The convergence rate/speed of the proposed algorithm varies according to the initial 3D association (states) that is randomly selected. In this model, the state space means allocation strategies that include subsets of all possible associations of active users for each sub-channel at the episode .

Based on the above discussions, we design Algorithm 2 for step by step significant optimization stages of the DRL algorithm for heavy traffic. The details of the mentioned algorithm are as follows:

Iii-C1 Drl

  • Line : In this stage, the parameter initialization is performed, which is a similar initialization step like SARSA. However, instead of state action pairing, the weight matrix is initialized for DRL to find optimal policy .

  • Line Pre-training: In this stage, initial actions are selected using uniform random distribution as an initial state space in continuous environment. In this way initial weights are also calculated to start the optimization process.

  • Line : Whole process for DRL is similar to SARSA from line with DRL bellmen equations (12) to (19).

  • Loss Calculation: The equation (12) is to calculate the loss that is the mean squared error (MSE) indicating the difference of the target and predicted networks. To optimise these values between the target network and prediction network we use Adam optimiser. The Adam optimizer is used for the loss minimization to further improve the optimal predicted Q-values for each episode. Therefore, the DRL framework converges faster even in huge state space. In (13), we calculate the target Q-values based on the tuple from mini-batch and the mini-batch is updated after 100 iterations.

  • DRL Updates: The updating function for the prediction of DRL and long-term reward calculation is shown in (14) to (16), where DRL agents obtain rewards and prediction loss after every transition from to next state to find the greedy policy. Additionally, discount factor has significant impact in search of the greedy-policy because based on discount factor as we mentioned in the previous paragraphs, the agent selects immediate or previous Q-values. The policy is calculated using (17) to maximize the Q-values by greedy search. The calculation for DRL Bellman equation is performed using states in (18).

  • Sparse Activations: The

    is an activation function for DNNs sparse activations using ReLU

    . The sparse activations help agents to efficiently converge by avoiding useless neuron activations. The outcome of sparsity is shown in the results section, comparing sparse ReLU, Sigmoid, and TanH. In (20), the activations are performed for the density of neurons with index, for each neuron we use state of the system as an input that is multiplied with weight of density and adding bias value as before activation. In next steps, current states, actions, and rewards are added to mini-batch for experience replay (for self-training). In , the agent receives next states from mini-batch that is learned in previous sections based on pre-training. Before that, the learning process of the agent is based on pre-training but when mini-batch is full, the agent will learn the optimal policy by experiencing replay mechanism with the help of mini-batch processing.

  • Neural Networks: this paper uses the DRL that is built with two DNNs as shown in Fig. 4: 1) a training network that learns the policy and 2) a target network to compute target Q-values for every update, where and shows the weights of these two networks. For the training of the DNN network, weights are predicted based on the current state and action. On the other hand, weights are based on the previous episodes and these weights are fixed during the calculation of for training purpose. Additionally, We utilize MSE loss function (12) to evaluate the accuracy of the training for the target network. Therefore, the proposed loss function is based on and to check the deviation of the predicted DNN weights.

  • Output: Finally, the output of this algorithm is the optimal policy for all clusters where overall long-term sum rate is maximum.

Definition 5.

We use ReLU activation function ( is the input neuron) for DRL performance evaluations. A ReLU network of density and hidden layers with each layer width can be represented as for any positive number .

In this definition, is a function to show the construction of neural network with weights for each layer and is the activation for each neuron. The mesh structure of the neural network remains fixed in this model to learn two main neural network parameters in addition with the activation function and the input of the neural network. In the neural network bias terms are added with the input of the DNN as as a shift value. To optimise our dynamic objective function, the greedy search agent is used. With the help of greedy search, the DRL agent receives higher rewards.

Remark 4.

To avoid useless visits, greedy policy provides a balanced exploitation, because exploits in the most cases and some times it processes a random action to explore the environment in search of different solutions .

For DRL, unbalanced random actions cause huge error propagation so that this is suitable to be applied for achieving efficient learning in a dynamic state space. Note that the boundary for the policy selection is . For close to the policy becomes greedy policy, and for close to the agent explores more.

1:Inputs for DRL:
  1. Episodes

  2. Explorations per trials

  3. Learning rate

2:Initialization for DRL:
  1. Network parameters ()

  2. memory, hidden size, State size, action size and mini-batch

3:train DRL to find a good policy
4:for iteration = : do
5:       for iteration = : do
6:             
7:             compute
8:             update using
9:              update using
10:             
11:             update mini-batch (Experience)
12:             if   then
13:                    get from mini-batch
14:             end if
15:       end for
16:end for
17:Return
Algorithm 2 Deep Q-Learning Based NOMA-IoT Uplink Resource Optimization
Definition 6.

(Sparsity for ReLU DNN): The sparsity of the ReLU network is a weight based sparsity denoted by , sparse ReLU networks are bounded by for layers, . For any hidden layer .

where is used to represent . The function is from Definition 5 and is the element of .

Iii-D Complexity

The complexity of the proposed model is based on the number of BSs , total number of sub-channels and the number of users communicating . In proposed scheme, simulated experiments are based on different examples. This paper considers , and for light traffic and for heavy traffic. These examples are association decisions for the user and the sub-channel at BS that receives signals for channels from users. The computation complexity for SARSA-learning is operations with memory requirement for Q-table to simulate brain of the learning agent/s. The complexity of DRL is with smaller Q-table and DRL uses 1D experience replay containing states vector (19) instead of huge memory requirements like traditional Q-learning. The benchmark scheme considered in this work is a memory-less method, which shows the maximum achievable rate by exhaustively searching all possible combinations of 3D associations. Consequently, it requires more number of operations. Due to this reason, the computation complexity increases in exponential manner as .

Load balancing factor values per resource block 2-3,2-4,2-10
Total number of trials 500
Total number of time steps 500
Bandwidth
Gain [zhang1999finite]
Optimisers SARSA-DRL (Adam)
Deep neural network activations Sigmoid, TanH, ReLU
TABLE II: Network parameters

Iv Numerical Results

In this section, simulation results are provided for the performance evaluation of the proposed multi-constrained algorithms. The proposed multi-constrained algorithms are tested under different network settings to solve: 3D associations among user, BSs, and sub-channels as well as sum rate optimization with different network traffic. For simulations, we have considered two different traffic density threshold values to analyse the impact of network load with various power levels on the sum rate and 3D associations. Additionally, network load in our case represents load of each resource block instead of total number of users in the network. Therefore, max network load=10 with two RB’s for each BSs means users in the network. To show the significance of available channel bandwidth, we start with a minimum channel bandwidth of 60(kHz) and then increase it to 120(kHz) under different network traffic conditions. The hardware and software system used for experimentation is Intel core i7-7700 CPU with 3.60 GHz frequency having 16 GB of RAM (Random Access Memory) and 64-bit operating system (windows-10). All the experiments are simulated using Matlab version-R2019a and Python 3.6. From Table.II for both the algorithms we used episodes with iterations for each episodes. Similarly, and exploration are values for the significant hyper-parameters of proposed algorithms. We used Load balancing factor values per resource block to show the maximum and minimum user connectivity for each resource block. The values of channel gain for each user is defined as [zhang1999finite]. For the DRL, additional parameters are trained, such as loss MSE, activation functions, batch-size, optimisers, experience memory , pre-training length , the number and size of hidden units. We use ReLU, Sigmoid, and TanH as activation functions with two hidden layers having density units. Adam optimizer is utilized for the optimal convergence of DRL.

Fig. 5: Overview for the proposed framework to the sum-rate maximization problem. Sub-figure (a) is convergence for proposed algorithms: DRL for heavy traffic (max scheduling up to 10 users), SARSA for medium and low traffic range ( support 2, 3, 4, and upto 10 users scheduling) and the comparison for two different learning rates (). (b) shows long-term comparison between the channel bandwidth, average sum rate, and power levels for the proposed SARSA, DRL and benchmark scheme.

Iv-a Convergence vs Sum Rate vs Traffic Density

Fig. 5 shows the inter-correlations among the four measures of convergence. It is apparent from this figure that if traffic density increases then convergence is slower and vice versa. DRL has better convergence for heavy traffic with the maximum allocation capacity/load, which makes DRL more suitable for the scenarios with high traffic densities. Secondly, another interesting insight is that the performance of SARSA is better than with . The convergence of Adam depends on DRL weights as . where and is feasible set for all steps.

Definition 7.

The bounded gradients of the function is . Secondly, the distance generated by the Adam optimiser is bounded as: for any with the bias terms satisfying the condition. Let the learning rate of the Adam optimiser be and bias term for each step , for all Adam obtains the following condition [kingma2014adam]:

(21)

The results obtained from the primary analysis of sum rate and traffic densities are shown in Fig. 5 in long-term settings, it is clearly visible that the proposed model performs close to the benchmark scheme and better than OMA. Fig. 6 shows short-term performance analysis between sum rates, bandwidth, and the number of iterations. This figure illustrates the performance of DRL and SARSA according to different bandwidths, where the performance of the DRL is better than SARSA. Interestingly it also shows that as the traffic density increases the sum rate also increases. Therefore sum rates are proportional to the number of users/traffic density in this case. Furthermore, from Fig. 5 even with light traffic conditions the sum rate of NOMA systems is higher than OMA. Lastly, Fig. 6 shows the relationships among long-term users connectivity during the simulation time. Where it is clearly visible that NOMA is more efficient for user connectivity by serving more users than OMA. From this figure we can see that the connectivity is improving as reinforcement learning agents, specifically DRL agent learning the dynamic environment. The number of served users are significantly increasing after 150 episodes of learning. As we can see the total number of served users are more than 3000 for DRL NOMA and more than 1000 for SARSA NOMA within 200 episodes.

Fig. 6: Overview for the proposed framework to the sum-rate maximization problem. Sub-figure (a) is short-term comparison between the channel bandwidth, average sum rate, and different network traffic loads for the proposed DRL and SARSA. Where (L) denotes light traffic,(M) denotes medium, and (H) is for heavy traffic. (b) shows long-term comparison between time episodes and clustering parameter c, showing sum of connected users in long-term for the proposed SARSA, DRL and OMA scheme.

Iv-B DQN Loss vs Rewards

In Fig. 7, the loss (MSE) for the DRL algorithm is shown, comparing three well-known activation functions (ReLU, TanH, and Sigmoid). As it can be seen that ReLU performs better than both Sigmoid and TanH activation functions. Sigmoid and TanH perform relatively better only in initial steps due to less experience of the DRL agent. Therefore, when DRL agent gains some experience after the process of exploration and exploitation of the given environment, the outcome of the DRL algorithm is changed accordingly. The loss (y-axis) for all the activation functions is decreasing according to the number of episodes (x-axis). Furthermore, this figure also shows that the performance (loss) of the DRL algorithm is efficient when ReLU activation is used. Fig. 7 provides the summary statistics of achieved average rewards for the three different activation functions of the DRL algorithm. From the data in Fig. 7, it is apparent that the DRL algorithm with ReLU outperforms Sigmoid and TanH activation functions. After combining Fig. 7 and Fig. 7, another interesting outcome is that by the improvement of loss function, the rewards improves as well. Therefore, the loss and reward are proportional to each other. Lastly, DRL with the Sigmoid activations is the second best until 200 episodes and in all the remaining cases, where the episode is greater than 200 the performance of TanH is better than Sigmoid.

Fig. 7: Overview for the proposed framework to the sum-rate maximization problem. Sub-figure (a) DRL loss vs number of episodes: A comparison between DRL loss and training episodes for different activation functions (ReLU, TanH, Sigmoid). Sub-figure (b) shows Rewards vs activation functions: A comparison between achieved rewards and episodes for different activation functions (ReLU, TanH, Sigmoid) of DRL algorithm.

Iv-C Clustering Time

The average clustering time in second is compared for DRL and SARSA algorithms with different types of traffic and impact of learning rates in Fig. 8. The learning rate is the significant hyper-parameter of RL algorithms, which shows how long the agent spends to explore and exploit the given environments. From the figure, it can be seen that there is no large effect of learning rates on clustering time (y-axis) for all the scenarios with current hyper-parameters but if it is not tuned with other hyper-parameters, learning rate can negatively influence the learning process. Therefore, with improper tuning the learning process becomes unbalanced and the agent can be searching for the solution for an infinite amount of time. Lastly, the clustering time increases but not significant when max load is increased from 3 to 10.

Fig. 8: Clustering time (mean (sec)) vs Traffic Densities for DRL and SARSA: A comparison between different traffic densities and learning rates of the proposed algorithms.

V Conclusion

This paper has proposed resource allocation for IoT users in the uplink transmission of NOMA systems. Two algorithms DRL and SARSA in the present study have been designed to determine the effect of three different traffic densities on the sum rate of IoT users. In order to improve the overall sum rate under different number of IoT users, we have formulated a multi-dimensional optimization problem using intelligent clustering based on RL algorithms with several interesting outcomes. Firstly, the simulation results of this study has indicated that the proposed technique performed close to the benchmark scheme in all the scenarios. The second major finding is that this frame work provides long-term guaranteed average rate with long-term reliability and stability. Thirdly, it has proved that DRL is efficient for complex scenarios. Additionally, we have proved that the sparse activations improve the performance of the DNNs when compared to the traditional mechanisms. Therefore, DRL with sparse activations is suitable for heavy traffic and SARSA is suitable for light traffic conditions. Furthermore, in general, both the algorithms (DRL and SARSA) have obtained better sum rates than OMA systems. Lastly, further research will explore performance improvements under the different scale of the networks.

Appendix A Proof of Problem (6a)

With the aid of the theory of computation complexity, we are able to use the following two steps to prove that the problem (

6a) is an NP-hard problem. Step 1: the association problem for every subset of