I Introduction
Internet of things (IoT) enable millions of devices to communicate simultaneously. It is predicted that the number of IoT devices will rapidly increase in the next decades [zhai2019delay]. Owing to a large number of timevarying communication channels, the dynamic network access with massive connectivity becomes a key requirement for future IoT networks. Recently, nonorthogonal multiple access (NOMA) is evolved as a promising approach to solve this problem[islam2017power],[sharma2019towards]. The key benefit of using NOMA is that NOMA exploits power domain to enable more connectivity than the traditional orthogonal multiple access (OMA). More specifically, NOMA supports multiple users in the same time/frequency resource block (RB) by employing superposition coding at transmitters and successive interference cancellation (SIC) techniques at receivers [wan2018non]. Various modelbased schemes have been proposed to improve different metrics of NOMAIoT networks, such as coverage performance, energy efficiency, system throughput (sumrates), etc. Additionally, on the importance of sumrates, the recent work in wireless networks based on the state of the art reflective intelligent surfaces (RIS) considered sumrate maximization objective function [guo2020weighted]. The sumrate is an important parameter to depict the average performance of wireless networks in detail for each user. Due to this sumrate is widely used as significant performance indicator in wireless networks by the research community [zeng2020sum] and [tse2005fundamentals]. It shows the significance of the sumrate maximization based objective functions. Regrading the system design, the uncertainty and dynamic mechanisms of wireless communication environments are difficult to be depicted by an accurate model. The dynamic mechanism involves spectral availability, channel access methods (e.g., OMA, NOMA, hybrid systems, etc.), and dynamic traffic arrival. Especially in practical NOMA systems by allowing resource share among more than one users the process is more dynamic, when the number of users are joining and leaving the network in short term and long term basis. Numerous modelbased techniques target to solve dynamic behaviour of wireless networks but failed to provide longterm performance outcomes [ding2017survey], [shao2018dynamic], [ali2016dynamic],[miuccio2020joint] and [mostafa2019connection]
. Moreover, due to the absence of learning abilities, to provide long term network stability the computational complexity of traditional schemes becomes ultrahigh. This is due to the fact that, by default traditional approaches cannot extract knowledge from any given problem (e.g, given distributions) online. Fortunately, the online learning properties of recently developed machine learning (ML) methods are extremely suitable to handle such type of dynamic problems
[8519960].Ia Related Works and Motivations
IA1 Studies on NOMAIoT Networks
Due to the aforementioned benefits, academia has proposed numerous studies on the optimization of resource allocation in NOMAenabled IoT networks. For singlecell scenarios, the authors in [shao2018dynamic] proposed a twostage NOMAbased model to optimize the computation offloading mechanism for IoT networks [hussain2019machine]. In the first stage, a large number of IoT devices are clustered into several NOMA groups depending on their channel conditions. In the second stage, different power levels are allocated to users to enhance the network performance. The comparison between uplink NOMAIoT and OMAIoT is presented in [zhang2016uplink], which considered the optimal selection of targeted data rates for each user. Regarding downlink transmission, the similar topic was studied in [ding2014performance] and [hanif2016minorization]. Different from others, in [zhang2018energy] using 2D matching theory authors performed dynamic resource allocations considering energy efficiency for downlink NOMA. Similarly, in [miuccio2020joint] for the massive Machine Type Communications (mMTC) usage scenario, also known as massive Internet of Things (mIoT) dynamic resource management is performed with Sparse Code Multiple Access (SCMA) domain using conventional mathematical tools. The authors in [yang2016general] proposed a general power allocation scheme for uplink and downlink NOMA to guarantee the quality of service (QoS). In [zhai2018energy], NOMA scheduling schemes in terms of power allocation and resource management were optimized to realize the massive connectivity in IoT networks. For multicell scenarios, the impact of NOMA on large scale multicell IoT networks was investigated in [liu2017enhancing]. To characterize the communication distances, the authors in [8635489] analysed the performance of large scale NOMA communications via stochastic geometry. It is worth noting that NOMAIoT channels are timevarying in the real world. Therefore, the study in [ali2018coordinated] considered a practical framework with dynamic channel state information for evaluating the performance of massive connectivity. The authors in [qian2018optimal],[shahab2019grant], and [dai2018survey]
discussed the advantages of various NOMAIoT applications. Interestingly, the proposed schemes introduced artificial intelligence (AI) methods to solve some practical challenges of NOMAIoT systems. For both uplink and downlink scenarios, AIbased multiconstrained functions can be utilized to optimise multiple parameters simultaneously.
IA2 Studies on MLbased NOMA Systems
Due to the dynamic nature of NOMAIoT communications, traditional methods may not be suitable for such type of networks [mostafa2019connection]. Note that MLbased methods are capable to handle the complex requirement of future wireless networks via learning. In [gui2018deep]
, one typical deep learning method, namely long shortterm memory (LSTM)
[hochreiter1997long], was applied for the maximization of user rates by minimizing the received signaltonoiseratio (SINR). In
[xu2018outage], successive approximation based algorithm was proposed to minimize outage probabilities through optimizing power allocation strategies. For nextgeneration ultradense networks, MLaided user clustering schemes were discussed in
[jiang2017machine] for obtaining efficient network management and performance gains. Because using clustering schemes, the entire network can be divided into several small groups, which helps to ease the resource management [bi2015wireless]. Regarding AIbased cluster techniques, in [arafat2019localization] and [cui2018unsupervised], resources were assigned to the most suitable user to ensure the best QoS for unmanned aerial vehicle (UAV) networks and millimetre wave networks, respectively. It is worth noting that the optimization of clustering is an NPhard problem. Therefore, for such type of problems the authors in [gui2018deep], [jiang2017machine], and [liu2019machine] recommended to use AI instead of conventional mathematical models. Currently, realistic datasets are not available for most of the machine learning algorithms, to overcome this designers use synthetic dataset for simulations. The data set is generated for a certain environment so it is difficult to depict general property and online scenarios of wireless networks. Therefore, algorithms like reinforcement learning plays very important role where data is collected online (during simulation) to learn the given search space for the simulation requirements. There are various Qlearning algorithm variants used for NOMA systems. Due to inefficient learning mechanism other methods like traditional Qlearning and Multiarm bandits (MABs) are heavily influenced by regret (negative reward) [li2020multi][de2018comparing]. On the other hand two most powerful methods, deep reinforcement learning (DRL) and SARSA learning created by google deep mind[silver2017mastering] and by the authors in [rummery1994line]. Both DRL deep mind and SARSA learning algorithms are efficient learners. Due to unique learning behaviour DRL and SARSA tend to receive more rewards. The main advantage of deep mind and online SARSA learning is to handle dynamic control as in [lillicrap2015continuous]. With the development of such type of RL techniques, the challenges for NOMA systems, which are difficult to be solved via traditional optimization methods, have been reinvestigated via RLbased approaches [xiao2017reinforcement, liu2019uav, yang2019reinforcement].IA3 Motivations
Combining multiuser relationship and resource allocation increases the complexity of NOMAIoT systems, which also introduces new problems for optimizing power allocation and scheduling schemes. Unlike traditional methods [zhai2018energy], where only one BS is considered for small scale network with no intercell interference and dynamic user connectivity. The design of schedulers should be in tandem with the large scale dynamic resource allocations and user decoding strategies. Therefore, due to the high complexity of the problem under multicell multiuser cases, AI can be a feasible option for the dynamic resource allocation [cui2017optimal]. For largescale NOMAIoT networks, an intelligent reinforcement learning (RL) algorithm becomes a promising approach to find the optimal longterm resource allocation strategy. This algorithm should jointly optimize multiple criteria under dynamic network states. In this paper, our main goal is to address the following research questions:

Q1: In NOMAIoT networks, how to maximize the longterm sum rates of users for a given network traffic density?

Q2: How does the intercell interference affect the longterm sum rates?

Q3: What is the correlation between traffic density, system bandwidth, and the number of clusters in NOMAIoT networks?
From above as it is known that modelfree methods are suitable to address multiconstrained longterm problem online. Therefore, in longterm there is strong correlation of mentioned research questions with general problems of “intermittent connectivity of IoT users (continuously joining and leaving the network), balanced resource allocations ( optimal allocations policy for dynamic network settings) and network traffic (as the (MinMax) number of users competing for the resource blocks)” in wireless networks. Similarly, research Q1 for capacity maximization, research Q2 for network scalability and, research Q3 for longterm network performance are strongly dependent on the main problems “balancing of network resources, IoT users and, the dynamic network behaviour”.
IB Contributions and Organization
This paper considers uplink NOMAIoT networks, where multiple IoT users are allowed to share the same RB based on NOMA techniques. With the aid of RL methods, we propose a multiconstrained clustering solution to optimize the resource allocation among IoT users, base stations (BSs), and subchannels, according to the received power levels of IoT users. Appropriate bandwidth selection for the entire system with different traffic densities is also taken into consideration for enhancing the generality. Our work provides several noteworthy contributions:

We design a 3D association model free framework for connecting IoT users, BSs, and subchannels. Based on this framework, we formulate a sumrate maximization problem with multiple constraints. These constraints consider longterm variables in the proposed NOMAIoT networks, such as the number of users, channel gains, and transmit power levels. To characterize the dynamic nature (online), at each time slot, these variables are changeable.

We propose two RL techniques, namely SARSAlearning with
and DRL, to solve this longterm optimization problem. SARSAlearning is used for light traffic scenarios to avoid high complexity and memory requirements. Heavy traffic scenarios with a huge number of variables are studied by DRL, where three different neuron activation mechanisms, namely TanH, Sigmoid, and ReLU, are compared to evaluate the impact of neuron activation on the convergence of the proposed DRL algorithm.

We design novel 3D state and action spaces to minimise the number of Qtables for both SARSA and DRL frameworks. The considered action space represent switching between RBs, which is the most efficient strategy for our networks. Based on this adequate Qtable design, DRL is able to converge faster.

We show that: 1) according to the timevarying environment, resources can be assigned dynamically to IoT users based on our proposed framework; 2) for the proposed model, the learning rate provides the best convergence and data rates; 3) for SARSA and DRL the sumrate is proportional to the number of users; 4) DRL with the ReLU activation mechanism is more efficient than TanH and Sigmoid; and 5) IoT networks with NOMA provide better system throughput than those with OMA.
The rest of the paper is organised as follows: In Section II, the system model for the proposed NOMAIoT networks is presented. In Section III, SARSAlearning and DRLbased resource allocation is investigated. The corresponding algorithms are also presented. Finally, numerical results and conclusions are drawn in Section IV and Section V, respectively.
Symbol  Definition 

Number of BSs, symbol of BSs  
Number of subchannels (NOMA clusters), symbol of subchannels (NOMA clusters)  
Number of users, symbol of users  
Set of BSs  
Set of subchannels (NOMA clusters)  
,  Set of users connected to BS via subchannel , user in the set 
Clustering variable for user connecting to BS via subchannel at time  
Transmit power for user at time  
Channel gain for user at time  
Additive white Gaussian noise at time  
Intercell interference at time  
Instantaneous SINR for user at time  
Instantaneous data rate for user at time  
Rate requirement for the SIC process of user  
,  Maximal load of each subchannel, Maximal power for each subchannel 
Duration of the considered longterm communication  
,  Matrix for clustering parameters, matrix for transmit power 
Vector for DRL gradients  
Moment estimation decay rate 
Ii System Model
In this paper, we consider an uplink IoT network with NOMA techniques as shown in Fig. 1, where BSs communicate with IoT users via orthogonal subchannels. we assume dynamic in each timeslot in our model, however for simplicity we omit for further sections. Additionally, channel gains are also dynamic for each user at each timeslot, even for the same user. The BSs and subchannels are indexed by sets and , respectively. Regarding users, the set for users severed by one BS through a subchannel is defined as , where is the number of the intraset users and . BSs and users are assumed to be equipped with a single antenna. For each BS, the entire bandwidth is equally divided into subchannels and hence each subchannel has bandwidth. In a time slot, we assume a part of users are active and the rest users keep silence. To share knowledge, we consider fiber link with ideal backhaul for inter BS connectivity. The defined notations in this system model are listed in TABLE I.
Iia NOMA Clusters
Based on the principles of NOMA, more than two users can be served in the same resource block (time/frequency), which forms a NOMA cluster. In this paper, each subchannel represents one NOMA cluster and [kiani2018edge]. To simplify the analysis, we assume BSs contain perfect CSI of all users. That CSI is our state space showing signalling and the channel conditions of IoT users connected to subchannel via basestation. Detailed explanation is present in the section IIIb and section IIIc. Based on such CSI, BSs are capable to dynamically optimize the subchannel allocation for active users in a longterm communication. For an arbitrary user , we define its clustering variable at time as follows:
(1) 
It is worth noting that also implies the activity status of users. If user is inactive, we obtain that . The set of clustering parameters is defined as and .
IiB Signal Model
In a NOMA cluster , one BS first receives the superposed messages from the active users in and then applies SIC to sequentially decode each user’s signal [liu2016cooperative]. Without loss of generality, we assume the order of channel gains is , where is the channel gain for the th user in [803503]. Therefore, the decoding order in this paper is the reverse of the channel gain order [8680645]. In a time slot , the instantaneous signaltointerferenceplusnoise ratio (SINR) for the intracluster user is given by
(2) 
where
(3) 
and is the transmit power of the user and the set of transmit power is given by [8626185]. The power of thermal noise obeys , where is temperature of resistors is Boltsmann’s constant, is the considered bandwidth. In this paper we use K therefore, . The represents the intercell interference, which is generated by the active users served by other BSs using the same subchannel . In uplink NOMA, the decoding of user is based on the SIC process of its previous user . If the data rate of successfully completing the SIC process is , when the decoding rate of user obeys
(4) 
the data rate of user is given by
(5) 
Otherwise, if , the decoding of all rest users fails, namely .
IiC Problem Formulation
For a longterm communication with period , the number of active users is different across each time slot. Given the maximal load of each subchannel
, we assume the number of active users are uniformly distributed in the range
and . Under this condition, the average longterm sum rate can be maximized by optimizing clustering parameters and transmit power . Therefore, the objective function is given by(6a)  
(6b)  
(6c)  
(6d)  
(6e)  
(6f)  
(6g) 
where (6b) is the ordered channel gains based on the perfect CSI. (6c) is to impose the power constraint of each subchannel. (6d) ensures all clustered IoT users can be successfully decoded for maximizing the connectivity. (6e) and (6f) limits the number of clustered users for the entire system and each subchannel, respectively. (6g) indicates that each user belongs to only one cluster. Problem (6a) is an NPhard problem, even only fixed number of users per cluster is considered instead of dynamic range, especially, in case of (6c) and (6f). The proof process is provided in Appendix A. The proof of (6a) follows the idea in [cui2018optimal] and [8807386].
Iii Intelligent Resource Allocation
Iiia Markov Decision Process Model for Uplink NOMA
In this section, we formulate user clustering and optimal resource allocation for uplink NOMA as a Markov decision process (MDP) problem. Problem transformations are shown in Fig.
2 and Fig. 2. A general MDP problem contains single or multiple agents, environment, states, actions, rewards, and policies. The process starts with the interaction of an agent with a given environment. In each interaction, the agent processes an action followed by a policy with previous state . After processing action according to these conditions and observed state agent/s receives a reward in the form of feedback to change its state to next state . A reward can be positive (reward) or negative (penalty). It helps the agent/s to find an optimal set of actions to maximize the cumulative reward for all interactions. Qtable acts as the brain of an agent. The main function of Qtable is to store/memorize states and corresponding actions that the agent can take according to all the states as during trail for the basic RL algorithms. SARSA and DRL are two promising RL methods to solve this MDP problem. SARSA learns the safest path, the policy is learned by estimation of statevalue optimization functionbut it requires more memory for complex state space. DRL uses neural network to simplify the Qtable by reducing memory requirements to handle more complex types of problems. Therefore, this work implements SARSA learning for light traffic. To further reduce the impact of statespace complexity DRL is used for heavy traffic scenarios. Additionally, in any case when SARSA algorithm fails to provide optimal policy for any type of network traffic during threshold trial
then final allocation is done using DRL. Finally to summarize, this model follows model free on policy SARSAlearning algorithm instead of value iteration and offpolicy methods for light traffic and DRL for complex networks. The major advantage of proposed algorithms is to avoid huge memory requirements (DRL) and learn the safest allocation policy (SARSA) for the different traffic conditions.IiiB SARSALearning Based Optimization For Light Traffic ((23,24)Ue’s)
As the name suggests, for this type of traffic scenarios there is less number of users joining and leaving the network. In other words the state space is not huge as compare to heavy traffic. Therefore, we use SARSA learning algorithm to find optimal long term policy. The traditional Qlearning is not suitable for long term because it uses tuple of 3 for policy learning which doesn’t know the knowledge of next step that is not suitable for our case. Secondly, the state space is not huge as compare to heavy traffic that require more complex control. To efficiently utilise system resources we use SARSA learning for light traffic and DRL for heavy traffic where the state space is huge with dynamic users. For SARSA learning, discount factor , sum reward, and the number of iterations are significant hyper parameters. The details for the flow of the information update is shown in Fig. 3. The 5tuple (, , , , , ) SARSAlearning elements are mentioned below:

, is a state space consists of finite set having dimensions containing total number of states. Each state represents one subset of 3D associations among users, BSs, and subchannels.

, is an action space consists of a finite set of actions to move the agent in a specific environment. Actions in this model are . The ’1’ is to reduce any one of the state elements from state matrix. Similarly, ’+1’ shows an increment in any of the state matrix elements. The last action ’0’ represents no change in the current state of the agent (BSs). It means that actions are swap operations between subchannels and all BSs. For example, when an agent takes an action from (2), the first action in means agent performs swap operation of user between subchannels at BS. In this model, agents have total swap operations between BSs and subchannels.
(7) 
, is an expected probability to change current state into next state by taking action . The total number of actions for an agent are with ’8’ swap operations. These operations include ’+1’,’1’, and ’0’ actions, the agent selects suitable actions according to corresponding state to obtain an optimal state and action pair.

, is a finite set of rewards, where the reward obtained after state transition to next state by taking action . Reward function is denoted by , showing that in result of all associations the agent will receive reward according to the conditions mentioned in reward function.

Multiconstrained reward function, the shortterm reward in proposed model depends on two conditions:1) sumrate and 2) the state of the system means the total number of users associated to BSs and subchannels, which is defined as . The reward function can be expressed as follows:
(8) 
, is a next state of an agent based on the previous state, action, and reward pairs of an agent.

, is a next possible action can be taken by an agent from state .
Definition 1.
The parameters of 3D state matrix defined as total number of states with dimensions . For all types of network traffic minimum for is defined as , the maximum for light traffic is and for heavy network traffic the maximum load is .
Furthermore, the optimal policy of the aforementioned parameters can be discovered by an agent using following function:
(9) 
where represents the optimal policy. This function provides the optimal policy value for each state from the finite sate set after taking appropriate action . For a better understanding, the optimal policy can be defined:
(10) 
For Qtable value updating that contains state and corresponding action values of an agent. Bellmen equation is utilised to perform optimization processes. According to Bellman equation statement, there is only one optimal solution strategy for each environment setting. Bellman’s equation is defined as:
(11) 
where is a discount factor, which is a balancing factor between historical and future Qtable values. The larger is the more weight for the future value and vice versa. indicates learning rate, it works like a step function (i.e., larger contributes to fast learning but due to minimal experience, it may result in nonconvergence. Similarly, if the value of is too small then it will increase the time complexity of the system by leading it to a slow learning process).
Definition 2.
For Qlearning we define to learn greedy policy for all state and action pairs.
One main limitation of reinforcement learning algorithms is slow convergence due to requirement. Additionally, it is challenging with 3D state space and dynamic systems [watkins1992q],[melo2001convergence]. Due to dynamic behaviour of IoT users the 3D state and action space influence learning process more as and are main parts of Qtable . The convergence of the reward functions and reinforcement learning hyper parameters guide the algorithm towards optimal policy . In other words the choice of reward function and the values for are used by reinforcement learning agent/s to avoid the random walk. The random walk in search space cause infinite exploration of the search space resulting no convergence. Therefore, we are able to propose the following conclusion.
Remark 1.
The selection of suitable rewards according to system dynamics is critical for effective convergence to find optimal . Consequently, following (10) altering the reward function does not change the output of RL algorithms but the convergence towards policy is highly influenced.
It is known that the proposed protocols are capable to handle multiconstrained optimization problems for different network traffic scenarios. We used SARSAlearning and DRL algorithms to explore and exploit search space to find dynamic outcomes, so the proposed protocols are capable to successfully obtain the optimal clustering solution. The Qtable in our model contains solutions for all subsets (user associations) in the search space. Therefore, in each episode , only a specific subset of users will be active.
Remark 2.
In reinforcement learning to find the best associations from the set possible states, an agent will converge towards the optimal states and actions pairing with the highest probability . In this way, by the increase in probabilities, the number of visits per stateaction pair and rewards increase as well.
Since an agent has limited successful visits, the achieved rewards will be as described in Remark 1 and Remark 2. As a result, the agent successfully finds the optimal policy for the given system by processing best actions.
IiiB1 SARSAalgorithm
Based on the above discussions, we design Algorithm 1 for step by step significant optimization stages of the SARSA algorithm for light traffic networks. The details of the mentioned algorithm are as follows:

Line : presents initialization of the SARSA algorithm, in which the system is initialized by initial sets of users, BSs, and subchannels as an initial state . After this we define the maximum number of clusters and the maximum number of users for each cluster. In line the brain of an agent is initialized with having dimensions as Qtable. The purpose of initialization with is to show that the brain of an agent needs training. Therefore, after training, the Qtable will contain values approaching to zero for the best case and vice versa. Secondly, it also shows that the proposed algorithm is targeted to solve the maximization problem, maximum Qvalue means better solutions. Line shows SARSAlearning parameter definition and initial random association among IoT users, BSs, and subchannels.

Line : shows key training steps based on Qtable updates via bellman equations. From line, an agent performs actions according to a given state of the environment, that is 3D associations and cluster allocation. In line agent picks new associations for different active users in one episode, then for all trials agent is trying to get optimal associations with optimal sum rate, if the associations are successful then the agent gets a reward (0) and if it fails then negative (10) is given as a punishment. In line, based on the 3D designed 5tuple (, , , , , ) values (11) is updated online. To perform online updates using , , , , , instead of , , , as (traditional Qlearning) the online learning mechanism becomes more fast converging. In other words, the agent finds optimal longterm online allocation policy more efficiently. Similarly, these updates are calculated for maximum episode and all the trials to maximize the overall longterm average reward of the system.
Definition 3.
In 3D state matrix from the set possible states, is defined as CSI of the proposed network that is known to both of the reinforcement learning agents. Therefore, the reinforcement learning agent contains perfect knowledge of the CSI for the whole network.
IiiC Deep Reinforcement Learning For Heavy Traffic ((210)Ue’s)
In general, both online and offline Qlearning methods require high memory space to build a state of the systems. However, practical systems are high in dimension and complex. Due to this reason Qlearning is not suitable for a large action space, this is a major drawback of conventional Qlearning methods. To overcome this, DRL method adopts a deep neural network (DNN) , to generate its Qtable with the help of by approximating the Qvalues [silver2017mastering]. Therefore, DRL agents only need to memorize the weights instead of reserving huge memory space for all possible states and action pairs. This is the main advantage to use DNN. More specifically, in conventional Qleaning algorithms, the optimization of is equal to the optimization of in DRL with low memory requirements. Similarly, updates are based on history states, actions, and reward values. More specifically, these values are based on DRL agent interactions with the environment to learns the relationship among the different actions and states by continuously observing given environment.

, is a unique state space used as an input of DNN. Each state is a combination of multiple subsets of 3D associations among users, BSs, and subchannels. It also consists of current rewards of the system as instantaneous and average reward from previous iterations.

, is a reward of the system that is denoted by , where is an instantaneous rewards similar to SARSA algorithm and denotes longterm average rewards for the time slot .

, is a multidimensional matrix representing actions as . For the DRL algorithm, the action mechanism is based on two main parts as; allocation strategies described as switching strategy and association strategy for the optimization process, where is a switching mechanism similar to SARSA and used for the DRL channel switching process. The second strategy is a result of selected switching strategy ,
denotes an index of the 3D associations among users, BSs, and subchannels. Finally, the DRL agent uses loss function mentioned in (
12) to calculate based on the previous experience.(12) where
(13) and is the target Qvalues from target DNN. For the improved training, in general the update frequency of the target network is performed in slow manner. Due to this reason the target network remains fixed for the target network update threshold .
The DRL agent uses gradient decent method as in (12) to reduce the prediction error by minimizing the loss function. The updating of is provided in (16), which is based on the outcome of new experience. The updating function for is defined in (18), namely DRL Bellman equation.
(14)  
(15)  
(16) 
where the function shows Qvalues and the longterm reward calculations for DRL based on the discount factor and below mentioned optimal DRL policy .
(17) 
where represents the optimal policy for the DRL algorithm. This function provides the optimal policy value for each state from finite sate set after taking appropriate action .
(18) 
where is showing Qvalue update according to DRL Bellman equation.
(19)  
(20) 
where in (19), represents state of the DRL agent and equation (20) shows the activation mechanism for each neuron layer based on weights for depth of neurons with bias term . In this model, the input of the DRL algorithm is the instantaneous network observation as . This state is sent to the different neural network neurons with specific network to obtain the final output as a set of different Qvalues for all actions. For the DRL framework the size of output actions are similar to the SARSA. We use the replay memory as an experience for the DRL agent to store the tuple for all the time steps in an experience dataset with size . When the size is full, the first experience as an oldest tuple will be removed to free some space for the new experience update. The reason for this updated experience is to reflect the sequential exploration of the DRL framework. However, the samples distribution is independent and identical. Therefore, to get more general output, the update process is performed on the basis of randomly sampled tuple instead of the current tuple. This is because output is highly influenced by the correlated set of tuples
and variance of the updates.
Definition 4.
DRL design in this work is defined with two main elements, the first element is target Qnetwork based on . The second main element of DRL is state transition mechanism . This mechanism is used to construct minibatch for experience reply from dataset to train DNN.
Remark 3.
The convergence rate/speed of the proposed algorithm varies according to the initial 3D association (states) that is randomly selected. In this model, the state space means allocation strategies that include subsets of all possible associations of active users for each subchannel at the episode .
Based on the above discussions, we design Algorithm 2 for step by step significant optimization stages of the DRL algorithm for heavy traffic. The details of the mentioned algorithm are as follows:
IiiC1 Drl

Line : In this stage, the parameter initialization is performed, which is a similar initialization step like SARSA. However, instead of state action pairing, the weight matrix is initialized for DRL to find optimal policy .

Line Pretraining: In this stage, initial actions are selected using uniform random distribution as an initial state space in continuous environment. In this way initial weights are also calculated to start the optimization process.

Loss Calculation: The equation (12) is to calculate the loss that is the mean squared error (MSE) indicating the difference of the target and predicted networks. To optimise these values between the target network and prediction network we use Adam optimiser. The Adam optimizer is used for the loss minimization to further improve the optimal predicted Qvalues for each episode. Therefore, the DRL framework converges faster even in huge state space. In (13), we calculate the target Qvalues based on the tuple from minibatch and the minibatch is updated after 100 iterations.

DRL Updates: The updating function for the prediction of DRL and longterm reward calculation is shown in (14) to (16), where DRL agents obtain rewards and prediction loss after every transition from to next state to find the greedy policy. Additionally, discount factor has significant impact in search of the greedypolicy because based on discount factor as we mentioned in the previous paragraphs, the agent selects immediate or previous Qvalues. The policy is calculated using (17) to maximize the Qvalues by greedy search. The calculation for DRL Bellman equation is performed using states in (18).

Sparse Activations: The
is an activation function for DNNs sparse activations using ReLU
. The sparse activations help agents to efficiently converge by avoiding useless neuron activations. The outcome of sparsity is shown in the results section, comparing sparse ReLU, Sigmoid, and TanH. In (20), the activations are performed for the density of neurons with index, for each neuron we use state of the system as an input that is multiplied with weight of density and adding bias value as before activation. In next steps, current states, actions, and rewards are added to minibatch for experience replay (for selftraining). In , the agent receives next states from minibatch that is learned in previous sections based on pretraining. Before that, the learning process of the agent is based on pretraining but when minibatch is full, the agent will learn the optimal policy by experiencing replay mechanism with the help of minibatch processing. 
Neural Networks: this paper uses the DRL that is built with two DNNs as shown in Fig. 4: 1) a training network that learns the policy and 2) a target network to compute target Qvalues for every update, where and shows the weights of these two networks. For the training of the DNN network, weights are predicted based on the current state and action. On the other hand, weights are based on the previous episodes and these weights are fixed during the calculation of for training purpose. Additionally, We utilize MSE loss function (12) to evaluate the accuracy of the training for the target network. Therefore, the proposed loss function is based on and to check the deviation of the predicted DNN weights.

Output: Finally, the output of this algorithm is the optimal policy for all clusters where overall longterm sum rate is maximum.
Definition 5.
We use ReLU activation function ( is the input neuron) for DRL performance evaluations. A ReLU network of density and hidden layers with each layer width can be represented as for any positive number .
In this definition, is a function to show the construction of neural network with weights for each layer and is the activation for each neuron. The mesh structure of the neural network remains fixed in this model to learn two main neural network parameters in addition with the activation function and the input of the neural network. In the neural network bias terms are added with the input of the DNN as as a shift value. To optimise our dynamic objective function, the greedy search agent is used. With the help of greedy search, the DRL agent receives higher rewards.
Remark 4.
To avoid useless visits, greedy policy provides a balanced exploitation, because exploits in the most cases and some times it processes a random action to explore the environment in search of different solutions .
For DRL, unbalanced random actions cause huge error propagation so that this is suitable to be applied for achieving efficient learning in a dynamic state space. Note that the boundary for the policy selection is . For close to the policy becomes greedy policy, and for close to the agent explores more.
Definition 6.
(Sparsity for ReLU DNN): The sparsity of the ReLU network is a weight based sparsity denoted by , sparse ReLU networks are bounded by for layers, . For any hidden layer .
where is used to represent . The function is from Definition 5 and is the element of .
IiiD Complexity
The complexity of the proposed model is based on the number of BSs , total number of subchannels and the number of users communicating . In proposed scheme, simulated experiments are based on different examples. This paper considers , and for light traffic and for heavy traffic. These examples are association decisions for the user and the subchannel at BS that receives signals for channels from users. The computation complexity for SARSAlearning is operations with memory requirement for Qtable to simulate brain of the learning agent/s. The complexity of DRL is with smaller Qtable and DRL uses 1D experience replay containing states vector (19) instead of huge memory requirements like traditional Qlearning. The benchmark scheme considered in this work is a memoryless method, which shows the maximum achievable rate by exhaustively searching all possible combinations of 3D associations. Consequently, it requires more number of operations. Due to this reason, the computation complexity increases in exponential manner as .
Load balancing factor values per resource block  23,24,210 

Total number of trials  500 
Total number of time steps  500 
Bandwidth  
Gain  [zhang1999finite] 
Optimisers  SARSADRL (Adam) 
Deep neural network activations  Sigmoid, TanH, ReLU 
Iv Numerical Results
In this section, simulation results are provided for the performance evaluation of the proposed multiconstrained algorithms. The proposed multiconstrained algorithms are tested under different network settings to solve: 3D associations among user, BSs, and subchannels as well as sum rate optimization with different network traffic. For simulations, we have considered two different traffic density threshold values to analyse the impact of network load with various power levels on the sum rate and 3D associations. Additionally, network load in our case represents load of each resource block instead of total number of users in the network. Therefore, max network load=10 with two RB’s for each BSs means users in the network. To show the significance of available channel bandwidth, we start with a minimum channel bandwidth of 60(kHz) and then increase it to 120(kHz) under different network traffic conditions. The hardware and software system used for experimentation is Intel core i77700 CPU with 3.60 GHz frequency having 16 GB of RAM (Random Access Memory) and 64bit operating system (windows10). All the experiments are simulated using Matlab versionR2019a and Python 3.6. From Table.II for both the algorithms we used episodes with iterations for each episodes. Similarly, and exploration are values for the significant hyperparameters of proposed algorithms. We used Load balancing factor values per resource block to show the maximum and minimum user connectivity for each resource block. The values of channel gain for each user is defined as [zhang1999finite]. For the DRL, additional parameters are trained, such as loss MSE, activation functions, batchsize, optimisers, experience memory , pretraining length , the number and size of hidden units. We use ReLU, Sigmoid, and TanH as activation functions with two hidden layers having density units. Adam optimizer is utilized for the optimal convergence of DRL.
Iva Convergence vs Sum Rate vs Traffic Density
Fig. 5 shows the intercorrelations among the four measures of convergence. It is apparent from this figure that if traffic density increases then convergence is slower and vice versa. DRL has better convergence for heavy traffic with the maximum allocation capacity/load, which makes DRL more suitable for the scenarios with high traffic densities. Secondly, another interesting insight is that the performance of SARSA is better than with . The convergence of Adam depends on DRL weights as . where and is feasible set for all steps.
Definition 7.
The bounded gradients of the function is . Secondly, the distance generated by the Adam optimiser is bounded as: for any with the bias terms satisfying the condition. Let the learning rate of the Adam optimiser be and bias term for each step , for all Adam obtains the following condition [kingma2014adam]:
(21) 
The results obtained from the primary analysis of sum rate and traffic densities are shown in Fig. 5 in longterm settings, it is clearly visible that the proposed model performs close to the benchmark scheme and better than OMA. Fig. 6 shows shortterm performance analysis between sum rates, bandwidth, and the number of iterations. This figure illustrates the performance of DRL and SARSA according to different bandwidths, where the performance of the DRL is better than SARSA. Interestingly it also shows that as the traffic density increases the sum rate also increases. Therefore sum rates are proportional to the number of users/traffic density in this case. Furthermore, from Fig. 5 even with light traffic conditions the sum rate of NOMA systems is higher than OMA. Lastly, Fig. 6 shows the relationships among longterm users connectivity during the simulation time. Where it is clearly visible that NOMA is more efficient for user connectivity by serving more users than OMA. From this figure we can see that the connectivity is improving as reinforcement learning agents, specifically DRL agent learning the dynamic environment. The number of served users are significantly increasing after 150 episodes of learning. As we can see the total number of served users are more than 3000 for DRL NOMA and more than 1000 for SARSA NOMA within 200 episodes.
IvB DQN Loss vs Rewards
In Fig. 7, the loss (MSE) for the DRL algorithm is shown, comparing three wellknown activation functions (ReLU, TanH, and Sigmoid). As it can be seen that ReLU performs better than both Sigmoid and TanH activation functions. Sigmoid and TanH perform relatively better only in initial steps due to less experience of the DRL agent. Therefore, when DRL agent gains some experience after the process of exploration and exploitation of the given environment, the outcome of the DRL algorithm is changed accordingly. The loss (yaxis) for all the activation functions is decreasing according to the number of episodes (xaxis). Furthermore, this figure also shows that the performance (loss) of the DRL algorithm is efficient when ReLU activation is used. Fig. 7 provides the summary statistics of achieved average rewards for the three different activation functions of the DRL algorithm. From the data in Fig. 7, it is apparent that the DRL algorithm with ReLU outperforms Sigmoid and TanH activation functions. After combining Fig. 7 and Fig. 7, another interesting outcome is that by the improvement of loss function, the rewards improves as well. Therefore, the loss and reward are proportional to each other. Lastly, DRL with the Sigmoid activations is the second best until 200 episodes and in all the remaining cases, where the episode is greater than 200 the performance of TanH is better than Sigmoid.
IvC Clustering Time
The average clustering time in second is compared for DRL and SARSA algorithms with different types of traffic and impact of learning rates in Fig. 8. The learning rate is the significant hyperparameter of RL algorithms, which shows how long the agent spends to explore and exploit the given environments. From the figure, it can be seen that there is no large effect of learning rates on clustering time (yaxis) for all the scenarios with current hyperparameters but if it is not tuned with other hyperparameters, learning rate can negatively influence the learning process. Therefore, with improper tuning the learning process becomes unbalanced and the agent can be searching for the solution for an infinite amount of time. Lastly, the clustering time increases but not significant when max load is increased from 3 to 10.
V Conclusion
This paper has proposed resource allocation for IoT users in the uplink transmission of NOMA systems. Two algorithms DRL and SARSA in the present study have been designed to determine the effect of three different traffic densities on the sum rate of IoT users. In order to improve the overall sum rate under different number of IoT users, we have formulated a multidimensional optimization problem using intelligent clustering based on RL algorithms with several interesting outcomes. Firstly, the simulation results of this study has indicated that the proposed technique performed close to the benchmark scheme in all the scenarios. The second major finding is that this frame work provides longterm guaranteed average rate with longterm reliability and stability. Thirdly, it has proved that DRL is efficient for complex scenarios. Additionally, we have proved that the sparse activations improve the performance of the DNNs when compared to the traditional mechanisms. Therefore, DRL with sparse activations is suitable for heavy traffic and SARSA is suitable for light traffic conditions. Furthermore, in general, both the algorithms (DRL and SARSA) have obtained better sum rates than OMA systems. Lastly, further research will explore performance improvements under the different scale of the networks.
Appendix A Proof of Problem (6a)
With the aid of the theory of computation complexity, we are able to use the following two steps to prove that the problem (
6a) is an NPhard problem. Step 1: the association problem for every subset of
Comments
There are no comments yet.