The numbers of machine-type communication (MTC) devices and the corresponding mobile data volume have grown rapidly with the development of smart metering, smart traffic surveillance, environmental monitoring, smart grid and other Internet of Things (IoT) applications. More than billion machine type devices are anticipated to be connected to the internet by the year of 2023 . Due to the fact that the channel conditions are too costly to measure and update for low power machines, random access (RA) has attracted great attention for MTC networks .
Non-orthogonal multiple access (NOMA), as a promising technology in 5G and Beyond, allows more than one device sharing the same time-frequency resource block, which improves the spectral efficiency considerably . Recently, the notion of NOMA is applied to slotted ALOHA system in order to achieve higher throughput for MTC network [2, 3]. Specifically, the transmission power of devices need to be tuned according to the channel state information (CSI) to guarantee that the received signal strength equals to one of some predefined values at the receiver. By empowering each device to decide its transmission power with different probabilities, NOMA significantly improves the network throughput performance by resolving collisions via successive interference cancellation (SIC) technique. However, due to the power limitation, IoT devices may have various constraints in a realistic IoT network. For instance, some IoT devices may not be able to transmit in all power levels. In order to analyze this scenario, the author in  developed an analytical model, in which two type of devices has been considered and two power levels are available at the receiver side. Based on the analytical model, two algorithms have been proposed to find the optimal transmission probabilities to attain the maximum throughput and max-min fairness respectively. However, the analytical model requires the knowledge of the number of devices for each type, and the proposed algorithms cannot be applied to the case with more than two power levels are available in the system due to the complexity of formulation.
Machine learning (ML) plays an very important role in human life these days. Reinforcement learning (RL), as a promising solution to handle many ML problems, has been applied extensively in NOMA based MTC networks in recent years. In order to minimize random access channel collision, a Q-learning algorithm has been implemented for each MTC device to dynamically select RA slots and transmit power for its transmission . The authors in  extended the work in  by further considering the short-packet communication and imperfect successive interference cancellation. In , the case with unsaturated traffic has been further considered. However, the proposed Q-learning methods in [5, 11, 9] cannot handle the situation when the number of supported devices changes dynamically due to the limitation of the Q table. Besides, the fairness between devices cannot be guaranteed. In order to tackle those problems, NOMA based slotted ALOHA scheme may better suit the MTC network. The author in  applied reinforcement learning method to an adaptive NOMA based p-persistent slotted ALOHA protocol. However, the reversed power control has not been considered which may cause the performance degradation.
In this work, we study the network throughput performance of the NOMA-based slotted ALOHA in MTC network. To capture the realistic power constraints of IoT devices and stochastic wireless fading channels, truncated channel inversion power control is considered. We first analyze the power level design strategy in MTC network. With given power level design, devices in the network are categorized into different types, with some types of devices being capable to utilize all power levels, while other types of devices only using lower power levels. In order to guarantee the fairness between different types of devices, two optimization problems have been formulated, namely, to maximize the minimum long-term expected throughput and to maximize the geometric mean of the long-term expected throughput for devices in the network. To solve the optimization problems effectively, a stateless deep RL learning approach has been proposed. To the best of our knowledge, our approach is the first to optimize network performance of the NOMA based slotted ALOHA in MTC network with truncated channel inversion power control, in which more than two types of devices exist in the network. Extensive simulations have been conducted to validate the performance of our proposed approach.
The remainder of the paper is organized as follows. Section II describes the system model. The power level design analysis is presented in Section III. Optimization problems are formulated and a deep RL approach has been proposed to solve the optimization problem in Section IV. The performance of our proposed approach has been validated and analyzed in Section V, followed by concluding remarks and future work in Section VI.
Ii System Model
As depicted in Fig. 1, a single cell uplink MTC network system has been considered, in which an access point (AP) is located at the center of a circular coverage area with a radio of meters, and multiple IoT devices are randomly distributed in its coverage area. The number of IoT devices may change dynamically to capture the mobility feature of the MTC network.
During the uplink transmission, IoT devices send data to the AP by using NOMA based -persistent slotted ALOHA protocol. Specifically, time is slotted. Power-domain NOMA is exploited to allow the AP to receive data from multiple IoT devices in each time slot. During each transmission, IoT devices can adjust its transmission power to ensure the received signal strength at the AP side belongs to a fixed predefined set where , and . Due to the fact that IoT devices have limited transmission power, truncated channel inversion power control has been considered in this scenario. This is, IoT devices in the coverage area may not be able to adjust their transmission power to achieve all power levels. Therefore, type of devices could exist in the system. i.e., The first type of devices can only utilize the power level as their received signal strength at the AP side. The second type of devices can use all power levels less than or equal to and etc. Thus, the -th type of devices can exploit all power levels. Let be the set of IoT devices in the system, in which denotes the set of -th type of IoT devices and .
Similar to -persistent slotted ALOHA protocol, a transmission probabilities matrix has been introduced to guide the uplink transmission of IoT devices. The transmission probabilities matrix is updated by AP and will broadcast to all devices after every time slots. If we use to indicate the index of the power level, the probability of a -th type of device to transmit by using -th power level is denoted as . Since the transmission probability for each type of devices is less than or equal to 1, we have . And due to the fact that -th type of devices cannot transmit with -th power level when . Thus, the transmission probabilities matrix is a triangular matrix which can be written as
Multiple devices may transmit simultaneously during each time slot in a random access network. If we use to denote the set of devices transmitting by using as received power, the received signal at the AP side can be written as
where denotes the transmitted signals of the device , and is the background noise.
The AP can decode signals sequentially by applying SIC technique based on the descending order of the signal strength. Specifically, the AP starts the decoding from the signal with the highest receiving power under the interference from all other signals which is transmitting concurrently. Without loss of generality, we assume that the AP can decode the signal successfully only when the SINR of the decoding signal larger than or equal to a threshold . Once the signal has been successfully decoded, it will be canceled by the AP. Thus, the rest signals would not be interfered by it. Throughput of the signal that has been decoded successfully is given by
In contrast, if the signal has not been decoded successfully, it cannot be canceled and will interfere the decoding of the following signals. The corresponding throughput of the signal is thus . It is worth noting that since the AP decode signals on the descending order of the signal strength, once a signal with high power level cannot be decoded successfully, all following signals cannot be decoded successfully as well. It is possible that there are more than one devices transmitting by using the same power level. In this case, the AP decodes their signals sequentially in random order.
If all signals before the -th decoding signal with the power level have been decoded and canceled successfully, the interference comes from all following signals that haven’t been decoded. In this case, the SINR of the signal can be written as
in which denotes the number of devices in the corresponding device set and , is the normalized background noise.
Iii Power Level Analysis
In this section, we analyze the constraints of power levels after inversion power control. To take the advantage of NOMA, the predefined power level set needs to satisfy that
If we use to indicate the maximum achievable power level of the system, the maximum number of power levels we can have is given by
in which denotes the highest integer smaller than or equal to .
To find the maximum number of power levels, the gap between all power levels should be as small as possible. Under this circumstance, equation (3) can be rewritten as follow,
Thus, we have . To ensure ,
It is also possible to design power levels in other way as long as equation (3) is satisfied, and the deep RL method we proposed can also solve the problem. However, in the following paper, we will focus on the case that which helps us to utilize more power levels for the MTC network.
Iv Proposed RL Method
Iv-a Optimization Problems
Our goal is to maximize the network performance by tuning the transmission probabilities matrix . The most commonly used performance matrix for IoT network is the total expected throughput. However, total expected throughput and fairness are generally conflicting performance metrics in heterogeneous IoT networks. Specifically, devices belong to the same type can achieve long term fairness in a random access network, yet devices belong to different types use different transmission probabilities and achieve different expected throughput. The NOMA transmission may favor the type of devices which could help to achieve the highest total expected throughput and stop the transmission of other type of devices to avoid channel congestion. With considering the fairness, two optimization problems has been considered. The first objective function we considered in this paper is to achieve the max-min fairness of devices in the network. In this case, we formulate the decision problem of tuning as a optimization problem
in which is the expected throughput of device over time slots.
Geometric mean of the expected throughput for all devices, on the other hand, is a performance matrix that also considered the fairness. It is zero if any device in the network do not have chance to transmit. With the increasing of the geometric mean, we ensure that no device is starved without any chance to transmit, and the expected throughput of most devices are increasing. The corresponding optimization problem can be written as
where is the number of devices in the IoT network.
The objective functions in and are mathematically intractable. Therefore, we propose a data-driven approach where a policy-based deep RL agent is applied at the AP to learn the transmission probabilities matrix automatically.
Iv-B Deep RL basic
The goal of a RL approach is to find an optimal strategy, i.e., a sequence of actions that maximizes the long-term expected accumulated discounted reward. Policy based RL methods, specifically, are well-known in addressing tasks with continuous action space. There are several policy based RL algorithms which has been developed recently, i.e., REINFORCE, trust region policy optimization (TRPO), deep deterministic policy gradient (DDPG), and proximal policy optimization (PPO). Among these algorithms, PPO draws great attention due to the fact that it is efficient, easy to be implemented and tuned .
In policy based RL algorithms, the most commonly used estimator is given by
where and represent state and action at time respectively, represents the policy at time and
is the parameter of actor neural network which is used to generate policy. With given parameter
, the action can be generated by using Gaussian distribution,
in which and are generated parameters from actor neural network. is the advantage value where is the discounted future reward after time and is baseline.
It is worth noting that in PPO, data generated in previous episode can also be used to update current policy. In order to reuse the historical data, a clipping function is used to avoid large changes between current updated policy and the old policy. The clip function is given by,
The changes between current updated policy and old policy can be written as
Thus, the new estimator can be modified as
It can be found that if is negative, the estimator is bounded by . On the other hand, when is positive, the estimator is at most .
As an actor-critic algorithm, PPO learns the baseline by using a critic neural network. The loss function of the critic network is given by
where is the value generated by the critic neural network and is the parameter of the critic network. During each episode, the actor neural network optimize the estimator in (13) with respect to and minimize the loss function with respect to .
Iv-C Overview of Our Approach
A stateless deep RL approach has been taken to solve our optimization problems and , in which RL agent (PPO network) at the AP generates a transmission probabilities matrix based on given power level set ; once has been generated, AP broadcasts it to all devices in the network. AP knows how many devices exist in the network but do not know the type of each device. Devices then start to upload packets to the AP with corresponding transmission probabilities in . Without loss of generality, devices are aware of their own type, so the transmission probabilities can be decided by a device once is received. After
time slot, the AP calculates the reward, updates the actor critic network and generates new transmission probabilities matrix. The stateless deep RL problems can be formulated as Markov Decision Process (MDPs), consisting of two key elements:
actions: the transmission probabilities matrix .
reward: the reward collected during time slots.(e.g., Minimum expected throughput, Geometric mean of the expected throughput).
Recall that for the transmission probabilities matrix, we need to guarantee . One way to generate reasonable
is to introduce the Beta distribution. Beta distribution defines on the interval, in which two positive parameters and control the shape of the distribution. For -th type of devices, in order to generate , we first generate continuous number and from the agent, then calculate , to ensure and are larger than or equal to 0. With given and
, the cumulative distribution function (CDF) of the beta distribution can be calculated easily. Letbe the value of the CDF at , we have and . The probability that -th type of devices transmit by using -th power level can be calculated as
Thus, the probability that the -th type of devices do not transmit is .
In order to find the that achieves the max-min fairness, the reward is defined as the following:
If the goal is to find the which maximizes the geometric mean of the expected throughput for all devices, the reward is given by
We summarize our proposed approach in .
V Performance Evaluation
The performance of a NOMA-based random access network with truncated channel inversion, which adopts the proposed deep RL approach has been evaluated in this section. The IoT network parameters we used in the simulation are , , , , , if not otherwise specified. The power level set can be calculated as mentioned in Lemma. 1, in which . The parameters for the deep RL approach are shown in follow: learning rate of the actor and critic network are , and the clipping parameter is .
In Fig. 2, the network performance such as the arithmetic mean of the expected throughput, geometric mean of the expected throughput, and the minimum expected throughput, are shown for the cases of (a) using total expected throughput as the reward function, (b) using geometric mean of the expected throughput as the reward function, and (c) using the minimum expected throughput as the reward function respectively. It can be intuitively seen that our approach has fast convergence speed and is capable to find the optimal solution. The corresponding average throughput for each type of devices are shown in Fig. 3.
For the case (a), after convergence, the geometric mean of the expected throughput and the minimum expected throughput are all zeros as shown in Fig. (a)a. It can also be observed in Fig. (a)a that only type 5 devices who can utilize all power levels are transmitting. This is due to the fact that the transmission of other type of devices will only increase the congestion probability of the channel, the best strategy to maximize the total expected throughput is to maximize the performance of the type 5 devices while other type of devices stop their transmission. When it comes to the case (b), the network performance and the average throughput for each type of devices are shown in Fig. (b)b and Fig. (b)b respectively. As can be observed that after convergence, all type of devices are capable to transmit while the total expected throughput is still relatively high. However, the average throughput of devices with more power levels to utilize is higher than the average throughput of devices with less power levels options to transmit. In other word, the network is not totally fair for all devices. In Fig. (c)c, we plot the network performance of case (c). As shown in this figure, after convergence, the gap between arithmetic mean of the expected throughput, geometric mean of the expected throughput and the minimum expected throughput are small. This is because after maximize the minimum expected throughput, the average throughput of each type of devices are generally equal as shown in Fig. (c)c. Otherwise, if the average throughput of a type of devices is smaller than the average throughput of other type of devices, the transmission probabilities of other type of devices should be decremented to promote the average throughput of the type of devices with minimum expected throughput, which helps to increase the minimum expected throughput of the network. It is worth noting that even thought the case (c) achieves fairness and improves the minimum throughput of the system, the total expected throughput is lower than that of case (b), which indicates that in order to achieve max-min fairness, the average throughput of devices with more power levels to select has been severely influenced.
In this paper, we introduced a novel deep RL approach for NOMA-based slotted ALOHA network with truncated channel inversion power control. The proposed approach enables the AP to tune the transmission probabilities matrix which guides the uplink transmission of IoT devices to improve the network performance. Instead of optimizing the total expected throughput which ignored the fairness in this heterogeneous scenario, two optimization problems have been formulated to maximize the geometric mean of the expected throughput and the minimum expected throughput for all devices. Extensive simulations shown that our approach helps us to find the optimal solutions of our optimization problems. In our future work, we will incorporate intelligent reflecting surface (IRS) in the NOMA-based random access network.
This work is supported in part by NSF Career award ECCS1554576 and NSF grants CNS-1320736.
-  (2021) Performance study of random access noma with truncated channel inversion power control. In ICC 2021-IEEE International Conference on Communications, pp. 1–6. Cited by: §I.
-  (2020) Optimizing non-orthogonal multiple access in random access networks. In 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring), pp. 1–5. Cited by: §I.
-  (2021) Performance study of cybertwin-assisted random access noma. IEEE Internet of Things Journal. Cited by: §I.
-  (2019) From 5g to 6g: has the time for modern random access come?. arXiv preprint arXiv:1903.03063. Cited by: §I.
-  (2020) A noma-based q-learning random access method for machine type communications. IEEE Wireless Communications Letters 9 (10), pp. 1720–1724. Cited by: §I.
-  (2019) Cisco visual networking index: global mobile data traffic forecast update, 2018–2023. Update 2018, pp. 2022. Cited by: §I.
-  (2020) A deep reinforcement learning based approach for channel aggregation in ieee 802.11 ax. In GLOBECOM 2020-2020 IEEE Global Communications Conference, pp. 1–6. Cited by: §IV-B.
-  (2021) Data-driven random access optimization in multi-cell iot networks with noma. arXiv preprint arXiv:2101.00464. Cited by: §I, §IV-A.
-  (2021) Distributed q-learning aided uplink grant-free noma for massive machine-type communications. IEEE Journal on Selected Areas in Communications. Cited by: §I.
-  (2019) Analysis of rf energy harvesting in uplink-noma iot-based network. In 2019 IEEE 90th Vehicular Technology Conference (VTC2019-Fall), pp. 1–5. Cited by: §I.
-  (2021) BLER-based adaptive q-learning for efficient random access in noma-based mmtc networks. In 2021 IEEE 93rd Vehicular Technology Conference (VTC2021-Spring), pp. 1–5. Cited by: §I.