In recent years, the rapid development of communication technology makes wireless spectrum resources very tight [1, 2, 3]. Therefore, the improvement of spectrum utilization and system throughput is a pertinent issue in the field of wireless communication.
Cooperative communication has been paid much attention, for it helps realizing the resource collaboration between different nodes and obtaining diversified benefits in multi-user scenario. Depending on how the cooperative relay processes the received signal, the relay mode can be mainly divided into amplify-and-forward (AF) and decode-and-forward (DF). When using AF protocol, the relay node amplifies and then directly forwards the signal to the destination, which is a simple method but the noise is amplified too. When using DF protocol, the relay node decodes the signal, and then re-encodes it before forwarding without noise amplification.
Relay selection is a hot topic in cooperative communication. For a scenario with multi-relay, it is usually possible to select multiple relays to coordinate data transmission and assign orthogonal channels to avoid interference. Jedrzejczak et al. 
studied the relay selection problem by calculating of harmonic mean of channel gains. Islamet al.  demonstrated the influence of network coverage capability on relay selection. Das and Mehta  proposed an approach for relay selection by analyzing the outage probability. The draw back of employing too many relays is that, it may lead to extensive time consumption and frequency resource waste when forwarding signal. To solve this issue, Bletsas et al.  proposed an opportunistic relay section scheme by only choosing the best relay according to channel state, which can obtain the full set gain. However, these methods all assume an exact CSI, which is not practical because of the existence of inevitable noise.
Power allocation is another significant issue for cooperative communication. Given partial channel information, Wang and Chen  derived a closed-form formula for optimal power allocation based on maximizing a tight capacity lower bound. In  and , the authors considered optimal power allocation scheme in different situations of convention relay and opportunistic relay. Tabataba et al. 
deduced the expression of system outage probability under the condition of high signal-to-noise ratio (SNR), and studied power allocation using AF protocol. These researches are more reasonable to some extent, but they still require prior knowledge of the channel and can not be further applied to other situations.
Machine learning has been applied to various fields and achieved remarkable results with its unprecedented algorithmic capabilities. As one of the three paradigms of machine learning, RL is an emerging tool to solve decision-making problems such as resource management in communication [12, 13, 14]
. RL uses an agent, which is an intelligent robot, to interact with the environment. Unlike traditional methods, the agent in RL has no prior knowledge of the environment, that is, we do not need to add any assumptions to learning process. The agent repeatedly interacts with the environment, chooses actions according to current state, and continuously adjusts its behavior strategy according to the feedback from the environment. DRL is a combination of deep neural networks (DNN) and traditional RL method, which is proposed to solve problems with large state or action space[15, 16]. HRL is a recent technology based on DRL, which has developed rapidly in recent years and is considered a promising method to solve problems with sparse rewards in complex environment. HRL enables more efficient exploration of the environment by abstracting complex tasks into different levels [17, 18].
Recently, some researchers have successfully applied RL methods to cooperative communication. The source node is empowered with the learning ability to determine the optimal relay or power allocation for the current moment, based on previous observation of system state and rewards. Shamset al.  employed Q-learning algorithm to solve the power control problem, and Wang et al.  proposed a Q-learning based relay selection scheme in relay-aided communication scenarios. The drawback of these studies is obvious, as they can be only suitable for simple problems with a low dimension. Su et al.  proposed a DQN-based relay selection scheme with detailed mutual information (MI) as reward, but did not take power allocation into consideration. In order to solve the joint optimization problem, Su et al.  employed convex optimization and DQN to deal with relay selection and power allocation, respectively. However, in this study, prior knowledge of channel state is still required when using convex optimization method.
In this paper, we propose an outage-based approach for relay selection and power allocation, to minimize the outage probability of the communication system. Unlike traditional optimization methods, our method can learn behavior policy without assuming any prior knowledge of channel state. It is also different from existing RL-based methods that, our agent only receives binary signals representing success or failure from the environment. Furthermore, we propose a novel hierarchical framework to reduce searching space and improve learning efficiency. Specifically, the contributions of this paper can be summarized as follows.
In our two-hop cooperative communication model, the SNR of the system using different protocols are analyzed. Then, we transform the traditional outage probability optimization problem into a statistical problem, which is suitable for RL frameworks to handle.
We propose an outage-based method, where the reward in DRL framework is only determined by binary signals of success or failure, which is fed back from the environment without any other additional information. It is practical because other information, such as instantaneous SNR or MI, may be not available in certain situations.
We further design an HRL framework with two levels for cooperative communication, where relay selection and power allocation are disassemble into two optimization objectives. By decomposing different optimization objectives, complex action space in traditional DRL framework is therefore simplified. After interacting with the environment, the agent can learn behavior strategies based on channel state prediction, to minimize outage probability.
The rest of this paper is organized as follows. Section II introduces the preliminaries of DRL. Section III analyzes our system model and Section IV formulates the outage minimization problem. Section V describes our outage-based method using DQN framework. Section VI describes our proposed HRL framework and learning algorithm, and presents our pre-training algorithm in detail. Section VII presents simulation results. Finally, Section VIII concludes this paper and outlines future works.
An Markov Decision Process (MDP) consists of an environment, a state space , an action space , and a reward function . At each discrete time step , the agent observes the current state , and selects an action according to a policy :
, which maps states to a probability distribution over actions. After executing action, the agent receives a scalar reward from the environment and observes the next state according to the transition probability . This process will continue until a terminal state is reached.
The goal of the agent is to find the optimal policy to maximize the expected long-term discounted reward, i.e., maximize the expected accumulated return from each state , where denotes the total step, and denotes the discount factor that trades off the importance of immediate and future rewards.
Action-value function is usually used to describe the expected return after selecting action in state according to policy .
And we can obtain the preceding action-value function via recursive relationship known as Bellman function.
Moreover, the optimal action-value function gives the maximum action value under state and action , and it also obeys Bellman function.
In practice, we usually do not know the underlying state transition probability, i.e., in a model-free situation. It requires the agent to interact with the environment and learn from the feedback, constantly adjust its behavior to maximize the expected reward.
Temporal difference (TD) methods  are proposed via combining Monte Carlo methods and dynamic programming methods, which enable the agent to learn directly from raw experience. Therefore, we have the following well-known Q-learning algorithm.
where denotes TD error and denotes learning rate.
Through continuous iterative updating, the Q value of different actions selected in each state finally tends to be stable, which can provide a policy for the subsequent action selection.
Iii System Model
Iii-a Communication Network
Consider a wireless network where exists an -antenna source , an -antenna destination , and a group of single-antenna relays , as shown in Fig. 1. We assume that the source is far from the destination, and the help of relay nodes is needed. Due to the limitation of equipment of relays, we consider a half-duplex signaling mode where the communication from to via the selected relay will take two time slots. In the first time slot, broadcasts its signal, and all the other nodes, include the destination, listen to this transmission. In the second time slot, the selected relay forwards the detected signal by employing AF protocol or DF protocol. Next, we will analyze the MI obtained by using these two protocols in the case of high SNR.
Iii-B Amplify and Forward Model
In the case that all relays can only scale the received signal and send it to the destination, we employ AF protocol to realize cooperative communication. In the first phase, the received signal at can be written as
where is the transmission power at source with being the maximum value, is the data symbol, is an, is the complex Gaussian noise at relay. Similarly, we have the received signal at which can be written as
where denotes a channel matrix, and denotes the complex Gaussian noise at destination.
In the second time slot, the selected relay amplifies the signal and transmits it to the destination. The destination combines the data from the source and the relay using maximal ratio combining (MRC). The received signal at the destination from the relay can be written as
where is the transmission power at relay with being the maximum value, is a channel vector between relay and destination, and similarly, each element is a complex Gaussian random variable with zero mean and variance . is the amplification factor, which can be written as
where and .
Similarly, we can obtain the SNR of direct transmission from source to destination, which can be represented as . Finally, we have the MI between the source and the destination using AF protocol.
Iii-C Decode and Forward Model
Assume that all relays are able to decode the signal from the source, and then re-encode and transmit the signal to the destination. The first time slot in DF mode is the same as that in AF mode, and (6) and (7) have given the received signal at and in this time slot.
In the second time slot, different from that in AF mode, the selected relay decodes and forward the signal to the destination, and the received signal at the destination from the relay can be written as
When employing DF protocol, the relays need to first successfully decode the signal from source on the condition that the MI is above the required transmission rate [25, 7]. Then, the signals received in both two time slots are combined at the destination using MRC, and we have the following instantaneous MI between the source and the destination.
Iii-D Outage Probability
Outage probability is an important criterion to measure the robustness of system in cooperative communication. Denote to be the outage threshold, then we can obtain the outage probability for our AF model and DF model, respectively.
Iv Problem Formulation
Suppose that there is an agent in the communication environment, which has access to all CSI in previous time slot. The agent estimates current channel state based on historical CSI, and accordingly selects relay and allocates transmission power. Afterwards, it receives a reward from the environment, which indicates whether the communication is successful. In this section, we model this process as an MDP and describe the variables in our communication scenario, then we formulate our problem.
Iv-a State Space
Full observation of our two-hop communication system consists of the channel states between any two nodes in the previous time slot. Therefore, the state space in current time slot is a union of different wireless channel states, which can be denoted as
where the integer satisfies .
In order to characterize the temporal correlation between time slots for each channel, we employ the following widely adopted Gaussian Markov block fading autoregressive model[26, 27].
where denotes the normalized channel correlation coefficient between corresponding elements in and , denotes the error variable and is uncorrelated with . According to Jake’s fading spectrum, we have where denotes zeroth-order Bessel function of the first kind, and denote Doppler frequency and the length of time slot, respectively.
Iv-B Action Space
Full action space includes relay selection , source power allocation , and relay power allocation .
Considered that the total power is constraint, i.e., , we can assume that the sum of power used by the source and its selected optimal relay is . So can be directly represented by the difference between and . Then, we can reduce the number of actions that need to be optimized and derive the following reduced action space.
Relay selection: The relay selection is denoted by
where means relay is selected in time slot and otherwise.
Source Power Allocation: Similarly, the power allocation for the source node is denoted by
where is divided into power-levels, and means the -th power-level is selected for source node transmission in time slot , and otherwise.
Iv-C Reward and Optimization Problem
Consider the fact that, when an indicator function is employed to represent each occurrence of an event, then the expectation of the indicator function can be used to calculate the probability of the original event. Define
to be the indicator function of event , which represents the result after each selection.
In this paper, we formulate an optimization problem to minimize the outage probability of our proposed communication system, which jointly optimizes relay selection and power allocation. Then, the problem can be formulated as
A reward is fed back to the agent to evaluate the selected action under current system state. Since our method is outage-based, which means the agent can only receive the communication result denoting success or failure from the environment, we can then define our binary reward function as
V DRL Based Framework for Minimizing Outage
Many reinforcement learning algorithms share the same basic idea that, estimate the action-value function by using Bellman function as an iterative update to converge to the optimal. However, it is impractical for this basic approach, as the action-value function is estimated separately for each sequence while without generalization. In this section, we realize our outage-based relay selection and power allocation algorithm on the basis of well-known DQN framework .
The diagram of our DRL-based communication framework is shown in Fig. 2. Note that the source node has no prior knowledge of the communication system, which means the distributions of wireless channels between any two nodes are all unknown to it. The agent can only observe the current state from the communication environment and get state . Then a deep neural network is employed as the nonlinear function approximator to deal with the input data, which can be applied to estimate the action-value function for high dimensional state space. According to (3), we have
After calculation, the agent chooses the best action which can induce the maximal value of , and then the environment will give the corresponding reward and update system state. So far, we have obtained a complete experience tuple , which will be stored in experience replay buffer
. When training, a batch of experience will be sampled and used to optimize a set of loss functions below.
where is parameters from previous iteration in a separate target network which are held fixed when optimizing, and will be replaced by from the evaluate network after a period of time.
Then difference operation on these loss functions will be carried out, and we can obtain the following gradient expression.
The following standard non-centered RMSProp optimization algorithm[28, 29] is then employed to update parameters in Q network .
Note that, this is a model-free approach, as the agent using state and reward sampled from the environment rather than estimating transition probability. And this is an offline policy, because an epsilon greedy method will be employed as the behavior policy. Complete algorithm can be found in Algorithm 1.
Vi Hierarchical Framework for Dynamic Relay Selection and Power Allocation
Traditional DRL-based approaches put all variables together in its action, which results in a complex search space. In this section, we propose a novel two-level HRL framework for cooperative communication, to learn relay selection policy and power allocation policy in different levels.
Vi-a Proposed Framework
HRL is a recent technology based on DRL, and has been considered as a promising method for solving problems with sparse reward. HRL method composes low-level actions into high-level primitives which are called ’goals’, and effectively reduces the dimension of the search space.
As shown in Fig. 3, the communication agent has two levels. In the higher level, the meta-controller receives observation of state from the external communication environment, and outputs a goal . The controller in the lower level is supervised with the goals that are learned and proposed by meta-controller, which observes the state of the external communication environment and selects an action . Note that the meta-controller gives a goal every steps, and the goal will remain until the low-level controller reaches the terminal. We employ a standard experience replay buffer, and it is worth to note that, experience tuples for meta controller and for controller are stored in disjoint spaces for training.
Hierarchical environment includes state, high-level goal, low-level action and reward, which are described specifically below.
State: State in our hierarchical framework is the same as that in DRL environment, which consists of the channel states between any two nodes in the previous time slot. The expression for the state space can be referred to (16).
High-Level Goal: In cooperative communication systems, we can intuitively find that relay selection plays a major role. Therefore, we separate the different action components to make different levels, and extract relay selection as high-level goals for overall planning. Denote to be the goal in higher level, we then have
In fact, the goal selection in high-level is similar to relay selection action in the previous DRL method. Therefore, should meet the same constraint that .
Low-Level Action: By decomposing relay selection and power allocation into different levels, we can further reduce the action space. Then the low-level action space only has one variable , which satisfies and in (22).
Reward: Note that the higher level and the lower level are working in different time scales. Meta-controller first proposes a temporarily fixed goal for the lower level, and then controller performs actions in a period of time according to both system state and high-level goal and receive feedbacks from the environment. Therefore, we can denote the internal reward for low-level controller as
On the other hand, we use communication success rate of a given relay over a period of time to measure the quality of current relay selection. Therefore, the external reward for high-level meta-controller can be represented as
which the agent aims to maximize its expectation.
Vi-B Hierarchical Learning Policy
For meta-controller in higher level, we use gradient bandit method to learn goal-policy for dynamically proposing goals according to a given system state. Recall that, we have relays to choose from. Therefore, we first establish the following probability distribution.
where is the high-level policy, and denotes the probability that relay is selected as the goal in time slot . denotes the preference value for choosing relay , which will be updated every steps.
Then, we employ stochastic gradient descent to update the preference values.
where denotes learning step size, and the expectation of can be equally calculated by . Replace the expectation form in equation (34), and we then have
where the newly introduced scalar is independent of . It denotes the average of all external rewards, i.e., the average success rate of our communication system. Further, the partial derivative part can be further written as
Note that, and is independent of the other variables. Then we can derive the following equation by employing the form of expectation.
In the training process, sampling is conducted every time steps, and the gradient in (34) is replaced by the expectation value of the single sample. Therefore, we can finally obtain the following update expression of preference value.
For controller in lower level, it learns action-policy for selecting actions according to both state and goal, which aims to maximize the long-term discounted expected internal reward.
In order to reflect the difference in values of the different actions, we perform some changes to the architecture of traditional deep Q network. Inspired by , we further employ a dueling network, which can enhance the stability of DRL algorithm by ignoring subtle changes in the environment and focusing on key states.
Schematic illustration of dueling architecture is shown in Fig. 4. The input layer and hidden layers are the same as that in traditional deep Q network. The key difference is that, there is a sub-output layer in our dueling network, where the traditional Q function output is separated into a state-goal valuation function and an advantage evaluation function
. In state-goal valuation part, there is only one neuron which represents the assessment of the current state and goal. In advantage evaluation part, the number of neurons is equal to that in output layer, representing advantage of choosing each optional action.
Consider the fact that , we have . Therefore, to meet this property, we can rewrite the advantage part as
and thus have the following expression of Q function by combining the two parts in sub-output layer.
where denotes parameters in common part of DNN (i.e., three columns on the left in Fig. 4), and and are parameters in separated fully connected sub-output layer for valuation function and advantage function, respectively. Note that, the output of our dueling network is still the same as that of the traditional network, that is, the estimated expected return for each action under the current state and goal .
By employing temporal difference method, the optimal value of Q function can be written as (41), where denotes low-level policy for power allocation in time slot . Note that meta-controller and controller work on different timescales. The controller operates at each time step, while the meta-controller operates on a longer timescale of time steps.
When training, a batch of memories are sampled from experience replay buffer . We calculate a set of loss functions, and derive a batch of gradients as (42). Then the RMSProp optimizer in (28) is employed to update network parameters.
Please refer to Algorithm 2 for detailed procedure of our hierarchical algorithm.
In this section, we first introduce the setup of simulation environment, and then carry out experiments to evaluate our proposed algorithms.
Vii-a Experiment Setup
In our two-hop cooperative relay network, channel vectors between any two nodes in each time slot are calculated according to formula (31), where the correlation coefficient is set to be 0.95. The value of the channel vector will be initialized at the beginning of each episode, i.e., , where and are path-loss coefficients, reference distance and is a variable which denotes the distance between node and node . On the other hand, the maximum power for transmission is .
To implement our proposed framework, learning step size in high-level meta-controller is set to be 0.1. Elements in preference value vector are initialized to be 0, and the vector will be updated after each inner loop is completed.
In low-level controller, we use two separate dueling deep Q networks, which share the same structure. Both dueling deep Q networks include two hidden layers, each of which has 50 neurons, and we employ ReLU function for all hidden layers as activation function. The number of neurons in input layer is equal to the sum of numbers of states and goals, and the number of neurons in output layer corresponds to the dimensions of low-level action.
Note that, deep Q network requires the action space to be discrete, so we have discretized the power allocation in the environment and set different power-levels for the agent to choose from. For comparison, we use the following methods as baseline in our experiments.
Random selection: For each time slot, the agent randomly selects a relay to perform cooperative communication with random transmission power.
DQN based approach: Our DRL framework for minimizing outage probability is proposed in Section V. The traditional DQN based algorithm make relay selection and power allocation at the same time, and it is now used as one of baseline methods.
Vii-B Numerical Results
Consider that our DRL method and HRL method both use the structure of deep Q network, therefore we first study the influence of different hyper-parameters on the convergence performance, to obtain the optimal network structure. Note that, average success rate with different hyper-parameter values is tested 10 times and mean curves and ranges are then recorded.
Fig. 5(a) shows that, setting a learning rate that is too large or too small is not very effective on convergence, which can lead to local optimum or slow convergence. Therefore, in the following simulations, we set the learning rate as 0.001.
In Fig. 5(b), we notice that different memory sizes have little influence on the final value that average success rate converges to. Considering the speed of convergence, we finally set memory size to be 8000.
When training, a batch of data is sampled from the experience buffer to improve the DNN. In Fig. 5(c), we further fix the memory size, and study the effects of different batch sizes during training on the convergence performance. It can be found that training with a small batch size cannot take advantage of all data stored in experience buffer, while a large batch size (=256, pink dot line) has the fastest convergence speed, although it will consume much more time during training process.
Finally, we investigate the convergence performance under different training intervals, as shown in Fig. 5(d). The smaller the interval is, the higher the frequency of training is, and therefore the longer the training time is. It is obvious that the average reward converges faster with shorter training interval. On the other hand, we find that the final convergence values with time intervals of 5 and 10 are very close. Therefore, we set training interval to be 10, for it is unnecessary to train and update the parameter of DNN too frequently.
Set the above hyper-parameters to optimal values and apply them to all deep Q networks, then we carry out the following experiments.
We set the rate threshold in AF communication environment as and that in DF communication environment as . Then we evaluate the performance of different methods, which is depicted in Fig. 6.
It can be observed that when using the method of random selection, the performance is always very poor. Both DQN-based method and our hierarchical algorithm can be effectively trained, and their average reward curves eventually converge to a stable value with slight changes.
However, take Fig. 6(a) for example, with DQN-based method, the average reward value is only about 0.9, i.e., the outage probability of communication system is about . When employing our hierarchical algorithm, the average reward value is closer to 1.0. On the other hand, our HRL method has a faster learning speed, which converges after about 15 iterations, while DQN-based method needs about 40 iterations to reach the convergence value. Training result in DF environment, which is presented in Fig. 6(b), also proves that our method can have the best performance.
When comparing Fig. 6(a) with Fig. 6(b), we observe that although there is little difference in convergence performance between different methods, the fluctuation of average success rate using DQN-based method is larger. On the other hand, our hierarchical method can still converge to a value with slighter changes, which indicates that our method is more stable. In all, our HRL agent can learn a better strategy faster for dynamic relay selection and power allocation, in both AF communication environment and DF communication environment.
Then, we test the performance of our proposed hierarchical method and DQN-based method under different search space scales. We conduct this experiment in DF communication environment with rate threshold as 2.0. As shown in Fig. 7, we have two scenarios where the number of relays and power levels are both set to be 10 or 20.
As and increased, the average success rate of our method is a little lower at the beginning of training, which is due to the increased difficulty of exploration. However, the performance then becomes better, and the average success rate increases from 0.87 to 0.91. With optional power levels increased, transmission power can be allocated more efficiently at source and relay node, resulting in an improved communication success rate.
One the other hand, DQN-based method performs worse under a larger and , where the average success rate drops by about . It takes more training iterations to converge, and the fluctuation gets larger. This is because in a larger search space, it becomes more difficult to learn the joint behavior policy. There are fewer successful explorations, which therefore leads to a problem of sparse reward. Traditional DRL methods cannot perform well in the environment with sparse rewards, due to the lack of positive experience to learn from. However, by making different hierarchies, our method can reduce the complexity of search space, which ensures the efficiency of exploration and learning. Therefore, when employing our proposed hierarchical method, we can still obtain a more stable behavior policy for relay selection and power allocation.
After 100 iterations of training, we obtain dynamic relay selection and power allocation policies by applying both hierarchical method and DQN based method. To further evaluate the robustness of different methods, we test the performance by using these well-trained policies under different rate thresholds, and the result is depicted in Fig. 8.
This experiment is conducted in a DF environment, and the communication rate thresholds range from 1.6 to 2.4. The only difference between testing and training is that the parameters of all the networks in the testing process are fixed, which means DNN is only used to provide a best action rather than performing further learning.
As we can see from Fig. 8, both HRL policy and DRL policy trained in a smaller search space can be applied to other situations. However, DRL policy trained in the larger search space performs poorly in the testing process, while we can still obtain the appropriate behavior scheme in different environments by following our HRL policy. It is obvious that our hierarchical algorithm is more robust and can greatly reduce the outage probability, which means that HRL agent can perform better relay selection and adjust power allocation more reasonably according to the current state after training, no matter in what kind of environment.
Viii Conclusion and Future Works
In this paper, we propose an outage-based method to dynamically select relay and allocate power in a two-hop cooperative communication model, in order to minimize outage probability with total transmission power constraint. Unlike traditional studies, our method does not require any assumptions about channel distribution, but relies on the interaction between the agent and the communication environment. Compared with existing RL-based methods, our outage-based reward function is more practical, and we further design a more efficient HRL framework by decomposing relay selection and power allocation into two sub-tasks. Simulation results show that our method can achieve lower outage probability, and converge faster than traditional methods in both AF and DF communication environment. Furthermore, our hierarchical method can effectively solves the problem of sparse reward, while the other methods can hardly deal with it.
Our method provides a novel way for the research of resource allocation and optimization in the field of communication. However, the total transmission power is discretized into several power levels in DRL architecture, which can be improved further to obtain higher SNR. In the future work, we would like to explore new optimization methods applicable to continuous action space.
-  F. Zhong, X. Xia, H. Li, and Y. Chen, “Distributed linear convolutional space-time coding for two-hop full-duplex relay 2x2x2 cooperative communication networks,” IEEE Transactions on Wireless Communications, vol. 17, no. 5, pp. 2857–2868, May 2018.
-  C. Wang, T. Cho, T. Tsai, and M. Jan, “A cooperative multihop transmission scheme for two-way amplify-and-forward relay networks,” IEEE Transactions on Vehicular Technology, vol. 66, no. 9, pp. 8569–8574, Sept. 2017.
-  Y. Liu, E. Liu, R. Wang, and Y. Geng, “Channel estimation and power scaling law of large reflecting surface with non-ideal hardware,” arXiv preprint arXiv:2004.09761, 2020.
-  J. Jedrzejczak, G. J. Anders, M. Fotuhi-Firuzabad, H. Farzin, and F. Aminifar, “Reliability assessment of protective relays in harmonic-polluted power systems,” IEEE Transactions on Power Delivery, vol. 32, no. 1, pp. 556–564, Feb. 2017.
-  S. N. Islam, M. A. Mahmud, and A. M. T. Oo, “Relay aided smart meter to smart meter communication in a microgrid,” in 2016 IEEE International Conference on Smart Grid Communications (SmartGridComm), Sydney, NSW, Australia, Nov. 2016, pp. 128–133.
-  P. Das and N. B. Mehta, “Direct link-aware optimal relay selection and a low feedback variant for underlay CR,” IEEE Transactions on Communications, vol. 63, no. 6, pp. 2044–2055, Jun. 2015.
-  A. Bletsas, A. Khisti, D. P. Reed, and A. Lippman, “A simple cooperative diversity method based on network path selection,” IEEE Journal on Selected Areas in Communications, vol. 24, no. 3, pp. 659–672, Mar. 2006.
-  C. Wang and J. Chen, “Power allocation and relay selection for af cooperative relay systems with imperfect channel estimation,” IEEE Transactions on Vehicular Technology, vol. 65, no. 9, pp. 7809–7813, Sept. 2016.
-  O. Amin, S. S. Ikki, and M. Uysal, “On the performance analysis of multirelay cooperative diversity systems with channel estimation errors,” IEEE Transactions on Vehicular Technology, vol. 60, no. 5, pp. 2050–2059, Jun. 2011.
-  M. Seyfi, S. Muhaidat, and J. Liang, “Amplify-and-forward selection cooperation over rayleigh fading channels with imperfect CSI,” IEEE Transactions on Wireless Communications, vol. 11, no. 1, pp. 199–209, Jan. 2012.
-  F. S. Tabataba, P. Sadeghi, and M. R. Pakravan, “Outage probability and power allocation of amplify and forward relaying with channel estimation errors,” IEEE Transactions on Wireless Communications, vol. 10, no. 1, pp. 124–134, Jan. 2011.
-  Y. Hua, R. Li, Z. Zhao, X. Chen, and H. Zhang, “GAN-powered deep distributional reinforcement learning for resource management in network slicing,” IEEE Journal on Selected Areas in Communications, vol. 38, no. 2, pp. 334–349, Feb. 2020.
-  L. P. Qian, A. Feng, X. Feng, and Y. Wu, “Deep rl-based time scheduling and power allocation in eh relay communication networks,” in IEEE International Conference on Communications (ICC), Shanghai, China, May 2019, pp. 1–7.
-  L. Huang, S. Bi, and Y. J. A. Zhang, “Deep reinforcement learning for online computation offloading in wireless powered mobile-edge computing networks,” IEEE Transactions on Mobile Computing, vol. 19, no. 11, pp. 2581–2593, Nov. 2020.
-  V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
-  L. Espeholt, H. Soyer, R. Munos et al., “IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures,” in International Conference on Machine Learning (ICML), Stockholm, Sweden, Jul. 2018, pp. 1407–1416.
-  T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaun, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in Neural Information Processing Systems (NIPS), Barcelona, Spain, Dec. 2016, pp. 3675–3683.
-  O. Nachum, S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” in Neural Information Processing Systems (NIPS), Montreal, Canada, Dec. 2018, pp. 3303–3313.
-  F. Shams, G. Bacci, and M. Luise, “Energy-efficient power control for multiple-relay cooperative networks using Q-learning,” IEEE Transactions on Wireless Communications, vol. 14, no. 3, pp. 1567–1580, Mar. 2015.
-  X. Wang, T. Jin, L. Hu, and Z. Qian, “Energy-efficient power allocation and Q-learning-based relay selection for relay-aided D2D communication,” IEEE Transactions on Vehicular Technology, vol. 69, no. 6, pp. 6452–6462, Jun. 2020.
-  Y. Su, X. Lu, Y. Zhao, L. Huang, and X. Du, “Cooperative communications with relay selection based on deep reinforcement learning in wireless sensor networks,” IEEE Sensors Journal, vol. 19, no. 20, pp. 9561–9569, Oct. 2019.
-  Y. Su, M. LiWang, Z. Gao, L. Huang, X. Du, and M. Guizani, “Optimal cooperative relaying and power control for IoUT networks with reinforcement learning,” IEEE Internet of Things Journal, pp. 1–1, Jul. 2020.
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
-  J. Boyer, D. D. Falconer, and H. Yanikomeroglu, “Multihop diversity in wireless relaying channels,” IEEE Transactions on Communications, vol. 52, no. 10, pp. 1820–1830, Oct. 2004.
-  R. Annavajjala, P. C. Cosman, and L. B. Milstein, “Statistical channel knowledge-based optimum power allocation for relaying protocols in the high SNR regime,” IEEE Journal on Selected Areas in Communications, vol. 25, no. 2, pp. 292–305, Feb. 2007.
-  H. A. Suraweera, T. A. Tsiftsis, G. K. Karagiannidis, and A. Nallanathan, “Effect of feedback delay on amplify-and-forward relay networks with beamforming,” IEEE Transactions on Vehicular Technology, vol. 60, no. 3, pp. 1265–1271, Mar. 2011.
-  Z. Chen and X. Wang, “Decentralized computation offloading for multi-user mobile edge computing: A deep reinforcement learning approach,” arXiv preprint arXiv:1812.07394, 2018.
-  V. Mnih, A. P. Badia, and M. Mirza, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning (ICML), New York City, NY, USA, Jun. 2016, pp. 1928–1937.
-  T. Tieleman, G. Hinton, G. K. Karagiannidis, and A. Nallanathan, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning, vol. 4, no. 2, pp. 26–31, 2012.
-  Z. Wang, T. Schaul et al., “Dueling Network Architectures for Deep Reinforcement Learning,” in International Conference on Machine Learning (ICML), New York City, NY, USA, Jun. 2016, pp. 1995–2003.