I Introduction
The popularity of mobile devices and the growth of multimedia applications have placed a high demand on the data transmission rate of wireless networks. Devicetodevice (D2D) communication is regarded as a key technology to improve data transmission rate, reduce latency and energy consumption. It’s an important part of future 5G and Internet of Things (IoT). Under the assistance of cellular base station (BS), D2D communication allows two nearby cellular users (CUEs) to form a D2D pair and communicate with each other directly without traversing the BS or core network, thus improving the transmit quality significantly due to short transmission distance [1]. D2D underlay communication reuses the spectrum of the cellular network to potentially increase spectral efficiency. However, D2D communication generates interference to the cellular network if the radio resources are not properly allocated [2, 3, 4]. Thus, it is important to properly allocate radio resources to ensure reliability of cellular communication and increase the capacity in D2D underlay cellular networks.
There have been many resource allocation schemes based on traditional optimization methods in the existing literature. However, in future wireless networks where users are dense and the scene changes rapidly, resource allocation mainly faces two challenges. On the one hand, as the number of users increases, acquiring channel state information (CSI) requires huge signaling overhead and assuming that the BS will have the global network information is unrealistic. On the other hand, resource allocation problems are often modeled as combinatorial optimization problems with nonlinear constraints that are difficult to optimize efficiently by traditional optimization methods. Fortunately, reinforcement learning (RL) has been shown effective in addressing decision making under uncertainty
[5]. RL can learn decision policies from historical data and the hardtooptimize objective issues can be nicely addressed in a RL framework through designing training rewards such that they correlate with the final objective. Moreover, RL for resource allocation can be designed as a distributed algorithm, where each D2D pair is supported by an autonomous agent, which automatically selects a reasonable spectrum for transmission based on the policy learned by RL. Therefore, we use RL to solve the spectrum allocation problem for D2D underlay communications in this paper.Ia Related Work
Existing resource allocation methods can be divided into centralized and distributed schemes according to their execution modes. In the centralized schemes[4, 6, 7, 8], the BS is responsible for allocating resources to the CUEs and D2D pairs, and monitoring information such as signal to interferenceplusnoise ratio (SINR), CSI, and interference level of each user in the cell range. With global CSI at the BS, various solutions to the channel and power allocation problems of the D2D tier have been proposed in [4] and the references therein. Graph theory is a useful centralized method to solve this kind of resource allocation problems [9, 10]. A bipartite graph is designed in [11], where CUEs and D2D pairs are modeled as vertexes and the weight of the bipartite graph is the rate of the associated D2D and cellular links. However, centralized schemes require BS to have global network information. Moreover, the complexity of the centralized schemes increases with the number of users, causing enormous computational pressure on the BS.
In order to reduce the signaling overhead and reduce the computing load of the BS, a series of distributed resource allocation methods are proposed. In a distributed approach, there is no central controller and the D2D pairs opportunistically and autonomously reuse the spectrum of CUEs. It still requires frequent exchange of information between adjacent D2D users and requires the devices to perceive the cellular communications to gather information about channel quality and available resource blocks while the BS is not required to obtain global information and participate in calculation. Distributed schemes can work well to large networks, but require complicated interference avoidance algorithms to ensure high quality cellular communications and reliable D2D communications. Some distributed algorithms are based on game theory
[12, 13, 14]. Game theory is used to model D2D pairs sharing spectrum resources with CUEs as an auction mechanism in [12]. A distributed resource allocation has been proposed in [13] to guarantee a minimum data rate for CUEs and to maximize the D2D average data rate. Interference is controlled using reuse prices and power control game. Since the channel gain information and prices have to be shared among the D2D pairs, large signaling overhead is incurred. Moreover, this type of method usually requires a lot of iterations to converge.In addition to game theory, machine learning has been considered as an effective tool in solving different network problems in 5G
[15, 16]. RL is one of the most powerful tools for policy control and intelligent decision making [5], which has been widely adopted in wireless communications [17, 18, 19]. Recently, a number of works have applied RL to solve the intelligent resource management and decision making problem in D2D underlay networks [20, 21, 22, 23, 24, 25, 26, 27]. A Qlearning based resource allocation has been proposed in [20]. Resources are shared among D2D and cellular users using Qlearning based strategy to maximize the network throughput. A distributed Qlearning based spectrum allocation scheme has been proposed in [21], where D2D users learn the wireless environment and select spectrum resources autonomously to maximize their throughput while causing minimum interference to the cellular users. Since Qlearning has low convergence speed and may not always suitable to deal with continuous valued state and action spaces, an efficient transfer actorcritic (AC) RL approach has been proposed in [22]to address the intelligent resource management problem in a D2Dbased Internet of Vehicle (IoV) networks. The above works can only be applied to lowdimensional stateaction mapping. Recently, deep learning has also been introduced into resource allocation problems.
[28]leverages the deep long shortterm memory (LSTM) learning technique to make localized prediction of the traffic load at the ultra dense networks (UDN) base station. In
[29], a damped three dimensional (D3D) messagepassing algorithm (MPA) based on deep learning for resource allocation in cognitive radio networks has been proposed. A novel deep learningbased traffic load prediction algorithm to forecast future traffic load and congestion in network has been proposed in [30]. With deep learning techniques, reinforcement learning has shown impressive improvement. [31] exploits a collaborative learning framework that consists of deep learning in conjunction with reinforcement learning for resource scheduling in network slicing. In [25], a decentralized resource allocation mechanism for vehicletovehicle (V2V) communications based on deep RL has been developed, which can be applied to both unicast and broadcast scenarios. All above works model the policy search process in RL as a Markov decision process (MDP), which is true if different agents (D2D pairs) are independently updating their policies at different times. However, if two or more agents (D2D pairs) are updating their policies at the same time, it becomes a multiagent environment which appears nonstationary.
There are some resource allocation studies based on multiagent RL [26, 32, 33, 27]. In [26], the resource allocation problem is modeled as a stochastic noncooperative game and a Qlearning based algorithm is proposed. This method combines Qlearning and game theory to alleviate the instability of the multiagent environment. However, it cannot be applied to highdimensional stateaction mapping and its convergence to the Nash equilibrium requires a lot of iterations. A fingerprintbased deep Qnetwork method has been proposed in [27]. This method is a combination of multiagent RL and deep learning. By giving all agents a common reward, it mitigates the instability of multiagent environment but makes each agent fail to achieve the higher individual reward.
IB Contribution
This paper proposes two distributed spectrum allocation frameworks, multiagent actor critic (MAAC) and neighboragent actor critic (NAAC), which are trained centralizedly and executed distributedly. The frameworks set the respective reward for each agent. By sharing all users’ historical states, actions and policies in the centralized training, MAAC can mitigate the instability of multiagent environment and meanwhile ensure that each agent’s policy is updated in the direction of increasing individual reward. Moreover, in order to reduce the computing complexity of the training, NAAC is further proposed to share neighbor users’ historical information for centralized training. Our motivation is to learn from historical information how to make decisions (select spectrum) based on the states observed in real time with the help of deep reinforcement learning. These states include instant channel information observed by UEs, etc. We don’t use historical information to make decisions, but we collect historical information for learning the reinforcement learning model. Our frameworks can learn a model with generalization capabilities that can make reasonable decisions based on realtime observed states The two frameworks require no information interaction when they are executed, so they significantly save the signaling overhead. In addition, our methods can transfer complex training processes to the BS and significantly reduce the computing complexity of algorithm execution.
Part of the work related to NAAC was written as a conference paper [34] which are published in IEEE Globecom 2019. This paper provides a unified multiagent deep reinforcement learning framework covering MAAC and NAAC for distributed spectrum allocation. In this paper, we theoretically deduces the feasibility of the proposed framework based on Markov game theory. And this paper provides a detailed analysis of how the framework is deployed and the computational complexity and performance overhead of the framework. In addition, more implementation details, experimental results and discussions are provided to better understand the multiagent deep reinforcement learning based spectrum management scheme. The main contributions of this paper are summarized as follows:

In order to more accurately model state transitions in a multiagent environment, the D2D communication environment is modeled by Markov game for the first time.

A multiagent deep RL framework, MAAC, is proposed. It shares all users’ historical states, actions and policies in the centralized training, which mitigates the issues that the multiagent environment is unstable and the training is difficult to converge. In addition, it takes into account the cooperation between users and the pursuit of higher individual rewards.

We find that the historical information sharing of neighbor users is enough to satisfy the stability of training. Therefore, an enhanced learning framework, NAAC, is proposed. While ensuring the convergence of the training, it reduces the computing complexity and is more suitable for complex and varied communication scenarios.
IC Paper Organization
The rest of this paper is organized as follows. Section II shows the system model. In Section III, we formulate the D2D communication environment as a partially observable Markov game and adopt the MAAC framework to address it. In Section IV, the NAAC framework with low computational complexity is proposed. The simulation results and analysis are presented in Section V. Finally, Section VI concludes the paper. The key mathematical notations used in our paper are listed in Table I.
Notations  Physical interpretation 

Number of CUEs, D2D pairs and RBs  
Power of BS and D2D transmitter  
Channel gains from the BS to  
Channel gains from to  
Channel gains from to  
Channel gains from the BS to  
Channel gains from to  
The power of AWGN  
SINR of the received signal at from BS in RB  
SINR of the received signal at from in RB  
Data rates of CUE  
Data rates of D2D pair  
The bandwidth of each RB  
State space and action space  
State, action and reward at time slot  
The instant channel information of the D2D link  
The channel information of the cellular link  
The previous interference to the link  
The RB selected by the D2D link in the previous time slot  
The SINR threshold of the CUE  
Positive reward  
Negative reward  
Transition probability  
The reward discount factor  
The sum of discounted future reward  
The expected cumulative discounted reward  
The policy of reinforcement learning  
The actionvalue function of reinforcement learning  
Deterministic target policy  
The weight of actor network and critic network  
Experience replay buffer  
“soft” update factor  
The number of neighbor D2D pairs taken in NAAC  
A set of neighbor D2D pairs of  
The states and actions of the neighbors of agent  
The number of neurons the th layer of the actor network 

The number of neurons the th layer of the critic network 
Ii System Model
As illustrated in Fig. 1, a downlink scenario in a single cell system is considered. A set of CUEs, denoted as , and a set of active D2D pairs, denoted as , are located in the coverage area of the base station (BS). We denote the CUE in the system by , , the D2D pair by , , the transmitter and the receiver of a D2D pair by and , respectively. Orthogonal frequency division multiple access (OFDMA) is employed to support multiple access for both the cellular and D2D communications, where a set of resource blocks (RBs) are available for spectrum allocation. A RB is the smallest unit of spectrum resources that can be allocated to a user, which is 180 kHz wide in frequency and 1 slot long in time. In this system, the D2D pairs share the same spectrum with the CUEs. There are three types of interference in the system, including:

the interference received from the transmitter of a D2D pair at a CUE;

the interference received from the BS at a D2D receiver;

the interference received from the transmitter of a D2D pair at the receiver of another D2D pair sharing the same spectrum with that D2D pair.
We assume that the BS and the transmitter of a D2D pair transmit with powers and , respectively. Denote , , , , and as the channel gains of the cellular communication link from the BS to CUE , the D2D communication link from D2D transmitter to D2D receiver , the interference link from D2D transmitter to CUE , the interference link from the BS to D2D receiver and the interference link from D2D transmitter to D2D receiver when they share the same spectrum for data transmission respectively. The power of the additive white Gaussian noise (AWGN) at a receiver is denoted by .
The instantaneous SINR of the received signal at CUE, , from the BS in RB can be written as
(1) 
and the instantaneous SINR of the received signal at the D2D receiver, , from the D2D transmitter, , in RB can be written as
(2) 
where represents the set of D2D pairs to which RB is allocated.
With the instantaneous SINR, we can find the data rates of CUE, , and D2D pair, , by
(3) 
and
(4) 
where is the bandwidth of each RB.
We assume that each CUE has been assigned a RB and a RB can be allocated to multiple D2D pairs, in the mean time, D2D users who need to communicate have already completed pairing before spectrum allocation. When the algorithm is executed, the paired D2D pairs autonomously selects RBs for communication. Traditionally, resource allocation in D2D communications is formulated as a NPhard combinatorial optimization problem [35] with nonlinear constraints, which is with forbidden complexity. To address this issue, we will investigate multiagent RL for resource allocation in D2D communications.
Iii MultiAgent Deep Reinforcement Learning based Spectrum Allocation
In this section, we first model the multiagent environment and then a distributed framework based on multiagent RL is proposed to address the spectrum allocation problem.
Iiia Modeling of MultiAgent Environments
In the RL model for D2D underlay communications, an agent, corresponding to a D2D pair, interacts with the environment and takes an action according to a policy. At each time , the D2D link, as the agent, observes a state, , from the state space, , and accordingly takes an action (select RBs or power levels), , from the action space, , based on the policy, . Following the action, the state of the environment transits to a new state and the agent receives a reward, .
In our system, the state space, , the action space, , and the reward function, , are defined as follows:
State space: The state observed by the D2D link (agent ) for characterizing the environment consists of several parts: the instant channel information of the D2D corresponding link, , the channel information of the cellular link, e.g., from the BS to the D2D transmitter, , the previous interference to the link, , the RB selected by the D2D link in the previous time slot, . Hence, . The instant channel information and the interference received reveal the quality of each channel.
Action space: At each time , the agent takes an action , which represents the agent select a RB, according to the current state, , based on the decision policy . The dimension of the action space is if there are RBs. Our methods has good scalability. Discretevalued power and one or more RBs selected by D2D pairs can be modeled as actions, whose dimension is if there are sets of optional RBs and the transmission power is discretized into levels. Therefore, our algorithm can also solve the resource allocation problem of discretevalued power control joint spectrum allocation. Note that the action selection of each agent should satisfy the constraint , where represents the SINR threshold of the CUE.
Reward function: The learning process is driven by the reward function in the RL. Each agent makes its decision to maximize its reward with the interactions of the environment. As a result, we will design a reward function for this distributed resource allocation problem as following.
The reward function relates to two parts: the D2D link rate and the SINR constraints of CUE. In our settings, the reward remains positive if the SINR constraints are satisfied; it will be a negative reward, , otherwise. When the D2D pair (agent ) take an action at current time slot , then the D2D pair received a positive reward , in proportion to the D2D link rate, if the constraints are satisfied. We use the Shannon capacity to evaluate ,
(5) 
where is the instantaneous SINR of the received signal at the D2D receiver at current time slot . Therefore, the reward function can be expressed as,
,  (6)  
otherwise.  (7) 
Most of the existing works model the policy search process in RL as a Markov decision process (MDP). In MDP, a sequence of resource management decisions of a learning agent by interacting with the wireless communication environment at some discrete time scale can be defined as a tuple , where is the transition probability when the agent takes the action from the current state to a new state , and is a discount factor.
However, in the decentralized settings of spectrum allocation problem, all D2D links as agents are independently updating their policies as learning progresses, which is a multiagent environment if two or more agents updating simultaneously, the environment appears nonstationary from the view of any one agent, violating Markov assumptions required for convergence of RL, and causing instability in the training process.
To make up for the shortcomings of MDP, we consider a multiagent extension of MDP in this work called partially observable Markov games, modeling the multiagent RL. In the multiagent RL model for D2D underlay communications, at each time , the D2D link , as the agent , observes a state, , from the state space, , and accordingly takes an action, , from the action space, , selecting RB based on the policy . Following the action, the state of the environment observed by agent transits to a new state and the agent receives a reward, .
An Nagent Markov game is formalized by a tuple , where denotes the state space, is the action space, which is assumed to be same for all agents (D2D pairs), is the reward function for agent (D2D pair ), is the transition probability when all agents take actions simultaneously from the current state to a new state . Compared to MDP, Markov game is more accurate in modeling state transitions. The constant represents the reward discount factor across time. At time step , all agents take their actions simultaneously, each receives the immediate rewards as a consequence of taking the previous actions of all agents. The return of agent from a state is defined as the sum of discounted future reward
(8) 
where is the time horizon.
The goal of mulitagent RL is to learn a policy for each agent to maximize the expected return from the start distribution defined as the expected cumulative discounted reward
(9) 
IiiB MultiAgent Actor Critic for Spectrum Allocation
In order to overcome the inherent nonstationary of the multiagent environment and to utilize the cooperation between the agents, a multiagent actorcritic (MAAC) framework is adopted to optimize the policy by modeling multiagent environment as Markov game and considering action policies of other agents so as to successfully learn policies that require complex multiagent coordination. In addition, MAAC can make full use of the cooperation among users to further improve the overall performance of the system.
The architecture for MAAC based spectrum allocation in D2D underlay communications is shown in Fig. 2. Each D2D pair, , is supported by an autonomous agent . MAAC is an extension of AC [36] where each agent is divided into two parts: critic and actor. We allow the policies to use the states and actions of all users to ease training. The deep learning training process will cause a lot of computational overhead. Therefore, we transfer the training process to the BS. In order to transfer the complex training process to the BS, our scheme needs D2D users to upload the historical information collected during the execution to the BS. The centralized training process is done at the BS, where critic is augmented with extra information about the policies of other neighbor agents to evaluate the quality of the action. In the distributed execution process, a D2D pair (agent ) downloads the trained weight of the actor from the BS and loads it into its own actor . The actor selects action (RB) based on the state observed by the agent from the environment. When the agent takes the action , the environment returns a reward . When the communication is in good condition, the D2D pair can upload the historical information including collected at the execution time to the BS for subsequent training.
Overview of MAAC of centralized training with decentralized execution is shown in Fig. 3. States and actions of all agents are entered into critic to evaluate the quality of the current actions. We allow the policies to use extra information to ease training so long as this information is not used at execution time. It is unnatural to do this with Qlearning based methods, as the Q function generally cannot contain different information at training and test time.
The goal in RL is to learn a policy which maximizes the expected return from the start distribution , where denotes the environment. In order to simplify the representation, the state , action , and return
at the current moment are simply denoted as
, , and , respectively, and at the next moment are simply denoted as and , respectively. The actionvalue function is used in many RL algorithms. In a single agent environment, it describes the expected return after taking an action in state and thereafter following policy :(10) 
Many approaches in RL make use of the recursive relationship known as the Bellman equation [37]:
(11) 
Extends it into multiagent environment. Consider a Markov game with agents and donete as the set of all agent policies. The actionvalue function (critic) of agent can be written as:
(12) 
where consists of the states of all agents, , consists of the actions of all agents, , is a centralized actionvalue function that takes the states and actions of all agents as input, and outputs the Qvalue for agent .
If the target policy is deterministic we can describe it as a function : and avoid the inner expectation. We now consider deterministic policies (actor) denoted as , The actionvalue function (critic) of agent can be written as:
(13) 
According to AC [36], the critic can be learned using the Bellman equation as in Qlearning [38]. Qlearning is a commonly used offpolicy algorithm using the greedy policy . We consider the function approximator of the centralized actionvalue function of agent parameterized by , which we optimize by minimizing the loss:
(14) 
where
(15) 
Based on the deterministic policy gradient (DPG) algorithm [39], a parameterized actor function
can be used to specify the current policy by deterministically mapping states to a specific action. The policy gradient method is known to exhibit high variance gradient estimates and is exacerbated in multiagent settings. Since an agent’s reward usually depends on the actions (RBs) of many agents (D2D pairs). When the actions of other agents are not considered in the agent’s optimization process, the reward conditioned only on the agent’s own actions exhibits much more variability, thereby increasing the variance of its gradients.
To analyze the variance of policy gradient methods in multiagent settings, [40] considers a simple scenario with agents and binary actions: . The reward is defined to be if all actions are the same , and otherwise. Agents must simply learn to either always output or always output at each time step. It can prove that the probability of taking a gradient step in the correct direction decreases exponentially with the number of agents, , which can be expressed in the following proposition.
Proposition 1: Consider agents with binary actions: , where when , and otherwise. We assume an uninformed scenario, in which agents cannot get any information from each other and are initialized to . Then, if we are estimating the gradient of the with policy gradient, we have:
(16) 
where is the policy gradient estimator from a single sample, and is the true gradient.
denotes the probability of taking a gradient step in the right direction that increases reward. Equation (16) indicates that the probability of taking a gradient step in the right direction decreases exponentially, as the number of agents increases.
The high variance gradient estimates of policy gradient methods can be solved by MAAC. The centralized critic in MAAC helps reduce the variance of the gradients since the critic is augmented with extra information about the policies of other agents to remove a source of uncertainty. In addition, conditioned only on the agent’s own actions, there is significant variability associated with the actions of other agents, which is largely removed when using these actions as input to the critic.
In MAAC, if the deterministic policy of agent is parameterized by , the actor of agent
is updated by applying the chain rule to the expected return from the start distribution
with respect to the actor parameters:(17)  
Here is the experience replay buffer contains the tuples , recording experiences of all agents.
MAAC controls the update of historical state by setting a fixed size experience replay buffer. The experience replay buffer is a finite sized cache. Transitions are sampled from the environment and the tuple is stored in the replay buffer. When the replay buffer is full, the oldest samples are discarded. At each timestep the actor and critic are updated by sampling a minibatch uniformly from the buffer. Since MAAC is an offpolicy algorithm, the replay buffer can be large, allowing the algorithm to benefit from learning across a set of uncorrelated transitions.
A primary motivation behind MAAC is that, if we know the actions taken by all agents, the environment is stationary even as the policies change [40] since
(18)  
for any . We use a Nagent Markov game to model the multiagent RL for D2D underlay communications, where the transition probability is . So if we know the actions taken by all agents, 18 is clearly established. The constant transition probability satisfies the Markov assumption of RL convergence. Therefore, the experience replay buffer can be used in MAAC, at the same time the training process of MAAC can mitigate the inherent nonstationary of the multiagent environment and converge very well. Moreover, the critic considers the actions of all agents to evaluate the quality of the selected action, and can fully utilize the cooperation between the agents.
The mapping between the state space and the action space of the actor part and the actionvalue function of the critic part need to be approximated by a function approximator. Qlearning works well and a lookup table can be used to accomplish the update rule if the state and action spaces of the problem are small. However, if the stateaction space is too large, many states may be rarely visited and thus the corresponding Qvalues are seldom updated, leading to a much longer time to converge [25]
. To solve this problem, deep neural networks (DNNs) are used to approximate the mapping in highdimensional space. The weight
of a DNN is updated by training. Once is determined, a state will correspond to a unique action. The DNN can approximate a complex mapping between highdimensional spaces based on a large amount of training data that will be used to update .In MAAC, we denote the set of actor networks and critic networks of all agents as and with the weights and , respectively. The input of the actor network is the state observed by the agent, and the output is the selected action. The hidden layers in the actor network are all fully connected layers. Fig. 4 provides the structure of the critic network. The critic network first enters the states of all agents and then a fully connected layer, the actions of all agents then go through several fully connected layers and finally output Qvalue.
Directly implementing Qlearning in equation (14) with neural networks has proved to be unstable in many environments. Since the network being updated is also used in calculating the target value in equation (15), the Qvalue update is prone to divergence. To solve this problem, we use “soft” target updates. We create a copy of the actor and critic networks for every agent, and respectively, that are used for calculating the target values. The weights of these target networks are then updated by having them slowly track the learned networks
(19) 
with “soft” update factor . In the experiment, . This means that the target values are constrained to change slowly, greatly improving the stability of learning.
IiiC Training and Execution
MAAC framework is divided into training and execution in use. Since the BS has more computing power than the mobile device, the training part of the algorithm is completed at the BS and the users only need to download the weights of the trained target actor network from the BS, and only uses the actor part to execute the algorithm distributedly. The training algorithm is shown in Algorithm 1. The MAAC framework uses historical information to train the DNNs of the actor part and the critic part, and returns the weights of target actor network . New data is generated when the algorithm is executed, which can be added to the experience replay buffer to further finetune weights.
The execution algorithm is as shown in Algorithm 2. Users download from the base station and import the weights into their actor networks. All agents input the observed states into the actor networks and the selected actions are output, where the actions correspond to the selected RB.
Iv NeighborAgent Deep Reinforcement Learning based Spectrum Allocation
The MAAC framework proposed in section III can make full use of the cooperation relationship among users to improve system performance and at the same time has a high convergence speed. However, MAAC requires information of all agents to assist training. It will result in high computational complexity and high computational overhead, in the case of a large number of users. Therefore, we further reduce the complexity of MAAC without losing too much performance in this section.
Iva NeighborAgent Actor Critic for Spectrum Allocation
In order to ensure the stability of the multiagent environment and make full use of the cooperation between users, MAAC adds states and actions of all agents to the critic network for training. The geographic location of all CUEs and D2D pairs and the spectrum allocation results after MAAC training convergence are visualized, as shown in Fig. 5 (a), where different colors represent different RBs and the number of CUEs, D2D pairs and RBs are , and , respectively. We can see that users who are closer together are allocated different RBs and users who are far apart may share RBs. This is because in a wireless communication environment, interuser interference is mainly related to the neighbor users. When the user’s transmit power is constant, the main factor affecting the interuser interference strength is the largescale fading, which is mainly related to the distance between users. Therefore, it is not necessary to have all users’ information to ensure the stability of the environment, just the information of the neighbor users is enough. Therefore, we improve MAAC by proposing an improved framework that only allows the states and actions of a fixed number of agents adjacent to the target agent to be added to the critic network for training, called neighboragent actor critic (NAAC).
We use distance to define the neighbor users. In NAAC, for a D2D pair , we define D2D pairs closest to as the neighbor users of . We denote a set of D2D pairs which contains itself and its neighbor users as . We can use the information of neighbor agents (D2D pairs) instead of global information to ensure the stability of the multiagent environment since
(20) 
where contains the actions of the neighbors of agent .
The loss function and gradient update function of NAAC have the same form as MAAC, the difference is that
and in equation (14) and (17) is changed to and , where contains the states of the neighbors of agent , .The actor network in the NAAC framework is identical to that in the MAAC framework. The input of the critic network in the NAAC framework is changed to the states and actions of the neighbor agents. Like MAAC, NAAC uses an experience replay mechanism to overcome the correlation and nonstationary distribution of empirical data. In addition, NAAC also uses “soft” target updates to ensure the stability of training.
IvB Training and Execution
The NAAC framework can also be divided into two parts: the training process and the execution process. The execution algorithm of NAAC is exactly the same as MAAC. In the training algorithm, for each time slot , save the tuples in experience replay buffer and sample a random minibatch of tuples from the , then set
(21) 
and update critic by minimizing the loss in equation (14), where and is changed to and . Finally, update the actor policy using the sampled policy gradient according to equation (17), where and is changed to and . The remaining steps in the training algorithm are the same as MAAC.
The spectrum allocation results after NAAC training convergence are visualized, as shown in Fig. 5 (b), where NAAC works with the number of neighbor users . We find that the historical information sharing of neighbor users is enough to satisfy the stability of training and get a reasonable spectrum allocation result.
IvC Computational Complexity and Overhead Analysis
The computational complexity is critical to the utility of an algorithm. Therefore, we analyze the computational complexity of the two proposed methods at execution time. Define the number of neurons the th layer of the actor network as . The computational complexity of the th layer is . The computational complexity of the actor network is , where is the number of layers of the actor network. The critic network is also a fully connected network. Define the number of neurons the th layer of the critic network as . The computational complexity of the th layer is . The computational complexity of the critic network is , where is the number of layers of the critic network.
For MAAC and NAAC, only the actor network is used during execution, and the actor network they use is the same, so MAAC and NAAC are executed with the same complexity which is . Both the actor network and the critic network participate in the training process, so the computational complexity of the training process is . Since MAAC needs to input the states and actions of all users into the critic network during training, the number of neurons in the first layer of the cirtic network of MAAC is more than that of NAAC, so the computational complexity of training is also higher than that of NAAC.
The overhead of system deployment is also important for the utility of the system. The deep learning training process will cause a lot of computational overhead. It is unrealistic to complete the training on the mobile device. Therefore, we transfer the training process to the BS, because the BS can easily deploy hardware devices such as GPUs, it has relatively more computing power. In order to transfer the training process to the BS, our scheme needs D2D users to upload the historical information collected during the execution to the BS. This historical information includes the states observed by the D2D users, the actions taken and the rewards they obtained, all of which are numeric data. The history information generated by a UE in a time slot is only a few kilobytes in size, which results in small transmission overhead. After the training process is completed, the device only needs to download the weight of the trained actor network from the BS and import its own actor network to perform spectrum selection. The weight of the neural network is also numerical data. The weight of each UE’s actor network is about 300 KB in size, which does not cause too much transmission overhead. In summary, transferring the complex training processes to the BS requires only a small amount of transmission overhead. Traditional strategies require users to report channel status information or exchange information between users in real time, which can cause serious signaling overhead. Our method does not require users realtime reporting and exchanging information. Our method only requires the user to upload samples of historical information they have collected to the BS when the communication is in good condition, which can save a lot of realtime signaling overhead.
IvD Comparison of MAAC and NAAC
Since the input of the critic network in NAAC is a part of the information of the agents, the data dimension is smaller and the required neural network is also smaller, which reduces the computing complexity of the algorithm and improves the training speed. In addition, since the input of the critic network in the NAAC is the states and actions of a fixed number of neighbor users of the target agent, the network structure of the NAAC does not change when the number of D2D pairs in the cell changes. The previously trained weights can continue to be used to speed up the training process. Compared with MAAC, since NAAC does not share all users’ states, actions and policies, its modeling of user state transition is not as accurate as MAAC, which will inevitably cause some loss to the convergence of training and the reliability of user communication. However, NAAC has better generalization ability, can be scale well to a larger network, adapt to more varied environments and save computing resource.
V Performance Evaluation
In this section, we compare the MAAC and NAAC with other four distributed approaches:

A game theory approach, Uncoupled Stochastic Learning Algorithm (denoted as SLA), which is developed in [14].
Since we assume that each D2D pair can only obtain its own CSI and there is no information exchange among D2D users, centralized approaches with global information do not participate in performance comparisons.
For the simulation, we consider a single cell scenario with a radius of 500 m. We assume that each CUE has been assigned a RB and a RB can be allocated to multiple D2D pairs. So we set the number of RBs to be the same as the number of CUEs. The size of experience replay buffer is set to 1000000. The CUEs and D2D pairs are distributed randomly in a cell, where the communication distance of each D2D pair cannot exceed a given maximum distance 30 m. The detail parameters can be found in Table II
. The actor network in our proposed frameworks is a fourlayer fully connected neural network with two hidden layers. The numbers of neurons in the two hidden layers are 512 and 128, respectively. The critic network in our proposed frameworks is a fivelayer fully connected neural network with three hidden layers. The numbers of neurons in the three hidden layers are 1024, 512 and 256, respectively. Relu function is used as the activation function. The learning rates of actor and critic parts are 0.0001 and 0.001, respectively. The reward discount factor
. The UE noise figure is taken 8 dB. And the negative reward . The channel model is set according to 3GPP Technical Specification [42]. In the first 2000 time slots of the system simulation, we use the random allocation method to allocate RBs to users, let the framework collect a certain number of samples for training, and then apply our algorithm for spectrum allocation. All simulations were conducted on Pytorch deep learning framework with a NVIDIA TESLA M40 GPU, 24 G memory size.
Parameter  Value 

Cell radius  500 m 
Maximum D2D pair distance  30 m 
Carrier frequency  2 GHz 
RB bandwidth  180 KHz 
Number of CUEs  10 
Number of RBs  10 
Number of D2D pairs  10, 20, …, 50 
BS transmission power ()  46 dBm 
D2D transmission power ()  13 dBm 
Cellular link pathloss  
D2D link path loss exponent  4 
UE thermal noise density  174 dBm/Hz 
CUE target SINR threshold ()  0 dB 
UE noise figure  8 dB 
Negative reward ()  1 
Va Simulations Results
Fig. 6 compares the convergence of the five approaches in terms of the total reward performance when the number of D2D pairs is 10 and NAAC works with the number of neighbor users . Total reward is the sum of the rewards obtained by the agents corresponding to all D2D pairs. Since SLA is an online learning algorithm that does not have an offline training process. From Fig. 6, the proposed MAAC and NAAC converges to the maximum total reward with only 60 time slots. The proposed two methods achieve the larger total reward performance while the convergence is most stable (less fluctuations) compared to the other three algorithms. The total reward performance and convergence of Qlearning are the worst since Qlearning does not work well when the stateaction space is vary large. DQN solves the mapping problem of highdimensional space by introducing a DNN to approximate the complex mapping between stateaction space. Compared to Qlearning, DQN improves in both total reward performance and convergence. The performance of AC algorithm is better than Qlearning and DQN since it optimizes the policy by combining the process of the policy learning and value learning with good convergence properties. However, none of the above three algorithms considers the impact of multiagent environment on stability of training process and the cooperation between multiple agents (D2D pairs) on system performance. The two proposed approaches introduce the state and action information of extra D2D pairs to assist the training process, greatly improving the stability of the training process, and achieving a higher total reward performance and converging quickly.
The outage probability can reflect the reliability of the communication links. In Fig. 7, we show the outage probability of cellular links as a function of the number of D2D pairs and NAAC works with . The outage probability of cellular links increases as the number of D2D links grows since there are more D2D pairs sharing the spectrum with the CUEs, which causes the CUEs to suffer more severe crosslayer interference. The two proposed methods are better than the other four algorithms because the reward function in the proposed frameworks penalizes the policy that does not meet the SINR threshold of CUEsc and the frameworks introduce the states and actions of extra D2D pairs to assist the training process. Therefore, the policies between different D2D pairs can be coordinated with each other to prevent multiple D2D pairs from simultaneously selecting the same RB to cause severe cumulative interference to the CUE. From Fig. 7, the MAAC algorithm achieves the lowest outage probability, which is 0.005 lower than the NAAC algorithm. Since the MAAC algorithm uses the information of all D2D pairs for centralized training, the learned strategy more strictly meets the SINR constraints of CUEs than the NAAC that uses part of the D2D pairs’ information for training.
Fig. 8 illustrates the outage probability of D2D links as a function of , where for NAAC. The outage probability of D2D links increases as the number of D2D links grows since there are more D2D pairs sharing the spectrum and there will be more serious colayer interference among them. From Fig. 8, the two proposed algorithms are obviously superior to other algorithms since our proposed algorithms make full use of the cooperation between D2D pairs so that the policy learned by each agent can coordinate with each other and avoid selecting the same RB at the same time, which leads to better transmission quality. For the Qlearning, DQN and AC, the outage probabilities increase significantly with the number of D2D links. This is because the policies learned by these three algorithms consider no information of other D2D pairs when they are executed, result in multiple D2D pairs to compete for the same RB, and seriously affect the transmission quality of D2D links. The SLA algorithm achieves performance close to our proposed algorithms since it estimates the interference experienced by the D2D pairs and takes an action based on this estimate.
Fig. 9 shows the the sum rate of D2D links as a function of , where for NAAC. The the sum rate of D2D links increases as the number of D2D links grows since more D2D pairs are allocated to RBs. When the number of D2D links increases, outage probability increases due to higher interference. Therefore, the slope of all the curves in Fig. 9 is decreasing. The proposed methods are significantly better than the other four algorithms and the advantages become more significant as the number of D2D links increases. Since the other four distributed algorithms can only achieve individual optimization, the effect of global optimization cannot be guaranteed, but the proposed methods adopt a framework of centralized training with decentralized execution, which can optimize the sum rate of D2D links. The performance indicators in Fig. 8 and Fig. 9 are indicators related to D2D communication. NAAC is slightly better than MAAC in performances related to D2D communication. That is because our optimization goal is to maximize the sum rate of D2D communications while ensuring the outage probability of cellular links. In order to achieve this goal, our reinforcement learning framework penalizes actions that fail to meet the SINR requirements of cellular users. MAAC uses more information to better achieve its goals, so it achieves the lowest outage probability for cellular users. Ensuring the communication quality of cellular users is bound to lose the performances of D2D users, so the performances of MAAC is slightly worse to NAAC in Fig. 8 and Fig. 9.
In Fig. 10, we show the outage probability of cellular and D2D links as a function of the number of neighbor users in NAAC with number of D2D pairs . For the NAAC, the outage probability of the cellular links is greater than the MAAC when the is small, and the outage probability decreases continuously with the increase of until it is equal to the MAAC. Since the NAAC obtains more comprehensive information during training with the increase of , the trained policy can provides more reliable protection for the transmission quality of CUEs. In addition, the outage probability of D2D links with the NAAC is smaller than the MAAC when the is small, and the outage probability increases with until it is equal to MAAC. The reason is that cellular communications have a higher priority than D2D communications, the policy trained by the MAAC algorithm using global information sacrifices some D2D links transmission quality to meet the transmission quality requirements of cellular users.
Fig. 11 illustrates the sum rate of cellular and D2D links as a function of in NAAC with . For the NAAC, the sum rate of cellular links is 2 bit/s/Hz lower than the MAAC at , and the rate increases as . Since the NAAC uses part of the information of the neighbor D2D pairs for training, and the trained policy does not adequately guarantee the transmission quality of the CUEs compared to the MAAC. In addition, the sum rate of D2D links with the NAAC is larger than the MAAC when is small, and the sum rate continues to decrease with the increase of until it is equal to the MAAC. The reason is that the constraint of satisfying the SINR of the cellular user has higher priority than increasing the D2D rate. As the increases, the information obtained by the NAAC algorithm during training is more comprehensive, and the trained policy satisfies the constraint more strictly, resulting in a loss of a portion of the D2D sum rate. According to the simulation results in Fig. 10 and Fig. 11, when using the NAAC, can be flexibly adjusted according to different communication scenarios to meet the various communication requirements.
VB Discussion
The proposed two frameworks exploits the advantages of the centralized and distributed schemes. Compared with the centralized methods, our methods are executed without requiring the global information, which significantly reduces the signaling overhead and alleviates the computational pressure of the BS. Compared with distributed method, our methods use the historical information of the extra users to learn the policies of mutual cooperation, avoiding frequent realtime information exchange between users, more suitable for userintensive communication scenario. In addition, our methods can transfer complex training processes to the cloud (BS), significantly reduces the computing complexity of algorithm execution. The two proposed methods have their own advantages respectively. MAAC uses the historical information of all users to assist in training. The trained policies can meet very strict transmission quality requirements and are suitable for highreliability wireless communication scenarios. NAAC uses a fixed number of users’ historical information to assist training, has better generalization ability, can be scale well to a larger network and has lower training complexity.
Vi Conclusion
This paper has studied the resource management problem in D2D underlay communications and formulated the intelligent spectrum allocation problem as a decentralized multiagent deep RL model to improve the sum rate of D2D links while ensuring the transmission quality of CUEs. In order to make full use of the performance gains brought by cooperation between users, the MAAC framework of centralized training with distributed execution is adopted, which not only requires no signaling interaction but also ensures the convergence of the algorithm. In addition, the NAAC framework with lower computing complexity and better generalization ability is proposed. The simulation results show that the proposed approaches can effectively guarantee the transmission quality of the CUEs and greatly improve the sum rate of D2D links as well as have better convergence, compared with other existing approaches. The proposed methods can be used to address the intelligent resource management problem in a D2Dbased Internet of Vehicle networks. In the future work, we plan to combine the proposed approaches with continuousvalued power control, and design an integrated deep reinforcement learning framework that automatically selects RB and transmit power to further improve the effectiveness and robustness of the algorithm.
References
 [1] Y. Kai, J. Wang, H. Zhu, and J. Wang, “Resource allocation and performance analysis of cellularassisted OFDMA devicetodevice communications,” IEEE Trans. Wireless Commun., vol. 18, no. 1, pp. 416–431, Jan. 2019.
 [2] D. Feng, L. Lu, Y. YuanWu, G. Y. Li, G. Feng, and S. Li, “Devicetodevice communications underlaying cellular networks,” IEEE Trans. Commun., vol. 61, no. 8, pp. 3541–3551, Aug. 2013.
 [3] H. Min, W. Seo, J. Lee, S. Park, and D. Hong, “Reliability improvement using receive mode selection in the devicetodevice uplink period underlaying cellular networks,” IEEE Trans. Wireless Commun., vol. 10, no. 2, pp. 413–418, Feb. 2011.
 [4] P. Phunchongharn, E. Hossain, and D. I. Kim, “Resource allocation for devicetodevice communications underlaying LTEadvanced networks,” IEEE Wireless Commun., vol. 20, no. 4, pp. 91–100, Aug. 2013.
 [5] R. S. Sutton, A. G. Barto et al., Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 135.
 [6] D. Wu and N. Ansari, “High capacity spectrum allocation for multiple D2D users reusing downlink spectrum in LTE,” in Proc. IEEE ICC, May. 2018, pp. 1–6.
 [7] A. Köse and B. Özbek, “Resource allocation for underlaying devicetodevice communications using maximal independent sets and knapsack algorithm,” in Proc. IEEE PIMRC, Sep. 2018, pp. 1–5.
 [8] Z. Kuang, G. Liu, G. Li, and X. Deng, “Energy efficient resource allocation algorithm in energy harvestingbased D2D heterogeneous networks,” IEEE Internet Things J., vol. 6, no. 1, pp. 557–567, Feb. 2019.
 [9] H. Tamura, M. Sengoku, K. Nakano, and S. Shinoda, “Graph theoretic or computational geometric research of cellular mobile communications,” in Proc. IEEE ISCAS, vol. 6, May. 1999, pp. 153–156.
 [10] A. Checco and D. J. Leith, “Learningbased constraint satisfaction with sensing restrictions,” IEEE J. Sel. Areas Commun., vol. 7, no. 5, pp. 811–820, Oct. 2013.
 [11] L. Wang, H. Tang, H. Wu, and G. L. Stüber, “Resource allocation for D2D communications underlay in rayleigh fading channels,” IEEE Trans. Veh. Technol., vol. 66, no. 2, pp. 1159–1170, Feb. 2017.
 [12] F. W. Zaki, S. Kishk, and N. H. Almofari, “Distributed resource allocation for D2D communication networks using auction,” in Proc. IEEE NRSC, Mar. 2017, pp. 284–293.
 [13] H. Nguyen, M. Hasegawa, and W. Hwang, “Distributed resource allocation for D2D communications underlay cellular networks,” IEEE Comun. Lett., vol. 20, no. 5, pp. 942–945, May. 2016.
 [14] S. Dominic and L. Jacob, “Distributed resource allocation for D2D communications underlaying cellular networks in timevarying environment,” IEEE Comun. Lett., vol. 22, no. 2, pp. 388–391, Feb. 2018.
 [15] C. Jiang, H. Zhang, Y. Ren, Z. Han, K. Chen, and L. Hanzo, “Machine learning paradigms for nextgeneration wireless networks,” IEEE Wireless Commun., vol. 24, no. 2, pp. 98–105, Apr. 2017.

[16]
X. Wang, X. Li, and V. C. M. Leung, “Artificial intelligencebased techniques for emerging heterogeneous network: State of the arts, opportunities, and challenges,”
IEEE Access, vol. 3, pp. 1379–1391, 2015.  [17] R. Li, Z. Zhao, X. Chen, J. Palicot, and H. Zhang, “TACT: A transfer actorcritic learning framework for energy saving in cellular radio access networks,” IEEE Trans. Wireless Commun., vol. 13, no. 4, pp. 2000–2011, Apr. 2014.
 [18] K. A.M, F. Hu, and S. Kumar, “Intelligent spectrum management based on transfer actorcritic learning for rateless transmissions in cognitive radio networks,” IEEE Trans. Mobile Comput., vol. 17, no. 5, pp. 1204–1215, May. 2018.
 [19] Y. Saleem, K. A. Yau, H. Mohamad, N. Ramli, M. H. Rehmani, and Q. Ni, “Clustering and reinforcementlearningbased routing for cognitive radio networks,” IEEE Wireless Commun., vol. 24, no. 4, pp. 146–151, Aug. 2017.
 [20] Y. Luo, Z. Shi, X. Zhou, Q. Liu, and Q. Yi, “Dynamic resource allocations based on Qlearning for D2D communication in cellular networks,” in Proc. ICCWAMTIP, Dec. 2014, pp. 385–388.
 [21] K. Zia, N. Javed, M. N. Sial, S. Ahmed, A. A. Pirzada, and F. Pervez, “A distributed multiagent RLbased autonomous spectrum allocation scheme in D2D enabled multitier HetNets,” IEEE Access, vol. 7, pp. 6733–6745, 2019.
 [22] H. Yang, X. Xie, and M. Kadoch, “Intelligent resource management based on reinforcement learning for ultrareliable and lowlatency IoV communication networks,” IEEE Trans. Veh. Technol., vol. 68, no. 5, pp. 4157–4169, May. 2019.
 [23] H. Ye and G. Y. Li, “Deep reinforcement learning for resource allocation in V2V communications,” in Proc. IEEE ICC, May. 2018, pp. 1–6.
 [24] H. Ye and G. Y. Li, “Deep reinforcement learning based distributed resource allocation for V2V broadcasting,” in Proc. IEEE IWCMC, Jun. 2018, pp. 440–445.
 [25] H. Ye, G. Y. Li, and B. F. Juang, “Deep reinforcement learning based resource allocation for V2V communications,” IEEE Trans. Veh. Technol., vol. 68, no. 4, pp. 3163–3173, Apr. 2019.
 [26] A. Asheralieva and Y. Miyanaga, “An autonomous learningbased algorithm for joint channel and power level selection by D2D pairs in heterogeneous cellular networks,” IEEE Trans. Commun., vol. 64, no. 9, pp. 3996–4012, Sep. 2016.
 [27] L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in vehicular networks based on multiagent reinforcement learning,” arXiv preprint arXiv:1905.02910, 2019.
 [28] Y. Zhou, Z. M. Fadlullah, B. Mao, and N. Kato, “A deeplearningbased radio resource assignment technique for 5G ultra dense networks,” IEEE Network, vol. 32, no. 6, pp. 28–34, Nov. 2018.
 [29] M. Liu, T. Song, J. Hu, J. Yang, and G. Gui, “Deep learninginspired message passing algorithm for efficient resource allocation in cognitive radio networks,” IEEE Trans. Veh. Technol., vol. 68, no. 1, pp. 641–653, Jan. 2019.
 [30] F. Tang, Z. M. Fadlullah, B. Mao, and N. Kato, “An intelligent traffic load predictionbased adaptive channel assignment algorithm in SDNIoT: A deep learning approach,” IEEE Internet Things J., vol. 5, no. 6, pp. 5141–5154, Dec. 2018.
 [31] M. Yan, G. Feng, J. Zhou, Y. Sun, and Y. Liang, “Intelligent resource scheduling for 5G radio access network slicing,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 7691–7703, Aug. 2019.
 [32] A. Asheralieva and Y. Miyanaga, “Multiagent Qlearning for autonomous D2D communication,” in Proc. IEEE ISPACS, Oct. 2016, pp. 1–6.
 [33] K. Zia, N. Javed, M. N. Sial, S. Ahmed, and F. Pervez, “Multiagent RL based usercentric spectrum allocation scheme in D2D enabled hetnets,” in Proc. IEEE CAMAD, Sep. 2018, pp. 1–6.
 [34] Z. Li and C. Guo, “A multiagent deep reinforcement learning based spectrum allocation framework for d2d underlay communications,” arXiv preprint arXiv:1904.06615, 2019.
 [35] D. A. Plaisted, “Some polynomial and integer divisibility problems are NPHARD,” in Proc. sfcs, Oct. 1976, pp. 264–267.
 [36] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [37] J. Von Neumann, O. Morgenstern, and H. W. Kuhn, Theory of games and economic behavior (commemorative edition). Princeton university press, 2007.
 [38] C. J. Watkins and P. Dayan, “Qlearning,” Machine Learning, vol. 8, no. 34, pp. 279–292, 1992.
 [39] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in Proc. Int. Conf. Mach. Learning (ICML), 2014.
 [40] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multiagent actorcritic for mixed cooperativecompetitive environments,” in Proc. Advances Neural Inf. Process. Syst. (NIPS), 2017, pp. 6379–6390.
 [41] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [42] E. Access, “Further advancements for eutra physical layer aspects,” 3GPP Technical Specification TR, vol. 36, p. V2, 2010.