I Introduction
Connecting vehicles on the road as a dynamic communication network, commonly known as vehicletoeverything (V2X) networks, is gradually becoming a reality to make our daily experience on wheels safer and more convenient [1]. V2X enabled coordination among vehicles, pedestrians, and other entities on the road can alleviate traffic congestion, improve road safety, in addition to providing ubiquitous infotainment services [2, 3, 4]. Recently, the 3rd generation partnership project (3GPP) begins to support V2X services in the longterm evolution (LTE) [5] and further the fifth generation (5G) mobile communication networks [6]. Crossindustry alliance has also been founded, such as the 5G automotive association (5GAA), to push development, testing, and deployment of V2X technologies.
Due to high mobility of vehicles and complicated timevarying communication environments, it is very challenging to guarantee the diverse qualityofservice (QoS) requirements in vehicular networks, such as extremely large capacity, ultra reliability, and low latency [7]. To address such issues, efficient resource allocation for spectrum sharing becomes necessary in the V2X scenario. Existing works on spectrum sharing in vehicular networks can be mainly categorized into two classes: centralized schemes [8, 9, 10, 11] and distributed approaches [12, 13]. For the centralized schemes, decisions are usually made centrally at a given node, such as the head in a cluster or the base station (BS) in a given coverage area. Novel graphbased resource allocation schemes have been proposed in [8] and [9] to maximize the vehicletoinfrastructure (V2I) capacity, exploiting the slow fading statistics of channel state information (CSI). In [10], an interference hypergraph based resource allocation scheme has been developed in the nonorthogonal multiple access (NOMA)integrated V2X scenario with the distance, channel gain, and interference known in each vehicletovehicle (V2V) and V2I group. In [11], a segmentation medium access control (MAC) protocol has been proposed in largescale V2X networks, where the location information of vehicles is updated. In these schemes, the decision making node needs to acquire accurate CSI, interference information of all the V2V links, and each V2V link’s transmit power to make spectrum sharing decisions. However, reporting all such information from each V2V link to the decision making node poses a heavy burden on the feedback links, and even becomes infeasible in practice.
As for distributed schemes [12, 13], each V2V link makes its own decision with partial or little knowledge of other V2V links. In [12], a distributed shuffling based HopcroftKarp algorithm has been devised to handle the subchannel allocation in V2V communications with onebit CSI broadcasting. In [13], the spatiotemporal traffic pattern has been exploited for distributed loadaware resource allocation for V2V communications with slowly varying channel information. In these methods, V2V links may exchange partial or none channel information with their neighbors before making a decision. However, each V2V link can only observe partial information of its surrounding environment since it is geographically apart from other V2V links in the V2X scenario. This may leave some channels overly congested while others underutilized, leading to substantial performance degradation.
Notably, the above works usually rely on some levels of channel information, such as channel gain, interference, locations and so on. This kind of channel information is usually hard to obtain perfectly in practical wireless communication systems, which is even challenging in the V2X scenario. Fortunately, machine learning enables wireless communications systems to learn their surroundings and feed critical information back to the BS for resource allocation. In particular, reinforcement learning (RL) can make decisions to maximize longterm return in the sequential decision problems, which has gained great success in various applications, such as AlphaGo
[14]. Inspired by its remarkable performance, the wireless community is increasingly interested in leveraging machine learning for the physical layer and resource allocation design [15, 16, 17, 18, 19, 20, 21, 22, 23]. In particular, machine learning for future vehicular networks has been discussed in [24] and [25]. In [26], each V2V link is treated as an agent to ensure the latency constraint is satisfied while minimizing interference to V2I link transmission. In [27], a multiagent RLbased spectrum sharing scheme has been proposed to promote the payload delivery rate of V2V links while improving the sum capacity of V2I links. A dynamic RL scheduling algorithm has been developed to solve the network traffic and computation offloading problems in vehicular networks [28].In order to fully exploit the advantages of both centralized and distributed schemes while alleviating the requirement on CSI for spectrum sharing in vehicular networks, we propose an RLbased resource allocation scheme with learned feedback. In particular, we devise a distributed CSI compression and centralized decision making architecture to maximize the sum rate of all V2V links in the long run. In this architecture, each V2V link first observes the state of its surrounding channels and adopts a deep neural network (DNN) to learn what to feed back to the decision making unit, such as the BS, instead of sending all observed information directly. To maximize the longterm sum rate of all links, the BS then adopts deep reinforcement learning to allocate spectrum for all V2V links. To further reduce feedback overhead, we adopt a quantization layer in each vehicle’s DNN and learn how to quantize the continuous feedback. Besides, to further facilitate distributed spectrum sharing, we devise a distributed spectrum sharing architecture to let each V2V link make its own decision locally. The contributions of this paper are summarized as follows.

We leverage the power of DNN and RL to devise a centralized decision making and distributed implementation architecture for vehicular spectrum sharing that maximizes the longterm sum rate of all vehicles. We use a weighted sum rate reward to balance V2I and V2V performance dynamically.

We exploit the DNN at each vehicle to compress local observations, which is further augmented by a quantized layer, to reduce network signaling overhead while achieving desirable performance.

We also develop a distributed decision making architecture that allows spectrum sharing decisions to be made at each vehicle locally and binary feedback is designed for signaling overhead reduction.

Based on extensive computer simulations, we demonstrate both of the proposed architectures can achieve nearoptimal performance and are robust to feedback interval variations, input noise, and feedback noise. In addition, the optimal number of continuous feedback and feedback bits for each V2V link are presented that strike a balance between signaling overhead and performance loss.
The rest of this paper is organized as follows. The system model is presented in Section II. Then, the BS aided spectrum sharing architecture, including distributed CSI compression and feedback, centralized resource allocation and quantized feedback, is introduced in Section III. The distributed decision making and spectrum sharing architecture is discussed in Section IV. Simulation results are presented in Section V. Finally, conclusions are drawn in Section VI.
Ii System Model
We consider a vehicular communication network with cellular users (CUs) and pairs of coexisting devicetodevice (D2D) users, where all devices are equipped with a single antenna. Let and denote the sets of all D2D pairs and CUs, respectively. Each pair of D2D users exchange important and short messages, such as safetyrelated information via establishing a V2V link while each CU uses a V2I link to support bandwidthintensive applications, such as social networking and video streaming. In order to ensure the QoS of the CUs, we assume all V2I links are assigned orthogonal radio resources. Without loss of generality, we assume that each CU occupies one channel for its uplink transmission. To improve the spectrum utilization efficiency, all V2V links share the spectrum resource with V2I links. Therefore, is also referred to as the channel set.
Denote the channel power gain from the th CU to the BS on the th channel, i.e., the th V2I link, by . Let represent the cross channel power gain from the transmitter of the th V2V link to the BS on the th channel. The received signaltointerferenceplusnoiseratio (SINR) of the th V2I link can be expressed as
(1) 
where and refer to the transmit powers of the th V2I link and the th D2D pair, respectively, represents the noise power, and is the channel allocation indicator with if the th D2D user pair chooses the th channel and otherwise. We assume each D2D pair only occupies one channel, i.e., . Then, the capacity of the th V2I link on the th channel can be written as
(2) 
where denotes the channel bandwidth.
Similarly, denotes the channel power gain of the th V2V link on the th channel. Meanwhile, denotes the cross channel power gain from the transmitter of the th D2D pair to the receiver of the th D2D pair on the th channel. Denote the cross channel power gain from the th CU to the receiver of the th D2D pair on the th channel by . Then, the SINR of the th V2V link over the th channel can be written as
(3) 
where the interference power for the th V2V link is
(4) 
In (4), the terms and refer to the interference of the other V2V links and the V2I link on the th channel, respectively. Hence the capacity of the th V2V link on the th channel can be written as
(5) 
In the V2X networks, a naive distributed approach will allow each V2V link to select a channel independently such that its own data rate is maximized. However, local rate maximization often leads to suboptimal global performance due to the interference among different V2V links. On the other hand, the BS in the V2X scenario has enough computational and storage resources to achieve efficient resource allocation. With the help of machine learning, we propose a centralized decision making scheme based on compressed information learned by each individual V2V link distributively.
In order to achieve this goal, each V2V link first learns to compress local observations, including the channel gain, the observed interference from other V2V links and V2I link, transmit power, etc., and then feeds the compressed information back to the BS. According to feedback information from all V2V links, the BS will make optimal decisions for all V2V links using RL. Then, the BS broadcasts the decision result to all V2V links.
Iii BS Decision based Spectrum Sharing Architecture
As shown in Fig. 1, we adopt the deep RL approach for resource allocation. In this section, we first design the DNN architecture of each V2V link and the deep Qnetwork (DQN) for centralized control at the BS, respectively. Then, we propose the centralized decision making and distributed spectrum sharing architecture, termed CDecision scheme. Finally, we introduce the binary feedback design for information compression.
Iiia V2V DNN Design
Here, we discuss the DNN at each V2V link to compress local observation for feedback. As shown in Fig. 1, each V2V link first observes its surroundings, and obtains its transmission power, the current channel gains and interference powers of all channels, which are denoted as and , respectively. Here, refers to the aggregated interference powers at the th V2V link on the th channel as shown in (4). To consider the impact of V2V links on V2I links, the observation of the th V2V also needs to include the cross channel gain from the th V2V link to all V2I links, such as . Then, the observation of the th V2V can be written as
(6) 
where . Here, the channel information
can be accurately estimated by the receiver of the
th V2V link and we assume it is also available at the transmitter through delayfree feedback [29]. Similarly, the received interference power over all channels can be measured at the th V2V receiver. Each V2V transmitter knows its transmit power. Besides, the vector
can be estimated at the BS and then broadcast to all V2V links in its coverage, which incurs a small signaling overhead [27].Then, the local observation, , is compressed using the DNN at each V2V link. The compressed information, , which is the output of the DNN, is fed back to the DQN at the BS. To limit overhead on information feedback, each V2V link only reports the compressed information vector, , instead of to the BS. Here, is also known as the feedback vector of the th V2V link and the term refers to the th feedback element of the th V2V, where denotes the number of feedback learned by the th V2V link. All V2V links aim at maximizing their global sum rate in the long run while minimizing the feedback information . Therefore, the parameters of the DNNs at all V2V links and those of the DQN will be jointly determined to maximize the sum rate of the whole V2X network.
IiiB Deep QNetwork at the BS
To make a proper resource sharing decision, we introduce the deep RL architecture at the BS as shown in Fig. 1
. In order to maximize the longterm sum rate of all links, we resort to the RL technique by treating the BS as the agent. In the RL, an agent interacts with its surroundings, named as the environment, via taking actions, and then observes a corresponding numerical reward from the environment. The agent’s goal is to find optimal actions so that the expected sum of rewards is maximized. Mathematically, the RL can be modelled as a Markov decision process (MDP). At each discrete time slot
, the agent observes the current state of the environment from the state space and then chooses an action from the action space and one time step later obtains a reward . Then, the environment evolves to the next state, with the transition probability
.The BS treats all the learned feedback as the current state of the agent’s environment, which can be expressed as:
(7) 
Then, the action of the BS is to determine the values of the channel indicators, , for each V2V link. Thus, we define the action of the BS as
(8) 
where refers to the channel allocation vector for the th V2V link.
Finally, we design the reward for the BS, which is very crucial to the performance of RL. To maximize the longterm sum rate of V2V links while ensuring the QoS of V2I links in the V2X scenario, we need to devise a mechanism to consider the transmissions of V2V links and V2I links simultaneously. As we know, the V2V links usually carry the safetycritical messages, such as vehicle’s speed and emergency vehicle warning on the road, while the V2I links often support the entertainment services [27]. Thus, we should guarantee the transmission of V2V links as the primal target while making sure that the impact of V2V transmission on the V2I links can be tolerable and adjustable to some specific applications. To this end, we model the reward of the BS as
(9) 
where refers to the capacity of the th V2V on all the channels. Besides, and are nonnegative weights to balance the performance of V2I links and V2V links.
The solution of the RL problem is related to the concept of policy , which defines the probabilities of choosing each action in when observing a state in . The goal of learning is to find an optimal policy to maximize the expected return from any initial state . The expected return is defined as , which is the cumulative discounted return with a discount factor .
To solve this problem, we resort to the Qlearning [30], which is a wellknown effective approach to tackle the RL problem, due to its modelfree property where is not required a priori. Qlearning is based on the idea of actionvalue function for a given policy , which means the expected return when the agent starts from the state , takes action , and thereafter follows the policy . The optimal actionvalue function under the optimal policy satisfies the wellknown Bellman optimality equations [31], which can be approached through an iterative update method:
(10) 
where is the stepsize parameter. Besides, the choice of action in state follows some exploratory policies, such as the greedy policy. For better understanding, the greedy policy can be expressed as
(11) 
Here, is also known as the exploration rate in the RL literature. Furthermore, it has been shown in [31] that with a variant of the stochastic approximation conditions on and the assumption that all the stateaction pairs continue to be updated, converges with probability 1 to the optimal actionvalue function .
However, in many practical problems, the state and action space can be extremely large, which prevents storing all actionvalue functions in a tabular form. As a result, it is common to adopt function approximation to estimate these actionvalue functions. Moreover, by doing so, we can generalize actionvalue functions from limited seen stateaction pairs to to a much larger space.
In [32], a DNN parameterized by is employed to represent the actionvalue function, thus called as DQN. DQN adopts the greedy policy to explore the state space and store the transition tuple in a replay memory (also known as the replay buffer) at each time step. The replay memory accumulates agent’s experiences over many episodes of the MDP. At each time step, a minibatch of experiences are uniformly sampled from the replay memory, called experience replay, to update the network parameters
with variants of stochastic gradient descent method to minimize the squared errors shown as follows:
(12) 
where is the parameter set of a target Qnetwork, which is duplicated from the training Qnetwork parameter set , and fixed for a couple of updates with the aim of further improving the stability of DQN. Besides, experience replay improves sample efficiency via repeatedly sampling experiences from the replay memory and also breaks correlation in successive updates, which also stabilizes the learning process.
IiiC Centralized Control and Distributed Transmission Architecture
In this part, the architecture for the CDecision scheme is shown in Fig. 1. Each V2V link first observes its local environment and then adopts a DNN to compress the observed information into several real numbers, which are finally fed back to the BS for centralized decision making. The BS takes the feedback information of all V2V links as the input, utilizes the DQN to perform Qlearning to decide the channel allocation for all V2V links, and broadcasts its decision. Finally, each V2V link chooses the BSallocated channel for its transmission.
Details of the training framework for the CDecision scheme are provided in Algorithm 1. We define as the observations of all V2Vs at the time step , where refers to the observation of the th V2V at the time step . Then, we can express the estimation of the return also known as the approximate target value [32] as
(13) 
where and represent the reward of all links and the Q function of the target DQN with parameters under the next observation and the action , respectively. Then, the updating process for the BS DQN can be written as [32, 33]:
(14) 
where is the step size in one gradient iteration.
As for the testing phase, at each time step , each V2V adopts its observation as the input of the trained DNN to obtain its learned feedback , and then sends it to the BS. After that, the BS takes as the input of its trained DQN to generate the decision result , and broadcasts to all V2Vs. Finally, each V2V chooses the specific channel indicated by to transmit.
IiiD Spectrum Sharing with Binary Feedback
In order to further reduce feedback overhead, we propose a framework to quantize the V2V links’ realvalued feedback into several binary digits. In other words, we try to constrain
. The binarization procedure can help force the neural networks to learn efficient representations of the feedback information compared to the standard floatingpoint layer. In other words, a binary layer can make each V2V compress its observation more efficiently.
The binary quantization process consists of two steps [34]. The first step is to generate the required number of continuous feedback values in the continuous interval , which is also equal to the desired number of the binary feedback. Then, the second step takes the outputs of the first step as its input to produce the desired number of discrete feedback in the set for each output realvalued feedback of the first step.
For the first step, we adopt a fullyconnected layer with activations, defined as , where we term this layer as the prebinary layer. Here, the input of this prebinary layer connects the outputs of each V2V’s DNN. Then, in order to binarize the continuous output of the first step, we adopt the traditional sign function method in the second step. To be specific, we take the sign of the input value as the output of this layer, which is shown as below:
(15) 
However, the gradient of this function is not continuous, challenging the back propagation procedure for DNN training. As a remedy to this, we adopt the identity function in the backward pass, which is known as the straightthrough estimator [35]. Combining these two steps together, we can express the full binary feedback function as
(16) 
where and denote the linear weights and bias of the prebinary layer that transform the activations from the previous layer in the neural network respectively. Here, we term this layer as the binary layer.
Finally, to implement the CDecision scheme with binary feedback, we add the full binary feedback function in (16), which consists of the prebinary layer in the first step and the binary layer in the second step, to the output of each V2V link’s DNN. Besides, in response to the change in the number of feedback bits at each V2V link’s new DNN, the number of inputs in the DQN of BS should change correspondingly.
Iv Distributed Decision Making and Spectrum Sharing Architecture
In order to further facilitate distributed spectrum sharing and reduce the computational complexity, we propose the distributed decision making and spectrum sharing architecture (named as the DDecision scheme) shown in Fig. 2 to let each V2V link make its own spectrum sharing decision. In this section, we first devise the neural network architecture for each V2V link to compress CSI and make decision, respectively, and then design the neural network for the BS to aggregate feedback from all V2V links. Then, we propose the hybrid information aggregation and distributed control architecture. Finally, we propose the DDecision scheme with the binary aggregated information.
Iva DNN Design at V2V and BS
To enable distributed decision making, each V2V contains one DNN to compress local observations for feedback, termed the Compression DNN and another DQN for distributed spectrum sharing decision making, termed Decision DQN. Here, we employ the same DNN architecture for each V2V as that in Part A of Section III since they share the same functionality.
The BS aggregates the feedback from all V2Vs via its DNN, termed as the Aggregation DNN, and then broadcasts the aggregated global information (AGI) to all V2Vs. Here, the AGI can be expressed as , where refers to the number of AGI values and also equals the number of outputs of BS Aggregation DNN. Finally, each V2V combines its local observation and the AGI as the input of its Decision DQN to decide which channel to transmit.
IvB Hybrid Information Aggregation and Distributed Control Architecture
Each V2V link first observes its local environment to obtain , and then adopts its Compression DNN to compress into several real numbers , and finally feeds this compressed information back to the BS. After that, the BS takes the feedback values of all V2V links as the input of its Aggregation DNN to aggregate the compressed observations of all V2V links and further compress this information into the AGI . Finally, each V2V link combines the received AGI and its local observation as the input of its Decision DQN, and performs the Qlearning algorithm to decide which channel to transmit.
Details of the training framework for the DDecision scheme are provided in Algorithm 2. Here, we define as the actions of all V2V links at the time step , where refers to the action for th V2V. Besides, in the training process, we take the observations of all V2V links as the input and train all DNNs and DQNs in an endtoend manner. The training process can be implemented in a fully distributed manner.
As for the testing phase, at each time step , each V2V link adopts its observation as the input of its Compression DNN to learn the feedback , and sends it to the BS. Then, the BS utilizes as the input of its Aggregation DNN to generate the AGI , and broadcasts to all V2V links. Finally, each V2V link takes and as the input of its Decision DQN to make decision, and then transmits on the chosen channel.
IvC Distributed Spectrum Sharing with binary information
Similar to Section IIID, we can also quantize the continuous feedback and the AGI in the DDecision scheme into the binary data to further reduce signaling overhead. Then, both the Compression DNN of each V2V link and the Aggregation DNN at the BS need to include the binary function in (16).
V Simulation Results
In this section, we conduct extensive simulation to verify the performance of the proposed schemes. In particular, we provide the simulation settings in Part A, and evaluate the training performance of the CDecision scheme in Part B. Then, we assess the testing performance under the realvalued feedback and binary feedback in Parts C and D respectively. Besides, we demonstrate the impacts of V2I and V2V links weights on the performance in Part E and the robustness of the proposed scheme in Part F, respectively. Finally, we show the training and testing performance of the DDecision scheme in Part G.
Va Simulation Settings
The simulation scenario follows the urban case in Annex A of [5]. The simulation area size is , where the BS is located in the center of this area. For better understanding, we provide related parameters and their corresponding settings in Table I. In addition, we list the corresponding channel models for both V2V and V2I links respectively in Table II.
Parameters  Typical values 

Number of V2I links  4 
Number of V2V links  4 
Carrier frequency  2 GHz 
Normalized Channel Bandwidth  1 
BS antenna height  25 m 
BS antenna gain  8 dBi 
BS receive noise figure  5 dB 
Vehicle antenna height  1.5 m 
Vehicle antenna gain  3 dBi 
Vehicle receive noise figure  9 dB 
Vehicle speed  randomly in [10, 15] km/h 
Vehicle drop and mobility model  Urban case of A.1.2 in [5] 
V2I transmit power  23 dBm 
V2V transmit power  10 dBm 
Parameter  V2I link  V2V link 

Path loss model  , d in km  LOS in WINNER + B1 Manhattan [36] 
Shadowing distribution  Lognormal  Lognormal 
Shadowing standard deviation 
8 dB  3 dB 
Decorrelation distance  50 m  10 m 
Noise power  114 dBm  114 dBm 
Fast fading  Rayleigh fading  Rayleigh fading 
Fast fading update  Every 1 ms  Every 1 ms 
The specific architecture of DNNs and BS DQN under the CDecision scheme are summarized in Table III, where
to refers the number of feedback for each V2V link and FC denotes the fully connected (FC) layer respectively. In addition, the number of neurons in the output layer of the BS DQN is set as
, which refers to all the possible channel allocations for all V2V links under current simulation setting. Besides, the settings for the DNNs and DQNs under the DDecision scheme are listed in Table IV.DNN  BS DQN  

Input layer  13  
Hidden layers  3 FC layers (16, 32, 16)  3 FC layers (1200, 800, 600) 
Output layer 
Compression DNN  Aggregation DNN  Decision DQN  

Input layer  13  
Hidden layers  3 FC layers (16, 32, 16)  3 FC layers (500, 400, 300)  3 FC layers (80, 40, 20) 
Output layer 
We use the rectified linear unit (ReLU) activation function for both DNN and DQNs, defined as
. Here, the activation function of output layers in DNNs and DQNs is set as a linear function. Besides, the RMSProp optimizer
[37] is adopted to update the network parameters with a learning rate of. The loss function is set as the Huber loss
[38].We choose the weights and for V2I and V2V links, respectively. We train the whole neural network for episodes and the exploration rate is linearly annealed from to over the beginning episodes and keeps constant afterwards. The number of steps in each episode is set as . The update frequency of the target Qnetwork is every steps. The discount factor, , in the training is chosen as . The size of the replay buffer is set as samples. Meanwhile, the minibatch size varies in different settings, to be specified in each figure.
VB Training Performance Evaluation
Fig. 3 demonstrates the training performance of the proposed CDecision scheme with a minibatch size and the number of realvalued feedback . In Fig. 3 LABEL:, the loss function decreases quickly with the increasing number of training episodes , and becomes nearly unchanged with the further increasing . On the other hand, the change of average return per episode is displayed in Fig. 3 LABEL:. Here, we evaluate the training process every training episodes under different random seeds with the exploration rate , and plot the average return per episode in Fig. 3 LABEL:. The average return per episode first increases quickly with increasing , and gradually converges despite some small fluctuations due to the timevarying V2X scenario, which shows the stability of the training process. Thus, Fig. 3 LABEL: and LABEL: demonstrate the desired convergence of the proposed training algorithm. Therefore, we set for the CDecision scheme afterwards.
VC Performance of RealValued Feedback
Fig. 4 LABEL: shows the return variation under the realvalued feedback with the number of testing episodes. Here, we choose the minibatch size as , number of testing episodes as , and the number of realvalued feedback as , respectively. For comparison, we also display the performance of two benchmark schemes: the optimal and the random action schemes, respectively. In the optimal scheme, we perform timeconsuming bruteforce search to find the optimal spectrum allocation in each testing step. In the random action scheme, each V2V link chooses the channel randomly. For better comparison, we depict the normalized return of these three schemes in Fig. 4 LABEL:, where we use the return of the optimal scheme to normalize the return of the other two schemes in each testing episode. Besides, the average return of our proposed scheme and the random action scheme are also depicted. In Fig. 4 LABEL:, the performance of the CDecision approaches in most episodes and its average performance is about of the optimal scheme while the average performance of random selection is about of the optimal performance. Thus, we conclude the proposed CDecision scheme can achieve nearoptimal spectrum sharing.
Fig. 4 LABEL: shows the impacts of different minibatch sizes and different numbers of realvalued feedback on the performance of the CDecision scheme, which adopts the average return percentage (ARP) as the metric. Here, the ARP metric is defined as: the return under the CDecision scheme is first averaged over testing episodes and then normalized by the average return of the optimal scheme. In Fig. 4 LABEL:, the number of realvalued feedback equals refers to the situation where each V2V link does not feed anything back to the BS and therefore, each V2V link just randomly selects channel to transmit, which is known as the random action scheme. From Fig. 4 LABEL:, the ARP under the CDecision scheme increases rapidly with the increase of , and reaches the maximal percentage nearly at . Thereafter, the ARP virtually keeps constant with increasing . In other words, each V2V link only needs to send realvalued feedback to the BS to achieve nearoptimal performance. Besides, different minibatch sizes can achieve very similar performance. Particularly, the minibatch size achieves the best performance, which is good enough considering the computational overhead in the training process and the gained performance.
VD Performance of Binary Feedback
Fig. 5 demonstrates the change of the ARP performance with an increasing number of feedback bits under different minibatch sizes. Here, we fix the number of realvalued feedback as , and quantize these realvalued feedback into different numbers of feedback bits. Similarly, the number of feedback bits equals in Fig. 5 refers to the situation where each V2V link does not feedback anything to the BS and just adopts the random action scheme. The ARP first increases quickly with the number of feedback bits, and then keeps nearly unchanged with the further increasing of feedback bits after the number of feedback bits is larger than . The ARP under different has quite similar performance. Besides, the ARP can reach with feedback bits under . Considering the performance and feedback overhead tradeoff, we choose feedback bits under in the subsequent evaluation.
VE Impacts of V2I and V2V Weights
In this part, we evaluate the impacts of V2I links weight and V2V links weights on the system performance. For better understanding, we fix and vary the values of . Fig. 6
demonstrates the empirical cumulative distribution function (CDF) of V2I and V2V sum rate. In Fig.
6, “Real FB” and “Binary FB” refer to the proposed CDecision scheme with realvalued feedback and that with binary feedback respectively, and “Optimal” represents the optimal scheme. In particular, two empirical CDFs of V2I sum rate under both realvalued feedback and binary feedback in Fig. 6 LABEL: shift quickly to the right when the V2I weight increases to , which shows our proposed scheme can ensure different QoS requirements of V2I links via adjusting . Besides, the performance gap between the realvalued feedback and binary feedback decreases with the increase of . From Fig. 6 LABEL:, the empirical CDFs of V2V sum rate under the realvalued feedback and binary feedback are very close to each other and shift slightly to the left with increasing , which implies the rate degradation of V2V links is quite small. Besides, the CDFs of V2V sum rate under both feedback schemes are very close to that under the optimal scheme, and slightly deviate from the optimal performance with the further increase of . Thus, we can see that the proposed CDecision scheme can ensure negligible degradation of V2V links while adjusting the QoS of V2I links via choosing different values of .VF Robustness Evaluation
Fig. 7 shows the impacts of different feedback intervals on the performance of both realvalued feedback and binary feedback, where the feedback interval is measured in the number of testing steps. To investigate the impact of very large feedback intervals on the performance, we set the number of testing steps as and the number of testing episodes as . The normalized average return under both feedback schemes decreases quite slowly with the increasing feedback interval at the beginning, which shows that the proposed scheme is immune to the feedback interval variations and then drops quickly with the very large feedback interval. Please note where the average return is normalized by the average return under the scheme with since we set and it is very high computational demanding to find the return under the optimal scheme.
Fig. 8 evaluates the impacts of different noise sources on the ARP performance. Specifically, Fig. 8 LABEL:
illustrates the impacts of noisy input on the performance of both realvalued feedback and binary feedback. Here, the xaxis means the ratio of the strength of Gaussian white noise with respect to the each observation (such as channel gain value) for V2V links. In Fig.
8 LABEL:, the ARP under both feedback schemes decreases very slowly at the beginning and then drops very quickly, and finally keeps nearly unchanged with the very large input noise, which shows the robustness of the proposed scheme. In addition, the proposed scheme can also gain nearly of the optimal performance under both realvalued feedback and binary feedback even at the very large input noise, which is still better than the random action scheme shown in Fig. 4 LABEL:. Based on this observation, we remark the proposed scheme can learn the intrinsic structure of the resource allocation in the V2X scenario.Besides, Fig. 8 LABEL: displays the impacts of noisy feedback on the performance of both feedback schemes. Here, noisy feedback refers to the situation where noise occurs when each V2V link sends its learned feedback to the BS. Similarly, the xaxis means the ratio of the strength of the Gaussian white noise with respect to each feedback. In Fig. 8 LABEL:, the ARP of both feedback schemes keeps nearly unchanged with the increasing feedback noise, which demonstrates the robustness of the proposed scheme, and then decreases more quickly under the realvalued feedback compared with that under the binary feedback with the further increasing feedback noise. This is because there are only realvalued feedback under the realvalued feedback scheme while there exist feedback bits under the binary feedback scheme. Finally, the ARP of both feedback schemes becomes nearly constant with the very large feedback noise. Similarly, the binary feedback scheme is more robust to the feedback noise compared with the realvalued feedback scheme.
VG Performance Evaluation for the DDecision Scheme
Fig. 9 evaluates the training process of the DDecision scheme. Here, we choose , and , respectively. In particular, the training loss for the st V2V in Fig. 9 LABEL: first decreases very slowly with some jitters with an increasing , and then drops almost linearly, and finally becomes nearly unchanged with the further increase of . The average return per episode under the DDecision scheme in Fig. 9 LABEL: first increases quickly with the increase of , and then increases slowly, and finally gradually converges despite some fluctuations, which shows the stability of the training process. Besides, we observe that under the DDecision scheme is much bigger than under the CDecision scheme, which indicates that the DDecision scheme converges more slowly than the CDecision scheme. To train the whole neural network well, we set under the DDecision scheme. Besides, the exploration rate is linearly annealed from to over the beginning episodes and then keeps constant.
Then, the testing performance of the DDecision scheme with the increasing number of AGI values is shown in Fig. 10. In particular, Fig. 10 LABEL: illustrates the ARP performance with the increasing number of realvalued AGI . Here, we set the number of realvalued feedback which each V2V transmits to the BS as as indicated by Fig. 4 LABEL:. The APR first increases with increasing , and then keeps nearly unchanged with the further increase of . Especially, the ARP nearly achieves its maximal value when . In other words, the BS only needs realvalued AGI to represent the realvalued feedback of all V2V links to achieve of the optimal performance. Furthermore, even when , the ARP can still reach , which is suitable for the bandwidthconstrained broadcast channel of the BS. Compared with the CDecision scheme, the DDecision scheme only incurs ARP degradation. However, it can achieve the fully distributed decision making and spectrum sharing, which is very appealing in the V2X scenario. In addition, the computational complexity for decision making under the DDecision scheme is greatly reduced compared with that under the CDecision scheme, which can further facilitate the fully distributed spectrum sharing in the V2X scenario.
Besides, the testing performance of the DDecision scheme with the binary AGI is evaluated in Fig. 10 LABEL:. Here, we choose the number of feedback bits as for each V2V link and the number of realvalued AGI . In Fig. 10 LABEL:, the ARP first increases with the increasing number of AGI bits , and then becomes nearly unchanged with the further increase of . In particular, the APR reaches when . Meanwhile, the APR is very close to even when . Similarly, compared with the CDecision scheme with binary feedback, the DDecision scheme with the binary feedback only incurs ARP degradation, which, however, can be implemented in a fully distributed manner.
Vi Conclusion
In this paper, we proposed a novel CDecision architecture to allow distributed V2V links to share spectrum efficiently with the aid of the BS in V2X scenario and also devised an approach to binarize the continuous feedback. To further facilitate distributed decision making, we have developed a DDecision scheme for each V2V link to make its own decision locally and also designed the binary procedure for this scheme. Simulation results demonstrated that the number of realvalued feedback can be quite small to achieve nearoptimal performance. Meanwhile, the DDecision scheme can also gain nearoptimal performance and enable a fully distributed decision making, which is more appealing to the V2X networks. Besides, the quantization of the feedback or AGI incurs small performance loss with an acceptable number of bits under both schemes. Our proposed scheme is quite immune to the variation of feedback interval, input noise, and feedback noise respectively, which validates the robustness of the proposed scheme. In the future, we will investigate joint power control and spectrum sharing issue in this scenario.
References
 [1] H. Seo, K. Lee, S. Yasukawa, Y. Peng, and P. Sartori, “LTE evolution for vehicletoeverything services,” IEEE Commun. Mag., vol. 54, no. 6, pp. 22–28, Jun. 2016.
 [2] S. Chen, J. Hu, Y. Shi, Y. Peng, J. Fang, R. Zhao, and L. Zhao, “Vehicletoeverything (V2X) services supported by LTEbased systems and 5G,” IEEE Commun. Standards Mag., vol. 1, no. 2, pp. 70–76, 2017.
 [3] L. Liang, H. Peng, G. Y. Li, and X. Shen, “Vehicular communications: A physical layer perspective,” IEEE Trans. Veh. Technol., vol. 66, no. 12, pp. 10 647–10 659, Dec. 2017.
 [4] H. Peng and L. Liang and X. Shen and G. Y. Li, “Vehicular communications: A network layer perspective,” IEEE Trans. Veh. Technol., vol. 68, no. 2, pp. 1064–1078, Feb. 2019.
 [5] 3rd Generation Partnership Project, “Technical spefication group radio access network: Study on LTEbased V2X services,” 3GPP, TR 36.885 V14.0.0, Jun. 2016.
 [6] ——, “Study on enhancement of 3GPP support for 5G V2X services,” 3GPP, TR 22.886 V15.1.0, Mar. 2017.
 [7] C. Guo, L. Liang, and G. Y. Li, “Resource allocation for lowlatency vehicular communications: An effective capacity perspective,” IEEE J. Sel. Areas Commun., vol. 37, no. 4, pp. 905–917, Apr. 2019.
 [8] L. Liang, G. Y. Li, and W. Xu, “Resource allocation for D2Denabled vehicular communications,” IEEE Trans. Commun., vol. 65, no. 7, pp. 3186–3197, Jul. 2017.
 [9] L. Liang, S. Xie, G. Y. Li, Z. Ding, and X. Yu, “Graphbased resource sharing in vehicular communication,” IEEE Trans. Wireless Commun., vol. 17, no. 7, pp. 4579–4592, Jul. 2018.
 [10] C. Chen, B. Wang, and R. Zhang, “Interference hypergraphbased resource allocation (IHGRA) for NOMAintegrated V2X networks,” IEEE Internet Things J., vol. 6, no. 1, pp. 161–170, Feb. 2019.
 [11] C. Han, M. Dianati, Y. Cao, F. Mccullough, and A. Mouzakitis, “Adaptive network segmentation and channel allocation in largescale V2X communication networks,” IEEE Trans. Commun., vol. 67, no. 1, pp. 405–416, Jan. 2019.
 [12] B. Bai, W. Chen, K. B. Letaief, and Z. Cao, “Low complexity outage optimal distributed channel allocation for vehicletovehicle communications,” IEEE J. Sel. Areas Commun., vol. 29, no. 1, pp. 161–172, Jan. 2011.
 [13] M. I. Ashraf, M. Bennis, C. Perfecto, and W. Saad, “Dynamic proximityaware resource allocation in vehicletovehicle (V2V) communications,” in Proc. IEEE Globecom Workshops (GC Wkshps), Dec. 2016, pp. 1–6.
 [14] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.

[15]
T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,”
IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp. 563–575, Dec. 2017.  [16] Z. Qin, H. Ye, G. Y. Li, and B. F. Juang, “Deep learning in physical layer communications,” IEEE Wireless Commun., vol. 26, no. 2, pp. 93–99, Apr. 2019.
 [17] H. Ye, G. Y. Li, and B. Juang, “Power of deep learning for channel estimation and signal detection in OFDM systems,” IEEE Wireless Commun. Lett., vol. 7, no. 1, pp. 114–117, Feb. 2018.
 [18] F. A. Aoudia and J. Hoydis, “Endtoend learning of communications systems without a channel model,” arXiv preprint arXiv:1804.02276, 2018.
 [19] C. Jiang, H. Zhang, Y. Ren, Z. Han, K. Chen, and L. Hanzo, “Machine learning paradigms for nextgeneration wireless networks,” IEEE Wireless Commun., vol. 24, no. 2, pp. 98–105, Apr. 2017.

[20]
R. Li, Z. Zhao, X. Zhou, G. Ding, Y. Chen, Z. Wang, and H. Zhang, “Intelligent 5G: When cellular networks meet artificial intelligence,”
IEEE Wireless Commun., vol. 24, no. 5, pp. 175–183, Oct. 2017.  [21] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforcement learning for dynamic multichannel access in wireless networks,” IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 2, pp. 257–265, Jun. 2018.
 [22] Y. Sun, M. Peng, and S. Mao, “Deep reinforcement learning based mode selection and resource management for green fog radio access networks,” IEEE Internet Things J., vol. 6, no. 2, pp. 1960–1971, Apr. 2019.
 [23] L. Liang, H. Ye, G. Yu, and G. Y. Li, “Deep learning based wireless resource allocation with application to vehicular networks,” arXiv preprint arXiv:1907.03289, 2019.
 [24] H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu, “Machine learning for vehicular networks: Recent advances and application examples,” IEEE Veh. Technol. Mag., vol. 13, no. 2, pp. 94–101, Jun. 2018.
 [25] L. Liang, H. Ye, and G. Y. Li, “Toward intelligent vehicular networks: A machine learning framework,” IEEE Internet Things J., vol. 6, no. 1, pp. 124–135, Feb. 2019.
 [26] H. Ye, G. Y. Li, and B. F. Juang, “Deep reinforcement learning based resource allocation for V2V communications,” IEEE Trans. Veh. Technol., vol. 68, no. 4, pp. 3163–3173, Apr. 2019.
 [27] L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in vehicular networks based on multiagent reinforcement learning,” to appear in IEEE J. Sel. Areas Commun., 2019.
 [28] Y. Wang, K. Wang, H. Huang, T. Miyazaki, and S. Guo, “Traffic and computation cooffloading with reinforcement learning in fog computing for industrial applications,” IEEE Trans. Ind. Informat., vol. 15, no. 2, pp. 976–986, Feb. 2019.
 [29] Y. S. Nasir and D. Guo, “Deep reinforcement learning for distributed dynamic power allocation in wireless networks,” arXiv preprint arXiv:1808.00490, 2018.
 [30] C. J. Watkins and P. Dayan, “Qlearning,” Machine Learning, vol. 8, no. 34, pp. 279–292, Feb. 1992.
 [31] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 2018.
 [32] V. Mnih et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
 [33] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Qlearning,” in Proc. 30th AAAI Conf., Feb. 2016, pp. 2094–2100.
 [34] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compression with recurrent neural networks,” arXiv preprint arXiv:1511.06085, 2015.
 [35] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
 [36] Y. Bultitude and T. Rautiainen, “IST4027756 WINNER II d1. 1.2 v1. 2 WINNER II channel models.”
 [37] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016.
 [38] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY, USA: Springer Science & Business Media, 2009.
Comments
There are no comments yet.