Connecting vehicles on the road as a dynamic communication network, commonly known as vehicle-to-everything (V2X) networks, is gradually becoming a reality to make our daily experience on wheels safer and more convenient . V2X enabled coordination among vehicles, pedestrians, and other entities on the road can alleviate traffic congestion, improve road safety, in addition to providing ubiquitous infotainment services [2, 3, 4]. Recently, the 3rd generation partnership project (3GPP) begins to support V2X services in the long-term evolution (LTE)  and further the fifth generation (5G) mobile communication networks . Cross-industry alliance has also been founded, such as the 5G automotive association (5GAA), to push development, testing, and deployment of V2X technologies.
Due to high mobility of vehicles and complicated time-varying communication environments, it is very challenging to guarantee the diverse quality-of-service (QoS) requirements in vehicular networks, such as extremely large capacity, ultra reliability, and low latency . To address such issues, efficient resource allocation for spectrum sharing becomes necessary in the V2X scenario. Existing works on spectrum sharing in vehicular networks can be mainly categorized into two classes: centralized schemes [8, 9, 10, 11] and distributed approaches [12, 13]. For the centralized schemes, decisions are usually made centrally at a given node, such as the head in a cluster or the base station (BS) in a given coverage area. Novel graph-based resource allocation schemes have been proposed in  and  to maximize the vehicle-to-infrastructure (V2I) capacity, exploiting the slow fading statistics of channel state information (CSI). In , an interference hyper-graph based resource allocation scheme has been developed in the non-orthogonal multiple access (NOMA)-integrated V2X scenario with the distance, channel gain, and interference known in each vehicle-to-vehicle (V2V) and V2I group. In , a segmentation medium access control (MAC) protocol has been proposed in large-scale V2X networks, where the location information of vehicles is updated. In these schemes, the decision making node needs to acquire accurate CSI, interference information of all the V2V links, and each V2V link’s transmit power to make spectrum sharing decisions. However, reporting all such information from each V2V link to the decision making node poses a heavy burden on the feedback links, and even becomes infeasible in practice.
As for distributed schemes [12, 13], each V2V link makes its own decision with partial or little knowledge of other V2V links. In , a distributed shuffling based Hopcroft-Karp algorithm has been devised to handle the subchannel allocation in V2V communications with one-bit CSI broadcasting. In , the spatio-temporal traffic pattern has been exploited for distributed load-aware resource allocation for V2V communications with slowly varying channel information. In these methods, V2V links may exchange partial or none channel information with their neighbors before making a decision. However, each V2V link can only observe partial information of its surrounding environment since it is geographically apart from other V2V links in the V2X scenario. This may leave some channels overly congested while others underutilized, leading to substantial performance degradation.
Notably, the above works usually rely on some levels of channel information, such as channel gain, interference, locations and so on. This kind of channel information is usually hard to obtain perfectly in practical wireless communication systems, which is even challenging in the V2X scenario. Fortunately, machine learning enables wireless communications systems to learn their surroundings and feed critical information back to the BS for resource allocation. In particular, reinforcement learning (RL) can make decisions to maximize long-term return in the sequential decision problems, which has gained great success in various applications, such as AlphaGo. Inspired by its remarkable performance, the wireless community is increasingly interested in leveraging machine learning for the physical layer and resource allocation design [15, 16, 17, 18, 19, 20, 21, 22, 23]. In particular, machine learning for future vehicular networks has been discussed in  and . In , each V2V link is treated as an agent to ensure the latency constraint is satisfied while minimizing interference to V2I link transmission. In , a multi-agent RL-based spectrum sharing scheme has been proposed to promote the payload delivery rate of V2V links while improving the sum capacity of V2I links. A dynamic RL scheduling algorithm has been developed to solve the network traffic and computation offloading problems in vehicular networks .
In order to fully exploit the advantages of both centralized and distributed schemes while alleviating the requirement on CSI for spectrum sharing in vehicular networks, we propose an RL-based resource allocation scheme with learned feedback. In particular, we devise a distributed CSI compression and centralized decision making architecture to maximize the sum rate of all V2V links in the long run. In this architecture, each V2V link first observes the state of its surrounding channels and adopts a deep neural network (DNN) to learn what to feed back to the decision making unit, such as the BS, instead of sending all observed information directly. To maximize the long-term sum rate of all links, the BS then adopts deep reinforcement learning to allocate spectrum for all V2V links. To further reduce feedback overhead, we adopt a quantization layer in each vehicle’s DNN and learn how to quantize the continuous feedback. Besides, to further facilitate distributed spectrum sharing, we devise a distributed spectrum sharing architecture to let each V2V link make its own decision locally. The contributions of this paper are summarized as follows.
We leverage the power of DNN and RL to devise a centralized decision making and distributed implementation architecture for vehicular spectrum sharing that maximizes the long-term sum rate of all vehicles. We use a weighted sum rate reward to balance V2I and V2V performance dynamically.
We exploit the DNN at each vehicle to compress local observations, which is further augmented by a quantized layer, to reduce network signaling overhead while achieving desirable performance.
We also develop a distributed decision making architecture that allows spectrum sharing decisions to be made at each vehicle locally and binary feedback is designed for signaling overhead reduction.
Based on extensive computer simulations, we demonstrate both of the proposed architectures can achieve near-optimal performance and are robust to feedback interval variations, input noise, and feedback noise. In addition, the optimal number of continuous feedback and feedback bits for each V2V link are presented that strike a balance between signaling overhead and performance loss.
The rest of this paper is organized as follows. The system model is presented in Section II. Then, the BS aided spectrum sharing architecture, including distributed CSI compression and feedback, centralized resource allocation and quantized feedback, is introduced in Section III. The distributed decision making and spectrum sharing architecture is discussed in Section IV. Simulation results are presented in Section V. Finally, conclusions are drawn in Section VI.
Ii System Model
We consider a vehicular communication network with cellular users (CUs) and pairs of coexisting device-to-device (D2D) users, where all devices are equipped with a single antenna. Let and denote the sets of all D2D pairs and CUs, respectively. Each pair of D2D users exchange important and short messages, such as safety-related information via establishing a V2V link while each CU uses a V2I link to support bandwidth-intensive applications, such as social networking and video streaming. In order to ensure the QoS of the CUs, we assume all V2I links are assigned orthogonal radio resources. Without loss of generality, we assume that each CU occupies one channel for its uplink transmission. To improve the spectrum utilization efficiency, all V2V links share the spectrum resource with V2I links. Therefore, is also referred to as the channel set.
Denote the channel power gain from the -th CU to the BS on the -th channel, i.e., the -th V2I link, by . Let represent the cross channel power gain from the transmitter of the -th V2V link to the BS on the -th channel. The received signal-to-interference-plus-noise-ratio (SINR) of the -th V2I link can be expressed as
where and refer to the transmit powers of the -th V2I link and the -th D2D pair, respectively, represents the noise power, and is the channel allocation indicator with if the -th D2D user pair chooses the -th channel and otherwise. We assume each D2D pair only occupies one channel, i.e., . Then, the capacity of the -th V2I link on the -th channel can be written as
where denotes the channel bandwidth.
Similarly, denotes the channel power gain of the -th V2V link on the -th channel. Meanwhile, denotes the cross channel power gain from the transmitter of the -th D2D pair to the receiver of the -th D2D pair on the -th channel. Denote the cross channel power gain from the -th CU to the receiver of the -th D2D pair on the -th channel by . Then, the SINR of the -th V2V link over the -th channel can be written as
where the interference power for the -th V2V link is
In (4), the terms and refer to the interference of the other V2V links and the V2I link on the -th channel, respectively. Hence the capacity of the -th V2V link on the -th channel can be written as
In the V2X networks, a naive distributed approach will allow each V2V link to select a channel independently such that its own data rate is maximized. However, local rate maximization often leads to suboptimal global performance due to the interference among different V2V links. On the other hand, the BS in the V2X scenario has enough computational and storage resources to achieve efficient resource allocation. With the help of machine learning, we propose a centralized decision making scheme based on compressed information learned by each individual V2V link distributively.
In order to achieve this goal, each V2V link first learns to compress local observations, including the channel gain, the observed interference from other V2V links and V2I link, transmit power, etc., and then feeds the compressed information back to the BS. According to feedback information from all V2V links, the BS will make optimal decisions for all V2V links using RL. Then, the BS broadcasts the decision result to all V2V links.
Iii BS Decision based Spectrum Sharing Architecture
As shown in Fig. 1, we adopt the deep RL approach for resource allocation. In this section, we first design the DNN architecture of each V2V link and the deep Q-network (DQN) for centralized control at the BS, respectively. Then, we propose the centralized decision making and distributed spectrum sharing architecture, termed C-Decision scheme. Finally, we introduce the binary feedback design for information compression.
Iii-a V2V DNN Design
Here, we discuss the DNN at each V2V link to compress local observation for feedback. As shown in Fig. 1, each V2V link first observes its surroundings, and obtains its transmission power, the current channel gains and interference powers of all channels, which are denoted as and , respectively. Here, refers to the aggregated interference powers at the -th V2V link on the -th channel as shown in (4). To consider the impact of V2V links on V2I links, the observation of the -th V2V also needs to include the cross channel gain from the -th V2V link to all V2I links, such as . Then, the observation of the -th V2V can be written as
where . Here, the channel information
can be accurately estimated by the receiver of the-th V2V link and we assume it is also available at the transmitter through delay-free feedback . Similarly, the received interference power over all channels can be measured at the -th V2V receiver. Each V2V transmitter knows its transmit power
. Besides, the vectorcan be estimated at the BS and then broadcast to all V2V links in its coverage, which incurs a small signaling overhead .
Then, the local observation, , is compressed using the DNN at each V2V link. The compressed information, , which is the output of the DNN, is fed back to the DQN at the BS. To limit overhead on information feedback, each V2V link only reports the compressed information vector, , instead of to the BS. Here, is also known as the feedback vector of the -th V2V link and the term refers to the -th feedback element of the -th V2V, where denotes the number of feedback learned by the -th V2V link. All V2V links aim at maximizing their global sum rate in the long run while minimizing the feedback information . Therefore, the parameters of the DNNs at all V2V links and those of the DQN will be jointly determined to maximize the sum rate of the whole V2X network.
Iii-B Deep Q-Network at the BS
To make a proper resource sharing decision, we introduce the deep RL architecture at the BS as shown in Fig. 1
. In order to maximize the long-term sum rate of all links, we resort to the RL technique by treating the BS as the agent. In the RL, an agent interacts with its surroundings, named as the environment, via taking actions, and then observes a corresponding numerical reward from the environment. The agent’s goal is to find optimal actions so that the expected sum of rewards is maximized. Mathematically, the RL can be modelled as a Markov decision process (MDP). At each discrete time slot, the agent observes the current state of the environment from the state space and then chooses an action from the action space and one time step later obtains a reward . Then, the environment evolves to the next state
, with the transition probability.
The BS treats all the learned feedback as the current state of the agent’s environment, which can be expressed as:
Then, the action of the BS is to determine the values of the channel indicators, , for each V2V link. Thus, we define the action of the BS as
where refers to the channel allocation vector for the -th V2V link.
Finally, we design the reward for the BS, which is very crucial to the performance of RL. To maximize the long-term sum rate of V2V links while ensuring the QoS of V2I links in the V2X scenario, we need to devise a mechanism to consider the transmissions of V2V links and V2I links simultaneously. As we know, the V2V links usually carry the safety-critical messages, such as vehicle’s speed and emergency vehicle warning on the road, while the V2I links often support the entertainment services . Thus, we should guarantee the transmission of V2V links as the primal target while making sure that the impact of V2V transmission on the V2I links can be tolerable and adjustable to some specific applications. To this end, we model the reward of the BS as
where refers to the capacity of the -th V2V on all the channels. Besides, and are nonnegative weights to balance the performance of V2I links and V2V links.
The solution of the RL problem is related to the concept of policy , which defines the probabilities of choosing each action in when observing a state in . The goal of learning is to find an optimal policy to maximize the expected return from any initial state . The expected return is defined as , which is the cumulative discounted return with a discount factor .
To solve this problem, we resort to the Q-learning , which is a well-known effective approach to tackle the RL problem, due to its model-free property where is not required a priori. Q-learning is based on the idea of action-value function for a given policy , which means the expected return when the agent starts from the state , takes action , and thereafter follows the policy . The optimal action-value function under the optimal policy satisfies the well-known Bellman optimality equations , which can be approached through an iterative update method:
where is the step-size parameter. Besides, the choice of action in state follows some exploratory policies, such as the -greedy policy. For better understanding, the -greedy policy can be expressed as
Here, is also known as the exploration rate in the RL literature. Furthermore, it has been shown in  that with a variant of the stochastic approximation conditions on and the assumption that all the state-action pairs continue to be updated, converges with probability 1 to the optimal action-value function .
However, in many practical problems, the state and action space can be extremely large, which prevents storing all action-value functions in a tabular form. As a result, it is common to adopt function approximation to estimate these action-value functions. Moreover, by doing so, we can generalize action-value functions from limited seen state-action pairs to to a much larger space.
In , a DNN parameterized by is employed to represent the action-value function, thus called as DQN. DQN adopts the -greedy policy to explore the state space and store the transition tuple in a replay memory (also known as the replay buffer) at each time step. The replay memory accumulates agent’s experiences over many episodes of the MDP. At each time step, a mini-batch of experiences are uniformly sampled from the replay memory, called experience replay, to update the network parameters
with variants of stochastic gradient descent method to minimize the squared errors shown as follows:
where is the parameter set of a target Q-network, which is duplicated from the training Q-network parameter set , and fixed for a couple of updates with the aim of further improving the stability of DQN. Besides, experience replay improves sample efficiency via repeatedly sampling experiences from the replay memory and also breaks correlation in successive updates, which also stabilizes the learning process.
Iii-C Centralized Control and Distributed Transmission Architecture
In this part, the architecture for the C-Decision scheme is shown in Fig. 1. Each V2V link first observes its local environment and then adopts a DNN to compress the observed information into several real numbers, which are finally fed back to the BS for centralized decision making. The BS takes the feedback information of all V2V links as the input, utilizes the DQN to perform Q-learning to decide the channel allocation for all V2V links, and broadcasts its decision. Finally, each V2V link chooses the BS-allocated channel for its transmission.
Details of the training framework for the C-Decision scheme are provided in Algorithm 1. We define as the observations of all V2Vs at the time step , where refers to the observation of the -th V2V at the time step . Then, we can express the estimation of the return also known as the approximate target value  as
where and represent the reward of all links and the Q function of the target DQN with parameters under the next observation and the action , respectively. Then, the updating process for the BS DQN can be written as [32, 33]:
where is the step size in one gradient iteration.
As for the testing phase, at each time step , each V2V adopts its observation as the input of the trained DNN to obtain its learned feedback , and then sends it to the BS. After that, the BS takes as the input of its trained DQN to generate the decision result , and broadcasts to all V2Vs. Finally, each V2V chooses the specific channel indicated by to transmit.
Iii-D Spectrum Sharing with Binary Feedback
In order to further reduce feedback overhead, we propose a framework to quantize the V2V links’ real-valued feedback into several binary digits. In other words, we try to constrain
. The binarization procedure can help force the neural networks to learn efficient representations of the feedback information compared to the standard floating-point layer. In other words, a binary layer can make each V2V compress its observation more efficiently.
The binary quantization process consists of two steps . The first step is to generate the required number of continuous feedback values in the continuous interval , which is also equal to the desired number of the binary feedback. Then, the second step takes the outputs of the first step as its input to produce the desired number of discrete feedback in the set for each output real-valued feedback of the first step.
For the first step, we adopt a fully-connected layer with activations, defined as , where we term this layer as the pre-binary layer. Here, the input of this pre-binary layer connects the outputs of each V2V’s DNN. Then, in order to binarize the continuous output of the first step, we adopt the traditional sign function method in the second step. To be specific, we take the sign of the input value as the output of this layer, which is shown as below:
However, the gradient of this function is not continuous, challenging the back propagation procedure for DNN training. As a remedy to this, we adopt the identity function in the backward pass, which is known as the straight-through estimator . Combining these two steps together, we can express the full binary feedback function as
where and denote the linear weights and bias of the pre-binary layer that transform the activations from the previous layer in the neural network respectively. Here, we term this layer as the binary layer.
Finally, to implement the C-Decision scheme with binary feedback, we add the full binary feedback function in (16), which consists of the pre-binary layer in the first step and the binary layer in the second step, to the output of each V2V link’s DNN. Besides, in response to the change in the number of feedback bits at each V2V link’s new DNN, the number of inputs in the DQN of BS should change correspondingly.
Iv Distributed Decision Making and Spectrum Sharing Architecture
In order to further facilitate distributed spectrum sharing and reduce the computational complexity, we propose the distributed decision making and spectrum sharing architecture (named as the D-Decision scheme) shown in Fig. 2 to let each V2V link make its own spectrum sharing decision. In this section, we first devise the neural network architecture for each V2V link to compress CSI and make decision, respectively, and then design the neural network for the BS to aggregate feedback from all V2V links. Then, we propose the hybrid information aggregation and distributed control architecture. Finally, we propose the D-Decision scheme with the binary aggregated information.
Iv-a DNN Design at V2V and BS
To enable distributed decision making, each V2V contains one DNN to compress local observations for feedback, termed the Compression DNN and another DQN for distributed spectrum sharing decision making, termed Decision DQN. Here, we employ the same DNN architecture for each V2V as that in Part A of Section III since they share the same functionality.
The BS aggregates the feedback from all V2Vs via its DNN, termed as the Aggregation DNN, and then broadcasts the aggregated global information (AGI) to all V2Vs. Here, the AGI can be expressed as , where refers to the number of AGI values and also equals the number of outputs of BS Aggregation DNN. Finally, each V2V combines its local observation and the AGI as the input of its Decision DQN to decide which channel to transmit.
Iv-B Hybrid Information Aggregation and Distributed Control Architecture
Each V2V link first observes its local environment to obtain , and then adopts its Compression DNN to compress into several real numbers , and finally feeds this compressed information back to the BS. After that, the BS takes the feedback values of all V2V links as the input of its Aggregation DNN to aggregate the compressed observations of all V2V links and further compress this information into the AGI . Finally, each V2V link combines the received AGI and its local observation as the input of its Decision DQN, and performs the Q-learning algorithm to decide which channel to transmit.
Details of the training framework for the D-Decision scheme are provided in Algorithm 2. Here, we define as the actions of all V2V links at the time step , where refers to the action for -th V2V. Besides, in the training process, we take the observations of all V2V links as the input and train all DNNs and DQNs in an end-to-end manner. The training process can be implemented in a fully distributed manner.
As for the testing phase, at each time step , each V2V link adopts its observation as the input of its Compression DNN to learn the feedback , and sends it to the BS. Then, the BS utilizes as the input of its Aggregation DNN to generate the AGI , and broadcasts to all V2V links. Finally, each V2V link takes and as the input of its Decision DQN to make decision, and then transmits on the chosen channel.
Iv-C Distributed Spectrum Sharing with binary information
Similar to Section III-D, we can also quantize the continuous feedback and the AGI in the D-Decision scheme into the binary data to further reduce signaling overhead. Then, both the Compression DNN of each V2V link and the Aggregation DNN at the BS need to include the binary function in (16).
V Simulation Results
In this section, we conduct extensive simulation to verify the performance of the proposed schemes. In particular, we provide the simulation settings in Part A, and evaluate the training performance of the C-Decision scheme in Part B. Then, we assess the testing performance under the real-valued feedback and binary feedback in Parts C and D respectively. Besides, we demonstrate the impacts of V2I and V2V links weights on the performance in Part E and the robustness of the proposed scheme in Part F, respectively. Finally, we show the training and testing performance of the D-Decision scheme in Part G.
V-a Simulation Settings
The simulation scenario follows the urban case in Annex A of . The simulation area size is , where the BS is located in the center of this area. For better understanding, we provide related parameters and their corresponding settings in Table I. In addition, we list the corresponding channel models for both V2V and V2I links respectively in Table II.
|Number of V2I links||4|
|Number of V2V links||4|
|Carrier frequency||2 GHz|
|Normalized Channel Bandwidth||1|
|BS antenna height||25 m|
|BS antenna gain||8 dBi|
|BS receive noise figure||5 dB|
|Vehicle antenna height||1.5 m|
|Vehicle antenna gain||3 dBi|
|Vehicle receive noise figure||9 dB|
|Vehicle speed||randomly in [10, 15] km/h|
|Vehicle drop and mobility model||Urban case of A.1.2 in |
|V2I transmit power||23 dBm|
|V2V transmit power||10 dBm|
|Parameter||V2I link||V2V link|
|Path loss model||, d in km||LOS in WINNER + B1 Manhattan |
Shadowing standard deviation
|8 dB||3 dB|
|Decorrelation distance||50 m||10 m|
|Noise power||-114 dBm||-114 dBm|
|Fast fading||Rayleigh fading||Rayleigh fading|
|Fast fading update||Every 1 ms||Every 1 ms|
The specific architecture of DNNs and BS DQN under the C-Decision scheme are summarized in Table III, where
to refers the number of feedback for each V2V link and FC denotes the fully connected (FC) layer respectively. In addition, the number of neurons in the output layer of the BS DQN is set as, which refers to all the possible channel allocations for all V2V links under current simulation setting. Besides, the settings for the DNNs and DQNs under the D-Decision scheme are listed in Table IV.
|Hidden layers||3 FC layers (16, 32, 16)||3 FC layers (1200, 800, 600)|
|Compression DNN||Aggregation DNN||Decision DQN|
|Hidden layers||3 FC layers (16, 32, 16)||3 FC layers (500, 400, 300)||3 FC layers (80, 40, 20)|
. Here, the activation function of output layers in DNNs and DQNs is set as a linear function. Besides, the RMSProp optimizer is adopted to update the network parameters with a learning rate of
. The loss function is set as the Huber loss.
We choose the weights and for V2I and V2V links, respectively. We train the whole neural network for episodes and the exploration rate is linearly annealed from to over the beginning episodes and keeps constant afterwards. The number of steps in each episode is set as . The update frequency of the target Q-network is every steps. The discount factor, , in the training is chosen as . The size of the replay buffer is set as samples. Meanwhile, the mini-batch size varies in different settings, to be specified in each figure.
V-B Training Performance Evaluation
Fig. 3 demonstrates the training performance of the proposed C-Decision scheme with a mini-batch size and the number of real-valued feedback . In Fig. 3 LABEL:, the loss function decreases quickly with the increasing number of training episodes , and becomes nearly unchanged with the further increasing . On the other hand, the change of average return per episode is displayed in Fig. 3 LABEL:. Here, we evaluate the training process every training episodes under different random seeds with the exploration rate , and plot the average return per episode in Fig. 3 LABEL:. The average return per episode first increases quickly with increasing , and gradually converges despite some small fluctuations due to the time-varying V2X scenario, which shows the stability of the training process. Thus, Fig. 3 LABEL: and LABEL: demonstrate the desired convergence of the proposed training algorithm. Therefore, we set for the C-Decision scheme afterwards.
V-C Performance of Real-Valued Feedback
Fig. 4 LABEL: shows the return variation under the real-valued feedback with the number of testing episodes. Here, we choose the mini-batch size as , number of testing episodes as , and the number of real-valued feedback as , respectively. For comparison, we also display the performance of two benchmark schemes: the optimal and the random action schemes, respectively. In the optimal scheme, we perform time-consuming brute-force search to find the optimal spectrum allocation in each testing step. In the random action scheme, each V2V link chooses the channel randomly. For better comparison, we depict the normalized return of these three schemes in Fig. 4 LABEL:, where we use the return of the optimal scheme to normalize the return of the other two schemes in each testing episode. Besides, the average return of our proposed scheme and the random action scheme are also depicted. In Fig. 4 LABEL:, the performance of the C-Decision approaches in most episodes and its average performance is about of the optimal scheme while the average performance of random selection is about of the optimal performance. Thus, we conclude the proposed C-Decision scheme can achieve near-optimal spectrum sharing.
Fig. 4 LABEL: shows the impacts of different mini-batch sizes and different numbers of real-valued feedback on the performance of the C-Decision scheme, which adopts the average return percentage (ARP) as the metric. Here, the ARP metric is defined as: the return under the C-Decision scheme is first averaged over testing episodes and then normalized by the average return of the optimal scheme. In Fig. 4 LABEL:, the number of real-valued feedback equals refers to the situation where each V2V link does not feed anything back to the BS and therefore, each V2V link just randomly selects channel to transmit, which is known as the random action scheme. From Fig. 4 LABEL:, the ARP under the C-Decision scheme increases rapidly with the increase of , and reaches the maximal percentage nearly at . Thereafter, the ARP virtually keeps constant with increasing . In other words, each V2V link only needs to send real-valued feedback to the BS to achieve near-optimal performance. Besides, different mini-batch sizes can achieve very similar performance. Particularly, the mini-batch size achieves the best performance, which is good enough considering the computational overhead in the training process and the gained performance.
V-D Performance of Binary Feedback
Fig. 5 demonstrates the change of the ARP performance with an increasing number of feedback bits under different mini-batch sizes. Here, we fix the number of real-valued feedback as , and quantize these real-valued feedback into different numbers of feedback bits. Similarly, the number of feedback bits equals in Fig. 5 refers to the situation where each V2V link does not feedback anything to the BS and just adopts the random action scheme. The ARP first increases quickly with the number of feedback bits, and then keeps nearly unchanged with the further increasing of feedback bits after the number of feedback bits is larger than . The ARP under different has quite similar performance. Besides, the ARP can reach with feedback bits under . Considering the performance and feedback overhead tradeoff, we choose feedback bits under in the subsequent evaluation.
V-E Impacts of V2I and V2V Weights
In this part, we evaluate the impacts of V2I links weight and V2V links weights on the system performance. For better understanding, we fix and vary the values of . Fig. 6
demonstrates the empirical cumulative distribution function (CDF) of V2I and V2V sum rate. In Fig.6, “Real FB” and “Binary FB” refer to the proposed C-Decision scheme with real-valued feedback and that with binary feedback respectively, and “Optimal” represents the optimal scheme. In particular, two empirical CDFs of V2I sum rate under both real-valued feedback and binary feedback in Fig. 6 LABEL: shift quickly to the right when the V2I weight increases to , which shows our proposed scheme can ensure different QoS requirements of V2I links via adjusting . Besides, the performance gap between the real-valued feedback and binary feedback decreases with the increase of . From Fig. 6 LABEL:, the empirical CDFs of V2V sum rate under the real-valued feedback and binary feedback are very close to each other and shift slightly to the left with increasing , which implies the rate degradation of V2V links is quite small. Besides, the CDFs of V2V sum rate under both feedback schemes are very close to that under the optimal scheme, and slightly deviate from the optimal performance with the further increase of . Thus, we can see that the proposed C-Decision scheme can ensure negligible degradation of V2V links while adjusting the QoS of V2I links via choosing different values of .
V-F Robustness Evaluation
Fig. 7 shows the impacts of different feedback intervals on the performance of both real-valued feedback and binary feedback, where the feedback interval is measured in the number of testing steps. To investigate the impact of very large feedback intervals on the performance, we set the number of testing steps as and the number of testing episodes as . The normalized average return under both feedback schemes decreases quite slowly with the increasing feedback interval at the beginning, which shows that the proposed scheme is immune to the feedback interval variations and then drops quickly with the very large feedback interval. Please note where the average return is normalized by the average return under the scheme with since we set and it is very high computational demanding to find the return under the optimal scheme.
illustrates the impacts of noisy input on the performance of both real-valued feedback and binary feedback. Here, the x-axis means the ratio of the strength of Gaussian white noise with respect to the each observation (such as channel gain value) for V2V links. In Fig.8 LABEL:, the ARP under both feedback schemes decreases very slowly at the beginning and then drops very quickly, and finally keeps nearly unchanged with the very large input noise, which shows the robustness of the proposed scheme. In addition, the proposed scheme can also gain nearly of the optimal performance under both real-valued feedback and binary feedback even at the very large input noise, which is still better than the random action scheme shown in Fig. 4 LABEL:. Based on this observation, we remark the proposed scheme can learn the intrinsic structure of the resource allocation in the V2X scenario.
Besides, Fig. 8 LABEL: displays the impacts of noisy feedback on the performance of both feedback schemes. Here, noisy feedback refers to the situation where noise occurs when each V2V link sends its learned feedback to the BS. Similarly, the x-axis means the ratio of the strength of the Gaussian white noise with respect to each feedback. In Fig. 8 LABEL:, the ARP of both feedback schemes keeps nearly unchanged with the increasing feedback noise, which demonstrates the robustness of the proposed scheme, and then decreases more quickly under the real-valued feedback compared with that under the binary feedback with the further increasing feedback noise. This is because there are only real-valued feedback under the real-valued feedback scheme while there exist feedback bits under the binary feedback scheme. Finally, the ARP of both feedback schemes becomes nearly constant with the very large feedback noise. Similarly, the binary feedback scheme is more robust to the feedback noise compared with the real-valued feedback scheme.
V-G Performance Evaluation for the D-Decision Scheme
Fig. 9 evaluates the training process of the D-Decision scheme. Here, we choose , and , respectively. In particular, the training loss for the st V2V in Fig. 9 LABEL: first decreases very slowly with some jitters with an increasing , and then drops almost linearly, and finally becomes nearly unchanged with the further increase of . The average return per episode under the D-Decision scheme in Fig. 9 LABEL: first increases quickly with the increase of , and then increases slowly, and finally gradually converges despite some fluctuations, which shows the stability of the training process. Besides, we observe that under the D-Decision scheme is much bigger than under the C-Decision scheme, which indicates that the D-Decision scheme converges more slowly than the C-Decision scheme. To train the whole neural network well, we set under the D-Decision scheme. Besides, the exploration rate is linearly annealed from to over the beginning episodes and then keeps constant.
Then, the testing performance of the D-Decision scheme with the increasing number of AGI values is shown in Fig. 10. In particular, Fig. 10 LABEL: illustrates the ARP performance with the increasing number of real-valued AGI . Here, we set the number of real-valued feedback which each V2V transmits to the BS as as indicated by Fig. 4 LABEL:. The APR first increases with increasing , and then keeps nearly unchanged with the further increase of . Especially, the ARP nearly achieves its maximal value when . In other words, the BS only needs real-valued AGI to represent the real-valued feedback of all V2V links to achieve of the optimal performance. Furthermore, even when , the ARP can still reach , which is suitable for the bandwidth-constrained broadcast channel of the BS. Compared with the C-Decision scheme, the D-Decision scheme only incurs ARP degradation. However, it can achieve the fully distributed decision making and spectrum sharing, which is very appealing in the V2X scenario. In addition, the computational complexity for decision making under the D-Decision scheme is greatly reduced compared with that under the C-Decision scheme, which can further facilitate the fully distributed spectrum sharing in the V2X scenario.
Besides, the testing performance of the D-Decision scheme with the binary AGI is evaluated in Fig. 10 LABEL:. Here, we choose the number of feedback bits as for each V2V link and the number of real-valued AGI . In Fig. 10 LABEL:, the ARP first increases with the increasing number of AGI bits , and then becomes nearly unchanged with the further increase of . In particular, the APR reaches when . Meanwhile, the APR is very close to even when . Similarly, compared with the C-Decision scheme with binary feedback, the D-Decision scheme with the binary feedback only incurs ARP degradation, which, however, can be implemented in a fully distributed manner.
In this paper, we proposed a novel C-Decision architecture to allow distributed V2V links to share spectrum efficiently with the aid of the BS in V2X scenario and also devised an approach to binarize the continuous feedback. To further facilitate distributed decision making, we have developed a D-Decision scheme for each V2V link to make its own decision locally and also designed the binary procedure for this scheme. Simulation results demonstrated that the number of real-valued feedback can be quite small to achieve near-optimal performance. Meanwhile, the D-Decision scheme can also gain near-optimal performance and enable a fully distributed decision making, which is more appealing to the V2X networks. Besides, the quantization of the feedback or AGI incurs small performance loss with an acceptable number of bits under both schemes. Our proposed scheme is quite immune to the variation of feedback interval, input noise, and feedback noise respectively, which validates the robustness of the proposed scheme. In the future, we will investigate joint power control and spectrum sharing issue in this scenario.
-  H. Seo, K. Lee, S. Yasukawa, Y. Peng, and P. Sartori, “LTE evolution for vehicle-to-everything services,” IEEE Commun. Mag., vol. 54, no. 6, pp. 22–28, Jun. 2016.
-  S. Chen, J. Hu, Y. Shi, Y. Peng, J. Fang, R. Zhao, and L. Zhao, “Vehicle-to-everything (V2X) services supported by LTE-based systems and 5G,” IEEE Commun. Standards Mag., vol. 1, no. 2, pp. 70–76, 2017.
-  L. Liang, H. Peng, G. Y. Li, and X. Shen, “Vehicular communications: A physical layer perspective,” IEEE Trans. Veh. Technol., vol. 66, no. 12, pp. 10 647–10 659, Dec. 2017.
-  H. Peng and L. Liang and X. Shen and G. Y. Li, “Vehicular communications: A network layer perspective,” IEEE Trans. Veh. Technol., vol. 68, no. 2, pp. 1064–1078, Feb. 2019.
-  3rd Generation Partnership Project, “Technical spefication group radio access network: Study on LTE-based V2X services,” 3GPP, TR 36.885 V14.0.0, Jun. 2016.
-  ——, “Study on enhancement of 3GPP support for 5G V2X services,” 3GPP, TR 22.886 V15.1.0, Mar. 2017.
-  C. Guo, L. Liang, and G. Y. Li, “Resource allocation for low-latency vehicular communications: An effective capacity perspective,” IEEE J. Sel. Areas Commun., vol. 37, no. 4, pp. 905–917, Apr. 2019.
-  L. Liang, G. Y. Li, and W. Xu, “Resource allocation for D2D-enabled vehicular communications,” IEEE Trans. Commun., vol. 65, no. 7, pp. 3186–3197, Jul. 2017.
-  L. Liang, S. Xie, G. Y. Li, Z. Ding, and X. Yu, “Graph-based resource sharing in vehicular communication,” IEEE Trans. Wireless Commun., vol. 17, no. 7, pp. 4579–4592, Jul. 2018.
-  C. Chen, B. Wang, and R. Zhang, “Interference hypergraph-based resource allocation (IHG-RA) for NOMA-integrated V2X networks,” IEEE Internet Things J., vol. 6, no. 1, pp. 161–170, Feb. 2019.
-  C. Han, M. Dianati, Y. Cao, F. Mccullough, and A. Mouzakitis, “Adaptive network segmentation and channel allocation in large-scale V2X communication networks,” IEEE Trans. Commun., vol. 67, no. 1, pp. 405–416, Jan. 2019.
-  B. Bai, W. Chen, K. B. Letaief, and Z. Cao, “Low complexity outage optimal distributed channel allocation for vehicle-to-vehicle communications,” IEEE J. Sel. Areas Commun., vol. 29, no. 1, pp. 161–172, Jan. 2011.
-  M. I. Ashraf, M. Bennis, C. Perfecto, and W. Saad, “Dynamic proximity-aware resource allocation in vehicle-to-vehicle (V2V) communications,” in Proc. IEEE Globecom Workshops (GC Wkshps), Dec. 2016, pp. 1–6.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,”IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp. 563–575, Dec. 2017.
-  Z. Qin, H. Ye, G. Y. Li, and B. F. Juang, “Deep learning in physical layer communications,” IEEE Wireless Commun., vol. 26, no. 2, pp. 93–99, Apr. 2019.
-  H. Ye, G. Y. Li, and B. Juang, “Power of deep learning for channel estimation and signal detection in OFDM systems,” IEEE Wireless Commun. Lett., vol. 7, no. 1, pp. 114–117, Feb. 2018.
-  F. A. Aoudia and J. Hoydis, “End-to-end learning of communications systems without a channel model,” arXiv preprint arXiv:1804.02276, 2018.
-  C. Jiang, H. Zhang, Y. Ren, Z. Han, K. Chen, and L. Hanzo, “Machine learning paradigms for next-generation wireless networks,” IEEE Wireless Commun., vol. 24, no. 2, pp. 98–105, Apr. 2017.
R. Li, Z. Zhao, X. Zhou, G. Ding, Y. Chen, Z. Wang, and H. Zhang, “Intelligent 5G: When cellular networks meet artificial intelligence,”IEEE Wireless Commun., vol. 24, no. 5, pp. 175–183, Oct. 2017.
-  S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep reinforcement learning for dynamic multichannel access in wireless networks,” IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 2, pp. 257–265, Jun. 2018.
-  Y. Sun, M. Peng, and S. Mao, “Deep reinforcement learning based mode selection and resource management for green fog radio access networks,” IEEE Internet Things J., vol. 6, no. 2, pp. 1960–1971, Apr. 2019.
-  L. Liang, H. Ye, G. Yu, and G. Y. Li, “Deep learning based wireless resource allocation with application to vehicular networks,” arXiv preprint arXiv:1907.03289, 2019.
-  H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu, “Machine learning for vehicular networks: Recent advances and application examples,” IEEE Veh. Technol. Mag., vol. 13, no. 2, pp. 94–101, Jun. 2018.
-  L. Liang, H. Ye, and G. Y. Li, “Toward intelligent vehicular networks: A machine learning framework,” IEEE Internet Things J., vol. 6, no. 1, pp. 124–135, Feb. 2019.
-  H. Ye, G. Y. Li, and B. F. Juang, “Deep reinforcement learning based resource allocation for V2V communications,” IEEE Trans. Veh. Technol., vol. 68, no. 4, pp. 3163–3173, Apr. 2019.
-  L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in vehicular networks based on multi-agent reinforcement learning,” to appear in IEEE J. Sel. Areas Commun., 2019.
-  Y. Wang, K. Wang, H. Huang, T. Miyazaki, and S. Guo, “Traffic and computation co-offloading with reinforcement learning in fog computing for industrial applications,” IEEE Trans. Ind. Informat., vol. 15, no. 2, pp. 976–986, Feb. 2019.
-  Y. S. Nasir and D. Guo, “Deep reinforcement learning for distributed dynamic power allocation in wireless networks,” arXiv preprint arXiv:1808.00490, 2018.
-  C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3-4, pp. 279–292, Feb. 1992.
-  R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 2018.
-  V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
-  H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proc. 30th AAAI Conf., Feb. 2016, pp. 2094–2100.
-  G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compression with recurrent neural networks,” arXiv preprint arXiv:1511.06085, 2015.
-  Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
-  Y. Bultitude and T. Rautiainen, “IST-4-027756 WINNER II d1. 1.2 v1. 2 WINNER II channel models.”
-  S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016.
-  T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY, USA: Springer Science & Business Media, 2009.