## I Introduction

Vehicular networking has lately drawn significant attention in wireless networking, particularly for next-generation communication platform, such as beyond 5G systems. The advent of vehicular ad-hoc networking [4539481] has made it possible to establish communication among vehicles on the road. This has posed as a direct method to share important safety messages among the vehicles using device-to-device like communication. The early development of dedicated short-range communications (DSRC) [jiang2006design] were designed following IEEE 802.11 physical and medium access layers technologies [7497762]. However, DSRC uses collision avoidance medium access schemes [molina2017lte]. Particularly, when network load increases, this creates an extra burden when extreme reliability has to be ensured [araniti2013lte]. The so-called cellular vehicle-to-everything (C-V2X) was developed by the 3GPP as a potential alternative [molina2017lte] to these limitations of DSRC. In C-V2X there are mainly two radio interfaces, namely the cellular interface (Uu interface) and the sidelink (PC5 interface). VUs maintain their communication with the infrastructure using the Uu interface. On the other hand, they can communicate with each other directly using the PC5 interface. The applications of V2X aim to incorporate both DSRC and C-V2X [ghafoor2019enabling], in reality, and have significantly drawn the attention recently to reduce travel time, traffic congestion, and ubiquitous internet connectivity for smart cities [heo2019performance].

There are several works in the literature addressing various aspects of vehicular networking. Şahin et.al. have presented a virtual cell formation for V2X in [sahin2018virtual] where vehicles that are in close-proximity can receive downlink multicast data from the transmission points. Gao et. al.

have proposed a deep neural network based resource allocation scheme in

[li2019learning] where they have used block coordinated descent method and minimized weighted minimum mean square error. Lien et. al. have considered a local area data network and shown a way to achieve lower latency in the radio access part in [lien2018low]. Considering the potential of performing caching at the edge nodes, the authors have used sophisticated tools of stochastic optimization and reinforcement learning (RL) to achieve lower latency using feedback less transmission. He et. al. have proposed an integrated platform for connected vehicles in [8061008]. The authors have formulated an integrated platform, in which they have orchestrated a joint networking, caching and computing resource allocation problem, and applied deep reinforcement learning (DRL) techniques.Tan et. al. have incorporated both caching and computing in mobility aware vehicular networks in [8447267]. The authors have shaped vehicle mobility using a naive contact time-based model. Considering both VUs and RSUs have caching and computing capabilities, they have used a DRL framework to find the optimal caching placement and computing resource allocation strategy in this paper. Ye et. al. have used DRL to allocate radio resources for vehicle-to-vehicle (V2V) communication in [ye2019deep]. In particular, they have reused the vehicle-to-infrastructure (V2I) uplink channels for V2V transmissions. Distributed vehicular networking resource allocation problems have also been addressed in [8792382] using multi-agent reinforcement learning. Similar to Ye et. al., the authors have reused uplink frequency resources for V2V communications. They have devised a system where their agents can interact with the environment and decide which channel to reuse and V2V transmission power level for V2V communication.

While most of these works address various aspects of V2X communication from the traditional network perspective, in this work, we incorporate user-centric dynamic cell design in a sophisticated software-defined (SD) environment to serve the users from multiple sources simultaneously. Particularly, we contemplate a centralized environment where resources can be programmed, controlled and distributed based on the users’ demands. As presented in Fig. 1

, we consider multiple edge servers are delicately deployed closer to the edge nodes in order to lessen end-to-end latency and improve spectral efficiency. Each of these edge servers is physically connected to a cloud server and controlled by its corresponding ANs. Furthermore, low-powered relay like APs are stationed as RSUs to guarantee ubiquitous connectivity of the VUs. Each of these APs is mesh connected to each of the edge servers. Moreover, we are interested in joint virtual cells design - for each of the scheduled VUs, and optimal beamforming vectors of the low-powered APs. Therefore, in this paper, we address joint user scheduling and power allocation problem in a highly mobile vehicular downlink network from the users’ prospect.

To the best of our knowledge, this is the first work to consider software-defined joint user-centric cell formation and power allocation, ensuring reliable communication and maximizing weighted sum rate (WSR), in a highway vehicular platform. Due to the complex combinatorial nature of the formulated optimization problem, it is extremely hard to solve the problem using traditional optimization methods. Thus, we propose a distributed Q-learning based RL solution to obtain optimal WSR. We also validate our results with genie-aided optimal, baseline single agent RL (SARL), multi-agent RL (MARL) and other baseline schemes.

The rest of the paper is organized as follows. Section II presents our system model with an explanation of node distributions, SD V2I communication models and problem statements. The proposed RL solution is described in Section III followed by results and discussions in Section IV. Finally, Section V concludes the paper.

## Ii System Model and Problem Formulation

We present our software-defined system model, followed by the node distributions, communication model and dynamic user-centric cell formation problem in this section.

### Ii-a Software-Defined System Model

We consider highly mobile autonomous VUs are moving on the road. Let us denote the VU set by , where . Various low-powered APs are deployed in the surrounding geographic region. Let the set of APs be denoted by , where . The VUs move on the road and each VU establishes its connection with the network via these APs. We assume that all of the APs are physically connected to edge servers. Note that, in reality, the number of such connections will be confined into a small group due to geographic locations of these edge nodes. Furthermore, each of the edge servers is controlled, programmed and operated by its respective AN. Let us denote AN by . In other words, there exists numbers of edge servers each of which can be expressed by the identical notation of its AN. Each of the edge servers has a fixed and limited radio spectrum assigned by the cloud server. Let us denote the available radio resources of the edge server by hertz. Considering an open-loop communication system, we assume that the channel state information (CSI) is perfectly known at each of the ANs. The ANs can form and schedule the beamforming weights of the APs for each of the virtual cells. Therefore, our system model is based on sophisticated SofAir [lin2018e2e], where the ANs control, create and assign resource slices to the APs based on user’s demands and thus, enhances the overall spectral efficiency.

In order to ensure reliability and ubiquitous connectivity, we assume a VU can associate with multiple APs that are in its communication region. That is, in comparison to the conventional network-centric approach, we consider user-centric communication by formulating virtual cells for the scheduled users by associating each of them to multiple APs. Some patterns of forming virtual cells are shown by the dotted ellipses in Fig. 1. Without any loss of generality, let us denote a user and an AP by and , respectively. Since we are considering a user-centric approach, let us denote the set of APs that a VU may associate to in time slot by . Furthermore, we denote the VU-AP association by the following indicator function:

(1) |

Moreover, an AP might be connected to multiple VUs at the same time. Thus, let us denote the set of VUs connected to an AP at time slot by the set . We may also represent this by the following indicator function:

(2) |

Therefore, denotes the set of APs that VU is connected to whereas, denotes the set of VUs connected to the AP in a given time slot.

### Ii-B Road Model and Node Distributions

We consider a straight three-lane one-way road structure without any intersection as our region of interest (ROI). Specifically, we consider the freeway case of 3GPP [3gpp36_885]. However, our modeling can be extended to complex practical road modeling. We are interested in establishing a communication framework for the V2X platform where the vehicle models are independent of road structure. The lane can be denoted by , where

in our case. We deploy VUs and APs following uniform distributions while maintaining a safety distance between two VUs. The method of updating mobility is described in Algorithm

1.### Ii-C SDV2I Communication Model

We assume that the VUs are equipped with a single antenna, whereas, each of the low-powered APs is equipped with a number of antennas. Unless mentioned otherwise, we assume an omnidirectional antenna for both VU and APs. Furthermore, with a slight abuse of notation, we represent the set of APs in by the set , where represents cardinality of set . Moreover, i.e., the total number of APs assigned to the virtual cell of VU has to be less than or equal to the total number of available APs in the network for all time slot .

We model the wireless channel, between AP and VU, as an independent quasi-static flat fading model during a basic time block. Let us denote the channel response at a VU from the AP by , where denotes the channel between and the antenna of AP at time . Then, the channel response at the VU from all APs is represented as follows:

(3) |

where , and are large scale fading, -Normal shadowing and fast fading channel vectors, respectively. Note that channels are modeled following appropriate measures listed in [3gpp36_885]. Moreover, denotes the total number of antennas in all APs.

We assume linear downlink beamforming in our SD V2I platform. Let us denote the beamforming vector for VU at AP by in time . Then, the beamforming vector to VU can be denoted by . Furthermore, the entire network beamforming design can be denoted by . Now, let us also denote the unit powered intended signal for VU by . Therefore, we have to satisfy all the time. Furthermore, applying the beamforming vector, the transmitting signal of the AP is denoted as . Therefore, the received signal at the VU is then calculated as follows:

(4a) | |||

(4b) |

where

denotes the received noise which is zero mean circularly symmetric Gaussian distributed with variance

.### Ii-D User-Centric Dynamic Cell Formation

With our analysis and vehicular traffic modeling, as presented in II-B, we now aim to derive the instantaneous achievable rate at the VU. At first, we calculate the signal-to-interference-plus-noise ratio (SINR) as follows:

(5) |

Considering time division duplexing (TDD) operated system, we thus calculate the instantaneous achievable data rate for VU as follows:

(6) |

where represents spectral efficiency loss due to signaling at the APs. Note that if a user is scheduled during the transmission time slot, the beamforming vector is nonzero, thus the rate is nonzero as well. On the other hand, if a user is not scheduled, the beamforming vector for that user should be zero leading to a zero achievable rate. Furthermore, as multiple APs are serving each of the scheduled users, the backhaul resource consumption by each of those users should also be carefully calculated. As such, next, we calculate the backhaul resource consumption[6831362] as follows:

(7) |

where represents the total number of nonzero elements in a vector and commonly referred as -norm.

Now, due to resource constraints, we may not serve all users at the same time. Therefore, there will be certain restrictions on the number of active users in the VU set . However, recall that our formulated rate calculation in equation (6) can also be used to design the user scheduling. We intend to serve all active users in in a transmission time slot by forming virtual cells for each of the users and dynamically selecting the transmission power of the APs. We aim to find optimal user-centric cell formation and beamforming weights calculation for the APs in our objective function. The question that we try to answer is - what are the optimal user associations and APs transmission powers that maximize the throughput in our SD controlled highly mobile vehicular network? A naive approach would be serving a user from as many APs as possible with the maximum transmission powers of the APs. However, transmitting to a particular VU from the APs with maximum transmission power will severely impact the SINR performances of the other active VUs. Therefore, it is essential to know the optimal transmission power of the APs for each of the active VUs.

As such, we present our joint optimization problem in the following :

Find: | ||||

Maximize | (8a) | |||

Subject to | (8b) | |||

(8c) | ||||

(8d) | ||||

(8e) |

where , and are the weights of the data rate of user at time , minimum SINR requirement for reliable communication and maximum allowable transmit power of AP , respectively. Note that constraint (8b) ensures that our user-centric virtual cell contains more than one AP to form the cluster. Constraint (8c) ensures the obtained SINR is greater than a minimum threshold. Moreover, constraint (8d) limits the total transmit power of AP to be at maximum .

Due to the norm, the formulated problem is not suitable to be solved using a gradient-based algorithm. Furthermore, due to the combinatorial nature of the originated problem, it is extremely hard to solve within a short period. Note that if we know all s - , we can figure out the s using equations (1-2). Therefore, for each of the AP, there are possible combinations, only for the VU-AP associations, in a single time-slot. Besides, for each of these associations, the AP has to select the power level for each of the active users. Moreover, in a centralized controlled environment, the centralized agent needs to make a central decision for all such AP-VU associations and power levels selection pairs. As such, it is obvious that as the number of APs and VUs increases the complexities increases exponentially. Therefore, we employ an elegant machine learning approach to solve the problem in what follows.

## Iii Reinforcement Learning-Based Vehicular Edge Slicing Mechanism

In this section, we discuss the problem-solution from the RL perspective. We first clearly state the state, action, and reward of the RL agent. Then we present our learning algorithms.

### Iii-a State

The state-space contains all channel state information . It also contains the geographic locations of the VUs and APs. Let us denote the positions of the VUs and APs at state by and , respectively. Therefore, we denote our state space as follows:

(9) |

### Iii-B Action

The action space contains the VU-AP association indicator functions followed by the beamforming vectors for the selected VU-AP associations.

(10) |

Notice that, in equation (10), the RL agent needs to take two action sets i.e., VU-AP associations and beamforming weights. However, both of these actions cannot be performed at the same time. Simply put, for deciding the optimal beamforming weights, the agent needs to know the VU-AP associations. Therefore, the action set can be thought of as a twin step process. At first, the RL agent needs to decide the VU-AP associations. Based on that decision, it then can allocate each AP’s transmission power for all of the VU that are being served by the respective AP. Therefore, we divide the action space into twin scales as described below.

Considering an open-loop system - one-shot transmission, to ensure reliability and increase the user data rate, we have considered serving a scheduled VU from multiple APs. If a user is scheduled to be served, we then design the beamforming vectors of the APs for that particular user. Now, note that we assume perfect CSI is known at the AN. As such, we model the beamforming weights using the following equation:

(11) |

where is the wireless channel information from AP to VU and is the allocated transmission power of AP to transmit to VU . Therefore, we rewrite the action space as:

(12) |

Now, instead of continuous power level, we divide the APs transmission power level into multiple discrete levels. Particularly, we divide into levels (e.g., dBm). In other words, each AP can select its transmission power to serve a user from one of these power levels. Therefore, our objective function is still the same as presented in equation (8). We formulate the beamforming vectors using the optimal power selection and equation (11).

### Iii-C Reward

Without any loss of generality, we define the weighted sum rate as the reward function for our learning algorithm. We also ensure that each of the users receives the minimum SINR threshold for a chosen action; otherwise, we return a zero reward. Accordingly, the reward function is given as follows:

(13) |

### Iii-D Single Agent Reinforcement Learning (SARL)

-learning is a model-free RL framework [watkins1992q] which takes the state and action into account and solve hard optimization problem such as equation (8) efficiently. In each state the agent takes an action from which it gets a reward and the environment transit to the next state . The governing equation of Q-learning is shown in the following:

(14) |

where and are learning rate and discount factor, respectively. Our learning algorithm for SARL is presented in Algorithm 2.

### Iii-E Multi-agent Reinforcement Learning (MARL)

In MARL, multiple agents can take their individual actions and get an overall centralized reward for their joint decisions. Liu et. al. have recently presented a MARL framework in [liu2019trajectory] in which each agent has its own -table and can take action independently. Particularly, following the

-greedy policy, if the value of the random variable is greater than

, the authors have chosen the action of the th agent using the following equation:(15) |

where represents the cooperating agents. The update rule for the th agents follows the following equation [liu2019trajectory]:

(16) |

### Iii-F Distributed SARL with Multi-agent Learning (SARL-MARL Collaboration)

While SARL is the baseline RL scheme, it may suffer to attain the best performance if the action and state space are too large. In such a platform MARL can be used to shrink the number of action space. However, whether MARL will achieve the optimal performance is still questionable as each agent takes its decision independently. Although, Liu et. al. [liu2019trajectory] have considered collaboration among the agents, we validate through simulation results that this scheme fails to attain the best performance in our specific problem. As an alternative, we incorporate the concept of distributed learning. Since a single agent RL framework has to evaluate all possible actions, we consider dividing the action space into multiple agents. We use individual -table and keep track of the global best performance. Note that the number of possible states for all agents are still the same as SARL, we are only evaluating the performance from segmented action spaces.

Let us denote the number of agents by . Then, the dimension of the -table for each of the agents is , where and represents the size of the state space and action space, respectively. Note that we selected such a way that . In each of the state , each of these agents takes their action following the same methodology as the baseline -learning. Let us denote a centralized vector that stores the global best action in each of the states. In each of the time step, we update the using the following equation:

(17) |

Note that each of the agents follows and updates its action and -table according to the baseline SARL. The detailed procedure of our distributed SARL with multi-agent learning is presented in Algorithm 3.

## Iv Performance Evaluation

We consider meters of a three-lane one-way freeway as our ROI. Vehicle’s new position is generated by adding linear displacement with VU velocity of km/h as per [3gpp36_885]. The channel model and related parameters are also chosen as specified in [3gpp36_885]. In order to keep a tractable sate space, we consider updating it after every milliseconds. Furthermore, for the ease of simulation, we consider a full buffer network model in which each of the APs serves all VUs simultaneously. Note that our proposed problem solution can work in other scheduling algorithms as well. As we are not adopting any actual scheduling schemes, we consider full buffer scenarios with . We consider AN, APs and users. While the VUs are dropped uniformly in each lane, the APs are placed meters apart fixed locations. We have assumed antennas per AP. For a tractable state space, we have considered that, at a given time step, all VUs are in the same locations - while they have different locations. We also have dB and .

For the SARL, MARL and SARL-MARL collaborative algorithms, we have considered . Besides, the value of both and are decayed linearly from to in each episode. Also, we have considered APs as independent agents for the MARL algorithm of Liu et. al. [liu2019trajectory] and we have used for our SARL-MARL distributed learning algorithm. For both SARL and MARL cases, we have trained our models for episodes while we only train our model in episodes in the SARL-MARL distributed learning case. In order to validate the results of each of these algorithms, we have compared the results with the genie-aided: optimal solution. We have also used two baseline schemes, namely, (1) random power: each AP randomly choose its transmitting power from and (2) equal power: each AP transmit to all VUs using dBm power.

We compare the average rewards of all of these schemes after running test episodes in Fig. (a)a. Both SARL, SARL-MARL distributed learning algorithm attains an average per VU reward of bits/sec/Hz, which is identical to the genie-aided optimal per VU reward () bits/sec/Hz. In the meantime, MARL gets a per VU reward of bits/sec/Hz. Recall that we have trained both SARL and MARL for episodes, whereas the proposed SARL-MARL distributed learning algorithm has been trained on only episodes. The performance of the equal and random power allocation schemes are, as expected, worse than the RL schemes. With regard to user fairness, our RL schemes and reward function always ensure that each of the users receives a fair data rate. Since we return zero rewards in the case that any of the user’s SINR is below the threshold level, we expect that our RL agents learn this pattern quickly. This is also verified by our simulation results in Fig. (b)b.

We further consider the impact of the reliability threshold. Note that, in our optimization problem, we have set the reliability constraint that each of the users has to get a minimum SINR threshold to ensure reliability. This, therefore, guarantees that regardless of the environment, our RL will learn to allocate resources such a way that each of the VUs receives at least SINR. Furthermore, Fig. 3 indicates that as this threshold increases, the network should experience difficulties in achieving this demand for all of the users. Both Figs. (a)a-(b)b confirm that our proposed distributed SARL multi-agent learning (SARL-MARL) algorithm reaches near-optimal results compare to that of Liu et. al.’s MARL and other baseline schemes. The performance gap between the optimal and the proposed SARL-MARL results are very close. Besides, the gap between our proposed solution and other baseline schemes are hugely evident as the reliability threshold increases.

Finally, we show the impact of the coverage radius of the APs in Fig. 4. Note that a VU can only be served if its in the coverage region of the AP. Therefore, as the coverage radius of the AP increases, more VUs can be served by that AP. As such, the expected sum rate of the user should increase, if the power levels of the APs are chosen appropriately with the increase of its coverage radius. This is also reflected and validated in our result presented in Fig. 4.

## V Conclusion

In this paper, we have presented an efficient way to dynamically allocate the transmission power of the APs and virtual cell formation for the VUs in user-centric vehicular networks. Using well-bred machine learning algorithms, we have demonstrated that the original hard combinatorial problem can be solved efficiently. Furthermore, as the numbers of possible states and actions increase, the traditional SARL suffers to behave optimally due to the curse of dimensionality. While MARL might be an option, we have proven that MARL does not attain optimal performance in our problem. As such, we have used a SARL-MARL based distributed learning approach that achieved near-optimal performance within a nominal number of training episodes.

Comments

There are no comments yet.