Eco-Vehicular Edge Networks for Connected Transportation: A Decentralized Multi-Agent Reinforcement Learning Approach

03/02/2020 ∙ by Md Ferdous Pervej, et al. ∙ NC State University 0

This paper introduces an energy-efficient, software-defined vehicular edge network for the growing intelligent connected transportation system. A joint user-centric virtual cell formation and resource allocation problem is investigated to bring eco-solutions at the edge. This joint problem aims to combat against the power-hungry edge nodes while maintaining assured reliability and data rate. More specifically, by prioritizing the downlink communication of dynamic eco-routing, highly mobile autonomous vehicles are served with multiple low-powered access points simultaneously for ubiquitous connectivity and guaranteed reliability of the network. The formulated optimization is extremely troublesome to solve within a polynomial time, due to its complicated combinatorial structure. Hence, a decentralized multi-agent reinforcement learning (D-MARL) algorithm is proposed for eco-vehicular edges. First, the algorithm segments the centralized action space into multiple smaller groups. Based on the model-free decentralized Q learner, each edge agent then takes its actions from the respective group. Also, in each learning state, a software-defined controller chooses the global best action from individual bests of all of the distributed agents. Numerical results validate that our learning solution outperforms existing baseline schemes and achieves near-optimal performance.



There are no comments yet.


page 4

page 12

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

User-centric communication has drawn significant attention lately. To ensure a better quality of service (QoS) and appease the data-hungry users, more networking components are being shifted towards the network edges day by day. Vehicular networking, on the other hand, has also been evolving from the rudimentary phase to an intelligent transportation system (ITS) to guarantee public safety, lessen congestion, reduce travel time, and better QoS of the vehicle users (VUs). An advanced ITS can undoubtedly save countless lives by assuring ubiquitous connectivity and well-measured timely road hazard alerts, thus, increasing the quality of experience of the VUs. Motivated by this, several governing bodies, such as - the United States Department of Transportation in the USA [USDOT, USDOT2], are heavily investigating more of vehicle-to-everything (V2X) communication.

Novel technologies, such as - dedicated short-range communications (DSRC), cellular-V2X (C-V2X), etc., are being deemed to be coupled together [ghafoor2019enabling] to induce an intelligent solution to this sector. Note that while DSRC is an earlier technology - depends on IEEE 802.11, C-V2X is developed by the 3rd generation partnership project (3GPP) and first introduced in its release 14 for basic safety messages in vehicular communication [zhou2020evolutionary]. The later releases of 3GPP are focusing on a more evolved system design with an advanced safety measure in addition to higher throughput, reliability, and much lower latency. VUs move rapidly in the highway causing frequent handover for V2I communication. In a very short period, the received signal strength at the downlink VUs can deteriorate severely for traditional network-centric communication infrastructure. Therefore, V2I communication is notably problematic for connected transportation. A potential solution to these problems should ensure universal connectivity, reliability, higher throughput, and low-latency. In addition to that, energy-efficiency (EE) should also be considered as green communication has shown its emergency rigorously lately [jain2019energy, huang2020energy].

In the literature, there exist several works addressing diverse aspects of vehicular networks from traditional network-centric prospects. A downlink multicasting scenario, for close-proximity vehicles, was acknowledged in [sahin2018virtual], where a group of vehicles as hot spots were used to simultaneously transmit multiple data streams. A vehicle-to-vehicle (V2V) radio resource management (RRM) method was proposed by Ye et. al., in [ye2019deep]. The authors used deep reinforcement learning (DRL) to scrutinize how uplink frequency resources can effectively be utilized for V2V communication. A similar approach was also considered by Liang et. al., in [8792382], for V2V applications using multi-agent RL (MARL). Ding et. al. considered RRM for vehicular networks in [8756652]. Although the authors considered virtual cell member association and RRM in their problem formulation, they did not consider user mobility. Note that we are considering highly mobile VUs for connected transportation, where each of the parameters needs to be chosen optimally in each transmission time interval (TTI). Therefore, our work is fundamentally different than a single snapshot-based static user-centric approach in [8756652].

Gao et. al. proposed a joint admission control and resource management scheme for both static and vehicular users in [gao2019joint]. Using the Lyapunov optimization technique, the authors showed a way to increase network throughput from the traditional network-centric approach. Guleng et. al. considered a -learning based solution for V2V communication in [guleng2020edge]. They used a two-staged learning process for minimizing overall latency and maximizing the network throughput. In our previous work [Pervej_WSR], we considered a throughput-optimal vehicular edge network for highway transportation, in which we achieved a maximum weighted sum rate using RL. While the studies in [ye2019deep, 8792382, 8756652, gao2019joint, guleng2020edge, Pervej_WSR] aimed at optimizing the network throughput, they did not address energy consumption issues in vehicular edge networks for smart and connected transportation.

Different from the existing studies, this paper focuses on uncovering an energy-efficient solution for user-centric, reliable vehicular edge networks in connected transportation. Particularly, we express a joint virtual cell formation and power allocation problem for highly mobile VUs in a sophisticated SD environment. In a freeway road environment, we delicately deploy various edge servers to obtain the users’ demands by serving a VU from multiple low-powered APs, as presented in Fig. 1

. Although such a distribution enhances end-to-end latency and increases the reliability of the network, system complexities also increase. Furthermore, as multiple APs are serving each of the VUs, it is essential to optimally form virtual cells for the users and allocate the transmission powers of the APs. While our joint formulation addresses these, it is a hard combinatorial optimization problem. Therefore, we use a model-free decentralized MARL (D-MARL) solution that can effectively formulate the virtual cell and slice the resources.

To the best of our knowledge, this is the first work to consider a reliable energy-efficient user-centric software-defined vehicular edge network for connected transportation. The rest of the paper is organized as follows: our software-defined network and problem formulation are presented in Section II. An efficient RL solution for resource slicing is presented in III. Section IV presents the results and findings. Finally, Section V concludes the paper.

Fig. 1: Energy-efficient user-centric vehicular edge network.

Ii Software-Defined Vehicular Edge Networks

We present our software-defined vehicular edge network model, followed by the problem formulation, in this section.

Ii-a Software-Defined System Model

Following the freeway case of 3GPP [3gpp36_885], a three-lane one-way road structure is considered as the region of interest (ROI). In this paper, we are interested establishing a communication framework for vehicular edge networks. However, our modeling can readily be extended to a more practical environment. Highly mobile autonomous VUs, denoted by , where , move on the road. Besides, several low-powered APs, denoted by , where , are also deployed along the roadsides in order to maintain ubiquitous connectivity. In addition to that, various edge servers - controlled by its anchor node (AN), denoted by are deployed at a fixed known geographic position. The APs are physically mesh-connected to each of these edge servers. Furthermore, the edge servers are connected to a centralized cloud server and have limited radio resources, denoted by hertz. We consider an open-loop communication where the ANs have perfect channel state information (CSI). Moreover, our software-defined system model is based on [lin2018e2e], where the beamforming weights can be formed and scheduled by the ANs based on the users’ requirements.

Creating virtual cells for each of the scheduled users, we aim to guarantee reliability of the network. In each virtual cell, multiple APs are activated to serve the VU as shown by the dotted ellipses in Fig. 1. Throughout this paper, a user and an AP is denoted by and , respectively. Furthermore, the VU-AP associations are denoted by the following two indicator functions:


Therefore, denotes the set of APs that VU is connected to and is the virtual cell for the VU .

Ii-B SD-V2I Communication Model

In this paper, we consider a multiple-input-single-output communication model where the VUs are equipped with a single antenna and the APs are equipped with a number of antennas111Note that while we consider omnidirectional antennas in this paper, the proposed framework can be easily extended with directional antennas and beam patterns to further improve SINR at vehicle receivers.. The wireless channel is considered to be quasi-static flat fading during a basic time block. The channel between VU and the APs are denoted by , where , , and are the channel response at a VU from the AP , large scale fading,

-Normal shadowing and fast fading channel vectors, respectively. Furthermore, the beamforming vector for VU

is denoted by , where represents the beamforming vector of AP for VU at time . Using this beamforming vector, the transmitted signal of AP is denoted as , where is the unit powered signal for and . As such, at time , the downlink received signal at is calculated as follows:


where is the received noise at time .

is circularly symmetric complex Gaussian distributed with zero mean and


Ii-C User-Centric Dynamic Cell Formation

We consider the vehicular edge network operates in time division duplex mode. Thus, the achievable rate for VU , at time , is calculated as follows:


where is spectral efficiency loss due to signaling at the APs and is the SINR. Moreover, as multiple APs are scheduled to transmit to , the backhaul link consumption by the VU is carefully calculated as follows[6831362]:


where denotes the total number of nonzero elements in a vector. This is commonly known as the -norm. If a user is scheduled in a transmission time slot , the precoding vectors from all of the APs for that VU, i.e., , is nonzero leading to a nonzero achievable data rate.

Note that we presume to serve all active users in a transmission time slot by forming virtual cells for each of the users and dynamically selecting the transmission power of the APs. As such, we intend to find optimal user-centric cell formation and beamforming weights calculation for the APs in our objective function. The first question that we try to answer is - what is the maximum throughput in our SD controlled highly mobile vehicular network? A naive approach would be serving a user from as many APs as possible with the maximum transmission powers of the APs. However, this will bring down the user fairness and EE whatsoever. Therefore, it is essential to justify the user data rate with EE. To avoid cross-domain nomenclature, let us define what we refer to as the EE. The fraction of the total user sum rate to the total power consumption of the network is defined as EE. At a given time slot , we calculate EE as follows:


where is calculate in Equation (5).

Therefore, in this paper, we address the following question: what are the user-centric associations and power allocations that guarantee reliability, programmability and EE of the entire network? To this end, we formulate a joint optimization problem as follows:

Maximize (7a)
Subject to (7b)

where is the minimum SINR requirement for our reliable communication. The reliability constraint is reflected in Equation (7c). is the maximum allowable transmit power of AP which is controlled via Equation (7d). Equation (7b) is taken to ensure each virtual cell contains more than one APs. Moreover, Equation (7e) indicates the feasible solution space.

Note that the norm restricts using the gradient-based solution. Besides, the formulated problem is a hard-combinatorial problem, which is extremely difficult to solve within a short period. Moreover, for each of the AP, at each time slot , there are possible combinations only for the possible VU-AP associations. For each of these associations, the AP, furthermore, needs to choose the optimal power level for the scheduled users. In this paper, instead of a continuous power level, we divide the AP’s transmission power level into multiple discrete levels. As our SD controlled ANs know the perfect CSI, we model the beamforming vector as follows:


where is the wireless channel information from AP to VU and is the allocated transmission power of AP to transmit to VU . If a centralized decision has to be taken, the centralized agent needs to make a central decision for all of the AP-VU associations and their power level selections. In that case, the size of the action space is , where is the total discrete power levels. Thus, traditional optimization methods may take an enormous amount of time to solve such an intricate problem. As such, we use a model-free -learning approach to solve the optimization problem efficiently in the next section.

Iii Energy-Efficient Resource Slicing at Edges: A Reinforcement Learning Approach

As we assume the CSI is known, our state-space contains all CSIs - denoted by , the locations of the VUs - denoted by and the locations of the APs - denoted by . Therefore, we denote the state-space by . On the other hand, the action space contains the VU-AP association and beamforming vectors for the chosen association. The action space is, thus, a two step process. First, the RL agent needs to choose a possible association. Then it designs the beamforming vectors. We express the action space as . Moreover, we have considered the EE of equation (6) as the reward function of the RL agent. However, to ensure fairness among users achievable rate, at each time slot , we have employed the following restriction:


Iii-a Single Agent Reinforcement Learning (SARL)

Taking state and action into account, -learning based RL framework can effectively solve hard optimization problems. Note that it is a model-free learning [watkins1992q, lin2016qos] process where, in each state , the agent takes an action , gets a reward for the chosen action and the environment transits to the next state . The governing equation of Q-learning is shown in the following:


where and are learning rate and discount factor, respectively. Although SARL is a good baseline scheme, if the number of states and actions is too large, it may become impracticable to handle. For an example, if and , then the baseline centralized SARL has a action space222Note that this contains the total action space. The number of valid actions will be lesser than this due to the maximum allowable transmission power constraint of the AP. of order

. This is commonly known as the curse of dimensionality. As an alternative, a D-MARL solution is proposed in what follows.

Iii-B Decentralized Multi-Agent RL (D-MARL)

In traditional MARL, multiple agents can take independent decisions and lead to optimal network performances. The action space for each of the agents is very small compared to that of the centralized SARL. For the same example of the centralized SARL one, if we consider each AP as an independent agent for the MARL scheme, the order of the action space for an agent is . However, whether MARL will accomplish the optimal solution, in our platform, is uncertain. Therefore, we have used the concept of MARL. Yet, instead of multiple-agents taking independent actions from a shrunk action space, we have used a distributed learning process where distributive agents take decisions from a segmented original SARL’s action space. In other words, the original SARL’s action space is subdivided into multiple groups. Each agent takes its decision from an assigned smaller group.

If there are such agents, then the dimension of the -table of such an agent is , where and represents the size of the state space and action space, respectively. Therefore, the order of the action space of each agent is of , where . Furthermore, let us assume there is a centralized vector - denoted by , that stores the global best action at every state. We update this global best action using the following equation:


Therefore, our proposed D-MARL algorithm can distributively learn to take the optimal central action. On the other hand, traditional MARL [liu2019trajectory] may not achieve the optimal solution as independent agents take autonomous actions in a shrunk action space. The joint actions of these agents may not be centrally optimal and lead to a sub-optimal solution. Algorithm 1 summarizes the proposed D-MARL algorithm.

1:Initialize: Total number of agents, Choose so that
2:Generate random tables, where .
3:Generate randomly
4:for each episode do
5:     Initiate the environment, generate
6:     while not terminated do
7:         for each  do
8:              Observe the environment; choose , based on the observation, following -greedy policy; receive reward ; update its -table using equation (10)
9:              if  reward using  then
10:                  update using equation (11)
11:              end if
12:         end for
14:     end while If is the terminal state
15:end for
Algorithm 1 Decentralized Multi-Agent RL (D-MARL)

Iv Performance Evaluation

We consider ROI = m, VU velocity = km/h, , = , antenna per AP, noise power = dBm/Hz, , and TTI = milliseconds. The channels, path loss, and shadowing are modeled following [3gpp36_885]. For the ease of simulation, we consider a full buffer network model where all APs serve all VUs simultaneously. We consider the following association rule:


Note that our proposed problem solution can work in other scheduling algorithms as well. While the VUs are dropped uniformly in each lane, the APs are placed meters apart fixed locations. For a tractable state space, we have considered that, at a given time step, all VUs are in the same locations - while they have different locations.

Scheme Name Training Episodes Test Episodes Average EE [bits/Hz/J] Deviation from Benchmark
Brute Force (Benchmark) N/A (Benchmark)
D-MARL (Proposed)
SARL [lin2016qos]
MARL [liu2019trajectory]
Equal Power N/A
Random Power N/A
TABLE I: Performance Comparisons: , AP Coverage Radius

Iv-a Performance Comparisons

At first, we show the effectiveness of our proposed solution. In order to do that, we compare our results with the following schemes:

  • Brute Force (Benchmark): This is the optimal solution. In this case, at each state, we need to search for the optimal action that provides the maximum reward.

  • SARL [lin2016qos]: This is the baseline RL scheme. We adopted the learning process mentioned in [lin2016qos] for this case.

  • MARL [liu2019trajectory]: We have used the novel MARL learning process proposed by Liu et. al. in [liu2019trajectory].

  • Equal Power Allocation: In this case, we have assumed that the AP divides its transmission power equally to serve the VUs. Essentially, this is the centralized case. The central power allocation decision is chosen in such a way that each AP transmits to its scheduled users using equal power.

  • Random Power Allocation: We have assumed that the AP chooses random transmission power from the discrete power level to serve a VU. This is also a centralized case. In each state and time slot, we pick a random central decision from all of the possible actions in the centralized action space.

We use each AP as an independent agent for the MARL algorithm. Therefore, there are agents for MARL [liu2019trajectory] where each AP takes it’s association and power allocation decision independently. For our proposed D-MARL algorithm, we have used agents. The SARL and MARL models are trained for episodes whereas, the D-MARL model is trained only on episodes. We have taken . Besides, the value of both and are decayed linearly from to in each episode.

From test episodes, the performance comparisons of our proposed algorithm with other schemes are listed in Table I. Note that we have taken dB and AP coverage radius

m for this comparison. Clearly, machine learning solutions achieve much higher performances than two baseline schemes (equal power allocation and random power allocation). Furthermore, notice that the centralized baseline SARL solution and the proposed D-MARL solution deliver nearly identical performance to that of the brute force optimal performance. Thanks to RL, the agents learn to take optimal actions from the training episodes and deliver a near-optimal performance. The performance of the MARL

[liu2019trajectory] is also very close to this optimal solution. However, recall that the total number of training episodes for SARL and MARL is 4 times the training episodes of our proposed D-MARL algorithm. Our proposed D-MARL achieves dB and dB performance gain over equal power allocation and random power allocation schemes, respectively.

Fig. 2: Probability of success for different SINR threshold when AP coverage radius is m
Fig. 3: Average EE for different SINR threshold when AP coverage radius is 250 m

Iv-B Impact of the Reliability Constraint

The reliability constraint has a significant impact on the overall network performance. If we increase the reliability constraint, , we force the RL agents to find optimal solutions that maximize the EE without violating the reliability constraint. Therefore, as this constraint increases, the number of total failed events also increases. We first calculate the success probability of delivering the reliability constraint as follows:


where is the total number of time steps and is an indicator function for the event that for any of the . The probability of delivering the minimum required SINR is shown in Fig. 2. The RL algorithms perform better than the baseline schemes. Furthermore, as increases, the successful transmission events get decayed. Our proposed D-MARL can deliver near-optimal success probability with these varying reliability requirements. On the other hand, the performance gap between MARL [liu2019trajectory] and D-MARL is quite evident from this result. Moreover, increasing the reliability constraint may necessitate the APs to transmit to the VUs with more power so that it can attain the SINR threshold. However, this will immediately downgrade the EE. This is also reflected in our simulation results in Fig. 3. As the performances of the two baseline schemes (equal power allocation and random power allocation) are very poor compared to the RL schemes, hereinafter, we will only compare the performance of our proposed algorithm with the brute force (benchmark) and other two RL schemes.

Fig. 4: Probability of success for different AP coverage radius when SINR threshold is dB
Fig. 5: Average EE for different AP coverage radius when SINR threshold is dB

Iv-C Impact of the Coverage Radius

Now, we analyze the impact of the AP’s coverage radius. To do that, we keep the reliability constraint fixed and vary the coverage radius. Note that as the reliability constraint is fixed, the probability of success, as shown in equation (13), should not fluctuate that much while we vary the coverage radius. This is also reflected in Fig. 4. Besides, as the coverage radius of the AP increases, more VUs can be served by each of the APs. Although the SINR constraint is fixed, recall that from our association rule in equation (12) and rate calculation in equation (5), it is quite clear that increasing the coverage radius will increase the total number of links for a VU. This will, therefore, improve the user sum rate. On the other hand, if the VU is far away from an AP, the AP might need to transmit to it with more power. However, the RL agents will find optimal power allocations that increase the user sum rate with appropriate power levels that ultimately increase the EE of equation (6). This trend is also reflected in our simulation results in Fig. 5. As the coverage radius increases, the D-MARL algorithm finds optimal associations and power allocations, leading to an improved EE of the network.

Iv-D User Fairness

Furthermore, a reliable and efficient network should ensure fairness while serving its associated users. A fair system delivers a nearly equal data rate to all users. Our reward function is designed to serve this purpose. Moreover, user fairness is also ensured by our proposed D-MARL algorithm. From test episodes, user fairness - while conserving the maximized EE, is presented in Fig. 6. Our proposed D-MARL delivers a Jain’s fairness index [jain1999throughput] of . The fairness index for the optimal scheme, SARL [lin2016qos] and MARL [liu2019trajectory] are , , and , respectively. Note that this fairness index varies between and . A fairness index of means there is no fairness whatsoever in the network. Fairness among users increases with the rise of this index.

Fig. 6: User fairness: AP coverage radius 250, SINR threshold dB

V Conclusion

In this paper, we have jointly optimized virtual cell formation and power allocation to assure ubiquitous connectivity and reliability at the vehicular edge networks for connected transportation. Thanks to RL’s powerful complex problem-solving ability, the hard combinatorial joint optimization problem is efficiently solved using this sophisticated learning process. Particularly, we have used a sophisticated D-MARL solution for the eco-vehicular edge network in connected transportation. Our proposed algorithm attains near-optimal benchmark performance within a nominal number of training episodes.