A Reinforcement Learning Approach to Jointly Adapt Vehicular Communications and Planning for Optimized Driving

07/10/2018 ∙ by Mayank K. Pal, et al. ∙ IIIT Delhi 0

Our premise is that autonomous vehicles must optimize communications and motion planning jointly. Specifically, a vehicle must adapt its motion plan staying cognizant of communications rate related constraints and adapt the use of communications while being cognizant of motion planning related restrictions that may be imposed by the on-road environment. To this end, we formulate a reinforcement learning problem wherein an autonomous vehicle jointly chooses (a) a motion planning action that executes on-road and (b) a communications action of querying sensed information from the infrastructure. The goal is to optimize the driving utility of the autonomous vehicle. We apply the Q-learning algorithm to make the vehicle learn the optimal policy, which makes the optimal choice of planning and communications actions at any given time. We demonstrate the ability of the optimal policy to smartly adapt communications and planning actions, while achieving large driving utilities, using simulations.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We consider an on-road environment that consists of autonomous vehicles equipped with sensors and other actors, for example, human driven vehicles, which may not have sensing ability. An autonomous vehicle would like to optimize its driving utility, for example, speed and/or smoothness of drive. To do so, it must choose a suitable motion plan while being cognizant of the other on-road actors.

Further, we envision the prevalence of roadside infrastructure sensors. Information sensed by vehicles or infrastructure may be communicated over a wireless network. In the absence of a network, an autonomous vehicle carries out motion planning given the road region it can perceive using its local sensors. Information from other sensors communicated over a wireless network allows the vehicle to perceive a larger region. Figure 1 provides an illustration. One expects this to allow for better motion planning and improved driving utilities.

Fig. 1: Illustration of the on-road environment. A Vehicle may query the infrastructure for information that it uses to extend its local view. The infrastructure makes its own measurements and may receive measurements made by other vehicles too. The local view and the possible extended view is shown for the vehicle .

However, communications related constraints impose limits on information that can be communicated at any given time. These limits impact the region that may be perceived by the autonomous vehicle and thus impact its motion plan. Conversely, an on-road environment, for example, a high vehicle density, may restrict the feasible motion plans and therefore diminish the requirement for communicating sensed information over the network.

In this work, we capture this interplay for a single autonomous vehicle (the ego vehicle) that uses the view perceived by its local sensors (local view) and may choose to query information sensed by the infrastructure to append an extended view to its local view. The communications network is constrained and the ego vehicle may need multiple queries to populate its extended view. The ego vehicle must adapt its motion plan and its communications action (specifically, the choice of sensed information queried from the infrastructure) to the on-road environment, such that its driving utility is optimized. We formulate this requirement as a reinforcement learning problem.

Works that study the impact of communication constraints on driving utilities are limited. A recent work [1] considers the limits on speed in autonomous vehicle networks that result from communications constraints. Works like [2] and [3] consider techniques to merge locally sensed information with information obtained from other vehicles, to extend a vehicle’s view. They use an n/g wireless network as is for transferring sensed data amongst two to three vehicles. There are other works [4] [5] that don’t dwell on the sensed information and motion planning. Instead, they consider problems related to vehicles sending information over a wireless network.

Our specific contributions are:

  1. [leftmargin=*]

  2. We formulate the problem of optimizing the driving utility of an autonomous vehicle using information obtained from its local sensors, and by querying the infrastructure over a network, as a reinforcement learning problem. Specifically, we formulate a discrete-time infinite horizon expected sum cost minimization. The goal is to find an optimal policy that the vehicle will use to choose its communications and motion planning actions at every time step.

  3. We avail the occupancy grid representation of the on-road environment to capture both motion planning and communications related constraints. While simple, this helps us to clearly exhibit the interplay of planning and communications.

  4. We use the Q-learning algorithm to enable the ego vehicle to learn the optimal policy, which solves the above problem, as it drives through a simulated on-road environment. Since Q-learning does not require the ego vehicle to know the model that guides evolution of the on-road environment, it can be used by an ego vehicle to learn optimal policies over time in real settings.

  5. We compare the driving utility achieved by the ego vehicle, its choice of communications and motion planning actions when using the obtained policy, with alternate policies and under varied traffic densities and communications constraints.

The rest of the paper is organized as follows. In Section II we summarize related works. In Section III we formulate the problem followed by details on the occupancy grid representation that we use in Section IV. In Section V we explain how we use Q-learning to solve the problem. Section VI details the simulator. Section VII provides an evaluation of policies obtained using Q-learning for different scenarios. We conclude in Section VIII.

Ii Related Works

There exists a large body of work that explores planning strategies for autonomous vehicles connected in a V2X framework. We focus on works relevant to our problem context, which involves cooperative perception and planning, communications in vehicular networks and applications of reinforcement learning to these problems.

Cooperative Perception and Planning: With increasing sensing and communication capabilities in cars, cooperative perception has become key to improving safety and traffic flow density. Several efforts [2, 3, 6] have been made in extending the range of perception of the ego vehicle. Kim et al. [2, 3] combine locally sensed information from various vehicles to create a merged occupancy grid map. The authors also study and characterize the effect of communication delay on map merging and conclude that safety critical tasks like collision avoidance should rely on local sensing information. They also suggest that longer-term decisions like early lane change or lane keeping could benefit from remote information. In [6], the authors take a Bayesian approach to incorporate uncertainty in perception modules as well as communication delays in the merged occupancy grid, over which an appropriate algorithm like RRT* is applied for path planning.

More recent works like [7, 8] discuss cooperative perception and planning in mixed-traffic scenarios, where certain human driven vehicles may not have sensing and communication capabilities.

Manzinger and Althof [8] develop an algorithm for cooperative collision avoidance by redistributing drivable regions fairly among the cooperating vehicles. The human-driven vehicles in the occupancy maps are treated as any other obstacle and are assumed to be known at each time instance. Kamal et al. [7] also work in a mixed-traffic, i.e., partially connected vehicle environment, and use the local and extended view to create a ‘road speed profile’ of the upcoming road segment. The road speed profile is then used for anticipative planning and control to optimize a given utility function.

These approaches develop the framework for merging the extended view with the local view and leverage the global view for better planning, however, all the schemes discuss broadcasting of information and do not consider the ‘usefulness’ of the sensing information received by the ego vehicle. A naïve approach of collecting data may lead to unnecessary bandwidth consumption and processing delays.

Planning and Communications: A crucial aspect of cooperative perception is the underlying protocol for communication and the constraints it imposes. In case of vehicular networks, often vehicles are equipped with multiple network interfaces like DSRC, Wi-Fi and Cellular. Higuchi and Altintas [9] explore hybrid schemes for vehicular communication and design a strategy for selecting one of the communication media (DSRC, Wi-Fi, Cellular, etc.) based on vehicular density on road regions. Talak et al. [1], on the other hand, observe that any wireless communication scheme imposes constraints due to interference and delay and derive bounds for velocity as a function of traffic density.

In [10]

, Roth et al. consider a multi-agent setting and learn a policy with each agent broadcasting its local view with no cost to communication. At execution time, however, a heuristic approach is taken, where an agent decides to communicate only when it benefits the team performance. More recently, Best et al.

[11] take a planning-aware communication approach for decentralized coordination of multiple agents. A particle filter framework is used where each agent tracks the action distribution of every other agent and decides to communicate with one only when its local utility can improve with new information.

Iii Model and Optimization Problem

We model the interaction of the ego vehicle with its environment as a discrete-time dynamic system [12]


where indexes discrete time, is the state observed by the ego vehicle at time , is the action that the ego vehicle takes at time , and is a random disturbance that captures the fact that the next state , given the current state and action , is governed by a probabilistic model. The function describes how the state is updated.

The state , for any , is a vector. It consists of the two subvectors and . The former is a vector that is obtained from measurements made by sensors local to the ego vehicle, and the latter is a vector that the ego vehicle obtains by querying the infrastructure. The action is a vector. Here is the motion planning action that determines the vehicle’s motion on the road (say, for example, its velocity or lane) for the following time step and is the communications action that determines the query the vehicle sends to the infrastructure. This query, at time , populates elements in .

Let and , respectively, be the set of all motion related actions and communications related actions that are feasible in state . While the set of feasible motion related actions restricts the on-road maneuvers the ego vehicle may make, the feasible set of communications actions captures the constraints on the ability of the ego vehicle to obtain information from the infrastructure using the communications network. An example of the former is a limit on the maximum acceleration or velocity and of the latter is a limited communications bandwidth/rate. For any , is feasible only if and .

At time , given that the ego vehicle observes state and chooses action , it incurs a bounded stage cost . Let be the set of all states. Define a stationary (independent of time ) policy , which is a function that maps every state to a feasible action . Let be the infinite horizon discounted expected sum cost when the ego vehicle starts in state and uses the policy . In terms of the stage costs, we can write


where is the factor that discounts future costs, and the expectation is over the sequence of states the ego vehicle visits, having started in . In our work, we use .

Our optimization problem is to find the optimal policy that minimizes the sum cost , for all . Let be the set of all feasible stationary policies. The optimal policy is


The optimal policy , at every time step, simultaneously chooses a motion planning and communications action, as a function of the state, such that the discounted sum cost in (2) is minimized.

Iv Grid Representation

Fig. 2: The grid corresponding to the states and is shown. The ego vehicle (blue car) has a local view of six cells and an extended view of eight cells. Assume a velocity of cells/second. In one other car is in the local view. At any time , the cells in the extended view are indexed relative to the local view of the ego vehicle. The occupancy of those with a grey background is unknown. The ego vehicle chooses to stay at a velocity of at and would like to query the cell that is indexed in its extended view at time . This is cell in the grid at . The control action and the query is .

We use an occupancy grid to represent the on-road environment through which the ego vehicle is driving. Each cell in the grid is either empty or is occupied (has a stationary or a moving obstacle). The grid, relative to the position of the ego vehicle, is divided into the two regions of (a) local view and (b) extended view. The local region occupancy can be measured by the sensors local to the ego vehicle. On the other hand, the ego vehicle needs to query the infrastructure to obtain occupancy information in the extended view. Figure 2 shows an illustration. In practice, cell occupancy information is obtained by using sensors like cameras and LIDARs that are observing the road region corresponding to the cell.

States in the grid representation: The state at time is obtained from the occupancy grid as follows. The subvector is constituted of the velocity of the ego vehicle, its lane, and the occupancy information of each cell that is in its local view and can be measured111In this work we assume that the measurements are accurate. More generally, the state may also include measurement uncertainty. by its local sensors. In Figure 2, assuming that the ego vehicle is moving at a speed of cells/second, and listing the cell occupancy in its local view (starting from the cell behind the ego vehicle and going anti-clockwise), we get , where and denote, respectively, an occupied and a free cell.

The subvector consists of the occupancy of cells that are in the extended view of the ego vehicle. Their occupancy is obtained by querying the infrastructure at . The occupancy of a cell that has never been queried is set to Unknown, else it is set to the last queried value. In Figure 2, in state , the occupancy of cells marked (relative to the ego vehicle (blue)) was queried earlier to be free and that of is unknown.

Actions in the grid representation: In this work, we restrict the set of motion planning actions to . Accelerate and Decelerate correspond to the ego vehicle, respectively, increasing and decreasing velocity. Lane Change has the vehicle move to an adjacent lane. The action Do Nothing implies that the vehicle sticks to its current velocity and lane. Occupancy of one or more cells in the extended region at time may be queried together at time . The set includes actions, each of which corresponds to a group of cells, which will be in the extended view at , being queried. The set also includes the action where in the ego vehicle chooses not to query. Larger numbers of cells in a group correspond to a larger available communications bandwidth (bits/sec). On the other hand, if groups contain fewer cells, in effect, the ego vehicle has access to a smaller communications bandwidth. In Figure  2, the extended region has eight cells. Supposed we group them into the sets and . To populate occupancy of either group at time , the ego vehicle queries them at time . We have . Implicitly, the communications bandwidth is cells/time step.

Uncertainty in the on-road environment:

We introduce uncertainty in the grid world by choosing a cell in the grid to be occupied with a certain probability

, independently of the other cells in the grid, and unknown to the ego vehicle. As the ego vehicle moves, new cells are added and the occupancy of each cell is determined similarly. The ego vehicle gets to know the true occupancy of a cell once the cell is in its local view or if the cell is in the vehicle’s extended view and it queries for its occupancy.

State evolution and costs: The ego vehicle will learn the optimal policy using Q-learning, without an explicit knowledge of the state evolution model (1), using state and action trajectories and stage costs obtained from a simulator. Next, we describe the Q-learning based method followed by a description of the simulator, where we describe the stage costs.

V Finding the Optimal Policy Using Q-Learning

The Q-learning algorithm provides a simulation driven iterative method to find the optimal Q-factors for every state and action . To simplify notation, define set . The optimal Q-factor is the optimal infinite horizon expected sum cost when starting with the state-action pair of (,). It is related to the optimal cost of starting in state as per the following equation.


The optimal policy (Equation (3)) is given by


Further define


The Q-learning based approach is summarized in Algorithm 1. It proceeds in an episodic manner. A given episode begins by randomly picking a certain initial state-action pair and running a long simulation trajectory ( long) that starts at the state-action pair. For every episode, we provide the simulator a probability with which a cell in the grid must be occupied. For every state that the simulation trajectory visits, a motion planning and a communications action is picked uniformly and randomly, respectively, from the sets and of actions feasible in the state. Inputting these to the simulator gives the stage cost and the next state .

The Q-factor corresponding to the visited state and chosen action is then updated using the original Q-learning algorithm [12]. All other state-action pairs are left unchanged. In this update, is a step-size that when chosen appropriately guarantees the convergence of the Q-factors to the optimal Q-factors . We set it to .

For the simulation scenarios that we present later, we observed convergence for a maximum of and .

  Inititalize: , , ,
  Output: , .
  while   do
      SIMULATOR(); {Get starting state from simulator}
      Choose uniformly and randomly from ;
     while   do
          Choose uniformly and randomly from ;
          Choose uniformly and randomly from ;
     end while
  end while
Algorithm 1 The Q-learning Based Algorithm

Vi Simulator

We found simulators like Flow [13], DeepTraffic222https://selfdrivingcars.mit.edu/deeptraffic/ by MIT and the Auto Drive Simulator333https://github.com/HugoTian/Auto_drive_simulation. Flow integrates micro-traffic simulation and reinforcement learning to explore control strategies in various traffic scenarios. DeepTraffic and Auto Drive both use deep reinforcement learning techniques to optimize the motion plan of an ego vehicle. However, we were unable to find simulators that supported the functionality above together with the possibility of an extended view that can be exposed to the ego vehicle in response to its querying for information. Motivated by existing approaches we built a simple simulator using, amongst others, the Python libraries NumPy and Pygame. This enables us to grasp the effectiveness of policies that adapt motion and communications actions jointly. We will briefly describe how the different actions are simulated followed by a description of the stage costs.

The simulator keeps track of every cell in an occupancy grid that includes the local view and the extended view. The sizes of each view are fixed during a simulation. As the ego vehicle moves, at every time step, new cells are added and an equal number of cells leave the simulation. The simulator sets the occupancy of a new cell to Occupied independently and randomly with the probability that is provided to it. The occupancy of the cells that leave are lost.

The ego vehicle has access to the occupancy of all cells in the local view. However, the occupancy of a cell in the extended view is revealed to it by the simulator only when the cell is queried.

The simulator maintains distance in the units of cells and, correspondingly, velocity and acceleration in the units of cells/time-step and cells/time-step.444For ease of presentation, we will skip the units. The ego vehicle has a certain maximum velocity that can be set. The minimum velocity is . For scenarios we evaluated, the cell size was set to that of the ego vehicle. Let be the velocity of the ego vehicle at the time step and let be the acceleration (a negative value implies deceleration) chosen by it. The number of cells covered by the ego vehicle and its velocity at time are calculated as


To exemplify, if a vehicle moving at a velocity of at time decelerates at a rate of , it will come to a halt at time in the cell it started in at time . On the other hand, a vehicle that has a velocity of at time , on choosing to accelerate at , will have a velocity of at time . However, it will be in the same cell as it was in time .

Lastly, lane changes are executed by moving the ego vehicle by the number of cells given by (7) and then placing it in the corresponding cell in the chosen adjacent lane.

Vi-a Stage costs (negative rewards)

To make the ego vehicle learn the optimal policy using Q-learning we must set the stage costs (see Equation 2) appropriately. We use a very simple reward structure. The ego vehicle accrues unit stage reward (stage cost of ) for every cell it moves in its direction of travel. This is to ensure that the vehicle covers distance. To discourage unnecessary motion planning actions or communications actions, the ego vehicle gets a stage reward of when it chooses the motion planning action Do Nothing or a communications action of No Query. Note that a vehicle may choose both in a time step, in which case it will get a reward of added to the number of cells it covered. An example of an unnecessary motion planning action is that of changing lanes when all available lanes including the current are empty. Similarly querying the occupancy of a cell that is already known is an example of wasteful communications. When an action leads to an imminent collision, the vehicle receives a reward of (a very large stage cost). Its velocity is forced to and its position remains unchanged.

Vii Evaluation and Results

Fig. 3: The grid maintained by the simulator. We show an instance in which the local view of the ego vehicle has a car.
(a) Distance covered
(b) Velocity distribution
(c) Motion planning action distribution
(d) Communications action distribution
Fig. 8: Figures (a), (b), (c), and (d) show, respectively, the average distance covered, the distribution of the velocity of the agent, the distribution of the motion planning actions taken by the agent, and the distribution of the use of communications, calculated over episodes of length time steps. These are shown for four different probabilities of occupancy (akin to traffic density). For each probability, we consider scenarios including Local view (LV), Random Communications (RC), Communications - one column at a time (C1), Communications - two columns at a time (C2), and Full View (FV).

We will demonstrate the efficacy of jointly adapting motion planning and communications actions in response to on-road and communications constraints. All results in this section were obtained for the grid shown in Figure 3. The ego vehicle could be in either lane, in its column of the grid. It was constrained to have a maximum velocity of . It could choose an acceleration and deceleration of and , respectively. We will compare the following scenarios. For each of them, the optimal policy , given by Equation (5)), was obtained using Q-learning, as described earlier.

  1. [leftmargin=*]

  2. Local View (LV) Only: The ego vehicle has no access to communications. It can only adapt its motion plan based on its local view.

  3. Random Communication (RC): The ego vehicle receives information about its extended view from the infrastructure. At any time instant, it receives the occupancy of cells or that of cells , independently of the previous time instants, and with equal probability. The vehicle chooses its motion planning actions given its local view and the occupancy received for the extended view.

  4. Query of one column (two cells) at a time (C1): The ego vehicle may query the infrastructure to populate its extended view. A query at any time instant can be about exactly one of the four columns, the groups of cells , , , and . This implies a communications constraint of cells per time step. C1 has about state-action pairs for which Q-factors must be calculated.

  5. Query of two columns (four cells) at a time (C2): Similar to C1. However, exactly one of the two groups and may be queried by the ego vehicle. This amounts to having a communications rate of cells per time step, which is twice that available in C1. C2 has about state-action pairs.

  6. Full View (FV): The ego vehicle obtains, without the need to query the infrastructure, the occupancy of both its local and extended views at any given time instant. This amounts to the ego vehicle having a large local view that includes all cells in the grid in Figure 3. FV has about state-action pairs.

Note that only in C1 and C2 does the ego vehicle jointly choose the motion planning and communications actions. In LV, RC, and FV, the ego vehicle chooses only a motion planning action. However, it has access to occupancy information that ranges from local only in LV to the entire grid in FV.

For each scenario, we test the obtained policy over a set of randomly chosen episodes, for a fixed choice of probability of a cell being occupied, where each episode was time steps long. We show results for . These choices correspond to on-road environments varying from no occupancy (no traffic) to very high occupancy (heavy traffic). During the episodes we ensure that both the cells in adjacent lanes are never occupied. This ensures that the ego vehicle has a free cell to move into for all the time steps in an episode. Recall from Algorithm 1 that, for any scenario, the policy is obtained by randomly choosing the value of . Specifically, we don’t train the ego vehicle separately for the different values of .

Understanding gains from an extended view: Figure (a)a shows the distance traveled by the ego vehicle for different traffic densities and the scenarios we detailed above. Not surprisingly, the distance covered in FV is at least as much and often better than that covered under the other scenarios. On the other hand, given the limited view that the ego vehicle has in LV, the distance covered is smaller than the rest. The ego vehicle does almost as well in C2 as in FV. This is because the policy in C2 is able to query occupancy of all the cells in the extended view (Figure 3) in two time steps (four cells (two columns) at a time). Given that the maximum speed of the ego vehicle is , C2 is in effect the same as FV.

Observe that the differences in distances covered under the different scenarios shrink as the traffic density increases. Also, in Figure (b)b, observe that as traffic density increases the agent chooses a velocity of more often, in all the scenarios. These two effects are in fact related. At large traffic densities the ego vehicle must restrict its speed to to avoid collisions even if it knows the occupancy of all cells in the grid. Note that at a velocity of the ego vehicle moves two cells, either in the same lane or in an adjacent lane, in one time step. While at any given time step, the ego vehicle will find an empty cell to enter, the probability of finding two consecutive empty cells is small for large traffic densities. To summarize, larger traffic densities constrain the velocities that the motion plan can choose. As a result, the knowledge of occupancy of cells in the extended view brings smaller rewards for larger densities.

Adapting communications to a constrained on-road environment: While in FV the ego vehicle gets the entire view without any communications, one would hope that C1 and C2 would query less in high density scenarios as they jointly adapt motion planning and communications. This is in fact the case and can be seen in Figure (d)d. Note that as the traffic density increases, the fraction of times No Query is used as a communications action increases. It is about for C2 at a density of . On the other hand, C1 does not query at all at the density of . This is because, given that it can query only two cells (one column) at a time, it is never able to benefit from the resulting extended view and it always has the ego vehicle move at a velocity of (Figure (b)b).

Adapting motion planning actions to communications constraints: From Figure (a)a, we can see that in C2 the ego vehicle covers the largest distance amongst RC, C1, and C2. In RC the ego vehicle has no control over the received information. However, it receives cells per time step as described earlier. This is the same as what the ego vehicle can query in C2. Clearly, from Figure (a)a, there are huge gains to choosing the communications action smartly. On the other hand, communications in C1 are constrained because of a smaller available communications rate ( cells (one column) per time step). The impact of this on motion planning is apparent on seeing the distributions of velocity (Figure (b)b) chosen by the ego vehicle in C1 and C2. The constrained communications in C1 restricts the fraction of time the ego vehicle chooses the maximum velocity of . The differences are especially significant for low traffic densities, that is when the on-road environment does not constrain motion planning.

Observe from Figure (a)a that RC shows gains over C1 for lower densities and does as well for higher densities. The gains are because in RC the communications rate is twice that in C1. This larger rate compensates for the randomness regarding what information must be sent in RC.

Action Selection: Consider Figure (c)c, which shows how motion planning actions are distributed across the scenarios and for different traffic densities. For a traffic density of , LV, C2, and FV almost always choose Do Nothing. This is because, given the absence of any other occupants, in LV the ego vehicle sticks to a velocity of and in C2 and FV it sticks to a velocity of . C1 and RC try to opportunistically benefit from communications. When the ego vehicle has a large enough extended view, it accelerates. At a higher velocity, it covers distance faster and isn’t able to acquire the required extended view to stay fast. Hence, it decelerates. For similar reasons, at higher traffic densities, C2 sees a lot more of acceleration and deceleration than FV. In C2, the ego vehicle opportunistically increases velocity when communications reveals space in the extended view. However, this also means that it is forced to decelerate when the density is high. In FV, on the other hand, the ego vehicle is more often at a velocity of (see Figure (b)b for densities and ). Both C2 and FV see similar distance rewards (Figure (a)a), however.

Finally, observe that as traffic density increases the frequency of lane changes increases. This is simply because the ego vehicle encounters occupied cells more often.

Viii Conclusions and Future Work

We formulated a reinforcement learning problem in which an autonomous vehicle jointly chooses a motion planning action and a communications action at every time step, such that its driving utility is optimized. We used the Q-learning algorithm to make the ego vehicle learn the optimal policy using simulations. We demonstrated how jointly adapting motion planning and communications actions allows an autonomous vehicle to (a) make judicious use of the communications network in an on-road environment that constrains feasible motion plans and (b) smartly choose motion planning given communications constraints.

In the future we plan to investigate more realistic motion and uncertainty models and leverage deep reinforcement learning techniques to obtain policies. In addition, extensions to networks of autonomous vehicles are planned.


  • [1] R. Talak, S. Karaman, and E. Modiano, “Speed limits in autonomous vehicular networks due to communication constraints,” in 2016 IEEE 55th Conference on Decision and Control (CDC), Dec 2016, pp. 4998–5003.
  • [2] S. W. Kim, B. Qin, Z. J. Chong, X. Shen, W. Liu, M. H. Ang, E. Frazzoli, and D. Rus, “Multivehicle cooperative driving using cooperative perception: Design and experimental validation,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 2, pp. 663–680, April 2015.
  • [3] S.-W. Kim, Z. J. Chong, B. Qin, X. Shen, Z. Cheng, W. Liu, and M. H. Ang, “Cooperative perception for autonomous vehicle control on the road: Motivation and experimental results,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on.   IEEE, 2013, pp. 5059–5066.
  • [4] M. Patra, R. Thakur, and C. S. R. Murthy, “Improving delay and energy efficiency of vehicular networks using mobile femto access points,” IEEE Transactions on Vehicular Technology, vol. 66, no. 2, pp. 1496–1505, Feb 2017.
  • [5] E. David, N. Ndih, and S. Cherkaoui, “Map: Contention-free mac protocol for vanets with plnc,” in 2017 IEEE International Conference on Communications (ICC), May 2017, pp. 1–7.
  • [6] W. Liu, S.-W. Kim, Z. J. Chong, X. Shen, and M. H. Ang, “Motion planning using cooperative perception on urban road,” in Robotics, Automation and Mechatronics (RAM), 2013 6th IEEE Conference on.   IEEE, 2013, pp. 130–137.
  • [7] M. A. S. Kamal, T. Hayakawa, and J. Imura, “Realization of highly anticipative driving in a partially connected vehicle environment,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Oct 2017, pp. 1–6.
  • [8] S. Manzinger and M. Althoff, “Negotiation of drivable areas of cooperative vehicles for conflict resolution,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Oct 2017, pp. 1–8.
  • [9] T. Higuchi and O. Altintas, “Leveraging cloud intelligence for hybrid vehicular communications,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Oct 2017, pp. 15–20.
  • [10] M. Roth, R. Simmons, and M. Veloso, “Reasoning about joint beliefs for execution-time communication decisions,” in Proceedings of the Fourth International Joint Conference on Autonomous Agents and Multiagent Systems, ser. AAMAS ’05.   New York, NY, USA: ACM, 2005, pp. 786–793. [Online]. Available: http://doi.acm.org/10.1145/1082473.1082593
  • [11] G. Best, M. Forrai, R. R. Mettu, and R. Fitch, “Planning-aware communication for decentralised multi-robot coordination,” in Proceedings of the International Conference on Robotics and Automation, Brisbane, Australia, vol. 21, 2018.
  • [12] D. P. Bertsekas, Dynamic programming and optimal control, 2012, vol. 2, no. 4.
  • [13] C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, and A. M. Bayen, “Flow: Architecture and benchmarking for reinforcement learning in traffic control,” arXiv preprint arXiv:1710.05465, 2017.