Car accidents are recognized as a serious problem. Although modern vehicles are often equipped with many accident-avoidance systems, the number of highway-related fatalities in the US alone is approximately 32,000 a year. As a response to this issue, the US Department of Transportation (DOT) has issued a notice of Transportation of proposed rule-making (NPRM) requiring the installation of vehicle-to-vehicle (V2V) communication capabilities in all new cars by 2023. This is expected to become a federal motor vehicle safety standard.
The approach is that vehicles will issue messages alerting each other of potential safety concerns so as to act upon these messages and avoid accidents. The NPRM Administration enumerates the causes of vehicle crashes and the information that will a will aid in avoiding them, yet there are many uncertainties in the proposal and requests for comments from stakeholders. For instance, the proposal expresses its uncertainty in which type of information must be sent as part of a safety message:
We tentatively believe that speed, heading, acceleration, and yaw are the most relevant pieces of information about a vehicle’s moment. Essentially, we propose to measure the rate at which the sending device’s location is changing and also any changes to that rate at which a device’s location is changing…
In another case, the proposal requests for comments on the specification of transmission range:
We ask for comment on [the minimum V2V range limit]. Is there any reason that the agency should require a maximum transmission range as well as a minimum? Should the agency choose a different minimum range requirement? What would be appropriate alternative minimum and maximum transmission range values?
This uncertainty calls for a scientific investigation. In this work, we explore the possibility of studying the effect of communication in vehicle coordination in the context of reinforcement learning. In doing so, we create our own simplified simulation in which we can deploy multiple vehicles, each with a distinct goal. We then train these vehicles and demonstrate that communication, specifically an engineered protocol used as input to the policy, greatly improves both safety and efficiency in adverse conditions. We also investigate emergent communication approaches but our efforts there are still preliminary.
1.1 Related Work
Communication in multi-agent scenarios
During the past couple of years, there has been a surge of publications in the area of multi-agent communication. At a high level, we can categorize them into two families. In the first family, communication is restricted to natural language, enabling us to study the effect and properties of natural languages in a setting richer than simple supervised learning. These include communication based machine translation(Lee et al., 2017), learning-to-negotiate (Lewis et al., 2017) and visually-grounded dialog agents (Das et al., 2017). Another angle has focused more on solving a target task by allowing multiple agents to develop their own communication protocol. These include simple traffic navigation (Sukhbaatar et al., 2016), cooperative riddle solving (Foerster et al., 2016; Evtimova et al., 2017) and multi-agent reinforcement learning (Mordatch and Abbeel, 2017; Lowe et al., 2017).
Our work sits in between these two directions of research in that we study the multi-agent coordination problem augmented with either a manually-designed communication protocol or an emergent one, and compare them against a no-communication setting. Additionally, our research is aimed towards solving a problem that today has no good solution.
Fully learning-based autonomous driving has received renewed interest since the early work by Muller et al. (2006). There is related work on training a single vehicle to drive (see, e.g., Bojarski et al., 2016; Zhang and Cho, 2016; Chowdhuri et al., 2017, and references therein). There is also work trying to train multiple vehicles collectively to better coordinate amongst themselves (see, e.g., Shalev-Shwartz et al., 2016), albeit without communication. Our work falls into the second category, however, with the novelty that we allow vehicles to directly message each other, potentially allowing them to develop a more efficient coordination strategy.
|4-5||Velocity (X, Y)|
|8||Steering Wheel Angle|
2 Simulation Setup
Our simulator involves multiple vehicles, driven by a single shared policy, each tasked with driving as fast as possible to a highway exit without crashing into each other
We use Box2D (Box2D, ), a Python gaming framework, to build the highway and use a car implementation from OpenAI Gym (Brockman et al., 2016). Every episode consists of twelve cars. The vehicle’s size, exit, and starting position are selected randomly from five, two, and fifteen possibilities respectively. The length and angle of the highway also change randomly at each episode, and there are no lane markings, which makes driving even more challenging. These choices ensure diversity in the driving scenarios. See Fig. 1 for a rendering of this environment.
Our reward structure is meant to encourage the agents to quickly drive to the exit without crashing. Consequently, we settled on for a successful exit, for a crash or going past the exit. A negative reward of is given each step.
|2||Global Message Id|
|3||Episode time step|
|4||Current X Position|
|5||Current Y Position|
|19||Hard Brake Indicator|
|20||Steering wheel angle|
3 Policy Setup
Observation and action
We share a single policy across all the vehicles in the highway. In our baseline model without communication, each agent receives a 40-dimensional observation vector every time step as described in Table1. When communication is allowed, the observation vector is augmented with the messages from all other nearby vehicles. In the case of the V2V message protocol, there are 21 additional dimensions, delineated in Table 2. Note that we do not adhere to the full specification of V2V (Administration, ) as some fields suggested in the protocol do not fit the proposed environment. Given such an observation, the policy outputs a tuple of three continuous actions: acceleration, steering wheel angle, and brake.
We do not impose any restriction other than the structures of the observations and actions as described above. This gives the policy full freedom in developing novel strategies based on its observations as well as messages from other vehicles. In the case of emergent communication especially, we expect this freedom to allow the policy to develop a more effective and efficient protocol than which can obtained using an engineered approach.
We use a
nonlinear feed-forward neural network with two hidden layers (200 and 100 units, respectively) and column norm initialization for each of the policy and value networks. Together, they output the mean of a Gaussian distribution per action.
There are a number of ways to incorporate the messages into the model. From the preliminary experiment, it was found most effective to concatenate all the messages together, non-linearly project them into a fixed size 32-dimensional input, and then concatenate that result with the personal observations as the input to the policy.
After each episode, we accumulate a sequence of state-action-reward tuples per vehicle. We consider each such vehicle-specific episode as an independent episode when training a policy. We use a recently proposed proximal policy algorithm (PPO, Schulman et al., 2017)
based on the TensorFlow implementation byHafner et al. (2017).
We found that randomly dropping the messages % of the time improved performance. We implemented two variants. The first variant works per agent by dropping a received message randomly. The second variant, on the other hand, randomly drops all the messages across all the vehicles at each time step. In this work, we use the first variant.
One target scenario for a car is to get to its destination as fast as possible while never crashing. Consequently, our evaluation metrics were designed around safety and speed objectives. For both adverse and regular conditions, we measured how often the cars succeeded as opposed to crashing or passing their exits, how oftenall of the cars finished the episode, and the mean number of steps for the failed cars and the successful cars. The adverse condition we consider is that of fog decreasing lidar range uniformly.
We report results as the average of three evaluation runs with different seeds. While training was unstable - a priority to address in future research - evaluation runs across a single model were consistent.
The experimental setup we have described does not demand that the cars do perfectly on safety, but that they do a reasonable job such that marked improvements over a baseline without communication could be considered valid. Thus, we tuned the difficulty of our environment such that our best baseline could only achieve approximately success rate per car.
|Model||Eval Conditions||Flat success||Episode Success||Mean Length Success||Mean Length Fail|
Table 3 presents results for models trained on normal sunny conditions without fog, but with evaluation for both the sunny and foggy settings. The best baseline model succeeded at a rate of per car and per episode in the sunny environment. It starts to learn a sufficiently good policy after episodes. While the model took longer to learn (approximately 2,400 episodes) with the V2V communication, the per-car success rate improved at the expense of a speed reduction. The gap in the success rate between the baseline and the one with the V2V communication grows when evaluated in foggy conditions. These relative numbers are roughly the same, albeit both models lose some efficacy.
|Model||Eval Conditions||Flat Success||Episode Success||Mean Length Success||Mean Length Fail|
We also tried training the model in adverse conditions. This is where we expect the communication to be most advantageous because now the model can learn to utilize the messages from other cars in situations where its lidar is ineffective. The results, shown in Table 4, verify this and demonstrate that the V2V protocol maintains a high level of success rate while only slightly reducing the speed from the policy learned in sunny conditions.
One notable statistic in the tables is that the models trained in sunny conditions do better when evaluated in foggy conditions than the models trained in foggy conditions do. This is because we set up the evaluation process to match the training conditions where only 10% of the episodes were foggy. Thus the sunny models do much better in the 90% of evaluation runs which are sunny than the foggy models do on the 10% of foggy evaluations.
In this paper, we have explored using reinforcement learning to model a real world scenario that is otherwise very difficult to test. Our current results suggest that the proposed protocol will in fact make the road safer in adverse conditions and enable cars to go faster. There are of course caveats to this claim, a major one being that our simulation is a limited interpretation of actual driving scenarios.
One aspect of self-driving cars that is of particular interest is that the agents could utilize the messages in ways that go beyond what an engineer would program. Whereas an engineer might program a few specific reactions to any given safety message, the size of the search space is so large that they probably will not think of the best solutions. A learned policy on the other hand can use messages in surprising ways. We have seen this recently with AlphaGoSilver et al. (2016) finding moves in Weiqi that humans players had not discovered after many centuries. In future research, we will compare the model’s learned policies to fixed safety responses.
Preliminary result: emergent communication
Instead of using the fixed V2V protocol, we can let the model design its own communication protocol by allowing it to synthesize a message itself. This type of emergent communication research has blossomed recently, with most prior work focusing on understanding or yielding attributes in the generated language such as compositionality Andreas and Klein (2017). We are unaware of any research such as ours that utilizes this method for finding solutions to real world problems that don’t already have a viable approach.
In preliminary experiments, we explored two variants of emergent communication. ‘Continuous’ was very similar to the described V2V model. The difference was that each agent emitted a fixed number of continuous message actions, which were then fed to other agents instead of the V2V protocol. In ‘Select’, each agent selected which of the twelve dimensions of the V2V protocol it wanted to include in a message. Their message then consisted of those parts with the rest of the protocol zeroed out.
We hypothesized that ‘continuous’ would perform better than the V2V protocol, because the agent could include information outside of the V2V protocol or it may be encoded in a way that is easier for a receiver to decode. We hypothesized that ‘select’ would be at least as good as the V2V protocol and would give us interpretable insight into what parts of the protocol were most important. The policies using either of these schemes were however no better and frequently converged to a degenerate solution in which the vehicles spent the entire episode turning in one place. Thus, finding ways to stabilize training and improve these models is important future research.
We thank Tudor Achim, Ilya Kostrikov, Adam Bouhenguel, and Martin Arjovsky for helpful discussions. This work was partly supported by NVIDIA (Project: "NVIDIA - NYU Autonomous Driving Collaboration").
-  N. H. T. S. Administration. Federal motor vehicle safety standards; v2v communications. https://www.federalregister.gov/documents/2017/01/12/2016-31059/federal-motor-vehicle-safety-standards-v2v-communications.
- Andreas and Klein  J. Andreas and D. Klein. Analogs of linguistic structure in deep representations. CoRR, abs/1707.08139, 2017. URL http://arxiv.org/abs/1707.08139.
- Bojarski et al.  M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
-  Box2D. https://github.com/pybox2d/pybox2d.
- Brockman et al.  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.
- Chowdhuri et al.  S. Chowdhuri, T. Pankaj, and K. Zipser. Multi-modal multi-task deep learning for autonomous driving. arXiv preprint arXiv:1709.05581, 2017.
- Das et al.  A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra. Learning cooperative visual dialog agents with deep reinforcement learning. arXiv preprint arXiv:1703.06585, 2017.
- Evtimova et al.  K. Evtimova, A. Drozdov, D. Kiela, and K. Cho. Emergent language in a multi-modal, multi-step referential game. arXiv preprint arXiv:1705.10369, 2017.
- Foerster et al.  J. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
- Hafner et al.  D. Hafner, J. Davidson, and V. Vanhoucke. Tensorflow agents: Efficient batched reinforcement learning in tensorflow. CoRR, abs/1709.02878, 2017. URL http://arxiv.org/abs/1709.02878.
- Lee et al.  J. Lee, K. Cho, J. Weston, and D. Kiela. Emergent translation in multi-agent communication. CoRR, abs/1710.06922, 2017. URL http://arxiv.org/abs/1710.06922.
- Lewis et al.  M. Lewis, D. Yarats, Y. N. Dauphin, D. Parikh, and D. Batra. Deal or no deal? end-to-end learning for negotiation dialogues. arXiv preprint arXiv:1706.05125, 2017.
- Lowe et al.  R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017.
- Mordatch and Abbeel  I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908, 2017.
- Muller et al.  U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun. Off-road obstacle avoidance through end-to-end learning. In Advances in neural information processing systems, pages 739–746, 2006.
-  U. S. D. of Transportation. Department of transportation advances deployment of connected vehicle technology to prevent hundreds of thousands of crashes. https://www.transportation.gov/briefing-room/us-dot-advances-deployment-connected-vehicle-technology-prevent-hundreds-thousands.
- Schulman et al.  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
- Shalev-Shwartz et al.  S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
- Silver et al.  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
Sukhbaatar et al. 
S. Sukhbaatar, a. szlam, and R. Fergus.
Learning multiagent communication with backpropagation.In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2244–2252. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6398-learning-multiagent-communication-with-backpropagation.pdf.
- Zhang and Cho  J. Zhang and K. Cho. Query-efficient imitation learning for end-to-end autonomous driving. arXiv preprint arXiv:1605.06450, 2016.