1 Introduction
Driving is an inherently social activity, and it will remain so as long as human drivers share the road with autonomous vehicles. The social nature of driving introduces several problems that are often overlooked in selfdriving car (SDC) research. An effective SDC will have to account for the variance in human driving styles and preferences
[1], as well as varying norms [4] and laws [2] in different parts of the world. Additionally, SDCs will have to navigate the many nuanced “corner cases” of driving that require complex social negotiation.Current planning algorithms for selfdriving cars are hardcoded and programmed to be cautious. Caution is important for safety, but singleminded caution can make it impossible to complete required driving tasks. For example, to merge onto a busy highway, a driver must stick the car’s nose out and encourage other drivers to slow down [5]. Furthermore, human drivers can take advantage of overlycautious SDCs by driving aggressively, effectively bullying SDCs by forcing them to yield when they should otherwise have the right of way. Such behavior has been noted as a possible impediment to mainstream adoption of SDCs [13, 3].
In this work, we explore a computational game theoretic approach that might be useful in these scenarios. In particular, we examine the idea of using a strategy that adaptively discourages antisocial behavior while remaining safe. Our proposed strategy has the overall structure of the “folk theorem” of repeated games—stabilize mutually beneficial behavior with the threat of punishment in later rounds of play [11]. However, unrestricted punishment could include unsafe behavior like intentionally crashing into the opponent’s car. We propose using a punishment strategy that only restricts the opponent’s utility to some safe target level while maximizing the utility of the agent. With analogy to a Stackelberg equilibrium, which is the strategy with maximum utility when paired with the opponent’s best response, we call such a strategy a Stackelberg punishment.
A Stackelberg punishment can be computed efficiently in several classes of games for which efficient Stackelberg equilibria algorithms exist. We demonstrate the concept deployed in a simple but strategically relevant driving scenario of negotiating right of way on a onelane bridge.
In the first part of this paper, we discuss efficient algorithms for computing Stackelberg punishment. In the second part, we describe an application of a Stackelberg punishment to solving the SDC bullying problem and demonstrate its efficacy in an experiment with human participants.
2 TreeBased Games
We define a treebased alternatingmove twoplayer game as a tuple . Here, is a set of states with being the subset of states where the first player (the leader) has control, being the subset of states where the second player (the follower) has control, and being the subset of states that are final states (leaves), ending the decision process. The sets , , and partition . The initial state is the root of the tree. The set is the set of actions available at each state with the transition function returning the next state reached from nonfinal state when action is selected. The state space forms a tree in that each state can be reached by only one path from the root. The pair is the reward values obtained by the players when state is reached.
A policy maps each nonfinal state and action
to the probability that action
is taken in that state. We can define the value of a policy from state aswhere for . That is, represents the payoff pair for the two players if they adopt the joint (stochastic stationary Markov) strategy . The leader makes the selection in the states and the follower makes the selection in the states. Given a policy defined on the states in and a policy defined on the states in , we write to represent the policy if and if .
For policies for the two players, and , we can write and representing the expected payoffs to the two players when these policies are executed:
Given such a treebased game and a policy , we call a policy a best response if, for all , . That is, the follower cannot improve its value by adopting a different policy. We write for the best response to .
In this setting, a Stackelberg equilibrium policy for the leader is
That is, assuming the follower will adopt a best response to whatever the leader elects to do, the leader behaves so as to maximize its reward.
Letchford and Conitzer [7] introduced an efficient algorithm for computing Stackelberg equilibria in treebased games. In a tree with leaves and internal nodes, their approach runs in time . The algorithm works by determining, for each state , a set of payoff pairs that can be obtained through some choice of and a best response . Since the objective is to find a policy that maximizes reward for the leader, we need only maintain the points of this set that maximize reward for the leader for each possible value obtained by the follower.
The algorithm represents the set via a finite set of payoff points and a finite set of line segments connecting some subset of the points in . It builds up the representation for a state out of the representation computed for the children of . At the leaves of the tree, the representation is simply the rewards at the leaves.
For a state where the leader selects the action, the representation is computed by noting that the leader can choose any child of , therefore any of the achievable payoffs at any of the children of are also achievable. However, the leader can also probabilistically select any of its children. This translates into line segments where one endpoint comes from the representation of one child node and the other endpoint comes from the representation of a different child node. This set is sufficient for capturing the representation of the set of possible values at , but it may include some unnecessary lines (or even points). These extra bits of representation can be removed or ignored, as we are ultimately only concerned with the points with maximum value for the leader.
For a state where the follower selects the action, the follower will select the action that gives it the highest value, assuming it has adopted a best response policy. For an action , we compute to be the lowest value for which one should be willing to tolerate selecting over the alternatives. To compute this value, we assume the leader will make the alternatives maximally unattractive. We then modify the points and lines representing the child’s values to reflect this preference. Once that modification is completed, every value for every child can be achieved at .
Once the set of points and lines needed to represent the achievable values at the root are computed, the point with the largest value for the leader can be returned as the value of the Stackelberg equilibrium for the tree. (Computing the policy itself involves unrolling this computation in reverse order and is detailed in the original paper.)
A single change is all that is needed to adapt the algorithm to produce a Stackelberg punishment—the strategy must also result in an expected reward for the follower that does not exceed :
That is, the leader maximizes its reward against a best responding follower while holding the follower’s value to a cap of . It follows that the leader cannot improve its value without the follower’s value rising above .
Our algorithm for computing a Stackelberg punishment is a simple extension over the Stackelberg equilibrium solution. In particular, it finds the point on the line segments that maximizes the leader’s value subject to the follower’s value being below . For lines that fall completely below in terms of their follower values, we need only check the endpoints to see which is largest for the leader. For lines that span in terms of their follower values, we need to check the intersection point with as well as the endpoint that falls below . This calculation does not increase the overall complexity over that of computing the Stackelberg equilibrium.
3 Other Models
The game representation in which transitions form a general graph, payoffs can occur at any node, and actions can have stochastic effects has also been called a simple stochastic game [6] or an alternating Markov game [9]. The difference between a stochastic game [12] and an alternating stochastic game is that actions are selected nonsimultaneously in an alternating stochastic game.
Since a Stackelberg equilibrium is a Stackelberg punishment with , computing a Stackelberg punishment is at least as hard as computing a Stackelberg equilibrium. Letchford and Conitzer [7] provide complexity results for a variety of Stackelberg equilibrium problems, showing that allowing stochastic transitions, simultaneous actions, or DAGstructured transition functions results in an NPhard problem. As such, we should not expect efficient algorithms for computing Stackelberg punishment in these other game models.
4 User Study
Our motivation for studying Stackelberg punishment is as a component of an algorithm that can work productively with people. We conducted an experiment to assess the efficacy of this idea in a simple SDCinspired game that requires social negotiation.
The scenario we used consists of a onelane bridge fed from both ends by a 2lane road (Figure 1). When two cars arrive on opposite sides of the bridge at roughly the same time, rightofway rules dictate that the car closer to the bridge should cross, while the further car should wait. However, the further car has the opportunity to “bully” the closer car by crossing the bridge first, forcing the closer car to wait. In this case, a selfdriving car that is hard coded to be cautious would be forced to back off the bridge, yielding to the human bully to avoid a collision, despite having the right of way.
Our experiment takes the form of an online game in which a virtual selfdriving car (controlled by our algorithm) starts on one side of the bridge, displayed at the top of the screen, and a humancontrolled car starts on the other side, shown at the bottom. On each turn, a player can move one position forward, stay in place, or move one position backward. The human participants control their own car’s actions using the arrow keys on their keyboard.
Human participants were sourced online through Amazon Mechanical Turk and were rewarded monetarily based on how quickly they got to the other side of the bridge. The reward for each episode was $0.13, minus $0.01 for every two seconds before the user reached the goal. This structure was designed to encourage participants to finish quickly while still receiving a fair wage for their time (around $15/hour). We limited our study to participants from the US to ensure that all participants had experience with similar driving laws and norms. Other demographic information was not collected.
Participants were placed into either a control group or an experimental group, and each participant completed 20 episodes of the game. At the start of each episode, one car begins noticeably closer to the bridge than the other (controlled so that each participant has an equal number of “close” and “far” starts.) We consider the closerstarting car to always have the right of way in terms of crossing the bridge first.
In response to the human participant’s behavior in prior episodes, the SDC follows either a hardcoded “cautious” policy or a Stackelberg punishmentbased policy. The policyswitching logic will be explained in Section 4.3.
4.1 Stackelberg Punishment Policy
We computed a Stackelberg punishment policy on a simplified game with four abstract positions for each car (start, beforebridge, onbridge, finish), resulting in a total of 16 distinct arrangements of the cars. On each decision round, one car could move forward, backward, or stay in place. We built a treebased game over these arrangements with a maximum depth of 20 (10 decision rounds for each player). The resulting tree has 2,621,437 nodes and 1,572,862 leaves. Payoffs were computed using the same scheme as for the interactive game, $0.13 minus $0.01 for each step it takes to reach the finish. Each step in this abstracted game corresponds to two seconds of gameplay in the interactive game.
The result of running the algorithm on the abstracted game is shown in Figure 3. It produces no more than line segments in any one node. Three behaviors for the SDC emerge, block (), bully (), and yield (), by setting to different values. In the bully case, the SDC always crosses the bridge first, regardless of starting position, as quickly as possible. This behavior is analogous to the human driver’s bullying behavior. The block strategy also takes the bridge first regardless of starting position, but drives slowly while on the bridge. Doing so decreases the reward for both players by forcing the human player to wait longer while the SDC crosses the bridge. To achieve more severe punishments, the block strategy waits on the bridge for more time steps before proceeding to the finish line. The yield strategy causes the SDC to let the human driver take the bridge first. For values other than those labeled in the figure, the SDC would behave according to a stochastic mixture of the pure strategies on either side of it.
In our experiments, we used the strategy resulting from setting , which results in a Stackelberg punishment in which the SDC blocks the human driver for 9 steps (18 seconds) before proceeding. Note that the Stackelberg equilibrium strategy is for the SDC to always bully (maximizing the leader’s payoff), and the minimax punishment for the game is to block the human car indefinitely (minimizing the follower’s payoff). Our Stackelberg punishment strategy strikes a more humane balance between these extremes.
4.2 Control Group
In the control group, the SDC is controlled by a naïve, cautious policy: If the SDC starts farther away from the bridge than does the human driver, it will wait until the human driver passes the bridge before proceeding. If the SDC starts closer to the bridge, it will try to cross the bridge, but will back off to avoid a collision if the human driver takes the bridge. We say the human has “bullied” the SDC if either (1) the human forces the SDC to back off the bridge and finishes first on a round where the SDC had the right of way, or (2) if the human blocks the SDC from finishing within the round time limit (26 seconds). We hypothesized that once participants in the control group discover that they can force the SDC to yield to them, they will bully the SDC at every opportunity to maximize their monetary reward.
4.3 Experimental Group
In the experimental group, the SDC is controlled by a policy that can be in one of two driving modes, determined by a computational version of the folk theorem [8, 10]. In cooperative mode, the SDC is hard coded to follow rightofway rules and avoid collisions (as in the control group). In punishing mode, the SDC selects actions according to a computed Stackelberg punishment policy that limits the human driver’s reward. Informally, the resulting policy is: go to the start of the bridge and drive forward slowly (to block the human driver) until enough time has passed that the participant’s final reward cannot be above the imposed limit (), then finish crossing the bridge.
We also use a horn to signal the SDC’s state to the human driver. In cooperative mode, the SDC will honk while it is being bullied. In punishing mode, the SDC will honk the entire round (Figure 1). Anecdotally, we found honking to be an important signaling device. Without it, the human participants did not understand the motivation behind the SDC’s reactive behavior. To our knowledge, this work is the first research that explores the use of the horn as a social signaling device for autonomous vehicles. Further experimentation is necessary to decorrelate the effects of honking from the effects of the adaptive policy, but exit survey responses (discussed in the Results section) suggest that participants’ decisionmaking was mainly affected by the adaptive policy.
The SDC selects its mode based on the human driver’s behavior in the previous round, titfortat style. If the human driver obeys rightofway rules, the SDC uses its cooperative mode in the following round. If the participant bullies the SDC, it switches to punishing mode in the following round. This titfortat strategy provides the necessary incentives to cooperate with the SDC. Other response strategies could be used, but this is left to future work. We hypothesized that participants in the experimental group who bully the SDC at first will learn to treat the SDC fairly over the course of multiple episodes, as bullying will cause our adaptive policy to restrict the participant’s subsequent reward.
To test this hypothesis, we compared occurrences of bullying between a control group of 18 participants and an experimental group of 37 participants. (We assigned fewer participants to the control group because pilot testing suggested that their behavior would have lower variability than the experimental group).
4.4 Results
In both the experimental and control groups, around 15% of participants never bullied. Since the conditions look exactly the same up until the first occurrence of bullying (punishing mode is never triggered), we only consider data from participants in both groups who bullied at least once, leaving 31 and 16 participants in the experimental and control groups, respectively.
Of these participants, Figure 3 shows the dropoff in the fraction who bully more than a given number of rounds. Most participants in the experimental group stop bullying after just a few initial rounds in which they experience punishment, while participants in the control group bully many more times.
The first takeaway from the control group is that human bullying of SDCs does occur. Once participants in the control group realized that the SDC would yield to them even when they did not have the right of way, they tended to take advantage of that fact at every opportunity, despite understanding it was unfair. In a postexperiment survey, control group participants commented:
Once I realized that the other car would reverse as soon as I crossed the line, I used it to my advantage. I would go no matter what so that I could cross the finish line faster.
Since the other car was completely submissive, I just did whatever was in my own best interests to ‘win’ the game.
In the postexperiment survey for the experimental group, participants expressed that the adaptive policy stopped them from bullying:
At first it made me more aggressive, since I noticed I could easily barge my way through to get a bit of extra cash. However it only took one time for me to realize anything I gained by doing that was quickly lost in the next round as the car went agonizingly slow.
When asked to rate the fairness of their driving compared to the SDC’s, only 32% of the control group described their own behavior as fair, while 91% described the SDC’s behavior as fair. In contrast, in the experimental group, 73% of subjects described their own behavior as fair (different from the control group at ), and 85% described the SDC’s behavior as fair (not significantly different from the control group).
It is worth noting that the reason the control group fraction in Figure 3 eventually dips is because there are a limited number of rounds per subject (20 rounds), and it usually takes subjects a few rounds to “discover” that bullying is possible (that is, that the SDC will back off the bridge to let them pass). We expect that if the number of rounds were considerably larger, the fraction of control group participants who bully would remain high for an indefinite number of rounds, and the experimental group would drop to zero.
To quantitatively evaluate the results, we looked at how the adaptive policy influences drivers after their first exposure to the punishing mode. Participants in the experimental group face an SDC in punishing mode in the round immediately following their first occurrence of bullying, so we compared the fraction of subjects in both groups that bully only once to the fraction that bully more than once. We use a Fisher Exact Test with an alpha level of
to determine statistical significance. Table 1 shows the categorical data from our experiments. The result of the Fisher Exact Test gives a value of 0.0016, meaning that the adaptive Stackelberg punishment policy significantly reduced repeat bullying.Control  Experimental  

Bullied Only Once  0  14 
Bullied More Than Once  16  17 
The contingency table for bullying as a function of participant group.
5 Conclusion
Research on selfdriving cars has historically focused on the hard technical problems of perception, planning and control. Social interaction between autonomous vehicles and human drivers has been largely overlooked, but has major implications for the mainstream adoption of selfdriving technology.
In this paper, we explored “rightofway bullying”—a social problem that could hinder the effectiveness of selfdriving cars. Through an online experiment with human subjects, we showed that such bullying does occur in a simplified driving scenario. By adopting an adaptive driving policy based on a novel Stackelberg punishment formulation, we showed how to significantly decrease repeat occurrences of bullying and encourage prosocial driving behavior.
Future work should explore how Stackelberg punishment could interact with hardcoded safety features in a production selfdriving car. In addition, the solution algorithm needs to be made considerably more efficient to scale to more complex social behaviors with finergrained states and actions.
We hope that this work can be a foundation for further investigation of autonomous driving as an inherently social problem that necessitates novel technological, behavioral and sociological solutions.
References
 [1] Basu, C., Yang, Q., Hungerman, D., Singhal, M., Dragan, A.D.: Do you want your autonomous car to drive like you? In: ACM/IEEE International Conference on HumanRobot Interaction. pp. 417––425 (2017)
 [2] Brodsky, J.S.: Autonomous vehicle regulation: How an uncertain legal landscape may hit the brakes on selfdriving cars. Berkeley Technology Law Journal 31, 851–878 (2016)
 [3] Brooks, R.: Unexpected consequences of self driving cars (2017), blog post: rodneybrooks.com/unexpectedconsequencesofselfdrivingcars/
 [4] Bruce, A.: Planning for humanrobot interaction: Representing time and human intention (2005), phD thesis, Thesis, Robotics Institute, Carnegie Mellon University
 [5] Chesterman, S.: Do driverless cars dream of electric sheep? SSRN (2016), available at SSRN: https://ssrn.com/abstract=2833701 or http://dx.doi.org/10.2139/ssrn.2833701
 [6] Condon, A.: The complexity of stochastic games. Information and Computation 96(2), 203–224 (February 1992)
 [7] Letchford, J., Conitzer, V.: Computing optimal strategies to commit to in extensiveform games. In: Proceedings of the 11th ACM Conference on Electronic Commerce. pp. 83–92. ACM (2010)
 [8] Littman, M.L., Stone, P.: A polynomialtime Nash equilibrium algorithm for repeated games. Decision Support Systems 39(1), 55–66 (2005)
 [9] Littman, M.L.: Algorithms for Sequential Decision Making. Ph.D. thesis, Department of Computer Science, Brown University (February 1996), also Technical Report CS9609

[10]
Munoz de Cote, E., Littman, M.L.: A polynomialtime Nash equilibrium algorithm for repeated stochastic games. In: 24th Conference on Uncertainty in Artificial Intelligence (UAI’08) (2008)
 [11] Osborne, M.J., Rubinstein, A.: A Course in Game Theory. The MIT Press (1994)
 [12] Shapley, L.: Stochastic games. Proceedings of the National Academy of Sciences of the United States of America 39, 1095–1100 (1953)
 [13] Tennant, C., Howard, S., Franks, B., Bauer, M.W.: Autonomous vehicles: Negotiating a place on the road (2016), online report: http://www.lse.ac.uk/websitearchive/newsAndMedia/PDF/AVsnegociatingaplaceontheroad1110.pdf