Learning Multi-agent Implicit Communication Through Actions: A Case Study in Contract Bridge, a Collaborative Imperfect-Information Game

10/10/2018 ∙ by Zheng Tian, et al. ∙ 10

In situations where explicit communication is limited, a human collaborator is typically able to learn to: (i) infer the meaning behind their partner's actions and (ii) balance between taking actions that are exploitative given their current understanding of the state vs. those that can convey private information about the state to their partner. The first component of this learning process has been well-studied in multi-agent systems, whereas the second --- which is equally crucial for a successful collaboration --- has not. In this work, we complete the learning process and introduce our novel algorithm, Policy-Belief-Iteration ("P-BIT"), which mimics both components mentioned above. A belief module models the other agent's private information by observing their actions, whilst a policy module makes use of the inferred private information to return a distribution over actions. They are mutually reinforced with an EM-like algorithm. We use a novel auxiliary reward to encourage information exchange by actions. We evaluate our approach on the non-competitive bidding problem from contract bridge and show that by self-play agents are able to effectively collaborate with implicit communication, and P-BIT outperforms several meaningful baselines that have been considered.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Communication is fundamental in collaborative multi-agent systems so that agents can learn to interact as a collective, as opposed to a collection of individuals. This is particularly important in the imperfect-information setting, where parts of the state that are hidden in one party’s partial view of the world — but observable by others — may be critical to their success. This motivates the need for a communication protocol between agents, so that hidden information can be exchanged, participants can learn about the true state of the world, and actions can be taken to coordinate joint activity.

Typically in multi-agent systems, we facilitate inter-agent communication by incorporating an explicit communication channel. This is conceptually similar to language, or verbal communication which is known to be important for maintaining mutual understanding in human social interaction [Baker et al.1999]. Explicit communication channels come at a cost [Roth, Simmons, and Veloso2006], however, and can be difficult to employ in the case of decentralized control. Moreover, in some games (e.g. bridge, Hanabi) and real-world applications (e.g. driving, storage robots coordination, team sports) explicit communication is either not allowed or not efficient. In these situations, humans are effective in learning to infer the meaning from others’ actions implicitly [Heider and Simmel1944] and relying upon non-verbal communication as a means of information exchange [Rasouli, Kotseruba, and Tsotsos2017].

Previous works have considered ways in which an agent can build a model (either implicitly [He et al.2016, Bard et al.2013, Bjarnason and Peterson2002] or explicitly [Raileanu et al.2018, Lockett, Chen, and Miikkulainen2007, Hernandez-Leal and Kaisers2017, Li and Miikkulainen2018]) of another agent’s characteristics, objectives or hidden information by observing their behavior. Whilst these works are of great value, they overlook the fact that an agent should also consider that it is being modeled and adapt its behavior accordingly. For example in collaborative tasks, an agent could choose to take actions which are informative to its teammates, whereas in competitive situations, agents might benefit from acting in a way that limits others’ ability to model them. Our work combines modeling of others, with a policy that also considers it is being modeled.

Our proposed Policy-Belief-Iteration (P-BIT) is a general framework for learning to cooperate in imperfect information multi-agent games. P-BIT consists of a belief module; which models other agents’ hidden information by considering their previous actions, and a policy module

; which combines the agent’s current observation with the inferred hidden information to return a distribution over actions. The two modules are trained iteratively: at each P-BIT iteration, the belief module learns to more accurately infer the meaning behind others’ actions, and the policy module learns to take actions which are more informative (without compromising overall performance in the game). We show the convergence of P-BIT using expectation maximization.

We evaluate P-BIT by considering the non-competitive bidding challenge from contract bridge and our experiment shows that agents trained using P-BIT can learn collaborative behaviors more effectively than a number of meaningful baselines without any explicit communication.

Related Work

Multi-agent Communication and Agent Modeling

Agent modeling has been well studied in navigation, motion planning [Vemula, Muelling, and Oh2017, Bai et al.2015, Bandyopadhyay et al.2013] and human-robot collaborative planning [Mainprice and Berenson2013, Wang et al.2013]. However, these works mainly focus on human-robot interactions. Foerster2017 Foerster2017 focus on using agent modeling to solve the classical non-stationary training problem caused by the existence of multiple learning agents. They assume information is fully observable for every agent. In our work, however, we assume that each agent has some private information and that being able to model other agent’s private information gives an agent advantage in the game, which is not the case in [Foerster et al.2017].

Our work is more closely related to [He et al.2016, Raileanu et al.2018, Rabinowitz et al.2018]

where an agent builds models to estimate other agents’ strategy or private information. The first two works incorporate the belief into an agent’s decision making and show that agent modeling can improve an agent’s performance in imperfect information games. Rabinowitz2018 Rabinowitz2018 focus purely on agent modeling but in the more challenging setting in which an observer has to model multiple types of agents with limited data.

Recently there has also been a surge of interest in using reinforcement learning (RL) approaches to learn communication protocols

[Foerster et al.2016, Mordatch and Abbeel2017, Sukhbaatar, Szlam, and Fergus2016, Lazaridou, Peysakhovich, and Baroni2016]. Most of these works enable agents to communicate by using an explicit discrete/continuous channel and communication protocols are learned by reinforcement learning. MordatchA17 MordatchA17 observe the emergence of non-verbal communication in collaborative environments without an explicit communication channel, however, to our knowledge, agents can only be either a sender or receiver in their setting. We do not have this restriction in our setting, which makes the problem more difficult. Knepper:2017:ICJ:2909824.3020226 Knepper:2017:ICJ:2909824.3020226 propose a framework of implicit communication in a cooperative setting and show that various problems can be mapped into this framework. Although our work is conceptually close to [Knepper et al.2017], we go beyond this and actually trained agents to solve problems with implicit communication.

Our work lies in an interesting overlapping region of agent modeling and multi-agent communication. A distinguishing factor of our work to previous ones in multi-agent communication is that we do not have an explicit communication channel, also information exchange can only happen by actions, i.e. non-verbal communication. Therefore, an agent needs to build a belief model of another agent’s private information and update it online following additional observations. This belief can be incorporated into an agent’s decision making for better performance. To our knowledge, our work is the first one to combine agent modeling and implicit communication to solve a coordination problem by deep RL.

Computerized Bridge Program

Imitation learning has been used in learning human bidding systems [B.Yegnanarayana, Khemani, and Sarkar1996, DeLooze and Downey2007]

. Gamback93pragmaticreasoning Gamback93pragmaticreasoning and Ginsberg99gib:steps Ginsberg99gib:steps enhance the lookahead search in bidding by borrowing rules from human bidding systems. Amit2006 Amit2006 go further in this direction and propose a decision tree model PIDM (Partial Information Decision Making) using Monte Carlo sampling to predict bids and pruning the tree with a learned refine strategy. Ho2015 Ho2015 build a decision tree model with the contextual bandit algorithm. They learn a bidding model without reference to rules designed by human experts. However, their model is limited to bid from up to 5 options and a hand-crafted feature representation is used to facilitate the learning.

Most recent works have achieved some promising results but their successes all rely on human expert knowledge. The first work to learn a bidding system without direct human domain knowledge is in [Yeh and Lin2016]. They propose an algorithm which learns a Q-function at each step of bidding. However, as Yeh16 Yeh16 learns a Q-function for each time step, their approach is less practical and not scalable.

Figure 1: Two Agents P-BITInteractions. Figure 2: P-BIT trainingprocess. 1: initialize 2: train the first policy naively 3: initialize belief module 4:for  to  do 5:      sample episodes for belief training 6:      train belief module on 7:      train policy given the belief module
Algorithm 1: Policy Belief Iteration (P-BIT)


Multi-Agent POMDP

We frame our problem as a cooperative partially observable Markov game. A Markov game for N agents defines as a tuple (), where is a set of environment states, is a set of partial observations for each agent , and is a set of actions for each agent . The partial observation of each agent is given by a deterministic function: . Initial states are determined by a distribution . State transitions are determined by a function . For each agent , a reward is given by function , To choose actions, each agent i uses a stochastic policy .

In this paper, we focus on the cooperative environment with two agents, i.e. . We assume that agents share the same action and observation space, and that they act according to the same policy and receive a shared reward. The objective is to find a policy that maximizes the expected shared return, which can be solved as a joint maximization problem:

and is a discount factor.

We denote the discounted state distribution as where

is the probability of being at state

after transitioning for time steps from state [S and Barto1998]. Then we can rewrite the objective as:


where denotes the expected value with respect to discounted state distribution . In particular, despite the fact that depends on the policy parameters, the policy gradient does not depend on the gradient of the state distribution [S and Barto1998].

Nonverbal Communication Environment

In a nonverbal communication environment, each agent has its private information hidden from others at time step , denoted as , where is an agent’s observable state space. For agent , we define the set of its unobservable information from all other agents as: . In our case where , . To succeed in a partially observable environment, agent maintains a belief module to infer through its observed history of actions. We denote agent ’s belief about the set of all other agents’ private information at time step as .

Now we define the history of actions at time step observed by agent as

We denote a vector representing a one-hot encoding of action

with boldface . We assume actions made by all agents are observable to all other agents and that they have perfect memory of the history. We differentiate the history perceived by agent from other agents by the order of actions placed in the history vector . The full state of the environment is given by . The observation for agent is

Policy Belief Iteration (P-BIT)

In our model (brief summary in Fig. 1), agent models its unseen information at time step by using a parametrized belief module , which only takes the history of actions as input and outputs a distribution of its belief about the other agent,


In addition, agent has a parameterized policy module to choose actions, which takes its private information and belief as inputs and outputs a distribution over legal actions


We omit subscripts of , and when it is not ambiguous henceforth. The weights in belief module and policy modules are shared among agents. In every Policy Belief iteration , we first train a new policy by policy gradients with belief module from last iteration being fixed. We then sample a dataset by self play using . We train on

by supervised learning to minimize the divergence between the set of all other agents’ private information

and one’s own belief vector .

We use centralized training by treating all agents as one single agent with different observations. Therefore, information is leaked by gradient updates and by using the ground truth as training targets for the belief module during the training. In the test stage however, even though players use the same trained network, their own private information remains hidden and there is no gradient update. We omit the agent identity subscript when we treat them as one agent with different observations and believes.

As we do not have a trained belief module at first P-BIT iteration , we train a naive policy to make actions without considering the existence of other agents. We can iterate between training these two modules until we either run out of computing resources or the policy and belief module converge.

Fig 1 demonstrates a simple way two agents interact with each other using policy and belief modules. Fig. 2 shows the process of training P-BIT. See Algorithm 1 for pseudo-code.

An EM-like Approach

In this section, we show that P-BIT is an EM-style coordinate ascent by alternating a supervised E step trained on self-play data, with a policy gradient M step.

Solving reinforcement learning as an inference problem has a long history [Dayan and Hinton1997] and has been well studied in [Levine and Koltun2013, Vlassis et al.2009, Kober and Peters2009, Abdolmaleki et al.2018, Furmston and Barber2010]. Following their approaches, we first reformulate our optimization problem in Eq. (Multi-Agent POMDP

) as an inference problem, by introducing a binary random variable

at time step that serves as the indicator for ”optimality”. Reward is bounded, but the optimality of the reward given an agent’s observable state is unknown and can be regarded as a random variable. Therefore we change from optimizing expected rewards to maximizing the log likelihood of with respect to the discounted state distribution and the policy ,


where is a part of .

By casting our reinforcement learning problem as an inference problem, we can apply Expectation Maximization (EM) to find the optimal policy . In previous works [Levine and Koltun2013, Vlassis et al.2009, Kober and Peters2009], they assume the environment is stochastic but the state transitions are unknown. Therefore, they build a lower bound on the likelihood of optimality for the policy by taking the trajectory in a game as a latent variable. In our work, we consider, for one agent, the set of all other agents’ private information as the hidden variable. We build the free energy which is the lower bound of by using a variational distribution :


The detailed derivation of the lower bound of is shown in the appendix. We distinguish as a (multivariate) random variable to which is a value (values) can take.

In this work, we parameterize the auxiliary distribution with and it is effectively the belief module . Henceforth, we substitute the auxiliary distribution with the belief module .

In M step, we maximize as if the latent variable were not hidden:


Following [Abdolmaleki et al.2018], we set , which gives:


where is assumed to fully describe in our work.

By Eq  (8), we show that maximizing in M step is equivalent to optimizing our policy module , when belief module is fixed.

For E step, as can been seen from Eq (6), to make the bound tight, we need to set . In practice, however, when the environment state space is too large, we cannot evaluate for each in discrete case or not have the analytical form in continuous case. To solve that, we draw samples from . Assuming the current greedy policy gives rise to optimality, we generate a dataset by self play. To make closer to , we minimize the KL-divergence with respect to as


Auxiliary Communication Reward

With the knowledge of others holding belief about an agent’s own private information, an agent could review how informative its action is to others by investigating how much closer others’ belief is to its private information before and after that action. To build a better shared understanding, that agent can use this information from past experience to help it make actions more informative in future.

Inspired by this idea, we use an auxiliary communication reward to incentivize agents to communicate by actions. Specifically, for agent ’s actions at time step , we set

where is agent ’s best belief so far about agent ’s private information and is a distance function. In our work, we use Kullback-Leibler divergence. A communication reward for agent at time step is only positive if agent ’s belief after action is closer to agent ’s private information than and becomes the new . For each time step, an agent receives a combined reward , where is environment reward and balances between communication and environment reward.


Contract Bridge

In this paper, we implement P-BIT for non-competitive contact bridge bidding, which is an imperfect information fully cooperative game. It requires agents to exchange information through actions in order to decide a good contract.

Contract bridge, or bridge, is a trick-taking game using a standard 52-card deck with four players North, East, South & West (henceforth N/E/S/W). The players are split into two teams, such that the players from each team are sat physically opposite to one another (i.e. Team 1: N/S, Team 2: E/W). Following the deal, there are bidding and playing phases.

In bidding, players bid for a contract sequentially until a proposed one is agreed, which is normally called the final contract. A PASS bid makes no change to the previously proposed contract and a contract is agreed only if it is followed by 3 PASS bids. A non-PASS bid proposes a new contract and is in the form of where is an integer ranging from to and . is the number of tricks needed to achieve the contract and represents the trump suit where NT stands for no trump suit. In each deal, points are assigned to the contract declaring team if they fulfill the contract and otherwise to the other team (”defender”). Bidding must be non-decreasing which means is non-decreasing and must go up if the newly proposed trump suit comes before the one from the previous round of bidding in the above set by order: .

Bridge is a great challenge for AI mainly because of its imperfect information feature. In bridge, one player has neither perfect information about its opponent nor its partner. The best chance for a team to exchange key information is in the bidding phase. Professional human players normally have a highly delicate bidding system, which is effectively a pre-designed implicit communication protocol for exchanging information. However, ambiguity can still exist in a system. Human players need to solve this ambiguity with their intelligence and experience. Nonetheless, this is extremely hard for an AI agent to do so because bidding has strict rules to follow so information that can be delivered is limited. In addition, as the final contract greatly affects the result of the game, a player needs to make a good balance between information exchange and proposing a good final contract to win the game. Hence, bidding in bridge will be an appropriate and challenging test bed for P-BIT.

Problem Setup

In this work, we focus on bidding in bridge without competition where we have Player N and Player S bidding in the game, but Player W and E always bid PASS. In this setting, the declaring team will never change and thus each deal can be viewed as an independent episode of a game. The private information of Player to the other player is its hand. is a 52-dimension one-hot vector encoding its 13 cards. One agent’s observation at time step consists of its hand and the game bidding history: . At each episode, Player N and S are dealt with hand respectively. Their hands together describe the full state of the environment , which is not fully observable by either of the two players.

As we are only interested in learning a computerized bidding system through self-play and playing out the game for every possible contract, given a state , is computationally expensive, we use Double Dummy Analysis (DDA) [Haglund2010] to estimate scores for every possible contract without actually playing out the game. is a vector whose element is a score for a possible contract given state . DDA assumes, for a particular deal, one player’s hand is fully observable by other players and players always play cards to their best advantage. However, given a set

, the distribution of remaining cards for other two non-bidding players Player W and Player E is still unknown. To reduce the variance of the estimate, we repeatedly sample a deal by allocating the remaining cards randomly to Player W and E

times and estimate by taking the average of their DDA scores,


where are hands for Player E and W from -th sample respectively. For a specific contract , the corresponding score is given by .

Generating a new pair of hands for Player N and S and calculating scores for every possible contract at the start of a new episode during policy training is time inefficient. Instead, we pre-generate 1.5 million hands and score them in advance. Then we sample new episodes from this dataset when we train a policy. Also, we generated a separate test dataset containing 30K hands for testing our models.

To show non-competitive bidding is still an appropriate problem for testing P-BIT, we investigate 1.5 million episodes. We find that the optimal contract (contract which gives the highest reward with lowest risk given a deal) is often () Level 1 (), sometimes () Level 6 or 7 and crucially never the ”mid-level” contracts in between. Bidding, and successfully making Level 6-7 contracts yield a much higher reward than other contracts, due to the ”slam bonus”. The non-competitive game is therefore interesting because players must learn to sometimes bid for non-optimal contracts in order to exchange information with their partner. Competitive bidding is much more complex and needs not only a communication strategy with partners but also skills such as hiding information from opponents, disturbing opponent team’s communication by actions and other playing strategies which will be our future work.

To note, Double is only a valid bid in response to a contract proposed by one’s opponents. Also, a Redouble bid must be preceded by a Double bid. In the non-competitive game, opponents do not propose contracts, so these options are naturally not included.

Model Architecture

We parameterize policy and belief modules with two neural networks


respectively. The figure of our model can be found in the appendix. Weights of belief and policy modules are shared between players respectively. The last layer’s non-linear activation function of the belief module

is Sigmoid, so the belief vector scales between . The input to a player’s policy is the sum of its private information and belief . By adding belief to , we could reuse parameters associated with for and avoid training of extra parameters. All training details can be found in the appendix.

Pre-train Policy

In the first P-BIT iteration , to have a good initial starting policy and avoid an over deterministic one, we train our policy to predict a distribution converted from using Softmax on with temperature . The loss for pre-train policy given is:

where is the KL-Divergence. To have a fair comparison with other benchmarks, all our benchmarks also initialize with this pre-trained policy. Supervising a bidding policy to predict pre-calculated scores for all actions only makes the policy to have a basic understanding of its hand. However, this does not give the policy great advantage for solving the whole problem as the policy only has the access to one player’s hand and the accuracy of predicting the optimal contract is only .

Learning A Belief Network

When a player tries to model its partner’s hand based on observed bidding history, we assume it can omit the restriction that its partner can only hold 13 cards. Therefore, we take the prediction of the partner’s hand given the observed bidding history as a 52-label classification problem, where each label represents one corresponding card being in the partner’s hand. In other words, we treat each card from a 52-card deck being in the partner’s hand as an independent Bernoulli distribution and we train a belief network by maximizing the joint likelihood of these 52 Bernoulli distributions given a bidding history

. This gives the loss for belief network as:

where and are elements of one-hot encoding vectors of a partner’s hand and one agent’s belief . The reasoning behind this assumption is we think it is more important to have a more accurate prediction over an invalid distribution than a less accurate one over a valid distribution.


To demonstrate it success, we compare P-BIT with the following four baselines:

  1. Independent Player (IP): A simple baseline is where a player bids independently without any consideration of the existence of the other player.

  2. No communication reward (NCR): One important question to ask is how beneficial of the additional communication auxiliary reward is in terms of learning a good bidding strategy because this additional reward effectively changes the problem being solved. We implement this baseline by using the same architecture and training schedule as P-BIT but setting weight associated with communication reward to zero,

  3. No P-BIT style iteration (NPBI): To demonstrate multiple iterations between policy and belief training helps, we compare our model to a baseline where a policy is trained with the same number of weight updates as our model but no further P-BIT iterations being carried out after training a belief network at P-BIT iteration .

  4. Penetrative Q-Learning (PQL): Yeh16 Yeh16 proposed a penetrative Q learning algorithm to first time learn a bidding policy in non-competitive bidding without human domain knowledge. However, as they learn a Q network at each time step, they have to pre-define the maximum number of bids in an episode, which makes the model difficult to generalize to other problems and less scalable. Additionally, they supervised Q networks with relative rewards (how relatively good one action is to other actions given a state) instead of absolute ones. At each time step t within one episode, each Q network is supervised by , which implies they could do trials and errors several times within one episode. These tricks give them a large advantage in winning the game by leaking crucial information to a policy without the need of good exploration.

Figure 3: Average scores of P-BIT and four baselines on 30K test dataset over 6 training runs.

Fig 3 shows the average learning curves of our model and four baselines on the pre-generated test dataset containing 30K pairs of hands over 6 training runs. As can been seen, IP and NCR both learn faster than our model initially. This is reasonable as P-BIT spends more time learning a communication protocol at first. However, P-BIT starts to outperform those baselines from P-BIT iteration . IP converges to a local optimum very quickly. NCR learns a better bidding strategy than IP because of using a belief module. However, NCR learns slower and less stable compared to P-BIT because it has no guidance on how to convey information to its partner. We also observe that only NCR drops in performance in the first few policy gradient iterations. As neither IP which does not have a belief module nor P-BIT which has a belief module and the guidance of communication decreases in the performance, the drop in NCR’s performance could be related to the fact that NCR is disturbed by ”noise” from its belief network due to lack of guidance. NPBI also outperforms IP and NCR but it learns much slower in the later stage of training and converged to a worse optimum compared to P-BIT, which shows the importance of iterative training between policy and belief modules. We evaluate the best version of PQL on our test dataset and show our model can outperform it. P-BIT improves from IP the most () compared to other models above the IP baseline. (PQL with unfair advantages comes to the second with an improvement rate )

We also observe P-BIT never decreases across different P-BIT iterations, which empirically shows its non-decreasing property of EM approximation. We have 8 P-BIT iterations shown in Fig 3 and every 1 P-BIT iteration contains 100 policy gradient iterations. All baseline models have the same number of policy gradient iterations as P-BIT.

Bid Total PASS 1.4 1.4 1.4 1.4 5.6 1 4.7 1.9 1.9 2.0 10.5 1 2.0 4.8 1.9 2.0 10.7 1 2.3 2.3 4.8 2.2 12.0 1 2.2 2.3 2.2 4.8 11.5 1NT 4.6 4.6 3.7 3.0 15.9 6 7.8 4.0 4.3 7.3 23.4 6 4.0 7.8 4.6 6.4 22.8 6 4.5 5.7 6.9 6.4 23.5 6 6.7 3.0 5.8 8.8 24.3
Table 2: Opening bid - own aHCPs.
Bid Total PASS 1.4 1.5 1.5 1.4 5.8 1 4.8 1.9 1.9 2.0 10.6 1 2.0 4.9 1.8 2.0 10.7 1 2.2 2.3 4.8 2.1 11.4 1 2.3 2.2 2.2 4.8 11.5 1NT 4.6 4.6 3.8 2.9 15.9 6 7.3 3.8 4.1 5.0 20.2 6 3.7 7.3 4.8 5.2 21.0 6 4.5 4.9 7.1 5.4 21.9 6 4.6 5.1 5.0 7.3 22.0
Table 3: Belief HCPs after observing opening bid.
Bid Total PASS 4.0 4.0 4.1 4.4 16.5 1 7.3 4.0 4.0 4.0 19.3 1 4.7 7.3 4.0 4.0 20.0 1 4.8 4.9 7.2 4.1 21.0 1 4.7 4.8 4.8 7.2 21.5 1NT 6.9 6.8 6.0 5.2 24.9 2 7.5 4.5 4.4 4.2 20.5 2 4.3 7.5 4.4 4.4 20.6 2 4.8 4.8 7.7 4.8 22.1 2 5.4 5.5 5.2 8.6 24.7 2NT 7.8 7.7 6.9 6.6 29.0 6 9.2 6.4 6.9 7.8 30.3 6 6.8 10.0 6.8 7.0 30.6 6 7.2 6.5 9.8 7.3 30.8 6 6.9 7.0 6.7 10.1 30.7 6NT 8.7 9.0 7.8 8.7 34.2
Table 4: Responding bid - own + belief aHCPs.
21.53% 18.29% 16.45% 22.30% 20.97% 22.93%
Table 5: Double Pass rates.

Double Pass Analysis

In bridge, when two players both bid Pass (Double Pass), all players’ hands are re-dealt and a new episode starts. If we ignore these episodes in training, a naive strategy emerges where a player always bids Pass unless it is highly confident about its hand and therefore bids at level 1 whose risk is at the minimum. In this work, we are interested in solving problems where private information needs to be inferred from observed actions for better performance in a game. Therefore, this strategy is less meaningful and the opportunity cost of bidding Pass could be high when a player could have won the game with high reward. To reflect this opportunity cost in training, we set the reward for Double Pass as: . Therefore, a player will be penalized heavily by bidding Pass if it could have obtained a high reward otherwise and awarded slightly if he could never win in the current episode. Table 5 shows Double Pass rates of P-BIT and other baselines on the unseen 30K deals. The optimal rate is the percentage of deals whose rewards are all negative thus Double Pass will be awarded. As expected, only P-BIT and NPBI have a lower Double Pass rates than the optimal one as they are explicitly encouraged to communicate during the game. It is worth noting that Pass can also convey information, but information exchanged by bidding Double Pass is relatively limited.

Learned Bidding Convention

Human bridge players decide which bids to make by following a set of rules, more commonly known as the bidding system or bidding convention. Because there are many () possible hands, it is infeasible to define a specific rule for each possible situation; as a result, humans have developed summary statistics to quickly assess the strength of a given hand and defined bidding rules on top of these. The most commonly used summary statistic measures the high card points (HCPs) for each suit, such that each card is given a point score. The specific mapping can vary between players, but by far the most frequently used is: A=4, K=3, Q=2, J=1, else=0. The total number of HCPs in a full deck is 40 and so the expected number of HCPs in a given hand is 10.

Whilst our model’s bidding decisions are based entirely on raw card data, we can use high card points as a simple way to observe and summarize the decisions which are being made. We run the model on the unseen test set of 30,000 deals. Table 4 shows the average HCPs (aHCPs) present in a hand for each of the opening bidding decisions made by NORTH. Once an opening bid is observed, SOUTH updates their belief; Table 4 shows the effect which each opening bid has on SOUTH’s belief. We show in Table 4 the responding bidding decisions made by SOUTH; aHCP values here are the summation of HCPs in SOUTH’s hand and HCPs in SOUTH’s belief of NORTH’s hand.

We hand-craft 24 features using human expert knowledge (such as HCPs per suit, cards per suit and whether or not a hand is balanced) and train a decision tree classifier to predict our model’s opening bid for a given hand’s feature vector. The most important features (with their respective feature importance in brackets) are found to be: total_HCP (0.24), count_

(0.19), count_ (0.18), count_ (0.13) and count_(0.12). This is consistent with the primary features used in human bidding conventions. At depth 14, the decision tree can correctly predict opening bids (in an unseen test set) with 88% accuracy, whereas if the depth is reduced to 3, accuracy falls to 56%. This highlights the limitations in relying on human bidding systems (which normally consider 2-3 features and rely upon summary statistics) when aiming to build super-human level bridge AIs.

In our analysis of the non-competitive bridge bidding problem, we showed that the optimal contracts are often at levels 1-2 ( of deals) and sometimes at levels 6-7 ( of deals). Despite the fact that the level 3-5 contracts can sometimes have the equal reward to those of level 1 and 2, their risk is often far higher. Also, any time that a level 3-5 contract can be made, its level 1-2 counterparts can be made, but for the same reward. There is, therefore, no benefit to bidding at levels 3-5 except to convey information to one’s partner. Our model frequently uses level 3 bids to convey information and in the appendix, we demonstrate some example trajectories for which this strategy is used. We observe that a level 3 bid is used by a player with a reasonably strong hand, who is unsure whether or not a level 6 contract is obtainable. Their partner responds with a level 6 bid if they have a strong hand and PASS otherwise.


In this paper, we have proposed P-BIT, a novel approach which combines agent modelling and communication to solve collaborative imperfect information games. We iterate between training a policy module and a belief module to improve the performance of an agent in a game. We also introduced a novel auxiliary reward which encourages an agent to take actions which can convey critical information to its partner. This auxiliary reward is shown by experimental results that it can guide collaborators in imperfect information games to communicate. We evaluate our model on non-competitive bridge bidding problem, where a non-verbal communication protocol is essential to succeed in the game. We demonstrate P-BIT’s performance by comparing it with meaningful baselines and we also show P-BIT can output a previous work which is heavily adapted to non-competitive bridge bidding problem. P-BIT is the first work to our knowledge which learns an implicit communication protocol by reinforcement learning.


  • [Abdolmaleki et al.2018] Abdolmaleki, A.; Springenberg, J. T.; Tassa, Y.; Munos, R.; Heess, N.; and Riedmiller, M. 2018. Maximum a posteriori policy optimisation. In ICLR ’18.
  • [Amit and Markovitch2006] Amit, A., and Markovitch, S. 2006. Learning to bid in bridge. Machine Learning 63(3):287–327.
  • [Bai et al.2015] Bai, H.; Cai, S.; Ye, N.; Hsu, D.; and Lee, W. S. 2015. Intention-aware online pomdp planning for autonomous driving in a crowd. In ICRA ’15, 454–460.
  • [Baker et al.1999] Baker, M.; Hansen, T.; Joiner, R.; and Traum, D. 1999. The role of grounding in collaborative learning tasks. Collaborative learning: Cognitive and computational approaches 31:63.
  • [Bandyopadhyay et al.2013] Bandyopadhyay, T.; Won, K.; Frazzoli, E.; Hsu, D.; Lee, W.; and Rus, D. 2013. Intention-aware motion planning. In Algorithmic Foundations of Robotics X, 475–491. Springer Berlin Heidelberg.
  • [Bard et al.2013] Bard, N.; Johanson, M.; Burch, N.; and Bowling, M. 2013. Online implicit agent modelling. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems, AAMAS ’13, 255–262.
  • [Bjarnason and Peterson2002] Bjarnason, R. V., and Peterson, T. S. 2002. Multi-agent learning via implicit opponent modeling. In Evolutionary Computation, 2002. CEC’02., volume 2, 1534–1539.
  • [B.Yegnanarayana, Khemani, and Sarkar1996] B.Yegnanarayana; Khemani, D.; and Sarkar, M. 1996. Neural networks for contract bridge bidding. Sadhana 21:395–413.
  • [Dayan and Hinton1997] Dayan, P., and Hinton, G. E. 1997. Using em for reinforcement learning.
  • [DeLooze and Downey2007] DeLooze, L. L., and Downey, J. 2007. Bridge bidding with imperfect information. In 2007 IEEE Symposium on Computational Intelligence and Games, 368–373.
  • [Foerster et al.2016] Foerster, J. N.; Assael, Y. M.; Freitas, N.; and Whiteson, S. 2016. Learning to communicate with deep multi-agent reinforcement learning. CoRR abs/1605.06676.
  • [Foerster et al.2017] Foerster, J.; Chen, R.; Al-Shedivat, M.; Whiteson, S.; Abbeel, P.; and Mordatch, I. 2017. Learning with opponent-learning awareness. CoRR abs/1709.04326.
  • [Furmston and Barber2010] Furmston, T., and Barber, D. 2010. Variational methods for reinforcement learning. In AISTATS ’10, volume 9, 241–248.
  • [Gambäck et al.1993] Gambäck, B.; Rayner, M.; Yard, M.; and Pell, B. 1993. Pragmatic reasoning in bridge.
  • [Ginsberg1999] Ginsberg, M. L. 1999. Gib: Steps toward an expert-level bridge-playing program. In IJCAI ’99, 584–589.
  • [Haglund2010] Haglund, B. 2010. Search algorithms for a bridge double dummy solver.
  • [He et al.2016] He, H.; Boyd-Graber, J.; Kwok, K.; and III, H. D. 2016. Opponent modeling in deep reinforcement learning. In ICML ’16, volume 48, 1804–1813.
  • [Heider and Simmel1944] Heider, F., and Simmel, M. 1944. An experimental study of apparent behavior. The American journal of psychology 57(2):243–259.
  • [Hernandez-Leal and Kaisers2017] Hernandez-Leal, P., and Kaisers, M. 2017. Learning against sequential opponents in repeated stochastic games. In RLDM ’17.
  • [Ho and Lin2015] Ho, C., and Lin, H. 2015. Contract bridge bidding by learning. In Proceedings of the Workshop on Computer Poker and Imperfect Information at AAAI ’15.
  • [Knepper et al.2017] Knepper, R. A.; I.Mavrogiannis, C.; Proft, J.; and Liang, C. 2017. Implicit communication in a joint action. In Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’17, 283–292. New York, NY, USA: ACM.
  • [Kober and Peters2009] Kober, J., and Peters, J. R. 2009. Policy search for motor primitives in robotics. In Koller, D.; Schuurmans, D.; Bengio, Y.; and Bottou, L., eds., NIPS ’09. 849–856.
  • [Lazaridou, Peysakhovich, and Baroni2016] Lazaridou, A.; Peysakhovich, A.; and Baroni, M. 2016. Multi-agent cooperation and the emergence of (natural) language. CoRR abs/1612.07182.
  • [Levine and Koltun2013] Levine, S., and Koltun, V. 2013. Variational policy search via trajectory optimization. In NIPS ’13, 207–215.
  • [Li and Miikkulainen2018] Li, X., and Miikkulainen, R. 2018. Dynamic adaptation and opponent exploitation in computer poker.
  • [Lockett, Chen, and Miikkulainen2007] Lockett, A. J.; Chen, C. L.; and Miikkulainen, R. 2007. Evolving explicit opponent models in game playing. In Proceedings of the 9th annual conference on Genetic and evolutionary computation, 2106–2113. ACM.
  • [Mainprice and Berenson2013] Mainprice, J., and Berenson, D. 2013. Human-robot collaborative manipulation planning using early prediction of human motion. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 299–306.
  • [Mordatch and Abbeel2017] Mordatch, I., and Abbeel, P. 2017. Emergence of grounded compositional language in multi-agent populations. CoRR abs/1703.04908.
  • [Rabinowitz et al.2018] Rabinowitz, N. C.; P., F.; Song, H. F.; Zhang, C.; Eslami, S. M. A.; and Botvinick, M. 2018. Machine theory of mind. CoRR abs/1802.07740.
  • [Raileanu et al.2018] Raileanu, R.; Denton, E.; Szlam, A.; and Fergus, R. 2018. Modeling others using oneself in multi-agent reinforcement learning. CoRR abs/1802.09640.
  • [Rasouli, Kotseruba, and Tsotsos2017] Rasouli, A.; Kotseruba, I.; and Tsotsos, J. 2017. Agreeing to cross: How drivers and pedestrians communicate. In Intelligent Vehicles Symposium (IV), 2017 IEEE, 264–269. IEEE.
  • [Roth, Simmons, and Veloso2006] Roth, M.; Simmons, R.; and Veloso, M. 2006. What to communicate? execution-time decision in multi-agent pomdps. In DARS ’06. Springer. 177–186.
  • [S and Barto1998] S, R., and Barto, A. G. 1998. Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
  • [Sukhbaatar, Szlam, and Fergus2016] Sukhbaatar, S.; Szlam, A.; and Fergus, R. 2016. Learning multiagent communication with backpropagation. CoRR abs/1605.07736.
  • [Vemula, Muelling, and Oh2017] Vemula, A.; Muelling, K.; and Oh, J. 2017. Modeling cooperative navigation in dense human crowds. In ICRA ’17, 1685–1692.
  • [Vlassis et al.2009] Vlassis, N.; Toussaint, M.; Kontes, G.; and Piperidis, S. 2009. Learning model-free robot control by a monte carlo em algorithm. Autonomous Robots 27(2):123–130.
  • [Wang et al.2013] Wang, Z.; Mülling, K.; Deisenroth, M.; Amor, H.; Vogt, D.; B.Schölkopf; and Peters, J. 2013. Probabilistic movement modeling for intention inference in human–robot interaction. The International Journal of Robotics Research 32(7):841–858.
  • [Yeh and Lin2016] Yeh, C., and Lin, H. 2016. Automatic bridge bidding using deep reinforcement learning. CoRR abs/1607.03290.