1. Introduction
Games have long been of great interest for artificial intelligence (AI) researchers. One set of works focus on fullinformation competitive games such as chess (Grammenos et al., 2005) and go (Coulom, 2007; Silver et al., 2016). Such games present two challenges: the large state space for decisionmaking and the competition from the opponent player. The AI program AlphaGo (Silver et al., 2016; Silver et al., 2017) has achieved great success in the game of Go. The other set of AI researches investigate imperfectinformation card games, such as poker (Sandholm, 2010; Yakovenko et al., 2016) and bridge (Amit and Markovitch, 2006; Ginsberg, 2001; Yeh and Lin, 2016). The computer programs Libratus (Brown et al., 2017) and DeepStack (Moravčík et al., 2017) for nolimit Texas hold’em both showed expertlevel performance, but their techniques can only handle the headsup (twoplayer) situation.
Contract bridge, or simply bridge, is one of the most interesting and difficult card games, because 1) it presents all the above challenges, i.e., large state space, competition and imperfect information; 2) four players rather than two in bridge makes the methods designed for twoplayer zerosum games (e.g., headsup nolimit poker (Sandholm, 2010; Yakovenko et al., 2016)), which focus on Nash equilibrium finding/approximation, not applicable; 3) it has to deal with cooperation between partners. The best bridge AIprograms, such as GIB^{1}^{1}1GIB. http://www.gibware.com/, Jack^{2}^{2}2Jack. http://www.jackbridge.com/ and Wbridge5^{3}^{3}3Wbridge5. http://www.wbridge5.com/
, have not yet reached the level of top human experts, which is probably because of the weakness of their bidding systems
(Amit and Markovitch, 2006; Ginsberg, 2001).The game of bridge consists of two parts, bidding and playing. Playing is relatively easy for AI agents and many programs (e.g., GIB, Jack and Wbridge5) have shown good playing performance (Amit and Markovitch, 2006; Ginsberg, 2001). For example, in 1998, the GIB program attained the 12th place among 35 human experts in a contest without bidding, which demonstrates that computer bridge agents can compete against human expert players in the playing stage. In human world championships, the variation in the level of the players during card playing is also negligible, making the quality of the bidding the decisive factor in the game (Amit and Markovitch, 2006).
Bidding is the hardest part of bridge. During bidding, the players can only see their own cards and the historical bids and try to search for a best contract together with their partners. The difficulty arises from the imperfectinformation setting and the complex meanings of the bids. A bid can carry one or more of the following purposes: 1) suggesting an optional contract, 2) exchanging information between partners and 3) interfering with opponents. Human players design a large number of complex bidding rules to explain the meanings of bidding sequences and then suggest a bid based on one’s own cards. To the best of our knowledge, almost all bridge programs are based on such humandesigned rules. However, since there are possible hand holdings with 13 out of 52 cards (Amit and Markovitch, 2006) and possible bidding sequences ^{4}^{4}4See the analysis at http://tedmuller.us/Bridge/Esoterica/CountingBridgeAuctions.htm, it is unfeasible for the rules to cover all the situations. Hence, the bidding rules are usually ambiguous. Besides, as the bidding rules are handcrafted, some of them may be inefficient and it is very likely that a pair of hand and bidding sequence is not covered by any rule or satisfies multiple rules suggesting conflicting bids.
Considering these drawbacks, many researchers study how to improve the rulebased computer bidding systems. Amit and Markovitch (Amit and Markovitch, 2006) used Monte Carlo Sampling to resolve the conflicts, but did not consider the ambiguity problem. Some researchers tried to infer the cards of other players on the basis of their calls (Amit and Markovitch, 2006; Ando et al., 2003; Ando and Uehara, 2000; Ginsberg, 1999). However, because the possible cards in others’ hands amount to and the rules are exactitude, the inference may be very inaccurate due to the computing resource and time limit. DeLooze and Downey (DeLooze and Downey, 2007)
introduced the SelfOrganizing Map neural network trained with examples from a human bidding system in order to reduce ambiguities, which is shown only to be effective for no trump hands. Recently, deep neural networks have achieved unprecedented performance in many games, e.g., playing the go
(Silver et al., 2016) and Atari games (Mnih et al., 2015), and have also been applied to bridge bidding. Yeh and Lin (Yeh and Lin, 2016) used a reinforcement learning algorithm to train a value network with raw data. However, the training for the system is based on a small dataset with randomly generated games and the competition from opponents is not considered (i.e., opponent players were assumed to always “pass”).In this paper, we, for the first time, develop a competitive bidding system based on deep neural networks, which combines supervised learning (SL)
(Mohri et al., 2012) from human expert data and reinforcement learning (RL) (Mnih et al., 2015; Sutton and Barto, 1998)from selfplay. Our techniques have the following two novelties. First, we design an efficient feature representation for learning, in which the bidding sequence is encoded to a compact 01 vector of 318 bits. Second, to deal with partnership bidding and imperfect information (i.e., unknown cards in the other three players’ hands), we propose a card estimation neural network (ENN) to infer the partner’s cards and demonstrate by experiments that one’s reward highly depends on his/her partner’s cards and the opponents’ cards are much less important and even not necessary for finding the optimal contract. The ENN outputs a probability distribution over possible cards, which serves as a part of the features of the policy neural network (PNN) to produce a probability distribution over possible bids. Both neural networks are first trained in the SL stage using expert data. Then we design an RL algorithm based on REINFORCE
(Williams, 1992) to let the system gradually improve its ability and learn its own bidding rules from selfplay. The learning procedure needs the final rewards of the game, which are related to the result of bidding and the outcome of playing. We leverage the double dummy analysis (DDA) ^{5}^{5}5DDA. http://bridgecomposer.com/DDA.htm to directly compute the playing outcome. We show by experiments that DDA is a very good approximation to real expert playing.We compare our bidding system with the program Wbridge5 (Ventos et al., 2017), the champions of World ComputerBridge Championship 2016  2018. Results indicate that our bidding system outperforms Wbridge5 by 0.25 IMP, which is significant because Ventos et al. (Ventos et al., 2017) show that an improvement of 0.1 IMP can greatly enhance the bidding strength of Wbridge5.
The rest of the paper is organized as follows. We introduce some basic knowledge of bridge game in Section 2. Our neural networkbased bidding system is introduced in Section 3. The learning algorithms are proposed in Section 4. In Section 5, we conduct extensive experiments to evaluate our bidding system. The conclusion is given in the last section.
2. Background
In this section, we introduce the bidding, playing and scoring mechanism of the game of bridge.
2.1. Bridge Bidding
The game of bridge is played by four players, commonly referred to as North, South, East and West. The players are divided into two opposing partnerships, with NorthSouth against WestEast. The game uses a standard deck of 52 cards with 4 suits (club , diamond , heart and spade ), each containing 13 cards from A down to 2. The club and the diamond are called the minor suits, while the other two are major suits. Each player is given 13 cards and one player is designated as the dealer that proposes the first bid (called opening bid). Then the auction proceeds around the table in a clockwise manner. Each player chooses one bid from the following 38 candidates in his/her turn:

a bid higher than that of his/her righthand player according to the ordered contract set
(1) where NT means no trump;

pass;

double a contract bid by the opponents;

redouble if one’s or one’s partner’s bid is doubled
We call the bids in the contract set the contract bids and the other three bids the noncontract bids. During the bidding, a player can only observe his/her own 13 cards and the historical bids. The bidding stage ends when a bid is followed by three consecutive “passes”.
A contract is a tuple of the level (17) and the trump suit (, , , , or NT), and the partnership that bids the highest contract wins the auction. The winners are called the contractors and the other two players are then the defenders. The player from the contractor partnership who first called the trump suit becomes the declarer and his/her partner the dummy. Bridge is a zerosum game and the contractors need to win at least the level plus 6 tricks in the playing stage to get a positive score (usually referred to as “make the contract”). For example, a contract of proposes to win at least tricks in the playing. The tricks higher than the level plus 6 are called the overtakes. In this example, 2 overtakes for the contractors means that they win 12 tricks in total.
There is a large number of bidding systems, for example, the Standard American Yellow Card (SAYC) ^{6}^{6}6SAYC. https://en.wikipedia.org/wiki/Standard_American. The bidding rules of many systems are usually ambiguous and even conflicting. For example, a rule may say that “do not bid with a balanced hand after an enemy 1NT opening unless you are strong enough to double”, where the definition of “strong enough” is ambiguous. The suggested bid of the rule that “with 44 or longer in the minor suits, open 1 and rebid 2” is conflicting with the rule that “always open your longer minor and never rebid a fivecard minor” for the hand with 5 clubs and 5 diamonds. Human players usually devote much time to practice together to reduce the ambiguities and conflicts.
2.2. Playing and Scoring
The playing begins right after the bidding stage, which runs for 13 rounds (tricks). The player sitting at the left side of the declarer plays his/her first card (called opening lead) and then the dummy exposes his/her cards to all players. The playing continues in a clockwise manner. Each of the players puts one card on the table in each round. A player must follow the lead suit if possible or play another suit. The winning hand of the four cards are based on the following rule: if a trump suit is played, the highest card in that suit wins the trick, otherwise, the highest lead suit card wins the trick. During the play, the declarer plays both his cards and the dummy’s. The player who wins the trick has the right to lead for the next round.
The scoring depends on the number of tricks taken by the contractors, the final contract and whether the contract is doubled or redoubled. Besides, in bridge, partnerships can be vulnerable which is predetermined before the game begins, increasing the reward for successfully making the contract, but also increasing the penalty for failure. If the contractors win the number of tricks they committed to, they get a positive score and the defenders a negative score, otherwise the positive score is given to the defenders. The most widely used scoring mechanism is the Duplicate Bridge Scoring (DBS) ^{7}^{7}7DBS. http://www.acbl.org//learn_page/howtoplaybridge/howtokeepscore/, which encourages players to bid higher contract for more bonuses, in addition to the trick points. For example, if a “game” (contracts with at least 100 trick points) is made, contractors are awarded a bonus of 300 points if not vulnerable, and 500 points if vulnerable. A larger bonus is won if the contractors make a “small slam” or “grand slam”, a contract of level 6 and level 7 respectively. However, they might face a negative score if they fail, even if they took most of the tricks.
In realworld clubs and tournaments, team competition is popular, where a team usually has four players. Two of the players, playing as a partnership, sit at NorthSouth of one table. The other two players of the same team sit at EastWest of a different table. The two partnerships from the opposing team fill the empty spots at the two tables. During the course of the match, exactly the same deal is played at both tables. Then the sum of the duplicate scores from the two partnerships of a team is converted to the International Match Points (IMPs) ^{8}^{8}8IMP. http://www.acbl.org/learn_page/howtoplaybridge/howtokeepscore/teams/. The team with higher IMPs wins the match.
3. Neural NetworkBased Bidding System
Bridge is a teamworkbased game and two players of a partnership adopt the same bidding system for information exchange. The system for human players consists of the predefined rules, which is a set of agreements and understandings assigned to bids and sequences of bids used by a partnership. Each bidding system ascribes a meaning to every possible bid by each player of a partnership, and presents a codified language which allows the players to exchange information about their card holdings. We implement the bidding system by two neural networks, the ENN (estimation neural network) and the PNN (policy neural network), where the ENN is used to estimate the cards in our partner’s hands and the PNN is designed for taking actions based on the information we have got. We will show in the Section 5.1 that it is not necessary to estimate opponents’ cards because the distribution of the remaining 26 cards between opponents’ hands has little effect on the final results.
In the following two subsections, we first give the definitions of the two networks and then introduce their feature representations.
3.1. Definitions of ENN and PNN
The set of players is represented as
(2) 
where and are in a team, so are and . We use to denote the partner of player . Let
(3) 
be the set of possible initial cards of a player. Given the cards of player , represents possible initial hands excluding cards in . We use to indicate the set of all bidding histories and let be the set of vulnerabilities. A player can infer the cards of based on the information he/she has got, including his/her own cards , the public vulnerability and the bidding history . Specifically, we define
(4) 
such that the th component of is the probability that, in ’s belief, were holding card . Our PNN is then defined as a neural network
(5) 
where represents the network parameters. That is, the PNN’s features consist of the cards in one’s hands, the vulnerability, the bidding history and the estimation of one’s partner’s cards. Let represent the conditional probability of given and it follows that
(6) 
where is the set of bids called by in the history . In the above equation, is a vector and , the post probability of , is a scalar. The product of a scalar and a vector means multiplying each component of the vector by the scalar. Given the PNN, theoretically, we can compute based on Eq. (6) and Bayes’ rule (Feller, 1968):
(7)  
Further, we have that
(8) 
where is the th bid of and represents the bidding sequence observed when he/she takes his/her th actions. Substituting Eqs. (7) and (3.1) into Eq. (6) leads to the fact that to compute , we need to recursively apply the Bayes’ rule. Since the space size of is
(9) 
when the length of is , the time complexity for computing is
(10) 
That is, it is impractical to directly use the Bayes’ rule to compute . Thus, we propose the neural network ENN
(11) 
to approximate , where denote the parameters to be learned.
In our model, the ENN outputs 52 probabilities about the partner’s cards, which are directly fed into the PNN, because in most situations of bridge, it is unnecessary to know the exact 13 cards of one’s partner and the probability distribution is sufficient for the PNN for making decision. For example, to make a specific contract, one may just want to confirm that his/her partner holds at least 4 minorsuit cards, i.e., the sum of the probabilities of minor suits is not less than 4, no matter which minorsuit card is held.
In the next subsection, we study how the features of one’s cards , the vulnerability of both partnerships and the bidding sequence are represented.
3.2. Compact Feature Representation
We use a 52dimensional 01 vector for a player’s cards, where a “1” in position means the player has the th card in the ordered set
(12) 
A 2dimensional vector is used to indicate the vulnerability with “00” for none of vulnerability, “11” for both of vulnerability, “01” for favorable vulnerability (only the opponent partnership is vulnerable) and “10” for unfavorable vulnerability (in contrast to favorable vulnerability).
There are 38 possible bids, including 35 contract bids, “pass”, “double” and “redouble”. According to the bidding instructions of bridge, there are at most 8 noncontract bids after each contract bid, i.e., “passpassdoublepasspassredoublepasspass”. Note that the dealer can begin the bidding with “pass”, and if the other three players also choose “pass” then the game ends immediately without a playing stage. Thus, the maximum bidding length a player need to consider is . Previous work like (Yeh and Lin, 2016) represents an individual bid with a onehot vector of 38 dimensions, which requires a vector with more than ten thousands of dimensions to represent a bidding history. Such a representation is inefficient in computation and it is assumed in (Yeh and Lin, 2016) that the maximal length of a bidding sequence is to address this problem. However, we observe from the expert data that more than of the sequences have more than 5 bids. We propose a more compact 318dimensional vector to represent the bidding sequence.
Figure 1 shows an example of the representation of the compact feature with bidding sequence
where a “1” in position indicates that the th bid in the possible maximal bidding sequence is called. We do not need to represent the player identity of each historical bid because the bids are called by players one by one in a clockwise manner and the player to bid can correctly match observed bids to corresponding players directly from the bidding sequence.
4. Learning Algorithms
We introduce the learning algorithms in this section, which combines supervised learning from expert data and reinforcement learning from selfplay.
The expert data are collected from the Vugraph Project ^{9}^{9}9Vugraph Project. http://www.bridgebase.com/vugraph_archives/vugraph_archives.php, which contains more than 1 million games from expertlevel tournaments of the past 20 years and keeps adding more data constantly. The information of each game recorded in the dataset includes players’ cards, the vulnerability, the dealer, the bidding sequence, the contract, the playing procedure, and the number of tricks won by the declarer.
Since the PNN takes the output of the ENN as a part of its features, we first train the ENN based on the tuples of
generated from the expert data. The ENN is a multiclass multilabel classifier because the label
contains 13 ones and 39 zeros. The output layer of the ENN consists of 52 sigmoid neurons. We calculate the cross entropy of each neuron and then sum them together as the final loss. When the training of the ENN is finished, we use it to generate the features of the PNN by modifying each instance
in the dataset into(13) 
where is the bid called by player . The PNN is a multiclass singlelabel classifier with a softmax output layer of 38 neurons.
After the SL procedure, we further improve the ability of our bidding system by RL through selfplay. We randomly generate the deal, including the dealer, the vulnerability and the cards of each player, and then the selfplay starts. There are two bidding systems in the RL phase, one for the target team to be improved and the other for the opponents. Each deal is played twice, with the target system playing at the  and  positions, respectively. Two players of a partnership use the same bidding system (ENN and PNN) and a player’s bid is sampled from the distribution output by his/her PNN. The selfplay ends when a bid is followed by three “passes”.
Note that the final score depends on both the contract of bidding and the playing result. It is time consuming to play each deal out either by humans or by some playing program. To address this problem, we use DDA to approximate the playing outcome, which computes the number of tricks taken by each partnership for each possible contract of a deal under the assumption of perfect information and optimal playing strategy. Since players can exchange information during bidding, the result of DDA is usually close to the real one especially for professional players. The biggest advantage of DDA is its speed, usually within several seconds using double dummy solvers, e.g., the Bridge Calculator ^{10}^{10}10Bridge Calculator. http://bcalc.w8.pl/. Given the DDA results and the contract, we can compute the two partnerships’ duplicate scores based on the rule of DBS. The duplicate score for the target system is then used to update the parameters of its PNN according to the following equation:
(14) 
where is the learning rate, is the number of bids called by the target PNN, and correspond to the th sampled bid and feature vector of , respectively, and is the probability of calling given the input
. The loss function of the ENN
in the RL phase is the same with that in the SL procedure.We train the ENN and PNN simultaneously in the RL. The complete process is depicted in Algorithm 1. To improve the stability of the training, we use a minibatch of 100 games to update the parameters. Furthermore, following the practice of (Silver et al., 2016), we maintain a pool of opponents consisting of the target bidding systems in previous iterations of the RL. We add the latest bidding system into the pool every 100 updates and randomly select one for the opponents at each minibatch of selfplays.
5. Experimental Results
We conduct a set of experiments to evaluate our bidding system. We first give an overview of expert data and compare the DDA result with that of expert playing in the dataset to demonstrate that DDA is a good approximation of expert playing process and estimating the partner’s cards is much more important than inferring opponents’ cards. Next we present the detailed evaluation on the performance of the ENN and PNN. Finally, we test the strength of our bidding system.
5.1. Expert Data Overview and DDA Evaluation
The expert data include the information of the bidding and playing processes, the declarer, the contract and the number of tricks won by the declarer. There exist some noise in the data, e.g., incomplete records and mismatch between the tricks and the playing process. After filtering the noisy data, we finally get about 1 million games.
The numbers of different contracts in the dataset are plotted in Figure 2(a), from which we see that the majorsuit and notrump contracts are more preferred by experts. This observation is consistent with the DBS system, where the basic scores of making the majorsuit and notrump contracts are higher than those of minorsuit contracts. For example, basic scores of making , and are 120, 110 and 90, respectively. Besides, we see from Figure 2(b) that most of the majorsuit and notrump contracts are at the level of 4 and 3, respectively, which is because contracts with level 4 of major suits and level 3 of notrump suits constitute the “game” contracts and making them is worth 250 bonus scores. The distribution of lengths of bidding sequences in the data are depicted in Figure 2(c), which indicates that most of the biddings run for 615 rounds. We see from Figure 2(d) that the over takes with levels greater than 4 are negative, which means that it is difficult to make those contracts.
For each game, we use DDA to compute the declarer’s tricks and calculate the gap, , between the DDA’s result and the real tricks in the data. The declarer’s partnership can win at most 13 tricks in one deal and thus the range of the gap is . Figure 3 shows the distribution of the gap. As can be seen, more than 90 of the gaps lie in and of the gaps are equal to zero, which implies that DDA is a very good approximation to expert playing.
Based on the DDA results, we demonstrate that 1) it is important to estimate our partner’s cards for finding the optimal contract and 2) the card distribution in opponents’ hands has little effect on the DDA results. Since there are 4 possible declarers and 5 possible trump suits, given the initial cards of four players, DDA outputs a double dummy table (DDT), containing the numbers of tricks each possible declarer can win with each possible contract suit. The optimal contracts of the two partnerships can then be calculated based on the DDT and the DBS system. Thus, we just need to show that, given one’s own cards, 1) the DDTs with different partner’s cards are of high divergence and 2) the DDTs with different opponents’ cards are similar. We use player as an example in the experiments. For demonstration 1, 2 thousand decks of initial cards for players and are randomly generated. For each deck, we use DDA to compute the DDTs of 1 thousand different initial cards of players (note that once the cards of , and are given, the cards for
can be obtained directly) and then calculate the standard deviation (std) of each declarersuit pair over the 1 thousand DDTs. Finally we get
std values, which are indexed from the smallest to the largest and plotted in Figure 4. The second demonstration uses the similar method, except that 2 thousand decks of initial cards are generated for players and , and then randomly sample 1 thousand ’s cards. The standard deviation caused by modifying ’s cards indicates how relevant the DDA result is to ’s cards. The lower the standard deviation is, the weaker the relevance between the result and ’s cards are, and the less important ’s cards are to the result. We see from Figure 4 that about of the std values of different partner’s cards are greater than 1.5 and of them are even greater than 2, while about of the std values of different opponents’ cards are less than 1 and about half of them are less than 0.75, which are consistent with our expectation.5.2. Evaluation of ENN and PNN
We first evaluate the fitting accuracy of the ENN in expert data. We generate more than 12 million training instances from the 1 million games, 70% of which are used for training, 10% for validation and 20% for test. To increase the divergence of these dataset, the instances from a single game are put in the same dataset (training, validation or test).
The ENN uses a fully connected neural network with 8 layers, each hidden layer of which has 1500 neurons. Besides, we add an extra skip connection every two layers for the network. To evaluate the accuracy of the ENN, cards with the highest 13 probabilities are selected as predictions of the ENN and then the accuracy is equal to the number of correct predictions divided by 13. Note that the PNN takes the 52 probabilities output by the ENN as input, but not the 13 highestprobability cards. The average accuracies of the ENN with different bidding length on the test dataset are shown in Figure 5. We see that the accuracies increase gradually. That is, the more actions we observe from our partner, the more accurately we can estimate his/her cards. The average accuracy and recall of each card is depicted in Figure 6(a). It implies that the accuracy and recall of card of “A” are higher than other cards and the fitting performance on major suits is slightly better than that on minor suits. Because the number of possible partner’s hand is , it is very difficult to get a high accuracy with very limited observations. In fact, in most situations, it is not necessary to know the exact cards of the partner, for example, to make a specific contract, one may just want to confirm that his/her partner holds at least 4 minorsuit cards, i.e., the sum of the probabilities of the ENN for minor suits is not less than 4, no matter which minorsuit card is held. Note that the outputted card distribution of the ENN is directly inputted to the PNN and thus such information can be used by the PNN for decision making.
The datasets for the PNN are generated based on the ENN. The PNN consists of a fully connected neural network of 10 layers with skip connection and each hidden layer has 1200 neurons. The accuracies of the PNN for predicting experts’ moves with different bidding lengths are shown in Figure 5. We learn that the accuracies of the two partnerships’ first bids are higher, which is because the rules for opening bids and corresponding responses are relatively welldefined. Overall, the accuracy also increases with the number of observed bids. The accuracies and recalls for different bids are shown in Figure 6(b), where the first 35 indexes correspond to the 35 ordered contract bids and the last three (36, 37, 38) represent “pass”, “double” and “redouble” respectively. The results indicate that the “pass” action is easy to predict because both the accuracy and recall are high. In fact, more than of the bids in the expert data are “pass”. Besides, we see that the PNN performs better at lowlevel bids than at highlevel bids, because the bidding usually begins with lowlevel bids and thus we have more data for training the PNN on them.
Next, we evaluate the improvements of the bidding system in the RL procedure. We randomly generate 2 million deals and use Algorithm 1 to train the ENN and PNN. To distinguish with the different networks, we call the ENN (PNN) after the SL the SLENN (SLPNN). Similarly, we use RLENN (RLPNN) to denote the networks after the training of the RL. The opponent pool consists of the historical versions of the RLENN and RLPNN in the RL. To evaluate the performance of the algorithm, we compare the historical networks with the initial SL networks through bidding competition over 10 thousand random deals. The average IMPs got by these RL networks in the competition is plotted in Figure 7. As can be seen, the strength of the bidding system is significantly improved in the RL.
SLPNN  RLPNN  SLPNN+ENN  RLPNN+ENN  Wbridge5  

SLPNN  N/A  8.7793  5.653  9.2957  – 
RLPNN  8.7793  N/A  2.1006  1.0856  – 
SLPNN+ENN  5.653  2.1006  N/A  2.2854  – 
RLPNN+ENN  9.2957  1.0856  2.2854  N/A  0.25 
5.3. Performance of the Bidding System
To show the importance of the ENN, we trained another policy neural network without consideration of the partner’s cards, which has the same structure with the PNN except that the feature of is removed. We use the notations of SLPNN (RLPNN) and SLPNN+ENN (RLPNN+ENN) to denote the bidding systems built with the SL (RL) version of a single PNN and the SL (RL) version of the PNN plus ENN in this subsection. The performances of different bidding systems are shown in Table 1, where the IMPs are in view of the row bidding systems and are averaged over 10 thousand random deals when both teams are networkbased systems.
We see that the RL bidding systems are stronger than SL systems and even the RLPNN can beat the SLPNN+ENN by 2.1006 IMPs, which implies that the RL can significantly improve the bidding system. The strongest RLPNN+ENN beats RLPNN by 1.0856 IMPs, which indicates that the ENN is a key component of our bidding system. The comparison with Wbridge5 is manually tested on 64 random boards because there is neither code nor command line interface for Wbridge5. We just compare the bidding ability with Wbridge5 and the scores are also computed with DDA. Wbridge5 implements many bidding rules, e.g., Weak Two, Strong 2D and Unusual 2NT, which can be selected by players in the “Options” of Wbridge5. The comparison results indicate that our best RLPNN+ENN system is stronger in bidding, with a positive average IMP of 0.25 over Wbridge5. It is claimed that a 0.1 IMP improvement is significant for bidding strength (Ventos et al., 2017).
6. Conclusion and Discussion
In this paper, we designed a neural networkbased bidding system, consisting of an estimation neural network (ENN) for inferring the partner’s cards and a policy neutral network (PNN) to select bids based on the public information and the output of the ENN. Experimental results indicate that our system outperforms the top rulebased program – Wbridge5.
Contract bridge is a good testbed for artificial intelligence because it is one of the most difficult card games, involving large state space, competition, cooperation and imperfect information. Our methods can be applied to other games. Specifically, the feature representation method in our paper provides a general idea to efficiently encode the action history in a game where the maximal history length is finite. For example, the method can be applied to limit Texas Hold’em poker. Since the possible action sequences in each round and possible numbers of round in the game are finite, we can use a vector whose length is equal to the maximal action sequence to encode the action history, where a “1” in position means that the corresponding action is taken, while a “0” means not taken.
Besides, how to deal with private information of other players in imperfectinformation games is a key problem. Although lack of theoretical support, our work experimentally shows that using a particular estimation component is effective. Since bridge is a very complex game with multiple players, imperfect information, collaboration and competition, the experimental evidence can motivate the method to be applied to other games, e.g., multiplayer nolimit Texas Hold’em poker and majhong.
For future work, first we will further improve the strength of our system. Second, we will develop a computer program for bridge and open to the community for public test. Third, we will “translate” the networkbased system to a convention to play with humans.
Acknowledgements.
This paper is supported by 2015 Microsoft Research Asia Collaborative Research Program.References
 (1)
 Amit and Markovitch (2006) Asaf Amit and Shaul Markovitch. 2006. Learning to bid in bridge. Machine Learning 63, 3 (2006), 287–327.
 Ando et al. (2003) Takahisa Ando, Noriyuki Kobayashi, and Takao Uehara. 2003. Cooperation and competition of agents in the auction of computer bridge. Electronics and Communications in Japan 86, 12 (2003), 76–86.
 Ando and Uehara (2000) Takahisa Ando and Takao Uehara. 2000. Reasoning by agents in computer bridge bidding. In International Conference on Computers and Games. 346–364.
 Brown et al. (2017) Noam Brown, Tuomas Sandholm, and Strategic Machine. 2017. Libratus: The Superhuman AI for NoLimit Poker. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence.
 Coulom (2007) Rémi Coulom. 2007. Computing Elo Ratings of Move Patterns in the Game of go. Journal of International Computer Games Association 30, 4 (2007), 198–208.
 DeLooze and Downey (2007) Lori L DeLooze and James Downey. 2007. Bridge bidding with imperfect information. In IEEE Symposium on Computational Intelligence and Games. 368–373.

Feller (1968)
William Feller.
1968.
An introduction to probability theory and its applications: volume I
. Vol. 3. John Wiley & Sons New York.  Ginsberg (1999) Matthew L Ginsberg. 1999. GIB: Steps Toward an ExpertLevel BridgePlaying Program. In Proceedings of the 6th International Joint Conference on Artificial Intelligence.
 Ginsberg (2001) Matthew L. Ginsberg. 2001. GIB: Imperfect information in a computationally challenging game. Journal of Artificial Intelligence Research 14 (2001), 303–358.
 Grammenos et al. (2005) Dimitris Grammenos, Anthony Savidis, and Constantine Stephanidis. 2005. UAChess: A universally accessible board game. In Proceedings of the 11th International Conference on HumanComputer Interaction, Vol. 7.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
 Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of machine learning. MIT press.
 Moravčík et al. (2017) Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. 2017. DeepStack: ExpertLevel Artificial Intelligence in NoLimit Poker. arXiv preprint arXiv:1701.01724 (2017).
 Sandholm (2010) Tuomas Sandholm. 2010. The state of solving large incompleteinformation games, and application to poker. AI Magazine 31, 4 (2010), 13–32.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354.
 Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge.
 Ventos et al. (2017) Veronique Ventos, Yves Costel, Olivier Teytaud, and Solène Thépaut Ventos. 2017. Boosting a Bridge Artificial Intelligence. ¡hal01665867¿ (2017).
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8, 34 (1992), 229–256.
 Yakovenko et al. (2016) Nikolai Yakovenko, Liangliang Cao, Colin Raffel, and James Fan. 2016. PokerCNN: a pattern learning strategy for making draws and bets in poker games using convolutional networks. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. 360–367.
 Yeh and Lin (2016) ChihKuan Yeh and HsuanTien Lin. 2016. Automatic Bridge Bidding Using Deep Reinforcement Learning. In Proceedings of the 22nd European Conference on Artificial Intelligence. 1362–1369.
Comments
There are no comments yet.