Suphx: Mastering Mahjong with Deep Reinforcement Learning

03/30/2020 ∙ by Junjie Li, et al. ∙ 0

Artificial Intelligence (AI) has achieved great success in many domains, and game AI is widely regarded as its beachhead since the dawn of AI. In recent years, studies on game AI have gradually evolved from relatively simple environments (e.g., perfect-information games such as Go, chess, shogi or two-player imperfect-information games such as heads-up Texas hold'em) to more complex ones (e.g., multi-player imperfect-information games such as multi-player Texas hold'em and StartCraft II). Mahjong is a popular multi-player imperfect-information game worldwide but very challenging for AI research due to its complex playing/scoring rules and rich hidden information. We design an AI for Mahjong, named Suphx, based on deep reinforcement learning with some newly introduced techniques including global reward prediction, oracle guiding, and run-time policy adaptation. Suphx has demonstrated stronger performance than most top human players in terms of stable rank and is rated above 99.99 This is the first time that a computer program outperforms most top human players in Mahjong.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 16

page 18

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Building superhuman programs for games is a long-standing goal of artificial intelligence (AI). Game AI has made great progress in past two decades  [18, 13, 16, 15, 2, 11, 3]. Recent studies have gradually evolved from relatively simple perfect-information or two-player games (e.g., shogi, chess, Go, and heads-up Texas hold’em) to more complicated imperfect-information multi-player ones (e.g., contract bridge [12], Dota [1], StarCraft II [21], and multi-player Texas hold’em [4]).

Mahjong, a multi-round tile-based game with imperfect information and multiple players, is very popular with hundreds of millions of players worldwide. In each round of a Mahjong game, four players compete with each other towards the first completion of a winning hand. Building a strong Mahjong program raises great challenges to the current studies on game AI.

First, Mahjong has complicated scoring rules. Each game of Mahjong contains multiple rounds, and the final ranking (and so the reward) of a game is determined by the accumulated round scores of those rounds. The loss of one round does not always mean that a player plays poorly for that round (e.g., the player may tactically lose the last round to ensure rank 1 of the game if he/she has a big advantage in previous rounds), and so we cannot directly use the round score as a feedback signal for learning. Furthermore, Mahjong has a huge number of possible winning hands. Those winning hands can be very different from each other, and different hands result in different winning round scores of the round. Such scoring rules are much more complex than previously studied games including chess, Go, etc. A professional player needs to carefully choose what kind of winning hand to form in order to trade off the winning probability and winning score of the round.

Second, in Mahjong each player has up to 13 private tiles in his/her hand which are not visible to other players, and there are 14 titles in the dead wall, which are invisible to all the players throughout the game, and 70 tiles in the live wall, which will become visible once they are drawn and discarded by the players. As a result, on average, for each information set (a decision point of a player), there are more than hidden states that are indistinguishable to him/her. Such a large set of hidden information makes Mahjong a much more difficult imperfect-information game than previously studied ones such as Texas hold’em poker. It is hard for a Mahjong player to determine which action is good only based on his/her own private tiles, since the goodness of an action highly depends on the private tiles of other players and the wall tiles that are invisible to everyone. Consequently, it is also difficult for an AI to connect the reward signal to the observed information.

Third, the playing rule of Mahjong is complex: (1) there are different types of actions including Riichi, Chow, Pong, Kong, discard, and (2) the regular order of plays can be interrupted when making a meld (Pong or Kong), going Mahjong (declaring a winning hand), or robbing a Kong. Because each player can have up to 13 private tiles, it is hard to predict those interruptions, and therefore we even cannot build a regular game tree; even if we build a game tree, such a tree will have a huge number of paths between the consecutive actions of a player. This prevents the direct application of previously successful techniques for games such as Monte-Carlo tree search [14, 15] and counterfactual regret minimization [3, 4].

Due to the above challenges, although there are several attempts [9, 8, 7, 20], the best Mahjong AI is still far behind top human players.

In this work, we build Suphx

(short for Super Phoenix), an AI system for 4-player Japanese Mahjong (Riichi Mahjong), which has one of the largest Mahjong communities in the world. Suphx adopts deep convolutional neural networks as its models. The networks are first trained through supervised learning from the logs of human professional players and then boosted through self-play reinforcement learning (RL), with the networks as the policy. We use the popular policy gradient algorithm

[17] for self-play RL and introduce several techniques to address the aforementioned challenges.

  1. Global reward prediction trains a predictor to predict the final reward (after several future rounds) of a game based on the information of the current and previous rounds. This predictor provides effective learning signals so that the training of the policy network can be performed. In addition, we design look-ahead features to encode the rich possibilities of different winning hands and their winning scores of the round, as a support to the decision making of our RL agent.

  2. Oracle guiding introduces an oracle agent that can see the perfect information including the private tiles of other players and the wall tiles. This oracle agent is a super strong Mahjong AI due to the (unfair) perfect information access. In our RL training process, we gradually drop the perfect information from the oracle agent, and finally convert it to a normal agent which only takes observable information as input. With the help of the oracle agent, our normal agent improves much faster than standard RL training which only leverages observable information.

  3. As the complex playing rules of Mahjong lead to an irregular game tree and prevent the application of Monte-Carlo tree search techniques, we introduce parametric Monte-Carlo policy adaptation (pMCPA) to improve the run-time performance of our agent. pMCPA gradually modifies and adapts the offline-trained policy to a specific round in the online playing stage when the round goes on and there is more information observable (such as public tiles discarded by four players).

We evaluated Suphx on the most popular and competitive Mahjong platform, Tenhou [19], which has more than 350,000 active users. Suphx reached 10 dan at Tenhou, and its stable rank, which describes the long-term average of the performance of a player, surpasses most top human players.

2 Overview of Suphx

In this section, we first describe the decision flow of Suphx, and then the network structures and features used in Suphx.

2.1 Decision Flow

Due to the complex playing rules of Mahjong, Suphx learns five models to handle different situations: the discard model, the Riichi model, the Chow model, the Pong model, and the Kong model, as summarized in Table 1.

Model Functionality
Discard model Decide which tile to discard in normal situations
Riichi model Decide whether to declare Riichi
Chow model Decide whether/what to make a Chow
Pong model Decide whether to make a Pong
Kong model Decide whether to make a Kong
Table 1: Five models in Suphx

Besides these five learned models, Suphx employs another rule-based winning model to decide whether to declare a winning hand and win the round. It basically checks whether a winning hand can be formed from a tile discarded by other players or drawn from the wall, and then makes decisions according to the following simple rules:

  • If this is not the last round of the game, declare and win the round;

  • If this is the last round of the game,

    • If after declaring a winning hand the accumulated round score of the whole game is the lowest among the four players, do not declare;

    • Otherwise, declare and win the round.

There are two kinds of situations that a Mahjong player needs to take actions, so does our AI Suphx (see Figure 1):

  • The Draw situation: Suphx draws a tile from the wall. If its private tiles can form a winning hand with the drawn tile, the winning model decides whether to declare winning. If yes, it declares and the round is over. Otherwise,

    1. Kong step: If the private tiles can make an ClosedKong or AddKong with the drawn tile, the Kong model decides whether to make the ClosedKong or AddKong. If not, go to the Riichi step; otherwise, there are two sub cases:

      1. If it is an ClosedKong, make the ClosedKong and go back to the Draw situation.

      2. If it is an AddKong, other players may win the round using this AddKong tile. If some other player wins, the round is over; otherwise, make the AddKong and go back to the Draw situation.

    2. Riichi step: If the private tiles can make Riichi with the drawn tile, the Riichi model decides whether to declare Riichi. If not, go to the Discard step; otherwise, declare Riichi and then go to the Discard step.

    3. Discard step: The discard model chooses a tile to discard. After that, it is the other players’ turn to take actions or the round ends with no wall tiles left.222If a round ends with no wall tiles left, this is also known as an exhaustive draw.

  • The Other-Discard situation: Other players discard a tile. If Suphx can form a winning hand with this tile, the winning model decides whether to declare winning. If yes, it declares and the round is over. Otherwise, it checks whether a Chow, Pong, or Kong can be made with the discarded tile. If not, it is the other players’ turn to take actions; otherwise, the Chow, Pong, or Kong model decides what action to take:

    1. If no action is suggested by the three models, it is the turn for other players to take actions or the round ends with no wall tiles left.

    2. If one or more actions are suggested, Suphx proposes the action with the highest confidence score (outputted by those models). If the proposed action is not interrupted by higher-priority actions from other players, Suphx takes the action, and then goes to the Discard step in the first situation. Otherwise, the proposed action is interrupted and it becomes the turn for other players to take actions.

Figure 1: Decision flow of Suphx.

2.2 Features and Model Structures

Since deep convolutional neural networks (CNNs) have demonstrated powerful representation capability and been verified in playing games like chess, shogi and Go, Suphx also adopts deep CNNs as the model architecture for its policy.

Different from board games like Go and Chess, the information (as shown in Figure 2) available to players in Mahjong is not naturally in the format of images. We carefully design a set of features to encode the observed information into channels that can be digested by CNNs.

Figure 2: Example of a state. Mahjong’s state contains several types of information: (1) the tile set including private tiles, open hand, and doras, (2) the sequence of discarded tiles, (3) integer features including accumulated round scores of the four players and the number of tiles left in the live wall, and (4) categorical features including round id, dealer, counters of repeat dealer, and Riichi bets.

As there are 34 unique tiles in Japanese Mahong, we use multiple channels to represent a state. As shown in Fig. 3, we use four channels to encode the private tiles of a player. Open hand, doras, and the sequence of discarded tiles are encoded similarly into other channels. Categorical features are encoded into multiple channels, with each channel being either all 0’s or all 1’s. Integer features are partitioned into buckets and each bucket is encoded using a channel of either all 0’s or all 1’s.

Figure 3: Encoding of private tiles. We encode the private tiles of a player into four channels. There are four rows and 34 columns, with each row corresponding to one channel, and each column indicating one type of tile. The m-th column in the n-th channel means whether there are n tiles of the m-th type in the hand.

In addition to the directly observable information, we also design some look-ahead features, which indicate the probability and round score of winning a hand if we discard a specific tile from the current hand tiles and then draw tiles from the wall to replace some other hand tiles. In Japanese Mahjong, a winning hand of 14 tiles contains four melds and one pair. There are 89 kinds of melds and 34 kinds of pairs, which lead to a huge number of different possible winning hands. Furthermore, different hands result in different winning scores of the round according to the complex scoring rules.333A common list of different scoring patterns could be found at http://arcturus.su/wiki/List_of_yaku

It is impossible to enumerate all the combinations of different discarding/drawing behaviors and winning hands. Therefore, to reduce computational complexity, we make several simplifications while extracting look-ahead features: (1) We perform depth first search to find possible winning hands. (2) We ignore opponents’ behaviors and only consider drawing and discarding behaviors of our own agent. With those simplifications, we obtain 100+ look-ahead features, with each feature corresponding to a 34-dimensional vector. For example, a feature represents whether discarding a specific tile can lead to a winning hand of 12,000 round score with replacing 3 hand tiles by tiles drawn from the wall or discarded by other players.

In Suphx, all the models (i.e., the discard/Riichi/Chow/Pong/Kong models) use similar network structures (Figures 4 and 5), except the dimensions of the input and output layers (Table 2

). The discard model has 34 output neurons corresponding to 34 unique tiles, the Richii/Chow/Pong/Kong models have only two output neurons corresponding to whether or not to take a certain action. In addition to the state information and look-ahead features, the inputs of Chow/Pong/Kong models also contain information about what tiles to make a Chow/Pong/Kong. Note that there is no pooling layer in our models, because every column of a channel has its semantic meaning and pooling will lead to information loss.

Discard Riichi Chow Pong Kong
Input
Output 34 2
Table 2: Input/out dimensions of different models
Figure 4: Structure of the discard model
Figure 5: Structures of the Riichi, Chow, Pong, and Kong models

3 Learning Algorithm

The learning of Suphx contains three major steps. First, we train the five models of Suphx by supervised learning, using (state, action) pairs of top human players collected from the Tenhou platform. Second, we improve the supervised models through self-play reinforcement learning (RL), with the models as policy. We adopt the popular policy gradient algorithm (Section 3.1) and introduce global reward prediction (Section 3.2) and oracle guiding (Section 3.3) to handle the unique challenges of Mahjong. Third, during online playing, we employ run-time policy adaptation (Section 3.4) to leverage the new observations on the current round in order to perform even better.

3.1 Distributed Reinforcement Learning with Entropy Regularization

The training of Suphx is based on distributed reinforcement learning. In particular, we employ the policy gradient method and leverage importance sampling to handle the staleness of trajectories due to asynchronous distributed training:

(1)

where is (the parameters of) an old policy generating trajectories for training, is the latest policy to update, and is the advantage of action at state with respect to policy .

We find that RL training is sensitive to the entropy of the policy. If the entropy is too small, RL training converges quickly and self-play does not significantly improve the policy; if the entropy is too large, RL training becomes unstable and the learned policy is of large variance. Therefore, we regularize the entropy of policy during the RL training as follows:

(2)

where is the entropy of policy and is a trade-off coefficient. To ensure stable exploration, we dynamically adjust to increase/decrease the entropy term if the entropy of our policy is smaller/larger than the target in a recent period:

(3)

where is the empirical entropy of the trajectories in a recent period, and is a small step size.

The distributed RL system used by Suphx is shown in Figure 6. The system consists of multiple self-play workers, each containing a set of CPU-based Mahjong simulators and a set of GPU-based inference engines to generate trajectories. The update of the policy is decoupled from the generation of the trajectories: a parameter server is used to update the policy using multiple GPUs based on a replay buffer. During training, every Mahjong simulator randomly initializes a game with our RL agent as a player and other three opponents. When any of the four players needs to take an action, the simulator sends the current state (represented by a feature vector) to a GPU inference engine, which then returns an action to the simulator. GPU inference engines pull the up-to-date policy from the parameter server in a regular basis, to ensure that the self-play policy is close enough to the latest policy .

Figure 6: Distributed RL system in Suphx

3.2 Global Reward Prediction

In Mahjong, each game contains multiple rounds, e.g., 8-12 rounds in Tenhou.444The number of rounds of a game is not fixed, depending on the win/loss of each round. More details can be found at https://tenhou.net/man/. A round starts with 13 private tiles dealt to each player, in turn players draw and discard tiles, the round ends until one of the players completes a winning hand or no tile is left in the wall, and then each player gets a round score. For example, the player who forms a winning hand gets a positive round score, and the others get zero or negative round scores. When all the rounds end, each player gets a game reward based on the rank of the accumulated round scores.

Players receive round scores at the end of each round, and receive game rewards after 8-12 rounds. However, neither round scores nor game rewards are good signals for RL training:

  • Since multiple rounds in the same game share the same game reward, using game rewards as a feedback signal cannot differentiate well-played rounds and poorly-played rounds. Therefore, one should better measure the performance of each round separately.

  • While the round score is computed for each individual round, it may not be able to reflect the goodness of the actions, especially for top professional players. For example, in the last one or two rounds of a game, the rank-1 player with a big lead in terms of accumulated round scores will usually become more conservative, and may purposely let rank-3 or rank-4 players win this round so that he/she can safely keep rank 1 overall. That is, a negative round score may not necessarily mean a poor policy: it may sometimes reflect certain tactics and thus correspond to a fairly good policy.

Therefore, to provide effective signal for RL training, we need to appropriately attribute the final game reward (a global reward) to each round of the game. For this purpose, we introduce a global reward predictor , which predicts the final game reward given the information of the current round and all previous rounds of this game. In Suphx, the reward predictor

is a recurrent neural network, more specifically, a two-layer gated recurrent unit (GRU) followed by two fully-connected layers, as shown in Figure

7.

Figure 7: Reward predictor: GRU network

The training data for this reward predictor come from the logs of top human players in Tenhou, and is trained by minimizing the following mean square error:

(4)

where denotes the number of games in the training data, denotes the final game reward of the -th game, denotes the number of rounds in the -th game, denotes the feature vector of the -th round in the -th game, including the score of this round, the current accumulated round score, the dealer position, the counters of repeat dealer and Riichi bets.

When is well trained, for a self-play game with rounds, we use as the reward of the -th round for RL training.

3.3 Oracle Guiding

There is rich hidden information in Mahjong (e.g., the private tiles of other players and the wall tiles). Without access to such hidden information, it is hard to take good actions. This is a fundamental reason why Mahjong is a difficult game. In this situation, although an agent can learn a policy by means of reinforcement learning, the learning could be very slow. To speed up the RL training, we introduce an oracle agent, which can see all the perfect information about a state: (1) private tiles of the player, (2) open (previously discarded) tiles of all the players, (3) other public information such as the accumulated round scores and Riichi bets, (4) private tiles of the other three players, and (5) the tiles in the wall. Only (1)(2) and (3) are available to the normal agent,555“Normal” here means that the agent has no access to the perfect information. while (4) and (5) are additional “perfect" information that is only available to the oracle.

With the (unfair) access to the perfect information, the oracle agent will easily become a master of Mahjong after RL training. Here the challenge is how to leverage the oracle agent to guide and accelerate the training of our normal agent. According to our study, simple knowledge distillation does not work well: it is difficult for a normal agent, who only has limited information access, to mimic the behavior of a well-trained oracle agent, who is super strong and far beyond the capacity of a normal agent. Therefore, we need a smarter way to guide our normal agent with the oracle.

To this end, there might be different approaches. In Suphx, what we do is to first train the oracle agent through reinforcement learning, using all the features including the prefect ones. Then we gradually drop out the perfect features so that the oracle agent will eventually transit to a normal agent:

(5)

where denote the normal features and the additional perfect features of state , and is the dropout matrix at the -th iteration whose elements are Bernoulli variables with . We gradually decay from 1 to 0. When , all the prefect features are dropped out and the model transits from the oracle agent to a normal agent.

After becomes zero, we continue the training of the normal agent for a certain number of iterations. We adopt two tricks during the continual training. First, we decay the learning rate to one tenth. Second, we reject some state-action pairs if the importance weight is larger than a pre-defined threshold. According to our experiments, without these tricks, the continual training is not stable and does not lead to further improvements.

3.4 Parametric Monte-Carlo Policy Adaptation

The strategy of a top human player will be very different when his/her initial hand (private tiles) varies. For example, he/she will play aggressively to win more given a good initial hand, and conservatively to lose less given a poor initial hand. This is very different from previous games including Go and StarCraft. Therefore, we believe that we could build a stronger Mahjong agent if we are able to adapt the offline-trained policy in the run time.

Monte-Carlo tree search (MCTS) is a well established technique in games like Go [14] for run-time performance improvement. Unfortunately, as aforementioned, the playing order of Mahjong is not fixed and it is hard to build a regular game tree. Therefore, MCTS cannot be directly applied to Mahjong. In this work, we design a new method, named parametric Monte-Carlo policy adaptation (pMCPA).

When a round begins and the initial private hand is dealt to our agent, we adapt the offline-trained policy to this given initial hand as follows:

  1. Simulations: Randomly sample private tiles for the three opponents and wall tiles from the pool of tiles excluding our own private tiles, and then use the offline-trained policy to roll out and finish the whole trajectory. In total, trajectories are generated in this way.

  2. Adaptation: Perform gradient updates using the rollout trajectories to finetune the offline policy.

  3. Inference: Use the finetuned policy to play against other players in this round.

Let denote the private hand tiles of our agent at a round, denote the parameters of the policy trained off-line, and the parameters of the new policy adapted to this round. Then we have

(6)

where is the set of trajectories with prefix , and is the probability of policy generating trajectory .

According to our study, the number of simulations/trajectories does not need to be very large and pMCPA does not need to collect statistics for all the following states for this round. Since pMCPA is a parametric method, the updated policy (using the

simulations) can lead to updated estimation of those states not visited in the simulations as well. That is, such run-time adaptation can help to generalize our knowledge obtained from limited simulations to unseen states.

Please note that the policy adaptation is performed for each round independently. That is, after we adapt the policy of our agent in the current round, for the next round, we will restart from the offline-trained policy once again.

4 Offline Evaluation

In this section, we report the effectiveness of each technical component of Suphx through offline experiments.

4.1 Supervised Learning

In Suphx, the five models were first trained through supervised learning separately. Each training sample is a state-action pair collected from human professional players with state serving as the input and action serving as the label for supervised learning. For example, for the training of the discard model, the input of a sample is all the observable information (and the look-head features) of a state, and the label is the action taken by a human player, i.e., the tile discarded at this state.

The training data sizes and the test accuracy are reported in Table 3. The sizes of the validation data and test data are 10K and 50K respectively, for all the models. Since the discard model addresses a 34-class classification problem, we collected more training samples for it. As can be seen from the table, we achieve an accuracy of for the discard model, for the Riichi model, for the Chow model, for the Pong model, and for the Kong model. We also list the accuracy achieved by previous works [6] as a reference.666We would like to point out that our numbers are not directly comparable to previous numbers due to different training/test data and model structures.

Model Training Data Size Test Accuracy Previous Accuracy [6]
Discard model 15M 76.7% 68.8 %
Riichi model 5M 85.7 % -
Chow model 10M 95.0% 90.4%
Pong model 10M 91.9 % 88.2%
Kong model 4M 94.0 % -
Table 3: Results for supervised learning

4.2 Reinforcement Learning

To demonstrate the value of each RL component in Suphx, we trained several Mahjong agents:

  • SL: the supervised learning agent. This agent (with all the five models) was trained in a supervised way, as illustrated in the previous subsection.

  • SL-weak: an under-trained version of the SL agent, which serves as opponent models while evaluating other agents.

  • RL-basic: the basic version of the reinforcement learning agent. In RL-basic, the discard model was initialized with the SL discard model and then boosted through the policy gradient method with round scores as reward and entropy regularization. The Riichi, Chow, Pong, and Kong models remain the same as those of the SL agent.777These four models can also be improved through reinforcement learning, although not as significant as the discard model. Therefore, to reduce training time, we inherit the SL models.

  • RL-1: the RL agent that enhances RL-basic with global reward prediction. The reward predictor was trained through supervised learning with human game logs from Tenhou.

  • RL-2: the RL agent that further enhances RL-1 with oracle guiding. Please note that in both RL-1 and RL-2, we also only trained the discard model using RL, and left the other four models the same as those of the SL agent.

The initial private tiles have large randomness and will greatly impact the win/loss of a game. To reduce the variance caused by initial private tiles, during the offline evaluation, we randomly generated one million games. Each agent plays against 3 SL-weak agents on these games. In such a setting, the evaluation of one agent took 20 Tesla K80 GPUs for two days. For the evaluation metric, we computed the stable rank of an agent following the rules of Tenhou (see Appendix C). To reduce the variance of stable rank, for each agent, we randomly sampled 800K games from the one million games, for 1000 times.

Figure 8 shows the interquartile ranges888

According to Wikipedia, "In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference between 75-th and 25-th percentiles, or between upper and lower quartiles, IQR = Q3 - Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used robust measure of scale."

of stable ranks over the 1000 samplings for those agents. Note that for fair comparison, each RL agent was trained using 1.5 million games. The training of each agent costs 44 GPUs (4 Titan XP for the parameter server and 40 Tesla K80 for self-play workers) and two days. As can be seen, RL-basic leads to good improvement over SL, RL-1 outperforms RL-basic, and RL-2 brings additional gains against RL-1. These experimental results clearly demonstrate the value of reinforcement learning, as well as the additional values of global reward prediction and oracle guiding.

Figure 8: Statistics of stable rank over one million games. The plot shows the three quartile values of a distribution along with extreme values. The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile, and then observations that fall outside this range are displayed independently.

With the global reward predictor distributing the game reward to each round, the trained agent can better maximize the final game reward instead of the round score. For example, in Figure 9, our agent (the south player) has a big lead with good hand in the last round of the game. Based on the current accumulated round scores of the four players, winning this round gets only marginal reward while losing this round will lead to a big punishment. Therefore instead of playing aggressively to win this round, our agent plays conservatively, chooses the safest tile to discard, and finally gets the first place/rank for this game. In contrast, RL-basic discards another tile to win the round, which brings a big risk of losing the 1-st rank of the entire game.

Figure 9: With global reward prediction, our agent (the south player) plays conservatively when its accumulated round score has a big lead in the last round of a game, even if its hand tiles are good and it has a certain probability to win this round. The RL-basic agent discards the red-boxed tile to win this round, but discarding this tile is risky since the same tile has not been discarded by any player in this round. In contrast, RL-1 and RL-2 agents play in a defense mode and discard the blue-boxed tile, which is a safe tile because the same tile has just been discarded by the west player.

4.3 Evaluation of Run-Time Policy Adaptation

In addition to testing the enhancement to offline RL training, we also tested the run-time policy adaptation. The experimental setting is described as follows.

When a round begins and the private tiles are dealt to our agent,

  1. Data generation: We fix the hand tiles of our agent and simulate 100K trajectories. In each trajectory, the hand tiles of the other three players and wall tiles are randomly generated, and we use four copies of our agent to roll out and finish the trajectory.

  2. Policy adaptation: We fine-tune and update the policy trained offline over those 100K trajectories by using the basic policy gradient method.

  3. Test of the adapted policy: Our agent plays against the other three players using the updated policy on another 10K test set where the private tiles of our agent are still fixed. As the initial private tiles of our agent is fixed, the performance of the adapted agent on this test set can tell whether such run-time policy adaptation really makes our agent adapt and work better for the current private tiles.

Please note that run-time policy adaptation is time consuming due to the roll-outs and online learning. Therefore, at the current stage, we only tested this technique on hundreds of initial rounds. The wining rate of the adapted version of RL-2 against its non-adapted version is 66%, which demonstrates the advantage of run-time policy adaptation.

Policy adaption makes our agent work better for the current private hand, especially at the last 1 or 2 rounds of a game. Figure 10 shows an example of the last round of a game. Through simulations the agent learns to know that while it is easy to win this round with a nice round score, this is unfortunately not enough to avoid ending the game with the 4-th place. Thus, after adaptation, the agent plays more aggressively, takes more risks, and eventually wins the round with a much larger round score and successfully avoid ending the game with the 4-th place.

Figure 10: In this example, to move out of the 4-th place of the game, the agent needs to win more than 12,000 round score in this round. Through simulations, the agent learns to know that discarding the red-boxed tile is easy to win this round; however, the corresponding winning round score will be less than 12,000. After adaptation, the agent discards the blue-boxed tile, which leads to lower probability of winning but more than 12,000 winning round score once it wins. By doing so, it takes risk and successfully move out of the 4-th place.

5 Online Evaluation

To evaluate the real performance of our Mahjong AI Suphx999Suphx is equivalent to RL-2 trained with about 2.5 million games. Given that Tenhou.net has time constraint for each action, run-time policy adaptation was not integrated into Suphx while testing on Tenhou.net since it is time consuming. We believe the integration of run-time policy adaptation will further improve Suphx., we let it play on Tenhou.net, the most popular online platform for Japanese Mahjong. Tenhou has two major rooms, the expert room and the phoenix room. The expert room is open to AI and human players of 4 dan and above, while the phoenix room is only open to human players of 7+ dan. According to this policy, Suphx can only play in the expert room.

Suphx played 5000+ games in the expert room and achieved 10 dan in terms of record rank101010Record rank is the highest rank a player has ever achieved in Tenhou. As shown in Appendix B, the rank of a player is dynamic and usually changes over time. For example, if he/she does not play well recently, his/her rank will drop. and 8.74 dan in terms of stable rank.111111See Appendix C for the definition of stable rank. It is the first and only AI in Tenhou that achieves 10 dan in terms of record rank.

We compare Suphx with several AI/human players in Table 4:

  • Bakuuchi [10]: This is a Mahjong AI designed by the University of Tokyo based on Monte Carlo simulation and opponent modeling. It does not use reinforcement learning.

  • NAGA121212https://dmv.nico/ja/articles/mahjong_ai_naga/: This is a Mahjong AI designed by Dwango Media Village based on deep convolutional neural networks. It does not use reinforcement learning either.

  • We also compare Suphx with top human players who have achieved 10 dan in terms of record rank. To be fair, we only compare their game playing in the expert room after they achieved 10 dan. Since these top human players spent the majority of their time in the phoenix room (partly due to its more friendly scoring rules) and only played occasionally in the expert room after they achieved 10 dan, we can hardly calculate a reliable stable rank for each individual of them.131313We chose to compare with the performance of top human players according to their game playing in the expert room but not the phoenix room because these two rooms have different scoring rules and the stable ranks are not directly comparable. Therefore, we treat them as one macro player to make a statistically reasonable comparison.

We can see that in terms of stable rank, Suphx is about 2 dan better than Bakuuchi and NAGA, the two best Mahjong AIs before Suphx. Although these top human players have achieved the same record rank (10 dan) as Suphx, they are not as strong as Suphx in terms of stable rank. Figure 11 plots the distributions of record ranks141414Phoenix is a honorable title in Tenhou, when a 10-dan player gets 4000 ranking points. There are only 13 players (and 14 accounts) in Tenhou’s history won this honorable title for 4-player Mahjong. of current active users in Tenhou, which shows that Suphx is above 99.99% human players in Tenhou.

#Game record rank Stable Rank
Bakuuchi 30,516 9 dan 6.59
NAGA 9,649 8 dan 6.64
Top human 8,031 10 dan 7.46
Suphx 5,760 10 dan 8.74
Table 4: Comparison with other AIs and top human players.
Figure 11: Distributions of record ranks of human players in Tenhou. Each bar indicates the number of human players above a certain level in Tenhou.

As discussed in Appendix B, the record rank sometimes cannot reflect the true level of a player: for example, there are 100+ players with a record rank of 10 dan in Tenhou’s history , but their true levels can be very different. The stable rank is more stable (by its definition) and of finer granularity than the record rank; however, it could also be of large variance, especially when a player has not played enough number of games in Tenhou. Thus, in order to make more informative and reliable comparisons, we proceed as follows. For each AI/human player, we randomly sample games from its/his/her logs in the expert room and compute the stable rank using those games. We do such sampling for times and show the statistics of the corresponding stable ranks of each player in Figure 12. As can be seen, Suphx surpasses both the other two AIs and the average performance of top human professional players with a big margin.

Figure 12: Statistics of stable ranks with and .

We further show more statistics of those AI/human players in Table 5. We have several interesting observations from the table:

  • Suphx is very strong at defense and has very low deal-in rate. This is confirmed by the comments from top human players on Suphx.151515A quote from a pheonix human player can be found at https://twitter.com/Futokunaio_Sota/status/1142399895577325568

  • Suphx has very low 4-th rank rate, which is the key to get a high stable rank in Tenhou according to its scoring rules.

1st Rank 2nd Rank 3rd Rank 4th Rank Win Rate Deal-in Rate
Bakuuchi 28.0% 26.2% 23.2% 22.4% 23.07% 12.16%
NAGA 25.6% 27.2% 25.9% 21.1% 22.69% 11.42%
Top human 28.0% 26.8% 24.7% 20.5% - -
Suphx 29.3% 27.5% 24.4% 18.7% 22.83% 10.06%
Table 5: More statistics: rank distribution and win/deal-in rate

Suphx has developed its own playing styles, which are well recognized by top human players. For example, Suphx is very good at keeping safe tiles, prefers winning hand with half-flush161616https://en.wikipedia.org/wiki/Japanese_Mahjong_yaku, etc. Figure 13 is an example that Suphx keeps a safe tile to balance future attack and defense171717Game replay can be found at https://tenhou.net/3/?log=2019070722gm-0029-0000-3bee4a7e&tw=3..

Figure 13: Suphx keeps a safe tile at state to balance future attack and defense. Although it is safe to discard the red-boxed tile at state (actually this is indeed the tile that most human players would discard), Suphx keeps this tile in hand; instead, it discards the blue-boxed tile, which might slow down the process of forming a winning hand. Doing so leads to more flexibility in future states and can better balance future attack and defense. Consider a future state in which another player declares Riichi that is unexpected to our agent. In this case, Suphx can discard the safe tile kept at state and does not break the winning hand it tries to form. In contrast, if Suphx discards this safe tile at state , it has no other safe tiles to discard at , and thus may have to break a meld or pair in its hand tiles that are close to a winning hand, consequently resulting in smaller winning probability.

6 Conclusion and Discussions

Suphx is the strongest Mahjong AI system up to date, and is also the first Mahjong AI that surpassed most top human players in Tenhou.net, a famous online platform for Japanese Mahjong. Because of the complexity and unique challenges of Mahjong, we believe that even though Suphx has performed very well, there is still a lot of space for further improvement.

  • We introduced global reward prediction in Suphx. In the current system, the reward predictor takes limited information as its input. Clearly, more information will lead to better reward signal. For example, if a round is very easy to win due to our good luck of initial hand tiles, winning this round does not reflect the superiority of our policy and should not be rewarded too much; in contrast, winning a difficult round should be rewarded more. That is, game difficulty should be taken into consideration while designing reward signals. We are investing how to leverage prefect information (e.g., by comparing the private initial hands of different players) to measure the difficulty of a round/game and then boost the reward predictor.

  • We introduced the concept of oracle guiding, and instantiated this concept using the gradual transition from an oracle agent to a normal agent by means of perfect feature dropout. In addition to this, there could be other approaches to leverage the perfect information. For one example, we can simultaneously train an oracle agent and a normal agent, let the oracle agent distill its knowledge to the normal agent while constraining the distance between these two agents. According to our preliminary experiments, this approach also works quite well. For another example, we can consider designing an oracle critic, which provides more effective state-level instant feedback (instead of round-level feedback) to accelerate the training of the policy function based on the perfect information.

  • For run-time policy adaptation, in the current system of Suphx, we did simulations at the beginning of each round when private tiles were dealt to our agent. Actually we can also do simulations after each tile is discarded by any player. That is, instead of only adapting the policy to the initial hands, we can continue the adaptation as the game goes on and more and more information becomes observable. Doing so should be able to further improve the performance of our policy. Moreover, since we gradually adapt our policy, we do not need too many samplings and roll-outs at each step. In other words, we can amortize the computational complexity of policy adaptation over the entire round. With this, it is even possible to use policy adaption in the online playing with affordable computational resources.

Suphx is an agent that constantly learns and improves. Today’s achievement of Suphx on Tenhou.net is just the beginning. Looking forward, we will introduce more novel technologies to Suphx, and continue to push the frontier of Mahjong AI and imperfect-information game playing.

Most real world problems such as finance market predication and logistic optimization share the same characteristics with Mahjong rather than Go/chess - complex operation/reward rules, imperfect information, etc. We believe our techniques designed in Suphx for Mahjong, including global reward prediction, oracle guiding and parametric Monte-Carlo policy adaptation, have a great potential to benefit for a wide range of real-world applications.

Acknowledgement

We would like to sincerely thank Tsunoda Shungo and Tenhou.net for providing the expert game playing logs and online platform for our experiments. We would like to thank players on Tenhou.net for playing games with Suphx. We would like to thank MoYuan for helping us to collect the statistics of human professional players. We would also like to thank our interns Hao Zheng and Xiaohong Ji, as well as our colleagues Yatao Li, Wei Cao, Weidong Ma, and Di He for their contributions to developing the learning algorithms and training system of Suphx.

References

  • [1] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  • [2] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit Hold’Em poker is solved. Commun. ACM, 60(11):81–88, October 2017.
  • [3] Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, January 2018.
  • [4] Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019.
  • [5] European Mahjong Association. Rules for Japanese Mahjong. http://mahjong-europe.org/portal/images/docs/Riichi-rules-2016-EN.pdf.
  • [6] Shiqi Gao, Fuminori Okuya, Yoshihiro Kawahara, and Yoshimasa Tsuruoka. Supervised learning of imperfect information data in the game of mahjong via deep convolutional neural networks. Information Processing Society of Japan, 2018.
  • [7] Shiqi Gao, Fuminori Okuya, Yoshihiro Kawahara, and Yoshimasa Tsuruoka. Building a computer mahjong player via deep convolutional neural networks. June 2019.
  • [8] Moyuru Kurita and Kunihito Hoki.

    Method for constructing artificial intelligence player with abstraction to markov decision processes in multiplayer game of mahjong.

    April 2019.
  • [9] N Mizukami and Y Tsuruoka. Building a computer mahjong player based on monte carlo simulation and opponent models. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), pages 275–283, August 2015.
  • [10] Naoki Mizukami and Yoshimasa Tsuruoka. Building a computer mahjong player based on monte carlo simulation and opponent models. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), pages 275–283. IEEE, 2015.
  • [11] Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, May 2017.
  • [12] Jiang Rong, Tao Qin, and Bo An. Competitive bridge bidding with deep neural networks. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pages 16–24. International Foundation for Autonomous Agents and Multiagent Systems, 2019.
  • [13] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.
  • [14] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
  • [15] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, December 2018.
  • [16] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, October 2017.
  • [17] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • [18] Gerald Tesauro. Temporal difference learning and TD-Gammon. Commun. ACM, 38(3):58–68, 1995.
  • [19] Shingo Tsunoda. Tenhou. https://tenhou.net/. Accessed: 2019-6-17.
  • [20] Dwango Media Village.

    NAGA: Deep learning mahjong AI.

    https://dmv.nico/ja/articles/mahjong_ai_naga/. Accessed: 2019-6-29.
  • [21] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.

Appendix A: Rules of Mahjong

Mahjong is a tile-based game that was developed in China hundreds of years ago and is now popular worldwide with hundreds of millions of players. The game of Mahjong itself has numerous variations across the world, and Mahjong in different regions is different in both the rules and the tiles used. In this work, we focus on 4-player Japanese Mahjong (Riichi Mahjong) considering that Japanese Mahjong is very popular with a professional league181818https://m-league.jp/ of top players in Japan and its rules [5] are clearly defined and well accepted. Since the playing/scoring rules of Japanese Mahjong are very complex and the focus of this work is not to give a comprehensive introduction to them, here we make some selections and simplifications and give a brief introduction to those rules. Comprehensive introductions can be found at Mahjong International League rules  [5].

There are 136 tiles in Japanese Mahjong, consisting of 34 different kinds of tiles, with four of each kind. The 34 tiles are consists of three suit, Bamboo, Character and Dot, each from 1 to 9, and 7 different Honour tiles. A game contains multiple rounds and ends when one player loses all points, or some winning condition is triggered.

Each player in a game will start with 25,000 points, and one of four players is designated as the dealer. At the beginning of each round, all the tiles are shuffled and arranged into four walls, each with 34 tiles. 52 tiles are dealt to 4 players (13 for each player as his/her private hand), 14 tiles form the dead wall, which are never played except when players declare Kongs and draw a replacement tile, and the remaining 70 tiles form the live wall. The 4 players take turns to draw and discard tiles.191919The turns can be interrupted by a Kong/Kong and declaring a winning hand. The player who first builds a complete hand with at least one yaku wins the round and gets certain round score calculated by the rewarding rules:

  • A complete hand is a set of 4 melds plus a pair. A meld can be a Pong (three identical tiles), a Kong (four identical tiles), and a Chow (three consecutive simple tiles of the same suit). The pair consists of any two identical tiles, but it cannot be mixed with the four melds. A player can make a Chow/Pong/Kong from (1) a tile drawn from the wall by himself/herself, in which case this meld is concealed to others, or (2) a tile discarded by other players, in which case this meld is exposed to others. If a Kong is made from a tile drawn from the wall, it is called ClosedKong. After a player makes a Kong, he/she needs to draw an additional tile from the dead wall for replacement. A special case called AddKong converts an exposed Pong into a Kong when a player draws a tile that matches an exposed Pong he/she has.

  • A yaku is a certain pattern of players’ tiles or a special condition. Yaku is the main factor to determine round score, and the value of yaku varies from different patterns. A winning hand could contain several different yaku’s and the final round score will be accumulated across all the yaku’s in hand. Different variations of Japanese Majong have different yaku patterns. A common list of yaku consist of 40 different types. Furthermore, dora, a special tile determined by rolling dice before drawing tiles, provides additional points as a bonus.

The last tile that a player forms a winning hand together with his/her private tiles can come from (1) a tile from the wall drawn by himself/herself or (2) a tile discarded by other players. For the first case, all other players will lose points to the winner. For the second case, the player who discards the tile will lose points to the winner.

A special yaku is that player can declare Riichi when his/her hand is only one tile away from a winning hand. Once declaring a Riichi, the player can only claim winning hand from either a self-drawn tile or a tile discarded by other players, and he/she is not allowed to change his/her hand tiles any more.202020http://mahjong.wikidot.com/riichi

The final ranking points a player gets from a game is determined by his/her level and the rank of his/her accumulated round score over the multiple rounds of the game, as shown in Table 6.

Appendix B: Tenhou Ranking Rules

Level
Base
Ranking Pts.
1st 2nd 3rd 4th
Level Up
Ranking Pts.
Level Down
Rookie 0
+20
@Normal Room
+40
@Advanced Room
+50
@Expert Room
+60
@Phoenix Room
+10
@Normal Room
+10
@Advanced Room
+20
@Expert Room
+30
@Phoenix Room
+0 0 20 -
9Kyu 0 0 20 -
8Kyu 0 0 20 -
7Kyu 0 0 20 -
6Kyu 0 0 40 -
5Kyu 0 0 60 -
4Kyu 0 0 80 -
3Kyu 0 0 100 -
2Kyu 0 -15 100 -
1Kyu 0 -30 100 -
1Dan 200 -45 400 Yes
2Dan 400 -60 800 Yes
3Dan 600 -75 1200 Yes
4Dan 800 -90 1600 Yes
5Dan 1000 -105 2000 Yes
6Dan 1200 -120 2400 Yes
7Dan 1400 -135 2800 Yes
8Dan 1600 -150 3200 Yes
9Dan 1800 -165 3600 Yes
10Dan 2000 -180 4000 Yes
Phoenix Honorable Title -
Table 6: Tenhou ranking systems: different levels and their requirements

Tenhou uses the Japanese Martial Arts Ranking System212121https://en.wikipedia.org/wiki/Dan_(rank), which starts from rookie, 9 kyuu down to 1 kyuu, and then 1 dan up to 10 dan. Players earn/lose ranking points when they win/lose ranked games. The amount earned or lost depends on the game result (which ranges from 1 to 4), the current level of the player, and the room that the player is in. There are four types of room in Tenhou, normal room, advanced room, expert room and phoenix room. The punishment (i.e., the negative ranking points) of losing a game is the same across these rooms, but the reward (i.e., the positive ranking points) of winning a game are different. The ranking system is designed to punish high-level players if he/she loses a game: a player of a higher level will lose more point for the 4-th rank of game than that of a lower-level player, e.g., -180 for a 10-dan player vs. -120 for a 6-dan player.

When a player plays a game, he/she wins/loses some ranking points from the game, and his/her total ranking points change. If the total ranking points increase and reach the requirement of the next level, his/her rank increases 1 level; if the total ranking points decrease to 0, his/her rank decreases 1 level. When his/her rank level changes, he/she will get initial ranking points at the new level. The details of the ranking points across different rooms and levels are listed in Table 6. Therefore, the rank of a player is not stable and often change over time. We use record rank to denote the highest rank a player has ever achieved in Tenhou.

Appendix C: Stable Rank

Tenhou uses stable rank to evaluate the long-term average performance of a player. The stable rank in the expert room is calculated as follows.222222https://tenhou.net/man/#RANKING

Let denote the number of games a player gets the highest accumulated round scores, the number of games he/she gets the lowest accumulated round scores, and and the numbers of games for the second/third highest accumulated round scores. Then the stable rank of the player in terms of dan is

(7)

Since the accumulated round score of a game depends not only on the skills of the player but also on the private tiles of the four players and the wall tiles, the stable rank could be of large variance due to the randomness in the hidden information. Furthermore, when playing in Tenhou, opponents are randomly allocated by the Tenhou system, which brings in additional randomness. Thus, for a player in Tenhou, it is usually assumed that at least a few thousands of games are needed to get a relatively reliable stable rank.