Recent progress in deep learning has fueled breakthroughs in reinforcement learning[mnih2015human]. However, most successful works rely on annotated data which are expensive to acquire and labor-intensive to prepare. Deep reinforcement learning models have been of increasing interest due to the ability to learn from own experiences. It has been used to allow an agent to solve and play various games such as Atari games and board games such as Go and Chess.
AlphaGo [AlphaGo] was able to achieve superhuman performance in the game of Go and was able to defeat champions of the game. They were able to achieve this by training it with supervision, then training it as a reinforcement learning problem. This was further improved in AlphaGoZero [alphagozero] which learned to play Go through solely reinforcement learning, without any human data supervision. This suggests that learning a game solely through self-play is not only possible, but could also result in novel strategies that no human would ever have thought of.
Motivated by the results of AlphaGoZero, we seek to further explore other means of learning. Self-play is a good learning mechanism assuming that the agent gets better over time. However, this requires careful optimization to prevent suboptimal results often seen in optimization methods cast as a game between two agents [arjovsky2017wasserstein]. Particularly, we turn to purely random play as an alternative optimization method and develop a fast-converging algorithm around it. To test our hypothesis, we choose the game of Sungka, a Filipino variant of Mancala. It is a two-player turn-based board game wherein each player tries to collect as many stones as they can.
The game looks deceptively simple because the Sungka board only has 2 heads, 14 houses, and 98 stones in total. However, the actual state-space complexity is which makes it more complex than various games such as Nine Men’s Morris, Connect Four, Pentominoes, and Domineering, and comparable to American Checkers as seen in Table I. This level of complexity makes Sungka a good candidate for experimenting with random play as a mechanism for learning.
|Nine Men’s Morris||10 [allis1994searching]|
|Connect Four||13 [allis1994searching]|
|American Checkers||18 [allis1994searching]|
In this paper, we present a reinforcement learning agent capable of playing Sungka at human-level performance. We also show empirical evidence that with just random play, our training algorithm still converges fast, and that the trained agent discovers various strategies such as maximizing the number of consecutive turns, and choosing an action which would result in sunog.
Specifically, our contributions111Source code at: https://github.com/baudm/sungka-ai are as follows:
OpenAI Gym environment for Sungka
Reward formulation which penalizes actions resulting in high opponent scores
Fast-converging and stable training algorithm
Ii Related Work
As far as we know, there is no prior work yet on Sungka, but there have been several works on Mancala and its other variants.
Prior to Deep Q Networks (DQN) [mnih2015human]
, the successful use of deep neural networks in the context of reinforcement learning has not yet been demonstrated. An early attempt at using neural nets for developing an intelligent agent for Congkak[chepa2013application], one of the Malaysian traditional games and a variant of Mancala, was largely unsuccessful. Results showed that the neural net policy is even worse than a random policy.
Pinto et al. [PintoMancala] used three agents (Game trees, Q-learning, Rule of Thumb) and six reward functions. Results show that Q-learning beats mini-max, and that RoT is efficient but is easily beaten. The reward function has a bigger impact on the game outcome than the type of agent used. DaVolio and Langenborg [DaVolioAI] compared eight agents (Random, Max, Exact, MinMax, MCTS, Q-learning, Deep Q, A3C) and found that the agent performances were roughly in line with complexity: Random Agent as the worst and A3C Agent was the best.
Iii-a Game Mechanics
Sungka is a two-player board game where each player takes turn in moving stones with the objective of obtaining the most stones in their respective heads. Each player has seven houses each filled with seven stones initially as shown in Figure 2.
A player chooses one of the houses on his side, takes the stones in it, and then moves in a clockwise direction while dropping one stone on each house or head he passes over excluding the other player’s head. If the last stone is dropped into his own head, the player continues his turn by choosing any of houses on his side again. If the last stone is dropped on any filled house, the player picks up all the shells in that house (including the last stone dropped) and continue the turn.
A player’s turn ends when the last stone is dropped into an empty house. If this empty house is on the player’s side, this player takes all the stones on the other player’s house directly opposite of it and the last stone dropped itself and puts it in his own head. This mechanic is called sunog. Otherwise, the other player will now choose his move. The game ends when all stones are dropped into any of the heads. The player with more stones in his head wins the game.
Iii-B Environment Limitations
In an actual game of Sungka, both players start their first move simultaneously. When a player has finished his move, he must wait until the other player is finished. When both players have finished their first move, the players make actions in alternating fashion where the player who finished his first move faster moves first. This gives the game a real-time element to it. For this project, we omit this mechanism and limit the game to a purely turn-based game.
After the first round of Sungka, each player redistributes their collected stones back into their houses with seven stones each. If a house could not be filled with exactly seven stones, the house is not filled and is burnt. After filling each house, all excess stones are returned to their head. The game is played again while each player no longer putting stones in the burnt houses, with the winner of the previous round moving first. The game ends when a player had all his houses burnt or surrenders. For simplicity, we only play a single round of sungka and declare the player with more stones after this round as the winner.
Iii-C Sungka Environment
We implement our environment using the OpenAI Gym toolkit [OpenAIGym]. Since the gym environment does not explicitly support turn-based games, we manually enforce the turn-based nature of the actions by keeping track of the players’ turns.
Iii-C1 Observation Space
An observation is a representation of the game board state. It is a 1x14 array where the first seven elements represent the first player’s side, while the last seven elements represent the second player’s side. The value inside the array corresponds to the number of stones the house contains. The heads effectively contain the players’ current scores. Since the current scores do not affect the decision-making process, we exclude them from the state vector.
Iii-C2 Action Space
A player has seven possible actions which correspond to choosing one of the seven houses on his side of the board. In the actual environment, the action space is 14. We map the player-specific actions to the raw indices used in the environment: 0-6 for Player 1 and 7-13 for Player 2.
An action made by a player does not contribute only to the player’s own score, but also affects the maximum score attainable by the opponent in the next turn. Since the player with the most number of stones wins, the agent should not only maximize its own score, but should also minimize the opponent’s score.
We define a timestep such that each one consists of two turns: the agent’s and the opponent’s. We denote as the reward for timestep , and and as the scores obtained by the agent and the opponent at timestep , respectively. Thus we formulate the reward for each timestep as shown in (1).
Iii-D Deep Q-Learning
The game board consists of 2 heads, 14 houses, and 98 stones in total. We can model the board configuration as a combination with replacement problem. As such, the number of theoretically possible game states is . With this large number of states, using a Q-table to store the values for all state-action pairs becomes impractical, if not impossible. Thus, we instead base our approach on DQN and use a neural network to learn the optimal Q function.
We test our trained DQN agent against several policies:
The random policy agent simply chooses a random action from a uniform distribution.
Max Policy: The max policy agent always chooses the house with the most number of stones.
Exact Policy: The exact policy agent chooses the nearest house to the head where the number of stones is equal to its distance to the head. This allows the agent to get another turn. If there are more than one houses which satisfy the condition, the nearer house to the head is chosen first. If no house satisfies the condition, max policy is used.
DQN Agent: The trained DQN agent plays against itself.
We generate training episodes by making the DQN Agent play against the Random Agent. We train the DQN every step of an episode, for a total of 10,000 episodes. We employ Experience Replay with a buffer size of 2,000, and sample a random mini-batch of size 128 every training iteration. Algorithm 1 describes the training procedure.
We explored various training setups, but we highlight our experiences in two scenarios:
This is the typical approach used in most reinforcement learning work because it ensures that in the early phases of training, the DQN gets trained on a very varied set of states. However, we found that for Sungka, starting with and annealing it to resulted in relatively unstable training.
Using a fixed resulted in the fastest convergence and most stable training. While it seems counterintuitive, note that the practical state space is smaller than the theoretical maximum, and that the stochastic behavior of the Random Agent opponent already provides ample exploration.
Regardless of the two approaches, the DQN Agent always achieves human-level performance at the end of training.
At test time, we initially used full exploitation mode by setting epsilon to zero. However, when the model is still not trained well enough, the agent gets stuck on choosing an action which does not have any stones in it. To prevent this, we use a very small epsilon instead of using zero during test time.
V-a Effects of Reward Formulation
Results show that training our DQN agent versus a random agent allows it to learn how to play sungka such that it maximizes the number of stones it can place in its own head. Figure 3 shows that the agent is able to increase the average reward as training progresses. Moreover, the effect of the reward formulation in (1) is apparent in Figure 4. With everything held constant, the agent trained using the naive reward formulation, , is consistently outscored by the Exact policy, and has an inconsistent performance against the Max policy.
V-B Performance of First Turn Player vs. Second Turn Player
During training, our DQN agent gets a high win percentage against any of the four policies tested after a few hundred episodes as shown in Figure 5. Among the four policies, our agent had the hardest time against exact policy as our agent had the lowest win rate and the second lowest average reward when playing against it as shown in Table II(a). This can be accounted by the ability of exact policy to maximize its number of turns which the other policies do not have.
Table II(a) shows the performance of the final DQN agent aginst other policies over 1000 test episodes with . It can be seen that the trained agent performs very well against any policy when it has the first turn. When it plays as the second turn, it still wins against random policy at a good rate. However, it could not win against max policy and exact policy. It also only wins 11 games out of 1000 versus itself. This shows that the game of sungka is biased towards the player that gets the first turn.
Looking into this further, we trained another DQN agent which plays second (Player2DQN). Table III(a) shows the performance of that agent when playing second. The average rewards received by Player2DQN are higher when playing against any of the policies than when Player1DQN plays as second turn. Win percentage against random policy and self is also higher than when it was not trained to play second.
We also look into the performance of Player2DQN as the first player in Table III(b). It shows that the average reward increased against max policy, exact policy, and self. Its win percentage against exact policy and self also drastically increased.
Results from these experiments show that when playing against max policy and exact policy, having the first turn leads to more rewards and a higher win percentage. Player2DQN performed well against random, exact, and self when playing first even though it was trained as second. On the other hand, performance against random policy is dependent on how the agent was trained.
We also tested our Player1DQN agent against Player2DQN agent. Table IV shows that the first player always has the advantage.
|Average Reward||Win Percentage|
|Player1DQN as 1st turn||71.991||98.1%|
|Player2DQN as 2nd turn||26.009||1.8%|
|Player2DQN as 1st turn||68.189||98.0%|
|Player1DQN as 2nd turn||29.811||1.8%|
V-C Learned Actions based on Game Mechanics
Figures 6 and 7 shows the agent’s ability to exploit some of the game mechanics. In Figure 6 showed that the agent has learned to choose a move that will put the last stone on an empty house on his side of the board. This move allows him to get more stones since he also gets the stones on the opponent’s side. In Figure 7 exploits the fact that putting the last stone on his head allows him to make another move. This allows the agent to get one point, and gets to take another action without the board state changing unpredictably (due to an opponent’s action).
Vi Conclusion and Recommendation
We have trained a network that is capable of playing and winning in Sungka. The trained agent is able to choose actions that maximizes its reward to increase its probability of winning the game. We showed that the reward formulation which uses both the score accumulated by the other player and the agent’s score for that turn result to more stable training and better performance.
In this paper, we only trained a DQN agent. We recommend looking into the performance of other reinforcement learning methods such as cross entropy, trust region policy optimization, proximal policy optimization, and A3C. It would also be interesting to see the performance of agents trained using different reinforcement learning methods against each other.