Learning from Learners: Adapting Reinforcement Learning Agents to be Competitive in a Card Game

by   Pablo Barros, et al.
Istituto Italiano di Tecnologia

Learning how to adapt to complex and dynamic environments is one of the most important factors that contribute to our intelligence. Endowing artificial agents with this ability is not a simple task, particularly in competitive scenarios. In this paper, we present a broad study on how popular reinforcement learning algorithms can be adapted and implemented to learn and to play a real-world implementation of a competitive multiplayer card game. We propose specific training and validation routines for the learning agents, in order to evaluate how the agents learn to be competitive and explain how they adapt to each others' playing style. Finally, we pinpoint how the behavior of each agent derives from their learning style and create a baseline for future research on this scenario.


page 3

page 4

page 5

page 7


Incorporating Rivalry in Reinforcement Learning for a Competitive Game

Recent advances in reinforcement learning with social agents have allowe...

Moody Learners – Explaining Competitive Behaviour of Reinforcement Learning Agents

Designing the decision-making processes of artificial agents that are in...

Emergent Communication under Competition

The literature in modern machine learning has only negative results for ...

An Efficient Application of Neuroevolution for Competitive Multiagent Learning

Multiagent systems provide an ideal environment for the evaluation and a...

Prototyping three key properties of specific curiosity in computational reinforcement learning

Curiosity for machine agents has been a focus of intense research. The s...

Multiagent Cooperation and Competition with Deep Reinforcement Learning

Multiagent systems appear in most social, economical, and political situ...

Artificial Agents Learn Flexible Visual Representations by Playing a Hiding Game

The ubiquity of embodied gameplay, observed in a wide variety of animal ...

I Introduction

With the current interest in reinforcement learning caused by the development of deep reinforcement learning techniques [22], novel methods and mechanisms have been developed in recent years. Such mechanisms allow an artificial agent to map between state and actions within highly complex state representations and in an end-to-end learning manner, reducing the need for strong and well-defined prior knowledge. In recent cases, reinforcement learning agents have been used for guiding autonomous cars [25, 15], predicting the stock exchange impact [24, 19], and coordinating a swarm of robots to protect the environment [11, 33].

Most of these solutions, although having real-world-inspired scenarios, focus on a direct space-action-reward mapping between the agent’s action and the environment state. That translates to agents that can adapt to dynamic scenarios, but, when applied to competitive scenarios, they fail to address the impact of the opponents. In most cases, when these agents choose an action, they do not take into consideration how other agents can affect the state of the scenario. In this regard, competitive reinforcement learning is still behind the mainstream applications and demonstrations of the last years.

In competitive scenarios, the agents have to learn decisions that a) maximize their goal, and b) minimize their adversaries’ goals. Besides dealing with complex scenarios, they usually have to deal with the dynamics between the agents themselves. Some of the most common applications for competitive reinforcement learning involve multi-agent simulations, such as multiple autonomous vehicles [7], life-simulation/resources gathering [32], pursuer/pursued scenarios [31]), and multi-player games [18].

The recent development and popular interest in deep reinforcement learning have contributed, however, to the design, implementation, and evaluation of only a few competitive learning solutions. The implementation of a counterfactual thinking solution [31], based on a classic psychological phenomenon, obtained a good performance on a simple multi-agent resource gathering life-simulation water world scenario [10]

. The model is certainly interesting but became very complex to scale to realistic scenarios as it implements an extra counterfactual policy network that is extremely sensitive to hyperparameters change. In another direction, a centralized learning mechanism was introduced by Tampuu et al.

[28]. This presents an effective way of learning competitive actions, but it demands the learner to have total control of the environment, which restricts its applications. Moreover, all of these models were evaluated using very limited simulations of real-world events and most of the time do not scale well to real-world problems [13].

To better assess how popular reinforcement learning methods perform in a real-world competitive scenario, we propose a broad study on how different reinforcement learning agents learn and behave when deployed in such an environment. We investigate how three reinforcement learning models (Deep Q-Learning - DQL [30], Advantage Actor-Critic - A2C [21], and Proximal Policy Optimization - PPO [26]) can learn a competitive multiplayer card game, and evaluate how their emerged behavior affect their own decisions towards winning the game. By focusing on these three implementations, we aim to provide the training, analysis and performance baseline for the competitive Chef’s Hat card game [3], without the need of a centralized learner or overly-complex solutions. Our goal is to understand how these established models behave in a real-world inspired competitive scenario.

To maintain our scenario as close to real-world as possible, we implement in full the Chef’s Hat card game, which has been designed to be used in Human-Robot Interactions (HRI). The game contains specific mechanics that allow complex dynamics between the players to be used in the development of a winning game strategy. We use the OpenAI Gym-based Chef’s Hat simulation environment [2] to emulate, in a 1:1 scale, all the possible game mechanics. A card game scenario allows us to have a naturally-constrained environment and yet obtain responses that are the same as the real-world counter-part application. It additionally helps us to better understand the decision-making process of the agents and to better illustrate the strategies learned by each agent and how they affect each other.

For each of the three reinforcement learning methods, we introduce adaptations to the learning mechanisms of each agent, including a novel greedy policy for action selection. We perform three main competitive learning tasks: first, each of these agents is trained against random agents, to evaluate their capability to learn a game strategy. Second, we deploy a self-play routine that allows each agent to further improve its strategies by playing with evolving versions of itself. Third, once all the agents are trained, we choose the best of them and perform an inter-method competition, where the best agents of each learning method play against each other.

We compare the performance of these agents by measuring the number of wins they have in a series of games, and to better understand and explain their learned strategies, we evaluate their action-selection behavior over time. We explain our results in terms of how the agents learn gaming strategies, and discuss how their specific learning mechanisms affect their learning behavior.

Fig. 1: Chef’s Hat in real-life gameplay and 1:1 rendered simulation environment.

Ii Learning to be Competitive

One of the most important metrics for a competitive agent is defining the overall environment goal. In our card game scenario, we define the overall goal as winning as many games as possible. This gives each agent a clear goal and allows us to observe how this affects the agent’s behavior while playing the game, and which strategies emerge. We implement the Chef’s Hat card game [3] through the OpenAI-based simulation environment [2], illustrated at Figure 1. The game represents a controllable action-perception cycle, where each player can only perform a restricted set of actions, and we can directly measure the impact of each action within the game state, and the formation of player’s strategy. Furthermore, it allows each player to behave as organically as possible, given the in-game constraints, and allows for a naturally-controllable real-world scenario.

Ii-a Chef’s Hat Card Game Mechanics

Chef’s Hat is played by four players and it has as a theme a kitchen environment. The game was designed, implemented and validated in a way to allow Human-Robot Interaction experiments to be conducted, where one player can be replaced by a robot without changing the game rules or dynamics. The game is composed of a role-based hierarchy: each player can either be a Chef, a Sous-Chef, a Waiter, or a Dishwasher. The objective of the players is to be the first one to get rid of their ingredient cards and become the Chef. The player which was most times the Chef is considered the winner of the entire game. The flow of one full game is depicted in Algorithm 1.

Shuffle the deck;
Deal an equal amount of cards per player;
Exchange roles;
Exchange cards;
if special action is evoked then
       Do special action;
end if
discard cards.
while not end of the game do
       for  each player do
             if player can, and want, to discard then
                   discard cards;
             end if
            if All players passed then
                   Make the pizza;
             end if
            if All players finished then
                   End of game.
             end if
       end for
end while
Algorithm 1 The Game-flow of the Chef’s Hat card game.

During each game there are three phases: Start of the game, Making Pizzas, End of the game. The game starts with the cards having been shuffled and dealt to the players. Then, starting from the second game, the exchange of roles takes place based on the last game’s finishing positions. The player who finished first becomes the Chef, the one that finished second becomes the Sous-Chef, the one that finished third becomes the Waiter and the last one the Dishwasher. Once the roles are swapped, the exchange of the cards starts. The Dishwasher has to give the two cards with the highest values to the Chef, who in return gives back two cards of their liking. The Waiter has to give their lowest valued card to the Sous-Chef, who in return gives one card of their liking. If any player has two jokers at hand, they can perform a special action: in case of the Dishwasher, this is ”Food Fight” (the hierarchy is inverted), in case of the other roles it is ”Dinner is served” (there will be no card exchange during that game).

Once the cards and roles have been exchanged, the game starts. The goal of each player is to discard all the cards at hand. They can do this by making a pizza, which consists of laying down the cards into the playing field, represented by a pizza dough. The person who possesses the Golden Mozzarella card (with a face value of 11) at hand starts making the first pizza of the game. A pizza is done when no one can, or wants to, lay down any more ingredients. To discard a card, they need to be rarer (i.e. lower face values) than the previously played cards. The ingredients are played from highest to the lowest face value, that means from 11 to 1. Players can play multiple copies of an ingredient at once, but have to always play an equal or greater amount of copies than the previous player did. If a player cannot (or does not want to) play, they pass until the next pizza starts. A joker card is also available and when played together with other cards, it assumes their value. When played alone, the joker has the highest face value (12). Once everyone has passed, they start a new pizza by cleaning the playing field, and the last player to play an ingredient is the first one to start the new pizza.

Ii-B Chef’s Hat Card Game Simulation

To simulate the Chef’s Hat game, we implemented our scenario using the OpenAI-based simulation environment of Chef’s Hat [3]. The environment simulates all the game mechanics described above and allows the plugin of different agents to play the game. It comes embedded with dummy agents that randomly perform actions.

The simulator represents the current game state for each player as an aggregation of the cards the player has at hand, and the current cards in the playing field; using a total of 28 values for the state representation, one value per card. For each player, there are a total of 200 allowed actions: to discard one card of face value 1 represents one move, while to discard 3 cards of face value 1 and a joker is another move, and passing is considered another move. Each player can only do one action per game turn.

Ii-C Learning to be the Chef

To train artificial agents to play Chef’s Hat, we employ different reinforcement learning algorithms based on Q-learning. Q-Learning allows our agents to apply a temporal difference calculation when updating the policy network to maximize the state transitions that will lead to the optimal reward. In this regard, Q-learning showed a faster convergence and a simplified learning process [27, 8, 17] when compared to other reinforcement learning methods.

Chef’s Hat Greedy Policy. To provide an ideal balance between exploration and exploitation, an -greedy exploration mechanism is usually adopted [29, 23, 6]. In the traditional form, each agent implements an action selection mechanism that ensures an exploration through random action selection at the beginning of the training:


where random(a) represents a random action selection over the entire action space, is the greedy factor, and a random number selected at each action. Usually starts with a higher value at the beginning of the training and it is reduced each time the policy is updated. This guarantees that the model performs a large number of exploratory steps at the beginning of the learning but incrementally starts to trust more and more on the policy update by the end of the learning phase.

In our scenario, however, performing a fully random action is not beneficial to the agent. As the game simulation only allows for valid actions to go through, and thus moving on towards the next game state, choosing random actions could lead to an agent getting stuck in a state until it chooses randomly a valid action. At the same time, penalizing the agent for choosing an invalid action is not ideal as it creates an unbalanced training set with a reward representing multiple goals: win the game, and perform valid actions, which creates unstable and unfocused learning. To solve this problem, we introduce here an updated greedy policy for the Chef’s Hat agents. Instead of selecting a random action, the agent calculates the allowed actions given a state using Algorithm 2.

for  do
       for  do
             if  in and  then
                   if  and in  then
                         if is  then
                               if  then
                               end if
                         end if
                   end if
             end if
       end for
end for
Algorithm 2

Chef’s Hat novel greedy action selection algorithm. It creates a vector containing all the 200 possible actions and which of them are allowed given a certain state.

The output of Algorithm 2 is hot-encoding with all the 200 possible actions, with a 1 representing an allowed action and a 0 representing an invalid action given that specific state. Our -greedy function is then represented by:

Fig. 2: Example of possible actions given a certain game state. The columns represent the card face values, and the rows represent the number of cards to be discarded. The letter ”j” represents the presence of a joker. It marks all the allowed actions within the game mechanics (blue regions) and not allowed actions (the gray regions), and the currently allowed actions (green dots) given a certain state.

To better understand the output of Algorithm 2, Figure 2 illustrates an example of calculated possible actions given a game state. The blue areas mark all the possible action states, while the gray areas mark actions that are not allowed due to the game’s mechanics. The green dots illustrate the possible actions given that specific state, and the red dots display the invalid actions.

Ii-D The Tale of Three Learners

In order to validate our learning scenario and the Chef’s Hat greedy action selection mechanism, we adapted three popular Q-learning-based methods: Deep Q-Learning - DQL [30], Advantage Actor-Critic - A2C [21], and Proximal Policy Optimization - PPO [26]. Each of these algorithms represents one particular aspect of reinforcement learning, and our goal is to demonstrate how they learn and behave when deployed in our scenario using our specific greedy action selection process.

For each action taken by an agent, we calculate a mask composed of the output of Algorithm 2

. This mask is applied to the output layer of the neural network that calculates the Q-values of the actions for each algorithm. The mask is extremely important to guarantee that the outputs of the networks are in agreement with the games’ mechanics, and thus, focus the Q-values maximization towards finding the best game-play strategy.

All learning agents parameters, illustrated in Figure 3, were optimized using a TPE optimization implemented by the Hyperopt [4]

library. Each of the learning agents implemented a single optimization routine for minimizing its loss when playing against dummy random agents. We implemented the agent using the Keras library

[9], and our agents and experiments implementations are publicly available 111https://github.com/pablovin/ChefsHatGYM.

Ii-D1 Deep Q-Learning

Deep Q-learning is an evolution of the standard Q-learning method and introduces two novel aspects: a target model and the experience replay. The target model helps to stabilize the learning of Q by providing a stable Q-estimation over the training. The experience replay stores the agent’s own experience by saving important steps taken by the agent, to increase the available data for learning state/action pairs through batch-learning. Deep Q-learning has been recently applied to teach agents to play complex video games with great success

[12, 14, 20], mostly due to their capability of performing batch-learning using the experience replay. This increases drastically their training time but results in finding optimal game-winning strategies. We expect to see this behavior reflected on how this agent learns different strategies to play our game as well.

Ii-D2 Advantage Actor-Critic

Actor-critic models present a hybrid learning method where an agent learns how to estimate the Q-values for a given state by following policy, the actor-network, and updates the chosen Q-values importance by a value-function approximator, the critic network. Advantage Action-critic [21] was introduced recently to stabilize the learning of the two networks by introducing the advantage function, which helps the entire model to identify, given a certain state, how much better it is to take a specific action compared to an average of all the actions. Recent research demonstrates how A2C models present stable learning for video-games scenarios [5], and we expect to observe a steady improvement of this agent while learning a strategy. Our implementation of the A2C model uses a common decoder and a two-tailed network architecture and it is represented in Figure 3.

Fig. 3: The detailed implementation of our three agents: DQL, A2C, with a double-tail implementation, and PPO with individual actor and critic networks.

Ii-D3 Ppo

Our third implemented learning model is the Proximal Policy Optimization (PPO) [26]

. PPO is a recently introduced policy-based method, which follows the same learning structures as the A2C. It, however, implements an adaptive penalty control, based on the Kullback–Leibler divergence, to drive the updates of the agent at each interaction. This allows the model to create an update region that functions similarly to the stochastic gradient descent optimization, simplifying the necessity of the algorithm to keep large memory-replays or complex update rules. PPO has been used recently with great success in different competitive scenarios, where the environment is constantly changing

[1, 16]. We expect that our PPO agent will present quick adaptation to newly perceived competitive strategies, in particular when playing against the other agents, which have a slower adaptation mechanism. The figure illustrates our PPO agent.

Iii Evaluating Competition

The goal of this paper is to demonstrate, evaluate and understand how the three reinforcement learning methods described above behave when learning in a multiplayer competitive scenario provided by the Chef’s Hat simulation environment. As such, we separate our evaluation routines into three experiments: First, we train one agent implementing each of these methods playing against three other agents implementing random action selections. Second, we perform a self-play training routine where each of the learning agents plays with different generations of themselves. Finally, we choose the best learning agent from the self-play experiments and play a competitive game with the three agents and a random agent.

Reward and Metrics. To train our agents we use an overall rewarding strategy: The environment gives a full reward (1.0) when performing the action that leads an agent to win the game. Every other reward is set to -0.01 to promote exploration within the agent Q-learning algorithm in order to avoid an unoptimal solution. Given the temporal-difference learning, the agent will learn how to generate strategies, composed of a sequence of actions, in order to achieve the maximum reward without receiving any prior information from the environment.

For each experiment, we evaluate the agent’s performance by calculating the average of victories for all the games the agent played in a series of 10 experimental runs of 100 games each, totaling 1000 games. To help us understand and explain how the agents learn, we also calculate the selected action Q-values over all the games, which will give us an insight on how is the agents confidence in selecting certain actions during the game-play. We post-process all the Q-values of an agent, per turn, using a softmax function, which help us to exhibit the Q-values as a probability, improving readability.

To fully illustrate our experimental setup, we report all the experiments, training and validation routines, agent combinations and the number of games in Table I.

Exp. Routine Agents # Games
Random Train DQL vs Random 1000
A2C vs Random 1000
PPO vs Random 1000
Val. DQL vs Random
A2C vs Random
PPO vs Random
Myself Train DQL
Val. vs vs vs
vs vs vs
vs vs vs
Others Train. DQL vs A2C vs PPO vs Random 1000
Val. DQL vs A2C vs PPO vs Random
TABLE I: Experimental setup: training and validation routines, agent combinations, and number of performed games per routine.

Iii-1 vs Random

Our first experiment puts each of the learning agents to play against three dummy agents. We perform a training routine that lasts 1000 games. We then perform an evaluation routine where each trained agent plays 10x100 games against the random agents, without further training, and we measure the average of total of victories achieved by each agent per 100 games together with the standard deviation. This experiment aims to give us important information about how each trained agent learns to beat a simple strategy based on random selections.

Iii-2 vs Myself

Our second experiment is composed of a self-playing routine. For each self-play generation, we train agents playing against each other for 1000 games. In order to increase the oponents variability and avoid an overspecification of the agent, in every generation, we save the best and second-best agents in a list, based on their averaged summed reward when playing against each other in a validation routine composed of 1000 games without further training. For the next generation, we copy the best agent from the previous generation and put it to play against three other agents, which can be pulled from the best and second-best list, a newly instantiated agent, or a random agent. The selection happens randomly, with the same probability of choosing any of these agents. We repeat the self-play routine for 50 generations, totaling 50.000 played games per learning method. We evaluate the impact of the self-playing routines by getting the first, the 25th and the last generation to play a game against the best agent from the previous experiment for 10x100 games, and measure the averaged number of victories and standard deviation. This experiment allows us to observe how the self-playing routine affects the trained agents’ performance within different generations.

Iii-3 vs Others

Our last experimental setup involves an inter-method evaluation. We take the best-trained agents for each learning method, based on the results of the vs. Myself experiments, and put them to play against each other and a dummy agent. To play agains the dummy agent will normalize their behavior by providing a super easy agent that all of them can beat. We perform two evaluation routines here. The first involves these agents playing against each other for 10x100 games, without further training. The second instead consists of a training routine that lasts 1000 games followed by an evaluation routine that lasts 10x100 games without training. We calculate here the average victories for each agent, together with the standard deviation. This experiment will exhibit the performance of the implemented agents when compared to each other, and how they can adapt to more complex strategies than random action selection.

Iv Results

The results from all three experiments - vs. Random, vs. Myself and vs. Others - are depicted in Table II.

vs. Random
Model Victories Random1 Random2 Random3
DQL 66.8 5.69 9.7 3.13 12.9 4.66 10.6 1.8
A2C 65.1 5.19 9.3 3.1 12.1 4.35 13.5 3.58
PPO 83.1 4.18 4.7 2.19 6.0 2.28 6.2 1.83

vs. Myself
Model Gen-1 Gen-25 Gen-50 Random
DQL 19.4 4.78 24.8 4.98 42.9 7.06 12.9 6.64
A2C 25.4 4.39 29.1 6.14 34.5 7.12 11 2.86
PPO 16.9 3.36 32.5 3.75 40.3 3.52 10.3 4.1

vs. Others
Model Before training After training
DQL 35.9 3.11 35.9 3.11
A2C 18.9 3.51 4.9 2.84
PPO 42.8 5.06 48.5 40.6
Random 2.4 0.8 3.3 1.85

TABLE II: Results for all three experiments.

Iv-a vs Random

We observe that the PPO agent achieves highest number of victories during the validation routine with an average of 83.1 victories per 100 games, followed by DQL (66.8 averaged victories) and A2C (65.1 averaged victories) respectively. As these experiments were performed while playing against random agents, these numbers inform us that all the agents learned how to beat a random strategy, with the PPO agent been the best on it.

Iv-B vs Myself

Our results from the self-playing experiments clearly show that the vs Myself agents learned how to beat the strategies learned by the vs Random agents. What is important to notice is the higher standard deviation obtained by the DQL and the A2C agents when compared to the PPO agent. Again, given the PPO advantage on fast adapting, it presents a much more consistent behavior on learning the best strategies to play against a more varied type of opponents. Also, our results validate our training routine by having the final generation of all the agents always achieving more victories than the previous ones.

Iv-C vs Others

Our third and last experiment put the best agent from each learning algorithm (based on the results of the vs Myself experiment) to play against each other. This result, illustrated in the ”Before Training” column, shows us that the PPO agent is the one with the best performance, followed closely by the DQL agent, and both with much better results than the A2C. The A2C number of victories tell us that the strategies it learned were much less successful when compared to the PPO and DQL.

This is much clearer when we re-train the three agents, making them adapt to each other strategy (illustrated in the ”After Training” column). The re-adaptation causes the DQL and the PPO agent to obtain similar performance, with a slight advantage to PPO, while the A2C agent seems to be completely ineffective against the other two. This can be explained by how these agents learn. The fast adaptation from the PPO agent presents an expected advantage compared to the A2C agent, while the experience replay from the DQL helps it to experience many more training samples, and to focus on learning a set of winning strategies. This behavior is better explored and explained in the next section.

V What is to be competitive?

Calculating the overall number of victories per agent tells us if they were successful in maximizing the goal of the game. However, once we prove that these agents can learn, and some better than the others, it is of high importance to shed a light on how they achieve such performance in the competitive Chef’s Hat scenario. In this regard, we discuss below our interpretations of how these agents learn the game strategy and how they learn to be competitive when playing against each other.

V-a How do I learn an action-selection strategy?

To have a better insight on the learned action-selection strategy per agent, we run a hundred games in the vs. Random and vs vs. Others before training and vs. Others after the training routines and plot the selected Q-Values over all the played matches in Figure 4. In the vs. Random games, we keep the random agents receiving the same card distribution when playing against the random agents, to reproduce a similar initial condition.

Fig. 4: Q-values readings (Y axis) for each action of a game (X axis) for a hundred games following the vs. Random, vs. Others before and after training routine.

We observe that the vs. Random routine, the DQL have higher confidence on a single selected-action usually by the end of the game. This possibly indicates that this agent learned a small set of actions that guarantee it a win against the random agent usually be the mid-end of a game-play. The A2C agent has a more distinguished action-selection pattern, where a single action seems to have high-confidence over the entire game-play. The PPO behaves somehow the opposite of the DQL agent, as it presents higher confidence in a single action at the beginning of the match while having devised different strategies, demonstrated by having low confidence in a single action, through the duration of a game.

When playing a game against each other, on the vs. Others before training routine, the A2C and PPO somehow present the same behavior as the vs Random training. As this scenario is composed of the agents that learned via self-play, we can infer that their action-selection strategy was not much altered by this training routine. Also, this scenario differs from the vs. Random scenario by providing a much more complex and dynamic state throughout the game as each action taken by an agent has a direct impact on its opponents. This is reflected directly in the action-selection behavior of the agents changing through time. The DQL agent, however, changed its behavior drastically. It seems to have higher confidence in a single action at the beginning of the game, an opposite behavior from the vs. Random routine. This behavior change can be explained by the batch-learning technique used by the DQL. Probably on the vs. Random scenario, the agent learned a set of similar strategies to beat the random agents and reinforced it. On the vs. Others routine, the agent learned another set of few strategies that seem to win most of the games, which is reflected in the agent’s behavior change.

After training the agents on the vs. Others after training routine, the A2C and PPO agents change their behavior. In this scenario, the agents were updated to win the game by playing against each other, and thus, the update routine rewarded behavior that hinders the other players to win. That means they probably try to learn strategies to counter each other’s playing style. The A2C agent predilection to a specific action seems to disappear, and it presents a behavior similar to the PPO agent in the previous experiments. However, this does not translate onto victories, on the opposite, based on our results, it seems that the A2C agent becomes much more ineffective. This probably indicates that the A2C agent probably is lost, and did not learn any strategy to play against the other two. The PPO agent seems to continue its adaptation towards different actions per game-state, which translates in the highest number of victories. The DQL agent keeps the same behavior as the previous experiment, which seems to help it to win games. It shows a slight delay on when to focus on a single action. Probably it learned a single strategy that seems to be quite effective against the other players. This probably indicates the difference between the DQL and PPO agent, while the DQL agent learn a set of few strategies that win the game, the PPO agent learned a more balanced game-play style and different strategies to counterbalance the other agents behavior.

V-B How do I learn to be competitive?

Observing the evolution of the selected Q-value during training against the random agents, illustrated in Figure 5, gives us an insight into how the learning algorithms devise strategies to beat their opponents.

Fig. 5: Q-values evolution (Y axis) for each action (X axis) taken in a 1000 games training routine following the vs. Random, and vs. Others routine.

When training against the random agents, the DQL agent presents a quick growth on Q-values at the beginning of the training routine, which can indicate that it associates the selected action with a higher chance of achieving the maximal goal, i.e. winning the game, very early on. This corroborates with our insights on the single-game observation, and it is an indication that the agent learns fairly early during training a specific set of strategies to be followed with high confidence.

This behavior changes in the A2C and PPO agents. They take longer to show an increase in the selected actions Q-values, which indicates they need more time to establish a state-action association with confidence while playing the game. They present lower confidence in selecting a specific action when compared to the DQL agent, which however does not translate onto less overall victories. In the case of the PPO, this is exactly the opposite behavior. This can be explained as the PPO agent learning steadily how to play against a random player, and in this way deriving many different strategies, instead of a strict set as the DQL. In this regard, it provides the best performance probably due to its adaptive learning mechanism, when compared to A2C.

When training in the vs. Others experiment, the behavior changes. The DQL agent maintains high confidence during the entire training. The A2C agent appears to decrease its confidence drastically mid-training routine, which corroborates with our understanding that it loses focus and it is not able to devise winning strategies. The PPO agent shows an interesting behavior, as it reduces its own confidence in one single action over the training routine. This is probably due to its fast adaptation on finding different strategies to beat the opponents over time, and learning a high number of associations between game-states and actions.

All the agents’ behavior reflects directly their learning mechanisms. The memory replay of the DQL makes it focus on a specific action, probably reinforced by the actions inside the memory themselves. We believe that with more games, the DQL agent would probably learn different strategies as its memory would grow over time. The PPO fast adaptation translates into associating more connections between action and states than the other two algorithms, in particular when training against each other. The A2C struggles to keep the pace of the PPO, and without the focused reinforcement that the DQL has, it loses its ability to adapt quickly, translating to the smallest number of victories.

Vi Conclusion

In this paper we presented a broad experiment with three different reinforcement learning algorithms playing the competitive Chef’s Hat card game. We implemented these algorithms in agents and trained them to play the game. To evaluate the agents, we performed three validation routines - playing against random agents, self-playing, and playing against each other. We described how each learning algorithm behaved within the competitive scenario, and how their learning characteristics contributed to their performance. The PPO-based agent presented the best performance in all of our tasks, demonstrating how its quick update mechanisms contributed to competitive learning.

The agents learned different action-selection strategies, and their learning nature affected the way they tried to optimize their gameplay style. From our results, we consolidated the Chef’s Hat card game simulation environment as a challenging task to be learned, and set the initial work on understanding how reinforcement learning can be used in such a competitive task.

We envision the development of further specific adaptations to reinforcement learning agents to be more competitive in the Chef’s Hat card game. Given the proximity to the real-world card game scenario, we also encourage further research on applying such agents to play against real humans and embodied agents, such as social robots.


  • [1] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch (2017) Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748. Cited by: §II-D3.
  • [2] P. Barros, A. C. Bloem, I. M. Hootsmans, L. M. Opheij, R. H. Toebosch, E. Barakova, and A. Sciutti (2020) The chef’s hat simulation environment for reinforcement-learning-based agents. arXiv preprint arXiv:2003.05861. Cited by: §I, §II.
  • [3] P. Barros, A. Sciutti, I. M. Hootsmans, L. M. Opheij, R. H. A. Toebosch, and E. Barakova (2020) It’s food fight! introducing the chef’s hat card game for affective-aware hri. External Links: 2002.11458 Cited by: §I, §II-B, §II.
  • [4] J. Bergstra, D. Yamins, and D. D. Cox (2013)

    Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms

    In Proceedings of the 12th Python in Science Conference, pp. 13–20. Cited by: §II-D.
  • [5] K. Clary, E. Tosch, J. Foley, and D. Jensen (2019) Let’s play again: variability of deep reinforcement learning agents in atari environments. arXiv preprint arXiv:1904.06312. Cited by: §II-D2.
  • [6] Y. Efroni, G. Dalal, B. Scherrer, and S. Mannor (2018) Multiple-step greedy policies in approximate and online reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5238–5247. Cited by: §II-C.
  • [7] L. Fridman, J. Terwilliger, and B. Jenik (2018) Deeptraffic: crowdsourced hyperparameter tuning of deep reinforcement learning systems for multi-agent dense traffic navigation. arXiv preprint arXiv:1801.02805. Cited by: §I.
  • [8] A. Gosavi (2009) Reinforcement learning: a tutorial survey and recent advances. INFORMS Journal on Computing 21 (2), pp. 178–192. Cited by: §II-C.
  • [9] A. Gulli and S. Pal (2017) Deep learning with keras. Packt Publishing Ltd. Cited by: §II-D.
  • [10] J. K. Gupta, M. Egorov, and M. Kochenderfer (2017) Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 66–83. Cited by: §I.
  • [11] R. N. Haksar and M. Schwager (2018) Distributed deep reinforcement learning for fighting forest fires with a network of aerial robots. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1067–1074. Cited by: §I.
  • [12] M. Hausknecht and P. Stone (2015) Deep recurrent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series, Cited by: §II-D1.
  • [13] P. Hernandez-Leal, B. Kartal, and M. E. Taylor (2019) A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems 33 (6), pp. 750–797. Cited by: §I.
  • [14] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. (2018) Deep q-learning from demonstrations. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §II-D1.
  • [15] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura (2018) Navigating occluded intersections with autonomous vehicles using deep reinforcement learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2034–2039. Cited by: §I.
  • [16] Ł. Kidziński, S. P. Mohanty, C. F. Ong, Z. Huang, S. Zhou, A. Pechenko, A. Stelmaszczyk, P. Jarosik, M. Pavlov, S. Kolesnikov, et al. (2018) Learning to run challenge solutions: adapting reinforcement learning methods for neuromusculoskeletal environments. In The NIPS’17 Competition: Building Intelligent Systems, pp. 121–153. Cited by: §II-D3.
  • [17] B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis (2017) Optimal and autonomous control using reinforcement learning: a survey. IEEE transactions on neural networks and learning systems 29 (6), pp. 2042–2062. Cited by: §II-C.
  • [18] M. McKenzie, P. Loxley, W. Billingsley, and S. Wong (2017) Competitive reinforcement learning in atari games. In Australasian Joint Conference on Artificial Intelligence, pp. 14–26. Cited by: §I.
  • [19] T. L. Meng and M. Khushi (2019) Reinforcement learning in financial markets. Data 4 (3), pp. 110. Cited by: §I.
  • [20] W. Meng, Q. Zheng, L. Yang, P. Li, and G. Pan (2019) Qualitative measurements of policy discrepancy for return-based deep q-network. IEEE transactions on neural networks and learning systems. Cited by: §II-D1.
  • [21] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §I, §II-D2, §II-D.
  • [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §I.
  • [23] C. Painter-Wakefield and R. Parr (2012) Greedy algorithms for sparse reinforcement learning. arXiv preprint arXiv:1206.6485. Cited by: §II-C.
  • [24] E. Ponomarev, I. Oseledets, and A. Cichocki (2019) Using reinforcement learning in the algorithmic trading problem. Journal of Communications Technology and Electronics 64 (12), pp. 1450–1457. Cited by: §I.
  • [25] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani (2017) Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017 (19), pp. 70–76. Cited by: §I.
  • [26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §I, §II-D3, §II-D.
  • [27] Y. Shoham, R. Powers, and T. Grenager (2003) Multi-agent reinforcement learning: a critical survey. Web manuscript. Cited by: §II-C.
  • [28] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente (2017) Multiagent cooperation and competition with deep reinforcement learning. PloS one 12 (4). Cited by: §I.
  • [29] M. Tokic (2010) Adaptive -greedy exploration in reinforcement learning based on value differences. In Annual Conference on Artificial Intelligence, pp. 203–210. Cited by: §II-C.
  • [30] H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §I, §II-D.
  • [31] Y. Wang, Y. Wan, C. Zhang, L. Cui, L. Bai, and P. S. Yu (2019) Competitive multi-agent deep reinforcement learning with counterfactual thinking. arXiv preprint arXiv:1908.04573. Cited by: §I, §I.
  • [32] H. Xu, K. Paster, Q. Chen, H. Tang, P. Abbeel, T. Darrell, and S. Levine (2018) Hierarchical deep reinforcement learning agent with counter self-play on competitive games. Cited by: §I.
  • [33] T. K. Yu, Y. M. Chieh, and H. Samani (2018)

    Reinforcement learning and convolutional neural network system for firefighting rescue robot

    In MATEC Web of Conferences, Vol. 161, pp. 03028. Cited by: §I.