Deep RL Agent for a Real-Time Action Strategy Game

by   Michal Warchalski, et al.

We introduce a reinforcement learning environment based on Heroic - Magic Duel, a 1 v 1 action strategy game. This domain is non-trivial for several reasons: it is a real-time game, the state space is large, the information given to the player before and at each step of a match is imperfect, and distribution of actions is dynamic. Our main contribution is a deep reinforcement learning agent playing the game at a competitive level that we trained using PPO and self-play with multiple competing agents, employing only a simple reward of ± 1 depending on the outcome of a single match. Our best self-play agent, obtains around 65% win rate against the existing AI and over 50% win rate against a top human player.



page 1


Application of Self-Play Reinforcement Learning to a Four-Player Game of Imperfect Information

We introduce a new virtual environment for simulating a card game known ...

Creating Pro-Level AI for Real-Time Fighting Game with Deep Reinforcement Learning

Reinforcement learning combined with deep neural networks has performed ...

Playing Catan with Cross-dimensional Neural Network

Catan is a strategic board game having interesting properties, including...

Neural Fictitious Self-Play on ELF Mini-RTS

Despite the notable successes in video games such as Atari 2600, current...

Applying supervised and reinforcement learning methods to create neural-network-based agents for playing StarCraft II

Recently, multiple approaches for creating agents for playing various co...

Efficient Reinforcement Learning with a Mind-Game for Full-Length StarCraft II

StarCraft II provides an extremely challenging platform for reinforcemen...

StarCraft II: A New Challenge for Reinforcement Learning

This paper introduces SC2LE (StarCraft II Learning Environment), a reinf...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Deep reinforcement learning for games has become an exciting area of research in the last years, proving successful, among others, in Atari games [3, 4], in the games of Go and chess [9, 11, 12, 10] leading to major progress in the complex real-time game environments of Dota 2 [6] and StarCraft 2 [14, 13].

Inspired by the recent advances, we tackle a new reinforcement learning environment based on the video game Heroic - Magic The game combines elements of a card game and a real-time strategy game. Heroic has been downloaded over a million times since it launched in June 2019, and more than half a million matches are being played per day at the time of writing.

The players create custom decks consisting of cards that they will play against each other in a live match which usually lasts between 2 and 3 minutes. It is played 1 v 1, in real-time, on a symmetrical battlefield that separates your castle from the opponent’s. Players take actions by playing cards in order to deploy units on the battlefield, which fight the opponent’s units with the goal of destroying their castle. In addition to deploying units, players have an additional action available to them - casting spells. These usually affect units in some way and are purely limited by the fact that once you use them - they cannot be cast again for a period of time (they are on “cooldown”). In addition to this, deploying units uses a specific resource (mana) which refreshes over time - meaning that if a certain unit is deployed, another one cannot be deployed for a certain period afterwards.

There are several reasons why this domain is non-trivial. First of all, it is a real-time game with a considerable state space, since there is a number of different units in the game and coordinates on every lane are expressed in floating point precision. Secondly, the information given is imperfect both before and at the time of the game. More concretely, the decks that both players use in a match are a priori unknown and at any time of the match neither player knows which cards are currently available to the opponent. Moreover, because of the aforementioned cooldown and mana, taking any action has an intrinsic cost of limiting further action use for a time.

Figure 1: Game screen. The yellow dashed arrow represents the card casting mechanics.

As the main contribution of this paper, we trained a deep reinforcement learning agent playing the game at a competitive level via self-play and proximal policy optimization algorithm [8], using only a simple reward of depending on the outcome of a single match. The agent is robust to both different strategies and decks used. Following [1] in the learning process the agent is competing against an ensemble of policies trained in parallel and, moreover, before every match the decks of both the agent and the opponent are randomly sampled from a pool of decks. With such distribution of decks our best self-play agent has achieved around win rate against the existing AI and win rate against a top human player.

In the following we further elaborate on the parts that we found crucial for our work.

The game offers plenty of visual information, which makes it challenging to learn a policy from pure pixels. Similarly to [14], we represent the spatial state of the game by constructing low resolution grids of features, which contain positions of different units on the map and use it together with the non-spatial information as the observation.

As for the value network we adopt the Atari-net architecture appearing in [14] in the context of StarCraft 2. Regarding the policy network we initially started with the same baseline, which did not give us good results. We upgraded the policy by adding more flexibility to spatial actions. The change is twofold. First, we condition the spatial action on the selected action in the network head instead of sampling it independently of the selected action. Second, we output a single spatial action instead of sampling its coordinates separately. We found that these modifications improved the results by a substantial margin. In their implementation we make use of the fact that the action space and the discretized representation of the map is not very large.

Through experiments we found that the version of the game that involves spells is considerably more complex. Our hypothesis is that this is because they are usually required to be cast with more precision and it is often more desired to save a spell for later than to cast it immediately.

We test several training curricula including playing against existing AI and via self-play, where the agent is trained against an ensemble of policies that are being trained in parallel [1]. To measure the performance we test the agent against two AI systems: rule-based and tree-search based. We also measure performance of the agent against a top th percentile human player.

It is crucial to skip meaningless no-op actions in the experiments. Most of the time during a match no action is available, which makes no-op action useless from the strategical point of view and impacts the learning. We resolve this issue by returning the state only when there is a non-trivial action available.

Related Work

In this section we elaborate on the work related to the paper, namely, deep reinforcement learning agents, proximal policy optimization, self-play, representation of spatial information in games as feature layers and passive no-op actions.

There has been a lot of progress in the area of deep reinforcement learning in games which started several years ago with agents obtaining superhuman level of performance in arcade games [4, 2]. Such approach, combined with tree-search techniques and self-play, also proved successful in complex board games [9, 11, 12, 10]. Most recently deep reinforcement learning agents playing real-time strategy games beat top professional human players [6, 13].

Proximal policy optimization (PPO) introduced in [8]

is a popular on-policy algorithm in the reinforcement learning community due to its performance and simplicity. It was successfully used recently for training a neural network that beat top Dota 2 players


. PPO is often combined with generalized advantage estimation (GAE) from


Deep reinforcement learning agents for competitive games are usually trained via self-play [10, 6]

or combining self-play with imitation learning

[13]. In order to provide a variety of strategies, it has become popular to train multiple competing agents in parallel [1, 13].

In order to simplify the input of policy and value networks while still preserving the crucial spatial information, [14]

use low resolution grids of features corresponding, among other things, to positions of various types of units on the game map and their properties. Such grids are then fed into a convolutional layer of a neural network in the form of a tensor with multiple channels together with the non-spatial information. Subsequently, the processed feature layers and the non-spatial component are combined (for example via flattening and concatenation) and fed into the head of the policy network, which outputs a probability distribution over actions. It is also worth noting that feature layers representing the positions of chess pieces at several past timesteps were also used in

[11, 12].

The authors of [5] encountered a related issue concerning no-op actions, when no other action can be executed. Although they tackle a quite different problem, namely a real-time fighting game, similarly as in our case they resolve it by skipping passive no-op actions.

The Heroic Environment

In this section we present details of the environment based on Heroic - Magic Duel. In particular, we describe the game itself and introduce the observation and the action space of the reinforcement learning environment used throughout the paper. Finally, we give some technical details on the implementation of the system.

Game Details

Overview of the Game

The map where the match happens fits onto one screen of a mobile phone, see Figure 1. Map is symmetrical, consisting of a battlefield which separates two castles, blue belonging to the left player, and red belonging to the right player. It is itself split into 3 separate lanes. Lanes are separate parallel linear battlefields which all start at one castle and end at the other. All lanes have access to the enemy castle at the end of it. The goal is to destroy the enemy castle.

Figure 2: Schematic diagram of the game mechanics (left) and diagram of attack dependencies (right).

Cards and Units

Before the match, the player selects a deck of 12 cards to be used during the match out of the total of 56 possible cards that exist in the game. At the time of the match, players play their cards at a desired location on the map, effectively spawning the units there.

Card Action Casting

During the match, the players will spend mana to play cards from their hand in order to cast units onto the battlefield. They start the match with 4 cards in their hand, and periodically draw one more card to refill it from the deck. Player’s current hand is represented as cards at the bottom of the screen, see Figure 1. The deck holds an infinite supply of cards in the ratio that is determined by the 12 cards they previously selected. In order to be able to play a certain card, they will need to pay the cost in mana. The mana is a resource which slowly refills up until it reaches the mana cap - which is the limit of mana supply the current player can hold. Each player starts the match with no mana, and a mana cap of 2. Mana cap increases at the same pace as drawing of cards does - meaning that more expensive cards can be played only at later stages of the match. Due to the fact that mana can be capped (and thus not refill until it is spent by playing of a card), and that hand can be capped by a maximum of 5 cards - (and thus not refill with a new card, until one is spent), there is a pressure to keep playing actions in order to use the steady flow of resources optimally, see Figure 2.

To play a card, players drag it from their hand to the position where they want to cast it on the map. Once the card is cast, the unit corresponding to the card appears and the card is removed from the hand. The unit automatically starts moving towards the opponent’s castle (e.g. if the left player casts the card, the unit will start moving towards the right end of the lane). A card can be cast only between one’s own castle and the closest opponent unit on the given lane, but no further than the furthest friendly unit on any of the 3 lanes, and no further than half-length of the lane. This means at the start of the match players can only cast units at their castle.

For example, in Figure 1, the left player is playing his card by dragging it on the bottom lane. Additionally, he has four other cards in his hand, one of them is fully available, while the other three require more mana.

Unit Intrinsic Behavior

A unit on the battlefield automatically moves towards the opponent castle. If it manages to get to the end of the lane it starts attacking the opponent castle, lowering its health points. If any player’s castle hit points reach 0, that player loses the match, and their opponent wins the match. Units can only move within, and interact with enemies that are in the same lane as that unit. If the moving unit encounters an opposing unit on the way to the enemy castle it will start attacking this opposing unit, attempting to destroy it by depleting it of health points in order to proceed towards the goal - this is how units “fight” each other. Speed and fighting abilities are determined by the game (and specified on the card that was used to cast that unit) - the player cannot influence any of them after the card is cast. Fighting abilities of the unit are determined by a number properties, such as its range, type of attack, and any special abilities it might have. Certain unit types are good against certain types of units, but lose to others. This gives the choice of which units to cast the dimension of a rock-paper-scissors problem.

For example, Figure 2 shows that units that have a splash attack - attack which hits multiple enemies at once - are good against units which consist of many weak enemies, which are the “swarm” archetype. These are strong against melee type - the most basic units which attack one enemy at a time in close range. Melee type is normally good against ranged units - but this interaction can depend on how each unit is placed. Ranged units are again good against splash units as splash units are generally slow moving, and do not do great against single targets. Additionally, since ranged and swarm units have low hit points (they are easily defeated when attacked), they are generally weak against aggressive spells.


In addition to preparing their decks of cards, players choose 2 spells out of a total of 25 to be available to them to use during the match. Spells are special actions that require no mana to be played, but once played cannot be cast for a set period of time - the spell is on cooldown. Spells have a variety of effects, usually affecting units in some way. For example, they can lower the opponent units’ health points in an area where they are cast or make one’s own units stronger. Spell actions are generally more powerful but less frequently available than casting unit actions, and correct use can swing the match in one’s favor. The spells that are available to be cast from the interface are represented as hexagons above the player’s hand of cards on the game screen, see Figure 1.

Spell Action Casting

The player can select any of the available spells and place it at any location on the map. Once the spell is cast, the special action happens, which lasts up to couple of seconds. Spells take effect only in a certain range. There are no restrictions on when a spell can be cast, but once it is, it is on cooldown - meaning it cannot be cast again for a period of 25 - 60 seconds, the length of the cooldown is the property of the spell itself and is enforced by the game, see Figure 2.

Reinforcement Learning Environment

In this subsection we present specifications of the reinforcement learning environment that we used in the experiments. We start by introducing some necessary setup.

Let be the total number of units in the game, let be the number of lanes. We discretize each lane by splitting it into bins of equal length. Let be the number of available spells. Let , be the total number of actions in the game, i.e. the total number of units (since it is equal to the number of cards), the total number of spells plus no-op action. We index the unit type, lanes, discretization bins on every lane and the actions with the sets222For any natural number , we set . , , and , respectively.


We represent an observation as pair , where is the spatial and the non-spatial component, respectively.

The spatial component is a tensor of shape where for the value of is the sum of health (in percent) of own units of type on lane whose position on the lane falls into the -th bin of the discretization and for it is the sum of health of opponent’s units of type at the same , .

The non-spatial component

is a vector of length

, where for , is the number of seconds till the action is available. If this time is not known, there are two possibilities. The first one is that the action is available in the match, but the time-till-available is unknown. For such actions we set a default large positive value. The other possibility is that the action is not available in the current match, in this case we set a default negative value. The last three coordinates of correspond to own castle health, opponent’s castle health and elapsed match time, respectively.


We represent an action as triple , where is the selected action type and , are the lane and the position on the lane where the action should be executed. We refer to as the non-spatial action and as the spatial action. Note that in the case of no-op action coordinates are irrelevant. Nevertheless, we do not give this action any special treatment.


Throughout the experiments we give the agent reward of if it wins and if it gets defeated. We also use discounted returns with parameter .

Technical Details

In this part we provide some details of the implementation of the match simulation. Heroic - Magic Duel game has clear separation of game view, which is used for displaying the match, and simulation, which is used to simulate matches. We extract simulation part to a service that is able to run multiple matches concurrently. Given that match mechanics are deterministic, i.e. a discrete time step is used, it is possible to simulate the entire match in only a fraction of real match time. During training, agent sends its action to the service, then service applies given action and simulates the match up to the point where agent can act again, and finally returns observation back to the agent.

Reinforcement Learning Agent

In this section we introduce the policy and the value networks that we use as well as their updates during training.

Let be the set of all possible non-spatial actions in the game, be the set of card actions together with no-op action and let be the spell actions together with no-op action. Moreover, let be the set of all observations.

Policy Network

Figure 3: Two-headed policy network (left), single-headed policy network (center) and value network (right). Both policies take an observation consisting of a spatial and a non-spatial observation. is processed with one convolutional layer, while is processed with a fully connected layer. The output of the convolutional layer is subsequently flattened and both vectors are concatenated. The result is then passed to a network head corresponding to a set

. It starts with a fully connected network. Next, a softmax layer that outputs a probability distribution over actions in

and a sequence of probability distributions of spatial actions, one for each . Similarly to [14] we mask out invalid actions and renormalize the distribution. Next, an action is sampled. Finally, the spatial distribution corresponding to is used for sampling a spatial action .

We model the policy as , where are the parameters of a neural network. For every observation and action , the probability of taking given is given by . Below we describe the architecture of the policy network.

Similarly to the Atari-net in [14] we process the spatial observation with a convolutional network and the non-spatial observation with a fully connected network, whereupon we flatten the output of the spatial branch and concatenate both vectors.

Concerning further stages of the architecture, we employ a two-headed and a single-headed policy network. In the following we describe a single policy head, which outputs and a spatial action, where is one of , , , depending on the architecture and the head considered.

A head outputs an action , and an accompanying spatial action. A difference between the Atari-net and our approach is that the output spatial action is conditioned on the selected action . We do it by outputting a distribution of spatial actions for each , including invalid actions. Subsequently, we mask out the invalid actions, sample a valid and use the spatial action distribution corresponding to in order to sample a sample spatial action . Note that we can do that since the total number of cards and spells in the game is not very large, otherwise it could be costly to use multiple distributions with the same approach. Another difference is that we output a single spatial action corresponding to a pair , i.e. we do not sample , independently as it is done in Atari-net. We found such approach natural, since products of probability distributions form a smaller class of 2D probability distributions and, moreover, because is intrinsically a discrete value. Here we make use of the fact that the ranges of both , are not large, which guarantees almost no impact on the size or the stability of the network.

Value Network

The main building blocks of the value network are the same as the ones of Atari-net in [14]

with the hyperparameters adjusted for our problem. Similarly as in the policy network we feed the spatial observation

into a convolutional network and the non-spatial observation into a fully connected network, subsequently we flatten the output of the convolutional net and concatenate both outputs. Then, we pass the resulting vector to a fully connected layer which outputs a single real value , see Figure 3.

Policy Network Update

We train agents using proximal policy optimization (PPO) introduced in [8], which is the state-of-the-art deep reinforcement learning algorithm in many domains, including real-time games with large state and action spaces [6]. In this subsection we briefly describe it.

Given policy , PPO update via



where and is the advantage function. The right hand side of Equation 1 is estimated using the Monte-Carlo method. In the experiments we handle using generalized advantage estimate estimation (GAE) if the KL-divergence between and becomes too large.

Value Network Update

PPO is coupled with learning the value network. We parametrize the value network with and update it by minimizing the expected squared error between and the discounted return , where denotes the timestep. More concretely, the update is as follows

where the right hand side is estimated via the Monte-Carlo method.

Training Curriculum

In this section we describe the curriculum that we use for training agents. First, we specify the AI that we use as benchmarks to test the agents against. Then, we elaborate on how we sample decks during learning. Finally, we introduce two training procedures. The first one is based on playing against the benchmark bots, while the second one is based on competing against an ensemble of policies [1].

Benchmarking against Existing AI

We use two different AI to benchmark performance of our agents. The first one is a rule-based AI, which utilizes multiple built-in handcrafted conditions in order to decide which actions to execute. The second one is a tree-search AI, which uses a tree search algorithm combined with a handcrafted value function.

Deck Sampling

A single deck in the match consists of cards and may contain duplicates, while in total there are cards in the game. In order to make the agent robust under various decks we perform deck sampling. Not all choices of decks are reasonable and in order to simplify the selection process we prepared a pool of over decks chosen in collaboration with a top player. In the training we uniformly sample one deck for the agent and one for the opponent.

Note that it would be an interesting problem to learn deck selection as a part of agent’s task, but we are not aiming to do it in this work.

Training via Competing against AI

Given the existing AI that lets us perform fast rollouts, the simplest curriculum is to simply pit an agent against a bot during training. This has the advantage of quicker progress, which allowed for faster iterations. The main disadvantage is the agent fitting its policy against a particular type of opponent, which can result in low versatility.

Training via Self-Play with Ensemble of Policies

The second approach that we take is self-play, which recently became popular in combination with deep reinforcement learning [9, 11, 12, 10, 13, 6]. However, agents trained via simple self-play with one agent competing against older versions of itself tend to find poor strategies and depending on the complexity of the environment there are several ways to mitigate the issue [1, 13]. In this paper we follow the strategy proposed in [1] and train an ensemble of policies competing against each other. Every several iterations we first sample two policies (could be the same policy) and for each policy we select weights of the policy uniformly from the interval , where is the last iteration number for the latest available parameters and is a parameter of the algorithm, see [1] for details.


We train deep reinforcement learning agents using the architectures and the training plans introduced in the previous sections. Our goal is to demonstrate that these methods can yield agents playing the game at a competitive level. Concerning the training plans we demonstrate with experiments that the self-play curriculum yields agents with versatile strategies against different types of opponents. We test the self-play agents against the existing AI and a top human player. We also study the game with spells enabled. We point out that this version of the game is considerably more complex than the restricted counterpart and we examine different ways of handling the associated actions via the introduced policies. Before we present the results, we make a remark on handling no-op actions.

No-Op Actions

In the early stage of experimentation we discovered that naively triggering the network too often (every 0.2, 0.5 or 1 second), even when no move is available, can result in poor strategies. This is because usually for the vast majority of match time no action is available to be played. Similarly as in [5] we mitigate this issue by letting the agent perform no-op actions only when there is a non-trivial action available. The duration of no-op is a hyperparameter of our learning procedure and in the experiments we employ seconds for card no-op and seconds for spell no-op.

Experiment Details


We train the agents on 2 Nvidia Titan V GPUs. We scale training using distributed PPO implementation with parallel processes. Each worker collects multiple trajectories, and the data is then used to perform a synchronized update step.

Policy and Value Network

We use the policy and the value network architectures introduced above. The first two layers of all networks are the same: we use filters of size

and stride

in the convolutional layer and units in the fully connected layer. We use units in the fully connected layers in the policy head as well as in the last layer of the value network.

Algorithm Parameters

We use PPO algorithm with clip ratio , policy network learning rate , value network learning rate . We estimate the value function with the advantage estimator with discount factor and parameter . In each iteration we collect samples, which is equivalent of approximately

matches, and perform several epochs of update with batches of size

. We also employ early stopping if the approximate mean KL-divergence between the current and the previous policy becomes greater than .

Episode and Match Length

We noticed that collecting more samples per iteration helped stabilize the training and improved exploration. We found value of to be a good trade-off between stability and training speed. Matches without spells enabled took on average around steps and with spells around steps.

Effect of Card No-Op

We compare two agents in the version of the game without spells - one with no-op disabled and the other with no-op enabled. We train them against the rule-based AI and use the policy in Figure 3 (since spells are disabled, both introduced policies are equivalent). In Figure 4 we demonstrate the win rates of both agents over the course of training. They perform almost the same - our hypothesis is that this is because the number of cards in player’s hand is bounded (by ) and new cards cannot be drawn as long as there is no free slot - hence, no-op actions do not give the agent any advantage and, as we additionally observed during the experiment, the agent learns to avoid them over time.

Figure 4: Policy without vs with card no-op.

Effect of Conditioning Spatial Actions

We compare our policy network (Figure 3) with an Atari-net-like policy network. The difference is that the latter samples spatial actions independently of the selected card action and, moreover, samples independently of . We run experiments without spells and with no-op action disabled against the existing rule-based AI. In Figure 5 we demonstrate the win rate of both agents in several hundred initial epochs - even in such short training time the agent using conditioning of spatial actions performs better than the other by a margin of about . Note, however, that in the first phase of training the performance of both agents is inverted - we think that this is because the simpler agent discovers basic rules of casting cards more quickly, while the other agent needs more time to do so.

Figure 5: Policy conditioned spatial actions vs policy with spatial actions independent of the card action.

Effect of Self-Play

We present the results of training agents via self-play with ensemble of policies similarly as in [1]. We run the experiment with policies and sampling interval with , where is the number of the last iteration for a given agent. We periodically tested the agents against the existing AI in order to track agents’ progress and versatility. We run the experiment without spells. Our best self-play agent obtained against the rule-based AI and against the tree-search AI.

We also tested the agent against a human player, who is a developer that worked on the game and is within top th percentile of Heroic players. In total matches were played with random decks from the pool for both players. In such test our self-play agent achieved win rate.

We additionally tested the self-play agent, the rule-based AI and the tree-search AI against the same human player in a series of matches against each of the three. This time the self-play agent obtained win rate, while the rule-based AI achieved win rate and the tree-search AI reached win rate.

Effect of Enabling Spells

In this subsection we study the effect of enabling spells in the game. First we compare the single-headed and the two-headed policy network (Figure 3), which are two different ways of handling spells. We use the two-headed policy network with card no-op disabled (see the card no-op experiment) and spell no-op enabled. The single-headed policy uses a single unified no-op action. In Figure 6 we demonstrate the win rates of both agents. To our surprise, we observed that even after a large number of iterations the win rate of the single-headed agent remained better by a margin of a few percent. Note that there is a change of slope at a certain point (around the -th iteration) of the training curve of the single-headed agent - we discovered that from this point on the agent started to use no-ops more often. In order to evaluate the importance of spell no-op actions, we ran additional experiments with both card no-op and spell no-op actions disabled or enabled. The results are indecisive - on one hand the performance of the agents was very similar to the performance of the agents in the previous experiment. On the other hand, as opposed to the experiments without spells (Figure 4), the frequency with which the agents were using no-op actions did not drop during training.

Figure 6: Single-headed policy vs two-headed policy.

We think that this the result of the agents not learning to play no-op actions optimally, while no-op actions do not appear irrelevant as in the no-spell experiments. We hypothesize that this is one of the potential reasons for the gap in performance between the no-spell and spell agents, which we present in Figure 7. In the upper part of the figure we demonstrate the win rate of two agents trained against the rule-based AI. In the same number of epochs the no-spell agent obtains around win rate, while the spell agent reaches around . The difference is even more apparent when training against the stronger tree-search AI, see the lower part of Figure 7. There we present the win rate curve of an agent playing without spells and an agent playing with spells, additionally pretrained against the rule-based AI. We applied pretraining for the latter agent, since otherwise the reward signal is not strong enough to initiate any learning. Note that the drop in win rate, which happens at the end pretraining, is a result of stronger performance of the tree-search AI.

Figure 7: Training without spells vs with spells.

Future Work

Although the presented agents offer a good level of competition, the strategies they learn lack long-term planning and the earlier states are not utilized, since they use an MLP policy. We experimented with providing past information by stacking previous observations into a tensor with multiple channels, similarly as in [11]. However, we did not observe any immediate benefit. Still, we think that endowing the agents with memory could be a major improvement. Thus, in future work we would like to run experiments with agents equipped with LSTM based policies.

We also plan to scale the agents further. We would like to augment the set of available decks in order to make the policy more robust. We think that the training procedure should also be scaled, by experimenting with more competing agents. It would be interesting to adopt a bigger league of policies together with exploiters or equilibrium strategies [13].

Finally, we wish to tackle the performance gap between spell and no-spell agents. In future work we would like to employ learning the no-op duration [13] together with recurrent policies and extended training, in order to mitigate this issue.


The authors would like to thank Marko Antonic, for providing detailed game description, Sandra Tanackovic, for proofreading the paper and Marko Knezevic, for valuable feedback during writing of this paper.


  • [1] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch (2017) Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748. External Links: 1710.03748 Cited by: Introduction, Introduction, Related Work, Training via Self-Play with Ensemble of Policies, Training Curriculum, Effect of Self-Play.
  • [2] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783. External Links: 1602.01783 Cited by: Related Work.
  • [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. External Links: 1312.5602 Cited by: Introduction.
  • [4] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518, pp. 529–533. Cited by: Introduction, Related Work.
  • [5] I. Oh, S. Rho, S. Moon, S. Son, H. Lee, and J. Chung (2019) Creating pro-level ai for a real-time fighting game using deep reinforcement learning. arXiv preprint arXiv:1904.03821. External Links: 1904.03821 Cited by: Related Work, No-Op Actions.
  • [6] OpenAI, C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang (2019) Dota 2 with large scale deep reinforcement learning. External Links: 1912.06680, Link Cited by: Introduction, Related Work, Related Work, Related Work, Policy Network Update, Training via Self-Play with Ensemble of Policies.
  • [7] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. External Links: 1506.02438 Cited by: Related Work.
  • [8] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. External Links: 1707.06347 Cited by: Introduction, Related Work, Policy Network Update.
  • [9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–489. Cited by: Introduction, Related Work, Training via Self-Play with Ensemble of Policies.
  • [10] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362, pp. 1140–1144. Cited by: Introduction, Related Work, Related Work, Training via Self-Play with Ensemble of Policies.
  • [11] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815. External Links: 1712.01815 Cited by: Introduction, Related Work, Related Work, Training via Self-Play with Ensemble of Policies, Future Work.
  • [12] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. R. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis (2017) Mastering the game of go without human knowledge. Nature 550, pp. 354–359. Cited by: Introduction, Related Work, Related Work, Training via Self-Play with Ensemble of Policies.
  • [13] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. W. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. P. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, pp. 1–5. Cited by: Introduction, Related Work, Related Work, Training via Self-Play with Ensemble of Policies, Future Work, Future Work.
  • [14] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. van Hasselt, D. Silver, T. Lillicrap, K. Calderone, P. Keet, A. Brunasso, D. Lawrence, A. Ekermo, J. Repp, and R. Tsing (2017) StarCraft II: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. External Links: 1708.04782 Cited by: Introduction, Introduction, Introduction, Related Work, Figure 3, Policy Network, Value Network.