I Introduction
Hearthstone: Heroes of Warcraft is a freetoplay online video game developed and published by Blizzard Entertainment. It is an example of a turnbased collectible card game played between two opponents. During a game, players use their custom decks of thirty cards, along with a selected hero with a unique power. They spend mana points to cast spells or summon minions to attack the opponent, with the goal to reduce the opponent’s health to zero. Building efficient decks is an essential skill and many archetypes of decks exists. These archetypes are characterized by different distributions of the cards’ mana cost and thus are meant for players with different play styles. There are also sets of cards, which synergize well due to their unique properties and can be used in many different decks.
In recent years, Hearthstone has become a testbed for AI research. A community of passionate players and developers have started the HearthSim project (https://hearthsim.info/) and created many tools that allow simulating the game for the purpose of AI and machine learning experiments. Several researchers have already used this game in their studies [1, 2]
. Moreover, our research team decided to use Hearthstone as one of case studies which aim to demonstrate capabilities of our video game’s AI designing framework, called Grail. For this reason, one objective of this article is to explain how some powerful heuristic search algorithms can be combined with prediction models that derive from the machine learning domain, in order to construct a smart and cunning artificial Hearthstone player.
The paper is organized as follows: in the next section, we describe the specificity of the AAIA’17 Data Mining Competition. In Section III, the approach of using the collected data to effectively play the game of Hearthstone is presented. The approach is based on the socalled Monte Carlo Tree Search algorithm (Section IIIA) coupled with machine learning models (Sections IIIB – IIID). The last section is devoted to conclusions.
Ii AAIA’17 Data Mining Challenge
AAIA’17 Data Mining Challenge: Helping AI to Play Hearthstone (https://knowledgepit.fedcsis.org/contest/view.php?id=120) took place between March 23 and May 15, 2017. It was organized under the auspices of the
International Symposium on Advances in Artificial Intelligence and Applications (AAIA’17,
https://fedcsis.org/2017/aaia) which is a part of the FedCSIS conference series.The main objective in this competition was to construct a prediction model which would be able to foresee who is going to win, using only information about a single game state. The ability to accurately assess winning chances of a player in different game states is substantial for designing efficient and challenging AI opponents in many video games. The most famous example is the AlphaGo program, which used two neural networks to evaluate possible moves and game states of Go games [3]. In our competition, we challenged participants with the task to design such models for Hearthstone.
In particular, the dataset provided to participants contained examples of game states extracted from Hearthstone play outs between weak AI players (i.e. the agents which were used to generate the data were choosing their ingame decisions at random). The participants were asked to predict winning chances of the first player from game states belonging to the test set and submit their predictions to the Knowledge Pit competition platform [4]. In order to give participants a freedom of choosing a representation of the data which they want to use, the datasets were provided in two formats: in a tabular format (with simplified representation) and as raw JSON files (with detailed game states).
The training part of the data was made available along with the corresponding information regarding the actual game winners. These labels were removed from the test set which was also made available to participants. Initially, the training set consisted of game states, however, after detecting an unwitting data leakage [5], after first few weeks of the challenge, it was extended by additional cases from the original test set (in total, there were training examples). The final test set consisted of game states. Test set examples were obtained from a different set of Hearthstone play outs than the training cases. In fact, while the training data contained game states from simulations, more than play outs were simulated to generate the test set. It is also important to note that while in the training games there were used only different sets of cards (one deck for every hero type), the test games were played using different decks. As a consequence, the test data contained Hearthstone cards which had never appeared in the training set. Table I shows a summary of basic characteristics of datasets used in the challenge.
characteristic  training set  test set 

no. examples  
no. games  
no. used decks  
percent of wins  
min. win rate per hero (percent/hero_id)  
max. win rate per hero (percent/hero_id) 
Iia Evaluation of results and participation in the challenge
Participants of the competition had to prepare their solutions in a form of a file with predictions of a likelihood that player 1 will win, given a corresponding description of a game state. The files with predictions had to be sent using the submission system of Knowledge Pit [4]. Each of the competing teams could submit multiple solutions. Quality of the submissions was measured using Area Under the ROC Curve (AUC) [6]. The submitted solutions were evaluated online and the preliminary results were published on the competition leaderboard. The preliminary score was computed on a subset of the test set, fixed for all participants. Size of this subset corresponded to randomly chosen 5% of the test set. The final evaluation was conducted after completion of the competition using the remaining part of the test data.
Apart from submitting their predictions, each team was also obligated by competition rules to provide a brief report describing its approach. Only the final solutions from teams which sent a valid report could undergo the final evaluation and be published among the competition results. In this way, we were able to collect a vast amount of information regarding efficient representation methods of Hearthstone game states and stateoftheart approaches to this type of prediction problems.
IiB Summary of the competition
Even though AAIA’17 Data Mining Challenge lasted for less than two months, it attracted attention of many researchers from domains of machine learning and artificial intelligence in video games. By the end of competition there were teams from countries registered in the challenge. Among them, teams submitted at least one solution to the leaderboard and teams described their solution in a report uploaded to the Knowledge Pit platform. In total, we received submission, which makes this competition the most popular one among challenges organized at Knowledge Pit to this day.
The large number of submitted reports gave us a unique opportunity to review the most effective prediction methods for the assessment of Hearthstone game states. The most successful approach in this regard turned out to be artificial neural networks [7]
and particularly, the deep learning methods
[8]. In fact, all topranked teams used neural networks in their solutions and the winners focused particularly on the convolutional neural networks
[9]. Another popular approach was the utilization of xgboost algorithm [10]. There were also much simpler approaches which turned to be efficient, such as the logistic regression models. Moreover, all of these methods were often combined – techniques such as averaging, bagging or stacking were commonly used to obtain better prediction results
[11]. Table II presents scores obtained by the five topranked teams. Noticeable is the fact that the difference in scores between the best solution and the baseline is less than 2%.team name  rank  # of submissions  final result 

iwannabetheverybest  
hieuvq  
johnpateha  
vz  
jj  
baseline  – 
Many teams decided to use data in the JSON format in order to construct richer representations of game states than the one which was available in the provided tabular data. Feature engineering [12] turned out to be an important aspect of the most efficient solutions. Extracted features were often a reflection of participant’s experience and domain knowledge about Hearthstone. Their descriptions included in reports turned out to be a valuable source of knowledge which can be used to improve our artificial Hearthstone players.
Iii Augmentation of game state search heuristics with neural networks
Iiia MonteCarlo Tree Search
Monte Carlo Tree Search (MCTS) [13] is a method of learning an optimal policy for solving problems such as gameplaying. For the first time, it was used for games in Go [14] as an improvement over a Monte Carlo sampling technique (without the tree search). The algorithm led to a breakthrough in the game of Go, which had been previously regarded as intractable for computer programs [15]. Driven by this success, MCTS became the stateoftheart approach in various game domains, such as General Game Playing [16] and General Video Game Playing [17]. The idea of MCTS is to repeatedly simulate the game (problem) and build statistics about states and actions. Each iteration of the algorithm consists of four phases as depicted in Figure 1.
(1) Selection. In this phase, the algorithm starts from the root node and searches the tree down by choosing subsequent children nodes. The child node at each node down the path is chosen according to the socalled selection policy. The selection phase ends when there is no child node to choose, i.e., a leaf node has been reached.
(2) Expansion. One of the possible actions is applied to a node selected in the previous step and the tree is grown by adding a child node representing the resulting state.
(3) Simulation. The algorithm starts from the new node and performs a complete game simulation, i.e., reaching a terminal state. This phase is done outside the gametree and no nodes are added to it. Once the simulation reaches the terminal state, the obtained goals (outcomes) of each player are fetched.
(4) Backpropagation. Here, the statistics are recalculated inside all nodes along the path from the root to the leaf (containing the starting state for the simulation) in the game tree. The statistics include the average scores of each player and the number of visits to a node. An average score is computed as the total score achieved in iterations going through a particular node divided by the number of visits to that node.
In the classic implementation, actions in the simulation phase are chosen with respect to uniform random distribution. In the selection phase, a more sophisticated formula (selection policy) is typically used. The most common one, which was also employed in this paper for all MCTSbased programs used during the experiments is called Upper Confidence Bounds applied for Trees (UCT) [18].
(1) 
where is a set of actions available in state , denotes the average result of playing action in state in the simulations performed so far,  a number of times state has been visited in previous simulations and  a number of times action has been sampled in this state in previous simulations. Constant controls the balance between exploration and exploitation.
The MCTS algorithm using the UCT selection formula is proved to converge to the minmax theoretical optimum [18]. However, it poses several advantages over a classic minmax search. For instance, it does not require any game specific evaluation function and constructs the tree in an asymmetric manner, focusing at the most promising lines of play. It scales better with the depth of the tree, it can be stopped at anytime to return the best action found so far.
IiiB MonteCarlo Tree Search with Heuristics
Despite the wide usage in a variety of game domains, the MCTS method has bottlenecks and limitations. It is both computationally demanding and memory intensive. Games with huge branching factor, i.e., the total number of actions available to players, in average, often inhibit the usage of MCTS and other treesearch methods. This weakness has motivated us to combine this algorithm with heuristics represented by prediction models. Such prediction models can be trained to either predict the outcome of the game by looking at a potential next state (candidate state) of the game or at a potential action (candidate action). In the scope of this paper, we will use the terms “machine learning prediction models” and “heuristic evaluation” interchangeably.
There is a couple of ways to combine external heuristics with the MCTS algorithm. The authors of paper [19] give a nice review of four common methods:
(1) Tree Policy Bias  here the heuristic evaluation function is included together with the in the UCT formula (see Eq. 1) or its equivalent. A typical implementation of this idea is called Progressive Bias [20], in which the standard UCT evaluation is linearly combined with the heuristic evaluation with the weight proportional to the number of simulations. The more simulations are performed, the more statistical confidence, and therefore, the higher weight is assigned to the standard UCT formula.
(2) Move Ordering  the heuristic defines the order, in which actions in the tree are expanded (chosen for the first time). This method has the most significant impact on the deeper parts of the tree, because the MCTS is less likely to visit them again, so the order matters. If better moves are expanded first, their neighbourhood in the tree has a higher chance to be visited in subsequent simulations.
(3) Simulation Policy Bias
 in the baseline version of the MCTS algorithm, the actions during the simulation phase are chosen randomly. With a good heuristic evaluation, a sensible approach is to infer this knowledge in the action selection process, while still leaving some degree of randomness. The two most common implementations are pseudoroulette selection with probabilities computed using Boltzmann distribution (where the heuristic evaluation is used) or the socalled epsilongreedy approach
[21]. In the latter, the action with the highest heuristic evaluation is chosen with the probability of or a random one with the probability of .(4) Early Cutoff  the authors of [19] achieved the best results with terminating Monte Carlo simulations before the game ends and returning the heuristic evaluation of the current state. This variant is called Early Cutoff and the cutoff is done typically at fixed depth or with certain small probability (e.g. P=) in each step.
The aforementioned AlphaGo program employs both, Tree Policy Bias and Simulation Policy Bias. Motivated by its success, we decided to apply a similar approach for Hearthstone.
IiiC Generality of models trained on random simulations
Both Tree Policy Bias and Simulation Policy Bias methods utilize a heuristic function which provide the value of a game state or an action. Various machine learning methods may be used to obtain these evaluations, including supervised prediction models [6]. Since random simulations are used in the classic version of MCTS, it is natural to train such models on game states obtained from play outs between agents making random decisions.
The datasets used in AAIA17 Data Mining Challenge constitute an example to this type of data. It can be used to train prediction models for the purpose of evaluation of Hearthstone game states during MCTS simulations. However, a question arises whether such models could be also effective in games played by more intelligent opponents. To check that, we conducted a series of experiments. First, we generated an additional dataset containing game states from duels between strong MCTS agents. Each agent was making decisions after performing random simulations before a single action. In total, there were play outs generated in this way, which resulted in a dataset consisting of game states. We trained a simple neural network with two hidden layers on the training set used in our competition and we checked its performance on the available test set. Next, we constructed another model using all competition data and we tested it on the additional dataset. Figure 2 shows a comparison of the obtained AUC scores in consecutive turns. Surprisingly, total AUC dropped only slightly (from to ) when the test was done on the data generated by MCTS players. It shows that predictive models can be successfully used for evaluation of game states, even in a case when they are trained on random simulations.
IiiD Learning a playing strategy from sequences
In practice, it is often desirable to have a function that provides the policy rather than the value of a particular action. The policy specifies the probabilities for all actions available in a given state, thus it enables the selection of the best action candidate in a single state evaluation.
Reinforcement Learning is used in particular for cases where the optimal policy is unknown and needs to be learned based on sparse reward signals. However, a supervised learning approach may be used, when examples of policies are available (e.g. from human players or other algorithms). In our case, we may use MCTS to generate Hearthstone matches and obtain stateaction pairs i.e. record what action was chosen by MCTS as a response to given game state. Next, we may train a model that predicts the action that MCTS would choose for a given state. Training such a model is basically a classification task, well fitted for deep neural networks (DNN).
Long shortterm memory (LSTM) is a type of recurrent neural network [22] dedicated for use with sequences. The architecture of LSTM enables it to learn long and short term temporal dependencies. LSTM provides superior performance in tasks such as speech recognition, machine translation or language modeling. A deep LSTM network may be created by stacking multiple LSTM layers.
A DNN may be trained to approximate a policy from examples of stateaction pairs  we will refer to this network as to policy network [3]. However, in case of Hearthstone, a single state may not provide enough information to the model for a valid action prediction. This is due to the fact that a single turn in Hearthstone consists of many moves. Moreover, a single move is decomposed into a few atomic actions in order to be efficiently implemented in the Hearthstone simulator. For example, putting a minion on the board consists of two actions: selecting a card from hand and selecting the slot on the board where the minion should be placed. To improve the accuracy of the policy predictions, rather than using a single state, we chosen to use a sequence of states and previous actions as the input to the policy network.
Our policy network is presented in figure 3. LSTM network is provided with a sequence of vectors created from concatenating a state vector for state and an action vector for action , for
. The state vector includes 403 values representing the state of the game from the point of view of a selected player. The action vector is of length 91 and contains a onehot encoded action. The output of the LSTM is a single vector with probabilities assigned to each of all available 91 actions (the probabilities are provided also for actions that are illegal in a certain state). The LSTM in our experiments consisted of 3 layers of 256 LSTM cells with dropout
To obtain the training data, we first used MCTS with 1000 iterations to generate 7000 games between randomly selected decks. Using this data we obtained 528365 sequences of length 10 where each element of the sequence included 494 values. We will refer to this dataset as to SeqMCTS1k. Next, we generated 1500 games using MCTS with 10000 iterations. As a result we created a second dataset: SeqMCTS10k, that consists of 149521 sequences.
To evaluate the policy network, we created a greedy DNN agent that always selects the most promising action from the predictions of the policy network. We confronted this agent against a random agent and an agent using MCTS with 1000 iterations. Greedy DNN agents were using policy networks trained in three variants: 1) using only SeqMCTS1k dataset, 2) using only seqMCTS10k dataset and 3) trained first on the SeqMCTS1k dataset and then retrained on the SeqMCTS10k dataset. The results are presented in Table III. Each score is calculated based on 500 games played between the agents.
greedy agent type  wins vs random agent  wins vs MCTS(1000) 

seqMCTS1k  
seqMCTS10k  
seqMCTS1k retrained with seqMCTS10k 
Iv Conclusions
In the paper we provided a summary of AAIA’17 Data Mining Challenge which was held at the Knowledge Pit platform. Results of this competition clearly show that learning from Hearthstone game logs is feasible and has a potential to facilitate a construction of intelligent artificial agents which play that game. We explained how prediction models can be combined with game state search heuristics to improve their performance. We also demonstrated results of experiments showing that models trained on data obtained from random simulations can be successfully applied for the assessment in games between intelligent agents. Finally, we showed that more advanced approaches such as the supervised action policy learning based on game state sequences are feasible and deserve further investigation.
Acknowledgments
This research was cofunded by the Smart Growth Operational Programme 20142020, financed by the European Regional Development Fund under a GameINN project POIR.01.02.00000150/16, operated by The National Centre for Research and Development (NCBiR), and by the Silver Bullet Solutions company.
References
 [1] D. Taralla, “Learning artificial intelligence in largescale video games: A first case study with hearthstone: Heroes of warcraft,” Ph.D. dissertation, Université de Liege, Liege, Belgium, 2015.
 [2] P. GarcíaSánchez, A. Tonda, G. Squillero, A. Mora, and J. J. Merelo, “Evolutionary deckbuilding in hearthstone,” in Computational Intelligence and Games (CIG), 2016 IEEE Conference on. IEEE, 2016, pp. 1–8.
 [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
 [4] A. Janusz, D. Ślęzak, S. Stawicki, and M. Rosiak, “Knowledge Pit  a data challenge platform,” in Proceedings of the 24th International Workshop on Concurrency, Specification and Programming, Rzeszow, Poland, September 2830, 2015., 2015, pp. 191–195. [Online]. Available: http://ceurws.org/Vol1492/Paper_18.pdf
 [5] S. Kaufman, S. Rosset, C. Perlich, and O. Stitelman, “Leakage in data mining: Formulation, detection, and avoidance,” TKDD, vol. 6, no. 4, p. 15, 2012.
 [6] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, ser. Springer Series in Statistics. New York, NY, USA: Springer New York Inc., 2001.
 [7] M. S. Szczuka and D. Ślęzak, “Feedforward neural networks for compound signals,” Theor. Comput. Sci., vol. 412, no. 42, pp. 5960–5973, 2011.

[8]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Proceedings of the 25th International Conference on Neural Information Processing Systems, ser. NIPS’12. USA: Curran Associates Inc., 2012, pp. 1097–1105. [Online]. Available: http://dl.acm.org/citation.cfm?id=2999134.2999257 
[9]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei,
“Largescale video classification with convolutional neural networks,” in
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition
, ser. CVPR ’14. Washington, DC, USA: IEEE Computer Society, 2014, pp. 1725–1732. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2014.223  [10] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. New York, NY, USA: ACM, 2016, pp. 785–794. [Online]. Available: http://doi.acm.org/10.1145/2939672.2939785

[11]
A. Janusz, “Combining multiple predictive models using genetic algorithms,”
Intelligent Data Analysis, vol. 16, no. 5, pp. 763–776, 2012. [Online]. Available: http://dx.doi.org/10.3233/IDA20120550  [12] A. Grużdź, A. Ihnatowicz, and D. Ślęzak, “Interactive gene clusteringa case study of breast cancer microarray data,” Information Systems Frontiers, vol. 8, no. 1, pp. 21–27, 2006.
 [13] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A Survey of Monte Carlo Tree Search Methods,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1–43, 2012.
 [14] S. Gelly, Y. Wang, O. Teytaud, M. U. Patterns, and P. Tao, “Modification of UCT with Patterns in MonteCarlo Go,” 2006.
 [15] X. Cai and D. C. Wunsch II, “Computer Go: A Grand Challenge to AI,” in Challenges for Computational Intelligence. Springer, 2007, pp. 443–465.
 [16] M. Świechowski, H. Park, J. Mańdziuk, and K.J. Kim, “Recent Advances in General Game Playing,” The Scientific World Journal, vol. 2015, 2015.
 [17] D. Perez, S. Samothrakis, and S. Lucas, “KnowledgeBased Fast Evolutionary MCTS for General Video Game Playing,” in 2014 IEEE Conference on Computational Intelligence and Games. IEEE, 2014, pp. 1–8.
 [18] L. Kocsis and C. Szepesvári, “Bandit Based MonteCarlo Planning,” in Proceedings of the 17th European conference on Machine Learning, ser. ECML’06. Berlin, Heidelberg: SpringerVerlag, 2006, pp. 282–293.
 [19] K. Walędzik and J. Mańdziuk, “An automatically generated evaluation function in general game playing,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 6, no. 3, pp. 258–270, Sept 2014.
 [20] G. M. J. Chaslot, M. H. Winands, H. J. V. D. HERIK, J. W. Uiterwijk, and B. Bouzy, “Progressive strategies for montecarlo tree search,” New Mathematics and Natural Computation, vol. 4, no. 03, pp. 343–357, 2008.
 [21] M. Świechowski and J. Mańdziuk, “SelfAdaptation of Playing Strategies in General Game Playing,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 6, no. 4, pp. 367–381, Dec 2014.
 [22] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735