1 Introduction
In reinforcement learning, an agent solves a Markov decision process (MDP) by selecting actions that maximize its longterm reward. Most stateoftheart algorithms assume numerical rewards. In domains like finance, realvalued reward is naturally given, but many other domains do not have a natural numerical reward representation. In such cases, numerical values are often handcrafted by experts so that they optimize the performance of their algorithms. This process is not trivial, and it is hard to argue about good rewards. Hence, such handcrafted rewards may easily be erroneous and contain biases. For special cases such as domains with true ordinal rewards, it has been shown that it is impossible to create numerical rewards that are not biased. For example,
[Yannakakis, Cowie, and Busso2017] argue that emotions need to be treated as ordinal information.In fact, it often is hard or impossible to tell whether domains are realvalued or ordinal by nature. Experts may even design handcrafted numerical reward without thinking about alternatives, since using numerical reward is state of the art and most algorithms need them. In this paper we want to emphasize that numerical rewards do not have to be the ground truth and it may be worthwhile for the machine learning community to have a closer look on other options, ordinal being only one of them.
Mcts
Monte Carlo tree search (MCTS) is a popular algorithm to solve MDPs. MCTS is used in many successful AI systems, such as AlphaGo [Silver et al.2017] or topranked algorithms in the general video game playing competitions [PerezLiebana et al.2018, Joppen et al.2018]. A reoccurring problem of MCTS is its behavior in case of danger: As a running example we look at a generic platform game, where an agent has to jump over deadly gaps to eventually reach the goal at the right. Dying is very bad, and the more the agent proceeds to the right, the better. The problem occurs by comparing the actions jump and stand still : jumping either leads to a better state than before because the agent proceeded to the right by successfully jumping a gap, or to the worst possible state (death) in case the jump attempt failed. Standing still, on the other hand, safely avoids death, but will never advance to a better game state. MCTS averages the obtained rewards gained by experience, which lets it often choose the safer action and therefore not progress in the game, because the (few) experiences ending with its death pull down the average reward of jump below the mediocre but steady reward of standing still. Because of this, the behavior of MCTS has also been called cowardly in the literature [Jacobsen, Greve, and Togelius2014, Khalifa et al.2016].
Transferring those platform game experiences into an ordinal scale eliminates the need of meaningful distances. In this paper, we present an algorithm that only depends on pairwise comparisons in an ordinal scale, and selects jump over stand still if it more often is better than worse. We call this algorithm Ordinal MCTS (OMCTS) and compare it to different MCTS variants using the General Video Game AI (GVGAI) framework [PerezLiebana et al.2016].
2 Monte Carlo Tree Search
In this section, we briefly recapitulate Monte Carlo tree search and some of its variants, which are commonly used for solving Markov decision processes.
2.1 Markov Decision Process
A Markov decision process (MDP; MDP MDP) can be formalized as quintuple (, , , , ) where is the set of possible states , the set of actions the agent can perform (with the possibility of only having a subset of possible actions available in state ), a state transition function, a reward function for reaching state , and a distribution for starting states. weng2011markov˜weng2011markov has extended this notion to ordinal reward MDPs (ORMDP), where rewards are defined over a qualitative, ordinal scale , in which states can only be compared to a obtain a preference between them, but the feedback does not provide any numerical information which allows to assess a magnitude of the difference in their evaluations.
The goal is learn a policy
that defines the probability of selecting an action
in state . The optimal policy maximizes the expected, cumulative reward in the MDP setting [Sutton and Barto1998], or the preferential information for each reward in a trajectory in the ORMDP setting [Weng2011]. For finding an optimal policy, one needs to solve the socalled exploration/exploitation problem. The state/action spaces are usually too large to sample exhaustively. Hence, it is required to trade off the improvement of the current, best policy (exploitation) with an exploration of unknown parts of the state/action space.2.2 Monte Carlo Tree Search
Monte Carlo tree search (MCTS) is a method for approximating an optimal policy for a MDP. It builds a partial search tree, which is more detailed where the rewards are high. MCTS spends less time evaluating less promising action sequences, but does not avoid them entirely in order to explore the state space. The algorithm iterates over four steps [Browne et al.2012]:

Selection: Starting from the root node which corresponds to start state , a tree policy traverses to deeper nodes , until a state with unvisited successor states is reached.

Expansion: One successor state is added to the tree.

Simulation: Starting from the new state, a socalled rollout is performed, i.e., random actions are played until a terminal state is reached or a depth limit is exceeded.

Backpropagation: The reward of the last state of the simulation is backed up through the selected nodes in tree.
The UCT formula
(1) 
is used to select the most interesting action in a node
by trading off the expected reward estimated as
from samples in which action has been taken in node , with an exploration term . The tradeoff parameter is often set to , which has been shown to ensure convergence for rewards [Kocsis and Szepesvári2006].In the following, we will often omit the subscript when it is clear from the context.
2.3 MixMax Modification
As mentioned in the introduction, MCTS has been blamed for cowardly behavior in the sense that it often prefers a safer, certain option over a more promising but uncertain outcome. To change this behavior, jacobsen2014monte˜jacobsen2014monte proposed to use MixMax, which uses a mix between the maximum and the average reward
(2) 
where is a parameter to trade off between the two values. As illustrated further below (Figure 2), this is a possible way to encourage MCTS to boost actions that can lead to high rated states. Hence, MixMax may solve the running problem given a welltuned value.
The benefit of MixMax
is its simplicity which makes it very cheap to compute. However, the use of the maximum makes does not take into account the distribution of rewards, which makes it very sensitive to noise: a single outlier may lead to a high
MixMax bonus. Hence, MCTS may choose a generally very deadly action just because it survived once and got a good score for that, thereby, in a way, inverting the problem with the conservative action selection of MCTS. Also note that in comparison to vanilla MCTS, this bonus does not decrease with a higher number of deadly samples.The MixMax modification has already been used in the General Video Game Framework to reduce cowardly behavior. khalifa2016modifying˜khalifa2016modifyingb found that exhibits more humanlike behavior so that we will also use this parameter setting in our experiments.
2.4 PreferenceBased Monte Carlo Tree Search
A version of MCTS that uses preferencebased feedback (PBMCTS) was recently introduced by joppen2018preference˜joppen2018preference. In this setting, the agent receives rewards in the form of preferences over states. Hence, feedback about single states is not available, it can only be compared to another state , i.e., ( dominates ), , or ( and are incomparable).
An iteration of PBMCTS contains the same abstract steps like MCTS, but their realization differs. First and foremost, it is impossible to use preference information on a vanilla MCTS iteration, since it only samples a single trajectory, whereas a second state is needed for a comparison. Hence, PBMCTS does not select a single path per iteration but an entire subtree of the search tree. In each of its nodes, two actions are selected that can be compared to each other. In the backpropagation phase, the two selected actions in a node both have at least one trajectory. All trajectories are compared and the received preference information is stored. For the selection step, a modified version of the dueling bandit algorithm RUCB [Zoghi et al.2014] is used to select two actions per node given the stored preferences.
There are two main disadvantages with this approach:

No transitivity is used. Given ten actions to , MCTS needs only at most iterations to have a first fair estimation of the quality of each of those actions. In the preferencebased approach, each action has to be compared with each other action until a first complete estimation can be done. These are iterations, i.e., in general the effort is quadratic in the number of actions.

A binary subtree is needed to learn on each node of the currently best trajectory. Instead of a path of length for vailla MCTS, the subtree consists of nodes and trajectories instead of only one, causing an exponential blowup of PBMCTS’s search tree.
Hence, we believe that PBMCTS does not make optimal use of available computing resources, since on a local perspective, transitivity information is lost, and on a global perspective, the desired asymmetric growth of the search tree is undermined by the need for selecting a binary tree. Note that even in the case of a nontransitive domain, PBMCTS will nevertheless obtain a transitive policy, as illustrated in Figure 1, where the circular preferences between actions A, B, and C can not be reflected in the resulting tree structure.
3 Ordinal Monte Carlo Tree Search
In this section, we introduce OMCTS, an MCTS variant which only relies on ordinal information to learn a policy. We will first present the algorithm, and then take a closer look at the differences to MCTS and PBMCTS.
3.1 OMcts
Ordinal Monte Carlo tree search (OMCTS) proceeds like conventional MCTS as introduced in Section 2.2, but replaces the average value in (1) with the Borda score of an action . To calculate the Borda score for each action in a node, OMCTS stores the backpropagated ordinal values, and estimates pairwise preference probabilities from these data. Hence, it is not necessary to do multiple rollouts in the same iteration as in PBMCTS because current rollouts can be directly compared to previously observed ones.
Note that can only be estimated if each action was visited at least once. Hence, similar to other MCTS variants, we enforce this by always selecting nonvisited actions in a node first.
3.2 The Borda Score
The Borda score is based on the Borda count which has its origins in voting theory [Black1976]. Essentially, it estimates the probability of winning against a random competitor. In our case, estimates the probability of action to win against any other action available in node .
To calculate the Borda score , we store all backpropagated ordinal values for each action available in node . A simple solution to summarize this information is to use a twodimensional array to count how often value is obtained by playing action in node . Given in a node, we can derive the estimated density probabilities
(3) 
for receiving ordinal reward by playing action in this node. The probability of receiving an ordinal reward worse than for action (which we denote with ) is then
(4) 
Given this, the probability of action beating action can be estimated as
(5) 
For each ordinal value , this estimates the probability of receiving reward while receiving a lesser reward (plus half of the probability that receives the same reward to deal with ties). This is then summed up over all possible values .
The Borda score of is then the average win probability of over all other actions available in this node:
(6) 
It has several properties that encourage its use as a value estimator:

if and only if action strictly dominates any other action. Action seems to be the best option and has to get the highest estimate:

if and only if action is strictly dominated by any other action. If an action is worse than any other action, it has to get the lowest possible estimate:

if two actions and have equal ordinal outcomes:
Since , and the remaining terms to compute and are all identical, must hold.
3.3 Incremental Update
In order to reduce computation, we do not compute the counts used in (3), but maintain counts , from which can be directly estimated, thereby avoiding the summation in (4). For each rollout which took action in node yielding a reward , we increase the counts for all , as well as the counters and . From this, the Borda count can be updated incrementally in the backpropagation step: Given a new ordinal reward for action in node , the Borda count for all actions of have to be updated. For each action of we can update as follows:
(7) 
where is the relative proportion of data from time step such that all iterations are weighted equally.
3.4 Differences to MCTS
Although the changes from MCTS to OMCTS are comparably small, the algorithms have very different characteristics. In this section, we highlight some of the differences between OMCTS and MCTS.
Loss function
MCTS and OMCTS do not use the same loss function. Consider UCT values at an example node
with two actions and . The past rollouts were and .MCTS averages different backpropagated values and compares them directly. This can be seen as minimizing the linear loss. Here is better: . OMCTS has a different loss function: instead of averaging the values, a preference comparison is used, and the action is chosen, which dominates the other more frequently. This can be seen as minimizing a ranking loss. Here is better, as because has more wins than . Which loss function should be used depends on the specific problem.
Ordinal Values Only
The most prominent difference between MCTS and OMCTS is that for problems where only ordinal reward exist, MCTS is not applicable without creating an artificial reward signal. Any assignment of numerical values to ordinal values is arbitrary and will add a bias [Yannakakis, Cowie, and Busso2017]. Similarly, a linear loss function (or any other loss function that uses value differences) will also introduce a bias, and a ranking loss should be used instead.
Cowardly Behavior
As mentioned previously, MCTS has been blamed for behaving cowardly, by prefering safe but unyielding actions over actions that have some risk but will in the long run result in higher rewards. As an example, consider Figure 2, which shows in its bottom row the distribution of trajectory values for two actions over a range of possible rewards. One action (circles) has a mediocre quality with low deviation, whereas the other (stars) is sometimes worse but often better than the first one. Since MCTS prioritizes the stars only if the average is above the average of circles, MCTS would often choose the safe, mediocre action. In the literature one can find many ideas to tackle this problem, like MixMax backups (cf. Section 2.3) or adding domain knowledge (e.g., by giving a direct bonus to actions that should be executed [PerezLiebana et al.2018, Joppen et al.2018]). OMCTS takes a different point of view, by not comparing average values but by comparing how often stars are the better option than circles and vice versa. As a result, it would prefer the circle action, which is preferable in 70% of the games.
Normalization
Although MCTS does not depend on normalized reward values, in practice they are nevertheless often normalize to a range in order to simplify the tuning of the parameter. OMCTS is already normalized in the sense that all values are in the range . Note, however, that this a local, relative scaling and not a global, absolute scale as in regular MCTS. If this does not mean that is a better action than unless .
MCTS can be modified to use local normalization as well by storing the minimal () and maximal () reward seen in each node . For each new sample in , these values are updated using the received reward , which is then normalized using .
In our experiments, we tested this version under the name of NormalizedMCTS (NMCTS).
Computational Time
Even though we propose an incremental update for the Borda score, it should be mentioned that calculating a running average (MCTS) is faster than calculating the Borda score (OMCTS). In our experiments, the Borda score needed to times more time than averaging depending on the size of and .
4 Experimental Setup
We test the five algorithms described above (MCTS, OMCTS, NMCTS, MixMax and PBMCTS) using the General Video Game AI (GVGAI) framework [PerezLiebana et al.2016]. GVGAI has implemented a variety of different video games and provides playing agents with a unified interface to simulate moves using a forward model. Using this forward model is expensive so that simulations take a lot of time. We use the number of calls to this forward model as a computational budget. In comparison to using the real computation time, it is independent of specific hardware, algorithm implementations, and side effects such as logging data.
Our algorithms are given access to the following pieces of information provided by the framework:

Available actions: The actions the agent can perform in a given state

Game score: The score of the given state . Depending on the game this ranges from to or to .

Game result: The result of the game: won, lost or running.

Simulate action: The forward model. It is stochastic, e.g., for enemy moves or random object spawns.
4.1 Heuristic Monte Carlo Tree Search
The games in GVGAI have a large search space with actions and up to turns. Using vanilla MCTS, one rollout may use a substantial amount of time, since up to
moves have to be made to reach a terminal state. To achieve a good estimate, many rollouts have to be simulated. Hence it is common to stop rollouts early at nonterminal states, using a heuristic to estimate the value of these states. In our experiments, we use this variation of MCTS, adding the maximal length for rollouts
RL as an additional parameter. The heuristic value at nonterminal nodes is computed in the same way as the terminal reward (i.e., it essentially corresponds to the score at this state of the game).4.2 Mapping Rewards to
The objective function has two dimensions: on the one hand, the agent needs to win the game by achieving a certain goal, on the other hand, the agent also needs to maximize its score. Winning is more important than getting higher scores.
Since MCTS needs its rewards being or even better , the twodimensional target function needs to be mapped to one dimension, in our case for comparison and ease of tuning parameters into . Knowing the possible scores of a game, the score can be normalized by with and being the highest and lowest possible score. Note that this differs from the NMCTS normalization discussed in Section 3.4 in that here global extrema are used, whereas NMCTS uses the extrema seen in each node.
For modeling the relation lost playing won which must hold for all states, we split the interval into three equal parts (cf. also the axis of Figure 2):
(8) 
This is only one of many possibilities to map the rewards to , but it is an obvious and straightforward approach. Naturally, the results for the MCTS techniques, which use this reward, will change when a different reward mapping is used, and their results can probably be improved by shaping the reward. In fact, one of the main points of our work is to show that for OMCTS (as well as for PBMCTS) no such reward shaping is necessary because these algorithms do not rely on the numerical information. In fact, for them, the mapped linear function with is equivalent to the preferences induced by the twodimensional feedback.
4.3 Selected Games
GVGAI provides users with many games. Doing an evaluation on all of them is not feasible. Furthermore, some results would exhibit erratic behavior, since the tested algorithms are not suitable for solving some of the games. For example, often true rewards are very sparse, and the agent has to be guided in some way to reliably solve the game.
For this reason, we manually played all the games and selected a variety of interesting, and not too complex games with different characteristics, which we believed to be solvable for the tested algorithms:

Zelda: The agent can hunt monsters and slay them with its sword. It wins by finding the key and taking the door.

Chase: The agent has to catch all animals which flee from the agent. Once an animal finds a catched one, it gets angry and chases the agent. The agent wins once no more animal flee and loses if a chasing animal catches it.

Whackamole: The agent can collect mushrooms which spawn randomly. A cat helps it in doing so. The game is won after a fixed amount of time or lost if the agent and cat collide.

Boulderchase: The agent can dig through sand to a door that opens after it has collected ten diamonds. Monsters chase it through the sand turning sand into diamonds.

Surround: The agent can win the game at any time by taking a specific action, or collect points by moving while leaving a snakelike trail. A moving enemy also leaves a trail. The game is lost if the agent collides with a trail.

Jaws: The agent controls a submarine, which is hunted by a shark. It can shoot fish giving points and leaving an item behind. Once 20 items are collected, a collision with the shark gives a large number of points, otherwise it loses the game. Colliding with fish always loses the game. The fish spawn randomly on 6 specific positions.

Aliens: The agent can only move from left to right and shoot upwards. Aliens come flying from top to bottom throwing rocks on the agent. For increasing the score, the agent can shoot the aliens or shoot disappearing blocks.
The number of iterations that can be performed by the algorithms depends on the computational budget of calls to the forward model. We tested the algorithms with , , and forward model uses (later called time resources). Thus, in total, we experimented with problem settings ( domains time resources).
4.4 Tuning Algorithms and Experiments
All MCTS algorithms have two parameters in common, the exploration tradeoff and rollout length . For we tested 4 different values: and , and for we tested 9 values from to in steps of size
. In total, these are 36 configurations per algorithm. To reduce variance, we have repeated each experiment 40 times. Overall, 5 algorithms with 36 configurations were run 40 times on 28 problems, resulting in 201600 games played for tuning.
Additionally, we compare the algorithms to Yolobot, a highly competitive GVGAI agent that won several challenges [Joppen et al.2018, PerezLiebana et al.2018]. Yolobot is able to solve games none of the other five algorithms can solve. Note that Yolobot is designed and tuned to act within a 20ms time limit. Scaling the time resources might not lead to better behavior. Still it is added for sake of comparison and interpretability of strength. For Yolobot each of the problems is played times, which leads to additional games or games in total.^{1}^{1}1For anonymization, we added the agents as supplementary material. In case of acceptance, they will be made publicly available.
We are mainly interested on how well the different algorithms perform on the problems, given optimal tuning per problem. To give an answer, we show the performance of the algorithms per problem in percentage of wins and obtained average score. We do a Friedmann test on average ranks of those data with a posthoc Wilcoxon signed rank test to test for significance [Demšar2006]. Additionally, we show and discuss the performance of all parameter configurations.
Game 
Time 
OMCTS 
MCTS 
NMCTS 
Yolo bot 
PBMCTS 
MixMax 

Jaws  100%  100%  100%  27.5%  80.0%  67.5%  
1083.8  832.7  785.7  274.7  895.7  866.8  
92.5%  95.0%  92.5%  35.0%  52.5%  65.0%  
1028.2  958.9  963.2  391.0  788.5  736.4  
500  85.0%  90.0%  97.5%  65.0%  50.0%  52.5%  
923.4  1023.1  1078.2  705.7  577.6  629.0  
250  85.0%  85.0%  87.5%  32.5%  37.5%  37.5%  
1000.9  997.6  971.9  359.6  548.8  469.0  
Surround  100%  100%  100%  100%  100%  100%  
81.5  71.0  63.5  81.2  64.3  57.6  
100%  100%  100%  100%  100%  100%  
83.0  80.8  75.2  77.3  40.8  25.0  
500  100%  100%  100%  100%  100%  100%  
84.6  61.8  79.3  83.3  26.3  17.3  
250  100%  100%  100%  100%  100%  100%  
83.4  64.7  55.2  76.1  14.3  10.3  
Aliens  100%  100%  100%  100%  100%  100%  
82.4  81.6  81.2  81.5  81.8  77.0  
100%  100%  100%  100%  100%  100%  
79.7  78.4  77.7  82.2  76.9  76.4  
500  100%  100%  100%  100%  100%  100%  
78.0  77.3  78.6  81.1  77.2  76.0  
250  100%  100%  100%  100%  100%  100%  
77.7  77.1  77.1  79.3  75.8  74.8  
Chase  87.5%  80.0%  80.0%  50.0%  67.5%  37.5%  
6.2  6.0  5.8  4.8  5.2  3.9  
60.0%  50.0%  47.5%  70.0%  30.0%  17.5%  
4.8  4.8  5.0  5.1  3.7  2.6  
500  55.0%  45.0%  45.0%  90.0%  27.5%  12.5%  
4.9  4.5  4.7  5.5  2.9  2.1  
250  40.0%  32.5%  32.5%  90.0%  17.5%  7.5%  
4.2  4.1  4.2  5.6  2.5  2.6  
Boulderchase  62.5%  75.0%  82.5%  45.0%  82.5%  30.0%  
23.7  22.1  24.0  18.8  27.3  20.1  
50.0%  32.5%  37.5%  52.5%  40.0%  22.5%  
22.8  18.6  18.6  21.8  18.1  16.2  
500  47.5%  30.0%  37.5%  35.0%  32.5%  15.0%  
24.7  20.2  21.4  18.3  19.4  14.4  
250  40.0%  40.0%  35.0%  60.0%  17.5%  15.0%  
20.9  20.1  20.2  21.7  14.7  15.3  
Whackamole  100%  100%  100%  75.0%  97.5%  75.0%  
72.5  44.4  44.6  37.0  60.1  48.5  
100%  100%  100%  55.0%  77.5%  65.0%  
64.0  41.8  48.2  33.9  43.9  39.0  
500  100%  100%  100%  57.5%  70.0%  52.5%  
59.5  50.0  51.5  29.0  38.1  35.4  
250  97.5%  100%  97.5%  50.0%  65.0%  52.5%  
54.8  45.9  46.4  28.5  35.1  26.6  
Zelda  97.5%  87.5%  90.0%  95.0%  90.0%  70.0%  
8.3  7.4  6.7  3.8  9.6  8.1  
80.0%  85.0%  77.5%  87.5%  57.5%  42.5%  
8.8  7.5  7.4  5.3  8.6  8.8  
500  62.5%  75.0%  70.0%  77.5%  50.0%  35.0%  
8.6  8.2  7.8  4.6  8.8  7.8  
250  55.0%  55.0%  57.5%  70.0%  45.0%  30.0%  
8.4  7.8  7.8  4.4  8.0  7.2  
Rank  1.9  3.1  3.1  3  4.3  5.7 
5 Experimental Results
Table 1 shows the best win rate and the corresponding average score of each algorithm, averaged over runs for each of the different parameter settings. In each row, the best values for the win rate and the average score are shown in bold, and a ranking of the algorithms is computed. The resulting average ranks are shown in the last line. We use a Friedmann test and a posthoc Wilcoxon signed rank test as an indication for significant differences in performance. The results of the latter (with a significance level of ) are shown in Figure 2(a).
We can see that OMCTS performed best with an average rank of and a significantly better performance than all other MCTS variants. Only the advanced algorithm Yolobot, which has won the GVGAI competition several times, comes close to it, as can be seen in Figure 2(a). Table 1 allows us to take a closer look on the domains where OMCTS is better: For games that are easy to win, such as Surround, Aliens, and Whackamole OMCTS beats the other algorithms MCTSlike algorithms by winning with a higher score. In Chase, a deadly but more deterministic game, OMCTS is able to achieve a higher win rate. In deadly and stochastic games like Zelda, Boulderchase and Jaws OMCTS gets beaten by Yolobot, NMCTS or MCTS, but still performs good.
NMCTS and MCTS perform similarly in all games, which lets us conclude that pernode normalization does not strongly influence the performance. MixMax performed worst on nearly every game: In hard games, MixMax does not win often and in highscore games it falls short in score. In the recorded videos,^{2}^{2}2You can watch the videos at https://bit.ly/2ohbYb3 one can see that MixMax greedily goes for high scores: For example in Zelda, it approaches enemies where MCTS often flees. This often leads to a bad rated death. But nevertheless, MixMax achieves a good score in Zelda compared to MCTS or NMCTS. In Whackamole, MixMax dies often most probably because of greedily chosen dangerous moves.
Figure 2(b) summarizes the results when only won games are considered. It can be seen, that in this case, MixMax is better than MCTS or NMCTS, but the difference is not significant. OMCTS still performs best, but Yolobot falls behind. This is because it is designed to primarily maximize the win rate, not the score.
In conclusion, we found evidence that OMCTS’s preference for actions that maximize win rate works better than MCTS’s tendency to maximize average performance for the tested domains.
Parameter Optimization
In Table 2 the overall rank over all parameters for all algorithms are shown. It is clearly visible that a low rollout length improves performance and is more important to tune correctly than the explorationexploitation tradeoff . Since Yolobot has no parameters, it is not shown (rank ). Except for the extreme case of no exploration (), OMCTS with is better than any other MCTS algorithm. The best configuration is OMCTS with and .
Video Demonstrations
For each algorithm and game, we recorded a video where the agent wins.^{2}^{2}footnotemark: 2 In those videos it can be seen that OMCTS frequently plays actions that lead to a higher score, whereas MCTS and NMCTS play more safely—often too cautious and averse to risking any potentially deadly effect.
6 Conclusion
In this paper we proposed OMCTS, a modification of MCTS that handles the rewards in an ordinal way: Instead of averaging backpropagated values to obtain a value estimation, it estimates the winning probability of an action using the Borda score. By doing so, the magnitude of distances between different reward signals are disregarded, which can be useful in ordinal domains. In our experiments using the GVGAI framework, we compared OMCTS to MCTS, different MCTS modifications and Yolobot, a specialized agent for this domain. Overall, OMCTS achieved higher win rates and reached higher scores than the other algorithms, confirming that this approach can be useful in domains where no meaningful numeric reward information is available.
Acknowledgments
This work was supported by the German Research Foundation (DFG project number FU 580/10). We gratefully acknowledge the use of the Lichtenberg high performance computer of the TU Darmstadt for our experiments.
References
 [Black1976] Black, D. 1976. Partial justification of the Borda count. Public Choice 28(1):1–15.
 [Browne et al.2012] Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games 4(1):1–43.

[Demšar2006]
Demšar, J.
2006.
Statistical comparisons of classifiers over multiple data sets.
Journal of Machine Learning Research 7(Jan):1–30. 
[Jacobsen, Greve, and
Togelius2014]
Jacobsen, E. J.; Greve, R.; and Togelius, J.
2014.
Monte Mario: Platforming with MCTS.
In
Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation
, 293–300. ACM.  [Joppen et al.2018] Joppen, T.; Moneke, M. U.; Schröder, N.; Wirth, C.; and Fürnkranz, J. 2018. Informed hybrid game tree search for general video game playing. IEEE Transactions on Games 10(1):78–90.
 [Joppen, Wirth, and Fürnkranz2018] Joppen, T.; Wirth, C.; and Fürnkranz, J. 2018. Preferencebased Monte Carlo tree search. In Proceedings of the 41st German Conference on AI (KI18).

[Khalifa et al.2016]
Khalifa, A.; Isaksen, A.; Togelius, J.; and Nealen, A.
2016.
Modifying MCTS for humanlike general video game playing.
In
Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI16)
, 2514–2520.  [Kocsis and Szepesvári2006] Kocsis, L., and Szepesvári, C. 2006. Bandit based MonteCarlo planning. In Proceedings of the 17th European Conference on Machine Learning (ECML06), 282–293.
 [PerezLiebana et al.2016] PerezLiebana, D.; Samothrakis, S.; Togelius, J.; Lucas, S. M.; and Schaul, T. 2016. General video game AI: Competition, challenges and opportunities. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, 4335–4337.
 [PerezLiebana et al.2018] PerezLiebana, D.; Liu, J.; Khalifa, A.; Gaina, R. D.; Togelius, J.; and Lucas, S. M. 2018. General video game AI: A multitrack framework for evaluating agents, games and content generation algorithms. arXiv preprint arXiv:1802.10363.
 [Puterman2005] Puterman, M. L. 2005. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 2nd edition.
 [Silver et al.2017] Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017. Mastering the game of Go without human knowledge. Nature 550(7676):354.
 [Sutton and Barto1998] Sutton, R. S., and Barto, A. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.
 [Weng2011] Weng, P. 2011. Markov decision processes with ordinal rewards: Reference pointbased preferences. In Proceedings of the 21st International Conference on Automated Planning and Scheduling (ICAPS11), ICAPS.
 [Yannakakis, Cowie, and Busso2017] Yannakakis, G. N.; Cowie, R.; and Busso, C. 2017. The ordinal nature of emotions. In Proceedings of the 7th International Conference on Affective Computing and Intelligent Interaction (ACII17).
 [Zoghi et al.2014] Zoghi, M.; Whiteson, S.; Munos, R.; and Rijke, M. 2014. Relative upper confidence bound for the karmed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning (ICML14), 10–18.
Comments
There are no comments yet.