1 Introduction
Conventional reinforcement learning (RL) algorithms rely on numerical feedback signals. Their main advantages include the ease of aggregation, efficient gradient computation, and many use cases where numerical reward signals come naturally, often representing a quantitative property. However, in some domains numerical rewards are hard to define and are often subject to certain problems. One issue of numerical feedback signals is the difficulty of reward shaping, which is the task of creating a reward function. Since RL algorithms use rewards as direct feedback to learn a behavior which optimizes the aggregation of received rewards, the reward function has a significant impact on the behavior that is learned by the algorithm. Manual creation of the reward function is often expensive, nonintuitive and difficult in certain domains and can therefore cause a bias in the optimal behavior that is learned by an algorithm. Since the learned behavior is sensitive to these reward values, the rewards of an environment should not be introduced or shaped arbitrarily if they are not explicitly known or naturally defined. This can also lead to another problem called reward hacking, where algorithms are able to exploit a reward function and miss the intended goal of the environment, caused by being able to receive better rewards through undesired behavior. The use of numerical rewards furthermore requires infinite rewards to model undesired decisions in order to not allow tradeoffs for a given state. This can be illustrated by an example in the medical domain, where it is undesirable to be able to compensate one occurrence of death of patient with multiple occurrences of cured patient to stay at a positive reward in average and therefore artificial feedback signals are used that can not be averaged. These issues have motivated the search for alternatives, such as preferencebased feedback signals [13].
In this paper, we investigate the use of rewards on an ordinal scale, where we have information about the relative order of various rewards, but not about the magnitude of the quality differences between different rewards. Our goal is to extend reinforcement learning algorithms so that they can make use of ordinal rewards as an alternative feedback signal type in order to avoid and overcome the problems with numerical rewards.
Reinforcement learning with ordinal rewards has multiple advantages and directly addresses multiple issues of numerical rewards. Firstly, the problem of reward shaping is minimized, since the manual creation of the ordinal reward function specifically by the reward ordering often is intuitive and can be done easily without the need of exact specifications for reward values. Even though the creation of ordinal reward values, socalled reward tiers, through the ascending order of feedback signals introduces a naturally defined bias, it omits the largely introduced artificial bias by the manual shaping of reward values. At the same time, ordinal rewards simplify the problem of reward hacking because the omission of specific numeric reward values has the effect that any possible exploitation of rewards by an algorithm is only dependent on an incorrect reward order, which can be more easily fixed than the search for correct numerical values. While the use of infinite rewards can not be modelled directly, it is still possible to define infinite rewards as highest or lowest ordinal reward tier, and implement policies which completely avoid and encourage certain tiers.
Since the creation of the ordinal reward function is cheap and intuitive, it is especially suitable for newly defined environments since it enables the easy definition of ordinal rewards by ordering the possible outcomes naturally by desirability. Additionally it should be noted that for existing environments with numerical rewards it is possible to extract ordinal rewards from these environments.
The focus of this paper is the technique of using ordinal rewards for reinforcement learning. To this end, we propose an alternative reward aggregation for ordinal rewards, introduce a method for policy determination from ordinal rewards and compare the performance of ordinal reward algorithms to algorithms for numerical rewards. In Section 2, we discuss related work and previous approaches. A formal definition of common reinforcement learning terminology can be found in Section 3. Section 4 introduce reinforcement learning algorithms which use ordinal reward aggregations instead of numerical rewards, and illustrates the differences to conventional approaches. In Section 5 experiments are executed on the framework of OpenAI Gym and common reinforcement learning algorithms are compared to ordinal reinforcement learning.
2 Related Work
The technique of using rewards on an ordinal scale as an alternative to numerical rewards is mainly based on the approach of preference learning (PL) [1]
. In contrast to traditional supervised learning, PL follows the core idea of having preferences over states or symbols as labels and predicting these preferences as the output on unseen data instances instead of labelling data with explicit nominal or numerical values.
Recently, there have been several proposals for combining PL with RL, where pairwise preferences over trajectories, states or actions are defined and applied as feedback signals in reinforcement learning algorithms instead of the commonly used numerical rewards. For a survey of such preferencebased reinforcement learning algorithms, we refer the reader to [13].
While preferencebased RL provides algorithms for learning an agent’s behavior from pairwise comparison of trajectories, [12] presents an approach for creating preferences over multiple trajectories in the order of ascending ordinal reward tiers, thereby deviating from the concept of pairwise comparisons over trajectories. Using a tutor as an oracle, this approach approximates a latent numerical reward score from a sequence of received ordinal feedback signals. This alternative reward computation functions as a reward transformation from the ordinal to the numerical scale and is applicable on top of an existing reinforcement learning algorithm.
Contrary to this approach, we do not use a tutor for the comparison of trajectories but can directly use ordinal rewards as a feedback signal. In order to use environments where numerical feedback already exists without the need for acquiring human feedback about the underlying preferences, we automatically extract rewards on an ordinal scale from existing environments with numerical rewards. To this end, we adapt an approach that has been proposed for MonteCarlo Tree Search [4] to reinforcement learning.
Furthermore, we handle ordinal rewards in a similar manner as previous approaches by directly using aggregated received ordinal rewards for comparing different options. The idea of direct comparison of ordinal rewards builds on the works of [10], [11], [2] and [4], which provide criteria for the direct comparison of ordinal reward aggregations. We utilize the approach of [4], which transfers the numerical reward maximization problem into a bestchoice maximization problem for an alternative computation of the value function for reinforcement learning from ordinal feedback signals. [4] used this idea for adapting Monte Carlo Tree Search to the use of ordinal rewards.
In summary, we automatically transfer numerical feedback into preferencebased feedback and propose a new conceptual idea to utilize ordinal rewards for reinforcement learning, which should not be seen as an alternative for the existing algorithms stated above. Hence, we do not compare the performance of our new approach to any of the algorithms that use additional human feedback, but to common RL techniques that use numerical feedback.
3 Markov Decision Process and Reinforcement Learning
In this section, we briefly recapitulate Markov decision processes and reinforcement learning algorithms. Our notation and terminology is based on
[8].3.1 Value function and policy for Markov Decision Process
A Markov Decision Process (MDP) is defined as a tuple of (, , , ) with being a finite set of states, being a finite set of actions, being the transition function
that models the probability of reaching a state
when action is performed in state , and being the reward function which maps a reward from a subset of possible rewards to executing action in state and reaching in the process. For further work we assume that is deterministic and a transition always has the probability of 0 or 1. Furthermore it is assumed that each action is executable in any state , hence the transition function is defined for every element in . A policy is the specification which decision to take based on the environmental state. In a deterministic setting, it is modeled as a mapping which directly maps an environmental state to the decision which should be taken in this state. The value function represents the expected quality of a policy in state with respect to the rewards that will be received in the future. Value functions for numerical rewards are computed by the expectation of the discounted sum of rewards . The value function of a policy in an environmental state therefore can be computed by(1) 
where is the discounted sum of rewards when following policy , a discount factor, and the direct reward at time step . The optimal policy in a state is the policy with the largest , which complies with the goal of an RL algorithm to maximize expected future reward.
3.2 Reinforcement Learning
Reinforcement learning can be described as the task of learning a policy that maximizes the expected future numerical reward. The agent learns iteratively by updating its current policy after every action and the corresponding received reward from the environment. Furthermore, the agent may perform multiple training sessions, socalled episodes, in the environment. Using the previously defined formalism, this can be expressed as approximating the optimal policy iteratively with a function , by repeatedly choosing actions that lead to states
with the highest estimated value function
. In the following section two common reinforcement learning algorithms are introduced.3.2.1 Qlearning.
The key idea of the Qlearning algorithm [9] is to estimate Qvalues , which estimate the expected future sum of rewards when choosing an action in a state and following the optimal policy afterwards. Hence the Qvalue can be seen as a measure of goodness for a stateaction pair , and therefore, in a given state , the optimal policy should select the action that maximizes this value in comparison to other available actions in that state. The approximated Qvalues are stored and iteratively updated in a Qtable. The Qtable is updated after an action has been performed in a state and the reward and the newly reached state is observed. The computation of the expected Qvalue is done by
(2) 
Following this socalled Bellman equation, every previously estimated Qvalue is updated with the newly computed expected Qvalue with the formula
(3) 
where represents a learning rate and the discount factor.
3.2.2 Deep QNetwork.
The original Qlearning algorithm is limited to very simple problems, because of the explicitly stored Qtable, which essentially memorizes the quality of each possible stateaction pair independently. Thus it requires, e.g., that each stateaction pair has to be visited a certain number of times in order to make a reasonable prediction for this pair. A natural extension of this method is to replace the Qtable with a learned Qfunction, which is able to predict a quality value for a given, possibly previously unseen stateaction pair. The key idea behind the Deep QNetwork (DQN) [6, 7] is to learn a continuous function
in the form of a deep neural network with
input nodes, which represent the feature vector of
, and output nodes, each containing the Qvalue of one action .Neural networks can be iteratively updated to fit the output nodes to the desired Qvalues. The expected Qvalue for a stateaction pair is calculated in the same manner as defined in (2) with the difference that the Qvalues are now predicted by the DQN, with one output node for each possible action . Therefore (2) becomes
(4) 
where represents the Qvalue node of action in state .
In order to optimize the learning procedure, DQN makes use of several optimizations such as experience replay, the use of a separate target and evaluation network, and Double Deep QNetwork. More details on these techniques can be found in the following paragraphs.
Experience replay.
Using a neural network to fit the Qvalue of the previously executed stateaction pair as described in (4) leads to overfitting to recent experiences because of the high correlation between environmental states across multiple successive time steps, and the property of neural networks to overfit recently seen training data. Instead of only using the previous stateaction pair for fitting the DQN, experience replay [5] uses a memory to store previous experience instances and iteratively reuses a random sample of these experiences to update the network prediction at every time step.
Target and evaluation networks.
Frequently updating the neural network, which is simultaneously used for the prediction of the expected Qvalue, leads to unstable fitting of the network. Therefore these two tasks, firstly the prediction of the target Qvalue for network fitting and secondly the prediction of the Qvalue which is used for policy computation, allows for a split into two networks. These two networks are the evaluation network, which is used for policy computation, and the target network, which is used for predicting the target value for continuously fitting the evaluation network. In order to keep the target network up to date, it is replaced by a copy of the evaluation network every steps.
Double Deep QNetwork.
Deep QNetworks tend to overestimate the prediction of Qvalues for some actions, which may result in an unjustified bias towards certain actions. To address this problem, Double Deep QNetworks [3] additionally use the target and evaluation networks to decouple the action choice and Qvalue prediction by letting the evaluation network choose the next action to be played, and letting the target network predict the respective Qvalue.
4 Deep Ordinal Reinforcement Learning
In this section, Markov decision processes and reinforcement learning algorithms are adapted to settings with ordinal reward signals. More concretely, we present a method for reward aggregation that fits ordinal rewards and explain how this method can be used in Qlearning and Deep QNetworks in order to learn to solve environments that return feedback signals on an ordinal scale.
4.1 Ordinal Markov Decision Process
Similar to the standard Markov Decision Process, [10] defines an ordinal version of an MDP as a tuple of (, , , ) with the only difference that is the reward function is modified to return ordinal rewards instead of numerical ones. Thus, it maps executing action in state and reaching state to an ordinal reward from a subset of possible ordinal rewards , with representing the number of ordinal rewards. Whereas a realvalued reward provides information about the qualitative size of the reward, the ordinal scale breaks rewards down to naturally ordered reward tiers. These reward tiers solely represent the rank of desirability of a reward compared to all other possible rewards, which is noted as the ranking position of a reward in the set of all possible rewards . Interpreting the reward signals on an ordinal scale still allows us to order and directly compare individual reward signals, but while the numerical scale allows for comparison of rewards by means of the magnitude of their difference, ordinal rewards do not provide this information.
In order to aggregate multiple ordinal rewards, a distribution to store and represent the expected frequency of received rewards on the ordinal scale is constructed. This distribution is represented by a vector , in which represents the frequency of receiving the ordinal reward by executing in . The distribution vector is defined by
(5) 
Through normalization of distribution vector
can be constructed, which represents the expected probability of receiving a reward. The probability distribution is represented by a probability vector , in which represents the estimated probability of receiving the ordinal reward by executing in . Hence the probability vector can be defined by(6) 
4.1.1 Value function for ordinal rewards.
While numerical rewards enable the representation of value function by the expected sum of rewards, the value function for environments with ordinal rewards needs to be estimated differently. Since ordinal rewards are aggregated in a distribution of received ordinal rewards, the calculation of value function in state can be done based on for action that is selected by policy . Hence the computation of the value function can be modeled by the following formula of
(7) 
The computation of the value function from probability distribution through function is performed by the technique of measure of statistical superiority [4]. This measure computes the probability that action receives a better ordinal reward than a random alternative action in the same environmental state . This probability can be calculated through the sum of all probabilities of receiving a better ordinal reward than . Hence the probability of an action performing better than another action can be defined as
To deal with ties, additionally half the probability of receiving the same reward tier as is added.
The function of the measure of statistical superiority therefore is computed through the averaged winning probability of against all other actions by
(8) 
for available actions in state .
4.2 Transformation of existing numerical rewards to ordinal rewards
If an environment has predefined rewards on a numerical scale, transforming numerical rewards into ordinal rewards can easily be done by translating every numerical reward to its ordinal position within all possible numerical rewards. This way the lowest possible numerical reward is mapped to position 1, and the highest numerical reward is mapped to position , with representing the number of possible numerical rewards. This transformation process simply results in removing the metric and semantic of distances of rewards but keeping the order.
4.3 Ordinal Reinforcement Learning
In Section 4.1.1, we have shown how to compute a value function and defined the optimal policy for environments with ordinal rewards. This can now be used for adapting common reinforcement learning algorithms to ordinal rewards.
4.3.1 Ordinal Qlearning.
For the adaptation of the Qlearning algorithm to ordinal rewards, we do not directly update a Qvalue that represents the quality of a stateaction pair but update the distribution of received ordinal rewards. The target distribution is computed by adding the received ordinal reward (represented through unit vector of length ) to the distribution of taking an action in the new state according to the optimal policy . The previous distribution
is updated with the target distribution by interpolating both values with learning rate
, which can be seen in the formula(9) 
In this adaptation of Qlearning^{1}^{1}1This technique of modifying the Qlearning algorithm to deal with rewards on an ordinal scale can analogously be applied to other Qtable based reinforcement learning algorithms like Sarsa and Sarsa [14], the expected quality of stateaction pair is not represented by the Qvalue (3) but by the function (4.1.1) of the probability distribution , which is derived from the iteratively updated distribution .
4.3.2 Ordinal Deep QNetwork.
Because ordinal rewards are aggregated by a distribution instead of a numerical value, the neural network is adapted to predict distributions instead of Qvalues for every possible action. Hence for one action the network does not predict a 1dimensional Qvalue, but predicts an dimensional reward distribution with being the length of the ordinal scale. Since this distribution has to be computed for each of actions, the adaptation of the Deep QNetwork algorithm to ordinal rewards requires a differently structured neural network. Contrary to the original Deep QNetwork where one network simultaneously predicts Qvalues for all actions, the structure of the ordinal DQN consists of an array of neural networks, from which every network computes the expected ordinal reward distribution for one separate action . In a deep neural network for the prediction of distributions every output node of the network computes one distribution value . The structure of neural networks used for the prediction of distributions can be seen in Figure 1.
The prediction of the ordinal reward distributions for all actions can afterwards be normalized to a probability distribution and used in order to compute the value function through the measure of statistical superiority as has been previously defined in (7). Once the value function and policy have been evaluated, the ordinal variant of the DQN algorithm follows a similar procedure as ordinal Qlearning and updates the prediction of the reward distribution for by fitting to the target reward distribution:
(10) 
The main difference in the update step between ordinal Qlearning (9) and ordinal DQN consists of fitting the neural network of action for input
to the expected reward distribution by backpropagation instead of updating a Qtable entry
. Additional modifications to the ordinal Deep QNetwork in form of experience replay, the split of the target and evaluation network and the usage of a Double DQN are done in a similar fashion as described with the standard DQN algorithm in Section 3.2.2. These modifications can be seen in the following paragraphs.Experience replay.
A memory is used to sample multiple saved experience elements randomly and replay these previously seen experiences by fitting the ordinal DQN networks to the samples of earlier memory elements.
Target and evaluation networks.
In order to prevent unstable behavior by using the same networks for the prediction and updating step, we use separate evaluation networks to predict reward distributions for the policy computation, and use target networks to predict the target reward distributions which are used for fitting the evaluation networks continuously.
Double Deep QNetwork.
The neural networks of ordinal DQN tend to overestimate the prediction of the reward distributions for some actions, which may result in an unjustified bias towards certain actions. Therefore, in order to determine the next action to be played by , the measure of statistical superiority is computed based on the reward distributions predicted by the evaluation networks. Afterwards the prediction of the reward distribution for this action is computed by the respective target network.
5 Experiments and Results
In the following, the standard reinforcement algorithms described in Section 3.2 and the ordinal reinforcement learning algorithms described in Section 4.3 are evaluated and compared in a number of testing environments.^{2}^{2}2The source code for the implementation of the experiments can be found in https://github.com/az79nefy/OrdinalRL.
5.1 Experimental setup
The environments which are used for evaluation are provided by OpenAI Gym,^{3}^{3}3For further information about OpenAI visit https://gym.openai.com. which can be viewed as a unified toolbox for our experiments. All environments expect an action input after every time step and return feedback in form of the newly reached environmental state, the direct reward for the executed action, and the information whether the newly reached state is terminal. The environments that the algorithms were tested on were CartPole and Acrobot.^{4}^{4}4Further technical details about the environments CartPole and Acrobot from OpenAI can be found in https://gym.openai.com/envs/CartPolev0/ and https://gym.openai.com/envs/Acrobotv1/.
Policies of the reinforcement learning algorithms were modified to use greedy exploration [8], which encourages early exploration of the state space and increases exploitation of the learned policy over time. In the experiments the maximum exploitation is reached after half of the total episodes. In order to directly compare the standard and the ordinal variants of reinforcement learning algorithms, the quality of the learned policy and the computational efficiency are investigated across all environments with varying episode numbers. Information about the quality of the learned policy is derived from the sum of rewards over a whole episode (score) or the win rate while the efficiency is measured by realtime processing time. Additionally to the standard variant with unchanged rewards, the performance of standard Qlearning algorithms is tested with changed rewards in order to simulate the performance on environments where no optimal reward engineering has been performed. It should be noted that the modifications of the rewards is performed under the constraints of remaining existing reward order, therefore not changing the transformation to the ordinal scale. The change of rewards (CR) from the existing numerical rewards is performed for all rewards by the calculation of .
The parameter configuration of the Qlearning algorithms is learning rate and discount factor . The parameter configuration of the Deep QNetwork algorithm is learning rate and discount factor . As for the network specific parameters, the Adam optimizer is used for the network fitting, the target network is getting replaced every 300 fitting updates, the experience memory size is 200000 and the replay batch size is 64.
5.2 Experimental results
The results of the comparison between numerical and ordinal algorithms for the CartPole and Acrobotenvironment in terms of score, win rate and computational time are shown and investigated in the following. This comparison is performed based on the averaged results from 10 and respectively 5 independent runs of Qlearning and Deep QNetwork on the environments.
5.2.1 Qlearning.
In Figure 2 the scores for the CartPoleenvironment over the course of 400 and 10000 episodes can be seen which were played by an agent using the ordinal (orange) as well as the standard Qlearning algorithm, with (red) and without (blue) modified rewards. Additionally the individual dots in this figure represent the scores achieved by the respective algorithms by using the optimal policy instead of greedy exploration. The evaluation of these scores shows that the ordinal variant of Qlearning performs better than the standard variant with engineered rewards for 400 episodes and reaches the optimal score of 200 quicker for 10000 episodes. Additionally the use of ordinal rewards significantly outperforms the standard variant with modified rewards for both episode numbers. Therefore it can be seen that ordinal Qlearning is able to learn a good policy better than the standard variants for the CartPoleenvironment.
In order to explain the difference of learned behavior between the standard and ordinal variant, the average relative difference of Qvalues and respectively measure of statistical superiority functions for the two possible actions were plotted and compared in Figure 3 for standard (blue) and ordinal (orange) Qlearning. It can be seen for both episode numbers that the policy which is learned by ordinal RL through the measure of statistical superiority converges to a difference of 0, meaning that the function converges to similar values for both actions. This can be interpreted as the policy learning to play safely and rarely entering any critical states where this function would indicate strong preference towards one action (e.g. in big angles). On the other side it can be seen for 400 episodes that common RL does not converge towards similar Qvalues for the actions over time and therefore a policy is learned that enters critical states more often. It should be noted that the Qvalue differences for standard Qlearning converges to 0 for evaluations with more episodes and a safe policy is eventually learned as well.
In Figure 4 the win rates from the Acrobotenvironment were plotted over the course of 400 and 10000 episodes similarly as the scores for the CartPoleenvironment and it can be seen for low episode numbers that while the policy learned by the standard variant of Qlearning with unchanged rewards performs better than the policy learned by the ordinal variant, changing the numerical values of rewards yields the same performance as the ordinal variant. But for high episode numbers it should be noted that the ordinal variant reaches a similar performance as the standard variant with a win rate of 0.3 after 10000 episodes and clearly outperforms the win rate of the standard Qlearning algorithm with CR.
Similar as for the CartPoleenvironment, the  and function margins of the best actions over the course of 400 and 10000 episodes were compared in Figure 5 and yield different observations for the standard and ordinal variants, and it can be therefore be concluded that the learned policies differ. While the ordinal variant decreases the relative margin of of the best action and therefore learns a policy which plays safely, the standard variant learns a policy which maximizes the Qvalue margin of the best action and therefore follows a policy which enters critical states more often. While the standard variant learns a good policy quicker, it should be noted that both policies perform comparably after many episodes despite the policy differences.
Number of  CartPole  Acrobot  

episodes  Standard  Ordinal  Standard  Ordinal 
400  2.10 s  4.17 s  35.74 s  52.85 s 
2000  10.07 s  24.86 s  174.38 s  266.40 s 
10000  67.29 s  130.09 s  855.15 s  1258.30 s 
50000  354.52 s  667.87 s  4149.78 s  6178.76 s 
As can be seen in Table 1, using the ordinal variant results in an additional computational load by a factor between 0.8 and 1.2 for CartPole and 0.5 for Acrobot. The additionally required computational capacity is caused by the computation of the measure of statistical superiority which is less efficient than computing the expected sum of rewards. This factor could be reduced by using the iterative update of the function measure of statistical superiority described in [4].
5.2.2 Deep QNetwork.
In Figure 6 the scores achieved in the CartPoleenvironment by the ordinal as well as the standard Deep QNetwork, with and without CR, can be seen over the course of 160 and 1000 episodes. For 160 episodes it can be seen that ordinal DQN as well as the standard variant without CR converge to a good policy reaching an episode score close to 150. Contrary to this performance, modified rewards negatively impact standard Qlearning and therefore its performance is significantly worse, not reaching a score above 100. Additionally for low episode numbers it should be noted that the policy learned by the ordinal variant of Deep QNetwork is able to achieve good scores faster than the standard variant, matching the observation made for the Qlearning algorithms. The evaluation for 1000 episodes shows that the performances of standard, with and without CR, and ordinal DQNs are comparable.
Figure 7 plots the win rate of Deep QNetwork algorithms for the Acrobotenvironment over the course of 160 and 1000 episodes. For 160 episodes standard DQN with engineered rewards performs better than the ordinal variant, but loses this quality once the rewards are modified. For high episode numbers it can be seen that the ordinal variant is comparable to the standard algorithm without CR and solves the environment with a win rate of close to 1.0, but clearly outperforms the standard DQN with modified rewards which is only able to achieve a win rate of 0.6. It should be noted that all variants of DQN are able to learn a better policy than their respective Qlearning algorithms, achieving a higher win rate in less than 160 episodes.
Additionally, it should be noted that the use of the ordinal variant of DQN adds an additional computational factor between 0 and 0.5 for the CartPoleenvironment and 1.0 for the Acrobotenvironment, as can be seen in Table 2.
Since the evaluation of the ordinal Deep QNetwork algorithm shows comparable results to the standard DQN with engineered rewards and furthermore outperforms the standard variant with modified rewards, it can be concluded that the conversion of the Deep QNetwork algorithm to ordinal rewards is successful. Therefore it has been shown that algorithms of deep reinforcement learning can as well be adapted to the use of ordinal rewards.
Number of  CartPole  Acrobot  

episodes  Standard  Ordinal  Standard  Ordinal 
160  1520.01 s  2232.48 s  3659.44 s  7442.49 s 
400  6699.69 s  7001.79 s  9678.80 s  19840.88 s 
1000  15428.41 s  15526.84 s  23310.36 s  47755.90 s 
6 Conclusion
In this paper we have shown that the use of ordinal rewards for reinforcement learning is able to reach and even improve the quality of standard reinforcement learning algorithms with numerical rewards. We compared RL algorithms for both numerical and ordinal rewards on a number of tested environments and demonstrated that the performance of the ordinal variant is mostly comparable to the learned common RL algorithms that make use of engineered rewards while being able to significantly improve the performance for modified rewards.
Finally, it should be noted that ordinal reinforcement learning enables the learning of a good policy for environments without much effort to manually shape rewards. We hereby lose the possibility of reward shaping to the same degree that numerical rewards would allow, but therefore gain a more simpletodesign reward structure. Hence, our variant of reinforcement learning with ordinal rewards is especially suitable for environments that do not have a natural semantic of numerical rewards or where reward shaping is difficult. Additionally this method enables the usage of new and unexplored environments for RL only with the specification of an order of desirability instead of the needed effort of manually engineering numerical rewards with sensible semantic meaning.
Acknowledgements
This work was supported by DFG. Calculations for this research were conducted on the Lichtenberg high performance computer of the TU Darmstadt.
References
 [1] Fürnkranz, J., Hüllermeier, E. (eds.): Preference Learning. SpringerVerlag (2011)
 [2] Gilbert, H., Weng, P.: Quantile reinforcement learning. CoRR abs/1611.00862 (2016)

[3]
Hasselt, H.v., Guez, A., Silver, D.: Deep Reinforcement Learning with Double QLearning. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. pp. 2094–2100. AAAI’16, AAAI Press (2016)
 [4] Joppen, T., Fürnkranz, J.: Ordinal Monte Carlo Tree Search. CoRR abs/1901.04274 (2019)
 [5] Lin, L.J.: Reinforcement Learning for Robots Using Neural Networks. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PA, USA (1992), uMI Order No. GAX9322750
 [6] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.A.: Playing Atari with Deep Reinforcement Learning. CoRR abs/1312.5602 (2013)
 [7] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M.A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Humanlevel control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

[8]
Sutton, R.S., Barto, A.G.: Reinforcement learning  an introduction. Adaptive computation and machine learning, MIT Press, second edn. (2018)
 [9] Watkins, C.J., Dayan, P.: Qlearning. Machine Learning 8, 279–292 (1992)
 [10] Weng, P.: Markov decision processes with ordinal rewards: Reference pointbased preferences. In: Proceedings of the 21st International Conference on Automated Planning and Scheduling (ICAPS11). AAAI Press, Freiburg, Germany (2011)
 [11] Weng, P.: Ordinal Decision Models for Markov Decision Processes. In: Proceedings of the 20th European Conference on Artificial Intelligence (ECAI12). pp. 828–833. IOS Press, Montpellier, France (2012)
 [12] Weng, P., BusaFekete, R., Hüllermeier, E.: Interactive QLearning with Ordinal Rewards and Unreliable Tutor. In: Proceedings of the ECML/PKDD13 Workshop on Reinforcement Learning from Generalized Feedback: Beyond Numeric Rewards (2013)
 [13] Wirth, C., Akrour, R., Neumann, G., Fürnkranz, J.: A Survey of PreferenceBased Reinforcement Learning Methods. Journal of Machine Learning Research 18(136), 1–46 (2017)
 [14] Zap, A.: Ordinal Reinforcement Learning. Master’s thesis, Technische Universität Darmstadt (2019), To appear
Comments
There are no comments yet.