## Introduction

Recent progress in Reinforcement Learning (RL) has shown a remarkable success in learning to play games such as Atari from raw sensory input mnih2015human, hessel2017rainbow. Still, tabula rasa, RL typically requires a significant amount of interaction with the environment in order to learn. In real-world environments, particularly when a risk factor is involved, an inept policy might be hazardous shalev2016safe. Thus, an appealing approach is to record a dataset of other agents in order to learn a safe initial policy which may later be improved via RL techniques taylor2011integrating. Learning from a dataset of experts has been extensively researched in the literature. These learning methods and algorithms are commonly referred to as Learning from Demonstrations (LfD) argall2009survey. In this paper, we consider a sibling, less explored problem of learning form a dataset of observations (LfO). We define LfO as a relaxation of LfD, where: (1) we do no assume a single policy generating the data and; (2) the policies are not assumed to be optimal, nor do they cover the entire state space. Practically, LfO is of interest since it is often easier to collect observations than expert demonstrations, for example, using crowdsourcing kurin2017atari. Technically, LfO is fundamentally different from LfD, where the typical task is to clone a single expert policy. The data under the LfO setting is expected to be more diverse than LfD data, which in general can be beneficial for learning. However, it also brings in the challenge of learning from multiple, possibly contradicting, policy trajectories. In this work, we propose to solve the LfO problem with a three phases approach: (1) imitation (2) annotation and; (3) safe improvement. The imitation phase seeks to learn the average behavior in the dataset. In the annotation part we approximate the value function of the average behavior and in the final safe improvement step (section Document), we craft a novel algorithm that takes the learned average behavior and its approximated value function and yields an improved policy without generating new trajectories. The improvement step is designed to increase the policy performance with a high confidence in the presence of the value estimation errors that exists in a LfO setup. Our three phases approach which we term Rerouted Behavior Improvement (RBI), provides a robust policy (without any interaction with the environment), that both eliminates the risks in random policy initialization and in addition, can boost the performance of a succeeding RL process. We demonstrate our algorithm both in a Taxi grid-world dietterich2000hierarchical as well as in the Atari domain (section Document). In the latter, we tackle the challenge of learning from non-experts human players kurin2017atari. We show that our algorithm provides a robust policy, on par with deep RL policies, using only the demonstrations and without any additional interaction with the environment. As a baseline, we compare our approach to two state-of-the-art algorithms: (1) learning from demonstrations, DQfD hester2018deep and; (2) robust policy improvement, PPO schulman2017proximal.

## Related Work

Learning from demonstrations in the context of deep RL in the Atari environment have been studied in [cruz2017pre], DQfD hester2018deep and recently in [pohlen2018observe]. However, all these methods focus on expert demonstrations and they benchmark their scores after additional trajectory collection with an iterative RL process. Therefore, essentially these methods can be categorized as a RL augmented with expert’s supervised data. In contrast, we take a deeper look into the first part of best utilizing the observed data to provide the highest initial performance. Previously, LfO has often been solved solely with the imitation step, e.g. in AlphaGo silver2016mastering, where the system learned a policy which mimics the average behavior in a multiple policies dataset. While this provides sound empirical results, we found that one can do better by applying a safe improvement step to boost the policy performance. A greedy improvement method with respect to multiple policies has already been suggested in [barreto2017successor], yet we found that practically, estimating the value function of each different policy in the dataset is both computationally prohibitive and may also produce large errors since the generated data by each different policy is typically small. In section Document we suggest a feasible alternative. Instead of learning the value of each different policy, estimate the value of the average behavior. While such estimation is not exact, we show both theoretically and experimentally that it provides a surrogate value function that may be used for policy improvement purposes. There is also a significant research in the field of safe RL garcia2015comprehensive, yet, here there may be multiple accepted definitions of this term ranging from worst-case criterion tamar2013scaling to baseline benchmark approach ghavamzadeh2016safe. We continue the line of research of safe RL investigated by [kakade2002approximately, pirotta2013adaptive, thomas2015high]

but we focus on a dataset composed of unknown policies. Finally, there are two recent well-known works in the context of non-decreasing policy improvement (also can be categorized as safe improvement) TRPO and PPO schulman2015trust,schulman2017proximal. We compare our work to these algorithms and show two important advantages: First, our approach can be applied without an additional Neural Network (NN) policy optimization step and second we provide theoretical and experimental arguments why both approaches may be deemed unsafe when applied in the context of a LfO setup.

## Problem Formulation

We are dealing with a Markov Decision Process (MDP) puterman2014markov where an agent interacts with an environment and tries to maximize a reward. A MDP is defined by the tuple , where is a set of states and is a set of actions.

is the set of probabilities of switching from a state

to when executing action , i.e. and is a reward function which defines the reward that the agent gets when applying action in state . An agent acts according to a policy , and its goal is to find a policy that maximizes the expected cumulative discounted reward, also known as the objective function where is a discount factor, is a time index and is an initial state. We assume that all policies belong to the Markovian randomized set s.t.is a probability distribution over

given a state , i.e. .^{1}

^{1}1Note that humans’ policies can generally be considered as part of the history randomized set where is a probability function over given the states and actions history. In the appendix we explain how we circumvented this hurdle in the Atari dataset. For convenient and when appropriate, we may simply write to denote (omitting the state’s dependency). In the paper we will discuss two important distance measures between policies. The first is the Total Variation (TV) and the second is the KL divergence . These measures are often used to constrain the updates of a learned policy in an iterative policy improvement RL algorithm schulman2015trust,schulman2017proximal. For a given policy, the state’s value is the expected cumulative reward starting at this state, . Similarly, the -value function is the value of taking action in state and then immediately following with policy . The advantage is the gain of taking action in state over the average value (note that ). We denote by the probability of switching from state to state in steps with a policy . We define the LfO problem as learning a policy solely by observing a finite set of trajectories of other behavior policies without interacting with the environment. Formally, we are given a dataset of trajectories executed by different players each with a presumably different policy. Players are denoted by , , and their corresponding policies are , with value and -value functions respectively. is indexed as , where the cardinality of the dataset is denoted by and each record is the tuple s.t. is a termination signal and is the player’s index. may also be partitioned to , representing the different players’ records. [11]r0.25

[width=0.25]optimal_180921_100910.pdf

The paper is accompanied with a running example, based on the Taxi grid-world domain dietterich2000hierarchical, In the Taxi world, the driver’s task is to pickup and drop a passenger in predefined locations with minimal number of steps (See figure Document). For this example, we synthetically generated policies of the form , where is the optimal policy and is a different mixing parameter for each different policy. Generally we divide the state space into two complementary and equal-size sets , . Where for and for . In the next sections, we will use different selections of to generate different types of datasets. For example, randomly pick half of the states to form is termed in the paper as a random selection (see also appendix).

## Average Behavior Policy and its Value Function

We begin with the first phase of the RBI approach which is learning a behavior policy from the dataset. Learning a behavior policy is a sensible approach both for generating a sound initial performance, as well as to avoid unknown states and actions. Yet, contrary to LfD where a single expert policy is observed, in multiple policies dataset the definition of a behavioral policy is ambiguous. To that end, we define the average behavior of the dataset as a natural generalization of the single policy imitation problem. [Average Behavior] The average behavior of a dataset is

() |

where if , and otherwise. The first form in Eq. beta_0 is simply the fraction of each action taken in each state in , which for a single policy dataset is identical to behavioral cloning. Typically, when is expressed with a NN (as in section Document) we apply a standard classification learning process with a Cross Entropy loss. Otherwise, for a tabular representation (as in the Taxi example) we directly enumerate to calculate this expression. The second form in Eq. beta_0 is a weighted sum over all players in the dataset, which may also be expressed with conditional probability terminology as

() |

Here is the probability of visiting an th player’s record given that a uniform sample picked .
While other definitions of average behavior are possible, the ease of learning such formulation with a NN makes it a natural candidate for RBI. Yet, for the second phase of RBI, one must evaluate its -value function, i.e. . It is not straightforward to learn the -value of the average behavior since essentially, such policy was never executed in the data.^{2}^{2}2One may argue that it can be learned Off-Policy, we discuss Off-Policy learning in section Document and show that other Off-Policy approaches yielded significantly lower results. However, we suggest that the following function

() |

which we term the -value of the dataset may be served as a surrogate value function for policy improvement purposes. may be interpreted as the weighted average over the players’ -values, where the weights are the probability of visiting an th player’s record given that a uniform sample , picked . In the following two propositions, we show that such a function has two appealing characteristics. First, it may be evaluated with a -norm loss and Monte-Carlo (MC) learning Sutton2016ReinforcementL from the dataset trajectories, without Off-Policy corrections and without the burden of evaluating each independently. Secondly, it is a -value function of a time dependent policy with a very similar structure to the average behavior policy. Taken together, this features provide an efficient way to approximate the value function of . [Consistency of MC upper bound] For an approximation and a loss , an upper bound for the loss when is

() |

where is the sampled Monte-Carlo return (Proof in the appendix). The attractive implication of Proposition Document is that we can learn : (1) without approximating and plugging any sort of Importance Sampling (IS) factor

as executed in Off-Policy learning and; (2) from complete trajectories without bootstrapping with Temporal Difference (TD) learning. Learning a value function Off-Policy is prone to a high variance munos2016safe, particularly when combined with a function approximation such as a NN and/or bootstrapping Sutton2016ReinforcementL. Given a state-action pair

, then is the -value of a time dependent policy , where is a time index and is a fixed initial state-action pair,() |

Here the conditional probability is over a uniform sample where is a subset of that contains all the entries in the dataset with distance from an entry with a state-action pair (Proof in the appendix). Proposition Document indicates that when can be approximated as at least for a finite horizon , then . This happens when the distribution of players in states near equals to the distribution of players in those states in the entire dataset. Practically, while both policies are not equal, they have a relatively low TV distance and therefore their -values are close. To further increase the robustness of our method we add an interesting consideration: our improvement step will rely only on the action ranking of the -value in each state, i.e. the order of (see next section). This, as we show hereafter, significantly increases the effective similarity between and . We demonstrates the action ranking similarity between and in the Taxi grid-world example. To that end, we generated trajectories with players according to three selection types of : (1) row selection (2) column selection and; (3) random selection. Each selection type provides a different class of policies and therefore form a different dataset (see exact definitions in the appendix). In the first experiment (Figure Documenta), we plot the average TV distance (for various initial states ), between and , as a function of the time-step for . Generally it is low, but it may be questionable whether relying on the true value of for the improvement step will provide adequate results. However, when we consider only the action ranking similarity (evaluated with the Pearson’s rank correlation), we find even more favorable pattern. First, in Figure Documentb we plot the average rank correlation between and (for ) as a function of the number of different policies used to generate the dataset. It is evident that the rank correlation is very high and stable for any number of policies. In the second experiment, we generated (Figure Documentc) and (Figure Documentd) policies and examined the impact of different discount factors. Also here, for the majority of practical scenarios we observe sufficiently high rank correlation. Only for a very large discount factor (close to 1) the rank correlation reduces. This happens since the long horizon accumulates more error from the difference (in high steps) in the TV distance. In conclusion, while Proposition Document state bounds on the similarity between and , evaluating the Pearson’s rank correlation confirms our statement that in practice the action ranking of is an acceptable surrogate for the action ranking of .

## Safe Policy Improvement

Our next step, is to harness the approximated -function (which here will be termed ), in order to improve the average behavior , i.e. generate a policy such that . However, one must recall that is learned from a fixed, sometimes even small dataset. Therefore, in order to guarantee improvement, we analyze the statistics of the value function’s error. This leads to an interesting observation: the -value has a higher error for actions that were taken less frequently, thus, to avoid improvement penalty, we must restrict the ratio of the change in probability . We will use this observation to craft our third phase of RBI, i.e. the safe improvement step, and show that other well-known monotonic improvement methods (such as PPO schulman2017proximal and TRPO schulman2015trust) overlooked this consideration and therefore their improvement method may be unsafe for a LfO setup.

### Soft Policy Improvement

Before analyzing the error’s statistics, we begin by considering a subset of policies in which are proven to improve if our estimation of is exact. Out of this subset we will later pick our improved policy. Recall that the most naive and also common improvement method is taking a greedy step, i.e. deterministically acting with the highest -value action in each state. This is known by the policy improvement theorem Sutton2016ReinforcementL, to improve the policy performance. The policy improvement theorem may be generalized to include a larger family of soft steps.
[Soft Policy Improvement]
Given a policy , with value and advantage , a policy improves , i.e. , if it satisfies with at least one state with strict inequality. The term is called the improvement step.^{3}^{3}3We post a proof in the appendix for completeness, though it may have been proven elsewhere.
Essentially, every policy that increases the probability of taking positive advantage actions over the probability of taking negative advantage actions achieves improvement. We will use the next Corollary to prove that our specific improvement step guarantees a positive improvement step.
[Rank-Based Policy Improvement]
Let be an ordered list of the advantages in a state , s.t. , and let . If for all states is a monotonic non-decreasing sequence s.t. , then improves (Proof in the appendix).

### Standard Error of the Value Estimation

To provide a statistical argument for the expected error of the -function, consider learning with a tabular representation. The

-function is the expected value of the random variable

. Therefore, the Standard Error (SE) of an approximation

for the -value with MC trajectories is() |

where is the number of visitations in state , s.t. . Therefore, and specifically for low frequency actions such estimation may suffer large SE.^{4}^{4}4Note that even for deterministic environments, a stochastic policy inevitably provides .

### Policy Improvement in the Presence of Value Estimation Errors

We now turn to the crucial question of what happens when one applies an improvement step with respect to an inaccurate estimation of the -function, i.e. . [Improvement Penalty] Let be an estimator of with an error and let be a policy that satisfies lemma Document with respect to . Then the following holds

() |

where the difference, denoted , is called the improvement penalty (proof in the appendix). For simplicity we may write , where is sometimes referred to as the undiscounted state distribution of policy given an initial state . Since is a random variable, it is worth to consider the variance of . Assuming that the errors are positively correlated (since neighboring state-action pairs share trajectories of rewards) and under the error model introduced above, it follows that

Hence, it is evident that the improvement penalty can be extremely large when the term is unregulated. Moreover, a single mistake along the trajectory, caused by an unregulated element, might wreck the performance of the entire policy. Therefore, it is essential to regulate each one of these elements to minimize the potential improvement penalty.

### The Reroute Constraint

In order to regulate the ratio , we suggest limiting the improvement step to a subset of based on the following constraint. [Reroute Constraint] Given a policy , a policy is a of , if where . Further, note that reroute is a subset of the TV constraint with (proof in the appendix). Now, it is evident that with the reroute constraint, each element in the sum of (Document) is regulated and proportional to where . Analyzing other well-known trust regions such as the TV constraint , the average KL-divergence constraint , used in the TRPO algorithm, and the PPO objective function schulman2017proximal, surprisingly reveals that non of them properly controls the improvement penalty (see an example and an analysis of the PPO objective in the appendix, we also show in the appendix that the solution of the PPO objective is not unique).

### Maximizing the Improvement Step under the Reroute Constraint

We now consider the last part of our improvement approach, i.e. maximizing the objective function under the reroute constraint and whether such maximization provides a positive improvement step. It is well-known that maximizing the objective function without generating new trajectories of is a hard task since the distribution of states induced by the policy is unknown. Previous works have suggested to maximize a surrogate off-policy objective function . These works have also suggested to solve the constrained maximization with a NN policy representation and a gradient ascent approach schulman2015trust. Here we suggest a refreshing alternative, instead of precalculating the policy that maximizes one may ad hoc compute the policy that maximizes the improvement step (which is the argument of the objective) for each different state. Such an approach maximizes also the

objective since the improvement step is independent between states. For the reroute constraint, this essentially sums up to solving the following simple linear program for each state

() |

Where , and

are vector representations of

, and respectively. We term the algorithm that solves this maximization problem as Max-Reroute (see Algorithm Document). In the appendix we also provide an analogous algorithm that maximizes the improvement step under the TV constraint (termed Max-TV). We will use Max-TV as a baseline for the performance of the reroute constraint. With an ad hoc maximization approach, we avoid the hassle of additional learning task after the policy imitation step, and in addition, our solution guarantees maximization without the common caveats in NN learning such as converging to local minima or overfitting etc. Further analyzing Max-Reroute and Max-TV quickly reveals that they both rely only on the action ranking at each state (as stated in the previous section). This is also in contrast with the aforementioned methods (TRPO and PPO) where by their definition as policy gradient methods sutton2000policy, they optimize the policy according to the magnitude of the advantage function. Finally, notice that both Max-Reroute and Max-TV satisfy the conditions of Corollary 3, therefore they always provide a positive improvement step and hence for a perfect approximation of the value function they are guaranteed to improve the performance. [] , , , [30]r0.35[width=0.35]taxi_lfo_180928_004030.pdf

Improvement steps comparison

Let us now return to our Taxi example and examine three different types of improvement steps with respect to a behavioral cloning baseline: (1) a greedy step^{5}^{5}5An unconstrained step, equivalent to reroute parameters (2) a TV constrained and; (3) a reroute constrained steps. The dataset is generated by two policies with row selection and a discount factor (the evaluation method for and is in the appendix). Note that in a tabular representation, is undefined for missing state-action pairs. While in such parameterization we can avoid visiting unknown by simply setting them with a very low -value, in a function approximation such as a NN, the value in such unseen states is practically uncontrolled. To examine the effect of this matter, we consider two different evaluations for unseen : in Figure a, we assign a random value in the range to an unseen pair and in Figure b we assign the minimal value, i.e. for such pairs. Both figures show the evaluated performance with respect to the number of trajectories in the dataset.
Examining the score with respect to the average behavior baseline, reveals that indeed TV is not a safe step when unknown are assigned a random score. On the other hand, reroute nicely demonstrates its safety property where it always provides better performance than the average behavior baseline. The results also show that for any size of the dataset the greedy policy provides the poorest performance. We found out that this is since it easily converges to recurrent states and does not complete the task (note that this is true for both type of experiments). For small to medium dataset sizes, the reroute step outperforms the TV step but for large datasets, when evaluation errors diminish TV is better. In real-world MDP with larger number of states, it is extremely difficult to sufficiently sample the entire state-space, hence, we project that reroute should be better than TV even for large datasets. In the next section, this premise is verified in the Atari domain.

## Learning to play Atari by observing inexpert human players

In the previous section, we analyzed the expected error which motivated the reroute constraint for a tabular representation. In this section, we experimentally show that the same ideas holds for Deep NN parametric form. We conducted an experiment with a crowdsourced data of 4 Atari games (Spaceinvaders, MsPacman, Qbert and Montezuma’s Revenge) kurin2017atari. Each game had roughly 1000 recorded episodes. We employed two networks, one for policy cloning and one for -value estimation with architecture inspired by the Dueling DQN mnih2015human,wang2015dueling and a few modifications (see appendix). We evaluated three types of local maximization steps: (1) Max-Reroute, with different parameters ; (2) Max-TV; and (3) greedy.^{6}^{6}6We do not report the greedy step results due to a poor performance with respect to the other baselines.

We implemented two baselines: (1) DQfD algorithm with hyperparameters as in

[hester2018deep] and (2) a single PPO policy search step based on the learned behavior policy and the estimated advantage. The following discussion refers to the results presented in Figure Document.Average behavior performance: When comparing the average behavior performance to the average score, we get very similar performance in SpaceInvaders, lower results in MsPacman and significantly higher results in Qbert and Revenge. Generally, we expect that in games where longer trajectories lead to significantly more reward, the average policy obtains above average reward, since the effect of good players in the data has a heavier weight on the results. We assume that the lower score in MsPacman is due to the more complex and less clear frame in this game (after decimation) with respect to the other games. We take the average behavior performance as a baseline for comparing the safety level of the subsequent improvement steps. Local maximization steps: We chose to evaluate Max-TV with since it encapsulates the region. It is clear that Max-TV is always dominated by Max-Reroute and it did not secure safe improvement in any of the games. A comparison of different reroute parameters reveals that there is an fundamental trade-off between safety and improvement. Smaller steps like support higher safety at the expense of improvement potential. In all games provided safe improvement, but in Qbert its results were inferior to . On the other hand, reduced the Revenge score which indicates a too greedy step. Our results indicates that it is important to set since avoiding so may discard some important actions altogether due to errors in the value evaluation, resulting in a poor performance. Comparison with PPO: For the PPO baseline, we executed a policy search with the learned advantage according to the PPO objective

We chose , motivated by the similarity to the region (see appendix). Contrary to the PPO paper, we plugged our advantage estimator and did not use the Generalized Advantage Estimation (GAE). While scored (see box plot in Figure Document) slightly better than in Qbert, in all other games it reduced the behavioral cloning score. The overall results indicate the similarity between and , probably since negative advantages actions tend to settle at zero-probability to avoid negative penalty. This emphasizes the importance of the parameter of reroute which is missing from PPO. Comparison with DQfD and Off-Policy greedy RL: DQfD scored below the average behavior in all games. The significantly low scores in MsPacman and Qbert raise the question whether DQfD and more generally, Off-Policy greedy RL can effectively learn from multiple non-exploratory fixed policies. The most conspicuous issue is the greedy policy improvement approach taken by DQfD: we have shown that an unconstrained greedy improvement step leads to a poor performance (recall the Taxi example). However, Off-Policy RL also suffers from the second RL ingredient, i.e. policy evaluation. Notice that contrary to conventional Off-Policy learning, the behavior policy is unknown and should be estimated from the data. In practice, a NN might provide inaccurate estimation of the probability function when classes are imbalanced guo2008class. This might lead to significant errors in the evaluation step, which in turn might lead to a significant improvement penalty in an iterative RL process. As our results show, our proposed safe policy improvement scheme mitigates these issues, leading to significantly better results

## Conclusions

In this paper, we studied both theoretically and experimentally the problem of LfO. We analyzed factors that impede classical methods, such as TRPO/PPO and Off-Policy greedy RL, and proposed a novel alternative, Rerouted Behavior Improvement (RBI), that incorporates behavioral cloning and a safe policy improvement step. RBI is designed to learn from multiple agents and to mitigate value evaluation errors. It does not use importance sampling corrections or bootstrapping to estimate values, hence it is less sensitive to deep network function approximation errors. In addition, it does not require a policy search process. Our experimental results in the Atari domain demonstrate the strength of RBI compared to current state-of-the-art algorithms. We project that these attributes of RBI would also benefit an iterative RL process. Therefore, in the future, we plan to study RBI as an online RL policy improvement method.