Path Consistency Learning in Tsallis Entropy Regularized MDPs

02/10/2018 ∙ by Ofir Nachum, et al. ∙ 0

We study the sparse entropy-regularized reinforcement learning (ERL) problem in which the entropy term is a special form of the Tsallis entropy. The optimal policy of this formulation is sparse, i.e., at each state, it has non-zero probability for only a small number of actions. This addresses the main drawback of the standard Shannon entropy-regularized RL (soft ERL) formulation, in which the optimal policy is softmax, and thus, may assign a non-negligible probability mass to non-optimal actions. This problem is aggravated as the number of actions is increased. In this paper, we follow the work of Nachum et al. (2017) in the soft ERL setting, and propose a class of novel path consistency learning (PCL) algorithms, called sparse PCL, for the sparse ERL problem that can work with both on-policy and off-policy data. We first derive a sparse consistency equation that specifies a relationship between the optimal value function and policy of the sparse ERL along any system trajectory. Crucially, a weak form of the converse is also true, and we quantify the sub-optimality of a policy which satisfies sparse consistency, and show that as we increase the number of actions, this sub-optimality is better than that of the soft ERL optimal policy. We then use this result to derive the sparse PCL algorithms. We empirically compare sparse PCL with its soft counterpart, and show its advantage, especially in problems with a large number of actions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In reinforcement learning (RL), the goal is to find a policy with maximum long-term performance, defined as the sum of discounted rewards generated by following the policy (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998). In case the number of states and actions are small, and the model is known, the optimal policy is the solution of the non-linear Bellman optimality equations (Bellman, 1957). When the system is large or the model is unknown, greedily solving the Bellman equations often results in policies that are far from optimal. A principled way of dealing with this issue is regularization. Among different forms of regularization, such as (e.g., Farahmand et al. 2008, 2009) and (e.g., Kolter & Ng 2009; Johns et al. 2010; Ghavamzadeh et al. 2011), entropy regularization is among the most studied in both value-based (e.g., Kappen 2005; Todorov 2006; Ziebart 2010; Azar et al. 2012; Fox et al. 2016; O’Donoghue et al. 2017; Asadi & Littman 2017) and policy-based (e.g., Peters et al. 2010; Todorov 2010) RL formulations. In particular, two of the most popular deep RL algorithms, TRPO (Schulman et al., 2015) and A3C (Mnih et al., 2016), are based on entropy-regularized policy search. We refer the interested readers to Neu et al. (2017), for an insightful discussion on entropy-regularized RL algorithms and their connection to online learning.

In entropy-regularized RL (ERL), an entropy term is added to the Bellman equation. This formulation has four main advantages: 1) it softens the non-linearity of the Bellman equations and makes it possible to solve them more easily, 2) the solution of the softened problem is quantifiably not much worse than the optimal solution in terms of accumulated return, 3) the addition of the entropy term brings nice properties, such as encouraging exploration (Shannon entropy) (e.g., Fox et al. 2016; Nachum et al. 2017) and maintaining a close distance to a baseline policy (relative entropy) (e.g., Schulman et al. 2015; Nachum et al. 2018), and 4) unlike the original problem that has a deterministic solution, the solution to the softened problem is stochastic, which is preferable in problems in which exploration or dealing with unexpected situations is important. However, in the most common form of ERL, in which a Shannon (or relative) entropy term is added to the Bellman equations, the optimal policy is of the form of softmax. Despite the advantages of a softmax policy in terms of exploration, its main drawback is that at each step, it assigns a non-negligible probability mass to non-optimal actions, a problem that is aggravated as the number of actions is increased. This may result in policies that may not be safe to execute. To address this issue, Lee et al. (2018) proposed to add a special form of a general notion of entropy, called Tsallis entropy (Tsallis, 1988), to the Bellman equations. This formulation has the property that its solution has sparse distributions, i.e., at each state, only a small number of actions have non-zero probability. Lee et al. (2018) studied the properties of this ERL formulation, proposed value-based algorithms (fitted Q-iteration and Q-learning) to solve it, and showed that although it is harder to solve than its soft counterpart, it potentially has a solution closer to that of the original problem.

In this paper, we propose novel path consistency learning (PCL) algorithms for the Tsallis ERL problem, called sparse PCL. PCL is a class of actor-critic type algorithms developed by Nachum et al. (2017) for the soft (Shannon entropy) ERL problem. It uses a nice property of soft ERL, namely the equivalence of consistency and optimality, and learns parameterized policy and value functions by minimizing a loss that is based on the consistency equation of soft ERL. The most notable feature of soft PCL is that it can work with both on-policy (sub-trajectories generated by the current policy) and off-policy (sub-trajectories generated by a policy different than the current one, including any sub-trajectory from the replay buffer) data. We first derive a multi-step consistency equation for the Tsallis ERL problem, called sparse consistency. We then prove that in this setting, while optimality implies consistency (similar to the soft case), unlike the soft case, consistency only implies sub-optimality. We then use the sparse consistency equation and derive PCL algorithms that use both on-policy and off-policy data to solve the Tsallis ERL problem. We empirically compare sparse PCL with its soft counterpart. As expected, we gain from using the sparse formulation when the number of actions is large, both in algorithmic tasks and in discretized continuous control problems.

2 Markov Decision Processes (MDPs)

We consider the reinforcement learning (RL) problem in which the agent’s interaction with the system is modeled as a MDP. A MDP is a tuple , where and are state and action spaces; and

are the reward function and transition probability distribution, with

and being the reward and the next state probability of taking action in state ; is the initial state distribution; and is a discounting factor. In this paper, we assume that the action space is finite, but can be large. The goal in RL is to find a stationary Markovian policy, i.e., a mapping from state and action spaces to a simplex over the actions , that maximizes the expected discounted sum of rewards, i.e.,

(1)

where , , and . For a given policy , we define its value and action-value functions as

Any solution of the optimization problem (1) is called an optimal policy and is denoted by . Note that while a MDP may have several optimal policies, it only has a single optimal value function . It has been proven that (1) has a solution in the space of deterministic policies, i.e., , which can be obtained as the greedy action w.r.t. the optimal action-value function, i.e.,  (Puterman, 1994; Bertsekas & Tsitsiklis, 1996). The optimal action-value function is the unique solution of the non-linear Bellman optimality equations, i.e., for all and ,

(2)

Any optimal policy and the optimal state and state-action value functions, and , satisfy the following equations for all states and action,

3 Entropy Regularized MDPs

As discussed in Section 2, finding an optimal policy for a MDP involves solving a non-linear system of equations (see Eq. 2), which is often complicated. Moreover, the optimal policy may be deterministic, always selecting the same optimal action at a state even when there are several optimal actions in that state. This is undesirable when it is important to explore and to deal with unexpected situations. In such cases, one might be interested in multimodal policies that still have good performance. This is why many researchers have proposed to add a regularizer in the form of an entropy term to the objective function (1) and solve the following entropy-regularized optimization problem

(3)

where is an entropy-related term and is the regularization parameter. The entropy term smoothens the objective function (1) such that the resulting problem (3) is often easier to solve than the original one (1). This is another reason for the popularity of entropy-regularized MDPs.

3.1 Entropy Regularized MDP with Shannon Entropy

It is common to use in entropy-regularized MDPs (e.g., Fox et al. 2016; Nachum et al. 2017). Note that is the Shannon entropy. Problem (3) with can be seen as a RL problem in which the reward function is the sum of the original reward function and a term that encourages exploration.111Another entropy term that has been studied in the literature is , where is a baseline policy. Note that is the relative entropy. Problem (3) with can be seen as a RL problem in which the reward function is the sum of the original reward function and a term that penalizes deviation from the baseline policy . Unlike (1), the optimization problem (3) with has a unique optimal policy and a unique optimal value (action-value ) function that satisfy the following equations:

(4)

where for any function , the sfmax operator is defined as . Note that the equations in (3.1) are derived from the KKT conditions of (3) with . In this case, the optimal policy is soft-max, with the regularization parameter playing the role of its temperature (see Eq. 3.1). This is why (3) with is called the soft MDP problem. In soft MDPs, the optimal value function is the unique solution of the soft Bellman optimality equations, i.e., ,

(5)

Note that the sfmax operator is a smoother function of its inputs than the operator associated with the Bellman optimality equation (2). This means that solving the soft MDP problem is easier than the original one, with the cost that its optimal policy performs worse than the optimal policy of the original MDP . This difference can be quantified as

(6)

where we discriminate between the value function of a policy in the soft and original MDPs. Note that the sub-optimality of is unbounded as . This is the main drawback of using softmax policies; in large action space problems, at each step, the policy assigns a non-negligible probability mass to non-optimal actions, which in aggregate can be detrimental to its reward performance.

3.2 Entropy Regularized MDP with Tsallis Entropy

To address the issues with the softmax policy, Lee et al. (2018) proposed to use in entropy-regularized MDPs. Note that is a special case of a general notion of entropy, called Tsallis entropy (Tsallis, 1988), i.e., , for the parameters and .222Note that the Shannon entropy is a special case of the Tsallis entropy for the parameters  (Tsallis, 1988). Similar to the soft MDP problem, the optimization problem (3) with has a unique optimal policy and a unique optimal value (action-value ) function that satisfy the following equations (Lee et al., 2018):

(7)

where , and for any function , the spmax operator is defined as

in which

and is the set of actions satisfying , where indicates the action with the th largest value of . Note that the equations in (3.2) are derived from the KKT conditions of (3) with . In this case, the optimal policy may have zero probability for several actions (see Eq. 3.2). This is why (3) with is called the sparse MDP problem. The regularization parameter controls the sparsity of the resulted policy. The policy would be more sparse for smaller values of . In sparse MDPs, the optimal value function is the unique fixed-point of the sparse Bellman optimality operator  (Lee et al., 2018) that for any function is defined as

(8)

Similar to (5), the spmax operator is a smoother function of its inputs than the , and thus, solving the sparse MDP problem is easier than the original one, with the cost that its optimal policy performs worse than the optimal policy of the original MDP . This difference can be quantified as (Lee et al., 2018),

(9)

On the other hand, the spmax operator is more complex than sfmax, and thus, it is slightly more complicated to solve the sparse MDP problem than its soft counterpart. However, as can be seen from Eqs. 6 and 9, the optimal policy of the sparse MDP, , can have a better performance than its soft counterpart, , and this difference becomes more apparent as the number of actions grows. For large action size, the term in (9) turns to a constant, while in (6) grows unbounded.

4 Path Consistency Learning in Soft MDPs

A nice property of soft MDPs that was elegantly used by Nachum et al. (2017) is that any policy and function that satisfy the (one-step) consistency equation, i.e., for all and for all ,

(10)

are optimal, i.e.,  and (consistency implies optimality). Due to the uniqueness of the optimal policy in soft MDPs, the reverse is also true, i.e., the optimal policy and the value function satisfy the consistency equation (optimality implies consistency).

As shown in Nachum et al. (2017), the (one-step) consistency equation (10) can be easily extended to multi-step, i.e., any policy and function that for any state and sequence of actions , satisfy the multi-step consistency equation

(11)

are optimal, i.e.,  and .

The property that both single and multiple step consistency equations imply optimality (Eqs. 10 and 11) was the motivation of a RL algorithm by Nachum et al. (2017), path consistency learning (PCL). The main idea of (soft) PCL is to learn a parameterized policy and value function by minimizing the following objective function:

where is any -length sub-trajectory, and are the policy and value function parameters, respectively, and

(12)

An important property of the soft PCL algorithm is that since the multi-step consistency (11) holds for any -length sub-trajectory, it can use both on-policy (’s generated by the current policy ) and off-policy data, i.e., ’s generated by a policy different than the current one, including any -length sub-trajectory from the replay buffer.

Note that since both optimal policy and value function can be written based on the optimal action-value function (see Eq. 3.1), we may write the objective function (12) based on , and optimize only one set of parameters , instead of separate and .

5 Consistency between Optimal Value & Policy in Sparse MDPs

This section begins the main contributions of our work. We first identify a (one-step) consistency equation for the sparse MDPs defined by (3). We then prove the relationship between the sparse consistency equation and the optimal policy and value function of the sparse MDP, and highlight its similarities and differences with that in soft MDPs, discussed in Section 4. We then extend the sparse consistency equation to multiple steps and prove results that allow us to use the multi-step sparse consistency equation to derive on-policy and off-policy algorithms to solve sparse MDPs, which we fully describe in Section 6. The significance of the sparse consistency equation is in providing an efficient tool for computing a near-optimal policy for sparse MDPs, which only involves solving a set of linear equations and linear complementary constraints, as opposed to (iteratively) solving the fixed-point of the non-linear sparse Bellman operator (8). We report the proofs of all the theorems of this section in Appendix A.

For any policy and value function , we define the (one-step) consistency equation of the sparse MDPs as, for all state and for all actions ,

(13)

where and are Lagrange multipliers, such that and . We call this the one-step sparse consistency equation and it is the equivalent of Eq. 10 in soft MDPs.

We now present a theorem which states that, similar to soft MDPs, optimality in sparse MDPs is a necessary condition for consistency, i.e., optimality implies consistency.

Theorem 1.

The optimal policy and value function of the sparse MDP (3) satisfy the consistency equation (13).

Theorem 2 shows that in the sparse MDPs, consistency only implies near-optimality, as opposed to optimality in the case of soft MDPs.

Theorem 2.

Any policy that satisfies the consistency equation (13) is -optimal in the sparse MDP (3), i.e., for each state , we have

(14)

This result indicates that for a fixed , as decreases, a policy satisfying the one-step consistency equations approaches the true optimal . To connect the performance of to the original goal of maximizing expected return, we present the following corollary, which is a direct consequence of Theorem 2 and the results reported in Section 3.2 on the performance of in the original MDP.

Corollary 1.

Any policy that satisfies the consistency equation (13) is -optimal in the original MDP (1), i.e., for each state , we have

We now extend the (one-step) sparse consistency equation (13) to multiple steps. For any state and sequence of actions , define the multi-step consistency equation for sparse MDPs as

(15)

where and are Lagrange multipliers, such that and . We call this multi-step sparse consistency equation, the equivalent of Eq. 11 in soft MDPs.

From Theorem 1, we can immediately show that multi-step sparse consistency is a necessary condition of optimality.

Corollary 2.

The optimal policy and value function of the sparse MDP (3) satisfy the multi-step consistency equation (15).

Proof.

The proof follows directly from Theorem 1, by repeatedly applying the expression in (13) over the trajectory , taking the expectation over the trajectory, and using telescopic cancellation of the value function of intermediate states. ∎

Conversely, followed from Theorem 2, we prove the following result on the performance of any policy satisfying the mutli-step consistency equation. This is a novel result showing that solving the multi-step consistency equation is indeed sufficient to guarantee near-optimality.

Corollary 3.

Any policy that satisfies the multi-step consistency equation (15) is -optimal in the sparse MDP (3).

Proof.

Consider the multi-step consistency equation (15). Since it is true for any initial state and sequence of actions , unrolling it for another steps starting at state and using the action sequence yields

Note that this process can be repeated for an arbitrary number of times (say times), and also note that as is a bounded function, one has . Therefore, by further unrolling, we obtain

Followed from the Banach fixed-point theorem (Bertsekas & Tsitsiklis, 1996), one can show that the solution pair is also a solution to the one-step consistency condition in (13), i.e., , for any and . Thus the -optimality performance guarantee of is implied by Theorem 2. ∎

Equipped with the above results on the relationship between (near)-optimality and multi-step consistency in sparse MDPs, we are now ready to present our off-policy RL algorithms to solve the sparse MDP (3).

6 Path Consistency Learning in Sparse MDPs

Similar to the PCL algorithm for soft MDPs, in sparse MDPs the multi-step consistency equation (15) naturally leads to a path-wise algorithm for training a policy and value function parameterized by and , as well as Lagrange multipliers and parameterized by the auxiliary parameter . To characterize the objective function of this algorithm, we first define the soft consistency error for the -step sub-trajectory as a function of , , and ,

The goal of our algorithm is to learn , , , and , such that the expectation of for any initial state and action sequence is as close to as possibles. Our sparse PCL algorithm minimizes the empirical objective function , which converges to as . By the Cauchy-Schwarz inequality, is a conservative surrogate of , which represents the error of the multi-step consistency equation. This relationship justifies that the solution policy of the sparse PCL algorithm is near-optimal (see Corollary 3). Moreover, the gradient of w.r.t. the parameters is as follows:

We may relate the sparse PCL algorithm to the standard actor-critic (AC) algorithm (Konda & Tsitsiklis, 2000; Sutton et al., 2000), where and correspond to the actor and critic updates, respectively. An advantage of sparse PCL over the standard AC is that it does not need the multi-time-scale update required by AC for convergence.

While optimizing minimizes the mean square of the soft consistency error, in order to satisfy the multi-step consistency in (15), one still needs to impose the following constraints on Lagrange multipliers into the optimization problem: (i) , and (ii) , , . One standard approach is to replace the above constraints with adding penalty functions (Bertsekas, 1999) to the original objective function . Note that each penalty function is associated with a penalty parameter and there are constraints. When and are large, tuning all the parameters becomes computationally expensive. Another approach is to update the penalty parameters using gradient ascent methods (Bertsekas, 2014). This is equivalent to finding the saddle point of the Lagrangian function in the constrained optimization problem. However, the challenge is to balance the primal and dual updates in practice.

We hereby describe an alternative and a much simpler methodology to parameterize the Lagrange multipliers and , such that the aforementioned constraints are immediately satisfied. Although this method may impose extra restrictions to the representations of their function approximations, it avoids the difficulties of directly solving a constrained optimization problem. Specifically, to satisfy the constraint (i), one can parameterize

with a multilayer perceptron network that has either an activation function of

or at its last layer. To satisfy constraint (ii), we consider the case when is written in form of for some function approximator . This parameterization of is justified by the closed-form solution policy of the Tsallis entropy-regularized MDP problem in (3.2). Specifically, (3.2) uses . Now suppose that is parameterized as follows: , where is an auxiliary function approximator. Then by the property , constraint (ii) is immediately satisfied. A pseudo-code of our sparse PCL algorithm can be found in Algorithm 1 in the Appendix A.

Unified Sparse PCL

Note that the closed-form optimal policy and value function are both functions of the optimal state-action value function . As in soft PCL, based on this observation one can also parameterize both policy and value function in sparse PCL (see Eq. 3.2) with a single function approximator . Although consistency does not imply optimality in sparse MDPs (as opposed to the case of soft MDPs), the justification of this parameterization comes from Corollary 2, where the unique optimal value function and optimal policy satisfy the consistency equation (15). From an actor-critic perspective, the significance of this is that both policy (actor) and value function (critic) can be updated simultaneously without affecting the convergence. Accordingly, the update rule for the model parameter takes the form

7 Experimental Results

We demonstrate the effectiveness of the sparse PCL algorithm by comparing its performance with that of the soft PCL algorithm on a number of RL environments available in the OpenAI Gym (Brockman et al., 2016) environment.


Copy


DuplicatedInput


RepeatCopy


Reverse


Figure 1: Results of the average reward from sparse PCL and standard soft PCL during training. Here each row corresponds to a specific algorithmic task. For each particular task, the action space is increased from left to right across the rows, corresponding to an increase in difficulty. We observe that soft PCL returns a better solution when the action space is small, but its performance degrades quickly as the size of action space grows. On the other hand, sparse PCL is not only able to learn good policies in tasks with small action spaces, but, unlike soft PCL, also successfully learns high-reward policies in the higher-dimension variants. See the appendix for additional results.

7.1 Discrete Control

Here we compare the performance of these two algorithms on the following standard algorithmic tasks: 1) Copy, 2) DuplicatedInput, 3) RepeatCopy, 4) Reverse, and 5) ReversedAddition (see appendix for more details). Each task can be viewed as a grid environment, where each cell stores a single character from a finite vocabulary . An agent moves on the grid of the environment and writes to output. At each time step the agent observes the character of the single cell in which it is located. After observing the character, the agent must take an action of the form , where determines the agent’s move to an adjacent cell, (in 1D environments, ; in 2D environments, ), determines whether the agent writes to output or not, and determines the character that the agent writes if (otherwise ). Based on this problem setting, the action space has size . Accordingly, the difficulty of these tasks grows with the size of the vocabulary. To illustrate the effectiveness of Tsallis entropy-regularized MDPs in problems with large action space, we evaluate these two PCL algorithms on different choices of .

In each task, the agent has a different goal. In Copy, the environment is a 1D sequence of characters and the agent aims to copy the sequence to output. In DuplicatedInput, the environment is a 1D sequence of duplicated characters and the agent needs to write the de-duplicated sequence to output. In RepeatCopy, the environment is a 1D sequence of characters in which the agent must copy in forward order, reverse order, and finally forward order again. In Reverse, the environment is a 1D sequence of characters in which the agent must copy to output in reverse order. Finally, in ReversedAddition, the environment is a grid of digits representing two numbers in base- that the agent needs to sum. In each task the agent receives a reward of for each correctly output character. The episode is terminated either when the task is completed, or when the agent outputs an incorrect character.

We follow a similar experimental procedure as in Nachum et al. (2017), where the functions , , ,

in the consistency equations are parameterized with a recurrent neural network with multiple heads. For each task and each PCL algorithm, we perform a hyper-parameter search to find the optimal regularization weight

, and the corresponding training curves for average reward are shown in Figure 1. To increase the statistical significance of these experiments, we also train these policies on different Monte Carlo trials (Notice that these environments are inherently deterministic, therefore no additional Monte Carlo evaluation is needed.). Details of the experimental setup and extra numerical results are included in the Appendix.

For each task we evaluated sparse PCL compared to the original soft PCL on a suite of variants which successively increase the vocabulary size. For low vocabulary sizes soft PCL achieves better results. This suggests that Shannon entropy encourages better exploration in small action spaces. Indeed, in such regimes, a greater proportion of the total actions are useful to explore, and exploration is not as costly. Therefore, the decreased exploration of the Tsallis entropy may outweigh its asymptotic benefits. The sub-optimality bounds presented in this paper support this behavior: when is small, .

As we increase the vocabulary size (and thus the action space), the picture changes. We see that the advantage of soft PCL over sparse PCL decreases until eventually the order is reversed and sparse PCL begins to show a significant improvement over the standard soft PCL. This supports our original hypothesis. In large action spaces, the tendency of soft PCL to assign a non-zero probability to many sub-optimal actions over-emphasizes exploration and is detrimental to the final reward performance. On the other hand, sparse PCL is able to handle exploration in large action spaces properly. These empirical results provide evidence for this unique advantage of sparse PCL.

7.2 Continuous Control

We further evaluate the two PCL algorithms on HalfCheetah, a continuous control problem in the OpenAI gym. The environment consists of a dimensional action space, where each dimension corresponds to a torque of . Here we discretize each continuous action with either one of the following grids: and . Even though the resolution of these discretization grids is coarse, the corresponding action spaces are quite large, with sizes of and , respectively.

We present the results of sparse PCL and soft PCL on these discretized problems in Figure 2. Similar to the observations in the algorithmic tasks, here the policy learned by sparse PCL performs much better than that of soft PCL. Specifically sparse PCL achieves higher average reward and is able to learn much faster. To better visualize the learning progress of these two PCL algorithms in these problems, at each training step we also compare the average probability of the most-likely actions across all time-steps from the on-policy trajectory.333Specifically in each iteration we collect a single on-policy trajectory of steps. Therefore this metric is an average over samples of (greedy) action probabilities. Clearly, sparse PCL quickly converges to a near-deterministic policy, while the policy generated by soft PCL still allocates significant probability masses to non-optimal actions (as the average probability of most-likely actions barely ever exceeds ). In environments like HalfCheetah, where the trajectory has a long horizon ( steps), the soft-max policy will in general suffer because it chooses a large number of sub-optimal actions in each episode for exploration.

Comparing with the performance of other continuous RL algorithms such as deterministic policy gradient (DPG) (Silver et al., 2014), we found that the policy generated by sparse PCL is sub-optimal. This is mainly due to the coarse discretization of the action space. Our main purpose here is to demonstrate the fast and improved convergence to deterministic policies in sparse PCL, compared to soft PCL. Further evaluation of sparse PCL will be left to future work.


Reward


3cm


Figure 2: Results of sparse PCL and soft PCL in HalfCheetah with discretized actions. The top figure shows the average reward over random runs during training, with best hyper-parameters. On the bottom we plot the average probability of the most-likely actions during training. The bottom figure illustrates the fast convergence of sparse PCL to a near-deterministic policy.

8 Conclusions

In this work we studied the sparse entropy-regularized problem in RL, whose optimal policy has non-zero probability for only a small number of actions. Similar to the work by Nachum et al. (2017), we derived a relationship between (near-)optimality and consistency for this problem. Furthermore, by leveraging the properties of the consistency equation, we proposed a class of sparse path consistency learning (sparse PCL) algorithms that are applicable to both on-policy and off-policy data and can learn from multi-step trajectories. We found that the theoretical advantages of sparse PCL correspond to empirical advantages as well. For tasks with a large number of actions, we find significant improvement in final performance and amount of time needed to reach that performance by using sparse PCL compared to the original soft PCL.

Future work includes 1) extending the sparse PCL algorithm to the more general class of Tsallis entropy, 2) investigating the possibility of combining sparse PCL and path following algorithms such as TRPO (Schulman et al., 2015), and 3) comparing the performance of sparse PCL with other deterministic policy gradient algorithms, such as DPG (Silver et al., 2014) in the continuous domain.

References

  • Asadi & Littman (2017) Asadi, K. and Littman, M. An alternative softmax operator for reinforcement learning. In

    Proceedings of the 34th International Conference on Machine Learning

    , pp. 243–252, 2017.
  • Azar et al. (2012) Azar, M., Gómez, V., and Kappen, H. Dynamic policy programming. Journal of Machine Learning Research, 13:3207–3245, 2012.
  • Bellman (1957) Bellman, R. Dynamic Programming. Princeton University Press, 1957.
  • Bertsekas (1999) Bertsekas, D. Nonlinear programming. Athena scientific Belmont, 1999.
  • Bertsekas (2014) Bertsekas, D. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.
  • Bertsekas & Tsitsiklis (1996) Bertsekas, D. and Tsitsiklis, J. Neuro-Dynamic Programming. Athena Scientific, 1996.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI Gym. arXiv:1606.01540, 2016.
  • Farahmand et al. (2008) Farahmand, A. M., Ghavamzadeh, M., Szepesvári, Cs., and Mannor, S. Regularized policy iteration. In Proceedings of Advances in Neural Information Processing Systems 21, pp. 441–448. MIT Press, 2008.
  • Farahmand et al. (2009) Farahmand, A. M., Ghavamzadeh, M., Szepesvári, Cs., and Mannor, S. Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In Proceedings of the American Control Conference, pp. 725–730, 2009.
  • Fox et al. (2016) Fox, R., Pakman, A., and Tishby, N. G-learning: Taming the noise in reinforcement learning via soft update. In

    Proceedings of the 32nd International Conference on Uncertainty in Artificial Intelligence

    , pp. 202–211, 2016.
  • Ghavamzadeh et al. (2011) Ghavamzadeh, M., Lazaric, A., Munos, R., and Hoffman, M. Finite-sample analysis of lasso-td. In Proceedings of the Twenty-Eighth International Conference on Machine Learning, pp. 1177–1184, 2011.
  • Johns et al. (2010) Johns, J., Painter-Wakefield, C., and Parr, R. Linear complementarity for regularized policy evaluation and improvement. In Proceedings of Advances in Neural Information Processing Systems 23, pp. 1009–1017. MIT Press, 2010.
  • Kappen (2005) Kappen, H. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics, 11, 2005.
  • Kolter & Ng (2009) Kolter, Z. and Ng, A.

    Regularization and feature selection in least-squares temporal difference learning.

    In Proceedings of the Twenty-Sixth International Conference on Machine Learning, pp. 521–528, 2009.
  • Konda & Tsitsiklis (2000) Konda, V. and Tsitsiklis, J. Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014, 2000.
  • Lee et al. (2018) Lee, K., Choi, S., and Oh, S.

    Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning.

    IEEE Robotics and Automation Letters, 2018.
  • Mnih et al. (2016) Mnih, V., Badia, A., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 1928–1937, 2016.
  • Nachum et al. (2017) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In NIPS, pp. 2772–2782, 2017.
  • Nachum et al. (2018) Nachum, Ofir, Norouzi, Mohammad, Xu, Kelvin, and Schuurmans, Dale. Trust-pcl: An off-policy trust region method for continuous control. In Proceedings of the 5th International Conference on Learning Representations, 2018.
  • Neu et al. (2017) Neu, G., Jonsson, A., and Gómez, V. A unified view of entropy-regularized Markov decision processes. arXiv:1705.07798, 2017.
  • O’Donoghue et al. (2017) O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. PGQ: Combining policy gradient and Q-learning. In Proceedings of the 5th International Conference on Learning Representations, 2017.
  • Peters et al. (2010) Peters, J., Müling, K., and Altun, Y. Relative entropy policy search. In Proceedings of the 24th Conference on Artificial Intelligence, pp. 1607–1612, 2010.
  • Puterman (1994) Puterman, M. Markov Decision Processes. Wiley Interscience, 1994.
  • Schulman et al. (2015) Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1889–1897, 2015.
  • Silver et al. (2014) Silver, David, Lever, Guy, Heess, Nicolas, Degris, Thomas, Wierstra, Daan, and Riedmiller, Martin. Deterministic policy gradient algorithms. In ICML, 2014.
  • Sutton & Barto (1998) Sutton, R. and Barto, A. An Introduction to Reinforcement Learning. MIT Press, 1998.
  • Sutton et al. (2000) Sutton, R., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of Advances in Neural Information Processing Systems 12, pp. 1057–1063, 2000.
  • Todorov (2006) Todorov, E. Linearly-solvable Markov decision problems. In Proceedings of the 19th Advances in Neural Information Processing, pp. 1369–1376, 2006.
  • Todorov (2010) Todorov, E. Policy gradients in linearly-solvable MDPs. In Proceedings of the 23rd Advances in Neural Information Processing, pp. 2298–2306, 2010.
  • Tsallis (1988) Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics, 52(1):479–487, 1988.
  • Ziebart (2010) Ziebart, B. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, 2010.

Appendix A Proofs of Section 5

Consider the Bellman operator for the entropy-regularized MDP with Tsallis entropy:

We first have the following technical result about its properties.

Proposition 1.

The sparse-max Bellman operator has the following properties: (i) Translation: ; (ii) -contraction: ; (iii) Monotonicity: for any value functions such that , .

The detailed proof of this proposition can be found in Lee et al. (2018). Using these results, the Banach fixed point theorem shows that there exists a unique solution for the following fixed point equation: , , and this solution is equal to the optimal value function . Analogous to the arguments in standard MDPs, in this case the optimal value function can also be computed using dynamic programming methods such as value iteration.

Before proving the main results, notice that by using analogous arguments of the complementary-slackness property in KKT conditions, the second and the third consistency equation in (13) is equivalent to the following condition:

(16)

where represents the set of actions that have non-zero probabilities w.r.t policy .

Theorem 3.

The pair of optimal value function and optimal policy of the MDP problem in (3) satisfies the consistency equation in (13).

Proof.

Recall that the optimal state-action value function is given by

According to Bellman’s optimality, the optimal value function satisfies the following equality:

(17)

at any state , where is the corresponding maximizer. By the KKT condition, we have that

for any and any , where is the Lagrange multiplier that corresponds to equality constraint , and is the Lagrange multiplier that corresponds to inequality constraint such that

Recall from the definition of optimal state-action value function and the definition of the optimal policy , one has that . This condition further implies

Substituting the equality in (17) to this KKT condition, and noticing that , the KKT condition implies that

which further implies that

Therefore, by defining , and , one immediately has that , . Using this construction, one further has the following expression for any and any