AlgaeDICE: Policy Gradient from Arbitrary Experience

12/04/2019 ∙ by Ofir Nachum, et al. ∙ Google 0

In many real-world applications of reinforcement learning (RL), interactions with the environment are limited due to cost or feasibility. This presents a challenge to traditional RL algorithms since the max-return objective involves an expectation over on-policy samples. We introduce a new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution. We first derive this result by considering a regularized version of the dual max-return objective before extending our findings to unregularized objectives through the use of a Lagrangian formulation of the linear programming characterization of Q-values. We show that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-policy objective is exactly the on-policy policy gradient, without any use of importance weighting. In addition to revealing the appealing theoretical properties of this approach, we also show that it delivers good practical performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The use of model-free reinforcement learning (RL) in conjunction with function approximation has proliferated in recent years, demonstrating successful applications in fields such as robotics (Andrychowicz et al., 2018; Nachum et al., 2019a), game playing (Mnih et al., 2013), and conversational systems (Gao et al., 2019). These successes often rely on on-policy access to the environment; , during the learning process agents may collect new experience from the environment using policies they choose, and these interactions are effectively unlimited. By contrast, in many real-world applications of RL, interaction with the environment is costly, if not impossible, hence experience collection during learning is limited, necessitating the use of off-policy RL methods, , algorithms which are able to learn from logged experience collected by potentially multiple and possibly unknown behavior policies.

The off-policy nature of many practical applications presents a significant challenge for RL algorithms. The traditional max-return objective is in the form of an on-policy expectation, and thus, policy gradient methods (Sutton et al., 2000; Konda and Tsitsiklis, 2000)

require samples from the on-policy distribution to estimate the gradient of this objective. The most straightforward way to reconcile policy gradient with off-policy settings is via importance weighting 

(Precup et al., 2000)

. However, this approach is prone to high variance and instability without appropriate damping 

(Munos et al., 2016; Wang et al., 2016; Gruslys et al., 2017; Schulman et al., 2017). The more common approach to the off-policy problem is to simply ignore it, which is exactly what has been proposed by many existing off-policy policy gradient methods (Degris et al., 2012; Silver et al., 2014). These algorithms simply compute the gradients of the max-return objective with respect to samples from the off-policy data, ignoring distribution shift in the samples. The justification for this approach is that the maximum return policy will be optimal regardless of the sampling distribution of states. However, such a justification is unsound in function approximation settings, where models have limited expressiveness, with potentially disastrous consequences on optimization and convergence (e.g., Lu et al., 2018).

Value-based methods provide an alternative that may be more promising for the off-policy setting. In these methods, a value function is learned either as a critic to a learned policy (as in actor-critic) or as the maximum return value function itself (as in -learning). This approach is based on dynamic programming in tabular settings, which is inherently off-policy and independent of any underlying data distribution. Nevertheless, when using function approximation, the objective is traditionally expressed as an expectation over single-step Bellman errors, which re-raises the question, “What should the expectation be?” Some theoretical work suggests that the ideal expectation is in fact the on-policy expectation (Sutton et al., 2000; Silver et al., 2014; Nachum et al., 2018). In practice, this problem is usually ignored, with the same justification as that made for off-policy policy gradient methods. It is telling that actor-critic or -learning algorithms advertised as off-policy still require large amounts of online interaction with the environment (Haarnoja et al., 2018; Hessel et al., 2018).

In this work, we present an ALgorithm for policy Gradient from Arbitrary Experience via DICE (AlgaeDICE)111DICE is an abbreviation for distribution correction estimation and is taken from the DualDICE work (Nachum et al., 2019b) on off-policy policy evaluation. Although our current work notably focuses on policy optimization as opposed to evaluation and only implicitly estimates the distribution corrections, our derivations are nevertheless partly inspired by this previous work. as an alternative to policy gradient and value-based methods. We start with the dual formulation of the max-return objective, which is expressed in terms of normalized state-action occupancies rather than a policy or value function. Traditionally, this objective is considered unattractive, since access to the occupancies either requires an on-policy expectation (similar to policy gradient methods) or learning a function approximator to satisfy single-step constraints (similar to value-based methods). We demonstrate how these problems can be remedied by adding a controllable regularizer and applying a carefully chosen change of variables, obtaining a joint objective over a policy and an auxiliary dual function (that can be interpreted as a critic). Crucially, this objective relies only on access to samples from an arbitrary off-policy data distribution, collected by potentially multiple and possibly unknown behavior policies (under some mild conditions). Unlike traditional actor-critic methods, which use a separate objective for actor and critic, this formulation trains the policy (actor) and dual function (critic) to optimize the same objective. Further illuminating the connection to policy gradient methods, we show that if the dual function is optimized, the gradient of the proposed objective with respect to the policy parameters is exactly the on-policy policy gradient. This way, our approach naturally avoids issues of distribution mismatch without any explicit use of importance weights. We continue to provide an alternative derivation of the same results, based on a primal-dual form of the return-maximizing RL problem, and notably this perspective extends the previous results to both undiscounted settings and unregularized max-return objectives. Finally, we provide empirical evidence that AlgaeDICE can perform well on benchmark RL tasks.

2 Background

We consider the RL problem presented as a Markov Decision Process (MDP) 

(Puterman, 1994), which is specified by a tuple

consisting of a state space, an action space, a reward function, a transition probability function, and an initial state distribution. A policy

interacts with the environment by starting at an initial state , and iteratively producing a sequence of distributions over , at steps , from which actions are sampled and successively applied to the environment. At each step, the environment produces a scalar reward and a next state . In RL, one wishes to learn a return-maximizing policy:

(1)

where describes the future rewards accumulated by from any state-action pair ,

(2)

and is a discount factor. This objective may be equivalently written in its dual form (Puterman, 1994; Wang et al., 2008) in terms of the policy’s normalized state visitation distribution as

(3)

where

(4)

As we will discuss in sec:q-lp and appendix:proof_details, these objectives are the primal and dual of the same linear programming (LP) problem.

In function approximation settings, optimizing requires access to gradients. The policy gradient theorem (Sutton et al., 2000) provides the gradient of as

(5)

To properly estimate this gradient one requires access to on-policy samples from and access to estimates of the -value function . The first requirement means that every gradient estimate of necessitates interaction with the environment, which limits applicability of this method in settings where interaction with the environment is expensive or infeasible. The second requirement means that one must maintain estimates of the -function to learn . This leads to the family of actor-critic algorithms that alternate between updates to (the actor) and updates to a -approximator (the critic). The critic is learned by encouraging it to satisfy single-step Bellman consistencies,

(6)

where is the expected Bellman operator with respect to . Thus, the critic is learned according to some variation on

(7)

for some distribution . Although the use of an arbitrary suggests the critic may be learned off-policy, to achieve satisfactory performance, actor-critic algorithms generally rely on augmenting a replay buffer with new on-policy experience. Theoretical work has suggested that if one desires compatible function approximation, then an appropriate is, in fact, the on-policy distribution  (Sutton et al., 2000; Silver et al., 2014; Nachum et al., 2018).

In this work, we focus on the off-policy setting directly. Specifically, we are given a dataset , where ; ; and has been sampled according to an unknown process. We let denote the unknown state-action distribution, and additionally assume access to a sample of initial states, , such that .

3 AlgaeDICE via Density Regularization

We begin by presenting an informal derivation of our method, motivated as a regularization of the dual max-return objective in (3). In Section 4 we will present our results more formally as a consequence of the Lagrangian of a linear programming formulation of the max-return objective.

3.1 A Regularized Off-Policy Max-Return Objective

The max-return objective (3) is written exclusively in terms of the on-policy distribution . To introduce an off-policy distribution into the objective, we incorporate a regularizer:

(8)

with and denoting the -divergence induced by a convex function :

(9)

where we have used the shorthand . This form of regularization encourages conservative behavior, compelling the state-action occupancies of to remain close to the off-policy distribution, which can improve generalization. We emphasize that the introduction of this regularizer is to enable the subsequent derivations and not to impose a strong constraint on the optimal policy. Indeed, by appropriately choosing and , the strength of the regularization can be controlled. Later, we will show that many of our results also hold for exploratory regularization () and even for no regularization at all ().

At first glance, the regularization in (8) seems to complicate things. Not only do we still require on-policy samples from , but we also have to compute , which in general can be difficult. To make this objective more approachable, we transform the -divergence to its variational form (Nguyen et al., 2010) by use of a dual function :

(10)

where is the convex (or Fenchel) conjugate of . With the objective in (10), we are finally ready to eliminate the expectation over on-policy samples from . To do so, we make a change of variables, inspired by DualDICE (Nachum et al., 2019b). Define as the fixed point of a variant of the Bellman equation,

(11)

Equivalently, . Note that always exists and is bounded when and are bounded (Puterman, 1994). Applying this change of variables to (10) (after some telescoping, see Nachum et al. (2019b)) yields

(12)

The resulting objective is now completely off-policy, relying only on access to samples from the initial state distribution and the off-policy dataset . Thus, we have our first theorem, providing an off-policy formulation of the max-return objective: [Primal AlgaeDICE ] Under mild conditions on , the regularized max-return objective may be expressed as a max-min optimization:

(13)
Remark (extension to ):

It is clear that the same derivations above may apply to an exploratory regularizer of the same form with , which leads to the following optimization problem:

(14)
Remark (Fenchel AlgaeDICE)

The appearance of inside in the second term of (12) presents a challenge in practice, since involves an expectation over the transition function , whereas one typically only has access to a single empirical sample from for a given state-action pair. This challenge, known as double sampling in the RL literature (Baird, 1995), can prevent the algorithm from finding the desired value function, even with infinite data. There are several alternatives to handle this issue (e.g., Antos et al., 2008; Farahmand et al., 2016; Feng et al., 2019). Here, we apply the dual embedding technique (Dai et al., 2016, 2018b). Specifically, the dual representation of ,

can be substituted into (12), to result in a -- problem:

(15)

As we will see in Section 4, under mild conditions, strong duality holds in the inner - of (15), hence one can switch the and to reduce to a more convenient -- form.

3.2 Consistent Policy Gradient using Off-Policy Data

The equivalence between the objective in (12) and the on-policy max-return objective can be highlighted by considering the gradient of this objective with respect to . First, consider the optimal for (10). By taking the gradient of with respect to and setting this to 0, one finds that satisfies

(16)

Accordingly, for any , the optimal for (12) satisfies

(17)

Thus, we may express the gradient of with respect to as

where we have used Danskin’s theorem (Bertsekas, 1999) to ignore gradients of through . Hence, if the dual function is optimized, the gradient of the off-policy objective is exactly the on-policy policy gradient, with -value function given by .

To characterize this -value function, note that from (11), is a -value function with respect to augmented reward . Recalling the expression for in (16) and the fact that the derivatives and are inverses of each other, we have, . This derivation leads to our second theorem: If the dual function is optimized, the gradient of the off-policy objective with respect to is the regularized on-policy policy gradient:

(18)

where, is the -value function of with respect to rewards .

Remark

We note that Theorem 3.2 also holds when using the more sophisticated objective in (15), since the optimal is equal to , regardless of .

3.3 Connection to Actor-Critic

The relationship between the proposed off-policy objective and the classic policy gradient becomes more profound when we consider the form of the objective under specific choices of convex function . If we take , then and the proposed objective is reminiscent of actor-critic:

The second term alone is an instantiation of the off-policy critic objective in actor-critic. However, in actor-critic, the use of an off-policy objective for the critic is difficult to theoretically motivate. Moreover, in practice, critic and actor learning can both suffer from the mismatch between the off-policy distribution and the on-policy . By contrast, our derivations show that the introduction of the first term to the objective transforms the off-policy actor-critic algorithm to an on-policy actor-critic, without any explicit use of importance weights. Moreover, while standard actor-critic has two separate objectives for value and policy, our proposed objective is a single, unified objective. Both the policy and value functions are trained with respect to the same off-policy objective.

4 A Lagrangian View of AlgaeDICE

We now show how AlgaeDICE can be alternatively derived from the Lagrangian of a linear programming (LP) formulation of the -function. Please refer to appendix:proof_details for details. We begin by introducing some notations and assumptions, which have appeared in the literature (e.g., Puterman, 1994; Nachum et al., 2019b; Zhang et al., 2020). [Bounded rewards] The rewards of the MDP are bounded by some finite constant : . For the next assumption, we introduce the transpose Bellman operator:

(19)

[MDP regularity] The transposed Bellman operator has a unique fixed point solution.222 When , has a unique fixed point regardless of the underlying MDP. For , in the discrete case, the assumption reduces to requiring that be ergodic. The continuous case for is more involved; see Meyn and Tweedie (2012); Levin and Peres (2017) for a detailed discussion. [Bounded ratio] The target density ratio is bounded by some finite constant : . [Characterization of ] The function is convex with domain and continuous derivative . The convex (Fenchel) conjugate of is , and is closed and strictly convex. The derivative is continuous and its range is a superset of . For convenience, we define where . will serve as the range for .

Our derivation begins with a formalization of the LP characterization of the -function and its dual form: Given a policy , the average return of may be expressed in the primal and dual forms as

(20)
 and,
(21)

respectively. Under Assumptions 4 and 19, strong duality holds, , for optimal solutions . The optimal primal satisfies for all reachable by and the optimal dual is .

Consider the Lagrangian of , which would typically be expressed with a sum (or integral) of constraints weighted by . We can reparametrize the dual variable as to express the Lagrangian as,

(22)

The optimal of this Lagrangian is , and this optimal solution is not affected by expanding the allowable range of to all of . However, in practice, the linear structure in (22) can induce numerical instability. Therefore, inspired by the augmented Lagrangian method, we introduce regularization. By adding a special regularizer using convex , we obtain

(23)

We characterize the optimizers and and the optimum value of this objective in the following theorem. Interestingly, although the regularization can affect the optimal primal solution , the optimal dual solution is unchanged. Under Assumptions 42, the solution to (23) is given by,

The optimal value is . Thus, we have recovered the Fenchel AlgaeDICE objective for , given in Equation 15. Furthermore, one may reverse the Legendre transform, , to recover the Primal AlgaeDICE objective in Equation (13).

The derivation of this same result from the LP perspective allows us to exploit strong duality. Specifically, under the assumption that and are bounded, does not change if we optimize over a bounded space , as long as . In this case, strong duality holds (Ekeland and Temam, 1999, Proposition 2.1), and we obtain

This implies that, for computational efficiency, we can optimize the policy via

(24)
Remark (extensions to or ):

Although AlgaeDICE is originally derived for and in sec:dice-form, the Lagrangian view of the LP formulation of can be used to generalize the algorithm to and . In particular, for , one can directly use the original Lagrangian for the LP. For the case , the problem reduces to the Lagrangian of the LP for an undiscounted -function; details are delegated to appendix:proof_details.

Remark (off-policy evaluation):

The LP form of the -values leading to the Lagrangian (22) can be directly used for behavior-agnostic off-policy evaluation (OPE). In fact, existing estimators for OPE in the behavior-agnostic setting which typically reduce the OPE problem to estimation of quantities (, DualDICE (Nachum et al., 2019b) and GenDICE (Zhang et al., 2020)) can be recast as special cases by introducing different regularizations to the Lagrangian. As we have shown, the solution to the Lagrangian provides both (regularized) -values and the desired state-action corrections as primal and dual variables simultaneously.

5 Related Work

Algorithmically, our proposed method follows a Lagrangian primal-dual view of the LP characterization of the -function, which leads to a saddle-point problem. Several recent works (e.g., Chen and Wang, 2016; Wang, 2017; Dai et al., 2018a, b; Chen et al., 2018; Lee and He, 2018) also considered saddle-point formulations for policy improvement, derived from fundamentally different perspectives. In particular, Dai et al. (2018a) exploit a saddle-point formulation for the multi-step (path) conditions on the consistency between optimal value function and policy. Other works (Chen and Wang, 2016; Wang, 2017; Dai et al., 2018b; Chen et al., 2018) consider the (augmented) Lagrangian of the LP characterization of Bellman optimality for the optimal -function, which is slightly different from the LP characterization with respect to the optimal -function we consider. Although slight, the difference between the - and -LPs is crucial to enable behavior-agnostic policy optimization in AlgaeDICE. If one were to follow derivations similar to AlgaeDICE but for the -function LP, some form of explicit importance weighting (and thus knowledge of the behavior policy) would be required, as in recent work on off-policy estimation (Tang et al., 2019; Uehara and Jiang, 2019). We further note that the application of a regularizer on the dual variable to yield Primal AlgaeDICE is key to transforming the Lagrangian optimization over values and state-action occupancies — typical in these previous works — to an optimization over values and policies, which is more common in practice and can help generalization (e.g., Swaminathan and Joachims, 2015).

The regularization we employ is inspired by previous uses of regularization in RL. Adding regularization to MDPs (Neu et al., 2017; Geist et al., 2019) has been investigated for many different purposes in the literature, including exploration (de Farias and Van Roy, 2000; Haarnoja et al., 2017, 2018), smoothing (Dai et al., 2018b), avoiding premature convergence (Nachum et al., 2017a), ensuring tractability (Todorov, 2006), and mitigating observation noise (Rubin et al., 2012; Fox et al., 2016). We note that the regularization employed by AlgaeDICE as a divergence over state-action densities is markedly different from these previous works, which mostly regularize only the action distributions of a policy conditioned on state. An approach more similar to ours is given by Belousov and Peters (2017), which regularizes the max-return objective using an -divergence over state-action densities. Their derivations are similar in spirit to ours, using the method of Lagrange multipliers, but their result is distinct in a number of key characteristics. First, their objective (analogous to ours in (12)) includes not only policy and values but also a number of additional functions, complicating any practical implementation. Second, their results are restricted to conservative regularization (), whereas our findings extend to both exploratory regularization and unregularized objectives (). Third, the algorithm proposed by Belousov and Peters (2017) follows a bi-level optimization, in which the policy is learned using a separate and distinct objective. In contrast, our proposed AlgaeDICE uses a single, unified objective for both policy and value learning.

Lastly, there are a number of works which (like ours) perform policy gradient on off-policy data via distribution correction. The key differentiator is in how the distribution corrections are computed. One common method is to re-weight off-policy samples by considering eligibility traces (Precup et al., 2000; Geist and Scherrer, 2014), i.e., compute weights by taking the product of per-action importance weights over a trajectory. Thus, these methods can suffer from high variance as the length of trajectory increases, known as the “curse of horizon” (Liu et al., 2018). A more recent work (Liu et al., 2019) attempts to weight updates by estimated state-action distribution corrections. This is more in line with our proposed AlgaeDICE, which implicitly estimates these quantities. One key difference is that this previous work explicitly estimates these corrections, which results in a bi-level optimization, as opposed to our more appealing unified objective. It is also important to note that both eligibility trace methods and the technique outlined in Liu et al. (2019) require knowledge of the behavior policy. In contrast, AlgaeDICE is a behavior-agnostic off-policy policy gradient method, which may be more relevant in practice. Compared to existing behavior-agnostic off-policy estimators (Nachum et al., 2019b; Zhang et al., 2020), this work considers the substantially more challenging problem of policy optimization.

6 Experiments

We present empirical evaluations of AlgaeDICE, first in a tabular setting using the Four Rooms domain (Sutton et al., 1999) and then on a suite of continuous control benchmarks using MuJoCo (Todorov et al., 2012) and OpenAI Gym (Brockman et al., 2016).

Figure 1: We provide a pictoral representation of learned policies and dual variables during training of AlgaeDICE on the Four Rooms domain (Sutton et al., 1999). The agent is initialized at the state denoted by an orange square and receives zero reward everywhere except at the target state, denoted by a green square. We use a fixed offline experience data distribution that is near-uniform. The progression of learned and during training is shown from left to right. The policy is presented via arrows for each action at each square, with the opacity of the arrow determined by the probability . The dual variables are presented via their Bellman residuals: the opacity of each square is determined by the sum of the Bellman residuals at that state . Recall that for any , the Bellman residuals of the optimal should satisfy . As expected, we see that in the beginning of training the residuals are high near the initial state, while towards the end of training the residuals show the preferred trajectories of the near-optimal policy.

6.1 Four Rooms

We begin by considering the tabular setting given by the Four Rooms environment (Sutton et al., 1999), in which an agent must navigate to a target location within a gridworld. In this tabular setting, we evaluate Primal AlgaeDICE (Equations 12 and 13) with . This way, for any , the dual value function may be solved exactly using standard matrix operations. Thus, we train by iteratively solving for via matrix operations and then taking a gradient step for . We collect an off-policy dataset by running a uniformly random policy for 500 trajectories, where each trajectory is initalized at a random state and is of length 10. This dataset is kept fixed, placing us in the completely offline regime. We use and .

Figure 2: Average per-step reward of policies on Four Rooms learned by AlgaeDICE compared to actor-critic (AC) over training iterations.

Graphical depictions of learned policies and dual value functions are presented in Figure 1, where each plot shows and during the first, fourth, seventh, and tenth iterations of training. The opacity of each square is determined by the Bellman residuals of at that state. Recall that the Bellman residuals of the optimal are the density ratios . We see that this is reflected in the learned . At the beginning of training, the residuals are high around the initial state. As training progresses, there is a clear path (or paths) of high-residual states going from initial to target state. Thus we see that learns to properly correct for distribution shifts in the off-policy experience distributions. The algorithm successfully learns to optimize a policy using these corrected gradients, as shown by the arrows denoting preferred actions of the learned policy.

We further provide quantitative results in Figure 2. We plot the average per-step reward of AlgaeDICE compared to actor-critic in both online and offline settings. As a point of comparison, the behavior policy used to collect data for the offline setting achieves average reward of . Although all the variants are able to significantly improve upon this baseline, we see that AlgaeDICE performance is only negligibly affected by the type of dataset, while performance of actor-critic degrades in the offline regime. See app:details for experimental details.

6.2 Continuous Control

We now present results of AlgaeDICE on a set of continuous control benchmarks using MuJoCo (Todorov et al., 2012) and OpenAI Gym (Brockman et al., 2016). We evaluate the performance of Primal AlgaeDICE with . Our empirical objective is thus given by

where is a single-sample estimate of the Bellman residual:

(25)

We note that using a single-sample estimate for the Bellman residual in general leads to biased gradients, although previous works have found this to not have a significant practical effect in these domains (Kostrikov et al., 2019). We make the following additional practical modifications:

  • As entropy regularization has been shown to be important on these tasks (Nachum et al., 2017b; Haarnoja et al., 2018), we augment the rewards with a causal entropy term; , replace in (25) with , where is learned adaptively as in Haarnoja et al. (2018).

  • As residual learning is known to be hard in function approximation settings (Baird, 1995), we replace in (25) with a mixture where is a target value calculated as in Haarnoja et al. (2018). We use .

  • If is fully optimized, will be the density ratio , and thus always non-negative. However, during optimization, this may not always hold, which can affect policy learning. Thus, when calculating gradients of this objective with respect to , we clip the value of from below at 0.

For training we parameterize and

using neural networks and perform alternating stochastic gradient descent on their parameters.

We present our results in Figure 3. We see that AlgaeDICE can perform well in these settings, achieving performance that is roughly competitive with the state-of-the-art SAC and TD3 algorithms. There are potentially more possible improvements to these practical results by choosing (or ) appropriately. In app:more-results, we conduct a preliminary investigation into polynomial , showing that certain polynomials can at times provide better performance than . A more detailed and systematic study of this and other design choices for implementing AlgaeDICE is an interesting avenue for future work.

HalfCheetah Hopper Walker2d
Ant Humanoid
Figure 3: We show the results of AlgaeDICE compared to SAC (Haarnoja et al., 2018), TD3 (Fujimoto et al., 2018), and DDPG (Lillicrap et al., 2015). We follow the evaluation protocol of Fujimoto et al. (2018)

, plotting the performance of 10 randomly seeded training runs, with shaded region representing half a standard deviation and

-axis given by environment steps. There are potentially better results achievable by using a choice of other than for AlgaeDICE; see app:more-results for a preliminary investigation.

7 Conclusion

We have introduced an ALgorithm for policy Gradient from Arbitrary Experience via DICE, or AlgaeDICE, for behavior-agnostic, off-policy policy improvement in reinforcement learning. Based on a linear programming characterization of the -function, we derived the new approach from a Lagrangian saddle-point formulation. The resulting algorithm, AlgaeDICE, automatically compensates for the distribution shift in collected off-policy data, and achieves an estimate of the on-policy policy gradient using this off-policy data.

Acknowledgments

We thank Marc Bellemare, Nicolas Le Roux, George Tucker, Rishabh Agarwal, Dibya Ghosh, and the rest of the Google Brain team for insightful thoughts and discussions.

References

Appendix A Proof Details

We follow the notations in main text. Abusing notation slightly, we will use and interchangeably.

theorem:q_lp

Given a policy , the average return of may be expressed in the primal and dual forms as

 and,

respectively. Under Assumptions 4 and 19, strong duality holds, , for optimal solutions . The optimal primal satisfies for all reachable by and the optimal dual is .

Recall that is monotonic; that is, given two bounded functions and , implies . Therefore, for any feasbile , we have , proving the first claim.

The duality of the linear program (4) can be obtained as

which is exactly (4). Notice that the equality constraints correspond to a system of linear equations of dimension : , where , , and

is the identity matrix. Since the matrix

is nonsingular, the system has a unique solution given by

Finally, when , we can rewrite , so , as desired.

thm:effect_reg

Under Assumptions 42, the solution to (23) is given by,

The optimal value is . By Fenchel duality, we have

Plugging this into (23), we have

(26)

To investigate the optimality, we apply the change-of-variable,