Reinforcement learning (RL) aims to learn behavior policies to optimize a long-term decision-making process in an environment. RL is thus relevant to a variety of real-world applications, such as robotics, patient care, and recommendation systems. When tackling problems associated with the RL setting, two main difficulties arise: first, the decision-making problem is inherently sequential, with early decisions made by the policy affecting outcomes both near and far in the future; second, the learner’s knowledge of the environment is only through sampled experience, i.e., previously sampled trajectories of interactions, while the underlying mechanism governing the dynamics of these trajectories is unknown.
The environment and its underlying dynamics are typically abstracted as a Markov decision process (MDP). This abstraction gives rise to the Bellman recurrence, which characterizes the optimal value function and behavior policy through a dynamic programming (DP) view of the RL problem(Bellman, 1966). Most of the effective existing RL algorithms are rooted in this dynamic programming paradigm, attempting to find approximate fixed-point solutions to the Bellman recurrence, leading to the family of temporal-difference (TD) algorithms including SARSA (Sutton, 1996), -learning (Watkins, 1989)
, and their deep learning variants(Mnih et al., 2015; Hasselt et al., 2016; Wang et al., 2015). While the TD-based algorithms are powerful, their training can oscillate or even diverge in settings where function approximation is used or the ability to sample additional interactions with the environment is limited (Sutton and Barto, 1998).
An alternative paradigm for RL is based on linear programming (LP). A number of RL problems, such as policy optimization and policy evaluation, can be expressed as an LP –i.e., an optimization problem involving a linear objective and linear constraints. The LP may then be transformed to a form more amenable to stochastic and large-scale optimization via the tools of LP duality. Although the LP perspective has existed for decades (e.g., Manne (1960); Denardo (1970)), it has recently received renewed interest for its potential ability to circumvent the optimization challenges of DP-based approaches in exchange for more mature and well-studied techniques associated with convex optimization (De Farias and Van Roy, 2003; Wang et al., 2007; Chen and Wang, 2016; Wang, 2017; Bas-Serrano and Neu, 2019).
In this article, we generalize the LP approach and describe a number of convex problems relevant to RL – i.e., formulations of RL problems as a convex objective and linear constraints. With a convex objective, one must appeal to the more general Fenchel-Rockafellar duality. Perhaps the most useful property of this generalization is that, when the original primal problem involves a strictly convex objective (unlike the LP setting), application of Fenchel-Rockafellar duality leads to a dual problem which is unconstrained. We show that the Fenchel-Rockafellar duality and its variants are at the heart of a number of recent RL algorithms, although many of these were originally presented through less generalizable derivations (e.g., the DICE-family of offline RL algorithms: Nachum et al. (2019a, b); Kostrikov et al. (2019); Zhang et al. (2020)). By providing a unified perspective on these results and the tools and tricks which lead to them, we hope to enable future researchers to better use the techniques of convex duality to make further progress in RL.
Aiming to provide a useful reference for any interested researcher, we begin by reviewing basic knowledge of convex duality (sec:convex-duality) and RL (sec:rl). We then focus on the discounted policy evaluation problem in RL (sec:eval), originally expressed in an LP form known as the -LP. We show how the tools of LP and convex duality may be used to derive a variety of useful re-formulations of policy evaluation. We then continue to show how these same techniques can be applied to the policy optimization problem, starting from either the -LP (sec:opt) or the potentially more streamlined -LP (sec:vlp). We then generalize these algorithms to undiscounted settings in sec:undiscounted. We conclude in sec:conclusion with a brief summary and promising future directions.
2 Convex Duality
The concept of duality is a basic and powerful tool in optimization and machine learning, especially in the field of convex analysis, allowing a researcher to easily reformulate optimization problems in alternative ways that are potentially more tractable. In this section, we provide a brief overview of a few key convex duality results which will play an important role in the RL algorithms derived in later sections.
A full and detailed introduction to convex analysis is beyond the scope of this report, and so most of our statements will be presented informally (for example, we use and as opposed to and and we use and without ambiguity), although we will strive to properly qualify theoretical claims when appropriate. The curious reader may refer to a number of resources for a more complete and mathematically precise treatment of this subject; e.g., Rockafellar (1970); Boyd and Vandenberghe (2004); Borwein and Lewis (2010); Bauschke and Lucet (2012).
2.1 Fenchel Conjugate
The Fenchel conjugate of a function is defined as
where denotes the inner product defined on . This function is also referred to as the convex conjugate or Legendre–Fenchel transformation of .
We say a function is proper when is non-empty and for all . We say a function is lower semi-continuous when is an open set for all . For a proper, convex, lower semi-continuous , its conjugate function is also proper, convex, and lower semi-continuous. Moreover, one has the duality ; i.e.,
where denotes the domain of . From now on we will assume any use of is for a convex function. Furthermore, we will assume any declared convex function is also proper and lower semi-continuous. tab:conjugates provides a few common functions and their corresponding Fenchel conjugates.
|For and .|
|is 0 if and otherwise.|
|For and a distribution over .|
|For , i.e.
, a normalized distribution over.
2.1.1 The Indicator Function
One especially useful function is the indicator function , which is defined as,
If is a closed, convex set, it is easy to check that is proper, convex, and lower semi-continuous. The indicator function can be used as a way of expressing constraints. For example, the optimization problem may be alternatively expressed as . It may be readily shown that the conjugate of is the linear function and vice-versa.
The family of -divergences, also known as Csiszár-Morimoto or Ali-Silvey divergences (Ali and Silvey, 1966), has been widely applied in many machine learning applications, including variational inference (Wainwright and Jordan, 2003)
, generative model estimation(Nowozin et al., 2016; Dai et al., 2019)2019; Ghasemipour et al., 2019; Kostrikov et al., 2019), and reinforcement learning (Nachum et al., 2019a; Zhang et al., 2020; Nachum et al., 2019b).
For a convex function and a distribution over some domain , the -divergence is defined as,
Typically, -divergences are used to measure the discrepancy between two distributions (i.e., , the simplex over , and measures the divergence between and ), although the domain of may be extended to the set of real-valued functions .
The choice of domain of is important when considering the Fenchel conjugate of . If the domain is the set of unrestricted real-valued functions, the conjugate of at is, under mild conditions,111Conditions of the interchangeability principle (Dai et al., 2016) must be satisfied, and must have sufficient support over .
On the other hand, if one considers the domain of to be , then one must solve a constrained version of (5), which can be difficult depending on the form of .
Of the family of -divergences, the -divergence is arguably the most commonly used one, and it is given by,
It is no coincidence that the -average- function (and the closely related -- function) is arguably as ubiquitous in RL as the -divergence.
2.2 Fenchel-Rockafellar Duality
Fenchel conjugates are indispensable when tackling a variety of optimization problems. In this section, we present one of the most general and useful tools associated with Fenchel conjugates, known as the Fenchel-Rockafellar duality (Rockafellar, 1970; Borwein and Lewis, 2010).
Consider a primal problem given by
where are convex, lower semi-continuous functions and is a linear operator (e.g., a matrix). The dual of this problem is given by
where we use to denote the adjoint linear operator of ; i.e., is the linear operator for which , for all . In the common case of simply being a real-valued matrix, is the transpose of .
Under mild conditions,333See Theorem 3.3.5 in Borwein and Lewis (2010). Informally, the primal problem needs to be feasible; i.e. for some . the dual problem (11) may be derived from the primal (10) via
Thus, we have the duality,
Furthermore, one may show that a solution to the dual can be used to find a solution to the primal. Specifically, if is well-defined, then is a solution to the primal. More generally, one can recover as the set of all primal solutions.
Of course, in the presence of Fenchel-Rockafellar duality, the label of primal and dual is arbitrary. One can consider (11) the primal problem and (10) its dual, and in our derivations we will use these labels interchangeably.
2.2.1 The Lagrangian
The Fenchel-Rockafellar duality is general enough that it can be used to derive the Lagrangian duality. Consider the constrained optimization problem
If we consider this problem expressed as for , its Fenchel-Rockafellar dual is given by
By considering in terms of its Fenchel conjugate (equation (1)), we may write the problem as
Using the fact that for any we may express this as
The expression is known as the Lagrangian of the original problem in (14). One may further derive the well-known Lagrange duality:444See Veinott (2005) (https://web.stanford.edu/class/msande361/handouts/nlpdual.pdf) for a brief derivation of this fact and Ekeland and Temam (1999)[Proposition 2.1] for more general cases.
Moreover, the optimal value of the Lagrangian is the optimal value of the original problem (14), and the optimal solutions (equilibrium points) are the solutions to the original primal (14) and its dual (15).
2.2.2 LP Duality
The Fenchel-Rockafellar duality also generalizes the well-known linear programming (LP) duality. Specifically, if one considers functions and then the primal and dual problems in (10) and (11) correspond to,
respectively. By making the switch , the dual (20) may be equivalently expressed in the more familiar form,
3 Reinforcement Learning
In this work, we will show how the Fenchel-Rockafellar duality (and the LP and Lagrangian dualities) can be applied to solve a number of reinforcement learning (RL) problems. Before we present these algorithms, we use this section as a brief introduction to RL.
3.1 Markov Decision Process
In RL, one wishes to learn a behavior policy to interact with an environment in an optimal way, where the typical meaning of ‘optimal’ is with respect to future discounted rewards (feedback) provided by the environment. The RL environment is commonly abstracted as a Markov decision process (MDP) (Puterman, 1994; Sutton and Barto, 1998), which is specificied by a tuple
, consisting of a state space, an action space, a reward function, a transition probability function, an initial state distribution, and a discount factor, respectively. The policy is a function . The policy interacts with the environment iteratively, starting with an initial state . At step , the policy produces a distribution over the actions , from which an action is sampled and applied to the environment. The environment produces a scalar reward 555For simplicity we consider a deterministic reward function. Stochastic rewards are more typical, although the same derivations are usually applicable in either case. and subsequently transitions to a new state .
In summary, the RL setting is concerned with a policy which sequentially makes decisions, and the effects of those decisions are observed through a per-step reward feedback and a stochastic, Markovian state transition process. For simplicity, we will consider infinite-horizon (non-terminating) environments, which may be extended to finite-horizon environments by considering an extra terminal state which continually loops onto itself with zero reward.
3.2 Policy Evaluation and Optimization
The first question one may ask when presented with an MDP and a policy is, what is the long-term value of when interacting with ? The next question might be, what is the optimal policy maximizing this long-term value? These two questions constitute the policy evaluation and optimization problems, respectively.
To formalize these questions, we consider a discount factor .666See Section 7 for consideration of . The value of is defined as the expected per-step reward obtained by following the policy, averaging over time via -discounting:
The policy evaluation problem is to estimate this quantity for a given , and the policy optimization problem is to find which maximizes , i.e., solve . If the reward function is independent of the policy , there exists an optimal that is deterministic (Puterman, 1994). If one adds a policy-dependent regularization to the objective, e.g., by considering entropy-regularized rewards , the optimal policy could be stochastic.
3.3 Online vs. Offline RL
One of the main limitations when approaching either the policy evaluation or policy optimization problems is that one does not have explicit knowledge of the environment; i.e., one does not have explicit knowledge of the functions . Rather, access to the environment is given in the form of experience gathered via interactions with the environment. The specific nature of this experience depends on the context of one’s considered problem. The most common forms of experience may be generally categorized into online and offline.
In the online setting, experience from the environment may be collected at any point via Monte-Carlo rollouts. With this type of access to the environment, the policy evaluation and optimization problems may be easily solved. For example, the value of the policy may be estimated by simply averaging the discounted reward of a large number of Monte-Carlo rollouts. For this reason, online RL research typically strives to find sample-efficient algorithms, which find approximate solutions to policy evaluation or optimization with as few interactions with the environment as possible.
In practice (e.g., consumer web recommendation systems or health applications), interaction with the environment during training is not available at all. More commonly, access to the environment is offline. That is, interactions with the environment are limited to a static dataset of (logged) experience , where for some unknown distribution , , and . One also typically assumes access to samples from . In this report we will mostly focus on the offline setting, although we will relax it to assume that our offline experience is effectively unlimited (), and so will write our expectations in terms of , , and . Performing the appropriate finite-sample analysis for finite is outside the scope of this report. Even with effectively unlimited experience, the offline setting presents a challenge to RL algorithms, due to the mismatch between the experience distribution given by the offline dataset and the online distribution typically needed for policy evaluation or optimization.
Although is a distribution over state-action pairs, we will at times abuse notation and write , and this is intended to mean ; i.e., as if we are sampling from the (infinite) dataset . Moreover, although one does not have explicit access to , we will oftentimes write expressions such as , and this is intended to mean .
We emphasize a subtle difference between the offline setting and what is commonly referred to in the literature as off-policy learning. Off-policy algorithms are designed to enable an RL agent to learn from historical samples collected by other policies. However, these algorithms are typically allowed to interact with the environment during training to collect new samples. On the other hand, in the offline setting, one’s access to the environment is exclusively via a fixed dataset of experience. In other words, while an offline RL algorithm is necessarily off-policy, an off-policy algorithm is not necessarily offline.
3.4 -values and State-Action Visitations
When evaluating or optimizing policies, both online and offline, the notions of -values and state-action visitations are useful. For a policy , the Q-values denote the future discounted sum of rewards of following starting at :
The -values satisfy the single-step Bellman recurrence
where is the policy transition operator,
The state-action visitations of (also known as occupancies or density) may be defined similarly as,
That is, the visitation measures how likely is to encounter when interacting with , averaging these encounters over time via -discounting. The visitations constitute a normalized distribution, and this distribution is referred to as the on-policy distribution.
Like the -values, the visitations satisfy the single-step transpose Bellman recurrence:
where is the transpose (or adjoint) policy transition operator,
These recursions simply reflect the conservation of flow (probability mass) of a stationary distribution on a Markov process. Note that both and are linear operators and that the transpose policy transition operator is indeed the mathematical transpose (or adjoint) of in the sense that for any .
Both the -values and the visitations are useful in RL. For example, the value of a policy may be expressed in two ways:
Also, when performing policy optimization, the policy gradient theorem (Sutton et al., 2000) utilizes the -values and visitations to express the gradient of as
It is thus standard in most RL algorithms to either have access to and or have some mechanism to estimate these quantities. Typically, the -values are estimated by minimizing squared Bellman residuals (or some variation on this) (Sutton et al., 2008, 2009; Liu et al., 2015; Dai et al., 2016; Du et al., 2017); i.e., minimizing the (surrogate) squared difference between the LHS and RHS of (23). For the visitations, it is more typical to assume access to the distribution (for example, by simply performing Monte-Carlo rollouts enabled by online access), although instances exist in which is approximated by importance-weighting a different (i.e., offline) distribution (Precup et al., 2001; Sutton et al., 2014).
4 Policy Evaluation
We now move on to demonstrating applications of Fenchel-Rockafellar duality to RL. We begin by approaching the policy evaluation problem. Although the policy evaluation problem may appear to be simpler or less interesting (it is not!) than the policy optimization problem, in our case the same techniques will be used in either setting. Thus, we will use this section to provide more detailed derivations of a variety of techniques which will be referenced repeatedly in the following sections.
4.1 The Linear Programming Form of
The equivalent formulations of in (27) in terms of either or hint at a duality which is formally given by the following LP characterization of , known as the -LP:
The optimal of this LP satisfies for all reachable by .
The dual of this LP provides us with the visitation perspective on policy evaluation:
The optimal of this LP is the state-action visitation of . It is important to note that this dual LP is over-constrained. The equality constraints in (33) uniquely determine regardless of the objective in (32). This fact will prove useful in a number of later derivations.
For detailed and complete derivations of these LP representations of and , please refer to Nachum et al. (2019b).
4.2 Policy Evaluation via the Lagrangian
The potentially large number of constraints in either the primal or dual forms of the -LP introduce a challenge to estimating . We may instead derive a more tractable unconstrained optimization perspective on the policy evaluation problem using the Lagrangian of the -LP:
In practical settings where is possibly infinite, it is not feasible to optimize the sum in (35) over . In an offline setting, where we only have access to a distribution , we may make a change-of-variables via importance sampling, i.e., . If has sufficient support or coverage (Sutton et al., 2016), we may re-write (35) as
The optimal of this problem satisfies . Thus, to estimate , one may optimize this objective with respect to (requiring only access to samples from , , and ) and return as the final estimate.
Thus, this estimator is robust to errors in at most one of and .
Despite the desirable properties of this estimator, the optimization problem associated with it involves rewards and learning -values with respect to these rewards. Learning -values using single-step transitions turns out to be difficult in practice without the use of a number of tricks developed over the years (e.g., target networks, ensembling). Moreover, the bilinear nature of the Lagrangian can lead to instability or poor convergence in optimization (Dai et al., 2017; Bas-Serrano and Neu, 2019). Rather than tackling these various issues head-on, a number of recent works propose an alternative approach, which we describe in the following subsection.
4.3 Changing the Problem Before Applying Duality
As mentioned in Section 4.1, the dual of the -LP is over-constrained, in the sense that the constraints in (33) uniquely determine the state-action visitation . Thus, one may replace the objective in (32) with for some without affecting the optimal solution . Therefore, the main idea of a number of recent works is to choose an appropriate so that either the Lagrangian or the Fenchel-Rockafellar dual of this problem is more approachable and potentially avoids the instabilities associated with the original LP.
Although the problem is changed, the solution is unaffected, and once a solution is found it may be used to provide an estimate of . Specifically, if the problem is re-written in terms of , then once the problem is optimized, we can derive an estimate for the value of via the approximate solution :
4.3.1 Constant Function
If is taken to be the trivial function , the offline form of the Lagrangian optimization becomes,
The optimal solution of this problem is , and once an approximate solution is found, it may be used to estimate according to (38). Unlike the previous form of the Lagrangian in (36), this optimization does not involve learning -values with respect to environment rewards, and in practice this distinction leads to much better optimization behavior (Uehara and Jiang, 2019). Still, the Lagrangian is linear in both and . This can be remedied by choosing a strictly convex form of , for example, by using an -divergence.
The use of an -divergence objective leads to the set of general off-policy estimation techniques outlined in the recent DualDICE paper (Nachum et al., 2019a). Specifically, the various estimators proposed by DualDICE correspond to applying either the Lagrange or Fenchel-Rockafellar dualities to the optimization problem,
Application of Lagrange duality to the above problem yields
We transform the transpose policy transition operator to its transpose by using the fact :
Now we make the change-of-variables to yield,
where corresponds to the linear constraints (41) with respect to the adjoint Bellman operator; i.e.,
When applying Fenchel-Rockafellar duality, the linear operator is transformed to its adjoint and is used as an argument to the Fenchel conjugate of . At the same time, is replaced by its Fenchel conjugate .
The dual problem is therefore given by
We can now see that the use of an -divergence with respect to naturally leads to an offline problem with expectations over , without an explicit change-of-variables. Furthermore, unlike previous dual problems, there are no constraints in this optimization, and so standard gradient-based techniques may be applied to find a solution without appealing to the Lagrange duality, which would necessarily involve nested - optimizations. The Fenchel-Rockafellar duality also provides us with a way to recover from a solution :
If we set , we may recover what is perhaps the most intriguing result in Nachum et al. (2019a):
That is, if one optimizes -value functions to minimize squared Bellman residuals (with respect to zero reward) while minimizing initial -values, then the optimal Bellman residuals are exactly the density ratios between the on-policy and offline state-action distributions.
Interestingly, the derivations in Nachum et al. (2019a) do not explicitly use Lagrangian or Fenchel-Rockafellar duality, but rather focus on a cleverly chosen change-of-variables (the so-called DualDICE trick). It is clear from our own derivations that this trick essentially comes from the relationship between and and is simply another way of applying Fenchel-Rockafellar duality.
It is important to note that there is a trade-off introduced by the use of the Fenchel-Rockafellar duality as opposed to the Lagrangian. Namely, the objective (47) involves optimizing a convex function of an expectation under , and thus also , whereas in practice one typically only has access to a single empirical sample . Many works ignore this problem, and simply consider the single empirical sample as the full distribution .
We briefly summarize the main takeaways from this section.
The policy evaluation problem may be expressed as an LP, known as the -LP, whose solution is .
The dual of this LP has solution .
Taking the Lagrangian of the -LP can lead to a doubly robust estimator for .
Changing the objective in the dual of the -LP does not change its solution .
Changing the objective to an appropriate alternative is a powerful tool. This can lead to a (regularized) Fenchel-Rockafellar dual that is unconstrained, and thus more amenable to stochastic and offline settings.
5 Policy Optimization
The previous section detailed a number of ways to estimate . In this section, we show how similar techniques may be applied for the policy optimization problem, which is concerned with finding the optimal solution .
5.1 The Policy Gradient Theorem
Considering the Lagrangian formulation of given in (35) can provide a simple derivation of the policy gradient theorem in (28). Let be the inner expression in (35). Danskin’s theorem (Bertsekas, 1999) tells us that
where are the solutions to . Recall that for all for which and that . Thus,
We may compute the gradient of w.r.t. term-by-term. For the first term in (35) we have
where we used the general identity . For the second term of in (35) we have