1 Introduction
Reinforcement learning (RL) aims to learn behavior policies to optimize a longterm decisionmaking process in an environment. RL is thus relevant to a variety of realworld applications, such as robotics, patient care, and recommendation systems. When tackling problems associated with the RL setting, two main difficulties arise: first, the decisionmaking problem is inherently sequential, with early decisions made by the policy affecting outcomes both near and far in the future; second, the learner’s knowledge of the environment is only through sampled experience, i.e., previously sampled trajectories of interactions, while the underlying mechanism governing the dynamics of these trajectories is unknown.
The environment and its underlying dynamics are typically abstracted as a Markov decision process (MDP). This abstraction gives rise to the Bellman recurrence, which characterizes the optimal value function and behavior policy through a dynamic programming (DP) view of the RL problem
(Bellman, 1966). Most of the effective existing RL algorithms are rooted in this dynamic programming paradigm, attempting to find approximate fixedpoint solutions to the Bellman recurrence, leading to the family of temporaldifference (TD) algorithms including SARSA (Sutton, 1996), learning (Watkins, 1989), and their deep learning variants
(Mnih et al., 2015; Hasselt et al., 2016; Wang et al., 2015). While the TDbased algorithms are powerful, their training can oscillate or even diverge in settings where function approximation is used or the ability to sample additional interactions with the environment is limited (Sutton and Barto, 1998).An alternative paradigm for RL is based on linear programming (LP). A number of RL problems, such as policy optimization and policy evaluation, can be expressed as an LP –
i.e., an optimization problem involving a linear objective and linear constraints. The LP may then be transformed to a form more amenable to stochastic and largescale optimization via the tools of LP duality. Although the LP perspective has existed for decades (e.g., Manne (1960); Denardo (1970)), it has recently received renewed interest for its potential ability to circumvent the optimization challenges of DPbased approaches in exchange for more mature and wellstudied techniques associated with convex optimization (De Farias and Van Roy, 2003; Wang et al., 2007; Chen and Wang, 2016; Wang, 2017; BasSerrano and Neu, 2019).In this article, we generalize the LP approach and describe a number of convex problems relevant to RL – i.e., formulations of RL problems as a convex objective and linear constraints. With a convex objective, one must appeal to the more general FenchelRockafellar duality. Perhaps the most useful property of this generalization is that, when the original primal problem involves a strictly convex objective (unlike the LP setting), application of FenchelRockafellar duality leads to a dual problem which is unconstrained. We show that the FenchelRockafellar duality and its variants are at the heart of a number of recent RL algorithms, although many of these were originally presented through less generalizable derivations (e.g., the DICEfamily of offline RL algorithms: Nachum et al. (2019a, b); Kostrikov et al. (2019); Zhang et al. (2020)). By providing a unified perspective on these results and the tools and tricks which lead to them, we hope to enable future researchers to better use the techniques of convex duality to make further progress in RL.
Aiming to provide a useful reference for any interested researcher, we begin by reviewing basic knowledge of convex duality (sec:convexduality) and RL (sec:rl). We then focus on the discounted policy evaluation problem in RL (sec:eval), originally expressed in an LP form known as the LP. We show how the tools of LP and convex duality may be used to derive a variety of useful reformulations of policy evaluation. We then continue to show how these same techniques can be applied to the policy optimization problem, starting from either the LP (sec:opt) or the potentially more streamlined LP (sec:vlp). We then generalize these algorithms to undiscounted settings in sec:undiscounted. We conclude in sec:conclusion with a brief summary and promising future directions.
2 Convex Duality
The concept of duality is a basic and powerful tool in optimization and machine learning, especially in the field of convex analysis, allowing a researcher to easily reformulate optimization problems in alternative ways that are potentially more tractable. In this section, we provide a brief overview of a few key convex duality results which will play an important role in the RL algorithms derived in later sections.
A full and detailed introduction to convex analysis is beyond the scope of this report, and so most of our statements will be presented informally (for example, we use and as opposed to and and we use and without ambiguity), although we will strive to properly qualify theoretical claims when appropriate. The curious reader may refer to a number of resources for a more complete and mathematically precise treatment of this subject; e.g., Rockafellar (1970); Boyd and Vandenberghe (2004); Borwein and Lewis (2010); Bauschke and Lucet (2012).
2.1 Fenchel Conjugate
The Fenchel conjugate of a function is defined as
(1) 
where denotes the inner product defined on . This function is also referred to as the convex conjugate or Legendre–Fenchel transformation of .
We say a function is proper when is nonempty and for all . We say a function is lower semicontinuous when is an open set for all . For a proper, convex, lower semicontinuous , its conjugate function is also proper, convex, and lower semicontinuous. Moreover, one has the duality ; i.e.,
(2) 
where denotes the domain of . From now on we will assume any use of is for a convex function. Furthermore, we will assume any declared convex function is also proper and lower semicontinuous. tab:conjugates provides a few common functions and their corresponding Fenchel conjugates.
Function  Conjugate  Notes 

For and .  
is 0 if and otherwise.  
.  
For and a distribution over .  
For , i.e. , a normalized distribution over . 
2.1.1 The Indicator Function
One especially useful function is the indicator function , which is defined as,
(3) 
If is a closed, convex set, it is easy to check that is proper, convex, and lower semicontinuous. The indicator function can be used as a way of expressing constraints. For example, the optimization problem may be alternatively expressed as . It may be readily shown that the conjugate of is the linear function and viceversa.
2.1.2 Divergences
The family of divergences, also known as CsiszárMorimoto or AliSilvey divergences (Ali and Silvey, 1966), has been widely applied in many machine learning applications, including variational inference (Wainwright and Jordan, 2003)
, generative model estimation
(Nowozin et al., 2016; Dai et al., 2019)(Ke et al., 2019; Ghasemipour et al., 2019; Kostrikov et al., 2019), and reinforcement learning (Nachum et al., 2019a; Zhang et al., 2020; Nachum et al., 2019b).For a convex function and a distribution over some domain , the divergence is defined as,
(4) 
Typically, divergences are used to measure the discrepancy between two distributions (i.e., , the simplex over , and measures the divergence between and ), although the domain of may be extended to the set of realvalued functions .
The choice of domain of is important when considering the Fenchel conjugate of . If the domain is the set of unrestricted realvalued functions, the conjugate of at is, under mild conditions,^{1}^{1}1Conditions of the interchangeability principle (Dai et al., 2016) must be satisfied, and must have sufficient support over .
(5)  
(6)  
(7) 
On the other hand, if one considers the domain of to be , then one must solve a constrained version of (5), which can be difficult depending on the form of .
The Divergence
Of the family of divergences, the divergence is arguably the most commonly used one, and it is given by,
(8) 
which is the result of choosing in (4). For the divergence, the constrained version of (5) may be shown^{2}^{2}2See Example 3.25 in Boyd and Vandenberghe (2004). to yield the conjugate function
(9) 
It is no coincidence that the average function (and the closely related  function) is arguably as ubiquitous in RL as the divergence.
2.2 FenchelRockafellar Duality
Fenchel conjugates are indispensable when tackling a variety of optimization problems. In this section, we present one of the most general and useful tools associated with Fenchel conjugates, known as the FenchelRockafellar duality (Rockafellar, 1970; Borwein and Lewis, 2010).
Consider a primal problem given by
(10) 
where are convex, lower semicontinuous functions and is a linear operator (e.g., a matrix). The dual of this problem is given by
(11) 
where we use to denote the adjoint linear operator of ; i.e., is the linear operator for which , for all . In the common case of simply being a realvalued matrix, is the transpose of .
Under mild conditions,^{3}^{3}3See Theorem 3.3.5 in Borwein and Lewis (2010). Informally, the primal problem needs to be feasible; i.e. for some . the dual problem (11) may be derived from the primal (10) via
(12)  
Thus, we have the duality,
(13) 
Furthermore, one may show that a solution to the dual can be used to find a solution to the primal. Specifically, if is welldefined, then is a solution to the primal. More generally, one can recover as the set of all primal solutions.
Of course, in the presence of FenchelRockafellar duality, the label of primal and dual is arbitrary. One can consider (11) the primal problem and (10) its dual, and in our derivations we will use these labels interchangeably.
2.2.1 The Lagrangian
The FenchelRockafellar duality is general enough that it can be used to derive the Lagrangian duality. Consider the constrained optimization problem
(14) 
If we consider this problem expressed as for , its FenchelRockafellar dual is given by
(15) 
By considering in terms of its Fenchel conjugate (equation (1)), we may write the problem as
(16) 
Using the fact that for any we may express this as
(17) 
The expression is known as the Lagrangian of the original problem in (14). One may further derive the wellknown Lagrange duality:^{4}^{4}4See Veinott (2005) (https://web.stanford.edu/class/msande361/handouts/nlpdual.pdf) for a brief derivation of this fact and Ekeland and Temam (1999)[Proposition 2.1] for more general cases.
(18) 
Moreover, the optimal value of the Lagrangian is the optimal value of the original problem (14), and the optimal solutions (equilibrium points) are the solutions to the original primal (14) and its dual (15).
2.2.2 LP Duality
The FenchelRockafellar duality also generalizes the wellknown linear programming (LP) duality. Specifically, if one considers functions and then the primal and dual problems in (10) and (11) correspond to,
(19)  
(20) 
respectively. By making the switch , the dual (20) may be equivalently expressed in the more familiar form,
(21) 
FenchelRockafellar duality thus provides us with the strong LP duality theorem. Namely, if the primal problem (19) is feasible, then its result is the same as that of the dual (21).
3 Reinforcement Learning
In this work, we will show how the FenchelRockafellar duality (and the LP and Lagrangian dualities) can be applied to solve a number of reinforcement learning (RL) problems. Before we present these algorithms, we use this section as a brief introduction to RL.
3.1 Markov Decision Process
In RL, one wishes to learn a behavior policy to interact with an environment in an optimal way, where the typical meaning of ‘optimal’ is with respect to future discounted rewards (feedback) provided by the environment. The RL environment is commonly abstracted as a Markov decision process (MDP) (Puterman, 1994; Sutton and Barto, 1998), which is specificied by a tuple
, consisting of a state space, an action space, a reward function, a transition probability function, an initial state distribution, and a discount factor
, respectively. The policy is a function . The policy interacts with the environment iteratively, starting with an initial state . At step , the policy produces a distribution over the actions , from which an action is sampled and applied to the environment. The environment produces a scalar reward ^{5}^{5}5For simplicity we consider a deterministic reward function. Stochastic rewards are more typical, although the same derivations are usually applicable in either case. and subsequently transitions to a new state .In summary, the RL setting is concerned with a policy which sequentially makes decisions, and the effects of those decisions are observed through a perstep reward feedback and a stochastic, Markovian state transition process. For simplicity, we will consider infinitehorizon (nonterminating) environments, which may be extended to finitehorizon environments by considering an extra terminal state which continually loops onto itself with zero reward.
3.2 Policy Evaluation and Optimization
The first question one may ask when presented with an MDP and a policy is, what is the longterm value of when interacting with ? The next question might be, what is the optimal policy maximizing this longterm value? These two questions constitute the policy evaluation and optimization problems, respectively.
To formalize these questions, we consider a discount factor .^{6}^{6}6See Section 7 for consideration of . The value of is defined as the expected perstep reward obtained by following the policy, averaging over time via discounting:
(22) 
The policy evaluation problem is to estimate this quantity for a given , and the policy optimization problem is to find which maximizes , i.e., solve . If the reward function is independent of the policy , there exists an optimal that is deterministic (Puterman, 1994). If one adds a policydependent regularization to the objective, e.g., by considering entropyregularized rewards , the optimal policy could be stochastic.
3.3 Online vs. Offline RL
One of the main limitations when approaching either the policy evaluation or policy optimization problems is that one does not have explicit knowledge of the environment; i.e., one does not have explicit knowledge of the functions . Rather, access to the environment is given in the form of experience gathered via interactions with the environment. The specific nature of this experience depends on the context of one’s considered problem. The most common forms of experience may be generally categorized into online and offline.
In the online setting, experience from the environment may be collected at any point via MonteCarlo rollouts. With this type of access to the environment, the policy evaluation and optimization problems may be easily solved. For example, the value of the policy may be estimated by simply averaging the discounted reward of a large number of MonteCarlo rollouts. For this reason, online RL research typically strives to find sampleefficient algorithms, which find approximate solutions to policy evaluation or optimization with as few interactions with the environment as possible.
In practice (e.g., consumer web recommendation systems or health applications), interaction with the environment during training is not available at all. More commonly, access to the environment is offline. That is, interactions with the environment are limited to a static dataset of (logged) experience , where for some unknown distribution , , and . One also typically assumes access to samples from . In this report we will mostly focus on the offline setting, although we will relax it to assume that our offline experience is effectively unlimited (), and so will write our expectations in terms of , , and . Performing the appropriate finitesample analysis for finite is outside the scope of this report. Even with effectively unlimited experience, the offline setting presents a challenge to RL algorithms, due to the mismatch between the experience distribution given by the offline dataset and the online distribution typically needed for policy evaluation or optimization.
Although is a distribution over stateaction pairs, we will at times abuse notation and write , and this is intended to mean ; i.e., as if we are sampling from the (infinite) dataset . Moreover, although one does not have explicit access to , we will oftentimes write expressions such as , and this is intended to mean .
We emphasize a subtle difference between the offline setting and what is commonly referred to in the literature as offpolicy learning. Offpolicy algorithms are designed to enable an RL agent to learn from historical samples collected by other policies. However, these algorithms are typically allowed to interact with the environment during training to collect new samples. On the other hand, in the offline setting, one’s access to the environment is exclusively via a fixed dataset of experience. In other words, while an offline RL algorithm is necessarily offpolicy, an offpolicy algorithm is not necessarily offline.
3.4 values and StateAction Visitations
When evaluating or optimizing policies, both online and offline, the notions of values and stateaction visitations are useful. For a policy , the Qvalues denote the future discounted sum of rewards of following starting at :
The values satisfy the singlestep Bellman recurrence
(23) 
where is the policy transition operator,
(24) 
The stateaction visitations of (also known as occupancies or density) may be defined similarly as,
That is, the visitation measures how likely is to encounter when interacting with , averaging these encounters over time via discounting. The visitations constitute a normalized distribution, and this distribution is referred to as the onpolicy distribution.
Like the values, the visitations satisfy the singlestep transpose Bellman recurrence:
(25) 
where is the transpose (or adjoint) policy transition operator,
(26) 
These recursions simply reflect the conservation of flow (probability mass) of a stationary distribution on a Markov process. Note that both and are linear operators and that the transpose policy transition operator is indeed the mathematical transpose (or adjoint) of in the sense that for any .
Both the values and the visitations are useful in RL. For example, the value of a policy may be expressed in two ways:
(27) 
Also, when performing policy optimization, the policy gradient theorem (Sutton et al., 2000) utilizes the values and visitations to express the gradient of as
(28) 
It is thus standard in most RL algorithms to either have access to and or have some mechanism to estimate these quantities. Typically, the values are estimated by minimizing squared Bellman residuals (or some variation on this) (Sutton et al., 2008, 2009; Liu et al., 2015; Dai et al., 2016; Du et al., 2017); i.e., minimizing the (surrogate) squared difference between the LHS and RHS of (23). For the visitations, it is more typical to assume access to the distribution (for example, by simply performing MonteCarlo rollouts enabled by online access), although instances exist in which is approximated by importanceweighting a different (i.e., offline) distribution (Precup et al., 2001; Sutton et al., 2014).
4 Policy Evaluation
We now move on to demonstrating applications of FenchelRockafellar duality to RL. We begin by approaching the policy evaluation problem. Although the policy evaluation problem may appear to be simpler or less interesting (it is not!) than the policy optimization problem, in our case the same techniques will be used in either setting. Thus, we will use this section to provide more detailed derivations of a variety of techniques which will be referenced repeatedly in the following sections.
4.1 The Linear Programming Form of
The equivalent formulations of in (27) in terms of either or hint at a duality which is formally given by the following LP characterization of , known as the LP:
(29)  
(30)  
(31) 
The optimal of this LP satisfies for all reachable by .
The dual of this LP provides us with the visitation perspective on policy evaluation:
(32)  
(33)  
(34) 
The optimal of this LP is the stateaction visitation of . It is important to note that this dual LP is overconstrained. The equality constraints in (33) uniquely determine regardless of the objective in (32). This fact will prove useful in a number of later derivations.
For detailed and complete derivations of these LP representations of and , please refer to Nachum et al. (2019b).
4.2 Policy Evaluation via the Lagrangian
The potentially large number of constraints in either the primal or dual forms of the LP introduce a challenge to estimating . We may instead derive a more tractable unconstrained optimization perspective on the policy evaluation problem using the Lagrangian of the LP:
(35) 
In practical settings where is possibly infinite, it is not feasible to optimize the sum in (35) over . In an offline setting, where we only have access to a distribution , we may make a changeofvariables via importance sampling, i.e., . If has sufficient support or coverage (Sutton et al., 2016), we may rewrite (35) as
(36) 
The optimal of this problem satisfies . Thus, to estimate , one may optimize this objective with respect to (requiring only access to samples from , , and ) and return as the final estimate.
This more practical, offline estimator also has a desirably property, known as the doubly robust property (Funk et al., 2011; Jiang and Li, 2015; Kallus and Uehara, 2019a). Specifically,
(37) 
Thus, this estimator is robust to errors in at most one of and .
Despite the desirable properties of this estimator, the optimization problem associated with it involves rewards and learning values with respect to these rewards. Learning values using singlestep transitions turns out to be difficult in practice without the use of a number of tricks developed over the years (e.g., target networks, ensembling). Moreover, the bilinear nature of the Lagrangian can lead to instability or poor convergence in optimization (Dai et al., 2017; BasSerrano and Neu, 2019). Rather than tackling these various issues headon, a number of recent works propose an alternative approach, which we describe in the following subsection.
4.3 Changing the Problem Before Applying Duality
As mentioned in Section 4.1, the dual of the LP is overconstrained, in the sense that the constraints in (33) uniquely determine the stateaction visitation . Thus, one may replace the objective in (32) with for some without affecting the optimal solution . Therefore, the main idea of a number of recent works is to choose an appropriate so that either the Lagrangian or the FenchelRockafellar dual of this problem is more approachable and potentially avoids the instabilities associated with the original LP.
Although the problem is changed, the solution is unaffected, and once a solution is found it may be used to provide an estimate of . Specifically, if the problem is rewritten in terms of , then once the problem is optimized, we can derive an estimate for the value of via the approximate solution :
(38) 
4.3.1 Constant Function
If is taken to be the trivial function , the offline form of the Lagrangian optimization becomes,
(39) 
The optimal solution of this problem is , and once an approximate solution is found, it may be used to estimate according to (38). Unlike the previous form of the Lagrangian in (36), this optimization does not involve learning values with respect to environment rewards, and in practice this distinction leads to much better optimization behavior (Uehara and Jiang, 2019). Still, the Lagrangian is linear in both and . This can be remedied by choosing a strictly convex form of , for example, by using an divergence.
4.3.2 Divergence
The use of an divergence objective leads to the set of general offpolicy estimation techniques outlined in the recent DualDICE paper (Nachum et al., 2019a). Specifically, the various estimators proposed by DualDICE correspond to applying either the Lagrange or FenchelRockafellar dualities to the optimization problem,
(40)  
(41)  
(42) 
Lagrange Duality
Application of Lagrange duality to the above problem yields
(43) 
We transform the transpose policy transition operator to its transpose by using the fact :
(44) 
Now we make the changeofvariables to yield,
(45) 
Thus we have recovered the general saddlepoint form of DualDICE, which proposes to optimize (45) and then use an approximate solution to estimate via (38).
FenchelRockafellar Duality
Rather than applying Lagrange duality, application of FenchelRockafellar duality to (40) more clearly reveals the wisdom of choosing . We write the problem in (40) as
where corresponds to the linear constraints (41) with respect to the adjoint Bellman operator; i.e.,
When applying FenchelRockafellar duality, the linear operator is transformed to its adjoint and is used as an argument to the Fenchel conjugate of . At the same time, is replaced by its Fenchel conjugate .
The dual problem is therefore given by
(46)  
(47) 
We can now see that the use of an divergence with respect to naturally leads to an offline problem with expectations over , without an explicit changeofvariables. Furthermore, unlike previous dual problems, there are no constraints in this optimization, and so standard gradientbased techniques may be applied to find a solution without appealing to the Lagrange duality, which would necessarily involve nested  optimizations. The FenchelRockafellar duality also provides us with a way to recover from a solution :
(48) 
or equivalently,
(49) 
If we set , we may recover what is perhaps the most intriguing result in Nachum et al. (2019a):
(50) 
That is, if one optimizes value functions to minimize squared Bellman residuals (with respect to zero reward) while minimizing initial values, then the optimal Bellman residuals are exactly the density ratios between the onpolicy and offline stateaction distributions.
Interestingly, the derivations in Nachum et al. (2019a) do not explicitly use Lagrangian or FenchelRockafellar duality, but rather focus on a cleverly chosen changeofvariables (the socalled DualDICE trick). It is clear from our own derivations that this trick essentially comes from the relationship between and and is simply another way of applying FenchelRockafellar duality.
It is important to note that there is a tradeoff introduced by the use of the FenchelRockafellar duality as opposed to the Lagrangian. Namely, the objective (47) involves optimizing a convex function of an expectation under , and thus also , whereas in practice one typically only has access to a single empirical sample . Many works ignore this problem, and simply consider the single empirical sample as the full distribution .
4.4 Summary
We briefly summarize the main takeaways from this section.

The policy evaluation problem may be expressed as an LP, known as the LP, whose solution is .

The dual of this LP has solution .

Taking the Lagrangian of the LP can lead to a doubly robust estimator for .

Changing the objective in the dual of the LP does not change its solution .

Changing the objective to an appropriate alternative is a powerful tool. This can lead to a (regularized) FenchelRockafellar dual that is unconstrained, and thus more amenable to stochastic and offline settings.
5 Policy Optimization
The previous section detailed a number of ways to estimate . In this section, we show how similar techniques may be applied for the policy optimization problem, which is concerned with finding the optimal solution .
5.1 The Policy Gradient Theorem
Considering the Lagrangian formulation of given in (35) can provide a simple derivation of the policy gradient theorem in (28). Let be the inner expression in (35). Danskin’s theorem (Bertsekas, 1999) tells us that
(51) 
where are the solutions to . Recall that for all for which and that . Thus,
(52) 
We may compute the gradient of w.r.t. termbyterm. For the first term in (35) we have
(53) 
where we used the general identity . For the second term of in (35) we have
Comments
There are no comments yet.