1 Introduction
In reinforcement learning (RL), policy optimization seeks an optimal decision making strategy, known as a policy [5, 33, 30]. Policies are typically optimized in terms of accumulated rewards with or without constraints on actions and/or states associated with an environment [2].
Policy optimization has many challenges; perhaps the most basic is the constraint on flow of stateaction visitations called occupancy measures
. Indeed, formulating RL as a linear programming problem, occupancy measures appear as an explicit constraint on the optimal policy
[25]. The constraintbased formulation suggests the possibility of implementing a broader set of objectives and constraints, such as entropy regularization [22, 19, 18] and costconstrained MDPs [2].Considering the reward function as negative cost of assigning an action to a state, we view RL as a stochastic assignment problem. We formulate policy optimization as an unbalanced optimal transport on the space of occupancy measures. Where Optimal Transport (OT) [32] is the problem of adapting two distributions on possibly different spaces via a cost function, unbalanced OT relaxes the marginal constraints on OT to arbitrary measures through penalty functions [15, 7].
We therefore define distributionallyconstrained reinforcement learning as a problem of optimal transport. Given baseline marginals over states and actions, policy optimization is unbalanced optimal transport adapting the state marginal to action marginal via the reward function. Built upon mathematical tools of OT, we generalize the RL objective to the summation of a Bergman divergence and any number of arbitrary lowersemicontinuous convex functions. We optimize this objective with Dykstra’s algorithm [11] which is a method of iterative projections onto general closed convex constraint sets. Under Fenchel duality, this algorithm allows decomposition of the objective into Bregman projections on the subsets corresponding to each function.
As particular case, we can regularize over the state space distribution and/or the global action execution distribution of the desired occupancy measures. This formulation allows constraints on the policy optimization problem in terms of distributions on state visitations and/or action executions. We propose an actorcritic algorithm with function approximation for large scale RL, for when we have access to samples from a baseline policy (offpolicy sampling or imitation learning) and samples from the constraint marginals.
The structure of the paper is as follows: In the Section 2 we briefly present the preliminaries on (unbalanced) optimal transport and policy optimization in reinforcement learning. In Section 3 we introduce a general objective with Bregman divergence for policy optimization and provide Dykstra iterations as a general primal algorithm for optimizing this objective. Section 4 discusses distributionally constraint policy optimization with unbalanced OT and its applications. In this section, we also provide an actor critic algorithm for large scale RL. We conclude with demonstrations of the distributional constraints in Section 5 and discussion on related works in Section 6.
2 Notation and Preliminaries
For any finite set , let
be the set of probability distributions on
. We denoteas the indicator function on set . For , we define the entropy map , and denote the KullbackLeibler (KL) divergence between two positive functions by
If , for a given convex function with , we define divergence:
In particular, for , .
We also use as the natural inner product on . Through out the paper by
we denote a vector with elements one over set
or just if the context is clear.2.1 Optimal Transport
Given measures , on two sets and , with a cost function , Kantorovich Optimal Transportation is the problem of finding stochastic optimal assignment plan :
When and is derived from a metric on , this optimization defines a distance function on measure space , called Wasserstein distance [32].
Unbalanced Optimal Transport replaces hard constraints and , with penalty functions
where are positive scalars. This formulation also extends optimal transport to measures of arbitrary mass. As , the unbalanced OT approaches Kantorovich OT problem [15, 7].
To speed up the daunting computational costs of standard algorithms, an entropy term is usually added to the (U)OT objective to apply scaling algorithms [9, 7] and [23]. When and are large or continuous spaces, we usually have access to samples from instead of the actual measures. Stochastic approaches usually add a relative entropy , instead of
in order to take advantage of the Fenchel dual of the (U)OT optimization and estimate the objective from samples out of
[3, 28].2.2 Reinforcement Learning
Consider a discounted MDP , with finite state space , finite action space , transition model , initial distribution , deterministic reward function and discount factor . Letting be the set of stationary policies on the MDP, for any policy , we denote
to be the induced Markov chain on
. In policy optimization, the objective is(1) 
where is the discounted stationary distribution of . For a policy , we define its occupancy measure , as . Let be the set of occupancy measures of , the following lemma bridges the two spaces and :
Lemma 2.1.
[31][Theorem 2, Lemma2]


is a bijection from to .
So, by multiplying to both sides of the equation in (i), one can obtain where and . In the rest of paper, we may drop the superscripts and , when the context is clear. Rewriting the policy optimization objective (1) in terms of , we get
(2) 
Entropyregularized version of objective (2), relative to a given baseline , is also studied [22, 19]:
(3) 
where is the regularization coefficient.
By lemma 2.1, one can decompose the regularization term in (3) as
(4) 
with the first term penalizing the shift on state distributions and the second penalty is over average shift of policies for every state. Since the goal is to optimize for the best policy, one might consider only regularizing relative to as in [27, 19]
(5) 
One can also regularize objective (2) by as
(6) 
which encourages exploration and avoids premature convergence [13, 26, 1].
3 A General RL Objective with Bregman Divergence
In this section, we propose a general RL objective based on Bregman divergence and optimize it using Dykstra’s algorithm.
Let be a strictly convex, smooth function on , the relative interior of its domain, with convex conjugate . For any , we define the Bregman divergence
Given , we consider the optimization
(7) 
where s are proper, lowersemicontinuous convex functions satisfying
(8) 
Let be the simplex on , regularized RL algorithms in Section 2.2 can be observed as instances of optimization (7):
The motivation behind using Bregman divergence is to generalize the KL divergence regularization usually used in RL algorithms. Moreover, one may replace the Bergman divergence term in (7) with a Divergence and attempt deriving similar arguments for the rest of the paper.
3.1 Dykstra’s Algorithm
In this section, we use Dykstra’s algorithm [11] optimize objective (7). Dykstra is a method of iterative projections onto general closed convex constraint sets, which is well suited because the occupancy measure constraint is on a compact convex polytope .
Defining the Proximal map of a convex function , with respect to , as
for any , we present the following proposition which is the generalization of Dykstra algorithm in [24]:
Proposition 3.1 (Dykstra’s algorithm).
^{1}^{1}1All proofs and derivations in this section are included in Appendix A.Intuitively, at step , algorithm (9) projects into the convex constraint set corresponding to the function .
Corollary 3.2.
Note 3.3.
As aforementioned RL objectives in Section 2.2 can be viewed as instances of optimization (7), Dykstra’s algorithm can be used to optimize them. In particular, as the constraint occurs in all of them, each iteration of Dykstra requires
(11) 
which is the Bregman projection of onto the space of occupancy measures .
Although (the measure from the previous step of Dykstra) does not necessarily lie inside , step (11) of Dykstra could be seen as a Bregman divergence policy optimization resulting in dual formulation over value functions (See details of dual optimization in Appendix B). This dual formulation is similar to REPS algorithm [22].
In the next section we apply Dykstra to solve unbalanced optimal transport on .
4 DistributionallyConstrained Policy Optimization
A natural constraint in policy optimization is to enforce a global action execution allotment and/or state visitation frequency. In particular, given a positive baseline measure , with being a rough execution allotment of action over whole state space, for every , we can consider as a global penalty constraint of policy under its natural state distribution . Similarly, the penalty enforces a cautious or exploratory constraint on the policy behavior by avoiding or encouraging visitation of some states according to a given positive baseline measure on .
So, given baseline measures on and on , we define the distributionallyconstrained policy optimization objective
(12) 
When , objective (12) looks similar to (3) (considering expansion in (4)), but they are different. Because in (12), if and , for some baseline , then the third term is which is a global constraint on center of mass of over the whole state space, whereas in (4) is a stronger constraint on closeness of policies on every single state. The bottom line is that (12) generally constrains the projected marginals of over and , and (3) constrains element wise.
For regularization purposes in iterative policy optimization algorithm (e.g., using mirror descent), one natural choice of the state and action marginals is to take , at the ’th iteration. In the Appendix B, we discuss the policy improvement and convergence of such an algorithm. Another source for the marginals is the empirical visitation of states and actions sampled out of an expert policy in imitation learning.
Formulation of Objective (12) is in form of unbalanced optimal transport on the space of occupancy measures. So, for applying Dykstra algorithm, we can add an entropy term to transform (12) into
(13) 
which means setting , , , in Objective (7) ^{2}^{2}2Coefficients and in equation (13) are different from those in equation (12).. Hence, the algorithm (10) can be applied with following proximal functions:
(14)  
(15)  
(16) 
In general, for appropriate choices of and (e.g., ) the proximal operators in (14) and (B.1) have closed form solutions. However, as discussed in the previous section, in (16) has no closed form solution^{3}^{3}3See detailed derivation for proximal operators in Appendix B.. We can also consider other functions for in this scenario, for example, setting , changes the problem into finding a policy which globally matches the distribution under its natural state distribution , i.e., for any ^{4}^{4}4As a constraint, would be a feasible solution, when ..
In the next section we propose an actorcritic algorithm for large scale reinforcement learning.
4.1 Large Scale RL
When are large, policy optimization via Dykstra is challenging because tabular updating of is time consuming or sometimes even impossible. In addition, it requires knowledge of reward function for the initial distribution and transition model for projection onto in (16). Model estimation usually requires large number of stateaction samples. Also, we might only have offpolicy samples or in imitation learning scenarios, only access the marginals through observed samples by the expert. In this section, we derive an offpolicy optimization with function approximation to address these problems.
Replacing the last three terms of objective in (12) with their convex conjugates by dual variables , we get^{5}^{5}5Appendix C provides derivations for all formulations in this section.
(17) 
where is the convex conjugate (transpose) of .
Having access to samples from a baseline , both helps regularize the objective (17) into an easier problem to solve, and allows offpolicy policy optimization or imitation learning [18]. Yet, by the discussion in Section 4, regularizing with the term in (17) might make marginal penalties redundant, in particular when and . In the next two subsections we propose different approaches for each of these cases.
4.1.1 or
Without the loss of generality, assume marginals and . In this case, regularizing (17) with and under Fenchel duality, we get the offpolicy optimization objective
(18) 
where and . Now, the first term is based on expectation of baseline and can be estimated from offpolicy samples.
In a special case, when in objective (12) and we take as well, similar derivations yield
(19) 
Now, gradients of can now be computed. Letting and defining to be the softmax operator for any , we have gradients of with respect to :
(20) 
(21) 
(22) 
The gradient with respect to policy is
(23) 
where can be approximated by some functions and one can apply gradient ascent on and gradient descent on .
4.1.2 and
Following the approach in [29, 18], assuming is known, with the change of variable , we can rewrite (17) with importance sampling weights as a policy evaluation problem
(24) 
The gradients with respect to are as follows:
(25) 
(26) 
(27) 
(28) 
Wrapping around (24) gives the offpolicy optimization. Given optimized , the gradient with respect to is
(29) 
5 Demonstrations
In this section, we demonstrate the effectiveness of distributionallyconstrained policy optimization with Dykstra. The purpose of our experiments is to answer how distributional penalization on and affect the behavior of the policy and study the Dykstra’s empirical rate of convergence .
We look into these questions on a grid world with its optimal policy out of (6) shown in Fig. 1. Due to the simplicity, this environment is suitable for studying the effect of distributional constraints on the policy behavior. For the sake of focusing on the effects of distributional constraints, we set the coefficient of the entropy term fairly low () in optimizations (6) and (13). ^{6}^{6}6Appendix D provides the numerical settings in implementation of Dykstra.
We first observe the independent effect of on the policy, by setting . We use, instead of as an extreme case when to focus on the role . Figure 2(17), shows differences among policies when the marginal on actions shifts from a distribution where only equiprobable actions down and left are allowed () towards the case where only up and right are permitted with equal probability (). ^{7}^{7}7The optimal policies in this section aren’t necessarily deterministic (even though is set to be very small), because of the constraint . In general, the policies out of (13) are not necessarily deterministic either because of the nonlinear objective.
In Figure 2(1), under , down is the optimal action in state because, this is the only way to get reward (with luck). In 2(2), which changes to a probability on right, the policy eliminates the reliance on change by switching state to right.
Note that the optimal policy in Figure 1(left) does not include a down move. When down is forced to have nonzero probably, Figures 2(16), the policy assigns it to state , towards the case where only up and right are permitted with equal probability ().
Figures 2(7) shows the case where only up and right are allowed. In state , this creates a quandary. Right is unlikely to incur the penalty, but will never allow escape from this state. For this reason, the policy selects up, which with high probability will incur to the penalty, but has some probability of escape toward the reward.
Figure 2(8), depicts the convergence of Dykstra towards various . Notably, in all cases the algorithm converges, and the rate of convergence increases following the order of the subfigures.
Next, we test the extreme effect of constraints on the state marginals on the policy via various , by setting and very high. We study the policy when for a single state
, and uniform distribution of
on the rest of states other than . Figure 3(13) shows the policies when . Hitting the wall seems to be viable strategy to stay and increase the visitation frequency of each of these states. Figure 3(4) depicts the the convergence of Dykstra’s algorithm towards various . As shown, the error never gets to zero. This is because by setting , the objective is just to find an occupancy measure with closest state marginal to and can never be zero if is not from a valid occupancy measure.We also test how imitating a policy with distributional constraints affects the learned policy. For this purpose, we create a new environment with reward at state . The optimal riskaverse policy out of this environment is shown in Figure 4(left). Let be the action marginal corresponding to . Now consider the RL problem with reward of in state constrained by . Figure 4(right) shows the resulting policy . Notice that achieves the action marginal distribution , however, is quite different from , since the unconstrained optimal policy for the environment with reward of at state is more risk neutral. In contrast, distributionally constraining (as in previous experiments) results in the same policy of as in Figure 4(left). The differences are mostly in states and , where actions can be freely chosen (but not their visitation frequency) and contribution of state which has a lower visitation probability.
Consider constraints on both and . As explained earlier, the policy will then get as close as possible to while satisfying the action distribution . Figure 5(left) shows the optimal policy for with no constraint on . is a distribution where no up is allowed and the other three actions equiprobable. Figure 5(right) depicts the policy under constraints on both and when is the same distribution in Figure 3(1). The leftmost column and top row of this policy leads to (0,2) but in an attempt to satisfy , the policy goes back to the left.
6 Related Works
In this section we review and discuss the related works to our proposed approach.
Objectives and Constraints in Reinforcement Learning.
Posing policy optimization as a constrained linear programming on the space of occupancy measures has been long studied [25]. Recent works have expanded linear programming view through a more general convex optimization framework. For example, [19] unifies policy optimization algorithms in literature for entropy regularized average reward objectives. [17, 16, 18] propose entropyregularized policy optimizations under Fenchel duality to incorporate the occupancy measure constraint. Unlike these works, we looked at policy optimization from an optimal transport point of view. To do so, we proposed a structured general objective based on Bregman divergence that allows considering relaxations of entropy regularization using marginals. [36] studies a general concave objective in RL and proposes a variational Monte Carlo algorithm using the Fenchel dual of the objective function. Similar to these works we take advantage of Fenchel duality. However, other than different view point and structured objective, our work differs in solving the optimization by breaking the objective using Dykstra’s algorithm.
[35] proposes various caution penalty functions as the RL objective and a primaldual approach to solve the optimization. One of these objectives is a penalty function on , which is a part of our proposed unbalanced formulation. Other than our problem formulation, in this work, we focused on distributional penalty on global action execution which, to the best our knowledge, has not been studied before.
In constrained MDPs, a second reward is used to define an constrained value function [2, 12]. Here and the constraint is in form of , where is a constant. Thus considering as the cost for taking action at state , constrained MDP optimizes the policy with a fixed upper bound for the expected cost. Rather than introducing a fixed scalar restriction (A), our formulation allows distributional constraints over both the action and state spaces (i.e. and ). The source of these distributional constraints may vary from an expert policy to the environmental costs and we can apply them via penalty functions. In special cases, when an action can be identified by its individual cost, constraint on expected cost can be viewed as a special case of marginal constraint on . For instance, in the grid world of Figure 1, if the cost for up and right is significantly higher than down and left, then limited budget (small expected cost) is essentially equivalent to having a marginal constraint supported on down and left. However, in general, when cost varies for state per action, the expected cost does not provide much guidance over global usage of actions or visitation of states as our formulation does.
Reinforcement Learning and Optimal Transport
Optimal transport in terms of Wasserstein distance has been proposed in the RL literarture. [37] views policy optimization as Wasserstein gradient flow on the space of policies. [20] defines behavioral embedding maps on the space of trajectories and uses an approximation of Wasserstein distance between measures on embedding space as regularization for policy optimization. Marginals of occupancy measures can be viewed as embeddings via state/action distribution extracting maps. Our work defines an additive structure on these embedding functionals which is broken into Bregman projections using Dykstra.
Imitation Learning and Optimal Transport
In the imitation learning literature, [34] proposed an adversarial inverse reinforcement learning method which minimizes the Wasserstein distance to the occupancy measure of the expert policy using the dual formulation of optimal transport. [10] minimized the primal problem of Wasserstein minimization and [21] minimize the Sinkhorn Divergence to the expert’s occupancy measure. These works are fundamentally different from our approach as we are not solving the inverse RL problem and we view RL itself as a problem of stochastic assignment of actions to states. The type of distributional constraints via unbalanced optimal transport proposed in our work can be considered as relaxation of the idea of matching expert policy occupancy measures. We consider matching the distribution of global action executions and state visitations of the expert policy.
Related works in Optimal Transport
Other Settings
In unbalanced OT formulation of (12) we used penalty functions . One can apply other functions like the indicator function to enforce constraints on marginals like as discussed in section 4. However, using the constraint in (12) could be problematic as it can easily be incompatible with the occupancy measure constraint. If is not coming form a policy, then the optimization is infeasible. Despite this, setting and taking in (12) for the ’th iteration of an iterative policy optimization algorithm, (12) results in objective similar to TRPO [27].
7 Conclusion
We have introduced distributionallyconstrained policy optimization via unbalanced optimal transport. Extending prior work, we recast RL as a problem of unbalanced optimal transport via minimization of an objective with a Bregman divergence which is optimized through Dykstra’s algorithm. We illustrate the theoretical approach through the convergence and policies resulting from marginal constraints on and both individually and together. The result unifies different perspectives on RL that naturally allows incorporation of a wide array of realistic constraints on policies.
References
 [1] (2019) Understanding the impact of entropy on policy optimization. External Links: 1811.11214 Cited by: §2.2.

[2]
(1999)
Constrained markov decision processes
. Vol. 7, CRC Press. Cited by: §1, §1, §6.  [3] (2016) Stochastic optimization for largescale optimal transport. External Links: 1605.08527 Cited by: §2.1.
 [4] (2013) Modelindependent bounds for option prices—a mass transport approach. Finance and Stochastics 17 (3), pp. 477–501. Cited by: §6.
 [5] (1995) Dynamic programming and optimal control. Vol. 1, Athena scientific Belmont, MA. Cited by: §1.
 [6] (1999) Nonlinear programming. Athena Scientific, Belmont, MA, second edition. Cited by: Appendix C.
 [7] (2017) Scaling algorithms for unbalanced transport problems. External Links: 1607.05816 Cited by: §1, §2.1, §2.1, Note 3.3.
 [8] (2019) Unbalanced optimal transport: dynamic and kantorovich formulation. External Links: 1508.05216 Cited by: Note 3.3.
 [9] (2013) Sinkhorn distances: lightspeed computation of optimal transportation distances. External Links: 1306.0895 Cited by: §2.1, Note 3.3.
 [10] (2020) Primal wasserstein imitation learning. External Links: 2006.04678 Cited by: §6.
 [11] (1983) An algorithm for restricted least squares regression. Journal of the American Statistical Association 78 (384), pp. 837–842. Cited by: §1, §3.1.

[12]
(2006)
Reinforcement learning for mdps with constraints.
In
European Conference on Machine Learning
, pp. 646–653. Cited by: §6.  [13] (2017) Reinforcement learning with deep energybased policies. External Links: 1702.08165 Cited by: §2.2.
 [14] (2013) Automated option pricing: numerical methods. International Journal of Theoretical and Applied Finance 16 (08), pp. 1350042. Cited by: §6.
 [15] (201712) Optimal entropytransport problems and a new hellinger–kantorovich distance between positive measures. Inventiones mathematicae 211 (3), pp. 969–1117. External Links: ISSN 14321297, Link, Document Cited by: §1, §2.1.
 [16] (2019) DualDICE: behavioragnostic estimation of discounted stationary distribution corrections. External Links: 1906.04733 Cited by: §6.
 [17] (2019) AlgaeDICE: policy gradient from arbitrary experience. External Links: 1912.02074 Cited by: §6.
 [18] (2020) Reinforcement learning via fenchelrockafellar duality. External Links: 2001.01866 Cited by: Appendix C, Appendix C, §1, §4.1.2, §4.1, §6.
 [19] (2017) A unified view of entropyregularized markov decision processes. arXiv preprint arXiv:1705.07798. Cited by: §1, §2.2, §2.2, 2nd item, §6.
 [20] (2020) Learning to score behaviors for guided policy optimization. External Links: 1906.04349 Cited by: §6.
 [21] (2020) Imitation learning with sinkhorn distances. External Links: 2008.09167 Cited by: §6.

[22]
(2010)
Relative entropy policy search.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 24. Cited by: §1, §2.2, §3.1.  [23] (2020) Computational optimal transport. External Links: 1803.00567 Cited by: §2.1.
 [24] (2015) Entropic approximation of wasserstein gradient flows. SIAM Journal on Imaging Sciences 8 (4), pp. 2323–2351. Cited by: Appendix A, Lemma B.1, §3.1.
 [25] (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §1, §6.
 [26] (2018) Equivalence between policy gradients and soft qlearning. External Links: 1704.06440 Cited by: §2.2.
 [27] (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §2.2, §6.
 [28] (2018) Largescale optimal transport and mapping estimation. External Links: 1711.02283 Cited by: §2.1.
 [29] (2016) An emphatic approach to the problem of offpolicy temporaldifference learning. The Journal of Machine Learning Research 17 (1), pp. 2603–2631. Cited by: §4.1.2.
 [30] (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §1.
 [31] (2008) Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pp. 1032–1039. Cited by: Lemma 2.1.
 [32] (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §1, §2.1.
 [33] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §1.
 [34] (2019) Wasserstein adversarial imitation learning. External Links: 1906.08113 Cited by: §6.
 [35] (2020) Cautious reinforcement learning via distributional risk in the dual domain. arXiv preprint arXiv:2002.12475. Cited by: §6.
 [36] (2020) Variational policy gradient method for reinforcement learning with general utilities. External Links: 2007.02151 Cited by: §6.
 [37] (2018) Policy optimization as wasserstein gradient flows. In International Conference on Machine Learning, pp. 5737–5746. Cited by: §6.
Appendix A Fenchel Dual and Proofs in Section 3
For any function , its convex conjugate (or Fenchel dual) is defined as . If is proper, convex and lowersemi continuous then has the same properties and one can write . If is strictly convex and smooth on , then and are bijective maps between and int(domf), i.e., . It is easy to verify that

For , .

Consider with , where is an underlying space. For a fixed , if , . Also, if , then and .
Proof of Proposition 3.1.
Under condition (8), the FenchelLegendre duality holds and the solution of optimization (7) can be recovered via
(30) 
with the primaldual relationship
Applying coordinate descent on (30), with initial condition , and setting , , at we get the iteration
(31)  
where . The primal problem of optimization in (31) is