A Policy Efficient Reduction Approach to Convex Constrained Deep Reinforcement Learning

08/29/2021
by   Tianchi Cai, et al.
0

Although well-established in general reinforcement learning (RL), value-based methods are rarely explored in constrained RL (CRL) for their incapability of finding policies that can randomize among multiple actions. To apply value-based methods to CRL, a recent groundbreaking line of game-theoretic approaches uses the mixed policy that randomizes among a set of carefully generated policies to converge to the desired constraint-satisfying policy. However, these approaches require storing a large set of policies, which is not policy efficient, and may incur prohibitive memory costs in constrained deep RL. To address this problem, we propose an alternative approach. Our approach first reformulates the CRL to an equivalent distance optimization problem. With a specially designed linear optimization oracle, we derive a meta-algorithm that solves it using any off-the-shelf RL algorithm and any conditional gradient (CG) type algorithm as subroutines. We then propose a new variant of the CG-type algorithm, which generalizes the minimum norm point (MNP) method. The proposed method matches the convergence rate of the existing game-theoretic approaches and achieves the worst-case optimal policy efficiency. The experiments on a navigation task show that our method reduces the memory costs by an order of magnitude, and meanwhile achieves better performance, demonstrating both its effectiveness and efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/14/2020

Optimistic Distributionally Robust Policy Optimization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization...
06/03/2021

Iterative Empirical Game Solving via Single Policy Best Response

Policy-Space Response Oracles (PSRO) is a general algorithmic framework ...
11/02/2017

A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

To achieve general intelligence, agents must learn how to interact with ...
10/02/2018

CEM-RL: Combining evolutionary and gradient-based methods for policy search

Deep neuroevolution and deep reinforcement learning (deep RL) algorithms...
01/22/2018

Get Your Workload in Order: Game Theoretic Prioritization of Database Auditing

For enhancing the privacy protections of databases, where the increasing...
03/02/2021

NavTuner: Learning a Scene-Sensitive Family of Navigation Policies

The advent of deep learning has inspired research into end-to-end learni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Work Time complexity Policy efficiency

No extra hyperparameters

Le et al. (2019)
Miryoosefi et al. (2019)
Ours (Vanilla CG)
Ours (Modified MNP)
Table 1:

Comparison of different works. Time complexity (number of RL tasks solved) and policy efficiency (number of neural networks stored) are compared, when using any deep RL method to find an

-approximate policy to a convex constrained RL problem with -dimensional measurement function.

When applying reinforcement learning (RL) to many real-world tasks, it is inevitable to impose constraints to regulate the behavior of the resulting policy. Examples include adding risk constraints to avoid damaging expensive robotics (Blackmore et al., 2011; Ono et al., 2015), placing safety and comfort constraints on autonomous driving (Lefevre et al., 2015; Shalev-Shwartz et al., 2016; Isele et al., 2018; Chen et al., 2019), and introducing diversity constraints to encourage explorations (Hong et al., 2018; Miryoosefi et al., 2019). In general, such problems of learning desired policies under constraints can be cast into the constrained reinforcement learning (CRL) formalism.

As is well acknowledged, model-free RL methods can be classified into two major categories, i.e., value-based and policy-based

(Sutton and Barto, 2018)

. However, compared with the large volume of literature studying value-based methods in the general RL setting, they are rarely investigated in the CRL setting. This somehow surprising phenomenon has its root cause that in CRL, a constraint-satisfying policy may require delicate randomization between different behaviors, and hence selecting multiple actions with specific probabilities is necessary (cf. Example

1). Most value-based algorithms such as Q-learning (Sutton and Barto, 2018), DQN (Mnih et al., 2013), and their variants (Van Hasselt et al., 2015; Wang et al., 2016; Lillicrap et al., 2015; Fujimoto et al., 2018; Barth-Maron et al., 2018) may fail to find any constraint-satisfying policy in CRL. Therefore, the CRL literature traditionally merely focuses on policy-based methods (Paternain et al., 2019; Tessler et al., 2018; Achiam et al., 2017; Chow et al., 2017; Chow and Ghavamzadeh, 2014). Recently, value-based algorithms have achieved state-of-the-art performance in various RL tasks (Van Hasselt et al., 2015; Wang et al., 2016; Fujimoto et al., 2018; Barth-Maron et al., 2018). It is thus tempting to consider whether it is possible to solve CRL problems with value-based algorithms.

A new line of research derived from the game-theoretic perspective has made a breakthrough in this direction (Le et al., 2019; Miryoosefi et al., 2019). This line of game-theoretic approaches reformulates the CRL problem as a two-player zero-sum repeated game and solves it with no-regret online learning. In each round, one player who uses an online learning algorithm plays against the other player who uses an RL algorithm that finds a policy maximizing the value of the current game. This policy found by the RL player is then stored. It can be shown that after certain rounds, the mixed policy that uniformly randomly selects one of the found policies converges to the desired constraint-satisfying policy. However, storing all the policies found by the RL player is not policy efficient and may incur very high memory costs. In particular, when deep RL methods are utilized, even on some simple tasks, these game-theoretic approaches need to store dozens to hundreds of neural networks to find a constraint-satisfying policy (cf. Section 5). In theory, to obtain an -approximate policy in CRL, these game-theoretic approaches require storing many policies, which is a consequence of their reliance on no-regret online learning (Freund and Schapire, 1999; Abernethy et al., 2011; Hazan, 2012). Given the high memory costs, the policy inefficiency of these game-theoretic approaches makes them impractical to work with deep RL methods.

To improve policy efficiency, we propose a novel vector space reduction approach

to solve the CRL problems. Instead of the game-theoretic perspective, we reduce the CRL problem over a policy space as an equivalent distance minimization problem over a vector space. We then show that this distance minimization problem can be solved by a specially designed conditional gradient (CG) algorithm, whose linear optimization oracle is constructed using an RL algorithm. Consequently, this reduction yields a meta-algorithm, which can be instantiated by any variant of CG and any off-the-shelf RL method. Specifically, in each iteration, the RL algorithm finds a policy, and this policy is stored; the mixed policy that selects all found policies with appropriate weights (e.g., step sizes) converges to a desired constraint-satisfying policy. The main benefit of our reduction approach is that it substitutes the no-regret online learning techniques with the CG-type methods, and thus it is not necessary to store all found policies. However, since the step sizes of the vanilla CG are non-zero, directly applying it assigns non-zero weights to all found policies and does not improve policy efficiency.

To this end, we propose a new algorithm, which achieves optimal policy efficiency, based on a variant of CG called the minimum norm point (MNP) method (Wolfe, 1976). We extend the vanilla MNP to solve a more general problem, where the distance function to a convex set is minimized over a compact convex set. Inspired by the minor cycle technique (Wolfe, 1976) in MNP, our modified MNP method reassigns the weights of all found policies and maintains an active set, which only contains policies with non-zero weights. After the weight adjustment, policies with weight zero are eliminated from the active set immediately to cut the memory costs. To solve CRL problems with -dimensional measurement vectors, our method stores no more than policies throughout the learning process. Notably, this constant is shown to be worst-case optimal. Moreover, with a carefully refined analysis, our method solves the general problem with a faster convergence guarantee than the MNP method. To achieve an -approximate solution in an -dimensional space, our method improves the convergence from (Chakrabarty et al., 2014) to a tighter , with the same memory cost (details in Table 1). We compared our method with the game-theoretic approach (Miryoosefi et al., 2019) in a navigation task using different RL methods to construct the oracle. In cases of both tabular RL and deep RL, our method demonstrates superior performance and policy efficiency. In particular, in deep RL cases, our method even reduces the memory costs by an order of magnitude. In summary, our approach enables efficiently utilizing value-based RL methods to solve CRL, and the improved policy efficiency (worst-case optimal) makes it especially appealing to applications using deep RL methods.

2 Background

A

vector-valued Markov decision process

is defined as a tuple , where is a set of states , is a set of actions , is a transition probability function of the form that describes the dynamics of the system, defines the initial state distribution , is an -dimensional measurement function that may measure reward, risk or other constraints, and is a discount factor.

Actions are typically selected according to some (stationary) policies. A policy

maps states to probability distributions over actions, and

denotes the probability of selecting action in . We assume that policies under consideration are selected from some candidate policy set . For example, in policy-based methods, is usually the set of all stationary policies, and in value-based methods, is typically the set of all deterministic policies. For a policy , we define the long-term measurement as the expectation of the discounted cumulative measurements

(1)

where the expectation is over the described random process.

To enable utilizing value-based methods to solve CRL problems, we also consider mixed policies, which are distributions over the candidate policies space . We define to be the set of all mixed policies generated by . To execute a mixed policy , at the start of an episode, we select a policy , and then execute for the entire episode. The long-term measurement of a mixed policy is defined accordingly:

(2)

In the following, we focus on the convex constrained RL problem, also known as the feasibility problem, which generalizes inequality constraints to convex constraints (Miryoosefi et al., 2019). A feasibility problem is specified by a closed and convex set . The goal is to find a policy whose long-term measurement lies inside .

(3)

A policy is feasible if it satisfies the constraint, and the problem is feasible if a feasible policy exists. This formulation can potentially handle tasks that maximize one measurement (e.g., reward) under convex constraints. Such problems can be solved by performing a binary search over the maximum achievable reward value and at each iteration augmenting an inequality reward constraint (reward no less than the current iterated value) to the constraints.

Though both policy-based methods and value-based methods are well established in general RL, in the feasibility problem, the feasible policies may require choosing among multiple actions with specific probabilities, which is not satisfied by many value-based methods. We illustrate this difficulty with the following example.

Example 1.

We consider the task of playing the Rock, Paper, Scissors game. For simplicity, we assume the environment randomly selects one of the three actions, and the game terminates after a fixed number of rounds. Let the measurement vector be the basis vectors in , indicating whether the agent won with each of the three actions, and the zero vector if tie or loss. Consider the feasibility problem specified by , which requires the agent to win with each action with at least probability on expectation. It is obvious that the only feasible policy for this task is to select three actions with the same probability. However, most value-based methods calculate a scalar value for each state-action pair and select any action achieving the maximum value at the current state. Since value-based methods cannot specify the probability for choosing each state-action pair, they may fail to solve CRL problems.

One workaround is to use mixed policies. However, the main difficulty to use mixed policies is that when each policy is found by a deep RL method, the memory costs can be huge. To store such a mixed policy , the neural networks corresponding to all policies with non-zero probability have to be stored. Hence the memory cost of storing is proportional to the cardinality of the subset of policies with non-zero weights in the candidate policy space, i.e., . Since a neural network may have billions to trillions of parameters (Brown et al., 2020; Fedus et al., 2021), storing a large number of neural networks is impractical in many deep RL tasks. Therefore, we are interested in mixed policies that are policy efficient and have a small cardinality of policies with non-zero weights.

3 A Vector Space Reduction Approach

Our vector space reduction approach reformulates the original CRL problem over a policy space to an equivalent distance minimization problem over a vector space. The key is to construct a specific linear optimization oracle using any RL algorithm, which enables solving this distance minimization problem with any variant of the CG method. This reduction yields a meta-algorithm for the CRL problems, which can be instantiated by any CG method and any RL algorithm. We illustrate this with the vanilla CG method.

3.1 Equivalent Distance Minimization Problem

We first reformulate the feasibility problem to an equivalent distance minimization problem over the policy space. For a closed and convex set , considering the problem of finding a mixed policy , whose long-term measurement is closest to the target convex set,

(4)

where is the Euclidean distance of to the set , and is the Euclidean projection of onto the set .

For this minimization problem, a policy is defined to be optimal if it minimizes (4). Otherwise, the approximation error of is defined as

(5)

A policy is defined to be an -approximate policy if its approximation error is no larger than .

When the CRL problem (3) is feasible, the equivalence of being optimal to (4) and being feasible to the CRL problem can be easily established. Since a feasible policy of the CRL problem lies inside , it minimizes the non-negative function, and hence is optimal to (4). Vice versa, any optimal policy to (4) lies inside and is a feasible policy to the CRL problem.

From a geometric perspective, let denote the set of all long-term measurements achievable by policies in the candidate policy space . It is clear that

is the convex hull of , and hence is closed and compact. Therefore the distance minimization problem (4) over the policy space is equivalent to the following distance minimization problem over a closed and convex set :

(6)

If the CRL problem is feasible, then any that minimizes this distance function over the convex set finds a feasible policy to the original problem. Hence we have reduced the original CRL problem over a policy space to an equivalent distance minimization problem (6) over the closed and convex set in a vector space.

3.2 A Solution with Vanilla Conditional Gradient

Since it is unclear how to project a policy to the implicitly defined set , this distance minimization problem (6) is non-trivial. We overcome this difficulty by proposing a specially designed conditional gradient (CG) algorithm, where the linear optimization oracle used by the CG method is constructed using any off-the-shelf RL algorithm.

We briefly review the CG method. CG is a first-order method to minimize a convex function over a compact and convex set , using a linear optimization oracle (Frank et al., 1956)

(7)

In each iteration step , the CG (Algorithm 3 in Appendix A.1) calculates the gradient at the current point , and invokes the linear optimization oracle to find an improving point . Then it updates the iterated point by taking a convex average of the current point and the improving point , where at step , the step size is typically set to (Jaggi, 2013).

Input: , learning rate
Initialize:

1:  for t = 1, …, T do
2:     
3:     , // A new policy is stored
4:  end for
5:  return // 
Algorithm 1 Solve a CRL Problem with Vanilla CG

We first calculate the gradient of the target function with respect to . (1.1) of Holmes (1973) shows that the gradient of the function with respect to is , where if else

. Hence applying the chain rule, it is straightforward that

We construct the desired linear optimization oracle, denoted by , such that for any , it outputs a policy, together with the corresponding measurement vector

(8)

satisfying . To construct the linear optimization oracle, in fact the improving policy can be found by using any off-the-shelf RL algorithms to solve a specific RL task. In particular, for any , a policy that minimizes

(9)
(10)

is a policy that maximizes the scalar reward at each step. Therefore any reinforcement learning algorithm that maximizes this scalar reward finds an improving policy, and the RL algorithm that best suits the underlying problem can be used to find an improving policy .

Evaluating the measurement vector

is handy in online settings, where Monte Carlo simulations estimate

directly. In batch or offline settings, various off-policy evaluation methods, such as importance sampling (Precup, 2000; Precup et al., 2001) or doubly robust (Jiang and Li, 2016; Dudík et al., 2011), can be used to estimate .

With the linear optimization oracle constructed using any RL method, the distance minimizing problem (6) can be solved by any variant of the CG-type algorithm. When the vanilla CG algorithm is used, the resulting method is illustrated in Algorithm 1. In each iteration, the is invoked once to find an improving policy , together with its long-term measurement . Then, the current mixed policy is updated by selecting with weight , and selecting any previously found policy with weight . The iterated point is updated in the same way, ensuring the invariance that . The convergence of the vanilla CG is well-established (Jaggi, 2013; Lan, 2020), which readily implies the Algorithm 1 converges in a sublinear convergence rate. However, since the learning rates of vanilla CG is always non-zero, after iterations, all policies have non-zero weights to be selected in . When the policies are found by deep RL methods, this requires storing neural networks, and is not policy efficient. We conclude that Algorithm 1 matches the convergence rate and policy efficiency of the existing game-theoretic approaches.

4 A Policy Efficient CG Approach

Comparing with the game-theoretic approaches, our vector space reduction approach does not require storing all found policies. However, directly applying the vanilla CG method assigns non-zero weights to all found policies and does not improve policy efficiency. To improve policy efficiency, we propose a new CG-type method. Our method is based on a variant of CG called the minimum norm point (MNP) method (Wolfe, 1976). We extend the MNP to solve a more general problem. When applying to the CRL problem, we show that our proposed method matches the convergence rate and achieves an optimal policy efficiency.

4.1 Minimum Norm Point Method

To find policy efficient mixed policies, we turn to variants of CG-type algorithms, especially those that maintains an active set, and assign zero-weights to certain iterated points. When the target convex set is a singleton, a policy efficient solution can be readily found using Wolfe’s method for Minimum Norm Point (MNP) over a polytope (Wolfe, 1976; De Loera et al., 2018).

When the target set is a singleton containing one point , the distance minimization problem (6) is simplified to finding a point in the polytope that is closest to

(11)

which can be readily solved by Wolfe’s method for finding Minimum Norm Point (MNP) in a polytope.

In MNP (Algorithm 4 in Appendix A.2), the loop in CG is called a major cycle, and the convex averaging step is replaced by weight reassignment processes, called minor cycles. MNP maintains an active set , and the current iterated point is represented as a convex combination of points in .

Recall that for a set of points , the affine hull is defined as

(12)

The convex hull is defined similarly with an additional requirement that elementwise. The affine minimizer is defined as . When a point is treated as the origin, the affine minimizer with respect to is and the affine minimizer property gives

(13)

In a major cycle, when , we have for all . Hence, the MNP uses the oracle the same way as the CG algorithm. To minimize the size of the active set, the MNP repeatedly eliminates points from the active set using minor cycles. The minor cycles are executed until becomes a corral, that is, its affine minimizer lies inside its convex hull. To maintain the corral property of active set , in a minor cycle, let be the point of smallest norm in of the affine hull . If is in the relative interior of the convex hull , then the minor cycle is terminated. Otherwise, is updated to the nearest point to on the line segment . Thus is updated to a boundary point of , and any point, not on the face of in which lies, is deleted. Note that singletons are always corrals, and hence the minor cycles terminate after a finite number of runs. After which is updated to the affine minimizer of the corral .

The process returns the affine minimizer of and is the coefficient expressing as an affine combination of points in , where is the weight associated with . The process can be straightforwardly implemented using linear algebra. Wolfe (1976) also provides a more efficient implementation that uses a triangular array representation of the active set.

In the singleton case, the MNP solves the distance minimization problem (6), and hence the CRL problem (3). Since the active set is a corral and hence is affinely independent, the number of policies stored is at most at any time. After major cycle steps, the MNP method is shown to converge linearly with a rate where is an constant determined by the polytope as defined in Lacoste-Julien and Jaggi (2015).

Input: , target set .
Initialize: current point , active set , active policy set , weight for policies in .

1:  for t = 1, …, T do // Major cycle
2:     
3:     
4:     
5:     while True do // Minor cycle
6:        
7:        if  then //  is a corral
8:           break
9:        else
10:           
11:           , for all
12:           // Remove policies with weight zero
13:           , // Save memory
14:        end if
15:     end while
16:      for all
17:  end for
18:  return // 
Algorithm 2 Solve a CRL Problem with Modified MNP

4.2 Modified MNP and Theoretical Analysis

To solve the general case where may not be a singleton, we propose a modified MNP method. In the general non-singleton case, our target function is in fact not strongly-convex (Proposition 4.1). We analyze the complexity of our modified MNP method, and improve from the previous (Chakrabarty et al., 2014) to a tighter convergence rate (Theorem 4.3). Moreover, we show that maintaining an active policy set of size is worst-case optimal (Theorem 4.4). Therefore we conclude that the proposed modified MNP method matches the convergence of the existing game-theoretic methods, and achieves an optimal policy efficiency of storing no more than policies.

As illustrated in Algorithm 2, we modify the MNP by adding a projection step into the major cycle (line 3). In each major cycle, the modified MNP minimizes the distance to a projected point . Hence the resulting algorithm is equivalent to Wolfe’s MNP method when is a singleton, and otherwise, the oracle step calculates the gradient the same as the CG method. Intuitively, at each major step, if we are making a significant progress toward the projected point, then the distance to the convex set is decreased by at least the same amount.

For non-singleton , in fact we cannot achieve the linear convergence as the singleton case. This is because in a non-singleton case, the target squared distance function is not strongly convex, which is a common assumption required for linear convergence.

Recall that a function over is defined to be strongly convex (Boyd et al., 2004), if there exists , such that for all , satisfies

(14)
Proposition 4.1.

For any convex set , the function is strongly convex if and only if is a singleton.

A proof is given in Appendix B.1. This proposition shows that the singleton case solved by MNP is the only case where the target function is strongly convex and linear convergence can be achieved. For general non-singleton , the linear convergence does not hold. To analyze the convergence of our modified MNP method, we first show that the approximation error strictly decreases between any two steps.

Theorem 4.2 (Approximation Error Strictly Decreases).

For each step, the found by Algorithm 2 satisfies . That is, the measurement vectors of get strictly closer to the convex set .

A proof is provided in Appendix B.2. Given the approximation error strictly decreases, MNP can be shown to terminate finitely (Wolfe, 1976). However, this finitely terminating property does not hold for our algorithm. Since a changed projected point may yield a lower distance for the same active set , the active set may stay unchanged across major cycles (cf. Section 5). We establish the convergence of the modified MNP method by the following theorem.

Theorem 4.3 (Convergence in Approximation Error).

For any , the mixed policy found by the modified MNP method (Algorithm 2) satisfies

(15)

where is the maximum norm of a measurement vector.

The proof is provided in Appendix B.3. In short, we define major cycle steps with at most one minor cycle as non-drop step, which are ”good” steps, and major cycle steps with more than one minor cycles as drop steps, which are ”bad” steps. We show that in good steps, Algorithm 2 is guaranteed to make enough progress. Though this does not hold for bad steps, we can bound the frequency of bad steps, and by Theorem 4.2, bad steps still make progresses. Hence the convergence follows. The main techniques are based on Chakrabarty et al. (2014). However, since we give a tighter bound on the frequency of bad steps, we improves the convergence rate from their to a tighter .

Figure 1: The illustration of the navigation task. The agent needs to navigate from S to G, with no more than 11 steps, and no more than 0.5 steps in the grey region on expectation. This requires the agent to randomly choose between safer paths and shorter paths.

We then discuss the policy efficiency of mixed policy for the CRL problem. We give a constructive proof in Appendix B.4 to show that to ensure convergence for RL algorithms whose candidate policy set are deterministic policies (e.g. DQN (Mnih et al., 2013), DDPG (Lillicrap et al., 2015) and variants (Van Hasselt et al., 2015; Wang et al., 2016; Fujimoto et al., 2018; Barth-Maron et al., 2018)), storing policies is necessary in the worst case.

Figure 2: The approximation error and policy efficiency of solving the navigation task using different RL methods are compared, where tabular RL (a1, a2), policy-based deep RL (b1, b2) and value-based deep RL (c1, c2) have been considered. In all three cases, our modified MNP outperforms ApproPO, and meanwhile achieves a significant memory improvement.
Theorem 4.4 (Memory Complexity Bound).

When the candidate policy set is the set of all deterministic policies, to solve CRL problems (3) with -dimensional measurement vectors, a mixed policy needs to randomize among policies to ensure convergence in the worst case.

Since the minor cycles of the modified MNP method (Algorithm 2) maintain the active set to be affinely independent, the modified MNP method requires storing no more than individual policy, throughout the learning process.

Corollary 4.4.1.

The modified MNP method achieves the worst-case optimal policy efficiency.

Therefore we conclude that the proposed modified MNP method matches the convergence rate of the previous game-theoretical methods. Meanwhile, it achieves optimal policy efficiency, making it favorable for solving constrained deep RL problems.

5 Experiments

We verify the effectiveness and the efficiency of the proposed methods in a navigation task and compared them with the ApproPO (Miryoosefi et al., 2019), a game-theoretic reduction approach, using various RL methods. The ApproPO constructs an RL player similar to our RL oracle. Hence it is a natural baseline for comparison. We run experiments with the RL oracle constructed using tabular RL, policy-based deep RL, and value-based deep RL methods. In all three cases, our method outperforms ApproPO and meanwhile achieves a significant improvement in policy efficiency.

In this navigation task (Figure 1), the agent is required to find a path from the starting point (S) to the goal point (G), by moving to one of the four neighborhood cells at each step. We set part of the region as risky states (grey hatch) and should be avoided. By design, the risky region contains the shortest path from S to G, so that the agent has to trade-off between a shorter path and a safer path. The agent receives a 2-dimensional measurement vector that signals the number of steps and steps inside the risky region, i.e. for every step outside the risky region, and for every step inside the risky region. The agent is required to find a navigation policy whose measurement vector lies inside . That is, the agent is required to find a policy navigating from S to G, on average containing no more than steps, and enter the risky region no more than steps for each episode. The episodes terminate when the goal point is reached or after steps. To simplify the presentation, we take discount for this finite horizon task. See Appendix C for more experimental details and hyperparameters.

A quick inspection of this task shows that none of the deterministic policies is feasible. For example, the arrows in Figure 1 show a deterministic policy achieving by bypassing all risk regions and one achieves by entering the risky region once. A mixed policy that randomizes these two policies with the same probability can be feasible (illustrated by the pink arrows).

5.1 Tabular RL Case

We first construct an RL oracle using the tabular Q-learning method. The approximation error and policy efficiency are compared in Figure 2 (a1 and a2). For the modified MNP, the method got stuck for about 100 steps. This is caused by the added projection step. As we have mentioned, a changed projected point may yield a lower distance for the same active set , and the active set remains unchanged for many steps. However, once an improving policy is found outside this active set, the modified MNP method quickly achieves the optimal value. On the other side, since the game-theoretic method gives weights for policies in all steps, the ApproPO slows down when getting closer to a feasible policy. For the policy efficiency (Figure 2 a2), the number of policies stored for ApproPO is simply linear to the number of oracle calls.

5.2 Policy-based Deep RL Case

In an online setting, we solve the navigation task using an RL oracle constructed by a deep Advantage Actor-Critic (A2C) algorithm (Sutton and Barto, 2018; Mnih et al., 2016). In this experiment, all methods use the same A2C agent. ApproPO introduces extra hyper-parameters, which are set according to their original paper (see Appendix C for details), meanwhile, the proposed modified MNP introduces no extra hyper-parameters.

In Figure 2 (b1 and b2), we plot the mean and standard deviation of the approximation error and policy efficiency (number of policy stored) of running modified MNP and ApproPO methods over 50 runs. The original paper of ApproPO suggests the usages of a cache, which heuristically cuts memory costs, and does not affect its convergence. We include them in b2 and c2.

The experimental results show that our modified MNP outperforms ApproPO and meanwhile cut the memory usage by an order of magnitude. Even the memory requirement of ApproPO with cache stores significantly more policies than our proposed method. Our method stores about 2 policies throughout the process, with a guarantee of no more than 3.

5.3 Value-based Deep RL Case

We then consider the value-based deep RL methods, which are especially popular in offline RL settings (Levine et al., 2020; Fujimoto et al., 2019b, a). We illustrate how our proposed method enables leveraging the value-based deep RL method to solve CRL tasks with the following experiments.

We first randomly collect thousand samples from the training process of the previous A2C agent, and construct a replay buffer (Mnih et al., 2013) with these samples. Then we use a Double DQN (DDQN) with dueling network (Wang et al., 2016) to learn from samples in this replay buffer only, without any further interacting with the environment.

Learning from offline data without any further exploration is harder than in the online setting. Hence we double the training samples. Similar to our result with the policy-based RL method, when using the value-based RL method, it is clear that our proposed method also achieves superior performance. Meanwhile, throughout the learning process, our method stores much fewer policies than the ApproPO.

6 Conclusions

In this paper, we propose a policy efficient reduction approach to solve the CRL problem. Using a novel vector space reduction, we derive a meta-algorithm, which can be admitted by any CG-type algorithm and any RL algorithm as subroutines. To improve policy efficiency, we proposed a new variant of the CG method, the modified MNP method. The proposed method matches the convergence rate of the existing game-theoretic methods and reduces the memory complexity from to at most , which is worst-case optimal. Experiments demonstrate the superior performance of our method. When working with deep RL methods, our method even cut the memory costs by an order of magnitude, making it practical to utilize deep value-based methods to solve CRL problems.

References

  • J. Abernethy, P. L. Bartlett, and E. Hazan (2011) Blackwell approachability and no-regret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 27–46. Cited by: §1.
  • J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017) Constrained policy optimization. In International Conference on Machine Learning, pp. 22–31. Cited by: §1.
  • G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. Tb, A. Muldal, N. Heess, and T. Lillicrap (2018) Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617. Cited by: §1, §4.2.
  • A. Beck and S. Shtern (2017) Linearly convergent away-step conditional gradient for non-strongly convex functions. Mathematical Programming 164 (1-2), pp. 1–27. Cited by: §A.1.
  • L. Blackmore, M. Ono, and B. C. Williams (2011) Chance-constrained optimal path planning with obstacles. IEEE Transactions on Robotics 27 (6), pp. 1080–1094. Cited by: §1.
  • S. Boyd, S. P. Boyd, and L. Vandenberghe (2004) Convex optimization. Cambridge university press. Cited by: §B.1, §4.2.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §2.
  • D. Chakrabarty, P. Jain, and P. Kothari (2014) Provable submodular minimization using wolfe’s algorithm. In Advances in Neural Information Processing Systems, pp. 802–809. Cited by: §B.3, §1, §4.2, §4.2.
  • J. Chen, W. Zhan, and M. Tomizuka (2019) Autonomous driving motion planning with constrained iterative lqr. IEEE Transactions on Intelligent Vehicles 4 (2), pp. 244–254. Cited by: §1.
  • Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone (2017) Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18 (1), pp. 6070–6120. Cited by: §1.
  • Y. Chow and M. Ghavamzadeh (2014) Algorithms for cvar optimization in mdps. In Advances in neural information processing systems, pp. 3509–3517. Cited by: §1.
  • J. A. De Loera, J. Haddock, and L. Rademacher (2018) The minimum euclidean-norm point in a convex polytope: wolfe’s combinatorial algorithm is exponential. In

    Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing

    ,
    pp. 545–553. Cited by: §4.1.
  • M. Dudík, J. Langford, and L. Li (2011) Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601. Cited by: §3.2.
  • W. Fedus, B. Zoph, and N. Shazeer (2021) Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961. Cited by: §2.
  • M. Frank, P. Wolfe, et al. (1956) An algorithm for quadratic programming. Naval research logistics quarterly 3 (1-2), pp. 95–110. Cited by: §3.2, Algorithm 3.
  • Y. Freund and R. E. Schapire (1999) Adaptive game playing using multiplicative weights. Games and Economic Behavior 29 (1-2), pp. 79–103. Cited by: §1.
  • S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau (2019a) Benchmarking batch deep reinforcement learning algorithms. arXiv preprint arXiv:1910.01708. Cited by: §5.3.
  • S. Fujimoto, D. Meger, and D. Precup (2019b) Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062. Cited by: §5.3.
  • S. Fujimoto, H. Van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §1, §4.2.
  • D. Garber and E. Hazan (2013a) A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint arXiv:1301.4666. Cited by: §A.1.
  • D. Garber and E. Hazan (2013b) Playing non-linear games with linear oracles. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 420–428. Cited by: §A.1.
  • E. Hazan (2012) 10 the convex optimization approach to regret minimization. Optimization for machine learning, pp. 287. Cited by: §1.
  • R. B. Holmes (1973) Smoothness of certain metric projections on hilbert space. Transactions of the American Mathematical Society 184, pp. 87–100. Cited by: §3.2.
  • Z. Hong, T. Shann, S. Su, Y. Chang, T. Fu, and C. Lee (2018) Diversity-driven exploration strategy for deep reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10510–10521. Cited by: §1.
  • D. Isele, A. Nakhaei, and K. Fujimura (2018) Safe reinforcement learning on autonomous vehicles. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–6. Cited by: §1.
  • M. Jaggi (2013) Revisiting frank-wolfe: projection-free sparse convex optimization. In Proceedings of the 30th international conference on machine learning, pp. 427–435. Cited by: §A.1, §3.2, §3.2, Algorithm 3.
  • N. Jiang and L. Li (2016) Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661. Cited by: §3.2.
  • S. Lacoste-Julien and M. Jaggi (2015) On the global linear convergence of frank-wolfe optimization variants. In Advances in neural information processing systems, pp. 496–504. Cited by: §A.1, §4.1.
  • G. Lan (2020) First-order and stochastic optimization methods for machine learning. Springer. Cited by: §3.2.
  • H. Le, C. Voloshin, and Y. Yue (2019) Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712. Cited by: Table 1, §1.
  • S. Lefevre, A. Carvalho, and F. Borrelli (2015) A learning-based framework for velocity control in autonomous driving. IEEE Transactions on Automation Science and Engineering 13 (1), pp. 32–42. Cited by: §1.
  • S. Levine, A. Kumar, G. Tucker, and J. Fu (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §5.3.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §4.2.
  • S. Miryoosefi, K. Brantley, H. Daume III, M. Dudik, and R. E. Schapire (2019) Reinforcement learning with convex constraints. In Advances in Neural Information Processing Systems, pp. 14093–14102. Cited by: Table 1, §1, §1, §1, §2, §5.
  • B. Mitchell, V. F. Dem’yanov, and V. Malozemov (1974) Finding the point of a polyhedron closest to the origin. SIAM Journal on Control 12 (1), pp. 19–26. Cited by: §A.1.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §5.2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1, §4.2, §5.3.
  • M. Ono, M. Pavone, Y. Kuwata, and J. Balaram (2015) Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots 39 (4), pp. 555–571. Cited by: §1.
  • S. Paternain, L. Chamon, M. Calvo-Fullana, and A. Ribeiro (2019) Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, pp. 7555–7565. Cited by: §1.
  • D. Precup, R. S. Sutton, and S. Dasgupta (2001) Off-policy temporal-difference learning with function approximation. In ICML, pp. 417–424. Cited by: §3.2.
  • D. Precup (2000) Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80. Cited by: §3.2.
  • S. Shalev-Shwartz, S. Shammah, and A. Shashua (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1, §5.2.
  • C. Tessler, D. J. Mankowitz, and S. Mannor (2018) Reward constrained policy optimization. In International Conference on Learning Representations, Cited by: §1.
  • H. Van Hasselt, A. Guez, and D. Silver (2015) Deep reinforcement learning with double q-learning. arXiv preprint arXiv:1509.06461. Cited by: §1, §4.2.
  • Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas (2016) Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp. 1995–2003. Cited by: §1, §4.2, §5.3.
  • P. Wolfe (1970) Convergence theory in nonlinear programming. Integer and nonlinear programming, pp. 1–36. Cited by: §A.1.
  • P. Wolfe (1976) Finding the nearest point in a polytope. Mathematical Programming 11 (1), pp. 128–149. Cited by: §A.1, §B.2, §1, §4.1, §4.1, §4.2, §4, Algorithm 4.

Appendix A More on Conditional Gradient Type Methods

a.1 Vanilla Conditional Gradient

Input: , learning rate
Initialize:

1:  for t = 1, …, T do
2:     
3:     
4:  end for
5:  return
Algorithm 3 Vanilla Conditional Gradient (Frank et al., 1956; Jaggi, 2013)

For a convex function , the vanilla CG method (also known as the Frank-Wolfe method) solves the constrained optimization problem over a compact and convex set using a linear optimization oracle . The process is illustrated in Algorithm 3. For , the vanilla CG is known to have a sublinear convergence rate (Jaggi, 2013). Various methods are proposed to improve the convergence rate. For example, when is a polytope, and the objective function is strongly convex, multiple variants, such as away-step CG (Wolfe, 1970; Jaggi, 2013), pairwise CG (Mitchell et al., 1974), and Wolfe’s method (Wolfe, 1976) are shown to enjoy linear convergence rate (Lacoste-Julien and Jaggi, 2015). Linear convergence under other conditions is also studied (Beck and Shtern, 2017; Garber and Hazan, 2013a, b).

a.2 Wolfe’s Method for Minimum Norm Point

Input:
Initialize: current point , active set , weight for points in .

1:  for t = 1, …, T do // Major cycle
2:      // Potential improving point
3:     
4:     while True do // Minor cycle
5:        
6:        if  then //  is a corral
7:           break
8:        else
9:           
10:            for all
11:           
12:        end if
13:     end while
14:     
15:  end for
16:  return
Algorithm 4 Wolfe’s Method for Minimum Norm Point (Wolfe, 1976)

Wolfe’s method for minmum norm point (MNP) problem is an iterative algorithm to find the point with minimum Euclidean norm in a polytope, where the polytope is defined as the convex hull of a set of finitely many points . The Wolfe’s method consists of a finite number of major cycles, each of which consists of a finite number of minor cycles. The original MNP method iterates until a termination criteria is satisfied. At the start of each major cycle, let

be the hyperplane defined by

. If separates the polytope from the origin, then the process is terminated. Otherwise, it invokes an oracle to find any point on the near side of the hyperplane. The point is then added into the active set , and starts a minor cycle.

In a minor cycle, let be the point of smallest norm in of the affine hull . If is in the relative interior of the convex hull , then is updated to and the minor cycle is terminated. Otherwise, is updated to the nearest point to on the line segment . Thus is updated to a boundary point of , and any point that is not on the face of in which lies is deleted. The minor cycles are executed repeatedly until becomes a corral, that is, a set whose affine minimizer lies inside its convex hull. Since a set of one point is always a corral, the minor cycles is terminated after a finite number of runs.

Appendix B Proofs of the Main Results

Recall that (measurement of the mixed policy) throughout the process. In the following proofs, we define (measurement of the latest found policy) to simplify notation. When discussing one major cycle step with fixed, let denotes the affine minimizer found in the -th minor cycle (line 6 of Algorithm 2).

b.1 Proof of Proposition 4.1

Proposition 4.1.

For any convex set , the function is strongly convex if and only if is a singleton.

Recall that a function over is defined to be strongly convex (Boyd et al., 2004). If , such that

Proof.

”If” part: when is a singleton, the target function is twice continuously differentiable, with , and hence is strongly convex with . The “only if” part can be proved by contrapositive. For a non-singleton convex set, taking two distinct points from the set, any convex combination of them achieves 0 for , i.e., is not strictly convex, and hence not strongly convex. ∎

b.2 Proof of Theorem 4.2

The idea is to consider the distance between and . When the major cycle has no minor cycle, the non-terminal condition and the affine minimizer property implies . Otherwise we show that the first minor cycle strictly reduces the by moving along the segment joining and , and the subsequent minor cycle cannot increase it. Since , we conclude , and the approximation error strictly decreases.

Theorem 4.2 (Approximation Error Strictly Decreases).

For each step, the found by Algorithm 2 satisfies . That is, the measurement vectors of gets strictly closer to the convex set .

Proof.

If the current step is a major cycle with no minor cycle, then is the affine minimizer of with respect to . Then the affine minimizer property implies . Since iteration does not terminate at step , we have (Wolfe’s Criterion (Wolfe, 1976)), and therefore not equal to . Then is the unique affine minimizer implies .

Otherwise the current step contains one or more minor cycles. In this case, we show that the first minor cycle strictly reduces the approximation error, and the (possibly) following minor cycles cannot increase it. For the first minor cycle, the affine minimizer of with respect to is outside . Let be the intersection of and segment joining and . Let and denote the active set after the -th minor cycle. Then since is the affine minimizer of with respect to , we have

(16)

where the second step uses the triangle inequality and the last step follows since the segment intersects the interior of , and the distance to strictly decreases along this segment. Therefore the point found by first minor cycle satisfies

(17)

Minus both side by the optimal value of the problem , it it clear that the first minor cycle strictly decreases the approximation error. By a similar argument, in subsequent minor cycles the approximation error cannot be increased. However, after the first minor cycle, the iterating point may already at the intersection point and the strict inequality in last step of Eq. (16) need to be replaced by non-strict inequality.

Therefore any major cycle either finds an improving point and continue, or enters minor cycles where the first minor cycle finds an improving point, and the subsequent minor cycles does not increase the distance. Adding both side of by and we have the approximation error strictly decreases. ∎

b.3 Proof of Theorem 4.3

In our analysis, we consider the approximation error as defined in (4)

We first prove the following Lemma B.1 and Lemma B.2. Then we present the proof of Theorem 4.3 using the lemmas.

Lemma B.1.

For a non-drop step, we have .

Proof.

The non-drop step contains either no minor cycle or one minor cycle. We first consider the no minor cycle case.

If a major cycle contains no minor cycle, then is the affine minimizer of the .

(18)
(19)
(20)
(21)
(22)
(23)

where the equation (22) follows from the affine minimizer property Eq. (9). For in the last equation, and , we have

(24)
(25)
(26)
(27)

Then it suffices to show that .

Since is a convex set, the squared Euclidean distance function is convex for , which implies

(28)

Putting in , we get , which together with Eq. 23 and Eq. 27 concludes that for non-drop step with no minor cycles, we have .

For non-drop step with one minor cycle, we use the Theorem 6 of (Chakrabarty et al., 2014). By a linear translation of adding all points with , it gives

(29)

Then applying the same argument as Eq. 28, and we finished our proof.

Lemma B.2.

After major cycle steps of modified MNP method, the number of drop steps is less than .

Proof.

Since Lemma B.2 shows that drop steps are no more than half of total major cycle steps, and Theorem 4.2 guarantees these drop steps reducing the approximation error, we can safely skip these step, and re-index the step numbers to include non-drop steps only using .

For these non-drop steps, we claim that . Using Lemma B.1, we prove the convergence rate using induction. We first bound the error of any . For any

(30)