1 Introduction
Work  Time complexity  Policy efficiency  No extra hyperparameters 

Le et al. (2019)  ✗  ✗  
Miryoosefi et al. (2019)  ✗  
Ours (Vanilla CG)  ✗  
Ours (Modified MNP)  ✓ 
Comparison of different works. Time complexity (number of RL tasks solved) and policy efficiency (number of neural networks stored) are compared, when using any deep RL method to find an
approximate policy to a convex constrained RL problem with dimensional measurement function.When applying reinforcement learning (RL) to many realworld tasks, it is inevitable to impose constraints to regulate the behavior of the resulting policy. Examples include adding risk constraints to avoid damaging expensive robotics (Blackmore et al., 2011; Ono et al., 2015), placing safety and comfort constraints on autonomous driving (Lefevre et al., 2015; ShalevShwartz et al., 2016; Isele et al., 2018; Chen et al., 2019), and introducing diversity constraints to encourage explorations (Hong et al., 2018; Miryoosefi et al., 2019). In general, such problems of learning desired policies under constraints can be cast into the constrained reinforcement learning (CRL) formalism.
As is well acknowledged, modelfree RL methods can be classified into two major categories, i.e., valuebased and policybased
(Sutton and Barto, 2018). However, compared with the large volume of literature studying valuebased methods in the general RL setting, they are rarely investigated in the CRL setting. This somehow surprising phenomenon has its root cause that in CRL, a constraintsatisfying policy may require delicate randomization between different behaviors, and hence selecting multiple actions with specific probabilities is necessary (cf. Example
1). Most valuebased algorithms such as Qlearning (Sutton and Barto, 2018), DQN (Mnih et al., 2013), and their variants (Van Hasselt et al., 2015; Wang et al., 2016; Lillicrap et al., 2015; Fujimoto et al., 2018; BarthMaron et al., 2018) may fail to find any constraintsatisfying policy in CRL. Therefore, the CRL literature traditionally merely focuses on policybased methods (Paternain et al., 2019; Tessler et al., 2018; Achiam et al., 2017; Chow et al., 2017; Chow and Ghavamzadeh, 2014). Recently, valuebased algorithms have achieved stateoftheart performance in various RL tasks (Van Hasselt et al., 2015; Wang et al., 2016; Fujimoto et al., 2018; BarthMaron et al., 2018). It is thus tempting to consider whether it is possible to solve CRL problems with valuebased algorithms.A new line of research derived from the gametheoretic perspective has made a breakthrough in this direction (Le et al., 2019; Miryoosefi et al., 2019). This line of gametheoretic approaches reformulates the CRL problem as a twoplayer zerosum repeated game and solves it with noregret online learning. In each round, one player who uses an online learning algorithm plays against the other player who uses an RL algorithm that finds a policy maximizing the value of the current game. This policy found by the RL player is then stored. It can be shown that after certain rounds, the mixed policy that uniformly randomly selects one of the found policies converges to the desired constraintsatisfying policy. However, storing all the policies found by the RL player is not policy efficient and may incur very high memory costs. In particular, when deep RL methods are utilized, even on some simple tasks, these gametheoretic approaches need to store dozens to hundreds of neural networks to find a constraintsatisfying policy (cf. Section 5). In theory, to obtain an approximate policy in CRL, these gametheoretic approaches require storing many policies, which is a consequence of their reliance on noregret online learning (Freund and Schapire, 1999; Abernethy et al., 2011; Hazan, 2012). Given the high memory costs, the policy inefficiency of these gametheoretic approaches makes them impractical to work with deep RL methods.
To improve policy efficiency, we propose a novel vector space reduction approach
to solve the CRL problems. Instead of the gametheoretic perspective, we reduce the CRL problem over a policy space as an equivalent distance minimization problem over a vector space. We then show that this distance minimization problem can be solved by a specially designed conditional gradient (CG) algorithm, whose linear optimization oracle is constructed using an RL algorithm. Consequently, this reduction yields a metaalgorithm, which can be instantiated by any variant of CG and any offtheshelf RL method. Specifically, in each iteration, the RL algorithm finds a policy, and this policy is stored; the mixed policy that selects all found policies with appropriate weights (e.g., step sizes) converges to a desired constraintsatisfying policy. The main benefit of our reduction approach is that it substitutes the noregret online learning techniques with the CGtype methods, and thus it is not necessary to store all found policies. However, since the step sizes of the vanilla CG are nonzero, directly applying it assigns nonzero weights to all found policies and does not improve policy efficiency.
To this end, we propose a new algorithm, which achieves optimal policy efficiency, based on a variant of CG called the minimum norm point (MNP) method (Wolfe, 1976). We extend the vanilla MNP to solve a more general problem, where the distance function to a convex set is minimized over a compact convex set. Inspired by the minor cycle technique (Wolfe, 1976) in MNP, our modified MNP method reassigns the weights of all found policies and maintains an active set, which only contains policies with nonzero weights. After the weight adjustment, policies with weight zero are eliminated from the active set immediately to cut the memory costs. To solve CRL problems with dimensional measurement vectors, our method stores no more than policies throughout the learning process. Notably, this constant is shown to be worstcase optimal. Moreover, with a carefully refined analysis, our method solves the general problem with a faster convergence guarantee than the MNP method. To achieve an approximate solution in an dimensional space, our method improves the convergence from (Chakrabarty et al., 2014) to a tighter , with the same memory cost (details in Table 1). We compared our method with the gametheoretic approach (Miryoosefi et al., 2019) in a navigation task using different RL methods to construct the oracle. In cases of both tabular RL and deep RL, our method demonstrates superior performance and policy efficiency. In particular, in deep RL cases, our method even reduces the memory costs by an order of magnitude. In summary, our approach enables efficiently utilizing valuebased RL methods to solve CRL, and the improved policy efficiency (worstcase optimal) makes it especially appealing to applications using deep RL methods.
2 Background
A
vectorvalued Markov decision process
is defined as a tuple , where is a set of states , is a set of actions , is a transition probability function of the form that describes the dynamics of the system, defines the initial state distribution , is an dimensional measurement function that may measure reward, risk or other constraints, and is a discount factor.Actions are typically selected according to some (stationary) policies. A policy
maps states to probability distributions over actions, and
denotes the probability of selecting action in . We assume that policies under consideration are selected from some candidate policy set . For example, in policybased methods, is usually the set of all stationary policies, and in valuebased methods, is typically the set of all deterministic policies. For a policy , we define the longterm measurement as the expectation of the discounted cumulative measurements(1) 
where the expectation is over the described random process.
To enable utilizing valuebased methods to solve CRL problems, we also consider mixed policies, which are distributions over the candidate policies space . We define to be the set of all mixed policies generated by . To execute a mixed policy , at the start of an episode, we select a policy , and then execute for the entire episode. The longterm measurement of a mixed policy is defined accordingly:
(2) 
In the following, we focus on the convex constrained RL problem, also known as the feasibility problem, which generalizes inequality constraints to convex constraints (Miryoosefi et al., 2019). A feasibility problem is specified by a closed and convex set . The goal is to find a policy whose longterm measurement lies inside .
(3) 
A policy is feasible if it satisfies the constraint, and the problem is feasible if a feasible policy exists. This formulation can potentially handle tasks that maximize one measurement (e.g., reward) under convex constraints. Such problems can be solved by performing a binary search over the maximum achievable reward value and at each iteration augmenting an inequality reward constraint (reward no less than the current iterated value) to the constraints.
Though both policybased methods and valuebased methods are well established in general RL, in the feasibility problem, the feasible policies may require choosing among multiple actions with specific probabilities, which is not satisfied by many valuebased methods. We illustrate this difficulty with the following example.
Example 1.
We consider the task of playing the Rock, Paper, Scissors game. For simplicity, we assume the environment randomly selects one of the three actions, and the game terminates after a fixed number of rounds. Let the measurement vector be the basis vectors in , indicating whether the agent won with each of the three actions, and the zero vector if tie or loss. Consider the feasibility problem specified by , which requires the agent to win with each action with at least probability on expectation. It is obvious that the only feasible policy for this task is to select three actions with the same probability. However, most valuebased methods calculate a scalar value for each stateaction pair and select any action achieving the maximum value at the current state. Since valuebased methods cannot specify the probability for choosing each stateaction pair, they may fail to solve CRL problems.
One workaround is to use mixed policies. However, the main difficulty to use mixed policies is that when each policy is found by a deep RL method, the memory costs can be huge. To store such a mixed policy , the neural networks corresponding to all policies with nonzero probability have to be stored. Hence the memory cost of storing is proportional to the cardinality of the subset of policies with nonzero weights in the candidate policy space, i.e., . Since a neural network may have billions to trillions of parameters (Brown et al., 2020; Fedus et al., 2021), storing a large number of neural networks is impractical in many deep RL tasks. Therefore, we are interested in mixed policies that are policy efficient and have a small cardinality of policies with nonzero weights.
3 A Vector Space Reduction Approach
Our vector space reduction approach reformulates the original CRL problem over a policy space to an equivalent distance minimization problem over a vector space. The key is to construct a specific linear optimization oracle using any RL algorithm, which enables solving this distance minimization problem with any variant of the CG method. This reduction yields a metaalgorithm for the CRL problems, which can be instantiated by any CG method and any RL algorithm. We illustrate this with the vanilla CG method.
3.1 Equivalent Distance Minimization Problem
We first reformulate the feasibility problem to an equivalent distance minimization problem over the policy space. For a closed and convex set , considering the problem of finding a mixed policy , whose longterm measurement is closest to the target convex set,
(4) 
where is the Euclidean distance of to the set , and is the Euclidean projection of onto the set .
For this minimization problem, a policy is defined to be optimal if it minimizes (4). Otherwise, the approximation error of is defined as
(5) 
A policy is defined to be an approximate policy if its approximation error is no larger than .
When the CRL problem (3) is feasible, the equivalence of being optimal to (4) and being feasible to the CRL problem can be easily established. Since a feasible policy of the CRL problem lies inside , it minimizes the nonnegative function, and hence is optimal to (4). Vice versa, any optimal policy to (4) lies inside and is a feasible policy to the CRL problem.
From a geometric perspective, let denote the set of all longterm measurements achievable by policies in the candidate policy space . It is clear that
is the convex hull of , and hence is closed and compact. Therefore the distance minimization problem (4) over the policy space is equivalent to the following distance minimization problem over a closed and convex set :
(6) 
If the CRL problem is feasible, then any that minimizes this distance function over the convex set finds a feasible policy to the original problem. Hence we have reduced the original CRL problem over a policy space to an equivalent distance minimization problem (6) over the closed and convex set in a vector space.
3.2 A Solution with Vanilla Conditional Gradient
Since it is unclear how to project a policy to the implicitly defined set , this distance minimization problem (6) is nontrivial. We overcome this difficulty by proposing a specially designed conditional gradient (CG) algorithm, where the linear optimization oracle used by the CG method is constructed using any offtheshelf RL algorithm.
We briefly review the CG method. CG is a firstorder method to minimize a convex function over a compact and convex set , using a linear optimization oracle (Frank et al., 1956)
(7) 
In each iteration step , the CG (Algorithm 3 in Appendix A.1) calculates the gradient at the current point , and invokes the linear optimization oracle to find an improving point . Then it updates the iterated point by taking a convex average of the current point and the improving point , where at step , the step size is typically set to (Jaggi, 2013).
We first calculate the gradient of the target function with respect to . (1.1) of Holmes (1973) shows that the gradient of the function with respect to is , where if else
. Hence applying the chain rule, it is straightforward that
We construct the desired linear optimization oracle, denoted by , such that for any , it outputs a policy, together with the corresponding measurement vector
(8) 
satisfying . To construct the linear optimization oracle, in fact the improving policy can be found by using any offtheshelf RL algorithms to solve a specific RL task. In particular, for any , a policy that minimizes
(9)  
(10) 
is a policy that maximizes the scalar reward at each step. Therefore any reinforcement learning algorithm that maximizes this scalar reward finds an improving policy, and the RL algorithm that best suits the underlying problem can be used to find an improving policy .
Evaluating the measurement vector
is handy in online settings, where Monte Carlo simulations estimate
directly. In batch or offline settings, various offpolicy evaluation methods, such as importance sampling (Precup, 2000; Precup et al., 2001) or doubly robust (Jiang and Li, 2016; Dudík et al., 2011), can be used to estimate .With the linear optimization oracle constructed using any RL method, the distance minimizing problem (6) can be solved by any variant of the CGtype algorithm. When the vanilla CG algorithm is used, the resulting method is illustrated in Algorithm 1. In each iteration, the is invoked once to find an improving policy , together with its longterm measurement . Then, the current mixed policy is updated by selecting with weight , and selecting any previously found policy with weight . The iterated point is updated in the same way, ensuring the invariance that . The convergence of the vanilla CG is wellestablished (Jaggi, 2013; Lan, 2020), which readily implies the Algorithm 1 converges in a sublinear convergence rate. However, since the learning rates of vanilla CG is always nonzero, after iterations, all policies have nonzero weights to be selected in . When the policies are found by deep RL methods, this requires storing neural networks, and is not policy efficient. We conclude that Algorithm 1 matches the convergence rate and policy efficiency of the existing gametheoretic approaches.
4 A Policy Efficient CG Approach
Comparing with the gametheoretic approaches, our vector space reduction approach does not require storing all found policies. However, directly applying the vanilla CG method assigns nonzero weights to all found policies and does not improve policy efficiency. To improve policy efficiency, we propose a new CGtype method. Our method is based on a variant of CG called the minimum norm point (MNP) method (Wolfe, 1976). We extend the MNP to solve a more general problem. When applying to the CRL problem, we show that our proposed method matches the convergence rate and achieves an optimal policy efficiency.
4.1 Minimum Norm Point Method
To find policy efficient mixed policies, we turn to variants of CGtype algorithms, especially those that maintains an active set, and assign zeroweights to certain iterated points. When the target convex set is a singleton, a policy efficient solution can be readily found using Wolfe’s method for Minimum Norm Point (MNP) over a polytope (Wolfe, 1976; De Loera et al., 2018).
When the target set is a singleton containing one point , the distance minimization problem (6) is simplified to finding a point in the polytope that is closest to
(11) 
which can be readily solved by Wolfe’s method for finding Minimum Norm Point (MNP) in a polytope.
In MNP (Algorithm 4 in Appendix A.2), the loop in CG is called a major cycle, and the convex averaging step is replaced by weight reassignment processes, called minor cycles. MNP maintains an active set , and the current iterated point is represented as a convex combination of points in .
Recall that for a set of points , the affine hull is defined as
(12) 
The convex hull is defined similarly with an additional requirement that elementwise. The affine minimizer is defined as . When a point is treated as the origin, the affine minimizer with respect to is and the affine minimizer property gives
(13) 
In a major cycle, when , we have for all . Hence, the MNP uses the oracle the same way as the CG algorithm. To minimize the size of the active set, the MNP repeatedly eliminates points from the active set using minor cycles. The minor cycles are executed until becomes a corral, that is, its affine minimizer lies inside its convex hull. To maintain the corral property of active set , in a minor cycle, let be the point of smallest norm in of the affine hull . If is in the relative interior of the convex hull , then the minor cycle is terminated. Otherwise, is updated to the nearest point to on the line segment . Thus is updated to a boundary point of , and any point, not on the face of in which lies, is deleted. Note that singletons are always corrals, and hence the minor cycles terminate after a finite number of runs. After which is updated to the affine minimizer of the corral .
The process returns the affine minimizer of and is the coefficient expressing as an affine combination of points in , where is the weight associated with . The process can be straightforwardly implemented using linear algebra. Wolfe (1976) also provides a more efficient implementation that uses a triangular array representation of the active set.
In the singleton case, the MNP solves the distance minimization problem (6), and hence the CRL problem (3). Since the active set is a corral and hence is affinely independent, the number of policies stored is at most at any time. After major cycle steps, the MNP method is shown to converge linearly with a rate where is an constant determined by the polytope as defined in LacosteJulien and Jaggi (2015).
4.2 Modified MNP and Theoretical Analysis
To solve the general case where may not be a singleton, we propose a modified MNP method. In the general nonsingleton case, our target function is in fact not stronglyconvex (Proposition 4.1). We analyze the complexity of our modified MNP method, and improve from the previous (Chakrabarty et al., 2014) to a tighter convergence rate (Theorem 4.3). Moreover, we show that maintaining an active policy set of size is worstcase optimal (Theorem 4.4). Therefore we conclude that the proposed modified MNP method matches the convergence of the existing gametheoretic methods, and achieves an optimal policy efficiency of storing no more than policies.
As illustrated in Algorithm 2, we modify the MNP by adding a projection step into the major cycle (line 3). In each major cycle, the modified MNP minimizes the distance to a projected point . Hence the resulting algorithm is equivalent to Wolfe’s MNP method when is a singleton, and otherwise, the oracle step calculates the gradient the same as the CG method. Intuitively, at each major step, if we are making a significant progress toward the projected point, then the distance to the convex set is decreased by at least the same amount.
For nonsingleton , in fact we cannot achieve the linear convergence as the singleton case. This is because in a nonsingleton case, the target squared distance function is not strongly convex, which is a common assumption required for linear convergence.
Recall that a function over is defined to be strongly convex (Boyd et al., 2004), if there exists , such that for all , satisfies
(14) 
Proposition 4.1.
For any convex set , the function is strongly convex if and only if is a singleton.
A proof is given in Appendix B.1. This proposition shows that the singleton case solved by MNP is the only case where the target function is strongly convex and linear convergence can be achieved. For general nonsingleton , the linear convergence does not hold. To analyze the convergence of our modified MNP method, we first show that the approximation error strictly decreases between any two steps.
Theorem 4.2 (Approximation Error Strictly Decreases).
For each step, the found by Algorithm 2 satisfies . That is, the measurement vectors of get strictly closer to the convex set .
A proof is provided in Appendix B.2. Given the approximation error strictly decreases, MNP can be shown to terminate finitely (Wolfe, 1976). However, this finitely terminating property does not hold for our algorithm. Since a changed projected point may yield a lower distance for the same active set , the active set may stay unchanged across major cycles (cf. Section 5). We establish the convergence of the modified MNP method by the following theorem.
Theorem 4.3 (Convergence in Approximation Error).
For any , the mixed policy found by the modified MNP method (Algorithm 2) satisfies
(15) 
where is the maximum norm of a measurement vector.
The proof is provided in Appendix B.3. In short, we define major cycle steps with at most one minor cycle as nondrop step, which are ”good” steps, and major cycle steps with more than one minor cycles as drop steps, which are ”bad” steps. We show that in good steps, Algorithm 2 is guaranteed to make enough progress. Though this does not hold for bad steps, we can bound the frequency of bad steps, and by Theorem 4.2, bad steps still make progresses. Hence the convergence follows. The main techniques are based on Chakrabarty et al. (2014). However, since we give a tighter bound on the frequency of bad steps, we improves the convergence rate from their to a tighter .
We then discuss the policy efficiency of mixed policy for the CRL problem. We give a constructive proof in Appendix B.4 to show that to ensure convergence for RL algorithms whose candidate policy set are deterministic policies (e.g. DQN (Mnih et al., 2013), DDPG (Lillicrap et al., 2015) and variants (Van Hasselt et al., 2015; Wang et al., 2016; Fujimoto et al., 2018; BarthMaron et al., 2018)), storing policies is necessary in the worst case.
Theorem 4.4 (Memory Complexity Bound).
When the candidate policy set is the set of all deterministic policies, to solve CRL problems (3) with dimensional measurement vectors, a mixed policy needs to randomize among policies to ensure convergence in the worst case.
Since the minor cycles of the modified MNP method (Algorithm 2) maintain the active set to be affinely independent, the modified MNP method requires storing no more than individual policy, throughout the learning process.
Corollary 4.4.1.
The modified MNP method achieves the worstcase optimal policy efficiency.
Therefore we conclude that the proposed modified MNP method matches the convergence rate of the previous gametheoretical methods. Meanwhile, it achieves optimal policy efficiency, making it favorable for solving constrained deep RL problems.
5 Experiments
We verify the effectiveness and the efficiency of the proposed methods in a navigation task and compared them with the ApproPO (Miryoosefi et al., 2019), a gametheoretic reduction approach, using various RL methods. The ApproPO constructs an RL player similar to our RL oracle. Hence it is a natural baseline for comparison. We run experiments with the RL oracle constructed using tabular RL, policybased deep RL, and valuebased deep RL methods. In all three cases, our method outperforms ApproPO and meanwhile achieves a significant improvement in policy efficiency.
In this navigation task (Figure 1), the agent is required to find a path from the starting point (S) to the goal point (G), by moving to one of the four neighborhood cells at each step. We set part of the region as risky states (grey hatch) and should be avoided. By design, the risky region contains the shortest path from S to G, so that the agent has to tradeoff between a shorter path and a safer path. The agent receives a 2dimensional measurement vector that signals the number of steps and steps inside the risky region, i.e. for every step outside the risky region, and for every step inside the risky region. The agent is required to find a navigation policy whose measurement vector lies inside . That is, the agent is required to find a policy navigating from S to G, on average containing no more than steps, and enter the risky region no more than steps for each episode. The episodes terminate when the goal point is reached or after steps. To simplify the presentation, we take discount for this finite horizon task. See Appendix C for more experimental details and hyperparameters.
A quick inspection of this task shows that none of the deterministic policies is feasible. For example, the arrows in Figure 1 show a deterministic policy achieving by bypassing all risk regions and one achieves by entering the risky region once. A mixed policy that randomizes these two policies with the same probability can be feasible (illustrated by the pink arrows).
5.1 Tabular RL Case
We first construct an RL oracle using the tabular Qlearning method. The approximation error and policy efficiency are compared in Figure 2 (a1 and a2). For the modified MNP, the method got stuck for about 100 steps. This is caused by the added projection step. As we have mentioned, a changed projected point may yield a lower distance for the same active set , and the active set remains unchanged for many steps. However, once an improving policy is found outside this active set, the modified MNP method quickly achieves the optimal value. On the other side, since the gametheoretic method gives weights for policies in all steps, the ApproPO slows down when getting closer to a feasible policy. For the policy efficiency (Figure 2 a2), the number of policies stored for ApproPO is simply linear to the number of oracle calls.
5.2 Policybased Deep RL Case
In an online setting, we solve the navigation task using an RL oracle constructed by a deep Advantage ActorCritic (A2C) algorithm (Sutton and Barto, 2018; Mnih et al., 2016). In this experiment, all methods use the same A2C agent. ApproPO introduces extra hyperparameters, which are set according to their original paper (see Appendix C for details), meanwhile, the proposed modified MNP introduces no extra hyperparameters.
In Figure 2 (b1 and b2), we plot the mean and standard deviation of the approximation error and policy efficiency (number of policy stored) of running modified MNP and ApproPO methods over 50 runs. The original paper of ApproPO suggests the usages of a cache, which heuristically cuts memory costs, and does not affect its convergence. We include them in b2 and c2.
The experimental results show that our modified MNP outperforms ApproPO and meanwhile cut the memory usage by an order of magnitude. Even the memory requirement of ApproPO with cache stores significantly more policies than our proposed method. Our method stores about 2 policies throughout the process, with a guarantee of no more than 3.
5.3 Valuebased Deep RL Case
We then consider the valuebased deep RL methods, which are especially popular in offline RL settings (Levine et al., 2020; Fujimoto et al., 2019b, a). We illustrate how our proposed method enables leveraging the valuebased deep RL method to solve CRL tasks with the following experiments.
We first randomly collect thousand samples from the training process of the previous A2C agent, and construct a replay buffer (Mnih et al., 2013) with these samples. Then we use a Double DQN (DDQN) with dueling network (Wang et al., 2016) to learn from samples in this replay buffer only, without any further interacting with the environment.
Learning from offline data without any further exploration is harder than in the online setting. Hence we double the training samples. Similar to our result with the policybased RL method, when using the valuebased RL method, it is clear that our proposed method also achieves superior performance. Meanwhile, throughout the learning process, our method stores much fewer policies than the ApproPO.
6 Conclusions
In this paper, we propose a policy efficient reduction approach to solve the CRL problem. Using a novel vector space reduction, we derive a metaalgorithm, which can be admitted by any CGtype algorithm and any RL algorithm as subroutines. To improve policy efficiency, we proposed a new variant of the CG method, the modified MNP method. The proposed method matches the convergence rate of the existing gametheoretic methods and reduces the memory complexity from to at most , which is worstcase optimal. Experiments demonstrate the superior performance of our method. When working with deep RL methods, our method even cut the memory costs by an order of magnitude, making it practical to utilize deep valuebased methods to solve CRL problems.
References
 Blackwell approachability and noregret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 27–46. Cited by: §1.
 Constrained policy optimization. In International Conference on Machine Learning, pp. 22–31. Cited by: §1.
 Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617. Cited by: §1, §4.2.
 Linearly convergent awaystep conditional gradient for nonstrongly convex functions. Mathematical Programming 164 (12), pp. 1–27. Cited by: §A.1.
 Chanceconstrained optimal path planning with obstacles. IEEE Transactions on Robotics 27 (6), pp. 1080–1094. Cited by: §1.
 Convex optimization. Cambridge university press. Cited by: §B.1, §4.2.
 Language models are fewshot learners. arXiv preprint arXiv:2005.14165. Cited by: §2.
 Provable submodular minimization using wolfe’s algorithm. In Advances in Neural Information Processing Systems, pp. 802–809. Cited by: §B.3, §1, §4.2, §4.2.
 Autonomous driving motion planning with constrained iterative lqr. IEEE Transactions on Intelligent Vehicles 4 (2), pp. 244–254. Cited by: §1.
 Riskconstrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18 (1), pp. 6070–6120. Cited by: §1.
 Algorithms for cvar optimization in mdps. In Advances in neural information processing systems, pp. 3509–3517. Cited by: §1.

The minimum euclideannorm point in a convex polytope: wolfe’s combinatorial algorithm is exponential.
In
Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing
, pp. 545–553. Cited by: §4.1.  Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601. Cited by: §3.2.
 Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961. Cited by: §2.
 An algorithm for quadratic programming. Naval research logistics quarterly 3 (12), pp. 95–110. Cited by: §3.2, Algorithm 3.
 Adaptive game playing using multiplicative weights. Games and Economic Behavior 29 (12), pp. 79–103. Cited by: §1.
 Benchmarking batch deep reinforcement learning algorithms. arXiv preprint arXiv:1910.01708. Cited by: §5.3.
 Offpolicy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052–2062. Cited by: §5.3.
 Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §1, §4.2.
 A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint arXiv:1301.4666. Cited by: §A.1.
 Playing nonlinear games with linear oracles. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 420–428. Cited by: §A.1.
 10 the convex optimization approach to regret minimization. Optimization for machine learning, pp. 287. Cited by: §1.
 Smoothness of certain metric projections on hilbert space. Transactions of the American Mathematical Society 184, pp. 87–100. Cited by: §3.2.
 Diversitydriven exploration strategy for deep reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10510–10521. Cited by: §1.
 Safe reinforcement learning on autonomous vehicles. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–6. Cited by: §1.
 Revisiting frankwolfe: projectionfree sparse convex optimization. In Proceedings of the 30th international conference on machine learning, pp. 427–435. Cited by: §A.1, §3.2, §3.2, Algorithm 3.
 Doubly robust offpolicy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661. Cited by: §3.2.
 On the global linear convergence of frankwolfe optimization variants. In Advances in neural information processing systems, pp. 496–504. Cited by: §A.1, §4.1.
 Firstorder and stochastic optimization methods for machine learning. Springer. Cited by: §3.2.
 Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712. Cited by: Table 1, §1.
 A learningbased framework for velocity control in autonomous driving. IEEE Transactions on Automation Science and Engineering 13 (1), pp. 32–42. Cited by: §1.
 Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §5.3.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §4.2.
 Reinforcement learning with convex constraints. In Advances in Neural Information Processing Systems, pp. 14093–14102. Cited by: Table 1, §1, §1, §1, §2, §5.
 Finding the point of a polyhedron closest to the origin. SIAM Journal on Control 12 (1), pp. 19–26. Cited by: §A.1.
 Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §5.2.
 Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1, §4.2, §5.3.
 Chanceconstrained dynamic programming with application to riskaware robotic space exploration. Autonomous Robots 39 (4), pp. 555–571. Cited by: §1.
 Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, pp. 7555–7565. Cited by: §1.
 Offpolicy temporaldifference learning with function approximation. In ICML, pp. 417–424. Cited by: §3.2.
 Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80. Cited by: §3.2.
 Safe, multiagent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295. Cited by: §1.
 Reinforcement learning: an introduction. MIT press. Cited by: §1, §5.2.
 Reward constrained policy optimization. In International Conference on Learning Representations, Cited by: §1.
 Deep reinforcement learning with double qlearning. arXiv preprint arXiv:1509.06461. Cited by: §1, §4.2.
 Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp. 1995–2003. Cited by: §1, §4.2, §5.3.
 Convergence theory in nonlinear programming. Integer and nonlinear programming, pp. 1–36. Cited by: §A.1.
 Finding the nearest point in a polytope. Mathematical Programming 11 (1), pp. 128–149. Cited by: §A.1, §B.2, §1, §4.1, §4.1, §4.2, §4, Algorithm 4.
Appendix A More on Conditional Gradient Type Methods
a.1 Vanilla Conditional Gradient
For a convex function , the vanilla CG method (also known as the FrankWolfe method) solves the constrained optimization problem over a compact and convex set using a linear optimization oracle . The process is illustrated in Algorithm 3. For , the vanilla CG is known to have a sublinear convergence rate (Jaggi, 2013). Various methods are proposed to improve the convergence rate. For example, when is a polytope, and the objective function is strongly convex, multiple variants, such as awaystep CG (Wolfe, 1970; Jaggi, 2013), pairwise CG (Mitchell et al., 1974), and Wolfe’s method (Wolfe, 1976) are shown to enjoy linear convergence rate (LacosteJulien and Jaggi, 2015). Linear convergence under other conditions is also studied (Beck and Shtern, 2017; Garber and Hazan, 2013a, b).
a.2 Wolfe’s Method for Minimum Norm Point
Wolfe’s method for minmum norm point (MNP) problem is an iterative algorithm to find the point with minimum Euclidean norm in a polytope, where the polytope is defined as the convex hull of a set of finitely many points . The Wolfe’s method consists of a finite number of major cycles, each of which consists of a finite number of minor cycles. The original MNP method iterates until a termination criteria is satisfied. At the start of each major cycle, let
be the hyperplane defined by
. If separates the polytope from the origin, then the process is terminated. Otherwise, it invokes an oracle to find any point on the near side of the hyperplane. The point is then added into the active set , and starts a minor cycle.In a minor cycle, let be the point of smallest norm in of the affine hull . If is in the relative interior of the convex hull , then is updated to and the minor cycle is terminated. Otherwise, is updated to the nearest point to on the line segment . Thus is updated to a boundary point of , and any point that is not on the face of in which lies is deleted. The minor cycles are executed repeatedly until becomes a corral, that is, a set whose affine minimizer lies inside its convex hull. Since a set of one point is always a corral, the minor cycles is terminated after a finite number of runs.
Appendix B Proofs of the Main Results
Recall that (measurement of the mixed policy) throughout the process. In the following proofs, we define (measurement of the latest found policy) to simplify notation. When discussing one major cycle step with fixed, let denotes the affine minimizer found in the th minor cycle (line 6 of Algorithm 2).
b.1 Proof of Proposition 4.1
Proposition 4.1.
For any convex set , the function is strongly convex if and only if is a singleton.
Recall that a function over is defined to be strongly convex (Boyd et al., 2004). If , such that
Proof.
”If” part: when is a singleton, the target function is twice continuously differentiable, with , and hence is strongly convex with . The “only if” part can be proved by contrapositive. For a nonsingleton convex set, taking two distinct points from the set, any convex combination of them achieves 0 for , i.e., is not strictly convex, and hence not strongly convex. ∎
b.2 Proof of Theorem 4.2
The idea is to consider the distance between and . When the major cycle has no minor cycle, the nonterminal condition and the affine minimizer property implies . Otherwise we show that the first minor cycle strictly reduces the by moving along the segment joining and , and the subsequent minor cycle cannot increase it. Since , we conclude , and the approximation error strictly decreases.
Theorem 4.2 (Approximation Error Strictly Decreases).
For each step, the found by Algorithm 2 satisfies . That is, the measurement vectors of gets strictly closer to the convex set .
Proof.
If the current step is a major cycle with no minor cycle, then is the affine minimizer of with respect to . Then the affine minimizer property implies . Since iteration does not terminate at step , we have (Wolfe’s Criterion (Wolfe, 1976)), and therefore not equal to . Then is the unique affine minimizer implies .
Otherwise the current step contains one or more minor cycles. In this case, we show that the first minor cycle strictly reduces the approximation error, and the (possibly) following minor cycles cannot increase it. For the first minor cycle, the affine minimizer of with respect to is outside . Let be the intersection of and segment joining and . Let and denote the active set after the th minor cycle. Then since is the affine minimizer of with respect to , we have
(16) 
where the second step uses the triangle inequality and the last step follows since the segment intersects the interior of , and the distance to strictly decreases along this segment. Therefore the point found by first minor cycle satisfies
(17) 
Minus both side by the optimal value of the problem , it it clear that the first minor cycle strictly decreases the approximation error. By a similar argument, in subsequent minor cycles the approximation error cannot be increased. However, after the first minor cycle, the iterating point may already at the intersection point and the strict inequality in last step of Eq. (16) need to be replaced by nonstrict inequality.
Therefore any major cycle either finds an improving point and continue, or enters minor cycles where the first minor cycle finds an improving point, and the subsequent minor cycles does not increase the distance. Adding both side of by and we have the approximation error strictly decreases. ∎
b.3 Proof of Theorem 4.3
In our analysis, we consider the approximation error as defined in (4)
We first prove the following Lemma B.1 and Lemma B.2. Then we present the proof of Theorem 4.3 using the lemmas.
Lemma B.1.
For a nondrop step, we have .
Proof.
The nondrop step contains either no minor cycle or one minor cycle. We first consider the no minor cycle case.
If a major cycle contains no minor cycle, then is the affine minimizer of the .
(18)  
(19)  
(20)  
(21)  
(22)  
(23) 
where the equation (22) follows from the affine minimizer property Eq. (9). For in the last equation, and , we have
(24)  
(25)  
(26)  
(27) 
Then it suffices to show that .
Since is a convex set, the squared Euclidean distance function is convex for , which implies
(28) 
Putting in , we get , which together with Eq. 23 and Eq. 27 concludes that for nondrop step with no minor cycles, we have .
For nondrop step with one minor cycle, we use the Theorem 6 of (Chakrabarty et al., 2014). By a linear translation of adding all points with , it gives
(29) 
Then applying the same argument as Eq. 28, and we finished our proof.
∎
Lemma B.2.
After major cycle steps of modified MNP method, the number of drop steps is less than .
Proof.
Since Lemma B.2 shows that drop steps are no more than half of total major cycle steps, and Theorem 4.2 guarantees these drop steps reducing the approximation error, we can safely skip these step, and reindex the step numbers to include nondrop steps only using .
For these nondrop steps, we claim that . Using Lemma B.1, we prove the convergence rate using induction. We first bound the error of any . For any
(30)  
Comments
There are no comments yet.