1 Introduction
Imitation learning is a paradigm that learns from expert demonstration to perform a task. The most straightforward approach of imitation learning is behavioral cloning (Pomerleau, 1991), which learns from expert trajectories to predict the expert action at any state. Despite its simplicity, behavioral cloning ignores the accumulation of prediction error over time. Consequently, although the learned policy closely resembles the expert policy at a given point in time, their trajectories may diverge in the long term.
To remedy the issue of error accumulation, inverse reinforcement learning (Russell, 1998; Ng and Russell, 2000; Abbeel and Ng, 2004; Ratliff et al., 2006; Ziebart et al., 2008; Ho and Ermon, 2016) jointly learns a reward function and the corresponding optimal policy, such that the expected cumulative reward of the learned policy closely resembles that of the expert policy. In particular, as a unifying framework of inverse reinforcement learning, generative adversarial imitation learning (GAIL) (Ho and Ermon, 2016) casts most existing approaches as iterative methods that alternate between (i) minimizing the discrepancy in expected cumulative reward between the expert policy and the policy of interest and (ii) maximizing such a discrepancy over the reward function of interest. Such a minimax optimization formulation of inverse reinforcement learning mirrors the training of generative adversarial networks (GAN), which alternates between updating the generator and discriminator, respectively.
Despite its prevalence, inverse reinforcement learning, especially GAIL, is notoriously unstable in practice. More specifically, most inverse reinforcement learning approaches involve (partially) solving a reinforcement learning problem in an inner loop, which is often unstable, especially when the intermediate reward function obtained from the outer loop is ill-behaved. This is particularly the case for GAIL, which, for the sake of computational efficiency, alternates between policy optimization and reward function optimization without fully solving each of them. Moreover, such instability is exacerbated when the policy and reward function are both parameterized by deep neural networks. In this regard, the training of GAIL is generally more unstable than that of GAN, since policy optimization in deep reinforcement learning is often more challenging than training a standalone deep neural network.
In this paper, we take a first step towards theoretically understanding and algorithmically taming the instability in imitation learning. In particular, under a minimax optimization framework, we for the first time establish the global convergence of GAIL under a fundamental setting known as linear quadratic regulators (LQR). Such a setting of LQR is studied in a line of recent works (Bradtke, 1993; Fazel et al., 2018; Tu and Recht, 2017, 2018; Dean et al., 2018a, b; Simchowitz et al., 2018; Dean et al., 2017; Hardt et al., 2018) as a lens for theoretically understanding more general settings in reinforcement learning. See Recht (2018) for a thorough review. In imitation learning, particularly GAIL, the setting of LQR captures four critical challenges of more general settings:
-
the minimax optimization formulation,
-
the lack of convex-concave geometry,
-
the alternating update of policy and reward function, and
-
the instability of the dynamical system induced by the intermediate policy and reward function (which differs from the aforementioned algorithmic instability).
Under such a fundamental setting, we establish a global sublinear rate of convergence towards a saddle point of the minimax optimization problem, which is guaranteed to be unique and recovers the globally optimal policy and reward function. Moreover, we establish a local linear rate of convergence, which, combined with the global sublinear rate of convergence, implies a global Q-linear rate of convergence. A byproduct of our theory is the stability of all the dynamical systems induced by the intermediate policies and reward functions along the solution path, which addresses the key challenge in (iv) and plays a vital role in our analysis. At the core of our analysis is a new potential function tailored towards non-convex-concave minimax optimization with alternating update, which is of independent interest. To ensure the decay of potential function, we rely on the aforementioned stability of intermediate dynamical systems along the solution path. To achieve such stability, we unveil an intriguing “self-enforcing” stabilizing mechanism, that is, with a proper configuration of stepsizes, the solution path approaches the critical threshold that separates stable and unstable regimes at a slower rate as it gets closer to such a threshold. In other words, such a threshold forms an implicit barrier, which ensures the stability of the intermediate dynamical systems along the solution path without any explicit regularization.
Our work extends the recent line of works on reinforcement learning under the setting of LQR (Bradtke, 1993; Recht, 2018; Fazel et al., 2018; Tu and Recht, 2017, 2018; Dean et al., 2018a, b; Simchowitz et al., 2018; Dean et al., 2017; Hardt et al., 2018) to imitation learning. In particular, our analysis relies on several geometric lemmas established in Fazel et al. (2018), which are listed in §F for completeness. However, unlike policy optimization in reinforcement learning, which involves solving a minimization problem where the objective function itself serves as a good potential function, imitation learning involves solving a minimax optimization problem, which requires incorporating the gradient into the potential function. In particular, the stability argument developed in Fazel et al. (2018), which is based on the monotonicity of objective functions along the solution path, is no longer applicable, as minimax optimization alternatively decreases and increases the objective function at each iteration. In a broader context, our work takes a first step towards extending the recent line of works on nonconvex optimization, e.g., Baldi and Hornik (1989); Du and Lee (2018); Wang et al. (2014a, b); Zhao et al. (2015); Ge et al. (2015, 2017a, 2017b); Anandkumar et al. (2014); Bandeira et al. (2016); Li et al. (2016a, b); Hajinezhad et al. (2016); Bhojanapalli et al. (2016); Sun et al. (2015, 2018), to non-convex-concave minimax optimization (Du and Hu, 2018; Sanjabi et al., 2018; Rafique et al., 2018; Lin et al., 2018; Dai et al., 2017, 2018a, 2018b; Lu et al., 2019) with alternating update, which is prevalent in reinforcement learning, imitation learning, and generative adversarial learning, and poses significantly more challenges.
In the rest of this paper, §2 introduces imitation learning, the setting of LQR, and the generative adversarial learning framework. In §3, we introduce the minimax optimization formulation and the gradient algorithm. In §4 and §5, we present the theoretical results and sketch the proof. We defer the detailed proof to §A-§F of the appendix.
Notation. We denote by the spectral norm and
the Frobenius norm of a matrix. For vectors, we denote by
the Euclidean norm. In this paper, we write parameters in the matrix form, and correspondingly, all the Lipschitz conditions are defined in the Frobenius norm.2 Background
In the following, we briefly introduce the setting of LQR in §2.1 and imitation learning in §2.2. To unify the notation of LQR and more general reinforcement learning, we stick to the notion of cost function instead of reward function throughout the rest of this paper.
2.1 Linear Quadratic Regulator
In reinforcement learning, we consider a Markov decision process
, where an agent interacts with the environment in the following manner. At the -th time step, the agent selects an action based on its current state , and the environment responds with the cost and the next state , which follows the transition dynamics . Our goal is to find a policy that minimizes the expected cumulative cost. In the setting of LQR, we consider and . The dynamics and cost function take the formwhere , , , and with . The problem of minimizing the expected cumulative cost is then formulated as the optimization problem
(1) | ||||
subject to |
where is a given initial distribution. Here we consider the infinite-horizon setting with a stochastic initial state . In this setting, the optimal policy is known to be static and takes the form of linear feedback , where does not depend on (Anderson and Moore, 2007). Throughout the rest of this paper, we also refer to as policy and drop the subscript in . To ensure the expected cumulative cost is finite, we require the spectral radius of to be less than one, which ensures that the dynamical system
(2) |
is stable. For a given policy , we denote by the expected cumulative cost in (1). For notational simplicity, we define
(3) |
By (3), we have the following equivalent form of
(4) |
where denotes the matrix inner product. Also, throughout the rest of this paper, we assume that the initial distribution satisfies . See Recht (2018) for a thorough review of reinforcement learning in the setting of LQR.
2.2 Imitation Learning
In imitation learning, we parameterize the cost function of interest by , where denotes the unknown cost parameter. In the setting of LQR, we have . We observe expert trajectories in the form of , which are induced by the expert policy . As a unifying framework of inverse reinforcement learning, GAIL (Ho and Ermon, 2016) casts max-entropy inverse reinforcement learning (Ziebart et al., 2008) and its extensions as the following minimax optimization problem
(5) |
where for ease of presentation, we restrict to deterministic policies in the form of . Here denotes the causal entropy of the dynamical system induced by , which takes value zero in our setting of LQR, since the transition dynamics in (2) is deterministic conditioning on . Meanwhile, is a regularizer on the cost parameter.
The minimax optimization formulation in (2.2) mirrors the training of GAN (Goodfellow et al., 2014), which seeks to find a generator of distribution that recovers a target distribution. In the training of GAN, the generator and discriminator are trained simultaneously, in the manner that the discriminator maximizes the discrepancy between the generated and target distributions, while the generator minimizes such a discrepancy. Analogously, in imitation learning, the policy of interest acts as the generator of trajectories, while the expert trajectories act as the target distribution. Meanwhile, the cost parameter of interest acts as the discriminator, which differentiates between the trajectories generated by and . Intuitively, maximizing over the cost parameter amounts to assigning high costs to the state-action pairs visited more by than . Minimizing over aims at making such an adversarial assignment of cost impossible, which amounts to making the visitation distributions of and indistinguishable.
3 Algorithm
In the sequel, we first introduce the minimax formulation of generative adversarial imitation learning in §3.1, then we present the gradient algorithm in §3.2.
3.1 Minimax Formulation
We consider the minimax optimization formulation of the imitation learning problem,
(6) |
Here we denote by the cost parameter, where and are both positive definite matrices, and is the feasible set of cost parameters. We assume is convex and there exist positive constants , , , and such that for any , it holds that
(7) |
Also, consists of all stabilizing policies, such that for all , where
is the spectral radius defined as the largest complex norm of the eigenvalues of a matrix. The expert policy is defined as
for an unknown cost parameter . However, note that is not necessarily the unique cost parameter such that is optimal. Hence, our goal is to find one of such cost parameters that is optimal. The term is the regularizer on the cost parameter, which is set to be -strongly convex and -smooth.To understand the minimax optimization problem in (6), we first consider the simplest case with . A saddle point of the objective function in (6), defined by
(8) |
has the following desired properties. First, we have that the optimal policy recovers the expert policy . By the optimality condition in (8), we have
(9) |
where the first inequality follows from the optimality of and the second inequality follows from the optimality of . Since the optimal solution to the policy optimization problem is unique (as proved in §5), we obtain from (9) that . Second, is an optimal policy with respect to the cost parameter , since by and the optimality condition , we have . In this sense, the saddle point of (6) recovers a desired cost parameter and the corresponding optimal policy.
Although brings us desired properties of the saddle point, there are several reasons we can not simply discard this regularizer. The first reason is that a strongly convex regularizer improves the geometry of the problem and makes the saddle point of (6) unique, which eliminates the ambiguity in learning the desired cost parameter. Second, the regularizer draws connection to the existing optimization formulations of GAN. For example, as shown in Ho and Ermon (2016), with a specific choice of , (6) reduces to the classical optimization formulation of GAN (Goodfellow et al., 2014), which minimizes the Jensen-Shannon divergence bewteen the generator and target distributions.
3.2 Gradient Algorithm
To solve the minimax optimization problem in (6), we consider the alternating gradient updating scheme,
(10) | ||||
(11) |
Here is the projection operator onto the convex set , which ensures that each iterate stays within .
There are several ways to obatin the gradient in (10) without knowing the dynamics but based on the trajectory . One example is the deterministic policy gradient algorithm (Silver et al., 2014). In specific, the gradient of the cost function is obtained through the limit
where is a stochastic policy that takes the form . Here is the action-value function associated with the policy , defined as the expected total cost of the policy starting at state and action
, which can be estimated based on the trajectory
. An alternative approach is the evolutionary strategy (Salimans et al., 2017), which uses zeroth-order information to approximate with a random perturbation,where
is a random matrix in
with a sufficiently small variance
.To obtain the gradient in (11), we have
(12) | ||||
(13) |
Here with generated by policy , which can be estimated based on the trajectory .
4 Main Results
In this section, we present the convergence analysis of the gradient algorithm in (10) and (11). We first prove that the solution path are guaranteed to be stabilizing and then establish the global convergence. For notational simplicity, we define the following constants,
(14) |
where and are defined in (7), and . Also, we define
(15) | ||||
(16) |
which play a key role in upper bounding the cost function .
4.1 Stability Guarantee
A minimum requirement in reinforcement learning is to obtain a stabilizing policy such that the dynamical system does not tend to infinity. Throughout this paper, we employ a notion of uniform stability, which states that there exists a constant such that for all . Moreover, the uniform stability also allows us to establish the smoothness of , which is discussed in §5.2.
Recall that we assume and are positive definite. Therefore, the uniform stability is implied by the boundedness of the cost function , since we have
(17) |
where the second inequality follows from the properties of trace and the assumption in (7). However, it remains difficult to show that the cost function is upper bounded. Although the update of policy in (10) decreases the cost function , the update of cost parameter increases , which possibly increases the cost function . To this end, we choose suitable stepsizes as in the next condition to ensure the boundedness of the cost function.
For the update of policy and cost parameter in (10) and (11), let
Here , , , and are defined in (14), (15), and (16). The constants and are defined as
(18) |
The next lemma shows that the solution path is uniformly stabilizing, and meanwhile, along the solution path, the cost function and are both upper bounded.
Proof.
See §A for a detailed proof. ∎
4.2 Global Convergence
Before showing the gradient algorithm converges to the saddle point of (6), we establish its uniqueness. We define the proximal gradient of the objective function in (6) as
(19) |
Then a proximal stationary point is defined by . [Uniqueness of Saddle Point] There exists a unique proximal stationary point, denoted as , of the objective function in (6), which is also its unique saddle point.
Proof.
See §D.1 for a detailed proof. ∎
To analyze the convergence of the gradient algorithm, we first need to establish the Lipschitz continuity and smoothness of . However, the cost function becomes steep as the policy is close to unstabilizing. Therefore, we do not have such desired Lipschitz continuity and smoothness of and with respect to . However, given is upper bounded as in Lemma 4.1, we obtain such desired properties in the following lemma.
For notational simplicity, we slightly abuse the notation and rewrite as a block diagonal matrix and correspondingly define ,
(20) |
Then the objective function takes the form
(21) |
We assume that the initial policy of the gradient algorithm is stabilizing. Under Condition 4.1, there exists a compact set such that for all . Also, there exist constants and such that the matrix-valued function defined in (20) is -Lipschitz continuous and -smooth over . That is, for any and , we have
Proof.
See §5.2 for a detailed proof. ∎
Note that the cost parameter is only identifiable up to a multiplicative constant. Recall that we assume and . In the sequel, we establish the sublinear rate of convergence with a proper choice of , , , and , which is characterized by the following condition.
We assume that , , , and satisfy
(22) |
where , , and are defined in (14), (15) and (16), and is defined in Lemma 4.2.
The following condition, together with Condition 4.1, specifies the required stepsizes to establish the global convergence of the gradient algorithm.
In the following, we establish the global convergence of the gradient algorithm. Recall that as defined in (19), is the proximal gradient of the objective function defined in (6).
Under Conditions 4.1, 4.2, and 4.2, we have , which implies that converges to the unique saddle point of . To characterize the rate of convergence, we define as the smallest iteration index that is below an error ,
(23) |
Then there exists a constant , which depends on , , , and (as specified in (37)), such that for any .
Proof.
See §5.2 for a detailed proof. ∎
To understand Condition 4.2, we consider a simple case where the regularizer is the squared penalty centered at some point , that is,
Then we have , where . Also, by (16) we have
(24) |
Let for some constant . By (15) we have
(25) |
By (25) and Lemma 4.1, we obatin
(26) | ||||
(27) |
for all . In §5.2 we further prove that is determined by the uniform upper bound of and along the solution path, which is established in Lemma 4.1. Hence, by (26) and (27) we have that is independent of . Meanwhile, by (24) and (25) we have
Thus, for a sufficiently large , we have
which leads to Condition 4.2.
Condition 4.2 plays a key role in establishing the convergence. On the one hand, to ensure the boundedness of the cost function , we require an upper bound of in Condition 4.1. On the other hand, to ensure the convergence of the gradient algorithm, we require an upper bound of in Condtion 4.2. Condition 4.2 ensures such two requirements on stepsizes are compatible.
4.3 Q-Linear Convergence
In this section, we establish the Q-linear convergence of the gradient algorithm in (10) and (11). Recall that the optimal policy takes the form , where is the positive definite solution to the discrete-time algebraic Riccati equation (Anderson and Moore, 2007),
(28) |
We denote by the corresponding implicit matrix-valued function defined by (28). Also, we define as
for . We assume the following regularity condition on .
The unique stationary point of cost parameter is an interior point of . Also, we assume that .
We define as the unique optimal policy corresponding to the cost parameter and denote by the corresponding value of the objective function defined in (6), that is,
(29) |
The following two lemmas characterize the local properties of the functions and in a neighborhood of the saddle point of .
Under Condition 4.3, there exist constants and , and a neighborhood of , such that is -Lipschitz continuous and -smooth with respect to .
Proof.
See §D.4 for a detailed proof. ∎
Under Condition 4.3, there exist a constant and a neighborhood of such that is -strongly concave and -smooth with respect to .
Proof.
See §D.5 for a detailed proof. ∎
To establish the Q-linear convergence, we need an additional condition, which upper bounds the stepsizes and . For the stepsizes and in (10) and (11), let
We define the following potential function
(30) |
where . Note that implies that converges to , since we have . Also, we define
(31) |
The following theorem establishes the Q-linear convergence of the gradient algorithm.
Under Conditions 4.1, 4.2, 4.2, 4.3, and 4.3, we have in (31). There exists an iteration index such that for all .
Proof.
See §5.3 for a detailed proof. ∎
5 Proof Sketch
In this section, we sketch the proof of the main results in §4.
5.1 Proof of Stability Guarantee
To prove Lemma 4.1, we lay out two auxiliary lemmas that characterize the geometry of the cost function with respect to . The first lemma characterizes the stationary point of policy optimization. The second lemma shows that is gradient dominated with respect to .
If , then is the unique optimal policy corresponding to the cost parameter .
Proof.
See §D.2 for a detailed proof. ∎
[Corollary 5 in Fazel et al. (2018)] The cost function is gradient dominated with respect to , that is,
where and is defined in (29).
Proof.
See Fazel et al. (2018) for a detailed proof. ∎
Lemma 5.1 allows us to upper bound the increment of the cost function at each iteration in (10) and (11) by choosing a sufficiently small relative to . In fact, we construct a threshold such that when is close to such a threshold, an upper bound of the increment goes to zero. Thus, is upper bounded by such a threshold. See §A for a detailed proof.
5.2 Proof of Global Convergence
To prove Theorem 4.2, we first establish the Lipschitz continuity and smoothness of in within a restricted domain as in Lemma 4.2. Recall that is defined in (20) and takes the form in (21). Since the matrix-valued function plays a key role in , in the sequel we characterize the smoothness of with respect to within a restricted set. For any constant , there exist constants and depending on such that
for any and .
Proof.
See §D.3 for a detailed proof. ∎
Proof.
Let the set in Lemma 4.2 be
Then by Lemma 4.1, we have
for all , which implies for all . By Lemma 5.2, we obtain the Lipschitz continuity and smoothness of over . Furthermore, by the definition of in (20) and the boundedness of established in Lemma 4.1, is also Lipschitz continuous and smooth over . Thus, we conclude the proof of Lemma 4.2. ∎
Based on Lemma 4.2, we prove the global convergence in Theorem 4.2. To this end, we construct a potential function that decays monotonically along the solution path, which takes the form
(32) |
for some constant , which is specified in the next lemma. Meanwhile, we define three constants , , and as
(33) | ||||
(34) | ||||
(35) |
The following lemma characterizes the decrement of the potential function defined in (32) at each iteration.
Proof.
See §B for a detailed proof. ∎
Proof.
By the definitions of and in (32) and (6), we have for all , where is
Here we use the fact for any and . Let , where and are defined in (33) and (34). By rearranging the terms in (36), we obtain
where , which implies that converges to zero. Also, let
(37) |
For any , by the definition of in (23), we have
which implies
Comments
There are no comments yet.