# A Generic First-Order Algorithmic Framework for Bi-Level Programming Beyond Lower-Level Singleton

In recent years, a variety of gradient-based first-order methods have been developed to solve bi-level optimization problems for learning applications. However, theoretical guarantees of these existing approaches heavily rely on the simplification that for each fixed upper-level variable, the lower-level solution must be a singleton (a.k.a., Lower-Level Singleton, LLS). In this work, we first design a counter-example to illustrate the invalidation of such LLS condition. Then by formulating BLPs from the view point of optimistic bi-level and aggregating hierarchical objective information, we establish Bi-level Descent Aggregation (BDA), a flexible and modularized algorithmic framework for generic bi-level optimization. Theoretically, we derive a new methodology to prove the convergence of BDA without the LLS condition. Our investigations also demonstrate that BDA is indeed compatible to a verify of particular first-order computation modules. Additionally, as an interesting byproduct, we also improve these conventional first-order bi-level schemes (under the LLS simplification). Particularly, we establish their convergences with weaker assumptions. Extensive experiments justify our theoretical results and demonstrate the superiority of the proposed BDA for different tasks, including hyper-parameter optimization and meta learning.

There are no comments yet.

## Authors

• 35 publications
• 6 publications
• 12 publications
• 6 publications
• 35 publications
02/16/2021

### A Generic Descent Aggregation Framework for Gradient-based Bi-level Optimization

In recent years, gradient-based methods for solving bi-level optimizatio...
10/11/2021

### Value-Function-based Sequential Minimization for Bi-level Optimization

Gradient-based Bi-Level Optimization (BLO) methods have been widely appl...
10/01/2021

### Towards Gradient-based Bilevel Optimization with Non-convex Followers and Beyond

In recent years, Bi-Level Optimization (BLO) techniques have received ex...
06/04/2021

### Debiasing a First-order Heuristic for Approximate Bi-level Optimization

Approximate bi-level optimization (ABLO) consists of (outer-level) optim...
08/31/2020

### BiLO-CPDP: Bi-Level Programming for Automated Model Discovery in Cross-Project Defect Prediction

Cross-Project Defect Prediction (CPDP), which borrows data from similar ...
06/15/2021

### A Value-Function-based Interior-point Method for Non-convex Bi-level Optimization

Bi-level optimization model is able to capture a wide range of complex l...
01/27/2021

### Investigating Bi-Level Optimization for Learning and Vision from a Unified Perspective: A Survey and Beyond

Bi-Level Optimization (BLO) is originated from the area of economic game...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bi-Level Programs (BLPs) are mathematical programs with optimization problems in their constraints and recently have been recognized as powerful theoretical tools to address a variety of learning tasks (e.g., hyper-parameter optimization and meta learning). Mathematically, most of BLPs in these areas can be (re)formulated as the following hierarchical optimization problem:

 (1)

where the Upper-Level (UL) objective is a continuous function, the UL constraint is a compact set, and is a set-valued mapping which indicates the parameterized solution set of the Lower-Level (LL) subproblem. In this work, we just consider the following LL subproblem:

 S(x):=argminyf(x,y), (2)

where is a continuous function. Indeed, The BLP model in Eqs. (1)-(2) is a hierarchical optimization problem with two coupled variables . Specifically, given the UL variable from the feasible set , i.e., , the LL variable is an optimal solution of the LL subproblem governed by , i.e., . Due to the hierarchical structure, the BLP model in Eqs. (1)-(2) is in general nonconvex, and hence NP hard, even with both linear UL and LL subproblems (Jeroslow, 1985; Dempe, 2018). Moreover, due to the complicated dependency between the UL variable and LL variable in Eq. (1), it is very challenging to solve BLP. This difficulty is further aggravated when the LL solutions in Eq. (2) is no longer a singleton for given . Hereafter, we will always call this condition as Lower-Level Singleton condition or LLS for short.

### 1.1 Related Work

Although early works on bi-level optimization can date back to the nineteen seventies (Dempe, 2018), it was not until the last decade that a large amount of BLP models were proposed to address specific learning and vision problems. Representative applications include meta learning (Franceschi et al., 2018; Rajeswaran et al., 2019; Zügner and Günnemann, 2019), hyper-parameter optimization (Franceschi et al., 2017; Okuno et al., 2018; MacKay et al., 2019)

(Yang et al., 2019), generative adversarial learning (Pfau and Vinyals, 2016), graph and image processing (Kunisch and Pock, 2013; De los Reyes et al., 2017), just to name a few.

A large number of optimization methods have been developed to solve BLPs in Eqs. (1)-(2) with a rich literature. A prevailing approach is associated with the optimality characterization of the LL subproblem. Using the first-order optimality conditions, BLPs in Eqs. (1)-(2) are reformulated into single-level optimization which are numerically trackable (Moore, 2010; Kunapuli et al., 2008; Okuno et al., 2018). However, these bi-level algorithms involve too many auxiliary variables, as a consequence, the performance is hardly satisfied for BLP models in complex learning fields.

Recently, gradient-based First-Order Methods (FOMs) have been revisited to solve BLPs for learning and vision tasks. The key idea underlying these approaches is to calculate gradients of UL and LL objectives in hierarchical manners. A popular approach in this direction is to first calculate gradient representations of the LL objective and then perform either reverse or forward gradient computations (based on the LL gradients) for the UL subproblem. We have known that the reverse mode is identical to back-propagation through time and the forward mode calculates gradients appeals to the chain rule

(Maclaurin et al., 2015; Franceschi et al., 2017, 2018). Similar techniques were also used in (Jenni and Favaro, 2018; Zügner and Günnemann, 2019; Rajeswaran et al., 2019), but with different specific implementations. The work in (Shaban et al., 2019) adopted truncated back-propagation to improve the scale issue for these methods. Furthermore, in (Lorraine and Duvenaud, 2018; MacKay et al., 2019), a so-called hyper-network was introduced and trained to map LL gradients for such hierarchical optimization. Although widely used in practical applications, theoretical properties of these bi-level FOMs are still not convincing. Indeed, all of these methods have enforced the LLS constraint to Eqs. (1)-(2) to simplify their optimization problem. To satisfy such restrictive condition, existing work (e.g., (Franceschi et al., 2018; Shaban et al., 2019)) have to introduce the strong convexity (or local strong convexity) assumption for the LL subproblem, which is too tough to be satisfied in real-world complex tasks.

### 1.2 Our Contributions

In this work, we propose a generic first-order bi-level algorithmic framework, named Bi-level Descent Aggregation (BDA), that is flexible and efficient to handle BLPs with the form of Eqs. (1)-(2). Unlike the above prior gradient-based bi-level methods, that formulate the iteration schemes as two task-related single-level optimization problems and are fully dependent on the LLS condition, our BDA investigates BLPs from the optimistic point of view and develop a hierarchical optimization scheme, which consists of a single-level optimization formulation for the UL variable and a simple bi-level optimization formulation for the LL variable . We prove in theory that the convergence of BDA can be strictly guaranteed in the absence of the restrictive LLS condition. Moreover, our theoretical results are general enough to allow a variety of embedded iteration modules to handle different types of objective functions in Eqs. (1)-(2), thus BDA is indeed a task-agnostic optimization framework for BLPs. In addition, we demonstrate that the strong convexity of the LL objective (needed in previous theoretical results (Franceschi et al., 2018)) is non-essential and improve the convergence theories under the LLS condition by eliminating the strong convexity assumption. Our experimental results first verify the theoretical investigations and then show that BDA compares favorably to state-of-the-art methods on various applications, including hyper-parameter optimization and meta learning. The contributions can be summarized as:

• A counter-example (i.e., Example 1) explicitly indicates the importance of the LLS condition for existing bi-level FOMs. In particular, we investigate their iteration behaviors and reach the conclusion that using these approaches in the absence of the LLS condition may lead to incorrect solutions.

• By formulating BLPs in Eqs. (1)-(2) from the view point of optimistic bi-level, BDA provides a generic bi-level algorithmic framework. Embedded with a specific gradient-aggregation-based iterative module, BDA is applicable to a variety of learning/vision tasks.

• We strictly prove the convergence of BDA for general BLPs without the LLS consition. Our theoretical results are fairly general in the sense that, with slight modifications, our theories can apply to different types of bi-level objectives in Eqs. (1)-(2). In fact, considering specific problem settings, various appropriate iteration modules can be incorporated into BDA while the theoretical convergence is still guaranteed.

• As a nontrivial byproduct, we revisit and improve the convergence justification of existing gradient-based schemes  (Franceschi et al., 2018; Shaban et al., 2019) for BLPs in the LLS scenario. In particular, we successfully eliminate the strong convexity assumption on the LL subproblem which is usually too restrictive for real-world applications.

## 2 First-Order Bi-level Approaches

### 2.1 Solution Strategies with Lower-Level Singleton

As aforementioned, a number of FOMs have been proposed to solve BLP in Eqs. (1)-(2). However, these existing methods all rely on the uniqueness of (i.e., LLS condition). That is, rather than considering the original BLPs in Eqs. (1)-(2), they actually solve the following simplification:

 (3)

where the LL subproblem only has one single solution for a given . By setting as a parameter, the idea behind these approaches is to take a gradient-based scheme (e.g, gradient descent method, accelarated gradient descent method, block coordinate descent methond or their variations) on the LL subproblem. Therefore, with the initialization point , a sequence parameterized by can be generated, e.g.,

 yk+1=yk−sl∇yf(x,yk), k=0,⋯,K−1, (4)

where is an appropriately chosen step size. Then these existing schemes, e.g., (Franceschi et al., 2018; Shaban et al., 2019; Jenni and Favaro, 2018; Zügner and Günnemann, 2019; Rajeswaran et al., 2019), just follow the LLS assumption to consider (i.e., the output of Eq. (4) for a given ) as an approximation of the unique optimal solution to the LL subproblem in Eq. (3) and embed it to the UL objective, i.e., . In this way, by unrolling the iterative update scheme in Eq. (4)) as a computational graph, the derivative of (w.r.t. ) can be approximately calculated based on , accordingly (Franceschi et al., 2017).

### 2.2 Fundamental Issues and Counter-Example

As aforementioned, the LLS condition fairly matters for the validation of those gradient-based FOMs. Unfortunately, the uniqueness of the LL subproblem solution is actually too restrictive to be satisfied in practice. Interestingly, without the LLS assumption, the conventional gradient-based FOMs may still perform well in applications, see, e.g., (Franceschi et al., 2017; Jenni and Favaro, 2018; Lorraine and Duvenaud, 2018). However, the lack of theoretical support limits the application horizon of the gradient-based FOMs. Indeed, it is not surprising that this solution strategy fails for BLPs when the LLS condition does not meet. In this subsection, we present a counter-example to illustrate such invalidation of the conventional gradient-based FOMs in the absence of the LLS condition.

###### Example 1.

(Counter-Example) Define and . Then we consider the following BLP problem:

 minx∈[−100,100]12(x−[y]2)2+12([y]1−1)2,s.t. y∈argminy∈R212[y]21−x[y]1, (5)

where denotes the

-th element of the vector. By simple calculation, we know that the optimal solution of Eq. (

5) is . However, if adopting the existing gradient-based scheme in Eq. (4) with initialization and varying step size , we have that and . Then the approximated problem of Eq. (5) amounts to By defining , we have

 x∗K=argminx∈[−100,100]ϕK(x)=(1−∏K−1k=0(1−skl))1+(1−∏K−1k=0(1−skl))2.

As

 limK→∞K−1∏k=0(1−skl)∈[0,1]

and then

 limK→∞(1−∏K−1k=0(1−skl))1+(1−∏K−1k=0(1−skl))2∈[0,12].

Thus and will not converge to .

###### Remark 1.

The UL objective is indeed a function of both the UL variable and the LL variable . Conventional FOMs only use the gradient information of the LL subproblem to update . Thanks to the LLS condition, for fixed UL variable , the LL solution is uniquely determined. Then the generated converges to the true solution, not only the one that minimizes the LL objective, but also the one that optimizes the UL objective. However, when the LLS condition is absent, the generated may easily fail to converge to the true solution. Therefore, may tend to incorrect limiting points. Fortunately, even without the LLS condition, Section 3 demonstrates that the example in Eq. (5) is actually solvable by our proposed BDA.

## 3 Bi-level Descent Aggregation (BDA)

In contrast to previous work in the literature, which only address simplified BLPs with the LLS assumption, we propose a method, named Bi-level Descent Aggregation (BDA). The new BDA scheme aggregates both the UL objective and the LL objective information to generate , aiming to handle more generic (and more challenging) BLPs in the absence of the LLS condition.

### 3.1 Optimistic Bi-level Algorithmic Framework

By considering BLP from the optimistic point of view111For more theoretical details of optimistic BLPs, we refer to (Dempe, 2018) and the references therein., we can reformulate Eqs. (1)-(2) as

 minx∈Xφ(x), with φ(x):=infy∈S(x)F(x,y). (6)

Such reformulation reduces BLP to a single-level model w.r.t. the UL variable . While for any given , actually turns out to be the value function of a simple bi-level problem w.r.t. the LL variable , i.e.,

 minyF(x,y), s.t. y∈S(x), (with fixed x). (7)

Inspired by this observation, we may update as

 (8)

where stands for a schematic iterative module originated from a certain simple bi-level solution strategy on Eq. (7) with a fixed UL variable . W e set the initialization as , and is a prescribed positive integer. It can be seen that , by its nature should integrates the information from both the UL and LL subproblems in Eqs. (1)-(2). We will discuss specific choices of in the following subsection. Replacing by amounts to the following approximation of BLP in Eq. (6):

 minx∈XφK(x):=F(x,yK(x)), (9)

where is the output of Eq. (8) after iterations. With the above procedure, the BLP in Eqs. (1)-(2) is approximated by a sequence of standard unconstrained optimization problems. For each approximation subproblem in Eq. (9), its descent direction is actually implicitly representable in terms of a certain simple bi-level solution strategy (i.e., Eq. (8)). Therefore, standard first-order solvers can be involved to achieve the solution to these approximation subproblems. The solution sequences of approximated subproblems converge to the true solution to the BLP in Eqs. (1)-(2), which will be shown in Section 4.

### 3.2 Flexible Iteration Modules

Now optimizing BLP in Eqs. (1)-(2) has reduced to the problem of designing proper for Eq. (8). As discussed above, is related to both the UL and LL objectives. So it is natural to average the descent information of these two subproblems to obtain . Specifically, for a given , the descent directions of the UL and LL objectives can be respectively defined as and , where are their step size parameters. Then we formulate as the following first-order descent scheme:

 (10)

where denotes the aggregation parameter.

###### Remark 2.

In this part, we introduce a gradient aggregation based iterative module to handle the simple bi-level subproblem in Eq. (7). Indeed, the theoretical analysis in Section 4 will demonstrate that our BDA algorithmic framework is flexible enough to incorporate a variety of numerical schemes. For example, in Supplemental Material, we present an appropriate to handle BLPs with nonsmooth LL objective while its convergence is still strictly guaranteed within our framework.

## 4 Theoretical Investigations

In this section, the convergence behaviors of first-order bi-level optimization schemes are systematically investigated. We first derive two elementary properties and a convergence proof recipe. Following the roadmap, the convergence of our BDA gets rid of depending upon the LLS condition (Section 4.2). We also improve the convergence results for existing FOMs in LLS scenario (Section 4.3). To avoid triviality, we assume that is nonempty for any hereafter. Please notice that all the proofs are stated in our Supplemental Material.

### 4.1 A General Proof Recipe

We establish a general methodology in Theorem 1, which describes the main steps to achieve the converge guarantees for our schematic first-order bi-level scheme in Eqs. (8)-(9) (with abstract ) for BLPs in Eqs. (1)-(2). Basically, our proof methodology consists of two main steps:

1. LL solution set property: For any , there exists such that whenever ,

2. UL objective convergence property: is LSC222Some definitions, including Outer/Inner Semi-Continuous (OSC/ISC) properties for set-valued mappings, Lower/Upper Semi-Contionuous (LSC/USC) and local uniformly level-bounded properties for functions, are moved to our Supplemental Material. One may also refer to (Rockafellar and Wets, 2009) for more details. on , thus

Equipped with these properties, the following theorem establishes the general converge results for our schematic bi-level scheme in Eqs. (8)-(9).

###### Theorem 1.

Suppose both the above LL solution set and UL objective convergence properties hold, then

• if is local minimum of with uniform neighbourhood modulus , we have any limit point of the sequence is a local minimum of ;

• if , we have any limit point of the sequence satisfies that ; and as .

### 4.2 Convergence Properties of BDA

The objective here is to demonstrate that our BDA meets these two elementary properties required by Theorem 1. Before proving the convergence properties of BDA, we first take the following as our blanket assumption.

###### Assumption 1.

For any , is -Lipschitz continuous, -smooth, and -strongly convex, is -smooth and convex.

Notice that Assumption 1 is quite standard for BLPs in learning/vision areas (Franceschi et al., 2018; Shaban et al., 2019). As can be seen, it is satisfied for all the applications considered in this work. We first present some necessary variational analysis preliminaries. Denoting

 ~S(x):=argminy∈S(x)F(x,y), (11)

under Assumption 1, we can quickly obtain that is nonempty and unique for any . Moreover, we can derive the boundedness of in the following lemma.

###### Lemma 1.

Suppose is level-bounded w.r.t. and locally uniform w.r.t. . If is ISC on , then is bounded.

Denoting further , thanks to the continuity of , we have the following result.

###### Lemma 2.

If is continuous on , then is USC on .

Now we are ready to establish our fundamental LL solution set and UL objective convergence properties required in Theorem 1. In the following proposition, we first derive the convergence of in the light of the general fact stated in (Sabach and Shtern, 2017).

###### Proposition 1.

Suppose Assumption 1 is satisfied and and let , , , with and . Denoting , and , with and , it holds that

 ∥yK(x)−y∗(x)∥ ≤Cy∗(x), (12) ∥yK(x)−~yK(x)∥ ≤2Cy∗(x)(J+2)K(1−β), (13) f(x,~yK(x))−f∗(x) ≤2C2y∗(x)(J+2)K(1−β)sl, (14)

where . Furthermore, converges to as for any .

Proposition 1, upon together with Lemma 1, shows that is a bounded sequence and uniformly converges. We next prove the uniform convergence of towards the solution set through the uniform convergence of .

###### Proposition 2.

Let be a bounded set and . If is ISC on , then there exists such that for any , , in case is satisfied.

Combining Lemmas 1 and 2, together with Proposition 2, the LL solution set property required in Theorem 1 can be eventually derived. Let us now prove the LSC property of on in the following proposition.

###### Proposition 3.

Suppose is level-bounded w.r.t. and locally uniform w.r.t. . If is OSC at , then is LSC at .

Then the UL objective convergence property required in Theorem 1 can be obtained subsequently based on Proposition 3, In summary, we present the main convergence results of BDA in the following theorem.

###### Theorem 2.

Suppose Assumption 1 is satisfied and and let , , with and . Assume further that is continuous on . Then we have the same convergence results as that in Theorem 1.

###### Remark 3.

Our proposed theoretical results are indeed general enough for BLPs in different application scenarios. For example, when the LL objective takes a nonsmooth form, e.g., with smooth and nonsmooth , we can adopt the proximal operation based iteration module (Beck, 2017) to construct within our BDA framework. The convergence proofs are highly similar to that in Theorem 2. More details on such extension can be found in our Supplemental Material.

### 4.3 Improving Existing LLS Theories

Although with the LLS simplification on BLP in Eqs. (1)-(2), the theoretical properties of existing bi-level FOMs are still not very convincing. Their convergence proofs in essence depend on the strong convexity (or locally strong convexity) of the LL objective, restricting the use of FOMs in complex learning/vision applications. To address this issue, this subsection shows that under the LLS, existing convergence results (Franceschi et al., 2018; Shaban et al., 2019) can be improved in the sense that weaker assumptions are required. We begin by an assumption on the LL objective needed in this subsection.

###### Assumption 2.

is level-bounded w.r.t. and locally uniform w.r.t. .

In fact, Assumption 2 is mild and satisfied by a large number of bi-level FOMs, when the LL subproblem is convex but not necessarily strongly convex. In contrast, the more restrictive strong convexity on is an essential assumption in (Franceschi et al., 2018; Shaban et al., 2019). Under Assumption 2, the following lemma verifies the continuity of in the LLS scenario.

###### Lemma 3.

Suppose that Assumption 2 is satisfied and is single-valued on . Then is continuous on .

As can be seen from the proof of Theorem 3 in our Supplemental Material, Lemma 3 together with the uniform convergence of imply the LL solution set and UL objective convergence properties. Hence Theorem 1 is applicable, which inspires an improved version of the convergence results for existing bi-level FOMs as follows.

###### Theorem 3.

Suppose that Assumption 2 is satisfied, is uniformly bounded on , and converges uniformly to on as . Then concerning and , we have the same convergence results as that in Theorem 1.

Theorem 3 actually improves the converge results in (Franceschi et al., 2018). In fact, the uniform convergence assumption of towards required in (Franceschi et al., 2018) is essentially based on the strong convexity assumption (see Remark 3.3 of (Franceschi et al., 2018)). Instead of assuming such strong convexity, we only need to assume a weaker condition that converges uniformly to on as .

It is natural for us to illustrate our improvement in terms of concrete applications. Specifically, we take the gradient-based bi-level scheme in Section 2.1 (which has been used in (Franceschi et al., 2018; Shaban et al., 2019; Jenni and Favaro, 2018; Zügner and Günnemann, 2019; Rajeswaran et al., 2019)). In the following two propositions, we assume that is -smooth and convex, and . Inspired by Theorems 10.21 and 10.23 in (Beck, 2017), we derive the following proposition.

###### Proposition 4.

Let be generated by Eq. (4). Then it holds that , and , with and .

Then we can immediately verify our required assumption on in the absence of strong convexity for .

###### Proposition 5.

Suppose Assumption 2 is satisfied. Then is uniformly bounded on and converges uniformly to on as .

###### Remark 4.

When the LL subproblem is convex, but not necessarily strongly convex, a large number of gradient-based methods, including accelerated gradient methods such as FISTA (Beck and Teboulle, 2009) and block coordinate descent method (Tseng, 2001), automatically meet our assumption, i.e., the uniform convergence of optimal values towards on .

## 5 Experimental Results

In this section, we first verify our theoretical findings and then evaluate the performance of our proposed method on different problems, such as hyper-parameter optimization and meta learning. We conducted these experiments on a computer with Intel Core i7-7700 CPU (3.6 GHz), 32GB RAM and an NVIDIA GeForce RTX 2060 6GB GPU.

### 5.1 Synthetic BLP

Our theoretical findings are investigated based on the synthetic BLP described in Section 2.2. As stated above, this deterministic bi-level formulation satisfies all the assumptions in Section 4, but it does not satisfy the LLS assumption required in  (Franceschi et al., 2018; Finn et al., 2017; Shaban et al., 2019; Franceschi et al., 2017). Here, we fix the parameters and in this experiments.

In Figure 1, we plotted numerical results of BDA and one of the most representative first-order BLP method (i.e., Reverse Hyper-Gradient (RHG) (Franceschi et al., 2017, 2018)) with different initialization points. We considered the numerical metrics , , , and , where the superscript denotes the true objective/variable. We observed that RHG is always hard to obtain correct solution, even start from different initialization points. This is mainly because that the solution set of the LL subproblem in Eq. (5) is not a singleton, which does not satisfy the fundamental assumption of RHG. In contrast, our BDA aggregated the UL and LL information to perform the LL updating, thus we are able to obtain true optimal solution in all these scenarios. The initialization actually only slightly affected on the convergence speed of our iterative sequences.

Figure 2 further plotted the convergence behaviors of BDA and RHG with different LL iterations (i.e., ). We observed that the results of RHG cannot be improved by increasing . But for BDA, the three iterative sequences (with ) are always converged and the numerical performance can be improved by performing relatively more LL iterations. In the above two figures, we set , .

Figure 3 evaluated the convergence behaviors of BDA with different choices of . By setting , we was unable to use the UL information to guide the LL updating, thus it is hard to obtain proper feasible solutions for the UL subproblem. When choosing a fixed in (e.g., ), the numerical performance can be improved but the convergence speed was still slow. Fortunately, we followed our theoretical findings and introduced an adaptive strategy to incorporate UL information into LL iterations, leading to nice convergence behaviors for both UL and LL variables.

### 5.2 Hyper-parameter Optimization

Hyper-parameter optimization is the problem of choosing a set of optimal hyper-parameters for a given learning task. Here we consider a specific hyper-parameter optimization example, known as data hyper-cleaning (Shaban et al., 2019; Franceschi et al., 2017)

. In this problem, we need to train a linear classifier on a given image set, but part of the training labels are corrupted. Following

(Shaban et al., 2019; Franceschi et al., 2017), we consider this problem within BLP as follows. We first denote and as the training and validation sets, respectively. Then in the LL subproblem, we define as the following weighted training loss: where denotes the cross-entropy function with the classification parameter and data pairs and are the hyper-parameters to penalize the objective for different training samples. Here

denotes the element-wise sigmoid function on

and is used to constrain the weights in . For the UL subproblem, we define as the cross-entropy loss with regularization on the validation set, i.e., where the trade-off parameter is fixed as .

We applied our BDA together with the baselines RHG and Truncated RHG (T-RHG) (Shaban et al., 2019) to solve the above BLP model on MNIST (LeCun et al., 1998). Both the training and the validation sets consist of 7000 class-balanced samples and the remaining 56000 samples are used as the test set. We adopted the architectures used in RHG as the feature extractor for all the compared methods. For T-RHG, we chose -step truncated back-propagation to guarantee its convergence. Table 1 reported the averaged accuracy for all these compared methods with different number of LL iterations (i.e., ). We observed that RHG outperformed T-RHG. While BAD consistently achieved the highest accuracy. Our theoretical results suggested that most of the improvements in BDA should come from the aggregations of the UL and LL information. The results also showed that more LL iterations are able to improve the final performances in most cases.

### 5.3 Meta Learning

The aim of meta learning is to learn an algorithm that should work well on novel tasks. In particular, we consider the few-shot learning problem (Vinyals et al., 2016; Qiao et al., 2018), where each task is a -way classification and it is to learn the hyper-parameter such that each task can be solved only with training samples (i.e., -way -shot). To evaluate this problem, we collect a meta training data set , where each is linked to a specific task. We learn a cross-task intermediate representation , parameterized by as our meta features. Then for the

-th task, we utilize the multinomial logistic regression

, parameterized by and the cross-entropy function as our ground classifier and the task-specific loss, respectively. In this way, we first optimize the hyper-parameter to obtain the overall setup and then the parameters are fine-tuned for the -th task. Thus the LL and UL objectives can be defined as and

Our experiments are conducted on two widely used benchmarks, i.e., Ominglot (Lake et al., 2015), which contains 1623 hand written characters from 50 alphabets and MiniImageNet (Vinyals et al., 2016)

, which is a subset of ImageNet

(Deng et al., 2009) and includes 60000 downsampled images from 100 different classes. We followed the experimental protocol used in MAML (Finn et al., 2017) and compared our BDA to several state-of-the-art approaches, such as MAML (Finn et al., 2017), Meta-SGD (Li et al., 2018), Reptile (Nichol et al., 2018), RHG, and T-RHG.

It can be seen in Table 2 that BDA compared well to these methods and achieved the highest classification accuracy except in the 5-way 5-shot task. In this case, practical performance of BDA was slightly worse than MAML. We further conducted experiments on the more challenging MiniImageNet data set. In the second column of Table 3, we reported the averaged accuracy of three first-order BLP based methods (i.e., RHG, T-RHG and BDA). Again, the performance of BDA is better than RHG and T-RHG. In the rightmost two columns, we also compared the number of averaged UL iterations when they achieved almost the same accuracy (). These results showed that BDA needed the fewest iterations to achieve such accuracy.

## 6 Conclusions

This paper proposed BDA, a generic first-order algorithmic framework to address BLPs in Eqs. (1)-(2). Our approach has a number of theoretical benefits. Its convergence can be strictly proved without the LLS assumption, which is the fundamental restriction in existing gradient-based bi-level methods. It is also compatible to a variety of particular computation modules. As a nontrivial byproduct, we also improved convergence results for those classical gradient-based schemes. Extensive evaluations showed the superiority of BDA on different applications.

## Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 61922019, 61672125, 61733002 and 61772105), LiaoNing Revitalization Talents Program (XLYC1807088).

## References

• A. Beck and M. Teboulle (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2 (1), pp. 183–202. Cited by: Remark 4.
• A. Beck (2017) First-order methods in optimization. SIAM. Cited by: §A.4.3, §4.3, Remark 3.
• J. F. Bonnans and A. Shapiro (2013) Perturbation analysis of optimization problems. Springer Science & Business Media. Cited by: §A.4.1.
• J. C. De los Reyes, C. Schönlieb, and T. Valkonen (2017) Bilevel parameter learning for higher-order total variation regularisation models. Journal of Mathematical Imaging and Vision 57 (1), pp. 1–25. Cited by: §1.1.
• S. Dempe (2018) Bilevel optimization: theory, algorithms and applications. TU Bergakademie Freiberg Mining Academy and Technical University. Cited by: §1.1, §1, footnote 1.
• J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §5.3.
• [7] J. Domke Generic methods for optimization-based modeling. Cited by: Appendix A.
• C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §5.1, §5.3.
• L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil (2018)

Bilevel programming for hyperparameter optimization and meta-learning

.
In ICML, pp. 1563–1572. Cited by: Appendix A, 4th item, §1.1, §1.1, §1.2, §2.1, §4.2, §4.3, §4.3, §4.3, §4.3, §5.1, §5.1.
• L. Franceschi, M. Donini, P. Frasconi, and M. Pontil (2017) Forward and reverse gradient-based hyperparameter optimization. In ICML, pp. 1165–1173. Cited by: Appendix A, §1.1, §1.1, §2.1, §2.2, §5.1, §5.1, §5.2.
• S. Jenni and P. Favaro (2018) Deep bilevel learning. In ECCV, pp. 618–633. Cited by: §1.1, §2.1, §2.2, §4.3.
• R. G. Jeroslow (1985) The polynomial hierarchy and a simple model for competitive analysis. Mathematical Programming 32 (2), pp. 146–164. Cited by: §1.
• G. Kunapuli, K. P. Bennett, J. Hu, and J. Pang (2008) Classification model selection via bilevel programming. Optimization Methods & Software 23 (4), pp. 475–489. Cited by: §1.1.
• K. Kunisch and T. Pock (2013) A bilevel optimization approach for parameter learning in variational models. SIAM Journal on Imaging Sciences 6 (2), pp. 938–983. Cited by: §1.1.
• B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §5.3.
• Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.2.
• Z. Li, F. Zhou, F. Chen, and H. Li (2018) Meta-sgd: learning to learn quickly for few-shot learning. In ICML, Cited by: §5.3.
• J. Lorraine and D. Duvenaud (2018) Stochastic hyperparameter optimization through hypernetworks. CoRR, abs/1802.09419. Cited by: §1.1, §2.2.
• M. MacKay, P. Vicol, J. Lorraine, D. Duvenaud, and R. Grosse (2019) Self-tuning networks: bilevel optimization of hyperparameters using structured best-response functions. ICLR. Cited by: §1.1, §1.1.
• D. Maclaurin, D. Duvenaud, and R. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. In ICML, pp. 2113–2122. Cited by: Appendix A, §1.1.
• G. M. Moore (2010)

Bilevel programming algorithms for machine learning model selection

.
Rensselaer Polytechnic Institute. Cited by: §1.1.
• A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. CoRR, abs/1803.02999. Cited by: §5.3.
• T. Okuno, A. Takeda, and A. Kawana (2018) Hyperparameter learning via bilevel nonsmooth optimization. CoRR, abs/1806.01520. Cited by: §1.1, §1.1.
• D. Pfau and O. Vinyals (2016) Connecting generative adversarial networks and actor-critic methods. In NeurIPS Workshop on Adversarial Training, Cited by: §1.1.
• S. Qiao, C. Liu, W. Shen, and A. L. Yuille (2018) Few-shot image recognition by predicting parameters from activations. In CVPR, pp. 7229–7238. Cited by: §5.3.
• A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine (2019) Meta-learning with implicit gradients. In NeurIPS, pp. 113–124. Cited by: §A.5, §1.1, §1.1, §2.1, §4.3.
• R. T. Rockafellar and R. J. Wets (2009) Variational analysis. Springer Science & Business Media. Cited by: §A.1.1, §A.3.3, §A.4.1, footnote 2.
• S. Sabach and S. Shtern (2017) A first order method for solving convex bilevel optimization problems. SIAM Journal on Optimization 27 (2), pp. 640–660. Cited by: §A.5, §4.2.
• A. Shaban, C. Cheng, N. Hatch, and B. Boots (2019) Truncated back-propagation for bilevel optimization. In AISTATS, pp. 1723–1732. Cited by: Appendix A, 4th item, §1.1, §2.1, §4.2, §4.3, §4.3, §4.3, §5.1, §5.2, §5.2.
• P. Tseng (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications 109 (3), pp. 475–494. Cited by: Remark 4.
• O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In NeurIPS, pp. 3630–3638. Cited by: §5.3, §5.3.
• S. Wang, S. Fidler, and R. Urtasun (2016) Proximal deep structured models. In NeurIPS, pp. 865–873. Cited by: §A.5.
• Z. Yang, Y. Chen, M. Hong, and Z. Wang (2019) Provably global convergence of actor-critic: a case for linear quadratic regulator with ergodic cost. In NeurIPS, pp. 8351–8363. Cited by: §1.1.
• D. Zügner and S. Günnemann (2019)

Adversarial attacks on graph neural networks via meta learning

.
ICLR. Cited by: §1.1, §1.1, §2.1, §4.3.

## Appendix

The following Appendix are organized as follows. Section A compares the theoretical results of BDA and existing state-of-the-art bi-level FOMs. In Section A.1, we provide detailed proofs for all of the theoretical results in our manuscript. Finally, Section A.5 discusses a possible extension of BDA for BLP with the non-smooth LL objective.

## Appendix A Comparisons on Theoretical Results

Table 4 summarizes the proved convergence properties together with the required model conditions for BDA and existing gradient-based bi-level FOMs, such as (Domke, ; Maclaurin et al., 2015; Franceschi et al., 2017, 2018; Shaban et al., 2019). In fact, the theoretical results for these previous approaches have been proved in (Franceschi et al., 2018). To simplify the notations, we define the following abbreviations: “JC” (Jointly Continuous), “LC” (Lipschitz Continuous), “SC” (Strongly Convex), and “LB” (Level-Bounded). We also denote subsequentially convergent and uniformly convergent as “” and “”, respectively. The superscript denotes that it is the true optimal variables/values. For each categories of methods, the top two rows and the bottom row respectively summarize the required properties of the models (i.e., the UL and LL subproblems) and the proved converge results for these methods.

It can be seen that in the LLS scenario, our BDA and these existing bi-level FOMs share the same requirements for the UL subproblem. However, as for the LL subproblem, the uniform convergence assumption of towards , considered in the previous FOMs, is essentially more restrictive than the assumptions required in our BDA. Notice that this has already been discussed below Theorem 3 in our manuscript. More importantly, when solving BLPs without the LLS assumption, we can see that no theoretical results can be obtained for these existing FOMs. Fortunately, we demonstrate that BDA can obtain the same convergence properties as that in the LLS scenario.

### a.1 Detailed Proofs

#### a.1.1 Necessary Definitions

We state some definitions, which are necessary for our analysis. One may also refer to (Rockafellar and Wets, 2009) for more details on these variational analysis properties. Specifically, by denoting

 limsupx→¯xS(x):={y| ∃xν→¯x,∃yν→y,yν∈S(xν)},liminfx→¯xS(x):={y| ∀xν→¯x,∃yν→y,yν∈S(xν)}, (15)

we define various continuity properties of the set-valued mapping as follows.

###### Definition 1.

A set-valued mapping is Outer Semi-Continuous (OSC) at when and Inner Semi-Continuous (ISC) at when . It is called continuous at when it is both OSC and ISC at , as expressed by .

Before providing the following semi-continuous definitions, we introduce the upper and lower limits of a function as

 limsupx→¯xφ(x) :=limδ→0[supx∈Bδ(¯x)φ(x)] (16) =infδ>0[supx∈Bδ(¯x)φ(x)], liminfx→¯xφ(x) :=limδ→0[infx∈Bδ(¯x)φ(x)] =supδ>0[infx∈Bδ(¯x)φ(x)],

where .

###### Definition 2.

The function is Upper Semi-Continuous (USC) at if

 limsupx→¯xφ(x)≤φ(¯x), or equivalently limsupx→¯xφ(x)=φ(¯x), (17)

and USC on if this holds for every . The function is Lower Semi-Continuous (LSC) at if

 liminfx→¯xφ(x)≥φ(¯x), or equivalently liminfx→¯xφ(x)=φ(¯x), (18)

and LSC on if this holds for every .

We also present the level-bounded and locally uniform property for a function in the following definition.

###### Definition 3.

Given a function , if for the point and , there exists along with a bounded set , such that

 {y∈Rm | ϕ(¯x,y)≤c}⊆B, ∀¯x∈Bδ(x)∩X, (19)

then we call is level-bounded w.r.t. and locally uniform at . It is called locally uniform w.r.t. if the above holds for each .

### a.2 Proofs of Section 4.1

#### a.2.1 Proof of Theorem 1

###### Proof.

Since is compact, we can assume without loss of generality that and by considering a subsequence of . For any , there exists such that whenever , so we have

 supx∈Xdist(yK(x),S(x))≤ϵ2L0. (20)

Thus, for any , there exists such that

 ∥yK(x)−y∗(x)∥≤ϵL0. (21)

Therefore, for any , we have

 φ(x) =infy∈S(x)F(x,y) (22) ≤F(x,y∗(x)) ≤F(x,yK(x))+L0∥yK(x)−y∗(x)∥ ≤φK(x)+ϵ.

This implies that, for any , there exists such that whenever ,

 φ(xK)≤φK(xK)+ϵ≤φK(x)+ϵ,∀x∈X. (23)

Next, as is local minimum of with uniform neighbourhood modulus , it follows

 φK(xK)≤φK(x),∀x∈Bδ(xK)∩X.

And since , we have, for any , there exists such that whenever , ,

 φ(xK)≤φK(xK)+ϵ≤φK(x)+ϵ=φ(x)+ϵ.

Taking and by the LSC of , we have

 φ(¯x) ≤liminfK→∞φ(xK) ≤liminfK→∞φK(xK)+ϵ ≤limK→∞φK(x)+ϵ=φ(x)+ϵ, ∀x∈Bδ/2(¯x)∩X.

By taking , we have

 φ(¯x)≤φ(x),∀x∈Bδ/2(¯x),

which implies , i.e, is a local minimum of .

We can show the second result with similar arguments. Since is compact, we can assume without loss of generality that by considering a subsequence of . As shown above in (23), for any , there exists such that whenever ,

 φ(xK)≤φK(xK)+ϵ≤φK(x)+ϵ,∀x∈X. (24)

Taking and by the LSC of , we have

 φ(¯x) ≤liminfK→∞φ(xK) (25) ≤liminfK→∞φK(xK)+ϵ ≤limK→∞φK(x)+ϵ=φ(x)+ϵ,∀x∈X.

By taking , we have

 φ(¯x)≤φ(x),∀x∈X, (26)

which implies .

We next show that as . If this is not true, then there exist and sequence such that

 ∣∣∣infx∈Xφl(x)−infx∈Xφ(x)∣∣∣>δ,∀l. (27)

For each , there exists such that . And since is compact, we can assume without loss of generality that . For any , there exists such that whenever , the following holds

 φ(xl) ≤φl(xl)+ϵ (28) ≤infx∈Xφ</