1 Introduction
BiLevel Programs (BLPs) are mathematical programs with optimization problems in their constraints and recently have been recognized as powerful theoretical tools to address a variety of learning tasks (e.g., hyperparameter optimization and meta learning). Mathematically, most of BLPs in these areas can be (re)formulated as the following hierarchical optimization problem:
(1) 
where the UpperLevel (UL) objective is a continuous function, the UL constraint is a compact set, and is a setvalued mapping which indicates the parameterized solution set of the LowerLevel (LL) subproblem. In this work, we just consider the following LL subproblem:
(2) 
where is a continuous function. Indeed, The BLP model in Eqs. (1)(2) is a hierarchical optimization problem with two coupled variables . Specifically, given the UL variable from the feasible set , i.e., , the LL variable is an optimal solution of the LL subproblem governed by , i.e., . Due to the hierarchical structure, the BLP model in Eqs. (1)(2) is in general nonconvex, and hence NP hard, even with both linear UL and LL subproblems (Jeroslow, 1985; Dempe, 2018). Moreover, due to the complicated dependency between the UL variable and LL variable in Eq. (1), it is very challenging to solve BLP. This difficulty is further aggravated when the LL solutions in Eq. (2) is no longer a singleton for given . Hereafter, we will always call this condition as LowerLevel Singleton condition or LLS for short.
1.1 Related Work
Although early works on bilevel optimization can date back to the nineteen seventies (Dempe, 2018), it was not until the last decade that a large amount of BLP models were proposed to address specific learning and vision problems. Representative applications include meta learning (Franceschi et al., 2018; Rajeswaran et al., 2019; Zügner and Günnemann, 2019), hyperparameter optimization (Franceschi et al., 2017; Okuno et al., 2018; MacKay et al., 2019)
(Yang et al., 2019), generative adversarial learning (Pfau and Vinyals, 2016), graph and image processing (Kunisch and Pock, 2013; De los Reyes et al., 2017), just to name a few.A large number of optimization methods have been developed to solve BLPs in Eqs. (1)(2) with a rich literature. A prevailing approach is associated with the optimality characterization of the LL subproblem. Using the firstorder optimality conditions, BLPs in Eqs. (1)(2) are reformulated into singlelevel optimization which are numerically trackable (Moore, 2010; Kunapuli et al., 2008; Okuno et al., 2018). However, these bilevel algorithms involve too many auxiliary variables, as a consequence, the performance is hardly satisfied for BLP models in complex learning fields.
Recently, gradientbased FirstOrder Methods (FOMs) have been revisited to solve BLPs for learning and vision tasks. The key idea underlying these approaches is to calculate gradients of UL and LL objectives in hierarchical manners. A popular approach in this direction is to first calculate gradient representations of the LL objective and then perform either reverse or forward gradient computations (based on the LL gradients) for the UL subproblem. We have known that the reverse mode is identical to backpropagation through time and the forward mode calculates gradients appeals to the chain rule
(Maclaurin et al., 2015; Franceschi et al., 2017, 2018). Similar techniques were also used in (Jenni and Favaro, 2018; Zügner and Günnemann, 2019; Rajeswaran et al., 2019), but with different specific implementations. The work in (Shaban et al., 2019) adopted truncated backpropagation to improve the scale issue for these methods. Furthermore, in (Lorraine and Duvenaud, 2018; MacKay et al., 2019), a socalled hypernetwork was introduced and trained to map LL gradients for such hierarchical optimization. Although widely used in practical applications, theoretical properties of these bilevel FOMs are still not convincing. Indeed, all of these methods have enforced the LLS constraint to Eqs. (1)(2) to simplify their optimization problem. To satisfy such restrictive condition, existing work (e.g., (Franceschi et al., 2018; Shaban et al., 2019)) have to introduce the strong convexity (or local strong convexity) assumption for the LL subproblem, which is too tough to be satisfied in realworld complex tasks.1.2 Our Contributions
In this work, we propose a generic firstorder bilevel algorithmic framework, named Bilevel Descent Aggregation (BDA), that is flexible and efficient to handle BLPs with the form of Eqs. (1)(2). Unlike the above prior gradientbased bilevel methods, that formulate the iteration schemes as two taskrelated singlelevel optimization problems and are fully dependent on the LLS condition, our BDA investigates BLPs from the optimistic point of view and develop a hierarchical optimization scheme, which consists of a singlelevel optimization formulation for the UL variable and a simple bilevel optimization formulation for the LL variable . We prove in theory that the convergence of BDA can be strictly guaranteed in the absence of the restrictive LLS condition. Moreover, our theoretical results are general enough to allow a variety of embedded iteration modules to handle different types of objective functions in Eqs. (1)(2), thus BDA is indeed a taskagnostic optimization framework for BLPs. In addition, we demonstrate that the strong convexity of the LL objective (needed in previous theoretical results (Franceschi et al., 2018)) is nonessential and improve the convergence theories under the LLS condition by eliminating the strong convexity assumption. Our experimental results first verify the theoretical investigations and then show that BDA compares favorably to stateoftheart methods on various applications, including hyperparameter optimization and meta learning. The contributions can be summarized as:

A counterexample (i.e., Example 1) explicitly indicates the importance of the LLS condition for existing bilevel FOMs. In particular, we investigate their iteration behaviors and reach the conclusion that using these approaches in the absence of the LLS condition may lead to incorrect solutions.

We strictly prove the convergence of BDA for general BLPs without the LLS consition. Our theoretical results are fairly general in the sense that, with slight modifications, our theories can apply to different types of bilevel objectives in Eqs. (1)(2). In fact, considering specific problem settings, various appropriate iteration modules can be incorporated into BDA while the theoretical convergence is still guaranteed.

As a nontrivial byproduct, we revisit and improve the convergence justification of existing gradientbased schemes (Franceschi et al., 2018; Shaban et al., 2019) for BLPs in the LLS scenario. In particular, we successfully eliminate the strong convexity assumption on the LL subproblem which is usually too restrictive for realworld applications.
2 FirstOrder Bilevel Approaches
2.1 Solution Strategies with LowerLevel Singleton
As aforementioned, a number of FOMs have been proposed to solve BLP in Eqs. (1)(2). However, these existing methods all rely on the uniqueness of (i.e., LLS condition). That is, rather than considering the original BLPs in Eqs. (1)(2), they actually solve the following simplification:
(3) 
where the LL subproblem only has one single solution for a given . By setting as a parameter, the idea behind these approaches is to take a gradientbased scheme (e.g, gradient descent method, accelarated gradient descent method, block coordinate descent methond or their variations) on the LL subproblem. Therefore, with the initialization point , a sequence parameterized by can be generated, e.g.,
(4) 
where is an appropriately chosen step size. Then these existing schemes, e.g., (Franceschi et al., 2018; Shaban et al., 2019; Jenni and Favaro, 2018; Zügner and Günnemann, 2019; Rajeswaran et al., 2019), just follow the LLS assumption to consider (i.e., the output of Eq. (4) for a given ) as an approximation of the unique optimal solution to the LL subproblem in Eq. (3) and embed it to the UL objective, i.e., . In this way, by unrolling the iterative update scheme in Eq. (4)) as a computational graph, the derivative of (w.r.t. ) can be approximately calculated based on , accordingly (Franceschi et al., 2017).
2.2 Fundamental Issues and CounterExample
As aforementioned, the LLS condition fairly matters for the validation of those gradientbased FOMs. Unfortunately, the uniqueness of the LL subproblem solution is actually too restrictive to be satisfied in practice. Interestingly, without the LLS assumption, the conventional gradientbased FOMs may still perform well in applications, see, e.g., (Franceschi et al., 2017; Jenni and Favaro, 2018; Lorraine and Duvenaud, 2018). However, the lack of theoretical support limits the application horizon of the gradientbased FOMs. Indeed, it is not surprising that this solution strategy fails for BLPs when the LLS condition does not meet. In this subsection, we present a counterexample to illustrate such invalidation of the conventional gradientbased FOMs in the absence of the LLS condition.
Example 1.
(CounterExample) Define and . Then we consider the following BLP problem:
(5) 
where denotes the
th element of the vector. By simple calculation, we know that the optimal solution of Eq. (
5) is . However, if adopting the existing gradientbased scheme in Eq. (4) with initialization and varying step size , we have that and . Then the approximated problem of Eq. (5) amounts to By defining , we haveAs
and then
Thus and will not converge to .
Remark 1.
The UL objective is indeed a function of both the UL variable and the LL variable . Conventional FOMs only use the gradient information of the LL subproblem to update . Thanks to the LLS condition, for fixed UL variable , the LL solution is uniquely determined. Then the generated converges to the true solution, not only the one that minimizes the LL objective, but also the one that optimizes the UL objective. However, when the LLS condition is absent, the generated may easily fail to converge to the true solution. Therefore, may tend to incorrect limiting points. Fortunately, even without the LLS condition, Section 3 demonstrates that the example in Eq. (5) is actually solvable by our proposed BDA.
3 Bilevel Descent Aggregation (BDA)
In contrast to previous work in the literature, which only address simplified BLPs with the LLS assumption, we propose a method, named Bilevel Descent Aggregation (BDA). The new BDA scheme aggregates both the UL objective and the LL objective information to generate , aiming to handle more generic (and more challenging) BLPs in the absence of the LLS condition.
3.1 Optimistic Bilevel Algorithmic Framework
By considering BLP from the optimistic point of view^{1}^{1}1For more theoretical details of optimistic BLPs, we refer to (Dempe, 2018) and the references therein., we can reformulate Eqs. (1)(2) as
(6) 
Such reformulation reduces BLP to a singlelevel model w.r.t. the UL variable . While for any given , actually turns out to be the value function of a simple bilevel problem w.r.t. the LL variable , i.e.,
(7) 
Inspired by this observation, we may update as
(8) 
where stands for a schematic iterative module originated from a certain simple bilevel solution strategy on Eq. (7) with a fixed UL variable . W e set the initialization as , and is a prescribed positive integer. It can be seen that , by its nature should integrates the information from both the UL and LL subproblems in Eqs. (1)(2). We will discuss specific choices of in the following subsection. Replacing by amounts to the following approximation of BLP in Eq. (6):
(9) 
where is the output of Eq. (8) after iterations. With the above procedure, the BLP in Eqs. (1)(2) is approximated by a sequence of standard unconstrained optimization problems. For each approximation subproblem in Eq. (9), its descent direction is actually implicitly representable in terms of a certain simple bilevel solution strategy (i.e., Eq. (8)). Therefore, standard firstorder solvers can be involved to achieve the solution to these approximation subproblems. The solution sequences of approximated subproblems converge to the true solution to the BLP in Eqs. (1)(2), which will be shown in Section 4.
3.2 Flexible Iteration Modules
Now optimizing BLP in Eqs. (1)(2) has reduced to the problem of designing proper for Eq. (8). As discussed above, is related to both the UL and LL objectives. So it is natural to average the descent information of these two subproblems to obtain . Specifically, for a given , the descent directions of the UL and LL objectives can be respectively defined as and , where are their step size parameters. Then we formulate as the following firstorder descent scheme:
(10) 
where denotes the aggregation parameter.
Remark 2.
In this part, we introduce a gradient aggregation based iterative module to handle the simple bilevel subproblem in Eq. (7). Indeed, the theoretical analysis in Section 4 will demonstrate that our BDA algorithmic framework is flexible enough to incorporate a variety of numerical schemes. For example, in Supplemental Material, we present an appropriate to handle BLPs with nonsmooth LL objective while its convergence is still strictly guaranteed within our framework.
4 Theoretical Investigations
In this section, the convergence behaviors of firstorder bilevel optimization schemes are systematically investigated. We first derive two elementary properties and a convergence proof recipe. Following the roadmap, the convergence of our BDA gets rid of depending upon the LLS condition (Section 4.2). We also improve the convergence results for existing FOMs in LLS scenario (Section 4.3). To avoid triviality, we assume that is nonempty for any hereafter. Please notice that all the proofs are stated in our Supplemental Material.
4.1 A General Proof Recipe
We establish a general methodology in Theorem 1, which describes the main steps to achieve the converge guarantees for our schematic firstorder bilevel scheme in Eqs. (8)(9) (with abstract ) for BLPs in Eqs. (1)(2). Basically, our proof methodology consists of two main steps:

LL solution set property: For any , there exists such that whenever ,

UL objective convergence property: is LSC^{2}^{2}2Some definitions, including Outer/Inner SemiContinuous (OSC/ISC) properties for setvalued mappings, Lower/Upper SemiContionuous (LSC/USC) and local uniformly levelbounded properties for functions, are moved to our Supplemental Material. One may also refer to (Rockafellar and Wets, 2009) for more details. on , thus
Equipped with these properties, the following theorem establishes the general converge results for our schematic bilevel scheme in Eqs. (8)(9).
Theorem 1.
Suppose both the above LL solution set and UL objective convergence properties hold, then

if is local minimum of with uniform neighbourhood modulus , we have any limit point of the sequence is a local minimum of ;

if , we have any limit point of the sequence satisfies that ; and as .
4.2 Convergence Properties of BDA
The objective here is to demonstrate that our BDA meets these two elementary properties required by Theorem 1. Before proving the convergence properties of BDA, we first take the following as our blanket assumption.
Assumption 1.
For any , is Lipschitz continuous, smooth, and strongly convex, is smooth and convex.
Notice that Assumption 1 is quite standard for BLPs in learning/vision areas (Franceschi et al., 2018; Shaban et al., 2019). As can be seen, it is satisfied for all the applications considered in this work. We first present some necessary variational analysis preliminaries. Denoting
(11) 
under Assumption 1, we can quickly obtain that is nonempty and unique for any . Moreover, we can derive the boundedness of in the following lemma.
Lemma 1.
Suppose is levelbounded w.r.t. and locally uniform w.r.t. . If is ISC on , then is bounded.
Denoting further , thanks to the continuity of , we have the following result.
Lemma 2.
If is continuous on , then is USC on .
Now we are ready to establish our fundamental LL solution set and UL objective convergence properties required in Theorem 1. In the following proposition, we first derive the convergence of in the light of the general fact stated in (Sabach and Shtern, 2017).
Proposition 1.
Suppose Assumption 1 is satisfied and and let , , , with and . Denoting , and , with and , it holds that
(12)  
(13)  
(14) 
where . Furthermore, converges to as for any .
Proposition 1, upon together with Lemma 1, shows that is a bounded sequence and uniformly converges. We next prove the uniform convergence of towards the solution set through the uniform convergence of .
Proposition 2.
Let be a bounded set and . If is ISC on , then there exists such that for any , , in case is satisfied.
Combining Lemmas 1 and 2, together with Proposition 2, the LL solution set property required in Theorem 1 can be eventually derived. Let us now prove the LSC property of on in the following proposition.
Proposition 3.
Suppose is levelbounded w.r.t. and locally uniform w.r.t. . If is OSC at , then is LSC at .
Then the UL objective convergence property required in Theorem 1 can be obtained subsequently based on Proposition 3, In summary, we present the main convergence results of BDA in the following theorem.
Theorem 2.
Remark 3.
Our proposed theoretical results are indeed general enough for BLPs in different application scenarios. For example, when the LL objective takes a nonsmooth form, e.g., with smooth and nonsmooth , we can adopt the proximal operation based iteration module (Beck, 2017) to construct within our BDA framework. The convergence proofs are highly similar to that in Theorem 2. More details on such extension can be found in our Supplemental Material.
4.3 Improving Existing LLS Theories
Although with the LLS simplification on BLP in Eqs. (1)(2), the theoretical properties of existing bilevel FOMs are still not very convincing. Their convergence proofs in essence depend on the strong convexity (or locally strong convexity) of the LL objective, restricting the use of FOMs in complex learning/vision applications. To address this issue, this subsection shows that under the LLS, existing convergence results (Franceschi et al., 2018; Shaban et al., 2019) can be improved in the sense that weaker assumptions are required. We begin by an assumption on the LL objective needed in this subsection.
Assumption 2.
is levelbounded w.r.t. and locally uniform w.r.t. .
In fact, Assumption 2 is mild and satisfied by a large number of bilevel FOMs, when the LL subproblem is convex but not necessarily strongly convex. In contrast, the more restrictive strong convexity on is an essential assumption in (Franceschi et al., 2018; Shaban et al., 2019). Under Assumption 2, the following lemma verifies the continuity of in the LLS scenario.
Lemma 3.
Suppose that Assumption 2 is satisfied and is singlevalued on . Then is continuous on .
As can be seen from the proof of Theorem 3 in our Supplemental Material, Lemma 3 together with the uniform convergence of imply the LL solution set and UL objective convergence properties. Hence Theorem 1 is applicable, which inspires an improved version of the convergence results for existing bilevel FOMs as follows.
Theorem 3.
Theorem 3 actually improves the converge results in (Franceschi et al., 2018). In fact, the uniform convergence assumption of towards required in (Franceschi et al., 2018) is essentially based on the strong convexity assumption (see Remark 3.3 of (Franceschi et al., 2018)). Instead of assuming such strong convexity, we only need to assume a weaker condition that converges uniformly to on as .
It is natural for us to illustrate our improvement in terms of concrete applications. Specifically, we take the gradientbased bilevel scheme in Section 2.1 (which has been used in (Franceschi et al., 2018; Shaban et al., 2019; Jenni and Favaro, 2018; Zügner and Günnemann, 2019; Rajeswaran et al., 2019)). In the following two propositions, we assume that is smooth and convex, and . Inspired by Theorems 10.21 and 10.23 in (Beck, 2017), we derive the following proposition.
Proposition 4.
Let be generated by Eq. (4). Then it holds that , and , with and .
Then we can immediately verify our required assumption on in the absence of strong convexity for .
Proposition 5.
Suppose Assumption 2 is satisfied. Then is uniformly bounded on and converges uniformly to on as .
Remark 4.
When the LL subproblem is convex, but not necessarily strongly convex, a large number of gradientbased methods, including accelerated gradient methods such as FISTA (Beck and Teboulle, 2009) and block coordinate descent method (Tseng, 2001), automatically meet our assumption, i.e., the uniform convergence of optimal values towards on .
5 Experimental Results
In this section, we first verify our theoretical findings and then evaluate the performance of our proposed method on different problems, such as hyperparameter optimization and meta learning. We conducted these experiments on a computer with Intel Core i77700 CPU (3.6 GHz), 32GB RAM and an NVIDIA GeForce RTX 2060 6GB GPU.
5.1 Synthetic BLP
Our theoretical findings are investigated based on the synthetic BLP described in Section 2.2. As stated above, this deterministic bilevel formulation satisfies all the assumptions in Section 4, but it does not satisfy the LLS assumption required in (Franceschi et al., 2018; Finn et al., 2017; Shaban et al., 2019; Franceschi et al., 2017). Here, we fix the parameters and in this experiments.
In Figure 1, we plotted numerical results of BDA and one of the most representative firstorder BLP method (i.e., Reverse HyperGradient (RHG) (Franceschi et al., 2017, 2018)) with different initialization points. We considered the numerical metrics , , , and , where the superscript denotes the true objective/variable. We observed that RHG is always hard to obtain correct solution, even start from different initialization points. This is mainly because that the solution set of the LL subproblem in Eq. (5) is not a singleton, which does not satisfy the fundamental assumption of RHG. In contrast, our BDA aggregated the UL and LL information to perform the LL updating, thus we are able to obtain true optimal solution in all these scenarios. The initialization actually only slightly affected on the convergence speed of our iterative sequences.
Figure 2 further plotted the convergence behaviors of BDA and RHG with different LL iterations (i.e., ). We observed that the results of RHG cannot be improved by increasing . But for BDA, the three iterative sequences (with ) are always converged and the numerical performance can be improved by performing relatively more LL iterations. In the above two figures, we set , .
Figure 3 evaluated the convergence behaviors of BDA with different choices of . By setting , we was unable to use the UL information to guide the LL updating, thus it is hard to obtain proper feasible solutions for the UL subproblem. When choosing a fixed in (e.g., ), the numerical performance can be improved but the convergence speed was still slow. Fortunately, we followed our theoretical findings and introduced an adaptive strategy to incorporate UL information into LL iterations, leading to nice convergence behaviors for both UL and LL variables.
5.2 Hyperparameter Optimization
Hyperparameter optimization is the problem of choosing a set of optimal hyperparameters for a given learning task. Here we consider a specific hyperparameter optimization example, known as data hypercleaning (Shaban et al., 2019; Franceschi et al., 2017)
. In this problem, we need to train a linear classifier on a given image set, but part of the training labels are corrupted. Following
(Shaban et al., 2019; Franceschi et al., 2017), we consider this problem within BLP as follows. We first denote and as the training and validation sets, respectively. Then in the LL subproblem, we define as the following weighted training loss: where denotes the crossentropy function with the classification parameter and data pairs and are the hyperparameters to penalize the objective for different training samples. Heredenotes the elementwise sigmoid function on
and is used to constrain the weights in . For the UL subproblem, we define as the crossentropy loss with regularization on the validation set, i.e., where the tradeoff parameter is fixed as .Method  No. of LL Iterations ()  

50  100  200  400  800  
RHG  88.96  89.73  90.13  90.19  90.15 
TRHG  87.90  88.28  88.50  88.52  89.99 
BDA  89.12  90.12  90.57  90.81  90.86 
We applied our BDA together with the baselines RHG and Truncated RHG (TRHG) (Shaban et al., 2019) to solve the above BLP model on MNIST (LeCun et al., 1998). Both the training and the validation sets consist of 7000 classbalanced samples and the remaining 56000 samples are used as the test set. We adopted the architectures used in RHG as the feature extractor for all the compared methods. For TRHG, we chose step truncated backpropagation to guarantee its convergence. Table 1 reported the averaged accuracy for all these compared methods with different number of LL iterations (i.e., ). We observed that RHG outperformed TRHG. While BAD consistently achieved the highest accuracy. Our theoretical results suggested that most of the improvements in BDA should come from the aggregations of the UL and LL information. The results also showed that more LL iterations are able to improve the final performances in most cases.
5.3 Meta Learning
Method  5 way  20 way  

1 shot  5 shot  1 shot  5 shot  
MAML  98.70  99.91  95.80  98.90 
MetaSGD  97.97  98.96  93.98  98.40 
Reptile  97.68  99.48  89.43  97.12 
RHG  98.60  99.50  95.50  98.40 
TRHG  98.74  99.52  95.82  98.95 
BDA  99.04  99.62  96.50  99.10 
Method  Acc.  Ave. Var. (Acc.)  UL Iter. 

RHG  44.46 0.78  3300  
TRHG  44.21 0.78  3700  
BDA  49.08  44.24 0.79  2500 
Var. (Acc.)” denotes the averaged accuracy and the corresponding variance.
The aim of meta learning is to learn an algorithm that should work well on novel tasks. In particular, we consider the fewshot learning problem (Vinyals et al., 2016; Qiao et al., 2018), where each task is a way classification and it is to learn the hyperparameter such that each task can be solved only with training samples (i.e., way shot). To evaluate this problem, we collect a meta training data set , where each is linked to a specific task. We learn a crosstask intermediate representation , parameterized by as our meta features. Then for the
th task, we utilize the multinomial logistic regression
, parameterized by and the crossentropy function as our ground classifier and the taskspecific loss, respectively. In this way, we first optimize the hyperparameter to obtain the overall setup and then the parameters are finetuned for the th task. Thus the LL and UL objectives can be defined as andOur experiments are conducted on two widely used benchmarks, i.e., Ominglot (Lake et al., 2015), which contains 1623 hand written characters from 50 alphabets and MiniImageNet (Vinyals et al., 2016)
, which is a subset of ImageNet
(Deng et al., 2009) and includes 60000 downsampled images from 100 different classes. We followed the experimental protocol used in MAML (Finn et al., 2017) and compared our BDA to several stateoftheart approaches, such as MAML (Finn et al., 2017), MetaSGD (Li et al., 2018), Reptile (Nichol et al., 2018), RHG, and TRHG.It can be seen in Table 2 that BDA compared well to these methods and achieved the highest classification accuracy except in the 5way 5shot task. In this case, practical performance of BDA was slightly worse than MAML. We further conducted experiments on the more challenging MiniImageNet data set. In the second column of Table 3, we reported the averaged accuracy of three firstorder BLP based methods (i.e., RHG, TRHG and BDA). Again, the performance of BDA is better than RHG and TRHG. In the rightmost two columns, we also compared the number of averaged UL iterations when they achieved almost the same accuracy (). These results showed that BDA needed the fewest iterations to achieve such accuracy.
6 Conclusions
This paper proposed BDA, a generic firstorder algorithmic framework to address BLPs in Eqs. (1)(2). Our approach has a number of theoretical benefits. Its convergence can be strictly proved without the LLS assumption, which is the fundamental restriction in existing gradientbased bilevel methods. It is also compatible to a variety of particular computation modules. As a nontrivial byproduct, we also improved convergence results for those classical gradientbased schemes. Extensive evaluations showed the superiority of BDA on different applications.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Nos. 61922019, 61672125, 61733002 and 61772105), LiaoNing Revitalization Talents Program (XLYC1807088).
References
 A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2 (1), pp. 183–202. Cited by: Remark 4.
 Firstorder methods in optimization. SIAM. Cited by: §A.4.3, §4.3, Remark 3.
 Perturbation analysis of optimization problems. Springer Science & Business Media. Cited by: §A.4.1.
 Bilevel parameter learning for higherorder total variation regularisation models. Journal of Mathematical Imaging and Vision 57 (1), pp. 1–25. Cited by: §1.1.
 Bilevel optimization: theory, algorithms and applications. TU Bergakademie Freiberg Mining Academy and Technical University. Cited by: §1.1, §1, footnote 1.
 Imagenet: a largescale hierarchical image database. In CVPR, pp. 248–255. Cited by: §5.3.
 [7] Generic methods for optimizationbased modeling. Cited by: Appendix A.
 Modelagnostic metalearning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §5.1, §5.3.

Bilevel programming for hyperparameter optimization and metalearning
. In ICML, pp. 1563–1572. Cited by: Appendix A, 4th item, §1.1, §1.1, §1.2, §2.1, §4.2, §4.3, §4.3, §4.3, §4.3, §5.1, §5.1.  Forward and reverse gradientbased hyperparameter optimization. In ICML, pp. 1165–1173. Cited by: Appendix A, §1.1, §1.1, §2.1, §2.2, §5.1, §5.1, §5.2.
 Deep bilevel learning. In ECCV, pp. 618–633. Cited by: §1.1, §2.1, §2.2, §4.3.
 The polynomial hierarchy and a simple model for competitive analysis. Mathematical Programming 32 (2), pp. 146–164. Cited by: §1.
 Classification model selection via bilevel programming. Optimization Methods & Software 23 (4), pp. 475–489. Cited by: §1.1.
 A bilevel optimization approach for parameter learning in variational models. SIAM Journal on Imaging Sciences 6 (2), pp. 938–983. Cited by: §1.1.
 Humanlevel concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §5.3.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.2.
 Metasgd: learning to learn quickly for fewshot learning. In ICML, Cited by: §5.3.
 Stochastic hyperparameter optimization through hypernetworks. CoRR, abs/1802.09419. Cited by: §1.1, §2.2.
 Selftuning networks: bilevel optimization of hyperparameters using structured bestresponse functions. ICLR. Cited by: §1.1, §1.1.
 Gradientbased hyperparameter optimization through reversible learning. In ICML, pp. 2113–2122. Cited by: Appendix A, §1.1.

Bilevel programming algorithms for machine learning model selection
. Rensselaer Polytechnic Institute. Cited by: §1.1.  On firstorder metalearning algorithms. CoRR, abs/1803.02999. Cited by: §5.3.
 Hyperparameter learning via bilevel nonsmooth optimization. CoRR, abs/1806.01520. Cited by: §1.1, §1.1.
 Connecting generative adversarial networks and actorcritic methods. In NeurIPS Workshop on Adversarial Training, Cited by: §1.1.
 Fewshot image recognition by predicting parameters from activations. In CVPR, pp. 7229–7238. Cited by: §5.3.
 Metalearning with implicit gradients. In NeurIPS, pp. 113–124. Cited by: §A.5, §1.1, §1.1, §2.1, §4.3.
 Variational analysis. Springer Science & Business Media. Cited by: §A.1.1, §A.3.3, §A.4.1, footnote 2.
 A first order method for solving convex bilevel optimization problems. SIAM Journal on Optimization 27 (2), pp. 640–660. Cited by: §A.5, §4.2.
 Truncated backpropagation for bilevel optimization. In AISTATS, pp. 1723–1732. Cited by: Appendix A, 4th item, §1.1, §2.1, §4.2, §4.3, §4.3, §4.3, §5.1, §5.2, §5.2.
 Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications 109 (3), pp. 475–494. Cited by: Remark 4.
 Matching networks for one shot learning. In NeurIPS, pp. 3630–3638. Cited by: §5.3, §5.3.
 Proximal deep structured models. In NeurIPS, pp. 865–873. Cited by: §A.5.
 Provably global convergence of actorcritic: a case for linear quadratic regulator with ergodic cost. In NeurIPS, pp. 8351–8363. Cited by: §1.1.

Adversarial attacks on graph neural networks via meta learning
. ICLR. Cited by: §1.1, §1.1, §2.1, §4.3.
Appendix
The following Appendix are organized as follows. Section A compares the theoretical results of BDA and existing stateoftheart bilevel FOMs. In Section A.1, we provide detailed proofs for all of the theoretical results in our manuscript. Finally, Section A.5 discusses a possible extension of BDA for BLP with the nonsmooth LL objective.
Appendix A Comparisons on Theoretical Results
Table 4 summarizes the proved convergence properties together with the required model conditions for BDA and existing gradientbased bilevel FOMs, such as (Domke, ; Maclaurin et al., 2015; Franceschi et al., 2017, 2018; Shaban et al., 2019). In fact, the theoretical results for these previous approaches have been proved in (Franceschi et al., 2018). To simplify the notations, we define the following abbreviations: “JC” (Jointly Continuous), “LC” (Lipschitz Continuous), “SC” (Strongly Convex), and “LB” (LevelBounded). We also denote subsequentially convergent and uniformly convergent as “” and “”, respectively. The superscript denotes that it is the true optimal variables/values. For each categories of methods, the top two rows and the bottom row respectively summarize the required properties of the models (i.e., the UL and LL subproblems) and the proved converge results for these methods.
It can be seen that in the LLS scenario, our BDA and these existing bilevel FOMs share the same requirements for the UL subproblem. However, as for the LL subproblem, the uniform convergence assumption of towards , considered in the previous FOMs, is essentially more restrictive than the assumptions required in our BDA. Notice that this has already been discussed below Theorem 3 in our manuscript. More importantly, when solving BLPs without the LLS assumption, we can see that no theoretical results can be obtained for these existing FOMs. Fortunately, we demonstrate that BDA can obtain the same convergence properties as that in the LLS scenario.
Alg.  Level  BLPs with LLS condition  BLPs without LLS condition 
Existing FOMs  UL  is JC / is LC  Not Available 
LL  is JC /  
Convergence Results: /  
BDA  UL  is JC / is LC  is JC / is LC, SC, and smooth 
LL  is JC / is LB /  is JC / is smooth / is continuous  
Convergence Results: / 
a.1 Detailed Proofs
a.1.1 Necessary Definitions
We state some definitions, which are necessary for our analysis. One may also refer to (Rockafellar and Wets, 2009) for more details on these variational analysis properties. Specifically, by denoting
(15) 
we define various continuity properties of the setvalued mapping as follows.
Definition 1.
A setvalued mapping is Outer SemiContinuous (OSC) at when and Inner SemiContinuous (ISC) at when . It is called continuous at when it is both OSC and ISC at , as expressed by .
Before providing the following semicontinuous definitions, we introduce the upper and lower limits of a function as
(16)  
where .
Definition 2.
The function is Upper SemiContinuous (USC) at if
(17) 
and USC on if this holds for every . The function is Lower SemiContinuous (LSC) at if
(18) 
and LSC on if this holds for every .
We also present the levelbounded and locally uniform property for a function in the following definition.
Definition 3.
Given a function , if for the point and , there exists along with a bounded set , such that
(19) 
then we call is levelbounded w.r.t. and locally uniform at . It is called locally uniform w.r.t. if the above holds for each .
a.2 Proofs of Section 4.1
a.2.1 Proof of Theorem 1
Proof.
Since is compact, we can assume without loss of generality that and by considering a subsequence of . For any , there exists such that whenever , so we have
(20) 
Thus, for any , there exists such that
(21) 
Therefore, for any , we have
(22)  
This implies that, for any , there exists such that whenever ,
(23) 
Next, as is local minimum of with uniform neighbourhood modulus , it follows
And since , we have, for any , there exists such that whenever , ,
Taking and by the LSC of , we have
By taking , we have
which implies , i.e, is a local minimum of .
We can show the second result with similar arguments. Since is compact, we can assume without loss of generality that by considering a subsequence of . As shown above in (23), for any , there exists such that whenever ,
(24) 
Taking and by the LSC of , we have
(25)  
By taking , we have
(26) 
which implies .
We next show that as . If this is not true, then there exist and sequence such that
(27) 
For each , there exists such that . And since is compact, we can assume without loss of generality that . For any , there exists such that whenever , the following holds
(28)  
By taking and from the LSC of , we have