1 Introduction
BiLevel Optimization (BLO) is the hierarchical mathematical program where the feasible region of one optimization task is restricted by the solution set mapping of another optimization task (i.e., the second task is embedded within the first one) [48, 36]. The outer optimization task is commonly referred to as the UpperLevel (UL) problem, and the inner optimization task is commonly referred to as the LowerLevel (LL) problem. BLOs involve two kinds of variables, referred to as the UL and LL variables, accordingly.
The origin of BLOs can be traced to the domain of game theory and is known as Stackelberg competition [196]. Subsequently, it has been investigated in view of many important applications in various fields of science and engineering, particularly in economics, management, chemistry, optimal control, and resource allocation problems [63, 47, 182, 203]. Especially, in recent years, a great amount of modern applications in the fields of machine learning and computer vision (e.g., hyperparameter optimization [50, 62, 139, 152], multitask and metalearning [64, 66, 178], neural architecture search [117, 89, 113], generative adversarial learning [111, 92, 191], deep reinforcement learning [222, 163, 211] and image processing and analysis [151, 28, 121], just name a few) have arisen that fit the BLO framework.
In general, BLOs are highly complicated and computationally challenging to solve due to their nonconvexity and nondifferentiability [99, 43]. Despite their apparent simplicity, BLOs are nonconvex problems with an implicitly determined feasible region even if the UL and LL problems are convex [81, 220]. Indeed, it has been shown that even if the simplest model in BLOs (i.e., the linear BLO program) is strongly NPhard because of local optimality [195, 22]. In addition, the existence of multiple optima for the LL problem can result in an inadequate formulation of BLOs and the feasible region of the problem may be nonconvex [40, 126]. Despite the challenges, a lot of research have followed in this field consisting of methods and applications of BLOs, see [46, 43, 179]. Early studies focused on numerical methods, including extremepoint methods [21], branchandbound methods [63, 135], descent methods [175, 147], penalty function methods [5, 197], trustregion methods [54, 42], and so on. The most often used procedure is to replace the LL problem with its Karush–Kuhn–Tucker (KKT) conditions, and if assumptions are made (such as smoothness, convexity, among others) the BLOs can be transformed into singlelevel optimization problems [2, 183, 180]. However, bilevel model’s complexity makes it difficult to be solved in practice using the conventional approaches and computing devices. Specially, they may fail in the face of largescale, highdimensional applications, especially in machine learning and computer vision fields [108, 215].
The classical approach (or the first order approach in economics literature) to reformulate BLO is to replace the LL subproblem Eq. (1) by its KKT conditions and minimize over the original variables and as well as the multipliers. The resulting problem is a socalled Mathematical Program with Equilibrium Constraints (MPEC) [167, 219]. This transformation is the most often used one. Unfortunately, MPECs are still a challenging class of problems because of the presence of the complementarity constraint [8]. Solution methods for MPECs can be categorized into two approaches. The first one, namely, the nonlinear programming approach rewrites the complementarity constraint into nonlinear inequalities, and then allows to leverage powerful numerical nonlinear programming solvers. The other one, namely, the combinatorial approach tackles the combinatorial nature of the disjunctive constraint. MPEC has been studied intensively in the last three decades [100]
. Recently, some progress on the MPEC approach in dealing with BLO have been witnessed by the community of mathematical programming, in the context of selecting optimal hyperparameters in regression and classification problems. There are two issues caused by the multipliers in the MPEC approach. Firstly, in theory, if the LL has more than one multiplier, the resulting MPEC is not equivalent to the original BLO if local optimality is considered
[154]. Secondly, the introduced auxiliary multiplier variables limit the numerical efficiency in solving the BLO.In recent years, a variety of modern learning paradigms, including but not limited to, hyperparameter optimization [151, 140, 65, 153]), multitask and metalearning [149, 90, 112, 31], neural architecture search [117, 113, 51, 76], adversarial learning [191, 92, 111, 130], and deep reinforcement learning [222, 194], have been investigated in machine learning and computer vision fields to address complex tasks in challenging realworld scenarios. For example, hyperparameter optimization aims to jointly learn the model parameters and hyperparameters [177]. Metalearning intends investigate both the base and meta learners, thus obtains the ability to adapt to new environments rapidly with a few training examples [7]. Neural architecture search seeks to simultaneously optimize the network architecture and parameters [209]. Adversarial learning can be deemed as the process of constructing both the generator and discriminator during training [163]. In addition, deep reinforcement learning is to learn a “policy”, something which tells you which action to take from each state so as to try and maximize reward [32]. In fact, despite the different motivations and mechanisms, all these problems contain a series of closely related subproblems and have a natural hierarchical optimization structure. Unfortunately, although have attracted increasing attentions in both academia and industry, there lack perspectives to uniformly understand and formulate these different categories of learning and vision problems.
We notice that most previous surveys on BLOs (e.g., [48, 4, 36, 201, 193, 102, 34, 74]) are purely from the viewpoint of mathematical programming and mainly focus on the formulations, properties, optimality conditions and these classical solution algorithms, such as evolutionary methods [182]. In contrast, the aim of this paper is to utilize BLO to express a variety of complex learning and vision problems, which explicitly or implicitly contain closely related subproblems. Furthermore, we present a unified perspective to comprehensively survey different categories of gradientbased BLO methodologies in specific learning and vision applications. In particular, we first provide a literature review on various complex learning and vision problems, including hyperparameter optimization, multitask and metalearning, neural architecture search, adversarial learning, deep reinforcement learning and so on. We demonstrate that all these tasks can be modeled as a general BLO formulation. Following this perspective, we then establish a valuefunctionbased singlelevel reformulation to express these existing BLO models. By further introducing a bestresponsebased algorithmic framework on the singlelevel reformulation, we can uniformly understand and formulate these existing gradientbased BLOs and analyze their accelerations, simplifications, extensions, and convergence and complexity proprieties. Finally, we demonstrate the potentials of our framework for designing new algorithms and point out some promising research directions for BLO in learning and vision fields.
Compared with existing surveys on BLOs, our major contributions can be summarized as follows:

To the best of our knowledge, this is the first survey paper to focus on uniformly understanding and (re)formulating different categories of complex machine learning and computer vision tasks and their solution methods (especially in the context of deep learning) from the perspective of BLO.

We provide a valuefunctionbased singlelevel reformulation, together with a flexible bestresponsebased algorithmic framework, to unify existing gradientbased BLO methodologies and analyze their accelerations, simplifications, extensions, convergence behaviors and complexity properties.

Our framework not only comprehensively covers mainstream gradientbased BLO methods, but also has potentials for designing new BLO algorithms to deal with more challenging tasks. We also point out some promising directions for future research.
We summarize our mathematical notations in Table I. The remainder of this paper is organized as follows. We first introduce some necessary fundamentals of BLOs in Section 2. Then, Section 3 provides a comprehensive survey of various learning and vision applications that all can be modeled as BLOs. In Section 4, we establish a algorithmic framework in a unified manner for existing gradientbased BLO schemes. Within this framework, we further understand and formulate two different categories of BLOs (i.e., explicit and implicit gradients for bestresponse) in Section 5 and Section 6, respectively. We also discuss the socalled lowerlevel singleton issue of BLOs in Section 7. The convergence and complexity properties of these gradientbased BLOs are discussed in Section 8. Section 9 puts forward potentials of our framework for designing new algorithms to deal with more challenging pessimistic BLOs. Finally, Section 10 points out some promising directions for future research.
Notation  Description  Notation  Description 

/  Training/Validation data  Operations/Operation weights  
Policy/Reward  State/Action  
Qfunction under  Stateaction valuefunction  
Generator/Discriminator  Realworld image/Random noise  
Aggregation parameters  /  UL/LL variable  
UL objective  LL objective for a given  
UL valuefunction  LL valuefunction  
BR mapping  Transposition/Compound operation  
BR Jacobian  Practical BR Jacobian w.r.t.  
The th step of gradient descent  Solutionset mapping  
Hypernetwork with parameters  
Optimistic objective  Pessimistic objective  
Optimistic aggregated gradient  Pessimistic aggregated gradient  
Inverse Hessian matrix  Inverse Hessianvector product 

/  LL/UL iteration step  Maximum LL/UL iteration number  
Layerwise transformation  Neumann series 
2 Fundamentals of BiLevel Optimization
2.1 Overview of BiLevel Optimization Problems
BiLevel Optimization (BLO) contains two levels of optimization tasks, where one is nested within the other as a constraint. The inner (or nested) and outer optimization tasks are often respectively referred to as the LowerLevel (LL) and UpperLevel (UL) subproblems [48]. Correspondingly, there are two types of variables, namely, the LL () and UL () variables. Specifically, the LL subproblem can be formulated as the following parametric optimization task
(1) 
where we consider a continuous function as the LL objective and is a nonempty and closed set. By denoting the valuefunction as , we can define the solutionset mapping of Eq. (1) as Then the standard BLO problem can be formally expressed as
(2) 
where the UL objective is also a continuous function and . In fact, a feasible solution to BLO in Eq. (2) should be a vector of UL and LL variables, such that it satisfies all the constraints in Eq. (2), and the LL variables are optimal to the LL subproblem in Eq. (1) for the given UL variables as parameters. In Fig. 1, we provide a simple visual illustration for BLOs stated in Eq. (2).
The above BLO problem has a natural interpretation as a noncooperative game between two players (i.e., Stackelberg game [46]). Correspondingly, we may also call the UL and LL subproblems as the leader and follower, respectively. Then the “leader” chooses the decision first, and afterwards the “follower” observes so as to respond with a decision . Therefore, the follower may depend on the leader’s decision. Likewise, the leader has to satisfy a constraint that depends on the follower’s decision.
2.2 Formulating BLO from Different Viewpoints
It is worthwhile to notice that in the standard BLO formulation given in Eq. (2), there may exist multiple LL optimal solutions for some of the UL variables (see Fig. 1 (b) for example). Therefore, it is necessary to define, which solution out of the multiple LL solutions in should be considered. Indeed, most existing works in learning and vision fields consider the uniqueness and optimistic viewpoints to specify standard BLO in Eq. (2). Notice that we will also discuss the more challenging pessimistic viewpoint for BLO in the following Section 9.
Specifically, the most straightforward idea in existing works is to assume that is a singleton^{1}^{1}1It is also known as the LowerLevel Singleton (LLS) assumption [126].. In this case, we can simplify the original model as
(3) 
Such uniqueness version of BLOs is welldefined and could cover a variety of learning and vision tasks (e.g., [50, 140, 160, 139], just name a few). Thus, in recent years, dozens of methods have been developed to address this nested optimization task in different application scenarios (see the following sections for more details).
The situation becomes more intricate if the LL subproblem is not uniquely solvable for each . Essentially, if the follower can be motivated to select an optimal solution in that is also best for the leader (i.e., with respect to ), it yields the socalled optimistic (strong) formulation of BLO
(4) 
The abovestated optimistic viewpoint has received extensive attentions in BLO literature [41, 98, 103] and recently also been investigated in learning and vision fields [126].
3 Understanding and Modeling Learning and Vision Tasks by BLOs
In this section, we demonstrate that even with different motivations and mechanisms, a variety of modern complex learning and vision tasks (e.g, hyperparameter optimization, multitask and metalearning, neural architecture search, adversarial learning, deep reinforcement learning and so on) actually share close relationships from the BLO perspective. Moreover, we provide a uniform BLO expression to (re)formulate all these problems. Table II provides a summary of learning and vision applications, that can be understood and modeled by BLO.
Task  Important work  Other work 

HO  [16] (ICML, 2013),  [39] (SIAM, 2014), [93] (EURO, 2020), [140] (ICML, 2015), [134] (AISTATS, 2020), 
[160] (ICML, 2016),  [67] (ICML, 2018), [178] (AISTATS, 2019), [114] (arXiv, 2019), [181] (arXiv, 2020),  
[127] (ICML, 2020),  [66] (arXiv, 2018), [65] (ICML, 2017), [140] (ICML, 2015), [66] (ICML, 2018),  
[139] (ICLR, 2019)  [3] (ICML, 2017), [9] (NIPS, 2020), [23] (arXiv, 2020)  
MFL  [64] (arXiv, 2017),  [146] (PMLR, 2017), [166] (CVPR, 2018), [71] (CVPR, 2018), [144] (ICLR, 2018), 
[109] (ICLR, 2017)  [31] (ICML, 2017), [61] (ICLR, 2020), [173] (ICLR, 2018), [229] (ICML, 2019)  
MIL  [112] (ICML, 2017),  [169] (ICLR, 2017), [6] (NIPS, 2016), [59] (ICML, 2017), [148] (arXiv, 2018), 
[168] (NIPS, 2019),  [7] (ICLR, 2019), [78] (CVPR, 2020),[205] (AAAI, 2020), [149] (arXiv, 2018),  
[17] (ICLR, 2019),  [225] (NIPS, 2019), [60] (ICML, 2019), [185] (ICLR, 2020), [49] (ICML, 2019),  
[157] (NIPS, 2019),  [184] (CVPR, 2020), [190] (AAAI, 2020), [14] (arXiv, 2019), [143] (arXiv, 2018),  
[106] (ICML, 2018)  [86] (ICASSP, 2020), [230] (arXiv, 2019)  
NAS  [117] (ICLR, 2019),  [206] (ICLR, 2019), [164] (CVPR, 2018), [113] (ICLR, 2019), [29] (ICCV, 2019), 
[204] (NIPS, 2018),  [150] (AISTATS, 2020), [56] (CVPR, 2020), [213] (AAAI, 2020), [87] (CVPR, 2020),  
[51] (TGRS, 2020),  [116] (CVPR, 2019), [30] (NIPS, 2019), [79] (SPL, 2020), [207] (ICCV, 2019),  
[91] (CVPR, 2020),  [161] (CIIP, 2019), [218] (CVPR, 2020), [224] (arXiv, 2019), [110] (SIGKDD, 2020),  
[209] (ICLR, 2019),  [26] (NIPS, 2019), [77] (CVPR, 2020), [83] (CVPR, 2020), [33] (arXiv, 2020),  
[76] (ECCV, 2020)  [210] (arXiv, 2020)  
AL  [191] (CVPR, 2020),  [163] (arXiv, 2016), [68] (CVPR, 2020), [92] (arXiv, 2018), [216] (AAAI, 2020), 
[111] (PR, 2019)  [94] (ICML, 2020), [130] (CVPR, 2020), [188] (arXiv, 2020), [52] (arXiv, 2018)  
DRL  [194] (CVPR, 2020),  [192] (NIPS, 2019), [222] (arXiv, 2019), [32] (CIRED, 2019), [212] (ICLR, 2019), 
[214] (TSG, 2019)  [69] (arXiv, 2019)  
Others  [101] (SIAM, 2013),  [58] (ICML, 2016),[162] (ICLR, 2019), [121] (IJCAI, 2020), [84] (arXiv, 2020), 
[151] (SSVM, 2015),  [186] (UAI, 2020), [120] (arXiv, 2021), [165] (arXiv, 2020), [125] (TIP, 2020),  
[122] (TIP, 2020),  [123] (arXiv, 2019), [124] (arXiv, 2020), [187] (TRO, 2020), [145] (WACV, 2020),  
[107] (TNNLS, 2020),  [19] (arXiv, 2020), [18] (NIPS, 2020), [25] (arXiv, 2020), [217] (arXiv, 2020)  
[226] (TIP, 2016)  [129] (CVPR, 2020), [38] (arXiv, 2019), [115] (ICLR, 2018) 
3.1 Hyperparameter Optimization
Hyperparameters are those parameters that are external to the model and can’t be learned using training data alone. The problem of identifying the optimal set of hyperparameters is known as Hyperparameter Optimization (HO). Early methods to select hyperparameters always use regularized models and Support Vector Machines (SVMs)
[15]. These problems are first formulated as a hierarchical problem and then transformed to a singlelevel optimization problem by replacing the LL problem with its optimality conditions [37]. However, because of the high computational cost especially given highdimensional hyperparameter space, these methods even can not guarantee a locally optimal solution [16]. More recently, gradientbased HO methods with deep neural networks has gained a lot of traction and generally fall into two categories: iterative differentiation (i.e.,
[50, 12, 39, 151, 140, 160, 152, 64, 75, 66, 114, 177, 67]) and implicit differentiation (i.e., [27, 176, 101, 20, 139, 62, 160, 133, 134]), depending on how the gradient with respect to hyperparameters is computed. Indeed, the former is an iterative approach where the bestresponse function is approximated after some steps of gradient descent on the loss function, while the latter approach uses the implicit function theory to derive hypergradients. For instance, data hypercleaning
[65, 177], known as a specific HO example, needs to train a linear classifier with the crossentropy function (with parameters
) on the training data set, and introduces hyperparameters for training with regularization on the validation set.In general, because of the hierarchical nature, modeling the HO problem as a BLO paradigm is inherently and natural [15]. Specifically, the UL objective requires minimizing the validation set loss with respect to hyperparameters (e.g. weight decay), and the LL objective requires outputting a learning algorithm by minimizing the training set loss with respect to model parameters (e.g. weights and biases). We instantiate how to model the HO task from the perspective of BLO in Fig. 2. It can be seen that the full dataset is spilt into the training and validation dataset (i.e., ). In conclusion, as a nested optimization, most of HO applications can be formulated as BLOs, where the UL subproblem involves optimization of hyperparameters and the LL subproblem (w.r.t. weight parameters ) aims to find the learning algorithm by minimizing the LL objective with training loss given .
3.2 Multitask and MetaLearning
The goal of metalearning (a.k.a. learning to learn) [85] is to design models that can learn new skills or adapt to new environments rapidly with a few training examples. In other words, by learning to learn across data from many previous tasks, metalearning algorithms can discover the structure among tasks to enable fast learning of new tasks (see Fig. 3 for a schematic diagram). Multitask learning is a variant of metalearning, which just intends to jointly solve all of these given tasks [189, 172].
In fact, one of the most wellknown instances of metaearning is fewshot classification (i.e., way shot), where each task is a way classification, aiming to learn the metaparameter such that each task can be solved only with training samples. Specially, it can be seen that the full meta training data set () can be segmented into linked to the th task. In essence, these methods can generally fall into two categories, metafeature learning and metainitialization learning, as can be seen in Fig. 4, depending on the relationship between the metaparameters and the network parameters. We show the metaparameter in blue, and the network parameter in green. Metafeature learning indicates that there is no deeply intertwined and entangled relationship in between, while the metainitialization learning directly indicates that the metaparameter is the network initialization parameter. In the following part, we will emphasize the analysis from two perspectives.
(green blocks). (a) shows metaparameters for features shared across tasks and parameters of the logistic regression layer. (b) shows meta (initial) parameters shared across tasks and parameters of the task specific layer.
3.2.1 MetaFeature Learning
MetaFeature Learning (MFL) aims to learn a sharing meta feature representation for all tasks. It’s worth noting that relevant approaches in [223, 1] recently show that multitask learning with hard parameter sharing and metafeature representation are essentially similar. Furthermore, the optimization of metalearner with respect to metaparameters based on UL subproblem is similar to HO as discussed in last subsection [64, 66, 67]. For the th task, these methods consider the crossentropy function as the taskspecific loss on the meta training data set to define the LL objective, and also utilize crossentropy function to define the UL objective based on the validation data set.
As can be seen from subfigure (a) of Fig. 4, following the BLO framework, the network architecture in this category can be subdivided into two groups. The first one are crosstask intermediate representation layers parameterized by (illustrated by the blue block), outputting the meta features, and the second one is the logistic regression layer parameterized by (illustrated by the green block), as the ground classifier for the
th task. Concretely, feature layers are shared across all episodes and the softmax regression layer is episode (task) specific. The process of network forward propagation corresponds to the process of passing from the feature extraction part to the softmax part.
3.2.2 MetaInitialization Learning
MetaInitialization Learning (MIL) aims to learn a meta initialization for all tasks. MAML [59]
, known for its simplicity, estimates initialization parameters with the crossentropy and meansquared error for supervised classification and regression tasks purely by the gradientbased search. Except for initial parameters, recent approaches have focused on learning other meta variables, such as updating strategies (e.g., descent direction and learning rate
[6, 169, 14]) and an extra preconditioning matrix (e.g., [61, 173, 106]). Implicit gradient methods are widely used in the context of fewshot metalearning. There are a large variety of algorithms replacing the gradient process of the optimization of baselearner through calculation of implicit meta gradient [168, 60, 225, 10, 17]. More recently, the proximal regularization has been introduced for the original loss function [168]. As the Hessian vector product calculation during training requires a large amount of computation, various Hessianfree algorithms have been proposed to alleviate the costly computation of secondorder derivatives, which include but not limited to [149, 90, 185, 7, 112, 31, 146]. Furthermore, the firstorder approximation algorithm has been proposed to avoid the timeconsuming calculation of secondorder derivatives in [148]. More recently, a modularized optimization library that unifies several metalearning algorithms into a common BLO framework has been proposed in [128]^{2}^{2}2The code for this library is available at https://github.com/dutmedialab/BOML..As shown in subfigure (b) of Fig. 4, (the blue block) corresponds to initial parameters, and ( the green block) corresponds to model parameters. Compared to MFL, there is no deeply intertwined and entangled relationship between two variables (denoted by ), and is only explicitly related to in the initial state. We denote as the network initialization parameter, and as the updated (denote ). As a bilevel coupled nested loop strategy, the LL subproblem based on baselearner is trained for operating a given task, and the UL subproblem based on metalearner aims to learn how to optimize the baselearner. Among the wellknown approaches in this direction, much recent work (i.e., [105, 148] ) has claimed that the crossentropy function is denoted by the taskspecific loss as the LL objective on the training data set, i.e., . Also, by utilizing crossentropy function, the UL objective is given by Moreover, metalearning has also been successfully applied to a wide range of challenging problems, including image enhancement [184, 88], and semantic segmentation [190], tracking [198]
[78]and text generation
[216], just to name a few.Both MFL and MIL are essentially solution strategies of one optimizer based on another optimizer, which is exactly in line with the construction of BLO scheme. More specifically, as the taskspecific loss linked to the th task, the LL objective can be defined as , . Also, based on , the UL objective is given by . To summarize, firstly, the UL metalearner calculates the gradient and updates the metaparameter according to the feedback value, and then the optimized meta knowledge further generates better meta knowledge. Subsequently, the better meta knowledge is input into baselearner (i.e., LL subproblem) as part of its model for optimizing , thus forming an optimization cycle.
3.3 Neural Architecture Search
Neural Architecture Search (NAS) seeks to automate the process of choosing an optimal neural network architecture [55]. Recently, there has been a great deal of interest in gradientbased differentiable NAS methods [117, 24, 142]. These gradientbased differentiable NAS methods mainly contain three main concepts: search space, search strategy and performance estimation strategy. Through designing an architecture search space, they generally use a certain search strategy to find an optimal network architecture. As shown in Fig. 5, such a process can be regarded as the process of optimizing the operation and connection of each node.
The most wellknown instance, named DARTS [117], relaxes the search space to be continuous and conducts searching in a differentiable way, so that gradient descent can be used to simultaneously search for architectures and learn weights. Actually, each operation corresponds to a coefficient in DARTS. By denoting as architecture parameters and as the form of connection between two nodes, the expression formula of mixed operations according to the Softmax function can be written as
where and denote operations, and is the set of all these candidate operations. Then, is further performed to obtain the optimal architecture. However, as much skip connection leads to a sharp deterioration in performance, a great deal of improved approaches have emerged, i.e., ENAS [164], PCDARTS [209], PDARTS [29], just to name a few. Among them, PCDARTS takes features extracted from the network for sampling on the channel dimension, and then connects processed features with remaining features. Moreover, a series of gradientbased differentiable NAS approaches combining with metalearning recently have emerged (i.e., [199, 113, 97, 56] ). At present, the differentiable NAS based on BLO has achieved promising results in a large variety of tasks, such as image enhancement [79, 35], image classification [51], semantic segmentation [218, 228, 116], object detection [30, 207, 77, 91, 110], video processing [161], medical image analysis [228, 218], video classification [58], recommendation system [33], graph network [224, 69] and representation learning [69], etc.
In conclusion, given the proper search space, it is helpful for gradientbased differentiable NAS methods in deriving the optimal architecture for different tasks. From this perspective, the UL objective with respect to architecture weights (e.g. block/cell) is parameterized by , and the LL objective with respect to model weights is parameterized by . The searching process can be generally formulated as a BLO paradigm, where the UL objective is based on the validation data set , and the LL objective is based on the training data set .
3.4 Adversarial Learning
Adversarial Learning (AL) is contemporarily deemed as one of the most important learning tasks, which has been applied in a large variety of application areas, i.e., image generation [130, 92, 68], adversarial attacks [156] and face verification [111]. For example, the work proposed in [130] recently has introduced an adaptive BLO model for image generation, which guides the generator to reasonably modify parameters in a complementary and promoting way. In this way, the global optimization is to optimize the whole images and the local optimization is only to optimize the lowquality areas. Moreover, by learning an parametrized optimizer with neural networks, a recent work adopts a new adversarial training method to study adversarial attack combining with metalearning [92]
. In essence, as the current influential model, Generative Adversarial Network (GAN) is deemed as deep generative models
[73, 131]. For instance, a recent method in [111] introduces a coupled adversarial network for makeupinvariant face verification, which contains two adversarial subnetworks on different levels, with the one on the pixel level for reconstructing facial images and the other on the feature level for preserving identity information. Similarly, targeting at finding pure Nash equilibrium of generator and discriminator, the work proposed in [191] exploits a fully differentiable search framework by formalizing as solving a bilevel minimax optimization problem.is represented as a deterministic feed forward neural network (red blocks), through which a fixed random noise
is passed to output . The discriminator is another neural network (green blocks) which maps the sampled realworld image andto a binary classification probability.
All the above approaches can formulate the unsupervised learning problem as a game between two opponents: a generator which samples from a distribution, and a discriminator which classifies samples as real or false, as shown in Fig.
6. The goal of GAN is to minimize the duality gap denoted as :(5) 
where the fixed random noise source obtained by is input into the generator , which, together with the sampled realworld image , is then authenticated by the discriminator . Notice that
denotes the expectation which implies that the average value of some functions under a probability distribution.
In general, adversarial learning corresponds to a minmax BLO problem, where the UL discriminator denoted as targets on learning a robust classifier, and the LL generator denoted as tries to generate adversarial samples. Specifically, the UL objective and LL objective can be respectively formulated as
(6) 
(7) 
where the generator and the discriminator are parameterized with variables and , respectively. In other words, the UL subproblem aims to reduce the duality gap and the LL subproblem interactively optimizes the discriminator parameters denoted by to obtain the optimal solution.
3.5 Deep Reinforcement Learning
In Deep Reinforcement Learning (DRL), the agent aims to make optimal decisions by interacting with the environment and learning from the experiences, where the environment is modeled by the Markov Decision Process (MDP). Actorcritic (AC) methods are a longestablished class of techniques, consisting of two subproblems that are intertwined with each other. At the same time, AC methods are widely studied in recent years
[212, 214, 192, 222, 32, 194, 211]. In the case of guaranteed convergence, [222] proposes a bilevel AC method to solve multiagent reinforcement learning problem in finding Stackelberg equilibrium under Markov games. It allows agents to have different knowledge base (thus intelligent), while their actions still can be executed simultaneously and distributedly. Moreover, [32] proposes a multiagent bilevel cooperative reinforcement learning algorithm to solve the bilevel stochastic decisionmaking problem. In nonconvex scenarios, the recent work in [211] proves that AC could find a globally optimal pair at a linear rate of convergence, and provides a complete theoretical understanding of BLO paradigm. Actually, both GANs and AC can be seen as bilevel or twotimescale optimization problems, where one model is optimized with respect to the optimum of another model. In a sense, both of them are such classes of BLOs which have close parallels. In [163], it shows that GANs can be viewed as a modified AC method with blind actor in stateless MDPs.Indeed, AC methods simultaneously learn an stateaction valuefunction that predicts the expected discounted cumulative reward
(8) 
where and denote dynamics of the environment and reward function respectively, and correspond to the state and action respectively, and represent the ith and jth steps respectively, and also denotes the expectation which implies that the average value of some function under a probability distribution. Moreover, the policy maximizes the expected discounted cumulative reward for that stateaction valuefunction, i.e.,
(9) 
where and correspond to the initial state and initial action, respectively, and denotes the stateaction valuefunction, and represents the initial state distribution. We show the schematic diagram of AC learning in Fig. 7. Let denote the parameters of the stateaction valuefunction, and denote the parameters of the policy . From the perspective of BLO, the UL objective and LL objective can be respectively formulated as below
(10)  
(11) 
where we denote as any divergence that is positive except when the two are equal. From this perspective, the actor and critic correspond to the LL and UL variables, respectively. Under this framework, the update of policy can be regarded as a stochastic gradient step for the LL objective.
3.6 Other Related Applications
The rapid development of deep learning has claimed its domination in the area of image processing and analysis. In particular, beside the above mentioned tasks, there are a lot of other related learning and vision tasks that can be re(formulated) to BLO problems have emerged, such as image enhancement [101, 28, 226, 155, 38], image registration [121]
[138], image recognition [208], image compression [221] and other related works [186, 18, 162]. For example, early work proposed in [101] considers the problem of parameter learning for variational image denoising models, which incorporates norm–based analysis priors. Through formulating it as a BLO, the LL subproblem is given by the variational model which consists of the data fidelity and regularization term, and the UL subproblem is expressed by means of a loss function that penalizes errors between the solution of the LL subproblem and the ground truth data. Furthermore, the work proposed in [226] formulates discriminative dictionary learning method for image recognition tasks as a BLO. From this perspective, the UL subproblem directly minimizes the classification error, and the LL subproblem uses the sparsity term and the Laplacian term to characterize the intrinsic data structure.4 A Unified Algorithmic Framework for Gradientbased BLOs
In what follows, we develop a unified algorithmic framework to formulate existing mainstream gradientbased BLO methods emerged in learning and vision fields. We first state our valuefunctionbased singlelevel reformulation for BLOs and then introduce a general nested gradient optimization framework on it.
4.1 SingleLevel Reformulation
Our singlelevel reformulation is based on the BestResponse (BR) of the follower, which has been used for the backward induction in Stackelberg game [96, 45]. Specifically, given the leader , we define a unified BR mapping of the follower (denoted as ) for two categories of BLOs as follows:
(12) 
Based on Eq. (12), we can have the following valuefunctionbased reformulation (a singlelevel optimization model) for BLOs stated in Eq. (2), as specified below:
(13) 
where is able to uniformly represent the UL valuefunction of from the uniqueness and optimistic (i.e., ) viewpoints.
The UL valuefunction has played a key role throughout the history of BLO research, not only as a common ground for deriving the relevant algorithms, but also pushing the field towards increasingly complex and challenging problems.
4.2 The Devil Lies in the BR Jacobin
Moving one step forward, the gradient of w.r.t. the UL variable reads as^{3}^{3}3Please notice that we actually do not distinguish between the operation of the derivatives and partial derivatives to simplify our presentation.
(14) 
where we use “grad.” as the abbreviation for “gradient” and denote the transpose operation as “”. In fact, by simple computation, the direct gradient is easy to obtain. However, the indirect gradient is intractable to obtain because we must compute the changing rate of the optimal LL solution with respect to the UL variable (i.e., the BR Jacobian )^{4}^{4}4In the following algorithms, we will also call as the practical BR Jacobian w.r.t. .. The computation of the indirect gradient naturally motives formulating and hence . For this purpose, a series of techniques have recently been developed from either explicit or implicit perspectives, which respectively obtain their optimal solutions from automatic differentiation through dynamic system and based on implicit differentiation theory. In Alg. 1, we present a general scheme to unify these different categories of gradientbased BLOs. It can be seen that we actually first compute the practical BR Jacobian by a inner algorithm and then perform standard gradientbased descent scheme to update . When with , we can finally solve Eq. (12) to obtain the optimal solution. Therefore, the inner algorithm for computing the practical BR Jacobian actually plays the key role in this general algorithmic framework. Moreover, we emphasize that it is also the most challenging component in these existing gradientbased BLOs.
In Fig. 8, we further summarize mainstream gradientbased BLOs based on two different assumptions (i.e., w/ LLS and w/o LLS). In the LLS scenario, from the BRbased perspective, existing gradient methods can be categorized as two groups: Explicit Gradient for BestResponse (EGBR, stated in Section 5) and Implicit Gradient for BestResponse (IGBR, stated in Section 6). As for EGBR, there are mainly three types of methods, namely, recurrencebased EGBR (e.g., [65, 140, 66, 177, 117]), initializationbased EGBR (e.g., [148, 149] ) and proxybased EGBR methods (e.g., [109, 105, 157, 61]), differing from each other in the way of formulating the BR mapping. For IGBR, existing works consider two groups of techniques (e.g., linear system [160, 168] and Neumann series [134]) to alleviate the computational complexity issue for the BR Jacobian. It can be seen that the validity of formulations depends on such uniqueness of the LL solution set. Furthermore, EGBR category of methods require firstorder differentiability of the LL objective, while IGBR category of methods need that the LL objective admits the secondorder differentiability and the invertibility of Hessian. When solving BLOs without LLS assumption, it has been demonstrated in [127] that we need to first construct BR mapping based on both UL and LL subproblems, and then solve two optimization subproblems, namely, the singlelevel optimization subproblem (w.r.t. ) and the simple bilevel subproblem (w.r.t. )^{5}^{5}5It is known that simple bilevel optimization is just a specific BLO problem with only one variable [127, 53].. Here the subproblem can also be solved by either EGBR or IGBR.
5 Explicit Gradient for BestResponse
In this section, we delve deep into the EGBR category of methods, which can be understood as performing automatic differentiation through dynamic system [159, 200, 136]. Specifically, given an initialization at , the iteration process of EGBR can be written as
(15) 
where defines the operation performed by the th stage and is the number of iterations. For example, we can formulate based on the gradient descent rule, i.e.,
(16) 
where is the descent mapping of at th stage (e.g., ) and denotes the corresponding step size . Then we can calculate by substituting approximately for , and the full dynamical system can be defined as ^{6}^{6}6Here the notation represents the compound dynamical operation of the entire iteration.. That is, we actually consider the following optimization model
(17) 
and need to calculate (instead of Eq. (14)) in the practical optimization scenario. Since actually obtain an explicit gradient for bestresponse of the follower, we call this category of gradientbased BLOs as EGBR approaches hereafter. Starting from the Eq. (15), it is obvious to notice that may be affected coupling with the variable throughout the iteration. This coupling relationship will have a direct impact on the optimization process of UL variable in Eq. (14). In fact, existing EGBR algorithms can be summarized from three perspectives. The first is that, if closely acts on during the whole iteration process, the subsequent optimization of variable will be carried out recursively. The second is that when only acts in the initial step, the subsequent optimization of variable will be simplified. The third class is to replace the whole iterative process with a hypernetwork, so as to efficiently approximate BR mapping. Ultimately, in such cases, we divide them into three categories in terms of the coupling dependence of the two variables and the solution procedures, namely recurrencebased EGBR (stated in Section 5.1), initializationbased EGBR (stated in Section 5.2) and proxybased EGBR (stated in Section 5.3).
5.1 Recurrencebased EGBR
It can be seen from Eq. (15) that all the LL iterative variables depend on (i.e., ), and acts as a recurrent variable of the dynamical system . One of the most wellknown approaches for calculating (with the above recurrent structure) is Automatic Differentiation (AD) [12, 13]
, which is also called algorithmic differentiation or simply “AutoDiff”. It has traditionally been applied to imperative programs in two diametrically opposite ways on computing gradients for recurrent neural networks, of which one corresponds to backpropagation through time in a reversemode way
[158, 72], and the other corresponds to realtime recurrent learning in a forwardmode way [95, 170]. Quite a number of methods, all closely related to this subject, have been proposed since then [140, 65, 66, 177]. Here we would like to review recurrencebased BR methods from the AD perspective, covering the forwardmode, the reversemode AD, the truncated and onestage simplifications.Forwardmode AD (FAD): To compute , FAD (e.g., [65]
) appeals to the chain rule for the derivative of the dynamical system
. Specifically, recalling that , we have that the operation indeed depends on both directly by its expression and indirectly through . Hence, by using the chain rule, the formulation is given as(18) 
Denote , , for and , and we can rewrite Eq. (18) as the recursion . In this way, we have the following equation
(19) 
Based on the above derivation, it is apparent that can be computed by an iterative algorithm summarized in Alg. 2. Actually, FAD allows program to update parameters after each step, which may significantly speed up dynamic iterator and takes up less memory resources when the number of hyperparameters is much smaller than the number of parameters. In brief, It can be timeprohibitive for many hyperparameters with a more efficient and convenient way.
Reversemode AD (RAD): RAD is a generalization of the backpropagation algorithm and based on a Lagrangian formulation associated with the parameter optimization dynamics [82]. By replacing by and incorporating Eq. (19) into Eq. (14), recent works (e.g., [140, 66, 65]) provide that
(20) 
Rather than calculating by forward propagation as that in FAD (i.e., Alg. 2), the computation of Eq. (20) can also be implemented by backpropagation. That is, we first define and . Then we update , and , with . Finally, we have that . Indeed, the above RAD calculation is structurally identical to backpropagation through time [65]. Moreover, we can also derive it following the classical Lagrangian approach. That is, we reformulate Eq. (17) as the following constrained model
(21) 
The corresponding Lagrangian function can be written as
(22) 
where denotes the Lagrange multiplier associated with the th stage of the dynamic system. The optimality conditions of Eq. (21) are obtained by setting all derivatives of to zero. Then by some simple algebras, we have . Overall, we present the RAD algorithm in Alg. 3.
Truncated RAD: The above two precise calculation methods in many practical applications are tedious and timeconsuming with full backpropagation training. As aforementioned, due to the complicated longterm dependencies of the UL subproblem on , calculating Eq. (20) in RAD is a challenging task. This difficulty is further aggravated when both and are highdimensional vectors. More recently, the truncation idea has been revisited to address the above issue and shows competitive performance with significantly less computation time and memory [137, 11, 178]. Specifically, by ignoring longterm dependencies and approximating Eq. (20) with partial sums (i.e., storing only the last iterations), we have
(23) 
where . It can be seen that ignoring longterm dependencies can greatly reduce the time and space complexities for computing the approximate gradients. The work in [177] has investigated the theoretical properties of the above truncated RAD scheme. Formally, the work [178] has confirmed this fact that using fewstep backpropagation often performs comparably to optimization with the exact gradient, while requiring far less memory and half computation time.
Onestage RAD: Limited and expensive memory is often a bottleneck in modern massivescale deep learning applications, however, multistep iteration of the inner program will cause a lot of memory consumption [59], and it would be useful if a method existed that could simplify iteration steps. Inspired by BLO, a variety of simplified and elegant techniques have been adopted to circumvent this issue (e.g., [117]). The work in [117] proposed another simplification of RAD, which considers a fixed initialization and only performs onestep iteration in Eq. (15) to remove the recurrent structure for the gradient computation in Eq. (20), i.e.,
(24) 
By formulating the dynamical system as that in Eq. (16), we then write as
(25) 
Since calculating Hessian in Eq. (25) is still time consuming, to further simplify the calculation, we can adopt finite approximation [117] to cancel the calculation of the Hessian matrix (e.g., central difference approximation). The specific derivation can be formalised as follows:
(26) 
in which . Note that is set to be a small scalar equal to the learning rate [209].
5.2 Initializationbased EGBR
The research community has started moving towards the challenging goal of building general purpose initializationbased optimization systems whose ability to learn the initial parameters better. Regardless of the recurrent structure, we need to consider the special settings to analyze a family of algorithms for learning a parameter initialization, named initializationbased EGBR methods. In this series, MAML [59] is the most representative and important work. By making more practical assumptions about the coupling dependence of the two variables, these methods no longer use the full dynamical system to explicitly and accurately describe the dependency between and as discussed above in Eq. (21), but adopt a further simplified paradigm.
Specifically, by treating the iterative dynamical system with only the first step that is explicitly related to , this process can be written as
(27) 
where represents the network initialization parameters, and represents the network parameters after performing some sort of update. Given initial condition , then we obtain the following simplified formula
(28) 
where is the descent mapping of at the th stage (e.g., ). Finally, we have the Jacobian matrix as follows
(29) 
Then we have to calculate the Hessian matrix term , which is time consuming in real computation scenario. To reduce computational load, we will introduce two remarkably simple algorithms via a series of approximate transformation operations below. Among various schemes to simplify the algorithm based on initializationbased EGBR approaches, firstorder approximation (e.g., [148, 149]) and layerwise transformation (e.g., [109, 105, 157, 61]) are among the more popular.
Firstorder Approximation: For example, the most representative work (i.e., FOMAML [148] and Reptile [149]) adopts the operation by firstorder approximation, a way to alleviate the problem of Hessian term computation while not sacrificing much performance. Specifically, this approximation ignores the second derivative term by removing Hessian matrix , and then simplifies substitution of performed by
(30) 
In addition, there is another way of firstorder extension to simplify Eq. (29) through the operation of difference approximation [149]. This method also no longer avoids the Hessian term but tries another soft way to approximate (i.e., and ), in which is the step size used in gradient decent operation. Unlike [148], this method proposes to use different linear combinations of all steps rather than using just the final step. But overall, the above algorithms can significantly reduce the computing costs while keeping roughly equivalent performance.
Layerwise Transformation: Indeed, there are also a series of learningbased BLOs related to layerwise transformation, i.e., MetaSGD [109], TNet [105], MetaCurvature [157] and WarpGrad [61]. In addition to initial parameters, this type of work focuses on learning some additional parameters (or transformation) at each layer of the network. From the above Eq. (28), it is uniformly formulated as follows
(31) 
where defines the matrix transformation learned at each layer and is a vector (e.g., learning rate). For example, MetaSGD [109] learns a vector of learning rates and corresponds to . Moreover, TNet [105] learns blockdiagonal preconditioning linear projections. Similarly, an additional the blockdiagonal preconditioning transformation is also performed by MetaCurvature [157]. Recent study WarpGrad [61] defines the preconditions gradients from a geometrical point of view. It replaces the linear projection with a nonlinear preconditioning matrix , referred to as a warp layer.
5.3 Proxybased EGBR
In fact, the main difficulty to solve the highdimensional BLOs is the approximation of BR mapping. To alleviate this problem, recently, the proxybased EGBR methods (e.g., [139, 9, 133]) are proposed by utilizing differentiable hypernetwork (neural networks with parameters ) to approximate BR mapping (or BR Jacobian ). These methods can be regarded as training the hypernetwork to output optimal solutions for the LL subproblem. Instead of updating via , this method performs the hypernetwork to output a set of weights given by . Such methods train a neural network that takes variable as input, and output an approximately optimal hypernetwork as BR mapping . Specifically, it can be regarded as a datadriven network model, expressed through the formula approximating BR mapping directly with a parametric function
(32) 
Then, jointly optimizing and through the following two steps: first updating so that in a neighborhood around the current UL variable , second updating by using as a proxy for performed by
In principle, the proxy network can learn continuous BR and handle discrete UL variable. Unlike the global approximation (e.g., SelfTuning Networks, STN [133]), recent work in [139] locally approximates the BR mapping in a neighborhood around the current UL variable and updates by minimizing the objective where represents the perturbation noise added to , and defines as a factorized Gaussian noise distribution with a fixed scale parameter . In particular, its BR function can be represented over a linear network with Jacobian norm regularization. To compute BR Jacobian with small memory cost, STN [139] models the BR mapping of each row in a layer’s weight matrix as a rankone affine transformation of . In comparison to other Hypernetwork (e.g., [133]), this approach replaces existing modules in deep learning libraries with hyper counterparts which accept an additional vector of UL variable as input and adapts online, thus it only needs less memory consumption to meet the performance requirements.
6 Implicit Gradient for BestResponse
In contrast to the EGBR methods above, IGBR methods in essence can be interpreted as introducing Implicit Function Theory (IFT) to derive BR Jacobian [171]. Indeed, the gradientbased BLO methodologies with implicit differentiation are radically different from EGBR methods, which have been extensively applied in a string of applications (e.g., [134, 168]). As an example, a set of early IGBR approaches (e.g., [27, 176] ) used IGBR to select hyperparameters of kernelbased models. Unfortunately, these approaches are extremely problematic when scaling to contemporary neural architectures with millions of weights. More recent approaches (i.e., [160, 20]) assert that the LL subproblem could only be approximately optimized by leveraging noisy gradients. Soon afterwards, motivated by the problem of setting the regularization parameter in the context of neural networks [104], IGBR approaches have been successfully applied to various vision and learning tasks [101].
Actually, when all stationary points based on LL subproblem have been verified to be global minima, the BR mapping can be further characterized by first order optimality condition typically expressed as . Thus from this perspective, the implicit equation (denoted by ) can be derived when is assumed to be invertible in advance. Eventually, by using the chain rule, the implicit gradient with respect to is computed as
(33) 
On the theoretical side, it can be seen that Eq. (33) offers desired exact gradient intuitively. However, because of the burden originated from computing a Hessian matrix and its Therefore, there are mainly two different kind of techniques to alleviate this computational cost, , i.e., IGBR based on Linear System (LS) [160, 168] and Neumann Series (NS) [134].
Based on Linear System: To calculate the Hessian matrix inverse more efficiently, it is generally assumed that solving linear systems is a common operation (e.g., HOAG [160], IMAML [168]). Specially, a linear equation solution is obtained (that is, ) to approximate . Based on the above derivation, it is apparent that can be directly computed by the algorithm summarized in Alg. 4.
(34) 
(35) 