Bi-Level Optimization (BLO) is the hierarchical mathematical program where the feasible region of one optimization task is restricted by the solution set mapping of another optimization task (i.e., the second task is embedded within the first one) [48, 36]. The outer optimization task is commonly referred to as the Upper-Level (UL) problem, and the inner optimization task is commonly referred to as the Lower-Level (LL) problem. BLOs involve two kinds of variables, referred to as the UL and LL variables, accordingly.
The origin of BLOs can be traced to the domain of game theory and is known as Stackelberg competition . Subsequently, it has been investigated in view of many important applications in various fields of science and engineering, particularly in economics, management, chemistry, optimal control, and resource allocation problems [63, 47, 182, 203]. Especially, in recent years, a great amount of modern applications in the fields of machine learning and computer vision (e.g., hyper-parameter optimization [50, 62, 139, 152], multi-task and meta-learning [64, 66, 178], neural architecture search [117, 89, 113], generative adversarial learning [111, 92, 191], deep reinforcement learning [222, 163, 211] and image processing and analysis [151, 28, 121], just name a few) have arisen that fit the BLO framework.
In general, BLOs are highly complicated and computationally challenging to solve due to their nonconvexity and non-differentiability [99, 43]. Despite their apparent simplicity, BLOs are nonconvex problems with an implicitly determined feasible region even if the UL and LL problems are convex [81, 220]. Indeed, it has been shown that even if the simplest model in BLOs (i.e., the linear BLO program) is strongly NP-hard because of local optimality [195, 22]. In addition, the existence of multiple optima for the LL problem can result in an inadequate formulation of BLOs and the feasible region of the problem may be nonconvex [40, 126]. Despite the challenges, a lot of research have followed in this field consisting of methods and applications of BLOs, see [46, 43, 179]. Early studies focused on numerical methods, including extreme-point methods , branch-and-bound methods [63, 135], descent methods [175, 147], penalty function methods [5, 197], trust-region methods [54, 42], and so on. The most often used procedure is to replace the LL problem with its Karush–Kuhn–Tucker (KKT) conditions, and if assumptions are made (such as smoothness, convexity, among others) the BLOs can be transformed into single-level optimization problems [2, 183, 180]. However, bi-level model’s complexity makes it difficult to be solved in practice using the conventional approaches and computing devices. Specially, they may fail in the face of large-scale, high-dimensional applications, especially in machine learning and computer vision fields [108, 215].
The classical approach (or the first order approach in economics literature) to reformulate BLO is to replace the LL subproblem Eq. (1) by its KKT conditions and minimize over the original variables and as well as the multipliers. The resulting problem is a so-called Mathematical Program with Equilibrium Constraints (MPEC) [167, 219]. This transformation is the most often used one. Unfortunately, MPECs are still a challenging class of problems because of the presence of the complementarity constraint . Solution methods for MPECs can be categorized into two approaches. The first one, namely, the nonlinear programming approach rewrites the complementarity constraint into nonlinear inequalities, and then allows to leverage powerful numerical nonlinear programming solvers. The other one, namely, the combinatorial approach tackles the combinatorial nature of the disjunctive constraint. MPEC has been studied intensively in the last three decades 
. Recently, some progress on the MPEC approach in dealing with BLO have been witnessed by the community of mathematical programming, in the context of selecting optimal hyperparameters in regression and classification problems. There are two issues caused by the multipliers in the MPEC approach. Firstly, in theory, if the LL has more than one multiplier, the resulting MPEC is not equivalent to the original BLO if local optimality is considered. Secondly, the introduced auxiliary multiplier variables limit the numerical efficiency in solving the BLO.
In recent years, a variety of modern learning paradigms, including but not limited to, hyper-parameter optimization [151, 140, 65, 153]), multi-task and meta-learning [149, 90, 112, 31], neural architecture search [117, 113, 51, 76], adversarial learning [191, 92, 111, 130], and deep reinforcement learning [222, 194], have been investigated in machine learning and computer vision fields to address complex tasks in challenging real-world scenarios. For example, hyper-parameter optimization aims to jointly learn the model parameters and hyper-parameters . Meta-learning intends investigate both the base and meta learners, thus obtains the ability to adapt to new environments rapidly with a few training examples . Neural architecture search seeks to simultaneously optimize the network architecture and parameters . Adversarial learning can be deemed as the process of constructing both the generator and discriminator during training . In addition, deep reinforcement learning is to learn a “policy”, something which tells you which action to take from each state so as to try and maximize reward . In fact, despite the different motivations and mechanisms, all these problems contain a series of closely related subproblems and have a natural hierarchical optimization structure. Unfortunately, although have attracted increasing attentions in both academia and industry, there lack perspectives to uniformly understand and formulate these different categories of learning and vision problems.
We notice that most previous surveys on BLOs (e.g., [48, 4, 36, 201, 193, 102, 34, 74]) are purely from the viewpoint of mathematical programming and mainly focus on the formulations, properties, optimality conditions and these classical solution algorithms, such as evolutionary methods . In contrast, the aim of this paper is to utilize BLO to express a variety of complex learning and vision problems, which explicitly or implicitly contain closely related subproblems. Furthermore, we present a unified perspective to comprehensively survey different categories of gradient-based BLO methodologies in specific learning and vision applications. In particular, we first provide a literature review on various complex learning and vision problems, including hyper-parameter optimization, multi-task and meta-learning, neural architecture search, adversarial learning, deep reinforcement learning and so on. We demonstrate that all these tasks can be modeled as a general BLO formulation. Following this perspective, we then establish a value-function-based single-level reformulation to express these existing BLO models. By further introducing a best-response-based algorithmic framework on the single-level reformulation, we can uniformly understand and formulate these existing gradient-based BLOs and analyze their accelerations, simplifications, extensions, and convergence and complexity proprieties. Finally, we demonstrate the potentials of our framework for designing new algorithms and point out some promising research directions for BLO in learning and vision fields.
Compared with existing surveys on BLOs, our major contributions can be summarized as follows:
To the best of our knowledge, this is the first survey paper to focus on uniformly understanding and (re)formulating different categories of complex machine learning and computer vision tasks and their solution methods (especially in the context of deep learning) from the perspective of BLO.
We provide a value-function-based single-level reformulation, together with a flexible best-response-based algorithmic framework, to unify existing gradient-based BLO methodologies and analyze their accelerations, simplifications, extensions, convergence behaviors and complexity properties.
Our framework not only comprehensively covers mainstream gradient-based BLO methods, but also has potentials for designing new BLO algorithms to deal with more challenging tasks. We also point out some promising directions for future research.
We summarize our mathematical notations in Table I. The remainder of this paper is organized as follows. We first introduce some necessary fundamentals of BLOs in Section 2. Then, Section 3 provides a comprehensive survey of various learning and vision applications that all can be modeled as BLOs. In Section 4, we establish a algorithmic framework in a unified manner for existing gradient-based BLO schemes. Within this framework, we further understand and formulate two different categories of BLOs (i.e., explicit and implicit gradients for best-response) in Section 5 and Section 6, respectively. We also discuss the so-called lower-level singleton issue of BLOs in Section 7. The convergence and complexity properties of these gradient-based BLOs are discussed in Section 8. Section 9 puts forward potentials of our framework for designing new algorithms to deal with more challenging pessimistic BLOs. Finally, Section 10 points out some promising directions for future research.
|/||Training/Validation data||Operations/Operation weights|
|Q-function under||State-action value-function|
|Generator/Discriminator||Real-world image/Random noise|
|Aggregation parameters||/||UL/LL variable|
|UL objective||LL objective for a given|
|UL value-function||LL value-function|
|BR mapping||Transposition/Compound operation|
|BR Jacobian||Practical BR Jacobian w.r.t.|
|The -th step of gradient descent||Solution-set mapping|
|Hypernetwork with parameters|
|Optimistic objective||Pessimistic objective|
|Optimistic aggregated gradient||Pessimistic aggregated gradient|
|Inverse Hessian matrix||
Inverse Hessian-vector product
|/||LL/UL iteration step||Maximum LL/UL iteration number|
|Layer-wise transformation||Neumann series|
2 Fundamentals of Bi-Level Optimization
2.1 Overview of Bi-Level Optimization Problems
Bi-Level Optimization (BLO) contains two levels of optimization tasks, where one is nested within the other as a constraint. The inner (or nested) and outer optimization tasks are often respectively referred to as the Lower-Level (LL) and Upper-Level (UL) subproblems . Correspondingly, there are two types of variables, namely, the LL () and UL () variables. Specifically, the LL subproblem can be formulated as the following parametric optimization task
where we consider a continuous function as the LL objective and is a nonempty and closed set. By denoting the value-function as , we can define the solution-set mapping of Eq. (1) as Then the standard BLO problem can be formally expressed as
where the UL objective is also a continuous function and . In fact, a feasible solution to BLO in Eq. (2) should be a vector of UL and LL variables, such that it satisfies all the constraints in Eq. (2), and the LL variables are optimal to the LL subproblem in Eq. (1) for the given UL variables as parameters. In Fig. 1, we provide a simple visual illustration for BLOs stated in Eq. (2).
The above BLO problem has a natural interpretation as a non-cooperative game between two players (i.e., Stackelberg game ). Correspondingly, we may also call the UL and LL subproblems as the leader and follower, respectively. Then the “leader” chooses the decision first, and afterwards the “follower” observes so as to respond with a decision . Therefore, the follower may depend on the leader’s decision. Likewise, the leader has to satisfy a constraint that depends on the follower’s decision.
2.2 Formulating BLO from Different Viewpoints
It is worthwhile to notice that in the standard BLO formulation given in Eq. (2), there may exist multiple LL optimal solutions for some of the UL variables (see Fig. 1 (b) for example). Therefore, it is necessary to define, which solution out of the multiple LL solutions in should be considered. Indeed, most existing works in learning and vision fields consider the uniqueness and optimistic viewpoints to specify standard BLO in Eq. (2). Notice that we will also discuss the more challenging pessimistic viewpoint for BLO in the following Section 9.
Specifically, the most straightforward idea in existing works is to assume that is a singleton111It is also known as the Lower-Level Singleton (LLS) assumption .. In this case, we can simplify the original model as
Such uniqueness version of BLOs is well-defined and could cover a variety of learning and vision tasks (e.g., [50, 140, 160, 139], just name a few). Thus, in recent years, dozens of methods have been developed to address this nested optimization task in different application scenarios (see the following sections for more details).
The situation becomes more intricate if the LL subproblem is not uniquely solvable for each . Essentially, if the follower can be motivated to select an optimal solution in that is also best for the leader (i.e., with respect to ), it yields the so-called optimistic (strong) formulation of BLO
3 Understanding and Modeling Learning and Vision Tasks by BLOs
In this section, we demonstrate that even with different motivations and mechanisms, a variety of modern complex learning and vision tasks (e.g, hyper-parameter optimization, multi-task and meta-learning, neural architecture search, adversarial learning, deep reinforcement learning and so on) actually share close relationships from the BLO perspective. Moreover, we provide a uniform BLO expression to (re)formulate all these problems. Table II provides a summary of learning and vision applications, that can be understood and modeled by BLO.
|Task||Important work||Other work|
|HO|| (ICML, 2013),|| (SIAM, 2014),  (EURO, 2020),  (ICML, 2015),  (AISTATS, 2020),|
| (ICML, 2016),|| (ICML, 2018),  (AISTATS, 2019),  (arXiv, 2019),  (arXiv, 2020),|
| (ICML, 2020),|| (arXiv, 2018),  (ICML, 2017),  (ICML, 2015),  (ICML, 2018),|
| (ICLR, 2019)|| (ICML, 2017),  (NIPS, 2020),  (arXiv, 2020)|
|MFL|| (arXiv, 2017),|| (PMLR, 2017),  (CVPR, 2018),  (CVPR, 2018),  (ICLR, 2018),|
| (ICLR, 2017)|| (ICML, 2017),  (ICLR, 2020),  (ICLR, 2018),  (ICML, 2019)|
|MIL|| (ICML, 2017),|| (ICLR, 2017),  (NIPS, 2016),  (ICML, 2017),  (arXiv, 2018),|
| (NIPS, 2019),|| (ICLR, 2019),  (CVPR, 2020), (AAAI, 2020),  (arXiv, 2018),|
| (ICLR, 2019),|| (NIPS, 2019),  (ICML, 2019),  (ICLR, 2020),  (ICML, 2019),|
| (NIPS, 2019),|| (CVPR, 2020),  (AAAI, 2020),  (arXiv, 2019),  (arXiv, 2018),|
| (ICML, 2018)|| (ICASSP, 2020),  (arXiv, 2019)|
|NAS|| (ICLR, 2019),|| (ICLR, 2019),  (CVPR, 2018),  (ICLR, 2019),  (ICCV, 2019),|
| (NIPS, 2018),|| (AISTATS, 2020),  (CVPR, 2020),  (AAAI, 2020),  (CVPR, 2020),|
| (TGRS, 2020),|| (CVPR, 2019),  (NIPS, 2019),  (SPL, 2020),  (ICCV, 2019),|
| (CVPR, 2020),|| (CIIP, 2019),  (CVPR, 2020),  (arXiv, 2019),  (SIGKDD, 2020),|
| (ICLR, 2019),|| (NIPS, 2019),  (CVPR, 2020),  (CVPR, 2020),  (arXiv, 2020),|
| (ECCV, 2020)|| (arXiv, 2020)|
|AL|| (CVPR, 2020),|| (arXiv, 2016),  (CVPR, 2020),  (arXiv, 2018),  (AAAI, 2020),|
| (PR, 2019)|| (ICML, 2020),  (CVPR, 2020),  (arXiv, 2020),  (arXiv, 2018)|
|DRL|| (CVPR, 2020),|| (NIPS, 2019),  (arXiv, 2019),  (CIRED, 2019),  (ICLR, 2019),|
| (TSG, 2019)|| (arXiv, 2019)|
|Others|| (SIAM, 2013),|| (ICML, 2016), (ICLR, 2019),  (IJCAI, 2020),  (arXiv, 2020),|
| (SSVM, 2015),|| (UAI, 2020),  (arXiv, 2021),  (arXiv, 2020),  (TIP, 2020),|
| (TIP, 2020),|| (arXiv, 2019),  (arXiv, 2020),  (T-RO, 2020),  (WACV, 2020),|
| (TNNLS, 2020),|| (arXiv, 2020),  (NIPS, 2020),  (arXiv, 2020),  (arXiv, 2020)|
| (TIP, 2016)|| (CVPR, 2020),  (arXiv, 2019),  (ICLR, 2018)|
3.1 Hyper-parameter Optimization
Hyper-parameters are those parameters that are external to the model and can’t be learned using training data alone. The problem of identifying the optimal set of hyper-parameters is known as Hyper-parameter Optimization (HO). Early methods to select hyper-parameters always use regularized models and Support Vector Machines (SVMs). These problems are first formulated as a hierarchical problem and then transformed to a single-level optimization problem by replacing the LL problem with its optimality conditions . However, because of the high computational cost especially given high-dimensional hyper-parameter space, these methods even can not guarantee a locally optimal solution 
. More recently, gradient-based HO methods with deep neural networks has gained a lot of traction and generally fall into two categories: iterative differentiation (i.e.,[50, 12, 39, 151, 140, 160, 152, 64, 75, 66, 114, 177, 67]) and implicit differentiation (i.e., [27, 176, 101, 20, 139, 62, 160, 133, 134]
), depending on how the gradient with respect to hyper-parameters is computed. Indeed, the former is an iterative approach where the best-response function is approximated after some steps of gradient descent on the loss function, while the latter approach uses the implicit function theory to derive hyper-gradients. For instance, data hyper-cleaning[65, 177]
, known as a specific HO example, needs to train a linear classifier with the cross-entropy function (with parameters) on the training data set, and introduces hyper-parameters for training with regularization on the validation set.
In general, because of the hierarchical nature, modeling the HO problem as a BLO paradigm is inherently and natural . Specifically, the UL objective requires minimizing the validation set loss with respect to hyper-parameters (e.g. weight decay), and the LL objective requires outputting a learning algorithm by minimizing the training set loss with respect to model parameters (e.g. weights and biases). We instantiate how to model the HO task from the perspective of BLO in Fig. 2. It can be seen that the full dataset is spilt into the training and validation dataset (i.e., ). In conclusion, as a nested optimization, most of HO applications can be formulated as BLOs, where the UL subproblem involves optimization of hyper-parameters and the LL subproblem (w.r.t. weight parameters ) aims to find the learning algorithm by minimizing the LL objective with training loss given .
3.2 Multi-task and Meta-Learning
The goal of meta-learning (a.k.a. learning to learn)  is to design models that can learn new skills or adapt to new environments rapidly with a few training examples. In other words, by learning to learn across data from many previous tasks, meta-learning algorithms can discover the structure among tasks to enable fast learning of new tasks (see Fig. 3 for a schematic diagram). Multi-task learning is a variant of meta-learning, which just intends to jointly solve all of these given tasks [189, 172].
In fact, one of the most well-known instances of meta-earning is few-shot classification (i.e., -way -shot), where each task is a -way classification, aiming to learn the meta-parameter such that each task can be solved only with training samples. Specially, it can be seen that the full meta training data set () can be segmented into linked to the -th task. In essence, these methods can generally fall into two categories, meta-feature learning and meta-initialization learning, as can be seen in Fig. 4, depending on the relationship between the meta-parameters and the network parameters. We show the meta-parameter in blue, and the network parameter in green. Meta-feature learning indicates that there is no deeply intertwined and entangled relationship in between, while the meta-initialization learning directly indicates that the meta-parameter is the network initialization parameter. In the following part, we will emphasize the analysis from two perspectives.
(green blocks). (a) shows meta-parameters for features shared across tasks and parameters of the logistic regression layer. (b) shows meta (initial) parameters shared across tasks and parameters of the task specific layer.
3.2.1 Meta-Feature Learning
Meta-Feature Learning (MFL) aims to learn a sharing meta feature representation for all tasks. It’s worth noting that relevant approaches in [223, 1] recently show that multi-task learning with hard parameter sharing and meta-feature representation are essentially similar. Furthermore, the optimization of meta-learner with respect to meta-parameters based on UL subproblem is similar to HO as discussed in last subsection [64, 66, 67]. For the -th task, these methods consider the cross-entropy function as the task-specific loss on the meta training data set to define the LL objective, and also utilize cross-entropy function to define the UL objective based on the validation data set.
As can be seen from subfigure (a) of Fig. 4, following the BLO framework, the network architecture in this category can be subdivided into two groups. The first one are cross-task intermediate representation layers parameterized by (illustrated by the blue block), outputting the meta features, and the second one is the logistic regression layer parameterized by (illustrated by the green block), as the ground classifier for the
-th task. Concretely, feature layers are shared across all episodes and the softmax regression layer is episode (task) specific. The process of network forward propagation corresponds to the process of passing from the feature extraction part to the softmax part.
3.2.2 Meta-Initialization Learning
Meta-Initialization Learning (MIL) aims to learn a meta initialization for all tasks. MAML 
, known for its simplicity, estimates initialization parameters with the cross-entropy and mean-squared error for supervised classification and regression tasks purely by the gradient-based search. Except for initial parameters, recent approaches have focused on learning other meta variables, such as updating strategies (e.g., descent direction and learning rate[6, 169, 14]) and an extra preconditioning matrix (e.g., [61, 173, 106]). Implicit gradient methods are widely used in the context of few-shot meta-learning. There are a large variety of algorithms replacing the gradient process of the optimization of base-learner through calculation of implicit meta gradient [168, 60, 225, 10, 17]. More recently, the proximal regularization has been introduced for the original loss function . As the Hessian vector product calculation during training requires a large amount of computation, various Hessian-free algorithms have been proposed to alleviate the costly computation of second-order derivatives, which include but not limited to [149, 90, 185, 7, 112, 31, 146]. Furthermore, the first-order approximation algorithm has been proposed to avoid the time-consuming calculation of second-order derivatives in . More recently, a modularized optimization library that unifies several meta-learning algorithms into a common BLO framework has been proposed in 222The code for this library is available at https://github.com/dut-media-lab/BOML..
As shown in subfigure (b) of Fig. 4, (the blue block) corresponds to initial parameters, and ( the green block) corresponds to model parameters. Compared to MFL, there is no deeply intertwined and entangled relationship between two variables (denoted by ), and is only explicitly related to in the initial state. We denote as the network initialization parameter, and as the updated (denote ). As a bi-level coupled nested loop strategy, the LL subproblem based on base-learner is trained for operating a given task, and the UL subproblem based on meta-learner aims to learn how to optimize the base-learner. Among the well-known approaches in this direction, much recent work (i.e., [105, 148] ) has claimed that the cross-entropy function is denoted by the task-specific loss as the LL objective on the training data set, i.e., . Also, by utilizing cross-entropy function, the UL objective is given by Moreover, meta-learning has also been successfully applied to a wide range of challenging problems, including image enhancement [184, 88], and semantic segmentation , tracking 78]
and text generation, just to name a few.
Both MFL and MIL are essentially solution strategies of one optimizer based on another optimizer, which is exactly in line with the construction of BLO scheme. More specifically, as the task-specific loss linked to the -th task, the LL objective can be defined as , . Also, based on , the UL objective is given by . To summarize, firstly, the UL meta-learner calculates the gradient and updates the meta-parameter according to the feedback value, and then the optimized meta knowledge further generates better meta knowledge. Subsequently, the better meta knowledge is input into base-learner (i.e., LL subproblem) as part of its model for optimizing , thus forming an optimization cycle.
3.3 Neural Architecture Search
Neural Architecture Search (NAS) seeks to automate the process of choosing an optimal neural network architecture . Recently, there has been a great deal of interest in gradient-based differentiable NAS methods [117, 24, 142]. These gradient-based differentiable NAS methods mainly contain three main concepts: search space, search strategy and performance estimation strategy. Through designing an architecture search space, they generally use a certain search strategy to find an optimal network architecture. As shown in Fig. 5, such a process can be regarded as the process of optimizing the operation and connection of each node.
The most well-known instance, named DARTS , relaxes the search space to be continuous and conducts searching in a differentiable way, so that gradient descent can be used to simultaneously search for architectures and learn weights. Actually, each operation corresponds to a coefficient in DARTS. By denoting as architecture parameters and as the form of connection between two nodes, the expression formula of mixed operations according to the Softmax function can be written as
where and denote operations, and is the set of all these candidate operations. Then, is further performed to obtain the optimal architecture. However, as much skip connection leads to a sharp deterioration in performance, a great deal of improved approaches have emerged, i.e., ENAS , PC-DARTS , P-DARTS , just to name a few. Among them, PC-DARTS takes features extracted from the network for sampling on the channel dimension, and then connects processed features with remaining features. Moreover, a series of gradient-based differentiable NAS approaches combining with meta-learning recently have emerged (i.e., [199, 113, 97, 56] ). At present, the differentiable NAS based on BLO has achieved promising results in a large variety of tasks, such as image enhancement [79, 35], image classification , semantic segmentation [218, 228, 116], object detection [30, 207, 77, 91, 110], video processing , medical image analysis [228, 218], video classification , recommendation system , graph network [224, 69] and representation learning , etc.
In conclusion, given the proper search space, it is helpful for gradient-based differentiable NAS methods in deriving the optimal architecture for different tasks. From this perspective, the UL objective with respect to architecture weights (e.g. block/cell) is parameterized by , and the LL objective with respect to model weights is parameterized by . The searching process can be generally formulated as a BLO paradigm, where the UL objective is based on the validation data set , and the LL objective is based on the training data set .
3.4 Adversarial Learning
Adversarial Learning (AL) is contemporarily deemed as one of the most important learning tasks, which has been applied in a large variety of application areas, i.e., image generation [130, 92, 68], adversarial attacks  and face verification . For example, the work proposed in  recently has introduced an adaptive BLO model for image generation, which guides the generator to reasonably modify parameters in a complementary and promoting way. In this way, the global optimization is to optimize the whole images and the local optimization is only to optimize the low-quality areas. Moreover, by learning an parametrized optimizer with neural networks, a recent work adopts a new adversarial training method to study adversarial attack combining with meta-learning 
. In essence, as the current influential model, Generative Adversarial Network (GAN) is deemed as deep generative models[73, 131]. For instance, a recent method in  introduces a coupled adversarial network for makeup-invariant face verification, which contains two adversarial sub-networks on different levels, with the one on the pixel level for reconstructing facial images and the other on the feature level for preserving identity information. Similarly, targeting at finding pure Nash equilibrium of generator and discriminator, the work proposed in  exploits a fully differentiable search framework by formalizing as solving a bi-level minimax optimization problem.
is represented as a deterministic feed forward neural network (red blocks), through which a fixed random noiseis passed to output . The discriminator is another neural network (green blocks) which maps the sampled real-world image and
to a binary classification probability.
All the above approaches can formulate the unsupervised learning problem as a game between two opponents: a generator which samples from a distribution, and a discriminator which classifies samples as real or false, as shown in Fig.6. The goal of GAN is to minimize the duality gap denoted as :
where the fixed random noise source obtained by is input into the generator , which, together with the sampled real-world image , is then authenticated by the discriminator . Notice that
denotes the expectation which implies that the average value of some functions under a probability distribution.
In general, adversarial learning corresponds to a min-max BLO problem, where the UL discriminator denoted as targets on learning a robust classifier, and the LL generator denoted as tries to generate adversarial samples. Specifically, the UL objective and LL objective can be respectively formulated as
where the generator and the discriminator are parameterized with variables and , respectively. In other words, the UL subproblem aims to reduce the duality gap and the LL subproblem interactively optimizes the discriminator parameters denoted by to obtain the optimal solution.
3.5 Deep Reinforcement Learning
In Deep Reinforcement Learning (DRL), the agent aims to make optimal decisions by interacting with the environment and learning from the experiences, where the environment is modeled by the Markov Decision Process (MDP). Actor-critic (AC) methods are a long-established class of techniques, consisting of two subproblems that are intertwined with each other. At the same time, AC methods are widely studied in recent years[212, 214, 192, 222, 32, 194, 211]. In the case of guaranteed convergence,  proposes a bi-level AC method to solve multi-agent reinforcement learning problem in finding Stackelberg equilibrium under Markov games. It allows agents to have different knowledge base (thus intelligent), while their actions still can be executed simultaneously and distributedly. Moreover,  proposes a multi-agent bi-level cooperative reinforcement learning algorithm to solve the bi-level stochastic decision-making problem. In non-convex scenarios, the recent work in  proves that AC could find a globally optimal pair at a linear rate of convergence, and provides a complete theoretical understanding of BLO paradigm. Actually, both GANs and AC can be seen as bi-level or two-time-scale optimization problems, where one model is optimized with respect to the optimum of another model. In a sense, both of them are such classes of BLOs which have close parallels. In , it shows that GANs can be viewed as a modified AC method with blind actor in stateless MDPs.
Indeed, AC methods simultaneously learn an state-action value-function that predicts the expected discounted cumulative reward
where and denote dynamics of the environment and reward function respectively, and correspond to the state and action respectively, and represent the i-th and j-th steps respectively, and also denotes the expectation which implies that the average value of some function under a probability distribution. Moreover, the policy maximizes the expected discounted cumulative reward for that state-action value-function, i.e.,
where and correspond to the initial state and initial action, respectively, and denotes the state-action value-function, and represents the initial state distribution. We show the schematic diagram of AC learning in Fig. 7. Let denote the parameters of the state-action value-function, and denote the parameters of the policy . From the perspective of BLO, the UL objective and LL objective can be respectively formulated as below
where we denote as any divergence that is positive except when the two are equal. From this perspective, the actor and critic correspond to the LL and UL variables, respectively. Under this framework, the update of policy can be regarded as a stochastic gradient step for the LL objective.
3.6 Other Related Applications
The rapid development of deep learning has claimed its domination in the area of image processing and analysis. In particular, beside the above mentioned tasks, there are a lot of other related learning and vision tasks that can be re(formulated) to BLO problems have emerged, such as image enhancement [101, 28, 226, 155, 38], image registration 138], image recognition , image compression  and other related works [186, 18, 162]. For example, early work proposed in  considers the problem of parameter learning for variational image denoising models, which incorporates -norm–based analysis priors. Through formulating it as a BLO, the LL subproblem is given by the variational model which consists of the data fidelity and regularization term, and the UL subproblem is expressed by means of a loss function that penalizes errors between the solution of the LL subproblem and the ground truth data. Furthermore, the work proposed in  formulates discriminative dictionary learning method for image recognition tasks as a BLO. From this perspective, the UL subproblem directly minimizes the classification error, and the LL subproblem uses the sparsity term and the Laplacian term to characterize the intrinsic data structure.
4 A Unified Algorithmic Framework for Gradient-based BLOs
In what follows, we develop a unified algorithmic framework to formulate existing mainstream gradient-based BLO methods emerged in learning and vision fields. We first state our value-function-based single-level reformulation for BLOs and then introduce a general nested gradient optimization framework on it.
4.1 Single-Level Reformulation
Our single-level reformulation is based on the Best-Response (BR) of the follower, which has been used for the backward induction in Stackelberg game [96, 45]. Specifically, given the leader , we define a unified BR mapping of the follower (denoted as ) for two categories of BLOs as follows:
where is able to uniformly represent the UL value-function of from the uniqueness and optimistic (i.e., ) viewpoints.
The UL value-function has played a key role throughout the history of BLO research, not only as a common ground for deriving the relevant algorithms, but also pushing the field towards increasingly complex and challenging problems.
4.2 The Devil Lies in the BR Jacobin
Moving one step forward, the gradient of w.r.t. the UL variable reads as333Please notice that we actually do not distinguish between the operation of the derivatives and partial derivatives to simplify our presentation.
where we use “grad.” as the abbreviation for “gradient” and denote the transpose operation as “”. In fact, by simple computation, the direct gradient is easy to obtain. However, the indirect gradient is intractable to obtain because we must compute the changing rate of the optimal LL solution with respect to the UL variable (i.e., the BR Jacobian )444In the following algorithms, we will also call as the practical BR Jacobian w.r.t. .. The computation of the indirect gradient naturally motives formulating and hence . For this purpose, a series of techniques have recently been developed from either explicit or implicit perspectives, which respectively obtain their optimal solutions from automatic differentiation through dynamic system and based on implicit differentiation theory. In Alg. 1, we present a general scheme to unify these different categories of gradient-based BLOs. It can be seen that we actually first compute the practical BR Jacobian by a inner algorithm and then perform standard gradient-based descent scheme to update . When with , we can finally solve Eq. (12) to obtain the optimal solution. Therefore, the inner algorithm for computing the practical BR Jacobian actually plays the key role in this general algorithmic framework. Moreover, we emphasize that it is also the most challenging component in these existing gradient-based BLOs.
In Fig. 8, we further summarize mainstream gradient-based BLOs based on two different assumptions (i.e., w/ LLS and w/o LLS). In the LLS scenario, from the BR-based perspective, existing gradient methods can be categorized as two groups: Explicit Gradient for Best-Response (EGBR, stated in Section 5) and Implicit Gradient for Best-Response (IGBR, stated in Section 6). As for EGBR, there are mainly three types of methods, namely, recurrence-based EGBR (e.g., [65, 140, 66, 177, 117]), initialization-based EGBR (e.g., [148, 149] ) and proxy-based EGBR methods (e.g., [109, 105, 157, 61]), differing from each other in the way of formulating the BR mapping. For IGBR, existing works consider two groups of techniques (e.g., linear system [160, 168] and Neumann series ) to alleviate the computational complexity issue for the BR Jacobian. It can be seen that the validity of formulations depends on such uniqueness of the LL solution set. Furthermore, EGBR category of methods require first-order differentiability of the LL objective, while IGBR category of methods need that the LL objective admits the second-order differentiability and the invertibility of Hessian. When solving BLOs without LLS assumption, it has been demonstrated in  that we need to first construct BR mapping based on both UL and LL subproblems, and then solve two optimization subproblems, namely, the single-level optimization subproblem (w.r.t. ) and the simple bi-level subproblem (w.r.t. )555It is known that simple bi-level optimization is just a specific BLO problem with only one variable [127, 53].. Here the subproblem can also be solved by either EGBR or IGBR.
5 Explicit Gradient for Best-Response
In this section, we delve deep into the EGBR category of methods, which can be understood as performing automatic differentiation through dynamic system [159, 200, 136]. Specifically, given an initialization at , the iteration process of EGBR can be written as
where defines the operation performed by the -th stage and is the number of iterations. For example, we can formulate based on the gradient descent rule, i.e.,
where is the descent mapping of at -th stage (e.g., ) and denotes the corresponding step size . Then we can calculate by substituting approximately for , and the full dynamical system can be defined as 666Here the notation represents the compound dynamical operation of the entire iteration.. That is, we actually consider the following optimization model
and need to calculate (instead of Eq. (14)) in the practical optimization scenario. Since actually obtain an explicit gradient for best-response of the follower, we call this category of gradient-based BLOs as EGBR approaches hereafter. Starting from the Eq. (15), it is obvious to notice that may be affected coupling with the variable throughout the iteration. This coupling relationship will have a direct impact on the optimization process of UL variable in Eq. (14). In fact, existing EGBR algorithms can be summarized from three perspectives. The first is that, if closely acts on during the whole iteration process, the subsequent optimization of variable will be carried out recursively. The second is that when only acts in the initial step, the subsequent optimization of variable will be simplified. The third class is to replace the whole iterative process with a hyper-network, so as to efficiently approximate BR mapping. Ultimately, in such cases, we divide them into three categories in terms of the coupling dependence of the two variables and the solution procedures, namely recurrence-based EGBR (stated in Section 5.1), initialization-based EGBR (stated in Section 5.2) and proxy-based EGBR (stated in Section 5.3).
5.1 Recurrence-based EGBR
It can be seen from Eq. (15) that all the LL iterative variables depend on (i.e., ), and acts as a recurrent variable of the dynamical system . One of the most well-known approaches for calculating (with the above recurrent structure) is Automatic Differentiation (AD) [12, 13]
, which is also called algorithmic differentiation or simply “AutoDiff”. It has traditionally been applied to imperative programs in two diametrically opposite ways on computing gradients for recurrent neural networks, of which one corresponds to back-propagation through time in a reverse-mode way[158, 72], and the other corresponds to real-time recurrent learning in a forward-mode way [95, 170]. Quite a number of methods, all closely related to this subject, have been proposed since then [140, 65, 66, 177]. Here we would like to review recurrence-based BR methods from the AD perspective, covering the forward-mode, the reverse-mode AD, the truncated and one-stage simplifications.
Forward-mode AD (FAD): To compute , FAD (e.g., 
) appeals to the chain rule for the derivative of the dynamical system. Specifically, recalling that , we have that the operation indeed depends on both directly by its expression and indirectly through . Hence, by using the chain rule, the formulation is given as
Denote , , for and , and we can rewrite Eq. (18) as the recursion . In this way, we have the following equation
Based on the above derivation, it is apparent that can be computed by an iterative algorithm summarized in Alg. 2. Actually, FAD allows program to update parameters after each step, which may significantly speed up dynamic iterator and takes up less memory resources when the number of hyper-parameters is much smaller than the number of parameters. In brief, It can be time-prohibitive for many hyper-parameters with a more efficient and convenient way.
Reverse-mode AD (RAD): RAD is a generalization of the back-propagation algorithm and based on a Lagrangian formulation associated with the parameter optimization dynamics . By replacing by and incorporating Eq. (19) into Eq. (14), recent works (e.g., [140, 66, 65]) provide that
Rather than calculating by forward propagation as that in FAD (i.e., Alg. 2), the computation of Eq. (20) can also be implemented by back-propagation. That is, we first define and . Then we update , and , with . Finally, we have that . Indeed, the above RAD calculation is structurally identical to back-propagation through time . Moreover, we can also derive it following the classical Lagrangian approach. That is, we reformulate Eq. (17) as the following constrained model
The corresponding Lagrangian function can be written as
where denotes the Lagrange multiplier associated with the -th stage of the dynamic system. The optimality conditions of Eq. (21) are obtained by setting all derivatives of to zero. Then by some simple algebras, we have . Overall, we present the RAD algorithm in Alg. 3.
Truncated RAD: The above two precise calculation methods in many practical applications are tedious and time-consuming with full back-propagation training. As aforementioned, due to the complicated long-term dependencies of the UL subproblem on , calculating Eq. (20) in RAD is a challenging task. This difficulty is further aggravated when both and are high-dimensional vectors. More recently, the truncation idea has been revisited to address the above issue and shows competitive performance with significantly less computation time and memory [137, 11, 178]. Specifically, by ignoring long-term dependencies and approximating Eq. (20) with partial sums (i.e., storing only the last iterations), we have
where . It can be seen that ignoring long-term dependencies can greatly reduce the time and space complexities for computing the approximate gradients. The work in  has investigated the theoretical properties of the above truncated RAD scheme. Formally, the work  has confirmed this fact that using few-step back-propagation often performs comparably to optimization with the exact gradient, while requiring far less memory and half computation time.
One-stage RAD: Limited and expensive memory is often a bottleneck in modern massive-scale deep learning applications, however, multi-step iteration of the inner program will cause a lot of memory consumption , and it would be useful if a method existed that could simplify iteration steps. Inspired by BLO, a variety of simplified and elegant techniques have been adopted to circumvent this issue (e.g., ). The work in  proposed another simplification of RAD, which considers a fixed initialization and only performs one-step iteration in Eq. (15) to remove the recurrent structure for the gradient computation in Eq. (20), i.e.,
By formulating the dynamical system as that in Eq. (16), we then write as
Since calculating Hessian in Eq. (25) is still time consuming, to further simplify the calculation, we can adopt finite approximation  to cancel the calculation of the Hessian matrix (e.g., central difference approximation). The specific derivation can be formalised as follows:
in which . Note that is set to be a small scalar equal to the learning rate .
5.2 Initialization-based EGBR
The research community has started moving towards the challenging goal of building general purpose initialization-based optimization systems whose ability to learn the initial parameters better. Regardless of the recurrent structure, we need to consider the special settings to analyze a family of algorithms for learning a parameter initialization, named initialization-based EGBR methods. In this series, MAML  is the most representative and important work. By making more practical assumptions about the coupling dependence of the two variables, these methods no longer use the full dynamical system to explicitly and accurately describe the dependency between and as discussed above in Eq. (21), but adopt a further simplified paradigm.
Specifically, by treating the iterative dynamical system with only the first step that is explicitly related to , this process can be written as
where represents the network initialization parameters, and represents the network parameters after performing some sort of update. Given initial condition , then we obtain the following simplified formula
where is the descent mapping of at the -th stage (e.g., ). Finally, we have the Jacobian matrix as follows
Then we have to calculate the Hessian matrix term , which is time consuming in real computation scenario. To reduce computational load, we will introduce two remarkably simple algorithms via a series of approximate transformation operations below. Among various schemes to simplify the algorithm based on initialization-based EGBR approaches, first-order approximation (e.g., [148, 149]) and layer-wise transformation (e.g., [109, 105, 157, 61]) are among the more popular.
First-order Approximation: For example, the most representative work (i.e., FOMAML  and Reptile ) adopts the operation by first-order approximation, a way to alleviate the problem of Hessian term computation while not sacrificing much performance. Specifically, this approximation ignores the second derivative term by removing Hessian matrix , and then simplifies substitution of performed by
In addition, there is another way of first-order extension to simplify Eq. (29) through the operation of difference approximation . This method also no longer avoids the Hessian term but tries another soft way to approximate (i.e., and ), in which is the step size used in gradient decent operation. Unlike , this method proposes to use different linear combinations of all steps rather than using just the final step. But overall, the above algorithms can significantly reduce the computing costs while keeping roughly equivalent performance.
Layer-wise Transformation: Indeed, there are also a series of learning-based BLOs related to layer-wise transformation, i.e., Meta-SGD , T-Net , Meta-Curvature  and WarpGrad . In addition to initial parameters, this type of work focuses on learning some additional parameters (or transformation) at each layer of the network. From the above Eq. (28), it is uniformly formulated as follows
where defines the matrix transformation learned at each layer and is a vector (e.g., learning rate). For example, Meta-SGD  learns a vector of learning rates and corresponds to . Moreover, T-Net  learns block-diagonal preconditioning linear projections. Similarly, an additional the block-diagonal preconditioning transformation is also performed by Meta-Curvature . Recent study WarpGrad  defines the preconditions gradients from a geometrical point of view. It replaces the linear projection with a non-linear preconditioning matrix , referred to as a warp layer.
5.3 Proxy-based EGBR
In fact, the main difficulty to solve the high-dimensional BLOs is the approximation of BR mapping. To alleviate this problem, recently, the proxy-based EGBR methods (e.g., [139, 9, 133]) are proposed by utilizing differentiable hyper-network (neural networks with parameters ) to approximate BR mapping (or BR Jacobian ). These methods can be regarded as training the hyper-network to output optimal solutions for the LL subproblem. Instead of updating via , this method performs the hyper-network to output a set of weights given by . Such methods train a neural network that takes variable as input, and output an approximately optimal hyper-network as BR mapping . Specifically, it can be regarded as a data-driven network model, expressed through the formula approximating BR mapping directly with a parametric function
Then, jointly optimizing and through the following two steps: first updating so that in a neighborhood around the current UL variable , second updating by using as a proxy for performed by
In principle, the proxy network can learn continuous BR and handle discrete UL variable. Unlike the global approximation (e.g., Self-Tuning Networks, STN ), recent work in  locally approximates the BR mapping in a neighborhood around the current UL variable and updates by minimizing the objective where represents the perturbation noise added to , and defines as a factorized Gaussian noise distribution with a fixed scale parameter . In particular, its BR function can be represented over a linear network with Jacobian norm regularization. To compute BR Jacobian with small memory cost, STN  models the BR mapping of each row in a layer’s weight matrix as a rank-one affine transformation of . In comparison to other Hyper-network (e.g., ), this approach replaces existing modules in deep learning libraries with hyper counterparts which accept an additional vector of UL variable as input and adapts online, thus it only needs less memory consumption to meet the performance requirements.
6 Implicit Gradient for Best-Response
In contrast to the EGBR methods above, IGBR methods in essence can be interpreted as introducing Implicit Function Theory (IFT) to derive BR Jacobian . Indeed, the gradient-based BLO methodologies with implicit differentiation are radically different from EGBR methods, which have been extensively applied in a string of applications (e.g., [134, 168]). As an example, a set of early IGBR approaches (e.g., [27, 176] ) used IGBR to select hyper-parameters of kernel-based models. Unfortunately, these approaches are extremely problematic when scaling to contemporary neural architectures with millions of weights. More recent approaches (i.e., [160, 20]) assert that the LL subproblem could only be approximately optimized by leveraging noisy gradients. Soon afterwards, motivated by the problem of setting the regularization parameter in the context of neural networks , IGBR approaches have been successfully applied to various vision and learning tasks .
Actually, when all stationary points based on LL subproblem have been verified to be global minima, the BR mapping can be further characterized by first order optimality condition typically expressed as . Thus from this perspective, the implicit equation (denoted by ) can be derived when is assumed to be invertible in advance. Eventually, by using the chain rule, the implicit gradient with respect to is computed as
On the theoretical side, it can be seen that Eq. (33) offers desired exact gradient intuitively. However, because of the burden originated from computing a Hessian matrix and its Therefore, there are mainly two different kind of techniques to alleviate this computational cost, , i.e., IGBR based on Linear System (LS) [160, 168] and Neumann Series (NS) .
Based on Linear System: To calculate the Hessian matrix inverse more efficiently, it is generally assumed that solving linear systems is a common operation (e.g., HOAG , IMAML ). Specially, a linear equation solution is obtained (that is, ) to approximate . Based on the above derivation, it is apparent that can be directly computed by the algorithm summarized in Alg. 4.