Bilevel optimizations are at the center of several important machine learning problems such as hyperparameter tuning, data denoising, few-shot learning, data poisoning. Different from simultaneous or multi-objective optimization, obtaining the exact descent direction for continuous bilevel optimization requires computing the inverse of the hessian of the lower-level cost function, even for first order methods. In this paper, we propose a new method for solving bilevel optimization, using the penalty function, which avoids computing the inverse of the hessian. We prove convergence of the method under mild conditions and show that it computes the exact hypergradient asymptotically. Small space and time complexity of our method allows us to solve large-scale bilevel optimization problems involving deep neural networks with up to 3.8M upper-level and 1.4M lower-level variables. We present results of our method for data denoising on MNIST/CIFAR10/SVHN datasets, for few-shot learning on Omniglot/Mini-Imagenet datasets and for training-data poisoning on MNIST/Imagenet datasets. In all experiments, our method outperforms or is comparable to previously proposed methods both in terms of accuracy and run-time.READ FULL TEXT VIEW PDF
This study presents a novel Equiangular Direction Method (EDM) to solve ...
Machine learning (ML) problems are often posed as highly nonlinear and
Meta learning with multiple objectives can be formulated as a Multi-Obje...
Learning from a few examples remains a key challenge in machine learning...
This paper considers an explicit continuation method and the trust-regio...
Progress in deep learning is slowed by the days or weeks it takes to tra...
In recent years, implicit deep learning has emerged as a method to incre...
Bilevel optimizations appear in many fields of study where there are two competing parties or objectives involved. Particularly, a bilevel problem arises, if one party makes its choice first affecting the optimal choice of the second party, also known as the Stackelberg model, dating back to 1930’s (Von Stackelberg, 2010). The general form of a bilevel optimization problem is
A bilevel problem with constraints is of the form , where the lower-level constraint set can depend on . However, we focus on unconstrained problems in this paper. The ‘upper-level’ problem is a usual minimization problem except that is constrained to be the solution to the ‘lower-level’ problem which is dependent on (see (Bard, 2013) for a review of bilevel optimization). Bilevel optimizations also appears in many important machine learning problems. For example, gradient-based hyperparameter tuning (Domke, 2012; Maclaurin et al., 2015; Luketina et al., 2016; Pedregosa, 2016; Franceschi et al., 2017, 2018), data denoising by importance learning (Liu and Tao, 2016; Yu et al., 2017; Ren et al., 2018), few-shot learning (Ravi and Larochelle, 2017; Santoro et al., 2016; Vinyals et al., 2016; Franceschi et al., 2017; Mishra et al., 2017; Snell et al., 2017; Franceschi et al., 2018), and training-data poisoning (Maclaurin et al., 2015; Muñoz-González et al., 2017; Koh and Liang, 2017; Shafahi et al., 2018). We explain each of these problems and their bilevel formulations below.
Gradient-based hyperparameter tuning. Finding hyperparameters is an indispensable step in any machine learning problem. Grid search is a popular way of finding the optimal hyperparameters, if the domain of the hyperparameters is a predetermined discrete set or a range. However, when losses are differentiable functions of the hyperparameter(s), we can find optimal hyperparameter values by solving a continuous bilevel optimization. Let and denote hyperparameter(s) and parameter(s) for a class of learning algorithms and be the hypothesis. Then is the validation loss, and is the training loss, defined similarly. The best hyperparameter(s) is then the solution to the following problem
Thus, we find the best model parameters for each choice of the hyperparameter , and select that value for the hyperparameter which incurs the smallest validation loss.
Data denoising by importance learning. A common assumption of learning is that the training examples are i.i.d. samples from the same distribution as the test data. However, if training and testing distributions are not identical or if the training examples have been corrupted by noise or modified by adversaries, this assumption is violated. In such cases, re-weighting the importance of each training example, before training, can help reduce the discrepancy between the two distributions. For example, one can up-weight the importance of the examples from the same distribution and down-weight the importance of the rest. This problem of finding the correct weight for each training example can be formulated as a bilevel optimization. Consider
to be the vector of non-negative importance values for each training examplewhere is the number of training examples, and
be the parameter(s) of a classifier. Then the weighted training error is . Also, assume that we can get a small number of examples, from the same distribution as that of the test examples (clean validation examples). Then the importance learning problem can be formulated as:
Hence, the importance of each training example (vector ) is selected such that the minimizer of the weighted training loss in the lower level also minimizes the validation loss in the upper-level. The final importance values can help to identify good points from the noisy training set and the classifier obtained after solving this optimization will have superior performance compared to the model trained on the noisy data.
Meta-learning. A standard learning problem involves finding the best model from the class of hypotheses for a given task (i.e., data distribution). In contrast, meta-learning is a problem of learning a prior on the hypothesis classes (a.k.a. inductive bias) for a given set of tasks. Few-shot learning is an example of meta-learning, where a learner is trained on several related tasks, during the meta-training phase, so that it can generalize well to unseen (but related) tasks with just few examples, during the meta-testing phase. An effective approach to the few-shot learning problem is to learn a common representation for various tasks and train task specific classifiers on top of this representation. Let be the map that takes raw features to a common representation for all tasks and be the classifier for the -th task, where is the total number of tasks for training. The goal of few-shot learning is to learn both the representation map parameterized by and the set of classifiers parameterized by . Let be the validation loss of task and be the training loss defined similarly, then the bilevel problem for few-shot learning is
For evaluation of the learned representation, during the meta-test phase, the representation is kept fixed and only the classifiers for the new tasks are trained i.e. where is the total number of tasks for testing.
Training-data poisoning. Recently, machine learning models were shown to be vulnerable to train-time attacks. Different from the test time attacks, here adversary modifies the training data so that the model learned from altered training data performs poorly/differently as compared to the model learned from clean data. The most popular train-time attack method augments the original training data with one or more ‘poisoned’ examples , i.e., to create the poisoned dataset with being the loss on the poisoned training data. The problem of finding poisoning points, that when added to the clean training data hurt the performance of the model trained on it can be formulated as
where the minus sign in the upper-level is used to maximize the validation loss. This is the formulation for untargeted attacks. For targeted attacks, the upper-level must minimize the validation loss with respect to the intended target labels of the attacker. Another variant of poisoning attack, only influences the prediction of a single predetermined example. The upper-level cost for this attack is the loss over only this single example (see Eq. (10) in the Appendix D.4.1).
Challenges of deep bilevel optimization. General bilevel problems cannot be solved using simultaneous optimization of the upper- and lower-level problems. Moreover, exact bilevel optimization is known to be NP-hard even for linear cost functions (Bard, 1991)
. To add to this, recent deep learning models, with millions of variables, make it infeasible to use sophisticated methods beyond the first-order methods. For bilevel problems, even the first-order methods are difficult to apply since they require computation of the inverse Hessian–gradient product to get the exact hypergradient (see Sec.2.1). Since direct inversion of the Hessian is impractical, even for moderate-sized problems, many approaches have been proposed to approximate the exact hypergradient, including forward/reverse-mode differentiation (Maclaurin et al., 2015; Franceschi et al., 2017; Shaban et al., 2018), approximate inversion by solving a linear system of equations (Domke, 2012; Pedregosa, 2016). But, there is still big room for improvement in these existing approaches in terms of their time and space complexities and practical performance.
Contributions. We propose a penalty function-based algorithm (Alg. 1) for solving large-scale unconstrained bilevel optimization. We prove its convergence under mild conditions (Theorem 2) and show that it computes the exact hypergradient asymptotically (Lemma 3). We present complexity analysis of the algorithm showing that it has linear time and constant space complexity (Table 1), making our method superior to forward-mode and reverse-mode differentiation and similar to the approximate inversion based method. Small space and time complexity enables us to solve large-scale bilevel problems involving deep neural networks with up to 3.8M upper-level and 1.4M lower-level variables (Table 7 in Appendix). We show evaluation results on data denoising by importance learning, few-shot learning, and training-data poisoning problems. The proposed penalty-based method performs competitively to the state-of-the-art methods on simpler problems (with convex lower-level cost) and significantly outperforms other methods on complex problems (with non-convex lower-level cost), both in terms of accuracy (Sec. 3) and run-time (Table 2, Fig. 3).
The remainder of the paper is organized as follows. We present and analyze the main algorithm in Sec. 2, perform comprehensive experiments in Sec. 3, and conclude the paper in Sec. 4. Due to space limitation, proofs, experimental settings and additional results are presented in the appendix. All codes are published on GitHub https://github.com/jihunhamm/bilevel-penalty.
Throughout the paper we have assumed that upper- and lower-level costs and are twice continuously differentiable in both and . We use and to denote gradient vectors, for the matrix , and for the Hessian matrix . Additionally, following previous works we assumed that the lower-level solution is unique for all and that is invertible everywhere. Later in this section we discuss relaxation for some of these assumptions.
Assuming we can express the solution to the lower-level problem explicitly, we can write the bilevel problem as an equivalent single-level problem . We can use gradient-based approach on this single-level problem and compute the total derivative
, called the hypergradient, in previous approaches. Using the chain rule, the total derivative is
In reality, can be written explicitly only for trivial problems, but we can still compute using the implicit function theorem. Since at , we get and consequently . Thus, the hypergradient is as follows
Existing approaches (Domke, 2012; Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al., 2017) can be viewed as implicit methods of approximating the hypergradient, with distinct trade-offs in efficiency and complexity.
A bilevel problem can be considered as a constrained optimization problem since the lower-level optimality is a constraint, in addition to any other constraint in the upper- and the lower-level problems. In this work, we focus on unconstrained bilevel problems i.e. those without any additional constraints on the upper- and lower-level problems. For solving bilevel problems the lower-level problem is often replaced by its necessary optimality condition, resulting in the following problem:
For general bilevel problems, Eq. (8) and Eq. (1) are not the same (Dempe and Dutta, 2012). But, with lower-level cost being convex in for each and the assumption that the lower-level solution is unique for each , Eq. (8) is equivalent to Eq. (1).
We now describe the penalty function approach for solving bilevel optimization. The penalty function method is a well-known approach for solving constrained optimization problems (see (Bertsekas, 1997) for a review) and has been previously applied for solving bilevel problems. However, it was analyzed under strict assumptions and only high-level descriptions of the algorithm were presented before (Aiyoshi and Shimizu, 1984; Ishizuka and Aiyoshi, 1992). The penalty function , optimizes the original cost plus a quadratic penalty term (penalizes the violation of the necessary conditions for lower-level optimality). Let be the minimum of the penalty function for a given :
Then the following convergence result is known.
Even though this is a strong result, its not very practical, since the minimum needs to be computed exactly for each , and moreover and need to be convex in for any . In our approach, we allow -optimal solutions of Eq. (9) and show convergence to a KKT point of Eq. (8) without requiring convexity.
Alg. 1 describes our method in which we minimize the penalty function in Eq. (9), alternatively over and . It is essential to note that our method solves a single-level penalty function (Eq. (9)) and does not need any intermediate step to compute the approximate hypergradient, unlike other methods which first approximate the solution to the lower-level problem of Eq. (1) and then use an intermediate step (solving a linear system or using reverse/forward mode differentiation) to compute the approximate hypergradient. Lemma 3 (below) shows when the approximate gradient direction , computed from Alg. 1 becomes the exact hypergradient Eq. (7) for bilevel problems.
Given , let be from Eq. (9). Then, .
Thus if we find the minimizer of the penalty function for given and , Alg. 1 computes the exact hypergradient Eq. (7) at . Furthermore, under the conditions of Theorem 1, as and we get the exact hypergradient asymptotically.
Comparison with other methods: Many methods have been proposed previously to solve bilevel optimization problems that appear in machine learning, including forward/reverse-mode differentiation (FMD/RMD) (Maclaurin et al., 2015; Franceschi et al., 2017; Shaban et al., 2018) and approximate hypergradient computation by solving a linear system (ApproxGrad) (Domke, 2012; Pedregosa, 2016). For completeness, we have described these methods briefly in Appendix B. We have shown the trade-offs of these methods for computing the hypergradient in Table 1. One can see that as increases FMD and RMD become impractical due to time complexity and space complexity, respectively, whereas ApproxGrad and Penalty, have the same linear time complexity and constant space complexity, which is a big advantage over FMD and RMD. However, complexity analysis does not show the quality of hypergradient approximation of each method. In Sec. 3.1 we show empirically that the proposed penalty method has better convergence properties than all the other methods with synthetic examples and since ApproxGrad and Penalty have the same complexities we compare the two methods on real data and show that Penalty is twice as fast as ApproxGrad (Fig. 3).
Improvements. A caveat to these theoretical guarantees is that, some of the assumptions made for analysis may not be satisfied in practice. Here we discuss simple techniques to address these problems and improve Alg. 1 further. The first problem is related to non-convexity of the lower-level cost , which creates the problem that the local minimum of can be either a minimum or a maximum of . To address this we modify the -update for Eq. (9) by adding a ‘regularization’ term to the cost so that finds a minimum of . Thus, the -update becomes . This only affects the optimization in the beginning; as the final solution remains unaffected with or without regularization. The second problem is that the tolerance may not be satisfied in a limited time and the optimization may terminate before becomes large enough. A cure to this is the method of multipliers and augmented Lagrangian (Bertsekas, 1976) which allows the penalty method to find a solution with a finite . Thus we add the term to the penalty function (Eq. (9)) to get and use the method of multiplier to update as . In summary, we use the following update rules in the paper.
These improvements are helpful in theory but the empirical difference was only moderate (see Appendix C for details).
In this section, we evaluate the performance of the proposed penalty method (Penalty) on various machine learning problems discussed in the introduction. We compare Penalty against both bilevel and non-bilevel solutions to these problems previously reported in the literature.
We start by comparing Penalty (ours) with gradient descent (GD), reverse-mode differentiation (RMD), and approximate hypergradient method (ApproxGrad) on synthetic examples. We omit the comparison with forward-mode differentiation (FMD) because of its impractical time complexity for larger problems. GD refers to the alternating minimization: , . For RMD, we implemented a simple version of the method using vanilla gradient descent. For ApproxGrad, we implement our own GPU compatible version (which uses Hessian-vector product, mini-batches and gradient descent rather than conjugate gradient descent for solving the linear system) of the algorithm proposed by (Pedregosa, 2016). Using simple quadratic surfaces for and , we compare all the algorithms by observing their convergence as a function of the number of upper-level iterations by varying the number of lower-level updates (), for computing the hypergradient update. We measure the convergence of these methods using the Euclidean distance of the current iterate from the closest optimal solution . Since the synthetic examples are not learning problems, we can only measure the distance of the iterates to an optimal solution (). Fig. 1 shows the performance of two 10-dimensional examples described in the caption (see Appendix D.1). As one would expect, increasing the number of
-updates makes all the algorithms better since doing more lower-level iterations makes the hypergradient estimation more accurate (Eq. (7)) but it also increases the run time of the methods. However, even for these examples, only Penalty and ApproxGrad converge to the optimal solution and GD and RMD converge to non-solution points (regardless of ). Moreover, from Fig. 1(b), we see that Penalty converges even with =1 while ApproxGrad requires at least =10 to converge, which shows that our method approximates the hypergradient accurately with smaller . This directly translates to smaller run-time for our method as compared to ApproxGrad since the run-time is directly proportional to (see Table. 1).
is a rank-deficient random matrix. The mean curve (blue) is superimposed on 20 independent trials (yellow).
In Fig. 2 we show examples similar to Fig. 1 but with ill-conditioned or singular Hessian for the lower-level problem. The ill-conditioning poses difficulty for the methods since the implicit function theorem requires the invertibility of the Hessian at the solution point. Compared to Fig. 1, Fig. 2 shows that only Penalty converges to the true solution despite the fact that we add regularization in ApproxGrad to improve the ill-conditioning when solving the linear systems by minimization. We ascribe the robustness of Penalty to its simplicity and to the fact that it naturally handles non-uniqueness of the lower-level solution (see Appendix C.3). Additionally, we report the wall clock times for different methods on the four examples tested here in Table 2. We can see that as we increase the number of lower-level iterations all methods get slower but Penalty is faster than both RMD and ApproxGrad. Penalty is slower than GD but as shown in Fig. 1 and Fig. 2, GD does not converge to optima for most of the synthetic examples.
Now, we evaluate the performance of Penalty for learning a classifier from a dataset with corrupted labels (training data). We pose the problem as an importance learning problem presented in Eq. (3). We evaluate the performance of the classifier learned by Penalty, with 20 lower-level updates, against the following classifiers: Oracle: classifier trained on the portion of training data with clean labels and the validation data, Val-only: classifier trained only on the validation data, Train+Val: classifier trained on the entire training and validation data, ApproxGrad
: classifier trained with our implementation of ApproxGrad, with 20 lower-level and 20 linear system updates. We test the performance on MNIST, CIFAR10 and SVHN datasets with validation set sizes of 1000, 10000 and 1000 points respectively. We used convolutional neural networks (architectures described in AppendixD.2) at the lower-level for this task. Table 3 summarizes our results for this problem and shows that Penalty outperforms Val-only, Train+Val and ApproxGrad by significant margins and in fact performs very close to the Oracle classifier (which is the ideal classifier), even for high noise levels. This demonstrates that Penalty is extremely effective in solving bilevel problems involving several million variables (see Table 7 in Appendix) and shows its effectiveness at handling non-convex problems. Along with improvement in terms of accuracy over other bilevel methods like ApproxGrad, Penalty also gives better run-time per upper-level iteration, leading to a decrease in the overall run time of the experiments (Fig. 3(a)).
We compared the performance of Penalty against the RMD-based method presented in (Franceschi et al., 2017), using their setting from Sec. 5.1, which is a smaller version of this data denoising task. For this, we choose a sample of 5000 training, 5000 validation and 10000 test points from MNIST and randomly corrupted labels of 50% of the training points and used softmax regression in the lower-level of the bilevel formulation (Eq. (3)). The accuracy of the classifier trained on a subset of the dataset comprising only of points with importance values greater than 0.9 (as computed by Penalty) along with the validation set is 90.77%. This is better than the accuracy obtained by Val-only (90.54%), Train+Val (86.25%) and the RMD-based method (90.09%) used by (Franceschi et al., 2017) and is close to the accuracy achieved by Oracle classifier (91.06%).
|Non-bilevel Approaches||Bilevel Approaches|
|MAML(Finn et al., 2017)||(Snell et al., 2017)||SNAIL(Mishra et al., 2017)||RMD(Franceschi et al., 2018)||ApproxGrad||Penalty|
s.d. for Omniglot and 95% confidence intervals for Mini-Imagenet over five trials. For bilevel approaches (Penalty, ApproxGrad and RMD(Franceschi et al., 2018)) result is averaged over 600 randomly-sampled tasks from the meta-test set.
Next, we evaluate the performance of Penalty on the task of learning a common representation for the few-shot learning problem. We use the formulation presented in Eq. (4) and use Omniglot (Lake et al., 2015) and Mini-ImageNet (Vinyals et al., 2016) datasets for our experiments. Following the protocol proposed by (Vinyals et al., 2016) for -way -shot classification, we generate meta-training and meta-testing datasets. Each meta-set is built using images from disjoint classes. For Omniglot, our meta-training set comprises of images from the first 1200 classes and the remaining 423 classes are used in the meta-testing dataset. We also augment the meta-datasets with three different rotations (90, 180 and 270 degrees) of the images as used by (Santoro et al., 2016). For the experiments with Mini-Imagenet, we used the split of 64 classes in meta-training, 16 classes in meta-validation and 20 classes in meta-testing as used by (Ravi and Larochelle, 2017).
Each meta-batch of the meta-training and meta-testing dataset comprises of a number of tasks which is called the meta-batch-size. Each task in the meta-batch consists of a training set with images and a testing set consists of 15 images from classes. We train Penalty using a meta-batch-size of 30 for 5 way and 15 for 20 way classification for Omniglot and with a meta-batch-size of 2 for Mini-ImageNet experiments. The training sets of the meta-train-batch are used to train the lower-level problem and the test sets are used as validation sets for the upper-level problem in Eq. (4). The final accuracy is reported using the meta-test-set, for which we fix the common representation learnt during meta-training. We then train the classifiers at the lower-level for 100 steps using the training sets from the meta-test-batch and evaluate the performance of each task on the associated test set from the meta-test-batch. Average performance of Penalty and ApproxGrad over 600 tasks is reported in Table 4. It can be seen that Penalty outperforms other bilevel methods namely the ApproxGrad (trained with 20 lower-level iterations and 20 updates for the linear system) and the RMD-based method (Franceschi et al., 2018) on Mini-Imagenet and is comparable to them on the Omniglot. We also show the trade-off between using higher T and time for ApproxGrad and Penalty in Fig. 3(b) showing that Penalty achieves the same accuracy as ApproxGrad in almost half the run-time. In comparison to non-bilevel approaches Penalty is comparable to most approaches but is slightly worse than (Mishra et al., 2017) which makes use of temporal convolutions and soft attention.
We used four-layer convolutional neural networks with 64 filters per layer and a residual network with four residual blocks followed by two convolutional layers for learning the common task representation (upper-level variable) for Omniglot and Mini-ImageNet experiments, respectively. The lower-level problem uses logistic regression to learn the task specific classifiers (lower-level variables). We also use a normalization for the input-weight dot product, before taking the softmax, similar to the cosine normalization proposed by(Luo et al., 2018). The bilevel problem for Mini-ImageNet has upper-level variables which is the largest among all the experiments presented in this paper (Table 7 in Appendix).
|Untargeted Attacks (lower accuracy is better)|
|Targeted Attacks (higher accuracy is better)|
Next, we evaluate Penalty on the task of generating poisoned training data, such that models trained on this data, perform poorly/differently as compared to the models trained on the clean data (Mei and Zhu, 2015; Muñoz-González et al., 2017; Koh and Liang, 2017). We use the same setting as Sec. 4.2 of (Muñoz-González et al., 2017) and test both untargeted and targeted data poisoning on MNIST using data augmentation technique. Here, we assume regularized logistic regression will be used as the classifier during training. The poisoned points obtained after solving Eq. (5) by various methods are added to the clean training set and the performance of a new classifier trained on this data is used to report the results in Table 5. For untargeted attack, our aim is to generally lower the performance of the classifier on the clean test set. For this experiment, we select a random subset of 1000 training, 1000 validation and 8000 testing points from MNIST and initialize the poisoning points with random instances from the training set but assign them incorrect random labels. We use these poisoned points along with clean training data to train logistic regression, in the lower-level problem of Eq. (5). For targeted attacks, we aim to misclassify images of eights as threes. For this, we selected a balanced subset (each of the 10 classes are represented equally in the subset) of 1000 training, 4000 validation and 5000 testing points from the MNIST dataset. Then we select images of class 8 from the validation set and label them as 3 and use only these images for the upper-level problem in Eq. 5 with a difference that now we want to minimize the error in the upper level instead of maximizing (meaning we don’t have a negative sign in the upper level of Eq. 5). To evaluate the performance we selected images of 8 from the test set and labeled them as 3 and report the performance on this modified subset of the original test set in targeted attack section of Table 5. For this experiment the poisoned points are initialized with images of classes 3 and 8 from the training set, with flipped labels. We did this since images of threes and eights are the only ones involved in the poisoning. We compare the performance of Penalty against the performance reported using RMD in (Muñoz-González et al., 2017) and ApproxGrad. For ApproxGrad, we used 20 lower-level and 20 linear system updates to report the results in the Table 5. We see that Penalty significantly outperforms the RMD based method and performs similar to ApproxGrad. However, in terms of wall clock time Penalty has a advantage over ApproxGrad (see Fig. 3(c)). We also compared the methods against a label flipping baseline where we select poisoned points from the validation sets and change their labels (randomly for untargeted attacks and mislabel threes as 8 and eights as 3 for targeted attacks). All bilevel methods are able to beat this baseline showing that solving the bilevel problem can generate much better poisoning points. Examples of the poisoned points for untargeted and targeted attacks generated by Penalty are shown in Figs. 5 and 6 in Appendix D.4.
Additionally, we tested Penalty on the task of generating clean label poisoning attack (Koh and Liang, 2017; Shafahi et al., 2018) where goal is to learn poisoned points, such that they will be assigned correct labels when visually inspected by an expert, but can cause misclassification of specific target images when the classifier is trained on these poisoned points along with clean data. We used the dog vs. fish dataset and followed the setting in Sec. 5.2 of (Koh and Liang, 2017), to achieve 100% attack success with just a single poisoned point per target image, compared to 57% attack success in the original paper. A recent method (Shafahi et al., 2018) also reports 100% attack success on this same task. Details of the experiment are presented in Appendix D.4.1.
Finally, we compare Penalty and ApproxGrad on accuracy and time in Fig. 3 as we vary the number of lower-level iterations in the experiments. Intuitively, a larger corresponds to a more accurate approximation of the hypergradient and therefore a better result for all methods, but it comes with the space and time cost. The figure shows that relative improvement after is small in comparison to the increased run-time for both Penalty and ApproxGrad. Based on this result we used for in all our experiments on real data. The figure also shows that even though Penalty and ApproxGrad have the same linear time complexity (Table 1), Penalty is about twice as fast ApproxGrad in wall-clock time.
A wide range of interesting machine learning problems can be expressed as bilevel optimization problems, and new applications are still being discovered. So far, the difficulty of solving bilevel optimization has limited its wide-spread use for solving large-scale problems, specially, involving deep models. In this paper we presented an efficient algorithm based on penalty function to solve bilevel optimization, which is both simple and has theoretical and practical advantages over existing methods. As compared to previous methods we demonstrated competitive performance on problems with convex lower-level costs and significant improvement on problems with non-convex lower-level costs both in terms of accuracy and time, highlighting the practical effectiveness of our penalty-based method. In future works, we plan to tackle other challenges in bilevel optimization such as handling additional constraints in both upper- and lower-levels.
Cosine normalization: using cosine similarity instead of dot product in neural networks. In International Conference on Artificial Neural Networks, pp. 382–391. Cited by: §3.3.
The proof follows the standard proof for penalty function methods, e.g., . Let refer to the pair, and let be any limit point of the sequence , then there is a subsequence such that . From the tolerance condition
Take the limit with respect to the subsequence on both sides to get
Since is a tall matrix and is invertible by assumption, is full-rank and therefore , which is the primary feasibility condition in Eq. (8). Furthermore, let , then by definition,
We can write
At the minimum the gradient vanishes, that is
Equivalently, . Then,
where disappears, which is the hypergradient . ∎
Several methods have been proposed to solve bilevel optimization problems appearing in machine learning, including forward/reverse-mode differentiation [19, 10] and approximate gradient [8, 25] described briefly here.
Forward-mode (FMD) and Reverse-mode differentiation (RMD). Domke , Maclaurin et al., Franceschi et al. , and Shaban et al.  studied forward and reverse-mode differentiation to solve the minimization problem where the lower-level variable follows a dynamical system . This setting is more general than that of a bilevel problem. However, a stable dynamical system is one that converges to a steady state and thus, the process can be considered as minimizing an energy or a potential function.
Define and , then the hypergradient Eq. (7) can be computed by
When the lower-level process is one step of gradient descent on a cost function , that is,
The sequences and can be computed in forward or reverse mode. For reverse-mode differentiation, first compute
The final hypergradient is . For forward-mode differentiation, simultaneously compute , , and
The final hypergradient is
Approximate hypergradient (ApproxGrad). Since computing the inverse of the Hessian directly is difficult even for moderately-sized neural networks, Domke  proposed to find an approximate solution to by solving the linear system of equations . This can be done by solving
using conjugate gradient or any other method. Note that the minimization requires evaluation of the Hessian-vector product, which can be done in linear time . The asymptotic convergence with approximate solutions was shown by Pedregosa .
Here we discuss the details of the modifications to Alg. 1 presented in the main text which can be added to improve the performance of the algorithm in practice.
One of the common assumptions of this and previous works is that is invertible and locally positive definite. Neither invertibility nor positive definiteness hold in general for bilevel problems, involving deep neural networks, and this causes difficulties in the optimization. Note that if is non-convex in , minimizing the penalty term does not neccesarily lower the cost but instead moves the variable towards a stationary point – which is a known problem even for the Newton’s method. Thus we propose the following modification to the -update:
keeping the same -update intact. To see how this affects the optimization, note that -update becomes
After converges to a stationary point, we get
, and after plugging this into -update, we get
that is, the Hessian inverse is replaced by a regularized version to improve the positive definiteness of the Hessian. With a decreasing or constant sequence such that the regularization does not change to solution.
The penalty function method is intuitive and easy to implement, but the sequence is guaranteed to converge to an optimal solution only in the limit with , which may not be achieved in practice in a limited time. It is known that the penalty method can be improved by introducing an additional term into the function, which is called the augmented Lagrangian (penalty) method :
This new term allows convergence to the optimal solution even when is finite. Furthermore, using the update rule , called the method of multipliers, it is known that converges to the true Lagrange multiplier of this problem corresponding to the equality constraints .
Most existing methods have assumed that the lower-level solution is unique for all . Regularization from the previous section, can improve the ill-conditioning of the Hessian but it does not address the case of multiple disconnected global minima of . With multiple lower-level solutions , there is an ambiguity in defining the upper-level problem. If we assume that is chosen adversarially (or pessimistically), then the upper-level problem should be defined as
If is chosen co-operatively (or optimistically), then the upper-level problem should be defined as
and the results can be quite different between these two cases. Note that the proposed penalty function method is naturally solving the optimistic case, as Alg. 1 is solving the problem of by alternating gradient descent. However, with a gradient-based method, we cannot hope to find all disconnected multiple solutions. In a related problem of min-max optimization, which is a special case of bilevel optimization, an algorithm for handling non-unique solutions was proposed recently . This idea of keeping multiple candidate solution may be applicable to bilevel problems too and further analysis of the non-unique lower-level problem is left as future work.
Here we present the modified algorithm which incorporates regularization (Sec. C.1) and augmented Lagrangian (Sec. C.2) as discussed previously. The augmented Lagrangian term applies to both - and -update, but the regularization term applies to only the -update as its purpose is to improve the ill-conditioning of during