Log In Sign Up

Penalty Method for Inversion-Free Deep Bilevel Optimization

by   Akshay Mehra, et al.

Bilevel optimizations are at the center of several important machine learning problems such as hyperparameter tuning, data denoising, few-shot learning, data poisoning. Different from simultaneous or multi-objective optimization, obtaining the exact descent direction for continuous bilevel optimization requires computing the inverse of the hessian of the lower-level cost function, even for first order methods. In this paper, we propose a new method for solving bilevel optimization, using the penalty function, which avoids computing the inverse of the hessian. We prove convergence of the method under mild conditions and show that it computes the exact hypergradient asymptotically. Small space and time complexity of our method allows us to solve large-scale bilevel optimization problems involving deep neural networks with up to 3.8M upper-level and 1.4M lower-level variables. We present results of our method for data denoising on MNIST/CIFAR10/SVHN datasets, for few-shot learning on Omniglot/Mini-Imagenet datasets and for training-data poisoning on MNIST/Imagenet datasets. In all experiments, our method outperforms or is comparable to previously proposed methods both in terms of accuracy and run-time.


Follow the bisector: a simple method for multi-objective optimization

This study presents a novel Equiangular Direction Method (EDM) to solve ...

Multi-Objective Meta Learning

Meta learning with multiple objectives can be formulated as a Multi-Obje...

Decentralized Stochastic Bilevel Optimization with Improved Per-Iteration Complexity

Bilevel optimization recently has received tremendous attention due to i...

A Hessian inversion-free exact second order method for distributed consensus optimization

We consider a standard distributed consensus optimization problem where ...

BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach

Bilevel optimization (BO) is useful for solving a variety of important m...

Continual Learning with Extended Kronecker-factored Approximate Curvature

We propose a quadratic penalty method for continual learning of neural n...

Code Repositories

1 Introduction

Bilevel optimizations appear in many fields of study where there are two competing parties or objectives involved. Particularly, a bilevel problem arises, if one party makes its choice first affecting the optimal choice of the second party, also known as the Stackelberg model, dating back to 1930’s (Von Stackelberg, 2010). The general form of a bilevel optimization problem is


A bilevel problem with constraints is of the form , where the lower-level constraint set can depend on . However, we focus on unconstrained problems in this paper. The ‘upper-level’ problem is a usual minimization problem except that is constrained to be the solution to the ‘lower-level’ problem which is dependent on (see (Bard, 2013) for a review of bilevel optimization). Bilevel optimizations also appears in many important machine learning problems. For example, gradient-based hyperparameter tuning (Domke, 2012; Maclaurin et al., 2015; Luketina et al., 2016; Pedregosa, 2016; Franceschi et al., 2017, 2018), data denoising by importance learning (Liu and Tao, 2016; Yu et al., 2017; Ren et al., 2018), few-shot learning (Ravi and Larochelle, 2017; Santoro et al., 2016; Vinyals et al., 2016; Franceschi et al., 2017; Mishra et al., 2017; Snell et al., 2017; Franceschi et al., 2018), and training-data poisoning (Maclaurin et al., 2015; Muñoz-González et al., 2017; Koh and Liang, 2017; Shafahi et al., 2018). We explain each of these problems and their bilevel formulations below.

Gradient-based hyperparameter tuning. Finding hyperparameters is an indispensable step in any machine learning problem. Grid search is a popular way of finding the optimal hyperparameters, if the domain of the hyperparameters is a predetermined discrete set or a range. However, when losses are differentiable functions of the hyperparameter(s), we can find optimal hyperparameter values by solving a continuous bilevel optimization. Let and denote hyperparameter(s) and parameter(s) for a class of learning algorithms and be the hypothesis. Then is the validation loss, and is the training loss, defined similarly. The best hyperparameter(s) is then the solution to the following problem


Thus, we find the best model parameters for each choice of the hyperparameter , and select that value for the hyperparameter which incurs the smallest validation loss.

Data denoising by importance learning. A common assumption of learning is that the training examples are i.i.d. samples from the same distribution as the test data. However, if training and testing distributions are not identical or if the training examples have been corrupted by noise or modified by adversaries, this assumption is violated. In such cases, re-weighting the importance of each training example, before training, can help reduce the discrepancy between the two distributions. For example, one can up-weight the importance of the examples from the same distribution and down-weight the importance of the rest. This problem of finding the correct weight for each training example can be formulated as a bilevel optimization. Consider

to be the vector of non-negative importance values for each training example

where is the number of training examples, and

be the parameter(s) of a classifier

. Then the weighted training error is . Also, assume that we can get a small number of examples, from the same distribution as that of the test examples (clean validation examples). Then the importance learning problem can be formulated as:


Hence, the importance of each training example (vector ) is selected such that the minimizer of the weighted training loss in the lower level also minimizes the validation loss in the upper-level. The final importance values can help to identify good points from the noisy training set and the classifier obtained after solving this optimization will have superior performance compared to the model trained on the noisy data.

Meta-learning. A standard learning problem involves finding the best model from the class of hypotheses for a given task (i.e., data distribution). In contrast, meta-learning is a problem of learning a prior on the hypothesis classes (a.k.a. inductive bias) for a given set of tasks. Few-shot learning is an example of meta-learning, where a learner is trained on several related tasks, during the meta-training phase, so that it can generalize well to unseen (but related) tasks with just few examples, during the meta-testing phase. An effective approach to the few-shot learning problem is to learn a common representation for various tasks and train task specific classifiers on top of this representation. Let be the map that takes raw features to a common representation for all tasks and be the classifier for the -th task, where is the total number of tasks for training. The goal of few-shot learning is to learn both the representation map parameterized by and the set of classifiers parameterized by . Let be the validation loss of task and be the training loss defined similarly, then the bilevel problem for few-shot learning is


For evaluation of the learned representation, during the meta-test phase, the representation is kept fixed and only the classifiers for the new tasks are trained i.e. where is the total number of tasks for testing.

Training-data poisoning. Recently, machine learning models were shown to be vulnerable to train-time attacks. Different from the test time attacks, here adversary modifies the training data so that the model learned from altered training data performs poorly/differently as compared to the model learned from clean data. The most popular train-time attack method augments the original training data with one or more ‘poisoned’ examples , i.e., to create the poisoned dataset with being the loss on the poisoned training data. The problem of finding poisoning points, that when added to the clean training data hurt the performance of the model trained on it can be formulated as


where the minus sign in the upper-level is used to maximize the validation loss. This is the formulation for untargeted attacks. For targeted attacks, the upper-level must minimize the validation loss with respect to the intended target labels of the attacker. Another variant of poisoning attack, only influences the prediction of a single predetermined example. The upper-level cost for this attack is the loss over only this single example (see Eq. (10) in the Appendix D.4.1).

Challenges of deep bilevel optimization. General bilevel problems cannot be solved using simultaneous optimization of the upper- and lower-level problems. Moreover, exact bilevel optimization is known to be NP-hard even for linear cost functions (Bard, 1991)

. To add to this, recent deep learning models, with millions of variables, make it infeasible to use sophisticated methods beyond the first-order methods. For bilevel problems, even the first-order methods are difficult to apply since they require computation of the inverse Hessian–gradient product to get the exact hypergradient (see Sec. 

2.1). Since direct inversion of the Hessian is impractical, even for moderate-sized problems, many approaches have been proposed to approximate the exact hypergradient, including forward/reverse-mode differentiation (Maclaurin et al., 2015; Franceschi et al., 2017; Shaban et al., 2018), approximate inversion by solving a linear system of equations (Domke, 2012; Pedregosa, 2016). But, there is still big room for improvement in these existing approaches in terms of their time and space complexities and practical performance.

Contributions. We propose a penalty function-based algorithm (Alg. 1) for solving large-scale unconstrained bilevel optimization. We prove its convergence under mild conditions (Theorem 2) and show that it computes the exact hypergradient asymptotically (Lemma 3). We present complexity analysis of the algorithm showing that it has linear time and constant space complexity (Table 1), making our method superior to forward-mode and reverse-mode differentiation and similar to the approximate inversion based method. Small space and time complexity enables us to solve large-scale bilevel problems involving deep neural networks with up to 3.8M upper-level and 1.4M lower-level variables (Table 7 in Appendix). We show evaluation results on data denoising by importance learning, few-shot learning, and training-data poisoning problems. The proposed penalty-based method performs competitively to the state-of-the-art methods on simpler problems (with convex lower-level cost) and significantly outperforms other methods on complex problems (with non-convex lower-level cost), both in terms of accuracy (Sec. 3) and run-time (Table 2, Fig. 3).

The remainder of the paper is organized as follows. We present and analyze the main algorithm in Sec. 2, perform comprehensive experiments in Sec. 3, and conclude the paper in Sec. 4. Due to space limitation, proofs, experimental settings and additional results are presented in the appendix. All codes are published on GitHub

2 Inversion-Free Penalty Method

Throughout the paper we have assumed that upper- and lower-level costs and are twice continuously differentiable in both and . We use and to denote gradient vectors, for the matrix , and for the Hessian matrix . Additionally, following previous works we assumed that the lower-level solution is unique for all and that is invertible everywhere. Later in this section we discuss relaxation for some of these assumptions.

2.1 Hypergradient for bilevel optimization

Assuming we can express the solution to the lower-level problem explicitly, we can write the bilevel problem as an equivalent single-level problem . We can use gradient-based approach on this single-level problem and compute the total derivative

, called the hypergradient, in previous approaches. Using the chain rule, the total derivative is


In reality, can be written explicitly only for trivial problems, but we can still compute using the implicit function theorem. Since at , we get and consequently . Thus, the hypergradient is as follows


Existing approaches (Domke, 2012; Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al., 2017) can be viewed as implicit methods of approximating the hypergradient, with distinct trade-offs in efficiency and complexity.

2.2 Penalty function approach

A bilevel problem can be considered as a constrained optimization problem since the lower-level optimality is a constraint, in addition to any other constraint in the upper- and the lower-level problems. In this work, we focus on unconstrained bilevel problems i.e. those without any additional constraints on the upper- and lower-level problems. For solving bilevel problems the lower-level problem is often replaced by its necessary optimality condition, resulting in the following problem:


For general bilevel problems, Eq. (8) and Eq. (1) are not the same (Dempe and Dutta, 2012). But, with lower-level cost being convex in for each and the assumption that the lower-level solution is unique for each , Eq. (8) is equivalent to Eq. (1).

We now describe the penalty function approach for solving bilevel optimization. The penalty function method is a well-known approach for solving constrained optimization problems (see (Bertsekas, 1997) for a review) and has been previously applied for solving bilevel problems. However, it was analyzed under strict assumptions and only high-level descriptions of the algorithm were presented before (Aiyoshi and Shimizu, 1984; Ishizuka and Aiyoshi, 1992). The penalty function , optimizes the original cost plus a quadratic penalty term (penalizes the violation of the necessary conditions for lower-level optimality). Let be the minimum of the penalty function for a given :


Then the following convergence result is known.

Theorem 1 (Theorem 8.3.1 of (Bard, 2013)).

Assume and are convex in for any fixed . Let be any positive () and divergent () sequence. If is the corresponding sequence of optimal solutions of the penalty function Eq. (9), then the sequence has limit points any one of which is a solution of Eq. (1).

Even though this is a strong result, its not very practical, since the minimum needs to be computed exactly for each , and moreover and need to be convex in for any . In our approach, we allow -optimal solutions of Eq. (9) and show convergence to a KKT point of Eq. (8) without requiring convexity.

Theorem 2.

Suppose is a positive () and convergent () sequence, and is a positive (), non-decreasing (), and divergent () sequence. Let be the sequence of approximate solutions to Eq. (9) with tolerance for all . Then any limit point of satisfies the KKT conditions of the problem in Eq. (8).

Alg. 1 describes our method in which we minimize the penalty function in Eq. (9), alternatively over and . It is essential to note that our method solves a single-level penalty function (Eq. (9)) and does not need any intermediate step to compute the approximate hypergradient, unlike other methods which first approximate the solution to the lower-level problem of Eq. (1) and then use an intermediate step (solving a linear system or using reverse/forward mode differentiation) to compute the approximate hypergradient. Lemma  3 (below) shows when the approximate gradient direction , computed from Alg. 1 becomes the exact hypergradient Eq. (7) for bilevel problems.

Lemma 3.

Given , let be from Eq. (9). Then, .

Thus if we find the minimizer of the penalty function for given and , Alg. 1 computes the exact hypergradient Eq. (7) at . Furthermore, under the conditions of Theorem 1, as and we get the exact hypergradient asymptotically.

Comparison with other methods: Many methods have been proposed previously to solve bilevel optimization problems that appear in machine learning, including forward/reverse-mode differentiation (FMD/RMD) (Maclaurin et al., 2015; Franceschi et al., 2017; Shaban et al., 2018) and approximate hypergradient computation by solving a linear system (ApproxGrad) (Domke, 2012; Pedregosa, 2016). For completeness, we have described these methods briefly in Appendix B. We have shown the trade-offs of these methods for computing the hypergradient in Table 1. One can see that as increases FMD and RMD become impractical due to time complexity and space complexity, respectively, whereas ApproxGrad and Penalty, have the same linear time complexity and constant space complexity, which is a big advantage over FMD and RMD. However, complexity analysis does not show the quality of hypergradient approximation of each method. In Sec. 3.1 we show empirically that the proposed penalty method has better convergence properties than all the other methods with synthetic examples and since ApproxGrad and Penalty have the same complexities we compare the two methods on real data and show that Penalty is twice as fast as ApproxGrad (Fig. 3).

Initialize randomly

  for  do
     while  do
        for  do
           -update:  (from Eq. (9))
        end for
        -update:  (from Eq. (9))
     end while
     Break if max iteration is reached
  end for
Algorithm 1 Penalty method for bilevel optimization
Method -update Intermediate update Time Space
Penalty Not required
Table 1: Complexity analysis of various bilevel methods (FMD, RMD and ApproxGrad are discussed in Appendix B). is the size of , is the size of , and is the total number of -updates per one hypergradient computation. , , and are variables of size , , and required to compute the hypergradient (also updated -times). Note: Hessian-gradient product has complexity as shown in (Pearlmutter, 1994).

Improvements. A caveat to these theoretical guarantees is that, some of the assumptions made for analysis may not be satisfied in practice. Here we discuss simple techniques to address these problems and improve Alg. 1 further. The first problem is related to non-convexity of the lower-level cost , which creates the problem that the local minimum of can be either a minimum or a maximum of . To address this we modify the -update for Eq. (9) by adding a ‘regularization’ term to the cost so that finds a minimum of . Thus, the -update becomes . This only affects the optimization in the beginning; as the final solution remains unaffected with or without regularization. The second problem is that the tolerance may not be satisfied in a limited time and the optimization may terminate before becomes large enough. A cure to this is the method of multipliers and augmented Lagrangian (Bertsekas, 1976) which allows the penalty method to find a solution with a finite . Thus we add the term to the penalty function (Eq. (9)) to get and use the method of multiplier to update as . In summary, we use the following update rules in the paper.

These improvements are helpful in theory but the empirical difference was only moderate (see Appendix C for details).

3 Experiments

In this section, we evaluate the performance of the proposed penalty method (Penalty) on various machine learning problems discussed in the introduction. We compare Penalty against both bilevel and non-bilevel solutions to these problems previously reported in the literature.

3.1 Synthetic examples

We start by comparing Penalty (ours) with gradient descent (GD), reverse-mode differentiation (RMD), and approximate hypergradient method (ApproxGrad) on synthetic examples. We omit the comparison with forward-mode differentiation (FMD) because of its impractical time complexity for larger problems. GD refers to the alternating minimization: , . For RMD, we implemented a simple version of the method using vanilla gradient descent. For ApproxGrad, we implement our own GPU compatible version (which uses Hessian-vector product, mini-batches and gradient descent rather than conjugate gradient descent for solving the linear system) of the algorithm proposed by (Pedregosa, 2016). Using simple quadratic surfaces for and , we compare all the algorithms by observing their convergence as a function of the number of upper-level iterations by varying the number of lower-level updates (), for computing the hypergradient update. We measure the convergence of these methods using the Euclidean distance of the current iterate from the closest optimal solution . Since the synthetic examples are not learning problems, we can only measure the distance of the iterates to an optimal solution (). Fig. 1 shows the performance of two 10-dimensional examples described in the caption (see Appendix D.1). As one would expect, increasing the number of

-updates makes all the algorithms better since doing more lower-level iterations makes the hypergradient estimation more accurate (Eq. (

7)) but it also increases the run time of the methods. However, even for these examples, only Penalty and ApproxGrad converge to the optimal solution and GD and RMD converge to non-solution points (regardless of ). Moreover, from Fig. 1(b), we see that Penalty converges even with =1 while ApproxGrad requires at least =10 to converge, which shows that our method approximates the hypergradient accurately with smaller . This directly translates to smaller run-time for our method as compared to ApproxGrad since the run-time is directly proportional to (see Table. 1).

Figure 1: Convergence of GD, RMD, ApproxGrad, and Penalty for two example bilevel problems. The mean curve (blue) is superimposed on 20 independent trials (yellow).
Figure 2: Convergence of GD, RMD, ApproxGrad, and Penalty for two example bilevel problems where

is a rank-deficient random matrix. The mean curve (blue) is superimposed on 20 independent trials (yellow).

In Fig. 2 we show examples similar to Fig. 1 but with ill-conditioned or singular Hessian for the lower-level problem. The ill-conditioning poses difficulty for the methods since the implicit function theorem requires the invertibility of the Hessian at the solution point. Compared to Fig. 1, Fig. 2 shows that only Penalty converges to the true solution despite the fact that we add regularization in ApproxGrad to improve the ill-conditioning when solving the linear systems by minimization. We ascribe the robustness of Penalty to its simplicity and to the fact that it naturally handles non-uniqueness of the lower-level solution (see Appendix C.3). Additionally, we report the wall clock times for different methods on the four examples tested here in Table 2. We can see that as we increase the number of lower-level iterations all methods get slower but Penalty is faster than both RMD and ApproxGrad. Penalty is slower than GD but as shown in Fig. 1 and Fig. 2, GD does not converge to optima for most of the synthetic examples.

Example 1 GD RMD ApproxGrad Penalty
T=1 7.40.3 15.00.1 17.40.2 17.20.1
T=5 14.30.1 51.40.3 39.32.3 34.30.3
T=10 23.20.1 95.40.2 60.90.3 57.01.0
Example 2 GD RMD ApproxGrad Penalty
T=1 7.70.1 18.50.1 17.20.3 17.40.2
T=5 17.30.1 62.70.1 37.90.1 35.00.2
T=10 22.42.6 115.00.4 64.20.3 52.71.4
Example 3 GD RMD ApproxGrad Penalty
T=1 8.20.2 18.80.1 19.80.1 19.10.1
T=5 17.40.1 72.40.1 47.10.4 38.60.4
T=10 28.70.6 125.09.3 80.60.3 62.70.1
Example 4 GD RMD ApproxGrad Penalty
T=1 7.90.1 19.50.1 20.40.0 19.60.1
T=5 16.90.2 72.80.5 48.40.6 40.20.1
T=10 28.30.2 138.00.2 81.21.6 58.04.3
Table 2: Mean wall-clock time (sec) for 10,000 upper-level iterations for synthetic experiments. Boldface is the smallest among RMD, ApproxGrad, and Penalty. (Mean s.d. of 10 runs)

3.2 Data denoising by importance learning

Now, we evaluate the performance of Penalty for learning a classifier from a dataset with corrupted labels (training data). We pose the problem as an importance learning problem presented in Eq. (3). We evaluate the performance of the classifier learned by Penalty, with 20 lower-level updates, against the following classifiers: Oracle: classifier trained on the portion of training data with clean labels and the validation data, Val-only: classifier trained only on the validation data, Train+Val: classifier trained on the entire training and validation data, ApproxGrad

: classifier trained with our implementation of ApproxGrad, with 20 lower-level and 20 linear system updates. We test the performance on MNIST, CIFAR10 and SVHN datasets with validation set sizes of 1000, 10000 and 1000 points respectively. We used convolutional neural networks (architectures described in Appendix

D.2) at the lower-level for this task. Table  3 summarizes our results for this problem and shows that Penalty outperforms Val-only, Train+Val and ApproxGrad by significant margins and in fact performs very close to the Oracle classifier (which is the ideal classifier), even for high noise levels. This demonstrates that Penalty is extremely effective in solving bilevel problems involving several million variables (see Table 7 in Appendix) and shows its effectiveness at handling non-convex problems. Along with improvement in terms of accuracy over other bilevel methods like ApproxGrad, Penalty also gives better run-time per upper-level iteration, leading to a decrease in the overall run time of the experiments (Fig. 3(a)).

We compared the performance of Penalty against the RMD-based method presented in (Franceschi et al., 2017), using their setting from Sec. 5.1, which is a smaller version of this data denoising task. For this, we choose a sample of 5000 training, 5000 validation and 10000 test points from MNIST and randomly corrupted labels of 50% of the training points and used softmax regression in the lower-level of the bilevel formulation (Eq. (3)). The accuracy of the classifier trained on a subset of the dataset comprising only of points with importance values greater than 0.9 (as computed by Penalty) along with the validation set is 90.77%. This is better than the accuracy obtained by Val-only (90.54%), Train+Val (86.25%) and the RMD-based method (90.09%) used by (Franceschi et al., 2017) and is close to the accuracy achieved by Oracle classifier (91.06%).

Dataset Bilevel Approaches
(Noise%) Oracle Val-Only Train+Val ApproxGrad Penalty
MNIST (25) 99.30.1 90.50.3 83.91.3 98.110.08 98.890.04
MNIST (50) 99.30.1 90.50.3 60.82.5 97.270.15 97.510.07
CIFAR10 (25) 82.91.1 70.31.8 79.10.8 71.590.87 79.671.01
CIFAR10 (50) 80.71.2 70.31.8 72.21.8 68.080.83 79.031.19
SVHN (25) 91.10.5 70.61.5 71.61.4 80.051.37 88.120.16
SVHN (50) 89.80.6 70.61.5 47.91.3 74.181.05 85.210.34
Table 3: Test accuracy (%) of the classifier learnt from datasets with noisy labels using importance learning. (Mean s.d. of 5 runs)
Non-bilevel Approaches Bilevel Approaches
MAML(Finn et al., 2017) (Snell et al., 2017) SNAIL(Mishra et al., 2017) RMD(Franceschi et al., 2018) ApproxGrad Penalty
5-way 1-shot 98.7 98.8 99.1 98.6 97.490.31 97.570.11
5-way 5-shot 99.9 99.7 99.8 99.5 99.430.02 99.410.05
20-way 1-shot 95.8 96.0 97.6 95.5 93.070.24 92.200.22
20-way 5-shot 98.9 98.9 99.4 98.4 98.140.13 98.100.06
5-way 1-shot 48.701.75 49.420.78 55.710.99 50.540.85 48.10.82 52.100.65
5-way 5-shot 63.110.92 68.200.66 68.880.92 64.530.68 64.90.84 66.910.92
Table 4: Few-shot classification accuracy (%) with Omniglot and Mini-ImageNet. We report mean

s.d. for Omniglot and 95% confidence intervals for Mini-Imagenet over five trials. For bilevel approaches (Penalty, ApproxGrad and RMD

(Franceschi et al., 2018)) result is averaged over 600 randomly-sampled tasks from the meta-test set.

3.3 Few-shot learning

Next, we evaluate the performance of Penalty on the task of learning a common representation for the few-shot learning problem. We use the formulation presented in Eq. (4) and use Omniglot (Lake et al., 2015) and Mini-ImageNet (Vinyals et al., 2016) datasets for our experiments. Following the protocol proposed by (Vinyals et al., 2016) for -way -shot classification, we generate meta-training and meta-testing datasets. Each meta-set is built using images from disjoint classes. For Omniglot, our meta-training set comprises of images from the first 1200 classes and the remaining 423 classes are used in the meta-testing dataset. We also augment the meta-datasets with three different rotations (90, 180 and 270 degrees) of the images as used by (Santoro et al., 2016). For the experiments with Mini-Imagenet, we used the split of 64 classes in meta-training, 16 classes in meta-validation and 20 classes in meta-testing as used by (Ravi and Larochelle, 2017).

Each meta-batch of the meta-training and meta-testing dataset comprises of a number of tasks which is called the meta-batch-size. Each task in the meta-batch consists of a training set with images and a testing set consists of 15 images from classes. We train Penalty using a meta-batch-size of 30 for 5 way and 15 for 20 way classification for Omniglot and with a meta-batch-size of 2 for Mini-ImageNet experiments. The training sets of the meta-train-batch are used to train the lower-level problem and the test sets are used as validation sets for the upper-level problem in Eq. (4). The final accuracy is reported using the meta-test-set, for which we fix the common representation learnt during meta-training. We then train the classifiers at the lower-level for 100 steps using the training sets from the meta-test-batch and evaluate the performance of each task on the associated test set from the meta-test-batch. Average performance of Penalty and ApproxGrad over 600 tasks is reported in Table 4. It can be seen that Penalty outperforms other bilevel methods namely the ApproxGrad (trained with 20 lower-level iterations and 20 updates for the linear system) and the RMD-based method (Franceschi et al., 2018) on Mini-Imagenet and is comparable to them on the Omniglot. We also show the trade-off between using higher T and time for ApproxGrad and Penalty in Fig. 3(b) showing that Penalty achieves the same accuracy as ApproxGrad in almost half the run-time. In comparison to non-bilevel approaches Penalty is comparable to most approaches but is slightly worse than (Mishra et al., 2017) which makes use of temporal convolutions and soft attention.

We used four-layer convolutional neural networks with 64 filters per layer and a residual network with four residual blocks followed by two convolutional layers for learning the common task representation (upper-level variable) for Omniglot and Mini-ImageNet experiments, respectively. The lower-level problem uses logistic regression to learn the task specific classifiers (lower-level variables). We also use a normalization for the input-weight dot product, before taking the softmax, similar to the cosine normalization proposed by

(Luo et al., 2018). The bilevel problem for Mini-ImageNet has upper-level variables which is the largest among all the experiments presented in this paper (Table 7 in Appendix).

Untargeted Attacks (lower accuracy is better)
(Muñoz-González et al., 2017)
ApproxGrad Penalty
1% 86.710.32 85 82.090.84 83.290.43
2% 86.23 0.98 83 77.540.57 78.140.53
3% 85.170.96 82 74.411.14 75.141.09
4% 84.930.55 81 71.880.40 72.700.46
5% 84.391.06 80 68.690.86 69.481.93
6% 84.640.69 79 66.910.89 67.591.17
Targeted Attacks (higher accuracy is better)
(Muñoz-González et al., 2017)
ApproxGrad Penalty
1% 7.761.07 10 18.841.90 17.403.00
2% 12.082.13 15 39.643.72 41.644.43
3% 18.361.23 25 52.762.69 51.402.72
4% 24.412.05 35 60.011.61 61.161.34
5% 30.414.24 - 65.614.01 65.522.85
6% 32.883.47 - 71.484.24 70.012.95
Table 5: Test accuracy (%) of untargeted poisoning attack (LEFT) and success rate (%) of targeted attack (RIGHT), using MNIST and logistic regression. (Mean s.d. of 5 runs)
(a) Importance learning
(b) Few-shot learning
(c) Untargeted data poisoning
Figure 3: Comparison of accuracy and wall clock time (per upper-level iteration) with number of lower-level iterations of Penalty and ApproxGrad (For ApproxGrad, we perform T updates for the linear system) on data denoising problem (Sec. 3.2 with 25% noise on MNIST), few-shot learning problem (Sec. 3.3 with 20 way 5 shot classification on Omniglot) and untargeted data poisoning (Sec. 3.4 with 60 poisoned points on MNIST).

3.4 Training-data poisoning

Next, we evaluate Penalty on the task of generating poisoned training data, such that models trained on this data, perform poorly/differently as compared to the models trained on the clean data (Mei and Zhu, 2015; Muñoz-González et al., 2017; Koh and Liang, 2017). We use the same setting as Sec. 4.2 of (Muñoz-González et al., 2017) and test both untargeted and targeted data poisoning on MNIST using data augmentation technique. Here, we assume regularized logistic regression will be used as the classifier during training. The poisoned points obtained after solving Eq. (5) by various methods are added to the clean training set and the performance of a new classifier trained on this data is used to report the results in Table 5. For untargeted attack, our aim is to generally lower the performance of the classifier on the clean test set. For this experiment, we select a random subset of 1000 training, 1000 validation and 8000 testing points from MNIST and initialize the poisoning points with random instances from the training set but assign them incorrect random labels. We use these poisoned points along with clean training data to train logistic regression, in the lower-level problem of Eq. (5). For targeted attacks, we aim to misclassify images of eights as threes. For this, we selected a balanced subset (each of the 10 classes are represented equally in the subset) of 1000 training, 4000 validation and 5000 testing points from the MNIST dataset. Then we select images of class 8 from the validation set and label them as 3 and use only these images for the upper-level problem in Eq. 5 with a difference that now we want to minimize the error in the upper level instead of maximizing (meaning we don’t have a negative sign in the upper level of Eq. 5). To evaluate the performance we selected images of 8 from the test set and labeled them as 3 and report the performance on this modified subset of the original test set in targeted attack section of Table 5. For this experiment the poisoned points are initialized with images of classes 3 and 8 from the training set, with flipped labels. We did this since images of threes and eights are the only ones involved in the poisoning. We compare the performance of Penalty against the performance reported using RMD in (Muñoz-González et al., 2017) and ApproxGrad. For ApproxGrad, we used 20 lower-level and 20 linear system updates to report the results in the Table 5. We see that Penalty significantly outperforms the RMD based method and performs similar to ApproxGrad. However, in terms of wall clock time Penalty has a advantage over ApproxGrad (see Fig. 3(c)). We also compared the methods against a label flipping baseline where we select poisoned points from the validation sets and change their labels (randomly for untargeted attacks and mislabel threes as 8 and eights as 3 for targeted attacks). All bilevel methods are able to beat this baseline showing that solving the bilevel problem can generate much better poisoning points. Examples of the poisoned points for untargeted and targeted attacks generated by Penalty are shown in Figs. 5 and 6 in Appendix D.4.

Additionally, we tested Penalty on the task of generating clean label poisoning attack (Koh and Liang, 2017; Shafahi et al., 2018) where goal is to learn poisoned points, such that they will be assigned correct labels when visually inspected by an expert, but can cause misclassification of specific target images when the classifier is trained on these poisoned points along with clean data. We used the dog vs. fish dataset and followed the setting in Sec. 5.2 of (Koh and Liang, 2017), to achieve 100% attack success with just a single poisoned point per target image, compared to 57% attack success in the original paper. A recent method (Shafahi et al., 2018) also reports 100% attack success on this same task. Details of the experiment are presented in Appendix D.4.1.

3.5 Impact of on accuracy and wall-clock time

Finally, we compare Penalty and ApproxGrad on accuracy and time in Fig. 3 as we vary the number of lower-level iterations in the experiments. Intuitively, a larger corresponds to a more accurate approximation of the hypergradient and therefore a better result for all methods, but it comes with the space and time cost. The figure shows that relative improvement after is small in comparison to the increased run-time for both Penalty and ApproxGrad. Based on this result we used for in all our experiments on real data. The figure also shows that even though Penalty and ApproxGrad have the same linear time complexity (Table 1), Penalty is about twice as fast ApproxGrad in wall-clock time.

4 Conclusion

A wide range of interesting machine learning problems can be expressed as bilevel optimization problems, and new applications are still being discovered. So far, the difficulty of solving bilevel optimization has limited its wide-spread use for solving large-scale problems, specially, involving deep models. In this paper we presented an efficient algorithm based on penalty function to solve bilevel optimization, which is both simple and has theoretical and practical advantages over existing methods. As compared to previous methods we demonstrated competitive performance on problems with convex lower-level costs and significant improvement on problems with non-convex lower-level costs both in terms of accuracy and time, highlighting the practical effectiveness of our penalty-based method. In future works, we plan to tackle other challenges in bilevel optimization such as handling additional constraints in both upper- and lower-levels.


  • [1] E. Aiyoshi and K. Shimizu (1984) A solution method for the static constrained stackelberg problem via penalty method. IEEE Transactions on Automatic Control 29 (12), pp. 1111–1114. Cited by: §2.2.
  • [2] J. F. Bard (1991) Some properties of the bilevel programming problem. Journal of optimization theory and applications 68 (2), pp. 371–378. Cited by: §1.
  • [3] J. F. Bard (2013) Practical bilevel optimization: algorithms and applications. Vol. 30, Springer Science & Business Media. Cited by: Appendix A, §1, Theorem 1.
  • [4] D. P. Bertsekas (1976) On penalty and multiplier methods for constrained minimization. SIAM Journal on Control and Optimization 14 (2), pp. 216–235. Cited by: §C.2, §2.2.
  • [5] D. P. Bertsekas (1997) Nonlinear programming. Journal of the Operational Research Society 48 (3), pp. 334–334. Cited by: §2.2.
  • [6] S. Dempe and J. Dutta (2012) Is bilevel programming a special case of a mathematical program with complementarity constraints?. Mathematical programming 131 (1-2), pp. 37–48. Cited by: §2.2.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §D.3.
  • [8] J. Domke (2012) Generic methods for optimization-based modeling. In Artificial Intelligence and Statistics, pp. 318–326. Cited by: Appendix B, Appendix B, Appendix B, §1, §1, §2.1, §2.2.
  • [9] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. Cited by: Table 4.
  • [10] L. Franceschi, M. Donini, P. Frasconi, and M. Pontil (2017) Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pp. 1165–1173. Cited by: Appendix B, Appendix B, §1, §1, §2.1, §2.2, §3.2.
  • [11] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil (2018) Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pp. 1563–1572. Cited by: §D.3, §1, §3.3, Table 4.
  • [12] J. Hamm and Y. Noh (2018) K-beam minimax: efficient optimization for deep adversarial learning. International Conference on Machine Learning (ICML). Cited by: §C.3.
  • [13] Y. Ishizuka and E. Aiyoshi (1992) Double penalty method for bilevel optimization problems. Annals of Operations Research 34 (1), pp. 73–88. Cited by: §2.2.
  • [14] P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pp. 1885–1894. Cited by: §D.4.1, §1, §3.4, §3.4.
  • [15] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §D.3, §3.3.
  • [16] T. Liu and D. Tao (2016) Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence 38 (3), pp. 447–461. Cited by: §1.
  • [17] J. Luketina, M. Berglund, K. Greff, and T. Raiko (2016) Scalable gradient-based tuning of continuous regularization hyperparameters. In International Conference on Machine Learning, pp. 2952–2960. Cited by: §1.
  • [18] C. Luo, J. Zhan, X. Xue, L. Wang, R. Ren, and Q. Yang (2018)

    Cosine normalization: using cosine similarity instead of dot product in neural networks

    In International Conference on Artificial Neural Networks, pp. 382–391. Cited by: §3.3.
  • [19] D. Maclaurin, D. Duvenaud, and R. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122. Cited by: Appendix B, Appendix B, §1, §1, §2.1, §2.2.
  • [20] S. Mei and X. Zhu (2015) Using machine teaching to identify optimal training-set attacks on machine learners.. In AAAI, pp. 2871–2877. Cited by: §3.4.
  • [21] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2017) A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141. Cited by: §1, §3.3, Table 4.
  • [22] L. Muñoz-González, B. Biggio, A. Demontis, A. Paudice, V. Wongrassamee, E. C. Lupu, and F. Roli (2017) Towards poisoning of deep learning algorithms with back-gradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 27–38. Cited by: §1, §3.4, Table 5.
  • [23] J. Nocedal and S. Wright (2006) Numerical optimization. Springer Science & Business Media. Cited by: Appendix A.
  • [24] B. A. Pearlmutter (1994) Fast exact multiplication by the hessian. Neural computation 6 (1), pp. 147–160. Cited by: Appendix B, Table 1.
  • [25] F. Pedregosa (2016) Hyperparameter optimization with approximate gradient. In International conference on machine learning, pp. 737–746. Cited by: Appendix B, Appendix B, §1, §1, §2.1, §2.2, §3.1.
  • [26] S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. International Conference on Learning Representations (ICLR). Cited by: §1, §3.3.
  • [27] M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. In ICML, Cited by: §1.
  • [28] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065. Cited by: §1, §3.3.
  • [29] A. Shaban, C. Cheng, N. Hatch, and B. Boots (2018) Truncated back-propagation for bilevel optimization. arXiv preprint arXiv:1810.10667. Cited by: Appendix B, §1, §2.2.
  • [30] A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein (2018) Poison frogs! targeted clean-label poisoning attacks on neural networks. In Advances in Neural Information Processing Systems, pp. 6103–6113. Cited by: §D.4.1, §1, §3.4.
  • [31] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4080–4090. Cited by: §1, Table 4.
  • [32] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638. Cited by: §D.3, §1, §3.3.
  • [33] H. Von Stackelberg (2010) Market structure and equilibrium. Springer Science & Business Media. Cited by: §1.
  • [34] X. Yu, T. Liu, M. Gong, K. Zhang, and D. Tao (2017) Transfer learning with label noise. arXiv preprint arXiv:1707.09724. Cited by: §1.

Appendix A Proofs

Theorem 2.

Suppose is a positive () and convergent () sequence, and is a positive (), non-decreasing (), and a divergent () sequence. Let be the sequence of approximate solutions to Eq. (9) with tolerance for all . Then any limit point of satisfies the KKT conditions of the problem Eq. (8).


The proof follows the standard proof for penalty function methods, e.g., [23]. Let refer to the pair, and let be any limit point of the sequence , then there is a subsequence such that . From the tolerance condition

we have

Take the limit with respect to the subsequence on both sides to get

Since is a tall matrix and is invertible by assumption, is full-rank and therefore , which is the primary feasibility condition in Eq. (8). Furthermore, let , then by definition,

We can write

The corresponding limit can be found by taking the limit of the subsequence

Since from the condition , we get

at the limit , which is the stationarity condition of Eq. (8). Together with the feasibility condition , the two KKT conditions of Eq. (8) are satisfied at the limit point. ∎

Lemma 3.

Given , let be of Eq. (9). Then, .


At the minimum the gradient vanishes, that is .
Equivalently, . Then,

where disappears, which is the hypergradient . ∎

That is, if we find the minimum of the penalty function for given and , we get the hypergradient Eq. (7) at . Furthermore, under the conditions of Theorem 1, as (see Lemma 8.3.1 of [3]), and we get the exact hypergradient asymptotically.

Appendix B Review of bilevel optimization methods

Several methods have been proposed to solve bilevel optimization problems appearing in machine learning, including forward/reverse-mode differentiation [19, 10] and approximate gradient [8, 25] described briefly here.

Forward-mode (FMD) and Reverse-mode differentiation (RMD). Domke [8], Maclaurin et al.[19], Franceschi et al. [10], and Shaban et al. [29] studied forward and reverse-mode differentiation to solve the minimization problem where the lower-level variable follows a dynamical system . This setting is more general than that of a bilevel problem. However, a stable dynamical system is one that converges to a steady state and thus, the process can be considered as minimizing an energy or a potential function.

Define and , then the hypergradient Eq. (7) can be computed by

When the lower-level process is one step of gradient descent on a cost function , that is,

we get

The sequences and can be computed in forward or reverse mode. For reverse-mode differentiation, first compute

then compute

The final hypergradient is . For forward-mode differentiation, simultaneously compute , , and

The final hypergradient is

Approximate hypergradient (ApproxGrad). Since computing the inverse of the Hessian directly is difficult even for moderately-sized neural networks, Domke [8] proposed to find an approximate solution to by solving the linear system of equations . This can be done by solving

using conjugate gradient or any other method. Note that the minimization requires evaluation of the Hessian-vector product, which can be done in linear time [24]. The asymptotic convergence with approximate solutions was shown by Pedregosa [25].

Appendix C Improvements to Algorithm  1

Here we discuss the details of the modifications to Alg. 1 presented in the main text which can be added to improve the performance of the algorithm in practice.

c.1 Improving local convexity by regularization

One of the common assumptions of this and previous works is that is invertible and locally positive definite. Neither invertibility nor positive definiteness hold in general for bilevel problems, involving deep neural networks, and this causes difficulties in the optimization. Note that if is non-convex in , minimizing the penalty term does not neccesarily lower the cost but instead moves the variable towards a stationary point – which is a known problem even for the Newton’s method. Thus we propose the following modification to the -update:

keeping the same -update intact. To see how this affects the optimization, note that -update becomes

After converges to a stationary point, we get
, and after plugging this into -update, we get

that is, the Hessian inverse is replaced by a regularized version to improve the positive definiteness of the Hessian. With a decreasing or constant sequence such that the regularization does not change to solution.

c.2 Convergence with finite

The penalty function method is intuitive and easy to implement, but the sequence is guaranteed to converge to an optimal solution only in the limit with , which may not be achieved in practice in a limited time. It is known that the penalty method can be improved by introducing an additional term into the function, which is called the augmented Lagrangian (penalty) method [4]:

This new term allows convergence to the optimal solution even when is finite. Furthermore, using the update rule , called the method of multipliers, it is known that converges to the true Lagrange multiplier of this problem corresponding to the equality constraints .

c.3 Non-unique lower-level solution

Most existing methods have assumed that the lower-level solution is unique for all . Regularization from the previous section, can improve the ill-conditioning of the Hessian but it does not address the case of multiple disconnected global minima of . With multiple lower-level solutions , there is an ambiguity in defining the upper-level problem. If we assume that is chosen adversarially (or pessimistically), then the upper-level problem should be defined as

If is chosen co-operatively (or optimistically), then the upper-level problem should be defined as

and the results can be quite different between these two cases. Note that the proposed penalty function method is naturally solving the optimistic case, as Alg. 1 is solving the problem of by alternating gradient descent. However, with a gradient-based method, we cannot hope to find all disconnected multiple solutions. In a related problem of min-max optimization, which is a special case of bilevel optimization, an algorithm for handling non-unique solutions was proposed recently [12]. This idea of keeping multiple candidate solution may be applicable to bilevel problems too and further analysis of the non-unique lower-level problem is left as future work.

c.4 Modified algorithm

Here we present the modified algorithm which incorporates regularization (Sec. C.1) and augmented Lagrangian (Sec. C.2) as discussed previously. The augmented Lagrangian term applies to both - and -update, but the regularization term applies to only the -update as its purpose is to improve the ill-conditioning of during -update. The modified penalized functions for -update and