bilevelpenalty
None
view repo
Bilevel optimizations are at the center of several important machine learning problems such as hyperparameter tuning, data denoising, fewshot learning, data poisoning. Different from simultaneous or multiobjective optimization, obtaining the exact descent direction for continuous bilevel optimization requires computing the inverse of the hessian of the lowerlevel cost function, even for first order methods. In this paper, we propose a new method for solving bilevel optimization, using the penalty function, which avoids computing the inverse of the hessian. We prove convergence of the method under mild conditions and show that it computes the exact hypergradient asymptotically. Small space and time complexity of our method allows us to solve largescale bilevel optimization problems involving deep neural networks with up to 3.8M upperlevel and 1.4M lowerlevel variables. We present results of our method for data denoising on MNIST/CIFAR10/SVHN datasets, for fewshot learning on Omniglot/MiniImagenet datasets and for trainingdata poisoning on MNIST/Imagenet datasets. In all experiments, our method outperforms or is comparable to previously proposed methods both in terms of accuracy and runtime.
READ FULL TEXT VIEW PDF
This study presents a novel Equiangular Direction Method (EDM) to solve ...
read it
Machine learning (ML) problems are often posed as highly nonlinear and
n...
read it
Meta learning with multiple objectives can be formulated as a MultiObje...
read it
Learning from a few examples remains a key challenge in machine learning...
read it
This paper considers an explicit continuation method and the trustregio...
read it
Progress in deep learning is slowed by the days or weeks it takes to tra...
read it
In recent years, implicit deep learning has emerged as a method to incre...
read it
None
Bilevel optimizations appear in many fields of study where there are two competing parties or objectives involved. Particularly, a bilevel problem arises, if one party makes its choice first affecting the optimal choice of the second party, also known as the Stackelberg model, dating back to 1930’s (Von Stackelberg, 2010). The general form of a bilevel optimization problem is
(1) 
A bilevel problem with constraints is of the form , where the lowerlevel constraint set can depend on . However, we focus on unconstrained problems in this paper. The ‘upperlevel’ problem is a usual minimization problem except that is constrained to be the solution to the ‘lowerlevel’ problem which is dependent on (see (Bard, 2013) for a review of bilevel optimization). Bilevel optimizations also appears in many important machine learning problems. For example, gradientbased hyperparameter tuning (Domke, 2012; Maclaurin et al., 2015; Luketina et al., 2016; Pedregosa, 2016; Franceschi et al., 2017, 2018), data denoising by importance learning (Liu and Tao, 2016; Yu et al., 2017; Ren et al., 2018), fewshot learning (Ravi and Larochelle, 2017; Santoro et al., 2016; Vinyals et al., 2016; Franceschi et al., 2017; Mishra et al., 2017; Snell et al., 2017; Franceschi et al., 2018), and trainingdata poisoning (Maclaurin et al., 2015; MuñozGonzález et al., 2017; Koh and Liang, 2017; Shafahi et al., 2018). We explain each of these problems and their bilevel formulations below.
Gradientbased hyperparameter tuning. Finding hyperparameters is an indispensable step in any machine learning problem. Grid search is a popular way of finding the optimal hyperparameters, if the domain of the hyperparameters is a predetermined discrete set or a range. However, when losses are differentiable functions of the hyperparameter(s), we can find optimal hyperparameter values by solving a continuous bilevel optimization. Let and denote hyperparameter(s) and parameter(s) for a class of learning algorithms and be the hypothesis. Then is the validation loss, and is the training loss, defined similarly. The best hyperparameter(s) is then the solution to the following problem
(2) 
Thus, we find the best model parameters for each choice of the hyperparameter , and select that value for the hyperparameter which incurs the smallest validation loss.
Data denoising by importance learning. A common assumption of learning is that the training examples are i.i.d. samples from the same distribution as the test data. However, if training and testing distributions are not identical or if the training examples have been corrupted by noise or modified by adversaries, this assumption is violated. In such cases, reweighting the importance of each training example, before training, can help reduce the discrepancy between the two distributions. For example, one can upweight the importance of the examples from the same distribution and downweight the importance of the rest. This problem of finding the correct weight for each training example can be formulated as a bilevel optimization. Consider
to be the vector of nonnegative importance values for each training example
where is the number of training examples, andbe the parameter(s) of a classifier
. Then the weighted training error is . Also, assume that we can get a small number of examples, from the same distribution as that of the test examples (clean validation examples). Then the importance learning problem can be formulated as:(3) 
Hence, the importance of each training example (vector ) is selected such that the minimizer of the weighted training loss in the lower level also minimizes the validation loss in the upperlevel. The final importance values can help to identify good points from the noisy training set and the classifier obtained after solving this optimization will have superior performance compared to the model trained on the noisy data.
Metalearning. A standard learning problem involves finding the best model from the class of hypotheses for a given task (i.e., data distribution). In contrast, metalearning is a problem of learning a prior on the hypothesis classes (a.k.a. inductive bias) for a given set of tasks. Fewshot learning is an example of metalearning, where a learner is trained on several related tasks, during the metatraining phase, so that it can generalize well to unseen (but related) tasks with just few examples, during the metatesting phase. An effective approach to the fewshot learning problem is to learn a common representation for various tasks and train task specific classifiers on top of this representation. Let be the map that takes raw features to a common representation for all tasks and be the classifier for the th task, where is the total number of tasks for training. The goal of fewshot learning is to learn both the representation map parameterized by and the set of classifiers parameterized by . Let be the validation loss of task and be the training loss defined similarly, then the bilevel problem for fewshot learning is
(4) 
For evaluation of the learned representation, during the metatest phase, the representation is kept fixed and only the classifiers for the new tasks are trained i.e. where is the total number of tasks for testing.
Trainingdata poisoning. Recently, machine learning models were shown to be vulnerable to traintime attacks. Different from the test time attacks, here adversary modifies the training data so that the model learned from altered training data performs poorly/differently as compared to the model learned from clean data. The most popular traintime attack method augments the original training data with one or more ‘poisoned’ examples , i.e., to create the poisoned dataset with being the loss on the poisoned training data. The problem of finding poisoning points, that when added to the clean training data hurt the performance of the model trained on it can be formulated as
(5) 
where the minus sign in the upperlevel is used to maximize the validation loss. This is the formulation for untargeted attacks. For targeted attacks, the upperlevel must minimize the validation loss with respect to the intended target labels of the attacker. Another variant of poisoning attack, only influences the prediction of a single predetermined example. The upperlevel cost for this attack is the loss over only this single example (see Eq. (10) in the Appendix D.4.1).
Challenges of deep bilevel optimization. General bilevel problems cannot be solved using simultaneous optimization of the upper and lowerlevel problems. Moreover, exact bilevel optimization is known to be NPhard even for linear cost functions (Bard, 1991)
. To add to this, recent deep learning models, with millions of variables, make it infeasible to use sophisticated methods beyond the firstorder methods. For bilevel problems, even the firstorder methods are difficult to apply since they require computation of the inverse Hessian–gradient product to get the exact hypergradient (see Sec.
2.1). Since direct inversion of the Hessian is impractical, even for moderatesized problems, many approaches have been proposed to approximate the exact hypergradient, including forward/reversemode differentiation (Maclaurin et al., 2015; Franceschi et al., 2017; Shaban et al., 2018), approximate inversion by solving a linear system of equations (Domke, 2012; Pedregosa, 2016). But, there is still big room for improvement in these existing approaches in terms of their time and space complexities and practical performance.Contributions. We propose a penalty functionbased algorithm (Alg. 1) for solving largescale unconstrained bilevel optimization. We prove its convergence under mild conditions (Theorem 2) and show that it computes the exact hypergradient asymptotically (Lemma 3). We present complexity analysis of the algorithm showing that it has linear time and constant space complexity (Table 1), making our method superior to forwardmode and reversemode differentiation and similar to the approximate inversion based method. Small space and time complexity enables us to solve largescale bilevel problems involving deep neural networks with up to 3.8M upperlevel and 1.4M lowerlevel variables (Table 7 in Appendix). We show evaluation results on data denoising by importance learning, fewshot learning, and trainingdata poisoning problems. The proposed penaltybased method performs competitively to the stateoftheart methods on simpler problems (with convex lowerlevel cost) and significantly outperforms other methods on complex problems (with nonconvex lowerlevel cost), both in terms of accuracy (Sec. 3) and runtime (Table 2, Fig. 3).
The remainder of the paper is organized as follows. We present and analyze the main algorithm in Sec. 2, perform comprehensive experiments in Sec. 3, and conclude the paper in Sec. 4. Due to space limitation, proofs, experimental settings and additional results are presented in the appendix. All codes are published on GitHub https://github.com/jihunhamm/bilevelpenalty.
Throughout the paper we have assumed that upper and lowerlevel costs and are twice continuously differentiable in both and . We use and to denote gradient vectors, for the matrix , and for the Hessian matrix . Additionally, following previous works we assumed that the lowerlevel solution is unique for all and that is invertible everywhere. Later in this section we discuss relaxation for some of these assumptions.
Assuming we can express the solution to the lowerlevel problem explicitly, we can write the bilevel problem as an equivalent singlelevel problem . We can use gradientbased approach on this singlelevel problem and compute the total derivative
, called the hypergradient, in previous approaches. Using the chain rule, the total derivative is
(6) 
In reality, can be written explicitly only for trivial problems, but we can still compute using the implicit function theorem. Since at , we get and consequently . Thus, the hypergradient is as follows
(7) 
Existing approaches (Domke, 2012; Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al., 2017) can be viewed as implicit methods of approximating the hypergradient, with distinct tradeoffs in efficiency and complexity.
A bilevel problem can be considered as a constrained optimization problem since the lowerlevel optimality is a constraint, in addition to any other constraint in the upper and the lowerlevel problems. In this work, we focus on unconstrained bilevel problems i.e. those without any additional constraints on the upper and lowerlevel problems. For solving bilevel problems the lowerlevel problem is often replaced by its necessary optimality condition, resulting in the following problem:
(8) 
For general bilevel problems, Eq. (8) and Eq. (1) are not the same (Dempe and Dutta, 2012). But, with lowerlevel cost being convex in for each and the assumption that the lowerlevel solution is unique for each , Eq. (8) is equivalent to Eq. (1).
We now describe the penalty function approach for solving bilevel optimization. The penalty function method is a wellknown approach for solving constrained optimization problems (see (Bertsekas, 1997) for a review) and has been previously applied for solving bilevel problems. However, it was analyzed under strict assumptions and only highlevel descriptions of the algorithm were presented before (Aiyoshi and Shimizu, 1984; Ishizuka and Aiyoshi, 1992). The penalty function , optimizes the original cost plus a quadratic penalty term (penalizes the violation of the necessary conditions for lowerlevel optimality). Let be the minimum of the penalty function for a given :
(9) 
Then the following convergence result is known.
Even though this is a strong result, its not very practical, since the minimum needs to be computed exactly for each , and moreover and need to be convex in for any . In our approach, we allow optimal solutions of Eq. (9) and show convergence to a KKT point of Eq. (8) without requiring convexity.
Alg. 1 describes our method in which we minimize the penalty function in Eq. (9), alternatively over and . It is essential to note that our method solves a singlelevel penalty function (Eq. (9)) and does not need any intermediate step to compute the approximate hypergradient, unlike other methods which first approximate the solution to the lowerlevel problem of Eq. (1) and then use an intermediate step (solving a linear system or using reverse/forward mode differentiation) to compute the approximate hypergradient. Lemma 3 (below) shows when the approximate gradient direction , computed from Alg. 1 becomes the exact hypergradient Eq. (7) for bilevel problems.
Given , let be from Eq. (9). Then, .
Thus if we find the minimizer of the penalty function for given and , Alg. 1 computes the exact hypergradient Eq. (7) at . Furthermore, under the conditions of Theorem 1, as and we get the exact hypergradient asymptotically.
Comparison with other methods: Many methods have been proposed previously to solve bilevel optimization problems that appear in machine learning, including forward/reversemode differentiation (FMD/RMD) (Maclaurin et al., 2015; Franceschi et al., 2017; Shaban et al., 2018) and approximate hypergradient computation by solving a linear system (ApproxGrad) (Domke, 2012; Pedregosa, 2016). For completeness, we have described these methods briefly in Appendix B. We have shown the tradeoffs of these methods for computing the hypergradient in Table 1. One can see that as increases FMD and RMD become impractical due to time complexity and space complexity, respectively, whereas ApproxGrad and Penalty, have the same linear time complexity and constant space complexity, which is a big advantage over FMD and RMD. However, complexity analysis does not show the quality of hypergradient approximation of each method. In Sec. 3.1 we show empirically that the proposed penalty method has better convergence properties than all the other methods with synthetic examples and since ApproxGrad and Penalty have the same complexities we compare the two methods on real data and show that Penalty is twice as fast as ApproxGrad (Fig. 3).
Method  update  Intermediate update  Time  Space 
FMD  
RMD  
ApproxGrad  
Penalty  Not required 
Improvements. A caveat to these theoretical guarantees is that, some of the assumptions made for analysis may not be satisfied in practice. Here we discuss simple techniques to address these problems and improve Alg. 1 further. The first problem is related to nonconvexity of the lowerlevel cost , which creates the problem that the local minimum of can be either a minimum or a maximum of . To address this we modify the update for Eq. (9) by adding a ‘regularization’ term to the cost so that finds a minimum of . Thus, the update becomes . This only affects the optimization in the beginning; as the final solution remains unaffected with or without regularization. The second problem is that the tolerance may not be satisfied in a limited time and the optimization may terminate before becomes large enough. A cure to this is the method of multipliers and augmented Lagrangian (Bertsekas, 1976) which allows the penalty method to find a solution with a finite . Thus we add the term to the penalty function (Eq. (9)) to get and use the method of multiplier to update as . In summary, we use the following update rules in the paper.
These improvements are helpful in theory but the empirical difference was only moderate (see Appendix C for details).
In this section, we evaluate the performance of the proposed penalty method (Penalty) on various machine learning problems discussed in the introduction. We compare Penalty against both bilevel and nonbilevel solutions to these problems previously reported in the literature.
We start by comparing Penalty (ours) with gradient descent (GD), reversemode differentiation (RMD), and approximate hypergradient method (ApproxGrad) on synthetic examples. We omit the comparison with forwardmode differentiation (FMD) because of its impractical time complexity for larger problems. GD refers to the alternating minimization: , . For RMD, we implemented a simple version of the method using vanilla gradient descent. For ApproxGrad, we implement our own GPU compatible version (which uses Hessianvector product, minibatches and gradient descent rather than conjugate gradient descent for solving the linear system) of the algorithm proposed by (Pedregosa, 2016). Using simple quadratic surfaces for and , we compare all the algorithms by observing their convergence as a function of the number of upperlevel iterations by varying the number of lowerlevel updates (), for computing the hypergradient update. We measure the convergence of these methods using the Euclidean distance of the current iterate from the closest optimal solution . Since the synthetic examples are not learning problems, we can only measure the distance of the iterates to an optimal solution (). Fig. 1 shows the performance of two 10dimensional examples described in the caption (see Appendix D.1). As one would expect, increasing the number of
updates makes all the algorithms better since doing more lowerlevel iterations makes the hypergradient estimation more accurate (Eq. (
7)) but it also increases the run time of the methods. However, even for these examples, only Penalty and ApproxGrad converge to the optimal solution and GD and RMD converge to nonsolution points (regardless of ). Moreover, from Fig. 1(b), we see that Penalty converges even with =1 while ApproxGrad requires at least =10 to converge, which shows that our method approximates the hypergradient accurately with smaller . This directly translates to smaller runtime for our method as compared to ApproxGrad since the runtime is directly proportional to (see Table. 1).is a rankdeficient random matrix. The mean curve (blue) is superimposed on 20 independent trials (yellow).
In Fig. 2 we show examples similar to Fig. 1 but with illconditioned or singular Hessian for the lowerlevel problem. The illconditioning poses difficulty for the methods since the implicit function theorem requires the invertibility of the Hessian at the solution point. Compared to Fig. 1, Fig. 2 shows that only Penalty converges to the true solution despite the fact that we add regularization in ApproxGrad to improve the illconditioning when solving the linear systems by minimization. We ascribe the robustness of Penalty to its simplicity and to the fact that it naturally handles nonuniqueness of the lowerlevel solution (see Appendix C.3). Additionally, we report the wall clock times for different methods on the four examples tested here in Table 2. We can see that as we increase the number of lowerlevel iterations all methods get slower but Penalty is faster than both RMD and ApproxGrad. Penalty is slower than GD but as shown in Fig. 1 and Fig. 2, GD does not converge to optima for most of the synthetic examples.
Example 1  GD  RMD  ApproxGrad  Penalty 
T=1  7.40.3  15.00.1  17.40.2  17.20.1 
T=5  14.30.1  51.40.3  39.32.3  34.30.3 
T=10  23.20.1  95.40.2  60.90.3  57.01.0 
Example 2  GD  RMD  ApproxGrad  Penalty 
T=1  7.70.1  18.50.1  17.20.3  17.40.2 
T=5  17.30.1  62.70.1  37.90.1  35.00.2 
T=10  22.42.6  115.00.4  64.20.3  52.71.4 
Example 3  GD  RMD  ApproxGrad  Penalty 
T=1  8.20.2  18.80.1  19.80.1  19.10.1 
T=5  17.40.1  72.40.1  47.10.4  38.60.4 
T=10  28.70.6  125.09.3  80.60.3  62.70.1 
Example 4  GD  RMD  ApproxGrad  Penalty 
T=1  7.90.1  19.50.1  20.40.0  19.60.1 
T=5  16.90.2  72.80.5  48.40.6  40.20.1 
T=10  28.30.2  138.00.2  81.21.6  58.04.3 
Now, we evaluate the performance of Penalty for learning a classifier from a dataset with corrupted labels (training data). We pose the problem as an importance learning problem presented in Eq. (3). We evaluate the performance of the classifier learned by Penalty, with 20 lowerlevel updates, against the following classifiers: Oracle: classifier trained on the portion of training data with clean labels and the validation data, Valonly: classifier trained only on the validation data, Train+Val: classifier trained on the entire training and validation data, ApproxGrad
: classifier trained with our implementation of ApproxGrad, with 20 lowerlevel and 20 linear system updates. We test the performance on MNIST, CIFAR10 and SVHN datasets with validation set sizes of 1000, 10000 and 1000 points respectively. We used convolutional neural networks (architectures described in Appendix
D.2) at the lowerlevel for this task. Table 3 summarizes our results for this problem and shows that Penalty outperforms Valonly, Train+Val and ApproxGrad by significant margins and in fact performs very close to the Oracle classifier (which is the ideal classifier), even for high noise levels. This demonstrates that Penalty is extremely effective in solving bilevel problems involving several million variables (see Table 7 in Appendix) and shows its effectiveness at handling nonconvex problems. Along with improvement in terms of accuracy over other bilevel methods like ApproxGrad, Penalty also gives better runtime per upperlevel iteration, leading to a decrease in the overall run time of the experiments (Fig. 3(a)).We compared the performance of Penalty against the RMDbased method presented in (Franceschi et al., 2017), using their setting from Sec. 5.1, which is a smaller version of this data denoising task. For this, we choose a sample of 5000 training, 5000 validation and 10000 test points from MNIST and randomly corrupted labels of 50% of the training points and used softmax regression in the lowerlevel of the bilevel formulation (Eq. (3)). The accuracy of the classifier trained on a subset of the dataset comprising only of points with importance values greater than 0.9 (as computed by Penalty) along with the validation set is 90.77%. This is better than the accuracy obtained by Valonly (90.54%), Train+Val (86.25%) and the RMDbased method (90.09%) used by (Franceschi et al., 2017) and is close to the accuracy achieved by Oracle classifier (91.06%).
Dataset  Bilevel Approaches  
(Noise%)  Oracle  ValOnly  Train+Val  ApproxGrad  Penalty 
MNIST (25)  99.30.1  90.50.3  83.91.3  98.110.08  98.890.04 
MNIST (50)  99.30.1  90.50.3  60.82.5  97.270.15  97.510.07 
CIFAR10 (25)  82.91.1  70.31.8  79.10.8  71.590.87  79.671.01 
CIFAR10 (50)  80.71.2  70.31.8  72.21.8  68.080.83  79.031.19 
SVHN (25)  91.10.5  70.61.5  71.61.4  80.051.37  88.120.16 
SVHN (50)  89.80.6  70.61.5  47.91.3  74.181.05  85.210.34 
Nonbilevel Approaches  Bilevel Approaches  
MAML(Finn et al., 2017)  (Snell et al., 2017)  SNAIL(Mishra et al., 2017)  RMD(Franceschi et al., 2018)  ApproxGrad  Penalty  
Omniglot  
5way 1shot  98.7  98.8  99.1  98.6  97.490.31  97.570.11 
5way 5shot  99.9  99.7  99.8  99.5  99.430.02  99.410.05 
20way 1shot  95.8  96.0  97.6  95.5  93.070.24  92.200.22 
20way 5shot  98.9  98.9  99.4  98.4  98.140.13  98.100.06 
MiniImagenet  
5way 1shot  48.701.75  49.420.78  55.710.99  50.540.85  48.10.82  52.100.65 
5way 5shot  63.110.92  68.200.66  68.880.92  64.530.68  64.90.84  66.910.92 
s.d. for Omniglot and 95% confidence intervals for MiniImagenet over five trials. For bilevel approaches (Penalty, ApproxGrad and RMD
(Franceschi et al., 2018)) result is averaged over 600 randomlysampled tasks from the metatest set.Next, we evaluate the performance of Penalty on the task of learning a common representation for the fewshot learning problem. We use the formulation presented in Eq. (4) and use Omniglot (Lake et al., 2015) and MiniImageNet (Vinyals et al., 2016) datasets for our experiments. Following the protocol proposed by (Vinyals et al., 2016) for way shot classification, we generate metatraining and metatesting datasets. Each metaset is built using images from disjoint classes. For Omniglot, our metatraining set comprises of images from the first 1200 classes and the remaining 423 classes are used in the metatesting dataset. We also augment the metadatasets with three different rotations (90, 180 and 270 degrees) of the images as used by (Santoro et al., 2016). For the experiments with MiniImagenet, we used the split of 64 classes in metatraining, 16 classes in metavalidation and 20 classes in metatesting as used by (Ravi and Larochelle, 2017).
Each metabatch of the metatraining and metatesting dataset comprises of a number of tasks which is called the metabatchsize. Each task in the metabatch consists of a training set with images and a testing set consists of 15 images from classes. We train Penalty using a metabatchsize of 30 for 5 way and 15 for 20 way classification for Omniglot and with a metabatchsize of 2 for MiniImageNet experiments. The training sets of the metatrainbatch are used to train the lowerlevel problem and the test sets are used as validation sets for the upperlevel problem in Eq. (4). The final accuracy is reported using the metatestset, for which we fix the common representation learnt during metatraining. We then train the classifiers at the lowerlevel for 100 steps using the training sets from the metatestbatch and evaluate the performance of each task on the associated test set from the metatestbatch. Average performance of Penalty and ApproxGrad over 600 tasks is reported in Table 4. It can be seen that Penalty outperforms other bilevel methods namely the ApproxGrad (trained with 20 lowerlevel iterations and 20 updates for the linear system) and the RMDbased method (Franceschi et al., 2018) on MiniImagenet and is comparable to them on the Omniglot. We also show the tradeoff between using higher T and time for ApproxGrad and Penalty in Fig. 3(b) showing that Penalty achieves the same accuracy as ApproxGrad in almost half the runtime. In comparison to nonbilevel approaches Penalty is comparable to most approaches but is slightly worse than (Mishra et al., 2017) which makes use of temporal convolutions and soft attention.
We used fourlayer convolutional neural networks with 64 filters per layer and a residual network with four residual blocks followed by two convolutional layers for learning the common task representation (upperlevel variable) for Omniglot and MiniImageNet experiments, respectively. The lowerlevel problem uses logistic regression to learn the task specific classifiers (lowerlevel variables). We also use a normalization for the inputweight dot product, before taking the softmax, similar to the cosine normalization proposed by
(Luo et al., 2018). The bilevel problem for MiniImageNet has upperlevel variables which is the largest among all the experiments presented in this paper (Table 7 in Appendix).Untargeted Attacks (lower accuracy is better)  



ApproxGrad  Penalty  
1%  86.710.32  85  82.090.84  83.290.43  
2%  86.23 0.98  83  77.540.57  78.140.53  
3%  85.170.96  82  74.411.14  75.141.09  
4%  84.930.55  81  71.880.40  72.700.46  
5%  84.391.06  80  68.690.86  69.481.93  
6%  84.640.69  79  66.910.89  67.591.17  
Targeted Attacks (higher accuracy is better)  



ApproxGrad  Penalty  
1%  7.761.07  10  18.841.90  17.403.00  
2%  12.082.13  15  39.643.72  41.644.43  
3%  18.361.23  25  52.762.69  51.402.72  
4%  24.412.05  35  60.011.61  61.161.34  
5%  30.414.24    65.614.01  65.522.85  
6%  32.883.47    71.484.24  70.012.95  



Next, we evaluate Penalty on the task of generating poisoned training data, such that models trained on this data, perform poorly/differently as compared to the models trained on the clean data (Mei and Zhu, 2015; MuñozGonzález et al., 2017; Koh and Liang, 2017). We use the same setting as Sec. 4.2 of (MuñozGonzález et al., 2017) and test both untargeted and targeted data poisoning on MNIST using data augmentation technique. Here, we assume regularized logistic regression will be used as the classifier during training. The poisoned points obtained after solving Eq. (5) by various methods are added to the clean training set and the performance of a new classifier trained on this data is used to report the results in Table 5. For untargeted attack, our aim is to generally lower the performance of the classifier on the clean test set. For this experiment, we select a random subset of 1000 training, 1000 validation and 8000 testing points from MNIST and initialize the poisoning points with random instances from the training set but assign them incorrect random labels. We use these poisoned points along with clean training data to train logistic regression, in the lowerlevel problem of Eq. (5). For targeted attacks, we aim to misclassify images of eights as threes. For this, we selected a balanced subset (each of the 10 classes are represented equally in the subset) of 1000 training, 4000 validation and 5000 testing points from the MNIST dataset. Then we select images of class 8 from the validation set and label them as 3 and use only these images for the upperlevel problem in Eq. 5 with a difference that now we want to minimize the error in the upper level instead of maximizing (meaning we don’t have a negative sign in the upper level of Eq. 5). To evaluate the performance we selected images of 8 from the test set and labeled them as 3 and report the performance on this modified subset of the original test set in targeted attack section of Table 5. For this experiment the poisoned points are initialized with images of classes 3 and 8 from the training set, with flipped labels. We did this since images of threes and eights are the only ones involved in the poisoning. We compare the performance of Penalty against the performance reported using RMD in (MuñozGonzález et al., 2017) and ApproxGrad. For ApproxGrad, we used 20 lowerlevel and 20 linear system updates to report the results in the Table 5. We see that Penalty significantly outperforms the RMD based method and performs similar to ApproxGrad. However, in terms of wall clock time Penalty has a advantage over ApproxGrad (see Fig. 3(c)). We also compared the methods against a label flipping baseline where we select poisoned points from the validation sets and change their labels (randomly for untargeted attacks and mislabel threes as 8 and eights as 3 for targeted attacks). All bilevel methods are able to beat this baseline showing that solving the bilevel problem can generate much better poisoning points. Examples of the poisoned points for untargeted and targeted attacks generated by Penalty are shown in Figs. 5 and 6 in Appendix D.4.
Additionally, we tested Penalty on the task of generating clean label poisoning attack (Koh and Liang, 2017; Shafahi et al., 2018) where goal is to learn poisoned points, such that they will be assigned correct labels when visually inspected by an expert, but can cause misclassification of specific target images when the classifier is trained on these poisoned points along with clean data. We used the dog vs. fish dataset and followed the setting in Sec. 5.2 of (Koh and Liang, 2017), to achieve 100% attack success with just a single poisoned point per target image, compared to 57% attack success in the original paper. A recent method (Shafahi et al., 2018) also reports 100% attack success on this same task. Details of the experiment are presented in Appendix D.4.1.
Finally, we compare Penalty and ApproxGrad on accuracy and time in Fig. 3 as we vary the number of lowerlevel iterations in the experiments. Intuitively, a larger corresponds to a more accurate approximation of the hypergradient and therefore a better result for all methods, but it comes with the space and time cost. The figure shows that relative improvement after is small in comparison to the increased runtime for both Penalty and ApproxGrad. Based on this result we used for in all our experiments on real data. The figure also shows that even though Penalty and ApproxGrad have the same linear time complexity (Table 1), Penalty is about twice as fast ApproxGrad in wallclock time.
A wide range of interesting machine learning problems can be expressed as bilevel optimization problems, and new applications are still being discovered. So far, the difficulty of solving bilevel optimization has limited its widespread use for solving largescale problems, specially, involving deep models. In this paper we presented an efficient algorithm based on penalty function to solve bilevel optimization, which is both simple and has theoretical and practical advantages over existing methods. As compared to previous methods we demonstrated competitive performance on problems with convex lowerlevel costs and significant improvement on problems with nonconvex lowerlevel costs both in terms of accuracy and time, highlighting the practical effectiveness of our penaltybased method. In future works, we plan to tackle other challenges in bilevel optimization such as handling additional constraints in both upper and lowerlevels.
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: §D.3.Cosine normalization: using cosine similarity instead of dot product in neural networks
. In International Conference on Artificial Neural Networks, pp. 382–391. Cited by: §3.3.The proof follows the standard proof for penalty function methods, e.g., [23]. Let refer to the pair, and let be any limit point of the sequence , then there is a subsequence such that . From the tolerance condition
we have
Take the limit with respect to the subsequence on both sides to get
Since is a tall matrix and is invertible by assumption, is fullrank and therefore , which is the primary feasibility condition in Eq. (8). Furthermore, let , then by definition,
We can write
At the minimum the gradient vanishes, that is
.
Equivalently, .
Then,
where disappears, which is the hypergradient . ∎
Several methods have been proposed to solve bilevel optimization problems appearing in machine learning, including forward/reversemode differentiation [19, 10] and approximate gradient [8, 25] described briefly here.
Forwardmode (FMD) and Reversemode differentiation (RMD). Domke [8], Maclaurin et al.[19], Franceschi et al. [10], and Shaban et al. [29] studied forward and reversemode differentiation to solve the minimization problem where the lowerlevel variable follows a dynamical system . This setting is more general than that of a bilevel problem. However, a stable dynamical system is one that converges to a steady state and thus, the process can be considered as minimizing an energy or a potential function.
Define and , then the hypergradient Eq. (7) can be computed by
When the lowerlevel process is one step of gradient descent on a cost function , that is,
we get
The sequences and can be computed in forward or reverse mode. For reversemode differentiation, first compute
then compute
The final hypergradient is . For forwardmode differentiation, simultaneously compute , , and
The final hypergradient is
Approximate hypergradient (ApproxGrad). Since computing the inverse of the Hessian directly is difficult even for moderatelysized neural networks, Domke [8] proposed to find an approximate solution to by solving the linear system of equations . This can be done by solving
using conjugate gradient or any other method. Note that the minimization requires evaluation of the Hessianvector product, which can be done in linear time [24]. The asymptotic convergence with approximate solutions was shown by Pedregosa [25].
Here we discuss the details of the modifications to Alg. 1 presented in the main text which can be added to improve the performance of the algorithm in practice.
One of the common assumptions of this and previous works is that is invertible and locally positive definite. Neither invertibility nor positive definiteness hold in general for bilevel problems, involving deep neural networks, and this causes difficulties in the optimization. Note that if is nonconvex in , minimizing the penalty term does not neccesarily lower the cost but instead moves the variable towards a stationary point – which is a known problem even for the Newton’s method. Thus we propose the following modification to the update:
keeping the same update intact. To see how this affects the optimization, note that update becomes
After converges to a stationary point, we get
,
and after plugging this into update, we get
that is, the Hessian inverse is replaced by a regularized version to improve the positive definiteness of the Hessian. With a decreasing or constant sequence such that the regularization does not change to solution.
The penalty function method is intuitive and easy to implement, but the sequence is guaranteed to converge to an optimal solution only in the limit with , which may not be achieved in practice in a limited time. It is known that the penalty method can be improved by introducing an additional term into the function, which is called the augmented Lagrangian (penalty) method [4]:
This new term allows convergence to the optimal solution even when is finite. Furthermore, using the update rule , called the method of multipliers, it is known that converges to the true Lagrange multiplier of this problem corresponding to the equality constraints .
Most existing methods have assumed that the lowerlevel solution is unique for all . Regularization from the previous section, can improve the illconditioning of the Hessian but it does not address the case of multiple disconnected global minima of . With multiple lowerlevel solutions , there is an ambiguity in defining the upperlevel problem. If we assume that is chosen adversarially (or pessimistically), then the upperlevel problem should be defined as
If is chosen cooperatively (or optimistically), then the upperlevel problem should be defined as
and the results can be quite different between these two cases. Note that the proposed penalty function method is naturally solving the optimistic case, as Alg. 1 is solving the problem of by alternating gradient descent. However, with a gradientbased method, we cannot hope to find all disconnected multiple solutions. In a related problem of minmax optimization, which is a special case of bilevel optimization, an algorithm for handling nonunique solutions was proposed recently [12]. This idea of keeping multiple candidate solution may be applicable to bilevel problems too and further analysis of the nonunique lowerlevel problem is left as future work.
Here we present the modified algorithm which incorporates regularization (Sec. C.1) and augmented Lagrangian (Sec. C.2) as discussed previously. The augmented Lagrangian term applies to both  and update, but the regularization term applies to only the update as its purpose is to improve the illconditioning of during
Comments
There are no comments yet.