1 Introduction
Bilevel optimization can effectively solve the problems with hierarchical structures and recently has been widely applied in many machine learning applications such as hyperparameter optimization [Franceschi et al., 2018], mate learning [Franceschi et al., 2018, Liu et al., 2021, Ji et al., 2021]
, neural network architecture search
[Liu et al., 2018][Hong et al., 2020] and image processing Liu et al. [2021]. In the paper, we consider solving the following nonsmooth nonconvexstronglyconvex bilevel optimization problem:(Outer)  (1)  
s.t.  (Inner)  (2) 
where the function is smooth and possibly nonconvex, and the function are convex and possibly nonsmooth, and the function is strongly concave in . Here the constraint set is compact and convex, or . The problem (1) covers a rich class of nonconvex objective functions with nonsmooth regularization, which is more general than the existing nonconvex bilevel optimization formulation in Ghadimi and Wang [2018] that does not consider any regularizer. Here the function frequently denotes the nonsmooth regularization such as .
In machine learning, the loss function is generally stochastic form. Thus, we also consider the following stochastic bilevel optimization problem
(Outer)  (3)  
s.t.  (Inner)  (4) 
where the function is smooth and possibly nonconvex, and the function are convex and possibly nonsmooth, and the function is strongly concave in . Here and
are random variables. In fact, the problems (
1) and (3) involve many machine learning problems with a hierarchical structure, which include hyperparameter metalearning [Franceschi et al., 2018] and neural network architecture search [Liu et al., 2018]. Specifically, we give two popular applications that can be formulated as the bilevel optimization problem (1) or (3).Algorithm  Reference  Nonsmooth  
AIDBiO  Ghadimi and Wang [2018]  
AIDBiO  Ji et al. [2021]  
ITDBiO  Ji et al. [2021]  
BiOBreD  Ours 
denotes number of Jacobianvector products, i.e.,
, where is a vector; denotes number of Hessianvector products, i.e., ; denotes the conditional number. denotes the algorithm can solve the nonsmooth bilevel optimization.1.1 Applications
ModelAgnostic MetaLearning. Modelagnostic meta learning (MAML) is an effective learning paradigm, which is to find a good model to achieve the best performance for individual tasks by using more experiences. Consider the fewshot metalearning problem with tasks , and each task has training and test datasets and . As in [Ji et al., 2021, Guo and Yang, 2021], the MAML can be formulated as the following bilevel optimization problem
(5)  
s.t.  (6) 
where is the model parameter of the th task for all , and is the shared model parameter. Here is a convex and possibly nonsmooth regularization, and is a tuning parameter. Given a sufficiently large , clearly, the above inner problem (6) is stronglyconvex.
Algorithm  Reference  Nonsmooth  
TTSA  Hong et al. [2020]  
STABLE  Chen et al. [2021]  
SMB  Guo et al. [2021]  
VRBO  Yang et al. [2021]  
SUSTAIN  Khanduri et al. [2021]  
VRsaBiAdam  Huang and Huang [2021b]  
BSA  Ghadimi and Wang [2018]  
stocBiO  Ji et al. [2021]  
SBiOBreD  Ours  
ASBiOBreD  Ours 
Neural Network Architecture Search . The goal of neural network architecture search is to find an optimal architecture to minimize the validation loss. Let and denote the training loss and the validation loss, respectively. These losses are determined not only by the architecture , but also the weights in the neural network. Specifically, the goal of architecture search is to find an optimal architecture obtained by minimizing the validation loss , where the weights are obtained by minimizing the training loss . As in Liu et al. [2018], we can find the optimal architecture by solving the following bilevel optimization problem:
(7)  
s.t.  (8) 
where denotes a regularization, and is a tuning parameter, and is an initial weight obtained from pretrained or history information. Like as pruning technique, we generally choose sparse regularization such as . Choose a sufficiently large , the above inner problem (8) is stronglyconvex.
The above bilevel optimization problems (1) and (3) frequently appear in many machine learning applications. Thus, many bilevel optimization methods recently have been developed to solve them. For example, [Ghadimi and Wang, 2018, Ji et al., 2021] proposed a class of effective methods to solve the above deterministic problem (1) and stochastic problem (3) with . Consider these methods still suffer from high computational complexity, more recently some accelerated methods have proposed for stochastic problem (3) with . Specifically, [Chen et al., 2021, Khanduri et al., 2021, Guo and Yang, 2021, Yang et al., 2021] proposed some accelerated bilevel optimization algorithms by using the variance reduced techniques of SPIDER [Fang et al., 2018, Wang et al., 2019] and STORM [Cutkosky and Orabona, 2019], respectively. Although these accelerated methods obtain a lower computational complexity without considering the condition number, the condition number also accounts for an important part of the computational complexity (Please see tables 1 and 2). Whilst these accelerated methods focus on nothing but the stochastic bilevel optimization problem (3) with . Thus two natural yet important questions are
1) Could we propose some accelerated methods for solving both the deterministic and stochastic bilevel optimization problems, which can obtain a lower computational complexity especially in the condition number part ?
2) Could we develop some effective methods for solving both the deterministic and stochastic bilevel optimization problems with nonsmooth regularization ?
1.2 Contributions
In the paper, we will provide an affirmative answer to the above two questions and propose a class of effective bilevel optimization methods based on dynamic Bregman distances. Specifically, we use the mirror decent iteration to update the variable based on the stronglyconvex mirror function. Our main contributions are summarized as follows:

We propose a class of effective bilevel optimization methods for nonsmooth bilevel optimization problems based on Bregman distance. Moreover, we provides a wellestablished convergence analysis framework for the proposed bilevel optimization methods.

We propose an effective bilevel optimization method based on adaptive Bregman distances (SBiOBreD) for solving stochastic bilevel problem (3). At the same time, we further propose an accelerated version of SBiOBreD (ASBiOBreD) method by using the variance reduced technique of SARAH/SPIDER [Nguyen et al., 2017, Fang et al., 2018, Wang et al., 2019]. Moreover, we prove that the ASBiOBreD reaches a lower sample complexity than the best known result (Please see table 2).
Note that our methods can solve the constrained bilevel optimization with nonsmooth regularization but not rely on some special constraint sets and nonsmooth regularization. In the other words, our methods can also solve the unconstrained bilevel optimization without nonsmooth regularization considered in [Ghadimi and Wang, 2018, Ji et al., 2021]. Naturally, our convergence results can apply to both the constrained bilevel optimization with nonsmooth regularization and the unconstrained bilevel optimization without nonsmooth regularization.
1.3 Notations
Let denote a
dimensional identity matrix.
denotes a uniform distribution over a discrete set
. denotes the norm for vectors and spectral norm for matrices, respectively. For two vectors and , denotes their inner product. and denote the partial derivatives w.r.t. variables and respectively. Given the minibatch samples , we let . For two sequences , denotes that for some constant . The notation hides logarithmic terms. Given a convex closed set , we define a projection operation . denotes the subgradient set of function .2 Related Works
In this section, we overview the existing bilevel optimization methods and Bregman distance based methods, respectively.
2.1 Bilevel Optimization Methods
Bilevel optimization recently has attracted increased interest in many machine learning applications such as modelagnostic metalearning, neural network architecture search and policy optimization. Thus many bilevel optimization methods recently have been proposed to solve the bilevel problems. For example, [Ghadimi and Wang, 2018] proposed a class of bilevel approximation methods to solve the bilevel optimization problems by iteratively approximating the (stochastic) gradient of the outer problem either in forward or backward. [Hong et al., 2020] presented a twotimescale stochastic algorithm framework for stochastic bilevel optimization. Subsequently, some accelerated bilevel approximation methods have been proposed. Specifically, [Ji et al., 2021] proposed faster bilevel optimization methods based on approximate implicit differentiation (AID) and iterative differentiation (ITD), respectively. Moreover, [Chen et al., 2021, Khanduri et al., 2021, Guo and Yang, 2021, Yang et al., 2021] proposed some accelerated bilevel methods for the stochastic bilevel problems by using variancereduced techniques. More recently, Huang and Huang [2021b] have proposed a class of efficient adaptive bilevel optimization methods. At the same time, the lower bound of bilevel optimization methods has been studied in [Ji and Liang, 2021].
2.2 Bregman distancebased methods
Bregman distance based methods (a.k.a, mirror descent method) [Censor and Zenios, 1992, Beck and Teboulle, 2003] is a powerful optimization tool because it use the Bregman distances to fit the geometry of optimization problems. Bregman distance is first proposed in Bregman [1967], and is extended in Censor and Lent [1981]. Censor and Zenios [1992] first proposed proximal minimization algorithm with Bregman function. [Beck and Teboulle, 2003] studied the mirror descent for convex optimization. subsequently, Duchi et al. [2010] proposed an effective variant of mirror descent, i.e., composite objective mirror descent to solve regularized convex optimization. More recently, [Lei and Jordan, 2020] integrated the variance reduced technique to the mirror descent algorithm for stochastic convex optimization. Zhang and He [2018] studied the convergence properties of mirror descent algorithm for solving nonsmooth nonconvex problems. The variancereduced adaptive stochastic mirror descent algorithm [Li et al., 2020] has been proposed to solve the nonsmooth nonconvex finitesum optimization. More recently, Huang et al. [2021a] effectively applied the mirror descent method to regularized reinforcement learning.
3 Preliminaries
Assumption 1.
The function is possibly nonconvex w.r.t. , and the function is strongly convex w.r.t. . For the stochastic case, the same assumptions hold for and , respectively.
Assumption 2.
The loss functions and satisfy

and for any and ;

The partial derivatives , , and are LLipschitz continuous, i.e., for and ,
For the stochastic case, the same assumptions hold for and for any and .
Assumption 3.
The Jacobian matrix and Hessian matrix are Lipschitz and Lipschitz continuous, respectively, i.e., for all and
For the stochastic case, the same assumptions hold for and for any .
Assumption 4.
The function for any are convex but possibly nonsmooth.
Assumption 5.
The function is bounded below, i.e., .
Assumptions 13 are commonly used in bilevel optimization methods [Ghadimi and Wang, 2018, Ji et al., 2021, Khanduri et al., 2021]. According to Assumption 1, , where and . Thus is similar to the assumption that the function is Lipschitz in [Ji et al., 2021]. From the proofs in [Ji et al., 2021], we can find that they still use the norm bounded partial derivative . Similarly, according to Assumption 1, we have . Since , where and , we can let as in [Ji et al., 2021]. From the proofs in [Ji et al., 2021], we can find that they still use the norm bounded partial derivative for all . Throughout the paper, we let .
Assumption 4 is generally used in the regularization such as sparse penalty . Assumption 5 ensures the feasibility of the problems (1) and (3).
When we use the firstorder methods to solve the above bilevel optimization problems (1) and (3), we can easily obtain the partial (stochastic) derivative or to update variable . However, it is hard to get the (stochastic) gradient or , when there does not exist a closed form solution of the inner problem in the problems (1) and (3). Thus, a key point of solving the problems (1) and (3
) is to estimate the gradient
. The following lemma shows one gradient estimator of .Lemma 1.
Lemma 1 shows a natural estimator of , defined as, for all
(10) 
Next we give some properties of , and in the following lemma:
4 Bilevel Optimization via Bregman Distances Methods
In the section, we propose a class of enhanced bilevel optimization methods based on Bregman distances to solve the deterministic problem (1) and the stochastic problem (3), respectively.
4.1 Deterministic BiOBreD Algorithm
In the subsection, we propose a deterministic bilevel optimization method via Bregman distances (BiOBreD) to solve the deterministic bilevel optimization problem (1). Algorithm 1 describes the algorithmic framework of the BiOBreD method.
Given a strongly convex and continuouslydifferentiable function , i.e., , we define a Bregman distance [Censor and Lent, 1981, Censor and Zenios, 1992] for any :
(11) 
In Algorithm 1, we use the mirror descent iteration to update the variable at th step:
(12) 
where is stepsize, and is an estimator of . Here the mirror function can be dynamic as the algorithm is running. Let , we have . When , the above subproblem (12) is equivalent to the proximal gradient descent. When and , the above subproblem (12) is equivalent to the projection gradient descent. Let , we have . When is an approximated Hessian matrix, the above subproblem (12) is equivalent to the proximal quasiNewton decent. When is an adaptive matrix as used in [Huang et al., 2021b, Huang and Huang, 2021a], the above subproblem (12) is equivalent to the proximal adaptive gradient decent.
In Algorithm 1, we use to gradient estimator to estimate , where the partial derivative is obtained by the backpropagation w.r.t. . The following lemma shows an analytical form of :
Lemma 3.
(Proposition 2. [Ji et al., 2021]) The gradient is the following analytical form:
(13) 
Lemma 4.
The above lemma 4 shows the variance of gradient estimator decays exponentially fast with iteration number .
4.2 SBiOBreD Algorithm
In the subsection, we propose an effective stochastic bilevel optimization method via Bregman distances (SBiOBreD) to solve the stochastic bilevel optimization problem (3). Algorithm 2 details the algorithmic framework of the SBiOBreD method.
Given and draw independent samples , as in [Hong et al., 2020, Khanduri et al., 2021], we definite a stochastic gradient estimator:
(15) 
where is a uniform random variable independent on . It is easily verify that is a biased estimator of , i.e. . Here we define the bias in the gradient estimator (15).
Lemma 5.
Lemma 5 shows that the bias decays exponentially fast with number , and choose , we have . Specifically, let , we have . Due to , we have . Further due to , let , we have . Note that here we use .
For notational simplicity, let . In Algorithm 2, we use minibatch stochastic gradient estimator , defined as
Comments
There are no comments yet.