Bilevel optimization can effectively solve the problems with hierarchical structures and recently has been widely applied in many machine learning applications such as hyper-parameter optimization [Franceschi et al., 2018], mate learning [Franceschi et al., 2018, Liu et al., 2021, Ji et al., 2021]
, neural network architecture search[Liu et al., 2018]Hong et al., 2020] and image processing Liu et al. . In the paper, we consider solving the following nonsmooth nonconvex-strongly-convex bilevel optimization problem:
where the function is smooth and possibly nonconvex, and the function are convex and possibly nonsmooth, and the function is -strongly concave in . Here the constraint set is compact and convex, or . The problem (1) covers a rich class of nonconvex objective functions with nonsmooth regularization, which is more general than the existing nonconvex bilevel optimization formulation in Ghadimi and Wang  that does not consider any regularizer. Here the function frequently denotes the nonsmooth regularization such as .
In machine learning, the loss function is generally stochastic form. Thus, we also consider the following stochastic bilevel optimization problem
where the function is smooth and possibly nonconvex, and the function are convex and possibly nonsmooth, and the function is -strongly concave in . Here and
are random variables. In fact, the problems (1) and (3) involve many machine learning problems with a hierarchical structure, which include hyper-parameter meta-learning [Franceschi et al., 2018] and neural network architecture search [Liu et al., 2018]. Specifically, we give two popular applications that can be formulated as the bilevel optimization problem (1) or (3).
|AID-BiO||Ghadimi and Wang |
|AID-BiO||Ji et al. |
|ITD-BiO||Ji et al. |
denotes number of Jacobian-vector products, i.e.,, where is a vector; denotes number of Hessian-vector products, i.e., ; denotes the conditional number. denotes the algorithm can solve the nonsmooth bilevel optimization.
Model-Agnostic Meta-Learning. Model-agnostic meta learning (MAML) is an effective learning paradigm, which is to find a good model to achieve the best performance for individual tasks by using more experiences. Consider the few-shot meta-learning problem with tasks , and each task has training and test datasets and . As in [Ji et al., 2021, Guo and Yang, 2021], the MAML can be formulated as the following bilevel optimization problem
where is the model parameter of the -th task for all , and is the shared model parameter. Here is a convex and possibly nonsmooth regularization, and is a tuning parameter. Given a sufficiently large , clearly, the above inner problem (6) is strongly-convex.
|TTSA||Hong et al. |
|STABLE||Chen et al. |
|SMB||Guo et al. |
|VRBO||Yang et al. |
|SUSTAIN||Khanduri et al. |
|VR-saBiAdam||Huang and Huang [2021b]|
|BSA||Ghadimi and Wang |
|stocBiO||Ji et al. |
Neural Network Architecture Search . The goal of neural network architecture search is to find an optimal architecture to minimize the validation loss. Let and denote the training loss and the validation loss, respectively. These losses are determined not only by the architecture , but also the weights in the neural network. Specifically, the goal of architecture search is to find an optimal architecture obtained by minimizing the validation loss , where the weights are obtained by minimizing the training loss . As in Liu et al. , we can find the optimal architecture by solving the following bilevel optimization problem:
where denotes a regularization, and is a tuning parameter, and is an initial weight obtained from pre-trained or history information. Like as pruning technique, we generally choose sparse regularization such as . Choose a sufficiently large , the above inner problem (8) is strongly-convex.
The above bilevel optimization problems (1) and (3) frequently appear in many machine learning applications. Thus, many bilevel optimization methods recently have been developed to solve them. For example, [Ghadimi and Wang, 2018, Ji et al., 2021] proposed a class of effective methods to solve the above deterministic problem (1) and stochastic problem (3) with . Consider these methods still suffer from high computational complexity, more recently some accelerated methods have proposed for stochastic problem (3) with . Specifically, [Chen et al., 2021, Khanduri et al., 2021, Guo and Yang, 2021, Yang et al., 2021] proposed some accelerated bilevel optimization algorithms by using the variance reduced techniques of SPIDER [Fang et al., 2018, Wang et al., 2019] and STORM [Cutkosky and Orabona, 2019], respectively. Although these accelerated methods obtain a lower computational complexity without considering the condition number, the condition number also accounts for an important part of the computational complexity (Please see tables 1 and 2). Whilst these accelerated methods focus on nothing but the stochastic bilevel optimization problem (3) with . Thus two natural yet important questions are
1) Could we propose some accelerated methods for solving both the deterministic and stochastic bilevel optimization problems, which can obtain a lower computational complexity especially in the condition number part ?
2) Could we develop some effective methods for solving both the deterministic and stochastic bilevel optimization problems with nonsmooth regularization ?
In the paper, we will provide an affirmative answer to the above two questions and propose a class of effective bilevel optimization methods based on dynamic Bregman distances. Specifically, we use the mirror decent iteration to update the variable based on the strongly-convex mirror function. Our main contributions are summarized as follows:
We propose a class of effective bilevel optimization methods for nonsmooth bilevel optimization problems based on Bregman distance. Moreover, we provides a well-established convergence analysis framework for the proposed bilevel optimization methods.
We propose an effective bilevel optimization method based on adaptive Bregman distances (SBiO-BreD) for solving stochastic bilevel problem (3). At the same time, we further propose an accelerated version of SBiO-BreD (ASBiO-BreD) method by using the variance reduced technique of SARAH/SPIDER [Nguyen et al., 2017, Fang et al., 2018, Wang et al., 2019]. Moreover, we prove that the ASBiO-BreD reaches a lower sample complexity than the best known result (Please see table 2).
Note that our methods can solve the constrained bilevel optimization with nonsmooth regularization but not rely on some special constraint sets and nonsmooth regularization. In the other words, our methods can also solve the unconstrained bilevel optimization without nonsmooth regularization considered in [Ghadimi and Wang, 2018, Ji et al., 2021]. Naturally, our convergence results can apply to both the constrained bilevel optimization with nonsmooth regularization and the unconstrained bilevel optimization without nonsmooth regularization.
Let denote a
-dimensional identity matrix.
denotes a uniform distribution over a discrete set. denotes the norm for vectors and spectral norm for matrices, respectively. For two vectors and , denotes their inner product. and denote the partial derivatives w.r.t. variables and respectively. Given the mini-batch samples , we let . For two sequences , denotes that for some constant . The notation hides logarithmic terms. Given a convex closed set , we define a projection operation . denotes the subgradient set of function .
2 Related Works
In this section, we overview the existing bilevel optimization methods and Bregman distance based methods, respectively.
2.1 Bilevel Optimization Methods
Bilevel optimization recently has attracted increased interest in many machine learning applications such as model-agnostic meta-learning, neural network architecture search and policy optimization. Thus many bilevel optimization methods recently have been proposed to solve the bilevel problems. For example, [Ghadimi and Wang, 2018] proposed a class of bilevel approximation methods to solve the bilevel optimization problems by iteratively approximating the (stochastic) gradient of the outer problem either in forward or backward. [Hong et al., 2020] presented a two-timescale stochastic algorithm framework for stochastic bilevel optimization. Subsequently, some accelerated bilevel approximation methods have been proposed. Specifically, [Ji et al., 2021] proposed faster bilevel optimization methods based on approximate implicit differentiation (AID) and iterative differentiation (ITD), respectively. Moreover, [Chen et al., 2021, Khanduri et al., 2021, Guo and Yang, 2021, Yang et al., 2021] proposed some accelerated bilevel methods for the stochastic bilevel problems by using variance-reduced techniques. More recently, Huang and Huang [2021b] have proposed a class of efficient adaptive bilevel optimization methods. At the same time, the lower bound of bilevel optimization methods has been studied in [Ji and Liang, 2021].
2.2 Bregman distance-based methods
Bregman distance based methods (a.k.a, mirror descent method) [Censor and Zenios, 1992, Beck and Teboulle, 2003] is a powerful optimization tool because it use the Bregman distances to fit the geometry of optimization problems. Bregman distance is first proposed in Bregman , and is extended in Censor and Lent . Censor and Zenios  first proposed proximal minimization algorithm with Bregman function. [Beck and Teboulle, 2003] studied the mirror descent for convex optimization. subsequently, Duchi et al.  proposed an effective variant of mirror descent, i.e., composite objective mirror descent to solve regularized convex optimization. More recently, [Lei and Jordan, 2020] integrated the variance reduced technique to the mirror descent algorithm for stochastic convex optimization. Zhang and He  studied the convergence properties of mirror descent algorithm for solving nonsmooth nonconvex problems. The variance-reduced adaptive stochastic mirror descent algorithm [Li et al., 2020] has been proposed to solve the nonsmooth nonconvex finite-sum optimization. More recently, Huang et al. [2021a] effectively applied the mirror descent method to regularized reinforcement learning.
The function is possibly nonconvex w.r.t. , and the function is -strongly convex w.r.t. . For the stochastic case, the same assumptions hold for and , respectively.
The loss functions and satisfy
and for any and ;
The partial derivatives , , and are L-Lipschitz continuous, i.e., for and ,
For the stochastic case, the same assumptions hold for and for any and .
The Jacobian matrix and Hessian matrix are -Lipschitz and -Lipschitz continuous, respectively, i.e., for all and
For the stochastic case, the same assumptions hold for and for any .
The function for any are convex but possibly nonsmooth.
The function is bounded below, i.e., .
Assumptions 1-3 are commonly used in bilevel optimization methods [Ghadimi and Wang, 2018, Ji et al., 2021, Khanduri et al., 2021]. According to Assumption 1, , where and . Thus is similar to the assumption that the function is -Lipschitz in [Ji et al., 2021]. From the proofs in [Ji et al., 2021], we can find that they still use the norm bounded partial derivative . Similarly, according to Assumption 1, we have . Since , where and , we can let as in [Ji et al., 2021]. From the proofs in [Ji et al., 2021], we can find that they still use the norm bounded partial derivative for all . Throughout the paper, we let .
When we use the first-order methods to solve the above bilevel optimization problems (1) and (3), we can easily obtain the partial (stochastic) derivative or to update variable . However, it is hard to get the (stochastic) gradient or , when there does not exist a closed form solution of the inner problem in the problems (1) and (3). Thus, a key point of solving the problems (1) and (3
) is to estimate the gradient. The following lemma shows one gradient estimator of .
Lemma 1 shows a natural estimator of , defined as, for all
Next we give some properties of , and in the following lemma:
4 Bilevel Optimization via Bregman Distances Methods
4.1 Deterministic BiO-BreD Algorithm
In the subsection, we propose a deterministic bilevel optimization method via Bregman distances (BiO-BreD) to solve the deterministic bilevel optimization problem (1). Algorithm 1 describes the algorithmic framework of the BiO-BreD method.
In Algorithm 1, we use the mirror descent iteration to update the variable at -th step:
where is stepsize, and is an estimator of . Here the mirror function can be dynamic as the algorithm is running. Let , we have . When , the above subproblem (12) is equivalent to the proximal gradient descent. When and , the above subproblem (12) is equivalent to the projection gradient descent. Let , we have . When is an approximated Hessian matrix, the above subproblem (12) is equivalent to the proximal quasi-Newton decent. When is an adaptive matrix as used in [Huang et al., 2021b, Huang and Huang, 2021a], the above subproblem (12) is equivalent to the proximal adaptive gradient decent.
In Algorithm 1, we use to gradient estimator to estimate , where the partial derivative is obtained by the backpropagation w.r.t. . The following lemma shows an analytical form of :
(Proposition 2. [Ji et al., 2021]) The gradient is the following analytical form:
(Lemma 6. [Ji et al., 2021]) Under the above Assumptions, given , we have
where , and .
The above lemma 4 shows the variance of gradient estimator decays exponentially fast with iteration number .
4.2 SBiO-BreD Algorithm
In the subsection, we propose an effective stochastic bilevel optimization method via Bregman distances (SBiO-BreD) to solve the stochastic bilevel optimization problem (3). Algorithm 2 details the algorithmic framework of the SBiO-BreD method.
where is a uniform random variable independent on . It is easily verify that is a biased estimator of , i.e. . Here we define the bias in the gradient estimator (15).
Lemma 5 shows that the bias decays exponentially fast with number , and choose , we have . Specifically, let , we have . Due to , we have . Further due to , let , we have . Note that here we use .
For notational simplicity, let . In Algorithm 2, we use mini-batch stochastic gradient estimator , defined as