# Enhanced Bilevel Optimization via Bregman Distance

Bilevel optimization has been widely applied many machine learning problems such as hyperparameter optimization, policy optimization and meta learning. Although many bilevel optimization methods more recently have been proposed to solve the bilevel optimization problems, they still suffer from high computational complexities and do not consider the more general bilevel problems with nonsmooth regularization. In the paper, thus, we propose a class of efficient bilevel optimization methods based on Bregman distance. In our methods, we use the mirror decent iteration to solve the outer subproblem of the bilevel problem by using strongly-convex Bregman functions. Specifically, we propose a bilevel optimization method based on Bregman distance (BiO-BreD) for solving deterministic bilevel problems, which reaches the lower computational complexities than the best known results. We also propose a stochastic bilevel optimization method (SBiO-BreD) for solving stochastic bilevel problems based on the stochastic approximated gradients and Bregman distance. Further, we propose an accelerated version of SBiO-BreD method (ASBiO-BreD) by using the variance-reduced technique. Moreover, we prove that the ASBiO-BreD outperforms the best known computational complexities with respect to the condition number κ and the target accuracy ϵ for finding an ϵ-stationary point of nonconvex-strongly-convex bilevel problems. In particular, our methods can solve the bilevel optimization problems with nonsmooth regularization with a lower computational complexity.

There are no comments yet.

## Authors

• 17 publications
• 61 publications
• ### BiAdam: Fast Adaptive Bilevel Optimization Methods

Bilevel optimization recently has attracted increased interest in machin...
06/21/2021 ∙ by Feihu Huang, et al. ∙ 0

• ### Provably Faster Algorithms for Bilevel Optimization and Applications to Meta-Learning

Bilevel optimization has arisen as a powerful tool for many machine lear...
10/15/2020 ∙ by Kaiyi Ji, et al. ∙ 0

• ### Accelerated Zeroth-Order Momentum Methods from Mini to Minimax Optimization

In the paper, we propose a new accelerated zeroth-order momentum (Acc-ZO...
08/18/2020 ∙ by Feihu Huang, et al. ∙ 0

• ### AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax Optimization

In the paper, we propose a class of faster adaptive gradient descent asc...
06/30/2021 ∙ by Feihu Huang, et al. ∙ 0

• ### Accelerated Stochastic Subgradient Methods under Local Error Bound Condition

In this paper, we propose two accelerated stochastic subgradient method...
07/04/2016 ∙ by Yi Xu, et al. ∙ 0

• ### Opytimizer: A Nature-Inspired Python Optimizer

Optimization aims at selecting a feasible set of parameters in an attemp...
12/30/2019 ∙ by Gustavo H. de Rosa, et al. ∙ 84

• ### Learning to Optimize: A Primer and A Benchmark

Learning to optimize (L2O) is an emerging approach that leverages machin...
03/23/2021 ∙ by Tianlong Chen, et al. ∙ 61

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bilevel optimization can effectively solve the problems with hierarchical structures and recently has been widely applied in many machine learning applications such as hyper-parameter optimization [Franceschi et al., 2018], mate learning [Franceschi et al., 2018, Liu et al., 2021, Ji et al., 2021]

, neural network architecture search

[Liu et al., 2018][Hong et al., 2020] and image processing Liu et al. [2021]. In the paper, we consider solving the following nonsmooth nonconvex-strongly-convex bilevel optimization problem:

 minx∈X f(x,y∗(x))+h(x), (Outer) (1) s.t. y∗(x)∈argminy∈Rd2 g(x,y), (Inner) (2)

where the function is smooth and possibly nonconvex, and the function are convex and possibly nonsmooth, and the function is -strongly concave in . Here the constraint set is compact and convex, or . The problem (1) covers a rich class of nonconvex objective functions with nonsmooth regularization, which is more general than the existing nonconvex bilevel optimization formulation in Ghadimi and Wang [2018] that does not consider any regularizer. Here the function frequently denotes the nonsmooth regularization such as .

In machine learning, the loss function is generally stochastic form. Thus, we also consider the following stochastic bilevel optimization problem

 minx∈X Eξ[f(x,y∗(x);ξ)]+h(x), (Outer) (3) s.t. y∗(x)∈argminy∈Rd2 Eζ[g(x,y;ζ)], (Inner) (4)

where the function is smooth and possibly nonconvex, and the function are convex and possibly nonsmooth, and the function is -strongly concave in . Here and

are random variables. In fact, the problems (

1) and (3) involve many machine learning problems with a hierarchical structure, which include hyper-parameter meta-learning [Franceschi et al., 2018] and neural network architecture search [Liu et al., 2018]. Specifically, we give two popular applications that can be formulated as the bilevel optimization problem (1) or (3).

### 1.1 Applications

Model-Agnostic Meta-Learning. Model-agnostic meta learning (MAML) is an effective learning paradigm, which is to find a good model to achieve the best performance for individual tasks by using more experiences. Consider the few-shot meta-learning problem with tasks , and each task has training and test datasets and . As in [Ji et al., 2021, Guo and Yang, 2021], the MAML can be formulated as the following bilevel optimization problem

 minθ∈Θ 1mm∑i=11|Dite|∑ξ∈DiteL(θ,θi∗;ξ)+h(θ) (5) s.t. θi∗∈argminθi∈Rd{1|Ditr|∑ξ∈DitrL(θ,θi;ξ)+λ2∥θ−θi∥2}for i∈[m], (6)

where is the model parameter of the -th task for all , and is the shared model parameter. Here is a convex and possibly nonsmooth regularization, and is a tuning parameter. Given a sufficiently large , clearly, the above inner problem (6) is strongly-convex.

Neural Network Architecture Search . The goal of neural network architecture search is to find an optimal architecture to minimize the validation loss. Let and denote the training loss and the validation loss, respectively. These losses are determined not only by the architecture , but also the weights in the neural network. Specifically, the goal of architecture search is to find an optimal architecture obtained by minimizing the validation loss , where the weights are obtained by minimizing the training loss . As in Liu et al. [2018], we can find the optimal architecture by solving the following bilevel optimization problem:

 minα Lval(w∗(a),a)+h(α) (7) s.t. w∗(a)=argminw{Ltra(w,a)+λ∥w−w0∥2}, (8)

where denotes a regularization, and is a tuning parameter, and is an initial weight obtained from pre-trained or history information. Like as pruning technique, we generally choose sparse regularization such as . Choose a sufficiently large , the above inner problem (8) is strongly-convex.

The above bilevel optimization problems (1) and (3) frequently appear in many machine learning applications. Thus, many bilevel optimization methods recently have been developed to solve them. For example, [Ghadimi and Wang, 2018, Ji et al., 2021] proposed a class of effective methods to solve the above deterministic problem (1) and stochastic problem (3) with . Consider these methods still suffer from high computational complexity, more recently some accelerated methods have proposed for stochastic problem (3) with . Specifically, [Chen et al., 2021, Khanduri et al., 2021, Guo and Yang, 2021, Yang et al., 2021] proposed some accelerated bilevel optimization algorithms by using the variance reduced techniques of SPIDER [Fang et al., 2018, Wang et al., 2019] and STORM [Cutkosky and Orabona, 2019], respectively. Although these accelerated methods obtain a lower computational complexity without considering the condition number, the condition number also accounts for an important part of the computational complexity (Please see tables 1 and 2). Whilst these accelerated methods focus on nothing but the stochastic bilevel optimization problem (3) with . Thus two natural yet important questions are

1) Could we propose some accelerated methods for solving both the deterministic and stochastic bilevel optimization problems, which can obtain a lower computational complexity especially in the condition number part ?

2) Could we develop some effective methods for solving both the deterministic and stochastic bilevel optimization problems with nonsmooth regularization ?

### 1.2 Contributions

In the paper, we will provide an affirmative answer to the above two questions and propose a class of effective bilevel optimization methods based on dynamic Bregman distances. Specifically, we use the mirror decent iteration to update the variable based on the strongly-convex mirror function. Our main contributions are summarized as follows:

• We propose a class of effective bilevel optimization methods for nonsmooth bilevel optimization problems based on Bregman distance. Moreover, we provides a well-established convergence analysis framework for the proposed bilevel optimization methods.

• We propose an effective bilevel optimization method based on adaptive Bregman distances (BiO-BreD) for solving deterministic bilevel problem (1). We prove that the BiO-BreD reaches a lower sample complexity than the best known result (Please see table 1).

• We propose an effective bilevel optimization method based on adaptive Bregman distances (SBiO-BreD) for solving stochastic bilevel problem (3). At the same time, we further propose an accelerated version of SBiO-BreD (ASBiO-BreD) method by using the variance reduced technique of SARAH/SPIDER [Nguyen et al., 2017, Fang et al., 2018, Wang et al., 2019]. Moreover, we prove that the ASBiO-BreD reaches a lower sample complexity than the best known result (Please see table 2).

Note that our methods can solve the constrained bilevel optimization with nonsmooth regularization but not rely on some special constraint sets and nonsmooth regularization. In the other words, our methods can also solve the unconstrained bilevel optimization without nonsmooth regularization considered in [Ghadimi and Wang, 2018, Ji et al., 2021]. Naturally, our convergence results can apply to both the constrained bilevel optimization with nonsmooth regularization and the unconstrained bilevel optimization without nonsmooth regularization.

### 1.3 Notations

Let denote a

-dimensional identity matrix.

denotes a uniform distribution over a discrete set

. denotes the norm for vectors and spectral norm for matrices, respectively. For two vectors and , denotes their inner product. and denote the partial derivatives w.r.t. variables and respectively. Given the mini-batch samples , we let . For two sequences , denotes that for some constant . The notation hides logarithmic terms. Given a convex closed set , we define a projection operation . denotes the subgradient set of function .

## 2 Related Works

In this section, we overview the existing bilevel optimization methods and Bregman distance based methods, respectively.

### 2.1 Bilevel Optimization Methods

Bilevel optimization recently has attracted increased interest in many machine learning applications such as model-agnostic meta-learning, neural network architecture search and policy optimization. Thus many bilevel optimization methods recently have been proposed to solve the bilevel problems. For example, [Ghadimi and Wang, 2018] proposed a class of bilevel approximation methods to solve the bilevel optimization problems by iteratively approximating the (stochastic) gradient of the outer problem either in forward or backward. [Hong et al., 2020] presented a two-timescale stochastic algorithm framework for stochastic bilevel optimization. Subsequently, some accelerated bilevel approximation methods have been proposed. Specifically, [Ji et al., 2021] proposed faster bilevel optimization methods based on approximate implicit differentiation (AID) and iterative differentiation (ITD), respectively. Moreover, [Chen et al., 2021, Khanduri et al., 2021, Guo and Yang, 2021, Yang et al., 2021] proposed some accelerated bilevel methods for the stochastic bilevel problems by using variance-reduced techniques. More recently, Huang and Huang [2021b] have proposed a class of efficient adaptive bilevel optimization methods. At the same time, the lower bound of bilevel optimization methods has been studied in [Ji and Liang, 2021].

### 2.2 Bregman distance-based methods

Bregman distance based methods (a.k.a, mirror descent method) [Censor and Zenios, 1992, Beck and Teboulle, 2003] is a powerful optimization tool because it use the Bregman distances to fit the geometry of optimization problems. Bregman distance is first proposed in Bregman [1967], and is extended in Censor and Lent [1981]. Censor and Zenios [1992] first proposed proximal minimization algorithm with Bregman function. [Beck and Teboulle, 2003] studied the mirror descent for convex optimization. subsequently, Duchi et al. [2010] proposed an effective variant of mirror descent, i.e., composite objective mirror descent to solve regularized convex optimization. More recently, [Lei and Jordan, 2020] integrated the variance reduced technique to the mirror descent algorithm for stochastic convex optimization. Zhang and He [2018] studied the convergence properties of mirror descent algorithm for solving nonsmooth nonconvex problems. The variance-reduced adaptive stochastic mirror descent algorithm [Li et al., 2020] has been proposed to solve the nonsmooth nonconvex finite-sum optimization. More recently, Huang et al. [2021a] effectively applied the mirror descent method to regularized reinforcement learning.

## 3 Preliminaries

In the section, we first give some mild assumptions on the problems (1) and (3).

###### Assumption 1.

The function is possibly nonconvex w.r.t. , and the function is -strongly convex w.r.t. . For the stochastic case, the same assumptions hold for and , respectively.

###### Assumption 2.

The loss functions and satisfy

• and for any and ;

• The partial derivatives , , and are L-Lipschitz continuous, i.e., for and ,

 ∥∇xf(x1,y)−∇xf(x2,y)∥≤L∥x1−x2∥, ∥∇xf(x,y1)−∇xf(x,y2)∥≤L∥y1−y2∥, ∥∇yf(x1,y)−∇yf(x2,y)∥≤L∥x1−x2∥, ∥∇yf(x,y1)−∇yf(x,y2)∥≤L∥y1−y2∥, ∥∇xg(x1,y)−∇xg(x2,y)∥≤L∥x1−x2∥, ∥∇xg(x,y1)−∇xg(x,y2)∥≤L∥y1−y2∥, ∥∇yg(x1,y)−∇yg(x2,y)∥≤L∥x1−x2∥, ∥∇yg(x,y1)−∇yg(x,y2)∥≤L∥y1−y2∥.

For the stochastic case, the same assumptions hold for and for any and .

###### Assumption 3.

The Jacobian matrix and Hessian matrix are -Lipschitz and -Lipschitz continuous, respectively, i.e., for all and

 ∥∇2xyg(x1,y)−∇2xyg(x2,y)∥≤Lgxy∥x1−x2∥, ∥∇2xyg(x,y1)−∇2xyg(x,y2)∥≤Lgxy∥y1−y2∥, ∥∇2yyg(x1,y)−∇2yyg(x2,y)∥≤Lgyy∥x1−x2∥, ∥∇2yyg(x,y1)−∇2yyg(x,y2)∥≤Lgyy∥y1−y2∥.

For the stochastic case, the same assumptions hold for and for any .

###### Assumption 4.

The function for any are convex but possibly nonsmooth.

###### Assumption 5.

The function is bounded below, i.e., .

Assumptions 1-3 are commonly used in bilevel optimization methods [Ghadimi and Wang, 2018, Ji et al., 2021, Khanduri et al., 2021]. According to Assumption 1, , where and . Thus is similar to the assumption that the function is -Lipschitz in [Ji et al., 2021]. From the proofs in [Ji et al., 2021], we can find that they still use the norm bounded partial derivative . Similarly, according to Assumption 1, we have . Since , where and , we can let as in [Ji et al., 2021]. From the proofs in [Ji et al., 2021], we can find that they still use the norm bounded partial derivative for all . Throughout the paper, we let .

Assumption 4 is generally used in the regularization such as sparse penalty . Assumption 5 ensures the feasibility of the problems (1) and (3).

When we use the first-order methods to solve the above bilevel optimization problems (1) and (3), we can easily obtain the partial (stochastic) derivative or to update variable . However, it is hard to get the (stochastic) gradient or , when there does not exist a closed form solution of the inner problem in the problems (1) and (3). Thus, a key point of solving the problems (1) and (3

) is to estimate the gradient

. The following lemma shows one gradient estimator of .

###### Lemma 1.

(Lemma 2.1 in [Ghadimi and Wang, 2018]) Under the above Assumptions (1, 2, 3), we have, for any

 ∇F(x) =∇xf(x,y∗(x))+∇y∗(x)T∇yf(x,y∗(x)) =∇xf(x,y∗(x))−∇2xyg(x,y∗(x))[∇2yyg(x,y∗(x))]−1∇yf(x,y∗(x)). (9)

Lemma 1 shows a natural estimator of , defined as, for all

 ¯∇f(x,y)=∇xf(x,y)−∇2xyg(x,y)(∇2yyg(x,y))−1∇yf(x,y). (10)

Next we give some properties of , and in the following lemma:

###### Lemma 2.

(Lemma 2.2 in [Ghadimi and Wang, 2018]) Under the above Assumptions (1, 2, 3), for all and , we have

 ∥¯∇f(x,y)−∇F(x)∥≤Ly∥y∗(x)−y∥, ∥y∗(x1)−y∗(x2)∥≤κ∥x1−x2∥, ∥∇F(x1)−∇F(x2)∥≤LF∥x1−x2∥,

where , , and .

## 4 Bilevel Optimization via Bregman Distances Methods

In the section, we propose a class of enhanced bilevel optimization methods based on Bregman distances to solve the deterministic problem (1) and the stochastic problem (3), respectively.

### 4.1 Deterministic BiO-BreD Algorithm

In the subsection, we propose a deterministic bilevel optimization method via Bregman distances (BiO-BreD) to solve the deterministic bilevel optimization problem (1). Algorithm 1 describes the algorithmic framework of the BiO-BreD method.

Given a -strongly convex and continuously-differentiable function , i.e., , we define a Bregman distance [Censor and Lent, 1981, Censor and Zenios, 1992] for any :

 Dψ(x1,x2)=ψ(x1)−ψ(x2)−⟨∇ψ(x2),x1−x2⟩. (11)

In Algorithm 1, we use the mirror descent iteration to update the variable at -th step:

 xt+1=minx∈X{⟨wt,x⟩+h(x)+1γDψt(x,xt)}, (12)

where is stepsize, and is an estimator of . Here the mirror function can be dynamic as the algorithm is running. Let , we have . When , the above subproblem (12) is equivalent to the proximal gradient descent. When and , the above subproblem (12) is equivalent to the projection gradient descent. Let , we have . When is an approximated Hessian matrix, the above subproblem (12) is equivalent to the proximal quasi-Newton decent. When is an adaptive matrix as used in [Huang et al., 2021b, Huang and Huang, 2021a], the above subproblem (12) is equivalent to the proximal adaptive gradient decent.

In Algorithm 1, we use to gradient estimator to estimate , where the partial derivative is obtained by the backpropagation w.r.t. . The following lemma shows an analytical form of :

###### Lemma 3.

(Proposition 2. [Ji et al., 2021]) The gradient is the following analytical form:

 ∂f(xt,yKt)∂x=∇xf(xt,yKt)−λK−1∑k=0∇2xyg(xt,ykt)K−1∏j=k+1(Id2−λ∇2yyg(xt,yjt))∇yf(xt,yKt). (13)
###### Lemma 4.

(Lemma 6. [Ji et al., 2021]) Under the above Assumptions, given , we have

 ∥∂f(xt,yKt)∂x−∇F(xt)∥≤(L1(1−λμ)K2+L2(1−λμ)K−12)∥y0t−y∗(xt)∥+L3(1−λμ)K, (14)

where , and .

The above lemma 4 shows the variance of gradient estimator decays exponentially fast with iteration number .

### 4.2 SBiO-BreD Algorithm

In the subsection, we propose an effective stochastic bilevel optimization method via Bregman distances (SBiO-BreD) to solve the stochastic bilevel optimization problem (3). Algorithm 2 details the algorithmic framework of the SBiO-BreD method.

Given and draw independent samples , as in [Hong et al., 2020, Khanduri et al., 2021], we definite a stochastic gradient estimator:

 ¯∇f(x,y,¯ξ)=∇xf(x,y;ξ)−∇2xyg(x,y;ζ0)[KLk∏i=1(Id2−1L∇2yyg(x,y;ζi))]∇yf(x,y;ξ), (15)

where is a uniform random variable independent on . It is easily verify that is a biased estimator of , i.e. . Here we define the bias in the gradient estimator (15).

###### Lemma 5.

( Lemma 11 in [Hong et al., 2020] ) Under the about Assumptions (1, 2, 3), for any , the gradient estimator in (15) satisfies

 ∥R(x,y)∥≤LCfyμ(1−μL)K, (16)

where .

Lemma 5 shows that the bias decays exponentially fast with number , and choose , we have . Specifically, let , we have . Due to , we have . Further due to , let , we have . Note that here we use .

For notational simplicity, let . In Algorithm 2, we use mini-batch stochastic gradient estimator , defined as

 ¯∇f(xt,yt;¯ξit)=∇xf(xt,yt;ξt,i)−∇2xyg(xt,yt;ζ0t,i)[KLk∏j=1(Id