A Unified Approach to Error Bounds for Structured Convex Optimization Problems

12/11/2015
by   Zirui Zhou, et al.
The Chinese University of Hong Kong
0

Error bounds, which refer to inequalities that bound the distance of vectors in a test set to a given set by a residual function, have proven to be extremely useful in analyzing the convergence rates of a host of iterative methods for solving optimization problems. In this paper, we present a new framework for establishing error bounds for a class of structured convex optimization problems, in which the objective function is the sum of a smooth convex function and a general closed proper convex function. Such a class encapsulates not only fairly general constrained minimization problems but also various regularized loss minimization formulations in machine learning, signal processing, and statistics. Using our framework, we show that a number of existing error bound results can be recovered in a unified and transparent manner. To further demonstrate the power of our framework, we apply it to a class of nuclear-norm regularized loss minimization problems and establish a new error bound for this class under a strict complementarity-type regularity condition. We then complement this result by constructing an example to show that the said error bound could fail to hold without the regularity condition. Consequently, we obtain a rather complete answer to a question raised by Tseng. We believe that our approach will find further applications in the study of error bounds for structured convex optimization problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/18/2014

Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning

Majorization-minimization algorithms consist of successively minimizing ...
12/08/2019

Additive Schwarz Methods for Convex Optimization as Gradient Methods

This paper gives a unified convergence analysis of additive Schwarz meth...
08/11/2016

A Richer Theory of Convex Constrained Optimization with Reduced Projections and Improved Rates

This paper focuses on convex constrained optimization problems, where th...
09/23/2016

Screening Rules for Convex Problems

We propose a new framework for deriving screening rules for convex optim...
06/30/2020

Conditional Gradient Methods for convex optimization with function constraints

Conditional gradient methods have attracted much attention in both machi...
06/29/2021

Approximate Frank-Wolfe Algorithms over Graph-structured Support Sets

In this paper, we propose approximate Frank-Wolfe (FW) algorithms to sol...
02/10/2013

Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization

Motivated by some applications in signal processing and machine learning...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It has long been recognized that many convex optimization problems can be put into the form

(1)

where is a finite-dimensional Euclidean space, is a proper convex function that is continuously differentiable on , and is a closed proper convex function. On one hand, the constrained minimization problem

where is a closed convex set, is an instance of Problem (1) with being the indicator function of ; i.e.,

On the other hand, various data fitting problems in machine learning, signal processing, and statistics can be formulated as Problem (1), where

is a loss function measuring the deviation of a solution from the observations and

is a regularizer intended to induce certain structure in the solution. With the advent of the big data era, instances of Problem (1) that arise in contemporary applications often involve a large number of variables. This has sparked a renewed interest in first-order methods for solving Problem (1) in recent years; see, e.g.[38, 32, 41] and the references therein. From a theoretical point of view, a fundamental issue concerning these methods is to determine their convergence rates. It is well known that various first-order methods for solving Problem (1) will converge at the sublinear rate of , where is the number of iterations; see, e.g.[37, 5, 38, 23]. Moreover, the convergence rate is optimal when the functions and are given by first-order oracles [22]. However, in many applications, both and are given explicitly and have very specific structure. It has been observed numerically that first-order methods for solving structured instances of Problem (1) converge at a much faster rate than that suggested by the theory; see, e.g.[11, 42]. Thus, it is natural to ask whether the structure of the problem can be exploited in the convergence analysis to yield sharper convergence rate results.

As it turns out, a very powerful approach to addressing the above question is to study a so-called error bound property associated with Problem (1). Formally, let be the set of optimal solutions to Problem (1), assumed to be non-empty. Furthermore, let be a set satisfying and be a function satisfying if and only if . We say that Problem (1) possesses a Lipschitzian error bound (or simply error bound) for with test set and residual function if there exists a constant such that

(2)

where denotes the Euclidean distance from the vector to the set ; cf. [26]. Conceptually, the error bound (2) provides a handle on the structure of the objective function of Problem (1) in the neighborhood of the optimal solution set via the residual function . For the purpose of analyzing the convergence rates of first-order methods, one particularly useful choice of the residual function is , where is the residual map defined by

(3)

and is the proximal map associated with ; i.e.,

(4)

Indeed, by comparing the optimality conditions of (1) and (4), it is immediate that if and only if . Moreover, it is known that many first-order methods for solving Problem (1) have update rules that aim at reducing the value of the residual function; see, e.g.[18, 6, 39]. This leads to the following instantiation of (2):

Error Bound with Proximal Map-Based Residual Function. For any , there exist constants and such that

(EBP)

The usefulness of the error bound (EBP) comes from the fact that whenever it holds, a host of first-order methods for solving Problem (1), such as the proximal gradient method, the extragradient method, and the coordinate (gradient) descent method, can be shown to converge linearly; see [18, 38] and the references therein. Thus, an important research issue is to identify conditions on the functions and under which the error bound (EBP) holds. Nevertheless, despite the efforts of many researchers over a long period of time, the repertoire of instances of Problem (1) that are known to possess the error bound (EBP) is still rather limited. Below are some representative scenarios in which (EBP) has been shown to hold:

  • ([25, Theorem 3.1]) , is strongly convex, is Lipschitz continuous, and is arbitrary (but closed, proper, and convex).

  • ([17, Theorem 2.1]) , takes the form , where , are given and is proper and convex with the properties that (i) is continuously differentiable on , assumed to be non-empty and open, and (ii) is strongly convex and is Lipschitz continuous on any compact subset of , and has a polyhedral epigraph.

  • ([38, Theorem 2]; cf. [44, Theorem 1]) , takes the form , where and are as in scenario (S2), and is the grouped LASSO regularizer; i.e., , where is a partition of the index set , is the vector obtained by restricting to the entries in , and is a given parameter.

In many applications, such as regression problems, the function of interest is not strongly convex but has the structure described in scenarios (S2) and (S3). However, a number of widely used structure-inducing regularizers —most notably the nuclear norm regularizer—are not covered by these scenarios. One of the major difficulties in establishing the error bound (EBP) for regularizers other than those described in scenarios (S2) and (S3) is that they typically have non-polyhedral epigraphs. Moreover, existing approaches to establishing the error bound (EBP) are quite ad hoc in nature and cannot be easily generalized. Thus, in order to identify more scenarios in which the error bound (EBP) holds, some new ideas would seem to be necessary.

In this paper, we present a new analysis framework for studying the error bound property (EBP) associated with Problem (1). The framework applies to the setting where has the form described in scenario (S2) and is any closed proper convex function. In particular, it applies to all the scenarios (S1)–(S3). Our first contribution is to elucidate the relationship between the error bound property (EBP) and various notions in set-valued analysis. This allows us to utilize powerful tools from set-valued analysis to elicit the key properties of Problem (1) that can guarantee the validity of (EBP). Specifically, we show that the problem of establishing the error bound (EBP) can be reduced to that of checking the calmness of a certain set-valued mapping induced by the optimal solution set of Problem (1); see Corollary 1. Furthermore, using the fact that can be expressed as the intersection of a polyhedron and the inverse of the subdifferential of at a certain point (see Proposition 1), we show that the calmness of is in turn implied by (i) the bounded linear regularity of the two intersecting sets and (ii) the calmness of at ; see Theorem 2. These results provide a concrete starting point for verifying the error bound property (EBP) and make it possible to simplify the analysis substantially. We remark that when has a polyhedral epigraph, the early works [18, 19] of Luo and Tseng have already pointed out a connection between (EBP) and the calmness of certain polyhedral multi-function. However, such an idea has not been further explored in the literature to tackle more general forms of .

To demonstrate the power of our proposed framework, we apply it to scenarios (S1)–(S3) and show that the error bound results in [25, 17, 38] can be recovered in a unified manner; see Sections 4.14.3. It is worth noting that scenario (S3) involves the non-polyhedral grouped LASSO regularizer, and the existing proof of the validity of the error bound (EBP) in this scenario employs a highly intricate argument [38]. By contrast, our approach leads to a much simpler and more transparent proof. Motivated by the above success, we proceed to apply our framework to the following scenario, which again involves a non-polyhedral regularizer and arises in the context of low-rank matrix optimization:

  • , takes the form , where is a given linear operator, is a given matrix, is as in scenario (S2), and is the nuclear norm regularizer; i.e., .

The validity of the error bound (EBP) in this scenario was left as an open question in [38] and to date is still unresolved.111It was claimed in [13] that the error bound (EBP) holds in scenario (S4). However, there is a critical flaw in the proof. Specifically, contrary to what was claimed in [13, Supplementary Material, Section C], the matrices and that satisfy displayed equations (37) and (38) need not satisfy displayed equation (35). The erroneous claim was due to an incorrect application of [35, Lemma 4.3]. We thank Professor Defeng Sun and Ms. Ying Cui for bringing this issue to our attention. As our second contribution in this work, we show that under a strict complementarity-type regularity condition on the optimal solution set of Problem (1), the error bound (EBP) holds in scenario (S4); see Proposition 12. This is achieved by verifying conditions (i) and (ii) mentioned in the preceding paragraph. Specifically, we first show that condition (i) is satisfied under the said regularity condition. Then, we prove that is calm everywhere, which implies that condition (ii) is always satisfied; see Proposition 11. We note that to the best of our knowledge, this last result is new and could be of independent interest. To further understand the role of the regularity condition, we demonstrate via a concrete example that without such condition, the error bound (EBP) could fail to hold; see Section 4.4.4. Consequently, we obtain a rather complete answer to the question raised by Tseng [38].

The following notations will be used throughout the paper. Let denote finite-dimensional Euclidean spaces. The closed ball around with radius in is given by . For simplicity, we denote the closed unit ball in by . We use and to denote the sets of real symmetric matrices and orthogonal matrices, respectively. Given a matrix , we use or (resp.  or ) to indicate that is positive semidefinite (resp. positive definite). Also, we use and to denote the Frobenius norm and spectral norm of the matrix , respectively.

2 Preliminaries

2.1 Basic Setup

Consider the optimization problem (1). Recall that its optimal value and optimal solution set are denoted by and , respectively. We shall make the following assumptions in our study:

Assumption 1

(Structural Properties of the Objective Function)

  1. The function takes the form

    (5)

    where is a linear operator, is a given vector, and is a convex function with the following properties:

    1. The effective domain of is non-empty and open, and is continuously differentiable on .

    2. For any compact convex set , the function is strongly convex and its gradient is Lipschitz continuous on .

  2. The function is convex, closed, and proper.

Assumption 2

(Properties of the Optimal Solution Set) The optimal solution set is non-empty and compact. In particular, .

The above assumptions yield several useful consequences. First, Assumption 1(a-i) implies that is also non-empty and open, and is continuously differentiable on . Second, under Assumption 1(a-ii), if the Lipschitz constant of on the compact convex set is , then the Lipschitz constant of on is at most , where is the spectral norm of . Third, Assumption 1 implies that is a closed proper convex function. Together with Assumption 2 and [30, Corollary 8.7.1], we conclude that for any , the level set is a compact subset of .

Assumptions 1 and 2 are automatically satisfied by a number of applications. As an illustration, consider the problem of regularized empirical risk minimization of linear predictors, which underlies much of the development in machine learning. With being the number of data points and , the problem takes the form

(6)

where is the -th component of the vector and represents the -th linear prediction, is the -th response, is a smooth convex loss function, and is a regularizer used to induce certain structure in the solution. It is clear that Problem (6) is an instance of Problem (1). Moreover, one can easily verify that when instantiated with the loss functions and regularizers in Table 1—which have been widely used in the machine learning literature—Problem (6) satisfies both Assumptions 1 and 2.

Domain of
Linear Regression
Logistic Regression
Poisson Regression
(a) Loss Functions
LASSO
Ridge
Grouped LASSO
Nuclear Norm
(b) Regularizers
Table 1: Some commonly used loss functions and regularizers.

2.2 A Characterization of the Optimal Solution Set

Since Problem (1) is an unconstrained convex optimization problem, its first-order optimality condition is both necessary and sufficient for optimality. Hence, we have

(7)

The following proposition shows that under Assumptions 1 and 2, the optimal solution set admits an alternative, more explicit characterization. Such a characterization will be central to our analysis of the error bound property associated with Problem (1).

Proposition 1

Consider the optimization problem (1). Under Assumptions 1 and 2, there exists a such that

(8)

where . In particular, we have

(9)

Proof   The proof of (8) is rather standard; cf. [36, 17]. For completeness’ sake, we include the proof here. For arbitrary , let and . Note that the line segment between and is a compact convex subset of . By Assumption 1(a-ii), the function is strongly convex on this set. Thus, there exists a such that

Due to (5), the above is equivalent to

Moreover, the convexity of gives

Upon adding the above two inequalities and using , we have

This implies that , for otherwise the above inequality contradicts the fact that is the optimal value of Problem (1). Consequently, the map is invariant over ; i.e., there exists a such that for all . Now, using (5) and Assumption 1(a-i), we compute . Since for all , we have for all . This completes the proof of (8).

To establish (9), we first observe that by (7) and (8), every belongs to the set on the right-hand side of (9). Now, for any satisfying and , we can use the relationships and to get . This, together with (7), implies that , as desired.  

2.3 Tools from Set-Valued Analysis

Proposition 1 reveals that the optimal solution set of Problem (1) is completely characterized by the vectors and

. Thus, in order to estimate

for some , a natural idea is to take and an arbitrary and establish a relationship between and . Intuitively, if is “nice” (e.g., satisfies certain regularity condition), then one should be able to control the (local) growth of by that of a “nice” function of . Such an idea can be formalized using tools from set-valued analysis, which we now introduce.

Let and be finite-dimensional Euclidean spaces. We say that a mapping is a multi-function (or set-valued mapping) from to (denoted by ) if it assigns a subset of to each vector . The graph and domain of are defined by

respectively. The inverse mapping of , denoted by , is the multi-function from to defined by

Before we proceed further, let us briefly illustrate some of the concepts above.

Example

  1. Let be a given matrix. The mapping defined by is a multi-function from to . Here, is simply the solution set of the linear system .

  2. Let be a closed proper convex function. Its subdifferential is a multi-function from to . Moreover, by [30, Corollary 23.5.1], we have , where is the conjugate of .

Next, we introduce two regularity notions regarding set-valued mappings.

Definition 1

(see, e.g., [8, Chapter 3H])

  1. A multi-function is said to be calm at for if and there exist constants such that

    (10)
  2. A multi-function is said to be metrically sub-regular at for if and there exist constants such that

    (11)

The notions of calmness and metric sub-regularity have played a central role in the study of error bounds; see, e.g.[26, 9, 31, 27, 15] and the references therein. To see what these notions would yield in the context of Problem (1), consider the multi-function given by

(12)

Suppose that is calm at for , where and are given in Proposition 1. Note that and . Hence, by (10), there exist constants such that

(13)

Since is equivalent to and , it follows from (13) that

(14)

which is an error bound for with test set and residual function . Incidentally, the inequality (14) also shows that the multi-function is metrically sub-regular at for .

The error bound (14) shows that under a calmness assumption on the multi-function given in (12), the local growth of is on the order of , where and is arbitrary. This realizes the idea mentioned at the beginning of this sub-section. However, we are ultimately interested in establishing the error bound (EBP), which is concerned with the test set (where is arbitrary and depends on ) and residual function . At first sight, it is not clear whether the error bounds (EBP) and (14) are compatible. Indeed, the former involves only easily computable quantities (i.e., and ), while the latter involves quantities that are generally not known a priori (i.e., , , and ). Nevertheless, as we shall demonstrate in Section 3, the latter can be used to establish the former under some mild conditions.

Before we leave this section, let us record two useful results regarding the notions of calmness and metric sub-regularity. The first is a well-known equivalence between the calmness of a multi-function and the metric sub-regularity of its inverse. One direction of the equivalence has already manifested in our discussion above.

Fact 1

(see, e.g., [8, Theorem 3H.3]) For a multi-function , let . Then, is calm at for if and only if its inverse is metrically sub-regular at for .

The second result concerns a multi-function that is calm at for a set of points . It shows that if is compact, then the neighborhoods around each in the definition of calmness can be made uniform.

Proposition 2

For a multi-function , let and suppose that is compact. Then, the following statements are equivalent:

  1. is calm at for any .

  2. There exist constants such that

Proof   It is clear that (b) implies (a). Hence, suppose that (a) holds. By (10), given any , there exist constants such that

(15)

Let denote the open unit ball around the origin in . Then, the set forms an open cover of the compact set . Hence, by the Heine-Borel theorem, there exist points (where is finite) such that . We claim that there exists a constant such that . Indeed, suppose that this is not the case. Then, for , we can find vectors such that for ,

Since is compact and , by passing to a subsequence if necessary, we may assume that for some . Then, we have

which shows that . On the other hand, since , there exists an index such that . This implies that

for which contradicts the fact that . Thus, the claim is established.

Now, upon setting , we obtain

where the second inclusion is due to (15) and the fact that for . This completes the proof.  

3 Sufficient Conditions for the Validity of the Error Bound (Ebp)

Following our discussion in Section 2.3, we now show that under Assumptions 1 and 2, the error bound (EBP) is implied by certain calmness property of the multi-function given in (12). This is achieved by exploring the relationships between error bounds defined using different test sets and residual functions. For the sake of convenience, we shall refer to the multi-function given in (12) as the solution map associated with Problem (1) in the sequel.

3.1 Error Bound with Neighborhood-Based Test Set

To begin, recall that the error bound (EBP) involves the test set , where is arbitrary and depends on . The following proposition shows that under Assumptions 1 and 2, we can replace the test set by a neighborhood of . This would facilitate our analysis of the relationship between the error bound (EBP) and the calmness of the solution map , as the latter is also defined in terms of a neighborhood of .

Proposition 3

Consider the optimization problem (1). Under Assumptions 1 and 2, the error bound (EBP) holds if there exist constants such that

(EBN)

Proof   To establish the error bound (EBP), it suffices to show that for any , there exists an such that

Suppose that this does not hold. Then, there exist a scalar and a sequence in such that for and , but for . Since is compact by Assumption 2, by passing to a subsequence if necessary, we may assume that for some . Using the fact that is 1-Lipschitz continuous on (see, e.g.[6, Lemma 2.4]) and is continuous on (Assumption 1(a-i)), we see that is continuous on . This, together with the fact that , implies that ; i.e., . However, this contradicts the fact that for , and the proof is completed.  

Before we proceed, two remarks are in order. First, the reverse implication in Proposition 3 is also true if, in addition to Assumptions 1 and 2, the optimal solution set of Problem (1) is contained in the relative interior of . However, since we will mostly focus on sufficient conditions for the error bound (EBP) to hold, we will not indulge in proving this here. Second, for those instances of Problem (1) that do not satisfy Assumption 2, one or both of the error bounds (EBP) and (EBN) could fail to hold. The following example demonstrates such possibility.

Example   Let . Define the function by

and take to be the indicator function of . Furthermore, let be given by . It can be verified that is convex and continuously differentiable on with

Moreover, we have

which shows that the level sets of are closed but not bounded. It follows that is a closed proper convex function with and

Next, we determine the residual map on . Recall that

Since is the indicator function of , it is easy to see that is the projection operator onto . Note that for each , the function

is decreasing in . Moreover, it can be verified that for all . It follows that

In particular, we have for all .

Now, observe that for any , if satisfies , then for any . However, we have for any and . It follows that there do not exist constants such that (EBP) holds. Similarly, for any , we have

Since for any , there does not exist a constant such that (EBN) holds. In fact, the same arguments show that the instance in question does not possess a Hölderian error bound; i.e., the error bounds (EBP) and (EBN) fail to hold even if one replaces the inequality by for any .  

3.2 Error Bound with Alternative Residual Function

As the reader would recall, a motivation for using as the residual function is that the optimal solution set can be characterized as . Since admits the alternative characterization (9), we can define another residual function by

and consider the error bound

(EBR)

where are constants and , are given in Proposition 1. Our interest in the error bound (EBR) stems from the following result, which reveals that it is closely related to certain calmness property of the solution map :

Proposition 4

Suppose that Problem (1) satisfies Assumptions 1 and 2. Let and be as in Proposition 1. Then, the error bound (EBR) holds if and only if the solution map is calm at for any .

Proof   Suppose that the error bound (EBR) holds. Let be arbitrary and suppose that . In particular, we have and . Using the inequality , which is valid for all , we see from (EBR) that

Since , this implies that . Hence, is calm at for any .

Conversely, suppose that is calm at for any