DeepAI
Log In Sign Up

A single-phase, proximal path-following framework

We propose a new proximal, path-following framework for a class of constrained convex problems. We consider settings where the nonlinear---and possibly non-smooth---objective part is endowed with a proximity operator, and the constraint set is equipped with a self-concordant barrier. Our approach relies on the following two main ideas. First, we re-parameterize the optimality condition as an auxiliary problem, such that a good initial point is available; by doing so, a family of alternative paths towards the optimum is generated. Second, we combine the proximal operator with path-following ideas to design a single-phase, proximal, path-following algorithm. Our method has several advantages. First, it allows handling non-smooth objectives via proximal operators; this avoids lifting the problem dimension in order to accommodate non-smooth components in optimization. Second, it consists of only a single phase: While the overall convergence rate of classical path-following schemes for self-concordant objectives does not suffer from the initialization phase, proximal path-following schemes undergo slow convergence, in order to obtain a good starting point TranDinh2013e. In this work, we show how to overcome this limitation in the proximal setting and prove that our scheme has the same O(√(ν)(1/ε)) worst-case iteration-complexity with standard approaches Nesterov2004,Nesterov1994 without requiring an initial phase, where ν is the barrier parameter and ε is a desired accuracy. Finally, our framework allows errors in the calculation of proximal-Newton directions, without sacrificing the worst-case iteration complexity. We demonstrate the merits of our algorithm via three numerical examples, where proximal operators play a key role.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/31/2016

A Proximal Stochastic Quasi-Newton Algorithm

In this paper, we discuss the problem of minimizing the sum of two conve...
02/12/2021

Proximal and Federated Random Reshuffling

Random Reshuffling (RR), also known as Stochastic Gradient Descent (SGD)...
06/03/2019

A Generic Acceleration Framework for Stochastic Composite Optimization

In this paper, we introduce various mechanisms to obtain accelerated fir...
05/13/2014

Scalable sparse covariance estimation via self-concordance

We consider the class of convex minimization problems, composed of a sel...
05/31/2016

Iterative Smoothing Proximal Gradient for Regression with Structured Sparsity

In the context high-dimensionnal predictive models, we consider the prob...
04/06/2022

Unconstrained Proximal Operator: the Optimal Parameter for the Douglas-Rachford Type Primal-Dual Methods

In this work, we propose an alternative parametrized form of the proxima...
06/10/2020

Principled Analyses and Design of First-Order Methods with Inexact Proximal Operators

Proximal operations are among the most common primitives appearing in bo...

1 Introduction.

This paper studies the following constrained convex optimization problem:

(1)

where , is a possibly non-smooth, proper, closed and convex function from to and is a nonempty, closed and convex set in .222We note that the linear term can be absorbed into . However, we separate it from for our convenience in processing numerical examples in the last section. We denote by the optimal solution set of (1), and by an optimal solution in .

For convex sets , associated with a self-concordant barrier (see Section 2 for details), and for self-concordant and smooth, e.g., just linear or quadratic, interior point methods (IPMs) often constitute the method-of-choice for solving (1), with a well-characterized worst-case complexity. A non-exhaustive list of instances of (1

) includes linear programs, quadratic programs, second order cone programs, semi-definite programs, and geometric optimization

[2, 7, 8, 17, 31, 32, 33, 38, 42, 44, 53, 54].

At the heart of IPMs lies the notion of interior barriers: these mimic the effect of the constraint set in (1) by appropriately penalizing the objective function with a barrier over the set , as follows:

(2)

Here, models the structure of the feasible set and is a penalty parameter. For different values of , the regularized problem generates a sequence of solutions , known as the central path, converging to of (1), as goes to . Path-following methods operate along the central path: for a properly decreasing sequence of values, they solve (2) only approximately, by performing a few Newton iterations for each value; standard path-following schemes even perform just one Newton iteration, assuming a linear objective with no non-smooth term . For such problem cases, this is sufficient to guarantee that the approximate solution lies sufficiently close to the central path, and operates as warm-start for the next value of in (2) [8, 30, 37, 33]. One requirement is that the initial point must lie within a predefined neighborhood of the central path. In their seminal work [37], Nesterov and Nemirovski showed that such methods admit a polynomial worst-case complexity, as long as the Newton method has polynomial complexity.

Based on the above, standard schemes [31, 33, 37] can be characterized by two phases: Phase I and Phase II. In Phase I and for an initial value of , say , one has to solve (2) carefully in order to determine a good initial point for Phase II; this implies solving (2) up to sufficient accuracy, such that the Newton method for (2) admits fast convergence. In Phase II and using the output of Phase I as a warm-start, we path-follow with a provably polynomial time complexity.

Taking into account both phases, standard path-following algorithms—where (2) is a self-concordant objective—are characterized by the following iteration complexity. The total number of iterations required to obtain an -solution is

(3)

Here, is a barrier parameter (see Section 2 for details) and is the approximate parameter, according to the following definition: Given a tolerance , we say that is an -solution for (1) if

1.1 Path-following schemes for non-smooth objectives.

For many applications in machine learning, optimization and signal processing

[8, 40, 50], the part in (1) could be non-smooth (or even smooth but non-self-concordant). Such a term is usually included in the optimization in order to leverage the true underlying structure in . An example is the -norm regularization, i.e.,

, with applications in high-dimensional statistics, compressive sensing, scientific and medical imaging

[12, 18, 20, 23, 29, 41, 47, 55], among others. Other examples for include the indicator function of a convex set [40], the -group norm [4, 22, 24], and the nuclear norm [11] using in low-rank matrix approximation.

Unfortunately, non-smoothness in the objective reduces the optimization efficiency. In such settings, one can often reformulate (1) into a standard conic program, by introducing slack variables and additional constraints to model . Such a technique is known as disciplined convex programming (DCP) [19] and has been incorporated in well-known software packages, such as CVX [19] and YALMIP [28]. Existing off-the-shelf solvers are then utilized to solve the resulting problem. However, DCP could potentially increase the problem dimension significantly; this, in sequence, reduces the efficiency of the IPMs. For instance, in the example above where , DCP introduces slack variables to reformulate into additional linear constraints; when , i.e.,

is the nuclear norm (sum of singular values of

), then it can be smoothed via a semi-definite formulation, where the memory requirements and the volume of computation per iteration are high [27].

In this paper, we focus on cases where is endowed with a generalized proximity operator, associated with a local norm (see Section 2 for details):

Such proximity operators have been used extensively in non-smooth optimization problems, and proven to be efficient in real applications, under common gradient Lipschitz-continuity and strong convexity assumptions on the objective function [6, 13, 33]. However, for generic constraints in (1), the resulting interior barrier in (2) does not have Lipschitz continuous gradients and, thus, prevents us from trivially recycling such ideas. This necessitates the design of a new class of path-following schemes, that exploit proximal operators and thus can accommodate non-smooth terms in the objective.

To the best of our knowledge, [50] is the first work that treats jointly interior barrier path-following schemes and proximity operators, in order to construct new proximal path-following algorithms for problems as in (1). According to [50], the proposed algorithm follows a two-phase approach, with Phase II having the same worst-case iteration-complexity as in (3) (up to constants) [33, 37]. However, the initialization Phase I in [50] requires substantial computational effort, which usually dominates the overall computational time. In particular, to find a good initial point, [50] uses a damped-step proximal-Newton scheme for (2), starting from an arbitrary initial point and for arbitrary selected . For such configuration, [50] requires

damped-step Newton iterations in Phase I in order to find a point close to the optimal solution of (2), say , for the selected . Here, , and ; see [50, Theorem 4.4] for more details. I.e., in stark contrast to the global iteration complexity (3) of smooth path-following schemes, Phase I of [50] might require a substantial number of iterations just to converge to a point close to the central path, and depends on the arbitrary initial point selection .

1.2 Motivation.

From our discussion so far, it is clear that most existing works on path-following schemes require two phases. In the case of smooth self-concordant objectives in (1), Phase I is often implemented as a damped-step Newton scheme, which has sublinear convergence rate, or an auxiliary path-following scheme, with linear convergence rate that satisfies the global, worst-case complexity in (3) [33, 37]. In standard conic programming, one can unify a two-phase algorithm in a single-phase IP path-following scheme via homogeneous and self-dual embedded strategies; see, e.g., [45, 52, 54]. Such strategies parameterize the KKT condition of the primal and dual conic program so that one can immediately have an initial point, without performing Phase I. So far and to the best of our knowledge, it remains unclear how such an auxiliary path-following scheme can find an initial point for non-smooth objectives in (1).

1.3 Our contributions.

The goal of this paper is to develop a new single-phase, proximal path-following algorithm for (1). To do so, we first re-parameterize the optimality condition of the barrier problem associated with (1) as a parametric monotone inclusion (PMI). Then, we design a proximal path-following scheme to approximate the solution of such PMI, while controlling the penalty parameter. Finally, we show how to recover an approximate solution of (1), from the approximate solution of the PMI.

The main contributions of this paper can be summarized as follows:

  • We introduce a new parameterization for the optimality condition of (2) to appropriately select the parameters such that less computation for initialization is needed. Thus, with an appropriate choice of parameters, we show how we can eliminate the slowly-convergent Phase I in [50], while we still maintain the global, polynomial time, worst-case iteration-complexity.

    In particular, we propose novel—checkable a priori—conditions over the set of initial points that can achieve the desiderata; this, in turn, provides rigorous configurations of the algorithm’s parameters such that the worst-case iteration complexity guarantee is obtained provably, avoiding the slowly convergent initialization procedures proposed so far for non-smooth optimization in (1).

  • We design a single-phase, path-following algorithm to compute an -solution of (1). For each value, the resulting algorithm only requires a single approximate Newton iteration (see [50]), followed by a proximal step, of a strongly convex quadratic composite subproblem. We will use the term proximal Newton step when referring to these two steps. The algorithm allows inexact Newton steps, with a verifiable stopping criterion (cf. eq. (25)).

In particular, we establish the following result: The total number of proximal Newton iterations required, in order to reach an -solution of (1), is upper bounded by . A complete and formal description of the above theorem and its proof are provided in Section 4. Our proximal algorithm admits the same iteration-complexity, as standard path-following methods [33, 37] (up to a constant). To highlight the iteration complexity gains from the two-phase algorithm in [50, Theorem 4.4], recall that in the latter case, the total number of proximal Newton steps are bounded by:

where the first term is in Phase I as mentioned previously, and the second one is in Phase II.

Our algorithm requires a well-chosen initial point that avoids Phase I; one such case is that of an approximation of the analytical center of the barrier (see Section 2 for details). In the text, we argue that evaluating this point is much easier than finding an initial point using Phase I, as in [50]. In addition, for many feasible sets in (1), we can explicitly and easily compute of (see Section 5 for examples).

1.4 The structure of the paper.

This paper is organized as follows. Sections 2 and 3 contain basic definitions and notions, used in our analysis. We introduce a new re-parameterization of the central path in order to obtain a predefined initial point. Section 4 presents a novel algorithm and its complexity theory for the non-smooth objective function. Section 5 provides three numerical examples that highlight the merits of our algorithm.

2 Preliminaries.

In this section, we provide the basic notation used in the rest of the paper, as well as two key concepts: proximity operators and self-concordant (barrier) functions.

2.1 Basic definitions.

Given , we use or to denote the inner product in . For a proper, closed and convex function , we denote by its domain, (i.e., , and by its subdifferential at . We also denote by the closure of [43]. We use to denote the class of three times continuously differentiable functions from to .

For a given twice differentiable function such that at some , we define the local norm, and its dual, as

respectively, for . Note that the Cauchy-Schwarz inequality holds, i.e., .

2.2 Generalized proximity operators.

The generalized proximity operator of a proper, closed and convex function is defined as the following program:

(4)

When

—the identity matrix—in the local norm, (

4) becomes a standard proximal operator [5]. Computing might be hard even for such cases. Nevertheless, there exist structured smooth and non-smooth convex functions with that comes with a closed-form solution or can be computed with low computational complexity. We capture this idea in the following definition.

[Tractable proximity operator] A proper, closed and convex function has a tractable proximity operator if (4) can be computed efficiently via a closed-form solution or via a polynomial time algorithm.

Examples of such functions include the -norm—where the proximity operator is the well-known soft-thresholding operator [13]—and the indicator functions of simple sets (e.g., boxes, cones and simplexes)—where the proximity operator is simply the projection operator. Further examples can be found in [5, 13, 40]. Observe that, due to the existence of a closed-form solution for most well-known proximity operators, one can always compute efficiently and its computational complexity does not depend on the value of the regularization parameter . Our main result does not require the tractability of computing the proximity operator of ; it will be used to analyze the overall computational complexity in Subsection 4.7.

Some properties of the proximity operator are described in the next lemma: The generalized proximal operator defined in (4) is co-coercive and therefore nonexpansive w.r.t. the local norms, i.e.,

(5)
(6)

The proof to this lemma can be found in [51, Lemma 2].

2.3 Self-concordant functions and self-concordant barriers.

A concept used in our analysis is the self-concordance property, introduced by Nesterov and Nemirovskii [33, 37].

A univariate convex function is called standard self-concordant if for all , where is an open set in . Moreover, a function is standard self-concordant if, for any and , the univariate function where is standard self-concordant.

A standard self-concordant function is a -self-concordant barrier for the set with parameter , if

In addition, as tends to the boundary of .

We note that when is non-degenerate (particularly, when contains no straight line [33, Theorem 4.1.3.]), a -self-concordant function satisfies

(7)

Self-concordant functions have non-global Lipschitz gradients and can be used to analyze the complexity of Newton-methods [10, 33, 37], as well as first-order variants [16]. For more details on self-concordant functions and self-concordant barriers, we refer the reader to Chapter 4 of [33].

Several simple sets are equipped with a self-concordant barrier. For instance, is an -self-concordant barrier of the orthant cone , is a -self-concordant barrier of the Lorentz cone , and the semidefinite cone is endowed with the -self-concordant barrier . In addition, other convex sets, such as hyperbolic polynomials and convex cones, are also characterized by explicit self-concordant barriers [26, 34]. Generally, any closed and convex set—with nonempty interior and not containing a straight line—is endowed with a self-concordant barrier; see [33, 37].

Finally, we define the analytical center of as

(8)

If is bounded, then exists and is unique [35]. Some properties of the analytical center are presented in Section 3. In this paper, we develop algorithms for (1) with a general self-concordant barrier of as defined by Definition 2.3.

2.4 Basic assumptions.

We make the following assumption, regarding problem (1). The solution set of (1) is nonempty. The objective function in (1) is proper, closed and convex, and . The feasible set is nonempty, closed and convex with nonempty interior and is endowed with a -self-concordant barrier such that . The analytical center of exists.

Except for the last condition, Assumption 2.4 is common for interior-point methods. The last condition can be satisfied by adding an auxiliary constraint for sufficiently large ; this technique has been also used in [37] and it does not affect the solution of (1) when is large.

3 Re-parameterizing the central path.

In this section, we introduce a new parameterization strategy, which will be used in our scheme for (1).

3.1 Barrier formulation and central path of (1).

Since is endowed with a -self-concordant barrier , according to Assumption A.2.4, the barrier formulation of (1) is given by

(9)

where is the penalty parameter. We denote by the solution of (9) at a given value . Define . The optimality condition of (9) is necessary and sufficient for to be an optimal solution of (9), and can be written as follows:

(10)

We also denote by the set of solutions of (9), which generates a central path (or a solution trajectory) associated with (1). We refer to each solution as a central point.

3.2 Parameterization of the optimality condition.

Let us fix ; a specific selection of is provided later on. For given , let be an arbitrary subgradient of at , and set . For a given parameter , define

(11)

with the gradient . We further define an -parameterized version of (9) as

(12)

We denote by the solution of (12), given . Observe that, for a fixed value of , the optimality condition of (12) at is given by

(13)

Next, we provide some remarks regarding the -parameterized problem in (12):

  1. Clearly, if we set , and thus, (12) is equivalent to (9). Therefore, for any other value , the problem in (12) differs from the original formulation (9) by a factor .

  2. Fix parameters and let be the solution of (12), which is different from the solution of (9), given the remark above. However, as in a path-following scheme, both and converge to an optimum of (1).

  3. Based on the above, for fixed and different values of , (12) leads to a family of paths towards of (1).

Our aim in this paper is to properly combine the quantities , and , such that solving iteratively (12) always has fast convergence (even at the initial point ) and, while (12) differs from (9), its solution trajectory is closely related to the solution trajectory of the original barrier formulation. The above are further discussed in the next subsections.

3.3 A functional connection between solutions of (9) and (12).

Given the definitions above, let us first study the relationship between exact solutions of (9) and (12), for fixed values and . Let be fixed. Assume and is chosen such that . Define as the local distance between and , the solutions of (9) and (12), respectively. Then,

Proof. Let be the solution of (12) and be the solution of (9). By the optimality conditions in (10) and (13), we have and . Moreover, by the convexity of , we have . Using the definition , the last inequality leads to

Further, by [33, Theorem 4.1.5] and the Cauchy-Schwarz inequality, this inequality implies

(14)

which completes the proof of this lemma.

The above lemma indicates that, while (9) and (12) define different central paths towards , there is an upper bound on the distance between and , which is controlled by the selection of and . However, cannot be evaluated a priori, since is unknown.

3.4 Estimate upper bound for .

We can overcome this difficulty by using an approximation of the analytical center point in (8). A key property of is the following [33, Corollary 4.2.1]: Define , where is the self-concordant barrier parameter. If is a logarithmically homogeneous self-concordant barrier, then we set [33]. Then, for any and . This observation leads to the following Corollary; the proof easily follows from that of Lemma 3.3 and the properties above.

Consider the configuration in Lemma 3.3 and define . Then,

(15)

Moreover, if we choose the initial point as , then , where is defined in Lemma 3.3.

Proof. By [33, Corollary 4.2.1], one observes that , where is the analytical center of . Following the same motions with the proof of Lemma 3.3, we obtain (15). Further, using the property and the definition of , we obtain the last statement.

In the corollary above, we bound the quantity using the local norm at the analytical center

. This will allow us to estimate the theoretical worst-case bound in Theorem

4.5, described next. This corollary also suggests to choose an initial point , assuming is available or easy to compute exactly. Later in the text, we propose initialization conditions that can be checked a priori and, if they are satisfied, such initial points are sufficient for our results to apply. E.g., consider the case where we only approximate up to a tolerance level (and not exactly computed), where we can decide on the fly whether such approximation is adequate—see Lemma 3.5 below.

The above observations lead to the following lemma: given a point , we bound by the distance , using the bound (15).

Consider the configuration in Corollary 3.4, such that . Let and , for any . Then, the following connection between and holds:

(16)

Proof. By definition of the local norm , we have

Here, in the first inequality, we use the triangle inequality for the weighted norm , while in the second inequality we apply [33, Theorem 4.1.6]. The proof is completed when we use (15) to upper bound the RHS.

The above lemma indicates that, given a fixed , any approximate solution to (12) (say ) that is “good” enough (i.e., the metric is small), signifies that is also “close” to the optimal of (9) (i.e., the metric is bounded by and, thus, can be controlled). This fact allows the use of (12), instead of (9), and provides freedom to cleverly select initial parameters and for faster convergence. The next section proposes such an initialization procedure.

3.5 The choice of initial parameters.

Here, we describe how we initialize and . Lemma 3.4 suggests that, for some , if we can bound , then is bounded as well as . This observation leads to the following lemma. Let , where is the solution of (12) at and is an arbitrarily chosen initial point in . Let and, from (11), . Then, we have

(17)

provided that for a particular choice of .

Proof. Since is the solution of (12) at , there exists such that: . Hence, by the definitions of and , we obtain

By convexity of , we have

This inequality leads to . Using the self-concordance of in [33, Theorem 4.1.7] and the Cauchy-Schwarz inequality, we can derive

Hence, . Moreover, by [33, Theorem 4.1.6], we have . Combining these two inequalities, we obtain

After few elementary calculations, one can see that if , we obtain (17), which also guarantees its right-hand side of (17) to be positive.

In plain words, Lemma 3.5 provides a recipe for initial selection of parameters: Our goal is to choose an initial point and the parameters and such that , for a predefined constant . The following lemma provides sufficient conditions—that can be checked a priori—on the set of initial points that lead to specific configurations and make the results in the previous subsection hold; and suggests that even an approximation of the analytical center is sufficient.

The initial point and the parameters and need to satisfy the following relations:

(18)

If we choose such that

(19)

then one can choose and

(20)

In addition, the quantity defined in Corollary 3.4 is bounded by . As a special case, if , then , and we can choose and . If is chosen from the first interval in (18), then we can take any .

Proof. Using (17), we observe that in order to satisfy , it is sufficient to require

Since , the inequality further implies

Hence, we obtain the first condition of (18).

By our theory and the choice of as in Lemma 3.3, it holds . Since , the last condition can be upper bounded as follows:

This condition suggests to choose such that . Let be an initial point, then, by Corollary 3.4, we can enforce . This condition leads to

which implies the second condition of (18).

If we take , then the second condition of (18) can be written as . Since , similar to the proof of (14), we can show that . Hence, if