This paper studies the following constrained convex optimization problem:
where , is a possibly non-smooth, proper, closed and convex function from to and is a nonempty, closed and convex set in .222We note that the linear term can be absorbed into . However, we separate it from for our convenience in processing numerical examples in the last section. We denote by the optimal solution set of (1), and by an optimal solution in .
For convex sets , associated with a self-concordant barrier (see Section 2 for details), and for self-concordant and smooth, e.g., just linear or quadratic, interior point methods (IPMs) often constitute the method-of-choice for solving (1), with a well-characterized worst-case complexity. A non-exhaustive list of instances of (1
) includes linear programs, quadratic programs, second order cone programs, semi-definite programs, and geometric optimization[2, 7, 8, 17, 31, 32, 33, 38, 42, 44, 53, 54].
At the heart of IPMs lies the notion of interior barriers: these mimic the effect of the constraint set in (1) by appropriately penalizing the objective function with a barrier over the set , as follows:
Here, models the structure of the feasible set and is a penalty parameter. For different values of , the regularized problem generates a sequence of solutions , known as the central path, converging to of (1), as goes to . Path-following methods operate along the central path: for a properly decreasing sequence of values, they solve (2) only approximately, by performing a few Newton iterations for each value; standard path-following schemes even perform just one Newton iteration, assuming a linear objective with no non-smooth term . For such problem cases, this is sufficient to guarantee that the approximate solution lies sufficiently close to the central path, and operates as warm-start for the next value of in (2) [8, 30, 37, 33]. One requirement is that the initial point must lie within a predefined neighborhood of the central path. In their seminal work , Nesterov and Nemirovski showed that such methods admit a polynomial worst-case complexity, as long as the Newton method has polynomial complexity.
Based on the above, standard schemes [31, 33, 37] can be characterized by two phases: Phase I and Phase II. In Phase I and for an initial value of , say , one has to solve (2) carefully in order to determine a good initial point for Phase II; this implies solving (2) up to sufficient accuracy, such that the Newton method for (2) admits fast convergence. In Phase II and using the output of Phase I as a warm-start, we path-follow with a provably polynomial time complexity.
Taking into account both phases, standard path-following algorithms—where (2) is a self-concordant objective—are characterized by the following iteration complexity. The total number of iterations required to obtain an -solution is
1.1 Path-following schemes for non-smooth objectives.
For many applications in machine learning, optimization and signal processing[8, 40, 50], the part in (1) could be non-smooth (or even smooth but non-self-concordant). Such a term is usually included in the optimization in order to leverage the true underlying structure in . An example is the -norm regularization, i.e.,
, with applications in high-dimensional statistics, compressive sensing, scientific and medical imaging[12, 18, 20, 23, 29, 41, 47, 55], among others. Other examples for include the indicator function of a convex set , the -group norm [4, 22, 24], and the nuclear norm  using in low-rank matrix approximation.
Unfortunately, non-smoothness in the objective reduces the optimization efficiency. In such settings, one can often reformulate (1) into a standard conic program, by introducing slack variables and additional constraints to model . Such a technique is known as disciplined convex programming (DCP)  and has been incorporated in well-known software packages, such as CVX  and YALMIP . Existing off-the-shelf solvers are then utilized to solve the resulting problem. However, DCP could potentially increase the problem dimension significantly; this, in sequence, reduces the efficiency of the IPMs. For instance, in the example above where , DCP introduces slack variables to reformulate into additional linear constraints; when , i.e.,
is the nuclear norm (sum of singular values of), then it can be smoothed via a semi-definite formulation, where the memory requirements and the volume of computation per iteration are high .
In this paper, we focus on cases where is endowed with a generalized proximity operator, associated with a local norm (see Section 2 for details):
Such proximity operators have been used extensively in non-smooth optimization problems, and proven to be efficient in real applications, under common gradient Lipschitz-continuity and strong convexity assumptions on the objective function [6, 13, 33]. However, for generic constraints in (1), the resulting interior barrier in (2) does not have Lipschitz continuous gradients and, thus, prevents us from trivially recycling such ideas. This necessitates the design of a new class of path-following schemes, that exploit proximal operators and thus can accommodate non-smooth terms in the objective.
To the best of our knowledge,  is the first work that treats jointly interior barrier path-following schemes and proximity operators, in order to construct new proximal path-following algorithms for problems as in (1). According to , the proposed algorithm follows a two-phase approach, with Phase II having the same worst-case iteration-complexity as in (3) (up to constants) [33, 37]. However, the initialization Phase I in  requires substantial computational effort, which usually dominates the overall computational time. In particular, to find a good initial point,  uses a damped-step proximal-Newton scheme for (2), starting from an arbitrary initial point and for arbitrary selected . For such configuration,  requires
damped-step Newton iterations in Phase I in order to find a point close to the optimal solution of (2), say , for the selected . Here, , and ; see [50, Theorem 4.4] for more details. I.e., in stark contrast to the global iteration complexity (3) of smooth path-following schemes, Phase I of  might require a substantial number of iterations just to converge to a point close to the central path, and depends on the arbitrary initial point selection .
From our discussion so far, it is clear that most existing works on path-following schemes require two phases. In the case of smooth self-concordant objectives in (1), Phase I is often implemented as a damped-step Newton scheme, which has sublinear convergence rate, or an auxiliary path-following scheme, with linear convergence rate that satisfies the global, worst-case complexity in (3) [33, 37]. In standard conic programming, one can unify a two-phase algorithm in a single-phase IP path-following scheme via homogeneous and self-dual embedded strategies; see, e.g., [45, 52, 54]. Such strategies parameterize the KKT condition of the primal and dual conic program so that one can immediately have an initial point, without performing Phase I. So far and to the best of our knowledge, it remains unclear how such an auxiliary path-following scheme can find an initial point for non-smooth objectives in (1).
1.3 Our contributions.
The goal of this paper is to develop a new single-phase, proximal path-following algorithm for (1). To do so, we first re-parameterize the optimality condition of the barrier problem associated with (1) as a parametric monotone inclusion (PMI). Then, we design a proximal path-following scheme to approximate the solution of such PMI, while controlling the penalty parameter. Finally, we show how to recover an approximate solution of (1), from the approximate solution of the PMI.
The main contributions of this paper can be summarized as follows:
We introduce a new parameterization for the optimality condition of (2) to appropriately select the parameters such that less computation for initialization is needed. Thus, with an appropriate choice of parameters, we show how we can eliminate the slowly-convergent Phase I in , while we still maintain the global, polynomial time, worst-case iteration-complexity.
In particular, we propose novel—checkable a priori—conditions over the set of initial points that can achieve the desiderata; this, in turn, provides rigorous configurations of the algorithm’s parameters such that the worst-case iteration complexity guarantee is obtained provably, avoiding the slowly convergent initialization procedures proposed so far for non-smooth optimization in (1).
We design a single-phase, path-following algorithm to compute an -solution of (1). For each value, the resulting algorithm only requires a single approximate Newton iteration (see ), followed by a proximal step, of a strongly convex quadratic composite subproblem. We will use the term proximal Newton step when referring to these two steps. The algorithm allows inexact Newton steps, with a verifiable stopping criterion (cf. eq. (25)).
In particular, we establish the following result: The total number of proximal Newton iterations required, in order to reach an -solution of (1), is upper bounded by . A complete and formal description of the above theorem and its proof are provided in Section 4. Our proximal algorithm admits the same iteration-complexity, as standard path-following methods [33, 37] (up to a constant). To highlight the iteration complexity gains from the two-phase algorithm in [50, Theorem 4.4], recall that in the latter case, the total number of proximal Newton steps are bounded by:
where the first term is in Phase I as mentioned previously, and the second one is in Phase II.
Our algorithm requires a well-chosen initial point that avoids Phase I; one such case is that of an approximation of the analytical center of the barrier (see Section 2 for details). In the text, we argue that evaluating this point is much easier than finding an initial point using Phase I, as in . In addition, for many feasible sets in (1), we can explicitly and easily compute of (see Section 5 for examples).
1.4 The structure of the paper.
This paper is organized as follows. Sections 2 and 3 contain basic definitions and notions, used in our analysis. We introduce a new re-parameterization of the central path in order to obtain a predefined initial point. Section 4 presents a novel algorithm and its complexity theory for the non-smooth objective function. Section 5 provides three numerical examples that highlight the merits of our algorithm.
In this section, we provide the basic notation used in the rest of the paper, as well as two key concepts: proximity operators and self-concordant (barrier) functions.
2.1 Basic definitions.
Given , we use or to denote the inner product in . For a proper, closed and convex function , we denote by its domain, (i.e., , and by its subdifferential at . We also denote by the closure of . We use to denote the class of three times continuously differentiable functions from to .
For a given twice differentiable function such that at some , we define the local norm, and its dual, as
respectively, for . Note that the Cauchy-Schwarz inequality holds, i.e., .
2.2 Generalized proximity operators.
The generalized proximity operator of a proper, closed and convex function is defined as the following program:
—the identity matrix—in the local norm, (4) becomes a standard proximal operator . Computing might be hard even for such cases. Nevertheless, there exist structured smooth and non-smooth convex functions with that comes with a closed-form solution or can be computed with low computational complexity. We capture this idea in the following definition.
[Tractable proximity operator] A proper, closed and convex function has a tractable proximity operator if (4) can be computed efficiently via a closed-form solution or via a polynomial time algorithm.
Examples of such functions include the -norm—where the proximity operator is the well-known soft-thresholding operator —and the indicator functions of simple sets (e.g., boxes, cones and simplexes)—where the proximity operator is simply the projection operator. Further examples can be found in [5, 13, 40]. Observe that, due to the existence of a closed-form solution for most well-known proximity operators, one can always compute efficiently and its computational complexity does not depend on the value of the regularization parameter . Our main result does not require the tractability of computing the proximity operator of ; it will be used to analyze the overall computational complexity in Subsection 4.7.
2.3 Self-concordant functions and self-concordant barriers.
A univariate convex function is called standard self-concordant if for all , where is an open set in . Moreover, a function is standard self-concordant if, for any and , the univariate function where is standard self-concordant.
A standard self-concordant function is a -self-concordant barrier for the set with parameter , if
In addition, as tends to the boundary of .
We note that when is non-degenerate (particularly, when contains no straight line [33, Theorem 4.1.3.]), a -self-concordant function satisfies
Self-concordant functions have non-global Lipschitz gradients and can be used to analyze the complexity of Newton-methods [10, 33, 37], as well as first-order variants . For more details on self-concordant functions and self-concordant barriers, we refer the reader to Chapter 4 of .
Several simple sets are equipped with a self-concordant barrier. For instance, is an -self-concordant barrier of the orthant cone , is a -self-concordant barrier of the Lorentz cone , and the semidefinite cone is endowed with the -self-concordant barrier . In addition, other convex sets, such as hyperbolic polynomials and convex cones, are also characterized by explicit self-concordant barriers [26, 34]. Generally, any closed and convex set—with nonempty interior and not containing a straight line—is endowed with a self-concordant barrier; see [33, 37].
2.4 Basic assumptions.
We make the following assumption, regarding problem (1). The solution set of (1) is nonempty. The objective function in (1) is proper, closed and convex, and . The feasible set is nonempty, closed and convex with nonempty interior and is endowed with a -self-concordant barrier such that . The analytical center of exists.
3 Re-parameterizing the central path.
In this section, we introduce a new parameterization strategy, which will be used in our scheme for (1).
3.1 Barrier formulation and central path of (1).
where is the penalty parameter. We denote by the solution of (9) at a given value . Define . The optimality condition of (9) is necessary and sufficient for to be an optimal solution of (9), and can be written as follows:
3.2 Parameterization of the optimality condition.
Let us fix ; a specific selection of is provided later on. For given , let be an arbitrary subgradient of at , and set . For a given parameter , define
with the gradient . We further define an -parameterized version of (9) as
Next, we provide some remarks regarding the -parameterized problem in (12):
Our aim in this paper is to properly combine the quantities , and , such that solving iteratively (12) always has fast convergence (even at the initial point ) and, while (12) differs from (9), its solution trajectory is closely related to the solution trajectory of the original barrier formulation. The above are further discussed in the next subsections.
Given the definitions above, let us first study the relationship between exact solutions of (9) and (12), for fixed values and . Let be fixed. Assume and is chosen such that . Define as the local distance between and , the solutions of (9) and (12), respectively. Then,
Proof. Let be the solution of (12) and be the solution of (9). By the optimality conditions in (10) and (13), we have and . Moreover, by the convexity of , we have . Using the definition , the last inequality leads to
Further, by [33, Theorem 4.1.5] and the Cauchy-Schwarz inequality, this inequality implies
which completes the proof of this lemma.
3.4 Estimate upper bound for .
We can overcome this difficulty by using an approximation of the analytical center point in (8). A key property of is the following [33, Corollary 4.2.1]: Define , where is the self-concordant barrier parameter. If is a logarithmically homogeneous self-concordant barrier, then we set . Then, for any and . This observation leads to the following Corollary; the proof easily follows from that of Lemma 3.3 and the properties above.
Consider the configuration in Lemma 3.3 and define . Then,
Moreover, if we choose the initial point as , then , where is defined in Lemma 3.3.
Proof. By [33, Corollary 4.2.1], one observes that , where is the analytical center of . Following the same motions with the proof of Lemma 3.3, we obtain (15). Further, using the property and the definition of , we obtain the last statement.
In the corollary above, we bound the quantity using the local norm at the analytical center
. This will allow us to estimate the theoretical worst-case bound in Theorem4.5, described next. This corollary also suggests to choose an initial point , assuming is available or easy to compute exactly. Later in the text, we propose initialization conditions that can be checked a priori and, if they are satisfied, such initial points are sufficient for our results to apply. E.g., consider the case where we only approximate up to a tolerance level (and not exactly computed), where we can decide on the fly whether such approximation is adequate—see Lemma 3.5 below.
The above observations lead to the following lemma: given a point , we bound by the distance , using the bound (15).
Consider the configuration in Corollary 3.4, such that . Let and , for any . Then, the following connection between and holds:
Proof. By definition of the local norm , we have
Here, in the first inequality, we use the triangle inequality for the weighted norm , while in the second inequality we apply [33, Theorem 4.1.6]. The proof is completed when we use (15) to upper bound the RHS.
The above lemma indicates that, given a fixed , any approximate solution to (12) (say ) that is “good” enough (i.e., the metric is small), signifies that is also “close” to the optimal of (9) (i.e., the metric is bounded by and, thus, can be controlled). This fact allows the use of (12), instead of (9), and provides freedom to cleverly select initial parameters and for faster convergence. The next section proposes such an initialization procedure.
3.5 The choice of initial parameters.
Here, we describe how we initialize and . Lemma 3.4 suggests that, for some , if we can bound , then is bounded as well as . This observation leads to the following lemma. Let , where is the solution of (12) at and is an arbitrarily chosen initial point in . Let and, from (11), . Then, we have
provided that for a particular choice of .
Proof. Since is the solution of (12) at , there exists such that: . Hence, by the definitions of and , we obtain
By convexity of , we have
This inequality leads to . Using the self-concordance of in [33, Theorem 4.1.7] and the Cauchy-Schwarz inequality, we can derive
Hence, . Moreover, by [33, Theorem 4.1.6], we have . Combining these two inequalities, we obtain
In plain words, Lemma 3.5 provides a recipe for initial selection of parameters: Our goal is to choose an initial point and the parameters and such that , for a predefined constant . The following lemma provides sufficient conditions—that can be checked a priori—on the set of initial points that lead to specific configurations and make the results in the previous subsection hold; and suggests that even an approximation of the analytical center is sufficient.
The initial point and the parameters and need to satisfy the following relations:
If we choose such that
then one can choose and
Proof. Using (17), we observe that in order to satisfy , it is sufficient to require
Since , the inequality further implies
Hence, we obtain the first condition of (18).
By our theory and the choice of as in Lemma 3.3, it holds . Since , the last condition can be upper bounded as follows:
This condition suggests to choose such that . Let be an initial point, then, by Corollary 3.4, we can enforce . This condition leads to
which implies the second condition of (18).