1 Introduction.
This paper studies the following constrained convex optimization problem:
(1) 
where , is a possibly nonsmooth, proper, closed and convex function from to and is a nonempty, closed and convex set in .^{2}^{2}2We note that the linear term can be absorbed into . However, we separate it from for our convenience in processing numerical examples in the last section. We denote by the optimal solution set of (1), and by an optimal solution in .
For convex sets , associated with a selfconcordant barrier (see Section 2 for details), and for selfconcordant and smooth, e.g., just linear or quadratic, interior point methods (IPMs) often constitute the methodofchoice for solving (1), with a wellcharacterized worstcase complexity. A nonexhaustive list of instances of (1
) includes linear programs, quadratic programs, second order cone programs, semidefinite programs, and geometric optimization
[2, 7, 8, 17, 31, 32, 33, 38, 42, 44, 53, 54].At the heart of IPMs lies the notion of interior barriers: these mimic the effect of the constraint set in (1) by appropriately penalizing the objective function with a barrier over the set , as follows:
(2) 
Here, models the structure of the feasible set and is a penalty parameter. For different values of , the regularized problem generates a sequence of solutions , known as the central path, converging to of (1), as goes to . Pathfollowing methods operate along the central path: for a properly decreasing sequence of values, they solve (2) only approximately, by performing a few Newton iterations for each value; standard pathfollowing schemes even perform just one Newton iteration, assuming a linear objective with no nonsmooth term . For such problem cases, this is sufficient to guarantee that the approximate solution lies sufficiently close to the central path, and operates as warmstart for the next value of in (2) [8, 30, 37, 33]. One requirement is that the initial point must lie within a predefined neighborhood of the central path. In their seminal work [37], Nesterov and Nemirovski showed that such methods admit a polynomial worstcase complexity, as long as the Newton method has polynomial complexity.
Based on the above, standard schemes [31, 33, 37] can be characterized by two phases: Phase I and Phase II. In Phase I and for an initial value of , say , one has to solve (2) carefully in order to determine a good initial point for Phase II; this implies solving (2) up to sufficient accuracy, such that the Newton method for (2) admits fast convergence. In Phase II and using the output of Phase I as a warmstart, we pathfollow with a provably polynomial time complexity.
Taking into account both phases, standard pathfollowing algorithms—where (2) is a selfconcordant objective—are characterized by the following iteration complexity. The total number of iterations required to obtain an solution is
(3) 
Here, is a barrier parameter (see Section 2 for details) and is the approximate parameter, according to the following definition: Given a tolerance , we say that is an solution for (1) if
1.1 Pathfollowing schemes for nonsmooth objectives.
For many applications in machine learning, optimization and signal processing
[8, 40, 50], the part in (1) could be nonsmooth (or even smooth but nonselfconcordant). Such a term is usually included in the optimization in order to leverage the true underlying structure in . An example is the norm regularization, i.e.,, with applications in highdimensional statistics, compressive sensing, scientific and medical imaging
[12, 18, 20, 23, 29, 41, 47, 55], among others. Other examples for include the indicator function of a convex set [40], the group norm [4, 22, 24], and the nuclear norm [11] using in lowrank matrix approximation.Unfortunately, nonsmoothness in the objective reduces the optimization efficiency. In such settings, one can often reformulate (1) into a standard conic program, by introducing slack variables and additional constraints to model . Such a technique is known as disciplined convex programming (DCP) [19] and has been incorporated in wellknown software packages, such as CVX [19] and YALMIP [28]. Existing offtheshelf solvers are then utilized to solve the resulting problem. However, DCP could potentially increase the problem dimension significantly; this, in sequence, reduces the efficiency of the IPMs. For instance, in the example above where , DCP introduces slack variables to reformulate into additional linear constraints; when , i.e.,
is the nuclear norm (sum of singular values of
), then it can be smoothed via a semidefinite formulation, where the memory requirements and the volume of computation per iteration are high [27].In this paper, we focus on cases where is endowed with a generalized proximity operator, associated with a local norm (see Section 2 for details):
Such proximity operators have been used extensively in nonsmooth optimization problems, and proven to be efficient in real applications, under common gradient Lipschitzcontinuity and strong convexity assumptions on the objective function [6, 13, 33]. However, for generic constraints in (1), the resulting interior barrier in (2) does not have Lipschitz continuous gradients and, thus, prevents us from trivially recycling such ideas. This necessitates the design of a new class of pathfollowing schemes, that exploit proximal operators and thus can accommodate nonsmooth terms in the objective.
To the best of our knowledge, [50] is the first work that treats jointly interior barrier pathfollowing schemes and proximity operators, in order to construct new proximal pathfollowing algorithms for problems as in (1). According to [50], the proposed algorithm follows a twophase approach, with Phase II having the same worstcase iterationcomplexity as in (3) (up to constants) [33, 37]. However, the initialization Phase I in [50] requires substantial computational effort, which usually dominates the overall computational time. In particular, to find a good initial point, [50] uses a dampedstep proximalNewton scheme for (2), starting from an arbitrary initial point and for arbitrary selected . For such configuration, [50] requires
dampedstep Newton iterations in Phase I in order to find a point close to the optimal solution of (2), say , for the selected . Here, , and ; see [50, Theorem 4.4] for more details. I.e., in stark contrast to the global iteration complexity (3) of smooth pathfollowing schemes, Phase I of [50] might require a substantial number of iterations just to converge to a point close to the central path, and depends on the arbitrary initial point selection .
1.2 Motivation.
From our discussion so far, it is clear that most existing works on pathfollowing schemes require two phases. In the case of smooth selfconcordant objectives in (1), Phase I is often implemented as a dampedstep Newton scheme, which has sublinear convergence rate, or an auxiliary pathfollowing scheme, with linear convergence rate that satisfies the global, worstcase complexity in (3) [33, 37]. In standard conic programming, one can unify a twophase algorithm in a singlephase IP pathfollowing scheme via homogeneous and selfdual embedded strategies; see, e.g., [45, 52, 54]. Such strategies parameterize the KKT condition of the primal and dual conic program so that one can immediately have an initial point, without performing Phase I. So far and to the best of our knowledge, it remains unclear how such an auxiliary pathfollowing scheme can find an initial point for nonsmooth objectives in (1).
1.3 Our contributions.
The goal of this paper is to develop a new singlephase, proximal pathfollowing algorithm for (1). To do so, we first reparameterize the optimality condition of the barrier problem associated with (1) as a parametric monotone inclusion (PMI). Then, we design a proximal pathfollowing scheme to approximate the solution of such PMI, while controlling the penalty parameter. Finally, we show how to recover an approximate solution of (1), from the approximate solution of the PMI.
The main contributions of this paper can be summarized as follows:

We introduce a new parameterization for the optimality condition of (2) to appropriately select the parameters such that less computation for initialization is needed. Thus, with an appropriate choice of parameters, we show how we can eliminate the slowlyconvergent Phase I in [50], while we still maintain the global, polynomial time, worstcase iterationcomplexity.
In particular, we propose novel—checkable a priori—conditions over the set of initial points that can achieve the desiderata; this, in turn, provides rigorous configurations of the algorithm’s parameters such that the worstcase iteration complexity guarantee is obtained provably, avoiding the slowly convergent initialization procedures proposed so far for nonsmooth optimization in (1).

We design a singlephase, pathfollowing algorithm to compute an solution of (1). For each value, the resulting algorithm only requires a single approximate Newton iteration (see [50]), followed by a proximal step, of a strongly convex quadratic composite subproblem. We will use the term proximal Newton step when referring to these two steps. The algorithm allows inexact Newton steps, with a verifiable stopping criterion (cf. eq. (25)).
In particular, we establish the following result: The total number of proximal Newton iterations required, in order to reach an solution of (1), is upper bounded by . A complete and formal description of the above theorem and its proof are provided in Section 4. Our proximal algorithm admits the same iterationcomplexity, as standard pathfollowing methods [33, 37] (up to a constant). To highlight the iteration complexity gains from the twophase algorithm in [50, Theorem 4.4], recall that in the latter case, the total number of proximal Newton steps are bounded by:
where the first term is in Phase I as mentioned previously, and the second one is in Phase II.
Our algorithm requires a wellchosen initial point that avoids Phase I; one such case is that of an approximation of the analytical center of the barrier (see Section 2 for details). In the text, we argue that evaluating this point is much easier than finding an initial point using Phase I, as in [50]. In addition, for many feasible sets in (1), we can explicitly and easily compute of (see Section 5 for examples).
1.4 The structure of the paper.
This paper is organized as follows. Sections 2 and 3 contain basic definitions and notions, used in our analysis. We introduce a new reparameterization of the central path in order to obtain a predefined initial point. Section 4 presents a novel algorithm and its complexity theory for the nonsmooth objective function. Section 5 provides three numerical examples that highlight the merits of our algorithm.
2 Preliminaries.
In this section, we provide the basic notation used in the rest of the paper, as well as two key concepts: proximity operators and selfconcordant (barrier) functions.
2.1 Basic definitions.
Given , we use or to denote the inner product in . For a proper, closed and convex function , we denote by its domain, (i.e., , and by its subdifferential at . We also denote by the closure of [43]. We use to denote the class of three times continuously differentiable functions from to .
For a given twice differentiable function such that at some , we define the local norm, and its dual, as
respectively, for . Note that the CauchySchwarz inequality holds, i.e., .
2.2 Generalized proximity operators.
The generalized proximity operator of a proper, closed and convex function is defined as the following program:
(4) 
When
—the identity matrix—in the local norm, (
4) becomes a standard proximal operator [5]. Computing might be hard even for such cases. Nevertheless, there exist structured smooth and nonsmooth convex functions with that comes with a closedform solution or can be computed with low computational complexity. We capture this idea in the following definition.[Tractable proximity operator] A proper, closed and convex function has a tractable proximity operator if (4) can be computed efficiently via a closedform solution or via a polynomial time algorithm.
Examples of such functions include the norm—where the proximity operator is the wellknown softthresholding operator [13]—and the indicator functions of simple sets (e.g., boxes, cones and simplexes)—where the proximity operator is simply the projection operator. Further examples can be found in [5, 13, 40]. Observe that, due to the existence of a closedform solution for most wellknown proximity operators, one can always compute efficiently and its computational complexity does not depend on the value of the regularization parameter . Our main result does not require the tractability of computing the proximity operator of ; it will be used to analyze the overall computational complexity in Subsection 4.7.
2.3 Selfconcordant functions and selfconcordant barriers.
A concept used in our analysis is the selfconcordance property, introduced by Nesterov and Nemirovskii [33, 37].
A univariate convex function is called standard selfconcordant if for all , where is an open set in . Moreover, a function is standard selfconcordant if, for any and , the univariate function where is standard selfconcordant.
A standard selfconcordant function is a selfconcordant barrier for the set with parameter , if
In addition, as tends to the boundary of .
We note that when is nondegenerate (particularly, when contains no straight line [33, Theorem 4.1.3.]), a selfconcordant function satisfies
(7) 
Selfconcordant functions have nonglobal Lipschitz gradients and can be used to analyze the complexity of Newtonmethods [10, 33, 37], as well as firstorder variants [16]. For more details on selfconcordant functions and selfconcordant barriers, we refer the reader to Chapter 4 of [33].
Several simple sets are equipped with a selfconcordant barrier. For instance, is an selfconcordant barrier of the orthant cone , is a selfconcordant barrier of the Lorentz cone , and the semidefinite cone is endowed with the selfconcordant barrier . In addition, other convex sets, such as hyperbolic polynomials and convex cones, are also characterized by explicit selfconcordant barriers [26, 34]. Generally, any closed and convex set—with nonempty interior and not containing a straight line—is endowed with a selfconcordant barrier; see [33, 37].
2.4 Basic assumptions.
We make the following assumption, regarding problem (1). The solution set of (1) is nonempty. The objective function in (1) is proper, closed and convex, and . The feasible set is nonempty, closed and convex with nonempty interior and is endowed with a selfconcordant barrier such that . The analytical center of exists.
3 Reparameterizing the central path.
In this section, we introduce a new parameterization strategy, which will be used in our scheme for (1).
3.1 Barrier formulation and central path of (1).
Since is endowed with a selfconcordant barrier , according to Assumption A.2.4, the barrier formulation of (1) is given by
(9) 
where is the penalty parameter. We denote by the solution of (9) at a given value . Define . The optimality condition of (9) is necessary and sufficient for to be an optimal solution of (9), and can be written as follows:
(10) 
We also denote by the set of solutions of (9), which generates a central path (or a solution trajectory) associated with (1). We refer to each solution as a central point.
3.2 Parameterization of the optimality condition.
Let us fix ; a specific selection of is provided later on. For given , let be an arbitrary subgradient of at , and set . For a given parameter , define
(11) 
with the gradient . We further define an parameterized version of (9) as
(12) 
We denote by the solution of (12), given . Observe that, for a fixed value of , the optimality condition of (12) at is given by
(13) 
Next, we provide some remarks regarding the parameterized problem in (12):
Our aim in this paper is to properly combine the quantities , and , such that solving iteratively (12) always has fast convergence (even at the initial point ) and, while (12) differs from (9), its solution trajectory is closely related to the solution trajectory of the original barrier formulation. The above are further discussed in the next subsections.
3.3 A functional connection between solutions of (9) and (12).
Given the definitions above, let us first study the relationship between exact solutions of (9) and (12), for fixed values and . Let be fixed. Assume and is chosen such that . Define as the local distance between and , the solutions of (9) and (12), respectively. Then,
Proof. Let be the solution of (12) and be the solution of (9). By the optimality conditions in (10) and (13), we have and . Moreover, by the convexity of , we have . Using the definition , the last inequality leads to
Further, by [33, Theorem 4.1.5] and the CauchySchwarz inequality, this inequality implies
(14) 
which completes the proof of this lemma.
3.4 Estimate upper bound for .
We can overcome this difficulty by using an approximation of the analytical center point in (8). A key property of is the following [33, Corollary 4.2.1]: Define , where is the selfconcordant barrier parameter. If is a logarithmically homogeneous selfconcordant barrier, then we set [33]. Then, for any and . This observation leads to the following Corollary; the proof easily follows from that of Lemma 3.3 and the properties above.
Consider the configuration in Lemma 3.3 and define . Then,
(15) 
Moreover, if we choose the initial point as , then , where is defined in Lemma 3.3.
Proof. By [33, Corollary 4.2.1], one observes that , where is the analytical center of . Following the same motions with the proof of Lemma 3.3, we obtain (15). Further, using the property and the definition of , we obtain the last statement.
In the corollary above, we bound the quantity using the local norm at the analytical center
. This will allow us to estimate the theoretical worstcase bound in Theorem
4.5, described next. This corollary also suggests to choose an initial point , assuming is available or easy to compute exactly. Later in the text, we propose initialization conditions that can be checked a priori and, if they are satisfied, such initial points are sufficient for our results to apply. E.g., consider the case where we only approximate up to a tolerance level (and not exactly computed), where we can decide on the fly whether such approximation is adequate—see Lemma 3.5 below.The above observations lead to the following lemma: given a point , we bound by the distance , using the bound (15).
Consider the configuration in Corollary 3.4, such that . Let and , for any . Then, the following connection between and holds:
(16) 
Proof. By definition of the local norm , we have
Here, in the first inequality, we use the triangle inequality for the weighted norm , while in the second inequality we apply [33, Theorem 4.1.6]. The proof is completed when we use (15) to upper bound the RHS.
The above lemma indicates that, given a fixed , any approximate solution to (12) (say ) that is “good” enough (i.e., the metric is small), signifies that is also “close” to the optimal of (9) (i.e., the metric is bounded by and, thus, can be controlled). This fact allows the use of (12), instead of (9), and provides freedom to cleverly select initial parameters and for faster convergence. The next section proposes such an initialization procedure.
3.5 The choice of initial parameters.
Here, we describe how we initialize and . Lemma 3.4 suggests that, for some , if we can bound , then is bounded as well as . This observation leads to the following lemma. Let , where is the solution of (12) at and is an arbitrarily chosen initial point in . Let and, from (11), . Then, we have
(17) 
provided that for a particular choice of .
Proof. Since is the solution of (12) at , there exists such that: . Hence, by the definitions of and , we obtain
By convexity of , we have
This inequality leads to . Using the selfconcordance of in [33, Theorem 4.1.7] and the CauchySchwarz inequality, we can derive
Hence, . Moreover, by [33, Theorem 4.1.6], we have . Combining these two inequalities, we obtain
After few elementary calculations, one can see that if , we obtain (17), which also guarantees its righthand side of (17) to be positive.
In plain words, Lemma 3.5 provides a recipe for initial selection of parameters: Our goal is to choose an initial point and the parameters and such that , for a predefined constant . The following lemma provides sufficient conditions—that can be checked a priori—on the set of initial points that lead to specific configurations and make the results in the previous subsection hold; and suggests that even an approximation of the analytical center is sufficient.
The initial point and the parameters and need to satisfy the following relations:
(18) 
If we choose such that
(19) 
then one can choose and
(20) 
In addition, the quantity defined in Corollary 3.4 is bounded by . As a special case, if , then , and we can choose and . If is chosen from the first interval in (18), then we can take any .
Proof. Using (17), we observe that in order to satisfy , it is sufficient to require
Since , the inequality further implies
Hence, we obtain the first condition of (18).