The Newton method is a classical numerical scheme for solving systems of nonlinear equations and smooth optimization Nocedal2006 ; Ortega2000 . However, there are at least two reasons that prevent the use of such methods from solving large-scale problems. Firstly, while these methods often offer a fast local convergence rate, which can be up to a quadratic rate, their global convergence has not been well-understood Nesterov2006 . In practice, one can use a damped-step scheme utilizing the Lipschitz constant of the objective derivatives to compute a suitable step-size as often done in gradient-type methods, or incorporate the algorithm with a globalization strategy such as line-search, trust-region, or filter to guarantee a descent property Nocedal2006 . Both strategies allow us to prove a global convergence of the underlying Newton-type method in some sense. Unfortunately, in practice, there exist several problems whose objective function does not have global Lipschitz gradient or Hessian such as logarithmic or reciprocal functions. This class of problems does not provide us some uniform bounds to obtain a constant step-size in optimization algorithms. On the other hand, using a globalization strategy for determining step-sizes often requires centralized computation such as function evaluations, which prevent us from using distributed computation, and stochastic descent methods. Secondly, Newton algorithms are second-order methods, which often require a high per-iteration complexity due to the operations on the Hessian of the objective function or its approximations. In addition, these methods require the underlying functionals to be smooth up to a given smoothness levels, which does not often hold in many practical models.
In recent years, there has been a great interest in Newton-type methods for solving convex optimization problems and monotone equations due to the development of new techniques and mathematical tools in optimization, machine learning, and randomized algorithmsBecker2012a ; byrd2016stochastic ; Deuflhard2006 ; erdogdu2015convergence ; Lee2014 ; Nesterov2006b ; Nesterov2008b ; pilanci2015newton ; polyak2009regularized ; Roosta-Khorasani2016 ; roosta2016sub ; Tran-Dinh2013b . Several combinations of Newton-type methods and other techniques such as proximal operators Bonnans1994a , cubic regularization Nesterov2006b , gradient regularization polyak2009regularized , randomized algorithms such as sketching pilanci2015newton , subsampling erdogdu2015convergence , and fast eigen-decomposition halko2009finding have opened up a new research direction and attracted a great attention in solving nonsmooth and large-scale problems. Hitherto, research in this direction remains focusing on specific classes of problems where standard assumptions such as nonsingularity and Hessian Lipschitz continuity are preserved. However, such assumptions do not hold for many other examples as shown in Tran-Dinh2013a . Moreover, if they are satisfied, we often get a lower bound of possible step-sizes for our algorithm, which may lead to a poor performance, especially in large-scale problems.
In the seminar work Nesterov1994 , Nesterov and Nemirovskii showed that the class of log-barriers does not satisfy the standard assumptions of the Newton method if the solution of the underlying problem is closed to the boundary of the barrier function domain. They introduced a powerful concept called “self-concordance” to overcome this drawback and developed new Newton schemes to achieve global and local convergence without requiring any additional assumption, or a globalization strategy. While the self-concordance notion was initially invented to study interior-point methods, it is less well-known in other communities. Recent works Bach2009 ; cohen2017matrix ; monteiro2015hybrid ; Tran-Dinh2013a ; TranDinh2016c ; zhang2015disco have popularized this concept to solve other problems arising from machine learning, statistics, image processing, scientific computing, and variational inequalities.
In this paper, motivated by Bach2009 ; TranDinh2014d ; zhang2015disco , we aim at generalizing the self-concordance concept in Nesterov1994 to a broader class of smooth and convex functions. To illustrate our idea, we consider a univariate smooth and convex function . If satisfies the inequality for all in the domain of and for a given constant , then we say that is self-concordant (in Nesterov and Nemirovskii’s sense Nesterov1994 ). We instead generalize this inequality to
for all in the domain of , and for given constants and .
We emphasize that generalizing from univariate to multivariate functions in the standard self-concordant case (i.e., ) Nesterov1994 preserves several important properties including the multilinear symmetry (Nesterov2004, , Lemma 4.1.2), while, unfortunately, they do not hold for the case . We therefore modify the definition in Nesterov1994 to overcome this drawback. Note that a similar idea has been also studied in Bach2009 ; TranDinh2014d for a class of logistic-type functions. Nevertheless, the definition using in these papers is limited, and still creates certain difficulty for developing further theory in general cases.
Our second goal is to develop a unified mechanism to analyze convergence (including global and local convergence) of the following Newton-type scheme:
where can be represented as the right-hand side of a smooth monotone equation , or the optimality condition of a convex optimization or a convex-concave saddle-point problem, is the Jacobian map of , and is a given step-size. Despite the Newton scheme (2) is invariant to a change of variables Deuflhard2006 , its convergence property relies on the growth of the Hessian mapping along the Newton iterative process. In classical settings, the Lipschitz continuity and the non-degeneracy of the Hessian mapping in a neighborhood of a given solution are key assumptions to achieve local quadratic convergence rate Deuflhard2006
. These assumptions have been considered to be standard, but they are often very difficult to check in practice, especially the second requirement. A natural idea is to classify the functionals of the underlying problem into a known class of functions to choose a suitable method for minimizing it. While first-order methods for convex optimization essentially rely on the Lipschitz gradient continuity, Newton schemes usually use the Lipschitz continuity of the Hessian mapping and its non-degeneracy to obtain a well-defined Newton direction as we have mentioned. For self-concordant functions, the second condition automatically holds, while the first assumption fails to satisfy. However, both full-step and damped-step Newton methods still work in this case by appropriately choosing a suitable metric. This situation has been observed and standard assumptions have been modified in different directions to still guarantee the convergence of Newton-type methods, seeDeuflhard2006 for an intensive study of generic Newton-type methods, and Nesterov1994 ; Nesterov2004 for the self-concordant function class.
We first attempt to develop some background theory for a broad class of smooth and convex functions under the structure (1). By adopting the local norm defined via the Hessian mapping of such a convex function from Nesterov1994 , we can prove some lower and upper bound estimates for the local norm distance between two points in the domain as well as for the growth of the Hessian mapping. Together with this background theory, we also identify a class of functions using in generalized linear models mccullagh1989generalized ; nelder1972generalized as well as in empirical risk minimization vapnik1998statistical that falls into our generalized self-concordance class for many well-known loss-type functions as listed in Table 1.
Applying our generalized self-concordant theory, we then develop a class of Newton-type methods to solve the following composite convex minimization problem:
where is a generalized self-concordant function in our context, and is a proper, closed, and convex function that can be referred to as a regularization term. We consider two cases. The first case is a non-composite convex problem in which is vanished (i.e., ). In the second case, we assume that is equipped with a “tractably” proximal operator (see (34) for the definition).
To this end, our main contribution can be summarized as follows.
We generalize the self-concordant notion in Nesterov2004
to a more broader class of smooth convex functions, which we call generalized self-concordance. We identify several loss-type functions that can be cast into our generalized self-concordant class. We also prove several fundamental properties and show that the sum and linear transformation ofgeneralized self-concordant functions are generalized self-concordant for a given range of or under suitable assumptions.
We develop lower and upper bounds on the Hessian matrix, the gradient map, and the function values for generalized self-concordant functions. These estimates are key to analyze several numerical optimization methods including Newton-type methods.
We propose a class of Newton methods including full-step and damped-step schemes to minimize a generalized self-concordant function. We explicitly show how to choose a suitable step-size to guarantee a descent direction in the damped-step scheme, and prove a local quadratic convergence for both the damped-step and the full-step schemes using a suitable metric.
We also extend our Newton schemes to handle the composite setting (3). We develop both full-step and damped-step proximal Newton methods to solve this problem and provide a rigorous theoretical convergence guarantee in both local and global sense.
We also study a quasi-Newton variant of our Newton scheme to minimize a generalized self-concordant function. Under a modification of the well-known Dennis-Moré condition Dennis1974 or a BFGS update, we show that our quasi-Newton method locally converges at a superlinear rate to the solution of the underlying problem.
Let us emphasize the following aspects of our contribution. Firstly, we observe that the self-concordance notion is a powerful concept and has widely been used in interior-point methods as well as in other optimization schemes He2016 ; Lu2016a ; Tran-Dinh2013a ; zhang2015disco
, generalizing it to a broader class of smooth convex functions can substantially cover a number of new applications or can develop new methods for solving old problems including logistic and multimonomial logistic regression, optimization involving exponential objectives, and distance-weighted discrimination problems in support vector machine (see Table1 below). Secondly, verifying theoretical assumptions for convergence guarantees of a Newton method is not trivial, our theory allows one to classify the underlying functions into different subclasses by using different parameters and in order to choose suitable algorithms to solve the corresponding optimization problem. Thirdly, the theory developed in this paper can potentially apply to other optimization methods such as gradient-type, sketching and sub-sampling Newton, and Frank-Wolfe’s algorithms as done in the literature odor2016frank ; pilanci2015newton ; Roosta-Khorasani2016 ; roosta2016sub ; Tran-Dinh2013a . Finally, our generalization also shows that it is possible to impose additional structure such as self-concordant barrier to develop path-following scheme or interior-point-type methods for solving a subclass of composite convex minimization problems of the form (3). We believe that our theory is not limited to convex optimization, but can be extended to solve convex-concave saddle-point problems, and monotone equations/inclusions involving generalized self-concordant functions TranDinh2016c .
Summary of generalized self-concordant properties:
For our reference convenience, we provide a short summary on the main properties of generalized self-concordant (gsc) functions below.
|Definitions 1 and 2||definitions of gsc functions|
|Proposition 1||sum of gsc functions|
|Proposition 2||affine transformation of gsc functions with||
|Proposition 3(a)||non-degenerate property|
|Proposition 4(a)||gsc and strong convexity|
|Proposition 4(b)||gsc and Lipschitz gradient continuity|
|Proposition 6||if is the conjugate of a gsc function
|Propositions 7, 8, 9, and 10||local norm, Hessian, gradient, and function value bounds|
Since the self-concordance concept was introduced in 1990s Nesterov1994 , its first extension is perhaps proposed by Bach2009 for a class of logistic regression. In TranDinh2014d , the authors extended Bach2009
to study proximal Newton method for logistic, multinomial logistic, and exponential loss functions. By augmenting a strongly convex regularizer, Zhang and Lin inzhang2015disco showed that the regularized logistic loss function is indeed standard self-concordant. In Bach2013a Bach continued exploiting his result in Bach2009 to show that the averaging stochastic gradient method can achieve the same best known convergence rate as in strongly convex case without adding a regularizer. In Tran-Dinh2013a , the authors exploited standard self-concordance theory in Nesterov1994 to develop several classes of optimization algorithms including proximal Newton, proximal quasi-Newton, and proximal gradient methods to solve composite convex minimization problems. In Lu2016a , Lu extended Tran-Dinh2013a to study randomized block coordinate descent methods. In a recent paper gao2016quasi , Gao and Goldfarb investigated quasi-Newton methods for self-concordant problems. As another example, peng2009self proposed an alternative to the standard self-concordance, called self-regularity. The authors applied this theory to develop a new paradigm for interior-point methods. The theory developed in this paper, on the one hand, is a generalization of the well-known self-concordance notion developed in Nesterov1994 ; on the other hand, it also covers the work in Bach2009 ; Tran-Dinh2013b ; zhang2015disco as specific examples. Several concrete applications and extensions of self-concordance notion can also be found in the literature including He2016 ; Kyrillidis2014 ; odor2016frank ; peng2009self . Recently, cohen2017matrix exploited smooth structures of exponential functions to design interior-point methods for solving two fundamental problems in scientific computing called matrix scaling and balancing.
The rest of this paper is organized as follows. Section 2 develops the foundation theory for our generalized self-concordant functions including definitions, examples, basic properties, Fenchel’s conjugate, smoothing technique, and key bounds. Section 3 is devoted to studying full-step and damped-step Newton schemes to minimize a generalized self-concordant function including their global and local convergence guarantees. Section 4 considers to the composite setting (3) and studies proximal Newton-type methods, and investigates their convergence guarantees. Section 5 deals with a quasi-Newton scheme for solving the noncomposite problem of (3). Numerical examples are provided in Section 6 to illustrate advantages of our theory. Finally, for clarity of presentation, several technical results and proofs are moved to the appendix.
2 Theory of generalized self-concordant functions
We generalize the class of self-concordant functions introduced by Nesterov and Nemirovskii in Nesterov2004 to a broader class of smooth and convex functions. We identify several examples of such functions. Then, we develop several properties of this function class by utilizing our new definitions.
Given a proper, closed, and convex function , we denote by the domain of , and by the subdifferential of at . We use to denote the class of three times continuously differentiable functions on its open domain . We denote by its gradient map, by its Hessian map, and by its third-order derivative. For a twice continuously differentiable convex function , is symmetric positive semidefinite, and can be written as . If it is positive definite, then we write .
Let and denote the sets of nonnegative and positive real numbers, respectively. We use and to denote the sets of symmetric positive semidefinite and symmetric positive definite matrices of the size , respectively. Given a matrix , we define a weighted norm with respect to as for . The corresponding dual norm is . If
, the identity matrix, then, where is the standard Euclidean norm. Note that .
We say that is strongly convex with the strong convexity parameter if is convex. We also say that has Lipschitz gradient if is Lipschitz continuous with the Lipschitz constant , i.e., for all .
For , if at a given , then we define a local norm as a weighted norm of with respect to . The corresponding dual norm , is defined as for .
2.1 Univariate generalized self-concordant functions
Let be a three times continuously differentiable function on the open domain . Then, we write . In this case, is convex if and only if for all . We introduce the following definition.
Let be a and univariate function with open domain , and and be two constants. We say that is -generalized self-concordant if
The inequality (4) also indicates that for all . Hence, is convex. Clearly, if for any constants and , we have and . The inequality (4) is automatically satisfied for any and . The smallest value of is zero. Hence, any convex quadratic function is -generalized self-concordant for any . While (4) holds for any other constant , we often require that is the smallest constant satisfying (4).
Let us now provide some common examples satisfying Definition 1.
Logistic functions: In Bach2009 , Bach modified the standard self-concordant inequality in Nesterov1994 to obtain , and showed that the well-known logistic loss satisfies this definition. In TranDinh2014d the authors also exploited this definition, and developed a class of first-order and second-order methods to solve composite convex minimization problems. Hence, is a generalized self-concordant function with and .
Entropy function: We consider the well-known entropy function for . We can easily show that . Hence, it is generalized self-concordant with and in the sense of Definition 1.
Arcsine distribution: We consider the function for . This function is convex and smooth. Moreover, we verify that it satisfies Definition 1 with and . We can generalize this function to for , where and . Then, we can show that .
Robust Regression: Consider a monomial function for studied in yang2016rsg for robust regression using in statistics. Then, and .
As concrete examples, the following table, Table 1, provides a non-exhaustive list of generalized self-concordant functions used in the literature.
|Function name||Form of||Application||Reference|
|Log-barrier||2||Poisson||no||Boyd2004 ; Nesterov2004 ; Nesterov1994|
|Exponential||1||AdaBoost, etc||no||cohen2017matrix ; lafferty2002boosting|
|Arcsine distribution||Random walks||no||Goel2006|
All examples given in Table 1 fall into the case . However, we note that Definition 1 also covers (zhang2015disco, , Lemma 1) as a special case when . Unfortunately, as we will see in what follows, it is unclear how to generalize several properties of generalized self-concordance from univariate to multivariable functions for , except for strongly convex functions.
Table 1 only provides common generalized self-concordant functions using in practice. However, it is possible to combine these functions to obtain mixture functions that preserve the generalized self-concordant inequality given in Definition 1. For instance, the barrier entropy is a standard self-concordant function, and it is the sum of the entropy and the negative logarithmic function , which are generalized self-concordant with and , respectively.
2.2 Multivariate generalized self-concordant functions
Let be a smooth and convex function with open domain . Given the Hessian of , , and , we consider the function . Then, it is obvious to show that
for such that , where is the third-order derivative of . It is clear that . By using the local norm, we generalize Definition 1 to multivariate functions as follows.
A -convex function is said to be an -generalized self-concordant function of the order and the constant if, for any and , it holds
Here, we use a convention that for the case or . We denote this class of functions by (shortly, when is explicitly defined).
Let us consider the following two extreme cases:
We emphasize that Definition 2 is not symmetric, but can avoid the use of multilinear mappings as required in Bach2009 ; Nesterov1994 . However, by (Nesterov1994, , Proposition 9.1.1) or (Nesterov2004, , Lemma 4.1.2), Definition 2 with is equivalent to (Nesterov2004, , Definition 4.1.1) for standard self-concordant functions.
2.3 Basic properties of generalized self-concordant functions
We first show that if and are two generalized self-concordant functions, then is also a generalized self-concordant for any according to Definition 2.
Proposition 1 (Sum of generalized self-concordant functions)
Let be -generalized self-concordant functions satisfying (5), where and for . Then, for , , the function is well-defined on , and is -generalized self-concordant with the same order and the constant
It is sufficient to prove for . For , it follows from by induction. By (Nesterov2004, , Theorem 3.1.5), is a closed and convex function. In addition, . Let us fix some and . Then, by Definition 2, we have
Denote and for . We can derive
Let and . Then, and . Hence, the term in the square brackets of (6) becomes
Since and , we can upper bound as
The right-hand side function is linear in on . It achieves the maximum at its boundary. Hence, we have
Using this estimate into (6), we can show that is -generalized self-concordant with .
Using Proposition 1, we can also see that if is -generalized self-concordant, and , then is also -generalized self-concordant with the constant . The convex quadratic function with is -generalized self-concordant for any . Hence, by Proposition 1, if is -generalized self-concordant, then is also -generalized self-concordant.
Next, we consider an affine transformation of a generalized self-concordant function.
Proposition 2 (Affine transformation)
Let be an affine transformation from to , and be an -generalized self-concordant function with . Then, the following statements hold:
If , then is -generalized self-concordant with .
If and , then is -generalized self-concordant with , where
is the smallest eigenvalue of.
Since , it is easy to show that and . Let us denote by , , and . Then, using Definition 2, we have
(a) If , then we have . Hence, the last inequality (7) implies
which shows that is -generalized self-concordant with .
(b) Note that , where is the smallest eigenvalue of . If and , then we have . Combining this estimate and (7), we can show that is -generalized self-concordant with .
Proposition 2 shows that generalized self-concordance is preserved via an affine transformations if . If , then it requires to be over-completed, i.e., . Hence, the theory developed in the sequel remains applicable for if is over-completed.
The following result is an extension of standard self-concordant functions , whose proof is very similar to (Nesterov2004, , Theorems 4.1.3, 4.1.4) by replacing the parameters and with the general parameters and (or ), respectively. We omit the detailed proof.
Let be an -generalized self-concordant function with . Then:
If and contains no straight line, then for any .
If there exists , the boundary of , then, for any , and any sequence such that , we have .
Note that Proposition 3(a) only holds for . If we consider for a given affine operator , then the non-degenerateness of is only guaranteed if is full-rank. Otherwise, it is non-degenerated in a given subspace of .
2.4 Generalized self-concordant functions with special structures
We first show that if a generalized self-concordant function is strongly convex or has a Lipschitz gradient, then it can be cast into the special case or .
Let be an -generalized self-concordant with . Then:
If and is also strongly convex on with the strong convexity parameter in -norm, then is also -generalized self-concordant with and .
If and is Lipschitz continuous with the Lipschitz constant in -norm, then is also -generalized self-concordant with and .
(a) If is strongly convex with the strong convexity parameter in -norm, then we have for any . Hence, . In this case, (5) leads to
Hence, is - generalized self-concordant with and .
(b) Since is Lipschitz continuous with the Lipschitz constant in -norm, we have for all , which leads to for all . On the other hand, with , we can show that
Hence, is also -generalized self-concordant with and .
Proposition 4 provides two important properties. If the gradient map of a generalized self-concordant function is Lipschitz continuous, we can always classify it into the special case . Therefore, we can exploit both structures: generalized self-concordance and Lipschitz gradient to develop better algorithms. This idea is also applied to generalized self-concordant and strongly convex functions.
Given smooth convex univariate functions satisfying (4) for with the same order , we consider the function defined by the following form:
where and are given vectors and numbers, respectively for . This convex function is called a finite sum and widely used in machine learning and statistics. The decomposable structure in (8) often appears in generalized linear models bollapragada2016exact ; byrd2016stochastic , and empirical risk minimization zhang2015disco , where is referred to as a loss function as can be found, e.g., in Table 1.
Finally, we show that if we regularize in (8) by a strongly convex quadratic term, then the resulting function becomes self-concordant. The proof can follow the same path as (zhang2015disco, , Lemma 2).
2.5 Fenchel’s conjugate of generalized self-concordant functions
Primal-dual theory is fundamental in convex optimization. Hence, it is important to study the Fenchel conjugate of generalized self-concordant functions.
Let be an -generalized self-concordant function. We consider Fenchel’s conjugate of as
Since is proper, closed, and convex, is well-defined and also proper, closed, and convex. Moreover, since is smooth and convex, by Fermat’s rule, if satisfies , then is well-defined at . This shows that