1 Introduction
The Newton method is a classical numerical scheme for solving systems of nonlinear equations and smooth optimization Nocedal2006 ; Ortega2000 . However, there are at least two reasons that prevent the use of such methods from solving largescale problems. Firstly, while these methods often offer a fast local convergence rate, which can be up to a quadratic rate, their global convergence has not been wellunderstood Nesterov2006 . In practice, one can use a dampedstep scheme utilizing the Lipschitz constant of the objective derivatives to compute a suitable stepsize as often done in gradienttype methods, or incorporate the algorithm with a globalization strategy such as linesearch, trustregion, or filter to guarantee a descent property Nocedal2006 . Both strategies allow us to prove a global convergence of the underlying Newtontype method in some sense. Unfortunately, in practice, there exist several problems whose objective function does not have global Lipschitz gradient or Hessian such as logarithmic or reciprocal functions. This class of problems does not provide us some uniform bounds to obtain a constant stepsize in optimization algorithms. On the other hand, using a globalization strategy for determining stepsizes often requires centralized computation such as function evaluations, which prevent us from using distributed computation, and stochastic descent methods. Secondly, Newton algorithms are secondorder methods, which often require a high periteration complexity due to the operations on the Hessian of the objective function or its approximations. In addition, these methods require the underlying functionals to be smooth up to a given smoothness levels, which does not often hold in many practical models.
Motivation:
In recent years, there has been a great interest in Newtontype methods for solving convex optimization problems and monotone equations due to the development of new techniques and mathematical tools in optimization, machine learning, and randomized algorithms
Becker2012a ; byrd2016stochastic ; Deuflhard2006 ; erdogdu2015convergence ; Lee2014 ; Nesterov2006b ; Nesterov2008b ; pilanci2015newton ; polyak2009regularized ; RoostaKhorasani2016 ; roosta2016sub ; TranDinh2013b . Several combinations of Newtontype methods and other techniques such as proximal operators Bonnans1994a , cubic regularization Nesterov2006b , gradient regularization polyak2009regularized , randomized algorithms such as sketching pilanci2015newton , subsampling erdogdu2015convergence , and fast eigendecomposition halko2009finding have opened up a new research direction and attracted a great attention in solving nonsmooth and largescale problems. Hitherto, research in this direction remains focusing on specific classes of problems where standard assumptions such as nonsingularity and Hessian Lipschitz continuity are preserved. However, such assumptions do not hold for many other examples as shown in TranDinh2013a . Moreover, if they are satisfied, we often get a lower bound of possible stepsizes for our algorithm, which may lead to a poor performance, especially in largescale problems.In the seminar work Nesterov1994 , Nesterov and Nemirovskii showed that the class of logbarriers does not satisfy the standard assumptions of the Newton method if the solution of the underlying problem is closed to the boundary of the barrier function domain. They introduced a powerful concept called “selfconcordance” to overcome this drawback and developed new Newton schemes to achieve global and local convergence without requiring any additional assumption, or a globalization strategy. While the selfconcordance notion was initially invented to study interiorpoint methods, it is less wellknown in other communities. Recent works Bach2009 ; cohen2017matrix ; monteiro2015hybrid ; TranDinh2013a ; TranDinh2016c ; zhang2015disco have popularized this concept to solve other problems arising from machine learning, statistics, image processing, scientific computing, and variational inequalities.
Our goals:
In this paper, motivated by Bach2009 ; TranDinh2014d ; zhang2015disco , we aim at generalizing the selfconcordance concept in Nesterov1994 to a broader class of smooth and convex functions. To illustrate our idea, we consider a univariate smooth and convex function . If satisfies the inequality for all in the domain of and for a given constant , then we say that is selfconcordant (in Nesterov and Nemirovskii’s sense Nesterov1994 ). We instead generalize this inequality to
(1) 
for all in the domain of , and for given constants and .
We emphasize that generalizing from univariate to multivariate functions in the standard selfconcordant case (i.e., ) Nesterov1994 preserves several important properties including the multilinear symmetry (Nesterov2004, , Lemma 4.1.2), while, unfortunately, they do not hold for the case . We therefore modify the definition in Nesterov1994 to overcome this drawback. Note that a similar idea has been also studied in Bach2009 ; TranDinh2014d for a class of logistictype functions. Nevertheless, the definition using in these papers is limited, and still creates certain difficulty for developing further theory in general cases.
Our second goal is to develop a unified mechanism to analyze convergence (including global and local convergence) of the following Newtontype scheme:
(2) 
where can be represented as the righthand side of a smooth monotone equation , or the optimality condition of a convex optimization or a convexconcave saddlepoint problem, is the Jacobian map of , and is a given stepsize. Despite the Newton scheme (2) is invariant to a change of variables Deuflhard2006 , its convergence property relies on the growth of the Hessian mapping along the Newton iterative process. In classical settings, the Lipschitz continuity and the nondegeneracy of the Hessian mapping in a neighborhood of a given solution are key assumptions to achieve local quadratic convergence rate Deuflhard2006
. These assumptions have been considered to be standard, but they are often very difficult to check in practice, especially the second requirement. A natural idea is to classify the functionals of the underlying problem into a known class of functions to choose a suitable method for minimizing it. While firstorder methods for convex optimization essentially rely on the Lipschitz gradient continuity, Newton schemes usually use the Lipschitz continuity of the Hessian mapping and its nondegeneracy to obtain a welldefined Newton direction as we have mentioned. For selfconcordant functions, the second condition automatically holds, while the first assumption fails to satisfy. However, both fullstep and dampedstep Newton methods still work in this case by appropriately choosing a suitable metric. This situation has been observed and standard assumptions have been modified in different directions to still guarantee the convergence of Newtontype methods, see
Deuflhard2006 for an intensive study of generic Newtontype methods, and Nesterov1994 ; Nesterov2004 for the selfconcordant function class.Our approach:
We first attempt to develop some background theory for a broad class of smooth and convex functions under the structure (1). By adopting the local norm defined via the Hessian mapping of such a convex function from Nesterov1994 , we can prove some lower and upper bound estimates for the local norm distance between two points in the domain as well as for the growth of the Hessian mapping. Together with this background theory, we also identify a class of functions using in generalized linear models mccullagh1989generalized ; nelder1972generalized as well as in empirical risk minimization vapnik1998statistical that falls into our generalized selfconcordance class for many wellknown losstype functions as listed in Table 1.
Applying our generalized selfconcordant theory, we then develop a class of Newtontype methods to solve the following composite convex minimization problem:
(3) 
where is a generalized selfconcordant function in our context, and is a proper, closed, and convex function that can be referred to as a regularization term. We consider two cases. The first case is a noncomposite convex problem in which is vanished (i.e., ). In the second case, we assume that is equipped with a “tractably” proximal operator (see (34) for the definition).
Our contribution:
To this end, our main contribution can be summarized as follows.

We generalize the selfconcordant notion in Nesterov2004
to a more broader class of smooth convex functions, which we call generalized selfconcordance. We identify several losstype functions that can be cast into our generalized selfconcordant class. We also prove several fundamental properties and show that the sum and linear transformation of
generalized selfconcordant functions are generalized selfconcordant for a given range of or under suitable assumptions. 
We develop lower and upper bounds on the Hessian matrix, the gradient map, and the function values for generalized selfconcordant functions. These estimates are key to analyze several numerical optimization methods including Newtontype methods.

We propose a class of Newton methods including fullstep and dampedstep schemes to minimize a generalized selfconcordant function. We explicitly show how to choose a suitable stepsize to guarantee a descent direction in the dampedstep scheme, and prove a local quadratic convergence for both the dampedstep and the fullstep schemes using a suitable metric.

We also extend our Newton schemes to handle the composite setting (3). We develop both fullstep and dampedstep proximal Newton methods to solve this problem and provide a rigorous theoretical convergence guarantee in both local and global sense.

We also study a quasiNewton variant of our Newton scheme to minimize a generalized selfconcordant function. Under a modification of the wellknown DennisMoré condition Dennis1974 or a BFGS update, we show that our quasiNewton method locally converges at a superlinear rate to the solution of the underlying problem.
Let us emphasize the following aspects of our contribution. Firstly, we observe that the selfconcordance notion is a powerful concept and has widely been used in interiorpoint methods as well as in other optimization schemes He2016 ; Lu2016a ; TranDinh2013a ; zhang2015disco
, generalizing it to a broader class of smooth convex functions can substantially cover a number of new applications or can develop new methods for solving old problems including logistic and multimonomial logistic regression, optimization involving exponential objectives, and distanceweighted discrimination problems in support vector machine (see Table
1 below). Secondly, verifying theoretical assumptions for convergence guarantees of a Newton method is not trivial, our theory allows one to classify the underlying functions into different subclasses by using different parameters and in order to choose suitable algorithms to solve the corresponding optimization problem. Thirdly, the theory developed in this paper can potentially apply to other optimization methods such as gradienttype, sketching and subsampling Newton, and FrankWolfe’s algorithms as done in the literature odor2016frank ; pilanci2015newton ; RoostaKhorasani2016 ; roosta2016sub ; TranDinh2013a . Finally, our generalization also shows that it is possible to impose additional structure such as selfconcordant barrier to develop pathfollowing scheme or interiorpointtype methods for solving a subclass of composite convex minimization problems of the form (3). We believe that our theory is not limited to convex optimization, but can be extended to solve convexconcave saddlepoint problems, and monotone equations/inclusions involving generalized selfconcordant functions TranDinh2016c .Summary of generalized selfconcordant properties:
For our reference convenience, we provide a short summary on the main properties of generalized selfconcordant (gsc) functions below.
Result  Property  Range of 

Definitions 1 and 2  definitions of gsc functions  
Proposition 1  sum of gsc functions  
Proposition 2  affine transformation of gsc functions with 
for general
for overcompleted 
Proposition 3(a)  nondegenerate property  
Proposition 3(b)  unboundedness  
Proposition 4(a)  gsc and strong convexity  
Proposition 4(b)  gsc and Lipschitz gradient continuity  
Proposition 6  if is the conjugate of a gsc function
, then 
if (univariate)
if (multivariate) 
Propositions 7, 8, 9, and 10  local norm, Hessian, gradient, and function value bounds 
Although several results hold for a different range of , the complete theory only holds for . However, this is sufficient to cover two important cases: in Bach2009 ; Bach2013a and in Nesterov1994 .
Related work:
Since the selfconcordance concept was introduced in 1990s Nesterov1994 , its first extension is perhaps proposed by Bach2009 for a class of logistic regression. In TranDinh2014d , the authors extended Bach2009
to study proximal Newton method for logistic, multinomial logistic, and exponential loss functions. By augmenting a strongly convex regularizer, Zhang and Lin in
zhang2015disco showed that the regularized logistic loss function is indeed standard selfconcordant. In Bach2013a Bach continued exploiting his result in Bach2009 to show that the averaging stochastic gradient method can achieve the same best known convergence rate as in strongly convex case without adding a regularizer. In TranDinh2013a , the authors exploited standard selfconcordance theory in Nesterov1994 to develop several classes of optimization algorithms including proximal Newton, proximal quasiNewton, and proximal gradient methods to solve composite convex minimization problems. In Lu2016a , Lu extended TranDinh2013a to study randomized block coordinate descent methods. In a recent paper gao2016quasi , Gao and Goldfarb investigated quasiNewton methods for selfconcordant problems. As another example, peng2009self proposed an alternative to the standard selfconcordance, called selfregularity. The authors applied this theory to develop a new paradigm for interiorpoint methods. The theory developed in this paper, on the one hand, is a generalization of the wellknown selfconcordance notion developed in Nesterov1994 ; on the other hand, it also covers the work in Bach2009 ; TranDinh2013b ; zhang2015disco as specific examples. Several concrete applications and extensions of selfconcordance notion can also be found in the literature including He2016 ; Kyrillidis2014 ; odor2016frank ; peng2009self . Recently, cohen2017matrix exploited smooth structures of exponential functions to design interiorpoint methods for solving two fundamental problems in scientific computing called matrix scaling and balancing.Paper organization:
The rest of this paper is organized as follows. Section 2 develops the foundation theory for our generalized selfconcordant functions including definitions, examples, basic properties, Fenchel’s conjugate, smoothing technique, and key bounds. Section 3 is devoted to studying fullstep and dampedstep Newton schemes to minimize a generalized selfconcordant function including their global and local convergence guarantees. Section 4 considers to the composite setting (3) and studies proximal Newtontype methods, and investigates their convergence guarantees. Section 5 deals with a quasiNewton scheme for solving the noncomposite problem of (3). Numerical examples are provided in Section 6 to illustrate advantages of our theory. Finally, for clarity of presentation, several technical results and proofs are moved to the appendix.
2 Theory of generalized selfconcordant functions
We generalize the class of selfconcordant functions introduced by Nesterov and Nemirovskii in Nesterov2004 to a broader class of smooth and convex functions. We identify several examples of such functions. Then, we develop several properties of this function class by utilizing our new definitions.
Notation:
Given a proper, closed, and convex function , we denote by the domain of , and by the subdifferential of at . We use to denote the class of three times continuously differentiable functions on its open domain . We denote by its gradient map, by its Hessian map, and by its thirdorder derivative. For a twice continuously differentiable convex function , is symmetric positive semidefinite, and can be written as . If it is positive definite, then we write .
Let and denote the sets of nonnegative and positive real numbers, respectively. We use and to denote the sets of symmetric positive semidefinite and symmetric positive definite matrices of the size , respectively. Given a matrix , we define a weighted norm with respect to as for . The corresponding dual norm is . If
, the identity matrix, then
, where is the standard Euclidean norm. Note that .We say that is strongly convex with the strong convexity parameter if is convex. We also say that has Lipschitz gradient if is Lipschitz continuous with the Lipschitz constant , i.e., for all .
For , if at a given , then we define a local norm as a weighted norm of with respect to . The corresponding dual norm , is defined as for .
2.1 Univariate generalized selfconcordant functions
Let be a three times continuously differentiable function on the open domain . Then, we write . In this case, is convex if and only if for all . We introduce the following definition.
Definition 1
Let be a and univariate function with open domain , and and be two constants. We say that is generalized selfconcordant if
(4) 
The inequality (4) also indicates that for all . Hence, is convex. Clearly, if for any constants and , we have and . The inequality (4) is automatically satisfied for any and . The smallest value of is zero. Hence, any convex quadratic function is generalized selfconcordant for any . While (4) holds for any other constant , we often require that is the smallest constant satisfying (4).
Example 1
Let us now provide some common examples satisfying Definition 1.

Standard selfconcordant functions: If we choose , then (4) becomes , which is the standard selfconcordant functions in introduced in Nesterov1994 .

Logistic functions: In Bach2009 , Bach modified the standard selfconcordant inequality in Nesterov1994 to obtain , and showed that the wellknown logistic loss satisfies this definition. In TranDinh2014d the authors also exploited this definition, and developed a class of firstorder and secondorder methods to solve composite convex minimization problems. Hence, is a generalized selfconcordant function with and .

Exponential functions: The exponential function also belongs to (4) with and . This function is often used, e.g., in Adaboost lafferty2002boosting , or in matrix scaling cohen2017matrix .

Distanceweighted discrimination (DWD): We consider a more general function on and studied in marron2007distance for DWD using in support vector machine. As shown in Table 1, this function satisfies Definition 1 with and .

Entropy function: We consider the wellknown entropy function for . We can easily show that . Hence, it is generalized selfconcordant with and in the sense of Definition 1.

Arcsine distribution: We consider the function for . This function is convex and smooth. Moreover, we verify that it satisfies Definition 1 with and . We can generalize this function to for , where and . Then, we can show that .

Robust Regression: Consider a monomial function for studied in yang2016rsg for robust regression using in statistics. Then, and .
As concrete examples, the following table, Table 1, provides a nonexhaustive list of generalized selfconcordant functions used in the literature.
Function name  Form of  Application  Reference  

Logbarrier  2  Poisson  no  Boyd2004 ; Nesterov2004 ; Nesterov1994  
Entropybarrier  2  Interiorpoint  no  Nesterov2004  
Logistic  1  Classification  yes  Hosmer2005  
Exponential  1  AdaBoost, etc  no  cohen2017matrix ; lafferty2002boosting  
Negative power  DWD  no  marron2007distance  
Arcsine distribution  Random walks  no  Goel2006  
Positive power  Regression  no  yang2016rsg  
Entropy  1  KL divergence  no  Boyd2004 
Remark 1
All examples given in Table 1 fall into the case . However, we note that Definition 1 also covers (zhang2015disco, , Lemma 1) as a special case when . Unfortunately, as we will see in what follows, it is unclear how to generalize several properties of generalized selfconcordance from univariate to multivariable functions for , except for strongly convex functions.
Table 1 only provides common generalized selfconcordant functions using in practice. However, it is possible to combine these functions to obtain mixture functions that preserve the generalized selfconcordant inequality given in Definition 1. For instance, the barrier entropy is a standard selfconcordant function, and it is the sum of the entropy and the negative logarithmic function , which are generalized selfconcordant with and , respectively.
2.2 Multivariate generalized selfconcordant functions
Let be a smooth and convex function with open domain . Given the Hessian of , , and , we consider the function . Then, it is obvious to show that
for such that , where is the thirdorder derivative of . It is clear that . By using the local norm, we generalize Definition 1 to multivariate functions as follows.
Definition 2
A convex function is said to be an generalized selfconcordant function of the order and the constant if, for any and , it holds
(5) 
Here, we use a convention that for the case or . We denote this class of functions by (shortly, when is explicitly defined).
Let us consider the following two extreme cases:

If and , (5) reduces to , Definition 2 becomes the standard selfconcordant definition introduced in Nesterov2004 ; Nesterov1994 .
We emphasize that Definition 2 is not symmetric, but can avoid the use of multilinear mappings as required in Bach2009 ; Nesterov1994 . However, by (Nesterov1994, , Proposition 9.1.1) or (Nesterov2004, , Lemma 4.1.2), Definition 2 with is equivalent to (Nesterov2004, , Definition 4.1.1) for standard selfconcordant functions.
2.3 Basic properties of generalized selfconcordant functions
We first show that if and are two generalized selfconcordant functions, then is also a generalized selfconcordant for any according to Definition 2.
Proposition 1 (Sum of generalized selfconcordant functions)
Let be generalized selfconcordant functions satisfying (5), where and for . Then, for , , the function is welldefined on , and is generalized selfconcordant with the same order and the constant
Proof
It is sufficient to prove for . For , it follows from by induction. By (Nesterov2004, , Theorem 3.1.5), is a closed and convex function. In addition, . Let us fix some and . Then, by Definition 2, we have
Denote and for . We can derive
(6)  
Let and . Then, and . Hence, the term in the square brackets of (6) becomes
Since and , we can upper bound as
The righthand side function is linear in on . It achieves the maximum at its boundary. Hence, we have
Using this estimate into (6), we can show that is generalized selfconcordant with .
Using Proposition 1, we can also see that if is generalized selfconcordant, and , then is also generalized selfconcordant with the constant . The convex quadratic function with is generalized selfconcordant for any . Hence, by Proposition 1, if is generalized selfconcordant, then is also generalized selfconcordant.
Next, we consider an affine transformation of a generalized selfconcordant function.
Proposition 2 (Affine transformation)
Let be an affine transformation from to , and be an generalized selfconcordant function with . Then, the following statements hold:

If , then is generalized selfconcordant with .
Proof
Since , it is easy to show that and . Let us denote by , , and . Then, using Definition 2, we have
(7) 
(a) If , then we have . Hence, the last inequality (7) implies
which shows that is generalized selfconcordant with .
(b) Note that , where is the smallest eigenvalue of . If and , then we have . Combining this estimate and (7), we can show that is generalized selfconcordant with .
Remark 2
Proposition 2 shows that generalized selfconcordance is preserved via an affine transformations if . If , then it requires to be overcompleted, i.e., . Hence, the theory developed in the sequel remains applicable for if is overcompleted.
The following result is an extension of standard selfconcordant functions , whose proof is very similar to (Nesterov2004, , Theorems 4.1.3, 4.1.4) by replacing the parameters and with the general parameters and (or ), respectively. We omit the detailed proof.
Proposition 3
Let be an generalized selfconcordant function with . Then:

If and contains no straight line, then for any .

If there exists , the boundary of , then, for any , and any sequence such that , we have .
Note that Proposition 3(a) only holds for . If we consider for a given affine operator , then the nondegenerateness of is only guaranteed if is fullrank. Otherwise, it is nondegenerated in a given subspace of .
2.4 Generalized selfconcordant functions with special structures
We first show that if a generalized selfconcordant function is strongly convex or has a Lipschitz gradient, then it can be cast into the special case or .
Proposition 4
Let be an generalized selfconcordant with . Then:

If and is also strongly convex on with the strong convexity parameter in norm, then is also generalized selfconcordant with and .

If and is Lipschitz continuous with the Lipschitz constant in norm, then is also generalized selfconcordant with and .
Proof
(a) If is strongly convex with the strong convexity parameter in norm, then we have for any . Hence, . In this case, (5) leads to
Hence, is  generalized selfconcordant with and .
(b) Since is Lipschitz continuous with the Lipschitz constant in norm, we have for all , which leads to for all . On the other hand, with , we can show that
Hence, is also generalized selfconcordant with and .
Proposition 4 provides two important properties. If the gradient map of a generalized selfconcordant function is Lipschitz continuous, we can always classify it into the special case . Therefore, we can exploit both structures: generalized selfconcordance and Lipschitz gradient to develop better algorithms. This idea is also applied to generalized selfconcordant and strongly convex functions.
Given smooth convex univariate functions satisfying (4) for with the same order , we consider the function defined by the following form:
(8) 
where and are given vectors and numbers, respectively for . This convex function is called a finite sum and widely used in machine learning and statistics. The decomposable structure in (8) often appears in generalized linear models bollapragada2016exact ; byrd2016stochastic , and empirical risk minimization zhang2015disco , where is referred to as a loss function as can be found, e.g., in Table 1.
Next, we show that if is generalized selfconcordant with , then is also generalized selfconcordant. This result is a direct consequence of Proposition 1 and Proposition 2.
Corollary 1
Finally, we show that if we regularize in (8) by a strongly convex quadratic term, then the resulting function becomes selfconcordant. The proof can follow the same path as (zhang2015disco, , Lemma 2).
2.5 Fenchel’s conjugate of generalized selfconcordant functions
Primaldual theory is fundamental in convex optimization. Hence, it is important to study the Fenchel conjugate of generalized selfconcordant functions.
Let be an generalized selfconcordant function. We consider Fenchel’s conjugate of as
(9) 
Since is proper, closed, and convex, is welldefined and also proper, closed, and convex. Moreover, since is smooth and convex, by Fermat’s rule, if satisfies , then is welldefined at . This shows that
Comments
There are no comments yet.