Consider binary classification problems. Suppose we have a training dataset , which is assumed to be independent and identically distributed realizations of a random pair , where and
. Our purpose is to learn a linear classifierto predict a new instance correctly. The instance will be assigned to be positive if , and negative otherwise. Moreover, we use the loss to evaluate the performance of the classifier , that is, if the classifier makes a correct decision, then there is no loss and, otherwise, the loss is . To avoid overfitting, it is necessary to apply a regularization term to penalize the classifier. -regularization is the mostly used one in machine learning problems, which allows for the use of a kernel function as a way of embedding the original data in a higher dimension space [4, 29]. In certain practical applications such as text classification, one hopes to learn a sparse classifier through -regularization [12, 7]. By minimizing the structural risk, the binary classification problem is equivalent to the optimization problem:
where returns if its argument is true and otherwise, and is the -regularization term. Due to the non-differentiability and non-convexity of the loss, it is an NP-hard problem to optimize (1) directly. To overcome this difficulty, it is common to replace the loss with a convex surrogate loss function. And many efficient convex optimization methods can thus be applied to obtain a good solution, such as the gradient-based method and the coordinate descent method [29, 45, 6, 43], which are iteration methods based on taking the gradient or coordinate as a descent direction to decrease the objective function.
What kind of convex functions can be applied to replace the loss has been studied [22, 33, 44, 3]. The weakest possible condition on the surrogate loss is that it is classification-calibrated, which is a pointwise form of the Fisher consistency for binary classification .  obtained a necessary and sufficient condition for a convex surrogate loss to be classification-calibrated, as stated in the following theorem.
[, Theorem 2] A convex function is classification-calibrated if and only if it is differentiable at and
. The mostly used surrogate loss functions include the Hinge loss for SVM, the logistic loss for logistical regression, and the exponential loss for AdaBoost. They are all classification-calibrated. Note that the logistic and exponential losses are smooth, but the Hinge loss is not. As a result, solving SVMs with gradient-based methods only gives a suboptimal convergence rate, and no second-order algorithm is available for solving SVMs. To address this issue, the squared Hinge loss was introduced, but this leads to a new model since the squared Hinge loss is not an approximation of the Hinge loss[6, 17].
In this paper, we propose two new convex surrogate losses and for binary classification, where is a tunable hyper-parameter (see (2) below), which are called smooth Hinge losses due to the following reasons. First, and converge to the Hinge loss uniformly in as approaches to , so they can keep the advantage of the Hinge loss in SVMs. Secondly, and are infinitely differentiable. By replacing the Hinge loss with these two smooth Hinge losses, we obtain two smooth support vector machines (SSVMs) which can be solved with second-order methods. In particular, they can be solved by the inexact Newton method with a quadratic convergence rate as conducted in [1, 20] for the logistic regression. Although first-order methods are often sufficient in machine learning, there will be a great improvement in training time experimentally on the large scale sparse learning problems by using second-order methods.
Motivated by the proposed smooth Hinge losses, we also propose a general smooth convex loss function with , where and satisfy the conditions given in Theorem 3 below. This general smooth convex loss function
provides a smooth approximation to several surrogate loss functions usually used in machine learning, such as the non-differentiable absolute loss which is usually used as a regularization term, and the rectified linear unit (ReLU) activation function used in deep neural networks.
This paper is organized as follows. In Section 2, we first briefly review several SVMs with different convex loss functions and then introduce the smooth Hinge loss functions . The general smooth convex loss function is then presented and discussed in Section 3. In Section 4, we give the smooth support vector machine by replacing the Hinge loss with the smooth Hinge loss or . The first-order and second-order algorithms for the proposed SSVMs are also presented and analyzed. Several empirical examples of text categorization with high dimensions and sparse features are implemented in Section 5; the results show that the smooth Hinge losses are efficient for binary classification. Some conclusions are given in Section 6.
2 Smooth Hinge Losses
The support vector machine (SVM) is a famous algorithm for binary classification and has now also been applied to many other machine learning problems such as the AUC learning, multi-task learning, multi-class classification and imbalanced classification problems [27, 18, 2, 14].  is a recent survey work about the applications, challenges and trends of SVM.
The SVM model can be described as the following optimization problem
where the used surrogate loss is the Hinge loss. The model (1) is called L1-SVM.
Since the Hinge loss is not smooth, it is usually replaced with a smooth function. One is the squared Hinge loss , which is convex, piecewise quadratic and differentiable [6, 38]. The SVM model with the squared Hinge loss is called L2-SVM.  proposed to replace the squared Hinge loss by its smooth approximation
. In order to make the objective of the Lagrangian dual problem of L1-SVM strongly convex, which is needed in developing an accurate optima estimation with dual methods, the following smoothed hinge loss is proposed in to replace the hinge loss:
The stochastic dual coordinate ascent method is then applied to accelerate the training process [30, 31]. With the help of -regularization and the smoothed hinge-loss , a sparse and smooth support vector machine is obtained in . By simultaneously identifying the inactive features and samples, a novel screening method was further developed in , which is able to reduce a large-scale problem to a small-scale problem.
Motivated by the smoothing technique in quantile regression, presents the smooth approximation to the hinge loss, where is a bandwidth and is the smooth function defined by
By replacing the hinge loss with its smooth approximation , a smooth SVM is obtained, and a linear-type estimator is constructed for the smooth SVM in a distributed setting in .
Although there have already been several smooth loss functions to replace the Hinge loss in practice, they may not be ideal choices due to the fact that they are either not approximations to the Hinge loss or at most twice differentiable, such as the squared Hinge loss and its approximations as well as and . In this section, we propose two (infinitely differentiable) smooth Hinge loss functions which overcome the above weakness and are given by
where is a given parameter and . Here, and. The following theorem gives the approximation property of and .
and satisfy the estimates:
Thus and converge to the Hinge loss uniformly in as tends to
Let and . Taking the derivative of , we have
By the definition of and we have
It then follows that
Thus we have , which means that is a monotonically decreasing function of . For , we have , so
For , and , implying that
For we have
By the definition of and it is easy to see that
Thus, , implying that is monotonically decreasing. Similarly as for , we can easily prove that
The proof is thus complete.
3 A General Smooth Convex Loss
Given and , define ,
where , and are differentiable and satisfy
that and . Then we have
1) is twice differentiable with and ;
2) is convex;
3) is -strongly convex if for all ;
4) is -smooth convex if for all ;
5) is classification-calibrated for binary classification if ;
6) the conjugate of is , where is the inverse function of and is the range of .
1) It is easy to obtain that
2) Since , is convex.
3) If for all , then . Thus is -strongly convex, that is,
4) If , then , so the convex function is -smooth, that is,
5) Since , by Theorem 1 the convex function
6) The conjugate function of is
Let and . Then . Thus, reaches its minimum at , and so we have
Note that if and satisfy the conditions in Theorem 3 and , suppose is convex and differentiable, then is also convex. The general smooth convex loss function includes many surrogate loss functions mostly used in binary classification as special cases, as shown below. Figure 1 below presents several surrogate convex loss functions.
Example 1: Least Square Loss. Let and . Let . Then
which is the least square loss. The parameter satisfies . The conditions in Theorem 3 are easy to verify, and it is easy to obtain that
Example 2: Smooth Hinge Loss . Let and with and given in Section 2. Then
By (3), , so . Then . Setting gives the smooth Hinge loss . Further, , , , , and tends to the Hinge loss as . It is also easy to see that
Example 3: Smooth Hinge Loss . Let and . Then it follows that
By (4), , so . Then . Moreover, , , , , and tends to the Hinge loss as . Setting gives the smooth Hinge loss . It is easy to get
In addition, we have
Example 4: Exponential Loss. Let , . Then Further, and for all . Moreover, . Letting and gives the exponential loss. It is easy to get that
Example 5: Logistic Loss. Let , . Then
It is easy to get that , for all , and . Letting and gives the logistic loss. Further, we have
For , , so
Example 6: Smooth Absolute Loss. Let , . Then it follows that
and , so for and for . Thus, . To make the condition to hold, we must have . Further, we have . It is also easy to derive that , making us to call the smooth absolute loss function. A direct calculation gives
Finally, for we have and
Example 7: Smooth ReLU. ReLU is a famous non-smooth activation function in deep neural networks (DNN), which is defined as . Define the smooth ReLU (sReLU) function as
with and given in Section 2. Then , where is defined as in Example 2 with . By Example 2 we know that uniformly converge to ReLU as goes to
By Theorem 3 we have the following remarks.
1) Any smooth convex function can be rewritten in the form , where and .
2) Given a monotonically increasing, differentiable function , we are able to construct a convex, smooth surrogate loss which is classification-calibrated for binary classification. Moreover, there is no need to know the explicit expression of the loss function when learning with gradient-based methods.
3) There is a great interest to develop a smoothing technique to approximate a non-smooth convex function [25, 32]. For example,  improved the traditional bounds on the number of iterations of the gradient methods based on a special smoothing technique in non-smooth convex minimization problems. Theorem 3 provides a new smoothing technique by searching a monotonically increasing and differentiable function which approximates the sub-gradient of the non-smooth convex function.
There are many algorithms for SVMs. In the early time, the decomposition methods such as SMO and , which overcome the memory requirement of the quadratic optimization methods, have been proposed to solve L1-SVM in its dual form [26, 15] (see  for convergence analysis of the decomposition methods including SMO and ). Later, several convex optimization methods have been introduced, such as the gradient-based methods , the bundle method [35, 11], the coordinate descent method [42, 6], the dual method [13, 30, 45] and online learning methods [29, 37, 10]. Based on the generalized Hessian matrix of a convex function with locally Lipschitz gradient proposed in , several second-order methods have been applied to solve L2-SVM [16, 20].
In this paper, we focus on the following smooth support vector machine
where is the smooth Hinge loss or . We will analyze the first and second-order convex algorithms for the above SSVM.
4.1 First-Order Algorithms
The gradient descent (GD) method has taken the stage as the primary workhorse for convex optimization problems. It iteratively approaches the optimal solution. In each iteration, the standard full gradient descent (FGD) method takes the negative gradient as the descent direction, and the classifier updates as follows
where is a predefined step size and . The FGD method has a convergence rate under standard assumptions . Nesterov proposed a famous accelerated gradient (AGD) method in 1982 (see ). The AGD method achieves the optimal convergence rate of for smooth objective functions.
However, for large-scale problems it is computationally very expensive to compute the full gradient in each iteration. To accelerate the learning procedure, the linear classifier can be updated with a stochastic gradient (SG) step instead of a full gradient (FG) step, such as Pegasos . Precisely, at the -th iteration, is updated based on a randomly chosen example :
where is a predefined step size which is required to satisfy that and for convergence. can be seen as the learning rate of the chosen example . Noting that is independent of and depends on the margin of the chosen example , we call the exogenous learning rate and the endogenous learning rate.
An advantage of the above stochastic gradient descent (SGD) algorithm is that its iteration cost is independent of the number of training examples. This property makes it suitable for large-scale problems. In the SGD algorithm, the full gradientis replaced by a stochastic gradient of a randomly chosen example in the -th iteration. Though both gradients are equivalent in expectation, they are rarely the same. Thus, it is not a natural to use the norm of the stochastic gradient, , as a stopping criterion. The discrepancy between the FG and the SG also has a negative effect on the convergence rate of the SGD method. Since there is no guarantee that will approach to zero, we need to employ a monotonically decreasing step size series with for convergence. The small step size leads to a sub-linear convergence of the SGD method even for the strongly convex objective function.
Many first-order algorithms have been proposed by combining the low computation cost of SGD and the faster convergence of the FGD method to provide variance reduction[8, 36]. These novel algorithms are known as stochastic variance reduction gradient (SVRG) algorithms, such as SAG , SAGA , SDCA , Finito , and SVRG . Their convergence analysis can be found in the corresponding references.
4.2 Second-Order Algorithms
Recently, inexact Newton methods without computing the inverse Hessian matrix have been proposed to obtain a superlinear convergence rate in machine learning, such as LiSSA  and TRON [19, 20]. LiSSA constructs a natural estimator of the inverse Hessian matrix by using the Taylor expansion, while TRON is a trust region Newton method introduced in  to deal with general bound-constrained optimization problems, which generates an approximate Newton direction by solving a trust-region subproblem. In TRON, the direction step should give as much reduction as the Cauchy step.  applied the TRON method to maximize the log-likelihood of the logistic regression model, in which a conjugate gradient method was used to solve the trust-region subproblem approximately. TRON was also extended to solve the L2-SVM model by introducing a general Hessian for convex objective functions having a Lipschitz continuous gradient .
where and . Then the Hessian matrix of the total loss function is given by
where is the identity matrix, is a diagonal matrix, and is the input feature matrix with its each row representing an instance.
The Newton step is given as which requires a huge computation cost for high dimensional machine learning problems. The trust-region method is to provide an approximate Newton direction. For recent advances in the trust-region methods see . Suppose is the solution at the -th iteration. Then the trust-region method generates a direction step by solving the quadratic subproblem