Learning with Smooth Hinge Losses

Due to the non-smoothness of the Hinge loss in SVM, it is difficult to obtain a faster convergence rate with modern optimization algorithms. In this paper, we introduce two smooth Hinge losses ψ_G(α;σ) and ψ_M(α;σ) which are infinitely differentiable and converge to the Hinge loss uniformly in α as σ tends to 0. By replacing the Hinge loss with these two smooth Hinge losses, we obtain two smooth support vector machines(SSVMs), respectively. Solving the SSVMs with the Trust Region Newton method (TRON) leads to two quadratically convergent algorithms. Experiments in text classification tasks show that the proposed SSVMs are effective in real-world applications. We also introduce a general smooth convex loss function to unify several commonly-used convex loss functions in machine learning. The general framework provides smooth approximation functions to non-smooth convex loss functions, which can be used to obtain smooth models that can be solved with faster convergent optimization algorithms.

Authors

• 2 publications
• 22 publications
• 133 publications
• Learning Surrogate Losses

The minimization of loss functions is the heart and soul of Machine Lear...
05/24/2019 ∙ by Josif Grabocka, et al. ∙ 0

• Structured Prediction with Projection Oracles

We propose in this paper a general framework for deriving loss functions...
10/24/2019 ∙ by Mathieu Blondel, et al. ∙ 0

• Binary Excess Risk for Smooth Convex Surrogates

In statistical learning theory, convex surrogates of the 0-1 loss are hi...
02/07/2014 ∙ by Mehrdad Mahdavi, et al. ∙ 0

• Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification

Top-k error is currently a popular performance measure on large scale im...
12/12/2016 ∙ by Maksim Lapin, et al. ∙ 0

• Learning rates for classification with Gaussian kernels

This paper aims at refined error analysis for binary classification usin...
02/28/2017 ∙ by Shao-Bo Lin, et al. ∙ 0

• From Majorization to Interpolation: Distributionally Robust Learning using Kernel Smoothing

We study the function approximation aspect of distributionally robust op...
02/16/2021 ∙ by Jia-Jie Zhu, et al. ∙ 5

• Regularized ERM on random subspaces

We study a natural extension of classical empirical risk minimization, w...
06/17/2020 ∙ by Andrea Della Vecchia, et al. ∙ 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider binary classification problems. Suppose we have a training dataset , which is assumed to be independent and identically distributed realizations of a random pair , where and

. Our purpose is to learn a linear classifier

to predict a new instance correctly. The instance will be assigned to be positive if , and negative otherwise. Moreover, we use the loss to evaluate the performance of the classifier , that is, if the classifier makes a correct decision, then there is no loss and, otherwise, the loss is . To avoid overfitting, it is necessary to apply a regularization term to penalize the classifier. -regularization is the mostly used one in machine learning problems, which allows for the use of a kernel function as a way of embedding the original data in a higher dimension space [4, 29]. In certain practical applications such as text classification, one hopes to learn a sparse classifier through -regularization [12, 7]. By minimizing the structural risk, the binary classification problem is equivalent to the optimization problem:

 argminw∈Rp12λ∥w∥2+1nn∑i=1I(yiwTxi≤0),

where returns if its argument is true and otherwise, and is the -regularization term. Due to the non-differentiability and non-convexity of the loss, it is an NP-hard problem to optimize (1) directly. To overcome this difficulty, it is common to replace the loss with a convex surrogate loss function. And many efficient convex optimization methods can thus be applied to obtain a good solution, such as the gradient-based method and the coordinate descent method [29, 45, 6, 43], which are iteration methods based on taking the gradient or coordinate as a descent direction to decrease the objective function.

What kind of convex functions can be applied to replace the loss has been studied [22, 33, 44, 3]. The weakest possible condition on the surrogate loss is that it is classification-calibrated, which is a pointwise form of the Fisher consistency for binary classification [21]. [3] obtained a necessary and sufficient condition for a convex surrogate loss to be classification-calibrated, as stated in the following theorem.

Theorem

[[3], Theorem 2] A convex function is classification-calibrated if and only if it is differentiable at and

A brief overview of surrogate loss functions frequently used in practice is given in [21, 34]

. The mostly used surrogate loss functions include the Hinge loss for SVM, the logistic loss for logistical regression, and the exponential loss for AdaBoost. They are all classification-calibrated. Note that the logistic and exponential losses are smooth, but the Hinge loss is not. As a result, solving SVMs with gradient-based methods only gives a suboptimal convergence rate, and no second-order algorithm is available for solving SVMs. To address this issue, the squared Hinge loss was introduced, but this leads to a new model since the squared Hinge loss is not an approximation of the Hinge loss

[6, 17].

In this paper, we propose two new convex surrogate losses and for binary classification, where is a tunable hyper-parameter (see (2) below), which are called smooth Hinge losses due to the following reasons. First, and converge to the Hinge loss uniformly in as approaches to , so they can keep the advantage of the Hinge loss in SVMs. Secondly, and are infinitely differentiable. By replacing the Hinge loss with these two smooth Hinge losses, we obtain two smooth support vector machines (SSVMs) which can be solved with second-order methods. In particular, they can be solved by the inexact Newton method with a quadratic convergence rate as conducted in [1, 20] for the logistic regression. Although first-order methods are often sufficient in machine learning, there will be a great improvement in training time experimentally on the large scale sparse learning problems by using second-order methods.

Motivated by the proposed smooth Hinge losses, we also propose a general smooth convex loss function with , where and satisfy the conditions given in Theorem 3 below. This general smooth convex loss function

provides a smooth approximation to several surrogate loss functions usually used in machine learning, such as the non-differentiable absolute loss which is usually used as a regularization term, and the rectified linear unit (ReLU) activation function used in deep neural networks.

This paper is organized as follows. In Section 2, we first briefly review several SVMs with different convex loss functions and then introduce the smooth Hinge loss functions . The general smooth convex loss function is then presented and discussed in Section 3. In Section 4, we give the smooth support vector machine by replacing the Hinge loss with the smooth Hinge loss or . The first-order and second-order algorithms for the proposed SSVMs are also presented and analyzed. Several empirical examples of text categorization with high dimensions and sparse features are implemented in Section 5; the results show that the smooth Hinge losses are efficient for binary classification. Some conclusions are given in Section 6.

2 Smooth Hinge Losses

The support vector machine (SVM) is a famous algorithm for binary classification and has now also been applied to many other machine learning problems such as the AUC learning, multi-task learning, multi-class classification and imbalanced classification problems [27, 18, 2, 14]. [5] is a recent survey work about the applications, challenges and trends of SVM.

The SVM model can be described as the following optimization problem

 argminw∈Rd12λ∥w∥2+1nn∑i=1ℓ(yiwTxi), (1)

where the used surrogate loss is the Hinge loss. The model (1) is called L1-SVM.

Since the Hinge loss is not smooth, it is usually replaced with a smooth function. One is the squared Hinge loss , which is convex, piecewise quadratic and differentiable [6, 38]. The SVM model with the squared Hinge loss is called L2-SVM. [17] proposed to replace the squared Hinge loss by its smooth approximation

. In order to make the objective of the Lagrangian dual problem of L1-SVM strongly convex, which is needed in developing an accurate optima estimation with dual methods, the following smoothed hinge loss is proposed in

[30] to replace the hinge loss:

 ℓγ(α)=⎧⎪ ⎪⎨⎪ ⎪⎩0 if α≥11−α−γ2 if α≤1−γ12γ(1−α)2 otherwise.

The stochastic dual coordinate ascent method is then applied to accelerate the training process [30, 31]. With the help of -regularization and the smoothed hinge-loss , a sparse and smooth support vector machine is obtained in [12]. By simultaneously identifying the inactive features and samples, a novel screening method was further developed in [12], which is able to reduce a large-scale problem to a small-scale problem.

Motivated by the smoothing technique in quantile regression,

[39] presents the smooth approximation to the hinge loss, where is a bandwidth and is the smooth function defined by

 H(α)=⎧⎪⎨⎪⎩0 if α≤−112+1516(α−23α3+15α5) if −1<α<11 otherwise.

By replacing the hinge loss with its smooth approximation , a smooth SVM is obtained, and a linear-type estimator is constructed for the smooth SVM in a distributed setting in [39].

Although there have already been several smooth loss functions to replace the Hinge loss in practice, they may not be ideal choices due to the fact that they are either not approximations to the Hinge loss or at most twice differentiable, such as the squared Hinge loss and its approximations as well as and . In this section, we propose two (infinitely differentiable) smooth Hinge loss functions which overcome the above weakness and are given by

 ψG(α;σ)=Φ(v)(1−α)+ϕ(v)σ,ψM(α;σ)=ΦM(v)(1−α)+ϕM(v)σ, (2)

where is a given parameter and . Here, and

are the cumulative distribution function (CDF) and probability density function (PDF) of the standard normal distribution, respectively,

. The following theorem gives the approximation property of and .

Theorem

and satisfy the estimates:
1) ;
2) .
Thus and converge to the Hinge loss uniformly in as tends to

Proof

Let and . Taking the derivative of , we have

 ψ′G(α;σ)=−(Φ′(v)v+ϕ′(v))−Φ(v).

By the definition of and we have

 Φ′(v)=ϕ(v)=1√2πe−12v2.

It then follows that

 Φ′(v)v+ϕ′(v)=0. (3)

Thus we have , which means that is a monotonically decreasing function of . For , we have , so

 0=limα→∞ψG(α;σ)≤ψG(α;σ)−ℓ(α)≤ψG(1;σ).

For , and , implying that

 0=limα→−∞[ψG(α;σ)−ℓ(α)]≤ψG(α;σ)−ℓ(α)≤ψG(1;σ).

Thus,

 0≤ψG(α;σ)−max{0,1−α}≤ψG(1;σ)=σ√2π.

For we have

 ψ′M(α;σ)=−(Φ′M(v)v+ϕ′M(v))−ΦM(v).

By the definition of and it is easy to see that

 Φ′M(v)v+ϕ′M(v)=0. (4)

Thus, , implying that is monotonically decreasing. Similarly as for , we can easily prove that

 0≤ψM(α;σ)−max{0,1−α}≤ψM(1;σ)=σ/2.

The proof is thus complete.

3 A General Smooth Convex Loss

Motivated by (3) and (4) we propose a general smooth convex loss function as stated in the following theorem.

Theorem

Given and , define , where , and are differentiable and satisfy that and . Then we have
1) is twice differentiable with and ;
2) is convex;
3) is -strongly convex if for all ;
4) is -smooth convex if for all ;
5) is classification-calibrated for binary classification if ;
6) the conjugate of is , where is the inverse function of and is the range of .

Proof

1) It is easy to obtain that

 ψ′(α) = −(Φ′c(v)v+ϕ′c(v))−Φc(v)=−Φc(v), ψ′′(α) = Φ′c(v)/σ.

2) Since , is convex.
3) If for all , then . Thus is -strongly convex, that is,

 ψ(α)−ψ(β)≥ψ′(β)(α−β)+γ2|α−β|2,∀α,β∈R.

4) If , then , so the convex function is -smooth, that is,

 |ψ′(α)−ψ′(β)|≤μ|α−β|,∀α,β∈R.

5) Since , by Theorem 1 the convex function is classification-calibrated.
6) The conjugate function of is

 ψ⋆(β) = supα∈R(βα−ψ(α)) = supα∈R(βα−Φc(v)(θ−α)−ϕc(v)σ) = βθ−infv∈R(βv+Φc(v)v+ϕc(v))σ.

Let and . Then . Thus, reaches its minimum at , and so we have

 ψ⋆(β) = βθ−(βv⋆+Φc(v⋆)v⋆+ϕc(v⋆)) = βθ−ϕc(v⋆)=βθ−ϕc(Φ−1c(−β)).

Note that if and satisfy the conditions in Theorem 3 and , suppose is convex and differentiable, then is also convex. The general smooth convex loss function includes many surrogate loss functions mostly used in binary classification as special cases, as shown below. Figure 1 below presents several surrogate convex loss functions.

Example 1: Least Square Loss. Let and . Let . Then

 ψ(α)=Φc(v)(θ−α)+ϕc(v)σ=(θ−α)2/2,

which is the least square loss. The parameter satisfies . The conditions in Theorem 3 are easy to verify, and it is easy to obtain that

 ψ′(α) = −Φc(v)=α−θ,ψ′′(α)=Φ′c(v)/σ=1, ψ⋆(β) = βθ−ϕc(Φ−1c(−β))=βθ+σ(−β/σ)2/2 = βθ+β2/(2σ),β∈R.

Example 2: Smooth Hinge Loss . Let and with and given in Section 2. Then

 ψ(α)=Φ(v)(θ−α)+ϕ(v)σ.

By (3), , so . Then . Setting gives the smooth Hinge loss . Further, , , , , and tends to the Hinge loss as . It is also easy to see that

 ψ′′(α) = Φ′c(v)/σ=ϕ(v)/σ, ψ⋆(β) = βθ−ϕc(Φ−1c(−β))=βθ−ϕ(Φ−1(−β))

with .

Example 3: Smooth Hinge Loss . Let and . Then it follows that

 ψ(α) = Φc(v)(θ−α)+ϕc(v)σ = 12(θ−α)+12√(θ−α)2+σ2.

By (4), , so . Then . Moreover, , , , , and tends to the Hinge loss as . Setting gives the smooth Hinge loss . It is easy to get

 ψ′′(α)=Φ′M(v)/σ=(1+v2)−3/2/(2σ).

 ψ⋆(β)=βθ−ϕc(Φ−1c(−β))=βθ−12√1−(2β+1)2

with .

Example 4: Exponential Loss. Let , . Then Further, and for all . Moreover, . Letting and gives the exponential loss. It is easy to get that

 ψ′(α)=−Φc(v)=−ev,ψ′′(α)=Φ′c(v)/σ=ev/σ,

and

 ψ⋆(β) = βθ−ϕc(Φ−1c(−β))=βθ+β(1−ln(−β)) = β(1+θ−ln(−β)),β∈(−∞,0).

Example 5: Logistic Loss. Let , . Then

 ψ(α)=Φc(v)(θ−α)+ϕc(v)σ=σln(1+ev)≥0.

It is easy to get that , for all , and . Letting and gives the logistic loss. Further, we have

 ψ′(α)=−ev1+ev,ψ′′(α)=Φ′c(v)σ=1σev(1+ev)−2.

For , , so

 ψ⋆(β) = βθ−ϕc(Φ−1c(−β)) = β(θ−ln(−β))+(1+β)ln(1+β).

Example 6: Smooth Absolute Loss. Let , . Then it follows that

 ψ(α)=arctan(v)(θ−α)−σ2ln(1+v2),

and , so for and for . Thus, . To make the condition to hold, we must have . Further, we have . It is also easy to derive that , making us to call the smooth absolute loss function. A direct calculation gives

 ψ′′(α)=Φ′c(v)σ=1σ(1+v2).

Finally, for we have and

 ψ⋆(β)=βθ−ϕc(Φ−1c(−β))=βθ+12ln(1+tan2(−β)).

Example 7: Smooth ReLU. ReLU is a famous non-smooth activation function in deep neural networks (DNN), which is defined as . Define the smooth ReLU (sReLU) function as

 ψsReLU(α;σ)=Φ(α/σ)α+ϕ(α/σ)σ

with and given in Section 2. Then , where is defined as in Example 2 with . By Example 2 we know that uniformly converge to ReLU as goes to

By Theorem 3 we have the following remarks.

1) Any smooth convex function can be rewritten in the form , where and .

2) Given a monotonically increasing, differentiable function , we are able to construct a convex, smooth surrogate loss which is classification-calibrated for binary classification. Moreover, there is no need to know the explicit expression of the loss function when learning with gradient-based methods.

3) There is a great interest to develop a smoothing technique to approximate a non-smooth convex function [25, 32]. For example, [25] improved the traditional bounds on the number of iterations of the gradient methods based on a special smoothing technique in non-smooth convex minimization problems. Theorem 3 provides a new smoothing technique by searching a monotonically increasing and differentiable function which approximates the sub-gradient of the non-smooth convex function.

4 Algorithms

There are many algorithms for SVMs. In the early time, the decomposition methods such as SMO and , which overcome the memory requirement of the quadratic optimization methods, have been proposed to solve L1-SVM in its dual form [26, 15] (see [41] for convergence analysis of the decomposition methods including SMO and ). Later, several convex optimization methods have been introduced, such as the gradient-based methods [43], the bundle method [35, 11], the coordinate descent method [42, 6], the dual method [13, 30, 45] and online learning methods [29, 37, 10]. Based on the generalized Hessian matrix of a convex function with locally Lipschitz gradient proposed in [23], several second-order methods have been applied to solve L2-SVM [16, 20].

In this paper, we focus on the following smooth support vector machine

 argminw∈RpL(w):=12λ∥w∥2+1nn∑i=1ψ(yiwTxi;σ), (5)

where is the smooth Hinge loss or . We will analyze the first and second-order convex algorithms for the above SSVM.

4.1 First-Order Algorithms

The gradient descent (GD) method has taken the stage as the primary workhorse for convex optimization problems. It iteratively approaches the optimal solution. In each iteration, the standard full gradient descent (FGD) method takes the negative gradient as the descent direction, and the classifier updates as follows

 wt+1=wt−ηt[λwt−1nn∑i=1Φc(vti)yixi], (6)

where is a predefined step size and . The FGD method has a convergence rate under standard assumptions [24]. Nesterov proposed a famous accelerated gradient (AGD) method in 1982 (see [24]). The AGD method achieves the optimal convergence rate of for smooth objective functions.

However, for large-scale problems it is computationally very expensive to compute the full gradient in each iteration. To accelerate the learning procedure, the linear classifier can be updated with a stochastic gradient (SG) step instead of a full gradient (FG) step, such as Pegasos [29]. Precisely, at the -th iteration, is updated based on a randomly chosen example :

 wt+1=wt−ηt[λwt−Φc(vtit)yitxit], (7)

where is a predefined step size which is required to satisfy that and for convergence. can be seen as the learning rate of the chosen example . Noting that is independent of and depends on the margin of the chosen example , we call the exogenous learning rate and the endogenous learning rate.

An advantage of the above stochastic gradient descent (SGD) algorithm is that its iteration cost is independent of the number of training examples. This property makes it suitable for large-scale problems. In the SGD algorithm, the full gradient

is replaced by a stochastic gradient of a randomly chosen example in the -th iteration. Though both gradients are equivalent in expectation, they are rarely the same. Thus, it is not a natural to use the norm of the stochastic gradient, , as a stopping criterion. The discrepancy between the FG and the SG also has a negative effect on the convergence rate of the SGD method. Since there is no guarantee that will approach to zero, we need to employ a monotonically decreasing step size series with for convergence. The small step size leads to a sub-linear convergence of the SGD method even for the strongly convex objective function.

Many first-order algorithms have been proposed by combining the low computation cost of SGD and the faster convergence of the FGD method to provide variance reduction

[8, 36]. These novel algorithms are known as stochastic variance reduction gradient (SVRG) algorithms, such as SAG [28], SAGA [8], SDCA [30], Finito [9], and SVRG [36]. Their convergence analysis can be found in the corresponding references.

4.2 Second-Order Algorithms

Recently, inexact Newton methods without computing the inverse Hessian matrix have been proposed to obtain a superlinear convergence rate in machine learning, such as LiSSA [1] and TRON [19, 20]. LiSSA constructs a natural estimator of the inverse Hessian matrix by using the Taylor expansion, while TRON is a trust region Newton method introduced in [19] to deal with general bound-constrained optimization problems, which generates an approximate Newton direction by solving a trust-region subproblem. In TRON, the direction step should give as much reduction as the Cauchy step. [20] applied the TRON method to maximize the log-likelihood of the logistic regression model, in which a conjugate gradient method was used to solve the trust-region subproblem approximately. TRON was also extended to solve the L2-SVM model by introducing a general Hessian for convex objective functions having a Lipschitz continuous gradient [23].

We want to apply TRON to solve our SSVM problem (5). By Theorem 2 we have

 ∇2ψ(yiwTxi)=xi(Φ′c(vi)σ)xTi=dixixTi,

where and . Then the Hessian matrix of the total loss function is given by

 ∇2L(w)=λE+1nn∑i=1dixixTi=λE+1nXTDX,

where is the identity matrix, is a diagonal matrix, and is the input feature matrix with its each row representing an instance.

The Newton step is given as which requires a huge computation cost for high dimensional machine learning problems. The trust-region method is to provide an approximate Newton direction. For recent advances in the trust-region methods see [40]. Suppose is the solution at the -th iteration. Then the trust-region method generates a direction step by solving the quadratic subproblem

 s