# Stability of the Stochastic Gradient Method for an Approximated Large Scale Kernel Machine

In this paper we measured the stability of stochastic gradient method (SGM) for learning an approximated Fourier primal support vector machine. The stability of an algorithm is considered by measuring the generalization error in terms of the absolute difference between the test and the training error. Our problem is to learn an approximated kernel function using random Fourier features for a binary classification problem via online convex optimization settings. For a convex, Lipschitz continuous and smooth loss function, given reasonable number of iterations stochastic gradient method is stable. We showed that with a high probability SGM generalizes well for an approximated kernel under given assumptions.We empirically verified the theoretical findings for different parameters using several data sets.

## Authors

• 3 publications
• 1 publication
• ### Train faster, generalize better: Stability of stochastic gradient descent

We show that parametric models trained by a stochastic gradient method (...
09/03/2015 ∙ by Moritz Hardt, et al. ∙ 0

• ### Feature uncertainty bounding schemes for large robust nonlinear SVM classifiers

We consider the binary classification problem when data are large and su...
06/29/2017 ∙ by Nicolas Couellan, et al. ∙ 0

• ### Tight Analyses for Non-Smooth Stochastic Gradient Descent

Consider the problem of minimizing functions that are Lipschitz and stro...
12/13/2018 ∙ by Nicholas J. A. Harvey, et al. ∙ 0

• ### Scalable Semi-Supervised SVM via Triply Stochastic Gradients

Semi-supervised learning (SSL) plays an increasingly important role in t...
07/26/2019 ∙ by Xiang Geng, et al. ∙ 2

• ### Stability and Generalization of Stochastic Gradient Methods for Minimax Problems

Many machine learning problems can be formulated as minimax problems suc...
05/08/2021 ∙ by Yunwen Lei, et al. ∙ 0

• ### Speeding Up Budgeted Stochastic Gradient Descent SVM Training with Precomputed Golden Section Search

Limiting the model size of a kernel support vector machine to a pre-defi...
06/26/2018 ∙ by Tobias Glasmachers, et al. ∙ 0

• ### Fast Prediction with SVM Models Containing RBF Kernels

We present an approximation scheme for support vector machine models tha...
03/04/2014 ∙ by Marc Claesen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The stochastic gradient method (SGM) is widely used as an optimization tool in many machine learning applications including (linear) support vector machines

shalev2011pegasos ; zhu2009p zhuang2015distributed ; bottou2012stochastic , graphical models dean2012large ; poon2011sum ; bonnabel2013stochastic deng2013recent ; krizhevsky2012imagenet ; haykin2009neural

. SGM computes the estimates of the gradient on the basis of a single randomly chosen sample in each iteration. Therefore, applying a stochastic gradient method for large scale machine learning problems can be computationally efficient

zhang2004solving ; alrajeh2014large ; bottou2010large ; hsieh2008dual

. In the context of supervised learning, models that are trained by such iterative optimization algorithms are commonly controlled by convergence rate analysis. The convergence rate portrays how fast the optimization error decreases as the number of iterations grows. However, many fast converging algorithms are algorithmically less stable (an algorithm is stable if it is robust to small perturbations in the composition of the learning data set). The stability of an algorithm is considered by measuring the generalization error in terms of the absolute difference between the test and the training error.

The classical results by Bousquet and Elisseef bousquet2002stability showed that a randomized algorithm such as SGM is uniformly stable if for all data sets differing in one element, the learned models produce nearly the same results. Hardt et al. hardt2015train

suggested that by choosing reasonable number of iterations SGM generalizes well and prevents overfitting under standard Lipschitz and smoothness assumptions. Therefore, the iterative optimization algorithm can stop long before its convergence to reduce computational cost. The expected excess risk decomposition has been the main theoretical guideline for this kind of early-stopping criteria. Motivated by this approach we derived a high probability bound in terms of expected risk for an approximated kernel function considering the stability definition. We proposed that, in the context of supervised learning with a high probability SGM generalizes well for an approximated kernel under proper assumptions. We showed that with few number of iterations generalization error is independent of model size, and it is a function of number of epochs. In addition, we explored the effect of learning rate choices, and the number of Fourier components on the generalization error.

In this paper, we proved that SGM generalizes well for an approximated kernel under proper assumptions by choosing only few number of iterations while being stable. In particular, we mapped the input data to a randomized low-dimensional feature space to accelerate the training of kernel machines using random Fourier features rahimi2007random . We, then incorporated the approximated kernel function into the primal of SVM to form a linear primal objective function following chapelle2007training . Finally, we showed that SGM generalizes well given the approximated algorithm under proper assumptions by incorporating the stability term into the classical convergence bound.

This paper is organized as follows. In section 2 the detailed problem statement is discussed following the convex optimization setting used for this problem. The theoretical analysis is discussed in section 3. This is followed by the numerical results in section 4. Finally the discussion is provided in section 5.

## 2 Preliminaries

### 2.1 Optimization problem

Given a training set , ,

, a linear hyperplane for SVM problems is defined by

. Where, is the number of training examples, and is the weight coefficient vector. The standard primal SVM optimization problem is shown as:

 minw∈Rdλ2||w||2+1nn∑i(max(0,1−yif(xi))). (1)

Rather than using the original input attributes , we instead used the kernel tricks so that the algorithm would access the data only through the evaluation of . This is a simple way to generate features for algorithms that depend only on the inner product between pairs of input points. Kernel tricks rely on the observation that any positive definite function with defines an inner product and a lifting so that the inner product between lifted data points can be quickly computed as . Our goal is to efficiently learn a kernel prediction function and an associated Reproducing Kernel Hillbert Space as follows:

 minf∈Hλ2||f||2H+1nn∑i(max(0,1−yif(xi))). (2)

Where,

 f(x)=n∑i=1αiK(xi,xj), (3)

However, in large scale problems, dealing with kernels can be computationally expensive. Hence, instead of relying on the implicit lifting provided by the kernel trick, we used explicitly mapping the data to a low-dimensional Euclidean inner product space using a randomized feature map so that the inner product between a pair of transformed points approximates their kernel evaluation yang2012nystrom ; rahimi2008random . Given the random Fourier features, we then learned a linear machine by solving the following optimization problem:

 minw∈R2Dλ2||w||22+1nn∑i(max(0,1−yiwTz(xi)). (4)

### 2.2 Convex optimization settings

The goal of our online learning is to achieve minimum expected risk, hence we tried to minimize the loss function. Throughout the paper, we focused on convex, Lipschitz continuous and gradient smooth loss functions, provided their definitions here.

###### Definition 2.1

A function f is L-Lipschitz continuous if we have , while implies

 |f(x)−f(y)|≤L||x−y||, (5)
###### Definition 2.2

A function f is gradient Lipschitz continuous if we have , while implies

 |∇f(x)−∇f(y)|≤β||x−y||. (6)

In the theoretical analysis section we required a convex, Lipschitz continuous and gradient smooth function. Note that a huber-hinge loss function is Lipschitz continuous, and it has a Lipschitz continuous gradient which is defined as follows:

 ⎧⎪ ⎪⎨⎪ ⎪⎩−4yiwTz(xi),yiwTz(xi)<−1( 1−yiwTz(xi) )2, −1≤ yiwTz(xi)≤10,yiwTz(xi)>1

Therefore, in this paper, we used the following optimization problem:

 minw∈R2Dn∑ilhuber−hinge(yi,wTz(xi)). (7)

For simplicity, the loss function in (7) is denoted by . Let be the minimizer of the population risk:

 R(w)def=E(x,y)(l(w)) (8)

Let , where is the maximum iteration for the SGM. According to lularge and nemirovski2009robust we have the following Lemma:

###### Lemma 2.3

Let be a convex loss satisfying and let be the constant learning rate. Let , where is the maximum SGM iteration. Also, let be the minimizer of the population risk . Then,

 R[¯wT]≤R[w∗]+||w∗||22Tη+η2L2. (9)
###### Proof

Note that:

 ||wt+1−w∗||2=||wt−η∇lt(wt)−w∗||2 (10) =||wt−w∗||2+η2||∇lt(wt)||2−2η∇lt(wt)(wt−w∗),

and,

 lt(wt)−lt(w∗)≤∇lt(wt)(wt−w∗). (11)

Combining these two we have the following:

 lt(wt)−lt(w∗)≤||wt−w∗||2−||wt+1−w∗||22η (12) +η2||∇lt(wt)||2

By summing the above over and taking average the lemma is proved.

From Rahimi rahimi2007random , we know that with a high probability of at least there is a probability bound for the difference between the approximated kernel value and the exact kernel value. Where

is the second moment of Fourier transform of kernel function. Further the following inequality holds when,

:

 |z(xi)Tz(xj)−k(xi,xj)|<ϵ. (13)

Assuming and , then:

 ||w∗||≤(1+ϵ)||f∗||21, (14)

where , resulting from and , and . By substituting Equation (14) in Equation (9), with a high probability of , we obtain:

 R[¯wT]≤R[w∗]+(1+ϵ)||f∗||212Tη+η2L2+eL∥f∗∥1 (15)

Given that, an optimization error is defined as the gap between empirical risk and minimum empirical risk in expectation, and it is denoted by:

 ϵopt(w)def=E[RS[w]−RS[wS∗]], (16)

where, denotes a population sample of size and is the empirical risk defined as:

 RS[w]def=1nn∑i=1l(w;(xi,yi)) (17)

Note that the expected empirical risk is smaller than the minimum risk , implying:

 E[RS[wS∗]]≤R[w∗]. (18)

Hence, based on Equations (15), (16) and (18), with a high probability of at least , we have:

 ϵopt(w)≤(1+ϵ)||f∗||212Tη+η2L2. (19)
###### Lemma 2.4

Let l be a convex loss function that is Lipschitz continuous and . Let ; resulting from and . Also let . Suppose we make a single pass SGM over all the samples , and by choosing , then with a high probability , the classical convergence bound in (15) becomes:

 E[R[¯wn]]≤R[w∗]+(2+ϵ)||f∗||1L(2)√n. (20)

Knowing that,

 E[R[w]]≤E[RS[wS∗]]+ϵopt(w)+ϵstab, (21)

Where is the stability error satisfying , and given that the function is -Lipschitz continuous and -smooth. We know that will decrease with the number of SGM iterations while increases. Hardt et. al. in hardt2015train showed that given few number of iterations and by balancing and , the generalization error will decrease. In the next section, we explored to see whether using SGM for an approximated algorithm which favors in terms of computational cost would generalize well by choosing few number of iterations while being stable.

## 3 Generalization of SGM for an approximated algorithm

###### Theorem 3.1

Let l be -Lipschitz continuous and -smooth. is the minimizer of the empirical risk and . Let , where and is the coefficient of the th support vector. For the maximum iteration of the SGM, with high probability of , we have:

 E[R[¯wT]]≤E[RS[wS∗]]+||f∗||1L√1+ϵ√n√n+2TT. (22)
###### Proof

Recall that with a high probability, . Also recall that . Then by substituting these two terms in (21), for every , with a high probability , we have:

 E[R[¯wT]]−E[RS[wS∗]]≤(1+ϵ)||f∗||212Tη+η2L2+TL2ηn. (23)

By taking the gradient of the right hand side of (23) with respect to , the optimal is:

 η=||f∗||1√(1+ϵ)nL√T(n+2T). (24)

By substituting the optimal in Equation (23) the theorem is proved.

The above theorem suggests that with a high probability SGM generalizes well for an approximated kernel for -Lipschitz continuous and -smooth loss function. In general, the optimization error () decreases with the number of SGM iterations while the stability () increases. From (22) we can claim that as the number of iteration increases, and will become less balanced. Thus, choosing few number of iterations would balance and suggesting a stable SGM.

By setting the to (24) when the generalization error bound for an approximated kernel based on random Fourier features is given by,

 E[R[¯wT]]≤E[RS[wS∗]]≤O(1√n) (25)

Our generalization bound has a convergence rate of , where compared with the rate achieved by shalev2011pegasos of is significantly more efficient. Recall that from rahimi2007random number of random Fourier components is given by . By setting we require to sample

Fourier features in order to achieve a high probability. A regular classifier

, requires time to compute; however, with the randomized feature maps only operations is required. Thus, using reasonable number of iterations an approximated kernel learning machine is faster than a regular kernel method with an advantage of preventing overfitting, and making it more practical for large-scale kernel learning.

## 4 Experimental results

Theatrically we proved that an approximated Fourier primal support vector machine is stable providing a smooth loss function and relatively sufficient number of steps. Thus, given reasonable number of epochs stochastic gradient method would generalize well, and prevent possible overfitting. We numerically showed the effect of three parameters; model size, number of Fourier components and learning rate choices on the stability. Table (1) shows the description of four binary classification datasets used for the analysis. These datasets can be downloaded from UCI machine learning repository website.

In Figure 1, we showed the effect of the number of random Fourier features on the generalization error. The result showed that the generalization error is a function of number of random Fourier features. The approximated kernel performs nearly the same as the exact kernel based learning by choosing large number of Fourier components. This means if we sample more number of Fourier components, the approximation of kernel function is more accurate. In general, increasing number of Fourier components leads to a better approximation and thus a lower testing error sutherland2015error . On the other hand, the computation cost is proportional to the number of Fourier components. Hence, we performed a simulation based experiment for each data set to find the best number of Fourier components. For the computation cost purposes we restricted the maximum number of Fourier components to a maximum of 200 features. In addition, we used numerical examples to demonstrate the dependence of the generalization error on the number of epochs and its independence on the sample size. Figure 2, shows the generalization error for different epochs. We defined epochs as the number of complete passes through the training set. The results demonstrated that the generalization error is a function of number of epochs and not the model size. Choosing a proper learning rate for stochastic gradient method is crucial. Figure 3, shows the strong impact of learning rate choices on the generalization error. We conducted an experiment for searching the best learning rate for all data sets.

## 5 Discussion

In this paper we measured the stability of stochastic gradient method (SGM) for learning an approximated Fourier primal support vector machine. We demonstrated that a large-scale approximated online kernel machine using SGM is stable with a high probability. The empirical results showed that the generalization error is a function of number of epochs and independent of model size. We also showed the strong impact of learning rate choices and number of Fourier components. Moreover, in this paper we utilized SGM to solve an approximated primal SVM. Utilizing random Fourier features induced variance, which slowed down the convergence rate. One way to tackle this problem is using variance reduction methods such as stochastic variance reduced gradient (SVRG)

johnson2013accelerating .

## 6 Conflict of Interest

The authors declare that they have no conflict of interest.

###### Acknowledgements.
Authors are grateful for the tremendous support and help of Professor Maryam Fazel

## References

• (1) Alrajeh, A., Niranjan, M.: Large-scale reordering model for statistical machine translation using dual multinomial logistic regression (2014)
• (2)

Bonnabel, S.: Stochastic gradient descent on riemannian manifolds.

IEEE Transactions on Automatic Control 58(9), 2217–2229 (2013)
• (3) Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)
• (4) Bottou, L.: Stochastic gradient descent tricks.

In: Neural networks: Tricks of the trade, pp. 421–436. Springer (2012)

• (5) Bousquet, O., Elisseeff, A.: Stability and generalization. The Journal of Machine Learning Research 2, 499–526 (2002)
• (6) Chapelle, O.: Training a support vector machine in the primal. Neural computation 19(5), 1155–1178 (2007)
• (7) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: Advances in neural information processing systems, pp. 1223–1231 (2012)
• (8) Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., et al.: Recent advances in deep learning for speech research at microsoft. In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8604–8608. IEEE (2013)
• (9) Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240 (2015)
• (10) Haykin, S.S., Haykin, S.S., Haykin, S.S., Haykin, S.S.: Neural networks and learning machines, vol. 3. Pearson Upper Saddle River, NJ, USA: (2009)
• (11) Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S.: A dual coordinate descent method for large-scale linear svm. In: Proceedings of the 25th international conference on Machine learning, pp. 408–415. ACM (2008)
• (12) Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
• (13)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks.

In: Advances in neural information processing systems, pp. 1097–1105 (2012)
• (14) Lu, J., Hoi, S.C., Wang, J., Zhao, P., Liu, Z.Y.: Large scale online kernel learning
• (15) Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization 19(4), 1574–1609 (2009)
• (16) Poon, H., Domingos, P.: Sum-product networks: A new deep architecture.

In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 689–690. IEEE (2011)

• (17) Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in neural information processing systems, pp. 1177–1184 (2007)
• (18) Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in neural information processing systems, pp. 1177–1184 (2008)
• (19) Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming 127(1), 3–30 (2011)
• (20) Sutherland, D.J., Schneider, J.: On the error of random fourier features. arXiv preprint arXiv:1506.02785 (2015)
• (21) Yang, T., Li, Y.F., Mahdavi, M., Jin, R., Zhou, Z.H.: Nyström method vs random fourier features: A theoretical and empirical comparison. In: Advances in neural information processing systems, pp. 476–484 (2012)
• (22) Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the twenty-first international conference on Machine learning, p. 116. ACM (2004)
• (23) Zhu, Z.A., Chen, W., Wang, G., Zhu, C., Chen, Z.: P-packsvm: Parallel primal gradient descent kernel svm. In: Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pp. 677–686. IEEE (2009)
• (24) Zhuang, Y., Chin, W.S., Juan, Y.C., Lin, C.J.: Distributed newton methods for regularized logistic regression. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 690–703. Springer (2015)