The stochastic gradient method (SGM) is widely used as an optimization tool in many machine learning applications including (linear) support vector machinesshalev2011pegasos ; zhu2009p zhuang2015distributed ; bottou2012stochastic , graphical models dean2012large ; poon2011sum ; bonnabel2013stochastic
and deep learningdeng2013recent ; krizhevsky2012imagenet ; haykin2009neural
. SGM computes the estimates of the gradient on the basis of a single randomly chosen sample in each iteration. Therefore, applying a stochastic gradient method for large scale machine learning problems can be computationally efficientzhang2004solving ; alrajeh2014large ; bottou2010large ; hsieh2008dual
. In the context of supervised learning, models that are trained by such iterative optimization algorithms are commonly controlled by convergence rate analysis. The convergence rate portrays how fast the optimization error decreases as the number of iterations grows. However, many fast converging algorithms are algorithmically less stable (an algorithm is stable if it is robust to small perturbations in the composition of the learning data set). The stability of an algorithm is considered by measuring the generalization error in terms of the absolute difference between the test and the training error.
The classical results by Bousquet and Elisseef bousquet2002stability showed that a randomized algorithm such as SGM is uniformly stable if for all data sets differing in one element, the learned models produce nearly the same results. Hardt et al. hardt2015train
suggested that by choosing reasonable number of iterations SGM generalizes well and prevents overfitting under standard Lipschitz and smoothness assumptions. Therefore, the iterative optimization algorithm can stop long before its convergence to reduce computational cost. The expected excess risk decomposition has been the main theoretical guideline for this kind of early-stopping criteria. Motivated by this approach we derived a high probability bound in terms of expected risk for an approximated kernel function considering the stability definition. We proposed that, in the context of supervised learning with a high probability SGM generalizes well for an approximated kernel under proper assumptions. We showed that with few number of iterations generalization error is independent of model size, and it is a function of number of epochs. In addition, we explored the effect of learning rate choices, and the number of Fourier components on the generalization error.
In this paper, we proved that SGM generalizes well for an approximated kernel under proper assumptions by choosing only few number of iterations while being stable. In particular, we mapped the input data to a randomized low-dimensional feature space to accelerate the training of kernel machines using random Fourier features rahimi2007random . We, then incorporated the approximated kernel function into the primal of SVM to form a linear primal objective function following chapelle2007training . Finally, we showed that SGM generalizes well given the approximated algorithm under proper assumptions by incorporating the stability term into the classical convergence bound.
This paper is organized as follows. In section 2 the detailed problem statement is discussed following the convex optimization setting used for this problem. The theoretical analysis is discussed in section 3. This is followed by the numerical results in section 4. Finally the discussion is provided in section 5.
2.1 Optimization problem
Given a training set , ,
, a linear hyperplane for SVM problems is defined by. Where, is the number of training examples, and is the weight coefficient vector. The standard primal SVM optimization problem is shown as:
Rather than using the original input attributes , we instead used the kernel tricks so that the algorithm would access the data only through the evaluation of . This is a simple way to generate features for algorithms that depend only on the inner product between pairs of input points. Kernel tricks rely on the observation that any positive definite function with defines an inner product and a lifting so that the inner product between lifted data points can be quickly computed as . Our goal is to efficiently learn a kernel prediction function and an associated Reproducing Kernel Hillbert Space as follows:
However, in large scale problems, dealing with kernels can be computationally expensive. Hence, instead of relying on the implicit lifting provided by the kernel trick, we used explicitly mapping the data to a low-dimensional Euclidean inner product space using a randomized feature map so that the inner product between a pair of transformed points approximates their kernel evaluation yang2012nystrom ; rahimi2008random . Given the random Fourier features, we then learned a linear machine by solving the following optimization problem:
2.2 Convex optimization settings
The goal of our online learning is to achieve minimum expected risk, hence we tried to minimize the loss function. Throughout the paper, we focused on convex, Lipschitz continuous and gradient smooth loss functions, provided their definitions here.
A function f is L-Lipschitz continuous if we have , while implies
A function f is gradient Lipschitz continuous if we have , while implies
In the theoretical analysis section we required a convex, Lipschitz continuous and gradient smooth function. Note that a huber-hinge loss function is Lipschitz continuous, and it has a Lipschitz continuous gradient which is defined as follows:
Therefore, in this paper, we used the following optimization problem:
For simplicity, the loss function in (7) is denoted by . Let be the minimizer of the population risk:
Let be a convex loss satisfying and let be the constant learning rate. Let , where is the maximum SGM iteration. Also, let be the minimizer of the population risk . Then,
Combining these two we have the following:
By summing the above over and taking average the lemma is proved.
From Rahimi rahimi2007random , we know that with a high probability of at least there is a probability bound for the difference between the approximated kernel value and the exact kernel value. Where:
Assuming and , then:
Given that, an optimization error is defined as the gap between empirical risk and minimum empirical risk in expectation, and it is denoted by:
where, denotes a population sample of size and is the empirical risk defined as:
Note that the expected empirical risk is smaller than the minimum risk , implying:
Let l be a convex loss function that is Lipschitz continuous and . Let ; resulting from and . Also let . Suppose we make a single pass SGM over all the samples , and by choosing , then with a high probability , the classical convergence bound in (15) becomes:
Where is the stability error satisfying , and given that the function is -Lipschitz continuous and -smooth. We know that will decrease with the number of SGM iterations while increases. Hardt et. al. in hardt2015train showed that given few number of iterations and by balancing and , the generalization error will decrease. In the next section, we explored to see whether using SGM for an approximated algorithm which favors in terms of computational cost would generalize well by choosing few number of iterations while being stable.
3 Generalization of SGM for an approximated algorithm
Let l be -Lipschitz continuous and -smooth. is the minimizer of the empirical risk and . Let , where and is the coefficient of the th support vector. For the maximum iteration of the SGM, with high probability of , we have:
Recall that with a high probability, . Also recall that . Then by substituting these two terms in (21), for every , with a high probability , we have:
By taking the gradient of the right hand side of (23) with respect to , the optimal is:
By substituting the optimal in Equation (23) the theorem is proved.
The above theorem suggests that with a high probability SGM generalizes well for an approximated kernel for -Lipschitz continuous and -smooth loss function. In general, the optimization error () decreases with the number of SGM iterations while the stability () increases. From (22) we can claim that as the number of iteration increases, and will become less balanced. Thus, choosing few number of iterations would balance and suggesting a stable SGM.
By setting the to (24) when the generalization error bound for an approximated kernel based on random Fourier features is given by,
Our generalization bound has a convergence rate of , where compared with the rate achieved by shalev2011pegasos of is significantly more efficient. Recall that from rahimi2007random number of random Fourier components is given by . By setting we require to sample
Fourier features in order to achieve a high probability. A regular classifier, requires time to compute; however, with the randomized feature maps only operations is required. Thus, using reasonable number of iterations an approximated kernel learning machine is faster than a regular kernel method with an advantage of preventing overfitting, and making it more practical for large-scale kernel learning.
4 Experimental results
Theatrically we proved that an approximated Fourier primal support vector machine is stable providing a smooth loss function and relatively sufficient number of steps. Thus, given reasonable number of epochs stochastic gradient method would generalize well, and prevent possible overfitting. We numerically showed the effect of three parameters; model size, number of Fourier components and learning rate choices on the stability. Table (1) shows the description of four binary classification datasets used for the analysis. These datasets can be downloaded from UCI machine learning repository website.
|Pima Indians Diabetes||768||8|
In Figure 1, we showed the effect of the number of random Fourier features on the generalization error. The result showed that the generalization error is a function of number of random Fourier features. The approximated kernel performs nearly the same as the exact kernel based learning by choosing large number of Fourier components. This means if we sample more number of Fourier components, the approximation of kernel function is more accurate. In general, increasing number of Fourier components leads to a better approximation and thus a lower testing error sutherland2015error . On the other hand, the computation cost is proportional to the number of Fourier components. Hence, we performed a simulation based experiment for each data set to find the best number of Fourier components. For the computation cost purposes we restricted the maximum number of Fourier components to a maximum of 200 features. In addition, we used numerical examples to demonstrate the dependence of the generalization error on the number of epochs and its independence on the sample size. Figure 2, shows the generalization error for different epochs. We defined epochs as the number of complete passes through the training set. The results demonstrated that the generalization error is a function of number of epochs and not the model size. Choosing a proper learning rate for stochastic gradient method is crucial. Figure 3, shows the strong impact of learning rate choices on the generalization error. We conducted an experiment for searching the best learning rate for all data sets.
In this paper we measured the stability of stochastic gradient method (SGM) for learning an approximated Fourier primal support vector machine. We demonstrated that a large-scale approximated online kernel machine using SGM is stable with a high probability. The empirical results showed that the generalization error is a function of number of epochs and independent of model size. We also showed the strong impact of learning rate choices and number of Fourier components. Moreover, in this paper we utilized SGM to solve an approximated primal SVM. Utilizing random Fourier features induced variance, which slowed down the convergence rate. One way to tackle this problem is using variance reduction methods such as stochastic variance reduced gradient (SVRG)johnson2013accelerating .
6 Conflict of Interest
The authors declare that they have no conflict of interest.
Acknowledgements.Authors are grateful for the tremendous support and help of Professor Maryam Fazel
- (1) Alrajeh, A., Niranjan, M.: Large-scale reordering model for statistical machine translation using dual multinomial logistic regression (2014)
Bonnabel, S.: Stochastic gradient descent on riemannian manifolds.IEEE Transactions on Automatic Control 58(9), 2217–2229 (2013)
- (3) Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)
Bottou, L.: Stochastic gradient descent tricks.
In: Neural networks: Tricks of the trade, pp. 421–436. Springer (2012)
- (5) Bousquet, O., Elisseeff, A.: Stability and generalization. The Journal of Machine Learning Research 2, 499–526 (2002)
- (6) Chapelle, O.: Training a support vector machine in the primal. Neural computation 19(5), 1155–1178 (2007)
- (7) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: Advances in neural information processing systems, pp. 1223–1231 (2012)
- (8) Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., et al.: Recent advances in deep learning for speech research at microsoft. In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8604–8608. IEEE (2013)
- (9) Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240 (2015)
- (10) Haykin, S.S., Haykin, S.S., Haykin, S.S., Haykin, S.S.: Neural networks and learning machines, vol. 3. Pearson Upper Saddle River, NJ, USA: (2009)
- (11) Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S.: A dual coordinate descent method for large-scale linear svm. In: Proceedings of the 25th international conference on Machine learning, pp. 408–415. ACM (2008)
- (12) Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
- (13) In: Advances in neural information processing systems, pp. 1097–1105 (2012)
- (14) Lu, J., Hoi, S.C., Wang, J., Zhao, P., Liu, Z.Y.: Large scale online kernel learning
- (15) Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization 19(4), 1574–1609 (2009)
Poon, H., Domingos, P.: Sum-product networks: A new deep architecture.
In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 689–690. IEEE (2011)
- (17) Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in neural information processing systems, pp. 1177–1184 (2007)
- (18) Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in neural information processing systems, pp. 1177–1184 (2008)
- (19) Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming 127(1), 3–30 (2011)
- (20) Sutherland, D.J., Schneider, J.: On the error of random fourier features. arXiv preprint arXiv:1506.02785 (2015)
- (21) Yang, T., Li, Y.F., Mahdavi, M., Jin, R., Zhou, Z.H.: Nyström method vs random fourier features: A theoretical and empirical comparison. In: Advances in neural information processing systems, pp. 476–484 (2012)
- (22) Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the twenty-first international conference on Machine learning, p. 116. ACM (2004)
- (23) Zhu, Z.A., Chen, W., Wang, G., Zhu, C., Chen, Z.: P-packsvm: Parallel primal gradient descent kernel svm. In: Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pp. 677–686. IEEE (2009)
- (24) Zhuang, Y., Chin, W.S., Juan, Y.C., Lin, C.J.: Distributed newton methods for regularized logistic regression. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 690–703. Springer (2015)