1 Introduction
Non parametric models for classification have become attractive since the introduction of kernel methods like the Support Vector Machines (SVMs) [1]
. The complexity of the learned models scale with the data, which gives them desirable asymptotic properties. However from an estimation point of view, parametric models can offer significant statistical and computational advantages. Recent years has seen a shift of focus from nonparametric models to semiparametric for learning classifiers. This includes the work of Rahimi and Recht
[15], where they compute an approximate feature map , for shift invariant kernels , and solve the kernel problem approximately using a linear problem. This line of work has become extremely attractive, with the advent of several algorithms for training linear classifiers efficiently (for e.g. LIBLINEAR [6], PEGASOS [16]), including online variants which have very low memory overhead.Additive models, i.e., functions that decompose over dimensions , are a natural extension to linear models, and arise naturally in many settings. In particular if the kernel is additive, i.e.
then the learned SVM classifier is also additive. A large number of useful kernels used in computer vision are based on comparing distributions of low level features on images and are additive, for e.g., histogram intersection and
kernel [11]. This one dimensional decomposition allows one to compute approximate feature maps independently for each dimension, leading to very compact feature maps, making estimation efficient. This line of work has been explored by Maji and Berg [11] where they construct approximate feature maps for the kernel, and learn piecewise linear functions in each dimension. For homogenous additive kernels, Vedaldi and Zisserman [17] propose to use the closed form features of Hein and Bousquet [8] to construct approximate feature maps.Smoothing splines [18] are another way of estimating additive models, and are well known in the statistical community. Ever since Generalized Additive Models (GAMs) were introduced by Hastie and Tibshirani [7], many practical approaches to training such models for regression have emerged, for example the PSpline formulation of Eilers and Marx [4]. However these algorithms do not scale to extremely large datasets and highdimensional features typical of image or text classification datasets.
In this work we show that the spline framework can used to derive embeddings to train additive classifiers efficiently as well. We propose two families of embeddings which have the property that the underlying additive classifier can be learned directly by estimating a linear classifier in the embedded space. The first family of embeddings are based on the Penalized Spline (“PSpline”) formulation of additive models (Eilers and Marx [4]) where the function in each dimension is represented using a uniformly spaced spline basis and the regularization penalizes the difference between adjacent spline coefficients. The second class of embeddings are based on a generalized Fourier expansion of the function in each dimension.
This work ties the literature of additive model regression and linear SVMs to develop algorithms to train additive models in the classification setting. We discuss how our additive embeddings are related to additive kernels in Section 4. In particular our representations include those of [11] as a special case arising from a particular choice of BSpline basis and regularization. An advantage of our representations is that it allows explicit control of the smoothness of the functions and the choice of basis functions, which may be desirable in certain situations. Moreover the sparsity of some of our representations lead to efficient training algorithms for smooth fits of functions. We summarize the previous work in the next section.
2 Previous Work
The history of learning additive models goes back to [7], who proposed the “backfitting algorithm” to estimate additive models. Since then many practical approaches have emerged, the most prominent of which is the Penalized Spline formulation (“PSpline”) proposed by [4], which consists of modeling the one dimensional functions using a large number of uniformly spaced BSpline basis. Smoothness is ensured by penalizing the differences between adjacent spline coefficients. We describe the formulation in detail in Section 3. A key advantage of this formulation was the whole problem could be solved using a linear system.
Given data with and , discriminative training of functions often involve an optimization such as:
(1) 
where,
is a loss function and
is a regularization term. In the classification setting a commonly used loss function is the hinge loss function:(2) 
For various kernel SVMs the regularization penalizes the norm of the function in the implicit Reproducing Kernel Hilbert Space (RKHS), of the kernel. Approximating the RKHS of these additive kernels provides way of training additive kernel SVM classifiers efficiently. For shift invariant kernels, Rahimi and Recht [15] derive features based on Boshner’s theorem. Vedaldi and Zisserman [17] propose to use the closed form features of Hein and Bousquet [8] to train additive kernel SVMs efficiently for many commonly used additive kernels which are homogenous. For the kernel, Maji and Berg [11] propose an approximation and an efficient learning algorithm, and our work is closely related to this.
In the additive modeling setting, a typical regularization is to penalize the norm the th order derivatives of the function in each dimension, , . Our features are based on encodings that enable efficient evaluation and computation of this regularization. For further discussion we assume that the features are one dimensional. Once the embeddings are derived for the one dimensional case, we note that the overall embedding is concatenation of the embeddings in each dimension as the classifiers are additive.
3 Spline Embeddings
Eilers and Marx [4] proposed a practical modeling approach for GAMs. The idea is based on the representing the functions in each dimension using a relatively large number of uniformly spaced BSpline basis. The smoothness of these functions is ensured by penalizing the first or second order differences between the adjacent spline coefficients. Let denote the vector with entries , the projection of on to the th basis function. The PSpline optimization problem for the classification setting with the hinge loss function consists of minimizing :
(3) 
The matrix constructs the th order differences of :
(4) 
The first difference of , is a vector of elements . Higher order difference matrices can be computed by repeating the differencing. For a dimensional basis, the difference matrix is a matrix with , and zero everywhere else. The matrices and are as follows:
To enable a reduction to the linear case we propose a slightly different difference matrix . We let be a matrix with . This is same as the first order difference matrix proposed by Eilers and Marx, with one more row added on top. The resulting difference matrices and are both matrices:
The first row in
has the effect of penalizing the norm on the first coefficient of the spline basis, which plays the role of regularization in the linear setting (e.g. ridge regression, linear SVMs, etc). Alternatively one can think of this as an additional basis at left most point with its coefficient set to zero.
The key advantage is that the matrix is invertible and has a particularly simple form which allows us to linearize the whole system. We will also show in Section 4 that the derived embeddings also approximate the learning problem of kernel SVM classifier using the kernel () for a particular choice of spline basis.(5) 
Given the choice of the regularization matrix which is invertible, one can linearize the whole system by reparametrizing by , which results in :
(6) 
Therefore the whole classifier is linear on the features , i.e. the optimization problem is equivalent to
(7) 
The inverse matrices and are both upper triangular matrices. The matrix has entries and has entries and look like:
We refer the readers to [5] for an excellent review of additive modeling using splines. Figure 1 shows the for various choices of the regularization degree and B Splines basis, linear, quadratic and cubic.
3.1 Generalized Fourier Embeddings
Generalized Fourier expansion provides an alternate way of fitting additive models. Let be a orthogonal basis system in the interval , wrto. a weight function , i.e. we have . Given a function , the regularization can be written as:
(8) 
Consider an orthogonal family of basis functions which are differentiable and whose derivatives are also orthogonal. One can normalize the basis such that . In this case the regularization has a simple form:
(9) 
Thus the overall regularized additive classifier can be learned by learning a linear classifier in the embedded space . In practice one can approximate the scheme by using a small number of basis function. We propose two practical ones with closed form embeddings:
Fourier basis.
The classic Fourier basis functions are orthogonal in , wrto. the weight function . The derivatives are also in the same family (except the constant basis function), hence are also orthonormal. The normalized feature embeddings for are shown in Table 1.
Hermite basis.
Hermite polynomials also are an orthogonal basis system with orthogonal derivatives wrto. the weight function . Using the following identity:
(10) 
and the property that (Apell sequence), one can obtain closed form features for as shown in Table 1. It is also known that the family of polynomial basis functions which are orthogonal and whose derivatives are orthogonal belong to one of three families, Jacobi, Laguerre or Hermite [19]. The extended support of the weight function of the Hermite basis, makes them well suited for additive modeling.
Although both these basis are complete, for practical purposes one has to use the first few basis. The quality of approximation depends on how well the underlying function can be approximated by these chosen basis, for example, low degree polynomials are better represented by Hermite basis.
Fourier  Hermite 

,  , 
, , 
4 Additive Kernel Reproducing Kernel Hilbert Space & Spline Embeddings
We begin by showing the close resemblance of the spline embeddings to the kernel. To see this, let the features in be represented with uniformly spaced linear spline basis centered at . Let and let . Then the features is given by and the features for matrix is given by , if and . It can be seen that these features closely approximates the kernel, i.e.
(11) 
The features constructs a unary like representation where the number of ones equals the position of the bin of . One can verify that for a Bspline basis of degree (), the following holds:
(12) 
Define the kernel corresponding to a BSpline basis of degree and regularization matrix as follows:
(13) 
Figure 2 shows for corresponding to a linear, quadratic and cubic BSpline basis. In a recent paper, Maji and Berg [11], propose to use linear spline basis and a regularization, to train approximate intersection kernel SVMs, which in turn approximate arbitrary additive classifiers. Our features can be seen as a generalization to this work which allows arbitrary spline basis and regularizations.
BSplines are closely related to the truncated polynomial kernel [18, 14] which consists of uniformly spaced knots and truncated polynomial features:
(14) 
However these features are not as numerically stable as BSpline basis (see [5] for an experimental comparison). Truncated polynomials of degree corresponds to a BSpline basis of degree and regularization, i.e, same as kernel, when the knots are uniformly spaced. This is because BSplines are derived from truncated polynomial basis by repeated application of the difference matrix [3]. As noted by the authors in [5], one of the advantages of the PSpline formulation is that is decouples the order of regularization and B Spline basis. Typically regularization provides sufficient smoothness in our experiments.
5 Optimizations for Efficient Learning for BSpline embeddings
For Bspline basis one can exploit the sparsity to speedup linear solvers. The classification function is based on evaluating . Most methods for training linear methods are based on evaluating the classifier and updating the classifier if the classification is incorrect. Since the number of such evaluations are larger than the number of updates, it is is much more efficient to maintain and use sparse vector multiplication. Updates to the weight vector and for various gradient descent algorithms look like:
(15) 
Where is a step and . Unlike the matrix , the matrix is a dense, and hence the updates to may change all the entries of . However, one can compute in steps instead of steps, exploiting the simple form of . Initialize then repeat step A, times followed by step B, times to compute .
step A  
step B 
6 Experiments
Often on large datasets consisting of very high dimensional features, to avoid the memory bottleneck, one may compute the encodings in the inner loop of the training algorithm. We refer this to as the “online” method. Our solver is based on LIBLINEAR, but can be easily used with any other solver. The custom solver allows us to exploit the sparsity of embeddings (Section 5). A practical regularization is with the BSpline embeddings, where
is the identity matrix, which leads to sparse features. This makes it difficult to estimate the weights on the basis functions which have no data, but one can use a higher order BSpline basis, to somewhat mitigate this problem.
We present image classification experiments on two image datasets, MNIST [10] and Daimler Chrysler (DC) pedestrians [13]. On these datasets SVM classifiers based on histogram intersection kernel outperforms a linear SVM classifier [11, 12], when used with features based on a spatial pyramid of histogram of oriented gradients [2, 9]. We obtain the features from the author’s website for our experiments. The MNIST dataset has instances and the features are dimensional and dense, leading to nonzero entries. The DC dataset has three training sets and two test sets. Each training set has instances and the features are dimensional and dense, leading to entries. These sizes are typical of image datasets and training kernel SVM classifiers often take several hours on a single machine.
Toy Dataset.
The points are sampled uniformly on a 2D grid with the points satisfying in the positive class and others as negative. Figure 3(b), shows the fits on the data along (or ) dimension using uniformly spaced Bspline basis of various degrees and regularizations. The quadratic and cubic splines offer smoother fits of the data. Figure 3(c,d) shows the learned functions using Fourier and Hermite embeddings of various degrees respectively.
(a)  (b)  (c)  (d) 
Effect of BSpline parameter choices.
Table 2, shows the accuracy and training times as a function of the number of bins, regularization degree () and the BSpline basis degree ( on the first split of the DC pedestrian dataset. We set and the bias term for training all the models. On this dataset we find that is more accurate than and is significantly faster. In further experiments we only include the results of and . In addition, setting the regularization to zero (), leads to very sparse features and can be used directly with any linear solver which can exploit this sparsity. The training time for BSplines scales sublinearly with the number of bins, hence better fits of the functions can be obtained without much loss in efficiency.
Regularization  
Degree  
bins  
1  
2  
3  
bins  
1  
2  
3  
bins  
1  
2  
3 
Effect Fourier embedding parameter choices.
Table 3, shows the accuracy and training times for various Fourier embeddings on DC dataset. Before computing the generalized Fourier features, we first normalize the data in each dimension to using:
(16) 
We precompute the features and use LIBLINEAR to train various models, since it is relatively more expensive to compute the features online. In this case the training times are similar to that of BSpline models. However, precomputing and storing the features may not be possible on very large scale datasets.
Fourier  Hermite  

Degree  Accuracy  Time  Accuracy  Time  Accuracy  Time  Accuracy  Time 
1  
2  
3  
4 
Comparison of various additive models.
Table 4, shows the accuracy and training times of various additive models compared to linear and the more expensive kernel SVM on all the training and test set combinations of DC dataset. The optimal parameters were found on the first training and test set. The additive models are up to faster to train and are as accurate as the kernel SVM. The BSpline additive models significantly outperform a linear SVM on this dataset at the expense of small additional training time.
Table 5 shows the accuracies and training times of various additive models on the MNIST dataset. We train onevsall classifiers for each digit and the classification scores are normalized by passing them through a logistic. During testing, each example given the label of the classifier with the highest response. The optimal parameters for training were found using 2fold cross validation on the training set. Once again the additive models significantly outperform the linear classifier and closely matches the accuracy of kernel SVM, while being faster.
Method  Test Accuracy  Training Time  
SVM (linear) + LIBLINEAR  ()  s  
SVM () + LIBSVM  ()  s  
online  batch  
BSpline  ()  s   
BSpline  ()  s   
BSpline  ()  s   
BSpline  ()  s   
Fourier  ()  s  s ( memory) 
Hermite  ()  s  s ( memory) 
Method  Test Error  Training Time 
SVM (linear) + LIBLINEAR  s  
SVM () + LIBSVM  hours  
BSpline  s  
BSpline  s  
BSpline  s  
BSpline  s  
Hermite  s 
7 Conclusion
We have proposed a family of embeddings which enable efficient learning of additive classifiers. We advocate the use BSplines based embeddings because they are are efficient to compute and are sparse, which enables us to train these models with a small memory overhead by computing the embeddings on the fly even when the number of basis are large and can be seen as a generalization of [11]. Generalized Fourier features are low dimensional, but are expensive to compute and so are more suitable if the projected features can be precomputed and stored. The proposed classifiers outperform linear classifiers and match the significantly more expensive kernel SVM classifiers at a fraction of the training time. On both the MNIST and DC datasets, linear BSpline and regularization works the best and closely approximates the learning problem of kernel SVM. Higher degree splines are useful when used with regularization, have even faster training times but worse accuracies than regularization. The code for training various spline models proposed in the paper has been packaged as a library, LIBSPLINE, which will be released upon the publication of this paper.
References

[1]
B. Boser, I. Guyon, and V. Vapnik.
A training algorithm for optimal margin classifiers.
In
Proceedings of the fifth annual workshop on Computational learning theory
, pages 144–152. ACM, 1992.  [2] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages 886–893. IEEE, 2005.
 [3] C. De Boor. A practical guide to splines. Springer Verlag, 2001.
 [4] P. Eilers and B. Marx. Generalized linear additive smooth structures. Journal of Computational and Graphical Statistics, 11(4):758–783, 2002.
 [5] P. Eilers and B. Marx. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2005.

[6]
R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin.
Liblinear: A library for large linear classification.
The Journal of Machine Learning Research
, 9:1871–1874, 2008.  [7] T. Hastie and R. Tibshirani. Generalized additive models. Chapman & Hall/CRC, 1990.

[8]
M. Hein and O. Bousquet.
Hilbertian metrics and positive definite kernels on probability measures.
In Proceedings of AISTATS, volume 2005. Citeseer, 2005. 
[9]
S. Lazebnik, C. Schmid, and J. Ponce.
Beyond bags of features: Spatial pyramid matching for recognizing
natural scene categories.
In
Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on
, volume 2, pages 2169–2178. Ieee, 2006. 
[10]
Y. LeCun and C. Cortes.
The mnist database of handwritten digits, 1998.
 [11] S. Maji and A. Berg. Maxmargin additive classifiers for detection. In ICCV, pages 40–47. IEEE, 2009.
 [12] S. Maji and J. Malik. Fast and accurate digit classification. Technical Report UCB/EECS2009159, EECS Department, University of California, Berkeley, Nov 2009.
 [13] S. Munder and D. Gavrila. An experimental study on pedestrian classification. IEEE transactions on pattern analysis and machine intelligence, pages 1863–1868, 2006.
 [14] N. Pearce and M. Wand. Penalized splines and reproducing kernel methods. The american statistician, 60(3):233–240, 2006.
 [15] A. Rahimi and B. Recht. Random features for largescale kernel machines. NIPS, 20:1177–1184, 2008.
 [16] S. ShalevShwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated subgradient solver for svm. In ICML, pages 807–814. ACM, 2007.
 [17] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. In CVPR, pages 3539–3546. IEEE, 2010.
 [18] G. Wahba. Spline models for observational data, volume 59. Society for Industrial Mathematics, 1990.
 [19] M. Webster. Orthogonal polynomials with orthogonal derivatives. Mathematische Zeitschrift, 39:634–638, 1935.
Comments
There are no comments yet.