Tensor machines for learning target-specific polynomial features

04/07/2015 ∙ by Jiyan Yang, et al. ∙ berkeley college Stanford University 0

Recent years have demonstrated that using random feature maps can significantly decrease the training and testing times of kernel-based algorithms without significantly lowering their accuracy. Regrettably, because random features are target-agnostic, typically thousands of such features are necessary to achieve acceptable accuracies. In this work, we consider the problem of learning a small number of explicit polynomial features. Our approach, named Tensor Machines, finds a parsimonious set of features by optimizing over the hypothesis class introduced by Kar and Karnick for random feature maps in a target-specific manner. Exploiting a natural connection between polynomials and tensors, we provide bounds on the generalization error of Tensor Machines. Empirically, Tensor Machines behave favorably on several real-world datasets compared to other state-of-the-art techniques for learning polynomial features, and deliver significantly more parsimonious models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Kernel machines are one of the most popular and widely adopted paradigms in machine learning and data analysis. This success is due to the fact that an appropriately chosen non-linear kernel often succeeds in capturing non-linear structures inherent to the problem without forming the high-dimensional features necessary to explicitly delineate those structures. Unfortunately, the cost of kernel methods scales like

, where is the number of datapoints. Because of this high computational cost, it is often the case that kernel methods cannot exploit all the information in large datasets; indeed, even the cost of forming and storing the kernel matrix can be prohibitive on large datasets.

As a consequence, methods for approximately fitting non-linear kernel methods have drawn much attention in recent years. In particular, explicit random feature maps have been proposed to make large-scale kernel machines practical [14, 7, 13, 6, 22, 23]. The idea behind this method is to randomly choose an -dimensional feature map that satisfies where is the kernel of interest [14]. Let be the matrix whose rows comprise the application of to datapoints, then the kernel matrix has the low-rank approximation This approximation considerably reduces the costs of both fitting models, because one can work with the -by- matrix instead of , and of forming the input, because one need form only instead of

The success of this approximation hinges on choosing so that is close to

with high probability. One easy way to accomplish this is to take

where the random features all satisfy The concentration of measure phenomenon then ensures that is small with high probability. In practice, one must use a large number of random features to achieve performance comparable with the exact kernel: must be on the order of tens or even hundreds of thousands [6]. The reason for this is simple: since knowledge of the target function is not used in generating the feature map to ensure good performance enough random features must be selected to get good approximations to arbitrary functions in the reproducing kernel Hilbert space associated with Thus under the random feature map paradigm, one trades off a massive reduction in the cost of optimization for the necessity of generating a large number of random features, each of which is rather uninformative about any given target.

Implicit in the formulation of random feature maps is the assumption that the chosen class of features is expressive enough to cover the function space of interest, yet simple enough that features can be sampled randomly in an efficient manner. Given this assumption, it seems natural to question whether it is possible to directly select a much smaller number of features relevant to the given target in a computational efficient manner.

Target-specific optimization vs target-agnostic randomization

Given a kernel and a parametrized family of features satisfying where the expectation is with respect to some distribution on can optimization over this hypothesis class to select features give a parsimonious representation of a known target, with the same or less computational effort and less error than target-agnostic random sampling of ?

Figure 1:

The relative test error and relative running time (relative to kernel ridge regression on a subset of the training data) of several polynomial learning algorithms over

datasets. The median and first and third quartiles of the error and running times are shown. TM-Batch and TM-SFO are solvers for the Tensor Machine model class introduced in this work. See Section 

6 for further discussion.

This question has recently been answered in the affirmative for the case of the radial basis kernel [24, 25]. In this work, we provide a positive answer to the above question in the case of polynomial kernels and the Kar–Karnick random features introduced in [7] (see Figure 1). Polynomial kernels are of interest because, after the radial basis kernel, they are arguably the most widely used kernels in machine learning applications. The Weierstrass approximation theorem guarantees than any smooth function can be arbitrarily accurately approximated by a polynomial [17][Chapter 7], so in principle a polynomial kernel of sufficiently high degree can be used to accurately approximate any smooth target function.

In this work, we use a natural connection between tensors and polynomials to introduce a restricted hypothesis class of polynomials corresponding to low-rank tensors. In the Tensor Machine paradigm, learning consists of learning a regularized low-rank decomposition of a hidden tensor corresponding to the target polynomial. Given training points in -dimensional input space, the computational cost of fitting a degree TM with a rank- approximation for each degree (using a first-order algorithm) is , while the cost of fitting a predictor using the Kar–Karnick random feature maps and kernel regression approach of [7, 6] is . In practice the required for accurate predictions is typically on the order of thousands; by way of comparison, we show empirically that TMs typically require less than to achieve a comparable approximation error!

We show experimentally that Tensor Machines strike a favorable balance between expressivity and parsimony, and provide an analysis of Tensor Machines that establishes favorable generalization properties. The objective function for fitting Tensor Machines is nonconvex, but we show empirically that it is sufficiently structured that the process of fitting a TM is robust, efficient, and requires very few features. We demonstrate that our algorithm exhibits a more favorable time–accuracy tradeoff when compared to the random feature map approach to polynomial regression [7, 13, 6], as well as to the polynomial network algorithm of [9], and the recently introduced Apple algorithm for online sparse polynomial regression [1].

2 Prior Work

Consider the estimation problem of fitting a polynomial function

to i.i.d. training points drawn from the same unknown distribution, formulated as

(1)

where is the reproducing kernel Hilbert space of all polynomials of degree at most and

is a specified loss function. An exact solution (to within a specified numerical precision) to this problem can be obtained in

time using classical kernel methods.

One approach in the literature has been to couple kernel methods with various techniques for approximating the kernel matrix. The underlying assumption is that the kernel matrix is numerically low-rank, with rank ; typically these methods reduce the cost of kernel methods from to Nyström approximations, sparse greedy approximations, and incomplete Cholesky factorizations fall into this class [21, 19, 5]. In [3], Bach and Jordan observed that prior algorithms in this class did not exploit all

the knowledge inherent in supervised learning tasks: namely, they did not exploit knowledge of the targets (classification labels or regression values). They showed that by exploiting knowledge of the target, one can construct low-rank approximations to the kernel matrix that have significantly smaller rank, with a computational cost that remains

Another approach to polynomial-based supervised learning relies upon the modeling assumption that the desired target function can be approximated well in the subspace of consisting of sparse polynomials. Recall that a sparse degree- polynomial in variables is one in which only a few of the possible monomials have nonzero coefficients (there are exponentially in many such monomials). The early work of Sanger et al. attempts to learn the monomials relevant to the target in an online manner by augmenting the current polynomial with interaction features [18]

. The recent Apple algorithm of Agarwal et al. attempts to learn a sparse polynomial, also in an online manner, using a different heuristic that selects monomials from the current polynomial to be used in forming the next term of the monomial 

[1]. The algorithms presented in [2] and [8] provide theoretical guarantees for fitting sparse polynomials, but their computational costs scale undesirably for large-scale learning.

Polynomial fitting has also been tackled using the neural network paradigm. In 

[9]

, Livni et al. provide an algorithm for learning polynomial functions as deep neural networks which has the property that the training error is guaranteed to decrease at each iteration. This algorithm has the desirable properties that the network learns a linear combination of a set of polynomials that are constructed in a target-dependent way, and that the degree of the polynomial does not have to be specified in advance: instead, additional layers can be added to the network until the desired error threshold has been reached, with each layer increasing the degree of the predictor by one. Unfortunately, this algorithm requires careful tuning of the hyperparameters (number of layers, and the width of each layer). In the subsequent work 

[10], Livni et al. provide an algorithm for fitting cubic polynomials in an iterative manner using rank-one tensor approximations. It can be shown that, in fact, this algorithm greedily fits cubic polynomials in the Tensor Machine class we propose in this paper.

Factorization Machines combine the expressivity of polynomial regression with the ability of factorization models to infer relationships from sparse training data [15]. Quadratic Factorization Machines, as introduced by Rendle, are models of the form

where a vector

is learned for each coordinate of the input These models have been applied with great success in recommendation systems and related applications with sparse input Quadratic FMs can be fit in time linear in the size of the data, the degree of the polynomial being fit, and the length of the vectors , but can only represent nonlinear interactions that can be written as symmetric homogeneous polynomials (plus a constant). In particular, FMs cannot represent polynomials that involve monomials containing variables raised to powers higher than one. For instance, cannot be represented as a Factorization Machine. Another drawback to Factorization Machines is that explicitly evaluating the sums involved for a degree- FM requires operations. A computational manipulation allows quadratic FMs to be fit in linear time, and it has been claimed that similar manipulations for higher-order FMS ([15]), but no such generalizations have been documented. Perhaps for this reason, only second-order FMs have been used in the literature.

The random feature maps approach to polynomial-based learning [14] exploits the concentration of measure phenomenon to directly construct a random low-dimensional feature map that approximately decomposes Kar and Karnick provided the first random feature map approach to polynomial regression in [7]; their key observation is that the degree- homogeneous polynomial kernel can be approximated with random features of the form

where the vectors are vectors of random signs; that is, random feature maps of the form

satisfy the condition necessary for the random feature map approach. Pagh and Pham provided a qualitatively different random feature map for polynomial kernel machines, based on the tensorization of fast hashing transforms [13]. These TensorSketch feature maps often outperform Kar–Karnick feature maps, but both methods require a large The CRAFTMaps approach of [6] combines either TensorSketch or Kar–Karnick random feature maps with fast Johnson–Lindenstrauss transforms to significantly reduce the number of features needed for good performance.

Of these methods, the Tensor Machine approach introduced in this paper is most similar to the Factorization Machine and Kar–Karnick random feature map approaches to polynomial learning.

3 Tensors and polynomials

To motivate Tensor Machines, which are introduced in the next section, we first review the connection between polynomials and tensors. For simplicity we consider only homogeneous degree- polynomials: the monomials of such a polynomial all have degree We show that Kar–Karnick predictors and Factorizaton Machines correspond to specific tensor decompositions.

Recall that a tensor is a multidimensional array,

The number of indices into the array, , is also called the degree of the tensor. The inner product of two conformal tensors and is obtained by treating them as two vectors:

The Segre outer product of vectors is the tensor that satisfies

The degree- self-outer product of a vector is denoted by

Given a decomposition for a tensor of the form

where is minimal, is called the rank of the tensor. A tensor is supersymmetric if there exists a decomposition of the form

The tensor comprises all the degree- monomials in the variable so any degree- homogeneous polynomial satisfies for some degree- tensor likewise, any degree- tensor determines a homogenous polynomial of degree This equivalence between homogeneous polynomials and tensors allows us the attack the problem of polynomial learning as one of learning a tensor. The Kar–Karnick random feature maps and Factorization Machine approaches can both be viewed through this lens.

A single Kar–Karnick random polynomial feature corresponds to a rank-one tensor:

Accordingly, predictors generated by the random feature maps approach with the Kar–Karnick feature map satisfy

for some coefficient vector That is, Kar–Karnick predictors correspond to tensors in the span of randomly sampled degree- rank-one tensors.

Factorization Machines fit predictors of the form

where the factors are vectors in and

is the generalization of the inner product of two vectors. Let be the matrix with rows comprising the vectors and define the -dimensional vector to be the th column of It follows that FMs can be expressed in the form

where the tensor is defined so that its nonzero entries cancel the entries of that correspond to coefficients of monomials like that contain a variable raised to a power larger than one. The vectors are learned during training (the tensor is implicitly defined by these vectors).

4 Tensor Machines

Thus, the Kar–Karnick random feature map approach searches for a low-rank tensor corresponding to the target polynomial and the Factorization Machine approach attempts to provide a supersymmetric low-rank tensor plus sparse tensor decomposition of the target polynomial. The Kar–Karnick approach has the drawback that it tries to find a good approximation in the span of a set of random rank-one tensors, so many such basis elements must be chosen to ensure that an unknown target can be represented well. Factorization Machines circumvent this issue by directly learning the tensor decomposition, but impose strict constraints on the form of the fitted polynomial that are not appropriate for general learning applications.

As an alternative to Kar–Karnick random feature maps and factorization machines, we propose to instead fit Tensor Machines by directly optimizing over rank-one tensors to directly form a low-rank approximation to the target tensor (i.e., the target polynomial). Since the targets are not, in general, homogenous polynomials, we learn a different set of rank-one factors for each degree up to .

More precisely, Tensor Machines are functions of the form

obtained as minimizers to the following proxy to the kernel machine objective (1):

(2)

By construction, Tensor Machines couple the expressiveness of the Kar–Karnick random feature maps model with the parsimony of Factorization Machines.

5 Generalization Error

In this section we argue that fitting Tensor Machines using empirical risk minimization makes efficient use of the training data. We show that the observed risk of Tensor Machines converges to the expected risk at the optimal rate, thus indicating that empirical risk minimization is an efficient method for finding locally optimal Tensor Machines (assuming the optimization method avoids saddle points). For convenience, we consider a variant of Tensor Machines where the norms of the vectors are constrained:

minimize
subject to

This formulation is not equivalent to the formulation given in (2), but by taking to be the norm of the largest constituent vector in the fitted Tensor Machine, any bound on the generalization error of this formulation applies to the generalization error of Tensor Machines obtained by minimizing (2).

Recall that the Rademacher complexity of a function class measures how well random noise can be approximated using a function from  [4].

Definition 1 (Rademacher complexity)

Given a function class

and i.i.d. random variables

, the Rademacher complexity of , , is defined as

where the are independent Rademacher random variables (random variables taking values with equal probability).

Well-known results (e.g. [4][Theorems 7 and 8]) state that, with high probability over the training data, the observed classification and regression risks are within of the true classification and regression risks. In fact, the optimal rate of convergence for the observed risk to the expected risk is , which can only be achieved when the Rademacher complexity of the hypothesis class is [11].

Our main observation is that the Rademacher complexity of a Tensor Machines grows at the rate , so the empirically observed estimate of the Tensor Machine risk converges to the expected risk at the optimal rate.

Theorem 5.1

Let denote the class of degree-, rank- Tensor Machines on with constituent vectors constrained to lie in the ball of radius :

Let be distributed according to , and assume almost surely. The Rademacher complexity, with respect to , of this hypothesis class satisfies

where is a constant.

This result follows from reformulating the Rademacher complexity as the spectral norm of a Rademacher sum of data-dependent tensors and then applying a recent bound from [12] on the spectral norm of random tensors. A full proof is provided in the appendix.

6 Empirical Evaluations

In this section, we evaluate the performance of Tensor Machines on several real-world regression and classification datasets and demonstrate their attractiveness in learning polynomial features relative to other state-of-the-art algorithms.

6.1 Experimental setup

The nonconvex optimization problem (2) is central to fitting Tensor Machines. We consider two solvers for (2). The first uses the implementation of L-BFGS provided in Mark Schmidt’s minFunc 111The 2012 release, retrieved from http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html MATLAB package for optimization [16]. Since L-BFGS is a batch optimization algorithm, we also investigate the use of SFO, a stochastic quasi-Newton solver designed to work with minibatches [20]. We use the reference implementation of SFO provided by Sohl-Dickstein 222https://github.com/Sohl-Dickstein/Sum-of-Functions-Optimizer/blob/master/README.md. We refer to the two algorithms used to fit Tensor Machines as TM-Batch and TM-SFO, respectively.

The choice of initialization strongly influences the performance of both TM-Batch and TM-SFO. Accordingly, we used initial points sampled from a

distribution. The variance

as well as the regularization parameter in (2) are set by tuning.

We compared TM-Batch and TM-SFO against several recent algorithms for learning polynomials: CRAFTMaps [6], Basis Learner [9] and Apple [1]. To provide a baseline, we also used Kernel Ridge Regression (KRR) to learn polynomials; in cases where the training set contained more than 40,000 points, we randomly chose a subset of size 40,000 to perform KRR. For CRAFTMaps, the up-projection dimensionality is set to be times the down-projection dimensionality. We did not obtain reasonable predictions using the original implementation of Apple in vowpal wabbit, so we implemented the model in MATLAB and used this implementation to train and test the model, but reported the running time of the VW implementation (with the same choice of parameters). The remaining methods are implemented in pure MATLAB. The experiments were conducted on a machine with four 6-core Intel Xeon 2 GHz processors and 128 GB RAM.

Because the performance of nonlinear learning algorithms can be very data-dependent, we tested the methods on a collection of publicly available datasets to provide a broad characterization of the behavior of these algorithms. A summary of the basic properties of these datasets can be found in Table 1. For each dataset, the target degree was chosen to be the value that minimized the KRR test error. We proprocessed the data by first normalizing the input features to have similar magnitudes — viz., so that each column of the training matrix has unit Euclidean norm — then scaling each datapoint to have unit Euclidean norm.

For regression tasks, we used the squared loss, and for binary classification tasks we used the logistic loss. For regression tasks, the test error is reported as where and are the prediction and ground truth, respectively; inaccuracy is reported for classification tasks.

Name Train/Test split Type
Indoor 19937/11114 520 regression 3
Year 463715/51630 90 regression 2
Census 18186/2273 119 regression 2
Slice 42291/10626 384 regression 5
Buzz 466600/116650 77 regression 3
Gisette 6000/1000 5000 binary 3
Adult 32561/16281 122 binary 3
Forest 522910/58102 54 binary 4
eBay search 500000/100000 478 binary 3
Cor-rna 59535/157413 8 binary 4
Table 1: Description of datasets. The target degree was selected by minimizing the KRR test error on each dataset (or a subset).

Each algorithm is governed by several interacting parameters, but for each algorithm we identify one major parameter: the number of iterations for TM-Batch, the number of epochs for TM-SFO, the number of features for CRAFTMaps, the layer width for Basis Learner, and the expansion parameter for Apple. We choose optimal values for the non-major parameters through a 10-fold cross-validation on the training set. We varied the major parameter over a wide range and recorded the test error and running times for the different values until the performance of the algorithms saturated. By saturation, we mean that the improvement in the test error is less than

where denotes the test error of KRR. For each combination of parameters, the average test errors and running times over independent trials are reported.

6.2 Overall performance

To compare their performance, we applied TM-Batch, TM-SFO, CRAFTMaps, Basis Learner, and Apple to the datasets listed in Table 1. We did not evaluate Factorization Machines because efficient algorithms for fitting FM models are only available in the literature for the case . It is unclear whether FMs with higher values of can be fit efficiently.

We present the computation/prediction accuracy tradeoffs of the considered algorithms in Figure 1. The data plotted are the test errors and running times of the algorithms relative to those of KRR:

and

The median values of relerr and reltime as well as the corresponding first and third quartiles are shown.

The detailed results can be found in Table 2. The performance of TM-Batch is consistent across the datasets in the sense that it provides reliable predictions in a fairly short amount of time. TM-SFO converges to lower quality solutions than TM-Batch, but has lower median and variance in runtime. CRAFTMaps and Basis Learner take significantly more time to yield solutions almost as accurate as the two Tensor Machine algorithms. As one might expect, due to its greedy nature, Apple is the fastest of the algorithms (the time of the vowpal wabbit implementation is reported), but of all the algorithms, TM-Batch and TM-SFO deliver the lowest worst-case relative errors.

Name KRR TM-Batch TM-SFO CRAFTMaps Basis Learner Apple
Indoor 0.00961/50 0.00802/3.2(4) 0.0182/3.9(4) 0.00559/2.7(300) 0.00241/97 0.00712/6.5
Year 0.00485/310 0.00565/85(5) 0.00484/74(5) 0.00494/39(200) 0.00496/67 0.00489/60
Census 0.0667/46 0.0774/12(5) 0.0802/45(5) 0.0767/7.3(1000) 0.0788/8.1 0.0950/7.3
Slice 0.0118/441 0.0229/80(3) 0.0213/73(3) 0.0465/707(7000) 0.0501/1034 0.0788/46
Buzz 0.407/624 0.407/568(2) 0.409/440(2) 0.408/2105(1200) 0.373/2472 0.496/325
Gisette 0.027/6.6 0.0266/16(4) 0.0254/29(4) 0.0300/63(11500) 0.0270/62 0.0240/32
Adult 0.150/182 0.149/0.9(4) 0.151/2.0(4) 0.154/17(700) 0.149/ 30 0.150/0.8
Forest 0.148/361 0.184/657(5) 0.178/569(5) 0.196/1023(750) 0.180/3494 0.219/14
eBay search 0.197/446 0.197/612(5) 0.192/349(5) 0.281/1054(800) 0.281/1642 0.269/122
Cor-rna 0.0446/423 0.0453/8.4(3) 0.0489/7.2(3) 0.0462/19(200) 0.0493/1.7 0.0489/4.2
Table 2: Test error/running time of each methods on datasets. The rank parameter used in TM-Batch and TM-SFO and number of features used in CRAFTMaps is listed in the parenthesis. For Apple, the test error is computed based on a MATLAB implementation of Apple but the running time is recorded by using the vowpal wabbit framework with the same choice of parameters.

Both TM solvers required less than rank-one TM features for each individual degree fit for all datasets. Since only one dataset benefited from a quintic fit, the TM models required at most parameters to fit each dataset; this should be compared to the CRAFTMaps models, which required at least parameters to fit each dataset, generally produced fits with higher error than the TM solvers, and required longer solution times.

6.3 Effect of rank parameter

To investigate the effect of the rank parameter on the test error of Tensor Machines, we evaluated both TM-Batch and TM-SFO on the Census and Slice datasets. On Census, since the target degree is , we evaluate Factorization Machines (FM) as well. The results are shown in Figure 2.

As expected, increasing leads to smaller test errors. On Slice, where the target degree , the gap between the performance of TMs and KRR is relatively large (almost a factor of 2), while on Census, where , the gap is slighter.Interestingly, on Census, increasing rank does not lead to a higher accuracy after . We also observe on Census that the error of TMs is lower than that of FMs; this behaviour was also observed on the other datasets when

(a) Census
(b) Slice
Figure 2: Test error of TM-Batch and TM-SFO with different rank parameters on the Census () and Slice () datasets. The test error of using kernel ridge regression (KRR) is also plotted. On Census, we also evaluate Factorization Machines (FM).

6.4 Time-accuracy tradeoffs

To assess the time-accuracy tradeoffs of the various algorithms, in Figure 3 we plot the test error vs the training time for the Slice and Forest datasets. The training time is determined by varying the settings of each algorithm’s major parameter.

TM-Batch and TM-SFO compare favorably against the other methods: they either produce a much lower error than the other methods (on Slice) or reach a considerably low error much faster (on Forest). Similar patterns were observed on most of the datasets we considered.

(a) Slice
(b) Forest
Figure 3: Evaluation of time-accuracy tradeoffs of the algorithms on the Slice ( and Forest () datasets. The rank parameters for TM-Batch and TM-SFO on Slice and Forest are and , respectively.

6.5 Scalability of TM-SFO on eBay search dataset

Since SFO needs to access only a small mini-batch of data points per update, it is suitable for fitting TMs on datasets which cannot fit in memory all at once. Here we explore the scalability of TM-SFO on a private eBay search dataset of 1.2 million data points with dimension . Each data point

comprises the feature vectors associated with a pair of items that were returned as the results of a search on the eBay website and were subsequently visited. The goal is to fit a polynomial for the task of classifying which item was clicked on first:

or . In our experiments, we randomly selected 100,000 data points to be the test set; training sets of variable size were selected from the remainder.

For comparison, we also evaluate CRAFTMaps on the same task. In TM-SFO, we fix the rank parameter to be and number of epochs to be . In CRAFTMaps, we fix the number of random features to be . These parameters are the optimal settings from the experiments used to generate Table 2. We report both the classification error and training time as the size of training set grows.

Figure 4: Scalability of TM-SFO on eBay search dataset (, ). The number of epochs is set to be . The plot shows the test error (black solid line) and training time (red dash line) as a function of the number of training point used.

The results are shown in Figure 4. We observe that the running time of TM-SFO grows almost linearly with the size of the training set, while that of CRAFTMaps grows superlinearly. Also, as expected, because CRAFTMaps choose the hypothesis space independently of the target, increasing the size of the training set without also increasing the size of the model does not give much gain in performance past 100,000 training points. We see that the target-adaptive nature of TMs endows TM-SFO with two advantages. First, for a fixed amount of training data the errors of TMs are significantly lower than those of CRAFTMaps. Second, because the hypothesis space evolves as a function of the target and training data, the training error of TM-SFO exhibits noticeable decay up to a training set size of about 500,000 points, long past the point where the CRAFTMaps test error has saturated.

References

  • [1] A. Agarwal, A. Beygelzimer, D. Hsu, J. Langford, and M. Telgarsky. Scalable Nonlinear Learning with Adaptive Polynomial Expansions. In Advances in Neural Information Processing Systems 27, 2014.
  • [2] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning sparse polynomial functions. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms, 2014.
  • [3] F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernel methods. In Proceedings of the 22nd International Conference on Machine Learning, 2005.
  • [4] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research, 2002.
  • [5] S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2002.
  • [6] R. Hamid, Y. Xiao, A. Gittens, and D. DeCoste. Compact Random Feature Maps. In Proceedings of The 31st International Conference on Machine Learning, 2013.
  • [7] P. Kar and H. Karnick. Random Feature Maps for Dot Product Kernels. In

    Proceedings of the 15th International Conference on Artificial Intelligence and Statistics

    , 2012.
  • [8] M. Kocaoglu, K. Shanmugam, A. G. Dimakis, and A. Klivans. Sparse Polynomial Learning and Graph Sketching. In Advances in Neural Information Processing Systems 27, 2014.
  • [9] R. Livni, S. Shalev-Shwartz, and O. Shamir. An algorithm for learning polynomials. Preprint arXiv:1304.7045, 2014.
  • [10] R. Livni, S. Shalev-Shwartz, and O. Shamir. On the Computational Efficiency of Training Neural Networks. In Advances in Neural Information Processing Systems 27, 2014.
  • [11] S. Mendelson. Lower bounds for the empirical minimization algorithm. IEEE Transactions on Information Theory, 2008.
  • [12] N. H. Nguyen, P. Drineas, and T. D. Tran. Tensor sparsification via a bound on the spectral norm of random tensors. Information and Inference, abs/1005.4732, 2013.
  • [13] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013.
  • [14] A. Rahimi and B. Recht. Random Features for Large-Scale Kernel Machines. In Advances in Neural Information Processing Systems 20, 2007.
  • [15] S. Rendle. Factorization machines. In IEEE 10th International Conference on Data Mining, 2010.
  • [16] B. H. Richard, N. Jorge, and R. B. Schnabel. Representations of quasi-Newton matrices and their use in limited memory methods. Mathematical Programming, 63, 1994.
  • [17] W. Rudin. Principles of Mathematical Analysis. McGraw–Hill, 3 edition, 1976.
  • [18] T. D. Sanger, R. S. Sutton, and C. J. Matteus. Iterative Construction of Sparse Polynomial Approximation. In Advances in Neural Information Processing Systems, 1992.
  • [19] A. J. Smola and B. Schölkopf. Sparse Greedy Matrix Approximation for Machine Learning. In Proceedings of the 17th International Conference on Machine Learning, 2000.
  • [20] J. Sohl-Dickstein, B. Poole, and S. Ganguli. Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods. In Proceedings of the 31st International Conference on Machine Learning, 2014.
  • [21] C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Proceedings of the 14th Annual Conference on Neural Information Processing Systems, 2001.
  • [22] J. Yang, V. Sindhwani, H. Avron, and M. Mahoney. Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels. In Proceedings of The 31st International Conference on Machine Learning, 2014.
  • [23] J. Yang, V. Sindhwani, Q. Fan, H. Avron, and M. Mahoney. Random laplace feature maps for semigroup kernels on histograms. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2014.
  • [24] Z. Yang, A. J. Smola, L. Song, and A. G. Wilson. À la carte — learning fast kernels. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015.
  • [25] F. X. Yu, S. Kumar, H. Rowley, and S.-F. Chang. Compact Nonlinear Maps and Circulant Extensions. Preprint arXiv:1503.03893, 2015.

Appendix 0.A Proof of Theorem 5.1

Theorem 5.1 is a corollary of the following bound on the Rademacher complexity of a rank-one Tensor Machine.

Theorem 0.A.2

Let denote the class of degree- rank-one Tensor Machines on with constituent vectors constrained to lie in the ball of radius :

Let be distributed according to , and assume almost surely. The Rademacher complexity with respect to of this hypothesis class satisfies

where is a constant.

Proof (Proof of Theorem 5.1)

Every function in can be written as the sum of a constant, a linear function of the form with , and functions from each of The Rademacher complexity of is zero, and a straightforward calculation establishes that the Rademacher complexity of linear functions of the specified form is no larger than It follows from a simple structural result on Rademacher complexities [4][Theorem 12] that

Applying Theorem 0.A.2, we have that

Proof (Proof of Theorem 0.a.2)

Given the data points , let

so that From the definition of we have that

where .

Define . Recall the definition of the spectral norm of an order- tensor

It follows from this definition that

(3)

For simplicity, we drop the subscript from in the following. Note that , since Rademacher variables are mean-zero. We now apply Theorem 2 of [12], which bounds the expected spectral norm of the difference between a tensor Rademacher sum and its expectation with the sum of the expected maximum Euclidean lengths of one-dimensional slices through the tensor:

We replace the innermost sum with a maximum to obtain an estimate involving the expected size of the largest entry in

(4)

Next we bound the maximum entry of with the Frobenius norm of and apply Jensen’s inequality to obtain

Recalling the definition of we have that

It is readily established that

for any rank-one tensor, so it follows that

(5)

From equations (0.A), (0.A), and (5), we conclude that