1 Introduction
Kernel machines are one of the most popular and widely adopted paradigms in machine learning and data analysis. This success is due to the fact that an appropriately chosen nonlinear kernel often succeeds in capturing nonlinear structures inherent to the problem without forming the highdimensional features necessary to explicitly delineate those structures. Unfortunately, the cost of kernel methods scales like
, where is the number of datapoints. Because of this high computational cost, it is often the case that kernel methods cannot exploit all the information in large datasets; indeed, even the cost of forming and storing the kernel matrix can be prohibitive on large datasets.As a consequence, methods for approximately fitting nonlinear kernel methods have drawn much attention in recent years. In particular, explicit random feature maps have been proposed to make largescale kernel machines practical [14, 7, 13, 6, 22, 23]. The idea behind this method is to randomly choose an dimensional feature map that satisfies where is the kernel of interest [14]. Let be the matrix whose rows comprise the application of to datapoints, then the kernel matrix has the lowrank approximation This approximation considerably reduces the costs of both fitting models, because one can work with the by matrix instead of , and of forming the input, because one need form only instead of
The success of this approximation hinges on choosing so that is close to
with high probability. One easy way to accomplish this is to take
where the random features all satisfy The concentration of measure phenomenon then ensures that is small with high probability. In practice, one must use a large number of random features to achieve performance comparable with the exact kernel: must be on the order of tens or even hundreds of thousands [6]. The reason for this is simple: since knowledge of the target function is not used in generating the feature map to ensure good performance enough random features must be selected to get good approximations to arbitrary functions in the reproducing kernel Hilbert space associated with Thus under the random feature map paradigm, one trades off a massive reduction in the cost of optimization for the necessity of generating a large number of random features, each of which is rather uninformative about any given target.Implicit in the formulation of random feature maps is the assumption that the chosen class of features is expressive enough to cover the function space of interest, yet simple enough that features can be sampled randomly in an efficient manner. Given this assumption, it seems natural to question whether it is possible to directly select a much smaller number of features relevant to the given target in a computational efficient manner.
Targetspecific optimization vs targetagnostic randomization
Given a kernel and a parametrized family of features satisfying where the expectation is with respect to some distribution on can optimization over this hypothesis class to select features give a parsimonious representation of a known target, with the same or less computational effort and less error than targetagnostic random sampling of ?
This question has recently been answered in the affirmative for the case of the radial basis kernel [24, 25]. In this work, we provide a positive answer to the above question in the case of polynomial kernels and the Kar–Karnick random features introduced in [7] (see Figure 1). Polynomial kernels are of interest because, after the radial basis kernel, they are arguably the most widely used kernels in machine learning applications. The Weierstrass approximation theorem guarantees than any smooth function can be arbitrarily accurately approximated by a polynomial [17][Chapter 7], so in principle a polynomial kernel of sufficiently high degree can be used to accurately approximate any smooth target function.
In this work, we use a natural connection between tensors and polynomials to introduce a restricted hypothesis class of polynomials corresponding to lowrank tensors. In the Tensor Machine paradigm, learning consists of learning a regularized lowrank decomposition of a hidden tensor corresponding to the target polynomial. Given training points in dimensional input space, the computational cost of fitting a degree TM with a rank approximation for each degree (using a firstorder algorithm) is , while the cost of fitting a predictor using the Kar–Karnick random feature maps and kernel regression approach of [7, 6] is . In practice the required for accurate predictions is typically on the order of thousands; by way of comparison, we show empirically that TMs typically require less than to achieve a comparable approximation error!
We show experimentally that Tensor Machines strike a favorable balance between expressivity and parsimony, and provide an analysis of Tensor Machines that establishes favorable generalization properties. The objective function for fitting Tensor Machines is nonconvex, but we show empirically that it is sufficiently structured that the process of fitting a TM is robust, efficient, and requires very few features. We demonstrate that our algorithm exhibits a more favorable time–accuracy tradeoff when compared to the random feature map approach to polynomial regression [7, 13, 6], as well as to the polynomial network algorithm of [9], and the recently introduced Apple algorithm for online sparse polynomial regression [1].
2 Prior Work
Consider the estimation problem of fitting a polynomial function
to i.i.d. training points drawn from the same unknown distribution, formulated as(1) 
where is the reproducing kernel Hilbert space of all polynomials of degree at most and
is a specified loss function. An exact solution (to within a specified numerical precision) to this problem can be obtained in
time using classical kernel methods.One approach in the literature has been to couple kernel methods with various techniques for approximating the kernel matrix. The underlying assumption is that the kernel matrix is numerically lowrank, with rank ; typically these methods reduce the cost of kernel methods from to Nyström approximations, sparse greedy approximations, and incomplete Cholesky factorizations fall into this class [21, 19, 5]. In [3], Bach and Jordan observed that prior algorithms in this class did not exploit all
the knowledge inherent in supervised learning tasks: namely, they did not exploit knowledge of the targets (classification labels or regression values). They showed that by exploiting knowledge of the target, one can construct lowrank approximations to the kernel matrix that have significantly smaller rank, with a computational cost that remains
Another approach to polynomialbased supervised learning relies upon the modeling assumption that the desired target function can be approximated well in the subspace of consisting of sparse polynomials. Recall that a sparse degree polynomial in variables is one in which only a few of the possible monomials have nonzero coefficients (there are exponentially in many such monomials). The early work of Sanger et al. attempts to learn the monomials relevant to the target in an online manner by augmenting the current polynomial with interaction features [18]
. The recent Apple algorithm of Agarwal et al. attempts to learn a sparse polynomial, also in an online manner, using a different heuristic that selects monomials from the current polynomial to be used in forming the next term of the monomial
[1]. The algorithms presented in [2] and [8] provide theoretical guarantees for fitting sparse polynomials, but their computational costs scale undesirably for largescale learning.Polynomial fitting has also been tackled using the neural network paradigm. In
[9], Livni et al. provide an algorithm for learning polynomial functions as deep neural networks which has the property that the training error is guaranteed to decrease at each iteration. This algorithm has the desirable properties that the network learns a linear combination of a set of polynomials that are constructed in a targetdependent way, and that the degree of the polynomial does not have to be specified in advance: instead, additional layers can be added to the network until the desired error threshold has been reached, with each layer increasing the degree of the predictor by one. Unfortunately, this algorithm requires careful tuning of the hyperparameters (number of layers, and the width of each layer). In the subsequent work
[10], Livni et al. provide an algorithm for fitting cubic polynomials in an iterative manner using rankone tensor approximations. It can be shown that, in fact, this algorithm greedily fits cubic polynomials in the Tensor Machine class we propose in this paper.Factorization Machines combine the expressivity of polynomial regression with the ability of factorization models to infer relationships from sparse training data [15]. Quadratic Factorization Machines, as introduced by Rendle, are models of the form
where a vector
is learned for each coordinate of the input These models have been applied with great success in recommendation systems and related applications with sparse input Quadratic FMs can be fit in time linear in the size of the data, the degree of the polynomial being fit, and the length of the vectors , but can only represent nonlinear interactions that can be written as symmetric homogeneous polynomials (plus a constant). In particular, FMs cannot represent polynomials that involve monomials containing variables raised to powers higher than one. For instance, cannot be represented as a Factorization Machine. Another drawback to Factorization Machines is that explicitly evaluating the sums involved for a degree FM requires operations. A computational manipulation allows quadratic FMs to be fit in linear time, and it has been claimed that similar manipulations for higherorder FMS ([15]), but no such generalizations have been documented. Perhaps for this reason, only secondorder FMs have been used in the literature.The random feature maps approach to polynomialbased learning [14] exploits the concentration of measure phenomenon to directly construct a random lowdimensional feature map that approximately decomposes Kar and Karnick provided the first random feature map approach to polynomial regression in [7]; their key observation is that the degree homogeneous polynomial kernel can be approximated with random features of the form
where the vectors are vectors of random signs; that is, random feature maps of the form
satisfy the condition necessary for the random feature map approach. Pagh and Pham provided a qualitatively different random feature map for polynomial kernel machines, based on the tensorization of fast hashing transforms [13]. These TensorSketch feature maps often outperform Kar–Karnick feature maps, but both methods require a large The CRAFTMaps approach of [6] combines either TensorSketch or Kar–Karnick random feature maps with fast Johnson–Lindenstrauss transforms to significantly reduce the number of features needed for good performance.
Of these methods, the Tensor Machine approach introduced in this paper is most similar to the Factorization Machine and Kar–Karnick random feature map approaches to polynomial learning.
3 Tensors and polynomials
To motivate Tensor Machines, which are introduced in the next section, we first review the connection between polynomials and tensors. For simplicity we consider only homogeneous degree polynomials: the monomials of such a polynomial all have degree We show that Kar–Karnick predictors and Factorizaton Machines correspond to specific tensor decompositions.
Recall that a tensor is a multidimensional array,
The number of indices into the array, , is also called the degree of the tensor. The inner product of two conformal tensors and is obtained by treating them as two vectors:
The Segre outer product of vectors is the tensor that satisfies
The degree selfouter product of a vector is denoted by
Given a decomposition for a tensor of the form
where is minimal, is called the rank of the tensor. A tensor is supersymmetric if there exists a decomposition of the form
The tensor comprises all the degree monomials in the variable so any degree homogeneous polynomial satisfies for some degree tensor likewise, any degree tensor determines a homogenous polynomial of degree This equivalence between homogeneous polynomials and tensors allows us the attack the problem of polynomial learning as one of learning a tensor. The Kar–Karnick random feature maps and Factorization Machine approaches can both be viewed through this lens.
A single Kar–Karnick random polynomial feature corresponds to a rankone tensor:
Accordingly, predictors generated by the random feature maps approach with the Kar–Karnick feature map satisfy
for some coefficient vector That is, Kar–Karnick predictors correspond to tensors in the span of randomly sampled degree rankone tensors.
Factorization Machines fit predictors of the form
where the factors are vectors in and
is the generalization of the inner product of two vectors. Let be the matrix with rows comprising the vectors and define the dimensional vector to be the th column of It follows that FMs can be expressed in the form
where the tensor is defined so that its nonzero entries cancel the entries of that correspond to coefficients of monomials like that contain a variable raised to a power larger than one. The vectors are learned during training (the tensor is implicitly defined by these vectors).
4 Tensor Machines
Thus, the Kar–Karnick random feature map approach searches for a lowrank tensor corresponding to the target polynomial and the Factorization Machine approach attempts to provide a supersymmetric lowrank tensor plus sparse tensor decomposition of the target polynomial. The Kar–Karnick approach has the drawback that it tries to find a good approximation in the span of a set of random rankone tensors, so many such basis elements must be chosen to ensure that an unknown target can be represented well. Factorization Machines circumvent this issue by directly learning the tensor decomposition, but impose strict constraints on the form of the fitted polynomial that are not appropriate for general learning applications.
As an alternative to Kar–Karnick random feature maps and factorization machines, we propose to instead fit Tensor Machines by directly optimizing over rankone tensors to directly form a lowrank approximation to the target tensor (i.e., the target polynomial). Since the targets are not, in general, homogenous polynomials, we learn a different set of rankone factors for each degree up to .
More precisely, Tensor Machines are functions of the form
obtained as minimizers to the following proxy to the kernel machine objective (1):
(2) 
By construction, Tensor Machines couple the expressiveness of the Kar–Karnick random feature maps model with the parsimony of Factorization Machines.
5 Generalization Error
In this section we argue that fitting Tensor Machines using empirical risk minimization makes efficient use of the training data. We show that the observed risk of Tensor Machines converges to the expected risk at the optimal rate, thus indicating that empirical risk minimization is an efficient method for finding locally optimal Tensor Machines (assuming the optimization method avoids saddle points). For convenience, we consider a variant of Tensor Machines where the norms of the vectors are constrained:
minimize  
subject to  
This formulation is not equivalent to the formulation given in (2), but by taking to be the norm of the largest constituent vector in the fitted Tensor Machine, any bound on the generalization error of this formulation applies to the generalization error of Tensor Machines obtained by minimizing (2).
Recall that the Rademacher complexity of a function class measures how well random noise can be approximated using a function from [4].
Definition 1 (Rademacher complexity)
Given a function class
and i.i.d. random variables
, the Rademacher complexity of , , is defined aswhere the are independent Rademacher random variables (random variables taking values with equal probability).
Wellknown results (e.g. [4][Theorems 7 and 8]) state that, with high probability over the training data, the observed classification and regression risks are within of the true classification and regression risks. In fact, the optimal rate of convergence for the observed risk to the expected risk is , which can only be achieved when the Rademacher complexity of the hypothesis class is [11].
Our main observation is that the Rademacher complexity of a Tensor Machines grows at the rate , so the empirically observed estimate of the Tensor Machine risk converges to the expected risk at the optimal rate.
Theorem 5.1
Let denote the class of degree, rank Tensor Machines on with constituent vectors constrained to lie in the ball of radius :
Let be distributed according to , and assume almost surely. The Rademacher complexity, with respect to , of this hypothesis class satisfies
where is a constant.
This result follows from reformulating the Rademacher complexity as the spectral norm of a Rademacher sum of datadependent tensors and then applying a recent bound from [12] on the spectral norm of random tensors. A full proof is provided in the appendix.
6 Empirical Evaluations
In this section, we evaluate the performance of Tensor Machines on several realworld regression and classification datasets and demonstrate their attractiveness in learning polynomial features relative to other stateoftheart algorithms.
6.1 Experimental setup
The nonconvex optimization problem (2) is central to fitting Tensor Machines. We consider two solvers for (2). The first uses the implementation of LBFGS provided in Mark Schmidt’s minFunc ^{1}^{1}1The 2012 release, retrieved from http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html MATLAB package for optimization [16]. Since LBFGS is a batch optimization algorithm, we also investigate the use of SFO, a stochastic quasiNewton solver designed to work with minibatches [20]. We use the reference implementation of SFO provided by SohlDickstein ^{2}^{2}2https://github.com/SohlDickstein/SumofFunctionsOptimizer/blob/master/README.md. We refer to the two algorithms used to fit Tensor Machines as TMBatch and TMSFO, respectively.
The choice of initialization strongly influences the performance of both TMBatch and TMSFO. Accordingly, we used initial points sampled from a
distribution. The variance
as well as the regularization parameter in (2) are set by tuning.We compared TMBatch and TMSFO against several recent algorithms for learning polynomials: CRAFTMaps [6], Basis Learner [9] and Apple [1]. To provide a baseline, we also used Kernel Ridge Regression (KRR) to learn polynomials; in cases where the training set contained more than 40,000 points, we randomly chose a subset of size 40,000 to perform KRR. For CRAFTMaps, the upprojection dimensionality is set to be times the downprojection dimensionality. We did not obtain reasonable predictions using the original implementation of Apple in vowpal wabbit, so we implemented the model in MATLAB and used this implementation to train and test the model, but reported the running time of the VW implementation (with the same choice of parameters). The remaining methods are implemented in pure MATLAB. The experiments were conducted on a machine with four 6core Intel Xeon 2 GHz processors and 128 GB RAM.
Because the performance of nonlinear learning algorithms can be very datadependent, we tested the methods on a collection of publicly available datasets to provide a broad characterization of the behavior of these algorithms. A summary of the basic properties of these datasets can be found in Table 1. For each dataset, the target degree was chosen to be the value that minimized the KRR test error. We proprocessed the data by first normalizing the input features to have similar magnitudes — viz., so that each column of the training matrix has unit Euclidean norm — then scaling each datapoint to have unit Euclidean norm.
For regression tasks, we used the squared loss, and for binary classification tasks we used the logistic loss. For regression tasks, the test error is reported as where and are the prediction and ground truth, respectively; inaccuracy is reported for classification tasks.
Name  Train/Test split  Type  

Indoor  19937/11114  520  regression  3 
Year  463715/51630  90  regression  2 
Census  18186/2273  119  regression  2 
Slice  42291/10626  384  regression  5 
Buzz  466600/116650  77  regression  3 
Gisette  6000/1000  5000  binary  3 
Adult  32561/16281  122  binary  3 
Forest  522910/58102  54  binary  4 
eBay search  500000/100000  478  binary  3 
Corrna  59535/157413  8  binary  4 
Each algorithm is governed by several interacting parameters, but for each algorithm we identify one major parameter: the number of iterations for TMBatch, the number of epochs for TMSFO, the number of features for CRAFTMaps, the layer width for Basis Learner, and the expansion parameter for Apple. We choose optimal values for the nonmajor parameters through a 10fold crossvalidation on the training set. We varied the major parameter over a wide range and recorded the test error and running times for the different values until the performance of the algorithms saturated. By saturation, we mean that the improvement in the test error is less than
where denotes the test error of KRR. For each combination of parameters, the average test errors and running times over independent trials are reported.6.2 Overall performance
To compare their performance, we applied TMBatch, TMSFO, CRAFTMaps, Basis Learner, and Apple to the datasets listed in Table 1. We did not evaluate Factorization Machines because efficient algorithms for fitting FM models are only available in the literature for the case . It is unclear whether FMs with higher values of can be fit efficiently.
We present the computation/prediction accuracy tradeoffs of the considered algorithms in Figure 1. The data plotted are the test errors and running times of the algorithms relative to those of KRR:
and
The median values of relerr and reltime as well as the corresponding first and third quartiles are shown.
The detailed results can be found in Table 2. The performance of TMBatch is consistent across the datasets in the sense that it provides reliable predictions in a fairly short amount of time. TMSFO converges to lower quality solutions than TMBatch, but has lower median and variance in runtime. CRAFTMaps and Basis Learner take significantly more time to yield solutions almost as accurate as the two Tensor Machine algorithms. As one might expect, due to its greedy nature, Apple is the fastest of the algorithms (the time of the vowpal wabbit implementation is reported), but of all the algorithms, TMBatch and TMSFO deliver the lowest worstcase relative errors.
Name  KRR  TMBatch  TMSFO  CRAFTMaps  Basis Learner  Apple 

Indoor  0.00961/50  0.00802/3.2(4)  0.0182/3.9(4)  0.00559/2.7(300)  0.00241/97  0.00712/6.5 
Year  0.00485/310  0.00565/85(5)  0.00484/74(5)  0.00494/39(200)  0.00496/67  0.00489/60 
Census  0.0667/46  0.0774/12(5)  0.0802/45(5)  0.0767/7.3(1000)  0.0788/8.1  0.0950/7.3 
Slice  0.0118/441  0.0229/80(3)  0.0213/73(3)  0.0465/707(7000)  0.0501/1034  0.0788/46 
Buzz  0.407/624  0.407/568(2)  0.409/440(2)  0.408/2105(1200)  0.373/2472  0.496/325 
Gisette  0.027/6.6  0.0266/16(4)  0.0254/29(4)  0.0300/63(11500)  0.0270/62  0.0240/32 
Adult  0.150/182  0.149/0.9(4)  0.151/2.0(4)  0.154/17(700)  0.149/ 30  0.150/0.8 
Forest  0.148/361  0.184/657(5)  0.178/569(5)  0.196/1023(750)  0.180/3494  0.219/14 
eBay search  0.197/446  0.197/612(5)  0.192/349(5)  0.281/1054(800)  0.281/1642  0.269/122 
Corrna  0.0446/423  0.0453/8.4(3)  0.0489/7.2(3)  0.0462/19(200)  0.0493/1.7  0.0489/4.2 
Both TM solvers required less than rankone TM features for each individual degree fit for all datasets. Since only one dataset benefited from a quintic fit, the TM models required at most parameters to fit each dataset; this should be compared to the CRAFTMaps models, which required at least parameters to fit each dataset, generally produced fits with higher error than the TM solvers, and required longer solution times.
6.3 Effect of rank parameter
To investigate the effect of the rank parameter on the test error of Tensor Machines, we evaluated both TMBatch and TMSFO on the Census and Slice datasets. On Census, since the target degree is , we evaluate Factorization Machines (FM) as well. The results are shown in Figure 2.
As expected, increasing leads to smaller test errors. On Slice, where the target degree , the gap between the performance of TMs and KRR is relatively large (almost a factor of 2), while on Census, where , the gap is slighter.Interestingly, on Census, increasing rank does not lead to a higher accuracy after . We also observe on Census that the error of TMs is lower than that of FMs; this behaviour was also observed on the other datasets when
6.4 Timeaccuracy tradeoffs
To assess the timeaccuracy tradeoffs of the various algorithms, in Figure 3 we plot the test error vs the training time for the Slice and Forest datasets. The training time is determined by varying the settings of each algorithm’s major parameter.
TMBatch and TMSFO compare favorably against the other methods: they either produce a much lower error than the other methods (on Slice) or reach a considerably low error much faster (on Forest). Similar patterns were observed on most of the datasets we considered.
6.5 Scalability of TMSFO on eBay search dataset
Since SFO needs to access only a small minibatch of data points per update, it is suitable for fitting TMs on datasets which cannot fit in memory all at once. Here we explore the scalability of TMSFO on a private eBay search dataset of 1.2 million data points with dimension . Each data point
comprises the feature vectors associated with a pair of items that were returned as the results of a search on the eBay website and were subsequently visited. The goal is to fit a polynomial for the task of classifying which item was clicked on first:
or . In our experiments, we randomly selected 100,000 data points to be the test set; training sets of variable size were selected from the remainder.For comparison, we also evaluate CRAFTMaps on the same task. In TMSFO, we fix the rank parameter to be and number of epochs to be . In CRAFTMaps, we fix the number of random features to be . These parameters are the optimal settings from the experiments used to generate Table 2. We report both the classification error and training time as the size of training set grows.
The results are shown in Figure 4. We observe that the running time of TMSFO grows almost linearly with the size of the training set, while that of CRAFTMaps grows superlinearly. Also, as expected, because CRAFTMaps choose the hypothesis space independently of the target, increasing the size of the training set without also increasing the size of the model does not give much gain in performance past 100,000 training points. We see that the targetadaptive nature of TMs endows TMSFO with two advantages. First, for a fixed amount of training data the errors of TMs are significantly lower than those of CRAFTMaps. Second, because the hypothesis space evolves as a function of the target and training data, the training error of TMSFO exhibits noticeable decay up to a training set size of about 500,000 points, long past the point where the CRAFTMaps test error has saturated.
References
 [1] A. Agarwal, A. Beygelzimer, D. Hsu, J. Langford, and M. Telgarsky. Scalable Nonlinear Learning with Adaptive Polynomial Expansions. In Advances in Neural Information Processing Systems 27, 2014.
 [2] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning sparse polynomial functions. In Proceedings of the 25th Annual ACMSIAM Symposium on Discrete Algorithms, 2014.
 [3] F. R. Bach and M. I. Jordan. Predictive lowrank decomposition for kernel methods. In Proceedings of the 22nd International Conference on Machine Learning, 2005.
 [4] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research, 2002.
 [5] S. Fine and K. Scheinberg. Efficient SVM training using lowrank kernel representations. Journal of Machine Learning Research, 2002.
 [6] R. Hamid, Y. Xiao, A. Gittens, and D. DeCoste. Compact Random Feature Maps. In Proceedings of The 31st International Conference on Machine Learning, 2013.

[7]
P. Kar and H. Karnick.
Random Feature Maps for Dot Product Kernels.
In
Proceedings of the 15th International Conference on Artificial Intelligence and Statistics
, 2012.  [8] M. Kocaoglu, K. Shanmugam, A. G. Dimakis, and A. Klivans. Sparse Polynomial Learning and Graph Sketching. In Advances in Neural Information Processing Systems 27, 2014.
 [9] R. Livni, S. ShalevShwartz, and O. Shamir. An algorithm for learning polynomials. Preprint arXiv:1304.7045, 2014.
 [10] R. Livni, S. ShalevShwartz, and O. Shamir. On the Computational Efficiency of Training Neural Networks. In Advances in Neural Information Processing Systems 27, 2014.
 [11] S. Mendelson. Lower bounds for the empirical minimization algorithm. IEEE Transactions on Information Theory, 2008.
 [12] N. H. Nguyen, P. Drineas, and T. D. Tran. Tensor sparsification via a bound on the spectral norm of random tensors. Information and Inference, abs/1005.4732, 2013.
 [13] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013.
 [14] A. Rahimi and B. Recht. Random Features for LargeScale Kernel Machines. In Advances in Neural Information Processing Systems 20, 2007.
 [15] S. Rendle. Factorization machines. In IEEE 10th International Conference on Data Mining, 2010.
 [16] B. H. Richard, N. Jorge, and R. B. Schnabel. Representations of quasiNewton matrices and their use in limited memory methods. Mathematical Programming, 63, 1994.
 [17] W. Rudin. Principles of Mathematical Analysis. McGraw–Hill, 3 edition, 1976.
 [18] T. D. Sanger, R. S. Sutton, and C. J. Matteus. Iterative Construction of Sparse Polynomial Approximation. In Advances in Neural Information Processing Systems, 1992.
 [19] A. J. Smola and B. Schölkopf. Sparse Greedy Matrix Approximation for Machine Learning. In Proceedings of the 17th International Conference on Machine Learning, 2000.
 [20] J. SohlDickstein, B. Poole, and S. Ganguli. Fast largescale optimization by unifying stochastic gradient and quasiNewton methods. In Proceedings of the 31st International Conference on Machine Learning, 2014.
 [21] C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Proceedings of the 14th Annual Conference on Neural Information Processing Systems, 2001.
 [22] J. Yang, V. Sindhwani, H. Avron, and M. Mahoney. QuasiMonte Carlo Feature Maps for ShiftInvariant Kernels. In Proceedings of The 31st International Conference on Machine Learning, 2014.

[23]
J. Yang, V. Sindhwani, Q. Fan, H. Avron, and M. Mahoney.
Random laplace feature maps for semigroup kernels on histograms.
In
IEEE Conference on Computer Vision and Pattern Recognition
, 2014.  [24] Z. Yang, A. J. Smola, L. Song, and A. G. Wilson. À la carte — learning fast kernels. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015.
 [25] F. X. Yu, S. Kumar, H. Rowley, and S.F. Chang. Compact Nonlinear Maps and Circulant Extensions. Preprint arXiv:1503.03893, 2015.
Appendix 0.A Proof of Theorem 5.1
Theorem 5.1 is a corollary of the following bound on the Rademacher complexity of a rankone Tensor Machine.
Theorem 0.A.2
Let denote the class of degree rankone Tensor Machines on with constituent vectors constrained to lie in the ball of radius :
Let be distributed according to , and assume almost surely. The Rademacher complexity with respect to of this hypothesis class satisfies
where is a constant.
Proof (Proof of Theorem 5.1)
Every function in can be written as the sum of a constant, a linear function of the form with , and functions from each of The Rademacher complexity of is zero, and a straightforward calculation establishes that the Rademacher complexity of linear functions of the specified form is no larger than It follows from a simple structural result on Rademacher complexities [4][Theorem 12] that
Applying Theorem 0.A.2, we have that
Proof (Proof of Theorem 0.a.2)
Given the data points , let
so that From the definition of we have that
where .
Define . Recall the definition of the spectral norm of an order tensor
It follows from this definition that
(3) 
For simplicity, we drop the subscript from in the following. Note that , since Rademacher variables are meanzero. We now apply Theorem 2 of [12], which bounds the expected spectral norm of the difference between a tensor Rademacher sum and its expectation with the sum of the expected maximum Euclidean lengths of onedimensional slices through the tensor:
We replace the innermost sum with a maximum to obtain an estimate involving the expected size of the largest entry in
(4) 
Next we bound the maximum entry of with the Frobenius norm of and apply Jensen’s inequality to obtain
Recalling the definition of we have that
It is readily established that
for any rankone tensor, so it follows that
(5) 