1 Introduction
Interactions between features play an important role in many classification and regression tasks. One of the simplest approach to leverage such interactions consists in explicitly
augmenting feature vectors with products of features (monomials), as in polynomial regression. Although fast linear model solvers can be used
(Chang et al., 2010; Sonnenburg & Franc, 2010), an obvious drawback of this kind of approach is that the number of parameters to estimate scales as , where is the number of features and is the order of interactions considered. As a result, it is usually limited to second or thirdorder interactions.Another popular approach consists in using a polynomial kernel so as to implicitly map the data via the kernel trick. The main advantage of this approach is that the number of parameters to estimate in the model is actually independent of and . However, the cost of storing and evaluating the model is now proportional to the number of training instances. This is sometimes called the curse of kernelization (Wang et al., 2010). Common ways to address the issue include the Nyström method (Williams & Seeger, 2001), random features (Kar & Karnick, 2012) and sketching (Pham & Pagh, 2013; Avron et al., 2014).
In this paper, in order to leverage feature interactions in possibly very highdimensional data, we consider models which predict the output
associated with an input vector by(1) 
where , , is a kernel and is a hyperparameter. More specifically, we focus on two specific choices of which allow us to use feature interactions: the homogeneous polynomial and the ANOVA kernels. Our contributions are as follows. We show (Section 3) that choosing one kernel or the other allows us to recover polynomial networks (PNs) (Livni et al., 2014) and, surprisingly, factorization machines (FMs) (Rendle, 2010, 2012). Based on this new view, we show important properties of PNs and FMs. Notably, we show for the first time that the objective function of arbitraryorder FMs is multiconvex (Section 4). Unfortunately, the objective function of PNs is not multiconvex. To remedy this problem, we propose a lifted approach, based on casting parameter estimation as a lowrank tensor estimation problem (Section 5.1). Combined with a symmetrization trick, this approach leads to a multiconvex problem, for both PNs and FMs (Section 5.2). We demonstrate our approach on regression and recommender system tasks.
Notation. We denote vectors, matrices and tensors using lowercase, uppercase and calligraphic bold, e.g., , and . We denote the set of real tensors by and the set of symmetric real tensors by . We use to denote vector, matrix and tensor inner product. Given , we define a symmetric rankone tensor by , where . We use to denote the set .
2 Related work
2.1 Polynomial networks
Polynomial networks (PNs) (Livni et al., 2014) of degree predict the output associated with by
(2) 
where , , and
is evaluated elementwise. Intuitively, the righthand term can be interpreted as a feedforward neural network with one hidden layer of
units and with activation function
. Livni et al. (2014) also extend (2) to the case and show theoretically that PNs can approximate feedforward networks with sigmoidal activation. A similar model was independently shown to perform well on dependency parsing (Chen & Manning, 2014). Unfortunately, the objective function of PNs is nonconvex. In Section 5, we derive a multiconvex objective based on lowrank symmetric tensor estimation, suitable for training arbitraryorder PNs.2.2 Factorization machines
One of the simplest way to leverage feature interactions is polynomial regression (PR). For example, for secondorder interactions, in this approach, we compute predictions by
(3) 
where and . Obviously, model size in PR does not scale well w.r.t. . The main idea of (secondorder) factorization machines (FMs) (Rendle, 2010, 2012) is to replace with a factorized matrix :
(4) 
where . FMs have been increasingly popular for efficiently modeling feature interactions in highdimensional data, see (Rendle, 2012) and references therein. In Section 4, we show for the first time that the objective function of arbitraryorder FMs is multiconvex.
3 Polynomial and ANOVA kernels
In this section, we show that the prediction functions used by polynomial networks and factorization machines can be written using (1) for a specific choice of kernel.
The polynomial kernel is a popular kernel for using combinations of features. The kernel is defined as
(5) 
where is the degree and is a hyperparameter. We define the homogeneous polynomial kernel by
(6) 
Let and . Then,
(7) 
We thus see that uses all monomials of degree (i.e., all combinations of features with replacement).
A much lesser known kernel is the ANOVA kernel (Stitson et al., 1997; Vapnik, 1998). Following (ShaweTaylor & Cristianini, 2004, Section 9.2), the ANOVA kernel of degree , where , can be defined as
(8) 
As a result, uses only monomials composed of distinct features (i.e., feature combinations without replacement). For later convenience, we also define and .
With and defined, we are now in position to state the following lemma.
The relation easily extends to higher orders. This new view allows us to state results that will be very useful in the next sections. The first one is that and are homogeneous functions, i.e., they satisfy
(11) 
Another key property of is multilinearity.^{1}^{1}1A function is called multilinear (resp. multiconvex) if it is linear (resp. convex) w.r.t. separately.
Lemma 2
Multilinearity of w.r.t.
Let , and . Then,
(12) 
where denotes the dimensional vector with removed and similarly for .
That is, everything else kept fixed, is an affine function of , . Proof is given in Appendix B.1.
Assuming is dense and sparse, the cost of naively computing by (8) is , where is the number of nonzero features in . To address this issue, we will make use of the following lemma for computing in nearly time when .
Lemma 3
Efficient computation of ANOVA kernel
(13)  
where we defined and .
See Appendix B.2 for a derivation.
4 Direct approach
Let us denote the training set by and . The most natural approach to learn models of the form (1) is to directly choose and so as to minimize some error function
(14) 
where
is a convex loss function. Note that (
14) is a convex objective w.r.t. regardless of . However, it is in general nonconvex w.r.t. . Fortunately, when , we can show that (14) is multiconvex.Proof is given in Appendix B.3. As a corollary, the objective function of FMs of arbitrary order is thus multiconvex. Theorem 1 suggests that we can minimize (14) efficiently when by solving a succession of convex problems w.r.t. and the rows of . We next show that when
is odd, we can just fix
without loss of generality.Lemma 4
When is it useful to fit ?
Let . Then
(15)  
(16) 
The result stems from the fact that and are homogeneous functions. If we define , then we obtain if is odd, and similarly for . That is, can be absorbed into without loss of generality. When is even, cannot be absorbed unless we allow complex numbers. Because FMs fix , Lemma 4 shows that the class of functions that FMs can represent is possibly smaller than our framework.
5 Lifted approach
5.1 Conversion to lowrank tensor estimation problem
If we set in (14), the resulting optimization problem is neither convex nor multiconvex w.r.t. . In (Blondel et al., 2015), for , it was proposed to cast parameter estimation as a lowrank symmetric matrix estimation problem. A similar idea was used in the context of phase retrieval in (Candès et al., 2013). Inspired by these works, we propose to convert the problem of estimating and to that of estimating a lowrank symmetric tensor . Combined with a symmetrization trick, this approach leads to an objective that is multiconvex, for both and (Section 5.2).
We begin by rewriting the kernel definitions using rankone tensors. For , it is easy to see that
(17) 
For , we need to ignore irrelevant monomials. For convenience, we introduce the following notation:
(18) 
We can now concisely rewrite the ANOVA kernel as
(19) 
Our key insight is described in the following lemma.
Lemma 5
Link between tensors and kernel expansions
Let have a symmetric outer product decomposition (Comon et al., 2008)
(20) 
Let and . Then,
(21)  
(22) 
The result follows immediately from (17) and (19), and from the linearity of and . Given , let us define the following objective functions
(23)  
(24) 
If is decomposed as in (20), then from Lemma 5, we obtain for or . This suggests that we can convert the problem of learning and to that of learning a symmetric tensor of (symmetric) rank . Thus, the problem of finding a small number of bases and their associated weights is converted to that of learning a lowrank symmetric tensor. Following (Candès et al., 2013), we call this approach lifted. Intuitively, we can think of as a tensor that contains the weights for predicting of monomials of degree . For instance, when , is the weight corresponding to the monomial .
5.2 Multiconvex formulation
Estimating a lowrank symmetric tensor for arbitrary integer is in itself a difficult nonconvex problem. Nevertheless, based on a symmetrization trick, we can convert the problem to a multiconvex one, which we can easily minimize by alternating minimization. We first present our approach for the case to give intuitions then explain how to extend it to .
Intuition with the secondorder case. For the case , we need to estimate a lowrank symmetric matrix . Naively parameterizing and solving for and does not lead to a multiconvex formulation for the case . This is due to the fact that is quadratic in . Our key idea is to parametrize where and is the symmetrization of . We then minimize w.r.t. .
The main advantage is that both and are bilinear in and . This implies that is biconvex in and and can therefore be efficiently minimized by alternating minimization. Once we obtained , we can optionally compute its eigendecomposition , with and , then apply (21) or (22) to obtain the model in kernel expansion form.
Extension to higherorder case. For , we now estimate a lowrank symmetric tensor , where and is the symmetrization of (cf. Appendix A.2). We decompose using matrices of size . Let us call these matrices and their columns . Then the decomposition of can be expressed as a sum of rankone tensors
(25) 
Due to multilinearity of (25) w.r.t. , the objective function is multiconvex in .
Computing predictions efficiently. When , predictions are computed by . To compute them efficiently, we use the following lemma.
Lemma 6
Symmetrization does not affect inner product
(26) 
6 Regularization
In some applications, the number of bases or the rank constraint are not enough for obtaining good generalization performance and it is necessary to consider additional form of regularization. For the lifted objective with or , we use the typical Frobeniusnorm regularization
(28) 
where is a regularization hyperparameter. For the direct objective, we introduce the new regularization
(29) 
This allows us to regularize and with a single hyperparameter. Let us define the following nuclear norm penalized objective:
(30) 
We can show that (28), (29) and (30) are equivalent in the following sense.
Theorem 2
Equivalence of regularized problems
Let or , then
(31) 
where and .
Proof is given in Appendix C. Our proof relies on the variational form of the nuclear norm and is thus limited to . One of the key ingredients of the proof is to show that the minimizer of (30) is always a symmetric matrix. In addition to Theorem 2, from (Abernethy et al., 2009), we also know that every local minimum of (28) gives a global solution of (30) provided that . Proving a similar result for (29) is a future work. When , as used in our experiments, a squared Frobenius norm penalty on (direct objective) or on (lifted objective) works well in practice, although we lose the theoretical connection with the nuclear norm.
7 Coordinate descent algorithms
We now describe how to learn the model parameters by coordinate descent, which is a stateoftheart learningrate free solver for multiconvex problems (e.g., Yu et al. (2012)). In the following, we assume that is smooth.
Direct objective with for . First, we note that minimizing (29) w.r.t. can be reduced to a standard regularized convex objective via a simple change of variable. Hence we focus on minimization w.r.t. .
Let us denote the elements of by . Then, our algorithm cyclically performs the following update for all and :
(32) 
where . Note that when is the squared loss, the above is equivalent to a Newton update and is the exact coordinatewise minimizer.
The key challenge to use CD is computing efficiently. Let us denote the elements of by . Using Lemma 3, we obtain and . If for all and for fixed, we maintain and (i.e., keep in sync after every update of ), then computing takes
time. Hence the cost of one epoch, i.e. updating all elements of
once, is . Complete details and pseudo code are given in Appendix D.1.To our knowledge, this is the first CD algorithm capable of training thirdorder FMs. Supporting arbitrary is an important future work.
Lifted objective with . Recall that we want to learn the matrices , whose columns we denote by . Our algorithm cyclically performs the following update for all , and :
(33) 
where . The main difficulty is computing efficiently. If for all and for and fixed, we maintain , then the cost of computing is . Hence the cost of one epoch is , the same as SGD. Complete details are given in Appendix D.2.
Convergence. The above updates decrease the objective monotonically. Convergence to a stationary point is guaranteed following (Bertsekas, 1999, Proposition 2.7.1).
8 Inhomogeneous polynomial models
The algorithms presented so far are designed for homogeneous polynomial kernels and . These kernels only use monomials of the same degree . However, in many applications, we would like to use monomials of up to some degree. In this section, we propose a simple idea to do so using the algorithms presented so far, unmodified. Our key observation is that we can easily turn homogeneous polynomials into inhomogeneous ones by augmenting the dimensions of the training data with dummy features.
We begin by explaining how to learn inhomogeneous polynomial models using . Let us denote and . Then, we obtain
(34) 
Therefore, if we prepare the augmented training set , the problem of learning a model of the form can be converted to that of learning a rank symmetric tensor using the method presented in Section 5. Note that the parameter is automatically learned from data for each basis .
Next, we explain how to learn inhomogeneous polynomial models using . Using Lemma 2, we immediately obtain for :
(35) 
For instance, when , we obtain
(36) 
Therefore, if we prepare the augmented training set , we can easily learn a combination of linear kernel and secondorder ANOVA kernel using methods presented in Section 4 or Section 5. Note that (35) only states the relation between two ANOVA kernels of consecutive degrees. Fortunately, we can also apply (35) recursively. Namely, by adding dummy features, we can sum the kernels from down to (i.e., linear kernel).
9 Experimental results
In this section, we present experimental results, focusing on regression tasks. Datasets are described in Appendix E. In all experiments, we set to the squared loss.
9.1 Direct optimization: is it useful to fit ?
As explained in Section 4, there is no benefit to fitting when is odd, since and can absorb into . This is however not the case when is even: and can absorb absolute values but not negative signs (unless complex numbers are allowed for parameters). Therefore, when is even, the class of functions we can represent with models of the form (1) is possibly smaller if we fix (as done in FMs).
To check that this is indeed the case, on the diabetes dataset, we minimized (29) with as follows:

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

minimize w.r.t. both and alternatingly,

fix for and minimize w.r.t. ,

fix with proba. and minimize w.r.t. .
We initialized elements of by for all , . Our results are shown in Figure 1. For , we use CD and for , we use LBFGS. Note that since (29) is convex w.r.t. , a) is insensitive to the initialization of as long as we fit before . Not surprisingly, fitting allows us to achieve a smaller objective value. This is especially apparent when . However, the difference is much smaller when . We give intuitions as to why this is the case in Section 10.
We emphasize that this experiment was designed to confirm that fitting does indeed improve representation power of the model when is even. In practice, it is possible that fixing reduces overfitting and thus improves generalization error. However, this highly depends on the data.
9.2 Direct vs. lifted optimization
In this section, we compare the direct and lifted optimization approaches on highdimensional data when . To compare the two approaches fairly, we propose the following initialization scheme. Recall that, at the end of the day, both approaches are essentially learning a low rank symmetric matrix: for lifted and for direct optimization. This suggests that we can easily convert the matrices used for initializing lifted optimization to and by computing the (reduced) eigendecomposition of . Note that because we solve the lifted optimization problem by coordinate descent, is never symmetric and therefore the rank of is usually twice that of . Hence, in practice, we have that . In our experiment, we compared four methods: lifted objective solved by CD, direct objective solved by CD, LBFGS and SGD. For lifted optimization, we initialized the elements of and by sampling from . For direct optimization, we obtained and as explained. Results on the E2006tfidf highdimensional dataset are shown in Figure 2. For , we find that Lifted (CD) and Direct (CD) have similar convergence speed and both outperform Direct (LBFGS). For , we find that Lifted (CD) outperforms both Direct (LBFGS) and Direct (SGD). Note that we did not implement Direct (CD) for since the direct optimization problem is not coordinatewise convex, as explained in Section 5.
9.3 Recommender system experiment
To confirm the ability of the proposed framework to infer the weights of unobserved feature interactions, we conducted experiments on Last.fm and Movielens 1M, two standard recommender system datasets. Following (Rendle, 2012), matrix factorization can be reduced to FMs by creating a dataset of pairs where
contains the onehot encoding of the user and item and
is the corresponding rating (i.e., number of training instances equals number of ratings). We compared four models:
[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

(augment): , with ,

(linear combination): ,

(augment): and

(linear combination): ,
where is a vector of firstorder weights, estimated from training data. Note that b) and d) are exactly the same as FMs and PNs, respectively. Results are shown in Figure 3. We see that tends to outperform on these tasks. We hypothesize that this the case because features are binary (cf., discussion in Section 10). We also see that simply augmenting the features as suggested in Section 8 is comparable or better than learning additional firstorder feature weights, as done in FMs and PNs.
9.4 Lowbudget nonlinear regression experiment
In this experiment, we demonstrate the ability of the proposed framework to reach good regression performance with a small number of bases . We compared:

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

Proposed with (with augmented features),

Proposed with (with augmented features),

Nyström method with and

Random Selection: choose uniformly at random from training set and use .
For a) and b) we used the lifted approach. For fair comparison in terms of model size (number of floats used), we set . Results on the abalone, cadata and cpusmall datasets are shown in Figure 4. We see that i)
the proposed framework reaches the same performance as kernel ridge regression with much fewer bases than other methods and
ii) tends to outperform on these tasks. Similar trends were observed when using or .10 Discussion
Ability to infer weights of unobserved interactions. In our view, one of the strengths of PNs and FMs is their ability to infer the weights of unobserved feature interactions, unlike traditional kernel methods. To see why, recall that in kernel methods, predictions are computed by . When or , by Lemma 5, this is equivalent to or if we set . Thus, in kernel methods, the weight associated with can be written as a linear combination of the training data’s monomials:
(37) 
Assuming binary features, the weights of monomials that were never observed in the training set are zero. In contrast, in PNs and FMs, we have and therefore the weight associated with becomes
(38) 
Because parameters are shared across monomials, PNs and FMs are able to interpolate the weights of monomials that were never observed in the training set. This is the key property which makes it possible to use them on recommender system tasks. In future work, we plan to apply PNs and FMs to biological data, where this property should be very useful, e.g., for inferring higherorder interactions between genes.
ANOVA kernel vs. polynomial kernel. One of the key properties of the ANOVA kernel is multilinearity w.r.t. elements of (Lemma 2). This is the key difference with which makes the direct optimization objective multiconvex when (Theorem 1). However, because we need to ignore irrelevant monomials, computing the kernel and its gradient is more challenging. Deriving efficient training algorithms for arbitrary is an important future work.
In our experiments in Section 9.1, we showed that fixing works relatively well when . To see intuitively why this is the case, note that fixing is equivalent to constraining the weight matrix to be positive semidefinite, i.e., s.t. . Next, observe that we can rewrite the prediction function as
(39) 
where is a mask which sets diagonal and lowerdiagonal elements of to zero. We therefore see that when using , we are learning a strictly uppertriangular matrix, parametrized by . Importantly, the matrix
is not positive semidefinite. This is what gives the model some degree of freedom, even though
is positive semidefinite. In contrast, when using , if we fix , then we have that(40) 
and therefore the model is unable to predict negative values.
Empirically, we showed in Section 9.4 that outperforms
for lowbudget nonlinear regression. In contrast, we showed in Section
9.3 that outperforms for recommender systems. The main difference between the two experiments is the nature of the features used: continuous for the former and binary for the latter. For binary features, squared features are redundant withand are therefore not expected to help improve accuracy. On the contrary, they might introduce bias towards firstorder features. We hypothesize that the ANOVA kernel is in general a better choice for binary features, although this needs to be verified by more experiments, for instance on natural language processing (NLP) tasks.
Direct vs. lifted optimization. The main advantage of direct optimization is that we only need to estimate and and therefore the number of parameters to estimate is independent of the degree . Unfortunately, the approach is neither convex nor multiconvex when using . In addition, the regularized objective (29) is nonsmooth w.r.t. . In Section 5, we proposed to reformulate the problem as one of lowrank symmetric tensor estimation and used a symmetrization trick to obtain a multiconvex smooth objective function. Because this objective involves the estimation of matrices of size , we need to set for fair comparison with the direct objective in terms of model size. When , we showed that the direct objective is readily multiconvex. However, an advantage of our lifted objective when is that it is convex w.r.t. larger block of variables than the direct objective.
11 Conclusion
In this paper, we revisited polynomial networks (Livni et al., 2014) and factorization machines (Rendle, 2010, 2012) from a unified perspective. We proposed direct and lifted optimization approaches and showed their equivalence in the regularized case for . With respect to PNs, we proposed the first CD solver with support for arbitrary integer . With respect to FMs, we made several novel contributions including making a connection with the ANOVA kernel, proving important properties of the objective function and deriving the first CD solver for thirdorder FMs. Empirically, we showed that the proposed algorithms achieve excellent performance on nonlinear regression and recommender system tasks.
Acknowledgments
This work was partially conducted as part of “Research and Development on Fundamental and Applied Technologies for Social Big Data”, commissioned by the National Institute of Information and Communications Technology (NICT), Japan. We also thank Vlad Niculae, Olivier Grisel, Fabian Pedregosa and Joseph Salmon for their valuable comments.
References
 Abernethy et al. (2009) Abernethy, Jacob, Bach, Francis, Evgeniou, Theodoros, and Vert, JeanPhilippe. A new approach to collaborative filtering: Operator estimation with spectral regularization. J. Mach. Learn. Res., 10:803–826, 2009.
 Avron et al. (2014) Avron, Haim, Nguyen, Huy, and Woodruff, David. Subspace embeddings for the polynomial kernel. In Advances in Neural Information Processing Systems 27, pp. 2258–2266. 2014.
 Bertsekas (1999) Bertsekas, Dimitri P. Nonlinear programming. Athena scientific Belmont, 1999.

Blondel et al. (2015)
Blondel, Mathieu, Fujino, Akinori, and Ueada, Naonori.
Convex factorization machines.
In
Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)
, 2015.  Candès et al. (2013) Candès, Emmanuel J., Eldar, Yonina C., Strohmer, Thomas, and Voroninski, Vladislav. Phase retrieval via matrix completion. SIAM Journal on Imaging Sciences, 6(1):199–225, 2013.
 Chang et al. (2010) Chang, YinWen, Hsieh, ChoJui, Chang, KaiWei, Ringgaard, Michael, and Lin, ChihJen. Training and testing lowdegree polynomial data mappings via linear svm. Journal of Machine Learning Research, 11:1471–1490, 2010.
 Chen & Manning (2014) Chen, Danqi and Manning, Christopher D. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), volume 1, pp. 740–750, 2014.
 Comon et al. (2008) Comon, Pierre, Golub, Gene, Lim, LekHeng, and Mourrain, Bernard. Symmetric tensors and symmetric tensor rank. SIAM Journal on Matrix Analysis and Applications, 30(3):1254–1279, 2008.

Kar & Karnick (2012)
Kar, Purushottam and Karnick, Harish.
Random feature maps for dot product kernels.
In
Proceedings of the International Conference on Artificial Intelligence and Statistics
, pp. 583–591, 2012.  Livni et al. (2014) Livni, Roi, ShalevShwartz, Shai, and Shamir, Ohad. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pp. 855–863, 2014.
 Mazumder et al. (2010) Mazumder, Rahul, Hastie, Trevor, and Tibshirani, Robert. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11:2287–2322, 2010.
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Pham & Pagh (2013) Pham, Ninh and Pagh, Rasmus. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th KDD conference, pp. 239–247, 2013.
 Rendle (2010) Rendle, Steffen. Factorization machines. In Proceedings of International Conference on Data Mining, pp. 995–1000. IEEE, 2010.
 Rendle (2012) Rendle, Steffen. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology (TIST), 3(3):57–78, 2012.
 ShaweTaylor & Cristianini (2004) ShaweTaylor, John and Cristianini, Nello. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
 Sonnenburg & Franc (2010) Sonnenburg, Sören and Franc, Vojtech. Coffin: A computational framework for linear svms. In Proceedings of the 27th International Conference on Machine Learning, pp. 999–1006, 2010.
 Stitson et al. (1997) Stitson, Mark, Gammerman, Alex, Vapnik, Vladimir, Vovk, Volodya, Watkins, Chris, and Weston, Jason. Support vector regression with anova decomposition kernels. Technical report, Royal Holloway University of London, 1997.
 Vapnik (1998) Vapnik, Vladimir. Statistical learning theory. Wiley, 1998.
 Wang et al. (2010) Wang, Z., Crammer, K., and Vucetic, S. Multiclass pegasos on a budget. In Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 1143–1150, 2010.
 Williams & Seeger (2001) Williams, Christopher K. I. and Seeger, Matthias. Using the nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, pp. 682–688. 2001.
 Yu et al. (2012) Yu, HsiangFu, Hsieh, ChoJui, Si, Si, and Dhillon, Inderjit S. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In ICDM, pp. 765–774, 2012.
Appendix A Symmetric tensors
a.1 Background
Let be the set of real order tensors. In this paper, we focus on cubical tensors, i.e., . We denote the set of order cubical tensors by . We denote the elements of by , where .
Let be a permutation of . Given , we define as the tensor such that
(41) 
In other words is a copy of with its axes permuted. This generalizes the concept of transpose to tensors.
Let be the set of all permutations of . We say that a tensor is symmetric if and only if
(42) 
We denote the set of symmetric tensors by .
Given , we define the symmetrization of by
(43) 
Note that when , then .
Given , we define a symmetric rankone tensor by
Comments
There are no comments yet.