Higher-Order Factorization Machines

07/25/2016 ∙ by Mathieu Blondel, et al. ∙ 0

Factorization machines (FMs) are a supervised learning approach that can use second-order feature combinations even when the data is very high-dimensional. Unfortunately, despite increasing interest in FMs, there exists to date no efficient training algorithm for higher-order FMs (HOFMs). In this paper, we present the first generic yet efficient algorithms for training arbitrary-order HOFMs. We also present new variants of HOFMs with shared parameters, which greatly reduce model size and prediction times while maintaining similar accuracy. We demonstrate the proposed approaches on four different link prediction tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


TensorFlow implementation of an arbitrary order Factorization Machine

view repo


GSOC 2017 Project, Apache Organization, ontain This repo contains the TensorFlow implementation of Factorization Machines of System ML.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Factorization machines (FMs) fm ; libfm are a supervised learning approach that can use second-order feature combinations efficiently even when the data is very high-dimensional. The key idea of FMs is to model the weights of feature combinations using a low-rank matrix. This has two main benefits. First, FMs can achieve empirical accuracy on a par with polynomial regression or kernel methods but with smaller and faster to evaluate models fm_icml . Second, FMs can infer the weights of feature combinations that were not observed in the training set. This second property is crucial for instance in recommender systems, a domain where FMs have become increasingly popular libfm ; fm_context . Without the low-rank property, FMs would fail to generalize to unseen user-item interactions.

Unfortunately, although higher-order FMs (HOFMs) were briefly mentioned in the original work of fm ; libfm , there exists to date no efficient algorithm for training arbitrary-order HOFMs. In fact, even just computing predictions given the model parameters naively takes polynomial time in the number of features. For this reason, HOFMs have, to our knowledge, never been applied to any problem. In addition, HOFMs, as originally defined in fm ; libfm

, model each degree in the polynomial expansion with a different matrix and therefore require the estimation of a large number of parameters.

In this paper, we propose the first efficient algorithms for training arbitrary-order HOFMs. To do so, we rely on a link between FMs and the so-called ANOVA kernel fm_icml . We propose linear-time dynamic programming algorithms for evaluating the ANOVA kernel and computing its gradient. Based on these, we propose stochastic gradient and coordinate descent algorithms for arbitrary-order HOFMs. To reduce the number of parameters, as well as prediction times, we also introduce two new kernels derived from the ANOVA kernel, allowing us to define new variants of HOFMs with shared parameters. We demonstrate the proposed approaches on four different link prediction tasks.

2 Factorization machines (FMs)

Second-order FMs. Factorization machines (FMs) fm ; libfm are an increasingly popular method for efficiently using second-order feature combinations in classification or regression tasks even when the data is very high-dimensional. Let and , where is a rank hyper-parameter. We denote the rows of by and its columns by , for and , where . Then, FMs predict an output

from a vector



An important characteristic of (1) is that it considers only combinations of distinct features (i.e., the squared features are ignored). The main advantage of FMs compared to naive polynomial regression is that the number of parameters to estimate is instead of . In addition, we can compute predictions in time111We include the constant factor for fair later comparison with arbitrary-order HOFMs. using


where indicates element-wise product convex_fm . Given a training set and , and can be learned by minimizing the following non-convex objective



is a convex loss function and

are hyper-parameters. The popular libfm library libfm implements efficient stochastic gradient and coordinate descent algorithms for obtaining a stationary point of (3). Both algorithms have a runtime complexity of

per epoch.

Higher-order FMs (HOFMs). Although no training algorithm was provided, FMs were extended to higher-order feature combinations in the original work of fm ; libfm . Let , where is the order or degree of feature combinations considered, and is a rank hyper-parameter. Let be the row of . Then -order HOFMs can be defined as


where we defined (sum of element-wise products). The objective function of HOFMs can be expressed in a similar way as for (3):


where are hyper-parameters. To avoid the combinatorial explosion of hyper-parameter combinations to search, in our experiments we will simply set and . While (4) looks quite daunting, fm_icml recently showed that FMs can be expressed from a simpler kernel perspective. Let us define the ANOVA222The name comes from the ANOVA decomposition of functions. wahba_book ; vapnik_book kernel vapnik_book of degree by


For later convenience, we also define and . Then it is shown that


where is the column of . This perspective shows that we can view FMs and HOFMs as a type of kernel machine whose “support vectors” are learned directly from data. Intuitively, the ANOVA kernel can be thought as a kind of polynomial kernel that uses feature combinations without replacement (i.e., of distinct features). A key property of the ANOVA kernel is multi-linearity fm_icml :


where denotes the -dimensional vector with removed and similarly for . That is, everything else kept fixed, is an affine function of . Although no training algorithm was provided, fm_icml showed based on (8) that, although non-convex, the objective function of arbitrary-order HOFMs is convex in and in each row of , separately.

Interpretability of HOFMs. An advantage of FMs and HOFMs is their interpretability. To see why this is the case, notice that we can rewrite (4) as


where we defined . Intuitively, is a low-rank

-way tensor which contains the weights of feature combinations of degree

. For instance, when , is the weight of . Similarly to the ANOVA decomposition of functions, HOFMs consider only combinations of distinct features (i.e., for ).

This paper. Unfortunately, there exists to date no efficient algorithm for training arbitrary-order HOFMs. Indeed, computing (6) naively takes , i.e., polynomial time. In the following, we present linear-time algorithms. Moreover, HOFMs, as originally defined in fm ; libfm require the estimation of matrices . Thus, HOFMs can produce large models when is large. To address this issue, we propose new variants of HOFMs with shared parameters.

3 Linear-time stochastic gradient algorithms for HOFMs

The kernel view presented in Section 2 allows us to focus on the ANOVA kernel as the main “computational unit” for training HOFMs. In this section, we develop dynamic programming (DP) algorithms for evaluating the ANOVA kernel and computing its gradient in only time.

Evaluation. The main observation (see also (kernel_book, , Section 9.2)) is that we can use (8) to recursively remove features until computing the kernel becomes trivial. Let us denote a subvector of by and similarly for . Let us introduce the shorthand . Then, from (8),


For convenience, we also define since and since there does not exist any -combination of features in a dimensional vector.

Table 1: Example of DP table

The quantity we want to compute is . Instead of naively using recursion (10), which would lead to many redundant computations, we use a bottom-up approach and organize computations in a DP table. We start from the top-left corner to initialize the recursion and go through the table to arrive at the solution in the bottom-right corner. The procedure, summarized in Algorithm 1, takes time and memory.

Gradients. For computing the gradient of w.r.t. , we use reverse-mode differentiation autodiff

(a.k.a. backpropagation in a neural network context), since it allows us to compute the entire gradient in a single pass. We supplement each variable

in the DP table by a so-called adjoint , which represents the sensitivity of w.r.t. . From recursion (10), except for edge cases, influences and

. Using the chain rule, we then obtain


Similarly, we introduce the adjoint . Since influences , we have


We can run recursion (11) in reverse order of the DP table starting from . Using this approach, we can compute the entire gradient w.r.t. in time and memory. The procedure is summarized in Algorithm 2.

  Input: ,
  for  do
      for  do
      end for
  end for
Algorithm 1 Evaluating in
  Input: , ,
  for  do
      for  do
      end for
  end for
Algorithm 2 Computing in

Stochastic gradient (SG) algorithms. Based on Algorithm 1 and 2, we can easily learn arbitrary-order HOFMs using any gradient-based optimization algorithm. Here we focus our discussion on SG algorithms. If we alternatingly minimize (5) w.r.t , then the sub-problem associated with degree is of the form


where are fixed offsets which account for the contribution of degrees other than to the predictions. The sub-problem is convex in each row of fm_icml . A SG update for (13) w.r.t. for some instance can be computed by , where is a learning rate and where we defined . Because evaluating and computing its gradient both take , the cost per epoch, i.e., of visiting all instances, is . When , this is the same cost as the SG algorithm implemented in libfm.

Sparse data. We conclude this section with a few useful remarks on sparse data. Let us denote the support of a vector by and let us define . It is easy to see from (8) that the gradient and have the same support, i.e., . Another useful remark is that , provided that , where is the number of non-zero elements in . Hence, when the data is sparse, we only need to iterate over non-zero features in Algorithm 1 and 2. Consequently, their time and memory cost is only and thus the cost per epoch of SG algorithms is .

4 Coordinate descent algorithm for arbitrary-order HOFMs

We now describe a coordinate descent (CD) solver for arbitrary-order HOFMs. CD is a good choice for learning HOFMs because their objective function is coordinate-wise convex, thanks to the multi-linearity of the ANOVA kernel fm_icml . Our algorithm can be seen as a generalization to higher orders of the CD algorithms proposed in libfm ; fm_icml .

An alternative recursion. Efficient CD implementations typically require maintaining statistics for each training instance, such as the predictions at the current iteration. When a coordinate is updated, the statistics then need to be synchronized. Unfortunately, the recursion we used in the previous section is not suitable for a CD algorithm because it would require to store and synchronize the DP table for each training instance upon coordinate-wise updates. We therefore turn to an alternative recursion:


where we defined . Note that the recursion was already known in the context of traditional kernel methods (c.f., (vapnik_book, , Section 11.8)) but its application to HOFMs is novel. Since we know that and , we can use (14) to compute , then , and so on. The overall evaluation cost for arbitrary is .

Coordinate-wise derivatives. We can apply reverse-mode differentiation to recursion (14) in order to compute the entire gradient (c.f., Appendix C). However, in CD, since we only need the derivative of one variable at a time, we can simply use forward-mode differentiation:


where . The advantage of (15) is that we only need to cache for . Hence the memory complexity per sample is only instead of for (10).

Use in a CD algorithm. Similarly to fm_icml , we assume that the loss function is -smooth and update the elements of in cyclic order by , where we defined


The update guarantees that the objective value is monotonically non-increasing and is the exact coordinate-wise minimizer when is the squared loss. Overall, the total cost per epoch, i.e., updating all coordinates once, is , where is the time it takes to compute (15). Assuming have been previously cached, for , computing (15) takes operations. For fixed , if we unroll the two loops needed to compute (15), modern compilers can often further reduce the number of operations needed. Nevertheless, this quadratic dependency on means that our CD algorithm is best for small , typically .

5 HOFMs with shared parameters

HOFMs, as originally defined in fm ; libfm , model each degree with separate matrices . Assuming that we use the same rank for all matrices, the total model size of -order HOFMs is therefore . Moreover, even when using our DP algorithm, the cost of computing predictions is . Hence, HOFMs tend to produce large, expensive-to-evaluate models. To reduce model size and prediction times, we introduce two new kernels which allow us to share parameters between each degree: the inhomogeneous ANOVA kernel and the all-subsets kernel. Because both kernels are derived from the ANOVA kernel, they share the same appealing properties: multi-linearity, sparse gradients and sparse-data friendliness.

5.1 Inhomogeneous ANOVA kernel

It is well-known that a sum of kernels is equivalent to concatenating their associated feature maps (kernel_book, , Section 3.4). Let . To combine different degrees, a natural kernel is therefore


The kernel uses all feature combinations of degrees up to . We call it inhomogeneous ANOVA kernel, since it is an inhomogeneous polynomial of . In contrast, is homogeneous. The main difference between (17) and (7) is that all ANOVA kernels in the sum share the same parameters. However, to increase modeling power, we allow each kernel to have different weights .

Evaluation. Due to the recursive nature of Algorithm 1, when computing , we also get for free. Indeed, lower-degree kernels are available in the last column of the DP table, i.e., . Hence, the cost of evaluating (17) is time. The total cost for computing is instead of for .

Learning. While it is certainly possible to learn and by directly minimizing some objective function, here we propose an easier solution, which works well in practice. Our key observation is that we can easily turn into by adding dummy values to feature vectors. Let us denote the concatenation of with a scalar by and similarly for . From (8), we easily obtain


Similarly, if we apply (8) twice, we obtain:


Applying the above to and , we obtain


More generally, by adding dummy features to and , we can convert to . Because is learned, this means that we can automatically learn . These weights can then be converted to by “unrolling” recursion (8). Although simple, we show in our experiments that this approach works favorably compared to directly learning and . The main advantage of this approach is that we can use the same software unmodified (we simply need to minimize (13) with the augmented data). Moreover, the cost of computing the entire gradient by Algorithm 2 using the augmented data is just compared to for HOFMs with separate parameters.

5.2 All-subsets kernel

We now consider a closely related kernel called all-subsets kernel (kernel_book, , Definition 9.5):


The main difference with the traditional use of this kernel is that we learn . Interestingly, it can be shown that , where is the number of non-zero features in . Hence, the kernel uses all combinations of distinct features up to order with uniform weights. Even if is very large, the kernel can be a good choice if each training instance contains only a few non-zero elements. To learn the parameters, we simply substitute with in (13). In SG or CD algorithms, all it entails is to substitute with . For computing , it is easy to verify that and therefore we have


Therefore, the main advantage of the all-subsets kernel is that we can evaluate it and compute its gradient in just time. The total cost for computing is only .

6 Experimental results

6.1 Application to link prediction

Problem setting. We now demonstrate a novel application of HOFMs to predict the presence or absence of links between nodes in a graph. Formally, we assume two sets of possibly disjoint nodes of size and , respectively. We assume features for the two sets of nodes, represented by matrices and . For instance, can represent user features and movie features. We denote the columns of and by and , respectively. We are given a matrix , whose elements indicate presence (positive sample) or absence (negative sample) of link between two nodes and . We denote the number of positive samples by . Using this data, our goal is to predict new associations. Datasets used in our experiments are summarized in Table 2. Note that for the NIPS and Enzyme datasets, .

Dataset Columns of Columns of
NIPS nips_data 4,140 Authors 2,037 13,649
Enzyme yamanishi_2005 2,994 Enzymes 668 325
GD imc 3,954 Diseases 3,209 3,209 Genes 12,331 25,275
Movielens 100K movielens 21,201 Users 943 49 Movies 1,682 29
Table 2: Datasets used in our experiments. For a detailed description, c.f. Appendix A.

Conversion to a supervised problem. We need to convert the above information to a format FMs and HOFMs can handle. To predict an element of , we simply form to be the concatenation of and and feed this to a HOFM in order to compute a prediction . Because HOFMs use feature combinations in , they can learn the weights of feature combinations between and . At training time, we need both positive and negative samples. Let us denote the set of positive and negative samples by . Then our training set is composed of pairs, where .

Models compared.

  • [topsep=0pt,itemsep=1ex,partopsep=0ex,parsep=0.3ex,leftmargin=2ex]

  • HOFM: as defined in (4) and as originally proposed in fm ; libfm . We minimize (5) by alternating minimization of (13) for each degree.

  • HOFM-shared: . We learn and using the simple augmented data approach described in Section 5.1 (HOFM-shared-augmented). Inspired by SimpleMKL simple_mkl , we also report results when learning and directly by minimizing subject to and (HOFM-shared-simplex).

  • All-subsets: . As explained in Section 5.2, this model is equivalent to the HOFM-shared model with and .

  • Polynomial network: . This model can be thought as factorization machine variant that uses a polynomial kernel instead of the ANOVA kernel (c.f., livni ; fm_icml ; tensor_machines ).

  • Low-rank bilinear regression: , where and . Such model was shown to work well for link prediction in link_prediction and imc . We learn and by minimizing .

Experimental setup and evaluation. In this experiment, for all models above, we use CD rather than SG to avoid the tuning of a learning rate hyper-parameter. We set

to be the squared loss. Although we omitted it from our notation for clarity, we also fit a bias term for all models. We evaluated the compared models using the area under the ROC curve (AUC), which is the probability that the model correctly ranks a positive sample higher than a negative sample. We split the

positive samples into for training and for testing. We sample the same number of negative samples as positive samples for training and use the rest for testing. We chose from by cross-validation and following link_prediction we empirically set . Throughout our experiments, we initialized the elements of randomly by .

NIPS Enzyme GD Movielens 100K
HOFM () 0.856 0.880 0.717 0.778
HOFM () 0.875 0.888 0.717 0.786
HOFM () 0.874 0.887 0.717 0.786
HOFM () 0.874 0.887 0.717 0.786
HOFM-shared-augmented () 0.858 0.876 0.704 0.778
HOFM-shared-augmented () 0.874 0.887 0.704 0.787
HOFM-shared-augmented () 0.836 0.824 0.663 0.779
HOFM-shared-augmented () 0.824 0.795 0.600 0.621
HOFM-shared-simplex () 0.716 0.865 0.721 0.701
HOFM-shared-simplex () 0.777 0.870 0.721 0.709
HOFM-shared-simplex () 0.758 0.870 0.721 0.709
HOFM-shared-simplex () 0.722 0.869 0.721 0.709
All-subsets 0.730 0.840 0.721 0.714
Polynomial network () 0.725 0.879 0.721 0.761
Polynomial network () 0.789 0.853 0.719 0.696
Polynomial network () 0.782 0.873 0.717 0.708
Polynomial network () 0.543 0.524 0.648 0.501
Low-rank bilinear regression 0.855 0.694 0.611 0.718
Table 3: Comparison of area under the ROC curve (AUC) as measured on the test sets.

Results are indicated in Table 3. Overall the two best models were HOFM and HOFM-shared-augmented, which achieved the best scores on 3 out of 4 datasets. The two models outperformed low-rank bilinear regression on 3 out 4 datasets, showing the benefit of using higher-order feature combinations. HOFM-shared-augmented achieved similar accuracy to HOFM, despite using a smaller model. Surprisingly, HOFM-shared-simplex did not improve over HOFM-shared-augmented except on the GD dataset. We conclude that our augmented data approach is convenient yet works well in practice. All-subsets and polynomial networks performed worse than HOFM and HOFM-shared-augmented, except on the GD dataset where they were the best. Finally, we observe that HOFM were quite robust to increasing , which is likely a benefit of modeling each degree with a separate matrix.

6.2 Solver comparison


(a) Convergence when


(b) Convergence when


(c) Convergence when


(d) Scalability w.r.t. degree
Figure 1: Solver comparison for minimizing (13) when varying the degree on the NIPS dataset with and . Results on other datasets are in Appendix B.

We compared AdaGrad adagrad , L-BFGS and coordinate descent (CD) for minimizing (13) when varying the degree on the NIPS dataset with and . We constructed the data in the same way as explained in the previous section and added dummy features, resulting in sparse samples of dimension . For AdaGrad and L-BFGS, we computed the (stochastic) gradients using Algorithm 2. All solvers used the same initialization.

Results are indicated in Figure 1. We see that our CD algorithm performs very well when but starts to deteriorate when , in which case L-BFGS becomes advantageous. As shown in Figure 1 d), the cost per epoch of AdaGrad and L-BFGS scales linearly with , a benefit of our DP algorithm for computing the gradient. However, to our surprise, we found that AdaGrad is quite sensitive to the learning rate . AdaGrad diverged for and the largest value to work well was . This explains why AdaGrad did not outperform CD despite the lower cost per epoch. In the future, it would be useful to create a CD algorithm with a better dependency on .

7 Conclusion and future directions

In this paper, we presented the first training algorithms for HOFMs and introduced new HOFM variants with shared parameters. A popular way to deal with a large number of negative samples is to use an objective function that directly maximize AUC link_prediction ; rendle_bpr . This is especially easy to do with SG algorithms because we can sample pairs of positive and negative samples from the dataset upon each SG update. We therefore expect the algorithms developed in Section 3 to be especially useful in this setting. Recently, difacto proposed a distributed SG algorithm for training second-order FMs. It should be straightforward to extend this algorithm to HOFMs based on our contributions in Section 3. Finally, it should be possible to integrate Algorithm 1 and 2

into a deep learning framework such as TensorFlow

tensorflow , in order to easily compose ANOVA kernels with other layers (e.g., convolutional).


  • (1) M. Abadi et al.

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

  • (2) A. G. Baydin, B. A. Pearlmutter, and A. A. Radul. Automatic differentiation in machine learning: a survey. arXiv preprint arXiv:1502.05767, 2015.
  • (3) M. Blondel, A. Fujino, and N. Ueada. Convex factorization machines. In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2015.
  • (4) M. Blondel, M. Ishihata, A. Fujino, and N. Ueada. Polynomial networks and factorization machines: New insights and efficient training algorithms. In Proceedings of International Conference on Machine Learning (ICML), 2016.
  • (5) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
  • (6) GroupLens. http://grouplens.org/datasets/movielens/, 1998.
  • (7) M. Li, Z. Liu, A. Smola, and Y.-X. Wang. Difacto – distributed factorization machines. In Proceedings of International Conference on Web Search and Data Mining (WSDM), 2016.
  • (8) R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pages 855–863, 2014.
  • (9) A. K. Menon and C. Elkan. Link prediction via matrix factorization. In Machine Learning and Knowledge Discovery in Databases, pages 437–452. 2011.
  • (10) N. Natarajan and I. S. Dhillon. Inductive matrix completion for predicting gene–disease associations. Bioinformatics, 30(12):i60–i68, 2014.
  • (11) V. Y. Pan. Structured Matrices and Polynomials: Unified Superfast Algorithms. Springer-Verlag New York, Inc., 2001.
  • (12) A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. Simplemkl. Journal of Machine Learning Research, 9:2491–2521, 2008.
  • (13) S. Rendle. Factorization machines. In Proceedings of International Conference on Data Mining, pages 995–1000. IEEE, 2010.
  • (14) S. Rendle. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology (TIST), 3(3):57–78, 2012.
  • (15) S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In

    Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence

    , pages 452–461, 2009.
  • (16) S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme. Fast context-aware recommendations with factorization machines. In SIGIR, pages 635–644, 2011.
  • (17) S. Roweis. http://www.cs.nyu.edu/~roweis/data.html, 2002.
  • (18) J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
  • (19) V. Vapnik. Statistical learning theory. Wiley, 1998.
  • (20) G. Wahba. Spline models for observational data, volume 59. Siam, 1990.
  • (21) Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics, 21:i468–i477, 2005.
  • (22) J. Yang and A. Gittens. Tensor machines for learning target-specific polynomial features. arXiv preprint arXiv:1504.01697, 2015.

Appendix A Dataset descriptions

  • [topsep=0pt,itemsep=1ex,partopsep=0ex,parsep=1ex,leftmargin=2ex]

  • NIPS: co-author graph of authors at the first twelve editions of NIPS, obtained from [17]. For this dataset, as well as the Enzyme dataset below, we have . The co-author graph comprises authors represented by bag-of-words vectors of dimension (words used by authors in their publications). The number of positive samples is .

  • Enzyme: metabolic network obtained from [21]. The network comprises enzymes represented by three sets of features: a -dimensional vector of phylogenetic information, a -dimensional vector of gene expression information and a -dimensional vector of gene location information. We concatenate the three sets of information to form feature vectors of dimension . Original enzyme similarity scores are between and

    . We binarize the scores using

    as threshold. The resulting number of positive samples is .

  • GD: human gene-disease association data obtained from [10]. The bipartite graph is comprised of diseases and genes. We represent each disease using a vector of dimensions, whose elements are similarity scores obtained from MimMiner. The study [10] also used bag-of-words vectors describing each disease but we found these to not help improve performance both for baselines and proposed methods. We represent each gene using a vector of features, which are the concatenation of similarity scores obtained from HumanNet and gene-phenotype associations from 8 other species. See [10] for a detailed description of the features. The number of positive samples is .

  • Movielens 100K: recommender system data obtained from [6]. The bipartite graph is comprised of users and

    movies. For users, we convert age, gender, occupation and living area (first digit of zipcode) to a binary vector using a one-hot encoding. For movies, we use the release year and genres. The resulting vectors are of dimension

    and , respectively. Original ratings are between and . We binarize the ratings using as threshold, resulting in positive samples.

Appendix B Additional experiments

b.1 Solver comparison

We also compared AdaGrad, L-BFGS and coordinate descent (CD) on the Enzyme, Gene-Disease (GD) and Movielens 100K datasets. Results are indicated in Figure 2, 3 and 4, respectively.






Figure 2: Solver comparison for minimizing (13) on the Enzyme dataset. We set to the values with best test-set performance, which were , and , respectively. We set .






Figure 3: Solver comparison for minimizing (13) on the GD dataset. We set to the values with best test-set performance, which were , and , respectively. We set .






Figure 4: Solver comparison for minimizing (13) on the Movielens 100K dataset. We set to the values with best test-set performance, which were , and , respectively. We set .

b.2 Recommender system experiments

As we explained in Section 5.2, the all-subsets kernel can be a good choice if the number of non-zero elements per sample is small. To verify this assumption, we ran experiments on two recommender system tasks: Movielens 1M and Last.fm. We used the exact same setting as in [4, Section 9.3]. For each rating , the corresponding was set to the concatenation of the one-hot encodings of the user and item indices. We compared the following models:

  • FM:

  • FM-augmented: where

  • All-subsets:

  • Polynomial networks: (c.f. [4] for more details)

Results are indicated in Figure 5. We see that All-subsets performs relatively well on these tasks.


(a) Movielens 1M


(b) Last.FM
Figure 5: Model comparison on two recommender system datasets.

Appendix C Reverse-mode differentiation on the alternative recursion

We now describe how to apply reverse-mode differentiation to the alternative recursion (14) in order to compute the entire gradient efficiently. Let us introduce the shorthands and . We can then write the recursion as


For concreteness, let us illustrate the recursion for . We have


We see that influences , and influences and . Likewise, influences , influences and , and influences , and . Let us denote the adjoints and . For general , summing over quantities that influences and , we obtain


Let us denote the adjoint of by . We know that directly influences only and therefore


Assuming that and have been previously computed, which takes , the procedure for computing the gradient can be summarized as follows:

  1. Initialize