tffm
TensorFlow implementation of an arbitrary order Factorization Machine
view repo
Factorization machines (FMs) are a supervised learning approach that can use second-order feature combinations even when the data is very high-dimensional. Unfortunately, despite increasing interest in FMs, there exists to date no efficient training algorithm for higher-order FMs (HOFMs). In this paper, we present the first generic yet efficient algorithms for training arbitrary-order HOFMs. We also present new variants of HOFMs with shared parameters, which greatly reduce model size and prediction times while maintaining similar accuracy. We demonstrate the proposed approaches on four different link prediction tasks.
READ FULL TEXT VIEW PDF
In this paper, we introduce the notion of motif closure and describe
hig...
read it
Various factorization-based methods have been proposed to leverage
secon...
read it
Factorization machines (FMs) are machine learning predictive models base...
read it
Factorization Machines (FM) are powerful class of models that incorporat...
read it
High-order parametric models that include terms for feature interactions...
read it
To date, the instability of prognostic predictors in a sparse high
dimen...
read it
Higher-order mutation has the potential for improving major drawbacks of...
read it
TensorFlow implementation of an arbitrary order Factorization Machine
GSOC 2017 Project, Apache Organization, ontain This repo contains the TensorFlow implementation of Factorization Machines of System ML.
Factorization machines (FMs) fm ; libfm are a supervised learning approach that can use second-order feature combinations efficiently even when the data is very high-dimensional. The key idea of FMs is to model the weights of feature combinations using a low-rank matrix. This has two main benefits. First, FMs can achieve empirical accuracy on a par with polynomial regression or kernel methods but with smaller and faster to evaluate models fm_icml . Second, FMs can infer the weights of feature combinations that were not observed in the training set. This second property is crucial for instance in recommender systems, a domain where FMs have become increasingly popular libfm ; fm_context . Without the low-rank property, FMs would fail to generalize to unseen user-item interactions.
Unfortunately, although higher-order FMs (HOFMs) were briefly mentioned in the original work of fm ; libfm , there exists to date no efficient algorithm for training arbitrary-order HOFMs. In fact, even just computing predictions given the model parameters naively takes polynomial time in the number of features. For this reason, HOFMs have, to our knowledge, never been applied to any problem. In addition, HOFMs, as originally defined in fm ; libfm
, model each degree in the polynomial expansion with a different matrix and therefore require the estimation of a large number of parameters.
In this paper, we propose the first efficient algorithms for training arbitrary-order HOFMs. To do so, we rely on a link between FMs and the so-called ANOVA kernel fm_icml . We propose linear-time dynamic programming algorithms for evaluating the ANOVA kernel and computing its gradient. Based on these, we propose stochastic gradient and coordinate descent algorithms for arbitrary-order HOFMs. To reduce the number of parameters, as well as prediction times, we also introduce two new kernels derived from the ANOVA kernel, allowing us to define new variants of HOFMs with shared parameters. We demonstrate the proposed approaches on four different link prediction tasks.
Second-order FMs. Factorization machines (FMs) fm ; libfm are an increasingly popular method for efficiently using second-order feature combinations in classification or regression tasks even when the data is very high-dimensional. Let and , where is a rank hyper-parameter. We denote the rows of by and its columns by , for and , where . Then, FMs predict an output
from a vector
by(1) |
An important characteristic of (1) is that it considers only combinations of distinct features (i.e., the squared features are ignored). The main advantage of FMs compared to naive polynomial regression is that the number of parameters to estimate is instead of . In addition, we can compute predictions in time^{1}^{1}1We include the constant factor for fair later comparison with arbitrary-order HOFMs. using
(2) |
where indicates element-wise product convex_fm . Given a training set and , and can be learned by minimizing the following non-convex objective
(3) |
where
is a convex loss function and
are hyper-parameters. The popular libfm library libfm implements efficient stochastic gradient and coordinate descent algorithms for obtaining a stationary point of (3). Both algorithms have a runtime complexity ofper epoch.
Higher-order FMs (HOFMs). Although no training algorithm was provided, FMs were extended to higher-order feature combinations in the original work of fm ; libfm . Let , where is the order or degree of feature combinations considered, and is a rank hyper-parameter. Let be the row of . Then -order HOFMs can be defined as
(4) |
where we defined (sum of element-wise products). The objective function of HOFMs can be expressed in a similar way as for (3):
(5) |
where are hyper-parameters. To avoid the combinatorial explosion of hyper-parameter combinations to search, in our experiments we will simply set and . While (4) looks quite daunting, fm_icml recently showed that FMs can be expressed from a simpler kernel perspective. Let us define the ANOVA^{2}^{2}2The name comes from the ANOVA decomposition of functions. wahba_book ; vapnik_book kernel vapnik_book of degree by
(6) |
For later convenience, we also define and . Then it is shown that
(7) |
where is the column of . This perspective shows that we can view FMs and HOFMs as a type of kernel machine whose “support vectors” are learned directly from data. Intuitively, the ANOVA kernel can be thought as a kind of polynomial kernel that uses feature combinations without replacement (i.e., of distinct features). A key property of the ANOVA kernel is multi-linearity fm_icml :
(8) |
where denotes the -dimensional vector with removed and similarly for . That is, everything else kept fixed, is an affine function of . Although no training algorithm was provided, fm_icml showed based on (8) that, although non-convex, the objective function of arbitrary-order HOFMs is convex in and in each row of , separately.
Interpretability of HOFMs. An advantage of FMs and HOFMs is their interpretability. To see why this is the case, notice that we can rewrite (4) as
(9) |
where we defined . Intuitively, is a low-rank
-way tensor which contains the weights of feature combinations of degree
. For instance, when , is the weight of . Similarly to the ANOVA decomposition of functions, HOFMs consider only combinations of distinct features (i.e., for ).This paper. Unfortunately, there exists to date no efficient algorithm for training arbitrary-order HOFMs. Indeed, computing (6) naively takes , i.e., polynomial time. In the following, we present linear-time algorithms. Moreover, HOFMs, as originally defined in fm ; libfm require the estimation of matrices . Thus, HOFMs can produce large models when is large. To address this issue, we propose new variants of HOFMs with shared parameters.
The kernel view presented in Section 2 allows us to focus on the ANOVA kernel as the main “computational unit” for training HOFMs. In this section, we develop dynamic programming (DP) algorithms for evaluating the ANOVA kernel and computing its gradient in only time.
Evaluation. The main observation (see also (kernel_book, , Section 9.2)) is that we can use (8) to recursively remove features until computing the kernel becomes trivial. Let us denote a subvector of by and similarly for . Let us introduce the shorthand . Then, from (8),
(10) |
For convenience, we also define since and since there does not exist any -combination of features in a dimensional vector.
1 | |||||
0 | |||||
0 | |||||
0 |
The quantity we want to compute is . Instead of naively using recursion (10), which would lead to many redundant computations, we use a bottom-up approach and organize computations in a DP table. We start from the top-left corner to initialize the recursion and go through the table to arrive at the solution in the bottom-right corner. The procedure, summarized in Algorithm 1, takes time and memory.
Gradients. For computing the gradient of w.r.t. , we use reverse-mode differentiation autodiff
(a.k.a. backpropagation in a neural network context), since it allows us to compute the entire gradient in a single pass. We supplement each variable
in the DP table by a so-called adjoint , which represents the sensitivity of w.r.t. . From recursion (10), except for edge cases, influences and. Using the chain rule, we then obtain
(11) |
Similarly, we introduce the adjoint . Since influences , we have
(12) |
We can run recursion (11) in reverse order of the DP table starting from . Using this approach, we can compute the entire gradient w.r.t. in time and memory. The procedure is summarized in Algorithm 2.
Stochastic gradient (SG) algorithms. Based on Algorithm 1 and 2, we can easily learn arbitrary-order HOFMs using any gradient-based optimization algorithm. Here we focus our discussion on SG algorithms. If we alternatingly minimize (5) w.r.t , then the sub-problem associated with degree is of the form
(13) |
where are fixed offsets which account for the contribution of degrees other than to the predictions. The sub-problem is convex in each row of fm_icml . A SG update for (13) w.r.t. for some instance can be computed by , where is a learning rate and where we defined . Because evaluating and computing its gradient both take , the cost per epoch, i.e., of visiting all instances, is . When , this is the same cost as the SG algorithm implemented in libfm.
Sparse data. We conclude this section with a few useful remarks on sparse data. Let us denote the support of a vector by and let us define . It is easy to see from (8) that the gradient and have the same support, i.e., . Another useful remark is that , provided that , where is the number of non-zero elements in . Hence, when the data is sparse, we only need to iterate over non-zero features in Algorithm 1 and 2. Consequently, their time and memory cost is only and thus the cost per epoch of SG algorithms is .
We now describe a coordinate descent (CD) solver for arbitrary-order HOFMs. CD is a good choice for learning HOFMs because their objective function is coordinate-wise convex, thanks to the multi-linearity of the ANOVA kernel fm_icml . Our algorithm can be seen as a generalization to higher orders of the CD algorithms proposed in libfm ; fm_icml .
An alternative recursion. Efficient CD implementations typically require maintaining statistics for each training instance, such as the predictions at the current iteration. When a coordinate is updated, the statistics then need to be synchronized. Unfortunately, the recursion we used in the previous section is not suitable for a CD algorithm because it would require to store and synchronize the DP table for each training instance upon coordinate-wise updates. We therefore turn to an alternative recursion:
(14) |
where we defined . Note that the recursion was already known in the context of traditional kernel methods (c.f., (vapnik_book, , Section 11.8)) but its application to HOFMs is novel. Since we know that and , we can use (14) to compute , then , and so on. The overall evaluation cost for arbitrary is .
Coordinate-wise derivatives. We can apply reverse-mode differentiation to recursion (14) in order to compute the entire gradient (c.f., Appendix C). However, in CD, since we only need the derivative of one variable at a time, we can simply use forward-mode differentiation:
(15) |
where . The advantage of (15) is that we only need to cache for . Hence the memory complexity per sample is only instead of for (10).
Use in a CD algorithm. Similarly to fm_icml , we assume that the loss function is -smooth and update the elements of in cyclic order by , where we defined
(16) |
The update guarantees that the objective value is monotonically non-increasing and is the exact coordinate-wise minimizer when is the squared loss. Overall, the total cost per epoch, i.e., updating all coordinates once, is , where is the time it takes to compute (15). Assuming have been previously cached, for , computing (15) takes operations. For fixed , if we unroll the two loops needed to compute (15), modern compilers can often further reduce the number of operations needed. Nevertheless, this quadratic dependency on means that our CD algorithm is best for small , typically .
HOFMs, as originally defined in fm ; libfm , model each degree with separate matrices . Assuming that we use the same rank for all matrices, the total model size of -order HOFMs is therefore . Moreover, even when using our DP algorithm, the cost of computing predictions is . Hence, HOFMs tend to produce large, expensive-to-evaluate models. To reduce model size and prediction times, we introduce two new kernels which allow us to share parameters between each degree: the inhomogeneous ANOVA kernel and the all-subsets kernel. Because both kernels are derived from the ANOVA kernel, they share the same appealing properties: multi-linearity, sparse gradients and sparse-data friendliness.
It is well-known that a sum of kernels is equivalent to concatenating their associated feature maps (kernel_book, , Section 3.4). Let . To combine different degrees, a natural kernel is therefore
(17) |
The kernel uses all feature combinations of degrees up to . We call it inhomogeneous ANOVA kernel, since it is an inhomogeneous polynomial of . In contrast, is homogeneous. The main difference between (17) and (7) is that all ANOVA kernels in the sum share the same parameters. However, to increase modeling power, we allow each kernel to have different weights .
Evaluation. Due to the recursive nature of Algorithm 1, when computing , we also get for free. Indeed, lower-degree kernels are available in the last column of the DP table, i.e., . Hence, the cost of evaluating (17) is time. The total cost for computing is instead of for .
Learning. While it is certainly possible to learn and by directly minimizing some objective function, here we propose an easier solution, which works well in practice. Our key observation is that we can easily turn into by adding dummy values to feature vectors. Let us denote the concatenation of with a scalar by and similarly for . From (8), we easily obtain
(18) |
Similarly, if we apply (8) twice, we obtain:
(19) |
Applying the above to and , we obtain
(20) |
More generally, by adding dummy features to and , we can convert to . Because is learned, this means that we can automatically learn . These weights can then be converted to by “unrolling” recursion (8). Although simple, we show in our experiments that this approach works favorably compared to directly learning and . The main advantage of this approach is that we can use the same software unmodified (we simply need to minimize (13) with the augmented data). Moreover, the cost of computing the entire gradient by Algorithm 2 using the augmented data is just compared to for HOFMs with separate parameters.
We now consider a closely related kernel called all-subsets kernel (kernel_book, , Definition 9.5):
(21) |
The main difference with the traditional use of this kernel is that we learn . Interestingly, it can be shown that , where is the number of non-zero features in . Hence, the kernel uses all combinations of distinct features up to order with uniform weights. Even if is very large, the kernel can be a good choice if each training instance contains only a few non-zero elements. To learn the parameters, we simply substitute with in (13). In SG or CD algorithms, all it entails is to substitute with . For computing , it is easy to verify that and therefore we have
(22) |
Therefore, the main advantage of the all-subsets kernel is that we can evaluate it and compute its gradient in just time. The total cost for computing is only .
Problem setting. We now demonstrate a novel application of HOFMs to predict the presence or absence of links between nodes in a graph. Formally, we assume two sets of possibly disjoint nodes of size and , respectively. We assume features for the two sets of nodes, represented by matrices and . For instance, can represent user features and movie features. We denote the columns of and by and , respectively. We are given a matrix , whose elements indicate presence (positive sample) or absence (negative sample) of link between two nodes and . We denote the number of positive samples by . Using this data, our goal is to predict new associations. Datasets used in our experiments are summarized in Table 2. Note that for the NIPS and Enzyme datasets, .
Dataset | Columns of | Columns of | |||||
---|---|---|---|---|---|---|---|
NIPS nips_data | 4,140 | Authors | 2,037 | 13,649 | |||
Enzyme yamanishi_2005 | 2,994 | Enzymes | 668 | 325 | |||
GD imc | 3,954 | Diseases | 3,209 | 3,209 | Genes | 12,331 | 25,275 |
Movielens 100K movielens | 21,201 | Users | 943 | 49 | Movies | 1,682 | 29 |
Conversion to a supervised problem. We need to convert the above information to a format FMs and HOFMs can handle. To predict an element of , we simply form to be the concatenation of and and feed this to a HOFM in order to compute a prediction . Because HOFMs use feature combinations in , they can learn the weights of feature combinations between and . At training time, we need both positive and negative samples. Let us denote the set of positive and negative samples by . Then our training set is composed of pairs, where .
Models compared.
[topsep=0pt,itemsep=1ex,partopsep=0ex,parsep=0.3ex,leftmargin=2ex]
HOFM-shared: . We learn and using the simple augmented data approach described in Section 5.1 (HOFM-shared-augmented). Inspired by SimpleMKL simple_mkl , we also report results when learning and directly by minimizing subject to and (HOFM-shared-simplex).
All-subsets: . As explained in Section 5.2, this model is equivalent to the HOFM-shared model with and .
Polynomial network: . This model can be thought as factorization machine variant that uses a polynomial kernel instead of the ANOVA kernel (c.f., livni ; fm_icml ; tensor_machines ).
Low-rank bilinear regression: , where and . Such model was shown to work well for link prediction in link_prediction and imc . We learn and by minimizing .
Experimental setup and evaluation. In this experiment, for all models above, we use CD rather than SG to avoid the tuning of a learning rate hyper-parameter. We set
to be the squared loss. Although we omitted it from our notation for clarity, we also fit a bias term for all models. We evaluated the compared models using the area under the ROC curve (AUC), which is the probability that the model correctly ranks a positive sample higher than a negative sample. We split the
positive samples into for training and for testing. We sample the same number of negative samples as positive samples for training and use the rest for testing. We chose from by cross-validation and following link_prediction we empirically set . Throughout our experiments, we initialized the elements of randomly by .NIPS | Enzyme | GD | Movielens 100K | |
HOFM () | 0.856 | 0.880 | 0.717 | 0.778 |
HOFM () | 0.875 | 0.888 | 0.717 | 0.786 |
HOFM () | 0.874 | 0.887 | 0.717 | 0.786 |
HOFM () | 0.874 | 0.887 | 0.717 | 0.786 |
HOFM-shared-augmented () | 0.858 | 0.876 | 0.704 | 0.778 |
HOFM-shared-augmented () | 0.874 | 0.887 | 0.704 | 0.787 |
HOFM-shared-augmented () | 0.836 | 0.824 | 0.663 | 0.779 |
HOFM-shared-augmented () | 0.824 | 0.795 | 0.600 | 0.621 |
HOFM-shared-simplex () | 0.716 | 0.865 | 0.721 | 0.701 |
HOFM-shared-simplex () | 0.777 | 0.870 | 0.721 | 0.709 |
HOFM-shared-simplex () | 0.758 | 0.870 | 0.721 | 0.709 |
HOFM-shared-simplex () | 0.722 | 0.869 | 0.721 | 0.709 |
All-subsets | 0.730 | 0.840 | 0.721 | 0.714 |
Polynomial network () | 0.725 | 0.879 | 0.721 | 0.761 |
Polynomial network () | 0.789 | 0.853 | 0.719 | 0.696 |
Polynomial network () | 0.782 | 0.873 | 0.717 | 0.708 |
Polynomial network () | 0.543 | 0.524 | 0.648 | 0.501 |
Low-rank bilinear regression | 0.855 | 0.694 | 0.611 | 0.718 |
Results are indicated in Table 3. Overall the two best models were HOFM and HOFM-shared-augmented, which achieved the best scores on 3 out of 4 datasets. The two models outperformed low-rank bilinear regression on 3 out 4 datasets, showing the benefit of using higher-order feature combinations. HOFM-shared-augmented achieved similar accuracy to HOFM, despite using a smaller model. Surprisingly, HOFM-shared-simplex did not improve over HOFM-shared-augmented except on the GD dataset. We conclude that our augmented data approach is convenient yet works well in practice. All-subsets and polynomial networks performed worse than HOFM and HOFM-shared-augmented, except on the GD dataset where they were the best. Finally, we observe that HOFM were quite robust to increasing , which is likely a benefit of modeling each degree with a separate matrix.
We compared AdaGrad adagrad , L-BFGS and coordinate descent (CD) for minimizing (13) when varying the degree on the NIPS dataset with and . We constructed the data in the same way as explained in the previous section and added dummy features, resulting in sparse samples of dimension . For AdaGrad and L-BFGS, we computed the (stochastic) gradients using Algorithm 2. All solvers used the same initialization.
Results are indicated in Figure 1. We see that our CD algorithm performs very well when but starts to deteriorate when , in which case L-BFGS becomes advantageous. As shown in Figure 1 d), the cost per epoch of AdaGrad and L-BFGS scales linearly with , a benefit of our DP algorithm for computing the gradient. However, to our surprise, we found that AdaGrad is quite sensitive to the learning rate . AdaGrad diverged for and the largest value to work well was . This explains why AdaGrad did not outperform CD despite the lower cost per epoch. In the future, it would be useful to create a CD algorithm with a better dependency on .
In this paper, we presented the first training algorithms for HOFMs and introduced new HOFM variants with shared parameters. A popular way to deal with a large number of negative samples is to use an objective function that directly maximize AUC link_prediction ; rendle_bpr . This is especially easy to do with SG algorithms because we can sample pairs of positive and negative samples from the dataset upon each SG update. We therefore expect the algorithms developed in Section 3 to be especially useful in this setting. Recently, difacto proposed a distributed SG algorithm for training second-order FMs. It should be straightforward to extend this algorithm to HOFMs based on our contributions in Section 3. Finally, it should be possible to integrate Algorithm 1 and 2
into a deep learning framework such as TensorFlow
tensorflow , in order to easily compose ANOVA kernels with other layers (e.g., convolutional).TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence
, pages 452–461, 2009.[topsep=0pt,itemsep=1ex,partopsep=0ex,parsep=1ex,leftmargin=2ex]
NIPS: co-author graph of authors at the first twelve editions of NIPS, obtained from [17]. For this dataset, as well as the Enzyme dataset below, we have . The co-author graph comprises authors represented by bag-of-words vectors of dimension (words used by authors in their publications). The number of positive samples is .
Enzyme: metabolic network obtained from [21]. The network comprises enzymes represented by three sets of features: a -dimensional vector of phylogenetic information, a -dimensional vector of gene expression information and a -dimensional vector of gene location information. We concatenate the three sets of information to form feature vectors of dimension . Original enzyme similarity scores are between and
. We binarize the scores using
as threshold. The resulting number of positive samples is .GD: human gene-disease association data obtained from [10]. The bipartite graph is comprised of diseases and genes. We represent each disease using a vector of dimensions, whose elements are similarity scores obtained from MimMiner. The study [10] also used bag-of-words vectors describing each disease but we found these to not help improve performance both for baselines and proposed methods. We represent each gene using a vector of features, which are the concatenation of similarity scores obtained from HumanNet and gene-phenotype associations from 8 other species. See [10] for a detailed description of the features. The number of positive samples is .
Movielens 100K: recommender system data obtained from [6]. The bipartite graph is comprised of users and
movies. For users, we convert age, gender, occupation and living area (first digit of zipcode) to a binary vector using a one-hot encoding. For movies, we use the release year and genres. The resulting vectors are of dimension
and , respectively. Original ratings are between and . We binarize the ratings using as threshold, resulting in positive samples.We also compared AdaGrad, L-BFGS and coordinate descent (CD) on the Enzyme, Gene-Disease (GD) and Movielens 100K datasets. Results are indicated in Figure 2, 3 and 4, respectively.
As we explained in Section 5.2, the all-subsets kernel can be a good choice if the number of non-zero elements per sample is small. To verify this assumption, we ran experiments on two recommender system tasks: Movielens 1M and Last.fm. We used the exact same setting as in [4, Section 9.3]. For each rating , the corresponding was set to the concatenation of the one-hot encodings of the user and item indices. We compared the following models:
FM:
FM-augmented: where
All-subsets:
Polynomial networks: (c.f. [4] for more details)
Results are indicated in Figure 5. We see that All-subsets performs relatively well on these tasks.
We now describe how to apply reverse-mode differentiation to the alternative recursion (14) in order to compute the entire gradient efficiently. Let us introduce the shorthands and . We can then write the recursion as
(23) |
For concreteness, let us illustrate the recursion for . We have
(24) |
We see that influences , and influences and . Likewise, influences , influences and , and influences , and . Let us denote the adjoints and . For general , summing over quantities that influences and , we obtain
(25) |
Let us denote the adjoint of by . We know that directly influences only and therefore
(26) |
Assuming that and have been previously computed, which takes , the procedure for computing the gradient can be summarized as follows:
Initialize
Comments
There are no comments yet.