vfm
Variational Factorization Machines
view repo
We propose the convex factorization machine (CFM), which is a convex variant of the widely used Factorization Machines (FMs). Specifically, we employ a linear+quadratic model and regularize the linear term with the ℓ_2-regularizer and the quadratic term with the trace norm regularizer. Then, we formulate the CFM optimization as a semidefinite programming problem and propose an efficient optimization procedure with Hazan's algorithm. A key advantage of CFM over existing FMs is that it can find a globally optimal solution, while FMs may get a poor locally optimal solution since the objective function of FMs is non-convex. In addition, the proposed algorithm is simple yet effective and can be implemented easily. Finally, CFM is a general factorization method and can also be used for other factorization problems including including multi-view matrix factorization and tensor completion problems. Through synthetic and movielens datasets, we first show that the proposed CFM achieves results competitive to FMs. Furthermore, in a toxicogenomics prediction task, we show that CFM outperforms a state-of-the-art tensor factorization method.
READ FULL TEXT VIEW PDFVariational Factorization Machines
In recommendation task including movie recommendation and news article recommendation, the data are represented in a matrix form, , where
is extremely sparse. Matrix factorization (MF), which imputes missing entries of a matrix with the
low-rankconstraint, is widely used in recommendation systems for news recommendation, protein-protein interaction prediction, transfer learning, social media user modeling, multi-view learning, and modeling text document collections, among others
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10].Recently, a general framework of MF called the factorization machines (FMs) has been proposed [11, 12, 13]. FMs are applied to many regression and classification problems, including the display advertising challenge^{1}^{1}1https://www.kaggle.com/c/criteo-display-ad-challenge, and they show state-of-the-art performance. The key contribution of the FMs is that they reformulate recommendation problems as regression problems, where the input
is a feature vector that indicates the
-th user and the -th item, and output is the rating of the user-item pair:Here, is the dimensionality of , is the score of the -th user and -th item, and is the number of non-zero elements. The goal of the FMs is to find a model that predicts given an input .
For FMs, the following linear + feature interaction model is employed:
where , , and are model parameters (. Since only the -th user and -th item element of the input vector is non-zero, the model can also be written as
which is equivalent to the matrix factorization model with global, user, and item biases. Moreover, since FMs solve the matrix completion problem through regression, it is easy to utilize side information such as about user’s and article’s meta information by simply concatenating the meta-information to .
For regression problems, the model parameters are estimated by solving the following optimization problem:
where the , , and are regularization parameters, and is the Frobenius norm. In [12]
, stochastic gradient descent (SGD), alternating least squares (ALS), and Markov Chain Monte Carlo (MCMC) based approaches were proposed. These optimization approaches work well in practice if regularization parameters and the initial solution of parameters are set appropriately. However, since the loss function is non-convex with respect to
, it can converge to a poor local optimum (mode). The MCMC-based approach tends to obtain a better solution than ALS and SGD. However, it requires running the sampler long enough to explore different local modes.In this paper, we propose the convex factorization machine (CFM). We employ the linear+quadratic model, Eq. (2.2) and estimate and such that the squared loss between the output and the model prediction is minimized. More specifically, we regularize the linear parameter with the -regularizer and the quadratic parameter with the trace norm regularizer. Then, we formulate the CFM optimization problem as a semidefinite programming problem and solve it with Hazan’s algorithm [14], which is a Frank-Wolfe algorithm [15, 16]. A key advantage of the proposed method over existing FMs is that CFM can find a globally optimal solution, while FM can get poor locally optimal solutions. Moreover, our proposed CFM framework is a general variant of convex matrix factorization with nuclear norm regularization, and the CFM algorithm is simple and can be implemented easily. Finally, since CFM is a general factorization framework, it can be easily applied to any factorization problems including multi-view factorization problems [17]. We demostrate the effectiveness of the proposed method first through synthetic and real-world datasets. Then, we show that the proposed method outperforms a state-of-the-art multi-view factorization method on toxicogenomics data.
Contribution: The contributions of this paper are summarized below:
We formulate the FM problem as a semidefinite programming problem, which is a convex formulation.
We show that the proposed CFM framework includes the matrix factorization with nuclear norm regularization [18] as a special case.
We propose a simple yet efficient optimization procedure for the semidefinite programming problem using Hazan’s algorithm [14].
We applied the proposed CFM for a toxicogenomics prediction task; it outperformed a state-of-the-art method.
In this section, we propose the convex factorization machine (CFM) for regression problems.
We suppose that we are given independent and identically distributed (i.i.d.) paired samples
drawn from a joint distribution with density
. We denote as the input data and as the output real-valued vector.The goal of this paper is to find a model that predicts given an input .
We employ the following model:
(1) |
where , , is a positive semi-definite matrix, is the trace operator, is the elementwise product, and is the diagonal matrix. The difference between the FMs model and Eq. (2.2) is that is parametrized as .
The model can equivalently be written as
where is the vectorization operator. Since the model is a linear model, the optimization problem is jointly convex with respect to both and if we employ a loss function such as squared loss and logistic loss.
We formulate the optimization problem of CFM as a semidefinite programming problem:
s.t. | (2) |
where
and and are regularizaiton parameters. is the trace norm defined as
where is the
-th singular value of
. The trace norm is also referred to as the nuclear norm [19]. Since the singular values are non-negative, the trace norm can be regarded as the norm on singular values. Thus, by imposing the trace norm, we can make to be low-rank.To derive a simple yet effective optimization algorithm, we first eliminate from the optimization problem Eq.(2.3) and convert the problem to a convex optimization problem with respect to . Specifically, we take the derivative of the objective function with respect to and obtain an analytical solution for :
where
is the model corresponding to the quadratic term of such that , ,
is the identity matrix. Note that,
depends on the unknown parameter .Plugging back into the objective function of Eq.(2.3), we can rewrite the objective function as
(3) |
where
, , and .
Once is obtained by solving Eq. (3), we can get the estimated linear parameter as
Relation to Matrix Factorization with Nuclear Norm Regularization: The constraint on can be written as
where , , and . Furthermore, for the CFM setting, the -th user and -th item rating is modeled as
Based on Lemma 1, the optimization problem Eq. (2.3) is equivalent to
(4) |
where
and is the set of observed values in . If we set , the optimization problem is equivalent to matrix factorization with nuclear norm regularization [18]; CFM includes convex matrix factorization as a special case. Since we would like to have a low-rank matrix of the user-item matrix for recommendation, Eq. (2.3) is a natural formulation for convex FMs. Note that, even though CFM resembles the matrix factorization [18]. the MF method cannot incorporate side information, while CFM can deal with side-information by concatenating it to vector . That is, intrinsically, the MF method [18] and CFM are different.
For optimizing , we adopt Hazan’s algorithm [14]
. It only needs to compute a leading eigenvector of a sparse
matrix in each iteration, and thus it scales well to large problems. Moreover, the proposed CFM update formula is extremely simple, and hence useful for practitioners. The Hazan’s algorithm for CFM is summarized in Algorithm 1.Derivative computation: The objective function can be equivalently written as
Then, is given as
where we use . Since the derivative is written as
, the eigenvalue decomposition can be obtained without storing
in memory. Moreover, since the matrix is a sparse matrix, we can efficiently obtain the leading eigenvector by the Lanczos method. We can use a standard eigenvalue decomposition package to compute the approximate eigenvector by the ”approxEV” function. For example in Matlab, we can obtain the approximate eigenvector by the function ), where is the corresponding eigenvalue.The proposed CFM optimization requires a matrix inversion (i.e., ) for computing in , and it is not feasible if the dimensionality is large. For example in user-item recommendation task, the total dimensionality of the input can be the number of users + the number of items. In such cases, the dimensionality can be or more. However, fortunately, the input matrix is extremely sparse, and we can efficiently compute by using a conjugate gradient method whose time complexity is .
can be written as
where . Since the number of samples tends to be larger than the dimensionality in factorization machine settings, becomes full rank. Namely, we can safely make the regularization parameter . In such case, is given as
where we use . The is obtained by solving
(5) |
where can be efficiently obtained by a conjugate gradient method with time complexity . Thus, we can compute without computing the matrix inverse . To further speed up conjugate gradient method, we use a preconditioner and the previous solution as the initial solution.
Finally, we compute as
The diagonal elements of are the differences between the observed outputs and the model predictions at the -th iteration. Note that, in our CFM optimization, we eliminate and only optimize for ; however, the is implicitly estimated in Hazan’s algorithm.
Complexity: Iteration in Algorithm 1 includes computing an approximate leading eigenvector of a sparse matrix with non-zero elements and an estimation of , which require computation using Lanczos algorithm and computaiton using conjugate gradient descent, respectively. Thus, the entire computational complexity of the proposed method is , where is the total number of iterations in Hazan’s algorithm.
Optimal step size estimation: Hazan’s algorithm assures converges to a global optimum with using the step size [18]. However, this is in practice slow to converge. Instead, we choose the that maximally decreases the objective function . The optimal can be obtained by solving the following equation:
Taking the derivative with respect to and solving the problem for , we have
(6) |
The computation of involves the matrix inversion of . However, by using the same technique as in the derivative computation, we can efficiently compute .
Update : When the input dimension is large, storing the feature-feature interaction matrix is not possible. To avoid the memory problem, we update as
where . Thus, we only need to store and at the -th iteration. In practice, Hazan’s algorithm converges with (see experiment section), so the required memory for Hazan’s algorithm is reasonable.
Prediction: Let us define such that . Then, we can efficiently compute the output as
The time complexities of the terms are , , and , respectively.
First of all, the same problem setting as in our work has been addressed quite recently [20], being independent of our work. The key difference between the proposed method and [20] is that our approach is based on a single convex optimization problem for the interaction term . The approach [20] uses a block-coordinate descent (BCD) algorithm for optimization, optimizing the linear and quadratic terms alternatively. That is, they alternatingly solve the following two update equations until convergence:
while our proposed approach is simply given as
Hence, the BCD algorithm needs to iterate the sub-problem for until convergence for obtaining the globally optimal solution.
Let us employ an algorithm for the trace norm minimization in BCD; then the entire complexity is where is the BCD iteration and is the iteration of the sub-problem. On the other hand, our algorithm’s complexity is . Another difference is that our optimization approach includes the matrix factorization with nuclear norm regularization as a special case, while it is unclear whether the same holds for the formulation [20]. Finally, our CFM approach is very easy to implement; the core part of the proposed algorithm can be written within 20 lines in Matlab. Note, the BCD based approach is more general than our CFM framework; it can be used for other loss functions such as logistic loss and it does not require the positive definiteness condition for .
The convex variant of matrix factorization has been widely studied in machine learning community
[21, 22, 23, 24, 25, 26, 27, 28]. The key idea of the convex approach is to use the trace norm regularizer, and the optimization problem is given as(7) |
where is the set of observed value in , if and 0 otherwise, and is the Frobenius norm. Since Eq.(7) and Eq.(4) are equivalent when , the convex matrix factorization can be regarded as a special case of CFM.
To optimize Eq. (7), the singular value thresholding (SVT) method has been proposed [29, 30], where SVT converges faster in ( is an approximate error). However, the SVT approach requires to solve the full eigenvalue decomposition, which is computationally expensive for large datasets. To deal with large data, Frank-Wolfe based approaches have been proposed including Hazan’s algorithm [18], corrective refitting [31], and active subspace selection [32]. However, these approaches cannot incorporate user and item bias. Furthermore, it is not straightforward to incorporate side information to deal with cold start problems (i.e., recommending an item to a user who has no click information).
To handle cold start problems, collective matrix factorization (collective MF) has been proposed [33]. The key idea of collective MF is to incorporate side information into matrix factorization. More specifically, we prepare a user user meta matrix (e.g., gender, age, etc.) and an item item meta matrix (item category, item title, etc) in addition to a user-item matrix. Then, we factorize all the matrices together. A convex variant of CMF called convex collective matrix factorization (CCMF) has been proposed [34]. CCMF employs the convex collective norm, which is a generalization of the trace norm to several matrices. Recently, Hazan’s algorithm was introduced to CCMF [9]. More importantly, it has been theoretically justified that CCMF can give better performance in cold start settings. Since FMs can incorporate side information, FMs and CCMF are closely related. Actually, CFM can utilize side information and can learn the user and item bias term together; it can be regarded as a generalized variant of CCMF.
We evaluate the proposed method on one synthetic dataset, Movielens data (single matrix), and toxicogenomics data (two-view tensor).
In this paper, we compare CFM with ridge regression, FM (SGD), FM (MCMC) and FM (ALS), where FM (MCMC) is a state-of-the-art FM optimization method. The ridge regression corresponds to the factorization machine with only the linear term
, which is also a strong baseline. To estimate FM models, we use the publicly available libFM package^{2}^{2}2http://www.libfm.org. For all experiments, the number of latent dimensions of FMs is set to 20, which performs well in practice. For FM (ALS), we experimentally set the regularization parameters as and . The initial matrices (for CFM) and (for FMs) are randomly set (this is the default setting of the libFM package). For CFM, we implemented the algorithm with Matlab. We experimentally set , and it works for our experiments. For all experiments, we use a server with 16 core 1.6GHz CPU and 24G memory.When evaluating the performance of CFM and FMs, we use the root mean squared error (RMSE):
where and are the true and estimated target values, respectively.
First, we illustrate how the proposed CFM behaves using a synthetic dataset.
In this experiment, we randomly generate input vectors as , and output values as
where
We use samples for training and samples for testing. We run the experiments times with randomly selecting training and test samples and report the average RMSE scores. Figure 1 shows the test RMSE for CFM and FMs. As can be seen, the proposed CFM gets the lowest RMSE values with a small number of iterations, while FMs needs many iterations to obtain reasonable performance.
Next, we evaluate our proposed method on the Movielens 100K, 1M, 10M, and 20M datasets [35] (Table 1 for dataset details). In these experiments, we randomly split the observations into 75% for training and 25% for testing. We run the recommendation experiments on three random splits, which is the same experimental setting as in [20], and report the average RMSE score.
Dataset | ||||
---|---|---|---|---|
Movielens 100K | 943 | 1,682 | 2,625 | 100,000 |
Movielens 1M | 6,040 | 3,900 | 9,940 | 1,000,209 |
Movielens 10M | 82,248 | 10,681 | 92,929 | 10,000,054 |
Movielens 20M | 138,493 | 27,278 | 165,771 | 20,000,263 |
Dataset | CFM | CFM (BCD) | Ridge | ||||
---|---|---|---|---|---|---|---|
100K | 0.915 | 0.93 | 1.078 | 1.242 | 0.905 | 0.901 | 0.936 |
1M | 0.866 | 0.85 | 0.943 | 0.981 | 0.877 | 0.846 | 0.899 |
10M | 0.810 | 0.82 | 0.827 | 0.873 | 0.831 | 0.778 | 0.855 |
20M | 0.802 | n/a | 0.821 | 0.852 | 0.803 | 0.768 | 0.850 |
For CFM, the regularization parameter is experimentally set to (for 100K), (for 1M), (for 10M), and (for 20M), respectively. For FMs, the rank is set to , which gives overall good performance. To investigate the effect of the initialization parameter, we initialize FM (MCMC) with two parameters and
, which are the standard deviation of the random variable for initializing
. We also report the RMSE of the CFM method of [20] for reference.Figure 2 shows the training and test RMSE with the CFM (optimal step size) and the CFM () for the Movielens datasets. For both methods, the RMSE of training and test is converging with a small number of iterations. Overall, the optimal step size based approach converges faster than the one based on . Figure 3 shows the RMSE over computational time (seconds). For large datasets, the CFM achieves reasonable performance in less than an hour. In Table 2, we show the RMSE comparison of the proposed CFM with FMs. As we expected, CFM compares favorably with FM (SGD) and FM (ALS), since FM (SGD) and FM (ALS) can be easily trapped at poor locally optimal solutions. Moreover, our CFM method compares favorably with also the CFM (BCD) [20]. On the other hand, FM (MCMC) can obtain better performance than CFMs (both our formulation and [20]) for these datasets if we set an appropriate initialization parameter. This is because MCMC tends to avoid poor locally optimal solution if we run the sampler long enough. That is, since the objective function of FMs is non-convex and it has more flexibility than the convex formulation, it can converge to a better solution than CFM if we initialize FMs well.
Next, we evaluated our proposed method on a toxicogenomics dataset [17]. The dataset contains three sets of matrices representing gene expression and toxicity responses of a set of drugs. The first set called Gene Expression, represents the differential expression of 1106 genes in three different cancer types, to a collection of 78 drugs (i.e., ). The second set, Toxicity, contains three dose-dependent toxicity profiles of the corresponding 78 drugs over the three cancers (i.e., ). The gene expression data of the three cancers (Blood, Breast and Prostate) comes from the Connectivity Map [36] and were processed to obtain differential expression of treatment vs control. As a result, the expression scores represent positive or negative regulation with respect to the untreated level. The toxicity screening data, from the NCI-60 database [37], summarizes the toxicity of drug treatments in three variables GI50, LC50, and TGI, representing the 50% growth inhibition, 50% lethal concentration, and total growth inhibition levels. The data were conformed to represent dose-dependent toxicity profiles for the doses used in the corresponding gene expression dataset.
Predicting both gene and toxicity matrices: We compared our proposed method with existing state-of-the-art methods. In this experiment, we randomly split the observations into 50% for training ( elements) and 50% for testing ( elements), which is the exactly same datasets used in [17]. We run the prediction experiments on 100 random splits [17], and report the average relative MSE score, which is defined as
where is the target score vector, is the estimated score vector, and is the mean of elements in , is the number of views. In this experiment, the number of views is . Since the number of elements in view 1 and view 2 are different, the relative MSE score is more suitable than the root MSE score. We compare our proposed method with ARDCP [38], CP [39], Group Factor Analysis (GFA) [40], and Bayesian Multi-view Tensor Factorization (BMTF) [17]. BMTF is a state-of-the-art multi-view factorization method.
For CFM, we first concatenate all view matrices as
and use this matrix for learning. The regularization parameter is experimentally set to . To deal with multi-view data, we form the input and output of CFM as
Table 3 shows the average relative MSE of the methods. As can be seen, the proposed method outperforms the state-of-the-art methods.
Multi-view | Single-view | ||||||
---|---|---|---|---|---|---|---|
CFM | BMTF | GFA | ARDCP | CP | ARDCP | CP | |
Mean | 0.4037 | 0.4811 | 0.5223 | 0.8919 | 5.3713 | 0.6438 | 5.0699 |
StdError | 0.0163 | 0.0061 | 0.0041 | 0.0027 | 0.0310 | 0.0047 | 0.0282 |
Predicting toxicity matrices using Gene expression data: We further evaluated the proposed CFM on the toxicity prediction task. For this experiment, we randomly split the observations of the toxicity matrices into 50% for training ( elements) and 50% for testing ( elements). Then, we used the gene expression matrices as side information for predicting the toxicity matrices. More specifically, we designed two types of features from the gene expression data:
Mean of -nearest neighbor similarities:() We first find the -nearest neighbors of the -th drug target, where the Gaussian kernel is used for similarity computation. Then, we average the similarity of -th nearest neighbors.
Standard deviation of -nearest neighbor similarities:() Similarly to the mean feature, we first found -nearest neighbor similarities and then computed that’s standard deviation.
Then, we used these features as
We run the prediction experiments on 100 random splits, and report the average RMSE score (Table 4). ‘CFM’ is ‘CFM without any additional features. It is clear that the performance of CFM improves by simply adding manually designed features. Thus, we can improve the prediction performance of CFM by designing new features, and it is useful for various prediction tasks in biology data.
CFM | CFM (+mean/std features) | CFM (+mean feature) | |||||
---|---|---|---|---|---|---|---|
Mean | 0.5624 | 0.5199 | 0.5207 | 0.5215 | 0.5269 | 0.5234 | 0.5231 |
StdError | 0.0501 | 0.0464 | 0.0451 | 0.0450 | 0.0466 | 0.0454 | 0.0450 |
We proposed the convex factorization machine (CFM), which is a convex variant of factorization machines (FMs). Specifically, we formulated the CFM optimization problem as a semidefinite program (SDP) and solved it with Hazan’s algorithm. A key advantage of the proposed method over FMs is that CFM can find a globally optimal solution, while FMs can get poor locally optimal solutions since they are non-convex approaches. The derived algorithm is simple and can be easily implemented. We also showed the connection between CFM and convex factorization methods. Through synthetic and real-world experiments, we showed that the proposed CFM achieves results competitive with state-of-the-art methods. Moreover, for a toxicogenomics prediction task, CFM outperformed a state-of-the-art multi-view tensor factorization method.
In future work, we will extend the proposed method to distributed computation. Another important challenge is to improve the convergence properties of the proposed method.
A rank minimization heuristic with application to minimum order system approximation.
In ACC, 2001.