Which Factorization Machine Modeling is Better: A Theoretical Answer with Optimal Guarantee

01/30/2019 ∙ by Ming Lin, et al. ∙ University of Michigan 8

Factorization machine (FM) is a popular machine learning model to capture the second order feature interactions. The optimal learning guarantee of FM and its generalized version is not yet developed. For a rank k generalized FM of d dimensional input, the previous best known sampling complexity is O[k^3d·polylog(kd)] under Gaussian distribution. This bound is sub-optimal comparing to the information theoretical lower bound O(kd). In this work, we aim to tighten this bound towards optimal and generalize the analysis to sub-gaussian distribution. We prove that when the input data satisfies the so-called τ-Moment Invertible Property, the sampling complexity of generalized FM can be improved to O[k^2d·polylog(kd)/τ^2]. When the second order self-interaction terms are excluded in the generalized FM, the bound can be improved to the optimal O[kd·polylog(kd)] up to the logarithmic factors. Our analysis also suggests that the positive semi-definite constraint in the conventional FM is redundant as it does not improve the sampling complexity while making the model difficult to optimize. We evaluate our improved FM model in real-time high precision GPS signal calibration task to validate its superiority.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Factorization machine (FM) [16, 2, 8, 9, 23, 21, 14, 13]

is a popular linear regression model to capture the second order feature interactions. It has been found effective in various applications, including recommendation systems

[16] , CTR prediction [9], computational medicine [11] , social network [7] and so on. Intuitively speaking, the second order feature interactions consider the factors jointly affecting the output. On the theoretical side, FM is closely related to the symmetric matrix sensing [10, 5, 22] and phase retrieval [6]. While the conventional FM only considers the second order feature interactions, it is possible to extend the conventional FM to the high order functional space which leads to the Polynomial Network model [4]. FM is the cornerstone in modern machine learning research as it abridges the linear regression and high order polynomial regression. It is therefore important to understand the theoretical foundation of FM.

Given an instance , the conventional FM assumes that the label of is generated by

(1)

where are the first order and the second order coefficients respectively. In the original FM paper [16], the authors additionally assumed that is generated from a low-rank positive semi-definite (PSD) matrix with all its diagonal elements subtracted. That is

(2)

Eq. (2) consists of two parts. We call the first part as the PSD constraint and the second part as the diagonal-zero constraint. Our key question in this work is whether the FM model (1) can be learned by observations and how the two additional constraints help the generalization ability of FM.

Although the FM has been widely applied , there is little research exploring the theoretical properties of the FM to answer the above key question. A naive analysis directly following the sampling complexity of the linear model would suggest samples to recover which is too loose. When and is symmetric, Eq. (1) is equal to the symmetric matrix sensing problem. [5] proved the sampling complexity of this special case on well-bounded sub-gaussian distribution using trace norm convex programming under the -RIP condition. [22] developed a conditional gradient descent solver to recover . However, when the above methods and the theoretical results are no longer applicable. [3]

considered a convexified formulation of FM. Their FM model requires solving a trace-norm regularized loss function which is computationally expensive. They did not provide statistical learning guarantees for the convexified FM.

To the best of our knowledge, the most recent research dealing with the theoretical properties of the FM is [12]. In their study, the authors argued that the two constraints proposed in [16] can be removed if is sampled from the standard Gaussian distribution. However, their analysis heavily relies on the rotation invariance of the Gaussian distribution therefore cannot be generalized to non-Gaussian cases. Even limiting on the Gaussian distribution, the sampling complexity given by their analysis is which is worse than the information-theoretic lower bound . It is still an open question whether the FM can be learned with samples and whether both constraints in the original formulation are necessary to make the model learnable.

In this work, we answer the above questions affirmatively. We show that when the data is sampled from sub-gaussian distribution and satisfies the so-called -Moment Invertible Property (MIP), the generalized FM (without constraints) can be learned by samples. The PSD constraint is not necessary to achieve this sharp bound. Actually the PSD constraint is harmful as it introduces asymmetric bias on the value (see Experiment section). The optimal sampling complexity is achievable if we further constrain that the diagonal elements of are zero. This is not an artificial constraint but there is information-theoretic limitation prevents us recovering the diagonal elements of on sub-gaussian distribution. Finally inspired by our theoretical results, we propose an improved version of FM, called iFM, which removes the PSD constraint and inherits the diagonal-zero constraint from the conventional modeling. Unlike the generalized FM, the sampling complexity of the iFM does not depend on the MIP constant of the data distribution.

The remainder of this paper is organized as follows. We revisit the modeling of the FM in Section 2 and show that the conventional modeling is sub-optimal when considered in a more general framework. In Section 3 we present the learning guarantee of the generalized FM on sub-gaussian distribution. We propose the high order moment elimination technique to overcome a difficulty in our convergence analysis. Based on our theoretical results, we propose the improved model iFM. Section 4 conducts numerical experiments on synthetic and real-world datasets to validate the superiority of iFM over the conventional FM . Section 5 encloses this work.

2 A Revisit of Factorization Machine

In this section, we revisit the modeling design of the FM and its variants. We briefly review the original formulation of the conventional FM to raise several questions about its optimality. We then highlight previous studies trying to establish the theoretical foundation of the FM modeling. Based on the above survey, we motivate our study and present our main results in the next section.

In their original paper, [16] assumes that the feature interaction coefficients in the FM can be embedded in a -dimensional latent space. That is,

(3)

where is a vector. The original formulation Eq. (3) is equivalent to Eq. (1) with constraint Eq. (2). While the low-rank assumption is standard in the matrix sensing literature, the PSD and the diagonal-zero constraints are not. A critical question is whether the two additional constraints are necessary or removable. Indeed we have strong reasons to remove both constraints.

The reasons to remove the diagonal-zero constraint are straightforward. First there is no theoretical result so far to motivate this constraint. Secondly subtracting the diagonal elements will make the second order derivative w.r.t. non-PSD. This will raise many technical difficulties in optimization and learning theory as many research works assume convexity in their analysis.

The PSD constraint in the original FM modeling is the second term we wish to remove. Let us temporally forget about the diagonal-zero constraint and focus on the PSD constraint only. Obviously relaxing with will make the model more flexible. A more serious problem of the PSD constraint is that it implicitly assumes that the label is more likely to be “positive”. This will introduce asymmetric bias about the distribution of . To see this, suppose where

is the eigenvector matrix of

. We call the second order feature mapping matrix induced by since

. The eigenvalue matrix

is the weights for the mapped features . As is constrained to be PSD, the weights of cannot be negative. In other words, the PSD constraint prevents the model learning patterns from negative class. Please check the Experiment section for more concrete examples.

Another issue of the PSD constraint raised is the difficulty in optimization. Suppose we choose least square as the loss function in FM. By enforcing , the loss function is a fourth order polynomial of . This makes the initialization of difficult since the scale of the initial will affect the convergence rate. Clearly we cannot initialize since the gradient w.r.t. will be zero. On the other hand, we cannot initialize to have a large norm otherwise the problem will be ill-conditioned. This is because the spectral norm of the second order derivative w.r.t. will be proportional to therefore the (local) condition number depends on . In practice, it is usually difficult to figure out the optimal scale of resulting vanishing or explosion gradient norms. If we decouple the and , we can initialize and . Then by alternating gradient descent, the decoupled FM model is easy to optimize.

In summary, the theoretical foundation of the FM is still not well-developed. On one hand, it is unclear whether the conventional FM modeling is optimal and on the other hand, there is strong motivation to modify the conventional formulation based on heuristic intuition. This inspires our study of the optimal modeling of the FM driven by theoretical analysis which is presented in the next section.

3 Main Results

In this section, we present our main results on the theoretical guarantees of the FM and its improved version iFM. We first give a sharp complexity bound for the generalized FM on sub-gaussian distribution. We show that the recovery error of the diagonal elements of depends on a so-called -MIP condition of the data distribution. The sampling complexity bound can be improved to optimal by the diagonal-zero constraint.

We introduce a few more notations needed in this section. Suppose

is sampled from coordinate sub-gaussian with zero mean and unit variance. The element-wise third order moment of

is denoted as and the fourth order moment is . All training instances are sampled identically and independently (i.i.d.). Denote the feature matrix and the label vector . denotes the diagonal function. For any two matrices and , we denote their Hadamard product as . The element-wise squared matrix is defined by . For a non-negative real number , the symbol denotes some perturbation matrix whose spectral norm is upper bounded by . The

-th largest singular value of matrix

is . We abbreviate

. To abbreviate our high probability bounds, given a probability

, we use the symbol and to denote some polynomial logarithmic factors in and any other necessary variables that do not change the polynomial order of the upper bounds.

3.1 Limitation of The Generalized FM

In order to derive the theoretical optimal FM models, we begin with the most general formulation of FM, that is, with no constraint except low-rank:

(4)

Clearly must be symmetric but for now this does not matter. Eq. (4) is called the generalized FM [12]. It is proved that when is sampled from the Gaussian distribution, Eq. (4) can be learned by training samples. Although this bound is not optimal, [12] showed the possibility to remove Eq. (2) on the Gaussian distribution. However, their result no longer holds true on non-Gaussian distributions. In the following, we will show that the learning guarantee for the generalized FM on sub-gaussian distribution is much more complex than the Gaussian one.

Our first important observation is that model (4) is not always learnable on all sub-gaussian distributions.

Proposition 1.

When , the generalized FM is not learnable.

The above observation is easy to verify since when . Therefore at least the diagonal elements of cannot be recovered at all. Proposition 1 shows that there is information-theoretic limitation to learn the generalized FM on sub-gaussian distribution. In our analysis, we find that such limitation is related to a property of the data distribution which we call the Moment Invertible Property (MIP).

Definition 2 (Moment Invertible Property).

A zero mean unit variance sub-gaussian distribution is called -Moment Invertible if for some constants , , .

With the MIP condition, the following theorem shows that the generalized FM is learnable via alternating gradient descent.

Theorem 3.

Suppose is sampled from a -MIP sub-gaussian distribution with generated by Eq. (4). Then with probability at least , there is an alternating gradient descent based method which can achieve the recovery error after iteration such that

provided

Theorem 3 is our key result. In Theorem 3

, we measure the quality of our estimation by the recovery error

The recovery error decreases linearly along the steps of alternating iteration with rate . The sampling complexity is on order of . This bound delivers two messages. First when the distribution is close to the Gaussian distribution, and the bound is controlled by . This result improves the previous given by [12] for the Gaussian distribution. Secondly when is small, the sampling complexity is proportional to

. The sampling complexity will even trend to infinite when the data follows the binary Bernoulli distribution where

. Therefore the -MIP condition provides a sufficient condition to make the generalized FM learnable.

We have not given any detail about the alternating gradient descent algorithm mentioned in Theorem 3. We find that it is difficult to prove the convergence rate following the conventional alternating gradient descent framework. To address this difficulty, we use a high order moment elimination technique in the next subsection in the convergent analysis.

3.2 Alternating Gradient Descent with High Order Moment Elimination

0:  The mini-batch size ; number of total update ; training instances , ; rank .
0:  .
1:  Retrieve training instances to estimate the third and fourth order moments and .
2:  Compute and in Eq. (8) and (9).
3:  Initialize , . , that is the top- singular vectors.
4:  for  do
5:     Retrieve training instances , compute in Eq. (7) and update and as:
6:     Compute in Eq. (10) and update
7:  end for
8:  Output: .
Algorithm 1 Alternating Gradient Descent with High Order Moment Elimination

In this subsection, we will construct an alternating gradient descent algorithm which achieves the convergence rate and the sampling complexity in Theorem 3. We first show that the conventional alternating gradient descent cannot be applied directly to prove Theorem 3. Then a high order moment elimination technique is proposed to overcome the difficulty.

The generalized FM defined in Eq. (4) can be written in the matrix form

(5)

where the operator is defined by . The adjoint operator of is . To recover , we minimize the square loss function

(6)

A straightforward idea to prove Theorem 3 is to show that the alternating gradient descent will converge. However, we find that this is difficult in our problem. To see this, let us compute the expected gradient of with respect to at step .

where

In previous studies, one expects . However, this is no longer the case in our problem. Clearly is dominated by and . For non-Gaussian distributions, these two perturbation terms could dominate the gradient norm. Similarly the gradient of is biased by .

The difficulty to follow the conventional gradient descent analysis inspires us to look for a new convergence analysis technique. The perturbation term consists of high order moments of the sub-gaussian variable . It might be possible to construct a sequence of another high order moments to eliminate these perturbation terms. We call this idea the high order moment elimination method. The next question is whether the desired moments exist and how to construct them efficiently. Unfortunately, this is impossible in general. A sufficient condition to ensure the existence of the elimination sequence is that the data distribution satisfies the -MIP condition.

To construction an elimination sequence, for any and , define functions

Notice when ,

This inspires us to find a linear combination of to eliminate . The solution for this linear combination equation is

(7)

where

(8)
(9)

Similarly to eliminate the high order moments in the gradient of , we construct

(10)

The overall construction is given in Algorithm 1.

We briefly prove that the construction in Algorithm 1 will eliminate the high order moments in by which a global linear convergence rate is immediately followed. Please check appendix for details. We will omit the superscript in and when not raising confusion.

First we show that is conditionally independent restrict isometric after shifting its expectation (Shift CI-RIP). The proof can be found in Appendix B.

Theorem 4 (Shift CI-RIP).

Suppose . Fixed a rank- matrix , with probability at least ,

provided .

Theorem 4 is the main theorem in our analysis. The key ingredient of our proof is to apply the matrix Bernstein’s inequality with an improved version of sub-gaussian Hanson-Wright inequality proved by [18]. Please check Appendix B for more details.

Based on the shifted CI-RIP condition of operator , we prove the following perturbation bounds.

Lemma 5.

For , with probability at least ,

Lemma 5 shows that and are all concentrated around their expectations with no more than samples. To finish our construction, we need to bound the deviation of and from their expectation and . This is done in the following lemma.

Lemma 6.

Suppose the distribution of is -MIP with . Then in Algorithm 1,

provided

Lemma 6 shows that as long as . The matrix inversion in the definition of requires that the -MIP condition must be satisfied with .

We are now ready to show that and are almost isometric.

Lemma 7.

Under the same settings of Theorem 3, with probability at least ,

provided

where .

Lemma 7 shows that and are almost isometric when the number of samples is larger than and . The proof of Lemma 7 consists of two steps. First we replace each operator or matrix with its expectation plus a small perturbation given in Lemma 5 and Lemma 6. Then Lemma 7 follows after simplification. Theorem 3 is obtained by combining Lemma 7 with alternating gradient descent analysis. Please check Appendix F for the complete proof.

3.3 Improved Factorization Machine

Theorem 3 shows that learning the generalized FM is hard on non-gaussian distribution. Especially, when the data distribution has a very small -MIP constant, the sampling complexity to recover in Eq. (4) will be as large as . The recovery is even impossible on distributions such as the Bernoulli distribution. Clearly, a well-defined learnable model should not depend on the -MIP condition.

Indeed the bound given by Theorem 3 is quite sharp. It explains well why we cannot recover on the Bernoulli distribution. Therefore it is unlikely to remove the dependency by designing a better elimination sequence in Algorithm 1. After examining the proof of Theorem 1 carefully, we find that the only reason our bound contains is that the diagonal elements of are allowed to be non-zero. If we constrain and , the in the expected gradient will be zero and then we do not need to eliminate it during the alternating iteration. This greatly simplifies our convergence analysis as we only need Theorem 4 which now becomes

(11)

Eq. (4) already shows that is almost isometric that immediately implies the linear convergence rate of alternating gradient descent. As a direct corollary of Theorem 4, the sampling complexity could be improved to which is optimal up to some logarithmic constants . Inspired by these observations, we propose to learn the following FM model

(12)

We called the above model the Improved Factorization Machine (iFM). The iFM model is a trade-off between the conventional FM model and the generalized FM model. It decouples the PSD constraint with in the generalized FM model but keeps the diagonal-zero constraint as the conventional FM model. Unlike the conventional FM model, the iFM model is proposed in a theoretical-driven way. The decoupling of makes the iFM easy to optimize while the diagonal-zero constraint makes it learnable with the optimal sampling complexity. In the next section, we will verify the above discussion with numerical experiments.

4 Experiments

We first use synthetic data in subsection 4.1 to show the modeling power of iFM and the PSD bias of the conventional FM. In subsection 4.2 we apply iFM in a real-word problem, the vTEC estimation task, to demonstrate its superiority over baseline methods.

4.1 Synthetic Data

(a) (b) inverse the label in (a) (c)
Figure 1: RMSE Curve of iFM v.s. FM

In this subsection, we construct numerical examples to support our theoretical results. To this end, we choose , . are all sampled from the Gaussian distribution with variance . We randomly sample instances as training set and instances as testing set. In Figure 1, we report the convergence curve of iFM and FM on the testing set averaged over 10 trials. The x-axis is the iteration step and the y-axis is the Root Mean Square Error (RMSE) of . In Figure (a), we generate label following the conventional FM assumption. Both iFM and FM converge well. In Figure (b), we flip the sign of the label in (a). While iFM still converges well, FM cannot model the sign flip. This example shows why we should avoid to use the conventional FM both in theory and in practice: even a simple flipping operation can make the model under-fit the data. In Figure (c), we generate from with both positive eigenvalues and negative eigenvalues. Again the conventional FM cannot fit this data as the distribution of is now symmetric in both direction.

4.2 vTEC Estimation

Ridge LASSO ElasticNet Kernel FM iFM
TestDay1 0.02161 0.02260 0.02222 0.02570 0.02178 0.02164
TestDay2 0.01430 0.01461 0.01454 0.01683 0.01439 0.01404
TestDay3 0.01508 0.01524 0.01496 0.01875 0.01484 0.01484
TestDay4 0.01449 0.01487 0.01460 0.01564 0.01432 0.01423
TestDay5 0.01610 0.01579 0.01612 0.01744 0.01606 0.01567
TestDay6 0.01487 0.01491 0.01483 0.01684 0.01470 0.01459
TestDay7 0.01828 0.01849 0.01827 0.02209 0.01830 0.01805
TestDay8 0.01461 0.01552 0.01519 0.01629 0.01466 0.01453
TestDay9 0.01657 0.01646 0.01639 0.02096 0.01646 0.01636
Average 0.01621 0.01650 0.01635 0.01895 0.01617 0.01599
RTK 62.67% 62.18% 63.07% 52.09% 63.10% 64.44%
Table 1: RMSE & RTK of iFM v.s. Baseline Methods

In this subsection, we demonstrate the superiority of iFM in a real-world application, the vertical Total Electron Content (vTEC) estimation. The vTEC is an important descriptive parameter of the ionosphere of the Earth. It integrates the total number of electrons integrated when traveling from space to earth with a perpendicular path. One important application is the real-time high precision GPS signal calibration. The accuracy of the GPS system heavily depends on the pseudo-range measurement between the satellite and the receiver. The major error in the pseudo-range measurement is caused by the vTEC which is dynamic.

In order to estimate the vTEC, we build a triangle mesh grid system in an anonymous region. Each node in the grid is a ground stations equipped with dual-frequency high precision GPS receiver. The distance between two nodes is around 75 kilometers. The station sends out resolved GPS data every second. Formally, our system solves an online regression problem. Our system receives data points at every time step measured in seconds. Each data point presents an ionospheric pierce point of a satellite-receiver pair. The first two dimensions are the latitude and the longitude of the pierce point. The third dimension and the fourth dimension are the zenith angle and the azimuth angle respectively. We will omit the superscript below as we always stay within the same time window . In order to build a localized prediction model, we encode into a high dimensional vector. Suppose we have satellites in total. First we collect data points for 60 seconds. Then we cluster the collected into

clusters via K-means algorithm. Denote the cluster center of K-means as

and the -th data point belongs to the -th cluster. The first two dimensions of the -th data point are then encoded as where

Finally, each data point is encoded into an vector:

Evaluation

It is important to note that our problem does not fit the conventional machine learning framework. We only have validation set for model selection and evaluation set to evaluate the model performance. We introduce the training station and the testing station that correspond to the “training set” and “testing set” in the conventional machine learning framework. However please be advised that they are not exactly the same concepts.

To evaluate the performance of our models, we randomly select one ground station as the testing station. Around the testing station, we choose ground stations as training stations to learn the online prediction model. Suppose the online prediction model is which maps to the corresponding vTEC value

In our system, we are only given the double difference of the values due to the signal resolving process. Suppose two satellites and two ground stations are connected in the data-link graph, denotes the ionospheric pierce point between and . The observed double difference is given by

Since is an unknown function, we need to approximate it by . Once is learned from training stations, we can apply it to predict the double difference where either or is the testing station.

Once we get the vTEC estimation, we use it to calibrate the GPS signal and finally compute the geometric coordinate of the user. The RTK ratio measures the quality of the positioning service. It is a real number presenting the probability of successful positioning with accuracy at least one centimeter. The RTK ratio is computed from a commercial system that is much slower than the computation of RMSE.

Dataset and Results

We select a ground station at the region center as testing station. Around the testing station 16 stations are selected as training stations. We collect 5 consecutive days’ data as validation set for parameter tuning. The following 9 days’ data are used as evaluation set. We update the prediction model per 60 seconds. The learned model is then used to predict the double differences relating to the testing station. We compare the predicted double differences to the true values detected by the testing station. The number of valid satellites in our experiment is around 10 to 20.

In Table 1

, we report the root-mean-square error (RMSE) over 9 days period. The dates are denoted as TestDay1 to TestDay9 for anonymity. Five baseline methods are evaluated: Ridge Regression, LASSO, ElasticNet, Kernel Ridge Regression (Kernel) with RBF kernel and the conventional Factorization Machine (FM). More computational expensive models such as deep neural network are not feasible for our online system. For Ridge, LASSO, Kernel and ElasticNet, their parameters are tuned from

to . The regularizer parameters of FM and iFM are tuned from to . The rank of is tuned in set . We use Scikit-learn [15] and fastFM [1] to implement the baseline methods.

In Table 1, we observe that iFM is uniformly better than the baseline methods. We average the root squared error over

minutes in the last second row. The 95% confidence interval is within

in our experiment. In our experiment, the optimal rank of FM is 2 and the optimal rank of iFM is 6. We note that FM is better than the first order linear models since it captures the second order information. This indicates that the second order information is indeed helpful.

In the last row of Table 1, we report the RTK ratio averaged over the 9 days. We find that the RTK ratio will improve a lot even with small improvement of vTEC estimation. This is because the error of vTEC estimation will be broadcasted and magnified in the RTK computation pipeline. The RTK ratio of iFM is about 1.77% better than that of Ridge regression and is more than 12% better than Kernel regression. Comparing to FM, it is 1.34% better. We conclude that iFM achieves overall better performance and the improvement is statistically significant.

5 Conclusion

We study the learning guarantees of the FM solved by alternating gradient descent on sub-gaussian distributions. We find that the conventional modeling of the factorization machine might be sub-optimal in capturing negative second order patterns. We prove that the constraints in the conventional FM can be removed resulting a generalized FM model learnable by samples. The sampling complexity can be improved to the optimal with diagonal-zero constraint. Our theoretical analysis shows that the optimal modeling of high order linear model does not always agree with the heuristic intuition. We hope this work could inspire future researches of non-convex high order machines with solid theoretical foundation.

6 Acknowledgments

This research is supported in part by NSF (III-1539991). The high precision GPS dataset is provided by Qianxun Spatial Intelligence Inc. China. We appreciate Dr. Wotao Yin from University of California Los Angeles and anonymous reviewers for their insightful comments.

References

  • [1] Immanuel Bayer. fastFM: A Library for Factorization Machines. Journal of Machine Learning Research, 17(184):1–5, 2016.
  • [2] Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle. A generic coordinate descent framework for learning from implicit feedback. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, pages 1341–1350, 2017.
  • [3] Mathieu Blondel, Akinori Fujino, and Naonori Ueda. Convex Factorization Machines. In Machine Learning and Knowledge Discovery in Databases, number 9285 in Lecture Notes in Computer Science. 2015.
  • [4] Mathieu Blondel, Masakazu Ishihata, Akinori Fujino, and Naonori Ueda. Polynomial Networks and Factorization Machines: New Insights and Efficient Training Algorithms. In Proceedings of The 33rd International Conference on Machine Learning, pages 850–858, 2016.
  • [5] T. Tony Cai and Anru Zhang. ROP: Matrix recovery via rank-one projections. The Annals of Statistics, 43(1):102–138, 2015.
  • [6] E. Candes, Y. Eldar, T. Strohmer, and V. Voroninski. Phase Retrieval via Matrix Completion. SIAM Journal on Imaging Sciences, 6(1):199–225, 2013.
  • [7] Liangjie Hong, Aziz S. Doumith, and Brian D. Davison. Co-factorization Machines: Modeling User Interests and Predicting Individual Decisions in Twitter. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pages 557–566, New York, NY, USA, 2013.
  • [8] Yuchin Juan, Damien Lefortier, and Olivier Chapelle. Field-aware factorization machines in a real-world online advertising system. In Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion, pages 680–688, 2017.
  • [9] Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. Field-aware Factorization Machines for CTR Prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 43–50, 2016.
  • [10] Richard Kueng, Holger Rauhut, and Ulrich Terstiege. Low rank matrix recovery from rank one measurements. Applied and Computational Harmonic Analysis, 42(1):88–116, 2017.
  • [11] Kaixiang Lin, Jianpeng Xu, Inci M. Baytas, Shuiwang Ji, and Jiayu Zhou. Multi-Task Feature Interaction Learning. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1735–1744, 2016.
  • [12] Ming Lin and Jieping Ye. A non-convex one-pass framework for generalized factorization machine and rank-one matrix sensing. In Advances in Neural Information Processing Systems, pages 1633–1641, 2016.
  • [13] Xiao Lin, Wenpeng Zhang, Min Zhang, Wenwu Zhu, Jian Pei, Peilin Zhao, and Junzhou Huang. Online compact convexified factorization machine. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 1633–1642, 2018.
  • [14] Luo Luo, Wenpeng Zhang, Zhihua Zhang, Wenwu Zhu, Tong Zhang, and Jian Pei. Sketched follow-the-regularized-leader for online factorization machine. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1900–1909, 2018.
  • [15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [16] Steffen Rendle. Factorization machines. In IEEE 10th International Conference On Data Mining, pages 995–1000, 2010.
  • [17] Roman Vershynin.

    High-Dimensional Probability: An Introduction with Applications in Data Science

    .
    2017.
  • [18] Mark Rudelson and Roman Vershynin. Hanson-Wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18, 2013.
  • [19] G. W. Stewart and Ji-guang Sun. Matrix Perturbation Theory. Academic Press, Boston, 1 edition edition, 1990.
  • [20] Terence Tao.

    Topics in Random Matrix Theory

    , volume 132.
    Amer Mathematical Society, 2012.
  • [21] Makoto Yamada, Wenzhao Lian, Amit Goyal, Jianhui Chen, Kishan Wimalawarne, Suleiman A Khan, Samuel Kaski, Hiroshi Mamitsuka, and Yi Chang. Convex factorization machine for toxicogenomics prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1215–1224, 2017.
  • [22] Alp Yurtsever, Madeleine Udell, Joel A. Tropp, and Volkan Cevher. Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage. arXiv:1702.06838 [math, stat], 2017.
  • [23] Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. Meta-graph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pages 635–644, 2017.

Appendix A Preliminary

The -Orlicz norm of a random sub-gaussian variable is defined by

where is a constant. For a random sub-gaussian vector , its -Orlicz norm is

where is the unit sphere.

The following theorem gives the matrix Bernstein’s inequality [17].

Theorem 8 (Matrix Bernstein’s inequality).

Let be independent, mean zero random matrices with and . Denote

Then for any , we have

where is a universal constant. Equivalently, with probability at least ,

When , replacing with the inequality still holds true.

The following Hanson-Wright inequality for sub-gaussian variables is given in [18] .

Theorem 9 (Sub-gaussian Hanson-Wright inequality).

Let be a random vector with independent, mean zero, sub-gaussian coordinates. Then given a fixed matrix , for any ,

where and is a universal positive constant. Equivalently, with probability at least ,

Truncation trick

As Bernstein’s inequality requires boundness of the random variable, we use the truncation trick in order to apply it on unbounded random matrices. First we condition on the tail distribution of random matrices to bound the norm of a fixed random matrix. Then we take union bound over all

random matrices in the summation. The union bound will result in an extra penalty in the sampling complexity which can be absorbed into or . Please check [20] for more details.

Appendix B Proof of Theorem 4

Define . Recall that

Denote

In order to apply matrix Bernstein’s inequality , we have

The 3rd inequality is because the Hanson-Wright inequality and the fact that (See Appendix C, Proof of Lemma 5).

And

The last inequality is because the definition of .

And