1 Introduction
Factorization machine (FM) [16, 2, 8, 9, 23, 21, 14, 13]
is a popular linear regression model to capture the second order feature interactions. It has been found effective in various applications, including recommendation systems
[16] , CTR prediction [9], computational medicine [11] , social network [7] and so on. Intuitively speaking, the second order feature interactions consider the factors jointly affecting the output. On the theoretical side, FM is closely related to the symmetric matrix sensing [10, 5, 22] and phase retrieval [6]. While the conventional FM only considers the second order feature interactions, it is possible to extend the conventional FM to the high order functional space which leads to the Polynomial Network model [4]. FM is the cornerstone in modern machine learning research as it abridges the linear regression and high order polynomial regression. It is therefore important to understand the theoretical foundation of FM.Given an instance , the conventional FM assumes that the label of is generated by
(1) 
where are the first order and the second order coefficients respectively. In the original FM paper [16], the authors additionally assumed that is generated from a lowrank positive semidefinite (PSD) matrix with all its diagonal elements subtracted. That is
(2) 
Eq. (2) consists of two parts. We call the first part as the PSD constraint and the second part as the diagonalzero constraint. Our key question in this work is whether the FM model (1) can be learned by observations and how the two additional constraints help the generalization ability of FM.
Although the FM has been widely applied , there is little research exploring the theoretical properties of the FM to answer the above key question. A naive analysis directly following the sampling complexity of the linear model would suggest samples to recover which is too loose. When and is symmetric, Eq. (1) is equal to the symmetric matrix sensing problem. [5] proved the sampling complexity of this special case on wellbounded subgaussian distribution using trace norm convex programming under the RIP condition. [22] developed a conditional gradient descent solver to recover . However, when the above methods and the theoretical results are no longer applicable. [3]
considered a convexified formulation of FM. Their FM model requires solving a tracenorm regularized loss function which is computationally expensive. They did not provide statistical learning guarantees for the convexified FM.
To the best of our knowledge, the most recent research dealing with the theoretical properties of the FM is [12]. In their study, the authors argued that the two constraints proposed in [16] can be removed if is sampled from the standard Gaussian distribution. However, their analysis heavily relies on the rotation invariance of the Gaussian distribution therefore cannot be generalized to nonGaussian cases. Even limiting on the Gaussian distribution, the sampling complexity given by their analysis is which is worse than the informationtheoretic lower bound . It is still an open question whether the FM can be learned with samples and whether both constraints in the original formulation are necessary to make the model learnable.
In this work, we answer the above questions affirmatively. We show that when the data is sampled from subgaussian distribution and satisfies the socalled Moment Invertible Property (MIP), the generalized FM (without constraints) can be learned by samples. The PSD constraint is not necessary to achieve this sharp bound. Actually the PSD constraint is harmful as it introduces asymmetric bias on the value (see Experiment section). The optimal sampling complexity is achievable if we further constrain that the diagonal elements of are zero. This is not an artificial constraint but there is informationtheoretic limitation prevents us recovering the diagonal elements of on subgaussian distribution. Finally inspired by our theoretical results, we propose an improved version of FM, called iFM, which removes the PSD constraint and inherits the diagonalzero constraint from the conventional modeling. Unlike the generalized FM, the sampling complexity of the iFM does not depend on the MIP constant of the data distribution.
The remainder of this paper is organized as follows. We revisit the modeling of the FM in Section 2 and show that the conventional modeling is suboptimal when considered in a more general framework. In Section 3 we present the learning guarantee of the generalized FM on subgaussian distribution. We propose the high order moment elimination technique to overcome a difficulty in our convergence analysis. Based on our theoretical results, we propose the improved model iFM. Section 4 conducts numerical experiments on synthetic and realworld datasets to validate the superiority of iFM over the conventional FM . Section 5 encloses this work.
2 A Revisit of Factorization Machine
In this section, we revisit the modeling design of the FM and its variants. We briefly review the original formulation of the conventional FM to raise several questions about its optimality. We then highlight previous studies trying to establish the theoretical foundation of the FM modeling. Based on the above survey, we motivate our study and present our main results in the next section.
In their original paper, [16] assumes that the feature interaction coefficients in the FM can be embedded in a dimensional latent space. That is,
(3) 
where is a vector. The original formulation Eq. (3) is equivalent to Eq. (1) with constraint Eq. (2). While the lowrank assumption is standard in the matrix sensing literature, the PSD and the diagonalzero constraints are not. A critical question is whether the two additional constraints are necessary or removable. Indeed we have strong reasons to remove both constraints.
The reasons to remove the diagonalzero constraint are straightforward. First there is no theoretical result so far to motivate this constraint. Secondly subtracting the diagonal elements will make the second order derivative w.r.t. nonPSD. This will raise many technical difficulties in optimization and learning theory as many research works assume convexity in their analysis.
The PSD constraint in the original FM modeling is the second term we wish to remove. Let us temporally forget about the diagonalzero constraint and focus on the PSD constraint only. Obviously relaxing with will make the model more flexible. A more serious problem of the PSD constraint is that it implicitly assumes that the label is more likely to be “positive”. This will introduce asymmetric bias about the distribution of . To see this, suppose where
is the eigenvector matrix of
. We call the second order feature mapping matrix induced by since. The eigenvalue matrix
is the weights for the mapped features . As is constrained to be PSD, the weights of cannot be negative. In other words, the PSD constraint prevents the model learning patterns from negative class. Please check the Experiment section for more concrete examples.Another issue of the PSD constraint raised is the difficulty in optimization. Suppose we choose least square as the loss function in FM. By enforcing , the loss function is a fourth order polynomial of . This makes the initialization of difficult since the scale of the initial will affect the convergence rate. Clearly we cannot initialize since the gradient w.r.t. will be zero. On the other hand, we cannot initialize to have a large norm otherwise the problem will be illconditioned. This is because the spectral norm of the second order derivative w.r.t. will be proportional to therefore the (local) condition number depends on . In practice, it is usually difficult to figure out the optimal scale of resulting vanishing or explosion gradient norms. If we decouple the and , we can initialize and . Then by alternating gradient descent, the decoupled FM model is easy to optimize.
In summary, the theoretical foundation of the FM is still not welldeveloped. On one hand, it is unclear whether the conventional FM modeling is optimal and on the other hand, there is strong motivation to modify the conventional formulation based on heuristic intuition. This inspires our study of the optimal modeling of the FM driven by theoretical analysis which is presented in the next section.
3 Main Results
In this section, we present our main results on the theoretical guarantees of the FM and its improved version iFM. We first give a sharp complexity bound for the generalized FM on subgaussian distribution. We show that the recovery error of the diagonal elements of depends on a socalled MIP condition of the data distribution. The sampling complexity bound can be improved to optimal by the diagonalzero constraint.
We introduce a few more notations needed in this section. Suppose
is sampled from coordinate subgaussian with zero mean and unit variance. The elementwise third order moment of
is denoted as and the fourth order moment is . All training instances are sampled identically and independently (i.i.d.). Denote the feature matrix and the label vector . denotes the diagonal function. For any two matrices and , we denote their Hadamard product as . The elementwise squared matrix is defined by . For a nonnegative real number , the symbol denotes some perturbation matrix whose spectral norm is upper bounded by . Theth largest singular value of matrix
is . We abbreviate. To abbreviate our high probability bounds, given a probability
, we use the symbol and to denote some polynomial logarithmic factors in and any other necessary variables that do not change the polynomial order of the upper bounds.3.1 Limitation of The Generalized FM
In order to derive the theoretical optimal FM models, we begin with the most general formulation of FM, that is, with no constraint except lowrank:
(4) 
Clearly must be symmetric but for now this does not matter. Eq. (4) is called the generalized FM [12]. It is proved that when is sampled from the Gaussian distribution, Eq. (4) can be learned by training samples. Although this bound is not optimal, [12] showed the possibility to remove Eq. (2) on the Gaussian distribution. However, their result no longer holds true on nonGaussian distributions. In the following, we will show that the learning guarantee for the generalized FM on subgaussian distribution is much more complex than the Gaussian one.
Our first important observation is that model (4) is not always learnable on all subgaussian distributions.
Proposition 1.
When , the generalized FM is not learnable.
The above observation is easy to verify since when . Therefore at least the diagonal elements of cannot be recovered at all. Proposition 1 shows that there is informationtheoretic limitation to learn the generalized FM on subgaussian distribution. In our analysis, we find that such limitation is related to a property of the data distribution which we call the Moment Invertible Property (MIP).
Definition 2 (Moment Invertible Property).
A zero mean unit variance subgaussian distribution is called Moment Invertible if for some constants , , .
With the MIP condition, the following theorem shows that the generalized FM is learnable via alternating gradient descent.
Theorem 3.
Suppose is sampled from a MIP subgaussian distribution with generated by Eq. (4). Then with probability at least , there is an alternating gradient descent based method which can achieve the recovery error after iteration such that
provided
Theorem 3 is our key result. In Theorem 3
, we measure the quality of our estimation by the recovery error
The recovery error decreases linearly along the steps of alternating iteration with rate . The sampling complexity is on order of . This bound delivers two messages. First when the distribution is close to the Gaussian distribution, and the bound is controlled by . This result improves the previous given by [12] for the Gaussian distribution. Secondly when is small, the sampling complexity is proportional to
. The sampling complexity will even trend to infinite when the data follows the binary Bernoulli distribution where
. Therefore the MIP condition provides a sufficient condition to make the generalized FM learnable.We have not given any detail about the alternating gradient descent algorithm mentioned in Theorem 3. We find that it is difficult to prove the convergence rate following the conventional alternating gradient descent framework. To address this difficulty, we use a high order moment elimination technique in the next subsection in the convergent analysis.
3.2 Alternating Gradient Descent with High Order Moment Elimination
In this subsection, we will construct an alternating gradient descent algorithm which achieves the convergence rate and the sampling complexity in Theorem 3. We first show that the conventional alternating gradient descent cannot be applied directly to prove Theorem 3. Then a high order moment elimination technique is proposed to overcome the difficulty.
The generalized FM defined in Eq. (4) can be written in the matrix form
(5) 
where the operator is defined by . The adjoint operator of is . To recover , we minimize the square loss function
(6) 
A straightforward idea to prove Theorem 3 is to show that the alternating gradient descent will converge. However, we find that this is difficult in our problem. To see this, let us compute the expected gradient of with respect to at step .
where
In previous studies, one expects . However, this is no longer the case in our problem. Clearly is dominated by and . For nonGaussian distributions, these two perturbation terms could dominate the gradient norm. Similarly the gradient of is biased by .
The difficulty to follow the conventional gradient descent analysis inspires us to look for a new convergence analysis technique. The perturbation term consists of high order moments of the subgaussian variable . It might be possible to construct a sequence of another high order moments to eliminate these perturbation terms. We call this idea the high order moment elimination method. The next question is whether the desired moments exist and how to construct them efficiently. Unfortunately, this is impossible in general. A sufficient condition to ensure the existence of the elimination sequence is that the data distribution satisfies the MIP condition.
To construction an elimination sequence, for any and , define functions
Notice when ,
This inspires us to find a linear combination of to eliminate . The solution for this linear combination equation is
(7)  
where
(8)  
(9) 
Similarly to eliminate the high order moments in the gradient of , we construct
(10)  
The overall construction is given in Algorithm 1.
We briefly prove that the construction in Algorithm 1 will eliminate the high order moments in by which a global linear convergence rate is immediately followed. Please check appendix for details. We will omit the superscript in and when not raising confusion.
First we show that is conditionally independent restrict isometric after shifting its expectation (Shift CIRIP). The proof can be found in Appendix B.
Theorem 4 (Shift CIRIP).
Suppose . Fixed a rank matrix , with probability at least ,
provided .
Theorem 4 is the main theorem in our analysis. The key ingredient of our proof is to apply the matrix Bernstein’s inequality with an improved version of subgaussian HansonWright inequality proved by [18]. Please check Appendix B for more details.
Based on the shifted CIRIP condition of operator , we prove the following perturbation bounds.
Lemma 5.
For , with probability at least ,
Lemma 5 shows that and are all concentrated around their expectations with no more than samples. To finish our construction, we need to bound the deviation of and from their expectation and . This is done in the following lemma.
Lemma 6.
Lemma 6 shows that as long as . The matrix inversion in the definition of requires that the MIP condition must be satisfied with .
We are now ready to show that and are almost isometric.
Lemma 7.
Lemma 7 shows that and are almost isometric when the number of samples is larger than and . The proof of Lemma 7 consists of two steps. First we replace each operator or matrix with its expectation plus a small perturbation given in Lemma 5 and Lemma 6. Then Lemma 7 follows after simplification. Theorem 3 is obtained by combining Lemma 7 with alternating gradient descent analysis. Please check Appendix F for the complete proof.
3.3 Improved Factorization Machine
Theorem 3 shows that learning the generalized FM is hard on nongaussian distribution. Especially, when the data distribution has a very small MIP constant, the sampling complexity to recover in Eq. (4) will be as large as . The recovery is even impossible on distributions such as the Bernoulli distribution. Clearly, a welldefined learnable model should not depend on the MIP condition.
Indeed the bound given by Theorem 3 is quite sharp. It explains well why we cannot recover on the Bernoulli distribution. Therefore it is unlikely to remove the dependency by designing a better elimination sequence in Algorithm 1. After examining the proof of Theorem 1 carefully, we find that the only reason our bound contains is that the diagonal elements of are allowed to be nonzero. If we constrain and , the in the expected gradient will be zero and then we do not need to eliminate it during the alternating iteration. This greatly simplifies our convergence analysis as we only need Theorem 4 which now becomes
(11) 
Eq. (4) already shows that is almost isometric that immediately implies the linear convergence rate of alternating gradient descent. As a direct corollary of Theorem 4, the sampling complexity could be improved to which is optimal up to some logarithmic constants . Inspired by these observations, we propose to learn the following FM model
(12)  
We called the above model the Improved Factorization Machine (iFM). The iFM model is a tradeoff between the conventional FM model and the generalized FM model. It decouples the PSD constraint with in the generalized FM model but keeps the diagonalzero constraint as the conventional FM model. Unlike the conventional FM model, the iFM model is proposed in a theoreticaldriven way. The decoupling of makes the iFM easy to optimize while the diagonalzero constraint makes it learnable with the optimal sampling complexity. In the next section, we will verify the above discussion with numerical experiments.
4 Experiments
We first use synthetic data in subsection 4.1 to show the modeling power of iFM and the PSD bias of the conventional FM. In subsection 4.2 we apply iFM in a realword problem, the vTEC estimation task, to demonstrate its superiority over baseline methods.
4.1 Synthetic Data
In this subsection, we construct numerical examples to support our theoretical results. To this end, we choose , . are all sampled from the Gaussian distribution with variance . We randomly sample instances as training set and instances as testing set. In Figure 1, we report the convergence curve of iFM and FM on the testing set averaged over 10 trials. The xaxis is the iteration step and the yaxis is the Root Mean Square Error (RMSE) of . In Figure (a), we generate label following the conventional FM assumption. Both iFM and FM converge well. In Figure (b), we flip the sign of the label in (a). While iFM still converges well, FM cannot model the sign flip. This example shows why we should avoid to use the conventional FM both in theory and in practice: even a simple flipping operation can make the model underfit the data. In Figure (c), we generate from with both positive eigenvalues and negative eigenvalues. Again the conventional FM cannot fit this data as the distribution of is now symmetric in both direction.
4.2 vTEC Estimation
Ridge  LASSO  ElasticNet  Kernel  FM  iFM  

TestDay1  0.02161  0.02260  0.02222  0.02570  0.02178  0.02164 
TestDay2  0.01430  0.01461  0.01454  0.01683  0.01439  0.01404 
TestDay3  0.01508  0.01524  0.01496  0.01875  0.01484  0.01484 
TestDay4  0.01449  0.01487  0.01460  0.01564  0.01432  0.01423 
TestDay5  0.01610  0.01579  0.01612  0.01744  0.01606  0.01567 
TestDay6  0.01487  0.01491  0.01483  0.01684  0.01470  0.01459 
TestDay7  0.01828  0.01849  0.01827  0.02209  0.01830  0.01805 
TestDay8  0.01461  0.01552  0.01519  0.01629  0.01466  0.01453 
TestDay9  0.01657  0.01646  0.01639  0.02096  0.01646  0.01636 
Average  0.01621  0.01650  0.01635  0.01895  0.01617  0.01599 
RTK  62.67%  62.18%  63.07%  52.09%  63.10%  64.44% 
In this subsection, we demonstrate the superiority of iFM in a realworld application, the vertical Total Electron Content (vTEC) estimation. The vTEC is an important descriptive parameter of the ionosphere of the Earth. It integrates the total number of electrons integrated when traveling from space to earth with a perpendicular path. One important application is the realtime high precision GPS signal calibration. The accuracy of the GPS system heavily depends on the pseudorange measurement between the satellite and the receiver. The major error in the pseudorange measurement is caused by the vTEC which is dynamic.
In order to estimate the vTEC, we build a triangle mesh grid system in an anonymous region. Each node in the grid is a ground stations equipped with dualfrequency high precision GPS receiver. The distance between two nodes is around 75 kilometers. The station sends out resolved GPS data every second. Formally, our system solves an online regression problem. Our system receives data points at every time step measured in seconds. Each data point presents an ionospheric pierce point of a satellitereceiver pair. The first two dimensions are the latitude and the longitude of the pierce point. The third dimension and the fourth dimension are the zenith angle and the azimuth angle respectively. We will omit the superscript below as we always stay within the same time window . In order to build a localized prediction model, we encode into a high dimensional vector. Suppose we have satellites in total. First we collect data points for 60 seconds. Then we cluster the collected into
clusters via Kmeans algorithm. Denote the cluster center of Kmeans as
and the th data point belongs to the th cluster. The first two dimensions of the th data point are then encoded as whereFinally, each data point is encoded into an vector:
Evaluation
It is important to note that our problem does not fit the conventional machine learning framework. We only have validation set for model selection and evaluation set to evaluate the model performance. We introduce the training station and the testing station that correspond to the “training set” and “testing set” in the conventional machine learning framework. However please be advised that they are not exactly the same concepts.
To evaluate the performance of our models, we randomly select one ground station as the testing station. Around the testing station, we choose ground stations as training stations to learn the online prediction model. Suppose the online prediction model is which maps to the corresponding vTEC value
In our system, we are only given the double difference of the values due to the signal resolving process. Suppose two satellites and two ground stations are connected in the datalink graph, denotes the ionospheric pierce point between and . The observed double difference is given by
Since is an unknown function, we need to approximate it by . Once is learned from training stations, we can apply it to predict the double difference where either or is the testing station.
Once we get the vTEC estimation, we use it to calibrate the GPS signal and finally compute the geometric coordinate of the user. The RTK ratio measures the quality of the positioning service. It is a real number presenting the probability of successful positioning with accuracy at least one centimeter. The RTK ratio is computed from a commercial system that is much slower than the computation of RMSE.
Dataset and Results
We select a ground station at the region center as testing station. Around the testing station 16 stations are selected as training stations. We collect 5 consecutive days’ data as validation set for parameter tuning. The following 9 days’ data are used as evaluation set. We update the prediction model per 60 seconds. The learned model is then used to predict the double differences relating to the testing station. We compare the predicted double differences to the true values detected by the testing station. The number of valid satellites in our experiment is around 10 to 20.
In Table 1
, we report the rootmeansquare error (RMSE) over 9 days period. The dates are denoted as TestDay1 to TestDay9 for anonymity. Five baseline methods are evaluated: Ridge Regression, LASSO, ElasticNet, Kernel Ridge Regression (Kernel) with RBF kernel and the conventional Factorization Machine (FM). More computational expensive models such as deep neural network are not feasible for our online system. For Ridge, LASSO, Kernel and ElasticNet, their parameters are tuned from
to . The regularizer parameters of FM and iFM are tuned from to . The rank of is tuned in set . We use Scikitlearn [15] and fastFM [1] to implement the baseline methods.In Table 1, we observe that iFM is uniformly better than the baseline methods. We average the root squared error over
minutes in the last second row. The 95% confidence interval is within
in our experiment. In our experiment, the optimal rank of FM is 2 and the optimal rank of iFM is 6. We note that FM is better than the first order linear models since it captures the second order information. This indicates that the second order information is indeed helpful.In the last row of Table 1, we report the RTK ratio averaged over the 9 days. We find that the RTK ratio will improve a lot even with small improvement of vTEC estimation. This is because the error of vTEC estimation will be broadcasted and magnified in the RTK computation pipeline. The RTK ratio of iFM is about 1.77% better than that of Ridge regression and is more than 12% better than Kernel regression. Comparing to FM, it is 1.34% better. We conclude that iFM achieves overall better performance and the improvement is statistically significant.
5 Conclusion
We study the learning guarantees of the FM solved by alternating gradient descent on subgaussian distributions. We find that the conventional modeling of the factorization machine might be suboptimal in capturing negative second order patterns. We prove that the constraints in the conventional FM can be removed resulting a generalized FM model learnable by samples. The sampling complexity can be improved to the optimal with diagonalzero constraint. Our theoretical analysis shows that the optimal modeling of high order linear model does not always agree with the heuristic intuition. We hope this work could inspire future researches of nonconvex high order machines with solid theoretical foundation.
6 Acknowledgments
This research is supported in part by NSF (III1539991). The high precision GPS dataset is provided by Qianxun Spatial Intelligence Inc. China. We appreciate Dr. Wotao Yin from University of California Los Angeles and anonymous reviewers for their insightful comments.
References
 [1] Immanuel Bayer. fastFM: A Library for Factorization Machines. Journal of Machine Learning Research, 17(184):1–5, 2016.
 [2] Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle. A generic coordinate descent framework for learning from implicit feedback. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, pages 1341–1350, 2017.
 [3] Mathieu Blondel, Akinori Fujino, and Naonori Ueda. Convex Factorization Machines. In Machine Learning and Knowledge Discovery in Databases, number 9285 in Lecture Notes in Computer Science. 2015.
 [4] Mathieu Blondel, Masakazu Ishihata, Akinori Fujino, and Naonori Ueda. Polynomial Networks and Factorization Machines: New Insights and Efficient Training Algorithms. In Proceedings of The 33rd International Conference on Machine Learning, pages 850–858, 2016.
 [5] T. Tony Cai and Anru Zhang. ROP: Matrix recovery via rankone projections. The Annals of Statistics, 43(1):102–138, 2015.
 [6] E. Candes, Y. Eldar, T. Strohmer, and V. Voroninski. Phase Retrieval via Matrix Completion. SIAM Journal on Imaging Sciences, 6(1):199–225, 2013.
 [7] Liangjie Hong, Aziz S. Doumith, and Brian D. Davison. Cofactorization Machines: Modeling User Interests and Predicting Individual Decisions in Twitter. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pages 557–566, New York, NY, USA, 2013.
 [8] Yuchin Juan, Damien Lefortier, and Olivier Chapelle. Fieldaware factorization machines in a realworld online advertising system. In Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion, pages 680–688, 2017.
 [9] Yuchin Juan, Yong Zhuang, WeiSheng Chin, and ChihJen Lin. Fieldaware Factorization Machines for CTR Prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 43–50, 2016.
 [10] Richard Kueng, Holger Rauhut, and Ulrich Terstiege. Low rank matrix recovery from rank one measurements. Applied and Computational Harmonic Analysis, 42(1):88–116, 2017.
 [11] Kaixiang Lin, Jianpeng Xu, Inci M. Baytas, Shuiwang Ji, and Jiayu Zhou. MultiTask Feature Interaction Learning. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1735–1744, 2016.
 [12] Ming Lin and Jieping Ye. A nonconvex onepass framework for generalized factorization machine and rankone matrix sensing. In Advances in Neural Information Processing Systems, pages 1633–1641, 2016.
 [13] Xiao Lin, Wenpeng Zhang, Min Zhang, Wenwu Zhu, Jian Pei, Peilin Zhao, and Junzhou Huang. Online compact convexified factorization machine. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 1633–1642, 2018.
 [14] Luo Luo, Wenpeng Zhang, Zhihua Zhang, Wenwu Zhu, Tong Zhang, and Jian Pei. Sketched followtheregularizedleader for online factorization machine. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1900–1909, 2018.
 [15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [16] Steffen Rendle. Factorization machines. In IEEE 10th International Conference On Data Mining, pages 995–1000, 2010.

[17]
Roman Vershynin.
HighDimensional Probability: An Introduction with Applications in Data Science
. 2017.  [18] Mark Rudelson and Roman Vershynin. HansonWright inequality and subgaussian concentration. Electronic Communications in Probability, 18, 2013.
 [19] G. W. Stewart and Jiguang Sun. Matrix Perturbation Theory. Academic Press, Boston, 1 edition edition, 1990.

[20]
Terence Tao.
Topics in Random Matrix Theory
, volume 132. Amer Mathematical Society, 2012.  [21] Makoto Yamada, Wenzhao Lian, Amit Goyal, Jianhui Chen, Kishan Wimalawarne, Suleiman A Khan, Samuel Kaski, Hiroshi Mamitsuka, and Yi Chang. Convex factorization machine for toxicogenomics prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1215–1224, 2017.
 [22] Alp Yurtsever, Madeleine Udell, Joel A. Tropp, and Volkan Cevher. Sketchy Decisions: Convex LowRank Matrix Optimization with Optimal Storage. arXiv:1702.06838 [math, stat], 2017.
 [23] Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. Metagraph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pages 635–644, 2017.
Appendix A Preliminary
The Orlicz norm of a random subgaussian variable is defined by
where is a constant. For a random subgaussian vector , its Orlicz norm is
where is the unit sphere.
The following theorem gives the matrix Bernstein’s inequality [17].
Theorem 8 (Matrix Bernstein’s inequality).
Let be independent, mean zero random matrices with and . Denote
Then for any , we have
where is a universal constant. Equivalently, with probability at least ,
When , replacing with the inequality still holds true.
The following HansonWright inequality for subgaussian variables is given in [18] .
Theorem 9 (Subgaussian HansonWright inequality).
Let be a random vector with independent, mean zero, subgaussian coordinates. Then given a fixed matrix , for any ,
where and is a universal positive constant. Equivalently, with probability at least ,
Truncation trick
As Bernstein’s inequality requires boundness of the random variable, we use the truncation trick in order to apply it on unbounded random matrices. First we condition on the tail distribution of random matrices to bound the norm of a fixed random matrix. Then we take union bound over all
random matrices in the summation. The union bound will result in an extra penalty in the sampling complexity which can be absorbed into or . Please check [20] for more details.Appendix B Proof of Theorem 4
Define . Recall that
Denote
In order to apply matrix Bernstein’s inequality , we have
The 3rd inequality is because the HansonWright inequality and the fact that (See Appendix C, Proof of Lemma 5).
And
The last inequality is because the definition of .
And
Comments
There are no comments yet.