1 Introduction
Originally introduced in [16]
, the Factorization Machines (FM) is proposed as a new model class that combines the advantages of linear models, such as Support Vector Machines (SVM)
[5], with factorization models. Like linear model, FM is a general model which will learn a weight vector for any real valued feature vector. However, FM also learn a pairwise feature interaction matrix for all interactions between variables, thus it can estimate interactions for highly sparse data(like recommender systems) where linear models fail.The interaction matrix is learnt using factorized parameters with much smaller latent factor compared with the original dimension of the instances. This introduced several benefits. Firstly, this acts as a kind of regularization, since the rank of the interaction matrix is no more than the latent factors and the number of parameters is much lower than that of the full matrix. Secondly, this makes the computation of the prediction score of FM can be calculated in linear time and thus FMs can be optimized directly. Because of these advantages, FM can be used for any supervised learning tasks, including classification, regression, and recommendation systems. On the other hand, FM can mimic most factorization models
[19, 17], including standard matrix factorization [22], SVD++ [10], timeSVD++ [11], and PITF (Pairwise Interaction Tensor Factorization)
[20], just by feature engineering. This property makes FM suitable to many application domains, where factorization models are appropriate. Practically, FM can achieve as good accuracy performance as the best specialized models on the Netflix and KDDcup 2012 challenges [18].Although the original Factorization machine is successfully applied to optimize the accuracy of the model [16].However, it is not guaranteed to optimize ranking performance for recommendation system [3, 4]. Recently the Pairwised Ranking based Factorization Machines (PRFM) algorithm [15] is proposed to directly optimize the Area Under the ROC Curve (AUC) performance. However, AUC measure is not suitable for topN recommendation tasks [14], where the higher accuracy at the top of the list is more important than that at the lowposition ( such as Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR) [12]). So, LambdaFM [25] is proposed to directly optimize the rank biased metrics, using the core ideas of LambdaRank [1] where top pairs are assigned with higher importance. Empirical results show that LambdaFM generally outperforms PRFM in terms of different ranking metrics. Although FM and their variants are successfully applied to many problems, it usually needs to run the algorithm for many times to choose the rank properly. This clearly is inefficient for some largescale datasets.
Motivated by the above observations, we would like to design an algorithm that can adaptively search for a proper latent number for different datasets without retraining. To achieve this goal, we adopt boosting technique, which was proposed to improve the performance (such as, AUC, NDCG, MRR) of models by combining multiple weak models [8, 24], to propose an Adaptive boosting framework of Factorization Machine (AdaFM). Specifically, AdaFM works in rounds to build multiple component FMs bases on dynamically weighted training datasets, which are linearly combined to construct a strong FM. In this way, AdaFM will adaptively gradually increases its latent number according to its performance until the performance becomes saturated. As for component FM, we can either choose the original FM, PRMF or LambdaFM, according the performance that we would like to optimize. To verify the performance of our proposed framework, we conduct an extensive set of experiments on many largescale realworld datasets. Encouraging empirical results shows that the proposed algorithms are more effective than stateoftheart other Factorization Machines.
The rest of the paper is organized as follows. Section 2 presents the proposed framework and algorithms. Section 3 discusses our experimental results and Section 4 concludes our work.
2 Adaptive Boosting Factorization Machine
In this section, we will firstly introduce the problem setting and Factorization Machine. Then, we will present our Adaptive Boosting Factorization Machine framework, following which we will give several specific algorithms.
2.1 Problem Settings
Our goal is to learn a function , based on a dataset , where is the feature vector of the th instance, is the label of . There are many different choices of , which corresponds to different problems. For example, when , we can treat this problem as a classification problem.
2.1.1 Factorization Machine
To learn a reasonable , Factorization Machines (FM) can be adopted. Specifically, second order FM model predict the output for an instance using the following simple equation as:
(1) 
where is the th element of , and the model parameters to be learnt consists
where is a usually prefixed parameter which defines the rank of the factorization.
Intuitively, the vector , the linear part of the model, contains the weights of individual features for predicting ; while the positive semidefinite matrix , the factorization part, captures all the pairwise interactions between all the variables. Using the factorized parametrization instead of a full matrix is based on the assumption that the effect of pairwise interactions has a low rank. This explicit low rank assumption helps reduce the overfitting problem, and allows FM to estimate reliable parameters even in highly sparse data. In addition, this reduces the number of parameters to be learnt from to , and allows to compute prediction efficiently by using
(2) 
where is the elementwise product. So, FM can be computed efficiently with the computation cost instead of when implemented naively.
Given the above parametric FM function, now we take
as a concrete example, which can be treated as a classification problem. In order to learn the optimal parameters for FM, we need introduce some loss function
to measure the performance of on. One popular loss function is the wellknown logistic regression loss,
which measures how much is violation of the desired constraint by the function . Under these settings, FM is formulated as
(3) 
where . The parameter is a tradeoff parameter for the regularization and empirical loss.
2.1.2 Pairwise Ranking Factorization Machine
Although traditional FM can be applied to many different problems with interactions hard to be estimated, it is usually designed to approximately minimize the classification error, or regression loss, which is apparently not appropriate for ranking tasks where the prediction score does not matters while the ranks matter.
To solve this task, Pairwise Ranking Factorization Machines (PRFM) is proposed. In PRFM, the dataset is firstly transformed to a new one which is
where if and otherwise. Then the objective function of PRFM is defined as
(4) 
where is the logistic regression loss, and is a regularization parameter. Intuitively, PRFM model would assign higher sores for positive instances compared with negative instances, which is equivalent to approximately maximize a concave lower bound of AUC performance measure. In practice, PRFM dose work much better than FM in the setting of recommendation task measured by AUC.
2.1.3 Lambda Factorization Machine
Although PRFM can achieve significant higher AUC performance compared with traditional FM. However, in PRMF, an incorrect pairwise ordering at the bottom of list impacts the score just as much as that at the top of the list, this makes it not suitable to topN recommendation tasks , where the higher accuracy at the top of the list is more important to the recommendation quality than that at the lowposition. This can be further explained using rank biased metrics, such as NDCG and MRR [12] , for which higher weights are assigned to the top accurate instances.
To address this issue, LambdaFM is proposed to directly optimize the rank biased metrics, using the core ideas of LambdaRank where different pairs are assigned with different importance according to their positions in the list. Specifically, three strategies are proposed in LambdaFM. The first one is Static Sampler, in which the item
is assigned to a sampling probability
(5) 
where represents the rank of item among all items according to its overall popularity, is a parameter. The second one is Dynamic Sampler. Dynamic sampler will first draw samples uniformly from unobserved item set where is the item set clicked by , then sample one item according to the distribution
(6) 
where . Different from the first two samplers which would like to push nonpositive items with higher ranks down from top positions, the third one is to pull positive items with lower ranks up from the bottom positions. Specifically, for a pair of positive and nonpositive items , a rankaware weight will be assigned to it, where the weight is
(7) 
However, it is impractical to compute for large scale datasets. To remedy this issue, an approximate method is to repeatedly draw an item from until we obtain , s.t., and , where is a positive margin value. Let denote the size of sampling trials before obtaining such an item, then . Empirical results show that the three variant of LambdaFM generally outperforms PRFM in terms of different ranking metrics, such as NDCG.
2.2 Algorithm
The proposed Adaptive Boosting Factorization Machine (AdaFM) framework aims to provide a general framework to optimize the loss function defined based on various ranking metrics.
To introduce the proposed algorithm, we briefly describe the problem details with some notations. Specifically, let be the whole set of useres and the whole set of items, then our goal is to utilize the interactions between and to recommend a target user a list of items that he may prefer. In training, a set of user is given. Each user is associated with a list of retrieved items and a list of labels , where denotes the rank of item for user . A feature vector is created from each useritem pair . The interaction belongs to the set of . Thus the training set can be represented as . For a user and item , we denote his historical items by and define .
2.2.1 AdaFM
Our objective is to learn a Factorization Machine , such that for each user the function can assign its item list with prediction scores that generate a rank list as close as possible with . To achieve this goal, we introduce function to denote the rank list of items for , resulted by the learnt model . Specifically, for , is defined as a bijection from to itself, where the th element of denotes the rank of item .
Then the learning process is to maximize some performance which measures the match between and , for all users , . Specifically, we can use a general function to denote the ranking accuracy associated with each user and its item list . Then, the ranking accuracy in terms of a ranking metric, e.g., MAP, on the training data is rewritten as below
To maximize the ranking accuracy, we propose to minimize the following loss function:
where is the set of all possible FM. Observation that this minimization is equivalent to maximizing the performance measures. However is a noncontinuous function, it is difficult to optimize the loss function defined above. To solve this issue, we propose to minimize its upper bound as follows:
The primary idea of applying boosting for Factorization Machine is to learn a set of component FMs and then create an ensemble of the components to predict the users’ preferences on items. Specifically, we can use a linear combination of component FM as the final AdaFM model:
where
is the th component FM with small rank and is a positive weight assigned to to determine its contribution in the final model. Therefor, for we can get an equivalent formulation as:
where
This implies that the learnt is still a Factorization Machine, which rank .
In the training process, AdaFM runs for rounds, and one component FM is created at each round. At the th round, given the former components, the optimization problem is converted to
where .
To solve the above optimization, we first create an optimal component by using a reweighting strategy, which assigns a dynamic weight for each user . At each round, AdaFM increase the weights of the observed users for which their item lists are not ranked well by the ensemble components created so far. The learning process of the next component will then pay more attention to those ”hard” users. Once, is given, the optimal can be solved. Finally, the details of the AdaFM is summarized in Algorithm 1.
In algorithm 1, there is a key step using Component Algorithm (CA)
for which the inputs are the data set , the weights , the latent factor and the performance measure ; and the output is a FMs model with latent factor , which is obtained through maximizing
The specific algorithm to solve the above problem will be presented in the next subsection.
2.2.2 Component Algorithm
To construct component FMs, we can adopt the original FM, PRFM, or LambdaFM model. Specifically, for each user and an item , we can use the score of FM on to model the the relation between the user and item , as follows:
(8) 
where and . At each round, the accuracy of the component can be evaluated by the ranking performance measure weighted by . The optimal is then obtained by consistently optimizing the weighted ranking measure.
PRFM is selected as the component algorithm to optimize AUC, which is chosen as the ranking metric. Given the weight distribution , the accuracy of the component measured by weighted AUC, is defined as follows:
where , denotes the rank position of the item in the list ranked by for , and . Maximizing the weighted AUC is equivalent to minimizing the following loss function:
To solve this problem, we replace the indicator function with a convex surrogate, i.e., the logistic regression loss function, as follows:
where . The optimal component can be found by optimizing the following objective function:
where
is a regularization parameter. The problem above can be solved by stochastic gradient descent, which firstly uniformly sample one user
from all the users, then sample a pair from , and finally update the model based on the following method:where , and is the learning rate. To calculate the gradient of the objective with respect to , we can firstly derive the gradient using the property of Multilinearity:
Then, if we denote
the stochastic gradient for can be computed as
and the stochastic gradient for can be computed as
LambdaFM is selected to optimize NDCG, which is chosen as the performance metric. For this case, we can adopt the lambda sampling strategies [25] instead of the uniform sampling one, i.e, the popularity based Static Sampler (5), RankAware Dynamic Sampler (LABEL:cs), and Rankaware Weighted Approximation (LABEL:rs).
Finally, the algorithm for building the component is summarized in the following Algorithm 2.
3 Experiments
In this section, we report a comprehensive suite of experimental results that help evaluate the performance of our proposed AdaFM algorithm on several recommendation tasks. The experiments are designed to answer the following open questions: (1) Whether the proposed boosting approach is effective to improve the ranking performances significantly? (2) Whether the weak learner’s latent dimension has a great effect on ranking performances.
3.1 Experimental Testbed
Datasets  #Users  #Items  #Entries 

Yelp  17,526  85,539  875,955 
Lastfm  992  60,000  759,391 
Yahoo  2,450  6,518  107,334 
We evaluate our proposed algorithm against several baselines on three publicly available Collaborative Filtering (CF) datasets, i.e., Yelp^{1}^{1}1https://www.yelp.com/dataset_challenge (uservenue pairs), Lastfm^{2}^{2}2http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm1K.html (usermusic pairs), and Yahoo music^{3}^{3}3https://webscope.sandbox.yahoo.com/catalog.php?datatype=r (usermusic pairs). To speed up the experiments, we perform the following sampling strategies on these datasets. For Yelp, we filter out the users with less than 20 interactions. For Yahoo, we derive a smaller dataset by randomly sampling a subset of users and items from the original dataset. The statistics of the datasets after preprocessing are summarized in Table 1.
To test the performances of our proposed AdaFM framework under different optimization targets, we adopt two standard ranking metrics: Area Under ROC Curve (AUC) and Normalized Discounted Cumulative Gain (NDCG).
Datasets  FM  PRFM  LFMS  AdaFMS  LFMD  AdaFMD  LFMW  AdaFMW 

Yelp  0.204  0.205  0.217  0.225  0.215  0.228  0.221  0.227 
Yahoo  0.382  0.383  0.386  0.407  0.392  0.408  0.395  0.410 
3.2 Comparison Algorithms
Our proposed AdaFM is a general framework for improving the performances of FM derived algorithms. Thus, we compare the performances of the following FM derived algorithms and their corresponding enhanced models using our proposed AdaFM framework.

[leftmargin=*]

The Original FM that is designed for the rating prediction task, and its enhanced model using AdaFM and we name it AdaFMO for short;

Pariwise Ranking FM (PRFM), which aims to maximize the AUC metric, and its adaptive version (AdaFMP);

LambdaFM, which is designed to maximize the NDCG metric. We use three different sampling strategies to form the list pairs, i.e., Static sampler, Dynamic sampler, and rankaware sampler, as described in Section 2.1.3, and we name them as LFMS, LFMD, and LFMW, respectively. We also name their adaptive versions as AdaFMS, AdaFMD, and AdaFMW, respectively.
3.3 Hyperparameter Settings
The main parameters to be tuned in our experiments are as follows:
Learning rate : For base learners, we first apply the 5fold cross validation to find the best for FM when , and then use the same for the PRFM, LambdaFM, AdaFM.
Latent dimension : In order to compare the performance of AdaFM and the base learners, we simply choose the latent dimension of AdaFM from , and range the latent dimension of the FM derived algorithms in .
Regularization : FM derived algorithms have several regularization parameters, including and , which represent the regularization parameters of and , respectively. During the experiments, we select the best values of in for each FM derived algorithm. For simplicity, in our experiments, we restrict and to have the same value of .
Distribution coefficient : controls the sampling probability of Lambda FM, and is usually affected by data distribution. Thus, we select the best values of for LFMS, LFMD, and LFMW in .
Datasets  FM  AdaFMO  PRFM  AdaFMP 

Yelp  0.911  0.914  0.915  0.916 
Lastfm  0.826  0.845  0.843  0.864 
Yahoo  0.925  0.936  0.929  0.942 
3.4 Performance Evaluation
3.4.1 AUC Optimization
We start by evaluating the effectiveness of our proposed AdaFM framework on AUC maximization task. The detailed results are presented in Figure 1 and 2, and Table 3. Several insightful observations can be made.
First, after combine the Adaptive Boosting and FM, the final results are increased. As shown in Figure 1, when use FM as weak learner, compare with the FM, on Lastfm dataset, we get an 2.32% improvement. And as shown in Figure 2, when use PRFM as weak learner, compare with the PRFM, on Lastfm, we get an 2.49% improvement.
Second, AdaFM shows better results by using less parameters. This is clearly evident in Figure 1 and 2. For example, in all datasets, AdaFM with four weak learners (which latent dimension is 2) achieves a comparable or even better results than the base FM and PRFM with . The results are encouraging as it shows in the cases when the base FM and PRFM stuck in a certain local optimum, our proposed boosting framework can help to achieve better results.
Last but not least, as shown in Table 3, the AdaFMP has the best results on all the datasets. This shows when using a better weak learner, i.e., PRFM in our case, the AdaFM method achieves better results. This further demonstrates the effectiveness of our boosting framework.
3.4.2 NDCG Optimization
We proceed to evaluate the effectiveness of our AdaFM framework on NDCG maximization task. We use LambdaFM with different samplers as our baselines, which are designed to optimize the NDCG metric. More specifically, we consider three variants of LambdaFM, i.e., LFMS, LFMD, and LFMW, with their corresponding boosted versions, i.e., AdaFMS, AdaFMD and AdaFMW. As shown in the Table 2, LambdaFM is better than FM and PRFM, as LambdaFM is designed to optimize the NDCG metric. But our AdaFM methods outperform all the three variants of LambdaFM: LFMS, LFMD, and LFMW. Specifically, on Yelp dataset, comparing with the original algorithm, AdaFMS, AdaFMD, and AdaFMW get 3.6%, 6.04% and 2.7% improvement, respectively. On Yahoo dataset, AdaFMS, AdaFMD, and AdaFMW get 4.8%, 2.19% and 3.8% improvement, respectively.
3.5 Effect of Latent Dimension
In this section, we study whether weak learner’s latent dimension affect the final results of our proposed AdaFM.
From the experiments in Figure 1, 2, 3, and 4, we find that: (1) with the increase of weak learner numbers, the performances of our proposed AdaFM first increase and then become stable, no matter the latent dimension of weak learners; (2) AdaFM tends to have similar performance even when the latent dimension of the weak learners are different. For example, the AUC performances of AdaFMO2 and AdaFMO3 both increase with the weak learner nubmers on Lastfm (i.e., Figure 1(c)), however, they achieve quite similar AUC performance after a certain weak learner numbers (i.e., 0.845 vs. 0.844). This finding indicates that it is easy for our proposed AdaFM to tune model parameters in practice.
4 Conclusions
In this paper, we first proposed a novel Adaptive Boosting framework of factorization machine(AdaFM), which combines the advantages of adaptive boosting and FM. Our proposed AdaFM is a general framework that can be used to improve the performance of all the existing FM derived algorithms, e.g., FM, PRFM, and LambdaFM. We then presented the details of how to combine adaptive boosting technique and FM derived models. We finally performed thorough experiments to evaluate our model performance on three real public datasets. The results demonstrated that AdaFM is able to improve the prediction performances in both AUC and NDCG maximization tasks.
References
 [1] Christopher J. C. Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 47, 2006, pages 193–200, 2006.
 [2] Chen Cheng, Fen Xia, Tong Zhang, Irwin King, and Michael R. Lyu. Gradient boosting factorization machines. In RecSys’14, pages 265–272. ACM, 2014.
 [3] Corinna Cortes and Mehryar Mohri. AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 813, 2003, Vancouver and Whistler, British Columbia, Canada], pages 313–320, 2003.
 [4] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. Performance of recommender algorithms on topn recommendation tasks. In Proceedings of the 2010 ACM Conference on Recommender Systems, RecSys 2010, Barcelona, Spain, September 2630, 2010, pages 39–46, 2010.
 [5] Nello Cristianini and John ShaweTaylor. An Introduction to Support Vector Machines and Other Kernelbased Learning Methods. Cambridge University Press, 2010.
 [6] Nigel Duffy and David Helmbold. Boosting methods for regression. Machine Learning, 47(23):153–200, 2002.
 [7] Yoav Freund, Raj D. Iyer, Robert E. Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003.
 [8] Yoav Freund and Robert E. Schapire. A desiciontheoretic generalization of online learning and an application to boosting. In Computational learning theory, pages 23–37. Springer, 1995.
 [9] Xiaotian Jiang, Zhendong Niu, Jiamin Guo, Ghulam Mustafa, Zihan Lin, Baomi Chen, and Qian Zhou. Novel boosting frameworks to improve the performance of collaborative filtering. In ACML’13, pages 87–99, 2013.
 [10] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 2427, 2008, pages 426–434, 2008.
 [11] Yehuda Koren. Collaborative filtering with temporal dynamics. Commun. ACM, 53(4):89–97, 2010.
 [12] TieYan Liu. Learning to Rank for Information Retrieval. Springer, 2011.

[13]
Yong Liu, Peilin Zhao, Aixin Sun, and Chunyan Miao.
A boosting algorithm for item recommendation with implicit feedback.
In
Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 2531, 2015
, pages 1792–1798, 2015.  [14] Brian McFee and Gert R. G. Lanckriet. Metric learning to rank. In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, pages 775–782, 2010.
 [15] Runwei Qiang, Feng Liang, and Jianwu Yang. Exploiting ranking factorization machines for microblog retrieval. In 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27  November 1, 2013, pages 1783–1788, 2013.
 [16] Steffen Rendle. Factorization machines. In ICDM 2010, The 10th IEEE International Conference on Data Mining, Sydney, Australia, 1417 December 2010, pages 995–1000, 2010.
 [17] Steffen Rendle. Factorization machines with libfm. ACM TIST, 3(3):57, 2012.
 [18] Steffen Rendle. Scaling factorization machines to relational data. PVLDB, 6(5):337–348, 2013.
 [19] Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars SchmidtThieme. Fast contextaware recommendations with factorization machines. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 2529, 2011, pages 635–644, 2011.
 [20] Steffen Rendle and Lars SchmidtThieme. Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, New York, NY, USA, February 46, 2010, pages 81–90, 2010.
 [21] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidencerated predictions. Machine learning, 37(3):297–336, 1999.
 [22] Nathan Srebro, Jason D. M. Rennie, and Tommi S. Jaakkola. Maximummargin matrix factorization. In Advances in Neural Information Processing Systems 17 [Neural Information Processing Systems, NIPS 2004, December 1318, 2004, Vancouver, British Columbia, Canada], pages 1329–1336, 2004.
 [23] Yanghao Wang, Hailong Sun, and Richong Zhang. Adamf: Adaptive boosting matrix factorization for recommender system. In WAIM’14, pages 43–54. Springer, 2014.
 [24] Jun Xu and Hang Li. Adarank: a boosting algorithm for information retrieval. In SIGIR’07, pages 391–398. ACM, 2007.
 [25] Fajie Yuan, Guibing Guo, Joemon M. Jose, Long Chen, Haitao Yu, and Weinan Zhang. Lambdafm: Learning optimal ranking with factorization machines using lambda surrogates. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 2428, 2016, pages 227–236, 2016.