1. Introduction
The algorithm selection problem for Collaborative Filtering (CF) (Shi et al., 2014) has been investigated so far via Metalearning (MtL) (Adomavicius and Zhang, 2012; Ekstrand and Riedl, 2012; Griffith et al., 2012; Matuszyk and Spiliopoulou, 2014; Cunha et al., 2016, 2017b, 2017a). The problem is modeled using a set of features (i.e., metafeatures) to describe the problem domain and the performance of algorithms according to a specific measure to describe the behavior of algorithms. Afterwards, learning algorithms are used to learn the mapping between the metafeatures and the performance, effectively achieving a model (i.e. metamodel) which can be used to predict the best algorithms for a new problem.
However, the definition of suitable metafeatures is a hard problem. This is specially difficult in the CF problem, where there is no clear separation between independent and dependent variables. So far, there have been several examples of statistical and/or informationtheoretical approaches (Adomavicius and Zhang, 2012; Ekstrand and Riedl, 2012; Griffith et al., 2012; Matuszyk and Spiliopoulou, 2014; Cunha et al., 2016) and even landmarking approaches (Cunha et al., 2017b), which have produced interesting results. However, the merits of metafeatures continue to be questioned, since it is difficult to understand whether they actually contain useful informative or whether the results are dictated by noise or chance. Hence, we look towards another approach, which does not use metafeatures explicitly to train the metamodel.
The approach proposed in this work is to use CF algorithms to select CF algorithms, which we name CF4CF. The problem is addressed by considering users and items as the datasets and algorithm, respectively. The performance of all algorithms on a particular dataset are leveraged and converted into ratings. Thus, a proper rating matrix can be built using performance data only. Then a CF algorithm can be used to create a metamodel, which will allow to predict the best ranking of algorithms for a new problem. Specifically in the prediction step, when no data is available regarding the algorithm performance, CF4CF uses subsampling landmarkers (performance estimations on a sample of the original dataset) to obtain initial ratings. CF4CF is then responsible to predict the remaining ratings and convert the outcome into a ranking of algorithms.
As far as the authors know, this paper’s contribution  CF4CF  is the first approach to use CF algorithms to recommend CF algorithms. Furthermore, this is also the first attempt of CF algorithm selection which does not explicitly use metafeatures in the trained model. Beyond the interestingness of proving the ability to tackle the algorithm selection problem without metafeatures, this work is particularly important because it allows to compare the merits of traditional MtL and the novel CF4CF approaches. To this end, this work compares the merits of metalevel accuracy and impact on the baselevel for both learning strategies and shows that CF4CF is a suitable alternative for algorithm selection, having proved to be perform equally or better than traditional MtL.
This document is organized as follows: Section 2 presents the related work on Metalearning for CF; Section 3 presents the core contributions of this work: CF4CF and the unified evaluation framework, while Section 4 explains the experimental procedure. In Section 5, the proposed approach is evaluated and discussed and Section 6 presents the conclusions and future work tasks.
2. Related Work
Although the use of MtL for CF has already been investigated (Adomavicius and Zhang, 2012; Ekstrand and Riedl, 2012; Griffith et al., 2012; Matuszyk and Spiliopoulou, 2014), the approaches proposed have limited scope: the set of datasets, recommendation algorithms and metafeatures studied is always suitable, but never complete. An extensive overview of their positive and negative aspects can be seen in a recent survey (Cunha et al., 2018). More recent work in CF algorithm selection has extended the contributions to the area, in particular with regards to the metafeatures considered, which systematize the data characteristics used in earlier works (Cunha et al., 2016). This work, which we consider as the state of the art in CF algorithm selection, proposes a systematic approach for metafeature extraction. It leverages a framework which requires three main elements: object , function and postfunction . The framework applies a function to an object and, afterwards, the postfunction to the outcome in order to derive the final metafeature. Thus, any metafeature can be represented as: (Pinto et al., 2016).
The objects to be used in the framework are CF’s rating matrix , and its rows and columns . The functions considered to characterize these objects are: original ratings (ratings), count the number of elements (count), mean value (mean) and sum of values (sum). The postfunctions
are maximum, minimum, mean, standard deviation, median, mode, entropy, Gini index, skewness and kurtosis. Additionally, it includes the number of users, items, ratings and the matrix sparsity. This results in 74 metafeatures which were reduced by correlation feature selection, ending up with:
nusers, R.ratings.kurtosis, R.ratings.sd, I.count.kurtosis, I.count.min, I.mean.entropy, I.sum.skewness, U.sum.entropy, U.mean.min, sparsity, U.sum.kurtosis, U.mean.skewness. As an example, R.ratings.kurtosis represents the kurtosis of the distribution of all ratings in matrix .3. Cf4cf
This paper introduces a novel approach to tackle the CF algorithm selection problem, named CF4CF. Figure 1 presents the procedure.
Notice the process is organized in two main steps: train and predict. The training stage leverages the algorithm performance data, builds a rating matrix and trains a CF model. In the prediction stage, algorithm performance from subsampling landmarkers is transformed leveraged to create the initial ratings of the active dataset. The active dataset is then submitted to the previously trained CF model to obtain ratings for the missing algorithms. Afterwards, the final ranking of algorithms is calculated. The next sections will present in detail the steps exposed in the previous overview.
3.1. Build the Rating Matrix
Recall that CF requires three elements: users, items and ratings. As this work aims at recommending CF algorithms for CF datasets, the natural adaptation is to consider the users and items as datasets and algorithms, respectively. Hence, to build the rating matrix we consider the set of datasets where each dataset and the set of algorithms where each algorithm . To complete the matrix, one needs to provide the ratings available. However, in the algorithm selection problem there is not an explicit assignment of ratings by each dataset to the algorithms. To solve this issue, we model the preferences using the performance of algorithms on the datasets. The idea is to leverage how good the algorithm is for a particular dataset as the preference it holds for the same dataset.
Our approach works by converting the rankings into ratings. This conversion allows to take advantage of CF algorithms in a straightforward way. Formally, consider a ranking of algorithms for a specific dataset . Such ranking is created by sorting the algorithms in decreasing order of performance. To convert the ranking into a specific ratings scale , the following transformation is applied to each position :
(1) 
The rating values are then . The matrix is completed by converting all rankings of algorithms for all datasets.
3.2. Train the CF model
Notice the previous step outputs a complete rating matrix, since we have a preference for all datasets towards all algorithms. Although CF4CF uses a complete matrix, which is not the case in most CF problems, all CF algorithms available can be used in CF4CF. In the works case scenario, one just needs to sample the rating matrix to create missing data for algorithms such as Matrix Factorization to be able to operate. This is in fact a major advantage: since CF does not require all ratings to be provided, then it is theoretically possible to achieve good performance with less information than what is required by MtL, which may translate into significant saves in computational resources. The experimental procedure will assess these assumptions by varying the parameter , which refers to the number of ratings sampled by dataset to build the matrix.
3.3. Build the Active Dataset
Having the model built, one moves now to the prediction stage. However, due to domain constraints, one must introduce changes to the traditional prediction procedure. Recall that if a new dataset is considered, it is reasonable to assume that there is no performance estimate for any algorithm. In this case, CF4CF cannot properly work since it would have no data to provide the CF model. This work proposes to deal with this problem using subsampling landmarkers, which consists in estimating the algorithm performance on a small data samples and use them as initial input for the CF model.
Thus, in order to build the active dataset representation, this procedure leverages the subsampling landmarkers and processes them via sampling and rating conversion procedures. Formally, let us consider the complete ranking of algorithms for a specific dataset , obtained from subsampling landmarkers rather than the original performance values. Since we aim to use some of these values to serve as initial ratings for the CF model, we first sample the ranking . Considering how the number of ratings provided directly affects the performance of CF models, it is important to understand the effect of sampling different amounts of ratings. We address this issue by using a parameter in our experiments. Lastly, the sampled ranking is converted into ratings, also using Equation 1.
3.4. Predict Ratings and Calculate Ranking
Having obtained the active dataset representation , one uses the previously trained CF model to obtain the predictions for the remaining algorithms, represented as . Notice that CF algorithms only considers items for which the active user has not provided any feedback towards. Hence, in our case, CF will produce ratings for the remaining algorithms in a straightforward way.
Notice however the algorithm selection problem requires a complete ranking of algorithms to be predicted. To tackle this issue, we propose to aggregate the predictions with the initial ratings. Hence, the full ratings predicted are provided by .
At this point, the only step remaining is to convert the ratings into rankings. To do so, one sorts the ratings in decreasing order of importance and replaces them by the respective ranking position. By fixing the algorithm positions, one ensures a representation which allows to directly use ranking accuracy measures and, by extension, to compare CF4CF with MtL.
4. Experimental setup
4.1. Baselevel
The baselevel component is concerned with the traditional CF problem and it is exactly the same for both CF4CF and MtL. Here, several dimensions are considered: datasets, algorithms and evaluation measures. The 38 datasets used come from different domains, namely Amazon Reviews, BookCrossing, Flixter, Jester, MovieLens, MovieTweetings, Tripadvisor, Yahoo! and Yelp. Table 1 presents all domains and datasets used and a summary of their characteristics.
Domain  Dataset(s)  #Users  #Items  #Ratings  Ref. 
Amazon  App, Auto, Baby, Beauty, CD, Clothes, Food, Game, Garden, Health, Home, Instrument, Kindle, Movie, Music, Office, Pet, Phone, Sport, Tool, Toy, Video  [7k  311k]  [2k  267k]  [11k  574k]  (McAuley and Leskovec, 2013) 
Bookcrossing  Bookcrossing  8k  29k  40k  (Ziegler et al., 2005) 
Flixter  Flixter  15k  22k  813k  (Zafarani and Liu, 2009) 
Jester  Jester1, Jester2, Jester3  [2.3k  2.5k]  [96  100]  [61k  182k]  (Goldberg et al., 2001) 
Movielens  100k, 1m, 10m, 20m, latest  [94  23k]  [1k  17k]  [10k  2M]  (GroupLens, 2016) 
MovieTweetings  RecSys2014, latest  [2.5k  3.7k]  [4.8k  7.4k]  [21k  39k]  (Dooms et al., 2013) 
Tripadvisor  Tripadvisor  78k  11k  151k  (Wang et al., 2011) 
Yahoo!  Movies, Music  [613  764]  [4k  4.6k]  [22k  31k]  (Yahoo!, 2016) 
Yelp  Yelp  55k  46k  212k  (Yelp, 2016) 
The CF algorithms used in this work are variations of MF methods: BPRMF (Rendle et al., 2009)
, which performs a pairwise classification task, optimizing AUC using Stochastic Gradient Descent (SGD); WBPRMF
(Rendle et al., 2009), which is a variation of BPRMF that includes a sampling mechanism that promotes low scored items; SMRMF (Weimer et al., 2008), which is another variation of BPRMF, but it replaces the optimization formula in SGD by a soft margin ranking loss inspired by SVM classifiers; WRMF
(Hu et al., 2008) which uses ALS (Alternating Least Squares) instead of SGD and introduces user/item bias to regularize the process; and lastly the baseline algorithm MostPopular which ranks items by how often they have been seen in the past. Since these algorithm tackle a TopN recommendation problem, all algorithms are evaluated using NDCG (to assess ranking accuracy) and AUC (to evaluate classification accuracy) using 10fold crossvalidation. No parameter optimization was done to prevent bias towards any algorithm.4.2. Metalevel
CF4CF uses only algorithm performance as input data. While the results obtained from the baselevel are used as training data, the prediction stage requires to calculate subsampling landmarkers. To do so, all datasets are random sampled for 10% of all instances. Then, these samples are submitted to the same baselevel evaluation procedure to obtain performance estimations for all algorithms in all evaluation measures. In the case of MtL, each dataset is simply described by the state of the art metafeatures (Cunha et al., 2016) presented in Section 2. The algorithm performance is used to create rankings of algorithms to be used as targets for this predictive procedure. This means MtL is addressed using Label Ranking (LR) (Hüllermeier et al., 2008; Vembu and Gärtner, 2010). Recall that CF4CF is designed to use any CF algorithm. However, in order to provide the fairest comparison possible between MtL and CF, this work uses two algorithms with the same bias: userbased CF (Sarwar et al., 2000)
and kNN for LR
(Soares, 2015), both based on Nearest Neighbours. These algorithms are referred to as KNNCF and KNNLR.The evaluation in algorithm selection is comprised of two tasks: metaaccuracy and impact on the baselevel performance. While the first aims to assess how similar are the predicted and real rankings of algorithms, the second investigates how the algorithms recommended by the metamodels actually perform on average for all datasets. To assess the metaaccuracy, this work uses the ranking accuracy measure Kendall’s Tau using leaveoneout crossvalidation. To assess the impact on the baselevel, the analysis calculates the average performance for different thresholds . These thresholds refer to the number of algorithms from the predicted ranking which are considered for analysis. Hence, if , only the first recommended algorithm is used. On the other hand, if , then both the first and second algorithms are used. In this situation, the performance is the best of both recommended algorithms.
5. Results
5.1. Rating Matrix Sparsity
The first analysis aims at understanding the effect of variable . To do so, different matrices were created by sampling the complete matrix and then CF4CF models were trained upon them. The results in terms of Kendall’s Tau are presented in Figure 2.
The results show CF4CF is equal or better than the baseline and MtL for and , respectively. This shows CF4CF is able to provide good recommendations using only 4 ratings per baselevel dataset. However, the results also show that CF4CF is only better than MtL for , meaning the full rating matrix is the only to consistently beat MtL. To obtain optimal results and provide fair comparison against MtL, we use a complete rating matrix in the remaining experiments.
5.2. Metaaccuracy
This analysis assesses the effect that the number of sampled landmarkers () has in the overall performance of CF4CF. The Kendall’s Tau results are presented in Figure 3.
The results CF4CF is better than the baseline for for both NDCG and AUC metatargets, but it only reaches comparable performance with regards to MtL for in NDCG and in AUC. Furthermore, CF4CF can outperform MtL but only for NDCG for . This means CF4CF is a suitable alternative to MtL, which in fact can perform better when 4 subsampling landmarkers are used to feed the CF metamodel.
5.3. Impact on the baselevel performance
The results the impact on the baselevel performance are presented in Figure 4. Notice the results presented refer to .
The experimental results show CF4CF outperforms both the baseline and MtL for and for the the NDCG and AUC metatargets, respectively. These results show CF4CF makes better predictions than the competing approaches for the first thresholds in each problem, i.e. CF4CF is more accurate than MtL for the top positions in the predicted rankings of algorithms.
6. Conclusions
This work introduced a novel algorithm selection approach  CF4CF  which takes advantage of a Collaborative Filtering to recommend rankings of Collaborative Filtering algorithms. The procedure uses the algorithm performance as rating information to train the metamodel and uses subsampling landmarkers converted into ratings in the prediction stage. The proposed approach is the first known solution of its kind. According to the experimental results, CF4CF is a good alternative to MtL, and even better in some cases. CF4CF is able to perform equally to MtL using less data from algorithm performance in the rating matrix; it can out perform MtL when using 4 subsampling landmarkers in conjunction with a CF model; and it is able to have higher impact in the rankings of algorithms recommended in the top positions. All these observations allow to conclude CF4CF is better at predicting rankings of CF algorithms, (2) the CF algorithm it recommends has higher impact on the baselevel performance and (3) subsampling landmarkers are a suitable solution to provide initial ratings. Future work directions include: to improve CF4CF performance by testing different ways to leverage data for training and testing, further extend the experimental setup to other recommendation areas and algorithms and to leverage both metafeatures and ratings in a hybrid solution for CF algorithm selection.
Acknowledgments
This work is financed by the Portuguese funding institution FCT  Fundação para a Ciência e a Tecnologia through the PhD grant SFRH/BD/117531/2016.
References
 (1)
 Adomavicius and Zhang (2012) Gediminas Adomavicius and Jingjing Zhang. 2012. Impact of data characteristics on recommender systems performance. ACM Management Information Systems 3, 1 (2012), 1–17.

Cunha
et al. (2017a)
Tiago Cunha, Carlos
Soares, and André C.P.L.F. Carvalho.
2017a.
Metalearning for Contextaware Filtering: Selection of Tensor Factorization Algorithms. In
Proceedings of the Eleventh ACM Conference on Recommender Systems (RecSys ’17). ACM, New York, NY, USA, 14–22. https://doi.org/10.1145/3109859.3109899  Cunha et al. (2016) Tiago Cunha, Carlos Soares, and André de Carvalho. 2016. Selecting Collaborative Filtering algorithms using Metalearning. In ECMLPKDD. 393–409.
 Cunha et al. (2017b) Tiago Cunha, Carlos Soares, and Andre de Carvalho. 2017b. Recommending Collaborative Filtering algorithms using subsampling landmarkers. In Discovery Science. 189–203.
 Cunha et al. (2018) Tiago Cunha, Carlos Soares, and André C.P.L.F. de Carvalho. 2018. Metalearning and Recommender Systems: A literature review and empirical study on the algorithm selection problem for Collaborative Filtering. Information Sciences 423 (2018), 128–144.
 Dooms et al. (2013) Simon Dooms, Toon De Pessemier, and Luc Martens. 2013. MovieTweetings: a Movie Rating Dataset Collected From Twitter. In CrowdRec at RecSys 2013.
 Ekstrand and Riedl (2012) Michael Ekstrand and John Riedl. 2012. When Recommenders Fail: Predicting Recommender Failure for Algorithm Selection and Combination. ACM RecSys (2012), 233–236.
 Goldberg et al. (2001) Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. 2001. Eigentaste: A Constant Time Collaborative Filtering Algorithm. Information Retrieval 4, 2 (2001), 133–151.
 Griffith et al. (2012) Josephine Griffith, Colm O’Riordan, and Humphrey Sorensen. 2012. Investigations into user rating information and accuracy in collaborative filtering. In ACM SAC. 937–942.
 GroupLens (2016) GroupLens. 2016. MovieLens datasets. (2016). http://grouplens.org/datasets/movielens/
 Hu et al. (2008) Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In IEEE International Conference on Data Mining. 263 – 272.
 Hüllermeier et al. (2008) Eyke Hüllermeier, Johannes Fürnkranz, Weiwei Cheng, and Klaus Brinker. 2008. Label ranking by learning pairwise preferences. Artificial Intelligence 172, 1617 (2008), 1897–1916.
 Matuszyk and Spiliopoulou (2014) Pawel Matuszyk and Myra Spiliopoulou. 2014. Predicting the Performance of Collaborative Filtering Algorithms. In Web Intelligence, Mining and Semantics. 38:1–38:6.
 McAuley and Leskovec (2013) Julian McAuley and Jure Leskovec. 2013. Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text. In ACM Conference on Recommender Systems. 165–172.
 Pinto et al. (2016) Fábio Pinto, Carlos Soares, and João MendesMoreira. 2016. Towards automatic generation of Metafeatures. In PAKDD. 215–226.
 Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidtthieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence. 452–461.
 Sarwar et al. (2000) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2000. Analysis of Recommendation Algorithms for ECommerce. In ACM Electronic Commerce. 158–167.
 Shi et al. (2014) Yue Shi, Martha Larson, and Alan Hanjalic. 2014. Collaborative Filtering beyond the UserItem Matrix. Comput. Surveys 47, 1 (2014), 1–45.
 Soares (2015) Carlos Soares. 2015. labelrank: Predicting Rankings of Labels. (2015). https://cran.rproject.org/package=labelrank
 Vembu and Gärtner (2010) Shankar Vembu and Thomas Gärtner. 2010. Label ranking algorithms: A survey. In Preference Learning. 45–64.
 Wang et al. (2011) Hongning Wang, Yue Lu, and ChengXiang Zhai. 2011. Latent Aspect Rating Analysis Without Aspect Keyword Supervision. In ACM SIGKDD. 618–626.
 Weimer et al. (2008) Markus Weimer, Alexandros Karatzoglou, and Alex Smola. 2008. Improving Maximum Margin Matrix Factorization. Machine Learning 72, 3 (2008), 263–276.
 Yahoo! (2016) Yahoo! 2016. Webscope datasets. (2016). https://webscope.sandbox.yahoo.com/
 Yelp (2016) Yelp. 2016. Yelp Dataset Challenge. (2016). https://www.yelp.com/dataset_challenge
 Zafarani and Liu (2009) R. Zafarani and H. Liu. 2009. Social Computing Data Repository at ASU. (2009). http://socialcomputing.asu.edu
 Ziegler et al. (2005) CaiNicolas Ziegler, Sean M McNee, Joseph A Konstan, and Georg Lausen. 2005. Improving Recommendation Lists Through Topic Diversification. In Proceedings of the 14th International Conference on World Wide Web. 22–32.
Comments
There are no comments yet.