1 Introduction
The task of recommending the best algorithms for a new given problem, also known as algorithm selection, is widely studied in Machine Learning (ML)
(Brazdil et al, 2003; Prudêncio and Ludermir, 2004; SmithMiles, 2008). One of its most popular approaches, Metalearning (MtL), looks for a function to map metafeatures (characteristics extracted from a dataset representing the problem) to metatargets (the performance of a group of algorithms when applied to this dataset) (Brazdil et al, 2009). This function, learned via a ML algorithm, can be used to recommend algorithms for new datasets.One of the main concerns in MtL is the design of metafeatures that are informative regarding algorithm performance (Vanschoren, 2010). MtL has been successfully used in many ML tasks. However, since each ML task has its specificities, different sets of metafeatures may be necessary for each task. Hence, an essential part of the work of a MtL practitioner is the design of hand tailored metafeatures suitable for the task at hand. This has resulted in large collections of metafeatures for tasks like regression (Amasyali and Ersoy, 2009), classification (Gama and Brazdil, 1995; Kalousis and Hilario, 2001) and Collaborative Filtering (Cunha et al, 2018a, b).
In Collaborative Filtering (CF) algorithm selection approaches, several informative metafeatures have been proposed (Cunha et al, 2018b), ranging from rating matrix characteristics (Adomavicius and Zhang, 2012; Ekstrand and Riedl, 2012; Griffith et al, 2012; Matuszyk and Spiliopoulou, 2014; Cunha et al, 2016; Collins et al, 2018)
to performance estimates on data samples
(Cunha et al, 2017b). However, these metafeatures have important limitations: all are tailor made and therefore depend on he practitioner’s experience and perspective on the problem. Unlike previous works, this paper investigates how useful metafeatures can be designed while minimizing human interference.Representational Learning (RL) (Bengio et al, 2013) uses ML algorithms and domain knowledge to learn alternative and potentially richer representations for a given problem to enhance predictive performance in other ML tasks. Examples of successful applications are text classification (Bengio et al, 2013) and image recognition (He et al, 2016). However, to the best of our knowledge, this approach has never been used for algorithm selection tasks.
In this paper we use a RL approach to automatically design metafeatures for the problem of algorithm selection in CF. The solution proposed is inspired on distributed representations (Lecun et al, 2015), which represent each problem entity by its underlying relationships with other lowergranularity elements. An example is word2vec (Mikolov et al, 2013), which represents each word in a text using a set of neighbouring words. This paper investigates the hypothesis that there is a distributed representation technique able to create a latent representation of the CF problem, which can produce alternative metafeatures. This representation is created using graph2vec (Narayanan et al, 2017), a technique inspired in word2vec. The proposed procedure, cf2vec, has 4 essential steps: 1) to convert the CF matrix into a graph, 2) to reduce the problem complexity via graph sampling, 3) to learn the distributed representations and 4) to train a metamodel with using the representation learned as metafeatures. We evaluate cf2vec against stateoftheart CF metafeatures and show their performance to be comparable with stateoftheart metafeatures, while requiring significantly less data and without virtually any human input.
This document is organized as follows: Section 2 presents a literature review on MtL and RL for the investigated problem; Section 3 introduces the proposed technique cf2vec; Section 4 describes the experimental setup used to validate cf2vec, while Section 5 reports the experimental analysis conducted. Finally, Section 6 presents the main conclusions, discusses cf2vec’s limitations and introduces directions for future work.
2 Related Work
2.1 Metalearning
MtL attempts to model algorithm performance in terms of problem characteristics (Vanschoren, 2010). One of its main applications is algorithm selection, first conceptualized by (Rice, 1976). Figure 1 presents the conceptualized framework, which defines several search spaces: problem, feature, algorithm and performance, represented by , , and . A problem is described as: for a given instance , with features , find the mapping into , such that the selected algorithm maximizes (Rice, 1976). Hence, algorithm selection can be formulated as a learning task whose goal is to learn a metamodel able to recommend algorithms for a new task.
The first algorithm selection approaches for CF appeared recently (Adomavicius and Zhang, 2012; Ekstrand and Riedl, 2012; Griffith et al, 2012; Matuszyk and Spiliopoulou, 2014), but had low representativeness: the experimental setup and nature and diversity of metafeatures were very limited. More advanced and extensively validated CF metafeatures have been proposed:

Rating Matrix metafeatures (Cunha et al, 2016): these characteristics describe several rating matrix properties using a systematic metafeature generation framework (Pinto et al, 2016)
. This collection of metafeatures combines sets of objects, functions and postfunctions. The metafeatures proposed use three objects (rating matrix and its rows and columns), four functions (original ratings, number of ratings, mean rating value and sum of ratings) and eleven postfunctions (maximum, minimum, mean, standard deviation, median, mode, entropy, gini, skewness and kurtosis).

Subsampling landmarkers (Cunha et al, 2017b): these metafeatures are created using performance estimates on random samples extracted from the original datasets. First, random samples are extracted for each CF dataset. Next, CF algorithms are trained on these samples and their performance assessed using different evaluation measures. The outcome is a subsampling landmarker for each pair algorithm/evaluation measure.

Graph metafeatures (Cunha et al, 2018a): this approach models CF as a graph and takes advantage of the systematic metafeature extraction procedure (Pinto et al, 2016) and the hierarchical decomposition of complex data structures (Cunha et al, 2017a). This allows to define important levels (graph, node, pairwise and subgraph) to be characterized using Graph Theory characteristics.

Comprehensive metafeatures (Cunha et al, 2018a)
: This collection aggregates all metafeatures from all previous approaches. Correlation Feature Selection is used to obtain the most significant metafeatures.
2.2 Representational Learning
Although there are alternatives, like probabilistic models and manifold learning (Bengio, 2011; Bengio et al, 2013)
, the classical RL technique is the Autoencoder
(Bourlard and Kamp, 1988; Lecun, 1987). Figure 2 shows its architecture, simplified for easier readability.Autoencoders can, theoretically, be used in any ML task, including CF. In fact, they have been used to provide recommendations (Sedhain et al, 2015; Strub et al, 2016; Wu et al, 2016). These works learn latent representations for each user and/or item, which are in turn used to make the recommendations. However, cf2vec needs a latent representation able to describe the entire dataset, like the metafeatures do. Hence, these are not useful to our purposes.
A better alternative is the distributed representations (Lecun et al, 2015). As the name suggests, each entity is represented by a pattern of activity distributed over many elements, and each element participates in the representation of many different entities (Rumelhart et al, 1986). In essence, they also represent the input as a realvalued vector, but using a different network architecture. The most significant techniques for our problem are discussed next.
word2vec (Mikolov et al, 2013) assumes that two words are similar (and have similar representations) if they have similar contexts. In this case, the context refers to a predefined amount of neighboring words. One architecture proposed to learn these representations is the skipgram, which predicts surrounding words given the current word. Figure 3 shows how skipgram works.
doc2vec (Le and Mikolov, 2014) learns distributed representations for sequence of words with different lengths (i.e. paragraphs, documents, etc.). One of the introduced algorithms (i.e. Paragraph Vector Distributed Bag of Words (PVDBOW)) allows a straightforward adaptation of word2vec’s skipgram: instead of predicting context words based on a current word, now the neural network predicts sequences of words belonging to a particular document. A variation of this technique is available in graph2vec (Narayanan et al, 2017): by considering each graph as a document, it is able to represent each graph by its underlying nodes. The process has two stages: 1) create rooted subgraphs in order to generate vocabulary and 2) train the PVDBOW skipgram model. This technique will be discussed in detail in Section 3.3.2.
3 Distributed Representations as CF metafeatures
This section introduces the main contribution of this work: cf2vec. Next, its essential steps are presented: 1) to convert the CF matrix into a graph, 2) to reduce the problem complexity via graph sampling, 3) to learn the distributed representations and 4) to train a metamodel with alternative metafeatures.
3.1 Convert CF matrix into graph
CF is usually described by a rating matrix , representing a set of users and items . Each element of this matrix is the feedback provided by each user for each item. Figure 3(a) shows a toy example of a rating matrix.
To use graph2vec, the input elements must be graphs. Since (Cunha et al, 2018a) have shown that a CF rating matrix can be seen as an adjacency matrix, then the problem can be stated as: consider a bipartite graph , whose nodes and represent users and items, respectively. The edges connects elements of the two groups and represent the feedback provided by users to items. The edges can be weighted in order to represent preference values (ratings). Figure 3(b) shows the conversion of the toy example from Figure 3(a).
3.2 Sampling graphs
An important part of metafeature design is the effort required (Vanschoren, 2010): if the task is slower than training and evaluating all algorithms on the new problem, then it is useless. Considering how CF graphs can reach quite large sizes, this is a pressing issue and it motivates our necessity in reducing the problem dimensionality. Since one is not interested in the actual time required, but rather on reducing the amount of data to be processed in order to reduce the time needed, the focus lies on investigating which is the minimum amount of data which allows to maintain a high predictive performance.
Thus, an intermediate (but not mandatory) step is added: graph sampling. In order to find a distributed representation as closely related as possible to the entire graph, a sampling technique able to preserve the graph structural properties must be chosen. According to (Leskovec and Faloutsos, 2006), a good choice is random walk. It performs multiple explorations of graph paths until nodes are reached and uses all of them to obtain the respective subgraph.
3.3 Learn distributed representation
Taking advantage of graph2vec’s agnostic nature, one argues that the problem can be defined as follows: given a set of CF graphs and a positive integer (i.e., distributed representation size), one aims to learn a dimensional distributed representation for every graph. Hence, this process creates a matrix of distributed representations , which can be regarded as a metafeature representations for all considered graphs. Two steps are required: 1) to extract of rooted subgraphs and 2) to learn matrix .
3.3.1 Extract rooted subgraphs
A rooted subgraph is composed by the set of nodes (and corresponding edges) around node that are reachable in hops. Learning the distributed representation requires the extraction of rooted subgraphs for all nodes. Thus, the process must be applied to nodes, in which .
Rooted subgraphs in graph2vec are generated using the WeisfeilerLehman relabeling procedure (Shervashidze et al, 2011). Beyond being able to inspect neighboring nodes, it is also able to incorporate information about the neighbors in a single node’s name. As a result, it creates a rich textual description for every graph. To do so, it iteratively traverses each original node and using all neighbors as the current node label. Next, it replaces the original node labels by new compressed names, which represent a neighborhood structure. The process repeats until hops are reached. Every rooted subgraph can be represented by a numeric vector with the frequency each node (original or compressed) appears in the representation, similar to onehot encoding.
3.3.2 Learn matrix
Considering how now there is a graph vocabulary, then the skipgram model can be used straightforwardly. As it can be seen in Figure 5, each graph is represented by its identifier and connected to context rooted subgraphs . Training such a neural network allows to learn similar distributed representations for graphs with similar rooted subgraphs. The authors believe that this relationship also relates with algorithm performance, similarly to what happens in other metafeatures. Hence, it is suitable for algorithm selection.
In order to learn the weights, then one must train the network. The learning process, based on Stochastic Gradient Descent, iteratively performs these steps until conversion is achieved: 1) feedforward weights from input to the output layer, 2) application of a softmax classifier to compare the output layer’s weights with the subgraph representations and 3) backpropagation of the errors through the network. Doing so, it learns matrices
and , which represent the distribuetd representations and context matrices, respectively. Notice the skipgram is trained using Negative Sampling, which does not use all subgraphs belong to a graph. Instead, it takes advantage of few random subgraphs that do not belong to the graph. This way, training is more efficient.3.4 Learn metamodel
Notice that matrix can be easily used as metafeatures. Thus, every problem is described by independent variables (the th row of matrix ) and the dependent variables (the respective ranking of algorithms). Obtaining these pairs for all , allows to create a metadatabase like the one in Figure 6.
Formally, the submission of all problems (i.e. ) to cf2vec produces the metafeatures . To create the dependent variables, each problem is associated with the respective ranking of algorithms , based on the performance values for a specific evaluation measure . This ranking considers a static ordering of the algorithms (using for instance an alphabetical order) and is composed by a permutation of values . These values indicate, for each position , the respective ranking. A learning algorithm is then used to induce a metamodel. In order to make predictions, the metamodel can be applied to metafeatures extracted from a new problem to predict its best ranking of algorithms .
Considering how the problem is modelled, the ideal solution is Label Ranking (LR) (Hüllermeier et al, 2008; Vembu and Gärtner, 2010). Thus, the algorithm selection problem for CF using LR is: for every dataset , with features associated with the respective rankings , find the selection mapping into the permutation space , such that the selected ranking of algorithms maximizes the performance mapping .
4 Experimental setup
Any MtLbased problem for algorithm selection has two well defined levels: the baselevel (a conventional ML task applying ML algorithms to problemrelated datasets) and the metalevel (apply ML algorithms to metadatasets). In this work, the base level is a CF task and the metalevel is the application of ML algorithms for Label Ranking. From this point onward baselearners and metalearners are the algorithms used in the baselevel and in the metalevel, respectively. Next, the experimental setup used in this work is presented.
4.1 Collaborative Filtering
In the baselevel, CF baselearners are applied to CF datasets and evaluated using CF assessment measures. It uses 38 datasets, described in Table 1, alongside a summary of their statistics, namely the number of users, items and ratings. Due to space restrictions, the datasets are identified by acronyms: Amazon (AMZ), Bookcrossing (BC), Flixter (FL), Jester (JT), MovieLens (ML), MovieTweetings (MT), TripAdvisor (TA), Yahoo! (YH) and Yelp (YE).
Dataset  #users  #items  #ratings  Reference 
AMZapps  132391  24366  264233  (McAuley and Leskovec, 2013) 
AMZautomotive  85142  73135  138039  
AMZbaby  53188  23092  91468  
AMZbeauty  121027  76253  202719  
AMZcd  157862  151198  371275  
AMZclothes  311726  267503  574029  
AMZfood  76844  51139  130235  
AMZgames  82676  24600  133726  
AMZgarden  71480  34004  99111  
AMZhealth  185112  84108  298802  
AMZhome  251162  123878  425764  
AMZinstruments  33922  22964  50394  
AMZkindle  137107  131122  308158  
AMZmovies  7278  1847  11215  
AMZmusic  47824  47313  83863  
AMZoffice  90932  39229  124095  
AMZpet Supplies  74099  33852  123236  
AMZphones  226105  91289  345285  
AMZsports  199052  127620  326941  
AMZtools  121248  73742  192015  
AMZtoys  134291  94594  225670  
AMZvideo  42692  8882  58437  
BC  7780  29533  39944  (Ziegler et al, 2005) 
FL  14761  22040  812930  (Zafarani and Liu, 2009) 
JT1  2498  100  181560  (Goldberg et al, 2001) 
JT2  2350  100  169783  
JT3  2493  96  61770  
ML100k  94  1202  9759  (GroupLens, 2016) 
ML10m  6987  9814  1017159  
ML1m  604  3421  106926  
ML20m  13849  16680  2036552  
MLlatest  22906  17133  2111176  
MTlatest  3702  7358  39097  (Dooms et al, 2013) 
MTRS14  2491  4754  20913  
TA  77851  10590  151030  (Wang et al, 2011) 
YHmovies  764  4078  22135  (Yahoo!, 2016) 
YHmusic  613  4620  30852  
YE  55233  46045  211627  (Yelp, 2016) 
The experiments were carried out with MyMediaLite (Gantner et al, 2011). Two CF tasks were addressed: Rating Prediction and Item Recommendation. While the first aims to predict the rating an user would assign to a new instance, the second aims to recommend a ranked list of items. Since the tasks are different, so are the baselearners and evaluation measures required.
The following CF baselearners were used for Rating Prediction: Matrix Factorization (MF), Biased MF (BMF) (Salakhutdinov and Mnih, 2008), Latent Feature Log Linear Model (LFLLM) (Menon and Elkan, 2010), SVD++ (Koren, 2008), 3 versions of Sigmoid Asymmetric Factor Model (SIAFM, SUAFM and SCAFM) (Paterek, 2007), User Item Baseline (UIB) (Koren, 2010) and Global Average (GA). For Item Recommendation, the baselearners chosen were BPRMF (Rendle et al, 2009), Weighted BPRMF (WBPRMF) (Rendle et al, 2009), Soft Margin Ranking MF (SMRMF) (Weimer et al, 2008), WRMF (Hu et al, 2008) and Most Popular (MP). The baselearners were selected based on the fact that all are Matrix Factorization algorithms, well known for their predictive power and computational efficiency.
In the Item Recommendation experiments the baselearners were evaluated using NDCG and AUC, while for Rating Prediction, NMAE and RMSE were used. All experiments were performed using 10fold crossvalidation. To prevent bias in favour of any baselearner, the hyperparameters were not tuned.
4.2 Metalearning with Label Ranking
The metalevel use metalearners to map metafeatures to metatargets. This work investigates two types of metafeatures:

Comprehensive metafeatures (Cunha et al, 2018a): chosen to represent standard MtL approaches, since they achieve the best performance and represent the most diverse set of problem characteristics.

cf2vec metafeatures: distributed representations learned from the proposed procedure. One important issue to address is the hyperparameter optimization since depending on their settings, different representations are produced. This work pays special attention to and , since they were shown to be the most important in (Mikolov et al, 2013). However, all hyperparameters are tuned using grid search (Bergstra and Bengio, 2012).
The multicriteria metatargets procedure used in the latest related work approach for CF algorithm selection (Cunha et al, 2018a) is replicated here. The authors introduce a novel way to model the metatargets, which is able to create a single ranking of algorithms by considering more than one evaluation measure. This decision has been made since it allows to create fairer rankings and to reduce the amount of algorithm selection problems investigated.
This work uses only one metalearner, since one aims to simplify the presentation of results. To that end, KNN
(Soares, 2015) was chosen due to its superior predictive performance in CF algorithm selection (Cunha et al, 2018a). The experiments use as baseline the Average Rankings algorithm (Brazdil and Soares, 2000). Metamodels are evaluated using Kendall’s Tau and leave one out crossvalidation and tuned using grid search (Bergstra and Bengio, 2012).5 Results and Discussion
Here, a set of Research Questions (RQs) are posed to empirically compare the merits of cf2vec’s distributed representations against CF metafeatures.
RQ 1. Which is the best setting in cf2vec?
This analysis investigates the effect of (amount of nodes sampled per graph) on Kendall’s tau performance, which measures how similar are the true and predicted rankings of CF algorithms averaged by all datasets considered. Figure 7 shows the distribution of Kendall’s tau scores for all cf2vec metamodels, with . The results also show the performance obtained with Comprehensive Metafeatures (CM) and Average Rankings (AR).
According to these results:

cf2vec creates informative representations: this is supported by the fact that all their performances are better than the baseline AR.

cf2vec is never better than CM: although the performance results come very close to CM’s, this threshold is never beaten.

The best settings is : although the performances are quite similar, it is visible that the best and median performances increase until this threshold, but suffer a small decrease afterwards.
RQ 2. How cf2vec performance is affected by graph2vec’s hyperparameters?
This analysis focusses on two hyperparameters: and , the representation size and the amount of context subgraphs, respectively. All other hyperparameters are disregarded since no obvious patterns emerged. Figures 8 and 9 present Kendall’s tau performance for all cf2vec metamodels built with , since this proved to be the best setting.
According to these results:

The performances for
are stable: although the best and worst performances slightly fluctuate, the median values remain the same. These are surprising results, which may be explained by limited grid search settings. However, considering the reduced amount of metaexamples and the curse of dimensionality, it would be difficult to improve these results.

Hyperparameter has a significant impact on the predictive performance: both metatargets increase their performance until . Soon after, their performances decreases. However, lower amounts of context subgraphs lead to better performance (look how perform better than ).
RQ 3. How does the performance of the best cf2vec metamodel compare with the best metamodel induced with Comprehensive Metafeatures?
To select the best cf2vec hyperparameter settings, the best performance on both CF problems must be found. To illustrate how the performance is distributed, Figure 10 presents the best Kendall’s tau performances for both problems. The metamodels are identified by their and hyperparameters.
The results show that metamodels with occupy the vast majority of performances that simultaneously maximize the performance on both tasks. Among these, the best hyperparameter settings correspond to the performance point placed at , in which . The metamodel trained with this hyperparameter settings is henceforth used as cf2vec’s representative.
To understand how cf2vec competes with other strategies, a statistically significance test was used: Critical Difference (CD) diagrams (Demšar, 2006). Each strategy is represented by its best metamodel’s performances for all datasets. These are used here to rank several metalearners and to assess whether the differences are statistically significant. The CD interval created  which is calculated with a Friedman’s test  connects all metalearners for which there is no statistically significant difference. Figure 11 shows these results.
The results show that there is no statistically significant difference between CM and cf2vec and that both are better than the baseline. Thus, the proposed approach is not only suitable to the task at hand, but it is also as good as the best CF metafeatures.
RQ 4. What is the impact on the baselevel performance achieved by cf2vec?
Since differences in baselevel performances in a ranking can be quite costly, it is essential to assess each metalearner by the baselevel predictive performance of its predicted rankings. To do so, each threshold in the predicted ranking of algorithms is replaced by the respective baselearner’s performance. This performance vector is then normalized and averaged by all datasets. Figure 12 shows these results for both CF problems. The amount of thresholds is different because each problem has a different amount of algorithms. The results are presented in percentage in order to facilitate interpretation.
The results show that while CM is better for in Item Recommendation, for all remaining thresholds, cf2vec and CM have the same performance. However, in Rating Prediction, cf2vec is better than CM for and equal for . These results show that cf2vec obtains a comparable (and even higher) baselevel performance.
RQ 5. Is there a clear relation between metafeatures (either CM or cf2vec representations) and CF algorithm performance?
Considering the similarity in predictive performance for both types of metafeatures in all previous analysis, it is important to understand whether there are clear relationships between them and the metatargets. These relations can potentialy explain the results obtained.
The literature in distributed representations often refers to tsne (Van Der Maaten and Hinton, 2008) to explore highfimensional datasets. However, two limitations have shown this technique to not be ideal in our setup: 1) the procedure is stochastic, hence the representations are not static and 2) the process, being specially designed for a large amount of data points, is not ideal to our problem, where only 38 data points exist. Hence, PCA was used to visualize the highdimensional metafeatures in a two dimensional map. To enrich the results, the ranking of baselearners for each dataset is shown using a colour gradient which is assigned based on metatarget similarity. This highlights clear patterns between metafeatures and metatargets: if similar (or the same) metatargets are assigned to two similar datasets (placed near one another) there is a clear pattern. Figures 13 and 14 illustrate the results.
The results show that both metafeatures work well in two cases:

Same domain and similar metatargets: most datasets from the same domain have clearly visible patterns in the mappings between metafeatures and metatargets. This occurs for the AMZ and JT domains.

Different domains but similar metatargets: some datasets from different domains, and sharing similar metatargets, are close to each other. This happens for the BC and FL datasets in Item Recommendation and for the YE and FL datasets in Rating Prediction.
The previous observations refer to the easily predictable metainstances. The fact that both types of metafeatures are able to properly map the instances together is a good reason to explain why they perform well for the majority of datasets. However, some problems were found:

Anomalies: some points are close to others without any apparent reason. This occurs in the TA dataset for both CF problems and the YE dataset in the Item recommendation problem. This may have occurred because the current metafeatures are not good enough to characterize these datasets.

Same domain but different metatargets: some datasets from the same domain appear close. However, their rankings are significantly different. This occurs for the ML, YH and MT datasets. A possible reason is difficulty of the metamodel to correctly predict the rankings of algorithms. This difficulty can be potentially be reduced by tuning the metalearner hyperparameters and by chosing metalearners with different bias.
The low occurrence of these problems can explain the high predictive performance obtained. However, their occurrence points out the need for further studies. One important issue lies in the need to find more and more diverse datasets and baselearners to complement such observations.
Finally, although cf2vec shows interesting and useful patterns, CM seems to be generally better at mapping the difficult problems. This may be the missing indicator which justify the differences in predictive performance. Therefore, although the hypothesis of using distributed representations as metafeatures alternatives is validated, it is clear that other techniques may be needed to surpass the performance of several hand designed metafeatures specially designed for this empirical setup. Nevertheless, the research direction presented shows promising new algorithm selection solutions in CF and other ML tasks.
6 Conclusions
This paper introduced a novel technique for CF metafeature design: cf2vec. cf2vec adapts a known distributed representation technique graph2vec to the context of CF algorithm selection. To do so, the procedure converts CF datasets into graphs, reduces the problem complexity via graph sampling, learns the distributed representation and uses them as alternative metafeatures. Experiments carried out show that cf2vec is competitive with the stateoftheart collection of CF metafeatures, with no statistically significant differences. The main advantages of cf2vec are: the metafeatures are automatically generated without virtually any human intervention, the process can be tuned to adjust the metafeatures to the experimental setup and the procedure reaches the same performance as the stateoftheart, but requiring a smaller amount of data. However, cf2vec also has limitations. These are discussed next, along with suggestions to deal with them:

Representation Learning: this work used graph2vec as the main procedure to learn the distributed representations. The choice was supported by its theoretically applicability to this work and motivated by existing related works on CF graph metafeatures (Cunha et al, 2018a)
. However, other techniques can be considered, such as Autoencoders (for instance, adapting Image Processing techniques using Convolutional Neural Networks
(LeCun et al, 1990)to the CF domain, given the similarity between images and rating matrices) and RL techniques specially designed for CF
(Sedhain et al, 2015; Strub et al, 2016; Wu et al, 2016) (but designed to describe the whole dataset rather than each user; alternatively, these representations can be directly used to perform algorithm selection on an user level). Despite the clear motivation to use either, none is yet ready to be applied to the general CF algorithm selection problem and require modifications. 
Hyperparameter Tuning: according to the experimental results, the proposed technique is sensitive to the value of some hyperparameters. Hence, the metafeature extraction process requires training multiple graph2vec models to find the best one. Although this study tried to indicate the best hyperparameter values (which may be used as default), it is essential to understand that a different experimental setup may require a different hyperparameter setting to achieve optimal results.

Predictive Performance: the presented experimental setup, although extensive in nature, may still be the reason to why the proposed metafeatures are not significantly better than the stateoftheart. Reasons such as insufficient amount of datasets and baselearners, imbalanced data (not enough metaexamples for all metatargets considered) and lack of hyperparameter optimisation in the baselearners may influence the experimental results. However, the authors would like to acknowledge how difficult it still is to obtain a more suitable experimental setup in the CF domain.

Metafeature Importance: most previous works in CF algorithm selection have investigated which metafeatures are the most relevant. Despite becoming increasingly complex and harder to interpret, there was still a hint towards which data properties were important. In this case, since the metafeatures created are latent, it is impossible to perform the same analysis. Therefore, until procedures able to extract meaning from latent features surface, the analysis is limited to the one presented in this work.
References
 Adomavicius and Zhang (2012) Adomavicius G, Zhang J (2012) Impact of data characteristics on recommender systems performance. ACM Trans Manag Inf Syst 3(1):1–17
 Amasyali and Ersoy (2009) Amasyali F, Ersoy O (2009) A study of meta learning for regression. Tech. rep., Purdue University

Bengio (2011)
Bengio Y (2011) Deep Learning of Representations for Unsupervised and Transfer Learning. JMLR: Workshop and Conference 7:1–20
 Bengio et al (2013) Bengio Y, Courville A, Vincent P (2013) Representation Learning: Review and New Perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
 Bergstra and Bengio (2012) Bergstra J, Bengio Y (2012) Random Search for HyperParameter Optimization. JMLR 13:281–305

Bourlard and Kamp (1988)
Bourlard H, Kamp Y (1988) Autoassociation by multilayer perceptrons and singular value decomposition. Biological Cybernetics 59(4):291–294
 Brazdil and Soares (2000) Brazdil P, Soares C (2000) A Comparison of Ranking Methods for Classification Algorithm Selection. In: de Mántaras R, Plaza E (eds) Machine Learning: ECML 2000, Springer Berlin Heidelberg, pp 63–75
 Brazdil et al (2003) Brazdil P, Soares C, Costa J (2003) Ranking Learning Algorithms: Using IBL and MetaLearning on Accuracy and Time. Machine Learning 50(3):251–277
 Brazdil et al (2009) Brazdil P, GiraudCarrier C, Soares C, Vilalta R (2009) Metalearning: Applications to Data Mining, 1st edn. Springer Publishing
 Collins et al (2018) Collins A, Beel J, Tkaczyk D (2018) Oneatatime: A MetaLearning RecommenderSystem for Recommendation. ArXiv eprints arXiv:1805.12118
 Cunha et al (2016) Cunha T, Soares C, Carvalho A (2016) Selecting Collaborative Filtering algorithms using Metalearning. In: ECMLPKDD, pp 393–409

Cunha et al (2017a)
Cunha T, Soares C, Carvalho A (2017a) Metalearning for Contextaware Filtering: Selection of Tensor Factorization Alg. In: ACM RecSys, pp 14–22
 Cunha et al (2017b) Cunha T, Soares C, Carvalho A (2017b) Recommending Collaborative Filtering algorithms using landmarkers. In: Discovery Science, pp 189–203
 Cunha et al (2018a) Cunha T, Soares C, Carvalho A (2018a) Algorithm Selection for Collaborative Filtering: the influence of graph metafeatures and multicriteria metatargets. ArXiv eprints pp 1–25, arXiv:1807.09097
 Cunha et al (2018b) Cunha T, Soares C, Carvalho A (2018b) Metalearning and Recommender Systems: A literature review and empirical study on the algorithm selection problem for Collaborative Filtering. Information Sciences 423:128–144
 Demšar (2006) Demšar J (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7:1–30
 Dooms et al (2013) Dooms S, De Pessemier T, Martens L (2013) MovieTweetings: a Movie Rating Dataset Collected From Twitter. In: CrowdRec at ACM RecSys
 Ekstrand and Riedl (2012) Ekstrand M, Riedl J (2012) When Recommenders Fail: Predicting Recommender Failure for Algorithm Selection. ACM RecSys pp 233–236
 Gama and Brazdil (1995) Gama J, Brazdil P (1995) Characterization of Classification Algorithms. Lecture Notes in Computer Science 990:189–200
 Gantner et al (2011) Gantner Z, Rendle S, Freudenthaler C, SchmidtThieme L (2011) MyMediaLite: A Free Recommender System Library. In: ACM RecSys, pp 305–308
 Goldberg et al (2001) Goldberg K, Roeder T, Gupta D, Perkins C (2001) Eigentaste: A Constant Time Collaborative Filtering Algorithm. Information Retrieval 4(2):133–151
 Goodfellow et al (2016) Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press
 Griffith et al (2012) Griffith J, O’Riordan C, Sorensen H (2012) Investigations into rating information and accuracy in collaborative filtering. In: ACM SAC, pp 937–942
 GroupLens (2016) GroupLens (2016) MovieLens. URL http://grouplens.org/datasets/movielens/

He et al (2016)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Comput Soc Conf Comput Vis Pattern Recognit, pp 770–778
 Hu et al (2008) Hu Y, Koren Y, Volinsky C (2008) Collaborative Filtering for Implicit Feedback Datasets. In: IEEE Int. Conf. on Data Mining, pp 263 – 272

Hüllermeier et al (2008)
Hüllermeier E, Fürnkranz J, Cheng W, Brinker K (2008) Label ranking by learning pairwise preferences. Artificial Intelligence 172(1617):1897–1916
 Kalousis and Hilario (2001) Kalousis A, Hilario M (2001) Feature Selection for Metalearning. In: Advances in Knowledge Discovery and Data Mining, pp 222–233
 Koren (2008) Koren Y (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: ACM SIGKDD, pp 426–434, 62
 Koren (2010) Koren Y (2010) Factor in the Neighbors: Scalable and Accurate Collaborative Filtering. ACM Trans Knowl Discov Data 4(1):1–24
 Le and Mikolov (2014) Le Q, Mikolov T (2014) Distributed Representations of Sentences and Documents. In: Int. Conf. on Machine Learning, pp II–1188—II–1196
 Lecun (1987) Lecun Y (1987) PhD thesis: Modeles connexionnistes de l’apprentissage (connectionist learning models). Universite P. et M. Curie (Paris 6)
 LeCun et al (1990) LeCun Y, Boser B, Denker J, Henderson D, Howard R, Hubbard W, Jackel L (1990) Handwritten Digit Recognition with BackPropagation Network. In: Adv. in Neural Information Processing Sys., Morgan Kaufmann, pp 396–404
 Lecun et al (2015) Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
 Leskovec and Faloutsos (2006) Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: ACM SIGKDD, pp 631–636
 Matuszyk and Spiliopoulou (2014) Matuszyk P, Spiliopoulou M (2014) Predicting the Performance of Collaborative Filtering. In: Web Intelligence, Mining and Semantics, pp 38:1–6
 McAuley and Leskovec (2013) McAuley J, Leskovec J (2013) Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text. In: ACM RecSys, pp 165–172
 Menon and Elkan (2010) Menon AK, Elkan C (2010) A loglinear model with latent features for dyadic prediction. In: ICDM, pp 364–373
 Mikolov et al (2013) Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient Estimation of Word Representations in Vector Space. ArXiv eprints pp 1–12, arXiv:1301.3781
 Narayanan et al (2017) Narayanan A, Chandramohan, Mahinthan Venkatesan R, Chen L, Liu Y, Jaiswal S (2017) graph2vec: Learning Distributed Representations of Graphs. ArXiv eprints pp 1–8, arXiv:1707.05005
 Paterek (2007) Paterek A (2007) Improving regularized singular value decomposition for collaborative filtering. In: KDD cup and workshop, pp 2–5
 Pinto et al (2016) Pinto F, Soares C, MendesMoreira J (2016) Towards automatic generation of Metafeatures. In: PAKDD, pp 215–226
 Prudêncio and Ludermir (2004) Prudêncio RBC, Ludermir TB (2004) Metalearning approaches to selecting time series models. Neurocomputing 61:121–137
 Rendle et al (2009) Rendle S, Freudenthaler C, Gantner Z, Schmidtthieme L (2009) BPR: Bayesian Personalized Ranking from Implicit Feedback. In: Conference on Uncertainty in Artificial Intelligence, pp 452–461
 Rice (1976) Rice J (1976) The Algorithm Selection Problem. Adv in Computers 15:65–118
 Rumelhart et al (1986) Rumelhart DE, McClelland JL, PDP Research Group C (eds) (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. MIT Press, Cambridge, MA, USA
 Salakhutdinov and Mnih (2008) Salakhutdinov R, Mnih A (2008) Probabilistic Matrix Factorization. In: Advances in Neural Information Processing Systems, pp 1257–1264
 Schmidhuber (2015) Schmidhuber J (2015) Deep learning in neural networks: An overview. Neural Networks 61:85–117
 Sedhain et al (2015) Sedhain S, Menon AK, Sanner S, Xie L (2015) AutoRec : Autoencoders Meet Collaborative Filtering. In: WWW, pp 111–112
 Shervashidze et al (2011) Shervashidze N, Schweitzer P, van Leeuwen EJ, Mehlhorn K, Borgwardt KM (2011) WeisfeilerLehman Graph Kernels. JMLR 12:2539–2561
 SmithMiles (2008) SmithMiles K (2008) Crossdisciplinary perspectives on metalearning for algorithm selection. ACM Comput Surv 41(1):6:1–6:25
 Soares (2015) Soares C (2015) labelrank: Predicting Rankings of Labels. URL https://cran.rproject.org/package=labelrank
 Strub et al (2016) Strub F, Mary J, Gaudel R (2016) Hybrid Recommender System based on Autoencoders. ArXiv eprints arxiv:1606.07659

Van Der Maaten and Hinton (2008)
Van Der Maaten LJP, Hinton GE (2008) Visualizing highdimensional data using tsne. Journal of Machine Learning Research 9:2579–2605
 Vanschoren (2010) Vanschoren J (2010) Understanding machine learning performance with experiment databases. PhD thesis, Katholieke Universiteit Leuven
 Vembu and Gärtner (2010) Vembu S, Gärtner T (2010) Label ranking algorithms: A survey. In: Preference Learning, SpringerVerlag, pp 45–64
 Wang et al (2011) Wang H, Lu Y, Zhai C (2011) Latent Aspect Rating Analysis Without Aspect Keyword Supervision. In: ACM SIGKDD, pp 618–626
 Weimer et al (2008) Weimer M, Karatzoglou A, Smola A (2008) Improving Maximum Margin Matrix Factorization. Machine Learning 72(3):263–276
 Wu et al (2016) Wu Y, DuBois C, Zheng AX, Ester M (2016) Collaborative Denoising AutoEncoders for TopN Recommender Systems. In: WSDM, pp 153–162
 Yahoo! (2016) Yahoo! (2016) Webscope datasets. URL https://webscope.sandbox.yahoo.com/
 Yelp (2016) Yelp (2016) Yelp Dataset. URL https://www.yelp.com/dataset_challenge
 Zafarani and Liu (2009) Zafarani R, Liu H (2009) Social computing data repository at ASU. URL http://socialcomputing.asu.edu
 Ziegler et al (2005) Ziegler CN, McNee SM, Konstan JA, Lausen G (2005) Improving Recommendation Lists Through Topic Diversification. In: WWW, pp 22–32