cf2vec: Collaborative Filtering algorithm selection using graph distributed representations

09/17/2018 ∙ by Tiago Cunha, et al. ∙ Universidade de São Paulo Universidade do Porto 0

Algorithm selection using Metalearning aims to find mappings between problem characteristics (i.e. metafeatures) with relative algorithm performance to predict the best algorithm(s) for new datasets. Therefore, it is of the utmost importance that the metafeatures used are informative. In Collaborative Filtering, recent research has created an extensive collection of such metafeatures. However, since these are created based on the practitioner's understanding of the problem, they may not capture the most relevant aspects necessary to properly characterize the problem. We propose to overcome this problem by taking advantage of Representation Learning, which is able to create an alternative problem characterizations by having the data guide the design of the representation instead of the practitioner's opinion. Our hypothesis states that such alternative representations can be used to replace standard metafeatures, hence hence leading to a more robust approach to Metalearning. We propose a novel procedure specially designed for Collaborative Filtering algorithm selection. The procedure models Collaborative Filtering as graphs and extracts distributed representations using graph2vec. Experimental results show that the proposed procedure creates representations that are competitive with state-of-the-art metafeatures, while requiring significantly less data and without virtually any human input.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of recommending the best algorithms for a new given problem, also known as algorithm selection, is widely studied in Machine Learning (ML) 

(Brazdil et al, 2003; Prudêncio and Ludermir, 2004; Smith-Miles, 2008). One of its most popular approaches, Metalearning (MtL), looks for a function to map metafeatures (characteristics extracted from a dataset representing the problem) to metatargets (the performance of a group of algorithms when applied to this dataset) (Brazdil et al, 2009). This function, learned via a ML algorithm, can be used to recommend algorithms for new datasets.

One of the main concerns in MtL is the design of metafeatures that are informative regarding algorithm performance (Vanschoren, 2010). MtL has been successfully used in many ML tasks. However, since each ML task has its specificities, different sets of metafeatures may be necessary for each task. Hence, an essential part of the work of a MtL practitioner is the design of hand tailored metafeatures suitable for the task at hand. This has resulted in large collections of metafeatures for tasks like regression (Amasyali and Ersoy, 2009), classification (Gama and Brazdil, 1995; Kalousis and Hilario, 2001) and Collaborative Filtering (Cunha et al, 2018a, b).

In Collaborative Filtering (CF) algorithm selection approaches, several informative metafeatures have been proposed (Cunha et al, 2018b), ranging from rating matrix characteristics (Adomavicius and Zhang, 2012; Ekstrand and Riedl, 2012; Griffith et al, 2012; Matuszyk and Spiliopoulou, 2014; Cunha et al, 2016; Collins et al, 2018)

to performance estimates on data samples 

(Cunha et al, 2017b). However, these metafeatures have important limitations: all are tailor made and therefore depend on he practitioner’s experience and perspective on the problem. Unlike previous works, this paper investigates how useful metafeatures can be designed while minimizing human interference.

Representational Learning (RL) (Bengio et al, 2013) uses ML algorithms and domain knowledge to learn alternative and potentially richer representations for a given problem to enhance predictive performance in other ML tasks. Examples of successful applications are text classification (Bengio et al, 2013) and image recognition (He et al, 2016). However, to the best of our knowledge, this approach has never been used for algorithm selection tasks.

In this paper we use a RL approach to automatically design metafeatures for the problem of algorithm selection in CF. The solution proposed is inspired on distributed representations (Lecun et al, 2015), which represent each problem entity by its underlying relationships with other lower-granularity elements. An example is word2vec (Mikolov et al, 2013), which represents each word in a text using a set of neighbouring words. This paper investigates the hypothesis that there is a distributed representation technique able to create a latent representation of the CF problem, which can produce alternative metafeatures. This representation is created using graph2vec (Narayanan et al, 2017), a technique inspired in word2vec. The proposed procedure, cf2vec, has 4 essential steps: 1) to convert the CF matrix into a graph, 2) to reduce the problem complexity via graph sampling, 3) to learn the distributed representations and 4) to train a metamodel with using the representation learned as metafeatures. We evaluate cf2vec against state-of-the-art CF metafeatures and show their performance to be comparable with state-of-the-art metafeatures, while requiring significantly less data and without virtually any human input.

This document is organized as follows: Section 2 presents a literature review on MtL and RL for the investigated problem; Section 3 introduces the proposed technique cf2vec; Section 4 describes the experimental setup used to validate cf2vec, while Section 5 reports the experimental analysis conducted. Finally, Section 6 presents the main conclusions, discusses cf2vec’s limitations and introduces directions for future work.

2 Related Work

2.1 Metalearning

MtL attempts to model algorithm performance in terms of problem characteristics (Vanschoren, 2010). One of its main applications is algorithm selection, first conceptualized by (Rice, 1976). Figure 1 presents the conceptualized framework, which defines several search spaces: problem, feature, algorithm and performance, represented by , , and . A problem is described as: for a given instance , with features , find the mapping into , such that the selected algorithm maximizes  (Rice, 1976). Hence, algorithm selection can be formulated as a learning task whose goal is to learn a metamodel able to recommend algorithms for a new task.

Problem space

Feature space

Algorithm space

Performance space


Selection mapping


select tomaximize
Figure 1: Rice’s Algorithm Selection conceptual framework (Smith-Miles, 2008)).

The first algorithm selection approaches for CF appeared recently (Adomavicius and Zhang, 2012; Ekstrand and Riedl, 2012; Griffith et al, 2012; Matuszyk and Spiliopoulou, 2014), but had low representativeness: the experimental setup and nature and diversity of metafeatures were very limited. More advanced and extensively validated CF metafeatures have been proposed:

  • Rating Matrix metafeatures (Cunha et al, 2016): these characteristics describe several rating matrix properties using a systematic metafeature generation framework (Pinto et al, 2016)

    . This collection of metafeatures combines sets of objects, functions and post-functions. The metafeatures proposed use three objects (rating matrix and its rows and columns), four functions (original ratings, number of ratings, mean rating value and sum of ratings) and eleven post-functions (maximum, minimum, mean, standard deviation, median, mode, entropy, gini, skewness and kurtosis).

  • Subsampling landmarkers (Cunha et al, 2017b): these metafeatures are created using performance estimates on random samples extracted from the original datasets. First, random samples are extracted for each CF dataset. Next, CF algorithms are trained on these samples and their performance assessed using different evaluation measures. The outcome is a subsampling landmarker for each pair algorithm/evaluation measure.

  • Graph metafeatures (Cunha et al, 2018a): this approach models CF as a graph and takes advantage of the systematic metafeature extraction procedure (Pinto et al, 2016) and the hierarchical decomposition of complex data structures (Cunha et al, 2017a). This allows to define important levels (graph, node, pairwise and subgraph) to be characterized using Graph Theory characteristics.

  • Comprehensive metafeatures (Cunha et al, 2018a)

    : This collection aggregates all metafeatures from all previous approaches. Correlation Feature Selection is used to obtain the most significant metafeatures.

2.2 Representational Learning

Although there are alternatives, like probabilistic models and manifold learning (Bengio, 2011; Bengio et al, 2013)

, the classical RL technique is the Autoencoder 

(Bourlard and Kamp, 1988; Lecun, 1987). Figure 2 shows its architecture, simplified for easier readability.

Hidden layer

Input layer

Output layer

Figure 2:

Autoencoder architecture. Autoencoders are obtained by training a neural network to reproduce the input vector in the output vector using a hidden layer with less neurons than the output layer. For such, the network learns two functions: an encoding function

and a decoding function . Since this hidden layer is able to preserve useful properties of the data, it can represent the input (Goodfellow et al, 2016; Lecun et al, 2015; Schmidhuber, 2015).

Autoencoders can, theoretically, be used in any ML task, including CF. In fact, they have been used to provide recommendations (Sedhain et al, 2015; Strub et al, 2016; Wu et al, 2016). These works learn latent representations for each user and/or item, which are in turn used to make the recommendations. However, cf2vec needs a latent representation able to describe the entire dataset, like the metafeatures do. Hence, these are not useful to our purposes.

A better alternative is the distributed representations (Lecun et al, 2015). As the name suggests, each entity is represented by a pattern of activity distributed over many elements, and each element participates in the representation of many different entities (Rumelhart et al, 1986). In essence, they also represent the input as a real-valued vector, but using a different network architecture. The most significant techniques for our problem are discussed next.

word2vec (Mikolov et al, 2013) assumes that two words are similar (and have similar representations) if they have similar contexts. In this case, the context refers to a predefined amount of neighboring words. One architecture proposed to learn these representations is the skipgram, which predicts surrounding words given the current word. Figure 3 shows how skipgram works.

Figure 3: Skipgram architecture used in word2vec (Mikolov et al, 2013). Each target word

, represented as one-hot encoding for a vocabulary

, is connected to a hidden layer . This hidden layer, where the distributed representations are, has a predefined size . Each distributed representation is connected to the previous and next context words (i.e. ). The network weights are updated until a learning stop criterion is reached.

doc2vec (Le and Mikolov, 2014) learns distributed representations for sequence of words with different lengths (i.e. paragraphs, documents, etc.). One of the introduced algorithms (i.e. Paragraph Vector Distributed Bag of Words (PV-DBOW)) allows a straightforward adaptation of word2vec’s skipgram: instead of predicting context words based on a current word, now the neural network predicts sequences of words belonging to a particular document. A variation of this technique is available in graph2vec (Narayanan et al, 2017): by considering each graph as a document, it is able to represent each graph by its underlying nodes. The process has two stages: 1) create rooted subgraphs in order to generate vocabulary and 2) train the PV-DBOW skipgram model. This technique will be discussed in detail in Section 3.3.2.

3 Distributed Representations as CF metafeatures

This section introduces the main contribution of this work: cf2vec. Next, its essential steps are presented: 1) to convert the CF matrix into a graph, 2) to reduce the problem complexity via graph sampling, 3) to learn the distributed representations and 4) to train a metamodel with alternative metafeatures.

3.1 Convert CF matrix into graph

CF is usually described by a rating matrix , representing a set of users and items . Each element of this matrix is the feedback provided by each user for each item. Figure 3(a) shows a toy example of a rating matrix.

(a) Rating Matrix. Rows represent users , while columns represent items . Some cells have the rating assigned by an user to an item.







(b) Bipartite Graph. The graph has two node subsets, representing users and items . Ratings are weighted edges between nodes of both subsets.
Figure 4: Toy example for two different CF representations.

To use graph2vec, the input elements must be graphs. Since (Cunha et al, 2018a) have shown that a CF rating matrix can be seen as an adjacency matrix, then the problem can be stated as: consider a bipartite graph , whose nodes and represent users and items, respectively. The edges connects elements of the two groups and represent the feedback provided by users to items. The edges can be weighted in order to represent preference values (ratings). Figure 3(b) shows the conversion of the toy example from Figure 3(a).

3.2 Sampling graphs

An important part of metafeature design is the effort required (Vanschoren, 2010): if the task is slower than training and evaluating all algorithms on the new problem, then it is useless. Considering how CF graphs can reach quite large sizes, this is a pressing issue and it motivates our necessity in reducing the problem dimensionality. Since one is not interested in the actual time required, but rather on reducing the amount of data to be processed in order to reduce the time needed, the focus lies on investigating which is the minimum amount of data which allows to maintain a high predictive performance.

Thus, an intermediate (but not mandatory) step is added: graph sampling. In order to find a distributed representation as closely related as possible to the entire graph, a sampling technique able to preserve the graph structural properties must be chosen. According to (Leskovec and Faloutsos, 2006), a good choice is random walk. It performs multiple explorations of graph paths until nodes are reached and uses all of them to obtain the respective subgraph.

3.3 Learn distributed representation

Taking advantage of graph2vec’s agnostic nature, one argues that the problem can be defined as follows: given a set of CF graphs and a positive integer (i.e., distributed representation size), one aims to learn a -dimensional distributed representation for every graph. Hence, this process creates a matrix of distributed representations , which can be regarded as a metafeature representations for all considered graphs. Two steps are required: 1) to extract of rooted subgraphs and 2) to learn matrix .

3.3.1 Extract rooted subgraphs

A rooted subgraph is composed by the set of nodes (and corresponding edges) around node that are reachable in hops. Learning the distributed representation requires the extraction of rooted subgraphs for all nodes. Thus, the process must be applied to nodes, in which .

Rooted subgraphs in graph2vec are generated using the Weisfeiler-Lehman relabeling procedure (Shervashidze et al, 2011). Beyond being able to inspect neighboring nodes, it is also able to incorporate information about the neighbors in a single node’s name. As a result, it creates a rich textual description for every graph. To do so, it iteratively traverses each original node and using all neighbors as the current node label. Next, it replaces the original node labels by new compressed names, which represent a neighborhood structure. The process repeats until hops are reached. Every rooted subgraph can be represented by a numeric vector with the frequency each node (original or compressed) appears in the representation, similar to one-hot encoding.

3.3.2 Learn matrix

Considering how now there is a graph vocabulary, then the skipgram model can be used straightforwardly. As it can be seen in Figure 5, each graph is represented by its identifier and connected to context rooted subgraphs . Training such a neural network allows to learn similar distributed representations for graphs with similar rooted subgraphs. The authors believe that this relationship also relates with algorithm performance, similarly to what happens in other metafeatures. Hence, it is suitable for algorithm selection.

Figure 5: Skipgram architecture used in graph2vec (Narayanan et al, 2017).

In order to learn the weights, then one must train the network. The learning process, based on Stochastic Gradient Descent, iteratively performs these steps until conversion is achieved: 1) feedforward weights from input to the output layer, 2) application of a softmax classifier to compare the output layer’s weights with the subgraph representations and 3) backpropagation of the errors through the network. Doing so, it learns matrices

and , which represent the distribuetd representations and context matrices, respectively. Notice the skipgram is trained using Negative Sampling, which does not use all subgraphs belong to a graph. Instead, it takes advantage of few random subgraphs that do not belong to the graph. This way, training is more efficient.

3.4 Learn metamodel

Notice that matrix can be easily used as metafeatures. Thus, every problem is described by independent variables (the -th row of matrix ) and the dependent variables (the respective ranking of algorithms). Obtaining these pairs for all , allows to create a metadatabase like the one in Figure 6.


Figure 6: Metadatabase. Organized into training and prediction data (top, bottom) and independent and dependent variables (left, right).

Formally, the submission of all problems (i.e. ) to cf2vec produces the metafeatures . To create the dependent variables, each problem is associated with the respective ranking of algorithms , based on the performance values for a specific evaluation measure . This ranking considers a static ordering of the algorithms (using for instance an alphabetical order) and is composed by a permutation of values . These values indicate, for each position , the respective ranking. A learning algorithm is then used to induce a metamodel. In order to make predictions, the metamodel can be applied to metafeatures extracted from a new problem to predict its best ranking of algorithms .

Considering how the problem is modelled, the ideal solution is Label Ranking (LR) (Hüllermeier et al, 2008; Vembu and Gärtner, 2010). Thus, the algorithm selection problem for CF using LR is: for every dataset , with features associated with the respective rankings , find the selection mapping into the permutation space , such that the selected ranking of algorithms maximizes the performance mapping .

4 Experimental setup

Any MtL-based problem for algorithm selection has two well defined levels: the baselevel (a conventional ML task applying ML algorithms to problem-related datasets) and the metalevel (apply ML algorithms to metadatasets). In this work, the base level is a CF task and the metalevel is the application of ML algorithms for Label Ranking. From this point onward baselearners and metalearners are the algorithms used in the baselevel and in the metalevel, respectively. Next, the experimental setup used in this work is presented.

4.1 Collaborative Filtering

In the baselevel, CF baselearners are applied to CF datasets and evaluated using CF assessment measures. It uses 38 datasets, described in Table 1, alongside a summary of their statistics, namely the number of users, items and ratings. Due to space restrictions, the datasets are identified by acronyms: Amazon (AMZ), Bookcrossing (BC), Flixter (FL), Jester (JT), MovieLens (ML), MovieTweetings (MT), TripAdvisor (TA), Yahoo! (YH) and Yelp (YE).

Dataset #users #items #ratings Reference
AMZ-apps 132391 24366 264233 (McAuley and Leskovec, 2013)
AMZ-automotive 85142 73135 138039
AMZ-baby 53188 23092 91468
AMZ-beauty 121027 76253 202719
AMZ-cd 157862 151198 371275
AMZ-clothes 311726 267503 574029
AMZ-food 76844 51139 130235
AMZ-games 82676 24600 133726
AMZ-garden 71480 34004 99111
AMZ-health 185112 84108 298802
AMZ-home 251162 123878 425764
AMZ-instruments 33922 22964 50394
AMZ-kindle 137107 131122 308158
AMZ-movies 7278 1847 11215
AMZ-music 47824 47313 83863
AMZ-office 90932 39229 124095
AMZ-pet Supplies 74099 33852 123236
AMZ-phones 226105 91289 345285
AMZ-sports 199052 127620 326941
AMZ-tools 121248 73742 192015
AMZ-toys 134291 94594 225670
AMZ-video 42692 8882 58437
BC 7780 29533 39944 (Ziegler et al, 2005)
FL 14761 22040 812930 (Zafarani and Liu, 2009)
JT1 2498 100 181560 (Goldberg et al, 2001)
JT2 2350 100 169783
JT3 2493 96 61770
ML100k 94 1202 9759 (GroupLens, 2016)
ML10m 6987 9814 1017159
ML1m 604 3421 106926
ML20m 13849 16680 2036552
ML-latest 22906 17133 2111176
MT-latest 3702 7358 39097 (Dooms et al, 2013)
MT-RS14 2491 4754 20913
TA 77851 10590 151030 (Wang et al, 2011)
YH-movies 764 4078 22135 (Yahoo!, 2016)
YH-music 613 4620 30852
YE 55233 46045 211627 (Yelp, 2016)
Table 1: Summary dataset description.

The experiments were carried out with MyMediaLite (Gantner et al, 2011). Two CF tasks were addressed: Rating Prediction and Item Recommendation. While the first aims to predict the rating an user would assign to a new instance, the second aims to recommend a ranked list of items. Since the tasks are different, so are the baselearners and evaluation measures required.

The following CF baselearners were used for Rating Prediction: Matrix Factorization (MF), Biased MF (BMF) (Salakhutdinov and Mnih, 2008), Latent Feature Log Linear Model (LFLLM) (Menon and Elkan, 2010), SVD++ (Koren, 2008), 3 versions of Sigmoid Asymmetric Factor Model (SIAFM, SUAFM and SCAFM) (Paterek, 2007), User Item Baseline (UIB) (Koren, 2010) and Global Average (GA). For Item Recommendation, the baselearners chosen were BPRMF (Rendle et al, 2009), Weighted BPRMF (WBPRMF) (Rendle et al, 2009), Soft Margin Ranking MF (SMRMF) (Weimer et al, 2008), WRMF (Hu et al, 2008) and Most Popular (MP). The baselearners were selected based on the fact that all are Matrix Factorization algorithms, well known for their predictive power and computational efficiency.

In the Item Recommendation experiments the baselearners were evaluated using NDCG and AUC, while for Rating Prediction, NMAE and RMSE were used. All experiments were performed using 10-fold cross-validation. To prevent bias in favour of any baselearner, the hyperparameters were not tuned.

4.2 Metalearning with Label Ranking

The metalevel use metalearners to map metafeatures to metatargets. This work investigates two types of metafeatures:

  • Comprehensive metafeatures (Cunha et al, 2018a): chosen to represent standard MtL approaches, since they achieve the best performance and represent the most diverse set of problem characteristics.

  • cf2vec metafeatures: distributed representations learned from the proposed procedure. One important issue to address is the hyperparameter optimization since depending on their settings, different representations are produced. This work pays special attention to and , since they were shown to be the most important in (Mikolov et al, 2013). However, all hyperparameters are tuned using grid search (Bergstra and Bengio, 2012).

The multicriteria metatargets procedure used in the latest related work approach for CF algorithm selection (Cunha et al, 2018a) is replicated here. The authors introduce a novel way to model the metatargets, which is able to create a single ranking of algorithms by considering more than one evaluation measure. This decision has been made since it allows to create fairer rankings and to reduce the amount of algorithm selection problems investigated.

This work uses only one metalearner, since one aims to simplify the presentation of results. To that end, KNN 

(Soares, 2015) was chosen due to its superior predictive performance in CF algorithm selection (Cunha et al, 2018a). The experiments use as baseline the Average Rankings algorithm (Brazdil and Soares, 2000). Metamodels are evaluated using Kendall’s Tau and leave one out cross-validation and tuned using grid search (Bergstra and Bengio, 2012).

5 Results and Discussion

Here, a set of Research Questions (RQs) are posed to empirically compare the merits of cf2vec’s distributed representations against CF metafeatures.

RQ 1. Which is the best setting in cf2vec?

This analysis investigates the effect of (amount of nodes sampled per graph) on Kendall’s tau performance, which measures how similar are the true and predicted rankings of CF algorithms averaged by all datasets considered. Figure 7 shows the distribution of Kendall’s tau scores for all cf2vec metamodels, with . The results also show the performance obtained with Comprehensive Metafeatures (CM) and Average Rankings (AR).

Figure 7: Kendall’s tau in terms of (amount of nodes sampled per graph).

According to these results:

  • cf2vec creates informative representations: this is supported by the fact that all their performances are better than the baseline AR.

  • cf2vec is never better than CM: although the performance results come very close to CM’s, this threshold is never beaten.

  • The best settings is : although the performances are quite similar, it is visible that the best and median performances increase until this threshold, but suffer a small decrease afterwards.

RQ 2. How cf2vec performance is affected by graph2vec’s hyperparameters?

This analysis focusses on two hyperparameters: and , the representation size and the amount of context subgraphs, respectively. All other hyperparameters are disregarded since no obvious patterns emerged. Figures 8 and 9 present Kendall’s tau performance for all cf2vec metamodels built with , since this proved to be the best setting.

Figure 8: Kendall’s tau in terms of (distributed representation size).
Figure 9: Kendall’s tau in terms of (amount of context subgraphs).

According to these results:

  • The performances for

    are stable: although the best and worst performances slightly fluctuate, the median values remain the same. These are surprising results, which may be explained by limited grid search settings. However, considering the reduced amount of meta-examples and the curse of dimensionality, it would be difficult to improve these results.

  • Hyperparameter has a significant impact on the predictive performance: both metatargets increase their performance until . Soon after, their performances decreases. However, lower amounts of context subgraphs lead to better performance (look how perform better than ).

RQ 3. How does the performance of the best cf2vec metamodel compare with the best metamodel induced with Comprehensive Metafeatures?

To select the best cf2vec hyperparameter settings, the best performance on both CF problems must be found. To illustrate how the performance is distributed, Figure 10 presents the best Kendall’s tau performances for both problems. The metamodels are identified by their and hyperparameters.

Figure 10: Performance scatterplot in both CF problems.

The results show that metamodels with occupy the vast majority of performances that simultaneously maximize the performance on both tasks. Among these, the best hyperparameter settings correspond to the performance point placed at , in which . The metamodel trained with this hyperparameter settings is henceforth used as cf2vec’s representative.

To understand how cf2vec competes with other strategies, a statistically significance test was used: Critical Difference (CD) diagrams (Demšar, 2006). Each strategy is represented by its best metamodel’s performances for all datasets. These are used here to rank several metalearners and to assess whether the differences are statistically significant. The CD interval created - which is calculated with a Friedman’s test - connects all metalearners for which there is no statistically significant difference. Figure 11 shows these results.

Figure 11: Critical difference diagram for the best hyperparameter settings.

The results show that there is no statistically significant difference between CM and cf2vec and that both are better than the baseline. Thus, the proposed approach is not only suitable to the task at hand, but it is also as good as the best CF metafeatures.

RQ 4. What is the impact on the baselevel performance achieved by cf2vec?

Since differences in baselevel performances in a ranking can be quite costly, it is essential to assess each metalearner by the baselevel predictive performance of its predicted rankings. To do so, each threshold in the predicted ranking of algorithms is replaced by the respective baselearner’s performance. This performance vector is then normalized and averaged by all datasets. Figure 12 shows these results for both CF problems. The amount of thresholds is different because each problem has a different amount of algorithms. The results are presented in percentage in order to facilitate interpretation.

Figure 12: Impact on the baselevel performance.

The results show that while CM is better for in Item Recommendation, for all remaining thresholds, cf2vec and CM have the same performance. However, in Rating Prediction, cf2vec is better than CM for and equal for . These results show that cf2vec obtains a comparable (and even higher) baselevel performance.

RQ 5. Is there a clear relation between metafeatures (either CM or cf2vec representations) and CF algorithm performance?

Considering the similarity in predictive performance for both types of metafeatures in all previous analysis, it is important to understand whether there are clear relationships between them and the metatargets. These relations can potentialy explain the results obtained.

The literature in distributed representations often refers to t-sne (Van Der Maaten and Hinton, 2008) to explore high-fimensional datasets. However, two limitations have shown this technique to not be ideal in our setup: 1) the procedure is stochastic, hence the representations are not static and 2) the process, being specially designed for a large amount of data points, is not ideal to our problem, where only 38 data points exist. Hence, PCA was used to visualize the high-dimensional metafeatures in a two dimensional map. To enrich the results, the ranking of baselearners for each dataset is shown using a colour gradient which is assigned based on metatarget similarity. This highlights clear patterns between metafeatures and metatargets: if similar (or the same) metatargets are assigned to two similar datasets (placed near one another) there is a clear pattern. Figures 13 and 14 illustrate the results.

Figure 13: PCA visualization for Item Recommendation problem.
Figure 14: PCA visualization for Rating Prediction problem.

The results show that both metafeatures work well in two cases:

  • Same domain and similar metatargets: most datasets from the same domain have clearly visible patterns in the mappings between metafeatures and metatargets. This occurs for the AMZ and JT domains.

  • Different domains but similar metatargets: some datasets from different domains, and sharing similar metatargets, are close to each other. This happens for the BC and FL datasets in Item Recommendation and for the YE and FL datasets in Rating Prediction.

The previous observations refer to the easily predictable meta-instances. The fact that both types of metafeatures are able to properly map the instances together is a good reason to explain why they perform well for the majority of datasets. However, some problems were found:

  • Anomalies: some points are close to others without any apparent reason. This occurs in the TA dataset for both CF problems and the YE dataset in the Item recommendation problem. This may have occurred because the current metafeatures are not good enough to characterize these datasets.

  • Same domain but different metatargets: some datasets from the same domain appear close. However, their rankings are significantly different. This occurs for the ML, YH and MT datasets. A possible reason is difficulty of the metamodel to correctly predict the rankings of algorithms. This difficulty can be potentially be reduced by tuning the metalearner hyperparameters and by chosing metalearners with different bias.

The low occurrence of these problems can explain the high predictive performance obtained. However, their occurrence points out the need for further studies. One important issue lies in the need to find more and more diverse datasets and baselearners to complement such observations.

Finally, although cf2vec shows interesting and useful patterns, CM seems to be generally better at mapping the difficult problems. This may be the missing indicator which justify the differences in predictive performance. Therefore, although the hypothesis of using distributed representations as metafeatures alternatives is validated, it is clear that other techniques may be needed to surpass the performance of several hand designed metafeatures specially designed for this empirical setup. Nevertheless, the research direction presented shows promising new algorithm selection solutions in CF and other ML tasks.

6 Conclusions

This paper introduced a novel technique for CF metafeature design: cf2vec. cf2vec adapts a known distributed representation technique graph2vec to the context of CF algorithm selection. To do so, the procedure converts CF datasets into graphs, reduces the problem complexity via graph sampling, learns the distributed representation and uses them as alternative metafeatures. Experiments carried out show that cf2vec is competitive with the state-of-the-art collection of CF metafeatures, with no statistically significant differences. The main advantages of cf2vec are: the metafeatures are automatically generated without virtually any human intervention, the process can be tuned to adjust the metafeatures to the experimental setup and the procedure reaches the same performance as the state-of-the-art, but requiring a smaller amount of data. However, cf2vec also has limitations. These are discussed next, along with suggestions to deal with them:

  • Representation Learning: this work used graph2vec as the main procedure to learn the distributed representations. The choice was supported by its theoretically applicability to this work and motivated by existing related works on CF graph metafeatures (Cunha et al, 2018a)

    . However, other techniques can be considered, such as Autoencoders (for instance, adapting Image Processing techniques using Convolutional Neural Networks 

    (LeCun et al, 1990)

    to the CF domain, given the similarity between images and rating matrices) and RL techniques specially designed for CF 

    (Sedhain et al, 2015; Strub et al, 2016; Wu et al, 2016) (but designed to describe the whole dataset rather than each user; alternatively, these representations can be directly used to perform algorithm selection on an user level). Despite the clear motivation to use either, none is yet ready to be applied to the general CF algorithm selection problem and require modifications.

  • Hyperparameter Tuning: according to the experimental results, the proposed technique is sensitive to the value of some hyperparameters. Hence, the metafeature extraction process requires training multiple graph2vec models to find the best one. Although this study tried to indicate the best hyperparameter values (which may be used as default), it is essential to understand that a different experimental setup may require a different hyperparameter setting to achieve optimal results.

  • Predictive Performance: the presented experimental setup, although extensive in nature, may still be the reason to why the proposed metafeatures are not significantly better than the state-of-the-art. Reasons such as insufficient amount of datasets and baselearners, imbalanced data (not enough meta-examples for all metatargets considered) and lack of hyperparameter optimisation in the baselearners may influence the experimental results. However, the authors would like to acknowledge how difficult it still is to obtain a more suitable experimental setup in the CF domain.

  • Metafeature Importance: most previous works in CF algorithm selection have investigated which metafeatures are the most relevant. Despite becoming increasingly complex and harder to interpret, there was still a hint towards which data properties were important. In this case, since the metafeatures created are latent, it is impossible to perform the same analysis. Therefore, until procedures able to extract meaning from latent features surface, the analysis is limited to the one presented in this work.


  • Adomavicius and Zhang (2012) Adomavicius G, Zhang J (2012) Impact of data characteristics on recommender systems performance. ACM Trans Manag Inf Syst 3(1):1–17
  • Amasyali and Ersoy (2009) Amasyali F, Ersoy O (2009) A study of meta learning for regression. Tech. rep., Purdue University
  • Bengio (2011)

    Bengio Y (2011) Deep Learning of Representations for Unsupervised and Transfer Learning. JMLR: Workshop and Conference 7:1–20

  • Bengio et al (2013) Bengio Y, Courville A, Vincent P (2013) Representation Learning: Review and New Perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
  • Bergstra and Bengio (2012) Bergstra J, Bengio Y (2012) Random Search for Hyper-Parameter Optimization. JMLR 13:281–305
  • Bourlard and Kamp (1988)

    Bourlard H, Kamp Y (1988) Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics 59(4):291–294

  • Brazdil and Soares (2000) Brazdil P, Soares C (2000) A Comparison of Ranking Methods for Classification Algorithm Selection. In: de Mántaras R, Plaza E (eds) Machine Learning: ECML 2000, Springer Berlin Heidelberg, pp 63–75
  • Brazdil et al (2003) Brazdil P, Soares C, Costa J (2003) Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time. Machine Learning 50(3):251–277
  • Brazdil et al (2009) Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2009) Metalearning: Applications to Data Mining, 1st edn. Springer Publishing
  • Collins et al (2018) Collins A, Beel J, Tkaczyk D (2018) One-at-a-time: A Meta-Learning Recommender-System for Recommendation. ArXiv e-prints arXiv:1805.12118
  • Cunha et al (2016) Cunha T, Soares C, Carvalho A (2016) Selecting Collaborative Filtering algorithms using Metalearning. In: ECML-PKDD, pp 393–409
  • Cunha et al (2017a)

    Cunha T, Soares C, Carvalho A (2017a) Metalearning for Context-aware Filtering: Selection of Tensor Factorization Alg. In: ACM RecSys, pp 14–22

  • Cunha et al (2017b) Cunha T, Soares C, Carvalho A (2017b) Recommending Collaborative Filtering algorithms using landmarkers. In: Discovery Science, pp 189–203
  • Cunha et al (2018a) Cunha T, Soares C, Carvalho A (2018a) Algorithm Selection for Collaborative Filtering: the influence of graph metafeatures and multicriteria metatargets. ArXiv e-prints pp 1–25, arXiv:1807.09097
  • Cunha et al (2018b) Cunha T, Soares C, Carvalho A (2018b) Metalearning and Recommender Systems: A literature review and empirical study on the algorithm selection problem for Collaborative Filtering. Information Sciences 423:128–144
  • Demšar (2006) Demšar J (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7:1–30
  • Dooms et al (2013) Dooms S, De Pessemier T, Martens L (2013) MovieTweetings: a Movie Rating Dataset Collected From Twitter. In: CrowdRec at ACM RecSys
  • Ekstrand and Riedl (2012) Ekstrand M, Riedl J (2012) When Recommenders Fail: Predicting Recommender Failure for Algorithm Selection. ACM RecSys pp 233–236
  • Gama and Brazdil (1995) Gama J, Brazdil P (1995) Characterization of Classification Algorithms. Lecture Notes in Computer Science 990:189–200
  • Gantner et al (2011) Gantner Z, Rendle S, Freudenthaler C, Schmidt-Thieme L (2011) MyMediaLite: A Free Recommender System Library. In: ACM RecSys, pp 305–308
  • Goldberg et al (2001) Goldberg K, Roeder T, Gupta D, Perkins C (2001) Eigentaste: A Constant Time Collaborative Filtering Algorithm. Information Retrieval 4(2):133–151
  • Goodfellow et al (2016) Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press
  • Griffith et al (2012) Griffith J, O’Riordan C, Sorensen H (2012) Investigations into rating information and accuracy in collaborative filtering. In: ACM SAC, pp 937–942
  • GroupLens (2016) GroupLens (2016) MovieLens. URL
  • He et al (2016)

    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Comput Soc Conf Comput Vis Pattern Recognit, pp 770–778

  • Hu et al (2008) Hu Y, Koren Y, Volinsky C (2008) Collaborative Filtering for Implicit Feedback Datasets. In: IEEE Int. Conf. on Data Mining, pp 263 – 272
  • Hüllermeier et al (2008)

    Hüllermeier E, Fürnkranz J, Cheng W, Brinker K (2008) Label ranking by learning pairwise preferences. Artificial Intelligence 172(16-17):1897–1916

  • Kalousis and Hilario (2001) Kalousis A, Hilario M (2001) Feature Selection for Meta-learning. In: Advances in Knowledge Discovery and Data Mining, pp 222–233
  • Koren (2008) Koren Y (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: ACM SIGKDD, pp 426–434, 62
  • Koren (2010) Koren Y (2010) Factor in the Neighbors: Scalable and Accurate Collaborative Filtering. ACM Trans Knowl Discov Data 4(1):1–24
  • Le and Mikolov (2014) Le Q, Mikolov T (2014) Distributed Representations of Sentences and Documents. In: Int. Conf. on Machine Learning, pp II–1188—-II–1196
  • Lecun (1987) Lecun Y (1987) PhD thesis: Modeles connexionnistes de l’apprentissage (connectionist learning models). Universite P. et M. Curie (Paris 6)
  • LeCun et al (1990) LeCun Y, Boser B, Denker J, Henderson D, Howard R, Hubbard W, Jackel L (1990) Handwritten Digit Recognition with Back-Propagation Network. In: Adv. in Neural Information Processing Sys., Morgan Kaufmann, pp 396–404
  • Lecun et al (2015) Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
  • Leskovec and Faloutsos (2006) Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: ACM SIGKDD, pp 631–636
  • Matuszyk and Spiliopoulou (2014) Matuszyk P, Spiliopoulou M (2014) Predicting the Performance of Collaborative Filtering. In: Web Intelligence, Mining and Semantics, pp 38:1–6
  • McAuley and Leskovec (2013) McAuley J, Leskovec J (2013) Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text. In: ACM RecSys, pp 165–172
  • Menon and Elkan (2010) Menon AK, Elkan C (2010) A log-linear model with latent features for dyadic prediction. In: ICDM, pp 364–373
  • Mikolov et al (2013) Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient Estimation of Word Representations in Vector Space. ArXiv e-prints pp 1–12, arXiv:1301.3781
  • Narayanan et al (2017) Narayanan A, Chandramohan, Mahinthan Venkatesan R, Chen L, Liu Y, Jaiswal S (2017) graph2vec: Learning Distributed Representations of Graphs. ArXiv e-prints pp 1–8, arXiv:1707.05005
  • Paterek (2007) Paterek A (2007) Improving regularized singular value decomposition for collaborative filtering. In: KDD cup and workshop, pp 2–5
  • Pinto et al (2016) Pinto F, Soares C, Mendes-Moreira J (2016) Towards automatic generation of Metafeatures. In: PAKDD, pp 215–226
  • Prudêncio and Ludermir (2004) Prudêncio RBC, Ludermir TB (2004) Meta-learning approaches to selecting time series models. Neurocomputing 61:121–137
  • Rendle et al (2009) Rendle S, Freudenthaler C, Gantner Z, Schmidt-thieme L (2009) BPR: Bayesian Personalized Ranking from Implicit Feedback. In: Conference on Uncertainty in Artificial Intelligence, pp 452–461
  • Rice (1976) Rice J (1976) The Algorithm Selection Problem. Adv in Computers 15:65–118
  • Rumelhart et al (1986) Rumelhart DE, McClelland JL, PDP Research Group C (eds) (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. MIT Press, Cambridge, MA, USA
  • Salakhutdinov and Mnih (2008) Salakhutdinov R, Mnih A (2008) Probabilistic Matrix Factorization. In: Advances in Neural Information Processing Systems, pp 1257–1264
  • Schmidhuber (2015) Schmidhuber J (2015) Deep learning in neural networks: An overview. Neural Networks 61:85–117
  • Sedhain et al (2015) Sedhain S, Menon AK, Sanner S, Xie L (2015) AutoRec : Autoencoders Meet Collaborative Filtering. In: WWW, pp 111–112
  • Shervashidze et al (2011) Shervashidze N, Schweitzer P, van Leeuwen EJ, Mehlhorn K, Borgwardt KM (2011) Weisfeiler-Lehman Graph Kernels. JMLR 12:2539–2561
  • Smith-Miles (2008) Smith-Miles K (2008) Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput Surv 41(1):6:1–6:25
  • Soares (2015) Soares C (2015) labelrank: Predicting Rankings of Labels. URL
  • Strub et al (2016) Strub F, Mary J, Gaudel R (2016) Hybrid Recommender System based on Autoencoders. ArXiv e-prints arxiv:1606.07659
  • Van Der Maaten and Hinton (2008)

    Van Der Maaten LJP, Hinton GE (2008) Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research 9:2579–2605

  • Vanschoren (2010) Vanschoren J (2010) Understanding machine learning performance with experiment databases. PhD thesis, Katholieke Universiteit Leuven
  • Vembu and Gärtner (2010) Vembu S, Gärtner T (2010) Label ranking algorithms: A survey. In: Preference Learning, SpringerVerlag, pp 45–64
  • Wang et al (2011) Wang H, Lu Y, Zhai C (2011) Latent Aspect Rating Analysis Without Aspect Keyword Supervision. In: ACM SIGKDD, pp 618–626
  • Weimer et al (2008) Weimer M, Karatzoglou A, Smola A (2008) Improving Maximum Margin Matrix Factorization. Machine Learning 72(3):263–276
  • Wu et al (2016) Wu Y, DuBois C, Zheng AX, Ester M (2016) Collaborative Denoising Auto-Encoders for Top-N Recommender Systems. In: WSDM, pp 153–162
  • Yahoo! (2016) Yahoo! (2016) Webscope datasets. URL
  • Yelp (2016) Yelp (2016) Yelp Dataset. URL
  • Zafarani and Liu (2009) Zafarani R, Liu H (2009) Social computing data repository at ASU. URL
  • Ziegler et al (2005) Ziegler CN, McNee SM, Konstan JA, Lausen G (2005) Improving Recommendation Lists Through Topic Diversification. In: WWW, pp 22–32