Graph recommender system
This work formulates a novel song recommender system as a matrix completion problem that benefits from collaborative filtering through Non-negative Matrix Factorization (NMF) and content-based filtering via total variation (TV) on graphs. The graphs encode both playlist proximity information and song similarity, using a rich combination of audio, meta-data and social features. As we demonstrate, our hybrid recommendation system is very versatile and incorporates several well-known methods while outperforming them. Particularly, we show on real-world data that our model overcomes w.r.t. two evaluation metrics the recommendation of models solely based on low-rank information, graph-based information or a combination of both.READ FULL TEXT VIEW PDF
Graph recommender system
Recommending movies on Netflix, friends on Facebook, or jobs on LinkedIn are tasks gaining an increasing interest over the last years. Low-rank matrix factorization techniques  where amongst the winners of the famous Netflix prize, involving explicit user ratings as input. Similar techniques were soon used in order to solve implicit feedback problems, where item preferences were implied for example by the actions of a user [2, 3]. Specifically regarding songs and playlists recommendation, various techniques have been proposed, ranging from pure content-based methods  to hybrid models . A comprehensive review of related algorithms can be found in [6, 7]. Recently, graph regularization was proposed in order to enhance the quality of matrix completion problems [8, 9, 10].
The contributions of this paper are as follows:
A mathematically sound hybrid system that benefits from collaborative and content-based filtering.
A well-defined iterative optimization scheme based on proximal splitting methods .
Numerical experiments demonstrate the performance of our proposed recommender system.
Suppose we are given playlists, each containing some of songs. We define matrix as in [13, 3], that has a value if playlist contains song , otherwise. We also define a weight mask that has a ”confidence” value one if the entry is , and a small value , otherwise (we use ). This follows the example of implicit feedback problems , since a zero in matrix
does not mean that the corresponding song is irrelevant to the playlist, but that it is less probably relevant.
The goal of the training step is to find an approximate low-rank representation , where , non-negative and with small . This problem is known as Non-Negative Matrix Factorization (NMF) and has drawn a lot of attention after the seminal work . The advantage of NMF over other factorization techniques is that the approximation is only based on adding factors, a property explained as learning the parts of objects , in this case the playlists. NMF comes to the cost of being NP-hard , so sophisticated regularization is important for finding a good local minimum. In our problem we use outside information given by the songs and playlists graphs to give structure to the factors and . Our model is formulated as
where is the pointwise multiplication operator and . We use a weighted Kullback-Leibler (KL) divergence as a distance measure between and , that has been shown to be more accurate than the Frobenius norm for various NMF settings . The second term is the TV of the rows of on the playlists graph, so penalizing it promotes piecewise constant signals . Similarly with the third term for columns of . Eventually, the proposed model leverages the works of [9, 16], and extends them to graphs using the TV semi-norm.
Graph Regularization with Total Variation. In our NMF-based recommender, each playlist is represented in a low-dimensional space by a row of the matrix . In order to learn better low-rank representations of the playlists, we also impose the pairwise similarities of the playlists on their corresponding low-rank representations. We can see this from the definition of the TV regularization term, . Hence, when two playlists are similar then they are also well-connected on the graph and the weight of the edge connecting these two playlists is large (here
). Moreover, any large distance between the corresponding low-dimensional representation vectorsis penalized, forcing to stay close in the low-dimensional space. In a similar way, each song is represented in a low-dimensional space by a column of the matrix . If two songs are close (), so will be with the graph regularization .
A similar idea has been used in  by incorporating the graph information through Tikhonov regularization, i.e. with the Dirichlet energy term . However, the latter promotes smooth changes between the columns of , while the graph TV term penalization promotes piecewise constant signals with potentially sharp transitions between columns and . This is advantageous in applications where well separated classes are sought, for example in clustering , or in our recommendation system where similar playlists might belong to different categories.
As we demonstrate in Sec. 4, the use of the graphs of songs and playlists improve significantly the recommendations, while the results are better when the more forgiving TV term is used instead of Tikhonov regularization.
Primal-dual optimization. Optimization problem (1) is globally non-convex, but separately convex w.r.t. and . A standard strategy is thus to optimize for fixed , then optimize for fixed , and repeat until convergence. We describe here the proposed optimization algorithm w.r.t. for fixed based on [18, 12, 16]. The same algorithm can be applied to for fixed . Let us rewrite problem (1) as:
where , . Let us now introduce the proximal terms and the time steps , :
The iterative scheme is thus for :
where prox is the proximal operator  and . For our problem we have chosen the standard Arrow-Hurwicz time steps and , where is here the operator norm.
where shrink is the soft shrinkage operator . Note that the same algorithm could be used for Tikhonov regularization, i.e. replacing by by just changing the first proximal (10) to . In  this regularization is used along with a symmetric version of the KL divergence, however the latter has no analytic solution unlike the one we use in this work. As a result their objective function does not fit an efficient primal dual optimization scheme like the one we propose. We thus choose to keep the non symmetric KL model, denoted as GNMF in this paper, in order to compare the TV versus Tikhonov regularization.
Recommending songs. Once we have learned matrices and by solving (1), we wish to recommend a new playlist given a few songs (see Fig. 1). We also want to make real-time recommendations, so we design here a fast recommender function as follows:
Given the songs , we first find a good representation of the query on the learned low-rank space of playlists by solving a regularized least squares problem:
. The latter enjoys an analytic solution that is cheap to compute as is small (we use ).
The recommended playlist can benefit from the playlists that have similar representations as the one of the query, thus we use the weighted sum as the representation of the recommended playlist in the low dimensional space. Here the weights are defined as and depend on the distance of from other playlists representations, while . The final recommended playlist uses the low-rank representation :
Note finally that the recommended playlist is not binary, but with continued values that serve as song rankings.
Playlists Graph. The playlists graph naturally encodes pairwise similarities between playlists. The set of nodes of this graph is the set of playlists and the edge weight provides the proximity between two playlists. A large weight (here ) implies a strong proximity between the playlists. In this work, the edge weight of the playlists graph uses both “outside” information, i.e. the meta-data, and “inside” information, i.e. the songs that form the playlists. As meta-data, we use the predefined Art of the Mix playlist categories  onto which users label their mixes. The edge weight of the playlists graph is thus defined as follows:
where stands for playlist category, is the row of matrix and
is the cosine similarity distance between the vectors of the songs of the two playlists. In our case, the cosine similarity is the ratio between the songs in common and the square root of the product of the lengths of the two playlists. The two positive parameterswith allow to weight the importance of the playlist labels against their element-wise similarity. To control the edge density in each category and to give more flexibility to our recommendation model, we keep a random subset of of the edges between nodes of the same category. As we find experimentally, constitutes a good compromise, see Sec. 4.
The quality of the playlist graph is measured by partitioning the graph using the standard Louvain’s method . The number of partitions is automatically given by the modularity dendrogram which is cut where the modularity is maximal. The graph used in Sec. 4 has a modularity of when using the cosine similarity () only. If we add the meta-data information by connecting of all playlist pairs within each category with , the modularity increases to .
The second graph used in our model is the graph of song similarity. It is created from a mixture of Echonest features extracted from the audio signal which we combine with meta-data information and social features for the track. Table1 gives a view of the features used to create the song graph.
|High Level Features|
|acousticness||Acoustic or electric?|
|valence||Is the song positive or negative?|
|energy||How energetic is the song?|
|liveness||Is it a “live” recording?|
|speechiness||How many spoken words?|
|danceability||Is the song danceable?|
|instrumentalness||Is the song instrumental?|
|artist discovery||How unexpectedly popular is the artist?|
|artist familiarity||How familiar is the artist?|
|artist hotttnesss||Is the artist currently popular?|
|song hotttnesss||Is the song currently popular?|
|song currency||How recently has it become popular?|
|Temporal Echonest Features|
|statistics on echonest segments||Described in |
|genre||ID3 genre extracted from tags given by LastFM api|
In order to improve the quality of our audio features, we trained a Large Margin Nearest Neighbors model  on the song genres extracted from the LastFm associated terms (tags). To extract real music genres we use the Levenshtein distance between those terms weighted by their popularity (according to LastFm) and the music genres defined in the ID3 tags.
Eventually, the songs graph is created using the nearest neighbors (here = 5) where the edge weight between two songs is given by for in the nearest neighbors of . The parameter acts as the scale parameter of the graph and is set to be the average distance of the neighbors. The obtained graph has a high modularity () and is quite pure with respect to song genres with around 65% of accuracy using an unsupervised
In this section we validate our approach by comparing our model against three different recommender systems on a real world dataset. Our test dataset is extracted from the Art-of-the-Mix corpus created by McFee and al. in  onto which we extract the previously described features.
Assessing the quality of any music recommender systems is well-known to be a challenging problem . In this work, we use a typical metric for recommender system with implicit feedback, Mean Percentage Ranking (MPR) described in  and the playlist category accuracy, that is the percentage of the recommended songs that have already been used in playlists from the requested category in the past.
Models. We first compare our model against a graphs-only based approach, labeled as Cosine only. For a given input, this model computes the -closest playlists (here ) using cosine similarity. Songs are recommended by computing a histogram of all the songs contained in these playlists weighted by the cosine similarity weight, as defined by eq. (11). The second model is NMF using KL divergence, labeled NMF . The last model, GNMF  described in Sec. 2, is based on the KL divergence with Tikhonov regularization using the graphs of our model.
Queries. We test our model with three different types of queries. In all cases, a query contains input songs, and the system returns the top output songs as a playlist using eq. (11). The first type of queries, Random, contains completely randomly chosen songs from all categories and is solely used as a comparison baseline. The second type of queries, Test, picks randomly songs from a playlist of the test set. Lastly, Sampled, contains randomly chosen songs from a given category. It simulates a recommender system based on chosen playlist categories input by a user.
Training. We train our model using a randomly selected subset of of the playlists. As our model is not jointly convex, initialization may change the performance of the system, so we use the nowadays standard technique of NNDSVD  to get a good approximate solution. In all our experiments a value of the rank performs well, which is expected as each row has between and non-zero values. The best set of parameters and is found using a grid search using queries on the validation set. In order to prevent overfitting, we perform early stopping as soon as the MPR on the validation set ceases to increase.
Validation set. We create the “playlists” of the validation set by creating artificial queries from the different playlist categories. That is, for each category we randomly pick songs that have been previously used in user-made playlists labeled by the given category.
Results. The performance in terms of playlist category accuracy and MPR of the different models are reported in Table 2 and Table 3 respectively. As expected, for random category queries all models fail to return playlists from the categories of the input songs. At the same time, the performance of NMF as collaborative filtering without the graphs information is poor. This can be explained by the sparsity of the dataset, that only contains to non-zero elements per row, i.e. only 0.11-0.46% sparsity. Collaborative filtering models are known to perform better as more observed ratings are available . The cosine model performs better in terms of category accuracy, as it directly uses the cosine distance between the input query and playlists from pure categories. However, its high MPR value shows that our model, albeit more complex, achieves better song recommendations.
In this work we introduce a novel flexible song recommender system that combines collaborative filtering with playlist and song proximity information encoded by graphs. We use a primal-dual based optimization scheme to achieve a highly parallelizable algorithm with the potential to scale up to very large datasets. We choose graph TV instead of Tikhonov regularization and demonstrate the model’s superiority by comparing our system against three other recommendation models on a real music playlists dataset.
Proceedings of Conference on Uncertainty in Artificial Intelligence, 2009, pp. 452–461.