With the increasing popularity of online music streaming services, the task of selecting relevant and personalized content in large music catalogs becomes important to avoid choice overload (Bollen et al., 2010). An open challenge in the area of personalization for music recommender systems is known as automatic playlist continuation (APC), where the task is to recommend tracks that are likely to be selected as additional tracks for an existing playlist. In APC it is important to recommend relevant content while, at the same time, respecting the characteristics of the original playlist (Schedl et al., 2018). For example, the recommended songs for the continuation of a playlist that consists of Christmas songs should be other Christmas songs.
To promote progress in the area of APC, the RecSys Challenge 2018 focuses on this task. The challenge was organized by Spotify, The University of Massachusetts, Amherst, and Johannes Kepler University, Linz, and was open for submissions from January to July 2018. In the competition, participants were challenged to create a recommendation system for APC using a dataset of one million playlists that have been created by Spotify users in North America.
The competition task was to generate a list of 500 tracks as playlist continuation for each of the 10000 playlists in the challenge dataset. The playlists in the challenge set were divided into ten challenge categories, based on the number of seed tracks (the tracks that are already known to be present in the playlist) and the availability of the playlist title.
This manuscript describes the solution proposed by Team Latte. The main idea behind the proposed approach is to construct several collaborative filters, based on the co-occurrence of tracks with other tracks, artists, albums, and words in the playlist title. Furthermore, the solution adopts a specialized optimization strategy, where the weights of each collaborative filter are optimized locally within each challenge category using an optimization method called Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011)
. The final recommendation scores are produced as a weighted sum of the individual collaborative filtering components, and then post-processed using several heuristic strategies.
The remainder of this paper is organized as follows: Section 2 describes the data provided during the competition. Section 3 describes the proposed framework based on several collaborative filters and their combination. Section 4 further details the model optimization and selection procedures to combine the collaborative filers, and discusses some results on internal validation sets. Finally, in Section 5 we conclude the paper.
The dataset consists of one million playlists created by Spotify users and distributed as the Million Playlist Dataset (MPD) for exclusive use in the competition. The MPD dataset includes information about the playlist (title, identification, number of artists, playlist duration) and track information (album name, identification, artist, track duration, track name) for every playlist. A complete description of the dataset can be found in (2018, 2018b).
In addition to the MPD dataset, the organizers provided the challenge dataset, i.e. an official test dataset that contains partial information about 10000 playlists: the playlist title and/or a number of seed tracks (a subset of tracks present in the playlist). The aim of the challenge was to generate a list of 500 recommended tracks for each of these playlists, based on the partial information available. Additionally, each playlist contained a number of holdout tracks that were known only to the organizers of the challenge. The submissions of the participants were evaluated and ranked based on the correspondence between the recommended tracks and the holdout tracks (2018, 2018a).
The playlists in the challenge dataset can be divided into ten distinct challenge categories based on the type of the provided partial information (2018, 2018a), namely:
G1: Playlists with a title only
G2: Playlists with a title and the first track
G3: Playlists with a title and the first five tracks
G4: Playlists with first five tracks (no title)
G5: Playlists with a title and the first ten tracks
G6: Playlists with first ten tracks (no title)
G7: Playlists with a title and the first 25 tracks
G8: Playlists with a title and 25 random tracks
G9: Playlists with a title and the first 100 tracks
G10: Playlists with a title and 100 random tracks
Table 1 shows the distribution of playlists, the average number of holdout tracks (), the number of unique seed tracks (), and the number of unique artists () present the challenge dataset, grouped according to the number of seed tracks .
In this section, we describe the framework for automatic playlist continuation (APC). The framework consists of three steps: first a collection of multiple different collaborative filtering models are extracted in the collaborative filtering stage, then, the predictions of the collaborative filtering models are combined into a single relevance prediction per playlist-track combination in the composition stage, and finally, the recommendations are generated in the playlist continuation stage. We now continue with describing these stages in detail.
3.1. Collaborative Filtering Stage
The task of collaborative filtering is to predict the utility of items (tracks) to a particular context
(playlist) based on vector similarities between these entities extracted from data(Breese et al., 1998). This context can be based on different aspects of a playlist. For example, in item-item collaborative filtering, the context is based on the tracks that are already present in the playlist. A total of four collaborative models were built in order to capture different contexts:
- track-track model ():
models the relevance of a given track for a given playlist based on the set of tracks that are currently present in the playlist. This model is a traditional item-item collaborative filtering model.
- word-track model ():
models the relevance of a given track for a given playlist based on the name of the playlist. This collaborative filter that models the relation between words in the playlist name and the occurrence of tracks when a playlist contains this word in the playlist title. The words are extracted from the playlist names by splitting the playlist name on the space character (i.e. ’ ’), transforming the results to lowercase, and removing punctuation marks.
- album-track model ():
models the relevance of a given track for a given playlist based on the albums from the tracks that are currently present in the playlist. This collaborative filter models the relation between the set of albums of the tracks that are currently in the playlist and the occurrence of tracks when these albums are in the playlist.
- artist-track model ():
models the relevance of a given track for a given playlist based on the artists the created the tracks that are currently present in the playlist. This collaborative filter models the relation between the set of albums of the tracks that are currently in the playlist and the occurrence of tracks when these albums are in the playlist.
3.2. Composition Stage
The output of every collaborative filter is combined in a final ranking model () using a weighted sum given by:
where , , and are real-valued weights in range .
The best configuration of weights is found using an optimization procedure, such as Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011). We experiment with two types of weighting schemes: 1) global weights (optimized over all instances) and 2) local weights (optimized separately for each challenge category). We describe the procedure to determine the weights , , , and in detail in Section 4.
3.3. Playlist Continuation Stage
To determine the recommended tracks for a given playlist, we filter the tracks on and then sort the tracks in descending order based on their value, using the value that uses the weights that we found in the composition stage. However, it can be the case that fewer than 500 tracks have a value of that is larger than zero, in which case the requirement of recommending 500 songs would not be satisfied. To improve the order of tracks in the recommendations ranking and to guarantee a total 500 recommended tracks for every playlist, we apply two post-processing steps.
The first post-processing step aims at completing the albums that are currently already present in the playlists. This is motivated by the fact that a reasonable number of playlists in the dataset contained exactly all the tracks of a single album, and we found the to be insufficient to properly detect this scenario and complete the album for playlists that contain a high number of tracks from the same album. When the ratio of the number of tracks from the number of distinct albums that are currently in the playlist exceeds a threshold (where is a tunable parameter), we first recommend all the tracks from that remaining album before recommending the tracks based on .
As a second post-processing step, to fulfill the requirement of recommending exactly 500 tracks, we append the list of recommended tracks with the most popular tracks in the dataset in decreasing order of overall frequency until the list of recommended tracks contains exactly 500 tracks.
4. Model selection
We evaluate the model instantiations using a combination of three measures R-precision, NDCG, and CLICKS, which are the same three measures that are used by the RecSys challenge organizers to score the submissions. We select the best performing model instantiation for submission.
In this section we present the evaluation measures, the procedure for optimizing the models’ parameters, the procedure and the results for selecting the best model. The framework was implemented in Python and can be found at (Teinemaa et al., 2018) under open source license.
4.1. Evaluation measures
The R-precision, defined as:
where is the set of ground truth (holdout) tracks, and is the set of recommended tracks. The notation denotes the number of elements in the set.
The Normalized discounted cumulative gain (NDCG), defined as:
The Recommended Songs CLICKS metric, that mimics a Spotify feature for track recommendation where ten tracks are presented at a certain time to the user as the suggestion to complete the playlist. This metric captures the number of refreshes needed before a relevant track is encountered, and is defined as:
While NDCG and CLICKS were calculated based on the track-level agreement between the holdout tracks and the recommended tracks, R-precision was calculated on the artist-level agreement. In other words, it was considered sufficient if the artist of a recommended track matched the artist of a holdout track.
We tested three instantiations of the proposed framework, namely:
composition via global weights;
composition via local weights, without album completion (i.e., );
composition via local weights, where the album completion threshold is optimized through the same procedure as optimizing the weights.
As a baseline, we compared the results to a simple popularity-based model, where the recommendation list is created based on the overall popularity of songs in a non-personalized manner.
4.3. Model Optimization
For each of the tested approaches, the weights for combining the collaborative filters needed to be optimized. Furthermore, in the variant with album completion, the song to album ratio was optimized. To this end, we extracted a optimization dataset () containing 10k playlists (playlists from the MPD). Similarly to the original challenge dataset, we divided these playlists into 10 distinct categories that match the challenge categories (see Section 2) via random sampling. The statistics of the dataset can be seen in Table 2.
The optimization process was set maximize the NDCG metric (Equation 3) and was executed using Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011), which is a type of a Sequential Model-Based Global Optimization (SMBO) (Hutter et al., 2011) algorithm. We use the TPE implementation that is available in the Python library Hyperopt (Bergstra et al., 2013)
. The TPE optimization process was set to run for 100 iterations and the search space of weights defined as a uniform random variable ranging from 0 to 1.
Table 3 shows the optimized best sets of weights separately for each category (used in the local weights composition) and global weights (used in the global weights composition).
4.4. Model Selection
In order to select the best model from the proposed framework in an offline manner (without making an official submission), we extracted a validation dataset () containing 10k playlists (playlists from the MPD). Again, we divided these playlists into ten distinct challenge categories via random sampling. The statistics of the dataset can be seen in Table 4. The validation set was used as a proxy to the challenge leaderboard, guiding model selection and improvements.
This subsection presents and discusses the results of our experiments.
Figure 1 shows the performance (in terms of NDCG, CLICKS, and RPREC) for the tested instantiations of the framework and the baseline popularity model. Note that in all cases, the composed collaborative model performs better than the popularity model. The model with local weights and album completion is the best performing model and was the selected strategy for our final submission.
The results of the final model on both the validation set () and the challenge set are presented in Table 5. The Leaderboard score is the score given by the submission website, calculated by the organizers based on the recommended tracks and the holdout tracks (the ground truth values not available to participants) in the challenge dataset.
To further analyse the performance of the final model within different challenge categories, Table 6 presents the results for the composed model in each of these categories in the validation set111Note that the overall scores are slightly different than in the above, since this detailed evaluation was executed with training on 400k playlists and a total of 100k tracks only, to reduce the computations.. We can see in this table that the model is doing considerably better in the groups where the seed tracks were selected randomly from the playlist. The performance is lowest in the category where only the playlist title was provided as input.
In the 2018 RecSys challenge, teams competed in the task of automatic playlist competition. To simulate different challenges in the playlist completion task, a challenge dataset was provided with ten different types of seed information (called challenge categories). Our solution was based on combining multiple different collaborative filters that each capture different aspects of a playlist, and we combined them using a Tree-structured Parzen Estimator optimization approach where we optimized the weights locally for each of the challenge categories. The solution strategy shows promising results, ranking our team in position 12 out of 112 teams in the final competition leaderboard.
- 2018 (2018a) RecSys Challenge 2018. 2018a. Challenge Set Readme. https://recsys-challenge.spotify.com/challenge_readme
- 2018 (2018b) RecSys Challenge 2018. 2018b. The Million Playlist Dataset. https://recsys-challenge.spotify.com/readme
et al. (2013)
James Bergstra, Dan
Yamins, and David D Cox.
Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. InProceedings of the 12th Python in Science Conference. Citeseer, 13–20.
- Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems. 2546–2554.
- Bollen et al. (2010) Dirk Bollen, Bart P Knijnenburg, Martijn C Willemsen, and Mark Graus. 2010. Understanding choice overload in recommender systems. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 63–70.
et al. (1998)
John S Breese, David
Heckerman, and Carl Kadie.
Empirical analysis of predictive algorithms for
collaborative filtering. In
Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 43–52.
- Hutter et al. (2011) Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization. Springer, 507–523.
- Schedl et al. (2018) Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi Elahi. 2018. Current challenges and visions in music recommender systems research. International Journal of Multimedia Information Retrieval 7, 2 (2018), 95–116.
- Teinemaa et al. (2018) Irene Teinemaa, Niek Tax, Carlos Bentes, Maksym Semikin, Meri L Treimann, and Christian Safka. 2018. RecSys Challenge 2018 Team Latte Repository. https://github.com/irhete/recsys-challenge-2018