Automatic Playlist Continuation through a Composition of Collaborative Filters

by   Irene Teinemaa, et al.
University of Tartu
TU Eindhoven

The RecSys Challenge 2018 focused on automatic playlist continuation, i.e., the task was to recommend additional music tracks for playlists based on the playlist's title and/or a subset of the tracks that it already contains. The challenge is based on the Spotify Million Playlist Dataset (MPD), containing the tracks and the metadata from one million real-life playlists. This paper describes the automatic playlist continuation solution of team Latte, which is based on a composition of collaborative filters that each capture different aspects of a playlist, where the optimal combination of those collaborative filters is determined using a Tree-structured Parzen Estimator (TPE). The solution obtained the 12th place out of 112 participating teams in the final leaderboard. Team Latte participated in the main track of the challenge of the RecSys Challenge 2018.


page 1

page 2

page 3

page 4


An Analysis of Approaches Taken in the ACM RecSys Challenge 2018 for Automatic Music Playlist Continuation

The ACM Recommender Systems Challenge 2018 focused on the task of automa...

Automatic playlist continuation using a hybrid recommender system combining features from text and audio

The ACM RecSys Challenge 2018 focuses on music recommendation in the con...

SDE-AWB: a Generic Solution for 2nd International Illumination Estimation Challenge

We propose a neural network-based solution for three different tracks of...

Artist-driven layering and user's behaviour impact on recommendations in a playlist continuation scenario

In this paper we provide an overview of the approach we used as team Cre...

Automatic counting of fission tracks in apatite and muscovite using image processing

One of the major difficulties of automatic track counting using photomic...

Shadoks Approach to Low-Makespan Coordinated Motion Planning

This paper describes the heuristics used by the Shadoks team for the CG:...

Svadhyaya system for the Second Diagnosing COVID-19 using Acoustics Challenge 2021

This report describes the system used for detecting COVID-19 positives u...

1. Introduction

With the increasing popularity of online music streaming services, the task of selecting relevant and personalized content in large music catalogs becomes important to avoid choice overload (Bollen et al., 2010). An open challenge in the area of personalization for music recommender systems is known as automatic playlist continuation (APC), where the task is to recommend tracks that are likely to be selected as additional tracks for an existing playlist. In APC it is important to recommend relevant content while, at the same time, respecting the characteristics of the original playlist (Schedl et al., 2018). For example, the recommended songs for the continuation of a playlist that consists of Christmas songs should be other Christmas songs.

To promote progress in the area of APC, the RecSys Challenge 2018 focuses on this task. The challenge was organized by Spotify, The University of Massachusetts, Amherst, and Johannes Kepler University, Linz, and was open for submissions from January to July 2018. In the competition, participants were challenged to create a recommendation system for APC using a dataset of one million playlists that have been created by Spotify users in North America.

The competition task was to generate a list of 500 tracks as playlist continuation for each of the 10000 playlists in the challenge dataset. The playlists in the challenge set were divided into ten challenge categories, based on the number of seed tracks (the tracks that are already known to be present in the playlist) and the availability of the playlist title.

This manuscript describes the solution proposed by Team Latte. The main idea behind the proposed approach is to construct several collaborative filters, based on the co-occurrence of tracks with other tracks, artists, albums, and words in the playlist title. Furthermore, the solution adopts a specialized optimization strategy, where the weights of each collaborative filter are optimized locally within each challenge category using an optimization method called Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011)

. The final recommendation scores are produced as a weighted sum of the individual collaborative filtering components, and then post-processed using several heuristic strategies.

The remainder of this paper is organized as follows: Section 2 describes the data provided during the competition. Section 3 describes the proposed framework based on several collaborative filters and their combination. Section 4 further details the model optimization and selection procedures to combine the collaborative filers, and discusses some results on internal validation sets. Finally, in Section 5 we conclude the paper.

2. Dataset

The dataset consists of one million playlists created by Spotify users and distributed as the Million Playlist Dataset (MPD) for exclusive use in the competition. The MPD dataset includes information about the playlist (title, identification, number of artists, playlist duration) and track information (album name, identification, artist, track duration, track name) for every playlist. A complete description of the dataset can be found in (2018, 2018b).

In addition to the MPD dataset, the organizers provided the challenge dataset, i.e. an official test dataset that contains partial information about 10000 playlists: the playlist title and/or a number of seed tracks (a subset of tracks present in the playlist). The aim of the challenge was to generate a list of 500 recommended tracks for each of these playlists, based on the partial information available. Additionally, each playlist contained a number of holdout tracks that were known only to the organizers of the challenge. The submissions of the participants were evaluated and ranked based on the correspondence between the recommended tracks and the holdout tracks (2018, 2018a).

The playlists in the challenge dataset can be divided into ten distinct challenge categories based on the type of the provided partial information (2018, 2018a), namely:

  • G1: Playlists with a title only

  • G2: Playlists with a title and the first track

  • G3: Playlists with a title and the first five tracks

  • G4: Playlists with first five tracks (no title)

  • G5: Playlists with a title and the first ten tracks

  • G6: Playlists with first ten tracks (no title)

  • G7: Playlists with a title and the first 25 tracks

  • G8: Playlists with a title and 25 random tracks

  • G9: Playlists with a title and the first 100 tracks

  • G10: Playlists with a title and 100 random tracks

Table 1 shows the distribution of playlists, the average number of holdout tracks (), the number of unique seed tracks (), and the number of unique artists () present the challenge dataset, grouped according to the number of seed tracks .

K=0 1000 29 0 0
K=1 1000 23 932 715
K=5 2000 55 6790 2762
K=10 2000 53 11877 4096
K=25 2000 126 22507 6253
K=100 2000 88 53552 11517
Table 1. Challenge Dataset Statistics

3. Framework

In this section, we describe the framework for automatic playlist continuation (APC). The framework consists of three steps: first a collection of multiple different collaborative filtering models are extracted in the collaborative filtering stage, then, the predictions of the collaborative filtering models are combined into a single relevance prediction per playlist-track combination in the composition stage, and finally, the recommendations are generated in the playlist continuation stage. We now continue with describing these stages in detail.

3.1. Collaborative Filtering Stage

The task of collaborative filtering is to predict the utility of items (tracks) to a particular context

(playlist) based on vector similarities between these entities extracted from data

(Breese et al., 1998). This context can be based on different aspects of a playlist. For example, in item-item collaborative filtering, the context is based on the tracks that are already present in the playlist. A total of four collaborative models were built in order to capture different contexts:

track-track model ():

models the relevance of a given track for a given playlist based on the set of tracks that are currently present in the playlist. This model is a traditional item-item collaborative filtering model.

word-track model ():

models the relevance of a given track for a given playlist based on the name of the playlist. This collaborative filter that models the relation between words in the playlist name and the occurrence of tracks when a playlist contains this word in the playlist title. The words are extracted from the playlist names by splitting the playlist name on the space character (i.e. ’ ’), transforming the results to lowercase, and removing punctuation marks.

album-track model ():

models the relevance of a given track for a given playlist based on the albums from the tracks that are currently present in the playlist. This collaborative filter models the relation between the set of albums of the tracks that are currently in the playlist and the occurrence of tracks when these albums are in the playlist.

artist-track model ():

models the relevance of a given track for a given playlist based on the artists the created the tracks that are currently present in the playlist. This collaborative filter models the relation between the set of albums of the tracks that are currently in the playlist and the occurrence of tracks when these albums are in the playlist.

3.2. Composition Stage

The output of every collaborative filter is combined in a final ranking model () using a weighted sum given by:


where , , and are real-valued weights in range .

The best configuration of weights is found using an optimization procedure, such as Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011). We experiment with two types of weighting schemes: 1) global weights (optimized over all instances) and 2) local weights (optimized separately for each challenge category). We describe the procedure to determine the weights , , , and in detail in Section 4.

3.3. Playlist Continuation Stage

To determine the recommended tracks for a given playlist, we filter the tracks on and then sort the tracks in descending order based on their value, using the value that uses the weights that we found in the composition stage. However, it can be the case that fewer than 500 tracks have a value of that is larger than zero, in which case the requirement of recommending 500 songs would not be satisfied. To improve the order of tracks in the recommendations ranking and to guarantee a total 500 recommended tracks for every playlist, we apply two post-processing steps.

The first post-processing step aims at completing the albums that are currently already present in the playlists. This is motivated by the fact that a reasonable number of playlists in the dataset contained exactly all the tracks of a single album, and we found the to be insufficient to properly detect this scenario and complete the album for playlists that contain a high number of tracks from the same album. When the ratio of the number of tracks from the number of distinct albums that are currently in the playlist exceeds a threshold (where is a tunable parameter), we first recommend all the tracks from that remaining album before recommending the tracks based on .

As a second post-processing step, to fulfill the requirement of recommending exactly 500 tracks, we append the list of recommended tracks with the most popular tracks in the dataset in decreasing order of overall frequency until the list of recommended tracks contains exactly 500 tracks.

4. Model selection

We evaluate the model instantiations using a combination of three measures R-precision, NDCG, and CLICKS, which are the same three measures that are used by the RecSys challenge organizers to score the submissions. We select the best performing model instantiation for submission.

In this section we present the evaluation measures, the procedure for optimizing the models’ parameters, the procedure and the results for selecting the best model. The framework was implemented in Python and can be found at (Teinemaa et al., 2018) under open source license.

4.1. Evaluation measures

The R-precision, defined as:


where is the set of ground truth (holdout) tracks, and is the set of recommended tracks. The notation denotes the number of elements in the set.

The Normalized discounted cumulative gain (NDCG), defined as:




The Recommended Songs CLICKS metric, that mimics a Spotify feature for track recommendation where ten tracks are presented at a certain time to the user as the suggestion to complete the playlist. This metric captures the number of refreshes needed before a relevant track is encountered, and is defined as:


While NDCG and CLICKS were calculated based on the track-level agreement between the holdout tracks and the recommended tracks, R-precision was calculated on the artist-level agreement. In other words, it was considered sufficient if the artist of a recommended track matched the artist of a holdout track.

4.2. Approaches

We tested three instantiations of the proposed framework, namely:

  • composition via global weights;

  • composition via local weights, without album completion (i.e., );

  • composition via local weights, where the album completion threshold is optimized through the same procedure as optimizing the weights.

As a baseline, we compared the results to a simple popularity-based model, where the recommendation list is created based on the overall popularity of songs in a non-personalized manner.

4.3. Model Optimization

For each of the tested approaches, the weights for combining the collaborative filters needed to be optimized. Furthermore, in the variant with album completion, the song to album ratio was optimized. To this end, we extracted a optimization dataset () containing 10k playlists (playlists from the MPD). Similarly to the original challenge dataset, we divided these playlists into 10 distinct categories that match the challenge categories (see Section 2) via random sampling. The statistics of the dataset can be seen in Table 2.

K=0 1000 38 0 0
K=1 1000 37 942 758
K=5 2000 33 7548 3496
K=10 2000 33 13487 5127
K=25 2000 32 27185 8789
K=100 2000 53 76648 18242
Table 2. Optimization Dataset Statistics

The optimization process was set maximize the NDCG metric (Equation 3) and was executed using Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011), which is a type of a Sequential Model-Based Global Optimization (SMBO) (Hutter et al., 2011) algorithm. We use the TPE implementation that is available in the Python library Hyperopt (Bergstra et al., 2013)

. The TPE optimization process was set to run for 100 iterations and the search space of weights defined as a uniform random variable ranging from 0 to 1.

Table 3 shows the optimized best sets of weights separately for each category (used in the local weights composition) and global weights (used in the global weights composition).

title_only -
1_with_title 1
5_no_title 2
5_with_title 2
10_no_title 2
10_with_title 2
25_first 2
25_random 2
100_first 2
100_random 3
global -
Table 3. Optimized weights

4.4. Model Selection

In order to select the best model from the proposed framework in an offline manner (without making an official submission), we extracted a validation dataset () containing 10k playlists (playlists from the MPD). Again, we divided these playlists into ten distinct challenge categories via random sampling. The statistics of the dataset can be seen in Table 4. The validation set was used as a proxy to the challenge leaderboard, guiding model selection and improvements.

K=0 1000 38 0 0
K=1 1000 38 943 737
K=5 2000 34 7451 3427
K=10 2000 33 13455 5197
K=25 2000 33 26864 8512
K=100 2000 53 80171 18570
Table 4. Validation Dataset Statistics

4.5. Results

This subsection presents and discusses the results of our experiments.

Figure 1 shows the performance (in terms of NDCG, CLICKS, and RPREC) for the tested instantiations of the framework and the baseline popularity model. Note that in all cases, the composed collaborative model performs better than the popularity model. The model with local weights and album completion is the best performing model and was the selected strategy for our final submission.

Figure 1. Performance of different models on validation set

The results of the final model on both the validation set () and the challenge set are presented in Table 5. The Leaderboard score is the score given by the submission website, calculated by the organizers based on the recommended tracks and the holdout tracks (the ground truth values not available to participants) in the challenge dataset.

Metric Validation Leaderboard
RPREC 0.150587 0.203652
NDCG 0.288921 0.361175
CLICKS 5.6156 2.0240
Table 5. Results of Composed Model

To further analyse the performance of the final model within different challenge categories, Table 6 presents the results for the composed model in each of these categories in the validation set111Note that the overall scores are slightly different than in the above, since this detailed evaluation was executed with training on 400k playlists and a total of 100k tracks only, to reduce the computations.. We can see in this table that the model is doing considerably better in the groups where the seed tracks were selected randomly from the playlist. The performance is lowest in the category where only the playlist title was provided as input.

Table 6. Results by challenge category (model trained on 400k playlists, 100k tracks)

5. Conclusion

In the 2018 RecSys challenge, teams competed in the task of automatic playlist competition. To simulate different challenges in the playlist completion task, a challenge dataset was provided with ten different types of seed information (called challenge categories). Our solution was based on combining multiple different collaborative filters that each capture different aspects of a playlist, and we combined them using a Tree-structured Parzen Estimator optimization approach where we optimized the weights locally for each of the challenge categories. The solution strategy shows promising results, ranking our team in position 12 out of 112 teams in the final competition leaderboard.


  • (1)
  • 2018 (2018a) RecSys Challenge 2018. 2018a. Challenge Set Readme.
  • 2018 (2018b) RecSys Challenge 2018. 2018b. The Million Playlist Dataset.
  • Bergstra et al. (2013) James Bergstra, Dan Yamins, and David D Cox. 2013.

    Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In

    Proceedings of the 12th Python in Science Conference. Citeseer, 13–20.
  • Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems. 2546–2554.
  • Bollen et al. (2010) Dirk Bollen, Bart P Knijnenburg, Martijn C Willemsen, and Mark Graus. 2010. Understanding choice overload in recommender systems. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 63–70.
  • Breese et al. (1998) John S Breese, David Heckerman, and Carl Kadie. 1998. Empirical analysis of predictive algorithms for collaborative filtering. In

    Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

    . Morgan Kaufmann Publishers Inc., 43–52.
  • Hutter et al. (2011) Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization. Springer, 507–523.
  • Schedl et al. (2018) Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi Elahi. 2018. Current challenges and visions in music recommender systems research. International Journal of Multimedia Information Retrieval 7, 2 (2018), 95–116.
  • Teinemaa et al. (2018) Irene Teinemaa, Niek Tax, Carlos Bentes, Maksym Semikin, Meri L Treimann, and Christian Safka. 2018. RecSys Challenge 2018 Team Latte Repository.