Evaluating Music Recommendations with Binary Feedback for Multiple Stakeholders

by   Sasha Stoikov, et al.
cornell university

High quality user feedback data is essential to training and evaluating a successful music recommendation system, particularly one that has to balance the needs of multiple stakeholders. Most existing music datasets suffer from noisy feedback and self-selection biases inherent in the data collected by music platforms. Using the Piki Music dataset of 500k ratings collected over a two-year time period, we evaluate the performance of classic recommendation algorithms on three important stakeholders: consumers, well-known artists and lesser-known artists. We show that a matrix factorization algorithm trained on both likes and dislikes performs significantly better compared to one trained only on likes for all three stakeholders.



page 3


Evaluating Recommender System Algorithms for Generating Local Music Playlists

We explore the task of local music recommendation: provide listeners wit...

Common Artist Music Assistance

In today's world of growing number of songs, the need of finding apposit...

Diversifying Music Recommendations

We compare submodular and Jaccard methods to diversify Amazon Music reco...

Using offline metrics and user behavior analysis to combine multiple systems for music recommendation

There are many offline metrics that can be used as a reference for evalu...

Allowing for equal opportunities for artists in music recommendation

Promoting diversity in the music sector is widely discussed on the media...

Tracing Affordance and Item Adoption on Music Streaming Platforms

Popular music streaming platforms offer users a diverse network of conte...

Recommendations as Treatments: Debiasing Learning and Evaluation

Most data for evaluating and training recommender systems is subject to ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Music recommendation algorithms play a major role in what music gets listened to on streaming platforms (Mehrotra et al., 2020b; Millecamp et al., 2018). This in turn influences which artists make a living from streaming and which ones do not. Understanding the mechanisms that cause an algorithm to push certain artists ahead of others is increasingly urgent. In this paper, we focus on three stakeholders of a music recommendation system: music consumers, well-known artists and lesser-known artists.

Much has been written about competing classes of algorithms and metrics. Unfortunately, the growing number of proposed metrics makes it difficult to answer straightforward questions related to the satisfaction of the stakeholders of a music streaming platform. For example, what proportion of recommendations is actually liked by users? Do the algorithms serve well-known artists better than lesser-known artists? Are music consumers more or less satisfied with well-known or lesser-known recommendations? There is often a tension between metrics that aim to measure familiarity, relevance and predictability, and metrics that aim to measure fairness, diversity and serendipity. Combining these metrics into objective functions that represent the interests of the aforementioned stakeholders is challenging.

At the heart of this problem are three common assumptions in the data used for training and testing recommendation algorithms: (i) unheard songs are assumed to be disliked, though in reality unheard songs are often excellent and (ii) played songs are assumed to be liked, though they may have been passively listened to on a playlist (iii) rated songs are randomly presented to users, while they are often self-selected by the user. All of these assumptions tend to favor the well-known artists, who are more often heard, more often recommended on playlists and more often remembered at the search bar. To mitigate these concerns we built Piki 111www.piki.nyc, a music discovery tool that collects ratings through a binary choice (like/dislike) on song clips that cannot be searched or skipped. We present the Piki Music dataset with the goal of enabling researchers and practitioners from the RecSys community to explore the above challenges. We highlight the following contributions:

  • We collect the Piki Music dataset in a way that incentivizes users to give truthful binary ratings to songs, thus mitigating the self-selection and noisy feedback biases inherent in most recommendation datasets. (Section 3).

  • We quantify the value of a dislike by training a matrix factorization algorithm on this binary dataset. We find that it performs significantly better than when trained on positive-only feedback for all three stakeholders (Section 4).

2. Related work

Datasets with Explicit and Implicit User feedback

Datasets consisting of user-item interactions are typically elicited in two ways: explicitly or implicitly. Explicit feedback refers to an action that a user performs with the intention of giving their opinion on the quality of an item, for example, giving a rating to a watched movie. Two well-known datasets for recommendations, the Netflix Prize dataset (Bennett et al., 2007)

and the Movielens dataset 

(Harper and Konstan, 2015), are examples of explicit feedback datasets. Both consist of 1-5 star ratings for millions of user-item pairs. Explicit feedback is voluntary, so this kind of dataset is often subject to self-selection bias: if there is no obligation to rate, users tend to rate only when they feel very strongly about an item, and rate more items they like than items they dislike. Unbalanced explicit training datasets have been documented in the literature and some flattening techniques have been shown to improve recommendation quality (Mansoury et al., 2021).

Implicit feedback refers to an action that a user performs with an intention other than giving their opinion on the quality of the item, such as clicks, purchases, etc. Implicit feedback collected by streaming apps is commonly used to train modern recommendation systems (Hu et al., 2008). However, researchers have expressed concerns about training and evaluating algorithms using this type of feedback data (Wen et al., 2019; Lu et al., 2018). A major problem with implicit feedback datasets for music recommendations, such as those obtained on Lastfm (Bertin-Mahieux et al., 2011) or Spotify (Brost et al., 2019), is that they may measure a noisy signal of the user’s true preferences. If a user listens to a song, that action may be interpreted as positive feedback, though the song may have been passively played as background music (Wen et al., 2019). Moreover, implicit feedback typically consists of “positive-only” data. To train and test an algorithm, missing user-item interactions are typically labeled as negative feedback, which only adds more noise to the training data. One approach to deal with implicit data is to label songs that are less often played as negative feedback  (Sánchez and Bellogín, 2018). However, few datasets with explicit positive and negative user preferences are publicly available, making it hard to investigate the limitations of implicit feedback.

Recommendations for multi-stakeholders

A recent paper calls for a paradigm shift from user-centric metrics towards modeling the value that recommendation systems bring to their multiple stakeholders (Jannach and Bauer, 2020). They identify potential stakeholders as consumers, producers, platforms and society at large and call for evaluation designs aligned with the goals of each stakeholder. The way forward, they suggest, is to design better evaluation methods that combine multiple goals and generalize beyond domain specific applications. Researchers have investigated recommendations from the multi-stakeholder perspective by optimizing over multiple objectives such as diversity, relevance, fairness, satisfaction (Mehrotra et al., 2020a, b). They highlight an inherent tension between relevant recommendations that are similar to past consumption and diverse recommendations that are outside of a user’s echo chamber. Fairness in compensating for the popularity bias has gathered particular attention (Abdollahpouri et al., 2020; Kowald et al., 2019; Celma and Cano, 2008). The authors suggest addressing multiple stakeholders by modeling profit-aware recommendations  (Abdollahpouri et al., 2019; Abdollahpouri and Essinger, 2017), e.g., recommendations that are linked to sales. However, a challenge is the lack of publicly-available data with multi-stakeholder characteristics, which tends to be sensitive data.

In the music domain, lesser-known artist have expressed many concerns, which include reaching an audience, transparency in recommendations, localizing discovery, gender balance and popularity bias, according to a qualitative study  (Ferraro et al., 2021). In the sequel, we take a quantitative approach to some of these concerns with the Piki Music Dataset.

3. Piki Music Dataset

Through the Piki interface (Fig.  1), we collect binary data while incentivizing users to provide feedback in a way that is aligned with their individual tastes. The Piki Music dataset currently consists of 2723 anonymized users, 66,532 anonymized songs and 500K binary ratings and the data collection is on-going. Figure 2 illustrates the distribution of like rates across users and songs. The columns of the dataset are as following:

Figure 1. The Piki Music App interface. The dislike button is unlocked after 3 seconds, the like button after 6 seconds and the superlike button after 12 seconds. This incentives the users to vote truthfully: to dislike is easy but in order to like, a user must invest time in the song.
Figure 2. The distribution of like ratios across users and songs.
  • timestamp: a datetime variable

  • user_id: an anonymized user id

  • song_id: an anonymized song id

  • liked: this is the binary indicator, 1 if the song is liked, or 0 if the song is disliked. Note that the feedback consists of 39% likes and 61% dislikes. The superlike indicator, labeled 2, is included in the data, though we treat it as a like in our experiments.

  • personalized: this is 1 if the song was recommended based on their previous choices or 0 if the song was selected randomly. Note that the songs recommended are 66% personalized and 34% random songs. We have included this flag in the dataset, to allow mitigation of the recommendation bias of the data, though this question is not within the scope of our study.

  • spotify_popularity: this is the song’s artist’s popularity, a value between 0 and 100, with 100 being the most popular. It is published by Spotify for each artist, through their publicly-available API 222https://developer.spotify.com/documentation/web-api/reference/category-artists

    . The average value of the Spotify popularity in our data set is 52, so we classify songs as coming from well-known artists if the value is above this mean and as a lesser-known artist if it is below the mean. Note that this threshold corresponds to artists that have approximately 350,000 monthly listeners, which on average generates around $2000 per month, assuming this is a solo artist without a label.

3.1. Binary data collection

Users on Piki provide explicit binary feedback, by liking or disliking music clips. Users do not have access to a search bar and thus cannot control what songs they will hear. They are asked to like or dislike 30 second music video clips. Figure 1 shows how the interface presents the songs in batches, much like a social media story. The binary nature of the Piki music data set addresses our first concern with the training data, namely, that we won’t need to treat unheard songs as disliked, since we have a set of disliked songs to train and test the algorithm.

3.2. Incentives to vote truthfully

The dislike button is unlocked after 3 seconds (see the first and second images in Fig. 1), the like button is unlocked after 6 seconds (see the 3rd image in Fig. 1) and the superlike button is unlocked after 12 seconds (see the 4th image in Fig. 1). The clip starts 40 seconds into the song and when the clip is over, 30 seconds later, the user may replay the clip before rating it. This mechanism aligns the users’ ratings with their preferences. The 3 second lock period for the dislike button ensures that all songs are given a fair chance. The 6 and 12 second lock periods guarantee that positive ratings are backed up by a meaningful time investment in the song. Only songs that truly capture a user’s attention get liked or superliked. Piki users are rewarded with micropayments each time they complete a set of ratings. Both the timing and the financial rewards help mitigate our second data concern, namely that liked songs are actively liked, not just played passively on a playlist.

4. Experiments

We split the dataset into a training set and an evaluation set using random sampling according to 80%/20% splits stratified by user and average the results across 5 runs. Algorithms are trained on and scores are computed on interactions in .

4.1. Training matrix factorization algorithms

Given a set of users , songs and ratings , collaborative filtering aims to learn

-dimension latent user vectors

and latent item vectors

from the sparse user-item rating matrix through singular value decomposition (SVD) 

(Paterek, 2007). Predicted user preference scores are given by the dot products between user and item vectors:


A classic algorithm from this framework is the Weighted Regularized Matrix Factorization (WRMF) (Hu et al., 2008). Specifically, WRMF optimizes for the following objective:


where is the ground true preference score for user to item , is the weight put on each observation, represents the Frobenius norms for regularizing user and items matrices, and is regularization parameter. Note that with implicit feedback datasets, is assumed to be , where a click is treated as 1 and missing data from the rating matrix is treated as 0.

In a setting where negative feedback is available, a generalized framework is proposed to incorporate negative feedback during training (Wen et al., 2019). The objective function can be written as:


where refer to positive, negative and missing user feedback, are weights assigned to the corresponding sets of user feedback. We highlight two types of weight schemas:

  • WRMF with Likes: , . This means that we only sample from the positive and missing feedback, while ignoring the negative feedback.

  • WRMF with Likes and Dislikes: , . This means that we only sample from positive and negative feedback, without the need to sample from missing data.

4.2. Implementations

We implemented the WRMF algorithm based on the OpenRec library (Yang et al., 2018). We used the Adagrad optimizer with a learning rate of 0.01 and a batch size of 512 to train the model. The regularization parameter is tuned on a validation set from and early stopping is performed. We used a dimension of for both user and item latent factors. The code for experiments and the Piki Music Dataset is public 333https://github.com/sstoikov/piki-music-dataset.

4.3. Performance metrics

The evaluation dataset can be segmented into referring to positive, negative and missing user feedback. Popular metrics like Recall@ and Precision@ aim to quantify the relevance of playlists of length . However, they implicitly assume that and are indistinguishable from each other.

For a trained algorithm, all scores above a given threshold are classified as recommendations. Without loss of generality, we use the median of scores from model outputs as the threshold in our experiments. We use to denote the songs recommended by an algorithm from . The set of recommended songs from the well-known artists is and the set of recommended songs from the lesser-known artists is . It is obvious that . For each song in , we have a corresponding binary rating of from the consumers. For simplicity, we use to represent the vector for binary ratings on songs in . Similarly, we have and .

  • Consumers: the proportion of the recommended songs from the evaluation set that are actually liked by users: . The intuition is that a higher precision is aligned with better user experience for the consumers.

  • Well-known artists: the proportion of recommendations coming from the well-known artists that are actually liked by users:. A higher precision leads to more effective song exposure for well-known artists.

  • Lesser-known artists: the proportion of recommendations coming from the lesser-known artists that are actually liked by users:. A higher precision leads to more effective song exposure for lesser-known artists.

Model Well-known artists(%) Lesser known artists (%) Consumers (%)
Popularity(recommend well-known) -
Anti-popularity(recommend lesser-known) -
WRMF with Likes
WRMF with Likes and Dislikes
Table 1. Performance measured by precision for different stakeholders under WRMF models.

4.4. Results

We evaluate the performance of the three stakeholders on the following baselines:

  • Popularity: A naive baseline that always recommend more popular songs from well-known artists, i.e., .

  • Anti-popularity: A naive baseline that always recommend less popular songs from lesser-known artists, i.e., .

  • WRMF with Likes: A matrix factorization algorithm trained on likes (Hu et al., 2008). It makes the assumption that missing data are negatives.

  • WRMF with Likes and Dislikes: A matrix factorization algorithms trained on likes and dislikes using the framework proposed in (Wen et al., 2019).

The Popularity recommender achieves a precision of 41.1% for consumers, while the Anti-popularity recommender results in a lower precision of 36.3%. As a comparison, WRMF with likes improve the consumer metric by 21.9% (table 1). Moreover, we find that WRMF with Likes and Dislikes outperforms the WRMF with Likes for all three stakeholders. The consumer metric was lifted by 18.9%, with an increase of 18.1% for well-known artists and 19.5% for lesser-known artists. This highlights the importance of binary feedback in improving the training and evaluation of recommenders.

5. Conclusion

We present the Piki Music Dataset and argue that it was collected in a way that addresses many of the biases of other publicly available datasets. More importantly, since the ratings are binary (in the form of likes and dislikes), we can define performance metrics for recommendation algorithms from the perspective of various stakeholders.

There are a few directions that we think future researchers using this dataset may want to explore. In the spirit of the Netflix challenge, we encourage researchers to test the accuracy of more advanced RecSys algorithms on the dataset. For example, it would be interesting to determine if a neural recommender performs better than matrix factorization for consumers, well-known artists or lesser-known artists. It may also be valuable to explore other metrics to measure the interests of the stakeholders in this study. Researchers may be also be interested in other ways to segment the artist stakeholders, across genres or other metadata associated with the songs. We are particularly interested in modeling how the algorithm’s objectives are tied to the business objectives of other important stakeholders such as streaming platforms and record labels.


  • H. Abdollahpouri, G. Adomavicius, R. Burke, I. Guy, D. Jannach, T. Kamishima, J. Krasnodebski, and L. Pizzato (2019) Beyond personalization: research directions in multistakeholder recommendation. arXiv preprint arXiv:1905.01986. Cited by: §2.
  • H. Abdollahpouri, R. Burke, and M. Mansoury (2020) Unfair exposure of artists in music recommendation. arXiv preprint arXiv:2003.11634. Cited by: §2.
  • H. Abdollahpouri and S. Essinger (2017) Multiple stakeholders in music recommender systems. External Links: 1708.00120 Cited by: §2.
  • J. Bennett, S. Lanning, et al. (2007) The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007, pp. 35. Cited by: §2.
  • T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere (2011) The million song dataset. Cited by: §2.
  • B. Brost, R. Mehrotra, and T. Jehan (2019) The music streaming sessions dataset. In The World Wide Web Conference, pp. 2594–2600. Cited by: §2.
  • O. Celma and P. Cano (2008) From hits to niches?: or how popular artists can bias music recommendation and discovery. Proc. of the 2nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition, pp. . External Links: Document Cited by: §2.
  • A. Ferraro, X. Serra, and C. Bauer (2021) What is fair? exploring the artists’ perspective on the fairness of music streaming platforms. External Links: 2106.02415 Cited by: §2.
  • F. M. Harper and J. A. Konstan (2015) The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5 (4). External Links: ISSN 2160-6455, Link, Document Cited by: §2.
  • Y. Hu, Y. Koren, and C. Volinsky (2008) Collaborative filtering for implicit feedback datasets. ICDM ’08, USA, pp. 263–272. External Links: ISBN 9780769535029, Link, Document Cited by: §2, 3rd item, §4.1.
  • D. Jannach and C. Bauer (2020) Escaping the mcnamara fallacy: towards more impactful recommender systems research. AI Magazine 41 (4), pp. 79–95. Cited by: §2.
  • D. Kowald, M. Schedl, and E. Lex (2019) The unfairness of popularity bias in music recommendation: a reproducibility study. External Links: 1912.04696 Cited by: §2.
  • H. Lu, M. Zhang, and S. Ma (2018) Between clicks and satisfaction: study on multi-phase user preferences and satisfaction for online news reading. In The 41st International ACM SIGIR Conference on Research , Development in Information Retrieval, SIGIR ’18, New York, NY, USA, pp. 435–444. External Links: ISBN 9781450356572, Link, Document Cited by: §2.
  • M. Mansoury, R. Burke, and B. Mobasher (2021) Flatter is better: percentile transformations for recommender systems. ACM Transactions on Intelligent Systems and Technology (TIST) 12 (2), pp. 1–16. Cited by: §2.
  • R. Mehrotra, B. Carterette, Y. Li, Q. Yao, C. Gao, J. Kwok, Q. Yang, and I. Guyon (2020a) Advances in recommender systems: from multi-stakeholder marketplaces to automated recsys. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3533–3534. Cited by: §2.
  • R. Mehrotra, N. Xue, and M. Lalmas (2020b) Bandit based optimization of multiple objectives on a music streaming platform. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3224–3233. Cited by: §1, §2.
  • M. Millecamp, N. N. Htun, Y. Jin, and K. Verbert (2018) Controlling spotify recommendations: effects of personal characteristics on music recommender user interfaces. In Proceedings of the 26th Conference on user modeling, adaptation and personalization, pp. 101–109. Cited by: §1.
  • A. Paterek (2007)

    Improving regularized singular value decomposition for collaborative filtering

    In Proceedings of KDD cup and workshop, Vol. 2007, pp. 5–8. Cited by: §4.1.
  • P. Sánchez and A. Bellogín (2018) Measuring anti-relevance: a study on when recommendation algorithms produce bad suggestions. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys ’18, New York, NY, USA, pp. 367–371. External Links: ISBN 9781450359016, Link, Document Cited by: §2.
  • H. Wen, L. Yang, and D. Estrin (2019) Leveraging post-click feedback for content recommendations. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, New York, NY, USA, pp. 278–286. External Links: ISBN 9781450362436, Link, Document Cited by: §2, 4th item, §4.1.
  • L. Yang, E. Bagdasaryan, J. Gruenstein, C. Hsieh, and D. Estrin (2018) Openrec: a modular framework for extensible and adaptable recommendation algorithms. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 664–672. Cited by: §4.2.