Representation, Exploration and Recommendation of Music Playlists

07/01/2019
by   Piyush Papreja, et al.
Arizona State University
0

Playlists have become a significant part of our listening experience because of the digital cloud-based services such as Spotify, Pandora, Apple Music. Owing to the meteoric rise in the usage of playlists, recommending playlists is crucial to music services today. Although there has been a lot of work done in playlist prediction, the area of playlist representation hasn't received that level of attention. Over the last few years, sequence-to-sequence models, especially in the field of natural language processing, have shown the effectiveness of learned embeddings in capturing the semantic characteristics of sequences. We can apply similar concepts to music to learn fixed length representations for playlists and use those representations for downstream tasks such as playlist discovery, browsing, and recommendation. In this work, we formulate the problem of learning a fixed-length playlist representation in an unsupervised manner, using Sequence-to-sequence (Seq2seq) models, interpreting playlists as sentences and songs as words. We compare our model with two other encoding architectures for baseline comparison. We evaluate our work using the suite of tasks commonly used for assessing sentence embeddings, along with a few additional tasks pertaining to music, and a recommendation task to study the traits captured by the playlist embeddings and their effectiveness for the purpose of music recommendation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

05/19/2020

Embeddings as representation for symbolic music

A representation technique that allows encoding music in a way that cont...
08/10/2019

Personalized Music Recommendation with Triplet Network

Since many online music services emerged in recent years so that effecti...
06/04/2019

Towards Lossless Encoding of Sentences

A lot of work has been done in the field of image compression via machin...
04/07/2019

SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression

Neural sequence-to-sequence models are currently the dominant approach i...
12/02/2020

Sequence Generation using Deep Recurrent Networks and Embeddings: A study case in music

Automatic generation of sequences has been a highly explored field in th...
01/16/2019

It's Only Words And Words Are All I Have

The central idea of this paper is to demonstrate the strength of lyrics ...
07/19/2021

Sequence-to-Sequence Piano Transcription with Transformers

Automatic Music Transcription has seen significant progress in recent ye...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this age of cloud-based music streaming services such as Spotify[1], Pandora[2], Apple music[3] among others, with millions of songs at their fingertips, users have grown accustomed to, a) immediate attainment of their music demands, and b) an extended experience [4]. Recommendation engines service one aspect of this change in user behavior. They help users find new music based on their preferences. Playlists handle the second aspect of the changing behavior, which is the need for an extended experience. An extended experience is achieved by sustaining the mood of the songs in a playlist. For example, Spotify has over two billion playlists [5] created for every kind of mood (sad, happy, angry, calm, etc.), activity (running, workout, studying, etc.), and genre (blues, rock, pop, etc.).

Over the past couple of years, the playlist recommendation task has become analogous to playlist prediction/creation and continuation rather than playlist discovery. However, playlist discovery forms a significant part of the overall playlist recommendation pipeline, as it is an effective way to help users discover existing playlists on the platform by leveraging nearest-neighbor techniques. The aim of this work is to create an end-to-end pipeline for learning playlist embeddings which can be directly used for recommendation purposes. In the past few years, sequence-to-sequence learning [6]

has been widely used to learn effective sentence embeddings in applications like neural machine translation

[7]. We make use of the relationship playlist:songs :: sentences:words, and take inspiration from research in natural language processing to model playlist embeddings the way sentences are embedded.

1.1 Why playlist embeddings?

Current research pertaining to playlist is in the areas of automatic playlist generation [8], [9] [10], and continuation [11] [12] [13]

. Multiple solutions have been proposed to address these problems, like reinforcement learning

[14]

and Recurrent Neural Network-based models

[4] for playlist generation and playlist continuation tasks. However, great success has been achieved in the field of natural language processing using the power of learned embeddings. These fixed-length embeddings are easier to use and manipulate and can be used for tasks such as machine translation and query-and-search. A case can be made for using similar methods for music playlists, as the semantic properties captured can be leveraged for providing with good-quality recommendations. It can be easily integrated with other modes of information such as word2vec [15] model, or content-analysis-based models [16] or a combination of both, thus providing a multi-modal recommendation. Another use case for projecting the playlists onto an embedding space is easier browsing (like MusicBox[17] does for songs) through the entire corpus. It can be used to discover playlists that do not exactly fit into one genre and hence are difficult to find for using the conventional query-and-search method. Lastly, variational sequence-to-sequence models [18], along with added contextual information can be used for generating playlists from the embedding space [19].

1.2 Why unsupervised learning?

One of the major challenges when working with playlists is the lack of labeled data. Natural language processing has many popular supervised datasets such as SNLI [20], Microsoft’s paraphrase detection [21], SICK dataset [22]

, that are used to learn sentence embeddings that capture distinct discriminative characteristics which would not be possible without the labeled data. In the absence of annotated playlist datasets we resort to unsupervised learning to model playlist representations.

1.3 Our Contribution

In this work, we learn playlist embeddings using unsupervised learning and perform a comprehensive analysis of the embeddings learned by different encoders. We consider two kinds of models for this work: a) Seq2seq models and b) Bag of Words (BoW) models. We compare different embedding sizes, and architectures for each of the models, exploring their effect on the quality of resulting representation. We evaluate the embedding models by testing for the extent of genre information captured in the playlist embedding, along with other standard sentence-based characteristics such as length, order, content information. We also evaluate the effectiveness of the proposed embedding models using a recommendation task. To the best of our knowledge, our work is the first attempt at modeling and extensively analyzing compact playlist representations for playlist recommendation purposes, with inspirations from natural language processing.

2 Related Work

With regards to this work, natural language processing and music have important similarities. Both have sequential structure in their constituent parts - words in a sentence are akin to audio segments in a song or songs in a playlist. Both have semantic relationships between the elements of the sequence. Due to these similarities, there have been many works which employ techniques from the field of natural language processing by translating the problem to an already solved-problem in natural language processing, like [23], which uses this analogy for evaluating automatically generated playlists.
Embedding models are often used alongside the aforementioned approach in Music Information Retrieval (MIR) to project the data onto a compact space. Zheleva et al [24] create a statistical model for capturing user taste using Latent Dirichlet Allocation (LDA) [25]. Collaborative Filtering (CF) [26] is a widely used method for recommendation, but it suffers from shortcomings such as lack of consideration for order of items in the list and that there is no way to adjust the search results based on query [27]. Similar to CF, there have been many neural network based works [27] [28] [29]

which project the corpus and the user profiles to a low dimensionality vector and then recommend items based on cosine similarity between the query and the corpus items. But for this work we are concerned with works which do not have users in the loop. Volkovs et al

[12] create a playlist embedding, albeit with a task-tailored objective function for automatic playlist continuation. Yang et al [13]

use a custom autoencoder with an aim to make it easier to include multiple modalities in the input. Ludewig et al

[30] use tf-idf [31] to create playlist embeddings. Aalto [32]

uses eigenvectors from the playlist to create a representation and further uses cosine similarity between the playlists to compare playlists. However, one of the major limitations of this approach is that it doesn’t take into account the order of items in the playlist.


We also briefly look at the work done with regards to seq2seq learning since we apply this technique to our work. Cho et al [33] used an RNN-based network to create fixed length representation of variable length source and target sequences. However, RNNs are not good enough for capturing long-term dependencies. Sutskever et al [6] improved on this by using LSTM units instead of RNN, and using separate networks for encoder and decoder to increase the model capacity. Further improvement was made on this in [7] by enabling the model to translate even longer sequences by introduction of Attention mechanism, which is a technique used by the decoder to make use of source sequence in making the output prediction.
Our work is different from the previous works in sense that our aim is to leverage the playlist embeddings and use those for recommending existing playlists, hence making playlists as the prime recommendation items. In addition to that, we also focus on creating a framework of evaluation tasks for the playlist embeddings, which being source agnostic, can be used to evaluate other models in the future as well.

3 Sequence to Sequence Learning

Here we briefly describe the RNN Encoder–Decoder framework, proposed first in [33] and later improved in [6], upon which our model is based. Given a sequence of input vectors , the encoder reads this sequence and outputs a vector called the context vector. The context vector represents a compressed version of the input sequence which is then fed to the decoder which predicts tokens from the target sequence. One of the significant limitations of this approach was that the model was not able to capture long term dependencies for relatively longer sequences [34]. This problem was partially mitigated in [6] by using LSTM [35] units instead of vanilla RNN units and feeding the input sequence in the reversed order to solve for lack of long-term dependency capture.
Bahdanau et al [7] introduced the attention mechanism to solve this problem which involved focusing on a specific portion of the input sequence when predicting the output at a particular time step. The attention mechanism ensures the encoder doesn’t have to encode all the information into a single context vector. In this setting, the context vector is calculated using weighted sum of hidden states :

(1)

where is calculated as follows:

(2)

where and is the decoder state at time step and is the encoder state at time step . is the alignment model which scores how well the output at time step aligns with the input at time step . The alignment model

is a shallow feed forward neural network which is trained along with the rest of the network.

4 Embedding Models

We consider variants of two main kinds of models for learning playlist representations, which we present in this section:

4.1 Bag-of-Words Models (BoW):

BoW models create a representation of the sequence by simply aggregating the constituent items representations. They don’t take into account the order of the items in the sequence. We use these models for baseline comparison.

  1. Base Bag-of-Words Model (base-BoW): Given a sentence having a collection of words , the sentence embedding is calculated using a simple arithmetic mean of its constituent word embeddings. The effectiveness of this approach coupled with the simplicity of computation makes it a very competitive baseline for comparison.

  2. Weighted Bag-of-Words Model (weighted-BoW): Introduced in [36]

    , it uses a weighted averaging scheme to get the sentence embedding vectors followed by their modification using singular-value decomposition (SVD). This method of generating sentence embeddings proves to be a stronger baseline compared to the traditional averaging.

4.2 Seq2seq Models (Seq2seq):

These models are based on the the RNN-Encoder-Decoder framework discussed in Section 3. Given a sequence of words , an RNN-encoder generates hidden states . The attention mechanism takes in these hidden states along with the decoder states and outputs context vectors , which are then fed to the decoder to predict the output .

  1. Base Unidirectional Encoder (base-Seq2seq):

    We use a unidirectional RNN, and global attention model

    [37] as our base seq2seq model.

  2. Bidirectional Encoder (bi-Seq2seq): For this model, the encoder generated hidden state ht where t is the concatenation of a forward RNN and a backward RNN that read the sentences in two opposite directions. We use global attention for this one as well.

5 Experimental Setup

5.1 Corpus

5.1.1 Data Source

We created the corpus by downloading publicly available playlists from Spotify using the Spotify developer API [38]. We downloaded 1 million playlists111Data statistics details are mentioned in the supplementary material. from Spotify, consisting of both user-created playlists as well as Spotify-curated ones.

5.1.2 Data Filtering

As a part of cleaning up the data before training, we follow [39] in discarding the less frequent, and less relevant items from our dataset. Tracks occurring in less than 3 playlists are discarded. Playlists with the less than 30% of tracks left after this are also removed. All duplicate tracks from playlists are removed. And finally, only playlists with lengths in the range [10-5000] are retained and the rest are discarded. This leaves us with a total of 745,543 unique playlists and 2,470,756 unique tracks and 2680 unique genres.

5.1.3 Data Labeling: Genre Assignment

The songs in our dataset do not have genre labels, however artists do. Despite there being a 1:1 mapping between artist and song, we do not use the artist genres for songs as it is because:

  • An artist can have songs of different genres.

  • Since genres are subjective in their nature (rock v/s soft-rock v/s classic rock), having such high number of genres for songs would result in an ambiguity between the genres with respect to empirical evaluation (classification) and add to the complexity of the problem.

Hence, we aim to bring down the number of genres such that they are relatively mutually disjoint (such as rock, electronic, classical etc.)

To achieve this, we train a word-2-vec model[15]222For details about the word2vec algorithm configuration, please refer to the supplementary material. on our corpus to get song embeddings, which capture the semantic characteristics (such as genre) of the songs by virtue of their co-occurrence in the corpus. Separate models are trained for embedding sizes k . For each of the embedding sizes, the resulting song embeddings are then clustered into 200333This number was chosen with an aim to get the maximum feasible clusters (and minimum songs per cluster) while keeping the number within a limit which was feasible for annotating the data. clusters. For each cluster:

  • Artist genre is applied to each corresponding song and a genre-frequency (count) dictionary is created. A sample genre-count dictionary for cluster with 17 songs would look like {rock: 5, indie-rock:3, blues: 2, soft-rock: 7}

  • From this dictionary, the genre having a clear majority444This was a subjective decision. For example, a dictionary having {rock: 5, indie-rock:3, blues: 2, soft-rock: 7} is assigned the genre rock is assigned as the genre for all the songs in that cluster.

  • All the songs in a cluster with no clear genre majority are discarded for annotation.

Based on the observed genre-distribution in the data, and as a result of clustering sub-genres (such as soft-rock) into parent genres (such as rock), the genres finally chosen for annotating the clusters are: Rock, Metal, Blues, Country, Classical, Electronic, Hip Hop, Reggae and Latin

. To validate our approach, we train a classifier on our dataset consisting of annotated song embeddings. With training and test set kept separate at the time of training, we achieve a 94% test

555Result achieved for embedding size 750. Comparable results achieved for other sizes as well. accuracy. We also generate a t-SNE plot of the annotated songs to further validate our approach as shown in Figure 1.
For playlist-genre annotation, only the playlists having all the songs annotated, are considered for annotation. This leaves us with 339,998 playlists in total. Further, only those playlists are assigned genres for which more than 70% of the songs agree on a genre resulting in 35,527 genre-annotated playlists.

Figure 1: t-SNE plot for genre-annotated songs for embedding size 750, with 1000 sampled songs for each genre

5.2 Training: Configuration

  1. weighted-BoW Model: For the base configuration of the model, we set the value of the parameter to where the weight given to each word in the corpus is and is the controlled parameter. We experiment with different values for ranging from and .

  2. Seq2seq Models: All of our Seq2seq encoders use a 3 layer network with hidden state size controlled for and k . We experimented with both LSTM and GRU units. We tried Adam and SGD666SGD performed generally worse than Adam, hence the details for SGD are not included optimizers. We also set the maximum gradient norm to 1 to prevent exploding gradients.

5.3 Training: Task

We train seq2seq models on our dataset as autoencoders where the target sequence is the same as the source sequence (in our case, a playlist), and the goal of the model is to reconstruct the input sequence. For more training details, please refer to the supplementary material.

Figure 2: Evaluation (embedding probing) task results with respect to the encoder hidden state size. Missing bars are for cases where it was not possible to train an embedding of that size due to memory constraints.

6 Evaluation Tasks

Ideally, a good playlist embedding should encode information about the genre of the songs it contains, the order of songs, length of playlist and songs themselves. We divide our experiments777For details about the experimental set up for evaluation, please refer to the supplementary material. into two parts:

  1. Embedding Probing Tasks: The first part examines the effectiveness of the embeddings in encoding information such as genre, length, song-order etc.

  2. Recommendation Task: The second part evaluates the quality of embeddings for the purpose of recommendation.

6.1 Embedding Probing Tasks

6.1.1 Genre-Related Tasks

  • Genre Prediction Task (GPred-Task)

    This task measures to what extent a playlist embedding encodes the genre-related information of the songs it contains. Given a playlist embedding, the goal of the classifier is to predict the correct genre for the playlist. The task is formulated as multi-class classification, with nine output classes. Training samples are weighted-by-class as the dataset is skewed with the majority class (electronic) having 18,138 samples while the minority class (classical) just having 75 samples.


  • Genre Diversity Prediction Task (GDPred-Task)
    This task measures the extent to which the playlist embedding captures the sense of the homogeneity/diversity of the songs (with regards to their genre) constituting it. Given a playlist embedding, the goal of the classifier is to predict the number of genres spanned by the songs in that playlist. The task is also formulated as multi-class classification, with 3 output classes being low diversity (0-3 genres), medium diversity (3-6 genres) and high diversity (6-9 genres).

6.1.2 Playlist Length Prediction Task (PLen-Task)

This task measures to what extent the playlist embedding encodes its length. Given a playlist embedding, the goal of the classifier is to predict the length (number of songs) in the original playlist. Following [40], the task is formulated as multi-class classification, with ten output classes (spanning the range [30-250]) corresponding to equal binned lengths of size . Training samples are class-weighted as the dataset is unbalanced, with a majority class (lengths 30-50 songs) having 78,015 samples and the minority class (230-250 songs) having just 1098 samples.

6.1.3 Song-content Task (SC-Task)

The song content (SC) closely follows the Word Content (WC) task described by [41] in testing whether it is possible to recover information about the original words in the sentence from its embedding. We pick 750 mid-frequency songs (the middle 750 songs in our corpus of songs sorted by their occurrence count ), and sample equal numbers of playlists that contain one and only one of these songs. We formulate it as a 750-way classification problem where the aim of the classifier is to predict which of the 750 songs does a playlist contain, given the playlist embedding.

6.1.4 Song-order Related Tasks

  • Bigram-Shift Task (BShift-Task)
    Text sentences are governed by language grammatical rules. These rules govern the existence of certain bi-grams in the language (e.g. will do, will go ) as well as the lack of existence of others (eg. try will). In the field of natural language processing, the Bigram Shift (BShift) experiment, introduced in [40], is a very good way to measure the extent to which word-order information is encoded in the sentence embeddings. This evaluation task is formulated as a binary classification problem, where the aim of the classifier is to distinguish between original sentences from the corpus and sentences where two adjacent words have been inverted

  • Permute Classification Task
    Through this task we aim to answer the questions: Can the proposed embedding models capture song order, and if they can, to what extent? We split this task into two sub tasks: i) Shuffle Task, and ii) Reversal task. In the Shuffle task, for each playlist in our task-specific dataset888For details for the dataset as well as the experiment, please refer the supplementary material, we randomly shuffle a fraction of all the songs in that playlist. We use two ways to shuffle the playlists: a) by selecting a random block in the playlist and shuffling just that block (shuffle type-1), and b) Randomly selecting songs from the playlist and shuffling them (shuffle type-2). We then train a binary classifier where the aim of the classifier is to distinguish between an original and a permuted playlist. The Reversal task is similar to the Shuffle task except that the randomly selected sub-sequence of songs are reversed instead of shuffled. We further extend both these tasks with a slight modification that from the original dataset, playlists which are inverted are not included in the dataset and vice versa.

6.2 Recommendation Task

This task is set up to evaluate our proposed approach for recommendation purposes by measuring the extent to which the playlist space created by the embedding models is relevant, in terms of the similarity of genre and length information of closely-lying playlists.
We populate an embedding space with the playlist embeddings using the Approximate Nearest Neighbors Algorithm, and Spotify ANNOY library[42]. A playlist is randomly selected in the space and the returned search results are compared with the queried playlist in terms of genre and length information. There are nine possible genre labels, similar to the GPred-Task. For comparing length, classes are created as described in section 6.1.2. An average of 100 precision values for each query is considered. High precision would signify good performance of the embeddings in terms of capturing information relevant to playlist recommendation.

7 Results

Evaluation Tasks
GPred-Task GDPred-Task PLen-Task SC-Task
base-BoW Model 96.8 79 34 39.6
weighted-BoW Model 97.5 80.05 33.4 44.3
base-Seq2seq Model 76.6 75.8 70.7 15.3
bi-Seq2seq Model 84.9 76.2 71.9 21.7
Table 1: Evaluation (Embedding Probing) task accuracies for the embedding models. The best result for each task is displayed in bold font.

The results for the embedding probing tasks are presented in the Table 1. In this section we provide a detailed description of our experimental results along with their analysis and insights. For each of the discussed tests – genre, length, and content – we investigate the performance of different embedding models across multiple embedding lengths. Results are shown in Figure 2.

7.1 Embedding Probing Task Results

7.1.1 Genre Task Results

For the GPred-Task, BoW models outperform the Seq2seq models. This can be attributed to the reasoning that the playlist vector created by averaging the constituent songs is embedded in the space of the songs as their centroid. Since the genre of the playlist is the genre of its songs, the BoW models outperform the seq2seq at genre prediction. For the Seq2seq models, the performance appears to improve with increasing RNN hidden state size in the encoder, as larger embedding sizes generally have more space to encode sequence information. In the GDPred-Task as well, BoW-based models perform better than the Seq2seq models with the weighted-BoW model achieving 80% accuracy while Seq2seq models achieve an accuracy of 76%.

7.1.2 Length Task Results

For the PLen-Task, the Seq2seq models perform quite well, achieving close to 72% accuracy while the BoW models perform quite poorly managing just around 35% accuracy. The performance of the Seq2seq models doesn’t come as any surprise as it has been widely studied that seq2seq models are able to capture such characteristics of the sentences. Poor performance of the BoW models however is indeed surprising as BoW models have been shown to perform comparatively better on this task [40] [41].

7.1.3 Song Content Task Results

For the SC-Task, the seq2seq models performed poorly compared to the BoW models. However, our results closely match the results for the WC Task for the unsupervised models in [41]. The authors cite the inability of the model to capture the content-based information due to the complexity of the way the information is encoded by the model.

Figure 3: Classification accuracy vs Permute proportion for the Permute-classification task

7.1.4 Song Order Tasks Results

For the BShift-Task, we get an accuracy of  49% for all the embedding models (both BoW and seq2seq), meaning the classifier is unable to distinguish the original playlist from bigram-inverted ones. For the Permute Classification Task, as seen in Figure 3, the base Seq2seq model is able to distinguish correctly the permuted playlists from the original playlists as the proportion of the permutation is increased. Even for our extension of the tasks where complementary playlist pairs are not added to the dataset, the classifier can still distinguish between the original and the permuted playlists. On the other hand, BoW models cannot distinguish between the original and permuted playlists, making seq2seq models better for capturing the song-order in the playlist.

7.2 Recommendation Task Results

The Recommendation task, as shown in Fig. 4, captures some interesting insights about the effectiveness of different models for capturing different characteristics:

  • High precision values demonstrate the relevance of the playlist embedding space in terms of playlist-genre and length similarity of closely-lying playlists, which is the first and foremost expectation from a recommendation system.

  • BoW models capture genre information999Since BoW created playlist embeddings lie in the song space (as calculated using arithmetic mean of song embeddings) where genre annotation happens, they perform better. better than seq2seq models, confirming our genre-related findings from Section 7.1.1.

  • Length information is captured better by the Seq2seq models, confirming our length-related findings from Section 7.1.2

  • The skewness of the graphs (especially for length-recommendation) can be attributed to the imbalance in the dataset.

Figure 4: Precision-recall graph plotted for Genre and Length Recommendation task
BoW Seq2seq
Genre Information ✓✓
Song Information ✓✓
Length Information ✓✓
Order Information X ✓✓
Compute Complexity Required ✓✓
Table 2: Encoder models performance comparison table. ✓✓indicates better performance of the model compared to the other model. X indicates inability of the model to perform the task.

8 Results: Making Sense of Everything

As shown in Table 2, the evaluation task results capture some interesting details about the nature and capabilities of the models we use for this work to create playlist embeddings. BoW models perform quite well for genre and content-related tasks whereas seq2seq models capture the length and song-order information better. Going by these observations, we can use an ensemble of BoW and seq2seq models for building a playlist recommendation engine, having the BoW model create an initial subset of recommendable playlists based on genre information from the corpus, and seq2seq models refine that list to find the most similar playlists in terms of song-order and length information.
Another important point to consider is that the seq2seq model would not work for playlists having all the songs from outside the corpus which the model is trained on. The reason is that the current model derives the semantic characteristics of the songs and the playlists based solely on the co-occurrence of the songs in the corpus. To mitigate this problem, additional information needs to injected into the system such as audio content information, lyric-based information, or a combination of both, such that it can be used by the encoder to learn semantic characteristics of songs independent of their neighborhood in which they occur.

9 Conclusion

We have presented a seq2seq based approach for learning playlist embeddings, which can be used for tasks such as playlist discovery and recommendation. First, we define the problem of learning a playlist-embedding and describe how we formulate it as a seq2seq-based problem. We compare our proposed model with two BoW-based models on a number of tasks chosen from the field of natural language processing (like sentence length prediction, bi-gram shift experiment etc.) as well as music (genre prediction and genre-diversity prediction). We show that our proposed approach is effective in capturing the semantic properties of playlists. We also evaluate our approach using a recommendation task, showing the effectiveness of the learned embeddings for recommendation purposes. Our approach can also be extended for learning even better playlist-representations by integrating content-based (lyrics, audio etc.) song-embedding models, and for generating new playlists by using variational sequence models.

References

1 Data Statistics

The data distribution roughly follows the Zipf’s law. We plotted a Zipf plot of our corpus as shown in Figure 1.

Figure 1: Zipf plot for the corpus. Log scale used for frequency and rank

1.1 Playlists Length Statistics

More than half of the playlists (401,880) are of length less than 50 and another 246,427 (33%) of the playlists have lengths in the range [50-150]. Statistics related to the length of the playlists are given in Table 1.

Playlist Corpus Length statistics
Mean Std Median Min. Max.
83.3 133 45 10 5000
Table 1: Corpus Length Statistics

1.2 Genre Homogeneity/Diversity Statistics

Out of 745,543 playlists, 49,164 playlists have all of their songs genre annotated. Out of 49,164 playlists, 24,422 ( 49%) playlists have less than or equal to 3 genres in total across all of their songs, 23,162 ( 47%) playlists have less than of equal to 6 genres in total, and 1580 ( 3%) playlists have more than 6 genres.

2 Word-2-vec: Set up and Training

We train a word2vec model on our corpus to get the song embeddings, using the Gensim[43] implementation. We use the Skipgram algorithm with negative sampling value set to 5 and window size of 5 as well. Minimum threshold for the occurrence of words is set to 5. We train the word vectors of sizes k {500, 750, 1000}.

3 Experimental Setup

3.1 Training

We used OpenNMT-lua library [44]

as the Neural Machine Translation implementation library and AWS p3.16X large for conducting our experiments. Only the playlists with length less than 50 are are considered. Also, words with occurrence count less than 20 are discarded. This brings the vocabulary size down to 82,259 words and the number of playlists in the training set to 377,362. The training is done for 15 epochs and perplexity is used as the evaluation metric for the model. Batch-size was varied in the range [32-128] depending on the hidden state size and the memory constraints.

3.1.1 Song Order experiments

  • Bigram shift: Data
    To create the dataset for this experiment, we selected a list of 55265 playlists whose length lie in the range [50-100]. For each of these playlists, we created an additional playlist with two adjacent songs inverted. This resulted in a balanced dataset of size 110530.

  • Permute Shuffle To create the dataset for this experiment, we selected a list of 38168 playlists whose lengths lie in the range [50-100].

3.2 Evaluation

A 1-hidden layer neural network implemented in Keras[45] is used for all experiments. Each network is tailored for each task. Sklearnś [46] class weight computation library is used to learn class weights. We use categorical Cross Entropy loss with Softmax non linearity for all multi-class classification tasks and Binary Cross Entropy loss and Sigmoid non linearity for the binary classification tasks.

4 Other Tried Experiments

4.1 Genre Prediction (Multi-Label Prediction)

This task (GMLPred-Task) is an extension to the GPred-Task and the aim for this task is to predict all the genres given a playlist embedding. The task is formulated as multi-label classification, with the same nine output classes as used in the GPred task. Results in Table 2.

Evaluation results for Genre Prediction Task
base-BoW weighted-BoW base seq2seq bi-seq2seq
82.5 84 80.8 82.3
Table 2: Evaluation accuracies for Multi-Label Genre Prediction task.

4.1.1 Paraphrase detection Evaluation Task

We attempted the Paraphrase detection Evaluation task [21] by creating a dataset of playlists sampled from the dataset and as their paired playlists, we chose:

  1. A shuffled portion of the songs of the original playlist as the entailment playlist.

  2. A completely different playlist with non overlapping songs as the contradictory playlist

We found that the model was very easily able to correctly tag the paired playlists. We realized that in our case, since the entailed playlists are always going to be shorter in length than the original playlists, the model can just focus on that for making the prediction while completely ignoring the content of the playlists. This experiment especially points towards the lack of supervised datasets in music.

4.1.2 Genre Switch Prediction

A genre switch is defined here as change in genre going from one song to another. A homogeneous playlist would have fewer number of such genre shifts than a more diverse playlist. The aim of this task (GSPred-Task) is to predict the number of all the genres switches given a playlist embedding. For this task, the absolute number of switches for a playlist is normalized by dividing it by the length of the playlist such that the final label lies between 0 and 1. The task is then formulated as multi-class classification, with the five output classes being low switch playlist (0-0.34), mid-switch playlist (0.34-0.67) and high-switch playlist (0.76-1.00). Results in Table 3

Evaluation results for Genre-shift Prediction task
base-BoW weighted-BoW base seq2seq bi-seq2seq
76.7 77.9 58.9 56
Table 3: Evaluation accuracies for Genre-shift Prediction task.

4.2 LSTM vs GRU comparison

For training our models, we experimented with LSTM and GRU units. For almost all the experiments across all models and embedding sizes, LSTM unit performed better than GRU unit except for the Permute task where the GRU model(s) outperformed the LSTM-based models for all the embedding sizes.