1 Introduction
Online digital content providers, such as media streaming services and ecommerce websites, usually have immense catalogs of items. To prevent users from having to search manually through the entire catalog, recommender systems help to filter out items users might like to watch, listen to, buy… and are often based on characteristics of both users and items. One type of recommendation algorithms is collaborative filtering, which is generally based on ratings for items given by users. However, such explicit ratings are not always available. For example, in what way does clicking on an item represent how much the user likes this item? Implicit feedback models are therefore used here, but they require careful parameter tuning. Next to this, systems that model users based on aggregate historical consumption will often ignore the notion of sequentiality. In the case of music consumption, for example, it has been investigated that a user’s listening behavior can be described by a trajectory in time along different artists and genres with periods of fixations and transitions Figueiredo:2016ut .
Recently, recommenders have also been built on top of item embeddings Barkan:2016wm ; Liang:2016wj
. Such embeddings, or vector representations, are generally learned using item cooccurrence measures inspired by recent advances in language modeling, e.g. word2vec and related algorithms
Mikolov:2013uz . The problem with this approach is that we are left without an adequate user representation, and the question remains how to derive such a representation based on the given item embeddings.In this work we focus on creating user representations in the context of new music discovery on online streaming platforms. We start from given latent embeddings of the items in the catalog and we represent users as a function of their item consumption. For this, we propose that a user’s listening behavior is a sequential process and is therefore governed by an underlying timebased model. For example, think of a user listening to an artist’s album for some time and then transitioning to the next album, or to a compilation playlist of the same musical genre. To model the dynamics of a user’s listening pattern we use recurrent neural networks, which are currently among the stateoftheart in many data series processing tasks, such as natural language Sutskever:2014ty ; Bahdanau:2015vz and speech processing Graves:2014vz . We do not presuppose any of the item embedding properties, such that our model is generally applicable to any type of item representation. In the next section we will explain the problem setting and highlight related work regarding music recommendation and deep learning. In Section 3 we will describe our methodology, after which we perform an initial data analysis in Section 4. We conclude with the results of our experiments in Section 5. A complete table with the used symbols in this article is given in Appendix A.
2 Motivation and Related Work
Ever since the launch of the Netflix Prize competition in 2006 Bennett:2007uv , research in recommender systems, and a particular subset called collaborative filtering, has spiked. The basis of modern collaborative filtering lies in latentfactor models, such as matrix factorization Koren:2009jg . Herein, a lowdimensional latent vector is learned for each item and user based on rating similarities, and in the most basic scheme the dot product between a user vector and item vector is learned to represent the rating of item by user :
(1) 
This setting is based on the entire user and item history, and does not take into account that a user’s taste might shift over time. Koren et al. Koren:2009jg mention that such temporal effects can easily be brought into the model by adding timedependent biases for each item and user:
(2) 
Dror et al. Dror:2011bb extend on this work by introducing additional biases for albums, artists, genres and user sessions, but these biases now represent a global rather than a temporal effect of a user’s preference towards an item. This way we can for example model to what extent the average rating for a specific album is higher or lower compared to other albums. Although the models of Koren et al. and Dror et al. are capable of representing a user’s overall preference towards an item and how this preference shifts in time, it cannot immediately explain why a user would rate item higher or lower after having consumed items , and . That is, a user’s preference can depend on what he or she has consumed in the immediate past. Basic collaborative filtering techniques do not explicitly model this sequential aspect of item consumption and the effect on what future items will be chosen.
Next to this, standard collaborative filtering and matrix factorization techniques are mostly fit for explicit feedback settings, i.e. they are based on positive as well as negative item ratings provided by the users. In more general cases where we deal with views, purchases, clicks… we only have positive feedback signals, which are binary, nondiscriminative, sparse, and are inherently noisy Hu:2008el
. Standard and tested techniques to factorize a sparse matrix of useritem interactions are singular value decomposition (SVD) and nonnegative matrix factorization (NMF)
Anonymous:H3H5BbuI ; Lee:2000ti . In the context of implicit feedback, however, missing values do not necessarily imply negative samples. Pan et al. Pan:2008kb therefore formulate this as a socalled one class classification problem, in which observed interactions are attributed higher importance than nonobserved ones through different weighting schemes, or through careful negative sampling. Hu et al. Hu:2008el on the other hand construct an implicit matrix factorization algorithm, based on the singular value decomposition of the useritem matrix, which differs from the standard algorithm by attaching higher confidence on items with a large number of interactions during optimization. Johnson Johnson:2014tfuses the ideas from Hu et al. and devises a probabilistic framework to model the probability of a user consuming an item using logistic functions.
The implicitfeedback models calculate global recommendations and do not exploit temporal information to decide which items the user might be interested in. Recently, Figueiredo et al. Figueiredo:2016ut
have shown that users on online music streaming services follow a trajectory through the catalog, thereby focusing their attention to a particular artist for a certain time span before continuing to the next artist. For this they use a combination of different Markov models to describe the inter (‘switch’) and intraartist (‘fixation’) transitions. In other work, Moore et al.
Moore:2013tj learn user and song embeddings in a latent vector space and model playlists using a firstorder Markov model, for which they allow the user and song vectors to drift over time in the latent space.Compared to Markov models, which inherently obey the Markov property, recent work has shown that recurrent neural networks (RNNs) are able to learn longterm data dependencies, can process variablelength time series, have great representational power, and can be learned through gradientbased optimization. They can effectively model the nonlinear temporal dynamics of text, speech and audio VanDenOord:2016uo ; Sercu:2016ub ; Karpathy:2015wu , so they are ideal candidates for sequential item recommendation. In a general RNN, at each time step a new input sample is taken to update the hidden state :
(3) 
in which is a nonlinear function, e.g. sigmoid ,
or a rectifier (ReLU and variants)
Maas:2013tn. To counter vanishing gradients during backpropagation and to be able to learn longterm dependencies, recurrent architectures such as long shortterm memories (LSTMs) and gated recurrent units (GRUs) have been proposed, both with comparable performances
Hochreiter:1997fq ; Greff:2015wv ; Chung:2014wf . These models use a gating mechanism, e.g. an LSTM introduces input () and forget () gates that calculate how much of the input is taken in and to what extent the hidden state should be updated, and an output gate () that leaks bits of the internal cell state () to the output:(4) 
in which is the elementwise vector multiplication. GRUs only have a reset () and update () gate, get rid of the cell state, and have less parameters overall:
(5) 
Very recently there have been research efforts in using RNNs for item recommendation. Hidasi et al. Hidasi:2015uq use RNNs to recommend items by predicting the next item interaction. The authors use onehot item encodings as input and produce scores for every item in the catalog, on which a ranking loss is defined. The task can thus be compared to a classification problem. For millions of items, this quickly leads to scalability issues, and the authors resort to popularitybased sampling schemes to resolve this. Such models typically take a long time to converge, and special care needs to be taken not to introduce a popularity bias, since popular items will occur more frequently in the training data. The work by Tan et al. Tan:2016vy is closely related to the previous approach, and they also state that making a prediction for each item in the catalog is slow and intractable for many items. Instead, lowdimensional item embeddings can be predicted at the output in a regression task, a notion we will extend on in Section 3.
A popular method to learn item embeddings is the word2vec suite by Mikolov et al. Mikolov:2013uz with both Continuous BagofWords and SkipGram variants. In this, a corpus of item lists is fed into the model, which learns distributed, lowdimensional vector embeddings for each item in the corpus. Word2vec and variants have already been applied to item recommendation, e.g. Barkan et al. Barkan:2016wm formulate a word2vec variant to learn item vectors in a set consumed by a user, Liang Liang:2016wj devise a word2vecbased CoFactor model that unifies both matrix factorization and item embedding learning, and Ozsoy Ozsoy:2016tm learns embeddings for places visited by users on Foursquare to recommend new sites to visit. These works show that a word2vecbased recommender system can outperform traditional matrix factorization and collaborative filtering techniques on a variety of tasks. In the work by Tan et al. item embeddings are predicted and at the same time learned by the model itself, a practice that generally deteriorates the embedding quality: in the limit, the embeddings will all collapse to a degenerate, noninformative solution, since in this case the loss will be minimal. Also, they minimize the cosine distance during training, which we found to decrease performance a lot.
In the coming sections, we will train RNNs to predict songs a user might listen to in the future as a tool to model users on online music streaming platforms. For this, we will predict preexisting item embeddings—about which we will not make any assumptions in the model—and our task is therefore a regression rather than a classification problem. This approach is closely related to the work by van den Oord et al. VanDenOord:2013tp in which collaborative latent vectors are predicted based on raw audio signals, and also related to the work by Hill et al. Hill:2016wg who learn to project dictionary definition representations onto existing word embeddings. Regarding sequential item recommendation, the related works by Hidasi et al. and Tan et al. mentioned above both perform item recommendation within user sessions, i.e. uninterrupted and coherent sequences of item interactions, which can last from ca. 10 minutes to over an hour. Here, the prediction time scale is very shortterm, and since consumed items within a user session are usually more similar than across user sessions, it is generally easier perform item recommendation on this short time scale. In this work we will explore recommending songs for shortterm as well as longterm consumption. To recommend songs on the long term, we will need to be able to model a user’s behavior across session boundaries.
3 RNNs for Music Discovery
In this section we will explain the details of our approach to use RNNs as a means to model users on online music streaming platforms and to recommend new songs in a music discovery context. Since we aim towards models that can be used efficiently in largescale recommendation pipelines, we require the models to be trained within a reasonable time frame. Furthermore, sampling from the model should be efficient. Small models with little parameters are typically wanted to satisfy both requirements.
The entire recommendation pipeline for one specific user is given in Figure 1. The basic building blocks of the pipeline are song vectors, which have been learned using the songs in the catalog. The general idea is then to capture and predict a taste vector for each user. These taste vectors are the output of an RNN that sequentially aggregates song vectors from the user’s listening history, and can therefore be regarded as a representation of the user’s musical taste. The taste vector can subsequently be used to generate song recommendations by querying a tree data structure for the nearby song vectors.
Since we construct realvalued taste vectors, the RNN solves a regression task rather than a classification task, as argued in Section 2
. Directly predicting item embeddings is a regression problem that requires predicting a limited set of realvalued outputs, as opposed to a classifier with as many outputs as the number of items. The computational footprint of these models is typically smaller than the classifiers. They are usually learned faster, and are not per se biased towards popular items. One of the main advantages is that any type of item embeddings and embedding combinations, along with other features, can be used to learn the regression model.
We break the recommendation pipeline into three separate stages which we will cover below. First, we learn lowdimensional embeddings for each song in the streaming catalog using word2vec (§3.1). Then, we use an RNN to generate taste vectors for all users in the song embedding space (§3.2). Finally, we use the taste vector to query songs in the nearby space to generate recommendations for all users (§3.3).
3.1 Learning song embeddings
In the first stage of the recommendation pipeline we learn latent vector representations for the top most popular songs in the catalog. For this, we use Google’s word2vec suite as explained in Section 2; more specifically we use the Continuous BagofWords (CBoW) algorithm with negative sampling. As input to the word2vec algorithm we take usercreated playlists of songs. In this, each playlist is considered as an ordered ‘document’ of songs. By scanning all playlists in a windowed fashion, word2vec will learn a distributed vector representation with dimensionality for every song . Details regarding training data for word2vec will be highlighted in Section 4.
3.2 Learning user taste vectors
In the second pipeline stage we use RNNs to produce user taste vectors based on song listening history. The network takes a sequence of song vectors of dimensionality as input and produces a single taste vector with the same dimensionality . Let’s denote the set of all users by , the ordered sequence of song vectors user listened to by , and the predicted taste vector by . The RNN then produces:
(6) 
in which represents the function the RNN computes with parameters .
To learn a semantically rich user taste vector that is able to generate adequate recommendations, ideally this taste vector should be able to capture how a user’s listening behavior is evolving over time. We therefore train the RNN to predict a song the user is going to listen to in the future. More specifically, for a particular user , we take the first consecutive songs this user has listened to, and we try to predict a future song vector , for some strictly positive value of
. As a loss function, we use the
distance between the predicted taste vector and the true future song vector:(7) 
We experimented with other distance functions, such as cosine distance and a weighted mixture of cosine and distance. The cosine distance between two vectors and is given by:
(8) 
The distance, nevertheless, gave the best results in the experiments.
To determine the best value of the prediction offset we consider two separate prediction tasks: shortterm and longterm prediction. The idea is that it is generally easier to predict a song a user is going to listen to next—e.g. think of a user shuffling an artist album, or listening to a rock playlist—than it is to predict the 50th song he is going to listen to in the future. The underlying dynamics of a shortterm and longterm prediction model will therefore be different. For example, the shortterm model will intuitively be more focused on the last few tracks that were played, while the longterm model will generally look at a bigger timeframe. During training we sample a value of
for every input sequence from a discrete uniform distribution. More specifically, for the shortterm model,
is sampled from , and is sampled from for the longterm model. Random sampling of the prediction offset for every new input sequence reduces chances of overfitting and also increases model generalization. The entire training procedure is sketched in Algorithm 1. In here, we use a stochastic minibatch version of the gradientbased Adam algorithm to update the RNN weights Kingma:2015ku . We have also experimented with setting fixed to 1 for the shortterm model, but this leads to very nearsighted models that make predictions almost solely based on the last played song. For example, first listening to 100 classical songs and then to 1 pop song would lead to pop song predictions by the RNN.3.3 Recommending songs
The output of the RNN is a user taste vector that resides in the same vector space as all songs in the catalog. Since we trained the RNN to produce these taste vectors to lie close to future songs a user might play—in terms of distance—we can query for nearby songs in that vector space to generate new song recommendations. To scale the search for nearby songs in practice, we construct an Annoy^{1}^{1}1github.com/spotify/annoy tree datastructure with all song vectors. The Annoy tree will iteratively divide the vector space in regions using a localitysensitive hashing random projection technique Charikar:2002km , which facilitates approximate nearest neighbor querying. To generate suggested recommendations for a particular user, we query the Annoy tree using the user’s taste vector to find the top closest catalog songs in the vector space.
3.4 Incorporating play context
In general we do not only know the order in which a user plays particular songs, but we also know the context in which the songs have been played. By context we mean a playlist, an album page, whether the user deliberately clicked on the song, etc. This additional information can be very useful, e.g. we can imagine that a user clicking on a song is a stronger indicator of the user’s taste than when the song is played automatically in an album or playlist after the previous song has finished playing. Since the RNN can process any combination of arbitrary embeddings and features, we can supply a context vector as extra input. The context vector is in this case concatenated with the song vector at each time step to produce the user taste vector:
(9) 
in which we use the symbol to denote vector concatenation. To construct a context vector we consider the ordered set of all possible contexts, denote as the ’th context in , and as the play context for a particular song , e.g. . The context vector for a song
is then constructed using a onehot encoding scheme:
(10) 
in which is a onehot vector of length with a single at position , and in which is the indicator function that evaluates to if and to otherwise. We also include the time difference between playing the current song and the last played song. The final context vector for the ’th song played by user then becomes:
(11) 
in which is the time difference in seconds between playing song and , evaluating to if does not exist.
3.5 User and model updates
The recommendation pipeline we described in this section can be used in both static and dynamic contexts. In dynamic contexts, the model should be updated frequently so that recommendations can immediately reflect the user’s current listening behavior. This is easily done in our model: for every song the user listens to we retrieve its song vector that we use to update the hidden state of the RNN, after which we calculate a new taste vector to generate recommendations. This requires that we keep track of the current RNN states for every user in the system, which is a small overhead. In more static contexts, in which recommendations are generated on a regular basis for all users (every day, week…), there is no need to update the RNN for every song a user listens to. Here, we retrieve the entire user listening history–or the last songs–which we feed to a newly initialized RNN. We therefore do not need to remember the RNN states for every user.
All recommendation modules that are deployed in practical environments have to be updated regularly in order to reflect changes in the item catalog and user behavior. This is no different for the framework we present here. In order to perform a full model update, we subsequently train word2vec on the playlists in the catalog, retrain the RNN model on the new song vectors, and populate the Annoy tree with the same song vectors. Only retraining word2vec is not sufficient since we almost never end up in the same song embedding space. If we use dynamic user updates, note that we also have to do one static update after retraining word2vec and the RNN, since the remembered RNN states will have become invalid.
4 Data Gathering and Analysis
The first stage in the recommendation pipeline considers the learning of semantically rich song vectors, for which we use the word2vec algorithm. In this section we explain how we gather data to train this word2vec model, and we perform a preliminary analysis on the song vectors. Finally we detail the construction of the train and test data for the RNNs.
4.1 Training word2vec
To create training data for the word2vec model we treat usercreated playlists as documents and songs within these playlists as individual words, as mentioned in Section 3. For this, we gather all usercreated playlists on the Spotify music streaming platform. In these playlists we only consider the top most popular tracks; the other tracks are removed. In our experiments we set to 6 million, which makes up for most of the streams on the Spotify platform. After filtering out unpopular tracks, we further only consider playlists with a length larger than 10 and smaller than 5000. We also restrict ourselves to playlists which contain songs from at least 3 different artists and 3 different albums, to make sure there is enough variation and diversity in the playlists. After applying the filtering above, we arrive at a corpus of million playlists in total. In the following step, the playlists are fed to the word2vec suite, where we use the CBoW algorithm with negative sampling. We go through the entire playlist corpus once during execution of the algorithm in order to produce vectors with dimensionality , a number that we empirically determined and produces good results.
4.2 Data processing and filtering
In Section 5 we will train different RNN versions, for which we will use both playlist and user listening data. Playlists are usually more contained regarding artists and musical genres compared to listening data. Modeling playlist song sequences with RNNs will therefore be easier, but we can miss out important patterns that appear in actual listening data, especially if a user listens to a wide variety of genres.
For the playlist data we extract chunks of 110 consecutive songs, i.e. 60 songs to feed to the RNN and the next 50 songs are used as ground truth for the prediction. Regarding the user listening, which is captured in the first half of 2016 on the Spotify platform, we only keep songs which have a song vector associated with them; other songs are removed from the user’s history. We also save context information for every played song; for this, we consider the following 13 Spotifyrelated contexts: collection, library, radio, own playlist, shared playlist, curated playlist, search, browse, artist, album, chart, track, clicked, and unknown for missing context information. To extract RNN training sequences we take chunks of 150 consecutive songs, for which the first 100 songs are again used as input to the RNN and the last 50 as ground truth. We allow more songs as input to the RNN compared to the playlist training data since user listening data is generally more diverse than playlist data, as mentioned before.
Since users often return to the same songs or artists over and over again, we apply additional filtering to eliminate those songs. This will greatly improve the RNN’s generalization, and will counter what we call the ‘easy prediction bias’, as it is too easy to predict songs a user has already listened to. This filtering is not needed for playlist data, since a song only appears once in a playlist most of the times. The filtering rules we include, are:

The last 50 songs should be unique;

The last 50 songs should not appear in the first 100 songs;

The last 50 artists should be unique;

The last 50 artists should not appear in the first 100 artists.
Note that we only remove similar songs and artists from the ground truth labels in the dataset, and that we leave the first 100 songs in the user’s history intact. That is, the same song can still appear multiple times in these first 100 songs, thereby steering the user’s preference towards this particular song and influencing the predictions made by the RNN. For both playlist and listening data we gather 300,000 train sequences, 5,000 validation sequences and 5,000 test sequences.
4.3 User data analysis
To analyze the gathered data, we take all 5,000 listening history sequences in the test set, and we calculate pairwise cosine distances between the song vectors in these sequences. We measure both the vector distance between all possible song pairs in a listening sequence, as well as the distance only between subsequent songs. The distances are given as box plots in Figure 2, in which the whiskers are drawn at 1.5 times the interquartile range. We see that the all pairs median distance is larger than the subsequent pairs median distance by around 0.25, indeed confirming the higher correlation between subsequent songs. We also plot the number of song transitions within each user’s listening history that have a cosine distance larger than 1, meaning that the next song is more different from the current song than it is similar to it. The histogram for this is shown in Figure 3. Most listening histories have little such song transitions. The median number is seven, which points, on average, to listening periods of 21 similar songs before transitioning to a more different region in the song space, which in turn corresponds to coherent listening sessions of about 1 to 1.5 hours long.
5 Experiments
In this section we discuss the results of various experiments and we show examples of song recommendations produced by different RNN models. In the following we first design the best RNN architecture by experimentation, after which we explain the baselines against which the RNN models will be tested. All experiments are run on an Amazon AWS instance, 32core Intel Xeon 2.60GHz CPU, 64GB RAM and Nvidia GRID K520 GPU with cuDNN v4. The RNN models are implemented in Lasagne^{2}^{2}2github.com/Lasagne/Lasagne
and Theano
AlRfou:2016uc .5.1 Network architecture
The neural network architecture we propose consists of a number of recurrent layers followed by a number of dense layers. The number of input dimensions is , which is the dimensionality of a single song vector. For the recurrent layers we consider both LSTMs and GRUs, since such models are stateoftheart in various NLP applications and can handle longterm dependencies. We use standard implementations of these layers as described in Section 2, with a hidden state dimensionality of all layers equal to . We will describe how the optimal number of recurrent layers and the optimal value of are chosen.
After the recurrent layers we consider two dense layers. We empirically set the dimensionality of the first dense layer to , and we use a leaky ReLU nonlinearity with a leakiness of 0.01 Maas:2013tn . Since we are predicting a user taste vector in the same song space, the output dimensionality of the last dense layer is
, which is the same as the input dimensionality. In the last layer we use a linear activation function, since we are predicting raw vector values.
In a first experiment we set
to 50, we switch the recurrent layer type between LSTM and GRU, and we also vary the number of recurrent layers from 1 up to 3. We record train loss, validation loss, average gradient update time for a single batch and average prediction time for a single sequence. For this experiment we use the gathered playlist data, and we train a shortterm model for which we report the best values across 15 epochs of training. The results are shown in Table
1. We see that both models are very comparable in terms of validation loss, and that the optimal number of recurrent layers is two. The LSTM model performs slightly better than the GRU model in this case, but the GRU model predicts more than 50% faster than the LSTM model, which can be important in largescale commercial production environments. We observe comparable results for listening history prediction as well as longterm prediction. We therefore pick the twolayer GRU model as the optimal architecture.Train loss  Validation loss  Update [ms]  Prediction [ms]  

LSTM, 1 layer  258.8  271.2  20  4.4 
LSTM, 2 layers  257.3  270.2  39  8.7 
LSTM, 3 layers  258.1  271.0  58  13.0 
GRU, 1 layer  259.0  271.0  16  2.2 
GRU, 2 layers  257.3  270.5  30  4.1 
GRU, 3 layers  257.5  270.6  45  6.1 
validation loss  Cosine validation loss  

20  287.3  0.464 
30  276.4  0.425 
40  271.5  0.408 
50  270.5  0.406 
60  270.0  0.406 
70  270.4  0.406 
80  270.5  0.407 
90  270.1  0.405 
100  270.4  0.406 
Next we perform experiments to determine the optimal hidden state size . We train the twolayer GRU architecture from above for a shortterm playlist prediction, and we vary from 20 to 100 in steps of 10. The results are shown in Table 2. The validation loss is the highest at 20, and is minimal at values of 50 and 60. The loss remains more or less stable if we increasing the hidden state size to 100. Since larger hidden state sizes imply slower prediction and train times, we choose as the optimal hidden state size. These observations are also valid for longterm prediction and for listening history data. The final architecture is displayed in Table 3. It turns out that this architecture is also nearoptimal for longterm playlist prediction as well as for user listening data. Since the number of network parameters is low and given the large amount of training data, we do not need additional regularization.
We also train shortterm user listening RNNs with additional play context information at the input, as described in Section 3.4, and with the same configurations as in Tables 1 and 2. The optimal configuration is a 2layer architecture with equal to 100, but the differences in performance for are minimal. Furthermore, the performance gain compared to the same model without context information is nonsignificant. We will therefore disregard play context models in the rest of the experiments.
5.2 Baselines
We will compare the performance of different RNN models against several baseline models. The general idea here is again that a user’s listening behavior is governed by underlying temporal dynamics. We consider three types of baseline models: an exponential discount model, a weightbased model, and a classification model. The first two types are aggregation models, which means that they take an arbitrary number of song vectors as input and produce one output vector by mathematically combining the song vectors, e.g. through summing or taking a maximum across dimensions. Aggregation of distributed vectors is a popular practice in NLP applications and deep learning in general since it does not require any training when the vector space changes dosSantos:2014tr ; Collobert:2011tk . In our case however, the danger of aggregation is that sometimes songs from different genres are summed together, so that we can end up in ‘wrong’ parts of the song space. For example, if we aggregate the song vectors of a user listening to a mix of classical and pop music, we might arrive in a space where the majority of songs is jazz, which in turn will lead to bad recommendations. The third type of baseline is based on the work by Hidasi et al. Hidasi:2015uq as mentioned in Section 2. In this, we will not predict an output vector to query the Annoy LSH tree, but we will output probability scores for all items in the catalog. The top items can then immediately be recommended.
Layer type (no. of dimensions) and nonlinearity  

Input (40)  
1  GRU (50) 
sigmoid (gates); tanh (hidden update)  
2  GRU (50) 
sigmoid (gates); tanh (hidden update)  
3  Fully connected dense (200) 
Leaky ReLU, leakiness  
4  Fully connected dense (40) 
Linear activation 
Exponential discount model
In the exponential discount model we make the assumption that a user’s taste is better reflected by the recent songs he has listened to than by songs he has listened to a while ago. We model this type of temporal dynamics by the following exponentially decaying weight model:
(12) 
In this, we consider the song history of user which has length , and we weigh every vector by a power of , the discount factor. Setting results in no discounting and leads to a simple sum of the song vectors, while setting focuses more attention on recently played songs. The smaller , the more substantial this contribution becomes compared to songs played a longer time ago. If , the user taste vector is essentially the vector of the last played song.
Weightbased model
This model is based on the weighted word embedding aggregation technique by De Boom et al. DeBoom:2016gm . As in the exponential discount model, we will multiply each song vector by a weight :
(13) 
in which we gather all weights in a dimensional vector . Now, instead of fixing the weights to an exponential regime, we will learn the weights through the same gradient descent procedure as in Algorithm 1. In this algorithm we replace the RNN loss by the following weightbased loss:
(14) 
We include the last term as weight regularization, and we empirically set to 0.001. Apart from this regularization term, we do not imply any restrictions on or relations between the weights. We train a weightbased model on user listening data for both short and longterm predictions, and the resulting weights are plotted in Figure 4. For the shortterm prediction the weights are largest for more recent tracks and decrease the further we go back in the past. This confirms the hypothesis that current listening behavior is largely dependent on recent listening history.For the longterm model we largely observe the same trend, but the weights are noisier and generally larger than the shortterm weights. Also, the weights increase again in magnitude for the first 10 tracks in the sequence. This may signify that a portion of a user’s longterm listening behavior can be explained by looking at songs, genres or artists he has listened to in the past and returns to after a while.
Classification model
As mentioned above, we loosely rely on the work by Hidasi et al. Hidasi:2015uq to construct a classification baseline model. In this work, the items are encoded as a onehot representation, and the output of the model is a probability score for every item in the catalog. To be able to fairly compare between all other baselines, and to help scale the model, we slightly adapt it and use the word2vec vectors as input instead of the onehot item encodings. We employ the same neural network model as in Table 3 and at the output we add an additional dense layer with output dimensionality and softmax activation function. The memory footprint of this softmax model thus substantially increases by around 938MiB, compared to the model in Table 3
. Given the large output dimensionality, we also experimented with a twostage hierarchical softmax layer
Goodman:2001dp , but the computational improvements were only marginal and the model performed worse.We train the softmax classification model with two different loss functions. First, we consider the categorical crossentropy loss in the case there is only one target:
(15) 
In this loss function, is the RNN output index of the target song to be predicted, and is the output of the RNN after a softmax nonlinearity given the input vectors . The second loss function is a pairwise ranking loss used in the Bayesian Personalized Ranking (BPR) scheme by Rendle et al.Rendle:2009wp . This loss function evaluates the score of the positive target against randomly sampled negative targets:
(16) 
In this, is the number of negative samples that we set fixed to 100,
is the sigmoid function, and
is again the output index of the positive sample. Note that we use a sigmoid nonlinearity rather than a softmax. In practice we also add an
regularization term on the sum of the positive output value and negative sample values. To generate negative samples, we sample song IDs from a Zeta or Zipf distribution with parameter , which we checked empirically on the song unigram distribution:(17) 
in which is the Riemann zeta function. We resample whenever a song appears in a user’s listening data to make sure the sample is truly negative.
Hidasi et al. reported better stability using BPR loss compared to crossentropy loss, but our samplingbased training procedure from Algorithm 1 did not produce any unstable networks for both loss functions. We trained shortterm and longterm networks on the filtered listening history data using both crossentropy and BPR, and all models took around 2.5 days to converge. By comparison, training until convergence on the same hardware only took 1.5 hours for the models presented in this work.
5.3 Results
In this section we will display several performance metrics on the test set of user listening histories. After all, the music recommendations will be based on what a user has listened to in the past. We have trained four RNN models: a playlist shortterm (rPST) and longterm (rPLT) RNN, and a user history shortterm (rHST) and longterm (rHLT) RNN. We also report metrics for five baselines: a shortterm (bWST) and longterm (bWLT) weightbased model, and exponential discount models with . In the following we will perform a forward analysis to evaluate how well a taste vector is related to future song vectors, which will show the predictive capacity of the different models. We will also do a backwards analysis to study on what part of the listening history sequences the different models tend to focus. We conclude with results on a song prediction task.
Forward analysis
In the forward analysis we take the first 100 songs of a user’s listening history, which we use to generate the taste vector. This taste vector is then compared to the next 50 song vectors in the listening history in terms of cosine distance. That is, for each user in the test set we calculate the sequence , for . Figure 5 shows a plot of these sequences averaged over all users in the test set. The overall trend of every model is that the cosine distance increases if we compare the taste vector to songs further in the future. This is not surprising since it is generally easier to predict nearby songs than it is to predict songs in the far future, because the former are usually more related to the last played songs. We see that the and rPST model have comparable performance. They have low cosine distance for the first few tracks, but this quickly starts to rise, and they both become the worst performing models for longterm prediction. All other models, apart from rHST and rHLT, behave similarly, with rPLT being slightly better and the model slightly worse than all others. The two best performing models are rHST and rHLT. Until future track 20, the rHST model gives the lowest cosine distance, and rHLT is significantly the best model after that. Since playlists are typically more coherent than listening histories—e.g. they often contain entire albums or sometimes only songs by the same artist—this can explain why the playlisttrained RNNs, and especially rPST, perform not that well in this analysis. Another general trend is that ST models typically perform better than their LT counterparts in the very near future. And at some point the LT model becomes better than the ST model and is a better predictor on the long term. Finally, among all baselines, we also observe that bWST is the best performing shortterm model, and bWLT performs best to predict on the long term, which is not surprising since the weightbased models are a generalization of the discounting models. Note that the classification models remain absent, because in this case the output of the RNN is not a user taste vector.
Backwards analysis
In this analysis we again take the first 100 songs of a user’ listening history, which we use to generate a taste vector. We then compare this taste vector to these first 100 songs, i.e. the songs that generated the taste vector. We thus look back in time to gain insights as to what parts of the listening history contribute most or least to the taste vector. For this we calculate the sequence , for , and Figure 6 plots this sequence for each model averaged over all users in the test set. We see that the rPST and models are very focused on the last songs that were played, and the average cosine distance increases rapidly the further we go back in history: for songs 1 until 80 they are the worst performing. These models will typically be very nearsighted in their predictions, that is, the song recommendations will mostly be based on the last 10 played tracks. This is again due to the fact that playlists are very coherent, and predicting a nearfuture track can be done by looking at the last tracks alone. The rHST and bWST models also show a similar behavior, but the difference in cosine distance for tracks in the near and far history is not as large compared to rPST. The listening history RNNs, both rHST and rHLT, produce an overall high cosine distance. These models are therefore not really tied to or focused on particular songs in the user’s history. It is interesting to note that the plot for rHLT and bWLT is a nearflat line, so that the produced taste vector lies equally far from all songs in terms of cosine distance. In comparison, the taste vector, which is actually just a sum of all songs, produces a Ushaped plot, which is a behavior similar to the longterm weights in Figure 4. If we would attribute more weight to the first and last few tracks, we would end up with a flatter line. The plot also has a Ushape, but the minimum is shifted more towards the recent listening history. Note again that the classification models are absent in this analysis for the same reason as specified above.
Precision@k
In this section we calculate the precision of actual song recommendations. We again take 100 songs from a user’s history which we use to generate a taste vector. Then, as described in Section 3.3, we query the word2vec space for the nearest songs in the catalog in terms of cosine distance. We will denote the resulting set as . The value is then the fraction of how many songs in actually appear in the user’s next tracks:
(18)  
We can also generalize this to :
(19)  
Here we disregard the user’s first next tracks, since it is often easier to predict the immediate next tracks than it is to predict tracks further in the future. For the next results, we also slightly alter the definition of : given the fact that no song in occurs in for all users , as described in Section 4.2, we only regard the nearest songs of that do not appear in . For the classification models, we simply take the top songs with the highest scores, and compare them to the ground truth. If we denote these top songs by , we can reuse the same definition of from above.
The results of the experiments are shown in Table 4. In bold we mark the best performing model for each task, and the second best model is underlined. The overall precisions are quite low, but given that we aggressively filtered the listening data (Section 4.2), the task is rather difficult. The historybased RNNs clearly perform best in all tasks. Generally, for , and all shortterm models outperform the longterm models. But once we skip the first 25 songs, which are easier to predict, the longterm models take over, which shows that listening behavior indeed changes over time. The performance of the playlist RNNs and weightbased models are comparable to the exponential discount models, which we already saw in Figure 5.
All four classification models that we trained achieved the same precision score of 0 percent, so we listed them as one entry in Table 4. They were not able to correctly guess any of the 50 future songs a user might listen to. We can think of many reasons why we see this result. First, the output dimensionality of 6 million is extremely large, which makes it difficult to discriminate between different but comparable items. The RecSys Challenge 2015 dataset, used in both the works of Hidasi et al. and Tan et al., only has around 37,500 items to predict, which is 160 times less than 6 million. Second, the number of weights in the classification model is orders of magnitudes larger than for the regression model, which causes learning to be much harder. And third, the data in the works by Hidasi et al. and Tan et al. comes from user sessions, which are mostly contained and coherent sequences of songs a user listens to within a certain time span, see also Section 2. The listening history dataset in our work goes across user sessions to be able to recommend on the long term, which makes it much more difficult to model the temporal dynamics. This is reflected in the overall low precision accuracies.
A note on scalability
As indicated in the title of this article, our methodology should allow for scalable music discovery. The training procedure—i.e. training word2vec and the RNN—is not timecritical and can be trained offline. Despite the fact that these procedures could be parallelized when needed Ji:2016va , we will focus on the recommendation part of the system itself, which is more timecritical.
Since every user can be treated independently, the entire pipeline we have proposed in Figure 1 is ’embarrassingly parallel’ and can therefore be scaled up to as many computational nodes as available. Retrieving song vectors comes down to a dictionary lookup in constant time . Calculating user taste vectors through the RNN is linear in the number of historical song vectors we consider. An extensive study of the scalability of Annoy, the last part in the pipeline, is beyond the scope of this paper, and poses a tradeoff between accuracy and performance: more tree nodes inspected leads generally to more accurate nearest neighbors, but a slower retrieval time (approximately linear in the number of inspected nodes)^{3}^{3}3There is an excellent web article by Radim Rehurek from 2014 which studies this in depth, see raretechnologies.com/performanceshootoutofnearestneighboursquerying. Retrieving nearest neighbors using 10 trees and inspected nodes only takes on average 2.6ms on our system, which is in same order of magnitude compared to the RNN prediction times given in Table 1.
Combining all the above, sampling a taste vector from the rHLT RNN and retrieving the top 50 closest songs from the Annoy LSH tree over 1,000 runs takes on average 58ms on our system, while retrieving the top 50 songs from the BPR RNN takes on average 754ms, which is 13 times slower.
Precision (%)  

rPST  1.64  2.39  2.81  1.15  0.94 
rPLT  1.27  1.98  2.64  1.30  1.06 
rHST  2.03  2.95  3.72  1.85  1.53 
rHLT  1.40  2.31  3.25  1.89  1.63 
bWST  1.67  2.37  2.94  1.28  1.04 
bWLT  1.32  1.95  2.62  1.31  1.06 
1.94  2.63  3.00  1.20  0.96  
1.56  2.20  2.77  1.30  1.05  
1.16  1.78  2.41  1.24  1.02  
Classification  0.00  0.00  0.00  0.00  0.00 
6 Conclusions
We modeled users on largescale online music streaming platforms for the purpose of new music discovery. We sequentially processed a user’s listening history using recurrent neural networks in order to predict a song he or she will listen to in the future. For this we treated the problem as a regression rather than classification task, in which we predict continuousvalued vectors instead of distinct classes. We designed a shortterm and longterm prediction model, and we trained both versions on playlist data as well as filtered user listening history data. The best performing models were chosen to be as small and efficient as possible in order to fit in largescale production environments. Incorporating extra play context features did not significantly improve the models. We performed a set of experimental analyses for which we conclude that the historybased models outperform the playlistbased and all baseline models, and we especially pointed out the advantages of using the regression approach over the classification baseline models. We also saw that there is indeed a difference between shortterm and longterm listening behavior. In this work we modeled these with different models. One possible line of future work would be to design a single sequencetosequence model that captures both short and long term time dependencies to predict the entire future listening sequence Sutskever:2014ty .
Acknowledgments
Cedric De Boom is funded by a PhD grant of the Research Foundation  Flanders (FWO). We greatly thank Nvidia for its donation of a Tesla K40 and Titan X GPU to support the research of the IDLab group at Ghent University.
References
 (1) AlRfou, R., Alain, G., Almahairi, A., al, e.: Theano  A Python framework for fast computation of mathematical expressions. arXiv.org (2016)

(2)
Bahdanau, D., Cho, K., Bengio, Y.: Neural Machine Translation by Jointly Learning to Align and Translate.
In: ICLR (2015)  (3) Barkan, O., Koenigstein, N.: Item2vec  Neural Item Embedding for Collaborative Filtering. RecSys Posters (2016)
 (4) Bennett, J., Lanning, S.: The Netflix Prize. In: Proceedings of KDD cup and workshop (2007)

(5)
Charikar, M.: Similarity estimation techniques from rounding algorithms.
In: STOC (2002)  (6) Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv.org (2014)

(7)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural Language Processing (Almost) from Scratch.
The Journal of Machine Learning Research (2011)  (8) De Boom, C., Van Canneyt, S., Demeester, T., Dhoedt, B.: Representation learning for very short texts using weighted word embedding aggregation. Pattern Recognition Letters (2016)
 (9) Dror, G., Koenigstein, N., Koren, Y.: Yahoo! music recommendations  modeling music ratings with temporal dynamics and item taxonomy. In: RecSys (2011)
 (10) Figueiredo, F., Ribeiro, B., Faloutsos, C., Andrade, N., Almeida, J.M.: Mining Online Music Listening Trajectories. In: ISMIR (2016)
 (11) Goodman, J.: Classes for fast maximum entropy training. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221 (2001)
 (12) Graves, A., Jaitly, N.: Towards EndToEnd Speech Recognition with Recurrent Neural Networks. In: ICML (2014)
 (13) Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: A Search Space Odyssey. arXiv.org (2015)
 (14) Hidasi, B., Karatzoglou, A., Baltrunas, L., Tikk, D.: Sessionbased Recommendations with Recurrent Neural Networks. arXiv.org (2016)
 (15) Hill, F., Cho, K., Korhonen, A., Bengio, Y.: Learning to Understand Phrases by Embedding the Dictionary. TACL (2016)
 (16) Hochreiter, S., Schmidhuber, J.: Long shortterm memory. Neural Computation (1997)
 (17) Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. 2008 Eighth IEEE International … (2008)
 (18) Ji, S., Satish, N., Li, S., Dubey, P.: Parallelizing Word2Vec in MultiCore and ManyCore Architectures. CoRR (2016)
 (19) Johnson, C.C.: Logistic matrix factorization for implicit feedback data. In: NIPS 2014 Workshop on Distributed Machine Learning … (2014)
 (20) Karpathy, A., Johnson, J., FeiFei, L.: Visualizing and Understanding Recurrent Networks. arXiv.org (2015)
 (21) Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimization. In: ICLR (2015)
 (22) Koren, Y., Bell, R., Volinsky, C.: Matrix Factorization Techniques for Recommender Systems. Computer (2009)
 (23) Lee, D.D., Seung, H.S.: Algorithms for Nonnegative Matrix Factorization. NIPS 2000 (2000)
 (24) Liang, D., Altosaar, J., Charlin, L.: Factorization Meets the Item Embedding: Regularizing Matrix Factorization with Item Cooccurrence. In: ICML Workshop (2016)
 (25) Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: ICML (2013)

(26)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality.
In: NIPS 2013: Advances in neural information processing systems (2013)  (27) Moore, J.L., Chen, S., Turnbull, D., Joachims, T.: Taste Over Time  The Temporal Dynamics of User Preferences. In: ISMIR (2013)
 (28) Ozsoy, M.G.: From Word Embeddings to Item Recommendation arXiv.org (2016)
 (29) Pan, R., Zhou, Y., Cao, B., Liu, N.N., Lukose, R., Scholz, M., Yang, Q.: OneClass Collaborative Filtering. In: 2008 Eighth IEEE International Conference on Data Mining (2008)
 (30) Paterek, A.: Improving regularized singular value decomposition for collaborative filtering. In: KDDCup 2007 (2007)
 (31) Rendle, S., Freudenthaler, C., Gantner, Z., SchmidtThieme, L.: BPR  Bayesian Personalized Ranking from Implicit Feedback. UAI (2009)

(32)
dos Santos, C.N., Gatti, M.: Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts.
In: COLING 2014, the 25th International Conference on Computational Linguistics (2014)  (33) Sercu, T., Goel, V.: Advances in Very Deep Convolutional Neural Networks for LVCSR. In: Interspeech (2016)
 (34) Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to Sequence Learning with Neural Networks. In: NIPS 2014 (2014)
 (35) Tan, Y.K., Xu, X., Liu, Y.: Improved Recurrent Neural Networks for Sessionbased Recommendations. arXiv.org (2016)
 (36) Van Den Oord, A., Dieleman, S., Schrauwen, B.: Deep contentbased music recommendation. In: NIPS (2013)
 (37) Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: WaveNet: A Generative Model for Raw Audio. arXiv.org (2016)
Appendix A Table of Symbols
In order of appearance:
User vector for user at time
Item vector for item at time
rating of item by user
Global average rating
Rating bias of user at time
Rating bias of item at time
Hidden state at time
Cell state at time
Forget gate at time
Output gate at time
Reset gate at time
Update gate at time
Weight matrices for gate
Weight vector for gate
Bias for gate
Nonlinear function
Sigmoid function
Elementwise multiplication operator
Number of songs in the catalog
Embedding dimensionality
Set of all users on the platform
Ordered sequence of song vectors user listened to
Taste vector of user
RNN function with parameters
Loss function
L2 norm
Cosine distance
Uniform distribution between and
Dataset of song sequences
Minimum and maximum sampling offsets
Learning rate
Context vector for user
Vector concatenation operator
Ordered set of contexts on the Spotify platform
’th context in
set of contexts for song
Onehot vector of length with a 1 at position
Indicator function: 1 if , else 0
Time difference between playing songs and
Hidden dimensionality
Discount factor
Weightbased model function with weights
Regularization term
Riemann zeta function
Zipf probability density function with parameter
Short and longterm playlist RNN Short and longterm user listening history RNN Short and longterm weightbased model
Comments
There are no comments yet.