Attentive Neural Architecture Incorporating Song Features For Music Recommendation

11/20/2018 ∙ by Noveen Sachdeva, et al. ∙ IIIT Hyderabad 0

Recommender Systems are an integral part of music sharing platforms. Often the aim of these systems is to increase the time, the user spends on the platform and hence having a high commercial value. The systems which aim at increasing the average time a user spends on the platform often need to recommend songs which the user might want to listen to next at each point in time. This is different from recommendation systems which try to predict the item which might be of interest to the user at some point in the user lifetime but not necessarily in the very near future. Prediction of the next song the user might like requires some kind of modeling of the user interests at the given point of time. Attentive neural networks have been exploiting the sequence in which the items were selected by the user to model the implicit short-term interests of the user for the task of next item prediction, however we feel that the features of the songs occurring in the sequence could also convey some important information about the short-term user interest which only the items cannot. In this direction, we propose a novel attentive neural architecture which in addition to the sequence of items selected by the user, uses the features of these items to better learn the user short-term preferences and recommend the next song to the user.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

There has recently been an intense focus on recommendation systems by the Information Retrieval community because of their commercial experience and the ability to provide a better experience to the user while interacting with a large database of items. Often there are a very large number of items in the database that might be of interest to the user, to the extent that the user might not even know they exist. Hence, they need to be presented to the user as a recommendation. To give an example, for websites which sell different kinds of products and have a huge catalog, users might feel better if they didn’t have to browse for the items they might like and were rather recommended by the system, saving time and effort of the user, thus creating a pleasant experience.

The content of the item chosen by the user is often an indication of the items that might be of interest to the user. In the case of music, this might not always be constant and might change with time. In a recent work Gupta (own_work, ), tries to model the short-term preferences of the user for music recommendation. He uses (last_fm, ) tags to find out song features important to the user instead of the content derived from the audio. tags look promising in describing the contents of the song and also provide a lot more information about the song which could be very hard to derive either from the audio or the metadata of the song. We align to the claim that could very well be used to model the song features which might be of interest to the user. However, the similarity function used by Gupta could be better learned and provide a better performance. Gupta also claims that it is the group of items that occur together which matter while recommendation and not the exact sequence in which they occur.

Towards this claim made by Gupta and finding a better similarity function, we apply Attentive Neural Networks to the problem of next item Prediction. Attentive neural networks indeed give different weights to each item in the sequence and the weights are not in order of the items. The third last item selected by the user could get more weight than the last item selected by the user and hence the choice of Attentive Neural Networks takes the claim into account. Also, we introduce a content attention component, which deals with the tags of the items, assuming these tags indeed can model the short-term interests of the user. This component takes the tags of the items selected by the user in the recent past.

2. Related Works

Recommender systems is a well-researched topic and a wide variety of systems have been developed and it is important that we cover some of them here to provide a context to the reader.

2.1. Collaborative Filtering

It exploits the interactions to find similar users based on the number of same items selected. A variant is item level collaborative filtering (cf2, ), wherein two items selected by the same user are considered to be similar.

There have been improvements to collaborative filtering such as matrix factorization (cf1, ) of the matrix into the feature matrix and the feature matrix. Further, there have been ranking algorithms such as Bayesian personalized ranking (bpr, ) to further provide better and personalized recommendation to users.

2.2. Content Based Recommendation

Content-based systems recommend items based on the similarity of content to the items already selected by the user (content1, ; content2, )

. If the content of a song is similar to the ones the user likes, then that song is more probable to be recommended to the user. For example, there are systems which recommend songs based on the melody of the song  

(content4, ). Another example which also assumes that the tags can indeed be sufficient to model the features of the songs which might be of importance to the user is by Liang (content3, )

which generates a latent vector for each song based on the semantic tags and then applies collaborative filtering to provide a recommendation to the users.

2.3. Sequence Based Recommendation

Recommendation can be modeled as a sequence prediction problem and the first attempt at it was by Brafman (seq3, )

. The initial attempts were based on simple models such as Markov chains and they have been further improved. One such improvement is having a personal Markov chain for each user 

(seq1, )

. With the popularity of recurrent neural networks, they have been applied 

(seq2, ) to the problem of next item prediction and have performed much better than the other systems. With the success of Attentive neural networks in fields such as language and speech processing, they have been applied to recommender systems as well (attentive_1, ). Our model applies attention to the sequence of items as well as the content of those items. Two context vectors are computed in the model independently, one which gives a context solely based on the items and the other which gives a context based only on the tags of the items.

2.4. Hybrid Recommender Systems

Hybrid systems combine two or more techniques in order to provide better recommendations. Yoshii (hybrid1, ) proposed a system wherein the recommendations are based on the rating as well as the content, which are modeled based on the polyphonic timbres of the song. Hariri (hariri, ) applied topic modeling and models the sequence of songs heard by the user as a sequence of topics and then tries to predict the next topic and the next song in that topic. The transitions between topics are learned from a collection of playlists. Gupta (own_work, ) proposes a hybrid model which takes into account the different songs played together and the tags of the song. The approach is able to tell at any given point of time the features of the songs the user is interested in. Shobu (hybrid10, ) builds an interesting system which bases its recommendation on the transition of acoustic features over the songs. It tries to generate a sequence of songs over which the transition of acoustic features is smooth.

3. Our Method

Figure 1. Attentive Neural Network Architecture for Next Song Prediction

We present an Attentive Neural Architecture to tackle the problem of next item prediction which has the ability to include tags of the items and models the short term user interests based on the features of the items as well as the items themselves. We now present the formal problem statement that we try to tackle in this paper.

Predicting Next Song Given the set of songs heard by the user in sequence and the tag set for each song, , predict .

3.1. Proposed Solution

The architecture we propose is shown in figure 1. The output of the model are the probabilities of each item occurring next, given the items occurred in the user history (

). The first component receives as input the one hot encoding of the songs which occurred before the song to be predicted. The second component receives the one hot encoding of all the tags for the items occurring before the song to be predicted. The song-embedding layer maps the one hot representations of the songs to a vector space which are then fed to a Bi-GRU in the first component. Similarly, the tags for each song are also converted to their distributed representations using another embedding layer. For each song, the average of the distributed representations of all its tags is fed to a Bi-GRU in the second component. For both components, the hidden states are given as input to an attention layer where the attention-score or weight for each hidden state is computed. The output of the attention layer is the context vector which is the weighted sum (given by the attention layer) of the hidden states of the Bi-GRU. The context vectors coming from both components are concatenated and fed to a smaller dimension non-linear dense layer, using ReLU as the activation function. The output of this dense layer is then fed to another dense layer followed by a softmax operation, used to calculate probabilities over all songs modeling the next song. Below we present the equations for a better understanding of the model. Let

be the set of all the songs.


where is the one hot representation of the song, is the embedding layer, is the length of the embedded song vector and is the set of all songs.


where is the one hot representation of the tag of the song, is the embedding layer, is the length of the embedded tag vector and is the set of all tags.


where is the number of songs associated to the song, and is the average of the embedding vector of all the tags associated to the song.

The hidden states of both Bi-GRUs, and , which are fed to the attention layer are a mere concatenation of the two individual unidirectional hidden states: and respectively.

Both the attention layers output a context vector which is a weighted sum of all the hidden states. is the context vector computed from the song component of the model and is the context vector computed from the tag component of the model.


Both the context vectors, and are then concatenated resulting in a final context vector, which is then fed to a dense layer using the standard equations.


is nothing but a vector representation of in a smaller dimension vector space which significantly reduces the training time because the following dense layer has a huge dimension (Number of songs). The final output is a dense layer of the size of the total number of songs followed by a softmax function which gives the probability of occurrence of each song given the user’s history.


Negative log likelihood was used as the loss function and the optimization problem becomes:


where is a user session in the dataset and is the actual song which occurs after the given songs. and are the matrices consisting of song and tag embeddings respectively. We iterate over all the sessions in the datasets and all time steps in those sessions.

4. Experiments

4.1. Dataset

The dataset was a subset taken from the dataset (last_fm, ). Each log in the dataset consisted of user id, song name, artist name and time stamp. We performed experiments on a subset consisting of 6-month histories of all the users and the tags for each song were retrieved using the public API. The user histories were divided into sessions as done by Gupta (own_work, ). The first 70 percent of the sessions for each user (in order of occurrence) were put in the training set and the last 30 percent in the test set. Sessions having less than 5 songs were discarded.

Description Value
Total Logs 3553321
Total Users 759
Total Sessions 110410
Total Unique Songs 386046
Total Unique Tags 487844
Average Songs Per Session 32.18
Average logs per user 4681.58
Table 1. Dataset Statistics
Model k=10 k=20 k=30 k=40 k=50
POP 0.85 0.97 1.24 1.69 2.14
BPR-MF 7.34 8.13 8.56 8.98 9.27
SSCF 13.69 17.12 19.66 21.30 22.34
RNN 14.42 16.26 16.74 17.09 17.38
SBRS 19.15 26.14 28.83 30.35 31.40
SABR 26.36 28.61 29.97 31.72 32.47
STABR 28.95 30.85 31.90 32.65 34.26

Table 2. Results

4.2. Baselines

The architecture is tested against the following baselines:

  1. POP: The most popular items in the training set are recommended to the users.

  2. BPR-MF: A matrix factorization based model which ranks items for each user (bpr, ) differently. The implementation by MyMediaLite was used with default parameters except for the number of features which was kept 100 for best results. We report the mean over 5 runs for this model.

  3. Session Based Collaborative Filtering(SSCF): This system instead of making a matrix makes a matrix and recommends items by finding similar sessions in the database to the active session based on the songs which have already occurred in the current session. The similar sessions were found based on the last 5 songs heard by the user and the results are reported based on 100 nearest sessions.

  4. RNN: In this method, the sequence of items occurring together is fed to a recurrent neural network trying to predict the next item at each timestep. All sequences in the train set are used to learn the model and to get the next recommendation, all the songs heard by the user until that point are fed to the network. We used the implementation provided by the authors of  (umap, )

    based on mini batch stochastic gradient descent and we kept the batch size to 20, using the Categorical Cross Entropy loss function with a 100 hidden units for the RNN and a learning rate of 0.1.

  5. Subsession Based Recommender System: This method was proposed by Gupta (own_work, ). In this method, short-term user preferences are found using the tags of the songs the user heard. The user history is divided into small windows of constant preference and songs are found based on the similar window in the training set to the active window.

4.3. Training & Testing

We use the minibatch Stochastic Gradient Descent (SGD) algorithm coupled with Adagrad (ada, ) and a learning rate of 0.05 to train each model. Batch size of 32 was used, the embeddings for tags were kept to length 25 and that of songs to 50. The length of the middle layer, was kept to 50 and that of the output,

was equal to the number of songs in our dataset, 386046. Dropout regularization with a 0.1 discard probability was used for both the middle and the output layers. We trained the model on a single GTX 1080Ti GPU and the proposed model was implemented using PyTorch

(pytorch, ).

For testing the models, we adopt the same methodology as followed by Gupta (own_work, ). We iterate through the test histories of the users predicting the next song in the history while giving songs till that point of time as an input to the system. We report  (hit_ratio, ) where is the number of songs in the predicted set. We tested two systems based on the attentive neural networks. One was only with the component which takes only the songs into account and not the tags and is referred as SABR(Song Attention Based Recommendation), and the second one with both the components and is referred as STABR(Song and Tag Attention Based Recommendation).

5. Results and Conclusions

The results are shown in table 1. Attentive neural networks perform significantly better than all other baseline models and even for Attentive neural networks, the one with the tag component gives a huge gain over the one not having the tag component. This shows that the tags indeed are powerful in modeling the short term user preference and probably the neural network learns a better similarity function than the one proposed by Gupta and hence the gain.


  • (1) A. V. D. Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In NIPS , pp. 26432651, 2013.
  • (2) Badrul Sarwar, George Karypis, Joseph Konstan and John Riedl: Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pp. 285295, New York, NY, USA, 2001
  • (3) Brian McFee, Luke Barrington, and Gert R. G. Lanckriet.:Learning content similarity for music recom- mendation. IEEE Transactions on Audio, Speech & Language Processing , 20(8), 2012
  • (4) D. Lee, S. E. Park, M. Kahng, S. Lee, and S.-g. Lee: Exploiting contextual information from event logs for personalized recommendation. In Computer and Information Science, Studies in Computational Intelligence, pp. 121-139. Springer. 2010
  • (5) Dawen Liang, Minshu Zhan, and Daniel PW Ellis: Content-aware collaborative music recommendation using pre-trained neural networks. In Proceedings of the 16th International Society for Music Information Retrieval Conference, ISMIR 2015, Malaga, Spain, October 26-30, 2015.
  • (6)

    Duchi,John,Hazan,Elad,andSinger,Yoram. Adaptivesubgradientmethodsforonlinelearningandstochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.

  • (7) Fang-Fei Kuo and Man-Kwan Shan: A personalized music filtering system based on melody style classification. In 2002 IEEE International Conference on Data Mining, 2002. Proceedings., pp. 649652, 2002
  • (8) Gupta K., Sachdeva N., Pudi V. (2018) Explicit Modelling of the Implicit Short Term User Preferences for Music Recommendation. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds) Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science, vol 10772. Springer, Cham
  • (9) Hariri, N., Mobasher, B., Burke, R.: Context-aware music recommendation based on latenttopic sequential patterns. Sixth ACM conference on recommender systems, pp. 131138. Dublin (2012)
  • (10) J. Li, P. Ren, Z. Chen, Z. Ren, and J. Ma. Neural attentive session-based recommendation. CIKM, 2017.
  • (11), Retrieved date: 2017/04/23
  • (12) R. I. Brafman, D. Heckerman, and G. Shani: Recommendation as a stochastic sequential decision problem. In ICAPS, pp. 164173, 2003.
  • (13) R. Devooght and H. Bersini.: Long and Short-Term Recommendations with Recurrent Neural Networks. Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization (2017), pp. 1321
  • (14) Shobu Ikeda , Kenta Oku , Kyoji Kawagoe, Music Playlist Recommendation Using Acoustic-Feature Transitions, Proceedings of the Ninth International C* Conference on Computer Science & Software Engineering, July 20-22, 2016, Porto, Portugal
  • (15)

    S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme: Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pp. 452

    461. AUAI Press, 2009.
  • (16) S.Rendle, C. Freudenthaler and L. Schmidt-Thieme: Factorizing personalized markov chains for next-basket recommendation. In WWW, pp. 811820. ACM, 2010.
  • (17)
  • (18) Yoshii, Kazuyoshi and Goto, Masataka and Komatani, Kazunori and Ogata, Tetsuya and Okuno, Hiroshi G: Hybrid Collaborative and Content-based Music Recommendation Using Probabilistic Model with Latent User Preferences., Proceedings of the International Conference on Music Information Retrieval 2006
  • (19) Y. Hu, Y. Koren, and C. Volinsky.: Collaborative filtering for implicit feedback datasets. In ICDM, pp. 263272, 2008.
  • (20) Y. Zhang, H. Dai, C. Xu, J. Feng, T. Wang, J. Bian, B. Wang, and T. Liu: Sequential click prediction for sponsored search with recurrent neural networks. In AAAI, pp. 13691375, 2014.