1 Introduction
Professional music curators or DJs are usually able to carefully select, order, and form a list of songs which can give listeners brilliant listening experiences. For a music radio with a specific topic, they can collect songs related to the topic and sort in a smooth context. By considering preferences of users, curators can also find what they like and recommend them several lists of songs. However, different people have different preferences toward diversity, popularity, and etc. Therefore, it will be great if we can refine playlists based on different preferences of users on the fly. Besides, as online music streaming services grow, there are more and more demands for efficient and effective music playlist recommendation. Automatic and personalized music playlist generation thus becomes a critical issue.
However, it is unfeasible and expensive for editors to daily or hourly generate suitable playlists for all users based on their preferences about trends, novelty, diversity, etc. Therefore, most of previous works try to deal with such problems by considering some particular assumptions. McFee et al.[14] consider playlist generation as a language modeling problem and solve it by adopting statistical techniques. Unfortunately, statistical method does not perform well on small datasets. Pampalk et al.[16] generate playlists by exploiting explicit user behaviors such as skipping. However, for implicit user preferences on playlists, they do not provide a systematic way to handle it.
As a result, for generating personalized playlists automatically and flexibly, we develop a novel and scalable music playlist generation system. The system consists of three main steps. First, we adopt Chen et al.’s work [4]
to generate baseline playlists based on the preferences of users about songs. In details, given the relationship between users and songs, we construct a corresponding bipartite graph at first. With the users and songs graph, we can calculate embedding features of songs and thus obtain the baseline playlist for each songs by finding their knearest neighbors. Second, by formulating baseline playlists as sequences of words, we can pretrain RNN language model (RNNLM) to obtain better initial parameters for the following optimization, using policy gradient reinforcement learning. We adopt RNNLM because not only RNNLM has better ability of learning information progresses than traditional statistical methods in many generation tasks, but also neural networks can be combined with reinforcement learning to achieve better performances
[10]. Finally, given preferences from user profiles and the pretrained parameters, we can generate personalized playlists by exploiting techniques of policy gradient reinforcement learning with corresponding reward functions. Combining these training steps, the experimental results show that we can generate personalized playlists to satisfy different preferences of users with ease.Our contributions are summarized as follows:

We design an automatic playlist generation framework, which is able to provide timely recommended playlists for online music streaming services.

We remodel music playlist generation into a sequence prediction problem using RNNLM which is easily combined with policy gradient reinforcement learning method.

The proposed method can flexibly generate suitable personalized playlists according to user profiles using corresponding optimization goals in policy gradient.
The rest of this paper is organized as follows. In Section 2, we introduce several related works about playlist generation and recommendation. In Section 3, we provide essential prior knowledge of our work related to policy gradient. In Section 4, we introduce the details of our proposed model, attention RNNLM with concatenation (ACRNNLM). In Section 5, we show the effectiveness of our method and conclude our work in Section 6.
2 Related Work
Given a list of songs, previous works try to rearrange them for better song sequences [1, 12, 5, 3]. First, they construct a song graph by considering songs in playlist as vertices, and relevance of audio features between songs as edges. Then they find a Hamiltonian path with some properties, such as smooth transitions of songs [3], to create new sequencing of songs. User feedback is also an important consideration when we want to generate playlists [16, 13, 6, 7]. By considering several properties, such as tempo, loudness, topics, and artists, of users’ favorite played songs recently, authors of [6, 7] can thus provide personalized playlist for users based on favorite properties of users. Pampalk et al. [16] consider skip behaviors as negative signals and the proposed approach can automatically choose the next song according to audio features and avoid skipped songs at the same time. Maillet et al. [13] provides a more interactive way to users. Users can manipulate weights of tags to express highlevel music characteristics and obtain corresponding playlists they want. To better integrate user behavior into playlist generation, several works are proposed to combine playlist generation algorithms with the techniques of reinforcement learning[19, 11]. Xing et al. first introduce exploration into traditional collaborative filtering to learn preferences of users. Liebman et al.
take the formulation of Markov Decision Process into playlist generation framework to design algorithms that learn representations for preferences of users based on handcrafted features. By using these representations, they can generate personalized playlist for users.
Beyond playlist generation, there are several works adopting the concept of playlist generation to facilitate recommendation systems. Given a set of songs, Vargas et al. [18] propose several scoring functions, such as diversity and novelty, and retrieve the topK songs with higher scores for each user as the resulting recommended list of songs. Chen et al. [4] propose a querybased music recommendation system that allow users to select a preferred song as a seed song to obtain related songs as a recommended playlist.
3 Policy Gradient Reinforcement Learning
Reinforcement learning has got a lot of attentions from public since Silver et al.[17] proposed a general reinforcement learning algorithm that could make an agent achieve superhuman performance in many games. Besides, reinforcement learning has been successfully applied to many other problems such as dialogue generation modeled as Markov Decision Process (MDP).
A Markov Decision Process is usually denoted by a tuple , where

is a set of states

is a set of actions

is the expected reward that an agent will receive when the agent does action in state .

is the discount factor representing the importance of future rewards
Policy gradient is a reinforcement learning algorithm to solve MDP problems. Modeling an agent with parameters , the goal of this algorithm is to find the best of a policy measured by average reward per timestep
Usually, we assume that is differentiable with respect to its parameters , i.e., exists, and solve this optimization problem policy_gradient_goal by gradient ascent. Formally, given a small enough , we update its parameters by
(2) 
where
(3) 
4 The Proposed Model
The proposed model consists of two main components. We first introduce the structure of the proposed RNNbased model in Section 4.1. Then in Section 4.2, we formulate the problem as a Markov Decison Process and solve the formulated problem by policy gradient to generate refined playlists.
4.1 Attention RNN Language Model
(4) 
and an output function, usually softmax,
(5) 
where the implementation of the function depends on which kind of RNN cell we use, ,
with the column vector
corresponding to a word , and with the scalar corresponding to a word ( is the number of units in RNN, and is the number of unique tokens in all sequences).We then update the parameters of the RNNLM by maximizing the loglikelihood on a set of sequences with size , , and the corresponding tokens, .
(6) 
4.1.1 Attention in RNNLM
Attention mechanism in sequencetosequence model has been proven to be effective in the fields of image caption generation, machine translation, dialogue generation, and etc. Several previous works also indicate that attention is even more impressive on RNNLM[15].
In attention RNN language model (ARNNLM), given the hidden states from time to , denoted as , where is the attention window size, we want to compute a context vector as a weighted sum of hidden states and then encode the context vector into the original hidden state .
4.1.2 Our Attention RNNLM with concatenation
In our work, and are playlists and songs by adopting Chen et al.’s work[4]. More specifically, given a seed song for a playlist , we find topk approximate nearest neighbors of to formulate a list of songs .
The proposed attention RNNLM with concatenation (ACRNNLM) is shown in fig:model. We pad
to and concatenate the corresponding as the input of our RNNLM’s output function in output_function, where is the maximum number of songs we consider in one playlist. Therefore, our output function becomes(11) 
where , and
(12) 
4.2 Policy Gradient
We exploit policy gradient in order to optimize policy_gradient_goal, which is formulated as follows.
4.2.1 Action
An action is a song id, which is a unique representation of each song, that the model is about to generate. The set of actions in our problem is finite since we would like to recommend limited range of songs.
4.2.2 State
A state is the songs we have already recommended including the seed song, .
4.2.3 Policy
A policy takes the form of our ACRNNLM and is defined by its parameters .
4.2.4 Reward
Reward is a weighted sum of several reward functions, i.e., . In the following introductions, we formulate 4 important metrics for playlists generation. The policy of our pretrained ACRNNLM is denoted as with parameters , and the policy of our ACRNNLM optimized by policy gradient is denoted as with parameters .
 Diversity

represents the variety in a recommended list of songs. Several generated playlists in Chen et al.’s work[4] are composed of songs with the same artist or album. It is not regarded as a good playlist for recommendation system because of low diversity. Therefore, we formulate the measurement of the diversity by the euclidean distance between the embeddings of the last song in the existing playlist, , and the predicted song, .
(13)
where is the euclidean distance between the embeddings of and , and is a parameter that represents the euclidean distance that we want the model to learn.
 Novelty

is also important for a playlist generation system. We would like to recommend something new to users instead of recommend something familiar. Unlike previous works, our model can easily generate playlists with novelty by applying a corresponding reward function. As a result, we model reward of novelty as a weighted sum of normalized playcounts in periods of time[20].
(14)
where is the weight of a time period, , with a constraint , is playcounts of the song , and is the set of actions. Note that songs with less playcounts have higher value of , and vice versa.
 Freshness

is a subjective metric for personalized playlist generation. For example, latest songs is usually more interesting for young people, while older people would prefer oldschool songs. Here, we arbitrarily choose one direction for optimization to the agent to show the feasibility of our approach.
(15)
where is the release year of the song .
 Coherence

is the major feature we should consider to avoid situations that the generated playlists are highly rewarded but lack of cohesive listening experiences. We therefore consider the policy of our pretrained language model, , which is welltrained on coherent playlists, as a good enough generator of coherent playlists.
(16)
Combining the above reward functions, our final reward for the action is
(17) 
where the selection of , , , and depends on different applications.
Note that although we only list four reward functions here, the optimization goal can be easily extended by a linear combination of more reward functions.
5 Experiments and Analysis
In the following experiments, we first introduce the details of dataset and evaluation metrics in Section
5.1 and training details in Section 5.2. In Section 5.3, we compare pretrained RNNLMs with different mechanism combination by perplexity to show our proposed ACRNNLM is more effectively and efficiently than others. In order to demonstrate the effectiveness of our proposed method, ACRNNLM combined with reinforcement learning, we adopt three standard metrics, diversity, novelty, and freshness (cf. Section 5.1) to validate our models in Section 5.4. Moreover, we demonstrate that it is effortless to flexibly manipulate the properties of resulting generated playlists in Section 5.5. Finally, in Section 5.6, we discuss the details about the design of reward functions with given preferred properties.5.1 Dataset and Evaluation Metrics
The playlist dataset is provided by KKBOX Inc., which is a regional leading music streaming company. It consists of playlists, each of which is composed of songs. There are unique songs in total.
For validate our proposed approach, we use the metrics as follows.
 Perplexity

is calculated based on the song probability distributions, which is shown as follows.
where is the number of training samples, is a song in our song pool, is the predicted song probability distribution, and is the song probability distribution in ground truth.
 Diversity

is computed as different unigrams of artists scaled by he total length of each playlist, which is measured by Distinct1[9]
 Novelty

is designed for recommending something new to users [20]. The more the novelty is, the lower the metric is.
 Freshness

is directly measured by the average release year of songs in each playlist.
5.2 Training Details
In the pretraining and reinforcement learning stage, we use 4 layers and 64 units per layer in all RNNLM with LSTM units, and we choose for all RNNLM with padding and concatenation. The optimizer we use is Adam[8]. The learning rates for pretraining stage and reinforcement learning stage are empirically set as 0.001 and 0.0001, respectively.
5.3 Pretrained Model Comparison
In this section, we compare the training error of RNNLM combining with different mechanisms. The RNNLM with attention is denoted as ARNNLM, the RNNLM with concatenation is denoted as CRNNLM, and the RNNLM with attention and concatenation is denoted as ACRNNLM. fig:model_comparison reports the training error of different RNNLMs as logperplexity which is equal to negative loglikelihood under the training step from to . Here one training step means that we update our parameters by one batch. As shown in fig:model_comparison, the training error of our proposed model, ACRNNLM, can not only decrease faster than the other models but also reach the lowest value at the end of training. Therefore, we adopt ACRNNLM as our pretrained model.
Worth noting that the pretrained model is developed for two purposes. One is to provide a good basis for further optimization, and the other is to estimate transition probabilities of songs in the reward function. Therefore, we simply select the model with the lowest training error to be optimized by policy gradient and an estimator of (cf. R_4).
Model  

RLDIST  0.5  0.0  0.0  0.5 
RLNOVELTY  0.0  0.5  0.0  0.5 
RLYEAR  0.0  0.0  0.5  0.5 
RLCOMBINE  0.2  0.2  0.2  0.4 
Model  Diversity  Novelty  Freshness 

Embedding[4]  0.32  0.19  2007.97 
ACRNNLM  0.39  0.20  2008.41 
RLDIST  0.44  0.20  2008.37 
RLNOVELTY  0.42  0.05  2012.89 
RLYEAR  0.40  0.19  2006.23 
RLCOMBINE  0.49  0.18  2007.64 
5.4 Playlist Generation Results
As shown in Table 2, to evaluate our method, we compare 6 models on 3 important features, diversity, novelty, and freshness (cf. Section 5.1), of playlist generation system. The details of models are described as follows. Embedding represents the model of Chen et al.’s work[4]. Chen et al. construct the song embedding by relationships between user and song and then finds approximate nearest neighbors for each song. RLDIST, RLNOVELTY, RLYEAR, and RLCOMBINE are models that are pretrained and optimized by the policy gradient method (cf. rewards) with different weights, respectively, as shown in Table 1.
The experimental results show that for single objective such as diversity, our models can accurately generate playlists with corresponding property. For example, RLYear can generate a playlist which consists of songs with earliest release years than Embedding and ACRNNLM. Besides, even when we impose our model with multiple reward functions, we can still obtain a better resulting playlist in comparison with Embedding and ACRNNLM. Sample result is shown in fig:sample_playlist.
From Table 2, we demonstrate that by using appropriate reward functions, our approach can generate playlists to fit the corresponding needs easily. We can systematically find more songs from different artists (RLDIST), more songs heard by fewer people (RLNOVELTY), or more old songs for some particular groups of users (RLYEAR).
5.5 Flexible Manipulating Playlist Properties
After showing that our approach can easily fit several needs, we further investigate the influence of to the resulting playlists. In this section, several models are trained with the weight from to
to show the variances in novelty of the resulting playlists. Here we keep
and and fix the training steps to .As shown in fig:pop_progression, novelty score generally decreases when increases from to but it is also possible that the model may sometimes find the optimal policy earlier than expectation such as the one with . Nevertheless, in general, our approach can not only let the model generate more novel songs but also make the extent of novelty be controllable. Besides automation, this kind of flexibility is also important in applications.
Take online music streaming service as an example, when the service provider wants to recommend playlists to a user who usually listens to nonmainstream but familiar songs (i.e., novelty score is ), it is more suitable to generate playlists which consists of songs with novelty scores equals to instead of generating playlists which is composed of songs with novelty scores equals to and songs with novelty scores equals to . Since users usually have different kinds of preferences on each property, to automatically generate playlists fitting needs of each user, such as novelty, becomes indispensable. The experimental results verify that our proposed approach can satisfy the above application.
5.6 Limitation on Reward Function Design
When we try to define a reward function for a property, we should carefully avoid the bias from the state . In other words, reward functions should be specific to the corresponding feature we want. One common issue is that the reward function may be influenced by the number of songs in state . For example, in our experiments, we adopt Distinct1 as the metric for diversity. However, we cannot also adopt Distinct1 as our reward function directly because it is scaled by the length of playlists which results in all actions from states with fewer songs will be benefited. Therefore, difference between and Distinct1 is the reason that RLDIST does not achieve the best performance in Distinct1 (cf. Table 1). In summary, we should be careful to design reward functions, and sometimes we may need to formulate another approximation objective function to avoid biases.
6 Conclusions and Future Work
In this paper, we develop a playlist generation system which is able to generate personalized playlists automatically and flexibly. We first formulate playlist generation as a language modeling problem. Then by exploiting the techniques of RNNLM and reinforcement learning, the proposed approach can flexibly generate suitable playlists for different preferences of users.
In our future work, we will further investigate the possibility to automatically generate playlists by considering qualitative feedback. For online music streaming service providers, professional music curators will give qualitative feedback on generated playlists so that research developers can improve the quality of playlist generation system. Qualitative feedback such as ‘songs from diverse artists but similar genres’ is easier to be quantitative. We can design suitable reward functions and generate corresponding playlists by our approach. However, other feedback such as ‘falling in love playlist’ is more difficult to be quantitative. Therefore, we will further adopt audio features and explicit tags of songs in order to provide a better playlist generation system.
References
 [1] Masoud Alghoniemy and Ahmed H. Tewfik. A network flow model for playlist generation. In Proceedings of International Conference on Multimedia and Expo, pages 329–332, 2001.
 [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.
 [3] Rachel M. Bittner, Minwei Gu, Gandalf Hernandez, Eric J. Humphrey, Tristan Jehan, P. Hunter McCurry, and Nicola Montecchio. Automatic playlist sequencing and transitions. In Proceedings of the 18th International Conference on Music Information Retrieval, pages 472–478, 2017.
 [4] ChihMing Chen, MingFeng Tsai, YuChing Lin, and YiHsuan Yang. Querybased music recommendations via preference embedding. In Proceedings of the ACM Conference Series on Recommender Systems, pages 79–82, 2016.
 [5] Arthur Flexer, Dominik Schnitzer, Martin Gasser, and Gerhard Widmer. Playlist generation using start and end songs. In Proceedings of the 9th International Society for Music Information Retrieval Conference, pages 173–178, 2008.
 [6] Negar Hariri, Bamshad Mobasher, and Robin Burke. Contextaware music recommendation based on latenttopic sequential patterns. In Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12, pages 131–138, New York, NY, USA, 2012. ACM.
 [7] Dietmar Jannach, Lukas Lerche, and Iman Kamehkhosh. Beyond ”hitting the hits”: Generating coherent music playlist continuations with the right tracks. In Proceedings of the 9th ACM Conference on Recommender Systems, RecSys ’15, pages 187–194, New York, NY, USA, 2015. ACM.
 [8] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [9] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversitypromoting objective function for neural conversation models. CoRR, abs/1510.03055, 2015.
 [10] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement learning for dialogue generation. CoRR, abs/1606.01541, 2016.
 [11] Elad Liebman and Peter Stone. DJMC: A reinforcementlearning agent for music playlist recommendation. CoRR, abs/1401.1880, 2014.
 [12] Beth Logan. Contentbased playlist generation: Exploratory experiments. 2002.
 [13] François Maillet, Douglas Eck, Guillaume Desjardins, and Paul Lamere. Steerable playlist generation by learning song similarity from radio station playlists. In Proceedings of the 10th International Society for Music Information Retrieval Conference, 2009.
 [14] Brian McFee and Gert Lanckriet. The natural language of playlists. In Proceedings of the 12th International Society for Music Information Retrieval Conference, pages 537–542, 2011.

[15]
Hongyuan Mei, Mohit Bansal, and Matthew R. Walter.
Coherent dialogue with attentionbased language models.
In
The ThirtyFirst AAAI Conference on Artificial Intelligence
, 2016.  [16] Elias Pampalk, Tim Pohle, and Gerhard Widmer. Dynamic playlist generation based on skipping behavior. In Proceedings of the 6th International Society for Music Information Retrieval Conference, pages 634–637, 2005.
 [17] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. 2017.
 [18] Saúl Vargas. New approaches to diversity and novelty in recommender systems. In Proceedings of the Fourth BCSIRSG Conference on Future Directions in Information Access, FDIA’11, pages 8–13, Swindon, UK, 2011. BCS Learning & Development Ltd.
 [19] Zhe Xing, Xinxi Wang, , and Ye Wang. Enhancing collaborative filtering music recommendation by balancing exploration and exploitation. In Proceedings of the 15th International Society for Music Information Retrieval Conference, pages 445–450, 2014.
 [20] Yuan Cao Zhang, Diarmuid Ó Séaghdha, Daniele Quercia, and Tamas Jambor. Auralist: Introducing serendipity into music recommendation. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, pages 13–22, New York, NY, USA, 2012. ACM.
Comments
There are no comments yet.