The emergence of large streaming services significantly increased the amount of multimedia content consumed by Internet users over the last decade. Moreover, consumption pattern is also changing from a pull-based approach where users actively select the content, to a push-based approach where streaming services continuously send content to users. In both approaches, a significant challenge is selecting the right content for a user, as large scale services often offer millions of objects. Recommendation systems tackle this problem by recommending objects that are more likely to be enjoyed by users. Nevertheless, understanding user tastes and preferences continues to be a difficult task.
In the context of music, streaming services such as Spotify and LastFM offer users the notion of playlists, a sequence of songs that have some notion of similarity and cohesion that is likely to be enjoyed by a listener. Such playlists can be crafted manually by artists, experts or listeners, and be fixed, determining a particular song sequence. Users then choose among such playlists, facilitated by a recommendation system as the number of playlists can also be very large. Alternatively, playlists can be generated automatically and dynamically and even be personalized to a given user . In such scenario, the user listens to a continuous sequence of songs chosen by a recommendation system.
Despite the heterogeneity among different techniques for recommending content, a fundamental ingredient is the notion of object similarity . In particular, given two objects (e.g., two songs), provide a number that indicates their relationship or similarity under one or more criteria. One well-explored approach known as content-based uses signal processing techniques to analyze the objects to determine their similarity, such as spectrum analysis and time series correlations . Another more recent and promising approach known as context-based leverages meta-data concerning the object to determine their similarity [16, 14]. The potential of context-based approaches comes from the continuous generation of meta-data by users. While some meta-data is fixed and inherent to the object (e.g., the artist of a song), other meta-data is constantly being generated by users (e.g., number of users that have listened to two given songs). This information can be leveraged to design more effective measures of similarity that will then drive better recommendation systems.
While information from various sources can be collected and used to design context-based techniques, a powerful source of information are sequences of content consumed by users. Such sequences reveal a user’s preference along with a sequential ordering, since content is consumed in such order. In the context of music, playlists embody such sequences and have been leveraged to design a measure of similarity between songs [13, 15]. This work takes the same approach but generalizes prior work to consider direction and multi-hop influence. In the proposed technique, object similarity decays with the number of objects in between them, in the sense that two objects that follow each other are more similar than otherwise. Moreover, similarity is not symmetric in the sense that an object may closely follow another in a sequence, but the reverse is not necessarily true.
Another kind of information explored to design context-based techniques is meta-data inherent to objects . This can be used to stratify the objects into classes, to better assess their similarity for example. This work leverages this approach but combines it with sequences of objects. In particular, a sequence of objects can be translated into a sequence of attributes associated with the objects, giving rise to multiple sequences for different kinds of meta-data. In the context of music, a playlist can be translated into a sequence of artists (i.e., each song has an artist) or a sequence of genres (i.e., each song has a genre) or any other kind of meta-data, such as language, country or year. The proposed similarity metric can the be applied to such translated sequences to construct a similarity metric for different kinds of meta-data.
While a effective similarity metric is important, the object recommendation requires a model. A recently proposed and promising approach are graph models, where objects are nodes in the graph and edges encode relationships among them [15, 3, 18]. Random walks on such graphs are often used to generate or recommend content. The approach here proposed constructs multiple directed weighted graphs, each corresponding to a different meta-data of the objects, where edge weights correspond to similarity between nodes. These graphs form a hierarchy according to the different meta-data. Moreover, biased random walks are placed on each graph and are used to generate content. However, these random walks are coupled and walk synchronously on their respective graphs. Intuitively, context provided by a meta-data graph constrains transitions on other graphs, allowing for more effective object generation. In the context of music, a transition in the genre graph to “rock” will enforce that the song graph must transition to a song of that genre.
The main contributions of this work are as follows:
A novel measure for object similarity that leverages sequences of objects. The measure is not symmetric, capturing the inherent direction of sequences, and captures influences at all gaps, beyond just neighboring.
The translation of an object sequence to sequences of different kinds of meta-data, for which the same similarity measure can be constructed.
A multiple graph hierarchical model (one for each kind of meta-data) that leverages similarities among the meta-data is proposed to generate sequences. Biased random walks, one on each graph, that walk synchronously and constrain one another are used to generate sequences.
The proposed model is applied to a large music playlist dataset. The playlists are used to generate three kinds graphs (genres, artists and tracks) that are then used to generate random playlists. The model is fully parametrized from the playlists requiring no external parameters or tuning. Part of the dataset is used for parametrization and the model is evaluated on generating actual (never before seen) playlists, showing superiority against using a simple similarity metric and against using a single graph.
While the proposed model has been applied to music objects, it can be used in a myriad of other contexts, as long as objects follow a sequence and have meta-data. For example, in the context of short videos, books, and movies.
The remainder of this work is organized as follows. Section II introduces the proposed similarity measure and object generation mode. Section III describes the dataset while Section IV presents a characterization of the graphs as well as the results of the model. Last, Sections V and VI present a brief discussion of related work and some final remarks.
Ii Similarity Measure and Graph Hierarchy
Consider a finite set of distinct objects where each object has different attributes, namely . Let denote the set of possible values for attribute with . For example, can be the set of songs in a digital repository (e.g., all songs in Spotify), and the attributes can be the title, singer, and genre of a song.
Consider a sequence of objects where each object is an element of . In the context of songs, a sequence can be a music playlist constructed by a user. An important assumption in what follows is that such sequence encodes some kind of association or similarity between the objects. In particular, two objects that frequently appear close to each other in the sequence are more strongly related (or similar) than two objects that never appear close to each other. For example, in music playlists, songs that frequently appear together tend to be related or similar in the musical sense. This intuition is leveraged to construct a similarity metric between the objects. Moreover, since objects have multiple attributes, a similarity metric can be constructed for each attribute.
Ii-a Similarity metric
Consider two objects , and a sequence . Let denote the time of the -th appearance of object in . In particular, where . Consider the set of intervals between appearances of objects and in . In particular, for an appearance of at time , let denote the set of times that appear after . Thus, when considering the -th appearance of , we have , with a slight abuse of notation. Note that can be empty, which occurs when does not appear after in the sequence.
The similarity between two objects can now be defined in terms of its appearance, as follows:
where is a positive but monotonically decreasing function, for example, . Note that all appearances of that occur after an appearance of contribute to increase their similarity. However, appearances that are close in the sequence contribute more, since is assumed to be decreasing. Moreover, note that the metric is not symmetric and is likely different from . Also, note that an object can have similarity with itself, but this is not necessarily large.
The sequence of objects be converted into a sequence of a given attribute of the objects. In particular, when considering attribute , we can define , for where returns the value of the -th attribute of object . Thus, can then be used to define the similarity between two values of the -th attribute, by simply redefining the notions used in equation 1 for objects with their respective notion for attribute values . Thus, sequence can then be used to determine similarities for all attributes, and thus let denote the similarity between the values and of the -th attribute.
Ii-B Similarity networks and hierarchy
The above similarity metric can be used to construct a directed weighted network. In particular, let denote a directed graph where denotes the set of vertices and if the set of edges. Moreover, the weight of is given by . Let denote the set of outgoing neighbors of , namely . Moreover, let and denote the total outgoing and incoming weight of node . Note that the network depends on the sequence .
The same kind of network can be constructed for each attribute. In particular, let denote a directed graph associated with attribute where denotes the set of vertices and if the set of edges. Thus, there are networks, one for each attribute associated to the objects.
These independent networks can be structured into a hierarchy, as follows. Assume that the network size (in number of nodes) increases with the attributes (i.e., assume that the attributes are sorted such that ). Note that the top of the hierarchy is defined by network . Since every object in the sequence has its attributes defined, uniquely maps to a node in each network. Thus, the networks are all coupled by objects in the sequence . Moreover, note that if an edge exists, then the corresponding edge also exists in all other networks, namely for all (by construction of the similarity function). Figure 1 illustrates a scenario where .
Ii-C Sequence generation
The networks and hierarchy previously defined can be used to generate random biased sequences of objects. The key idea is to leverage the hierarchy to constrain the randomness and thus generate more meaningful sequences. The generation follows biased random walks in each network that are coupled and synchronized, and where the bias is given by edge weights (similarity). In particular, let denote the location of the random walk in the object network at time . Let denote the set of outgoing neighbors of node that are enabled at time (to be discussed how
is determined). Then, the transition probability is given by:
Note that transitions are more likely to objects that are more similar, with a linear dependence between similarity and probability across the set of possible transitions.
The set is determined according the random walks in the other networks. Consider the random walk in the highest level of the hierarchy, in . This random walk is unconstrained and moves freely according to the bias determined by the edge weights. The random walks of all networks are coupled and walk in lockstep. Let denote the state of the random walk at time in each network. In particular, each takes a step after the random walk in the level immediately above has taken a step. Thus, takes a step and moves to , given this transition, takes a step and moves to , and so on. The possible transitions in a given layer are constrained by the transition in the layer immediately above. In particular, given , can only transition to attribute values that have appeared in objects that also have the attribute value given by . Effectively, this constrains the outgoing neighbors of , and thus determines .
In the music example, the top level of the hierarchy can be the genre network, where nodes are music genres. The next level of the network can be the artist network, where nodes are musicians or bands. Finally, the bottom layer of the hierarchy is the track network, where nodes are specific songs. In order to generate a sequence of songs, the random walk in the genre network will make a transition. This will constrain the transitions of the random walk in the artist network, such that only transitions to the genre selected above are enabled. After this transition is made, the random walk in the track network will take a step, but now constrained to the artist selected in the layer above. In particular, only transitions to songs from this artist are enabled. Once this transition occurs, the process repeats and the random walk in the genre network takes a step. Note that the hierarchy constrains the randomness that is further biased by the similarity metric in each network.
In order to illustrate an application of the proposed framework, a dataset of playlists called Art of the Mix 2011 was chosen. This data is an expansion of the Art of the Mix (AotM)111http://www.artofthemix.org/ collection of Ellis, et al. which contains unique playlists with varying sizes, each having its own genre such as ROCK, POP, etc.
Iii-a Augmenting dataset
Due to the ratio between the number of playlists and the number of distinct tracks in the data (), it was necessary to augment data without changing much of its characteristics such as the order of tracks appearance. Considering a playlist as a sequence , two approaches were employed in the augmentation: randomly remove one element of each playlist and randomly select a pivot to split and recombine the playlist. The first approach leads to a sequence while the second leads to a sequence where is the size of the sequence and is the pivot. The first was repeated times and the second only once, generating a new dataset with playlists, times bigger than the original.
Iii-B Assigning tracks’ genre
Since the genre provided in the database is associated with a playlist instead of the tracks within it, it is necessary to perform a pre-processing to assign a genre to a track.
Consider an object with attributes , and defined as genre, artist and track respectively. Let denotes the number of times object appears in a playlist with genre . The value of is assigned as . Among the possible values for genre, there is a value ”MIXED GENRE” denoted as which is very frequent and adds no information about the real genre of a track and artist. Thus, if , is assigned as . After this, the size of each set was , and .
Iv-a Network characterization
The networks generated according to the proposed models present specific topological characteristics regarding weights distribution and the influence that different decaying functions have on it. Figure 2 shows the distribution of the sum of each vertex’s incoming and outgoing edge weights. Although there is no significant difference between both distributions, it is possible to notice that a very small number of vertices have very large associated weights. This means that there are some popular songs that appear very often in playlists.
In Figure 3 we present the influence of the decaying function of Equation 1 in the edge weights distribution. This model parameter allows to choose how much importance is given to distant objects in the sequences.
Another important structural aspect – known as Giant Connected Component – is present in our networks as it is in many other real world complex networks. In the tracks graph, of the vertices are in the largest connected component; in the artists graph, and the genre graph all vertices are in the same connected component.
Iv-B Playlist evaluation
The playlists dataset was split in train and test in three different proportions: , and . Also, two models were built using the proposed framework: one with 3 hierarchy levels (genres, artists, tracks) named hierarchical and the other with a single level (tracks) named multi hop, both using . They were also compared to the model proposed in , named single hop, which uses an undirected, single layered track network where
In order to compare the three models, we used a language modelling inspired approach, proposed in , which evaluates how likely each model is to produce naturally occurring playlists according to:
where is the model, is a sample of playlists from the test set and is the likelihood of playlist being generated by the model . Thus, we can say that is a better model of the data than if .
The likelihood is defined as
When evaluating the probability of a sequence of objects , , it may occur that in which case . In this case, does not yield any meaningful value, as the likelihood of this sequence will be
since one transition in the sequence is not possible. In order to allow the computation of the likelihood for sequences with a non-existing transition, the sequence probability was modified and inspired by Laplace Smoothing used in Natural Language Processing. This new sequence probability is defined as follows:
Note that when , the numerator becomes the constant . The obtained results for each model considering the different splits of the train and test are shown in Figure 4.
The result shows that the proposed model (multi hop) outperforms the single hop by at least five orders of magnitude. This indicates the effectiveness of the proposed generalization to consider not only immediate adjacencies in the track sequence, but also generating transitions between two songs that are not neighbors in the playlist. Moreover, when constrained by the hierarchical structure, the model achieves even better results, confirming the hypothesis that the objects’ inherent hierarchy is important for sequence generation.
V Related Work
Recommendation systems for multimedia objects is an important research area that has been widely explored over the last decades . Within this area, music recommendation has also received much attention recently [4, 7], in part due to the emergence of large scale streaming services, such as Spotify and lastFM. Despite significant progress, music recommendation continues to be a hard problem, and was the theme of a challenge at the ACM Conference on Recommender Systems (RecSys) in 2018 .
While there is a large number of approaches and algorithms for multimedia recommendation, they often fall into one of two broad categories: content-based [9, 17] and context-based [16, 14, 8]. Within the context-based approach there are various graph-based techniques where a graph of objects is constructed and then leveraged for recommendations, often using random walks [3, 15, 18].
From a theoretical perspective, Gopal et al. 
discuss many challenges in the hierarchical classification problem and propose a Bayesian hierarchical model using multivariate logistic regression. Based on this idea, Ben-Elazar et al. proposed an algorithm that is used in Microsoft’s Groove music service. The approach leverages a variational Bayes technique for learning the parameters of a hierarchical model that integrates genre, sub-genre, artist and global information while also incorporating personalization for user-specific preferences.
The approach here proposed is closely related the methodology proposed by Ragno et al.  where an undirected weighted graph is constructed from playlists and a biased random walk is used to generate recommendations. The proposed similarity metric simply counts the number of times two songs have appeared next to each other in the playlists, and this is used as edge weights. This current work generalizes this methodology by introducing multiple graphs that induce a hierarchy and a similarity metric that is not symmetric (directed graph) and leverages multiple hops between two objects in a sequence of objects. This current work is also closely related to the recent work of Ueda et al.  where a single directed graph with nodes that represent different objects such as tracks, genres and artists is constructed. Their methodology uses a single biased random walk to generate recommendations, but this bias is an external parameter. Moreover, object similarity is not considered in the construction of the graph and recommendation requires extensive computation (in order to compute the next most likely track). The approach here proposed more naturally encodes the relationships between the different kinds of meta-data (in a hierarchy) that is then used to constrain random walk transitions. Moreover, this approach is quite general and can be trivially applied to other kinds of objects, beyond music.
Making good recommendations for multimedia objects is an important and challenging task that continues to draw attention from academia and industry despite over a decade of progress. The many approaches that have emerged in the literature indicate that successful recommendations require using effective similarity metric that assess the similarity between two objects as well as models that leverage current context to constrain the recommendations. This work proposed a novel approach to both ingredients.
The proposed similarity measure is constructed from just a sequence of objects and assigns similarity inversely proportional to the distance between objects in the sequence. More appearances leads to a larger similarity, as well as closer appearances in the sequence. This notion can be applied to any meta-data associated with the objects (giving rise to a sequence of meta-data values) effectively constructing a suite of similarity measures.
The proposed model consists of a hierarchy of graphs each corresponding to a meta-data of the objects. The weights of these directed graphs correspond to similarity between the meta-data (and is not symmetric). In order to generate a recommendation, each graph has a random walk that move coupled and synchronously. Thus, a transition in a given layer of the hierarchy constrains the possible transitions in the layer immediately below, which in turn will constrain the transitions in the layer below it.
The proposed model was applied to a music dataset containing 1 million playlists where a hierarchy with three layers was constructed: genre, artist, and track. Results indicate that the model can generate actual (never before seen) playlists with an accuracy that is at least 5 orders of magnitude higher then two alternative approaches.
Last, although this model has been applied to the context of music, it could also be applied to any other context as long as objects in a sequence exert some notion of cohesion with respect to their distances in the sequence and also have meta-data associated with them. Examples, are short videos or books, but further analysis is needed to assess this generality.
-  (2017) Groove radio: a bayesian hierarchical model for personalized playlist generation. In ACM International Conference on Web Search and Data Mining (WSDM), pp. 445–453. Cited by: §V.
-  (2013) Recommender systems survey. Elsevier Knowledge-based systems 46, pp. 109–132. Cited by: §V.
-  (2010) Movie recommendation using random walks over the contextual graph. In International Workshop on Context-Aware Recommender Systems (CARS), Cited by: §I, §V.
-  (2015) Automated generation of music playlists: survey and experiments. ACM Computing Surveys (CSUR) 47 (2), pp. 26. Cited by: §I, §V.
-  (2018) Recsys challenge 2018: automatic music playlist continuation. In ACM Conference on Recommender Systems (RecSys), pp. 527–528. Cited by: §V.
-  (2012) Bayesian models for large-scale hierarchical classification. In Advances in Neural Information Processing Systems (NIPS), pp. 2411–2419. Cited by: §V.
-  (2013) A survey of music similarity and recommendation from music context data. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 10 (1), pp. 2. Cited by: §I, §I, §V.
-  (2016) Collaborative music similarity and recommendation. In Music Similarity and Retrieval, pp. 179–211. Cited by: §V.
-  (2002) Content-based playlist generation: exploratory experiments. In International Symposium for Music Information Retrieval (ISMIR), Cited by: §I, §V.
-  (2011) The natural language of playlists.. In International Symposium for Music Information Retrieval (ISMIR), pp. 537–541. Cited by: §IV-B.
-  (2012) Hypergraph models of playlist dialects. In International Symposium for Music Information Retrieval (ISMIR), Vol. 12, pp. 343–348. Cited by: §III.
-  (2002-01) The quest for ground truth in musical artist similarity. pp. . Cited by: §III.
-  (2000) A taxonomy of musical genres. In Content-Based Multimedia Information Access-Volume 2, pp. 1238–1245. Cited by: §I.
-  (2017) Improving context-aware music recommender systems: beyond the pre-filtering approach. In ACM International Conference on Multimedia Retrieval, pp. 201–208. Cited by: §I, §V.
-  (2005) Inferring similarity between music objects with application to playlist generation. In International Workshop on Multimedia Information Retrieval (SIGMM), pp. 73–80. Cited by: §I, §I, §IV-B, §V, §V.
-  (2011) Web-scale multimedia analysis: does content matter?. IEEE MultiMedia 18 (2), pp. 12–15. Cited by: §I, §V.
-  (2010) Music recommendation using content and context information mining. IEEE Intelligent Systems 25 (1), pp. 16–26. Cited by: §V.
-  (2018) A contextual random walk model for automated playlist generation. In IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 367–374. Cited by: §I, §V, §V.