1 Introduction
At the heart of music creation lies the interaction between an artist and its public. This interactivity is lost with artefacts such as vinyls or CDs, which turn listening into a rather passive activity, and break the feedback from fans to artists. The advent of the digital era, which culminates today in the ubiquity of music streaming, is opening again the possibility of a feedback cycle by providing intuitive ways for listeners to interact with a song and the ability for the platforms to record such interactions. Indeed, proactive listening behaviours (skipping, scrubbing, …) are becoming so prominent [6, 7] – and, as we shall see, predictable – that their analysis promises to reveal some deep insights about the music, which could in turn fuel the music making process.
A recent study [8] has shown that the skip profile of a song – i.e. the measure of the skipping rate as a function of the position in the song – is both very specific to the song and highly stable across time and geographical regions, as if it were some intrinsic property of the music. More generally, it was repeatedly observed [7, 8] that skip profiles exhibit a universal Ushape pattern with three phases:

a high skip rate at the beginning of the song followed by a sharp powerlaw decay, as it takes up to a few seconds for people to decide whether or not they want to listen to the song,

a permanent regime with low skipping rate interspersed with spikes, whose temporal positions have been shown in [8] to correspond to salient events such as musical transitions (the beginning of a chorus, the appearance of a voice or a new instrument, a variation in intensity…),

an increase in the skipping rate as the end of the song approaches, as users look forward to the next song.
Motivated by these results and by the apparent universality of the above patterns across songs, geographical regions and time periods, we develop a model for skipping behaviour as a function of some (a priori unknown) underlying events in the song. In particular, we find that the patterns constitutive of the three phases above (namely, the initial decay, the spikes in permanent regime, and the increase at the end) can be accurately modelled by some universal response functions (also called kernels
), that are only parametrized by their timings and amplitudes. Notably, we find that the distribution of response times to musical events decays slowly at large times, consistent with a number of behavioural studies that have shown that human response times on various tasks can be approximated by a lognormal distribution
[9, 2, 5, 4, 14, 10]. Fitting this model to real data enables one to perform a quantitative and interpretable comparison of skip profiles, which could be further analyzed by artists who wish to dissect how their songs are being received by their public – for example, by measuring the fraction of listeners that are lost over a particular musical transition.We start by introducing the model for skipping behaviour in Section 2 and show that the timings of the kernels indeed correspond to musical events – typically transitions – as first observed in [8]. In Section 3, we test empirically the modelling hypothesis that the shape of the skip response triggered by musical events does not depend on the musical genre or the listening context. In Section 4, we perform a daybyday analysis of the model parameters, and show that they are a very stable and distinctive characteristic of a song, which again confirms the findings of [8] as well as the relevance and stability of our model. A qualitative and quantitative analysis across various characteristics of the music (such as genre, stream count and listening context) is additionally performed in Appendix C.
2 A model for skipping behaviour
2.1 Skip profiles as Poisson processes
For a given song of length with a total of streams, let be the number of sessions who were still active at time , and the associated skip profile such that the number of skips in the interval is equal to:
(1) 
We model the skip profile as a Poisson process with timevarying intensity , such that the fraction of sessions that were active at time but became inactive (i.e. where the user skipped) between and is . Given a parametrization of the model space , the optimal set of parameters can then be found by applying the following optimization program:
(2) 
where the posterior logprobability of the intensity
can be written as:(3)  
The model prior, which acts as a regularization term, depends on the modelling hypotheses, which we will describe in Section 2.3. The loglikelihood term, on the other hand, can be derived directly from the Poisson hypothesis above.
2.2 Loglikelihood
Let us assume that the skip decisions are independent^{2}^{2}2This is true for different people, probably less so for different streams by the same person. and note the skip profile corresponding to the th listening session (such that ) and the time at which session stopped (with if the song was not skipped). Then, the probability of observing the skip profile given an intensity is given by:
(4)  
The loglikelihood can therefore be written as:
(5) 
Maximzing the loglikelihood therefore amounts to maximizing the integral term in the above equation. With no modelling constraints, the above expression is maximized for:
(6) 
which corresponds to the detrended skip profile (i.e. the skip profile normalized by the number users who are still active at the time considered). We call the empirical skip intensity. For a given model intensity , the modelling error is therefore .
2.3 Eventsresponses modelling of skips
We now wish to add a prior over the model space to reflect the structure first observed in [8] and mentioned in the introduction. To this end, we model the intensity as:
(7) 
where:

is a powerlaw kernel associated with the beginning of the song, which accounts for the high initial skipping rate and the subsequent decay.

is the kernel associated to musical event , where is the temporal position of the event and is the magnitude of the shock. The expression of the kernel that is used in the following sections is:
(8) where
is the sigmoid function and the parameter values are
s, s, and s. As we will see in Section 3, these parameter values fit the data consistently across genres and listening contexts. Note that the initial rise happens on a time scale of s, which is consistent with a recent controlled study which measured the time it takes humans to make musical aesthetic judgments [1]. A graphical representation of is shown on Fig. 1. 
is a powerlaw kernel associated with the end of the song, which accounts for the increase in skipping rate as the end of the song approaches.
We assume no prior on the amplitude and exponent parameters, and consider the following prior on the temporal positions of the events:
(9) 
where
is a hyperparameter that adjusts the balance between the likelihood loss and the prior loss. This expression for the logprobability encourages sparsity for the temporal positions of events, as events that are close by are encouraged to merge into a single event. This will allow us to initialize the gradient descent with a large number of events, which will subsequently merge into a smaller number of events during the training process.
^{3}^{3}3Intuitively, the prior should be high for but low for , as merged events should be encouraged while observing two transitions in a short amount of time should be penalized. However, when optimizing via gradient descent, this would lead to a potential barrier that prevents events from being merged, which is why we prefer the more regular function in Eq. (9). The final optimization program can therefore be written as:(10) 
where and:
(11) 
We place the initial event times on a grid with 10s spacing, such that the number of events is larger that the actual number of events. Thanks to the shape of the prior (Eq. (9)), the ’s will gradually merge as they reach their final positions. Figure 2 shows two examples of detrended skip profiles (i.e. the empirical Poisson intensity) and the associated model intensity , with the events detected by the optimization process. Note that the final number of events that are detected are different for both songs, as the ’s have successfully migrated towards the true event times – and then merged.
3 Universality of the skips patterns
In this section, we wish to confirm empirically the assumptions made following Eq. (7) about the shapes of the various kernels.
Initial decay
Fig. 3 shows the normalized decay for the first 30s for 100 popular songs from the Spotify catalogue, showing a remarkable stability across tracks, with a decay exponent and an offset s – thus confirming the relevance of the powerlaw kernel shape for the initial decay. Note that the very first seconds correspond to a rather different mechanism and deviate from the power law, so we exclude the first 2s from the fit.
Spikes
We now turn our attention to the event kernel . For every event found by the model, we compute the difference between the empirical skip intensity and the optimal intensity profile found by the model from which we have removed the kernel associated with this very event, defined as:
(12) 
This gives us the theoretical intensity corresponding to every event except the one under consideration (cf. Figure 4). We then analyze this difference – which we call the empirical signature of the event – to confirm whether its shape matches that of the kernels use in our model. We analyze both the full signature (hatched area on Figure 4) and the cropped signature where we discard everything that happens after the next event (pink area on Figure 4). The advantage of the cropped signature is that it limits the influence of surrounding kernels on shape of the empirical signature: in that sense, it is closer to a “true” signature of the event. The results are plotted on Figure 5, showing that both the cropped and the full skip signatures match the shape of with good accuracy. We show in Appendix B, Figures 10 and 11 that this remains true across genres and for various listening contexts (free or premium subscription, and whether or not the song was played based on a recommendation by the platform). Note that this does not directly prove that the chosen kernel shape is the “true” shape, if such shape exists: only that it provides a good basis for decomposing the skip curve. However, a similar analysis for different kernel shapes is performed in Appendix A, most of which are unable to reproduce the above fit quality. This confirms the relevance of the above shape for the event kernels, with a linear increase in the first few seconds after the event followed by a slow decay. This essentially shows two things:

skips are triggered by punctual events – or, more precisely, they almost exclusively depend on the time elapsed since the beginning of a continuous section,

the skip rate associated with a continuous section dies off slowly. One interpretation could be that the perception of time in music varies: the longer users have been listening to a continuous musical segment, the weaker their perception of time – whereas abrupt transitions, on the contrary, awakens this perception. This idea is illustrated on Figure 6, which compares the chronology of events in real and perceived times – i.e. the time in which the skip rate would be constant – such that most of the perceived time corresponds to the very first seconds of the track where the attention (and the skip rate) are maximal, while more monotonous sections are contracted.
This sigmoidpowerlaw shape is very close to a lognormal shape, which has been documented by several studies as characteristic of human response times in various contexts [2, 9, 5, 4, 14, 10] and which has been used extensively in subsequent response time modelling [13, 11, 12, 3]. By confirming these results at un unprecedented scale, the present study shows that music is no exception.
Final increase
The pattern associated with the final increase is less clearcut and universal than the patterns associated with the beginning of the song and with the events within the song. Since the average pattern seems to correspond to a powerlaw kernel with exponent , we keep a powerlaw kernel for modelling the final increase, as it substantially improves the overall fit of the skip curve. However, there is no clear indication of a universal pattern for the final increase.
4 Stability of the spikes
It was suggested in [8] that the skip profile of a song can be seen as a fingerprint of the song, as it is highly stable across geographical regions and time periods while being highly specific to the song. In the previous sections we have shown that the skip curve can be parametrized by a very small number of parameters – essentially, the timings and amplitudes associated with a discrete number of events. In this section, we test the hypothesis that the set of parameters associated with the events can be used as a compact fingerprint of a song, which would confirm the findings in [8] and validate our model at the same time. We thus need to answer the following question: given two sets of and event parameters and , what is the probability that they correspond to the same song?
4.1 Posterior probability
We take a bayesian approach, and use the following notations:

is the probability that songs 1 and 2 associated with parameters and are the same song. This is the quantity we ultimately wish to compute.

is the probability to have observes if songs 1 and 2 were the same,

is the probability to have observes if songs 1 and 2 were different,

is the prior probability of two songs being the same,

is the prior probability of two songs being different.
We can then write, using Bayes rule:
(13)  
where is a monotonously increasing function and we have noted the ratio between the prior probability of the two songs being different and the prior probability of the two songs being the same (typically,
). We make a naive Bayes assumption that events are independent, so that we can write:
(14) 
where we make the hypothesis that musical events in a song happen with a Poisson rate , and we have noted the empirical distribution of amplitudes . Similarly, we make the hypothesis that the joint probability of observing and if event in song 1 matches event in song 2 can be expressed as:
(15) 
and we write:
(16)  
with the additional notations for “event corresponds to event ”, for “there is no event in song 2 that corresponds to ” (and conversely) and where denotes the probability that no event is observed in song 2 if event is observed in song 1 (given that song 1 matches song 2). Putting the above equations together, one obtains:
(17)  
where the first line inside the parenthesis gives a boost for events with matching timing and amplitudes, while the second line is a penalty for every event in one song that does not correspond to an event in the other song.
4.2 Matching the events
The corresponding grid is shown (values are in logscale), with the optimal path in white. Three matches are found (white stars) between pairs of events (1, 1), (2, 2) and (4, 3). The other events are matched with empty intervals (even rows/columns, represented with white circles), resulting in a low posterior probability of match between the two songs. Note that the pair (5, 4) is not matched because of the difference in amplitude.
Note that writing and assumes that events from song 1 and 2 have previously been matched. In this section, we propose to find the optimal alignment between and that maximizes . We approach this maximization problem by noticing that the different configurations for the product in Eq. (17) can be viewed as the paths from the bottom left corner to the top right corner of a grid of shape
, where the even rows (resp. columns) correspond to the intervals between the events in song 1 (resp. song 2) and the odd rows (resp.columns) correspond to the events in song 1 (resp. song 2). A given path therefore matches the
th event in song 1 with the th event in song 2 () if it goes through the position . If on the contrary the th row is matched to an even index (corresponding to an interval), then . The values associated with each position on the grid correspond to the factors in Eq. (17), namely:
the value is associated to the intersection of row and column ,

the value (resp. ) is associated to the intersections of row and columns for all ’s (resp. column and rows for all ’s),

the value 0 is associated to the intersections of all even rows and columns.
The optimization problem can be solved by dynamic programming, by considering all the paths from to that satisfy if is odd (a punctual event can only have one counterpart) and if is even (an interval can have several counterparts) – and similarly for . An example of such grid for two nonmatching songs with the associated optimal path is represented on Figure 7. The optimal product value to use in Eq. (17) is then the product of the values along the optimal path.
4.3 Results
We compute the matching probabilities for the daily skip curves of a set of matching songs and a set of different songs. The logproducts (also called the scores) and the corresponding probabilities with an arbitrary prior are represented on Figure 8. The distributions for matching and nonmatching tracks are clearly distinct, showing that the set of timing/amplitude parameters are both highly specific to a song and stable across time, and thereby confirming the relevance and stability of our model. Note that one could use this result to finetune the model hyperparameters (in particular the weight of the prior loss ) by maximizing some distance between the two distributions.
Conclusion
We have developed a simple yet powerful model for skipping behaviour, in which skips follow a Poisson process with timevarying intensity modelled as a sum of temporal responses (kernels), which are triggered by discrete events that happen inside the song, namely: (i) the beginning of the song, which triggers a powerlaw decaying kernel and usually accounts for the majority of skips, (ii) musical events (typically transitions) within the song, which each trigger a quick increase in skipping rate followed by a slow decay, and (iii) the end of the song, which is anticipated with a rise in skips. This eventsresponses decomposition of skip profiles provides us with a natural framework to quantitatively assess the impact of musical events on listening behaviour, and confirms the idea developed in [8] that skips are for the most part reactions to salient musical events. This suggests that the perception of time when listening to music is highly dependent on the variety of the music, as though users were progressively anaesthetized by long monotonous sections and abruptly awoken by unexpected events. Moreover, the temporal profile of these reactions appears to be consistent across songs, suggesting some universal way for humans to react to musical surprises. The stability across time and geographical areas of the magnitude of the reactions to specific events further suggests that it should be possible to understand what in the music motivates people to skip – and to which extent. We leave this question for future work.
Acknowledgements
The author would like to thank Nicola Montecchio for helping with gathering the data, Pierre Roy for insightful discussions and François Pachet for his overall support.
References
 [1] Amy M Belfi, Anna Kasdan, Jess Rowland, Edward A Vessel, G Gabrielle Starr, and David Poeppel. Rapid timing of musical aesthetic judgments. Journal of Experimental Psychology: General, 2018.

[2]
György Buzsáki and Kenji Mizuseki.
The logdynamic brain: how skewed distributions affect network operations.
Nature Reviews Neuroscience, 15(4):264, 2014.  [3] JeanPaul Fox, RH Klein Entink, and Willem J van der Linden. Modeling of responses and response times with the package cirt. Journal of Statistical Software, 20(7):1–14, 2007.
 [4] John G Holden. Fractal characteristics of response time variability. Ecological Psychology, 14(12):53–86, 2002.
 [5] John G Holden, Guy C Van Orden, and Michael T Turvey. Dispersion of response times reveals cognitive dynamics. Psychological review, 116(2):318, 2009.
 [6] Paul Lamere. The drop machine. https://musicmachinery.com/2015/06/16/thedropmachine/, Accessed: 20190320.
 [7] Paul Lamere. The skip. https://musicmachinery.com/2014/05/02/theskip/, Accessed: 20190320.
 [8] Nicola Montecchio, Pierre Roy, and François Pachet. The skipping behavior of users of music streaming services and its relation to musical structure. arXiv preprint arXiv:1903.06008, 2019.
 [9] Deborah L Schnipke and David J Scrams. Representing responsetime information in item banks, volume 97. Law School Admission Council, 1999.
 [10] David Thissen. Timed testing: An approach using item response theory. In New horizons in testing, pages 179–203. Elsevier, 1983.
 [11] Rolf Ulrich and Jeff Miller. Information processing models generating lognormally distributed reaction times. Journal of Mathematical Psychology, 37(4):513–525, 1993.
 [12] Wim J van der Linden. A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2):181–204, 2006.
 [13] Wim J Van Der Linden, David J Scrams, and Deborah L Schnipke. Using responsetime constraints to control for differential speededness in computerized adaptive testing. Applied Psychological Measurement, 23(3):195–210, 1999.
 [14] Guy C Van Orden, John G Holden, and Michael T Turvey. Selforganization of cognitive performance. Journal of Experimental Psychology: General, 132(3):331, 2003.
Appendix A Qualitative analysis of kernel shapes
We proceed to a qualitative comparison of several event kernel shapes (cf. Eq. (11)). As in Section 3, we are interested in the empirical event signatures (as defined in Eq. 12) that are obtained for each kernel shape. We will look both at the full signature and the cropped signature (which is less biased by future events, so more reliable in theory). As already mentioned, that the empirical signatures correspond to the kernel shape only proves that the kernel provides a good decomposition basis for the skip curve (i.e. that the model is consistent), not that kernel is the “true” one (if it exists). However, a mismatch shows that the kernel shape is not even a good decomposition basis, and can be ruled out as the “true” shape. The full and cropped empirical signatures for 8 different kernel shapes are shown on Figure 9 below. A basic qualitative analysis of the results follows.

Figure 8(a): Inverse kernel.

Kernel shape: .

Parameters: s, s, , s.

Comments: this is the kernel used in the main text. The full and cropped empirical signatures match the kernel, which thus provides a good basis for decomposition.


Figure 8(b): Inverse kernel with sharp onset.

Kernel shape: .

Parameters: s, s, , s.

Comments: this kernel is similar to the inverse power kernel above, albeit with a sharper initial onset. The empirical signatures show a slower initial onset than the kernel, which would be better fitted by the previous inverse kernel with . This confirms that the previous kernel better accounts for the initial onset.


Figure 8(c): Exponential kernel with 25 decay

Kernel shape: .

Parameters: s, s, , s.

Comments: This kernel is similar to the inverse power kernel but with a quicker longterm decay. The quality of fit is quite similar to that of the inverse power kernel. We can see by looking at the cropped empirical signature that the measurement uncertainty gets large after 3040s, which makes it difficult to compare the fit quality with that of the inverse kernel at these time scales.


Figure 8(d): Exponential kernel with 33s decay.

Kernel shape: .

Parameters: s, s, , s.

Comments: A slower decay lead to a poorer fit than with the previous exponential kernel. In particular, the empirical signatures decay faster than the kernel after 5s. This is consistent with the fact that this kernel decays too slowly in the first 30s.


Figure 8(e): Exponential kernel with 20s decay.

Kernel shape: .

Parameters: s, s, , s.

Comments: We observe the inverse phenomenon from the slower exponential kernel, with an empirical signature that decays more slowly than the kernel around 1015s. This is consistent with the fact that this kernel decays too quickly, in particular in the first seconds.


Figure 8(f): Exponential kernel with 10s decay.

Kernel shape: .

Parameters: s, s, , s.

Comments: We observe the same phenomenon than for the previous kernel but much more accentuated. Now the empirical signature is largely out of the statistical error band.


Figure 8(g): Inverse square kernel #1.

Kernel shape: .

Parameters: s, s, , s.

Comments: The shape of the kernel poorly matches the empirical signature around 510s. The crossover between the onset and the decay isn’t properly captured.


Figure 8(h): Inverse square kernel #2.

Kernel shape: .

Parameters: s, s, , s.

Comments: Same as above, with a faster decay. Poor fit overall.

Because of the amount of computations required to test each kernel, we have been limited in our comparative study. However, we can draw a few conclusions from these experiments:

Not all kernels are consistent for modelling the skip curves. In fact, most of the kernels produce statistically significant biases in the empirical signatures.

It is possible to link the kernel shapes with the biases observed. In particular, one can tell how to modify the kernel onset to better fit the data, and whether the following decay is too fast or too slow.

After 3040s, the statistical error bars widen, and it becomes impossible to discriminate between reasonablyshaped kernels.
We can conclude that the “true” kernel, if it exists, resembles the inverse power kernel (1) and the exponential kernel (3) for the first 3040s. Whether the decay is faster or slower after that is impossible to settle. Importantly, these results remain valid across various partitions of the data (genres, listening contexts, etc.). This shows that there is some universality in the temporal response to musical events.
Appendix B Analysis of the empirical signature across genres and contexts
We proceed to an analysis of the empirical event signatures across genres and listening contexts, to confirm whether the shape chosen for the event kernels in Eq. (11) remains consistent across a wide number of conditions. Figure 10 shows the analysis across a number of genres (R&B, Rap, Dance & House, Rock, Indie Rock, Pop). In all cases, the kernel shape closely matches the empirical event signature, showing consistence across genres. Figure 11 proceeds to the same analysis across listening contexts (free or premium subscription, and whether or not the song was played based on a recommendation by the platform), with similar results.
Appendix C Crosssectional analysis of skip profiles
We can use the sparse parametrization of skip profile curves to compare skip profiles quantitatively across a number of contexts.


Stream count
Figure 11(a) shows the parameter distributions as a function of stream count. Most parameters are quite stable, as a horizontal line would almost fit in the confidence interval. A slight trend might be visible for and , which tends to increase with the stream count. The parameters relative to the beginning of the track are overall more stable than the parameters relative to the end of the track . The intercept is consistently very small.
Duration
Figure 11(b) shows the parameter distributions as a function of the duration of the song. Here again, parameters are quite stable, except which decreases with increasing durations, showing that the anticipation behaviour changes with the duration of the track – it is steeper for shorter tracks.
Genres
We finally proceed to an analysis by genres, which confirms the universality of the parameters across genres. Two notable deviations can be observed: (i) classical and jazz songs have a much lower skip rate than other genres, which reflects on the distribution of and , (ii) one can guess a bimodal distribution for , with half of the songs being more heavily skipped at the end than the other half.
Comments
There are no comments yet.