1 Introduction
Online streaming services such as Pandora, Netflix, and Youtube constantly seek to increase their share of online attention by keeping their users as engaged as possible with their own services [14]. A well known challenge is how to measure the users’ engagement, and what are the key components that create an engaging streaming service. One important engagement metric is the amount of time spent by users within the service. When users access streaming services, they usually watch videos, movies, online TV or listen to music, and after a while they leave the service. We refer to the user interaction from the moment they start the service to the moment they leave as a user session, and the time spent during one session as user session length [21].
In this paper, we aim to predict the user session length using real world datasets from two music streaming platforms, i.e. predicting, at the beginning of the session, the amount of time they will spend listening music. Understanding and modeling the factors that can affect the session length is of great use for various downstream tasks. In fact it allows recommender systems to tune the explore vs exploit parameters for each user. In addition, having an accurate estimate of the users’ session lengths allows the streaming service to adjust ad pacing per user. Ads can be rescheduled in a way to keep the revenue target (i.e. total number of ads presented) as well as improve user experience.
Predicting the length of user sessions is very challenging as the authors reported in [21]. First, user sessions can end for many different external reasons that have nothing to do with quality of the streaming services, such as moving to subway, reaching home, and most of these contextual covariates are not easily accessible because of technological or privacy reasons [21]. Second, a certain amount of users are casual users in the sense that they only use the streaming services a few times per month, which makes the problem of estimating session lengths for those users hard. Furthermore, a lack of predictive covariates makes it harder to correctly predict what is going to be the next session length for a user.
A first approach toward session length analysis and prediction [21] is based on a Boosting algorithm. Most of the other related research were focused on modeling the time spent after clicking a search result [4, 13] or advertisement [2]. The best models are based on survival analysis, mainly because data are censored in these applications. In fact, after clicking of a search result or an ad, users can either turn back to the search page or can abandon the final page, instead of turning back to the main search page. This is not the case of session length data that is not censored^{1}^{1}1Observations are called censored when the information about their survival time is incomplete. – this opens the door to using a suite of methods based on regression modeling, shrinkage and relevant generalizations. Furthermore, in the case of web search or ad click, the user enters a query or click and checks the results, therefore the intersection between search or ad title and the landing page give highly predictive features [13]. In the case of a streaming service, the user interaction can be very low, but still they can have very long sessions ("leanback" behavior) [21].
In this paper, we propose a novel framework inspired by hierarchical Bayesian modeling and shrinkage principles that allows us to express session lengths in terms of userspecific latent variables. These variables are estimated via a joint learning framework which is rather broad in scope – we use Bayes, Empirical Bayes and MAP estimation techniques–particular choices are based on computational tractability considerations. Informally speaking, our models learn by borrowing strengths across all users and making use of rich covariate information. We also propose models that can incorporate outliers in data. A salient aspect of our framework is its modularity – it includes stateoftheart models as special cases, and naturally allows for a hierarchy of flexible generalizations. Hence, it allows the practitioner to glean insights about the problem, by assessing incremental gains in predictive accuracy associated with different generalizations and the incorporation of covariateinformation. We present tailored algorithmic approaches based on modern large convex optimization techniques to address the computational challenges. We summarize some of our key findings in this paper:

We outperform a baseline estimator by a margin of to and the stateofthe art in session length prediction [21] up to in terms of Mean Absolute Error measured in seconds on different real world datasets.

We show that some of our proposed prediction models can be (a) more accurate than state of the art (with 12% relative improvement in prediction performance), (b) between to times faster in training time; and (c)  times faster in prediction time, reaching around 1ms.

We provide a modular framework specifically for this problem that allow more flexible generalizations,
2 Key idea
The main focus of this paper is the prediction of user session length at the moment of login into a streaming service. For this purpose, we exploit the users past interaction with streaming services, previous sessions lengths, build a set of features (as we will describe in Section 4.1), and we set a learning framework for prediction purposes. We proceed by using a Gaussian approximation for the distribution of the log of a users’ session length (Section 4.1
). As we will see, this will help us in creating a wellgrounded and tractable inferential framework. Since we want a prediction framework that is performing well for very active users (with many sessions) and lessfrequent users (with only a few sessions per month), we propose a formal framework inspired by hierarchical Bayesian shrinkage ideas that allows us to model the sessionlength of a user in terms of userspecific latent variables. Fundamental principles of Bayesian shrinkage and estimation encourage users to borrow strengths across each other, including (but not limited to) covariate information. This (i) ameliorates the high variance (and hence low predictive accuracy) of userspecific maximum likelihood estimates for lessfrequent users; and (ii) leads to an overall boost in prediction accuracy for more frequent users as well. Bayesian decision theory and empirical Bayes methodology
[7] provides a formal justification of our framework. The notion of shrinkage that we undertake here is quite broad: It applies to cases with or without covariate information; by a flexible choice of priors on the latent variables, we can also build models robust to heavy tailed errors. We show that the stateoftheart model [21] on this particular problem is a special case of our framework. For flexible models and/or priors when Bayes estimators become computationally demanding, we recommend MAP estimation for computational efficiency, this includes the case for robust modeling with a Huber loss [10]. We resort to techniques in modern large scale convex optimization to achieve computational scalability and efficiency.3 Mathematical Framework
We formally develop the inferential framework to address the session length prediction problem. We denote the total number of users by . For every , let be the number of past sessions of user , and be the time spent by user in the th session. We will work with log of session length
as our response as this gives a better approximation to the Gaussian distribution (See Table
1). We note that similar (variance stabilizing) transformations^{2}^{2}2We note that it is also possible to finetune the transformation by considering a family of the form: for ; and optimizing over to obtain the closest approximation to a Gaussian distribution in terms of the KolmogorovSmirnov goodness of fit measure (for example). However, we do not pursue this approach here; as it leads to marginal gains in eventual prediction performance. are often used in the empirical Bayes literature [7] so that one can take recourse to the rich literature in Gaussian estimation theory. We denote bya vector of logsessionlengths of user
; the total number of sessions across all users; and is a vector of logsessionlengths of all users across all sessions. In addition, covariate information persession is available (See Section 4.1). For a given user, some of these features are fixed across sessions (age, gender…) and others depend upon the session (network, device…). Let denote covariates corresponding to the th session of user . denotes a matrix with rows that are stacked on top of each other. The th row of corresponds to th entry of the response vector . The columns of were all meancentered and standardized to have unit norm.A natural point estimate of the time spent by the th user may be taken to be the average time spent by the user in past sessions – however, we see that this does not lead to good predictions due to the high variance associated with users with few sessions. This behavior is not surprising and occurs even in the simple Gaussian sequence model as explained in Section 3.1.
3.1 Review of Bayes, Empirical Bayes and MAP
This section provides a brief review of Bayes, Empirical Bayes (EB) and Maximum A posteriori (MAP) estimation in the context of the Gaussian sequence model [7]. The exposition in this section lays the foundation for generalizations to more flexible structural models that we present subsequently.
The Bayes Estimator: We consider a latent Gaussian vector where, ; that gives rise to an observable Gaussian vector such that for all . Note that the posterior distribution of is given by a dimensional multivariate Gaussian with mean and covariance , i.e., where, Recall that the Bayes estimator is the mean of the posterior and given by:
(1) 
We remind the reader that the Bayes estimator shrinks each observation towards the mean of the prior distribution – this is to be contrasted with the usual maximum likelihood (ML) estimator, that does not shrink ’s. The Bayes estimator has smaller risk than the ML estimator (Theorem 1).
Empirical Bayes (EB) estimator: The Bayes estimator depends upon the unknown hyperparameter ; which needs to be estimated from data. The “Empirical Bayes" (EB) framework [7] achieves this goal by using a datadriven plugin estimator for in (1) – this leads to an EB estimator for .
The basic EB framework obtains an unbiased estimator for
based on the marginal distribution ofUsing standard properties of Gamma and inverseGamma distributions, it follows that
(where, ) is an unbiased estimator of . This leads to an EB estimate , which has smaller risk (Theorem 1) than the ML estimator.Theorem 1
[7] For , the EB estimator has smaller risk (defined as ) than the ML estimator , i.e., for any . The risks of the EB and Bayes estimators are comparable, with a relative ratio:
We make the following remarks: (i) Theorem 1 states that the price to pay for not knowing is rather small, and as becomes large the Bayes and EB estimators are similar. (ii) Instead of taking an unbiased estimator for as above, one can also take a consistent estimator which might be easier to obtain for more general models (see Section 3.2). For more general models, a good plugin estimate for may be obtained based on validation tuning. The framework above provides important guidance regarding a range of good choices of thereby reducing the computational cost associated with the search for tuning parameters.
MAP estimation: As an alternative to the Bayes/EB estimator, we can also consider the MAP estimator, a mode of the posterior likelihood . Here the MAP estimate () coincides with the Bayes estimator. The MAP and Bayes estimators are not the same in general. For flexible priors/models, computing Bayes estimators can become computationally challenging and one may need to resort to (intractable) high dimensional MCMC computations. In these situations (see Section 3.3), the MAP estimator may be easier to compute from a practical viewpoint. For all models used in this paper, we observe that MAP computation can be tractably performed via convex optimization.
3.2 Model 1: Modeling user effects
We present a hierarchical shrinkage framework for predicting userspecific session lengths, generalizing the framework in Section 3.1.
Suppose that the logsession lengths of the
th user are normally distributed with mean
; and these latent variables are generated from a centered Gaussian distribution. The (random) is the th user effect. This leads to the following hierarchical model:(2) 
generalizing the model in Section 3.1 to the case with multiple replications per user. The posterior distribution is given by:
where, and is the mean of the vector . The Bayes estimator of is given by the posterior mean . Here, the MAP estimator of coincides with the Bayes estimator as well. We note that the Bayes/MAP estimators in this example bear similarities with the model in Section 3.1 – we shrink the observed mean of each user towards the global mean of the prior distribution: this lowers the variance of the estimator at the cost of (marginally) increasing the bias. The amount of shrinkage depends upon the number of sessions of the th user via the factor . In particular, the shrinkage effect will be larger for users with a small number of sessions.
Estimating the hyperparameters: The estimators above depend upon hyperparameters via , which is unknown and needs to be estimated from data. In the spirit of an EB estimator we obtain a plugin estimator for . To this end we use the marginal distribution of , which follows where, has diagonal entries equal to and offdiagonal entries equal to . Consequently, is an unbiased estimator for the covariance matrix . In particular, if then the estimators
(3) 
are unbiased estimators of and (respectively). To see this, note that is obtained by taking the average of all the offdiagonal entries of the matrix . Similarly, corresponds to the average of the diagonal entries. Estimators in (3) are based solely on observations from the th user; and can have high variance if is small (which is the case for less heavy users). Hence, we aggregate the estimators across all users to obtain improved estimators of given by:
(4)  
Using laws of large numbers, one can verify that
(and ) are consistent estimators for (and ). Interestingly, this holds under an asymptotic regime where, but remains bounded – this regime is relevant for our problem since there are many users with few/moderate number of sessions. We emphasize that even if is large but the ’s are small, shrinkage plays an important role and leads to estimators with smaller risk than the usual maximum likelihood estimator for. The plugin estimators suggested above lead to consistent estimators for the Bayes and EB estimators. This framework provides guidance regarding the choice of the tuning parameters in practice (and reduces the searchspace associated with hyperparameter tuning).
3.3 Model 2: Modeling with covariates
We describe a generalization of Model 1 that incorporates user and devicespecific covariates (See Section 4.1 for details). Our hierarchical model is now given by:
(5)  
The above represents a generative model with latent variables – where, denotes a vector of regression coefficients corresponding to the covariates ; and explains the residual userspecific effect of user . Both the latent variables are normally distributed with mean zero; and given these parameters, ’s are normally distributed with mean . This model is more structured and also more flexible than Model 1 in that the th user effect has two components: a global regressionbased response (this depends upon both the user and the session); and a residual component . We now derive the EB and MAP estimators.
Let us define and the latent vector . Model (5) can be reformulated as:
where, is such that its first entries correspond to , its th entry is ; and all remaining entries are 0. If be the matrix obtained by row concatenation of the ’s; then the posterior distribution of is given by
where, the matrix and the regularization parameter . The Bayes estimate of is given as:
and can be derived from the components of . In this model, the MAP estimator coincides with the Bayes estimator, and can be computed as , where, is the convex function:
(6) 
and , are hyperparameters. In Section 3.4, we propose Algorithm 1 to minimize Problem (6).
An empirical Bayes estimator of can be computed by using datadriven estimators for the hyperparameters. We can obtain consistent estimators of the hyperparameters following the derivation in Section 3.2. Since this derivation ^{3}^{3}3Note that for Model (5) the marginal distribution of is a multivariate Gaussian with mean zero and covariance matrix , which is a function of and . Following Section 3.2, we have . We can then derive consistent estimators of based on functionals of and entries of . is quite tedious, we do not report it here. In practice, we recommend tuning on a validation set (where, is taken to be in the neighborhood of the values suggested by Section 3.2 pertaining to Model 1). As we discuss in Section 3.5, this does not add significantly to the overall computational cost, as our algorithm effectively uses warmstart continuation [9] across different tuning parameter choices.
We now move beyond the Gaussian prior setup considered so far and consider a Laplace prior on . In this case, and the models we consider subsequently, Bayes estimators are difficult to compute due to highdimensional integration that require MCMC computations. With computational tractability in mind, we will resort to MAP estimation for these models.
Laplace prior on : Motivated by regularization techniques [20] popularly used in sparse modeling, we propose a Laplace prior on – the corresponding MAP estimators lead to sparse, interpretable models [9, 20]. Here, computing the Bayes estimator becomes challenging and requires MCMC computation. However, the MAP estimator is particularly appealing from a statistical and computational viewpoint; and given by , where,
(7) 
is a convex function with hyperparameters . The tuning parameters are chosen based on a validation set. Section 3.4 presents an algorithmic framework based on first order convex optimization methods [17, 22] for optimizing Problem (7) – our proposed algorithm leads to significant computational gains compared to offtheshelf implementations.
3.3.1 Nonparametric modeling with covariates
The framework presented above is quite modular–it allows for flexible generalizations, allowing a practitioner to experiment with several modeling ramifications, and understand their incremental value (prediction accuracy visavis computation time) in the context of the particular application/dataset.
Recall that the basic model put forth by Model 2 is where, . We propose to generalize this linear ‘link’ by incorporating flexible nonparametric models for the covariates, as follows:
(8) 
where, is a flexible nonparametric function of the covariates. For example, we can train
via Gradient Boosting Trees (GBT)
[9]as our nonparametric model
^{4}^{4}4We also experimented with classical Classification and Regression trees as well as random forests, but the best predictive models were obtained via GBT.
. Trees introduce nonlinearity and higher order interactions among features and can fit complex models. By adjusting tuning parameters like learning rate, maximal treedepth, number of boosting iterations, etc, they control the biasvariance tradeoff and hence the generalization ability of a model. Given a continuous response and covariates ; GBT creates an additive function of the form where, ’s are trees of a certain depth and is the learning rate – the components are learned incrementally via steepest descent on the least squares loss with possible early stopping. This imparts regularization and improves prediction accuracy.Summarizing the general framework: In summary, our framework assumes that we have access to an oracle that solves the following optimization problem
(9) 
with a regularizer that restricts the family . Problem (9) encompasses the different models that we have discussed thus far: e.g., model (7) (here, and ), model (6) (here, and ); and GBT.
For flexible nonparametric models, a MAP estimator can be obtained by minimizing the negative loglikelihood of the posterior distribution jointly w.r.t and . This entails minimizing the negative loglikelihood of the posterior distribution – this is given by the function (up to constants) as follows:
(10) 
where, . Section 3.4 presents an algorithmic framework for minimizing (10) to obtain estimates .
We note that for the class of the models in Section 3.3.1, it is not clear how to tractably construct and compute Bayes/EB estimators–we thus focus on MAP estimation; and note that the associated tasks can be cast as tractable convex optimization problems.
3.3.2 Some Special Cases
As we have noted before, an important contribution of this paper is to propose a general modeling framework – so that a practitioner can glean insights from data analyzing the incremental gains available from different modeling components. To this end, we note that if all the residual user effects are set to zero (i.e, for all ), then we can use covariates alone to model the user effects. In the case of model (6) with this is referred to as Ridge in Section 4.3. Furthermore, if we learn via boosting (with ) then we recover the model proposed in [21] (denoted as SIGIR2017 in Section 4) as a special case. Predictive performances of these models are presented in Section 4.
3.3.3 Robustifying against outliers
The models described above assume a normal distribution (see (8)) — in reality, to be more resistant to outliers in the data it is useful to relax this assumption to account for heavier tails [10] in the error distribution. To this end, with computational tractability in mind, we propose a scheme which is a simple and elegant modification to our framework, by assuming a stylized decomposition of the link in (8). To this end, we write
(11) 
and place an additional prior on ’s – they are all drawn from where, is the Laplace density. The MAP estimator for this joint model requires minimizing the following convex function
(12)  
w.r.t. the variables . It is not immediately clear why an estimator available from Problem (12) has robustness properties. To this end, Theorem 2 establishes a crisp characterization of Problem (12) in terms of minimizing a Huber loss [10] on the residuals where, the Huberloss is given by
The Huber loss is quadratic for small values of (controlled by the parameter ) and linear for larger values – thereby making it more resistant to outliers in the space. The Huber loss remains relatively agnostic to the size of the residuals, therefore offering a robust approach to regression [10]. If is small,
resembles the least absolute deviation loss function—this makes it more suitable for the
MAE metric used for evaluation in Section 4.Theorem 2
Minimizing Problem (12) w.r.t is equivalent to minimizing (below) w.r.t. :
(13) 
Proof 1
We use a variational representation of the Huber loss
(14) 
To derive identity (14), we compute a minimizer of in (14) via softthresholding: (where, ). We plugin the value of into and upon some simplification obtain (14). The proof of the theorem follows by applying (14) to Problem (12), where, we minimize wrt to obtain criterion (13) involving (and not ).
The above development: decomposition (11) and hence criterion (12) nicely falls within the general hierarchical framework discussed in this paper. In fact all models described before this section can be cast as special instances of Problem (12) by setting and stylized choices of . Our numerical experiments suggest that a finite nonzero choice of leads to the best outofsample prediction performance, thereby suggesting the importance of doing robust modeling in this application. In addition, our model is nicely amenable to the computational methods discussed in Section 3.4. This further underlines the flexibility of our overall framework – even if our basic assumption relies on Gaussian errors at the core, simple hierarchical modeling decompositions of the latent variables make it flexible enough to accommodate adversarial corruptions in the data.
3.4 Computation via Convex Optimization
All the estimation problems alluded to above can be cast as convex optimization problems; for which we resort to modern computational methods [17, 22]. To compute the estimators mentioned in Section 3.3, we need to minimize Problem (12). To this end, we use a blockcoordinate descent scheme [22]: at iteration , we minimize (12) w.r.t. , followed by a minimization w.r.t the latent vectors . The algorithm is summarized below.
Algorithm 1: BlockCoordinateDescent for MAP estimation
Input: , tuning parameters, tolerance ; initialization
.
Output: An estimate
, minimizing Problem (10)
(1) Repeat Steps 2 to 5 until for .
(2) Let be a solution of the optimization problem (10) with held fixed at respectively – this is equivalent to solving (9) with where .
(3) Update the residuals Estimate userspecific effects via: (with respectively set to ). This is a closedform update: .
(4) Update the vector: , corresponding to the sparse corruptions. The closed form update is ; where, .
(5) Set the value of .
Algorithm 1 applies to a fixed choice of the hyperparameters. We need to consider a sequence of hyperparameters to obtain the best model based on the minimization of prediction error (see Section 4) on a validation set. In the case of models (6) and (7) estimates of can be computed over a grid of parameters by using warmstarts across different tuning parameters–to this end, the EB estimators provide a good ballpark estimate of relevant tuning parameters. Section 3.5 describes specialized algorithms that are found to speed up the computations pertaining to models (6) and (7) when compared to offtheshelf implementations of these algorithms. We note that GBT does not benefit from warmstart continuation across hyperparameters. For Algorithm 1, we use a stopping criterion of (Step 1) in the experiments.
3.5 Computational Considerations
We consider certain algorithmic enhancements for Algorithm 1 that lead to important savings when the number of sessions become large (of the order of millions). We focus on the critical Step 2 of Algorithm 1 when and corresponds to the ridge or regularization – this leads to a problem of the form:
(15) 
where . Indeed, in these instances, we found out that the default implementation of Python’s scikitlearn package [18] was prohibitively slow for our purpose, and hence careful attention to algorithmic details seemed necessary (details below). We derived new algorithms for (15) with an eye towards caching numerical linear algebraic factorizations, exploiting warmstarts, etc; as we describe below.
3.5.1 regression subproblem
When , the ridge estimator has an analytical expression:
(16) 
which needs to be computed for several tuning parameters, and iterations. To reduce the computational cost, we obtain an equivalent expression for via the eigendecomposion of given by, , with
being a diagonal matrix with eigenvalues
– this has a cost of in addition to the cost of computing (and they can both be done once, offline). This leads to which can be computed with cost . In our experiments (Section 4) is small, which leads to a cost that is linear in . The predicted values can be computed with an additional cost of . Note that computing estimator (16) for different values of the tuning parameter does not require additional eigendecompositions – this is critical in making the overall algorithm efficient, especially when training across multiple values of the hyperparameter.3.5.2 regression subproblem
When , Problem (15) becomes equivalent to a Lasso estimator – we emphasize that scikitlearn’s implementation of Lasso became rather expensive for our purposes since it could not effectively exploit warmstarts and cached matrix computations. This motivated us to consider our own implementation, based on proximal gradient descent [3, 17]. To this end, since , we precomputed^{5}^{5}5
Note that this computation is also required for the ridge regression model.
and considered a dimensional quadratic optimization problem of the form:(17) 
where, the smooth part of has Lipschitzcontinuous gradient – that is, it satisfies for (recall, ’s are eigenvalues of ). A proximal gradient algorithm for (17) performs the following updates:
(18) 
till convergence. Note that can be computed via softthresholding, i.e., where, for a vector the th coordinate of the softthresholding operator is given by . Note that the objective function is strongly convex^{6}^{6}6Note that is convex for , i.e., the minimum eigenvalue of – this means that is strongly convex with strong convexity parameter .; and hence sequence converges to an suboptimal solution to Problem (17) in iterations [17] – i.e., it enjoys a linear convergence rate. Every iteration of (18) has a cost of (arising from the computation of and the softthresholding operation). In addition, computing costs (this is computed once at Step 2 of Algorithm 1). Problem (17) needs to be computed for several tuning parameters and iterations (of Algorithm 1) – this does not add much to the overall runtime as the proximal gradient algorithm can be effectively warmstarted – this is found to speedup convergence in practice.
3.5.3 GBT subproblem
When the optimization in Step 2 involves performing GBT, the runtimes increase substantially (See Section 4.5). Unlike the models in Sections 3.5.1,3.5.2; GBT is computationally intensive and needs to be done for every iteration of Algorithm 1. Unlike the optimization based algorithms for regression as described above, GBT does not naturally accommodate warmstarts across iterations, and/or tuning parameters.
4 Experiments
Here, we evaluate the effectiveness of our prediction model with respect to several baselines and state of the art session length prediction solutions. We proceed by describing our datasets, our evaluation framework, the comparisons, and then present the results.
4.1 Datasets
We used two different real world datasets of users listening to music, namely, PMusic and lastfm. PMusic is a sample of user interaction data from a major music streaming service in United States, and lastfm is a publicly available dataset from last.fm [5]. We defined the user sessions as periods of continuous listening, interrupted if the user stop or pause the music for more than minutes [21]. For PMusic we gathered data from a small subset of PMusic users for a period of months (FebruaryMay ) resulting in sessions ^{7}^{7}7Due to confidentiality we can not report the number of users for this dataset.. lastfm public dataset was gathered between to and it contains sessions for different users. Table 1 reports some statistics about the user session length in the two datasets. For the log values, we first take the log transform of the raw data, as mentioned in the modeling part, and then normalize. An interesting finding is that mean and median are quite different for the raw data in both datasets. In fact, as reported in [21], Weibull distributions give a better fit to user session lengths, while after a logtransformation, the data can be reasonably modeled via normal distributions, which is what our modeling framework requires for tractable inference.
Stats  PMusic  lastfm 

25th quantile(raw) 

median(raw)  
mean(raw)  
75th quantile(raw)  
25th quantile(log)  
median(log)  
mean(log)  
75th quantile(log) 
Feature Engineering. For all the sessions in PMusic we create two kinds of features, namely, userbased and contextual as in [21]. Table 4 reports some of the covariates used in our models. As userbased features we consider "gender (the gender of the user), age (the age of the user), subscription_status (whether the user is adsupported)", these features are fixed for a given user. As contextual features we consider "device (the device used for the session), network (the type of network used for the session), absence_time (time elapsed since the user’s previous session), previous_duration (the duration of the user’s previous session)".
We refine this set of features to include additional contextual features to [21], this is mainly to lower the variance of the past sessions, and introduce nonlinearity. We consider as additional features "avg_user_duration( average user session length in training set), log_avg_user_duration (logarithm of avg_user_duration), log_absence_time (logarithm of absence_time), log_previous_duration (logarithm of previous_duration), session_time (whether the user session started in morning or afternoon)". For lastfm dataset the "age, subscription_status, device, network are missed.
Feature  Description 

gender  gender of the user 
age  age of the user 
subscription_status  whether the user is adsupported 
device  device used for the session 
network  type of network used for the session 
absence_time  time elapsed since the previous session 
previous_duration  duration of the previous session 
avg_user_duration  average user session length (training) 
session_time  session started in morning or afternoon 
4.2 Evaluation
We sort our dataset by chronological order, use the first for the training set, for the validation set, and the rest for the test set. Additionally we require each user in the validation or test set to appear at least once in the training set. The final datasets for PMusic and lastfm have respectively in total , and
sessions. For the models that need parameter tuning, we first train the models on the training set for each set of the parameters. Then we use the validation set to pick the best set of parameters. Finally, we use that set of parameters for training on the combined set of training and validation, and predict on the test set. For the evaluation metric of our session length prediction model, we use
Normalized Mean Absolute Error measured in seconds, averaged over all the test sessions and normalized by the Baseline model which by our definition has MAE . MAE is a good metric due the possibility of important errors resulting from very large session length. More formally, let be the number of sessions in the test set and be the time spent by user on his th session, where is a test session of user , and be the predicted value then:
4.3 Comparisons
We compare our model with several baselines and state of the art methods. In particular we have considered the following:

Baseline. The baseline model is the peruser mean session length, i.e., we compute for each user the mean session length in the training set and use the value as a prediction value for all the test sessions of the same user.

SIGIR2017. This is method in [21] that is using a modified version of boosting algorithm. Our tuned models have a number of trees in , with depth and use a learning rate in .
Baseline can be interpreted as a natural baseline and SIGIR2017 is the state of the art in this particular application. Among the models proposed in this paper (cf Section 3), we consider the following in the experiments:
Similar to the Baseline model, Model1 does not consider covariates. Model1 however, performs shrinkage on the userspecific effects, and thus any gain in predictive accuracy (as evidenced in Table 3) is due to shrinkage. Ridge considers only covariates and does not include the residual userspecific effects. The rest of the models use features regarding the context and user, and additional userspecific effects. All versions of Model3 allow us to build models that are less sensitive to outliers, hence any performance boost in MAE over its Model2counterpart can be attributed to robustness. As described in Section 5 we do not have censored data, and we are interested in making point predictions on user sessionlengths, therefore survival analysis models are not suitable for our scenario — hence they are not included in our comparisons.
4.4 Effectiveness
We report the results regarding the effectiveness of our model. Table 3 reports the results of the Normalized MAE on all the models in Section 4.3. By borrowing strength across users, Model1 improves over the Baseline even without using any covariateinformation. SIGIR2017, the model presented in [21] is benefiting from the usage of the covariates, and it is clearly better than Model1. XGBoost which relies (solely on) covariates, has poor performance on both the datasets. Model2GBT by combining hierarchical shrinkage with flexible modeling of covariates, reaches a significantly lower MAE than SIGIR2017. This observation shows the importance of the user effect in our hierarchical modeling framework. Model2L2 is performing quite well in all the datasets and only considers hyperparameters. We did not observe any gain in MAE by using an penalization, though the models were sparse (in ) when compared to regularization. Model3GBT has the lowest MAE for all the datasets — thereby suggesting the usefulness of using a robust model for training purposes. For the PMusic dataset, the robustification strategy leads to good improvements: Model3L2 has MAE whereas, its nonrobust counterpart Model2L2 has MAE . Further improvements are possible by using nonparametric modeling of the covariates. Overall, our Model3GBT seems to be the best in terms of prediction in both datasets. Model2L1 or Model2L2 are close to SIGIR2017 for PMusic and better for lastfm and as we see in Table 6 they are actually much faster in training time.
Models  MAE PMusic  MAE lastfm 

Baseline no covariates  
XGBoost  
SIGIR2017  
Model1 no covariates  
Ridge  
Model2L1  
Model2L2  
Model3L2  
Model2GBT  
Model3GBT  0.871  0.811 
Feature Importance. By centering and normalizing the columns of the matrix of covariates , the absolute values of the coefficients of for Ridge or Model3L2 suggest the relative importances of the features. Table 4 reports the highest absolute values of the coefficients for the Ridge estimator and for Model3L2 estimator for PMusic dataset.
Ridge  log_avg_user_duration  0.516 
absence_time  
avg_user_duration  
device=smartphone  
device=Web  
Model3L2  device=smartphone  
device=Web  
avg_user_duration  
absence_time  
log_avg_user_duration 
Device and timerelated features appear as the most relevant features. The two most important features for Ridge correspond to logarithm of the average user session length in training set and the absence time since last session. In addition, considering user effect on Model3L2 lowers the magnitude of timerelated features, even though they still appear in the top ones.
Performance Breakdown by Sessions per User. We perform a break down of users into three different types by quantiles of number of sessions per user in the training set. Table 5 reports the normalized MAE for these three groups. The results show a monotonic decrease in terms of gain in performance (with respect to the Baseline) for all the models with number of sessions per user. This serves as a validation of an important message presented through our modeling framework – shrinkage is more critical for less active users when compared to the Baseline. In fact, even for the more frequent users, we observe that shrinkage helps when compared to the Baseline (but the gains are less pronounced). Both Model2GBT/Model3GBT outperform the state of art SIGIR2017 for all three cutoffs chosen. The gains obtained by Model3GBT over Model2GBT for PMusic dataset can be primarily attributed to the robustness to outliers. Model2GBT/Model3GBT perform better than SIGIR2017 by modeling the userspecific effects – this gain seems to be most prominent for heavier users for both PMusic and lastfm datasets.
Model  

PMusic  
Model1  
Ridge  
SIGIR2017  
Model2GBT  
Model3GBT  0.767  0.792  0.875 
lastfm  
Model1  
Ridge  0.606  
SIGIR2017  
Model2GBT  0.606  0.775  
Model3GBT  0.775  0.816 
Normalized MAE, restricted to people in the first decile, the first two deciles or the last
th deciles of the training set. (Hierarchical) shrinkage has a more prominent effect over the Baseline model (with MAE=1) for users with fewer sessions.4.5 Efficiency
We further investigate the efficiency of our model by looking at the training time and time of prediction for our best models and state of art solution. We implemented everything using python ^{9}^{9}9Our code will be released, we are removing the link to github due to double blind process.. We run the experiments in a MacBook Pro with GHz Intel Core i with 8 GB of RAM. For each model we fixed a set of parameters to tune. We then run each model sequentially, each time with different fixed parameters. Table 6 reports the average running time (across the number of runs done for tuning purposes) and the time to predict all the test instances for our best models and SIGIR2017. The most important thing to note here is that Model3L2 can be trained between to times faster than any tree based method such as SIGIR2017, and it has an MAE that is % to % better than SIGIR2017. This is mainly because the Model3L2 (and Model2L2) computations benefit from warmstarts (see Section 3.4), so the increase in number of tuning parameters do not increase the run time as in the case of SIGIR2017 and Model3GBT that are based on Gradient Boosting Trees. They are also at least to times faster in prediction time, and therefore they can be used for realtime prediction with time constraints. While Model3GBT is slow in training time, it shows better MAE performance compared to other methods. However, we note that these models are meant for offline training in the context of the application herein–while it is important to have algorithms that are fast, they are not of critical importance as in tasks where realtime learning is of foremost importance.
Models  MAE  Train. Time  Pred. Time 

PMusic  
SIGIR2017  
Model2L2  
Model3L2  
Model3GBT  
lastfm  
SIGIR2017  
Model2L2  
Model3L2  
Model3GBT 
5 Related work
Session length is an important metric serving as a proxy for user engagement. Therefore the solutions and evaluation are tailored to similar duration based engagement metric such as dwelltime prediction.
Dwelltime. Liu et al. in [16] presented one of the first studies on dwelltime for web search. Kim et al. in [13] proposed a dwelltime based user satisfaction prediction model in the web search context. Lalmas et al. in [15] proposed new way to improve ad ranking. Barbieri et al. in [2] propose to use survival forest [11] using landing page and user feature in the ad context to estimate the dwelltime and incorporate it in the ad ranking system as a quality score. Vasiloudis et al. in [21] presented recently a first session length prediction model in the music streaming context using survival analysis and gradient boosting [6]
. In that work, the authors show that in the case of media streaming services the probability of a session to end evolves differently for different users. In particular
of the users exhibit “negativeaging” length distributions, i.e. sessions that become less likely to end as they grow longer. Although not directly comparable, we report that this percentage is going to for dwelltime on a web page after a search, i.e., after clicking on a search result the more you stay the more you are likely to stay on the clicked page. Finally, Jing et al. in [12]presented a neural network based model combined with survival analysis for recommendation purpose and absence time prediction at the same time. In our work we compared against the most recent related work done by Vasiloudis
et al. [21]. It is worth mentioning that while dwelltime can benefit from survival analysis, because a user can click on a search engine result and never turn back, in our case we don’t have censored data, therefore our is a regression problem.Empirical Bayes and MAP estimation:
The statistical models proposed in this paper are inspired by Empirical Bayes methods that are wellknown in statistics community, dating back to
[19, 8]. In theory, when the model is true, these estimators are known to lead to estimators with better prediction accuracy when compared to (unregularized) maximum likelihood estimators. Empirical Bayes estimators offer an appealing tradeoff between frequentist and Bayesian modeling [7]— in that expensive MCMC computations may be avoided by a clever combination of guided hyperparameter tuning and numerical optimization (to obtain MAP estimates for example). We also consider more flexible models wherein MAP (maximum a posteriori) estimation becomes a pragmatic choice from a computational viewpoint. To our knowledge, this is the first time such a methodology is used in the context of user session length prediction.
6 Conclusions and future work
In this paper, we presented a new hierarchical modeling framework, inspired by core Bayesian modeling principles to predict the amount of time people will spent in a streaming service, and in particular listening to streaming music. We also propose modern convex optimization algorithms for enhanced computational efficiency and tractable inference. Our family of flexible models is meant to provide a practitioner insights regarding the incremental gains in predictive accuracy with enhanced modeling components. We focused on predicting the amount of time a user might spend on a platform, at the beginning of the user session.
In our experimental section, we have shown that our method is performing better than the state of the art in this context. Furthermore, we have shown that our model is better for heavy users as well as for users with few sessions. Due to the flexibility of our models, we can achieve lower prediction error for users with many sessions. We can also flexibly incorporate the choice of loss functions that are more robust to outliers in the data. Our results show that our models can lead to an improvement of up to % in MAE on reallife data. Some of our models can be to times faster (with a to % improvement in MAE) in training time; and to times faster in prediction time.
So far we have always investigated a scenario where the model is learned offline, and it tries to predict the user session length with only few static and offline extractable features. In the future we aim to extend our model to an online version. Furthermore we want to investigate the session length prediction utility within advertising or recommender systems context.
References
 [1]
 Barbieri et al. [2016] Nicola Barbieri, Fabrizio Silvestri, and Mounia Lalmas. 2016. Improving PostClick User Engagement on Native Ads via Survival Analysis. In Proceedings of the 25th International Conference on World Wide Web (WWW ’16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 761–770. https://doi.org/10.1145/2872427.2883092
 Beck and Teboulle [2009] Amir Beck and Marc Teboulle. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2, 1 (2009), 183–202.
 Borisov et al. [2016] Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016. A Contextaware Time Model for Web Search. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’16). ACM, New York, NY, USA, 205–214. https://doi.org/10.1145/2911451.2911504
 Celma [2010] O. Celma. 2010. Music Recommendation and Discovery in the Long Tail. Springer.
 Chen and Guestrin [2016] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
 Efron [2012] Bradley Efron. 2012. Largescale inference: empirical Bayes methods for estimation, testing, and prediction. Vol. 1. Cambridge University Press.
 Efron and Morris [1972] B Efron and C Morris. 1972. Limiting the risk of Bayes and empirical Bayes estimators. II. The empirial Bayes case. J. Amer. Statist. Assoc. 67 (1972), 130–139.
 Friedman et al. [2001] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York.
 Huber [2011] Peter J Huber. 2011. Robust statistics. In International Encyclopedia of Statistical Science. Springer, 1248–1251.
 Ishwaran et al. [2008] Hemant Ishwaran, Udaya B. Kogalur, Eugene H. Blackstone, and Michael S. Lauer. 2008. Random Survival Forests. The Annals of Applied Statistics 2, 3 (2008), 841–860. http://www.jstor.org/stable/30245111
 Jing and Smola [2017] How Jing and Alexander J. Smola. 2017. Neural Survival Recommender. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM ’17). ACM, New York, NY, USA, 515–524. https://doi.org/10.1145/3018661.3018719
 Kim et al. [2014] Youngho Kim, Ahmed Hassan, Ryen W. White, and Imed Zitouni. 2014. Modeling Dwell Time to Predict Clicklevel Satisfaction. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM ’14). ACM, New York, NY, USA, 193–202. https://doi.org/10.1145/2556195.2556220
 Lagun and Lalmas [2016] Dmitry Lagun and Mounia Lalmas. 2016. Understanding User Attention and Engagement in Online News Reading. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM ’16). ACM, New York, NY, USA, 113–122. https://doi.org/10.1145/2835776.2835833
 Lalmas et al. [2015] Mounia Lalmas, Janette Lehmann, Guy Shaked, Fabrizio Silvestri, and Gabriele Tolomei. 2015. Promoting Positive PostClick Experience for InStream Yahoo Gemini Users. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’15). ACM, New York, NY, USA, 1929–1938. https://doi.org/10.1145/2783258.2788581
 Liu et al. [2010] Chao Liu, Ryen W. White, and Susan Dumais. 2010. Understanding Web Browsing Behaviors Through Weibull Analysis of Dwell Time. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’10). ACM, New York, NY, USA, 379–386. https://doi.org/10.1145/1835449.1835513
 Nesterov [2013] Yu Nesterov. 2013. Gradient methods for minimizing composite functions. Mathematical Programming 140, 1 (2013), 125–161.

Pedregosa et al. [2011]
F. Pedregosa, G.
Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J.
Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and
E. Duchesnay. 2011.
Scikitlearn: Machine Learning in Python.
Journal of Machine Learning Research 12 (2011), 2825–2830.  Robbins [1955] Herbert Robbins. 19541955. An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol. I. UC Press, 157–163.
 Tibshirani [1996] Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), 267–288.
 Vasiloudis et al. [2017] Theodore Vasiloudis, Hossein Vahabi, Ross Kravitz, and Valery Rashkov. 2017. Predicting Session Length in Media Streaming. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17). ACM, New York, NY, USA, 977–980. https://doi.org/10.1145/3077136.3080695
 Wright [2015] Stephen J Wright. 2015. Coordinate descent algorithms. Mathematical Programming 151, 1 (2015), 3–34.