Retrospective Higher-Order Markov Processes for User Trails

04/20/2017 ∙ by Tao Wu, et al. ∙ Purdue University 0

Users form information trails as they browse the web, checkin with a geolocation, rate items, or consume media. A common problem is to predict what a user might do next for the purposes of guidance, recommendation, or prefetching. First-order and higher-order Markov chains have been widely used methods to study such sequences of data. First-order Markov chains are easy to estimate, but lack accuracy when history matters. Higher-order Markov chains, in contrast, have too many parameters and suffer from overfitting the training data. Fitting these parameters with regularization and smoothing only offers mild improvements. In this paper we propose the retrospective higher-order Markov process (RHOMP) as a low-parameter model for such sequences. This model is a special case of a higher-order Markov chain where the transitions depend retrospectively on a single history state instead of an arbitrary combination of history states. There are two immediate computational advantages: the number of parameters is linear in the order of the Markov chain and the model can be fit to large state spaces. Furthermore, by providing a specific structure to the higher-order chain, RHOMPs improve the model accuracy by efficiently utilizing history states without risks of overfitting the data. We demonstrate how to estimate a RHOMP from data and we demonstrate the effectiveness of our method on various real application datasets spanning geolocation data, review sequences, and business locations. The RHOMP model uniformly outperforms higher-order Markov chains, Kneser-Ney regularization, and tensor factorizations in terms of prediction accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

User trails record sequences of activities when individuals interact with the Internet and the world. Such data come from various applications when users write a product review (mcauley2013amateurs, ), checkin at a physical location (cho2011friendship, ; yang2013fine, ), visit a webpage, or listen to a song (celma2009music, ). Understanding the properties and predictability of these data helps improve many downstream applications including overall user experiences, recommendations, and advertising (figueiredo2016tribeflow, ; awad2012prediction, ). We study the prediction problem and our goal is to estimate a model to describe and predict a set of user trails.

Markov chains are one of the most commonly studied models for this type of data. For these models, each checkin place, website, or song is a state. Users transition among these states following Markov rules. In a first-order Markov model, the transition behavior to the next state of the sequence only depends on the current state. Higher-order Markov models include a more-realistic dependence on a larger number of previous states, and multiple recent studies found that first-order Markov chains do not fully capture the user behaviors in web browsing, transportation and communication networks 

(rosvall2014memory, ; chierichetti2012web, ). Furthermore ignoring the effects of second-order Markov dynamics has significant negative consequences for downstream applications including community detection, ranking, and information spreading (rosvall2014memory, ; Benson-2016-motif-spectral, ).

The downside to higher-order Markov models is that the number of parameters grows exponentially with the order. (If there are states and we model steps of history, there are parameters.) So, even if we could accurately learn the parameters, it is already challenging to even store them. (Some practical techniques include low-rank and sparse approximations, but these pose their own problems.) Second, since the number of model parameters grows rapidly, the amount of training data required also grows exponentially with the order  (chierichetti2012web, ). Acquiring such huge amounts of training data is usually impossible. Lastly, determining the amount of history to use itself is hard (peres2005two, ), and selecting a large value of could severely overfit the data, thus making the learned model less reliable.

Strategies to resolve the above issues of higher-order Markov chains include variable order Markov chain (buhlmann1999variable, ) where the order length is a variable that can have different values for different states. There is a fitting algorithm that can automatically determine an appropriate order for each state, however it requires substantial computation time (ron1994learning, ) which restricts it to applications with only a small number of states (borges2007evaluating, ; deshpande2004selective, ; chierichetti2012web, ). Smoothing and regularization methods (chen1996empirical, ) like Kneser-Ney smoothing and Witten-Bell smoothing are additional approaches to make the higher-order Markov chain more robust. These methods are widely applied in language models for predicting unseen transitions. We will compare against the behavior of the Kneser-Ney smoothing in our experiments and show that our method has a number of advantages.

In this paper we propose the retrospective higher-order Markov process (RHOMP) as a simplified, special case of a higher-order Markov chain (Section 3). In this type of Markov model, a user retrospectively choses a state from the past steps of history, and then transitions as a first-order chain conditional on that state from history. This assumption helps to restrict the total number of parameters and protect the model from overfitting the correlations between history states. Specifically, this model corresponds to choosing

different first order Markov chain transition matrices, one for each step of history, as well as an associated probability distribution. Consequently, the number of parameters grows linearly with the size of history while preserving the higher-order nature. We also show there are important connections between our model and the class of pairwise-interaction tensor factorization models proposed by Rendle et al. 

(rendle2010pairwise, ; rendle2010factorizing, ) (Section 3.2).

We design an algorithm to select an optimal model from training data via maximum likelihood estimation (MLE). For the second-order case with two steps of history, this yields a constrained convex optimization problem with a single hyperparameter

. We derive a projected gradient descent (duchi2008efficient, ) algorithm to solve it. It requires only a few iterations to converge and each iteration is linear in the training data. We select the hyperparameter by fitting a polynomial to the likelihood function as a function of the parameter and select the global minimum. Thus, our RHOMP process does not require any parameter tuning and is scalable to applications with tens of thousands of states. In addition, both the process of updating the gradients and model parameters parallelize over the training data.

We evaluate the effectiveness of RHOMP models in experiments  111Code and data for this paper are available at: with real datasets including product reviews, online music streaming, photo locations, and checkin business types (Section 5.1

). We primarily compare algorithms in terms of their ability to predict information from testing data and use precision and mean reciprocal rank as the two main evaluation metrics. These experiments and results show that the RHOMP model achieves superior prediction results in all datasets (Section 

5.2) compared with first and second order chains. For even higher-order chains, RHOMP shows stable performance with one exception (Section 5.4) where the data only has short sequences.

Remark. Recently Kumar et. (kumar2017linear, ) proposed the Linear Additive Markov Process (LAMP) that is closely related to our framework. Specifically our RHOMP model has the same formulation as the generazlied extention GLAMP from the paper (kumar2017linear, ). We learned about this paper as we were finalizing our submission to arXiv. The papers share a number of related technical results about the models and we discovered the related work (markov1971extension, ; zhang2015spatiotemporal, ; melnyk2006memory, ; usatenko2009random, ) based on their manuscript. The main difference is that in this paper we focus on the general form that allows to learn different Markov chains for each step of history. In addition we connect the RHOMP model with a particular tensor factorization to a higher-order Markov chain.

2. Preliminaries

We begin by formally reviewing the problem of user trail prediction. Then we will review relevant background on Markov chain models.

2.1. Problem Formulation

We denote a user trail as a sequence over a discrete state space with each element . Here is the total number of states. The sequence can represent, for instance, a user’s music listening history with each state denoting a song/artist, or a user’s checkin history from social network with each state denoting a location. Given a specific user trail up to time : with , the task is to predict the next state at time based on a large set of user trails for training: , where each is an individual trail.

2.2. Markov Chain Methods

An th order Markov chain is defined as a stochastic process on the state space: with the property that the next transition only depends on the last steps. Formally,

An -order transition tensor with size characterizes the above Markov chain, with denoting the probability of transitioning to state given the current history states . The model with is called the first-order Markov chain and similarly it can be described by an transition matrix .

In order to use a Markov chain for the prediction problem, we need to estimate the transition matrix . Given a set of users trails , the maximum likelihood estimator (MLE) of the probability for a first order chain is given by (chierichetti2012web, ):

where denotes the number of instances that the states and were consecutive in all trails. For the case of higher-order Markov chain, it is well-known that any higher-order () Markov chain is equivalent to a first-order Markov chain by taking a Cartesian product of its state space. This simplifies the parameter estimations and we may replace the original states with the Cartesian product states:

where now counts the number of instances of the sequence in the training data.

Returning to the prediction task itself, Markov chain methods take as input the history states of a trail and lookup the probabilities for all future states in the matrix or tensor . This becomes a ranked list of states with the highest probability on top.

3. Retrospective Higher-Order Markov Processes

The goal of the retrospective higher-order Markov process (RHOMP) is to strike a balance between the simplicity of the first order Markov model and the high-parameter complexity of the higher-order Markov model. Nevertheless, it is important for the model to account for higher-order behaviors because these are necessary to capture many types of user behaviors (rosvall2014memory, ; chierichetti2012web, ). Towards that end, the RHOMP model describes a structured higher-order Markov chain that results in a compact low-parameter description of possible user behaviors. We describe this formally for the case of a second-order history (and discuss largely notational extensions to higher-order chains in Section 3.4).

3.1. The Retrospective Process

The specific structure that a RHOMP describes is a retrospectively first-order Markov property. For some intuition, suppose that a web surfer had visited a search-query result page and then clicked the first link. In the RHOMP model, the user will first determine if they are going to continue browsing from the search-result page or the first link—hence users have the power to retrospect over history. Once that decision has been made, the user will behave in a first-order Markovian fashion that depends on if the user returned to the previous state or remained on the current state. Formally, suppose that the chain has recently visited states and . The RHOMP is a two-stage process that first selects a single history state. Since there are only two states, we model this selection as a weighted coin-toss where the probability of picking is and so picking happens with probability . Once we have the history state, then the RHOMP transitions according to a transition matrix that is specific to that step of the history. Thus

where models the transitions from the current state (when those are selected) and models the transitions from the previous state (when those are selected). See Figure 1 for illustration. We summarize this in the following definition:

Definition 0 ()

Given and two stochastic matrices , a second-order retrospective higher-order Markov process will transition from state with history state as follows: (i) with probability it transitions according to with the current state , and (ii) with probability it transitions according to with the previous state .

Figure 1. An illustration of Markov chain methods and our proposed RHOMP model.

This model has a number of useful features. For instance, it is easy to compute the stationary distribution as the following theorem shows.

Theorem 2 ()

Let be a second-order RHOMP model. Consider the stationary distribution in terms of the long-term fraction of time the process spends in a state:

Such a distribution always exists. Moreover, it is unique if is an irreducible matrix.

Proof 0 ()

Because the RHOMP is a special case of a second-order chain, we can use the relationship with the first-order chain on the Cartesian product space to establish that a distribution always exists. This follows because the long-term distribution of a first-order, finite-state space Markov chain always exists (though there could be multiple such distributions) (taylor2014introduction, ). Let for all be any limiting distribution of the product state space, and be either of the corresponding marginal distribution such that or . Note that both of these marginals result in the same distribution because we use the long time average to define . Then we have:

where is defined as . So the limiting distribution follows , and it is unique if the corresponding Markov chain is irreducible.

In Section 3.3, we show how to compute a maximum likelihood estimate of and from data.

3.2. A Tensor Factorization Perspective

We originally derived this type of RHOMP via a tensor factorization approach, but then realized that the retrospective interpretation is more direct and helpful. Nevertheless, we believe there are fruitful connections established by the tensor factorization approach. Consider the transition tensor of a second-order Markov chain: is a -mode, , non-negative tensor such that


This imposes a set of equality constraints. If we wanted to use traditional low-rank tensor approximations such as PARAFAC or Tucker (kolda2009tensor, ) to study large datasets, then we would need to add a large number of constraints to the fitting algorithms in order to ensure that the factorization results in a stochastic tensor that we could use for a second order Markov chain. This approach was extremely challenging.

Instead, consider a pairwise interaction tensor factorization (PITF) as proposed by Rendle et al. (rendle2010pairwise, ) with the following form:


where matrices . We notice that last term in (2) is the interaction between the current state and the previous state , and it contributes only a constant determined by the pair . In the applications of prediction, we can drop this term because it does not affect the relative ranking for the future state . So the factorization model becomes:


with .

To see the relationship with our RHOMPs, denote and with . Then the result of a PITF factorization with stochastic constraints is:


It is easy to verify that if both and are stochastic matrices, then the corresponding tensor is a transition tensor following (1). The following theorem shows that from any nonnegative and , we can construct such stochastic matrices.

Theorem 3 ()

Assuming there exist nonnegative matrices and such that the transition tensor can be decomposed in the form of (4), then there exist and stochastic matrices such that .

Proof 0 ()

Denote and for all . Because for all , we have , and . If then the original matrices and are stochastic. Otherwise we can set

where and are stochastic. Then we have

So forms a valid factorization for , the bound on follows from from (4).

Consequently, the RHOMP form also arises from the PITF approach when constrained to model stochastic tensors.

3.3. Parameter Optimization

In this section we will apply the principle of maximum likelihood to estimate the model parameters of a RHOMP (i.e., ) directly from data. An alternative would be to estimate the higher-order Markov chain and use the PITF factorization as discussed in the previous section. Working directly on the RHOMP model from data has two advantages: first, the estimate corresponds exactly with the model, rather than estimate and approximate; and second, the direct approach is faster.

We first show how to compute a maximum likelihood estimate with fixed and then discuss how to pick . Recall that is the total count of transitions moving from to with previous state in the training data. With fixed , the log likelihood of all transitions from the set of user trails is:


Our goal is to find a pair of stochastic matrices which minimizes the negative log likelihood, which gives us the following optimization problem:


This optimization problem is convex as the following theorem shows.

Theorem 4 ()

The negation of the log likelihood function in (5) is convex and so is the feasible region of pairs of stochastic matrices. Thus any local minima solution is also the solution for global mimima.

Proof 0 ()

First we verify the feasible domain of stochastic pairs is convex. We can check that given and two stochastic matrices , the linear combination

is also a stochastic matrix. This applies element-wise to the pair to verify the claim.

Now given two sets of stochastic matrices and and the corresponding linear combination we have

So (6) is a convex problem.

We now derive the projected gradient descent algorithm for (6), which is summarized in Algorithm 1. This involves

  1. [leftmargin=0.3in]

  2. First update and based on their gradients.

  3. Since and are no longer stochastic due to the above updates, the projection step is applied to project the updated and back to (i.e., the stochastic property).

The gradients over and are:


We accomplish the projection step using the algorithm from (duchi2008efficient, )

. Note that for the sake of simplicity we present the projection step by sorting the vector

, but there is a more efficient method based on divide and concur (duchi2008efficient, ) which is linear cost to the number non-zeros in . However in practice sorting is fast as the vector is very sparse.

Overall each iteration takes linear time in the number of unique triples in the sequence data. This is upper bounded by the size of input data. We also note that the procedure of computing the gradients and updating , which dominates the majority of the computation, can be paralleled.

0:  parameter , step size and transition counts
1:  Initialize with , with and
2:  repeat
3:     Compute the gradient matrices based on  (7)
4:      and
5:     for each column vector of and  do
6:        Sort the non-zeros of into :
7:        Find
8:        Define
9:        Update with
10:     end for
11:     if objective value decreases then
13:     else
14:        ; re-run this iteration with updated
15:     end if
16:  until converge
Algorithm 1 Max. Likelihood Estimate of a 2nd-order RHOMP

Choosing . To determine the value of hyperparameter , we conduct a few trials with chosen between . Then based on the value of the objective function, we calculate the best value of

from a polynomial interpolation of the likelihood function. Specifically

is selected as Chebyshev nodes . Getting the global minimum of a polynomial interpolant can be done efficiently, and polynomials can approximate arbitrary continuous functions, which renders this a pragmatic choice. Another approach for selecting the value of is to conduct cross validation with grid search. However a different objective is needed as we could run into unseen transitions in the validation set and the likelihood would go to . Alternatively we can use a measurement like precision instead of likelihood. The main advantage of cross validation is its ability to prevent overfitting. In our experiment we find this problem does not occur, so we drop this procedure as it requires substantially more computation.

3.4. Higher-order Cases Beyond Second Order

The ideas discussed in the above sections also work for the higher-order cases with . The RHOMP model becomes:

where for , and matrices for are stochastic. Similarly the log likelihood function can be derived as well as the gradient over each . The projected gradient descent algorithm is then applied to update each stochastic matrix , with a per-iteration complexity bounded by the size of the training data.

The biggest difference is that we are no longer able to determine the hyperparameters in a simple fashion as the polynomial interpolation is only computationally efficient for one or two parameters. To address this issue, recall that in Section 3.1 we proposed the model as a retrospective walk, where the walker has probability to step back steps into their history and then transition according to . Our proposal is to use a single hyperparameter to model a decaying probability of looking back into the history:

(This distribution describes a truncated geometric random variable.) In our experiments for the second-order case the optimal

for every dataset. This offers a single step of evidence for this assumption. This can be chosen either by the procedure of polynomial interpolation or simply using the optimal value from a second-order factorization model . We apply the latter approach in our experiments for RHOMP with .

4. Related Work

Modeling User Trails. Early work in (pirolli1999distributions, )

characterized the user path patterns on the web with the tools of Markov chains. Other advanced methods include hidden Markov models (HMM) 

(eddy1996hidden, ), variable length Markov chains (buhlmann1999variable, ) and association rules (awad2012prediction, ). However the computations associated with the above methods limit them from being used in datasets with more than a few thousand states. More recent work considers the sequence prediction task with personalization, such as collaborative filtering methods (salakhutdinov2008bayesian, ; shi2014collaborative, ) where the behavior of similar users is utilized to help the prediction, factorizing personalized Markov chains (rendle2010factorizing, ), and TribeFlow (figueiredo2016tribeflow, ). Other than the prediction problem, clustering and visualization (cadez2003model, ), sequence classification (xing2010brief, ), metric embedding (chen2012playlist, ; chen2013multi, ) and hypotheses comparison (singer2015hyptrails, ) have also been studied. In the context of this work, we seek to improve the performance of the classic and simple Markov model by studying a structured variation.

Random Walk Models. Since our model is a special case of a higher-order Markov chain, we note that there are relationships with a variety of enhanced Markov models. First our RHOMP model defines a specific form of the Additive Markov Process (AMP) (markov1971extension, ), where the transition probability is a summation of a series of memory functions that are restricted on the next state and one history state each. Applications of the AMP include LAMP (kumar2017linear, ) (see Section 1), the gravity models (zhang2015spatiotemporal, ), and some dynamical systems in physics (melnyk2006memory, ; usatenko2009random, ) where the memory function is empirically estimated for the application of binary state. In addition to the AMP, recent innovations include new recovery results on mixture of Markov chains (gupta2016mixtures, ) (a special case of HMM), which assumes a small set of Markov chains that model various classes of latent indent; and the spacey random walk (benson2016spacey, ; benson2015tensor, ; wu2016general, ) as a non-Markovian stochastic process that utilizes higher-order information based on the empirical occupation of states.

Tensor Factorization. As already discussed, our work is directly related to the pairwise interaction tensor factorization (PITF) method proposed by Rendle in (rendle2010pairwise, ; rendle2010factorizing, ), where the task is to generate tag recommendations given the combination. The PITF model is learned from a binary tensor of triple by bootstrap sampling from pairwise ranking constrains. Our work differs in the aspect of problem formulation, model construction and parameter optimization. The RHOMP model is also a special case of both the canonical/PARAFAC and Tucker decompositions (kolda2009tensor, ).

5. Experiments

# states # transitions # trails
LastFM 17,341 2,902,035 195,499
BeerAdvocate 2,324 1,348,903 35,629
BrightKite 11,465 400,340 125,437
Flickr 7,608 1,212,674 97,563
FourSQ 344 198,503 1,480
Table 1. Dataset characteristics in terms of the number of states, transitions and trails

We evaluate our RHOMP method on the ability to predict subsequent states in a user trail in terms of precision and mean reciprocal rank (MRR) on five different types of data (Section 5.1). We then present the results of a second-order (i.e., ) RHOMP compared with baseline methods in Section 5.2 and study over-fitting of the training data in Section 5.3. Then we study what happens for higher-order (i.e., ) models in Section 5.4. In all cases, the RHOMP model offers a considerable improvement to existing methods.

MC1 MC2 Kneser1 Kneser2 PITF LME RHOMP
Table 2. Mean Reciprocal Rank (MRR) results of various methods on all datasets. Bold indicates the best mean performance, and

entries are the standard deviations over 5 trials. Our proposed RHOMP (

) has the best performance in all datasets.
Figure 2. Relative precision results on all datasets with . We use Kneser1 as the baseline, and the relative precision is calculated as the precision ratio to that of Kneser1. The error bars in the figure are the standard deviations over 5 trials. The numbers in the bottom and the top of the figures denote the absolute precisions for the Kneser1 and our RHOMP method respectively. We see that our RHOMP has noticeable improvements over other methods in most datasets.

5.1. Datasets and Evaluations Setup

The real datasets we use in our experiments cover several applications including: product reviews, online music streaming, checkin locations of social network and photo uploads. Every dataset is publicly available. For all the datasets self-loops are removed as we are mostly interested in predicting a non-trivial transition. Also we only consider states that show up more than times. Simple statistics on each dataset are summarized in Table 1, and we now describe them individually.

LastFM (celma2009music, ) is a music streaming and recommendation website ( We generate user trails as listening histories regarding different artists over a one-year period (2008-05-01 to 2009-05-01).

BeerAdvocate (mcauley2013amateurs, ) consists of beer reviews spanning more than 10 years up to November 2011 from We study the user trail as reviews over different brewers.

BrightKite (cho2011friendship, ) was a location-based social networking website where users shared their locations by checking-in. We study the trails of location id.

Flickr (thomee2015new, ) contains 100 million Flickr photos/videos provided by Yahoo! Webscope. We extract the user trail based on geolocation (restricted to USA) of each upload after 2008-01-01. Each longitude and latitude is mapped into a grid of approximate 10km by 10km, which constitutes the state.

FourSQ is a location based check-in dataset created by Yang et al (yang2013fine, ) which contains checkins from New York City from 24 October 2011 to 20 February 2012. We extract checkin place category (e.g., bus station, hotel, bank) as state.

For experimental methods, we consider the following:

MC1, MC2 are the first-order and second-order Markov chain methods respectively, where the transition matrix is estimated based on maximum likelihood.

Kneser1, Kneser2 are the interpolated Kneser-Ney smoothing methods (chen1996empirical, )

applied on the first-order and second-order Markov chain methods respectively. This is one of the best smoothing methods for n-gram language models, where it enables higher-order Markov chain transitions to unseen n-grams. We set the discounting parameter as

by the method of leaving one out (chen1996empirical, ), where and denote the number of n-grams that appear exactly once and twice respectively

PITF is the pairwise interaction tensor factorization method (rendle2010pairwise, ) computed on the higher-order Markov chain estimate. Because we use ranking, we consider general positive and negative entries as valid for the factorization. We implement the fitting method ourselves to handle the sparsity in our data. As suggested in the paper (rendle2010pairwise, ), the hyperparameters are and with initialization from . We set the rank number as of the total number of states, which is enough to accurately capture the user behavior (rendle2010pairwise, )

. The number of iterations for the stochastic gradient descent is 10,000,000.

LME is short for Latent Markov Embedding (chen2012playlist, )

. It is an machine learning algorithm that embeds states into Euclidean space based on a regularized maximum likelihood principle. We set the dimension

and use default values all other parameters (e.g., learning rate, epsilon). (We tried various values of spanning from to , we find as increases the performance also gets better, for the improvements are negligible. So we use to make the algorithm efficient.) We use the authors’ implementations.

RHOMP is our proposed method in this paper. We use initial step size as , and set as the algorithm termination criterion when the relative improvement over log likelihood is below this point. For the hyperparameter we use Chebyshev nodes for the interpolation.

The datasets are randomly split into a training set () and testing set () based on keeping whole trails together. And for each dataset we conduct experiments over random repetitions and present the average results. For evaluations we use precision over top outputs to measure the accuracy of each method. It is calculated over all individual transitions in the testing set as

Besides precision, which measures the accuracy of the top outputs from algorithms, we also provide results on Mean Reciprocal Rank (MRR). The reciprocal rank of an output is the inverse of the rank of the ground truth answer and MRR measures the overall ranking compared to the groundtruth. For both measures, we want large scores close to 1.

5.2. General Results

First we compare our RHOMP () with other baseline methods in terms of precision and MRR score.

MRR score. Table 2 depicts the results on the MRR score. In all datasets, RHOMP has the highest score. From the table we see that MC1 outperforms the LME method. The LME has the advantage of embedding the states into Euclidean space for applications like visualization or clustering. However the embedding could potentially cause the information loss, thus make the prediction less accurate. And we notice that MC2 has the lowest scores in many cases (i.e., BeerAdvocate, Flickr and FourSQ datasets), and the MRR scores drop compared to MC1. The Kneser-Ney smoothing modification makes the MC2 estimate more robust, and in most cases outperforms the MC1, although such advantage is limited compared to that from our RHOMP method. The PITF method is also not competitive.

Precision score. Figure 2 shows the algorithms performances in terms of relative precision. Many of the observations from Table 2 on the MRR score also apply here. In addition we find MC2 is often able to provide one accurate output, so the relative precision () is actually quite good in most cases. However as increase the relative precision drops rapidly due to the fact that MC2 is not able to generate a few more reliable outputs. This limits the application of MC2 because in the task of recommendation, it is important for the algorithm to generate a few instead of one unique candidate state. Another observation is that the results of PITF over different trials are often more volatile because of its underlying stochastic gradient descent solver. We also find that for some datasets (e.g., BeerAdvocate and FourSQ) the relative precisions of our RHOMP decrease as increases. The reason is that as increases, the prediction task itself becomes easier, so it is hard to maintain the same advantage (i.e., constant relative precision). Same reason for the fact the inferior methods like LME and PITF can catch up as increases.

Algorithm Runtime. Table 3 shows the runtime for each method. The RHOMP approach takes slightly more time to train than Kneser-Ney methods, but has faster prediction and testing. It is slower than the pure MC methods, but much faster than PITF, LME.

Kneser1 Kneser2 PITF LME RHOMP
Table 3. Algorithm runtime (in minutes) for the three large datasets in terms of training time (left) and testing time (right). The experiments are run on a single-core of a 2.5Ghz Xeon CPU. Both MC1 and MC2 ran in under a minute.

5.3. Analysis on Overfitting

MC1 MC2 Kneser1 Kneser2 PITF LME RHOMP
Table 4. Precision (k=3) results for testing set (the left number) vs train set (the right number) that we use to estimate overfitting. Bold denotes the highest testing result. We judge the overfitting effects as MC2, Kneser2 MC1, Kneser1, RHOMP LME, PITF. But LME and PITF have poor test performances.

One of the reasons we propose the RHOMP method is to improve the higher-order Markov chain method in the aspect of overfitting. In this section we analyze the results in detail and give an explanation on the performances of different methods.

First we show the comparison between training and testing performance in Table 4. We present the result using precision with as it is representative of the remaining results. Both PITF and LME had the least overfitting effect as the testing and training precisions are very close. However, their testing precisions are also low. The training precision of MC2 is the highest for all datasets. But these are often more than times of the corresponding testing precisions. So MC2 is a highly overfitting method. Kneser2 also has comparatively high training precision since it is a second-order method and tends to fit the training data well. But the performance on testing set is better than MC2 as it uses lower-order information to smooth the output. The methods MC1, Kneser1 and RHOMP have a good training and testing balance, and among them, our RHOMP has superior testing performances.

Next we analyze the performance on individual states to help understand the behaviors of different algorithms. We sort all the states from high to low based on the total number of counts of each state in the training set. Our aim is to look at how testing accuracy correlates with these counts. Figure 3 shows the precision () comparisons (i.e., MC1 vs MC2 vs RHOMP and Kneser1 vs Kneser2 vs RHOMP) on the Flickr dataset based on counts of the states. We aggregate small sets of states based on their counts into baskets of at least 1000 transitions and 5 states. We find that all methods show precision drops when predicting infrequent states, with MC2 being affected most. Here, RHOMP does the best out of all methods, which reflects its ability to avoid overfitting.

Figure 3. State-wise precision () comparison on MC1 vs MC2 vs RHOMP (left figure) and Kneser1 vs Kneser2 vs RHOMP (right figure) on the Flickr dataset. Each marker represents the average precision over a group of states. The curves are fit from the scatter points based on Locally Weighted Scatterplot Smoothing (LOWESS).
Figure 4. Relative precision () vs order of the methods: MC, Kneser-Ney smoothing and RHOMP. The relative precision is the precision ratio to that from MC1 of the corresponding datasets. Note that the y-axis may not be scaled linearly to make the figures more clear.

5.4. Analysis on Higher-order Approaches

In the previous sections, we analyze the results for first and second-order approaches. Now we study the behavior as the order varies. Figure 4 shows change in performance as the order increases for the three frameworks: MC, Kneser-Ney smoothing and RHOMP. For the cases when the history states length is smaller than the order, we use the approach with the correct order to generate the prediction.

For the MC framework, higher-order approaches make the prediction less accurate. This occurs because these methods overfit the training data and there are more ways to overfit for a higher-order chain. For the Kneser-Ney smoothing approaches, in most cases (except BeerAdvocate dataset) there are improvements moving from first-order to second-order. However the improvements are slight. For order , there are usually either no clear improvements or small performance dips. The reason is that as the order increase, the higher-order transition become very sparse, and could easily encounter an unseen higher-order state. So in this case the algorithm will frequently seek the prediction from a lower-order approach.

For the RHOMP framework, there are improvements for each dataset when moving from MC1 to RHOMP with order , and for order , the results further improve. Compared to MC and Kneser-Ney smoothing frameworks, The RHOMP is more robust in terms of not decreasing the precision as order increases, with the exception of BrightKite dataset. In BrightKite, the average trail length is around , so there is insufficient information to train higher-order models and we lack the lower-order fallback in Kneser-Ney.

6. Summary & Future Work

In this paper we study the problem of modeling user trails, which encode useful information for downstream applications of user experiences, recommendations and advertising. We propose a new class of structured higher-order Markov chains which we call the retrospective higher-order Markov process (RHOMP). This model preserves the higher-order nature of user trails without risks of overfitting the data. A RHOMP can be estimated from data via a projected gradient descent algorithm we propose for maximum likelihood estimation (MLE). In the experiments, we find that RHOMP is superior in terms of precision and mean reciprocal rank compared to other methods. Also RHOMP is robust for higher-order chains when there is data available.

There are several directions to extend this work. First it would be interesting to explore other forms of retrospection that allow more interaction between the history states. (Note that the current approach in this paper selects a single state during the retrospective process). This will allow to model the case when certain combined history states have strong evidence in terms of transition patterns. Second it would also be useful to extend this framework in terms of personalization. This can be achieved by a tensor factorization approach or a collaborative filtering method. Lastly we also would like to embed time information into our prediction either by modeling the event time directly or using it as a side information to help generate a non-stationary process where the random walk behavior could change overtime.

Acknowledgements. This work was supported by NSF IIS-1422918, CAREER award CCF- 1149756, Center for Science of Information STC, CCF-093937; DOE award DE-SC0014543; and the DARPA SIMPLEX program.


  • [1] M. A. Awad and I. Khalil. Prediction of user’s web-browsing behavior: Application of Markov model. IEEE T. Syst. Man Cy. B, 42(4):1131–1142, 2012.
  • [2] A. Benson, D. F. Gleich, and J. Leskovec. Higher-order organization of complex networks. Science, 353(6295):163–166, 2016.
  • [3] A. R. Benson, D. F. Gleich, and J. Leskovec.

    Tensor spectral clustering for partitioning higher-order network structures.

    In SDM, pages 118–126, 2015.
  • [4] A. R. Benson, D. F. Gleich, and L.-H. Lim. The spacey random walk: A stochastic process for higher-order data. SIAM Rev., page To appear, 2017.
  • [5] J. Borges and M. Levene. Evaluating variable-length Markov chain models for analysis of user web navigation sessions. IEEE T. Knowl. Data En., 19(4), 2007.
  • [6] P. Bühlmann, A. J. Wyner, et al. Variable length Markov chains. The Annals of Statistics, 27(2):480–513, 1999.
  • [7] I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White. Model-based clustering and visualization of navigation patterns on a web site. Data Mining and Knowledge Discovery, 7(4):399–424, 2003.
  • [8] Ò. Celma Herrada. Music recommendation and discovery in the long tail. 2009.
  • [9] S. Chen, J. L. Moore, D. Turnbull, and T. Joachims. Playlist prediction via metric embedding. In KDD, pages 714–722, 2012.
  • [10] S. Chen, J. Xu, and T. Joachims. Multi-space probabilistic sequence modeling. In KDD, pages 865–873, 2013.
  • [11] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In ACL, pages 310–318, 1996.
  • [12] F. Chierichetti, R. Kumar, P. Raghavan, and T. Sarlos. Are web users really Markovian? In WWW, pages 609–618, 2012.
  • [13] E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: user movement in location-based social networks. In KDD, pages 1082–1090, 2011.
  • [14] M. Deshpande and G. Karypis. Selective Markov models for predicting web page accesses. ACM T. Internet Techno., 4(2):163–184, 2004.
  • [15] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1-ball for learning in high dimensions. In ICML, pages 272–279, 2008.
  • [16] S. R. Eddy. Hidden Markov models. Current opinion in structural biology, 6(3):361–365, 1996.
  • [17] F. Figueiredo, B. Ribeiro, J. M. Almeida, and C. Faloutsos. TribeFlow: Mining & predicting user trajectories. In WWW, pages 695–706, 2016.
  • [18] R. Gupta, R. Kumar, and S. Vassilvitskii. On mixtures of Markov chains. In NIPS, pages 3441–3449, 2016.
  • [19] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):455–500, 2009.
  • [20] R. Kumar, M. Raghu, T. Sarlós, and A. Tomkins. Linear additive markov processes. In WWW, pages 411–419, 2017.
  • [21] A. Markov.

    Extension of the limit theorems of probability theory to a sum of variables connected in a chain.

  • [22] J. J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In WWW, pages 897–908, 2013.
  • [23] S. Melnyk, O. Usatenko, and V. Yampol’skii. Memory functions of the additive markov chains: applications to complex dynamic systems. Physica A: Statistical Mechanics and its Applications, 361(2):405–415, 2006.
  • [24] Y. Peres and P. Shields. Two new Markov order estimators. arXiv preprint math/0506080, 2005.
  • [25] P. L. Pirolli and J. E. Pitkow. Distributions of surfers’ paths through the world wide web: Empirical characterizations. World Wide Web, 2(1-2):29–45, 1999.
  • [26] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme. Factorizing personalized Markov chains for next-basket recommendation. In WWW, pages 811–820, 2010.
  • [27] S. Rendle and L. Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In WSDM, pages 81–90, 2010.
  • [28] D. Ron, Y. Singer, and N. Tishby. Learning probabilistic automata with variable memory length. In

    Proceedings of the seventh annual conference on Computational learning theory

    , pages 35–46. ACM, 1994.
  • [29] M. Rosvall, A. V. Esquivel, A. Lancichinetti, J. D. West, and R. Lambiotte. Memory in network flows and its effects on spreading dynamics and community detection. Nature communications, 5, 2014.
  • [30] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In ICML, pages 880–887. ACM, 2008.
  • [31] Y. Shi, M. Larson, and A. Hanjalic. Collaborative filtering beyond the user-item matrix: A survey of the state of the art and future challenges. ACM Computing Surveys (CSUR), 47(1):3, 2014.
  • [32] P. Singer, D. Helic, A. Hotho, and M. Strohmaier. Hyptrails: A bayesian approach for comparing hypotheses about human trails on the web. In WWW, pages 1003–1013, 2015.
  • [33] H. M. Taylor and S. Karlin. An introduction to stochastic modeling. Academic press, 2014.
  • [34] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817, 1(8), 2015.
  • [35] O. Usatenko. Random finite-valued dynamical systems: additive Markov chain approach. Cambridge Scientific Publishers, 2009.
  • [36] T. Wu, A. R. Benson, and D. F. Gleich. General tensor spectral co-clustering for higher-order data. In NIPS, pages 2559–2567, 2016.
  • [37] Z. Xing, J. Pei, and E. Keogh. A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter, 12(1):40–48, 2010.
  • [38] D. Yang, D. Zhang, Z. Yu, and Z. Yu. Fine-grained preference-aware location search leveraging crowdsourced digital footprints from LBSNs. In UbiComp, pages 479–488, 2013.
  • [39] J.-D. Zhang and C.-Y. Chow. Spatiotemporal sequential influence modeling for location recommendations: A gravity-based approach. ACM Transactions on Intelligent Systems and Technology (TIST), 7(1):11, 2015.