1. Introduction
User trails record sequences of activities when individuals interact with the Internet and the world. Such data come from various applications when users write a product review (mcauley2013amateurs, ), checkin at a physical location (cho2011friendship, ; yang2013fine, ), visit a webpage, or listen to a song (celma2009music, ). Understanding the properties and predictability of these data helps improve many downstream applications including overall user experiences, recommendations, and advertising (figueiredo2016tribeflow, ; awad2012prediction, ). We study the prediction problem and our goal is to estimate a model to describe and predict a set of user trails.
Markov chains are one of the most commonly studied models for this type of data. For these models, each checkin place, website, or song is a state. Users transition among these states following Markov rules. In a firstorder Markov model, the transition behavior to the next state of the sequence only depends on the current state. Higherorder Markov models include a morerealistic dependence on a larger number of previous states, and multiple recent studies found that firstorder Markov chains do not fully capture the user behaviors in web browsing, transportation and communication networks
(rosvall2014memory, ; chierichetti2012web, ). Furthermore ignoring the effects of secondorder Markov dynamics has significant negative consequences for downstream applications including community detection, ranking, and information spreading (rosvall2014memory, ; Benson2016motifspectral, ).The downside to higherorder Markov models is that the number of parameters grows exponentially with the order. (If there are states and we model steps of history, there are parameters.) So, even if we could accurately learn the parameters, it is already challenging to even store them. (Some practical techniques include lowrank and sparse approximations, but these pose their own problems.) Second, since the number of model parameters grows rapidly, the amount of training data required also grows exponentially with the order (chierichetti2012web, ). Acquiring such huge amounts of training data is usually impossible. Lastly, determining the amount of history to use itself is hard (peres2005two, ), and selecting a large value of could severely overfit the data, thus making the learned model less reliable.
Strategies to resolve the above issues of higherorder Markov chains include variable order Markov chain (buhlmann1999variable, ) where the order length is a variable that can have different values for different states. There is a fitting algorithm that can automatically determine an appropriate order for each state, however it requires substantial computation time (ron1994learning, ) which restricts it to applications with only a small number of states (borges2007evaluating, ; deshpande2004selective, ; chierichetti2012web, ). Smoothing and regularization methods (chen1996empirical, ) like KneserNey smoothing and WittenBell smoothing are additional approaches to make the higherorder Markov chain more robust. These methods are widely applied in language models for predicting unseen transitions. We will compare against the behavior of the KneserNey smoothing in our experiments and show that our method has a number of advantages.
In this paper we propose the retrospective higherorder Markov process (RHOMP) as a simplified, special case of a higherorder Markov chain (Section 3). In this type of Markov model, a user retrospectively choses a state from the past steps of history, and then transitions as a firstorder chain conditional on that state from history. This assumption helps to restrict the total number of parameters and protect the model from overfitting the correlations between history states. Specifically, this model corresponds to choosing
different first order Markov chain transition matrices, one for each step of history, as well as an associated probability distribution. Consequently, the number of parameters grows linearly with the size of history while preserving the higherorder nature. We also show there are important connections between our model and the class of pairwiseinteraction tensor factorization models proposed by Rendle et al.
(rendle2010pairwise, ; rendle2010factorizing, ) (Section 3.2).We design an algorithm to select an optimal model from training data via maximum likelihood estimation (MLE). For the secondorder case with two steps of history, this yields a constrained convex optimization problem with a single hyperparameter
. We derive a projected gradient descent (duchi2008efficient, ) algorithm to solve it. It requires only a few iterations to converge and each iteration is linear in the training data. We select the hyperparameter by fitting a polynomial to the likelihood function as a function of the parameter and select the global minimum. Thus, our RHOMP process does not require any parameter tuning and is scalable to applications with tens of thousands of states. In addition, both the process of updating the gradients and model parameters parallelize over the training data.We evaluate the effectiveness of RHOMP models in experiments ^{1}^{1}1Code and data for this paper are available at: https://github.com/wutao27/RHOMP. with real datasets including product reviews, online music streaming, photo locations, and checkin business types (Section 5.1
). We primarily compare algorithms in terms of their ability to predict information from testing data and use precision and mean reciprocal rank as the two main evaluation metrics. These experiments and results show that the RHOMP model achieves superior prediction results in all datasets (Section
5.2) compared with first and second order chains. For even higherorder chains, RHOMP shows stable performance with one exception (Section 5.4) where the data only has short sequences.Remark. Recently Kumar et. (kumar2017linear, ) proposed the Linear Additive Markov Process (LAMP) that is closely related to our framework. Specifically our RHOMP model has the same formulation as the generazlied extention GLAMP from the paper (kumar2017linear, ). We learned about this paper as we were finalizing our submission to arXiv. The papers share a number of related technical results about the models and we discovered the related work (markov1971extension, ; zhang2015spatiotemporal, ; melnyk2006memory, ; usatenko2009random, ) based on their manuscript. The main difference is that in this paper we focus on the general form that allows to learn different Markov chains for each step of history. In addition we connect the RHOMP model with a particular tensor factorization to a higherorder Markov chain.
2. Preliminaries
We begin by formally reviewing the problem of user trail prediction. Then we will review relevant background on Markov chain models.
2.1. Problem Formulation
We denote a user trail as a sequence over a discrete state space with each element . Here is the total number of states. The sequence can represent, for instance, a user’s music listening history with each state denoting a song/artist, or a user’s checkin history from social network with each state denoting a location. Given a specific user trail up to time : with , the task is to predict the next state at time based on a large set of user trails for training: , where each is an individual trail.
2.2. Markov Chain Methods
An th order Markov chain is defined as a stochastic process on the state space: with the property that the next transition only depends on the last steps. Formally,
An order transition tensor with size characterizes the above Markov chain, with denoting the probability of transitioning to state given the current history states . The model with is called the firstorder Markov chain and similarly it can be described by an transition matrix .
In order to use a Markov chain for the prediction problem, we need to estimate the transition matrix . Given a set of users trails , the maximum likelihood estimator (MLE) of the probability for a first order chain is given by (chierichetti2012web, ):
where denotes the number of instances that the states and were consecutive in all trails. For the case of higherorder Markov chain, it is wellknown that any higherorder () Markov chain is equivalent to a firstorder Markov chain by taking a Cartesian product of its state space. This simplifies the parameter estimations and we may replace the original states with the Cartesian product states:
where now counts the number of instances of the sequence in the training data.
Returning to the prediction task itself, Markov chain methods take as input the history states of a trail and lookup the probabilities for all future states in the matrix or tensor . This becomes a ranked list of states with the highest probability on top.
3. Retrospective HigherOrder Markov Processes
The goal of the retrospective higherorder Markov process (RHOMP) is to strike a balance between the simplicity of the first order Markov model and the highparameter complexity of the higherorder Markov model. Nevertheless, it is important for the model to account for higherorder behaviors because these are necessary to capture many types of user behaviors (rosvall2014memory, ; chierichetti2012web, ). Towards that end, the RHOMP model describes a structured higherorder Markov chain that results in a compact lowparameter description of possible user behaviors. We describe this formally for the case of a secondorder history (and discuss largely notational extensions to higherorder chains in Section 3.4).
3.1. The Retrospective Process
The specific structure that a RHOMP describes is a retrospectively firstorder Markov property. For some intuition, suppose that a web surfer had visited a searchquery result page and then clicked the first link. In the RHOMP model, the user will first determine if they are going to continue browsing from the searchresult page or the first link—hence users have the power to retrospect over history. Once that decision has been made, the user will behave in a firstorder Markovian fashion that depends on if the user returned to the previous state or remained on the current state. Formally, suppose that the chain has recently visited states and . The RHOMP is a twostage process that first selects a single history state. Since there are only two states, we model this selection as a weighted cointoss where the probability of picking is and so picking happens with probability . Once we have the history state, then the RHOMP transitions according to a transition matrix that is specific to that step of the history. Thus
where models the transitions from the current state (when those are selected) and models the transitions from the previous state (when those are selected). See Figure 1 for illustration. We summarize this in the following definition:
Definition 0 ()
Given and two stochastic matrices , a secondorder retrospective higherorder Markov process will transition from state with history state as follows: (i) with probability it transitions according to with the current state , and (ii) with probability it transitions according to with the previous state .
This model has a number of useful features. For instance, it is easy to compute the stationary distribution as the following theorem shows.
Theorem 2 ()
Let be a secondorder RHOMP model. Consider the stationary distribution in terms of the longterm fraction of time the process spends in a state:
Such a distribution always exists. Moreover, it is unique if is an irreducible matrix.
Proof 0 ()
Because the RHOMP is a special case of a secondorder chain, we can use the relationship with the firstorder chain on the Cartesian product space to establish that a distribution always exists. This follows because the longterm distribution of a firstorder, finitestate space Markov chain always exists (though there could be multiple such distributions) (taylor2014introduction, ). Let for all be any limiting distribution of the product state space, and be either of the corresponding marginal distribution such that or . Note that both of these marginals result in the same distribution because we use the long time average to define . Then we have:
where is defined as . So the limiting distribution follows , and it is unique if the corresponding Markov chain is irreducible.
In Section 3.3, we show how to compute a maximum likelihood estimate of and from data.
3.2. A Tensor Factorization Perspective
We originally derived this type of RHOMP via a tensor factorization approach, but then realized that the retrospective interpretation is more direct and helpful. Nevertheless, we believe there are fruitful connections established by the tensor factorization approach. Consider the transition tensor of a secondorder Markov chain: is a mode, , nonnegative tensor such that
(1) 
This imposes a set of equality constraints. If we wanted to use traditional lowrank tensor approximations such as PARAFAC or Tucker (kolda2009tensor, ) to study large datasets, then we would need to add a large number of constraints to the fitting algorithms in order to ensure that the factorization results in a stochastic tensor that we could use for a second order Markov chain. This approach was extremely challenging.
Instead, consider a pairwise interaction tensor factorization (PITF) as proposed by Rendle et al. (rendle2010pairwise, ) with the following form:
(2) 
where matrices . We notice that last term in (2) is the interaction between the current state and the previous state , and it contributes only a constant determined by the pair . In the applications of prediction, we can drop this term because it does not affect the relative ranking for the future state . So the factorization model becomes:
(3) 
with .
To see the relationship with our RHOMPs, denote and with . Then the result of a PITF factorization with stochastic constraints is:
(4) 
It is easy to verify that if both and are stochastic matrices, then the corresponding tensor is a transition tensor following (1). The following theorem shows that from any nonnegative and , we can construct such stochastic matrices.
Theorem 3 ()
Assuming there exist nonnegative matrices and such that the transition tensor can be decomposed in the form of (4), then there exist and stochastic matrices such that .
Proof 0 ()
Denote and for all . Because for all , we have , and . If then the original matrices and are stochastic. Otherwise we can set
where and are stochastic. Then we have
So forms a valid factorization for , the bound on follows from from (4).
Consequently, the RHOMP form also arises from the PITF approach when constrained to model stochastic tensors.
3.3. Parameter Optimization
In this section we will apply the principle of maximum likelihood to estimate the model parameters of a RHOMP (i.e., ) directly from data. An alternative would be to estimate the higherorder Markov chain and use the PITF factorization as discussed in the previous section. Working directly on the RHOMP model from data has two advantages: first, the estimate corresponds exactly with the model, rather than estimate and approximate; and second, the direct approach is faster.
We first show how to compute a maximum likelihood estimate with fixed and then discuss how to pick . Recall that is the total count of transitions moving from to with previous state in the training data. With fixed , the log likelihood of all transitions from the set of user trails is:
(5) 
Our goal is to find a pair of stochastic matrices which minimizes the negative log likelihood, which gives us the following optimization problem:
(6) 
This optimization problem is convex as the following theorem shows.
Theorem 4 ()
The negation of the log likelihood function in (5) is convex and so is the feasible region of pairs of stochastic matrices. Thus any local minima solution is also the solution for global mimima.
Proof 0 ()
First we verify the feasible domain of stochastic pairs is convex. We can check that given and two stochastic matrices , the linear combination
is also a stochastic matrix. This applies elementwise to the pair to verify the claim.
Now given two sets of stochastic matrices and and the corresponding linear combination we have
So (6) is a convex problem.
We now derive the projected gradient descent algorithm for (6), which is summarized in Algorithm 1. This involves

[leftmargin=0.3in]

First update and based on their gradients.

Since and are no longer stochastic due to the above updates, the projection step is applied to project the updated and back to (i.e., the stochastic property).
The gradients over and are:
(7) 
We accomplish the projection step using the algorithm from (duchi2008efficient, )
. Note that for the sake of simplicity we present the projection step by sorting the vector
, but there is a more efficient method based on divide and concur (duchi2008efficient, ) which is linear cost to the number nonzeros in . However in practice sorting is fast as the vector is very sparse.Overall each iteration takes linear time in the number of unique triples in the sequence data. This is upper bounded by the size of input data. We also note that the procedure of computing the gradients and updating , which dominates the majority of the computation, can be paralleled.
Choosing . To determine the value of hyperparameter , we conduct a few trials with chosen between . Then based on the value of the objective function, we calculate the best value of
from a polynomial interpolation of the likelihood function. Specifically
is selected as Chebyshev nodes . Getting the global minimum of a polynomial interpolant can be done efficiently, and polynomials can approximate arbitrary continuous functions, which renders this a pragmatic choice. Another approach for selecting the value of is to conduct cross validation with grid search. However a different objective is needed as we could run into unseen transitions in the validation set and the likelihood would go to . Alternatively we can use a measurement like precision instead of likelihood. The main advantage of cross validation is its ability to prevent overfitting. In our experiment we find this problem does not occur, so we drop this procedure as it requires substantially more computation.3.4. Higherorder Cases Beyond Second Order
The ideas discussed in the above sections also work for the higherorder cases with . The RHOMP model becomes:
where for , and matrices for are stochastic. Similarly the log likelihood function can be derived as well as the gradient over each . The projected gradient descent algorithm is then applied to update each stochastic matrix , with a periteration complexity bounded by the size of the training data.
The biggest difference is that we are no longer able to determine the hyperparameters in a simple fashion as the polynomial interpolation is only computationally efficient for one or two parameters. To address this issue, recall that in Section 3.1 we proposed the model as a retrospective walk, where the walker has probability to step back steps into their history and then transition according to . Our proposal is to use a single hyperparameter to model a decaying probability of looking back into the history:
(This distribution describes a truncated geometric random variable.) In our experiments for the secondorder case the optimal
for every dataset. This offers a single step of evidence for this assumption. This can be chosen either by the procedure of polynomial interpolation or simply using the optimal value from a secondorder factorization model . We apply the latter approach in our experiments for RHOMP with .4. Related Work
Modeling User Trails. Early work in (pirolli1999distributions, )
characterized the user path patterns on the web with the tools of Markov chains. Other advanced methods include hidden Markov models (HMM)
(eddy1996hidden, ), variable length Markov chains (buhlmann1999variable, ) and association rules (awad2012prediction, ). However the computations associated with the above methods limit them from being used in datasets with more than a few thousand states. More recent work considers the sequence prediction task with personalization, such as collaborative filtering methods (salakhutdinov2008bayesian, ; shi2014collaborative, ) where the behavior of similar users is utilized to help the prediction, factorizing personalized Markov chains (rendle2010factorizing, ), and TribeFlow (figueiredo2016tribeflow, ). Other than the prediction problem, clustering and visualization (cadez2003model, ), sequence classification (xing2010brief, ), metric embedding (chen2012playlist, ; chen2013multi, ) and hypotheses comparison (singer2015hyptrails, ) have also been studied. In the context of this work, we seek to improve the performance of the classic and simple Markov model by studying a structured variation.Random Walk Models. Since our model is a special case of a higherorder Markov chain, we note that there are relationships with a variety of enhanced Markov models. First our RHOMP model defines a specific form of the Additive Markov Process (AMP) (markov1971extension, ), where the transition probability is a summation of a series of memory functions that are restricted on the next state and one history state each. Applications of the AMP include LAMP (kumar2017linear, ) (see Section 1), the gravity models (zhang2015spatiotemporal, ), and some dynamical systems in physics (melnyk2006memory, ; usatenko2009random, ) where the memory function is empirically estimated for the application of binary state. In addition to the AMP, recent innovations include new recovery results on mixture of Markov chains (gupta2016mixtures, ) (a special case of HMM), which assumes a small set of Markov chains that model various classes of latent indent; and the spacey random walk (benson2016spacey, ; benson2015tensor, ; wu2016general, ) as a nonMarkovian stochastic process that utilizes higherorder information based on the empirical occupation of states.
Tensor Factorization. As already discussed, our work is directly related to the pairwise interaction tensor factorization (PITF) method proposed by Rendle in (rendle2010pairwise, ; rendle2010factorizing, ), where the task is to generate tag recommendations given the combination. The PITF model is learned from a binary tensor of triple by bootstrap sampling from pairwise ranking constrains. Our work differs in the aspect of problem formulation, model construction and parameter optimization. The RHOMP model is also a special case of both the canonical/PARAFAC and Tucker decompositions (kolda2009tensor, ).
5. Experiments
# states  # transitions  # trails  

LastFM  17,341  2,902,035  195,499 
BeerAdvocate  2,324  1,348,903  35,629 
BrightKite  11,465  400,340  125,437 
Flickr  7,608  1,212,674  97,563 
FourSQ  344  198,503  1,480 
We evaluate our RHOMP method on the ability to predict subsequent states in a user trail in terms of precision and mean reciprocal rank (MRR) on five different types of data (Section 5.1). We then present the results of a secondorder (i.e., ) RHOMP compared with baseline methods in Section 5.2 and study overfitting of the training data in Section 5.3. Then we study what happens for higherorder (i.e., ) models in Section 5.4. In all cases, the RHOMP model offers a considerable improvement to existing methods.
MC1  MC2  Kneser1  Kneser2  PITF  LME  RHOMP  

LastFM  
BeerAdvocate  
BrightKite  
Flickr  
FourSQ 
entries are the standard deviations over 5 trials. Our proposed RHOMP (
) has the best performance in all datasets.5.1. Datasets and Evaluations Setup
The real datasets we use in our experiments cover several applications including: product reviews, online music streaming, checkin locations of social network and photo uploads. Every dataset is publicly available. For all the datasets selfloops are removed as we are mostly interested in predicting a nontrivial transition. Also we only consider states that show up more than times. Simple statistics on each dataset are summarized in Table 1, and we now describe them individually.
LastFM (celma2009music, ) is a music streaming and recommendation website (last.fm). We generate user trails as listening histories regarding different artists over a oneyear period (20080501 to 20090501).
BeerAdvocate (mcauley2013amateurs, ) consists of beer reviews spanning more than 10 years up to November 2011 from beeradvocate.com. We study the user trail as reviews over different brewers.
BrightKite (cho2011friendship, ) was a locationbased social networking website where users shared their locations by checkingin. We study the trails of location id.
Flickr (thomee2015new, ) contains 100 million Flickr photos/videos provided by Yahoo! Webscope. We extract the user trail based on geolocation (restricted to USA) of each upload after 20080101. Each longitude and latitude is mapped into a grid of approximate 10km by 10km, which constitutes the state.
FourSQ is a location based checkin dataset created by Yang et al (yang2013fine, ) which contains checkins from New York City from 24 October 2011 to 20 February 2012. We extract checkin place category (e.g., bus station, hotel, bank) as state.
For experimental methods, we consider the following:
MC1, MC2 are the firstorder and secondorder Markov chain methods respectively, where the transition matrix is estimated based on maximum likelihood.
Kneser1, Kneser2 are the interpolated KneserNey smoothing methods (chen1996empirical, )
applied on the firstorder and secondorder Markov chain methods respectively. This is one of the best smoothing methods for ngram language models, where it enables higherorder Markov chain transitions to unseen ngrams. We set the discounting parameter as
by the method of leaving one out (chen1996empirical, ), where and denote the number of ngrams that appear exactly once and twice respectivelyPITF is the pairwise interaction tensor factorization method (rendle2010pairwise, ) computed on the higherorder Markov chain estimate. Because we use ranking, we consider general positive and negative entries as valid for the factorization. We implement the fitting method ourselves to handle the sparsity in our data. As suggested in the paper (rendle2010pairwise, ), the hyperparameters are and with initialization from . We set the rank number as of the total number of states, which is enough to accurately capture the user behavior (rendle2010pairwise, )
. The number of iterations for the stochastic gradient descent is 10,000,000.
LME is short for Latent Markov Embedding (chen2012playlist, )
. It is an machine learning algorithm that embeds states into Euclidean space based on a regularized maximum likelihood principle. We set the dimension
and use default values all other parameters (e.g., learning rate, epsilon). (We tried various values of spanning from to , we find as increases the performance also gets better, for the improvements are negligible. So we use to make the algorithm efficient.) We use the authors’ implementations.RHOMP is our proposed method in this paper. We use initial step size as , and set as the algorithm termination criterion when the relative improvement over log likelihood is below this point. For the hyperparameter we use Chebyshev nodes for the interpolation.
The datasets are randomly split into a training set () and testing set () based on keeping whole trails together. And for each dataset we conduct experiments over random repetitions and present the average results. For evaluations we use precision over top outputs to measure the accuracy of each method. It is calculated over all individual transitions in the testing set as
Besides precision, which measures the accuracy of the top outputs from algorithms, we also provide results on Mean Reciprocal Rank (MRR). The reciprocal rank of an output is the inverse of the rank of the ground truth answer and MRR measures the overall ranking compared to the groundtruth. For both measures, we want large scores close to 1.
5.2. General Results
First we compare our RHOMP () with other baseline methods in terms of precision and MRR score.
MRR score. Table 2 depicts the results on the MRR score. In all datasets, RHOMP has the highest score. From the table we see that MC1 outperforms the LME method. The LME has the advantage of embedding the states into Euclidean space for applications like visualization or clustering. However the embedding could potentially cause the information loss, thus make the prediction less accurate. And we notice that MC2 has the lowest scores in many cases (i.e., BeerAdvocate, Flickr and FourSQ datasets), and the MRR scores drop compared to MC1. The KneserNey smoothing modification makes the MC2 estimate more robust, and in most cases outperforms the MC1, although such advantage is limited compared to that from our RHOMP method. The PITF method is also not competitive.
Precision score. Figure 2 shows the algorithms performances in terms of relative precision. Many of the observations from Table 2 on the MRR score also apply here. In addition we find MC2 is often able to provide one accurate output, so the relative precision () is actually quite good in most cases. However as increase the relative precision drops rapidly due to the fact that MC2 is not able to generate a few more reliable outputs. This limits the application of MC2 because in the task of recommendation, it is important for the algorithm to generate a few instead of one unique candidate state. Another observation is that the results of PITF over different trials are often more volatile because of its underlying stochastic gradient descent solver. We also find that for some datasets (e.g., BeerAdvocate and FourSQ) the relative precisions of our RHOMP decrease as increases. The reason is that as increases, the prediction task itself becomes easier, so it is hard to maintain the same advantage (i.e., constant relative precision). Same reason for the fact the inferior methods like LME and PITF can catch up as increases.
Algorithm Runtime. Table 3 shows the runtime for each method. The RHOMP approach takes slightly more time to train than KneserNey methods, but has faster prediction and testing. It is slower than the pure MC methods, but much faster than PITF, LME.
Kneser1  Kneser2  PITF  LME  RHOMP  

LastFM  
BrightKite  
Flickr 
5.3. Analysis on Overfitting
MC1  MC2  Kneser1  Kneser2  PITF  LME  RHOMP  

LastFM  
BeerAdvocate  
BrightKite  
Flickr  
FourSQ 
One of the reasons we propose the RHOMP method is to improve the higherorder Markov chain method in the aspect of overfitting. In this section we analyze the results in detail and give an explanation on the performances of different methods.
First we show the comparison between training and testing performance in Table 4. We present the result using precision with as it is representative of the remaining results. Both PITF and LME had the least overfitting effect as the testing and training precisions are very close. However, their testing precisions are also low. The training precision of MC2 is the highest for all datasets. But these are often more than times of the corresponding testing precisions. So MC2 is a highly overfitting method. Kneser2 also has comparatively high training precision since it is a secondorder method and tends to fit the training data well. But the performance on testing set is better than MC2 as it uses lowerorder information to smooth the output. The methods MC1, Kneser1 and RHOMP have a good training and testing balance, and among them, our RHOMP has superior testing performances.
Next we analyze the performance on individual states to help understand the behaviors of different algorithms. We sort all the states from high to low based on the total number of counts of each state in the training set. Our aim is to look at how testing accuracy correlates with these counts. Figure 3 shows the precision () comparisons (i.e., MC1 vs MC2 vs RHOMP and Kneser1 vs Kneser2 vs RHOMP) on the Flickr dataset based on counts of the states. We aggregate small sets of states based on their counts into baskets of at least 1000 transitions and 5 states. We find that all methods show precision drops when predicting infrequent states, with MC2 being affected most. Here, RHOMP does the best out of all methods, which reflects its ability to avoid overfitting.
5.4. Analysis on Higherorder Approaches
In the previous sections, we analyze the results for first and secondorder approaches. Now we study the behavior as the order varies. Figure 4 shows change in performance as the order increases for the three frameworks: MC, KneserNey smoothing and RHOMP. For the cases when the history states length is smaller than the order, we use the approach with the correct order to generate the prediction.
For the MC framework, higherorder approaches make the prediction less accurate. This occurs because these methods overfit the training data and there are more ways to overfit for a higherorder chain. For the KneserNey smoothing approaches, in most cases (except BeerAdvocate dataset) there are improvements moving from firstorder to secondorder. However the improvements are slight. For order , there are usually either no clear improvements or small performance dips. The reason is that as the order increase, the higherorder transition become very sparse, and could easily encounter an unseen higherorder state. So in this case the algorithm will frequently seek the prediction from a lowerorder approach.
For the RHOMP framework, there are improvements for each dataset when moving from MC1 to RHOMP with order , and for order , the results further improve. Compared to MC and KneserNey smoothing frameworks, The RHOMP is more robust in terms of not decreasing the precision as order increases, with the exception of BrightKite dataset. In BrightKite, the average trail length is around , so there is insufficient information to train higherorder models and we lack the lowerorder fallback in KneserNey.
6. Summary & Future Work
In this paper we study the problem of modeling user trails, which encode useful information for downstream applications of user experiences, recommendations and advertising. We propose a new class of structured higherorder Markov chains which we call the retrospective higherorder Markov process (RHOMP). This model preserves the higherorder nature of user trails without risks of overfitting the data. A RHOMP can be estimated from data via a projected gradient descent algorithm we propose for maximum likelihood estimation (MLE). In the experiments, we find that RHOMP is superior in terms of precision and mean reciprocal rank compared to other methods. Also RHOMP is robust for higherorder chains when there is data available.
There are several directions to extend this work. First it would be interesting to explore other forms of retrospection that allow more interaction between the history states. (Note that the current approach in this paper selects a single state during the retrospective process). This will allow to model the case when certain combined history states have strong evidence in terms of transition patterns. Second it would also be useful to extend this framework in terms of personalization. This can be achieved by a tensor factorization approach or a collaborative filtering method. Lastly we also would like to embed time information into our prediction either by modeling the event time directly or using it as a side information to help generate a nonstationary process where the random walk behavior could change overtime.
Acknowledgements. This work was supported by NSF IIS1422918, CAREER award CCF 1149756, Center for Science of Information STC, CCF093937; DOE award DESC0014543; and the DARPA SIMPLEX program.
References
 [1] M. A. Awad and I. Khalil. Prediction of user’s webbrowsing behavior: Application of Markov model. IEEE T. Syst. Man Cy. B, 42(4):1131–1142, 2012.
 [2] A. Benson, D. F. Gleich, and J. Leskovec. Higherorder organization of complex networks. Science, 353(6295):163–166, 2016.

[3]
A. R. Benson, D. F. Gleich, and J. Leskovec.
Tensor spectral clustering for partitioning higherorder network structures.
In SDM, pages 118–126, 2015.  [4] A. R. Benson, D. F. Gleich, and L.H. Lim. The spacey random walk: A stochastic process for higherorder data. SIAM Rev., page To appear, 2017.
 [5] J. Borges and M. Levene. Evaluating variablelength Markov chain models for analysis of user web navigation sessions. IEEE T. Knowl. Data En., 19(4), 2007.
 [6] P. Bühlmann, A. J. Wyner, et al. Variable length Markov chains. The Annals of Statistics, 27(2):480–513, 1999.
 [7] I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White. Modelbased clustering and visualization of navigation patterns on a web site. Data Mining and Knowledge Discovery, 7(4):399–424, 2003.
 [8] Ò. Celma Herrada. Music recommendation and discovery in the long tail. 2009.
 [9] S. Chen, J. L. Moore, D. Turnbull, and T. Joachims. Playlist prediction via metric embedding. In KDD, pages 714–722, 2012.
 [10] S. Chen, J. Xu, and T. Joachims. Multispace probabilistic sequence modeling. In KDD, pages 865–873, 2013.
 [11] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In ACL, pages 310–318, 1996.
 [12] F. Chierichetti, R. Kumar, P. Raghavan, and T. Sarlos. Are web users really Markovian? In WWW, pages 609–618, 2012.
 [13] E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: user movement in locationbased social networks. In KDD, pages 1082–1090, 2011.
 [14] M. Deshpande and G. Karypis. Selective Markov models for predicting web page accesses. ACM T. Internet Techno., 4(2):163–184, 2004.
 [15] J. Duchi, S. ShalevShwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1ball for learning in high dimensions. In ICML, pages 272–279, 2008.
 [16] S. R. Eddy. Hidden Markov models. Current opinion in structural biology, 6(3):361–365, 1996.
 [17] F. Figueiredo, B. Ribeiro, J. M. Almeida, and C. Faloutsos. TribeFlow: Mining & predicting user trajectories. In WWW, pages 695–706, 2016.
 [18] R. Gupta, R. Kumar, and S. Vassilvitskii. On mixtures of Markov chains. In NIPS, pages 3441–3449, 2016.
 [19] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):455–500, 2009.
 [20] R. Kumar, M. Raghu, T. Sarlós, and A. Tomkins. Linear additive markov processes. In WWW, pages 411–419, 2017.

[21]
A. Markov.
Extension of the limit theorems of probability theory to a sum of variables connected in a chain.
1971.  [22] J. J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In WWW, pages 897–908, 2013.
 [23] S. Melnyk, O. Usatenko, and V. Yampol’skii. Memory functions of the additive markov chains: applications to complex dynamic systems. Physica A: Statistical Mechanics and its Applications, 361(2):405–415, 2006.
 [24] Y. Peres and P. Shields. Two new Markov order estimators. arXiv preprint math/0506080, 2005.
 [25] P. L. Pirolli and J. E. Pitkow. Distributions of surfers’ paths through the world wide web: Empirical characterizations. World Wide Web, 2(12):29–45, 1999.
 [26] S. Rendle, C. Freudenthaler, and L. SchmidtThieme. Factorizing personalized Markov chains for nextbasket recommendation. In WWW, pages 811–820, 2010.
 [27] S. Rendle and L. SchmidtThieme. Pairwise interaction tensor factorization for personalized tag recommendation. In WSDM, pages 81–90, 2010.

[28]
D. Ron, Y. Singer, and N. Tishby.
Learning probabilistic automata with variable memory length.
In
Proceedings of the seventh annual conference on Computational learning theory
, pages 35–46. ACM, 1994.  [29] M. Rosvall, A. V. Esquivel, A. Lancichinetti, J. D. West, and R. Lambiotte. Memory in network flows and its effects on spreading dynamics and community detection. Nature communications, 5, 2014.
 [30] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In ICML, pages 880–887. ACM, 2008.
 [31] Y. Shi, M. Larson, and A. Hanjalic. Collaborative filtering beyond the useritem matrix: A survey of the state of the art and future challenges. ACM Computing Surveys (CSUR), 47(1):3, 2014.
 [32] P. Singer, D. Helic, A. Hotho, and M. Strohmaier. Hyptrails: A bayesian approach for comparing hypotheses about human trails on the web. In WWW, pages 1003–1013, 2015.
 [33] H. M. Taylor and S. Karlin. An introduction to stochastic modeling. Academic press, 2014.
 [34] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.J. Li. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817, 1(8), 2015.
 [35] O. Usatenko. Random finitevalued dynamical systems: additive Markov chain approach. Cambridge Scientific Publishers, 2009.
 [36] T. Wu, A. R. Benson, and D. F. Gleich. General tensor spectral coclustering for higherorder data. In NIPS, pages 2559–2567, 2016.
 [37] Z. Xing, J. Pei, and E. Keogh. A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter, 12(1):40–48, 2010.
 [38] D. Yang, D. Zhang, Z. Yu, and Z. Yu. Finegrained preferenceaware location search leveraging crowdsourced digital footprints from LBSNs. In UbiComp, pages 479–488, 2013.
 [39] J.D. Zhang and C.Y. Chow. Spatiotemporal sequential influence modeling for location recommendations: A gravitybased approach. ACM Transactions on Intelligent Systems and Technology (TIST), 7(1):11, 2015.
Comments
There are no comments yet.