Section 1 Introduction
In this paper we explore the problem of characterizing a database of sequential behaviors (i.e., sequences). An example of such sequential behaviors is the web browsing behaviors of Internet users. Table 1 shows some user-browsing data excerpted from the web server logs of msnbc.com. Each row of the table is an ordered list of discrete symbols, each of which represents one behavior made by a user. Here the behavior is described by the categories of web pages requested by the user. For example, User 1 first browses a ‘frontpage’ page, then visits a ‘news’ page, followed by visiting ‘travel’ page and other pages denoted by dots. This form typical sequential behaviors for a user, and other individuals have similar sequential behaviors. For the last decade, many efforts have been made to characterize the above sequential behaviors for further analysis.
Significant progress has been made on behaviour modelling in the field of Sequence Pattern Mining (SPM). Pattern is an expression describing a subset of the data . Sequential pattern mining discovers frequently occurring behaviors or subsequences as patterns, which was first introduced by Agrawal and Srikant . Several algorithms, such as Generalized Sequential Patterns (GSP) , SPADE  and PrefixSpan [11, 12], have been proposed to mine sequential patterns efficiently. Generally speaking, SPM techniques aim at discovering comprehensible sequential patterns in data, which is descriptive  and lacks of well-founded theories to apply the discovered patterns for further data analysis tasks, such as sequence classification and behavior modeling.
In the statistical and machine learning community, researchers try to characterize the sequential behaviors using probabilistic models. The probabilistic models not only describe the generative process of the sequential behaviors but also have apredictive property which is helpful for further analytic tasks, such as prediction of future behaviors. One representative model widely used is the hidden Markov models (HMMs) 
. Usually, each sequence is modelled as an HMM. In other words, the dynamics of each sequence is represented by a list of deterministic parameters (i.e., initialization prior vector and transition matrix), and there is no generative probabilistic model for these numbers. This leads to several problems when we directly extend HMMs to modeling a database of sequences: (1) the number of parameters for the HMMs grows linearly with the number of sequences, which leads to a serious problem of over-fitting, and (2) it is not clear how to assign probability to a sequence outside of the training set. Although suggests a strategy for modeling multiple sequences, it simply ignores the difference on parameters between sequences and assumes all the dynamics of sequences can be characterized by one set of deterministic parameters. This could alleviate the problem of over-fitting to some extent, but may overlook the individual characteristics for individual sequences at the same time, which may further deteriorate the accuracy of behavior modeling.
The goal of this paper is to characterize a database of sequential behaviors preserving the essential statistical relationships for each individual sequence and the whole database, while avoiding the problem of over-fitting. To achieve this goal, we propose a generative model that has both sequence-level and database-level variables to comprehensively and effectively modeling behavioral sequences.
The paper is organized as follows: Section 2 formalizes the problem studied in this paper, followed by the proposed approach described in Section 3. Then Section 4 reviews the probabilistic models related to this paper. After that, experimental results on several data mining tasks on 3 real-world data sets are reported in Section 5. Finally, Section 6 concludes this paper and discusses about some possible future directions.
Section 2 Problem Statement
Here we use the terminologies, such as “behaviors”, “sequences” and “database”, to describe a database of sequential behaviors. This is helpful for understanding the probabilistic model derived on the data. It is important to note that the model proposed in this paper is also applicable to other sequential data that has the similar data forms. In this paper, vectors are denoted by lower case bold Roman or Greek letters and all vectors are assumed to be column vectors except for special explanations. Uppercase bold Roman letters denote matrices while letters in other cases are assumed to be scalar.
A database is a collection of sequences denoted by .
A sequence () is an ordered list of behaviors denoted by , where () is the behavior in the sequence . The behaviors in the sequence are ordered by increasing time when behaviors are made.
A behavior (, ) is the basic unit of sequential behaviors, defined to be a 1-of- vector such that and (for all ), which represents an item from a vocabulary indexed by . Each index represents one type of behaviors, such as browsing a ‘travel’ web page as shown in Table 1.
Given a database of sequential behaviors, the problem of characterizing behaviors is to derive a probabilistic model which preserves the statistical relationships in the sequences and tends to assign high likelihood to “similar” sequences.
Section 3 The Proposed Model
3.1 The Graphical Model
The basic idea of the Latent Dirichlet Hidden Markov Models (LDHMMs) is that the dynamics of each sequence is assumed to be reflected through a hidden Markov chain 111 () can be represented by a 1-of- vector (similar to the form of a behavior ) and has possible hidden states, where is the number of possible hidden states and is usually set empirically. parameterized by the corresponding initial prior vector , transition matrix and a state-dependent emission matrix . Then , ( () is the th row vector of ) and ( () is the th row vector of ) can be seen as a lower dimension representation of the dynamics of the sequence. The distribution of these parameters of all sequences are then further governed by database-level Dirichlet hyper-parameters, i.e., , and , where is a matrix whose th row vector is and is a matrix whose th row vector is . To be more specific, for a database of sequential behaviors , the generative process is as follows222Please refer to Appendix A for details of Dirichlet (Dir) and Multinomial distributions.:
Generate hyper-parameters , and .
For each sequence index ,
Generate , and
For the first time stamp in the sequence :
Generate an initial hidden state .
Generate a behavior from , a multinomial probability conditioned on the hidden state and .
For each of other time stamps in the sequence ():
Generate a hidden state from .
Generate a behavior from .
Accordingly, the graphical model of LDHMMs is shown in Figure 1. As per the graph states itself, there are three levels of modeling in LDHMMs. The hyper-parameters , and are database-level variables, assumed to be sampled once in the process of generating a database. The variables , and () are sequence-level variables, denoted as sampled once per sequence. Finally, the variables and are behavior-level variables sampled once for each behavior in each sequence.
3.2 Learning the Model
In this section, our goal is to learn the deterministic hyper-parameters of the LDHMMs given a database , which is to maximize the likelihood function given by
Direct optimization of the above equation is very difficult since the involvement of latent variables, thus we turn to optimize its lower bound given by the Jensen’s inequality as follows :
where is assumed to be variational distribution function approximating to the posterior distribution of latent variables given and can be decomposed as . Specifically, is a variational distribution function approximating to for the sequence .
Then the lower bound of the likelihood function becomes a function of , , and . To obtain the optimal , and is still difficult since the involvement of . Thus, we propose a variational EM-based algorithm for learning the hyper-parameters of the LDHMMs, which yields the algorithm summarized in Algorithm 1 and is guaranteed to increase likelihood after each iteration . To be more specific, the variational EM algorithm is a two-stage iterative optimization technique which iterates the E-step (i.e., optimization with respect to ) and M-step (optimization with respect to the hyper-parameters) from lines 1 to 7. For each iteration, the E-step (lines 1-4) fixes the hyper-parameters and optimize the with respect to for each sequence; while the M-step (line 5) fixes the and optimizes the with respect to the hyper-parameters. Through this manner, the optimal hyper-parameters , and are obtained when the iterations are converged in line 7. It is also important to note that the approximate posteriors of sentence-level parameters (i.e., ) are learned as by-products in E steps.
The following two sections will discuss the details of the procedure E-step and M-step in Algorithm 1 and gives out two different implementations.
3.3 The E step: Variational Inference of Latent Variables
In this section, we provide the details of the E step, which is to estimatefor () given the observed sequence and fixed hyper-parameters and this process is usually termed as variational inference [3, 9, 13, 14].
Here we consider two different implementations of variational inference based on different decompositions of :
A fully-factorized (FF) form.
A partially-factorized (PF) form.
As shown in Figure 2, the FF form assumes:
As shown in Figure 2, the PF form assumes:
3.3.1 E Step: the FF Form
The variational inference process of the FF form yields the following iterations. Please refer to Appendix B.1 for the details of the derivation of the updating formulas.
Fixed and and , Update ()
is updated by:
() is updated by:
is updated by:
Fixed , , , Update
Fixed , and , Update
Fixed , and , Update
For simplicity, the E step for the FF form can be summarized in Procedure LABEL:Procedure:Estep-ff.
3.3.2 E Step: the PF Form
The variational inference process of the PF form yields the following iterations. Please refer to Appendix B.2 for the details of the derivation of the updating formulas.
Fixed and and , Update
Fixed , , , Update
where and .
Fixed , and , Update
Fixed , and , Update
where , and .
The E-step can be also summarized as a procedure similar to Procedure LABEL:Procedure:Estep-ff by replacing the corresponding updating formulas. We omit it here for conciseness.
3.3.3 Discussion of computational complexity
The computational complexity for the E-step of approximately inferring the posterior distribution of , and () given the hyper-parameters and the observed behaviors are similar for both the PF and FF forms. Specifically, the computational complexity for inferring the approximate posteriors of , and () are the same for the two forms, which are proportional to , and , respectively, where is the number of hidden states, is the iteration number of E-step, N is the maximum length of all sequences. However, the computational cost for approximate inference of the posterior of () is slightly different for the two forms. The computational complexity for the PF form is proportional to while its counterpart of the FF form is proportional to . Thus, the overall computational complexity for the PF and FF form are and , respectively. It is clear that two forms have comparable computational cost and the FF form is slightly faster.
3.4 The M Step: Fixed Variational Variables, Estimate Hyper-parameters
In this section, we provide the details of the M-step, which is to estimate hyper-parameters given the observed sequence and fixed variational variables for (). In particular, it maximizes the lower bound of the log-likelihood with respect to respective hyper-parameters as follows:
3.4.1 The FF Form
The updating equation is given by:
Procedure LABEL:Procedure:NewtonRaphson1 summarizes the above algorithm, which is an iterative process of updating the value of . To be more specific, at the beginning of each iteration, the variables are calculated by Equation 13-16 and to be 1 in lines 2 and 3. Then line 4 updates by Equation 17 and line 5 judges if the updated falls into the feasible range. If so, it reduces by a factor of 0.5 in line 6 and updates in line 7 until it becomes valid. In line 8, update as for the next iteration.
The M-step can be summarized in Procedure LABEL:Procedure:Mstep.
3.4.2 The PF Form
The process is the same as the above process.
Section 4 Comparison to Related Models
In this section, we compare our proposed LDHMMs to existing models that can model sequential behaviors, and our aim is to show the key difference between them.
4.1 Hidden Markov Models
As mentioned in Section 1, HMMs drop the difference of parameters between the sequences and assume all the sequences share the same parameters , and , as shown in their graphical representation in Figure 3. By contrast, LDHMMs assume the parameters representing the dynamics for each sequence are different and add hyper-parameters over the above parameters to avoid over-fittings.
4.2 Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA)  is a generative probabilistic model for a set of discrete data, which can be used for modeling sequential behaviors considered in this paper. Its graphical representation is shown in Figure 3. It can be seen from the figure that LDA simply ignores the dynamics in the hidden state space while LDHMMs consider the Markov relationships between hidden states.
4.3 Variational Bayesian HMMs (VBHMMs)
The graphical model of VBHMMs proposed in [2, 16] is shown in Figure 3. Its difference to the proposedLDHMMs is listed below: firstly, the VBHMMs assume all the sequences share the same parameters , and while LDHMMs assume each sequence has different parameters , and to characterize its dynamics; secondly, although VBHMMs use a variational inference algorithm similar to the PF form inference method described in Section B.2 to learn the posterior distribution of the parameters, they assume the hyper-parameters are known and do not provide any algorithm for learning them. By contrast, the proposed LDHMMs adopt a variational EM framework to learn the hyper-parameters as well as the posterior distribution of the parameters.
4.4 A Hidden Markov Model Variant (HMMV)
As shown in Figure 3, HMMV  has very similar graphical model compared to LDHMMs. However, it assumes sequences share the same parameters and . By contrast, LDHMMs treat these parameters individually for each sequence, in order to capture individual characteristics of them in a comprehensive way. Another difference to note is the assumption of known hyper-parameters. This is different from LDHMMs, since we provide a variational EM-based algorithm to estimate the hyper-parameters, as mentioned earlier.
Section 5 Empirical Study
In this section, we apply the proposed LDHMMs in several data mining tasks, such as sequential behavior modeling and sequence classification. To be more specific, firstly, we use two public-available data sets from web-browsing logs to study the data mining tasks. Secondly, we adopt a public-available biological sequence data set to study the problem of sequence classification. All algorithms were implemented in matlab333The code will be made publicly available on the first author’s web site soon. and performed on a 2.9GHz 20MB L3 Cache Intel Xeon E5-2690 (8 Cores) cluster node with 32GB 1600MHz ECC DDR3-RAM (Quad Channel), running on a Red Hat Enterprise Linux 6.2 (64bit) operating system.
5.1 Sequential Behavior Modeling
5.1.1 Data Sets
The Entree Data Set
This data set444Available at http://archive.ics.uci.edu/ml/datasets/Entree+Chicago+Recommendation+Data records users’ interactions with the Entree Chicago restaurant recommendation system from September, 1996 to April, 1999. The sequential behaviors of each user are his/her interactions with the system, i.e. their navigation operations. The characters L-T encode 8 navigation operations as shown in Table 2. We use a subset of 422 sequences whose lengths vary from 20 to 59.
|L||browse from one restaurant to another in a list|
|M||search for a similar but cheaper restaurant|
|N||search for a similar but nicer one|
|P||search for a similar but more traditional one|
|Q||search for a similar but more creative one|
|R||search for a similar but more lively one|
|S||search for a similar but quieter one|
|T||search for a similar but different cuisine one|
The MSNBC Data Set
This data set555Available at http://archive.ics.uci.edu/ml/datasets/MSNBC.com+Anonymous+Web+Data. describes the page visits of users who visited msnbc.com on September 28, 1999. Visits are recorded at the level of URL category (The 17 categories are ‘frontpage’, ‘news’, ‘tech’, ‘local’, ‘opinion’, ‘on-air’, ‘misc’, ‘weather’, ‘health’, ‘living’, ‘business’, ‘sports’, ‘summary’, ‘bbs’ (bulletin board service), ‘travel’, ‘msn-news’, and ‘msn-sports’.) and are recorded in a temporal order. Each sequence in the data set corresponds to page viewing behaviors of a user during that twenty-four hour period. Each behavior recorded in the sequence corresponds to the category of the user’s requesting page. We use a subset of 31071 sequences whose lengths vary from 20 to 100.
5.1.2 Evaluation Metrics
We learned the LDHMMs of the proposed two different learning forms (denoted as LDHMMs-ff and LDHMMs-pf) and related models (i.e., HMMs, LDA, VBHMMs and HMMV), on the above two data sets to compare the generalization performance of these models. Our goal is to achieve high likelihood on a held-out test set. Thus, we computed the log-likelihood of a held-out test set to evaluate the models given the learned deterministic hyper-parameters/parameters. In particular, for LDHMMs, we first learned their deterministic database-level hyper-parameters according to Algorithm 1 using the training data; then approximately inferred the sequence-level parameters of the testing data by applying Procedure LABEL:Procedure:Estep-ff with the learned hyper-parameters; and finally computed the log-likelihood of the test data as Equation B.1 using the learned hyper-parameters and inferred parameters. For other models, we used similar processes adjusting to their learning and inference algorithms. A higher log-likelihood indicates better generalization performance. We performed 10-fold cross validation on the above two data sets. Specifically, we split the data into 10 folds. Each time we held out 1 fold of the data for testing and trained the models on the remained 9-folds, and this process was repeated for 10 times. We report the averaged results of the 10-fold cross validation in the following.
5.1.3 Comparison of Log-likelihood on the Test Data Set
The results for different number of hidden states on the Entree data set is shown in Figure 4. As seen from the chart, the LDHMMs-pf consistently performs the best and LDHMMs-ff generally has the second best performance (only slightly worse than HMMs sometimes). Similar trend can be observed in Figure 5
. Both LDHMMs-ff and LDHMMs-pf perform better than the other models while LDHMMs-pf has a slightly better performance. This is because the PF form may have a more accurate approximation in these data sets. In summary, the proposed LDHMMs has a better generalization performance compared to other models. To further validate the statistical significance of our experiments, we also perform the paired t-test (2-tail) between LDHMMs-pf, LDHMMs-ff and other models over the perplexity of the experimental results. All the t-test results are less than 0.01, which proves the improvements of LDHMMs over other models are statistically significant.
5.1.4 Comparison of Computational time for the two forms
Since the computational complexity of related models are much lower than the proposed model due to their simpler structures, here we focus on the comparison of the proposed two forms. Figure 6 shows the comparison of training time on the two used data sets. Qualitatively speaking, the two approaches have similar computational time. But sometimes, the PF form is faster the FF form, which seems to be contradict to our theoretical analysis in Section 3.3.3. However, in practice, the stopping criterion used in the EM algorithm which may cause the iteration to stop earlier. Since the PF form may converge faster than the FF form does, it may need less numbers of E and M steps. Thus, it may converge faster than the FF form in those cases.
5.1.5 Visualization of LDHMMs
It is also important to obtain an intuitive understanding of the complex model learned. LDHMMs have database-level hyper-parameters, (i.e., , and ), which can be seen as database-level characteristics of the sequences; sequence-level variational parameters (i.e., , and ), which can be seen as sequence-level characteristics of each individual sequence. To visualize LDHMMs, we plot Hinton Diagrams for these parameters, each of which is represented by a square whose size is associated with the magnitude. Figure 7 shows a sample visualization from the Entree data set when . The left diagrams represent , and from the top to the bottom; the right diagrams represent , and from the top to the bottom for a sample sequence from the data set. It is clear from the picture that the individual sequence displays slightly different characteristics from the whole database.
5.2 Sequence Classification
5.2.1 Data Sets
This data set666Available at http://archive.ics.uci.edu/ml/datasets/Molecular+Biology+%28Splice-junction+Gene+Sequences%29 consists of 3 classes of DNA sequences. One class is made up of 767 sequences belonging to the exon/intron boundaries, referred to as the EI class; another class of 768 sequences belongs to the intron/exon boundaries, referred to as the IE class; the third class of 1655 sequences does not belong to either of the above classes, referred to as the N class.
5.2.2 Evaluation Metrics
We conducted 3 binary classification experiments (i.e., EI vs IE, EI vs N and IE vs N) using 10-fold cross validation. For each class , we learned a separate model of the sequences in that class, where is the number of training sequences of class
. An unseen sequence was classified by picking. To eliminate the influence of , we varied its value and obtained the corresponding area under ROC curve (AUC) , which is widely used for classification performance comparison.
5.2.3 Comparison of AUC on the Test Data Set
Table 3 reports the averaged results on the 10-fold validation and the best results for each number of hidden states are in bold. Surprisingly, our proposed LDHMMs do not significantly dominate other models. An possible explanation is that the generative models are not optimized for classification and thus more accurate modeling does not result in the significant improvement of classification performance. This problem may be alleviated by combining the model with a discriminative classifier, which will be further discussed in Section 6. However, LDHMMs have very competitive performance compared to the best models in all cases.
|EI vs IE||2|
|EI vs N||2|
|IE vs N||2|
Section 6 Conclusions and Future Work
Statistical modeling of sequential data has been studied for many years in machine learning and data mining. In this paper, we propose LDHMMs to characterize a database of sequential behaviors. Rather than assuming all the sequences share the same parameters as in traditional models, such as HMMs and VBHMMs, we explicitly assign sequence-level parameters to each sequence and database-level hyper-parameters to the whole database. The experimental results show that our model outperforms the other state-of-the-art models in predicting unseen sequential behaviors from web browsing logs and is competitive in classifying the unseen biological sequences.
In this paper, we assume that the observed sequences can only have one behavior at one time stamp, which is not practical in many application domains. For example, in the field of customer transaction analysis, one customer may buy several items at one time stamp. Thus, one possible future direction is to generalize LDHMMs to cater for the above scenarios. Additionally, it is also interesting to investigate the combination of our model with discriminative classifiers, such as support vector machine (SVM), to further improve the classification performance. This is because, similar to LDA, our model can be naturally seen as a dimensionality reduction method for feature extraction.
Section 7 Acknowledgements
Thanks for the anonymous reviewers!
-  R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, pages 3–14. IEEE.
M. J. Beal.
Variational algorithms for approximate Bayesian inference. PhD thesis, 2003.
-  C. Bishop. Pattern recognition and machine learning. Information Science and Statistics. Springer, New York, 2006.
S. Blasiak and H. Rangwala.
A hidden markov model variant for sequence classification.
Proceedings of the Twenty-Second international joint conference on Artificial Intelligence, volume 2, pages 1192–1197. AAAI Press, 2011.
-  D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003.
-  M. Evans, N. Hastings, and B. Peacock. Statistical distributions. Measurement Science and Technology, 12(1):117, 2000.
-  T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.
-  Z. Ghahramani. On structured variational approximations. University of Toronto Technical Report, CRG-TR-97-1, 1997.
-  Z. Ghahramani and M. Beal. Variational inference for bayesian mixtures of factor analysers. Advances in neural information processing systems, 12:449–455, 2000.
-  Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):831–864, 2000.
-  J. Han, J. Pei, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 17th International Conference on Data Engineering, pages 215–224.
-  J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery, 8(1):53–87, 2004.
-  T. Jaakkola and M. Jordan. Variational methods for inference and estimation in graphical models. PhD thesis, 1997.
-  M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
-  S. Kotz, N. Balakrishnan, and N. Johnson. Continuous multivariate distributions, models and applications, volume 334. Wiley-Interscience, 2000.
-  D. MacKay. Ensemble learning for hidden markov models. Technical report, Technical report, Cavendish Laboratory, University of Cambridge, 1997.
-  T. Minka. Estimating a dirichlet distribution, 2000.
-  P. K. Novak, N. Lavrac, and G. I. Webb. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. The Journal of Machine Learning Research, 10:377–403, 2009.
-  G. Piateski and W. Frawley. Knowledge discovery in databases. MIT press, 1991.
-  L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Readings in speech recognition, 53(3):267–296, 1990.
-  M. J. Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine learning, 42(1):31–60, 2001.
Appendix A Distributions
a.1 Dirichlet Distribution
a.2 Multinomial Distribution
A 1-of- vector multinomial random variable ( and ) and has the following form of probability distribution :
where the parameter is a -dimension vector with components and .
Appendix B Variational Inference
b.1 The FF Form
Here we expand the expression of for Variational Inference for the FF form. for , and are usually assumed to be Dirichlet distributions governed by parameters , and . is assumed to be multinomial distributions governed by (). Then Equation 2 can be expanded as follows:
Fixed and and , Update
As a functional of and add Lagrange multipliers: