Section 1 Introduction
In this paper we explore the problem of characterizing a database of sequential behaviors (i.e., sequences). An example of such sequential behaviors is the web browsing behaviors of Internet users. Table 1 shows some userbrowsing data excerpted from the web server logs of msnbc.com. Each row of the table is an ordered list of discrete symbols, each of which represents one behavior made by a user. Here the behavior is described by the categories of web pages requested by the user. For example, User 1 first browses a ‘frontpage’ page, then visits a ‘news’ page, followed by visiting ‘travel’ page and other pages denoted by dots. This form typical sequential behaviors for a user, and other individuals have similar sequential behaviors. For the last decade, many efforts have been made to characterize the above sequential behaviors for further analysis.
Significant progress has been made on behaviour modelling in the field of Sequence Pattern Mining (SPM). Pattern is an expression describing a subset of the data [19]. Sequential pattern mining discovers frequently occurring behaviors or subsequences as patterns, which was first introduced by Agrawal and Srikant [1]. Several algorithms, such as Generalized Sequential Patterns (GSP) [1], SPADE [21] and PrefixSpan [11, 12], have been proposed to mine sequential patterns efficiently. Generally speaking, SPM techniques aim at discovering comprehensible sequential patterns in data, which is descriptive [18] and lacks of wellfounded theories to apply the discovered patterns for further data analysis tasks, such as sequence classification and behavior modeling.
In the statistical and machine learning community, researchers try to characterize the sequential behaviors using probabilistic models. The probabilistic models not only describe the generative process of the sequential behaviors but also have a
predictive property which is helpful for further analytic tasks, such as prediction of future behaviors. One representative model widely used is the hidden Markov models (HMMs) [20]. Usually, each sequence is modelled as an HMM. In other words, the dynamics of each sequence is represented by a list of deterministic parameters (i.e., initialization prior vector and transition matrix), and there is no generative probabilistic model for these numbers. This leads to several problems when we directly extend HMMs to modeling a database of sequences: (1) the number of parameters for the HMMs grows linearly with the number of sequences, which leads to a serious problem of overfitting, and (2) it is not clear how to assign probability to a sequence outside of the training set. Although
[20] suggests a strategy for modeling multiple sequences, it simply ignores the difference on parameters between sequences and assumes all the dynamics of sequences can be characterized by one set of deterministic parameters. This could alleviate the problem of overfitting to some extent, but may overlook the individual characteristics for individual sequences at the same time, which may further deteriorate the accuracy of behavior modeling.User  Sequential Behaviors  

1  frontpage  news  travel  … 
2  news  news  news  … 
3  frontpage  news  frontpage  … 
4  frontpage  news  news  … 
The goal of this paper is to characterize a database of sequential behaviors preserving the essential statistical relationships for each individual sequence and the whole database, while avoiding the problem of overfitting. To achieve this goal, we propose a generative model that has both sequencelevel and databaselevel variables to comprehensively and effectively modeling behavioral sequences.
The paper is organized as follows: Section 2 formalizes the problem studied in this paper, followed by the proposed approach described in Section 3. Then Section 4 reviews the probabilistic models related to this paper. After that, experimental results on several data mining tasks on 3 realworld data sets are reported in Section 5. Finally, Section 6 concludes this paper and discusses about some possible future directions.
Section 2 Problem Statement
Here we use the terminologies, such as “behaviors”, “sequences” and “database”, to describe a database of sequential behaviors. This is helpful for understanding the probabilistic model derived on the data. It is important to note that the model proposed in this paper is also applicable to other sequential data that has the similar data forms. In this paper, vectors are denoted by lower case bold Roman or Greek letters and all vectors are assumed to be column vectors except for special explanations. Uppercase bold Roman letters denote matrices while letters in other cases are assumed to be scalar.

A database is a collection of sequences denoted by .

A sequence () is an ordered list of behaviors denoted by , where () is the behavior in the sequence . The behaviors in the sequence are ordered by increasing time when behaviors are made.

A behavior (, ) is the basic unit of sequential behaviors, defined to be a 1of vector such that and (for all ), which represents an item from a vocabulary indexed by . Each index represents one type of behaviors, such as browsing a ‘travel’ web page as shown in Table 1.
Given a database of sequential behaviors, the problem of characterizing behaviors is to derive a probabilistic model which preserves the statistical relationships in the sequences and tends to assign high likelihood to “similar” sequences.
Section 3 The Proposed Model
3.1 The Graphical Model
The basic idea of the Latent Dirichlet Hidden Markov Models (LDHMMs) is that the dynamics of each sequence is assumed to be reflected through a hidden Markov chain ^{1}^{1}1 () can be represented by a 1of vector (similar to the form of a behavior ) and has possible hidden states, where is the number of possible hidden states and is usually set empirically. parameterized by the corresponding initial prior vector , transition matrix and a statedependent emission matrix . Then , ( () is the th row vector of ) and ( () is the th row vector of ) can be seen as a lower dimension representation of the dynamics of the sequence. The distribution of these parameters of all sequences are then further governed by databaselevel Dirichlet hyperparameters, i.e., , and , where is a matrix whose th row vector is and is a matrix whose th row vector is . To be more specific, for a database of sequential behaviors , the generative process is as follows^{2}^{2}2Please refer to Appendix A for details of Dirichlet (Dir) and Multinomial distributions.:

Generate hyperparameters , and .

For each sequence index ,

Generate , and

For the first time stamp in the sequence :

Generate an initial hidden state .

Generate a behavior from , a multinomial probability conditioned on the hidden state and .


For each of other time stamps in the sequence ():

Generate a hidden state from .

Generate a behavior from .


Accordingly, the graphical model of LDHMMs is shown in Figure 1. As per the graph states itself, there are three levels of modeling in LDHMMs. The hyperparameters , and are databaselevel variables, assumed to be sampled once in the process of generating a database. The variables , and () are sequencelevel variables, denoted as sampled once per sequence. Finally, the variables and are behaviorlevel variables sampled once for each behavior in each sequence.
3.2 Learning the Model
In this section, our goal is to learn the deterministic hyperparameters of the LDHMMs given a database , which is to maximize the likelihood function given by
(1) 
Direct optimization of the above equation is very difficult since the involvement of latent variables, thus we turn to optimize its lower bound given by the Jensen’s inequality as follows [3]:
(2) 
where is assumed to be variational distribution function approximating to the posterior distribution of latent variables given and can be decomposed as . Specifically, is a variational distribution function approximating to for the sequence .
Then the lower bound of the likelihood function becomes a function of , , and . To obtain the optimal , and is still difficult since the involvement of . Thus, we propose a variational EMbased algorithm for learning the hyperparameters of the LDHMMs, which yields the algorithm summarized in Algorithm 1 and is guaranteed to increase likelihood after each iteration [3]. To be more specific, the variational EM algorithm is a twostage iterative optimization technique which iterates the Estep (i.e., optimization with respect to ) and Mstep (optimization with respect to the hyperparameters) from lines 1 to 7. For each iteration, the Estep (lines 14) fixes the hyperparameters and optimize the with respect to for each sequence; while the Mstep (line 5) fixes the and optimizes the with respect to the hyperparameters. Through this manner, the optimal hyperparameters , and are obtained when the iterations are converged in line 7. It is also important to note that the approximate posteriors of sentencelevel parameters (i.e., ) are learned as byproducts in E steps.
The following two sections will discuss the details of the procedure Estep and Mstep in Algorithm 1 and gives out two different implementations.
3.3 The E step: Variational Inference of Latent Variables
In this section, we provide the details of the E step, which is to estimate
for () given the observed sequence and fixed hyperparameters and this process is usually termed as variational inference [3, 9, 13, 14].Here we consider two different implementations of variational inference based on different decompositions of :

A fullyfactorized (FF) form.

A partiallyfactorized (PF) form.
As shown in Figure 2, the FF form assumes:
This is inspired by the standard meanfield approximation in [13, 14].
As shown in Figure 2, the PF form assumes:
(3) 
and no further assumption has been made on . This is inspired by the manners proposed in [16, 8, 2, 10], which preserves the conditional dependency between the variables of .
3.3.1 E Step: the FF Form
The variational inference process of the FF form yields the following iterations. Please refer to Appendix B.1 for the details of the derivation of the updating formulas.
Fixed and and , Update ()
is updated by:
(4) 
() is updated by:
(5) 
is updated by:
(6) 
Fixed , , , Update
(7) 
Fixed , and , Update
(8) 
Fixed , and , Update
(9) 
For simplicity, the E step for the FF form can be summarized in Procedure LABEL:Procedure:Estepff.
algocf[!t]
3.3.2 E Step: the PF Form
The variational inference process of the PF form yields the following iterations. Please refer to Appendix B.2 for the details of the derivation of the updating formulas.
Fixed and and , Update
Fixed , , , Update
(10) 
where and .
Fixed , and , Update
(11) 
Fixed , and , Update
(12) 
where , and .
The Estep can be also summarized as a procedure similar to Procedure LABEL:Procedure:Estepff by replacing the corresponding updating formulas. We omit it here for conciseness.
3.3.3 Discussion of computational complexity
The computational complexity for the Estep of approximately inferring the posterior distribution of , and () given the hyperparameters and the observed behaviors are similar for both the PF and FF forms. Specifically, the computational complexity for inferring the approximate posteriors of , and () are the same for the two forms, which are proportional to , and , respectively, where is the number of hidden states, is the iteration number of Estep, N is the maximum length of all sequences. However, the computational cost for approximate inference of the posterior of () is slightly different for the two forms. The computational complexity for the PF form is proportional to while its counterpart of the FF form is proportional to . Thus, the overall computational complexity for the PF and FF form are and , respectively. It is clear that two forms have comparable computational cost and the FF form is slightly faster.
3.4 The M Step: Fixed Variational Variables, Estimate Hyperparameters
In this section, we provide the details of the Mstep, which is to estimate hyperparameters given the observed sequence and fixed variational variables for (). In particular, it maximizes the lower bound of the loglikelihood with respect to respective hyperparameters as follows:
3.4.1 The FF Form
Update
Maximizing can be solved by iterative lineartime NewtonRaphson algorithm [5, 17]. Define the following variables:
(13) 
(14) 
(15) 
(16) 
The updating equation is given by:
(17) 
Procedure LABEL:Procedure:NewtonRaphson1 summarizes the above algorithm, which is an iterative process of updating the value of . To be more specific, at the beginning of each iteration, the variables are calculated by Equation 1316 and to be 1 in lines 2 and 3. Then line 4 updates by Equation 17 and line 5 judges if the updated falls into the feasible range. If so, it reduces by a factor of 0.5 in line 6 and updates in line 7 until it becomes valid. In line 8, update as for the next iteration.
algocf[!t]
Update
Update
Similarly, the estimation of () can be done by the Procedure LABEL:Procedure:NewtonRaphson1 with changes on Equation 1317 (i.e., replace by and by ).
The Mstep can be summarized in Procedure LABEL:Procedure:Mstep.
algocf[!t]
3.4.2 The PF Form
The process is the same as the above process.
Section 4 Comparison to Related Models
In this section, we compare our proposed LDHMMs to existing models that can model sequential behaviors, and our aim is to show the key difference between them.
4.1 Hidden Markov Models
As mentioned in Section 1, HMMs drop the difference of parameters between the sequences and assume all the sequences share the same parameters , and , as shown in their graphical representation in Figure 3. By contrast, LDHMMs assume the parameters representing the dynamics for each sequence are different and add hyperparameters over the above parameters to avoid overfittings.
4.2 Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) [5] is a generative probabilistic model for a set of discrete data, which can be used for modeling sequential behaviors considered in this paper. Its graphical representation is shown in Figure 3. It can be seen from the figure that LDA simply ignores the dynamics in the hidden state space while LDHMMs consider the Markov relationships between hidden states.
4.3 Variational Bayesian HMMs (VBHMMs)
The graphical model of VBHMMs proposed in [2, 16] is shown in Figure 3. Its difference to the proposedLDHMMs is listed below: firstly, the VBHMMs assume all the sequences share the same parameters , and while LDHMMs assume each sequence has different parameters , and to characterize its dynamics; secondly, although VBHMMs use a variational inference algorithm similar to the PF form inference method described in Section B.2 to learn the posterior distribution of the parameters, they assume the hyperparameters are known and do not provide any algorithm for learning them. By contrast, the proposed LDHMMs adopt a variational EM framework to learn the hyperparameters as well as the posterior distribution of the parameters.
4.4 A Hidden Markov Model Variant (HMMV)
As shown in Figure 3, HMMV [4] has very similar graphical model compared to LDHMMs. However, it assumes sequences share the same parameters and . By contrast, LDHMMs treat these parameters individually for each sequence, in order to capture individual characteristics of them in a comprehensive way. Another difference to note is the assumption of known hyperparameters. This is different from LDHMMs, since we provide a variational EMbased algorithm to estimate the hyperparameters, as mentioned earlier.
Section 5 Empirical Study
In this section, we apply the proposed LDHMMs in several data mining tasks, such as sequential behavior modeling and sequence classification. To be more specific, firstly, we use two publicavailable data sets from webbrowsing logs to study the data mining tasks. Secondly, we adopt a publicavailable biological sequence data set to study the problem of sequence classification. All algorithms were implemented in matlab^{3}^{3}3The code will be made publicly available on the first author’s web site soon. and performed on a 2.9GHz 20MB L3 Cache Intel Xeon E52690 (8 Cores) cluster node with 32GB 1600MHz ECC DDR3RAM (Quad Channel), running on a Red Hat Enterprise Linux 6.2 (64bit) operating system.
5.1 Sequential Behavior Modeling
5.1.1 Data Sets
The Entree Data Set
This data set^{4}^{4}4Available at http://archive.ics.uci.edu/ml/datasets/Entree+Chicago+Recommendation+Data records users’ interactions with the Entree Chicago restaurant recommendation system from September, 1996 to April, 1999. The sequential behaviors of each user are his/her interactions with the system, i.e. their navigation operations. The characters LT encode 8 navigation operations as shown in Table 2. We use a subset of 422 sequences whose lengths vary from 20 to 59.
Code  Navigation Operations 

L  browse from one restaurant to another in a list 
M  search for a similar but cheaper restaurant 
N  search for a similar but nicer one 
P  search for a similar but more traditional one 
Q  search for a similar but more creative one 
R  search for a similar but more lively one 
S  search for a similar but quieter one 
T  search for a similar but different cuisine one 
The MSNBC Data Set
This data set^{5}^{5}5Available at http://archive.ics.uci.edu/ml/datasets/MSNBC.com+Anonymous+Web+Data. describes the page visits of users who visited msnbc.com on September 28, 1999. Visits are recorded at the level of URL category (The 17 categories are ‘frontpage’, ‘news’, ‘tech’, ‘local’, ‘opinion’, ‘onair’, ‘misc’, ‘weather’, ‘health’, ‘living’, ‘business’, ‘sports’, ‘summary’, ‘bbs’ (bulletin board service), ‘travel’, ‘msnnews’, and ‘msnsports’.) and are recorded in a temporal order. Each sequence in the data set corresponds to page viewing behaviors of a user during that twentyfour hour period. Each behavior recorded in the sequence corresponds to the category of the user’s requesting page. We use a subset of 31071 sequences whose lengths vary from 20 to 100.
5.1.2 Evaluation Metrics
We learned the LDHMMs of the proposed two different learning forms (denoted as LDHMMsff and LDHMMspf) and related models (i.e., HMMs, LDA, VBHMMs and HMMV), on the above two data sets to compare the generalization performance of these models. Our goal is to achieve high likelihood on a heldout test set. Thus, we computed the loglikelihood of a heldout test set to evaluate the models given the learned deterministic hyperparameters/parameters. In particular, for LDHMMs, we first learned their deterministic databaselevel hyperparameters according to Algorithm 1 using the training data; then approximately inferred the sequencelevel parameters of the testing data by applying Procedure LABEL:Procedure:Estepff with the learned hyperparameters; and finally computed the loglikelihood of the test data as Equation B.1 using the learned hyperparameters and inferred parameters. For other models, we used similar processes adjusting to their learning and inference algorithms. A higher loglikelihood indicates better generalization performance. We performed 10fold cross validation on the above two data sets. Specifically, we split the data into 10 folds. Each time we held out 1 fold of the data for testing and trained the models on the remained 9folds, and this process was repeated for 10 times. We report the averaged results of the 10fold cross validation in the following.
5.1.3 Comparison of Loglikelihood on the Test Data Set
The results for different number of hidden states on the Entree data set is shown in Figure 4. As seen from the chart, the LDHMMspf consistently performs the best and LDHMMsff generally has the second best performance (only slightly worse than HMMs sometimes). Similar trend can be observed in Figure 5
. Both LDHMMsff and LDHMMspf perform better than the other models while LDHMMspf has a slightly better performance. This is because the PF form may have a more accurate approximation in these data sets. In summary, the proposed LDHMMs has a better generalization performance compared to other models. To further validate the statistical significance of our experiments, we also perform the paired ttest (2tail) between LDHMMspf, LDHMMsff and other models over the perplexity of the experimental results. All the ttest results are less than 0.01, which proves the improvements of LDHMMs over other models are statistically significant.
5.1.4 Comparison of Computational time for the two forms
Since the computational complexity of related models are much lower than the proposed model due to their simpler structures, here we focus on the comparison of the proposed two forms. Figure 6 shows the comparison of training time on the two used data sets. Qualitatively speaking, the two approaches have similar computational time. But sometimes, the PF form is faster the FF form, which seems to be contradict to our theoretical analysis in Section 3.3.3. However, in practice, the stopping criterion used in the EM algorithm which may cause the iteration to stop earlier. Since the PF form may converge faster than the FF form does, it may need less numbers of E and M steps. Thus, it may converge faster than the FF form in those cases.
5.1.5 Visualization of LDHMMs
It is also important to obtain an intuitive understanding of the complex model learned. LDHMMs have databaselevel hyperparameters, (i.e., , and ), which can be seen as databaselevel characteristics of the sequences; sequencelevel variational parameters (i.e., , and ), which can be seen as sequencelevel characteristics of each individual sequence. To visualize LDHMMs, we plot Hinton Diagrams for these parameters, each of which is represented by a square whose size is associated with the magnitude. Figure 7 shows a sample visualization from the Entree data set when . The left diagrams represent , and from the top to the bottom; the right diagrams represent , and from the top to the bottom for a sample sequence from the data set. It is clear from the picture that the individual sequence displays slightly different characteristics from the whole database.
5.2 Sequence Classification
5.2.1 Data Sets
This data set^{6}^{6}6Available at http://archive.ics.uci.edu/ml/datasets/Molecular+Biology+%28Splicejunction+Gene+Sequences%29 consists of 3 classes of DNA sequences. One class is made up of 767 sequences belonging to the exon/intron boundaries, referred to as the EI class; another class of 768 sequences belongs to the intron/exon boundaries, referred to as the IE class; the third class of 1655 sequences does not belong to either of the above classes, referred to as the N class.
5.2.2 Evaluation Metrics
We conducted 3 binary classification experiments (i.e., EI vs IE, EI vs N and IE vs N) using 10fold cross validation. For each class , we learned a separate model of the sequences in that class, where is the number of training sequences of class
. An unseen sequence was classified by picking
. To eliminate the influence of , we varied its value and obtained the corresponding area under ROC curve (AUC) [7], which is widely used for classification performance comparison.5.2.3 Comparison of AUC on the Test Data Set
Table 3 reports the averaged results on the 10fold validation and the best results for each number of hidden states are in bold. Surprisingly, our proposed LDHMMs do not significantly dominate other models. An possible explanation is that the generative models are not optimized for classification and thus more accurate modeling does not result in the significant improvement of classification performance. This problem may be alleviated by combining the model with a discriminative classifier, which will be further discussed in Section 6. However, LDHMMs have very competitive performance compared to the best models in all cases.
Dataset  LDA  HMM  HMMV  VBHMM  LDHMMff  LDHMMnf  

EI vs IE  2  
3  
4  
EI vs N  2  
3  
4  
IE vs N  2  
3  
4 
Section 6 Conclusions and Future Work
Statistical modeling of sequential data has been studied for many years in machine learning and data mining. In this paper, we propose LDHMMs to characterize a database of sequential behaviors. Rather than assuming all the sequences share the same parameters as in traditional models, such as HMMs and VBHMMs, we explicitly assign sequencelevel parameters to each sequence and databaselevel hyperparameters to the whole database. The experimental results show that our model outperforms the other stateoftheart models in predicting unseen sequential behaviors from web browsing logs and is competitive in classifying the unseen biological sequences.
In this paper, we assume that the observed sequences can only have one behavior at one time stamp, which is not practical in many application domains. For example, in the field of customer transaction analysis, one customer may buy several items at one time stamp. Thus, one possible future direction is to generalize LDHMMs to cater for the above scenarios. Additionally, it is also interesting to investigate the combination of our model with discriminative classifiers, such as support vector machine (SVM), to further improve the classification performance. This is because, similar to LDA, our model can be naturally seen as a dimensionality reduction method for feature extraction.
Section 7 Acknowledgements
Thanks for the anonymous reviewers!
References
 [1] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, pages 3–14. IEEE.

[2]
M. J. Beal.
Variational algorithms for approximate Bayesian inference
. PhD thesis, 2003.  [3] C. Bishop. Pattern recognition and machine learning. Information Science and Statistics. Springer, New York, 2006.

[4]
S. Blasiak and H. Rangwala.
A hidden markov model variant for sequence classification.
In
Proceedings of the TwentySecond international joint conference on Artificial Intelligence
, volume 2, pages 1192–1197. AAAI Press, 2011.  [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003.
 [6] M. Evans, N. Hastings, and B. Peacock. Statistical distributions. Measurement Science and Technology, 12(1):117, 2000.
 [7] T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.
 [8] Z. Ghahramani. On structured variational approximations. University of Toronto Technical Report, CRGTR971, 1997.
 [9] Z. Ghahramani and M. Beal. Variational inference for bayesian mixtures of factor analysers. Advances in neural information processing systems, 12:449–455, 2000.
 [10] Z. Ghahramani and G. E. Hinton. Variational learning for switching statespace models. Neural Computation, 12(4):831–864, 2000.
 [11] J. Han, J. Pei, B. MortazaviAsl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Prefixspan: Mining sequential patterns efficiently by prefixprojected pattern growth. In Proceedings of the 17th International Conference on Data Engineering, pages 215–224.
 [12] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation: A frequentpattern tree approach. Data Mining and Knowledge Discovery, 8(1):53–87, 2004.
 [13] T. Jaakkola and M. Jordan. Variational methods for inference and estimation in graphical models. PhD thesis, 1997.
 [14] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 [15] S. Kotz, N. Balakrishnan, and N. Johnson. Continuous multivariate distributions, models and applications, volume 334. WileyInterscience, 2000.
 [16] D. MacKay. Ensemble learning for hidden markov models. Technical report, Technical report, Cavendish Laboratory, University of Cambridge, 1997.
 [17] T. Minka. Estimating a dirichlet distribution, 2000.
 [18] P. K. Novak, N. Lavrac, and G. I. Webb. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. The Journal of Machine Learning Research, 10:377–403, 2009.
 [19] G. Piateski and W. Frawley. Knowledge discovery in databases. MIT press, 1991.
 [20] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Readings in speech recognition, 53(3):267–296, 1990.
 [21] M. J. Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine learning, 42(1):31–60, 2001.
Appendix A Distributions
a.1 Dirichlet Distribution
A dimensional Dirichlet random variable ( and
) and has the following form of probability distribution
[15]:(18) 
where the parameter is a dimension vector with components , and where is the Gamma function.
a.2 Multinomial Distribution
A 1of vector multinomial random variable ( and ) and has the following form of probability distribution [6]:
(19) 
where the parameter is a dimension vector with components and .
Appendix B Variational Inference
b.1 The FF Form
Here we expand the expression of for Variational Inference for the FF form. for , and are usually assumed to be Dirichlet distributions governed by parameters , and . is assumed to be multinomial distributions governed by (). Then Equation 2 can be expanded as follows:
where
(20) 
(21) 
(22) 
(23) 
(24) 
(25) 
(26) 
(27) 
(28) 
(29) 
Fixed and and , Update
As a functional of and add Lagrange multipliers:
(30) 
Setting the derivative to zero yields the maximizing value of the variational parameter as Equation 4. Similarly, the updated equation for () and can be obtained as Equation 5 and 6.
Use similar technique as above, we can fix , , , update as Equation
Comments
There are no comments yet.