Latent Theme Dictionary Model for Finding Co-occurrent Patterns in Process Data

by   Guanhua Fang, et al.

Process data, temporally ordered categorical observations, are of recent interest due to its increasing abundance and the desire to extract useful information. A process is a collection of time-stamped events of different types, recording how an individual behaves in a given time period. The process data are too complex in terms of size and irregularity for the classical psychometric models to be applicable, at least directly, and, consequently, it is desirable to develop new ways for modeling and analysis. We introduce herein a latent theme dictionary model (LTDM) for processes that identifies co-occurrent event patterns and individuals with similar behavioral patterns. Theoretical properties are established under certain regularity conditions for the likelihood based estimation and inference. A non-parametric Bayes LTDM algorithm using the Markov Chain Monte Carlo method is proposed for computation. Simulation studies show that the proposed approach performs well in a range of situations. The proposed method is applied to an item in the 2012 Programme for International Student Assessment with interpretable findings.



There are no comments yet.


page 1

page 2

page 3

page 4


Bayesian factor models for multivariate categorical data obtained from questionnaires

Factor analysis is a flexible technique for assessment of multivariate d...

A Latent Topic Model with Markovian Transition for Process Data

We propose a latent topic model with a Markovian transition for process ...

Non-parametric ordinal regression under a monotonicity constraint

Compared to the nominal scale, ordinal scale for a categorical outcome v...

Nested Dirichlet Process For Population Size Estimation From Multi-list Recapture Data

Heterogeneity of response patterns is important in estimating the size o...

Bayes Nets in Educational Assessment: Where Do the Numbers Come From?

As observations and student models become complex, educational assessmen...

A Dyadic IRT Model

We propose a dyadic Item Response Theory (dIRT) model for measuring inte...

Bayesian Functional Data Analysis over Dependent Regions and Its Application for Identification of Differentially Methylated Regions

We consider a Bayesian functional data analysis for observations measure...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Process data are temporally ordered data with categorical observations. Such data are ubiquitous and common in e-commerce (online purchases), social networking services and computer-based educational assessments. In large scale computer-based tests, analyzing process data has gained much attention and becomes a core task in the next generation of assessment; see, for example, 2012 and 2015 Programme for International Student Assessment (PISA) (OECD, 2014b, 2016), 2012 Programme for International Assessment of Adult Competencies (PIAAC) (Goodman et al., 2013), Assessment and Teaching of 21st Century Skills (ATC21S) (Griffin et al., 2012). In such technology-rich tests, there are problem-solving items which require the examinee to perform a number of actions before submitting final answers. These actions and their corresponding times are sequentially recorded and saved in a log file. Such log file data could provide extra information about examinee’s latent structure that is not available to the traditional paper-based test, in which the final responses (correct / incorrect) are used.

Like item response theory (IRT, Lord (2012)) models and diagnostic classification models (DCMs Templin et al. (2010)

), it is also important to study the personal traits and calibrate the items in analysis of process data. However, process data are much more complicated in the sense that events happen at irregular time points and event sequence length varies from examinee to examinee. Distinct examinees may have different reaction speeds and tend to perform varied actions to complete the task. In addition, some examinees could find optimal strategy in very short time, while others may explore the item for longer time before reaching the final answers. Therefore, we need to classify examinees into multiple groups based on their distinct behaviors. Moreover, some events may appear sequentially very often, that is, the existence of co-occurrent event patterns is another important feature of process data. Thus, we need to extract key event patterns which may help us understand students’ learning status better and deeper.

There is a recent literature on statistical analysis of process data in terms of classification and feature extraction.

Suh et al. (2010) and Hong et al. (2011)

worked on information cascades involved with extracting the set of features describing the past information and using these features into a machine learning classifier.

Luo et al. (2015) considered a multidimensional Hawkes process by inducing structural constraints on triggering functions to get low-rank feature representations. Lian et al. (2015) proposed a hierarchical gaussian process model by using past events as kernel features to allow flexible arrival patterns. He and von Davier (2016) performed a study to extract and detect robust sequential action patterns associated with success or failure on a problem-solving item. Xu and Zha (2017) introduced a Dirichlet mixture Hawkes process model for clustering individuals based on raw event sequence. Xu et al. (2018) considered a latent class model for analyzing the student behaviors in solving computer-based tasks. Qiao and Jiao (2018) adopted various classification methods to a dataset from PISA 2012 to achieve a better accuracy.

Despite these efforts, statistical modeling and analysis of process data is still in its infancy. Most classification methods for process data are over-simplified by ignoring the event sequentiality and time irregularity. The feature extraction methods are based on counts and hence lack in statistical explanations. Moreover, very little has been done in terms of systematically exploring the process data structure, especially in simultaneously classifying individuals and extracting features.

This paper proposes a latent theme dictionary model (LTDM). It is a latent class-type model with two layers of latent structures that assumes an underlying latent class structure for examinees and a latent association pattern structure for event types. To incorporate the temporal nature, a survival time model with both event-specific parameter and person-specific frailty is assumed for gap times between two consecutive events. The challenging issues of model identifiability are dealt with through using special dictionary structure and Kruskal’s fundamental result of unique decomposition for three-dimensional array (Kruskal, 1977). We propose a non-parametric Bayes algorithm (NB-LTDM), which can be used to identify pattern dictionary, classify individuals, and estimate model parameters simultaneously.

The rest of paper is organized as follow. In Section 2, we describe process data and introduce two useful statistical models, latent class model (LCM) and theme dictionary model (TDM). In Section 3, we propose a new latent theme dictionary model, which combines LCM and TDM and incorporates time structure. In Section 4, we provide theoretical results regarding with model identifiability and estimation consistency. In Section 5, we discuss the computation issue and propose the NB-LTDM algorithm. The simulation results are given in Section 6. In Section 7, we apply the proposed method to a real data set and get the interpretable results. Finally, the concluding remarks are given in Section 8.

2 Preliminary

2.1 Process Data

The process data here refer to a sequence of ordered events (actions) coupled with time stamps. For an examinee, his/her observed data may be denoted by , , where is the th event and is its corresponding time stamp. We have and , where is the set of all possible different event types. For notational simplicity, we write and .

We use the “Traffic” item from PISA 2012 (OECD, 2014b) as a motivational example to illustrate various concepts and notation. PISA is a worldwide study to evaluate educational performances of different countries and economies. The “Traffic” item asks the examinee to operate on a computer to complete the task, i.e., to locate a meeting point which is within 15 minutes away from three places, Silver, Lincoln and Nobel. There are two correct answers, “Park” and “Silver” for this task. Figure 1 shows the initial state of the computer screen. There are 16 destinations and 23 roads in the map. The examinee could click a road to highlight it, re-click a clicked road to unhighlight, and use “RESET” button to remove all highlighted roads. The “Total Time” box shows the time for traveling on the highlighted roads. Once a road is clicked, the corresponding time would be added to this box. Each action and its corresponding time will be sequentially saved during the process of completing the item in the log file. A typical example of the action process of one specific examinee and its cleaned version are shown in Tables 4 and 5. In this case, . After removing unneeded rows (“START_ITEM”, “END_ITEM”, “Click”, “SELECT”), we can see that there are 16 meaningful actions performed by this examinee as listed in Table 5. His/her observed data are

Figure 1: Map for Traffic Item interface, the big blue number is the label for road and the small black number represents the time for traveling on that road.

2.2 Latent Class Model and Theme Dictionary Model

To set the stage for proposing our new model, we first introduce two related simpler models, latent class model (LCM, Gibson (1959)) and theme dictionary model (TDM, Deng et al. (2014)), LCM is widely used in clinical trial, psychometric and machine learning (Goodman, 1974; Vermunt and Magidson, 2002; Templin et al., 2010). It relates a set of observed variables to a latent variable taking discrete values. The latent variable is often used for indicating the class label. LCM assumes a local independence structure, i.e.,

where are observed variables and is a discrete latent variable with density . Then, the joint marginal distribution of is

TDM (Deng et al., 2014) deals with observations called events. It is used for identifying associated event patterns. The problem of finding event associations is also known as market basket analysis, which has been a popular research topic since 1990s (Piatetsky-Shapiro, 1991; Hastie et al., 2005; Chen et al., 2005). Under TDM, a pattern is a combination of several events. A collection of distinct patterns forms a dictionary, . A sentence, , is the combination of patterns. The collection of multiple sentences forms a document. In TDM, we only observe sentence , but do not know which patterns it consists of. That is to say, could be divided into the association of multiple patterns in many different ways. For each possible association, we call it a separation of . We use the following simple example to illustrate this.

Example 1

There are three event types, and . The patterns are which forms a dictionary. Suppose is one observed sentence. According to , all possible separations for are , and .

TDM does not take into the account event ordering. For example, is the same as . Consequently, patterns are also unordered. For instance, patterns and are viewed as the same one. TDM postulates that a pattern appears in a sentence at most once. Let

be the probability of pattern that appears in a sentence. The probability distribution of one separation

for observation is defined to be

Since separation is not observed, then marginal probability of is

where is the set of all separations for . Furthermore, sentences are assumed to be independent of each other, i.e., for , the probability could be written as


3 Latent Theme Dictionary Model

In this section, we propose a latent theme dictionary model (LTDM). We treat the whole event process as a sequence of sentences where each sentence is an ordered subsequence of events. In doing this, we effectively reduce raw data length by splitting the original long sequence to multiple shorter subsequences. This way of complexity reduction enables us to model sentences instead of the whole process which is more complicated.

Specifically, we assume and can be divided into sentence sequence, i.e. and , where

are called as the event sentence and time sentence, respectively. We assume that some events often appear together, which can be viewed as event patterns. We use to represent a pattern that events appear sequentially. Therefore, the length of this pattern is , which will be called -gram. Event sentence can be represented as a sequence of patterns,

where is the number of patterns that contains. One thing should be noticed is that an event sentence can be separated to different pattern sequences. We use to denote the set of all possible pattern separations for . We let be the number of different event patterns of length . (Pattern of length is also viewed as an event.) We call , a pattern dictionary, which is the set of all distinct patterns. We write as its cardinality. Obviously, . is the maximum length of patterns. Next, we use an toy example to illustrate the relation between event sentence and event separations.

Example 2

Assume that there are three events, and , and a pattern dictionary . Let be a sentence. According to , all possible separations for would be , and . Hence, .

In our setting, one thing should be noticed is that a separation is also an ordered sequence of patterns. Specifically, separation is in but separation is not in .

Suppose that the entire population consists of different classes of people, but their class labels are unknown. We use variable to denote the latent class to which the person belongs. We make the usual local independence assumption, i.e.,


This leads to


where is the probability of th latent class. Since each observed event sentence may have different pattern separations (i.e. two different pattern sequences can lead to the same event sequence), we have


It follows


for each sentence . Additionally, we assume the event and time are conditionally independent given the separation and latent class label. It leads to


We further assume


where and is the number of patterns in ; and we assume


By (5) - (8) and , we have


For modeling time, we specify

We further assume that and are person-specific random effects.


Here, we view as the within-pattern intensity and as the between-pattern intensity. If an examinee performs consecutive actions faster when the actions are within the same pattern, then is larger than

. Lastly, we assume that the number of sentence for each subject follows a Poisson distribution, i.e.,


In a nutshell, LTDM is a data-driven model for learning event patterns and population clusters simultaneously. Under this model, the whole event sequence as a set of sentences and each sentence is an ordered sequence of different patterns. An important set of parameters, , measures how often people from different classes use distinct patterns.

Next we construct the likelihood function for LTDM. We shall use to denote the total number of subjects and subscript to denote th individual. We assume examinees are independent from each other. Then, the complete likelihood of has the following expression,

Here, we use and to represent the gamma priors for random effects and respectively and . Furthermore, by summing over / integrating out the unobserved latent variables, we have


4 Identifiability

Latent Class Models (LCMs) often faces issue of identifiability. There is an existing literature on identifiability of latent variable models; see Allman et al. (2009), Xu (2017), and references therein. Theme dictionary model also has the identifiability issue as seen in the example in Section 3. In this section, we address the identifiability issue, specifically to dictionary and model parameters in proposed model.

We first define the model space ,

where , , is a simplex which is .

  1. For each class and each event , there exists a positive integer such that for any sentence , consisting of events in , with length longer than , it is not a sub-sentence of any sentence in . Here, is the set of all units.

We next introduce the definition of model identifiability.

Definition 1

We say a LTDM is identifiable, if for any model that satisfies

and , , we must have

Here, we use superscript to denote the true model (parameters/dictionary). We use to denote the set of all possible sentences generated by model , that is, where is the set of sentences generated by the pattern set of Class , . means equals up to a permutation of class labels.

We want to point out that in general is too large to be identifiable without additional constraints. In other words, we need to identify a restricted model space such that almost every is identifiable. This is also known as generic identifiability. For this purpose, we need the following conditions.

  1. For each pattern in dictionary , at least one of the followings holds: (1) ; (2) when .

  2. There exists a partition of 1-grams such that for any , and , sentences and sentence admit only one separation. Cardinalities of three sets satisfy: , and .

  3. For any -gram with , there exists which is the subset of 1-grams and satisfies that, (1) for any , sentence does not admit other separations containing -gram or -gram other than ; (2) cardinality of is greater or equal to .

We let and define to be

With these, we have the following theorem.

Theorem 1

Under Conditions A0 - A3, for almost surely any model in , is identifiable.

Here, we take the measure over to be product measure with being lesbegue measures on corresponding parameter space and being the counting measure on . One immediate result as stated in Corollary 1 is that if a dictionary satisfies condition A1, then there is no other dictionary which has the same . This also implies the identifiability of TDMs, since a TDM could be non-identifiable without any additional assumption; see an example in discussion.

Corollary 1

For 1-class case, if satisfies Condition A1,, then for any .

Assuming and are known, we define the maximized likelihood estimator

where denotes and is a compact subset of . We further assume the following.

  1. and for .

Under A4 and together with identifiability conditions, the true dictionary and model parameters can be estimated consistently.

Theorem 2

Under Conditions A0 - A4, we have that

for any as goes to infinity, where is some permutation function of latent class.

We end this section by commenting on the conditions. Condition A0 111If we only consider 1-class latent model, then Condition A0 can be removed. is for restricting the model space. It says that a proper model is required to have different patterns across different classes to some extent. For example, suppose that people from Class 1 have patterns and people from Class 2 only have pattern . Then, we cannot distinguish Class 2 from Class 1, since the sentence set of Class 2 is a subset of that of Class 1. Conditions A1 - A3 pertain to the underlying pattern dictionary. Specifically, Condition A1 put constraints on patterns to avoid having too many repeated events. As a result, the pattern space is reduced. Consider the case that there are two dictionaries, and . Obviously, these two dictionaries could generate the same sentence set while . Condition A2 puts restrictions on 1-grams such that not all combinations of 1-grams are considered as patterns, which ensures the pattern frequency can be identified. Condition A3 requires that each -gram is not overlapped with other patterns to some extent and thus can be identified. Lastly, Condition A4 is natural since it requires that each pattern should frequently appear in at least one of the classes, and the size of each class should not be zero.

5 Computational Method

Though LTDM postulates a parametric form, we do not know the cardinality of true dictionary () and the number of latent classes () in practice. Therefore, three challenges remain in terms of computation, namely, (1) finding the true underlying patterns; (2) clustering people into the right groups; and (3) computational complexity. We propose a novel non-parametric Bayes - LTDM (NB-LTDM) algorithm as described below to address these issues.

  • NB-LTDM Algorithm

  • Initialization: Randomly choose a large ; sample personal latent labels

    from the uniform distribution on

    ; sample parameters uniformly on [0,1]; sample from the Dirichlet distribution; sample from , and sample all augmented variables accordingly. For the initial dictionary , we put all 1-grams into the dictionary and randomly add -grams, for each .

  • Output: : the number of classes, : the dictionary, estimates for other parameters.

  • The algorithm takes the following iterative steps until Markov chain becomes stable.

    • For each latent class, we calculate the most frequently used -grams within that group. Among those, we keep top patterns for each -grams. ()

    • Add above new grams to the current dictionary and remove the repeat ones.

    • [Split] Split the event sequences according to the current dictionary.

    • [Sample] Sample split for each event sequence from the corresponding possible candidates.

    • [Inner part] Use slice Gibbs sampling schemes to update the following variables:

      • Update parameters , update augmented variables, update separations , update latent labels , update parameter , update , update prior parameters.

    • Prune dictionary: for each action pattern in the current dictionary, calculate the evidence probability . Discard those patterns with evidence probability smaller than .

NB-LTDM algorithm is an exploratory method. It assumes a stick-breaking prior (Sethuraman, 1994) on latent class probabilities to avoid specifying the number of classes beforehand. It trims the dictionary by keeping patterns with high evidence level and discarding away those with weak signals. Parameters are estimated by using a Markov Chain Monte Carlo (MCMC) method together with slice sampler (Dunson and Xing, 2009). It relieves us from computing marginal likelihood directly, which requires the massive computation in integration of latent variables and . Tuning parameter is a threshold to filter out those less frequent patterns; is the initial number of -grams; is a searching parameter that controls the number of new patterns added into current dictionary. Inner part consists of main loop for updating parameters by using non-parametric Bayes method. The detail of this part is provided in Appendix B. The final number of latent class is estimated as with threshold . The estimated pattern dictionary . We use posterior means for other parameters.

6 Simulation Study

We provide the simulation results in this section. In particular, we consider two simulation settings which are described as follows.

  1. In the first simulation setting, we consider that dictionary consists of -grams, -grams and -grams. We set , , and set , . Other model parameters are set as follow, , , . Pattern probability is provided in Table 1.

  2. In the second simulation setting, we consider that dictionary includes patterns up to -grams. We let , , and set , . Other model parameters are set as follow, , , . Pattern probability is provided in Table 2.

max width= 1-10: 1-gram 11-20: 1-gram 21-30: 2-gram 31-40: 2-gram 41-45: 3-gram 46-50: 3-gram 1 0.3 0 0.2 0 0 0 2 0 0.3 0 0.2 0 0 3 0.2 0.2 0.05 0.05 0.001 0.001 4 0.05 0.05 0 0 0.3 0 5 0 0 0.03 0.03 0 0.3

Table 1: Simulation setting 1

max width= 1-15: 1-gm 15-30: 1-gm 31-35: 2-gm 36-50: 2-gm 50-60: 2-gm 61-70: 3-gm 71-75: 3-gm 76-85: 4-gm 86-90: 4-gm 1 0.15 0 0 0 0 0.06 0.06 0 0 2 0 0.15 0.06 0.06 0.06 0 0 0 0 3 0.05 0.05 0.05 0.001 0.001 0.05 0.001 0.001 0.001 4 0 0 0.03 0.03 0 0 0 0.05 0 5 0.04 0.04 0 0 0 0 0 0 0.1

Table 2: Simulation setting 2

For each setting, we run 50 replications. We let threshold be 0.05 and 0.01 for setting 1 and setting 2, respectively. We set in the first 20 iterations and set afterwards. The simulation results are provided in Table 3. On average, more than 96 percent of true patterns are correctly identified in both settings. False recovery is around 20 percent and 5 percent under each setting correspondingly. -gram hittings are all above 92 percent for and 4. This shows that our algorithm can find -gram well even for large . Clustering results are also shown in Table 3. Root mean square errors are relatively small compared to true parameters, indicating that the estimated class sizes are very close to the true ones. As we can see, the proposed method performs well in both simulation studies.

max width= Correct recovery % False recovery % 2-gram hitting 3-gram hitting 4-gram hitting Setting 1 96.4 % 11.0 % 99.8 % 89.2% - Setting 2 96.0 % 6.6 % 99.8 % 91.3% 92.8 % C1 C2 C3 C4 C5 Setting 1 0.392 0.298 0.200 0.051 0.051 RMSE 0.036 0.013 0.013 0.010 0.009 Setting 2 0.304 0.302 0.179 0.110 0.111 RMSE 0.021 0.009 0.055 0.044 0.047

Table 3: Recovery accuracy under two simulation settings
event_number event time event_value
2 click 24.60 paragraph01
3 ACER_EVENT 27.70 00000000010000000000000
4 click 27.70 hit_NobelLee
5 ACER_EVENT 28.60 00000001010000000000000
6 click 28.60 hit_MarketLee
7 ACER_EVENT 29.40 00000001110000000000000
8 click 29.40 hit_MarketPark
9 ACER_EVENT 30.50 00000001110000000001000
29 ACER_EVENT 46.00 00110000100000000001010
30 click 46.00 hit_MarketPark
31 ACER_EVENT 47.70 00110001100000000001010
32 click 47.70 hit_MarketLee
33 ACER_EVENT 48.70 00110001110000000001010
34 click 48.70 hit_NobelLee
35 Q3_SELECT 54.70 Park
36 END_ITEM 66.20 NULL
Table 4: The log file of an examinee.
event_number time event type
1 27.70 10
2 28.60 8
3 29.40 9
14 46.00 9
15 47.70 8
16 48.70 10
Table 5: The cleaned version of log data.

7 Real Data Analysis

In this section, we apply the proposed model the “Traffic” Item from PISA 2012 as described in Section 2. To do so, we clean the data as follows. In the raw data, each event corresponding to the map is a 0-1 vector with 23 entries. Notice that two consecutive vectors only differ at one position. We take their difference and represent event as the index on which the two consecutive vectors differ. We view highlighting and unhighlighting as two different thinking patterns showing distinct knowledge status of the examinee. As such, we treat a sentence as the event subsequence that the examinee either consecutively highlighted roads or consecutively unhighlighted roads. That is to say, a new sentence starts once the examinee changed from highlighting, or unhiglighting roads to unhiglighting or highlighting roads, or clicked “reset”. The time sentence is taken to be the time sequence in corresponding event sentence. An example of this data transformation is shown in Table

4. In our case, the observed data is and . The corresponding observed event and time sentence sequences are and . In addition, we remove those examinees who did not answer all traffic items or did not take any actions, and there are 10048 remaining examinees. On average, each individual had about 10.4 sentences and clicked around 28.4 roads.

We choose , , and fit the proposed model. The number of classes turns out to be 6 and the numbers of identified patterns under different are provided in Figure 4. Detailed results for each class are provided in Table 6, from which we can clearly see that an efficient student may have higher chance to complete the task. That is to say, he takes fewer actions but has many action patterns. This finding coincides with the motivation of designing the Traffic Item, which tests the ability of “Exploring and understanding” and “Planning and executing” (OECD, 2014a).

Furthermore, from Figures 2 and 3, we find that each of six classes can be interpreted as follows. People in Class 1 preferred to perform action patterns, , , and . These patterns link three original places to the correct answer “park”, it explains why people in Class 1 took the smallest number of actions. Examinees in Class 2 frequently took action patterns , , and in addition to those patterns used by people in Class 1. It shows that people in this class often first highlighted roads and then unhighlighted them which explains why they took more actions compared to the first class. Individuals in Class 3 were most likely to have patterns , , . These patterns are paths that connect “Nobel” or “Lincoln” to “Silver” which is the another correct answer. However, “Silver” is also one of the original place; this fact makes examinees harder to decide whether to keep this answer or find another one. It explains why people in Class 3 had lower correct proportions. Examinees in Class 4 showed more frequent use of patterns and compared with others and also used part of the patterns as people in Class 1 did. Notice that and are partial paths from “Nobel” to “Park”; this explains why people in this Class had lower probability to answer correctly compared with the first class. Examinees from Class 5 had similar pattern trends as those from Class 6. Both of them did not use top patterns very often, this explains the phenomenon that people from these two classes took much more actions opposed to the other four classes. Difference between Classes 5 and 6 is that individuals in Class 5 had higher probability to take patterns related with “Park”, while people from Class 6 had higher chance to have patterns related to “Silver”. This, to some extent, explains that why people in Class 6 had the least chance to answer correctly. Finally, we conclude that examinees are classified based on the their number of sentences, the number of actions and frequency of used patterns. An examinee is more likely to successfully complete the task if he/she plans ahead (less sentences) and chooses a good strategy (more patterns).

Figure 2: Ten most frequent 2-gram patterns.
Figure 3: Ten most frequent 3-gram patterns.
Figure 4: Number of patterns under different .
Top Patterns in Each Class
C1 C2 C3 C4 C5 C6
30.6 % 17.2 % 16.9 % 15.2 % 12.4 % 7.6 %
Correct 0.976 0.917 0.633 0.842 0.779 0.570
Avg Sent. 6.9 10.2 11.0 7.7 18.6 16.3
Avg Event. 20.7 29.1 29.0 20.9 47.5 40.3
“20, 9” 0.178 0.058 0.032 8e-4 0.058 0.034
“3, 4” 0.006 0.053 0.034 0.009 0.021 0.014
“10, 8” 0.015 0.011 0.013 0.093 0.016 0.015
“8, 10” 0.011 0.006 0.015 0.069 0.010 0.014
“9, 20” 0.089 0.034 0.017 1e-4 0.035 0.020
“3, 4, 22” 0.131 0.067 0.021 0.102 0.028 0.020
“6, 19, 16” 0.097 0.069 0.037 0.055 0.054 0.037
“10, 8, 9” 0.041 0.085 0.033 0.049 0.026 0.019
“10, 5, 15” 0.107 0.042 0.021 0.012 0.033 0.026
“16, 19, 6” 0.061 0.040 0.019 0.035 0.029 0.018
“21, 14, 22” 0.011 0.069 0.013 0.005 0.015 0.011
Table 6: The table contains real data results, including clustering information, most frequent patterns and estimated class-specific parameters.

8 Discussion

In this paper, we propose a novel statistical model, the latent theme dictionary model, dealing with the process data. The proposed model can be used to cluster population and extract co-occurrent patterns simultaneously. Along with the model, we propose the NB-LTDM algorithm based on non-parametric Bayes method. The algorithm allows us to extract co-occurrent patterns and to choose the number of clusters automatically based on data itself without specification in advance. In addition, we also provide theoretical results, showing the identifiability of the proposed model and consistency of proposed estimators. As shown through 2012 PISA “Traffic” item, our model has good interpretations on students’ strategy of solving complex problem.

Our approach is easy to be incorporated with the domain knowledge. If certain patterns are selected by experts, we can simply add them into the dictionary. On the other hand, if some patterns are known to be impossible or meaningless, we then can delete the separations including those useless patterns. Besides, the proposed model can be used in a broad range of applications. For example, it can be well applied in text mining and speech pattern recognition, where different articles and speeches could be clustered based on their word patterns. It can also be applied in user behavioral study, where users’ frequent daily action patterns can be extracted and user preference database can thus be built.

Appendix A Proof

To prove main theoretical results, we start with introducing two lemma which play key roles for dictionary and parameter identifiability. The proof of Lemma 2 is presented at the end of this section.

Lemma 1 (Kruskal (1977))

Suppose are six matrices with columns. There exist integers , , and such that . In addition, every columns of are linearly independent, every columns of are linearly independent, and every columns of are linearly independent. Define a triple product to be a three-way array where . Suppose that the following two triple products are equal . Then, there exists a column permutation matrix , we have , where are diagonal matrices such that identity. Column permutation matrix is a matrix acts on the righthand side of another matrix and permutes the columns of that matrix.

Lemma 2

Under Conditions A0 and A1 and assume that , then for any pattern , it must belong to .

[Proof of Theorem 1] To prove that for almost every is identifiable, it suffices to show that there exists a set such that any model belonging to

is identifiable and . Here, we use notation to indicate the complement of set .

For , we prove its identifiability through the followings steps. (1) Dictionary identifiability: we first prove that there is no other model with different dictionary such that . (2) identifiability: we prove that there is no other model with different leading to the same marginal distribution of number of sentences. (3) generic identifiability: we show that there is no other model with distinct leading to the same marginal distribution of sentences if . (4) Gamma identifiability: we show that the parameters for personal effect are identifiable.

For the dictionary identifiability, we would like to point out that the key steps was missing in the proof of Deng et al. (2014). Therefore, we use another approach to prove it. By Lemma 2, we know that any pattern in must also belong to which implies . In other words, there is no other dictionary will have same observation set with less dictionary size. Therefore, identifiability of the dictionary establishes under our setting.

For identifiability, we can see the marginal distribution of and is


First, we show that parameter is identifiable, that is, if for all , then = . We know that the marginal distribution of is

Hence, implies . Therefore,

This shows that .

For generic identifiability, we consider the marginal distribution of and ,


We show that parameters are identifiable. Typically, we choose and have


Further, we consider in , which is the set of all 1-grams in and the combinations of two or three 1-grams admitting one possible separation. Suppose there are two sets of parameters, and such that . According to Condition A1, we could get

if (17)
if (18)

where we write , and for . In addition, if , then we have another equation


It is not hard to write equations (17) - (20

) in terms of tensor products of matrices, that is,



Here, is a by diagonal matrix with its -th element equal to . Generically speaking, column ranks of matrix and are greater than or equal to , and column rank of is at least two. In other words, there exists a set