Process data are temporally ordered data with categorical observations. Such data are ubiquitous and common in e-commerce (online purchases), social networking services and computer-based educational assessments. In large scale computer-based tests, analyzing process data has gained much attention and becomes a core task in the next generation of assessment; see, for example, 2012 and 2015 Programme for International Student Assessment (PISA) (OECD, 2014b, 2016), 2012 Programme for International Assessment of Adult Competencies (PIAAC) (Goodman et al., 2013), Assessment and Teaching of 21st Century Skills (ATC21S) (Griffin et al., 2012). In such technology-rich tests, there are problem-solving items which require the examinee to perform a number of actions before submitting final answers. These actions and their corresponding times are sequentially recorded and saved in a log file. Such log file data could provide extra information about examinee’s latent structure that is not available to the traditional paper-based test, in which the final responses (correct / incorrect) are used.
), it is also important to study the personal traits and calibrate the items in analysis of process data. However, process data are much more complicated in the sense that events happen at irregular time points and event sequence length varies from examinee to examinee. Distinct examinees may have different reaction speeds and tend to perform varied actions to complete the task. In addition, some examinees could find optimal strategy in very short time, while others may explore the item for longer time before reaching the final answers. Therefore, we need to classify examinees into multiple groups based on their distinct behaviors. Moreover, some events may appear sequentially very often, that is, the existence of co-occurrent event patterns is another important feature of process data. Thus, we need to extract key event patterns which may help us understand students’ learning status better and deeper.
There is a recent literature on statistical analysis of process data in terms of classification and feature extraction.Suh et al. (2010) and Hong et al. (2011)
worked on information cascades involved with extracting the set of features describing the past information and using these features into a machine learning classifier.Luo et al. (2015) considered a multidimensional Hawkes process by inducing structural constraints on triggering functions to get low-rank feature representations. Lian et al. (2015) proposed a hierarchical gaussian process model by using past events as kernel features to allow flexible arrival patterns. He and von Davier (2016) performed a study to extract and detect robust sequential action patterns associated with success or failure on a problem-solving item. Xu and Zha (2017) introduced a Dirichlet mixture Hawkes process model for clustering individuals based on raw event sequence. Xu et al. (2018) considered a latent class model for analyzing the student behaviors in solving computer-based tasks. Qiao and Jiao (2018) adopted various classification methods to a dataset from PISA 2012 to achieve a better accuracy.
Despite these efforts, statistical modeling and analysis of process data is still in its infancy. Most classification methods for process data are over-simplified by ignoring the event sequentiality and time irregularity. The feature extraction methods are based on counts and hence lack in statistical explanations. Moreover, very little has been done in terms of systematically exploring the process data structure, especially in simultaneously classifying individuals and extracting features.
This paper proposes a latent theme dictionary model (LTDM). It is a latent class-type model with two layers of latent structures that assumes an underlying latent class structure for examinees and a latent association pattern structure for event types. To incorporate the temporal nature, a survival time model with both event-specific parameter and person-specific frailty is assumed for gap times between two consecutive events. The challenging issues of model identifiability are dealt with through using special dictionary structure and Kruskal’s fundamental result of unique decomposition for three-dimensional array (Kruskal, 1977). We propose a non-parametric Bayes algorithm (NB-LTDM), which can be used to identify pattern dictionary, classify individuals, and estimate model parameters simultaneously.
The rest of paper is organized as follow. In Section 2, we describe process data and introduce two useful statistical models, latent class model (LCM) and theme dictionary model (TDM). In Section 3, we propose a new latent theme dictionary model, which combines LCM and TDM and incorporates time structure. In Section 4, we provide theoretical results regarding with model identifiability and estimation consistency. In Section 5, we discuss the computation issue and propose the NB-LTDM algorithm. The simulation results are given in Section 6. In Section 7, we apply the proposed method to a real data set and get the interpretable results. Finally, the concluding remarks are given in Section 8.
2.1 Process Data
The process data here refer to a sequence of ordered events (actions) coupled with time stamps. For an examinee, his/her observed data may be denoted by , , where is the th event and is its corresponding time stamp. We have and , where is the set of all possible different event types. For notational simplicity, we write and .
We use the “Traffic” item from PISA 2012 (OECD, 2014b) as a motivational example to illustrate various concepts and notation. PISA is a worldwide study to evaluate educational performances of different countries and economies. The “Traffic” item asks the examinee to operate on a computer to complete the task, i.e., to locate a meeting point which is within 15 minutes away from three places, Silver, Lincoln and Nobel. There are two correct answers, “Park” and “Silver” for this task. Figure 1 shows the initial state of the computer screen. There are 16 destinations and 23 roads in the map. The examinee could click a road to highlight it, re-click a clicked road to unhighlight, and use “RESET” button to remove all highlighted roads. The “Total Time” box shows the time for traveling on the highlighted roads. Once a road is clicked, the corresponding time would be added to this box. Each action and its corresponding time will be sequentially saved during the process of completing the item in the log file. A typical example of the action process of one specific examinee and its cleaned version are shown in Tables 4 and 5. In this case, . After removing unneeded rows (“START_ITEM”, “END_ITEM”, “Click”, “SELECT”), we can see that there are 16 meaningful actions performed by this examinee as listed in Table 5. His/her observed data are
2.2 Latent Class Model and Theme Dictionary Model
To set the stage for proposing our new model, we first introduce two related simpler models, latent class model (LCM, Gibson (1959)) and theme dictionary model (TDM, Deng et al. (2014)), LCM is widely used in clinical trial, psychometric and machine learning (Goodman, 1974; Vermunt and Magidson, 2002; Templin et al., 2010). It relates a set of observed variables to a latent variable taking discrete values. The latent variable is often used for indicating the class label. LCM assumes a local independence structure, i.e.,
where are observed variables and is a discrete latent variable with density . Then, the joint marginal distribution of is
TDM (Deng et al., 2014) deals with observations called events. It is used for identifying associated event patterns. The problem of finding event associations is also known as market basket analysis, which has been a popular research topic since 1990s (Piatetsky-Shapiro, 1991; Hastie et al., 2005; Chen et al., 2005). Under TDM, a pattern is a combination of several events. A collection of distinct patterns forms a dictionary, . A sentence, , is the combination of patterns. The collection of multiple sentences forms a document. In TDM, we only observe sentence , but do not know which patterns it consists of. That is to say, could be divided into the association of multiple patterns in many different ways. For each possible association, we call it a separation of . We use the following simple example to illustrate this.
There are three event types, and . The patterns are which forms a dictionary. Suppose is one observed sentence. According to , all possible separations for are , and .
TDM does not take into the account event ordering. For example, is the same as . Consequently, patterns are also unordered. For instance, patterns and are viewed as the same one. TDM postulates that a pattern appears in a sentence at most once. Letfor observation is defined to be
Since separation is not observed, then marginal probability of is
where is the set of all separations for . Furthermore, sentences are assumed to be independent of each other, i.e., for , the probability could be written as
3 Latent Theme Dictionary Model
In this section, we propose a latent theme dictionary model (LTDM). We treat the whole event process as a sequence of sentences where each sentence is an ordered subsequence of events. In doing this, we effectively reduce raw data length by splitting the original long sequence to multiple shorter subsequences. This way of complexity reduction enables us to model sentences instead of the whole process which is more complicated.
Specifically, we assume and can be divided into sentence sequence, i.e. and , where
are called as the event sentence and time sentence, respectively. We assume that some events often appear together, which can be viewed as event patterns. We use to represent a pattern that events appear sequentially. Therefore, the length of this pattern is , which will be called -gram. Event sentence can be represented as a sequence of patterns,
where is the number of patterns that contains. One thing should be noticed is that an event sentence can be separated to different pattern sequences. We use to denote the set of all possible pattern separations for . We let be the number of different event patterns of length . (Pattern of length is also viewed as an event.) We call , a pattern dictionary, which is the set of all distinct patterns. We write as its cardinality. Obviously, . is the maximum length of patterns. Next, we use an toy example to illustrate the relation between event sentence and event separations.
Assume that there are three events, and , and a pattern dictionary . Let be a sentence. According to , all possible separations for would be , and . Hence, .
In our setting, one thing should be noticed is that a separation is also an ordered sequence of patterns. Specifically, separation is in but separation is not in .
Suppose that the entire population consists of different classes of people, but their class labels are unknown. We use variable to denote the latent class to which the person belongs. We make the usual local independence assumption, i.e.,
This leads to
where is the probability of th latent class. Since each observed event sentence may have different pattern separations (i.e. two different pattern sequences can lead to the same event sequence), we have
for each sentence . Additionally, we assume the event and time are conditionally independent given the separation and latent class label. It leads to
We further assume
where and is the number of patterns in ; and we assume
For modeling time, we specify
We further assume that and are person-specific random effects.
Here, we view as the within-pattern intensity and as the between-pattern intensity. If an examinee performs consecutive actions faster when the actions are within the same pattern, then is larger than
. Lastly, we assume that the number of sentence for each subject follows a Poisson distribution, i.e.,
In a nutshell, LTDM is a data-driven model for learning event patterns and population clusters simultaneously. Under this model, the whole event sequence as a set of sentences and each sentence is an ordered sequence of different patterns. An important set of parameters, , measures how often people from different classes use distinct patterns.
Next we construct the likelihood function for LTDM. We shall use to denote the total number of subjects and subscript to denote th individual. We assume examinees are independent from each other. Then, the complete likelihood of has the following expression,
Here, we use and to represent the gamma priors for random effects and respectively and . Furthermore, by summing over / integrating out the unobserved latent variables, we have
Latent Class Models (LCMs) often faces issue of identifiability. There is an existing literature on identifiability of latent variable models; see Allman et al. (2009), Xu (2017), and references therein. Theme dictionary model also has the identifiability issue as seen in the example in Section 3. In this section, we address the identifiability issue, specifically to dictionary and model parameters in proposed model.
We first define the model space ,
where , , is a simplex which is .
For each class and each event , there exists a positive integer such that for any sentence , consisting of events in , with length longer than , it is not a sub-sentence of any sentence in . Here, is the set of all units.
We next introduce the definition of model identifiability.
We say a LTDM is identifiable, if for any model that satisfies
and , , we must have
Here, we use superscript to denote the true model (parameters/dictionary). We use to denote the set of all possible sentences generated by model , that is, where is the set of sentences generated by the pattern set of Class , . means equals up to a permutation of class labels.
We want to point out that in general is too large to be identifiable without additional constraints. In other words, we need to identify a restricted model space such that almost every is identifiable. This is also known as generic identifiability. For this purpose, we need the following conditions.
For each pattern in dictionary , at least one of the followings holds: (1) ; (2) when .
There exists a partition of 1-grams such that for any , and , sentences and sentence admit only one separation. Cardinalities of three sets satisfy: , and .
For any -gram with , there exists which is the subset of 1-grams and satisfies that, (1) for any , sentence does not admit other separations containing -gram or -gram other than ; (2) cardinality of is greater or equal to .
We let and define to be
With these, we have the following theorem.
Under Conditions A0 - A3, for almost surely any model in , is identifiable.
Here, we take the measure over to be product measure with being lesbegue measures on corresponding parameter space and being the counting measure on . One immediate result as stated in Corollary 1 is that if a dictionary satisfies condition A1, then there is no other dictionary which has the same . This also implies the identifiability of TDMs, since a TDM could be non-identifiable without any additional assumption; see an example in discussion.
For 1-class case, if satisfies Condition A1,, then for any .
Assuming and are known, we define the maximized likelihood estimator
where denotes and is a compact subset of . We further assume the following.
and for .
Under A4 and together with identifiability conditions, the true dictionary and model parameters can be estimated consistently.
Under Conditions A0 - A4, we have that
for any as goes to infinity, where is some permutation function of latent class.
We end this section by commenting on the conditions. Condition A0 111If we only consider 1-class latent model, then Condition A0 can be removed. is for restricting the model space. It says that a proper model is required to have different patterns across different classes to some extent. For example, suppose that people from Class 1 have patterns and people from Class 2 only have pattern . Then, we cannot distinguish Class 2 from Class 1, since the sentence set of Class 2 is a subset of that of Class 1. Conditions A1 - A3 pertain to the underlying pattern dictionary. Specifically, Condition A1 put constraints on patterns to avoid having too many repeated events. As a result, the pattern space is reduced. Consider the case that there are two dictionaries, and . Obviously, these two dictionaries could generate the same sentence set while . Condition A2 puts restrictions on 1-grams such that not all combinations of 1-grams are considered as patterns, which ensures the pattern frequency can be identified. Condition A3 requires that each -gram is not overlapped with other patterns to some extent and thus can be identified. Lastly, Condition A4 is natural since it requires that each pattern should frequently appear in at least one of the classes, and the size of each class should not be zero.
5 Computational Method
Though LTDM postulates a parametric form, we do not know the cardinality of true dictionary () and the number of latent classes () in practice. Therefore, three challenges remain in terms of computation, namely, (1) finding the true underlying patterns; (2) clustering people into the right groups; and (3) computational complexity. We propose a novel non-parametric Bayes - LTDM (NB-LTDM) algorithm as described below to address these issues.
Initialization: Randomly choose a large ; sample personal latent labels
from the uniform distribution on; sample parameters uniformly on [0,1]; sample from the Dirichlet distribution; sample from , and sample all augmented variables accordingly. For the initial dictionary , we put all 1-grams into the dictionary and randomly add -grams, for each .
Output: : the number of classes, : the dictionary, estimates for other parameters.
The algorithm takes the following iterative steps until Markov chain becomes stable.
For each latent class, we calculate the most frequently used -grams within that group. Among those, we keep top patterns for each -grams. ()
Add above new grams to the current dictionary and remove the repeat ones.
[Split] Split the event sequences according to the current dictionary.
[Sample] Sample split for each event sequence from the corresponding possible candidates.
[Inner part] Use slice Gibbs sampling schemes to update the following variables:
Update parameters , update augmented variables, update separations , update latent labels , update parameter , update , update prior parameters.
Prune dictionary: for each action pattern in the current dictionary, calculate the evidence probability . Discard those patterns with evidence probability smaller than .
NB-LTDM algorithm is an exploratory method. It assumes a stick-breaking prior (Sethuraman, 1994) on latent class probabilities to avoid specifying the number of classes beforehand. It trims the dictionary by keeping patterns with high evidence level and discarding away those with weak signals. Parameters are estimated by using a Markov Chain Monte Carlo (MCMC) method together with slice sampler (Dunson and Xing, 2009). It relieves us from computing marginal likelihood directly, which requires the massive computation in integration of latent variables and . Tuning parameter is a threshold to filter out those less frequent patterns; is the initial number of -grams; is a searching parameter that controls the number of new patterns added into current dictionary. Inner part consists of main loop for updating parameters by using non-parametric Bayes method. The detail of this part is provided in Appendix B. The final number of latent class is estimated as with threshold . The estimated pattern dictionary . We use posterior means for other parameters.
6 Simulation Study
We provide the simulation results in this section. In particular, we consider two simulation settings which are described as follows.
In the first simulation setting, we consider that dictionary consists of -grams, -grams and -grams. We set , , and set , . Other model parameters are set as follow, , , . Pattern probability is provided in Table 1.
In the second simulation setting, we consider that dictionary includes patterns up to -grams. We let , , and set , . Other model parameters are set as follow, , , . Pattern probability is provided in Table 2.
For each setting, we run 50 replications. We let threshold be 0.05 and 0.01 for setting 1 and setting 2, respectively. We set in the first 20 iterations and set afterwards. The simulation results are provided in Table 3. On average, more than 96 percent of true patterns are correctly identified in both settings. False recovery is around 20 percent and 5 percent under each setting correspondingly. -gram hittings are all above 92 percent for and 4. This shows that our algorithm can find -gram well even for large . Clustering results are also shown in Table 3. Root mean square errors are relatively small compared to true parameters, indicating that the estimated class sizes are very close to the true ones. As we can see, the proposed method performs well in both simulation studies.
7 Real Data Analysis
In this section, we apply the proposed model the “Traffic” Item from PISA 2012 as described in Section 2. To do so, we clean the data as follows. In the raw data, each event corresponding to the map is a 0-1 vector with 23 entries. Notice that two consecutive vectors only differ at one position. We take their difference and represent event as the index on which the two consecutive vectors differ. We view highlighting and unhighlighting as two different thinking patterns showing distinct knowledge status of the examinee. As such, we treat a sentence as the event subsequence that the examinee either consecutively highlighted roads or consecutively unhighlighted roads. That is to say, a new sentence starts once the examinee changed from highlighting, or unhiglighting roads to unhiglighting or highlighting roads, or clicked “reset”. The time sentence is taken to be the time sequence in corresponding event sentence. An example of this data transformation is shown in Table4. In our case, the observed data is and . The corresponding observed event and time sentence sequences are and . In addition, we remove those examinees who did not answer all traffic items or did not take any actions, and there are 10048 remaining examinees. On average, each individual had about 10.4 sentences and clicked around 28.4 roads.
We choose , , and fit the proposed model. The number of classes turns out to be 6 and the numbers of identified patterns under different are provided in Figure 4. Detailed results for each class are provided in Table 6, from which we can clearly see that an efficient student may have higher chance to complete the task. That is to say, he takes fewer actions but has many action patterns. This finding coincides with the motivation of designing the Traffic Item, which tests the ability of “Exploring and understanding” and “Planning and executing” (OECD, 2014a).
Furthermore, from Figures 2 and 3, we find that each of six classes can be interpreted as follows. People in Class 1 preferred to perform action patterns, , , and . These patterns link three original places to the correct answer “park”, it explains why people in Class 1 took the smallest number of actions. Examinees in Class 2 frequently took action patterns , , and in addition to those patterns used by people in Class 1. It shows that people in this class often first highlighted roads and then unhighlighted them which explains why they took more actions compared to the first class. Individuals in Class 3 were most likely to have patterns , , . These patterns are paths that connect “Nobel” or “Lincoln” to “Silver” which is the another correct answer. However, “Silver” is also one of the original place; this fact makes examinees harder to decide whether to keep this answer or find another one. It explains why people in Class 3 had lower correct proportions. Examinees in Class 4 showed more frequent use of patterns and compared with others and also used part of the patterns as people in Class 1 did. Notice that and are partial paths from “Nobel” to “Park”; this explains why people in this Class had lower probability to answer correctly compared with the first class. Examinees from Class 5 had similar pattern trends as those from Class 6. Both of them did not use top patterns very often, this explains the phenomenon that people from these two classes took much more actions opposed to the other four classes. Difference between Classes 5 and 6 is that individuals in Class 5 had higher probability to take patterns related with “Park”, while people from Class 6 had higher chance to have patterns related to “Silver”. This, to some extent, explains that why people in Class 6 had the least chance to answer correctly. Finally, we conclude that examinees are classified based on the their number of sentences, the number of actions and frequency of used patterns. An examinee is more likely to successfully complete the task if he/she plans ahead (less sentences) and chooses a good strategy (more patterns).
|Top Patterns in Each Class|
|30.6 %||17.2 %||16.9 %||15.2 %||12.4 %||7.6 %|
|“3, 4, 22”||0.131||0.067||0.021||0.102||0.028||0.020|
|“6, 19, 16”||0.097||0.069||0.037||0.055||0.054||0.037|
|“10, 8, 9”||0.041||0.085||0.033||0.049||0.026||0.019|
|“10, 5, 15”||0.107||0.042||0.021||0.012||0.033||0.026|
|“16, 19, 6”||0.061||0.040||0.019||0.035||0.029||0.018|
|“21, 14, 22”||0.011||0.069||0.013||0.005||0.015||0.011|
In this paper, we propose a novel statistical model, the latent theme dictionary model, dealing with the process data. The proposed model can be used to cluster population and extract co-occurrent patterns simultaneously. Along with the model, we propose the NB-LTDM algorithm based on non-parametric Bayes method. The algorithm allows us to extract co-occurrent patterns and to choose the number of clusters automatically based on data itself without specification in advance. In addition, we also provide theoretical results, showing the identifiability of the proposed model and consistency of proposed estimators. As shown through 2012 PISA “Traffic” item, our model has good interpretations on students’ strategy of solving complex problem.
Our approach is easy to be incorporated with the domain knowledge. If certain patterns are selected by experts, we can simply add them into the dictionary. On the other hand, if some patterns are known to be impossible or meaningless, we then can delete the separations including those useless patterns. Besides, the proposed model can be used in a broad range of applications. For example, it can be well applied in text mining and speech pattern recognition, where different articles and speeches could be clustered based on their word patterns. It can also be applied in user behavioral study, where users’ frequent daily action patterns can be extracted and user preference database can thus be built.
Appendix A Proof
To prove main theoretical results, we start with introducing two lemma which play key roles for dictionary and parameter identifiability. The proof of Lemma 2 is presented at the end of this section.
Lemma 1 (Kruskal (1977))
Suppose are six matrices with columns. There exist integers , , and such that . In addition, every columns of are linearly independent, every columns of are linearly independent, and every columns of are linearly independent. Define a triple product to be a three-way array where . Suppose that the following two triple products are equal . Then, there exists a column permutation matrix , we have , where are diagonal matrices such that identity. Column permutation matrix is a matrix acts on the righthand side of another matrix and permutes the columns of that matrix.
Under Conditions A0 and A1 and assume that , then for any pattern , it must belong to .
[Proof of Theorem 1] To prove that for almost every is identifiable, it suffices to show that there exists a set such that any model belonging to
is identifiable and . Here, we use notation to indicate the complement of set .
For , we prove its identifiability through the followings steps. (1) Dictionary identifiability: we first prove that there is no other model with different dictionary such that . (2) identifiability: we prove that there is no other model with different leading to the same marginal distribution of number of sentences. (3) generic identifiability: we show that there is no other model with distinct leading to the same marginal distribution of sentences if . (4) Gamma identifiability: we show that the parameters for personal effect are identifiable.
For the dictionary identifiability, we would like to point out that the key steps was missing in the proof of Deng et al. (2014). Therefore, we use another approach to prove it. By Lemma 2, we know that any pattern in must also belong to which implies . In other words, there is no other dictionary will have same observation set with less dictionary size. Therefore, identifiability of the dictionary establishes under our setting.
For identifiability, we can see the marginal distribution of and is
First, we show that parameter is identifiable, that is, if for all , then = . We know that the marginal distribution of is
Hence, implies . Therefore,
This shows that .
For generic identifiability, we consider the marginal distribution of and ,
We show that parameters are identifiable. Typically, we choose and have
Further, we consider in , which is the set of all 1-grams in and the combinations of two or three 1-grams admitting one possible separation. Suppose there are two sets of parameters, and such that . According to Condition A1, we could get
where we write , and for . In addition, if , then we have another equation
) in terms of tensor products of matrices, that is,
Here, is a by diagonal matrix with its -th element equal to . Generically speaking, column ranks of matrix and are greater than or equal to , and column rank of is at least two. In other words, there exists a set