Problem solving is one of the key skills for people in the current world full of rapid changes (OECD, 2017). Computer-based items have recently become popular for assessing problem solving skills. In such items, problem-solving scenarios can be conveniently simulated through human-computer interfaces and the problem-solving processes can be easily recorded for analysis.
In 2012, several computer-based items were designed and deployed in the Programme for International Assessment of Adult Competencies (PIAAC) to measure adults’ competency in problem solving in technology-rich environments (PSTRE). Screenshots111Retrieved from https://piaac-logdata.tba-hosting.de/ of the interface of a released PSTRE item are shown in Figures 1–3. The opening page of the item, displayed in Figure 1, consists of two panels. The left panel contains item instructions and navigation buttons, while the right panel is the main medium for interaction. In this example, the right panel is a web browser showing a job searching website. The task is to find all job listings that meet the criteria described in the instructions. The dropdown menus and radio buttons can be used to narrow down the search range. Once the “Find Jobs” button is clicked, jobs that meet the selected criteria will be listed on the web page as shown in Figure 2. Participants can read the detail information about a listing by clicking “More about this position”. Figure 3 is the detailed information page of the first listing in Figure 2. If a listing is considered to meet all the requirements, it can be saved by clicking the “SAVE this listing” button. When a participant works on a problem, the entire response process is recorded in the log files in additional to the final response outcome (correct/incorrect). For example, if a participant selected “Photography” and “7 days” in the two dropdown menus, clicked the “Part-time” radio button, then clicked “Find Jobs”, read the detailed information of the first listing and saved it, then a sequence of actions, “Start, Dropdown1_Photography, Dropdown2_7, Part-time, Find_Jobs, Click_W1, Save, Next”, is recorded in the log files222This example item is not used in real practice. The coding of the actions and the action sequence described above were created for illustration purpose.. The entire action sequence constitutes a single observation of process data. It tracks all the major actions the participants took when they interacted with the browsing environment.
The process responses contain substantially more comprehensive information of respondents than the traditional item responses that is often dichotomous (correct/incorrect) or polytomous (partial credits). On the other hand, to what extent this information is useful for educational and cognitive assessments and how to systematically make full use of such information are largely unknown. One of the difficulties in analyzing process data is to cope with its nonstandard format. Each process is a sequence of categorical variables (mouse clicks and keystrokes) and its length varies across observations. As a result, existing models for traditional item responses such as item response theory (IRT) models are not directly applicable to process data. Although some models have been extended to incorporate item response time(Klein Entink ., 2009; S. Wang ., 2018; Zhan ., 2018), similar extensions for response processes are difficult.
Another challenge for analyzing process data comes from the wide diversity of human behaviors. Signals from behavioral patterns in response processes are often attenuated by a large amount of noisy actions. The 1-step or 2-step lagged correlation of the response processes are often close to zero, indicating that models only capturing short-term dependence are often inadequate.
The rich variety of computer-based items also adds to the difficulty in developing general methods for process data. The computer interface involved in the PSTRE items in PIAAC 2012 includes web browser, mail clients, and spreadsheet. The required tasks in these items also vary greatly. In some recent development of process data analysis such as Greiff . (2016) and Kroehne Goldhammer (2018), process data are first summarized into several variables according to domain knowledge and then their relationship with other variables of interest are investigated by conventional statistical methods. The design of the summary variables is usually item-specific and requires a thorough understanding of respondents’ cognitive process during human-computer interaction. Thus these approaches are too “expensive” to apply to even a moderate number of diverse items such as the PSTRE items in PIAAC 2012. He von Davier (2016)
adopted the concept of n-grams from natural language processing to explore the association between action sequence patterns and traditional item responses. The sequence patterns extracted from their procedure depend on the coding of log files and are often of limited capacity since it only considers consecutive actions.
In this paper, we propose a generic method to extract features from process data. The extracted features play a similar role as the latent variables in item response theory (Lord, 1980; Lord Novick, 1968). The proposed method does not rely on prior knowledge of the items and coding of the log files. Therefore, it is applicable to a wide range of process data with little item-specific processing effort. In the case study, we applied the proposed method to 14 PSTRE items in PIAAC 2012. These items vary widely in many aspects including the content of the problem-solving task and their overall difficulty levels.
The main component of the proposed feature extraction method is an autoencoder(Goodfellow ., 2016, Chapter 14)
. It is a class of artificial neural networks that tries to reproduce the input in its output. Autoencoders are often used for dimension reduction(Hinton Salakhutdinov, 2006) and data denoising (Vincent ., 2008)2010; Lu ., 2013; Li ., 2015; Yousefi-Azar ., 2017). They first map the input to a low-dimensional vector, from which they then try to reconstruct the input. Once a good autoencoder is found for a dataset, the low-dimensional vector contains comprehensive information of original data and thus can be used as features summarizing the response processes.
With the proposed method, we extract features from each of the PSTRE items in PIAAC 2012 and explore the extracted feature space of process data. We show that the extracted features from response processes contain more information than the traditional item responses. We find that the prediction of many variables, including literacy and numeracy scores and a variety of background variables, can be considerably improved once the process features are incorporated.
applied recurrent neural networks to knowledge tracing and showed that their deep knowledge tracing models can predict students’ performance on the next exercise from their exercise trajectories more accurately than other traditional methods.Bosch Paquette (2017) discussed several neural network architectures that can be used for analyzing interaction log data. They extracted features for detecting student boredom through modeling the relations of student behaviors in two time intervals. The log file data used there were aggregated into a more regular form. Ding . (2019) also studied the problem of extracting features from student’s learning process using autoencoders. The learning processes considered there have a fixed number of steps and the data in each step were preprocessed into fixed dimension raw features.
The rest of the paper is organized as follows. In Section 2, we introduce the action sequence autoencoder and the feature extraction procedure for process data. The proposed procedure is applied to simulated processes in Section 3 to demonstrate how extracted features reveal the latent structure in response processes. Section 4 presents a case study of process data from PSTRE items of PIAAC to show that response processes contain more information than traditional responses. Some concluding remarks are made in Section 5.
2 Feature Extraction by Action Sequence Autoencoder
We adopt the following setting throughout this paper. Let denote the set of possible actions for an item, where is the total number of distinct actions and each element in is a unique action. A response process can be represented as a sequence of actions, , where for and denotes the length of the process, i.e., the total number of actions that a respondent took to solve the problem. An action sequence can be equivalently represented as a binary matrix whose
-th row gives the dummy variable representation of the action at time step. More specifically, being one indicates the -th action of the sequence is action . There is one and only one element being one in each row. All other elements are zeros. In the rest of this article, is used interchangeably with for referring to an action sequence.
The length of a response process is likely to vary widely across respondents. As a result, the matrix representation of response processes from different respondents will have different number of rows. For a set of processes, (equivalently, ), the length of (the number of rows in ) is denoted by , for . The main motivation of developing a feature extraction method for process data is to compress the nonstandard data with varying dimension into homogeneous dimension vectors to facilitate subsequent standard statistical analysis.
The main component of our feature extraction method is an autoencoder (Goodfellow ., 2016, Chapter 14). It is a type of artificial neural networks whose output tries to reproduce the input. A trivial solution to this task is to link the input and the output through an identity function, but it provides little insight about the data. Autoencoders employ special structures in the mapping from the input to the output so that nontrivial reconstructions are formed to unveil the underlying low-dimensional structure. As illustrated in Figure 4, an autoencoder consists of two components, an encoder and a decoder . The encoder transforms a complex and high-dimensional input into a low-dimensional vector . Then the decoder reconstructs the input from . Since the low-dimensional vector is in a standard and simpler format and contains adequate information to restore the original data, autoencoders are often used for dimension reduction and feature extraction.
The encoder and the decoder are often specified as a family of functions, and , respectively, where and
are parameters to be estimated by minimizing the discrepancy between the inputs and the outputs of the autoencoder. To be more specific, lettingdenote the output for input , , the parameters and are estimated by minimizing
is a loss function measuring the difference between the reconstructed dataand the original data . Once estimates and are obtained, the latent representation or the features of an action sequence can be computed by .
To make an analogue to the IRT models or other latent variable models, one may consider , the output of the encoder , to be an estimator of the latent variables based on the responses and the decoder to be the item response function that specifies the response distribution corresponding to a latent vector. For the IRT model, the estimator and the item response function are often coherent in the sense that the estimator is determined by the item response function. For autoencoder, both and are parameterized and estimated based on the data. There is no coherence guarantee between them. This is one of the theoretical drawbacks of autoencoder. Nonetheless, we hope that the parametric families for and are flexible enough such that they can be consistently estimated with large samples and thus approximate coherence is automatically achieved.
Based on the above discussion, a crucial step in the application of autoencoders is to specify an encoder and a decoder that are suitable for the data to be compressed. In the remainder of this section, we will describe an autoencoder that performs well for response processes.
2.2 Recurrent Neural Network
To facilitate the presentation, we first provide a brief introduction to the recurrent neural networks (RNNs), a pivotal component of the encoder and the decoder of the action sequence autoencoder.
RNNs form a class of artificial neural networks that deal with sequences. Unlike traditional artificial neural networks such as multi-layer feed-forward networks (Patterson Gibson, 2017, Chapter 2) that treat an input as a simple vector, RNNs have a special structure to utilize the sequential information in the data. As depicted in Figure 5, the basic structure of RNNs has three components: inputs, hidden states, and outputs, each of which is a multivariate time series. The inputs are -dimensional vectors. The hidden states are also -dimensional and can be viewed as the memory that helps process the input information sequentially. The hidden state evolves as the input evolves. Each summarizes what has happened up to time by integrating the current information with the previous memory , that is, is a function of and
for . The initial hidden state is often set to be the zero-vector. To extract from memory the information that is useful for subsequent tasks, a -dimensional output vector is produced as a function of the hidden state at each time step ,
Both and are often specified as a parametric family of functions with parameters to be estimated from data.
To summarize, an RNN makes use of the current input and a summary of previous information to produce an updated summary information , which in turn produces an output at each time step
. An RNN is not a probabilistic model. It does not specify the probability distribution of the inputor the output given the hidden state . It is essentially a deterministic nonlinear function that takes a sequence of vectors and outputs another sequence of vectors. Each output vector summarizes the useful information in the input vectors up to the current time step. We will write the function induced by an RNN as where collects the parameters in and . Letting and respectively denote the inputs and the outputs of the RNN, we have . We use a subscript of to denote the output vector at time step , that is, .
RNNs can process sequences of different lengths. Note that the functions and in (2) and (3) are the same across time steps. Therefore, the total number of parameters for an RNN does not depend on the number of the time steps.
Various choices of and
have been proposed to compute the hidden states and the outputs. Two most widely used ones are the long-short-term-memory (LSTM) unit(Hochreiter Schmidhuber, 1997)
and the gate recurrent unit (GRU)(Cho ., 2014)
. They are designed to mitigate the vanishing or exploding gradient problem of a basic RNN(Bengio ., 1994). We will also use the two designs in the RNN component of our action sequence autoencoder. The detailed expressions of the LSTM unit and GRU are given in the appendix.
2.3 Action Sequence Autoencoder
The action sequence autoencoder used for extracting features from process data takes a sequence of actions as the input process and outputs a reconstructed sequence. The diagram in Figure 6 illustrates the structure of the action sequence autoencoder. In what follows, we elaborate the encoding and the decoding mechanism.
The encoder of the action sequence autoencoder takes a sequence of actions and outputs a -dimensional vector as a compressed summary of the input action sequence. Working with action sequences directly is often challenging because of the categorical nature of the actions. To overcome the obstacle, we associate each action in the action pool with a -dimensional latent vector
that will be estimated based on the data. These latent vectors describe the attributes of actions and will be used to summarize the information contained in the sequence. The method of mapping categorical variables to continuous latent attributes is often called the embedding method. It is widely used in machine learning applications such as neural machine translation and knowledge graph completion(Bengio ., 2003; Mikolov ., 2013; Kraft ., 2016).
The first operation of our encoder is to transform the input sequence into a corresponding sequence of latent vectors where is the index of the action in at time step , that is, for . With the binary matrix representation of action sequence , the embedding step of the encoder is simply a matrix multiplication where is an matrix whose -th row is the latent vector for action and the rows of form the latent vector sequence corresponding to the original action sequence .
Given the latent vector sequence, the encoder uses an RNN to summarize the information. Since our goal is to compress the entire response process into a single -dimensional vector, only the last output vector of the RNN is kept to serve as a summary of information. Therefore, the output of the encoder, i.e., the latent representation of the input sequence, is .
To summarize, the encoder of our action sequence autoencoder is
where represents all the parameters including the embedding matrix and the parameter vector of the encoder RNN. The encoding procedure consists of the following three steps.
An observed action sequence is transformed into a sequence of latent vectors by the embedding method: .
The latent vector sequence is processed by the encoder RNN to obtain another sequence of vectors where for .
The last output of the RNN is kept as the latent representation, namely, .
Each of the three steps corresponds to an arrow in the encoder part of Figure 6.
The decoder of the action sequence autoencoder reconstructs an action sequence , or equivalently, its binary matrix representation , from . First, a different RNN is used to expand the latent representation into a sequence of vectors, each of which contains the information of the action at the corresponding time step. As is the only information available for the reconstruction, the input of the decoder RNN is the same for each of the time steps. Writing it in a matrix form, the input of the decoder RNN is where is the -dimensional vector of ones. After the decoder RNN’s processing, we obtain a sequence of -dimensional vectors . Each contains the information for the action taken at time step .
Recall that each row of is the dummy variable representation of the action taken at corresponding time step. Each row essentially specifies a degenerate categorical distribution on , with the action that is actually taken having probability one and all the other actions having probability zero. With this observation, the task of restoring the action at step becomes constructing the probability distribution of the action taken at step from
. The multinomial logit model (MLM) can be used in the decoder to achieve this. To be more specific, the probability of taking actionat time is
where and are parameters to be estimated from the data. Note that the parameters in (5) do not depend on . That is, the encoder uses the same MLM to compute the probability distribution of from for . As a result, the reconstructed sequence is and the decoder can be written as
where the parameter vector consists of the parameter vector in the decoder RNN and , .
If we have an ideal autoencoder that reconstructs the input perfectly, the probability distribution specified by will concentrate all its probability mass on the action that is actually taken. In practice, it is very unlikely to construct such perfect autoencoders. Usually, every action in the action set will be assigned a positive probability in the reconstructed probability distribution. For a given set of response processes, we want to manipulate the parameters in the encoder and the decoder so that the reconstructed probability distribution concentrates as much probability mass on the actual action as possible.
To summarize, as depicted in the decoder part of Figure 6, the decoding procedure of the action sequence autoencoder consists of the following three steps.
The latent representation is replicated times to form the matrix .
The decoder RNN takes and outputs a sequence of vectors , each element of which containing the information of the action at the corresponding step.
The probability distribution of is computed according to the MLM from at each time step .
In order to extract good features for a given set of response processes, we need to construct an action sequence autoencoder that reconstructs the response processes as well as possible. The discrepancy between an action sequence and its reconstructed version can be measured by the following loss function
Note that, for a given , only one of is non-zero. The loss function is smaller if the distribution specified by is more concentrated on the action that is actually taken at step . The best action sequence autoencoder for describing a given set of response processes is the one that minimizes the total reconstruction loss defined in (1).
Notice that (7) is in the same form as the log-likelihood function of categorical distributions. By using this loss function, we implicitly define a probabilistic model for the response processes. That is, given the latent representation , follows a categorical distribution on with probability vector . The decoder of the action sequence autoencoder specifies the functional form of the probability vector in terms of and .
Based on the above discussion, we extract features from response processes through the following procedure.
Procedure 1 (Feature extraction for process data).
Find a minimizer, , of the objective function
by stochastic gradient descent through the following steps.
Initialize the parameters and .
Randomly generate from and update and with and , respectively, where is a predetermined small positive number.
Repeat step (b) until convergence.
Calculate , for . Each column of is a raw feature of the response processes.
Perform principal component analysis (PCA) on. The principal components are the principal features of the response processes.
In Step 1, the optimization problem is solved by stochastic gradient descent (SGD) (Robbins Monro, 1951). In Step 1b, a fixed step size is used for updating the parameters. Data-dependent step sizes such as those proposed in Duchi . (2011), Zeiler (2012), and Kingma Ba (2014) can be easily adapted for the optimization problem.
Neural networks are often over-parametrized. To prevent overfitting, validation based early stopping (Prechelt, 2012)
is often used when estimating parameters of complicated neural networks such as our action sequence autoencoder. With this technique, the optimization algorithm, in our case, SGD, is not run until convergence. A parameter value that are obtained before convergence with good performance on the validation set is used as an estimate of the minimizer. To perform early stopping, a dataset is split into a training set and a validation set. A chosen optimization algorithm is performed only on the training set for a number of epochs. An epoch consists ofiterations, where is the size of the training set. At the end of each epoch, the objective function is evaluated on the validation set. The value of the parameters produces the lowest validation loss is used as an estimate of the minimizer. We adopt this technique when constructing the action sequence autoencoder. The feature extraction procedure with validation-based early stopping is summarized in Procedure 2.
Procedure 2 (Feature extraction with validation-based early stopping).
Find a minimizer, , of the objective function by stochastic gradient descent with validation-based early stopping through the following steps.
Randomly split into a training index set of size and a validation index set of size .
Initialize the parameters and and calculate .
Randomly permute the indices in and denote the result as .
For , update and with and , respectively.
Calculate . If is smaller than , let and and update with .
Repeat steps (c), (d), and (e) for sufficiently many times.
Calculate , for . Each column of is a raw feature of the response processes.
Perform principal component analysis (PCA) on . The principal components are the principal features of the response processes.
The proposed feature extraction procedure requires the number of features to be extracted, , as an input. In general, if is too small, the action sequence autoencoder does not have enough flexibility to capture the structure of the response processes. On the other hand, if is too big, the extracted features contain too much redundant information, causing overfitting and instability in downstream analyses. We adopt the -fold cross-validation procedure (Stone, 1974) to choose a suitable in the analyses presented in Sections 3 and 4.
We perform principal component analysis on the raw features in the last step of the proposed feature extraction procedure for seeking for feature interpretations. As we will show in the case study, the first several principal features usually have clear interpretations even if the meaning of the actions is not taken into account in the feature extraction procedure.
Since the extracted features have a standard format, they can be easily incorporated in (generalized) linear models and many other well-developed statistical procedures. As we will show in the sequel, the extracted features contain a substantial amount of information about the action sequences. They can be used as surrogates of the action sequences to study how response processes are related to the respondents’ latent traits and other quantities of interest.
3.1 Experiment Settings
In this section, we apply the proposed feature extraction method to simulated response processes of an item with 26 possible actions. Each action in the item is denoted by an upper-case English letter. In other words, we define . All the sequences used in the study start with A and end with Z, meaning that A and Z represent the start and the end of an item, respectively.
In our simulation study, action sequences are generated from Markov chains. That is, given the firstactions in a response process, , the distribution from which is generated depends only on . A Markov chain is determined by its probability transition matrix , where . Because of the special meaning of actions A and Z, there should not be transitions from other actions to A and from Z to other actions. As a result, the probability transition matrices used in our simulation study have the constraints that for , and . To construct a probability transition matrix, we only need to specify its elements in its upper right submatrix. Given a transition matrix , we start a sequence with A and generate all subsequent actions according to until Z appears.
Two simulation scenarios are devised in our experiments to impose latent class structures in generated response processes. In Scenario I, two latent groups are formed by generating action sequences from two different Markov chains. Let and denote the probability transition matrices of the two chains. A set of sequences are obtained by generating sequences according to and the remaining sequences according to . Both and are randomly generated and then fixed to generate all sets of response processes. To generate , we first construct an matrix
whose elements are independent samples from a uniform distribution on interval. Then the upper right submatrix of is computed from by
The transition matrix is obtained similarly.
In Scenario II, half of the action sequences in a set are generated from as in Scenario I. The other half is obtained by reversing the actions between A and Z in each of the generated sequences. For example, if (A, B, C, Z) is a generated sequence, then the corresponding reversed sequence is (A, C, B, Z). The two latent groups formed in this scenario is more subtle than those in Scenario I as a sequence and its reversed version cannot be distinguished by marginal counts of actions in .
We consider three choices of , 500, 1000, and 2000. One hundred sets of action sequences are generated for each simulation scenario and each choice of . Procedure 2 is applied to each datasets. Both LSTM and GRU are considered for the recurrent unit in the autoencoder. For each choice of the recurrent unit, the number of features to be extracted are chosen from by five-fold cross-validation.
We investigate the ability of the extracted features in preserving the information in action sequences by examining their performance in reconstructing variables derived from action sequences. The variables to be reconstructed are indicators of the appearance of an action or an action pair in a sequence. Rare actions and action pairs that appears fewer than
times in a dataset are not taken into consideration. We model the relationship between the indicators and the extracted features through logistic regression. For each dataset,sequences are split into training and test sets in the ratio of 4:1. A logistic regression model is estimated for each indicator on the training set and its prediction performance is evaluated on the test set by the proportion of correct prediction, i.e., prediction accuracy. The average prediction accuracy over all the considered indicators are recorded for each dataset and each choice of the recurrent unit.
To study how well the extracted features unveil the latent group structures in response processes, we build a logistic regression model to classify the action sequences according to the extracted features. The training and test sets are split similarly as before and the prediction accuracy on the test set is recorded for evaluation.
|Scenario||Reconstruction Accuracy||Group Accuracy|
|I||500||0.88 (0.005)||0.87 (0.006)||0.99 (0.010)||1.00 (0.007)|
|1000||0.90 (0.003)||0.90 (0.004)||0.99 (0.005)||0.99 (0.006)|
|2000||0.91 (0.002)||0.91 (0.003)||0.99 (0.005)||0.99 (0.005)|
|II||500||0.88 (0.006)||0.88 (0.006)||0.86 (0.033)||0.87 (0.031)|
|1000||0.90 (0.004)||0.91 (0.005)||0.86 (0.021)||0.86 (0.021)|
|2000||0.91 (0.002)||0.92 (0.003)||0.87 (0.027)||0.87 (0.016)|
Mean (standard deviation) of prediction accuracy in the simulation study.
Table 1 reports the results of our simulation study. A few observations can be made from Table 1. First, the accuracy for reconstructing the appearance of actions and action pairs is high in both simulation scenarios, indicating that the extracted features preserve a significant amount of information in the original action sequences. The reconstruction accuracy is slightly improved as increases. Including more action sequences can provide more information for estimating the autoencoder in Step 1 of Procedure 2 thus producing better features. A larger sample size can also lead to a better fit of the logistic models that relate features to derived variables. Both effects contribute to the improvement of action and action pair reconstruction.
Second, in both simulation scenarios, the extracted features can distinguish the two latent groups well. In Scenario I, the two groups can be separated almost perfectly. Since the group difference in Scenario II is more subtle, the accuracy in classifying the two groups is lower than that in Scenario I, but still more than 85% of the sequences can be classified correctly. To further look at how the extracted features unveil the latent structure of action sequences, we plot two principal features for one of the datasets with 2000 sequences for each scenario in Figure 7. The left panel presents the first two principal features for Scenario I. The group structure is clearly shown and the two groups can be roughly separated by a horizontal line at 0. The right panel of Figure 7 displays the plot of the first and fourth principal features for Scenario II. Again the two groups can be clearly separated.
Last, the extracted features for the two choices of the recurrent unit in the action sequence autoencoder are comparable in terms of both reconstruction and group structure identification. A GRU has a simpler structure and fewer parameters than an LSTM unit with the same latent dimension. In this sense, GRU is more efficient for our action sequence modeling.
4 Case Study
Process data used in this study contains 11,464 respondents’ response processes of the PSTRE items in PIAAC 2012. There are 14 PSTRE items in total. In our data, 7,620 respondents answered 7 items and 3,645 respondents answered all 14 items. For each of the 14 items, there are around 7,500 respondents. For each respondent-item pair, both the response process (action sequence) and the final response outcome were recorded. The original final outcomes for some items are polytomous. We simplify them into binary outcomes with the fully corrected responses labelled as 1 and all others as 0.
The 14 PSTRE items in PIAAC 2012 vary in content, task complexity and difficulty. Some basic descriptive statistics of the items are summarized in Table2, where denotes the number of respondents, is the number of possible actions, stands for the average sequence length and Correct % is the percentage of correct responses. There are three types of interaction environments, email client, spreadsheet, and, web browser. Some items such as U01a and U01b have a single environment while some items such as U02 and U23 involve multiple environments. U06a is the simplest item in terms of number of possible actions and average response length, but only about one fourth of the participants answered it correctly. Items U02 and U04a are the most difficult items—only around 10% of the respondents correctly completed the given tasks. The tasks in these two items are relatively complicated—there are a few hundred of possible actions and more than 40 actions are needed to finish the task. With the wide item variety, manually extracting important features of process data based on experts’ understanding of the items is time-consuming while the proposed automatic method can be easily applied to all these items.
|U01a||Party Invitations - Can/Cannot Come||7620||207||24.8||54.5|
|U01b||Party Invitations - Accommodations||7670||249||52.9||49.3|
|U06a||Sprained Ankle - Site Evaluation Table||7622||47||10.8||26.4|
|U06b||Sprained Ankle - Reliable/Trustworthy Site||7612||98||16.0||52.3|
|U07||Digital Photography Book Purchase||7549||125||18.6||46.0|
|U11b||Locate E-mail - File 3 E-mails||7528||236||30.9||20.1|
|U19a||Club Membership - Member ID||7556||373||26.9||69.4|
|U19b||Club Membership - Eligibility for Club President||7558||458||21.3||46.3|
Note: = number of respondents; = number of possible actions; = average sequence length; Correct % = percentage of correct responses.
We extract features from the response processes for each of the 14 items using the proposed procedure. The number of features is chosen from by five-fold cross-validation. Adam (Kingma Ba, 2014) step size is used for optimizing the object function in Step 1 of Procedure 2. The algorithm is run for 100 epochs with validation based early stopping, where 10% of the processes are randomly sampled to form the validation set for each item.
Although the proposed method does not utilize the meaning of the actions for feature extraction, many of the principal features, especially the first several, have clear interpretations. Table 3 gives a partial list of feature interpretations.
|Email Client||viewing emails and folders, moving emails,|
|creating new folders, typing emails|
|Spreadsheet||using sort, using search,|
|clicking drop-down menu|
|Web Browser||clicking relevant links,|
|clicking irrelevant links|
|All Interfaces||sequence length,|
|using actions related to the task,|
|switching working environments,|
|selecting answers, answer submission|
The first or the second principal feature of each item is usually related to respondents’ attentiveness. An inattentive respondent tends to move to the next item without meaningful interactions with the computer environment. In contrast, an attentive respondent typically tries to understand and to complete the task by exploring the environment. Thus attentiveness in response process can be reflected in the length of the process. We call the principal feature that has the largest absolute correlation with the logarithm of the process length the attentiveness feature. In our case, the attentiveness feature is the second principal feature for item U06a and the first for all other items. For all the items, the absolute correlation between the attentiveness feature and the logarithm of sequence length is higher than 0.85. To make a higher attentiveness feature correspond to a more attentive respondent, we modify the attentiveness features by multiplying each of them by the sign of their correlation with the logarithm of process length. For a given pair of items, we select respondents who responded to both items and calculate the correlation between the two modified attentiveness features. These correlations are all positive and range from 0.30 to 0.70, implying that the respondents who are inattentive in one item tend to be inattentive in another item.
The feature space of the respondents with correct responses is usually very different from that of the respondents with incorrect responses. As an illustration, in Figure 8
, we plot the first two principal features of U01b for the two groups of respondents separately. It is obvious that the two clouds of points are of very distinct shapes. The non-oval shape of the clouds suggests that the feature space is highly non-linear. A multivariate normal distribution is not a suitable choice to describe the joint feature space. The scales of the two plots in Figure8 are also different. The variation of the features of correct respondents is much smaller than that of incorrect respondents. The main reason for this phenomenon is that there are more ways to solve the problem incorrectly than correctly. Item U01b requires the respondents to create a new folder and to move some emails to the new folder. Among the incorrect respondents, some moved emails but didn’t create a new folder while some created a new folder but didn’t move the emails correctly. There are also some respondents who didn’t respond seriously—they took fewer than five actions before moving to the next item. As shown in the right panel of Figure 8, respondents with similar behaviors are located close to each other in the feature space.
4.3 Reconstruction of Derived Variables
We demonstrate in this subsection that the features extracted from the proposed procedure retain a substantial amount of information of the response processes. To be more specific, we show that various variables directly derived from the processes can be reconstructed by the extracted features.
We define a derived variable as a binary variable indicating whether an action or a combination of actions appears in the process. For example, whether the first dropdown menu is set to “Photography” is a derived variable of the item described in the introduction. The binary response outcome is also a derived variable since it is entirely determined by the response process. In our data, 93 derived variables, including 14 item response outcomes, are considered.
Similar to the simulation study, we examine how well the derived variables can be reconstructed through a prediction procedure. We use logistic regression to model the relation between a derived variable and the principal features of the corresponding item. For each derived variable, 80% of the respondents are randomly sampled to form the training set and the remaining 20% form the test set. We fit the model on the training set and predict the derived variable for each respondent in the test set. Specifically, the derived variable is predicted as 1 if the fitted probability is greater than 0.5 and 0 otherwise. Prediction accuracy on the entire test set is calculated for evaluation.
As shown in Table 4, for all the derived variables, the prediction accuracy is higher than 0.80. For 75 out of 93 variables, the accuracy is higher than 0.90. Thirty five variables are predicted nearly perfectly (prediction accuracy greater than ). These results manifest that the extracted features carry a significant amount of information in the action sequences. We demonstrate in the remaining subsections that the extracted features is useful for assessing respondents’ competency and behaviors.
|Accuracy||(0.85, 0.90]||(0.90, 0.95]||(0.95, 0.975]||(0.975, 1.00]|
4.4 Variable Prediction Based on a Single Item
Item responses (both outcome and process) of an item reflect respondents’ latent traits, which affect their overall performance in a test. Therefore, each item response should have some predicting power of the responses of other items and the overall competency. Process data contain more detailed information of respondents’ behaviors than a single binary outcome. We expect that the prediction based on the response process is more accurate than that solely based on the final outcome. In this subsection, we assess information in the response processes of a single item via the prediction of the binary response outcomes of other items as well as the numeracy and the literacy scores.
Given the final outcome and the response process of an item, say item , we model their relation with the predicted variable by a generalized linear model
where is the expectation of the predicted variable, is the link function, is a vector of covariates related to item , which will be specified later, and is the coefficient vector. If the predicted variable is the binary outcome of item , is the logit link and is the probability of answering the item correctly. If the predicted variable is the literacy or numeracy score, is the identity link and (9
) becomes linear regression.
Let denote the binary outcome and let denote the features extracted from the response process of item . We consider two choices of for a given predicted variable, and . The first choice only uses the binary outcome for prediction. The second uses both the outcome and the response process. We call the model with these two choices of covariates the baseline model and the process model, respectively. It turns out that the information in the baseline model is very limited, especially when the correct rate of item is close to 0 or 1.
For a given predicted variable, two thirds of the available respondents are randomly sampled to form the training set. The remaining one third are evenly split to form the validation and the test set. Both the baseline model and the process model are fit on the training set. We add penalties on the coefficient vector in the process model to avoid overfitting. The penalty parameter is chosen by examining the prediction performance of the resulting model on the validation set. Specifically, a process model is fitted for each candidate value of the penalty parameter. The one that produces the best prediction performance on the validation set is chosen to obtain the final process model for comparing with the baseline model. The evaluation criterion is prediction accuracy for outcome prediction and out-of-sample () for score prediction. is defined to be the square of the Pearson correlation between the predicted and true values. A higher indicates better prediction performance.
4.4.1 Outcome Prediction Results
Figure 9 presents the results of outcome prediction. The plot in the left panel gives the improvement in the out-of-sample prediction accuracy of the process model over that of the baseline model for all item pairs. The entry in the -th row and the -th column gives the result for predicting item by item . For many item pairs, adding the features extracted from process data improves the prediction. To further examine the improvements, for the task of predicting the outcome of item by item , we calculate the prediction accuracy separately for the respondents who answered item correctly and for those who answered incorrectly. The improvements for these two groups are plotted respectively in the middle and the right panels of Figure 9. The improvement is more significant for the incorrect group in both the number of item pairs that have improvement and the magnitude of the improvement. As we mentioned previously, the incorrect response processes are more diverse than the correct ones, thus providing more information about the respondents. Misunderstanding the item requirements and lack of basic computer skills often lead to an incorrect response. Carelessness and inattentiveness are also possible causes of an incorrect answer. These differences can be reflected in the extracted features as illustrated in Figure 8. Therefore, including these features in the model helps the prediction more for the incorrect group than for the correct group.
4.4.2 Numeracy and Literacy Prediction Results
Numeracy and literacy score prediction results are displayed in Figure 10. In the left panel, we plot the of the process model against that of the baseline model. For both literacy and numeracy, regardless of the item used for prediction, the process model produces a higher than the baseline model. Although the PSTRE items are not designed for measuring these two competencies, the response processes are helpful for predicting the scores. To further examine the results, for each item-score pair, we again group the respondents according to their item response outcome and calculate the of the process model for the two groups separately. The for the incorrect group is plotted against that for the correct group in the right panel of Figure 10. Similar to the outcome prediction, the prediction performance for the incorrect group is usually much better than that for the correct group since action sequences corresponding to incorrect answers are often more diverse and informative than those corresponding to correct answers.
4.5 Prediction Based on Multiple Items
In this subsection, we examine how the improvement in prediction performance brought by process data aggregates as more items are incorporated in the prediction. The variables of interest are age, gender, and literacy and numeracy scores.
We only consider the 3,645 respondents who responded to all 14 PSTRE items in this experiment. The respondents are randomly split into training, validation, and test sets. The sizes of the three sets are 2645, 500, and 500, respectively. The split is fixed for estimating and evaluating all models in this experiment.
We still consider model (9) for prediction. The logit link, i.e., logistic regression, is used for gender prediction and linear regression for other variables. In this experiment, the covariate vector incorporates information from multiple items. Given a predicted variable and a set of available items, a baseline model and a process model are considered for each variable. For the baseline model, consists of only the final outcomes, while for the process model it also includes the first 20 principal features for each available item. Let denote the set of the indices of available items. The predictor for the baseline model is and that of the process model is where is the first 20 principal features for item . We start from an empty item set and add one item to the set at a time. That is, for a given predicted variable, a sequence of 14 baseline models and 14 process models are fitted. The order of items being added to the model is determined by forward Akaike information criterion (AIC) selection for the 14 outcomes on the training set. Specifically, for a given , contains the items whose outcomes are the first variables selected by the forward AIC selection among all 14 outcomes. We use prediction accuracy as the evaluation criterion for gender prediction and for other variables.
4.5.1 Numeracy and Literacy Prediction Results
In Figure 11, for predicting literacy and numeracy scores is plotted against the number of available items. For both the process model and the baseline model, the prediction of the numeracy and the literacy improves as responses from more items are available. Regardless of the number of available items, the process model outperforms the baseline model in both literacy and numeracy score predictions, although the difference becomes smaller as the the number of available items increases. Notice that the of the process model based on only two items roughly equals the of the baseline model based on four items. These results imply that properly incorporating process data in data analysis can exploit the information in items more efficiently and that the incorporation is especially beneficial when a small number of items are available.
The PSTRE item responses have some predicting power of literacy and numeracy. This is not surprising as literacy and numeracy are related to the understanding of the PSTRE item description and material. In our case study, PSTRE items are more related to literacy than numeracy—the of literacy score models is usually higher than that of the corresponding numeracy score model. The number of items needed in the process model to achieve a similar obtained in the baseline model with all 14 items is five for literacy and eight for numeracy.
4.5.2 Background Variable Prediction Results
Figure 12 presents the results for predicting age and gender. Adding more items in the baseline model barely improves the for predicting age while in the process model the quantity increases as more items are included and it is about twice as high as that of the baseline model when all 14 items are included. These results show that respondents at different age behave differently in solving PSTRE items and that response processes can reveal the differences significantly better than final outcomes. A closer examination of the action sequences shows that younger respondents are more likely to use drag and drop actions to move emails while older respondents tend to move emails by using email menu (left panel of Figure 13). Also, older respondents are less likely to use “Search” in spreadsheet environment (right panel of Figure 13).
As for gender, the highest prediction accuracy of the baseline models is 0.55, which is only 0.02 higher than the proportion of female respondents in the test set. The prediction accuracy of the process model is almost always higher than that of the corresponding baseline model and it can be as high as 0.63. These observations imply that female and male respondents have similar performance in PSTRE items in terms of final outcomes, but there are some differences in their response processes. In our data, male respondents are more likely to use sorting tools in spreadsheet environment as shown in Table 5. The p-value for the test of independence between gender and whether “Sort” is used is less than for the three items with spreadsheet environment.
5 Concluding Remarks
In this article, we presented a method to extract latent features from response processes. The key step of the method is to build an action sequence autoencoder for a set of response processes. We showed through a case study of the process data of PSTRE items in PIAAC 2012 that the extracted features improve the prediction of response outcomes, and literacy and numeracy scores.
It is possible to build neural networks that predict a response variable directly from an action sequence. These neural networks are often the combination of an RNN and a feed-forward neural network. In this way, we possibly need to fit separate models for each response variable and each of the models involves RNNs. Fitting models with RNN components is generally computationally expensive because of its recurrent structure. With the feature extraction method, we only need to fit a single model (the action sequence autoencoder) that involves RNNs, and then fit a (generalized) linear model or a feed-forward neural network for each variable of interest. The prediction performance of the two approaches are often comparable. The approach without feature extraction may perform worse than the approach with feature extraction due to overfitting.
Computer log files of interactive items often include time stamps of actions. The time elapsed between two consecutive actions may also provide extra information about respondents and can be useful in educational and cognitive assessments. The current action sequence autoencoder does not make use of this information. Further study on incorporating time information in the analysis of process data is a potential future direction.
The authors would like to thank Educational Testing Service and Qiwei He for providing the data, and Hok Kan Ling for cleaning it.
- Bengio . (2003) bengio2003neuralBengio, Y., Ducharme, R., Vincent, P. Jauvin, C. 2003. A neural probabilistic language model A neural probabilistic language model. Journal of Machine Learning Research3Feb1137–1155.
- Bengio . (1994) bengio1994learningBengio, Y., Simard, P. Frasconi, P. 1994. Learning long-term dependencies with gradient descent is difficult Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks52157–166.
- Bosch Paquette (2017) bosch2017unsupervisedBosch, N. Paquette, L. 2017. Unsupervised deep autoencoders for feature extraction with educational data Unsupervised deep autoencoders for feature extraction with educational data. Deep Learning with Educational Data Workshop at the 10th International Conference on Educational Data Mining. Deep learning with educational data workshop at the 10th international conference on educational data mining.
- Cho . (2014) cho2014learningCho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. Bengio, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation Learning phrase representations using rnn encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing Proceedings of the 2014 conference on empirical methods in natural language processing ( 1724–1734). Association for Computational Linguistics. 10.3115/v1/D14-1179
- Deng . (2010) deng2010binaryDeng, L., Seltzer, ML., Yu, D., Acero, A., Mohamed, A. Hinton, G. 2010. Binary coding of speech spectrograms using a deep auto-encoder Binary coding of speech spectrograms using a deep auto-encoder. Interspeech 2010. Interspeech 2010. International Speech Communication Association. 10.1.1.185.1908
- Ding . (2019) ding2019effectiveDing, M., Yang, K., Yeung, DY. Pong, TC. 2019. Effective Feature Learning with Unsupervised Learning for Improving the Predictive Models in Massive Open Online Courses Effective feature learning with unsupervised learning for improving the predictive models in massive open online courses. Proceedings of the 9th International Conference on Learning Analytics & Knowledge Proceedings of the 9th international conference on learning analytics & knowledge ( 135–144). New York, NY, USAACM. 10.1145/3303772.3303795
- Duchi . (2011) duchi2011adaptiveDuchi, J., Hazan, E. Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research12Jul2121–2159.
- Goodfellow . (2016) Goodfellow-et-al-2016Goodfellow, I., Bengio, Y. Courville, A. 2016. Deep Learning Deep learning. MIT Press.
- Greiff . (2016) greiff2016understandingGreiff, S., Niepel, C., Scherer, R. Martin, R. 2016. Understanding students’ performance in a computer-based assessment of complex problem solving: An analysis of behavioral data from computer-generated log files Understanding students’ performance in a computer-based assessment of complex problem solving: An analysis of behavioral data from computer-generated log files. Computers in Human Behavior6136–46. 10.1016/j.chb.2016.02.095
- He von Davier (2016) he2016analyzingHe, Q. von Davier, M. 2016. Analyzing process data from problem-solving items with n-grams: Insights from a computer-based large-scale assessment Analyzing process data from problem-solving items with n-grams: Insights from a computer-based large-scale assessment. Y. Rosen, S. Ferrara M. Mosharraf (), Handbook of Research on Technology Tools for Real-World Skill Development Handbook of research on technology tools for real-world skill development ( 749–776). Hershey, PAInformation Science Reference. 10.4018/978-1-4666-9441-5.ch029
- Hinton Salakhutdinov (2006) hinton2006reducingHinton, GE. Salakhutdinov, RR. 2006. Reducing the dimensionality of data with neural networks Reducing the dimensionality of data with neural networks. Science3135786504–507. 10.1126/science.1127647
- Hochreiter Schmidhuber (1997) hochreiter1997longHochreiter, S. Schmidhuber, J. 1997. Long short-term memory Long short-term memory. Neural Computation981735–1780. 10.1162/neco.19184.108.40.2065
- Kingma Ba (2014) kingma2014adamKingma, DP. Ba, J. 2014. Adam: A method for stochastic optimization Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Klein Entink . (2009) entink2009multivariateKlein Entink, RH., Fox, JP. van der Linden, WJ. 2009. A multivariate multilevel approach to the modeling of accuracy and speed of test takers A multivariate multilevel approach to the modeling of accuracy and speed of test takers. Psychometrika74121. 10.1007/s11336-008-9075-y
- Kraft . (2016) kraft2016embeddingKraft, P., Jain, H. Rush, AM. 2016. An Embedding Model for Predicting Roll-Call Votes An embedding model for predicting roll-call votes. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing Proceedings of the 2016 conference on empirical methods in natural language processing ( 2066–2070). Association for Computational Linguistics. 10.18653/v1/D16-1221
- Kroehne Goldhammer (2018) kroehne2018conceptualizeKroehne, U. Goldhammer, F. 2018. How to conceptualize, represent, and analyze log data from technology-based assessments? A generic framework and an application to questionnaire items How to conceptualize, represent, and analyze log data from technology-based assessments? A generic framework and an application to questionnaire items. Behaviormetrika452527–563. 10.1007/s41237-018-0063-y
- Li . (2015) li2015hierarchicalLi, J., Luong, T. Jurafsky, D. 2015. A Hierarchical Neural Autoencoder for Paragraphs and Documents A hierarchical neural autoencoder for paragraphs and documents. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing ( 1, 1106–1115). Association for Computational Linguistics.
- Lord (1980) lord1980applicationsLord, FM. 1980. Applications of Item Response Theory to Practical Testing Problems Applications of item response theory to practical testing problems. New York, NYRoutledge.
- Lord Novick (1968) lord1968statisticalLord, FM. Novick, MR. 1968. Statistical theories of mental test scores. Statistical theories of mental test scores. Reading, MAAddison-Wesley.
Lu . (2013)
lu2013speechLu, X., Tsao, Y., Matsuda, S. Hori, C.
Speech enhancement based on deep denoising autoencoder. Speech enhancement based on deep denoising autoencoder.Interspeech Interspeech ( 436–440).
- Mikolov . (2013) NIPS2013_5021Mikolov, T., Sutskever, I., Chen, K., Corrado, GS. Dean, J. 2013. Distributed Representations of Words and Phrases and their Compositionality Distributed representations of words and phrases and their compositionality. CJC. Burges, L. Bottou, M. Welling, Z. Ghahramani KQ. Weinberger (), Advances in Neural Information Processing Systems 26 Advances in neural information processing systems 26 ( 3111–3119). Curran Associates, Inc.
- OECD (2017) OECD2017problemOECD. 2017. The Nature of Problem Solving The nature of problem solving. 10.1787/9789264273955-en
- Patterson Gibson (2017) patterson2017deepPatterson, J. Gibson, A. 2017. Deep Learning: A Practitioner’s Approach Deep learning: A practitioner’s approach. ” O’Reilly Media, Inc.”.
- Piech . (2015) NIPS2015_5654Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, LJ. Sohl-Dickstein, J. 2015. Deep Knowledge Tracing Deep knowledge tracing. C. Cortes, ND. Lawrence, DD. Lee, M. Sugiyama R. Garnett (), Advances in Neural Information Processing Systems 28 Advances in neural information processing systems 28 ( 505–513). Curran Associates, Inc.
- Prechelt (2012) Prechelt2012Prechelt, L. 2012. Early Stopping — But When? Early stopping — but when? G. Montavon, GB. Orr KR. Müller (), Neural Networks: Tricks of the Trade: Second Edition Neural networks: Tricks of the trade: Second edition ( 53–67). Berlin, HeidelbergSpringer Berlin Heidelberg. 10.1007/978-3-642-35289-8_5
- Robbins Monro (1951) robbins1951stochasticRobbins, H. Monro, S. 1951. A stochastic approximation method A stochastic approximation method. The Annals of Mathematical Statistics223400–407. 10.1214/aoms/1177729586
- Stone (1974) stone1974crossStone, M. 1974. Cross-validatory choice and assessment of statistical predictions Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B362111–133.
- Vincent . (2008) vincent2008extractingVincent, P., Larochelle, H., Bengio, Y. Manzagol, PA. 2008. Extracting and composing robust features with denoising autoencoders Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning Proceedings of the 25th international conference on machine learning ( 1096–1103).
- L. Wang . (2017) wang2017deepWang, L., Sy, A., Liu, L. Piech, C. 2017. Deep Knowledge Tracing On Programming Exercises Deep knowledge tracing on programming exercises. Proceedings of the Fourth (2017) ACM Conference on Learning @ Scale Proceedings of the fourth (2017) acm conference on learning @ scale ( 201–204). New York, NY, USAACM. 10.1145/3051457.3053985
- S. Wang . (2018) wang2018usingWang, S., Zhang, S., Douglas, J. Culpepper, S. 2018. Using Response Times to Assess Learning Progress: A Joint Model for Responses and Response Times Using response times to assess learning progress: A joint model for responses and response times. Measurement: Interdisciplinary Research and Perspectives16145–58. 10.1080/15366367.2018.1435105
- Yousefi-Azar . (2017) yousefi2017autoencoderYousefi-Azar, M., Varadharajan, V., Hamey, L. Tupakula, U. 2017. Autoencoder-based feature learning for cyber security applications Autoencoder-based feature learning for cyber security applications. 2017 International Joint Conference on Neural Networks 2017 international joint conference on neural networks ( 3854–3861). 10.1109/IJCNN.2017.7966342
- Zeiler (2012) zeiler2012adadeltaZeiler, MD. 2012. ADADELTA: an adaptive learning rate method Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
- Zhan . (2018) zhan2018cognitiveZhan, P., Jiao, H. Liao, D. 2018. Cognitive diagnosis modelling incorporating item response times Cognitive diagnosis modelling incorporating item response times. British Journal of Mathematical and Statistical Psychology712262–286. 10.1111/bmsp.12114
Appendix A Structures of the LSTM Unit and the GRU
a.1 LSTM Unit
Using the notation in Section 2.2, the LSTM unit computes the hidden states and outputs in time step as follows
where denotes element-wise multiplication, , , , and are called the forget gate, input gate, output gate, and cell state of an LSTM unit, respectively, and , are parameters. Both and
are element-wise activation functions.
Using the notation in Section 2.2, the GRU computes the hidden states and outputs in time step as follows
where denotes element-wise multiplication, and are called the update gate and reset gate of a GRU, respectively, and , are parameters. Both and are element-wise activation functions.