Computer-based problem-solving items have become prevalent in large-scale assessments. These items are developed to measure skills related to problem solving in work and personal life. Thanks to the human-computer interface, it is possible to record the entire problem-solving process, as is the case of scientific inquiry items in the Programme for International Student Assessment (PISA) and Problem Solving in Technology-Rich Environments (PSTRE) items in the Programme for the International Assessment of Adult Competencies (PIAAC). The responses of such items are complex and are often in the form of a process. More precisely, the record of each item response contains a sequence of ordered and time-stamped actions.
An example of a PIAAC PSTRE item is shown in Figures 1–3. Figure 1 displays the main page of the item. The left panel of the main page provides item instructions. In this item, respondents are asked to identify websites that do not require registration or a fee from those listed in the web browser in the right panel. Respondents can visit a website by clicking its link. Figures 2 and 3 show the web pages of the first and the second links, respectively. Further information of the second website can be found by clicking on the “Learn More” button shown in Figure 3. If a website is considered useful, it can be bookmarked by either using the menu item “Bookmark” or clicking the bookmark icon in the tool bar. Suppose that a respondent completes the task by clicking on the first link, reading the first website, going back to the main page, clicking on the second link, and bookmarking the second website by clicking the bookmark icon. All these actions are recorded in the log file in order. The sequence “Start, Click_W1, Back, Click_W2, Toolbar_Bookmark, Next” constitutes a response process.
In this paper, we present a generic method to extract useful information regarding participants from their response processes. Latent variable or latent class models have been used in the literature to summarize item responses. Existing models and methods are not directly applicable to response processes. The analysis of process data is difficult for several reasons. First, response processes are in a nonstandard format. A response process is a sequence of actions and each action is a categorical variable. In addition, process length varies across individuals. Because of the nonstandard format, classic latent variable models such as item response theory models(Lord, 1980) and cognitive diagnosis models (Rupp ., 2010)
do not apply to process data. Second, computer-based assessments and their log files cover a large variety of items. Every human-computer interface generates log files. This makes confirmatory analysis practically infeasible due to the large amount and variety of items. It is too expensive to perform confirmatory analysis for each potential human-computer interface and then verify it empirically. Furthermore, the cognitive process of human-computer interaction is not thoroughly understood, which adds to the difficulty of confirmatory analysis. Lastly, response processes are often very noisy. For instance, the lagged correlations of action occurrences are often very close to zero, that is, response processes behave like white noise from an autoregressive process viewpoint.
Assessment of data beyond traditional responses have been studied previously. It has been shown that item response time can reveal test-taker response behaviors that are helpful for test design (van der Linden, 2008; Qian ., 2016). Models have been proposed to perform cognitive assessments using both traditional responses and response time (Klein Entink ., 2009; Wang ., 2018; Zhan ., 2018). The study of process data is at a more preliminary stage. Most works such as Greiff . (2016) and Kroehne Goldhammer (2018) first summarized process data into several variables and then investigated their relationship with other variables of interest using standard statistical methods. The design of these summary variables is usually item specific and thus hard to generalize. He von Davier (2015, 2016)
explored the association between action sequence patterns and traditional responses using n-grams. Although the procedure of extracting n-gram features is generic, the sequence patterns under consideration are limited.
The objective of the present analysis is to perform exploratory analysis on process data. In particular, we propose a generic method to extract features (latent variables) from response processes. The proposed method does not rely on prior knowledge of the items or the response processes and is applicable essentially to all process responses. We apply it to all 14 PIAAC PSTRE items that cover a range of human-computer interfaces.
The basic technique of our proposed method is multidimensional scaling (Borg Groenen, 2005)
. It constructs features based on the relative differences among individuals. Though numerous variants of multidimensional scaling (MDS) exist, their common goal is to locate objects in a vector space according to their pairwise dissimilarities in such a way that similar objects are close together, while less similar objects are far apart. MDS has been used for data visualization and dimension reduction in cognitive diagnosis, test analysis, and many other areas of psychometrics(Skager ., 1966; Karni Levin, 1972; Subkoviak, 1975; Shoben, 1983; Meyer Reynolds, 2018). In the context of process data analysis, if the differences between two processes can be properly summarized by a dissimilarity measure, then the coordinates obtained from MDS can be treated as features storing information of the original processes. With a proper rotation, each feature describes the variation of certain ability or behavior pattern among the group of respondents.
We use a prediction procedure to demonstrate that response processes contain more information than traditional item responses. We denote the features extracted from response processes by . For each response process, there is a binary response, denoted by , indicating whether the respondent has successfully accomplished the task. To compare the information contained in and , we adopt a third variable, denoted by (such as numeracy score, literacy score, etc.), and inspect the prediction of based on and that based on . In the empirical analysis of PSTRE in PIAAC, we find that the prediction based on outperforms that based on for a wide range of variables including assessment scores, basic demographic variables, and some background variables.
The rest of this paper is organized as follows. In Section 2, we introduce a dissimilarity measure for action sequences and describe the proposed feature extraction procedure. A simulation study is presented in Section 3 to demonstrate the procedure and how the latent structure of action sequences are reflected in extracted features. In Section 4, we show through a case study of PIAAC PSTRE item response processes that features extracted from process data contain much richer information than binary responses. Section 5 contains some concluding remarks.
2 Feature Extraction via Multidimensional Scaling
Consider a problem-solving item in which a student takes a number of actions to complete a task. We use to denote the set of possible actions of this item where is the number of distinct actions. A response process is a sequence of actions where each is an action in and is the process length, i.e., the number of actions taken in the response process. An action in may appear multiple times or never appear in . We observed the response processes of students and use subscript to index different observations: . The process length also varies among individuals; we use to denote the length of . The heterogeneous length of response processes for the same item is one of the technical difficulties in process data analysis. In what follows, we describe a procedure that transforms the response processes with heterogeneous length to homogeneous-dimension latent vectors that may be used for standard analysis.
The core of the procedure is MDS, which has been widely used as a data visualization and dimension reduction tool in many fields including psychometrics (Takane, 2006). The goal of MDS is to locate objects in a vector space according to their pairwise dissimilarities in such a way that similar objects are close together, while dissimilar objects are far apart. We begin the discussion with a description of a dissimilarity measure between discrete action sequences. This measure is key to the subsequent application of multidimensional scaling and it summarizes the variation among response processes. An appropriate dissimilarity measure should accommodate three characteristics of response processes. First, process data is a collection of discrete processes on which arithmetic calculation can not be performed. Second, processes from different respondents are of very different lengths. Third, the order of actions matters. Although the order of actions may not affect the final outcome of the task, it reflects respondents’ habit and personality.
Based on these considerations, we adopt the following dissimilarity measure. Let and be two action sequences. Define the dissimilarity between and as
where quantifies the dissimilarity among the actions that appear in both and and is the count of actions appearing in only one of and .
We now provide the precise definition of and . For an action , let be a sequence consisting of chronologically ordered positions of in sequence . The length of , , is the number of times that appears in . We use to denote the th element of , namely, the position of the th appearance of in . For two sequences and , let denote the set of actions that appear in both and and denote the set of actions that appear in but not in . Then and are defined as
where . This dissimilarity measure is first proposed in Gómez-Alonso Valls (2008) to measure differences in tourists’ itineraries and differences in websites visited by certain users.
We use a simple example to demonstrate how the dissimilarity is calculated. Consider a set of four possible actions and two sequences, and . Since and appear in both sequences, . Action appears in at positions 1 and 3 and appears in in position 2, so and . The difference between and in the appearance of is . Similarly, we can find , and the difference in the appearance of is . Therefore, . Since and with appearing twice in and appearing once in , . According to (1), .
The calculation of the dissimilarity described in (1) does not require inputs of informative behavior patterns or the meaning of each action. This is crucial for our automated feature extraction procedure at the exploratory stage of the analysis.
For action sequences , let an symmetric matrix denote their dissimilarity matrix, where measures the dissimilarity between and , . Higher dissimilarities indicate larger differences and the dissimilarity between two identical objects is zero, namely, for . MDS maps each action sequence to a latent vector in the -dimensional Euclidean space such that they govern the dissimilarities. Mathematically, applying MDS to objects with dissimilarity matrix essentially minimizes
with respect to , where is the latent vector of in and
. Many algorithms have been proposed to solve the optimization problem. For simplicity, we use stochastic gradient descent(Robbins Monro, 1951) to minimize (4).
Combining the calculation of the dissimilarity matrix and MDS, we present the feature extraction procedure for process data.
Procedure 1 (Feature extraction for process data).
Procedure 1 extracts features with homogeneous dimension from action sequences with heterogeneous length. These features have a standard form and, as we will show in the simulation and case study, contain compressed information of the original sequences. Therefore, they can be easily incorporated as a surrogate of the action sequences in well-developed statistical models such as (generalized) linear models to study how process data reflects respondents’ latent traits and how it is related to other quantities of interest. We will demonstrate how these can be achieved in the next two sections.
Principal component analysis is performed in Step 3 of Procedure 1 mainly for seeking interpretations of the features. As we will show in the case study, the first several principal features usually have clear interpretations, although the feature extraction procedure does not take into account the meaning of actions.
Procedure 1 requires the specification of , the number of features to be extracted. If is too small, there are not enough features to characterize the variation of action sequences, leading to substantial information loss in extracted features. On the other hand, if is too large, some features can be redundant and can cause overfitting and instability in subsequent analyses. A suitable can be chosen by -fold cross-validation. We randomly split the pairwise dissimilarities into subsets. For each candidate value of
and each subset of dissimilarities, we perform MDS using the rest of dissimilarities and calculating the discrepancy between the estimated and true dissimilarities for the subset. The value ofthat produces the smallest total discrepancy among subsets is chosen as the number of features to be extracted. This cross-validation procedure is summarized as follows.
Procedure 2 (Choose by cross-validation).
Randomly split into subsets .
For each candidate value of and each in , obtain , by minimizing
with respect to where .
For each candidate value of , calculate
Choose that produces the smallest .
In this section, we demonstrate the proposed feature extraction procedure on simulated data.
3.1 Data Generation
Twenty-six possible actions () are considered in our simulations. Each possible action is denoted by an upper-case English letter, namely, with and . We use A and Z to denote the start and the end of an item. As a result, each action sequence starts with A and ends with Z.
3.2 Experiment Settings
We consider two strategies for generating action sequences. With strategy I, a set of action sequences are generated from the previous Markov model under two different transition matrices, sequences for each matrix. Action sequences generated from this strategy have a latent group structure. Sequences generated from the same transition matrix form a group and tend to be similar. The two probability transition matrices used in this strategy are randomly generated. Both of the matrices are generated by first constructing an matrix . The elements of
are generated independently from a uniform distribution on interval. Then is computed from by
In strategy II, each of action sequences is generated from a unique probability transition matrix. To construct these matrices, we first obtain a uniform matrix as in strategy I. Then we draw independent samples, , from and compute the core matrix for the th sequence according to
With this strategy, sequences with similar resemble each other. In other words, serves as a continuous latent variable determining the characteristics of the sequences.
We consider three choices of , 200, 500, and 1000. For each strategy and each choice of , we generate 100 sets of action sequences and extract features according to Procedure 1. The number of features to be extracted are chosen by five-fold cross-validation described in Procedure 2.
To show that extracted features retain the information in action sequences, we derive several variables from action sequences for each dataset and examine how well these derived variables can be reconstructed from the extracted features. Good reconstruction performances indicate that a significant amount of information in action sequences is preserved in extracted features. The derived variables are indicators describing whether a unigram or a bigram appears in a sequence. We do not consider indicators for unigrams and bigrams that appears fewer than
times in a dataset. Logistic regression is used to reconstruct the derived variables from extracted features. For each data set,sequences are split into training and test sets in the ratio of 4:1. A logistic regression model is estimated for each derived variable on the training set and its prediction performance is evaluated on the test set. The average prediction accuracy and the worst prediction accuracy among all the derived variables are recorded for each dataset.
To inspect the ability of the extracted features in unveiling the latent structures in action sequences, we build a logistic regression model to identify the group structure from the extracted features for datasets generated from strategy I and a linear regression model ofon the extracted features for datasets generated from strategy II. The models are fitted on the training set. The logistic model of group identity is evaluated by the prediction accuracy on the test set while the linear regression model of is evaluated by out-of-sample (), the square of the correlation between the predicted and true values. As an analogy to the in-sample in linear regression, a higher indicates a better prediction performance.
Figures 4 and 5 display the results for datasets generated by strategies I and II, respectively. The left and middle panels of both figures present the average and worst prediction accuracy for derived variables. Under all the settings, for almost all datasets, the averaged prediction accuracy is greater than 0.9 and the worst prediction accuracy is greater than 0.7. These results demonstrate that the derived variables can be reconstructed well and imply that a significant amount of information in action sequences is compressed into the extracted features.
The right panel of Figure 4 presents the prediction accuracy for group identity. For most of the datasets, the prediction accuracy is higher than 0.9, indicating that group structures in action sequences can be identified very accurately by extracted features. The right panel of Figure 5 gives the for predicting . It reflects that continuous latent characteristics in action sequences can be captured well by features extracted from Procedure 1 as the correlation between the predicted and true values is higher than 0.8 for most of the datasets.
To take a closer look at how the extracted features reveal the latent structure of action sequences, in Figure 6, we plot the first two principal features for one dataset of 1000 sequences under each strategy. For the dataset generated from strategy I (left panel in Figure 6), the group structure is clearly shown in the figure and the two groups can be roughly separated by a horizontal line at 0. The data shown in the right panel of Figure 6 is generated from strategy II. It is evident that sequences located closer have similar latent characteristics.
4 Case Study
The data considered in this study comes from the PIAAC 2012 survey from five countries: the United Kingdom, Ireland, Japan, the Netherlands and the United States. There are 14 PSTRE items and 11,464 respondents in the dataset in total. Each person responded to all or a subset of the 14 items. There are 7,620 respondents who answered 7 items and 3,645 respondents who answered all 14 items. For each item, there were around 7,500 respondents. Altogether there are 106,096 respondent-item pairs. Both the response process and the response outcome (correct or incorrect) were recorded for each pair.
summarizes some basic descriptive statistics of the dataset by item, wheredenotes the number of respondents, is the number of possible actions, stands for the average process length, and Correct % is the percentage of correct responses. The 14 items vary in content, task complexity, and difficulty. Items U02 and U04a are the most difficult items as only around 10% of respondents had the correct answer. The tasks of these two items are also relatively complicated, requiring more than 40 actions on average and having a large number of possible actions. U06a is the simplest item in terms of task complexity since respondents took only 10.8 actions on average to finish the task and the item has the fewest possible actions. Despite the simplicity, less than 30% of respondents answered U06a correctly. The variety of items necessitates automatic methods to extract features from process data and to avoid identifying important actions and patterns manually, which is time consuming and requires extra work if coding is changed.
|U01a||Party Invitations - Can/Cannot Come||7620||207||24.8||54.5|
|U01b||Party Invitations - Accommodations||7670||249||52.9||49.3|
|U06a||Sprained Ankle - Site Evaluation Table||7622||47||10.8||26.4|
|U06b||Sprained Ankle - Reliable/Trustworthy Site||7612||98||16.0||52.3|
|U07||Digital Photography Book Purchase||7549||125||18.6||46.0|
|U11b||Locate E-mail - File 3 E-mails||7528||236||30.9||20.1|
|U19a||Club Membership - Member ID||7556||373||26.9||69.4|
|U19b||Club Membership - Eligibility for Club President||7558||458||21.3||46.3|
Note: = number of respondents; = number of possible actions; average process length; Correct % = percentage of correct responses
4.2 Feature Interpretation
We extracted features for each of the 14 items by Procedure 1. The number of features is chosen from by five-fold cross-validation and the selected number for each item is given in the second column of Table 2.
Many of the principal features, especially the first several ones, have clear interpretations. We find the interpretation of a feature by examining the characteristics of the action sequences corresponding to the two extremes of the feature and then confirm it by calculating the correlation between the feature and a variable constructed according to the interpretation. Table 2 lists the interpretation of the first three principal features for each item.
The first principal feature of each item usually indicates attentiveness. An inattentive respondent often tries to skip a task directly or submits an answer by guessing randomly without meaningful interactions with the simulated environment, while an attentive respondent usually tries to understand and to complete the task by exploring the environment, thus taking more actions. Attentiveness in response process can be reflected in the process length. In Table 2, the numbers in the parentheses after the interpretation of the first principal feature of each item give the absolute value of the correlation between the first principal feature and the logarithm of the process length. For 13 out of 14 items, the absolute correlation is higher than 0.85. To explore the relation between the 14 first principal features, we multiply the features by the sign of their correlation with the corresponding process length. With the redirection, a higher first principal feature indicates a more attentive respondent. For a given pair of items, we calculate the correlation between their first principal features among the respondents who responded to both items. These correlations range from 0.36 to 0.74, implying that the respondents who tend to skip one item are likely to skip other items as well.
Some other features reveal whether the respondent understands the requirements of items. For example, item U11b requires respondents to classify emails in the “Save” folder. The second feature of U11b reflects if a respondent was working on the correct folder. Similarly, item U01b requires creating a new folder. The second feature of this item is related to whether this requirement is followed.
There are also features related to respondents’ information and computer technology skills. Examples include the second feature of U03a, indicating whether search or sort tools are used, and the second feature of U04a, reflecting whether window split is used to avoid frequent switching between windows.
|U01a||50||1||Attentiveness in item response process (0.68)|
|2||Intensity of mail and folder viewing actions|
|3||Intensity of mail moving actions|
|U01b||30||1||Attentiveness in item response process (0.96)|
|2||Intensity of creating new folders actions|
|3||Intensity of mail moving actions|
|U02||50||1||Attentiveness in item response process (0.94)|
|2||Intensity of mail moving actions|
|3||Intensity of mail viewing actions|
|U03a||70||1||Attentiveness in item response process (0.86)|
|2||Intensity of search and sort actions|
|3||Times of answer submission|
|U04a||70||1||Attentiveness in item response process (0.98)|
|2||Intensity of switching environments|
|3||Intensity of arranging tables actions|
|U06a||60||1||Attentiveness in item response process (0.91)|
|2||Intensity of clicking radio buttons|
|3||Chance of classifying a website as useful|
|U06b||20||1||Attentiveness in item response process (0.94)|
|2||Intensity of selecting answers|
|3||Intensity of choosing website 2 against choosing website 4|
|U07||100||1||Attentiveness in item response process (0.96)|
|2||Intensity of actions related to website 6|
|3||Intensity of actions related to website 3|
|U11b||40||1||Attentiveness in item response process (0.94)|
|2||Intensity of actions related to email in save folder|
|3||Intensity of mail moving actions|
|U16||70||1||Attentiveness in item response process (0.95)|
|2||Intensity of “Other_Keypress”|
|3||Intensity of email viewing against email replying|
|U19a||40||1||Attentiveness in item response process (0.91)|
|2||Intensity of typing emails|
|3||Intensity of ticking and clicking email environment button|
|U19b||50||1||Attentiveness in item response process (0.89)|
|2||Intensity of sorting actions|
|3||Number of checked boxes|
|U21||50||1||Attentiveness in item response process (0.92)|
|2||Intensity of making reservations|
|3||Number of games selected|
|U23||40||1||Attentiveness in item response process (0.87)|
|2||Click customer service against clicking not needed links|
|3||Obtain Authorization number or not|
Note: Number in parentheses represents absolute value of correlation between first principal feature and logarithm of sequence length.
4.3 Reconstruction of Derived Variables
In this subsection, we demonstrate that the extracted features contain a substantial amount of information of the action sequences by showing that some key variables derived from the action sequences can be reconstructed from the features.
Derived variables are binary variables indicating whether certain actions or patterns appear in the action sequences. For the example item described in the introduction, whether the first link is clicked is a derived variable. Item response outcomes (correct or incorrect) can also be treated as derived variables since they are entirely determined by the action sequences. In PIAAC data, besides the item response outcomes, 79 derived variables are recorded for the 14 items. The following experiment examines how well the 93 (79 + 14) derived variables can be reconstructed from the features extracted from Procedure1.
For a given item, let denote a generic binary derived variable and be a vector of principal features extracted from its response process. We consider the logistic regression model for each derived variable
where is the probability of and . For each derived variable, the respondents with the variable are randomly divided into a training set and test set in the ratio 4:1. The logistic regression model (7) is fit on the training set and the value of derived variable in the test set is predicted as 1 if the fitted probability is greater than 0.5, and 0 otherwise. The prediction performance is evaluated by prediction accuracy.
Figure 7 presents a histogram of the prediction accuracy for the 93 derived variables. For most of the variables, the model constructed from the extracted features has more than 90% accuracy. This result confirms that the features extracted by Procedure 1 is a comprehensive summary of the response processes.
Given that the features contain information about action sequences, a natural question is whether these features are useful for assessing respondents’ competency and understanding their behavior. We will try to answer this question in the remainder of this section.
4.4 Cross-Item Outcome Prediction
In this section, we explore if the features obtained from the process data of one item are helpful to predict the outcomes of another item. Intuitively, if the extracted features characterize the behavioral patterns and/or intellectual levels of respondents, which affect their performance in general, then these features should be able to tell more about whether the respondents can answer other items correctly than a single binary outcome.
Let denote the outcome of item and denote the features extracted from item , . We model the relation between the outcome of item and the outcome and the features of item by a logistic regression
where is the probability of and is a vector of covariates of item . If process data is not taken into account, only provides information about and . In this case, available information for telling the outcome of item is very limited, especially when the correct rate of item is close to 0 or 1. If process data is collected, then the features extracted according to Procedure 1 provide another source of information and we could use as the covariates from item . We call it the baseline model if it only incorporates the outcome in and the process model if it utilizes the features extracted from process data.
Given that we want to model the outcome of item based on the information provided in item , respondents who responded to both items are randomly split into training, validation, and test sets in the ratio 4:1:1. Both the baseline model and the process model are fit on the training set. To avoid overfitting in the process model, penalties on the coefficients are incorporated. The process model is fitted on the training set for a grid of penalty parameters. The fitted process model that corresponds to the penalty parameter producing the highest prediction accuracy on the validation set is chosen to compare with the baseline model. The prediction accuracy of the process model for all combinations of and is plotted against the prediction accuracy of the corresponding baseline model in the left panel of Figure 8. For most of the item pairs, the prediction accuracy is improved when the features extracted from process data are utilized, implying that the information in the process data is helpful in predicting the performance of respondents.
To take a closer look at the results, the middle and right panels of Figure 8 compare prediction accuracy separately for those who answered item correctly and incorrectly. The improvement in prediction accuracy is more obvious for the “incorrect” group. The main reason is that the action sequences corresponding to the incorrect responses usually provide more information about the respondents. There are usually more ways to answer a question incorrectly than correctly. An incorrect response may be the consequence of misunderstanding the item requirements or lack of basic computer skills. It may also result from the respondents’ carelessness or inattentiveness. These varieties are reflected in the response processes, and thus, in the extracted features. As an illustration, the histograms of the first principal feature of item U01a stratified by the respondents’ outcomes of U01a and U01b are plotted in Figure 9. In the U01a incorrect group, there is a significant difference in the feature distributions for those who answered U01b correctly and incorrectly, while the two distributions are almost identical in the U01a correct group. Recall that the first principal feature describes the respondents’ attentiveness. Among the respondents who answer U01a incorrectly, those with lower feature values lack attentiveness. By including the features in the model, we are able to identify them and know that they are unlikely to answer U01b and other items correctly.
4.5 Score Prediction
The 14 interactive items in PIAAC were designed to study the PSTRE skills. The respondents’ competency in literacy and numeracy were measured using items designed specifically for these two scales. We will show in this subsection that the process data from problem-solving items can cast light on respondents’ proficiency in other scales. Let denote the score of a specific scale. We consider a linear model to explore the relation between and problem-solving items
where is a Gaussian random noise and is a vector of predictors related to one or more problem-solving items and will be specified later.
4.5.1 Score Prediction Using a Single Item
In the first experiment, we model the scores based on the information provided in a single item. In the model that only incorporates the binary outcome, namely the baseline model, the linear predictor is . In the process model, we use . For each of the 14 problem-solving items, the respondents are randomly split into training, validation and test sets in the ratio 4:1:1. Both the baseline and the process model are fitted on the training set for literacy and numeracy scores separately. To avoid overfitting, penalties are placed on the coefficients in the process model for a grid of penalty parameters. The penalty parameter that produces the best prediction performance on the validation set is selected to obtain the final estimated process model. The prediction performance is evaluated by .
The left panel of Figure 10 presents the of the baseline model and the process model for all combinations of score and item. For both literacy and numeracy scores, including information from process data is beneficial to score prediction. Although the problem-solving items are not designed to measure numeracy and literacy in PIAAC, process data can provide information leading to substantial improvements in these two scales.
The right panel of Figure 10 presents of the process model stratified by the outcome of an item. Similar to the outcome prediction in the previous subsection, the prediction performance for the respondents who answered an item incorrectly is usually much better than that for those who answered correctly since action sequences corresponding to incorrect answers often have more information than those corresponding to correct answers.
4.5.2 Score Prediction Using Multiple Items
In the second experiment, we will examine how the improvement in score prediction brought by process data changes as the number of available items increases. We only consider the 3,645 respondents who responded to all 14 problem-solving items in this experiment. Among these respondents, 2,645 are randomly assigned to the training set, 500 to the validation set and 500 to the test set. For each score, two models, a baseline model and a process model, are considered for a given set of available items. For the baseline model, the linear predictor consists of the binary outcomes of the available items. For the process model, in addition to the binary outcomes, the linear predictor includes the first 20 principal features for each available item. Let be the indices of the available items. Then the linear predictor for the baseline model is , while the linear predictor of the process model is where is the first 20 principal features for item . The set of available items is determined by forward Akaike information criterion (AIC) selection of the outcomes on the training set. Specifically, for a given , contains the items whose outcomes are the first outcomes selected by the forward AIC selection among all 14 outcomes . For a given score, a sequence of baseline models and the process models are fitted on the training set. Similar to the previous subsection, penalty is added on the coefficients of the process models to avoid overfitting, and the penalty parameter is selected based on the on the validation set.
Figure 11 presents the of the baseline model and the selected process model on the test set. Regardless of the number of items available, the process model outperforms the baseline model in both literacy and numeracy score prediction. The improvement is more significant for literacy. The of the process model with only five items is comparable to the of the baseline model with all 14 items. In the process of completing the task in the problem-solving item, respondents need to comprehend the item description and provided materials, so the outcomes and the action sequences of problem-solving items can reflect respondents’ literacy competency to some extent. Our experiment shows that process data can provide more information that binary outcomes. Properly incorporating process data in data analysis can exploit the information from items more efficiently.
5 Concluding Remarks
In this article, we present a method to extract informative latent variables from process data and illustrate the method via simulation studies and a case study of PIAAC 2012 data. The latent variables in the process data are extracted by an automatic procedure involving MDS of the dissimilarity matrix among response processes. The dissimilarity measure used in this article is just one of the possible choices. Other choices such as Levenshtein distance (Levenshtein, 1966) and optimal symbol alignment distance (Herranz ., 2011) can also be used and give similar results. However, these measures are often more computationally demanding.
The respondents of our process data came from five different countries and they varied in age, gender, and many other demographic variables. The extracted features and the prediction procedure can also be used to study the difference in behavior patterns of different demographic groups. We will pursue this direction in future research.
Time stamps of actions are also available in process data. The time elapsed between the occurrences of two actions may provide additional information about respondents and can be useful in cognitive assessments. The current dissimilarity measure does not make use of this information. Further study on incorporating response time information in the analysis of process data is a potential future direction.
The authors would like to thank Educational Testing Service for providing the data, and Hok Kan Ling for cleaning it.
- Borg Groenen (2005) borg2005modernBorg, I. Groenen, PJ. 2005. Modern multidimensional scaling: Theory and applications Modern multidimensional scaling: Theory and applications. New York, NYSpringer Science & Business Media. 10.1007/0-387-28981-X
Gómez-Alonso Valls (2008)
gomez2008similarityGómez-Alonso, C. Valls, A.
A similarity measure for sequences of categorical data
based on the ordering of common elements A similarity measure for sequences
of categorical data based on the ordering of common elements.
V. Torra Y. Narukawa (), Modeling Decisions for Artificial Intelligence Modeling decisions for artificial intelligence ( 134–145).Berlin, HeidelbergSpringer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-88269-5_13
- Greiff . (2016) greiff2016understandingGreiff, S., Niepel, C., Scherer, R. Martin, R. 2016. Understanding students’ performance in a computer-based assessment of complex problem solving: An analysis of behavioral data from computer-generated log files Understanding students’ performance in a computer-based assessment of complex problem solving: An analysis of behavioral data from computer-generated log files. Computers in Human Behavior6136–46. 10.1016/j.chb.2016.02.095
- He von Davier (2015) he2015identifyingHe, Q. von Davier, M. 2015. Identifying feature sequences from process data in problem-solving items with n-grams Identifying feature sequences from process data in problem-solving items with n-grams. LA. van der Ark, DM. Bolt, WC. Wang, JA. Douglas SM. Chow (), Quantitative Psychology Research Quantitative psychology research ( 173–190). ChamSpringer International Publishing. https://doi.org/10.1007/978-3-319-19977-1_13
- He von Davier (2016) he2016analyzingHe, Q. von Davier, M. 2016. Analyzing process data from problem-solving items with n-grams: Insights from a computer-based large-scale assessment Analyzing process data from problem-solving items with n-grams: Insights from a computer-based large-scale assessment. Y. Rosen, S. Ferrara M. Mosharraf (), Handbook of research on technology tools for real-world skill development Handbook of research on technology tools for real-world skill development ( 749–776). Hershey, PAInformation Science Reference. 10.4018/978-1-4666-9441-5.ch029
- Herranz . (2011) herranz2011optimalHerranz, J., Nin, J. Sole, M. 2011. Optimal symbol alignment distance: A new distance for sequences of symbols Optimal symbol alignment distance: A new distance for sequences of symbols. IEEE Transactions on Knowledge and Data Engineering23101541–1554. 10.1109/TKDE.2010.190
- Karni Levin (1972) karni1972useKarni, ES. Levin, J. 1972. The use of smallest space analysis in studying scale structure: An application to the California Psychological Inventory. The use of smallest space analysis in studying scale structure: An application to the california psychological inventory. Journal of Applied Psychology564341. 10.1037/h0032934
- Klein Entink . (2009) entink2009multivariateKlein Entink, R., Fox, JP. van der Linden, WJ. 2009. A multivariate multilevel approach to the modeling of accuracy and speed of test takers A multivariate multilevel approach to the modeling of accuracy and speed of test takers. Psychometrika74121. 10.1007/s11336-008-9075-y
- Kroehne Goldhammer (2018) kroehne2018conceptualizeKroehne, U. Goldhammer, F. 2018. How to conceptualize, represent, and analyze log data from technology-based assessments? A generic framework and an application to questionnaire items How to conceptualize, represent, and analyze log data from technology-based assessments? A generic framework and an application to questionnaire items. Behaviormetrika452527–563. https://doi.org/10.1007/s41237-018-0063-y
- Levenshtein (1966) levenshtein1966binaryLevenshtein, VI. 1966. Binary codes capable of correcting deletions, insertions, and reversals Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady108707–710.
- Lord (1980) lord1980applicationsLord, FM. 1980. Applications of Item Response Theory to Practical Testing Problems Applications of item response theory to practical testing problems. New York, NYRoutledge.
- Meyer Reynolds (2018) meyer2018scoresMeyer, EM. Reynolds, MR. 2018. Scores in Space: Multidimensional Scaling of the WISC-V Scores in space: Multidimensional scaling of the wisc-v. Journal of Psychoeducational Assessment366562-575. 10.1177/0734282917696935
- Qian . (2016) qian2016usingQian, H., Staniewska, D., Reckase, M. Woo, A. 2016. Using Response Time to Detect Item Preknowledge in Computer-Based Licensure Examinations Using response time to detect item preknowledge in computer-based licensure examinations. Educational Measurement: Issues and Practice35138–47. 10.1111/emip.12102
- Robbins Monro (1951) robbins1951stochasticRobbins, H. Monro, S. 1951. A stochastic approximation method A stochastic approximation method. The Annals of Mathematical Statistics223400–407. 10.1214/aoms/1177729586
- Rupp . (2010) rupp2010diagnosticRupp, AA., Templin, J. Henson, RA. 2010. Diagnostic Measurement: Theory, Methods, and Applications Diagnostic measurement: Theory, methods, and applications. New York, NYGuilford Press.
- Shoben (1983) shoben1983applicationsShoben, EJ. 1983. Applications of multidimensional scaling in cognitive psychology Applications of multidimensional scaling in cognitive psychology. Applied Psychological Measurement74473–490. 10.1177/014662168300700406
- Skager . (1966) skager1966multidimensionalSkager, RW., Schultz, CB. Klein, SP. 1966. The multidimensional scaling of a set of artistic drawings: Perceived structure and scale correlates The multidimensional scaling of a set of artistic drawings: Perceived structure and scale correlates. Multivariate Behavioral Research14425–436. 10.1207/s15327906mbr0104_2
- Subkoviak (1975) subkoviak1975useSubkoviak, MJ. 1975. The use of multidimensional scaling in educational research The use of multidimensional scaling in educational research. Review of Educational Research453387–423. 10.3102/00346543045003387
- Takane (2006) takane200611Takane, Y. 2006. 11 Applications of Multidimensional Scaling in Psychometrics 11 applications of multidimensional scaling in psychometrics. Handbook of Statistics26359–400. 10.1016/S0169-7161(06)26011-5
- van der Linden (2008) van2008usingvan der Linden, WJ. 2008. Using response times for item selection in adaptive testing Using response times for item selection in adaptive testing. Journal of Educational and Behavioral Statistics3315–20. 10.3102/1076998607302626
- Wang . (2018) wang2018usingWang, S., Zhang, S., Douglas, J. Culpepper, S. 2018. Using Response Times to Assess Learning Progress: A Joint Model for Responses and Response Times Using response times to assess learning progress: A joint model for responses and response times. Measurement: Interdisciplinary Research and Perspectives16145–58. 10.1080/15366367.2018.1435105
- Zhan . (2018) zhan2018cognitiveZhan, P., Jiao, H. Liao, D. 2018. Cognitive diagnosis modelling incorporating item response times Cognitive diagnosis modelling incorporating item response times. British Journal of Mathematical and Statistical Psychology712262–286. 10.1111/bmsp.12114