Assessment Modeling: Fundamental Pre-training Tasks for Interactive Educational Systems

01/01/2020 ∙ by Youngduck Choi, et al. ∙ 0

Interactive Educational Systems (IESs) have developed rapidly in recent years to address the issue of quality and affordability of education. Analogous to other domains in AI, there are specific tasks of AIEd for which labels are scarce. For instance, labels like exam score and grade are considered important in educational and social context. However, obtaining the labels is costly as they require student actions taken outside the system. Likewise, while student events like course dropout and review correctness are automatically recorded by IESs, they are few in number as the events occur sporadically in practice. A common way of circumventing the label-scarcity problem is the pre-train/fine-tine method. Accordingly, existing works pre-train a model to learn representations of contents in learning items. However, such methods fail to utilize the student interaction data available and model student learning behavior. To this end, we propose assessment modeling, fundamental pre-training tasks for IESs. An assessment is a feature of student-system interactions which can act as pedagogical evaluation, such as student response correctness or timeliness. Assessment modeling is the prediction of assessments conditioned on the surrounding context of interactions. Although it is natural to pre-train interactive features available in large amount, narrowing down the prediction targets to assessments holds relevance to the label-scarce educational problems while reducing irrelevant noises. To the best of our knowledge, this is the first work investigating appropriate pre-training method of predicting educational features from student-system interactions. While the effectiveness of different combinations of assessments is open for exploration, we suggest assessment modeling as a guiding principle for selecting proper pre-training tasks for the label-scarce educational problems.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Comparison between content-based and interaction-based approaches. Content-based approaches learn representations of contents in learning items (e.g. exercises). On the other hand, interaction-based approaches model student learning behaviors in interactive educational systems.

Interactive Educational System (IES) interacts with students to assess them and design individualized optimal learning paths. The observations of student behaviors collected automatically by IESs are large in scale, thus power data-driven approaches for many Artificial Intelligence in Education (AIEd) tasks and improve the quality of IESs further. However, there are important tasks with insufficient data that prevent relevant models from attaining the full potential. A possible case is when the task depends on labels external to IESs that is rare in number. For example, the labels for a grade prediction model in an in-class IES are collected at the end of class. In some cases, the data points are collected by IESs but simply lacks in number because the events occur sporadically in practice. For instance, the number of actions reviewing previously solved exercises may be outnumbered by number of actions solving unseen exercises, as it is likely for a student to invest more time to solve new exercises than to review.

To circumvent the lack of data, transfer learning has been studied in AIEd literature

(Hunt et al., 2017; Ding et al., 2019)

. However, the method works only when the available data and the task to complete have the same form. Instead, the pre-train/fine-tune paradigm can be considered. In this paradigm, a model is first pre-trained in an unsupervised auxiliary task with abundant data. Then, the model is modified slightly and trained in main tasks with possibly scarce data. This approach was successful in other AI fields including Natural Language Processing (NLP), computer vision, motion planning and so on

(Devlin et al., 2018; Studer et al., 2019; Schneider et al., 2019). In this line of thought, content-based pre-training methods have been studied (Huang et al., 2017; Sung et al., 2019; Yin et al., 2019). However, the pre-trained models fail to capture student learning behavior as student interactions are not considered.

In this paper, we propose assessment modeling, fundamental pre-training tasks for general IESs. Here, an assessment is any feature from student-system interactions which can act as pedagogical evaluation criterion. Examples of assessments are the time a student has spent on each exercise and the correctness of student response to a given exercise. While there is a wide range of interactive features available, narrowing down the prediction targets to assessments holds relevance to the label-scarce educational problems while reducing irrelevant noises. Also, most of assessments are available in large amount as they are automatically collected by IESs.

Inspired by the recent success of bidirectional representations in NLP domain (Devlin et al., 2019), we develop an assessment model with a deep bidirectional Transformer encoder. In the pre-training phase, we randomly select a portion of entries in a sequence of interactions and mask the corresponding assessments. Then, we train a deep bidirectional Transformer encoder based assessment model to predict the masked assessments conditioned on surrounding interactions. After pre-training, we replace the last layer of the model with a layer corresponding to each downstream task, and all parameters of the model are then fine-tuned to the downstream tasks.

We empirically evaluate assessment modeling as pre-training tasks on EdNet (Choi et al., 2019), a large-scale dataset collected by an active mobile education application, Santa, which has 1M users as well as 72M response data points gathered since 2016. The results show that assessment modeling provides a substantial performance gain in downstream AIEd tasks. In particular, we obtain an improvement of mean absolute error and accuracy from the previous state-of-the-art model for exam score and review correctness prediction respectively.

In summary, our contributions are follows:

  • We propose assessment modeling, fundamental pre-training tasks for general IESs. Assessment modeling as pre-training tasks allows the model to hold relevance to the label-scarce educational problems while reducing irrelevant noises.

  • We give formal definitions of knowledge tracing and assessment modeling in a form that is quantifiable and objective from a particular IES design.

  • Inspired by the recent success of bidirectional representation in NLP domain (Devlin et al., 2019), we develop an assessment model with a deep bidirectional Transformer encoder.

  • We empirically verify that assessment modeling as pre-training tasks achieves an improvement of mean absolute error and accuracy from the previous state-of-the-art model for exam score and review correctness prediction respectively.

2. Related Works

2.0.1. Artificial Intelligence in Education

AIEd supports education through different AI technologies, including machine learning and deep learning, to “promote the development of adaptive learning environments and other AIEd tools that are flexible, inclusive, personalised, engaging, and effective”

(Luckin et al., 2016). A large body of work have committed to the development of AI models for AIEd tasks, including knowledge tracing (Corbett and Anderson, 1994; Piech et al., 2015), question analysis (Yin et al., 2019; Liu et al., 2018), student score/grade prediction (Iqbal et al., 2017; Morsy and Karypis, 2017; Almasri et al., 2019; Patil et al., 2017), educational content recommendation (Liu et al., 2019) and many more. With these models, an IES constantly updates the state of each student through constant feedback and provide appropriate learning experience to learners.

2.0.2. Pre-training Methods in Education

Pre-training is the act of training a model to an unsupervised auxiliary task before the supervised main task (Erhan et al., 2010). Models using pre-training have been shown to outperform existing models in various fields including NLP (Devlin et al., 2018), Computer Vision (Studer et al., 2019), Speech Recognition (Schneider et al., 2019), Medical Science (Choi et al., 2016) and many more.

Pre-training techniques have been also applied to educational tasks with substantial performance improvements. For example, (Hunt et al., 2017) predict whether a student will graduate or not based on students’ general academic information such as SAT/ACT scores or courses taken during college. They predict graduation of 465 engineering students by first pre-training the data of 6834 students in other departments with TrAdaBoost algorithm (Dai et al., 2007). (Ding et al., 2019) suggests two transfer learning methods, each named Passive-AE transfer and Active-AE transfer, to predict student dropout in Massive Open Online Course (MOOC). Their experimental results show that both methods increase the prediction accuracy, with Passive-AE transfer more effective for transfer learning across the same subject and Active-AE transfer more effective for transfer learning across different subjects.

Most of the pre-training methods used in interactive educational system are NLP tasks on the content of learning materials. For example, the short answer grading model suggested in (Sung et al., 2019)

use pre-trained BERT to ameliorate the limited amount of student-answer pair data. They used pre-trained, uncased BERT-base model and fine-tuned it with ScientsBank dataset and two psychology domain datasets. The resulting model outperformed existing grading models. Test-aware Attention-based Convolutional Neural Network (TACNN)

(Huang et al., 2017) is a model that utilize the semantic representations of text materials (document, question and options) to predict exam question difficulty (i.e. the percentage of examinees with wrong answer for a particular question). TACNN uses pre-trained word2vec embedding (Mikolov et al., 2013) to represent word tokens. By applying convolutional neural networks to the sequence of text tokens and attention mechanism to the series of sentences, the model quantifies the difficulty of the question.

QuesNet (Yin et al., 2019)

is a question embedding model pre-trained with context information of question data. Since existing pre-training methods in NLP is not applicable to heterogeneous data in a question (such as images and metadata), the authors suggest Holed Language Model (HLM), a pre-training task in parallel to BERT’s masking language model. HLM differs to BERT’s task, however, because it predicts each input based on the values of other inputs aggregated in the Bi-LSTM layer of QuesNet, while BERT masks existing sequence randomly with certain probability. Also, QuesNet introduces another task called Domain-Oriented Objective (DOO), which is the prediction of the correctness of answer supplied with question, to capture high-level logical information. QuesNet adds two losses for HLM and DOO respectively for final loss in training. Compared to other baseline models, QuesNet shows the best performance in three downstream tasks: knowledge mapping, difficulty estimation, and score prediction.

3. Assessment Modeling

3.1. Formal Definition of Assessment Modeling

Figure 2. The relationship of features predicted in general AIEd tasks, knowledge tracing and assessment modeling. Assessment modeling predicts the distribution of assessments, the subset of interactive features which can act as pedagogical evaluation criteria. Note that predicting an exam score (exam_score), a grade (grade) and whether a student will pass to get a certificate (certification) are tasks outside knowledge tracing.

Recall that knowledge tracing is the task of modeling a student’s knowledge state based on his/her learning activities over time. Although the task is considered as fundamental and studied extensively over time, it has not been defined formally and explicitly in existing literature. In this subsection, we first define knowledge tracing in a form that is quantifiable and objective from a particular IES design. Subsequently, we introduce the definition of assessment modeling that addresses the educational values of the label being predicted.

A learning session in IES consists of a series of interactions between a student and the system, where each interaction is represented as a set of features automatically collected by the system. The features represent diverse aspects of learning activities provided by the system, such as exercises or lectures being used, and the corresponding student actions. Under the given notations, the definitions of knowledge tracing and assessment modeling are as follows:

Definition 3.1 ().

A knowledge tracing is a task of predicting a feature of the student in the ’th interaction given the sequence of interactions . That is, the prediction of


where is the set of features masked when the feature is guessed. This is to mask input features not available at prediction time, so that a model does not ‘cheat’ guessing . In particular, the feature itself is always masked so that .

This definition follows the prior works on knowledge tracing models (Piech et al., 2015; Zhang et al., 2017; Huang et al., 2019; Pandey and Karypis, 2019). Although a common set-up of knowledge tracing models is to predict a feature conditioned on only past interactions, we define knowledge tracing as a prediction task that is also conditioned on the future interactions to acknowledge the recent success of bi-directional architecture in knowledge tracing (Lee et al., 2019).

Example 3.2 (Knowledge tracing).

Knowledge tracing is typically instantiated as response correctness prediction, where the interaction consists of a consumed exercise and a corresponding response correctness (Piech et al., 2015; Zhang et al., 2017; Huang et al., 2019; Pandey and Karypis, 2019; Lee et al., 2019). In this setup, only the response correctness of the last interaction is predicted and features related to predicted student response are masked. Following our definition of knowledge tracing, the task can be extended further to predict diverse interactive features such as:

  • offer_selection: Whether a student accepts to study the offered learning items.

  • start_time: The time when a student starts to solve an exercise.

  • inactive_time: The duration a student being inactive in a learning session.

  • platform: Whether a student responds to each exercise on web browser or mobile app.

  • payment: Whether a student purchases charged items.

  • event : Whether a student participates in application events.

  • longest_answer: Whether a student selected the answer choice with longest description.

  • response_correctness: Whether a student responds correctly to a given exercise.

  • timeliness: Whether a student responds to each exercise under the time limit recommended by domain experts.

  • course_dropout : Whether a student drops out of the entire class.

  • elapsed_time: The total time a student took to solve a given exercise.

  • lecture_complete: Whether a student completes studying a video lecture offered to him/her.

  • review_correctness : Whether a student responds correctly to a previously solved exercise.

In the aforementioned example, features like response_correctness and timeliness directly evaluates the educational values of a student’s interaction, while it is somewhat debatable whether platform and longest_answer are also capable of addressing such quality. Accordingly, we define the concept of assessment and assessment modeling as the following.

Definition 3.3 ().

An assessment of the ’th interaction is a feature of which can act as a criterion for pedagogical evaluation. The collection of assessments is a subset of the available features of . An assessment modeling is the prediction of assessment for some from the interactions . That is, the prediction of

Example 3.4 (Assessments).

Among the interactive features listed in Example 3.2, we consider the response_correctness, timeliness, course_dropout, elapsed_time, lecture_complete and review_correctness as assessments. For example, response_correctness is the primary assessment as whether a student responded correctly for each exercise provides strong evidence for the mastery of concepts required to solve the exercise properly. Also, timeliness also serves as assessment since a student responding to each exercise within limited time is expected to have high proficiency in skills and knowledge necessary to solve the exercise. Figure 2 depicts the relationship between assessments and general knowledge tracing features.

3.2. Assessment Modeling as Pre-training Tasks

Figure 3. Not all interactive features are collected equally often. For example, payment, event, course_dropout and review_correctness are interactive features obtained more sporadically than other features. Among the sporadic interactive features, course_dropout and review_correctness are assessments.
Figure 4. Assessment modeling is highly appropriate pre-training method for the label-scarce educational problems. First, most of assessments are available in large amount as they are automatically collected from student-system interactions (Interactive). Also, since assessments share educational aspects, narrowing down the scope of prediction targets to assessments reduces noise irrelevant to the problems (Educational).
Figure 5. Possible pre-train/fine-tune scenarios. We may pre-train a model to predict start_time, response_correctness, timeliness and review_correctness for estimating exam_score (red). Likewise, a model pre-trained to predict start_time and response_correctness can be trained to predict review_correctness (green). However, pre-training a non-educational interactive feature like start_time is not effective for the label-scarce educational problems (dotted lines).

In this subsection, we provide examples of important yet scarce educational labels and argue that assessment modeling enables effective prediction of such labels.

Example 3.5 (Non-Interactive Educational Features).

In many applications, an IES is often integrated as part of larger learning process. Accordingly, the ultimate evaluation of the learning process is mostly done independently from the IES. For example, academic abilities of students are measured by course grades or standardized exams, and the ability to perform a complicated job or task is assured by professional certificates. Such labels hold great importance due to the pedagogical and social needs for consistent evaluation of students’ ability. However, obtaining the labels are often challenging due to their scarcity compared to features automatically collected from student-system interactions. We give the following examples (see Figure 2).

  • exam_score : A student’s standardized exam score.

  • grade : A student’s final grade in enrolled course.

  • certification : Professional certification obtained by completion of educational programs or examinations.

Example 3.6 (Sporadic Assessments).

All assessments are collected by IESs automatically, but some are few in number as the events occur rarely in practice, which introduces data scarcity problem. For example, it is natural for students to invest more time in learning new concepts than reviewing previously studied materials. course_dropout and review_correctness are examples of sporadic assessments (Figure 3).

To overcome the lack of aforementioned labels, we consider the pre-train/fine-tune paradigm that leverages the patterns of data available in large amount. In this paradigm, a model is first trained in an auxiliary task relevant to the tasks of interest with large-scale data. With pre-trained parameters as initial values, the model is slightly modified and finally trained in main tasks. This approach has been successful in AI fields like NLP, computer vision and speech recognition (Devlin et al., 2018; Studer et al., 2019; Schneider et al., 2019). Accordingly, existing methods in AIEd pre-train the contents of learning materials like the word sequence of a passage, but such methods do not capture student behaviors and only utilize a single feature from the data.

Instead, one may pre-train different features automatically collected by IESs (see Figure 4). However, training every available features is computationally intractable and may introduce irrelevant noise. To this end, assessment modeling narrows down the prediction targets to assessments, the interactive features that are also educational. Since multiple assessments are available, a wide variety of pre-train/fine-tune pairs can be explored for effective assessment modeling (see Figure 5). This raises the open-ended questions of which assessments to pre-train for label-scarce educational problems and how to pre-train multiple assessments. In Section 4, we explore these questions for exam score (a non-interactive educational feature) and review correctness (a sporadic assessment) respectively. Experimental results support that assessment modeling are effective pre-training tasks for the label-scarce educational problems.

3.3. Assessment Modeling with Deep Bidirectional Transformer Encoder

Figure 6. Proposed pre-train/fine-tune approach. In the pre-training phase, we train an assessment model to predict assessments conditioned on past and future interactions. After the pre-training phase, we fine-tune whole parameters in the model to predict labels in downstream tasks.

While there are several possible options for the architecture of the assessment model, we adopt the deep bidirectional Transformer encoder proposed in (Devlin et al., 2019) for the following reasons. First, (Pandey and Karypis, 2019) showed that the self-attention mechanism in Transformer (Vaswani et al., 2017) is effective for knowledge tracing task. The Transformer-based knowledge tracing model proposed in (Pandey and Karypis, 2019) achieved state-of-the-art performance on several datasets. Second, the deep bidirectional Transformer encoder model and pre-train/fine-tune method proposed in (Devlin et al., 2019) achieved state-of-the-art results on several NLP tasks. While (Devlin et al., 2019) conducted experimental studies on NLP domain, the method is also applicable to other domains with slight modifications.

Figure 6 depicts our proposed pre-train/fine-tune approach. In the pre-training phase, we train a deep bidirectional Transformer encoder based assessment model to predict the assessments conditioned on past and future interactions. After the pre-training phase, we replace the last layer of the assessment model with a layer appropriate for each downstream task and fine-tune whole parameters in the model to predict labels in the downstream tasks. We provide detailed descriptions of our proposed assessment model in the following subsections.

3.3.1. Input Representation

The first layer in the assessment model maps each interaction to an embedding vector. Firstly, we embed the following attributes:

  • exercise_id: We assign a latent vector unique to each exercise.

  • exercise_category: Each exercise has its own category tag that represents the type of the exercise. We assign a latent vector to each tag.

  • position: The relative position of the interaction in the input sequence. We use the learned positional embedding from (Gehring et al., 2017) instead of the sinusoidal positional encoding that used in (Vaswani et al., 2017).

As shown in Example 3.2, IES collects diverse interactive features that can potentially be used for assessment modeling. However, using all possible interactive features for assessment modeling is computationally intractable and does not guarantee the best results when the assessment model is fine-tuned on downstream tasks. For experimental studies, we narrow down the scope of interactive features to the ones available from an exercise-response pair, the simplest but widely considered interaction in knowledge tracing task. In particular, we embed the following interactive features:

  • start_time: The month, day and hour of the absolute time when a student started to solve each exercise is recorded. We assign a latent vector to every possible combination of month, day and hour.

  • response_correctness: The value is 1 if a student response is correct and 0 otherwise. We assign a latent vector corresponding to each possible value 0 and 1.

  • timeliness: The value is 1 if a student responds within a specified time limit and 0 otherwise. We assign a latent vector corresponding to each possible value 0 and 1.

  • longest_answer: The value is 1 if a student selects an answer choice with the longest description. We assign a latent vector corresponding to each possible value 0 and 1.

Among the interactive features above, only response_correctness and timeliness serve as assessments. However, we study various combinations of the above interactive features for pre-training in Section 4.

Let be the sum of the embedding vectors of exercise_id, exercise_category and position. Likewise, let and be the embedding vectors of response_correctness and timeliness respectively. Then, the representation of interaction is .

3.3.2. Masking

Inspired by the masked language model proposed in (Devlin et al., 2019), we use the following method to mask the assessments in a student interaction sequence. Firstly, we mask fraction of the interaction in a sequence chosen at random. If the -th interaction is chosen, we replace the corresponding input embedding with (1) for fraction of the time and (2) for fraction of the time. Here mask is a learned vector that represents masking, and and are embedding vectors for assessments uniformly chosen at random in the sequence. We determine and through ablation studies in Section 4.

3.3.3. Model Architecture

After the interactions are embedded and masked accordingly, they enter a series of Transformer encoder blocks, each consisting of a multi-head self-attention layer followed by position-wise feed-forward networks. All the inputs of layers have dimension . The first encoder block takes the sequence of interactions embedded in latent space and returns a series of vectors of the same length and dimension. For all , the ’th block takes the output of the ’th block as input and returns the series of vectors accordingly. We describe the architecture of each block as the following.

The multi-head self-attention layer takes a series of vectors, . Each vector is projected to latent space by projection matrices and :


Here and each , and are the query, key and value of respectively. The output of the self-attention is then obtained as weighted sum of values with coefficients determined by the dot products between queries and keys:


Models with self-attention layers often use multiple heads to jointly attend information from different representative subspaces. Following this, we apply attention times to the same query-key-value entries with different projection matrices for output:


Here, each is equal to the output of self-attention in Equation 4 with corresponding projection matrices , and in Equation 3. We use the linear map to aggregate each attention result.

After we compute the resulting value in Equation 5, we apply point-wise feed-forward networks to add non-linearity to the model. Also, we apply the skip connection (He et al., 2016) and layer normalization (Ba et al., 2016) to the output of the feed-forward networks.

Assume that the last encoder block returns the series of vectors. For pre-training, the prediction to

’th interaction is made by applying a linear layer with softmax activation function to

. The final output is the estimated probability distribution of four possible combinations of assessment

: , , or . The overall loss is defined as

where is the cross-entropy between estimated distribution of and actual one-hot distribution of . The value is a flag that represents whether the -th question is masked () or not (). In other words, the total loss is the sum of cross-entropy losses for masked questions.

The input embedding layer and encoder blocks are shared over pre-training and fine-tuning. For fine-tuning, we replace the linear layers applied to each in pre-training by a single linear layer that combines all entries of to fit the output to respective downstream tasks.

4. Experiments

4.1. Label-Scarce Educational Problems

We utilize assessment modeling to exam score (a non-interactive educational feature) and review correctness (a sporadic assessment) prediction. The tasks are described in detail as the following.

4.1.1. Exam Score Prediction

Exam score prediction (ES) is the estimation of a student’s scores in standardized exams, such as TOEIC and SAT, based on the student’s interaction history in educational system. ES is one of the most important tasks of AIEd, as it is crucial for both students and educational system to assess the performance in a standardized measure. Because a substantial amount of human effort is required to develop or take the tests, the number of data points available for exam score is considerably less than that of student interactions automatically collected by educational systems. By developing a reliable ES model, a student’s score is estimated in a universally accepted score by interactive educational system with considerably less effort. ES differs from student response prediction (e.g. the prediction of assessment ) because standardized tests are taken in a controlled environment with specific methods independent of interactive educational system.

4.1.2. Review Correctness Prediction

Assume that a student incorrectly responds to an exercise and receives corresponding feedback. The goal of review correctness prediction (RC) is to predict whether a student will be able to respond to the exercise correctly or not if she encounters the exercise again. The significance of this AIEd task is that it can assess the educational effect of an exercise to a particular student in a specific situation. In particular, the correctness probability estimated by this task represents the student’s expected marginal gain in knowledge as she goes through future learning process. For example, if the correctness probability is high, it is likely that the student will obtain relevant knowledge in future even if his initial response was incorrect.

Part 1 4 5 6 7
Time limit (sec) audio duration + 8 25 50 55
Table 1. Time Limits

4.2. Dataset

We use the public EdNet dataset obtained from Santa, a mobile AI tutoring service for preparing TOEIC Listening and Reading Test (Choi et al., 2019). The test consists of two timed sections named Listening Comprehension (LC) and Reading Comprehension (RC) with a total of 100 exercises and 4 or 3 parts respectively. The final test score ranges from 0 to 990 in a gap of 5. Once a user solves each exercise, Santa provides educational feedback to his response such as explanations or commentaries on exercises. EdNet is the collection of user interactions of multiple-choice exercises collected over the last four years. The main features of the user-exercise interaction data consists of six columns: user id, exercise id, user response, exercise part, received time and time taken. We describe each column as the following. Firstly, the user (resp. exercise) ID identifies each unique user (resp. exercise). The user response is recorded as 1 if the user response is correct and 0 otherwise. The part the exercise belongs to is recorded as exercise part. Finally, the absolute time when the user received the exercise and the time taken by user to respond are recorded respectively. In the dataset, 627,347 users solved more than one problem. The size of the exercise set is 16,175. The total row count of the dataset is 72,907,005.

4.2.1. Dataset for Pre-training

For pre-training, we first reconstruct the interaction timeline of each user by gathering the responses of a specific user in the increasing order of absolute time received. For each interaction , the assessment value response_ is the recorded user response and is calculated as 1 if the user responded under the time limits recommended by TOEIC experts (Table 1). We exclude the interactions of users involved in any of the downstream tasks for pre-training to avoid learning the downstream data beforehand. After processing, the data consists of 251,989 users with a total of 49,944,086 interactions.

4.2.2. Dataset for Label-Scarce Educational Problems

For the exam score prediction, we aggregate the real TOEIC scores reported by users of Santa. The reports are scarce in number because a user have to register, take the exam and report the score at his own expense. Santa offered corresponding reward to the users who have reported their score. A total of 2,594 score reports have been obtained through 6 months, which is comparably less than the number of 72M exercise responses. For our experiment, the data is divided into training set (1,302 users, 1815 labels), validation set (244 users, 260 labels), and test set (466 users, 519 labels).

For the review correctness prediction, we look over each student’s timeline and find exercises that have been solved at least twice. That is, if an exercise appears more than once in a student interaction sequence , we find the first two interactions and () with the exercise . The sequence of interactions until she encounters for the first time is taken as the input. The assessment for the next encounter is taken as the label. The total of 1,084,413 labeled sequence are generated after pre-processing. For our experiment, the data is divided into training set (28,016 users, 747,222 labels), validation set (4,003 users, 112,715 labels), and test set (8,004 users, 224,476 labels).

4.3. Setup

4.3.1. Models

Our model consists of two encoder blocks () with the latent space dimension of 256 (). The model takes 100 interactions as input. For comparison, the following pre-training methods are also applied to respective encoder networks with the same architecture. Since the existing pre-training methods embed the content of each exercise, we replace the embedding of exercise_id in our model with respective exercise embedding for fine-tuning.

  • Word2Vec (Mikolov et al., 2013) is one of the most well-known word embedding model which is used in (Huang et al., 2019) for exercise embedding. The model is pre-trained on the Google News dataset containing around 100 billion words. The embedding of each exercise is the average of embedding vectors of all words appearing in the exercise description. Embedded vectors are not fine-tuned.

  • BERT (Devlin et al., 2018) is a state-of-the-art pre-training method featuring the Transformer architecture with masked language model. As above, we embed each exercise by averaging out the representation vectors of words in the description without fine-tuning.

  • For QuesNet (Yin et al., 2019), the image and meta-data embeddings are not used because the exercises used in our experiment consist of text only. The model follows the architecture (Bi-directional LSTM followed by self-attention) and pre-training tasks (Holed Language Model and Domain-Oriented Objective) suggested in the original paper.

4.3.2. Metrics

We use the following evaluation metrics to evaluate performances of models on each downstream task. For exam score prediction, we compute the Mean Absolute Error (MAE), the average of differences between predicted exam scores and real values. For review correctness prediction, we use accuracy (ACC).

4.3.3. Training Details

The strategy we use for finding the optimal model parameters is the following. For pre-training, the model weights are first initialized with Xavier uniform distribution and trained with

epochs. After each ’th epoch, the model parameters are stored as . Then we fine-tune each for downstream task with a specified number of epochs. Likewise, the model after epochs of fine-tuning is stored as . Among all downstream task models with and

, the model with the best result on the validation set is chosen and evaluated with the test set. We use the Adam optimizer with hyperparameters

. The batch size is 256 for pre-training and 128 for fine-tuning.

The labels available for exam score prediction are scarce. To alleviate this, we apply the following data augmentation to fine-tune our model. Given the original interaction sequence with score labeled, we select each element with 50% probability repeatedly to generate augmented sequences with the same score label.

Without pre-train 64.37 0.648
Word2Vec 77.93 0.649
BERT 70.19 0.646
QuesNet 64.00 0.642
Assessment Modeling 49.84 0.656
Table 2. Experimental results

4.4. Experimental Results

4.4.1. Model Performance

The experimental results of 5 different pre-trained models on the two aforementioned downstream tasks are shown in Table 2. In all downstream tasks, our model with assessment modeling outperforms the other baseline models. Compared to the model without pre-training, MAE for exam score prediction is reduced to 14.53 and ACC for review correctness is improved by 0.9 points. Also, our model performs better than other pre-trained models. This demonstrates that the assessment modeling is more suitable for knowledge tracing tasks than content-based pre-training approaches. Interestingly, some of the pre-trained models even show worse performance than the model without pre-training. This shows the importance of choosing pre-training tasks relevant to the downstream tasks.

Labels Pre-trained ES RC
correctness 54.28 0.657
timeliness 61.98 0.648
correctness + timeliness 49.84 0.656
Table 3. Assessment Modeling Tasks
Masking Positions ES RC
Front 50.26 0.656
Back 51.91 0.657
Random 49.84 0.656
Table 4. Masking Positions

4.4.2. Effect of Assessment Modeling Labels

Recall that our model is pre-trained to predict whole assessment

We demonstrate the importance of pre-training multiple assessments by comparing our model with two variations pre-trained to predict only one of response_ and . The results are shown in Table 3. For exam score prediction, the results show that it is the best to pre-train the model to predict whole assessment . This is because exam score depends on not only the student’s answer but also the time she spent on each exercise. However, for review correctness prediction, the model only predicts correctness of user’s response and does not care about timeliness. This explains why the model pre-trained to predict only correctness of user’s response performs slightly better than the model pre-trained to predict both components of assessments.

4.4.3. Effect of Masking Positions

As we mentioned in Section 3, we used BERT-like masking method for pre-training to represent bi-directional information. We compared this with other possible masking approaches - masking the first 60% of exercises (Front) or the last 60% of exercises (Back) - and the results are shown in Table 4. For exam score prediction, our approach with random mask is more effective than other masking methods with fixed positions. In case of review correctness prediction, there is no big difference among three models, although the model with last 60% of interactions masked performs slightly better than others.

Base 0.6 1.0 2 256 49.84 0.656
(A) 0.2 53.87 0.655
0.4 51.79 0.655
0.8 52.01 0.657
(B) 0.0 52.25 0.655
0.5 52.51 0.655
(C) 1 128 50.79 0.657
4 512 51.85 0.653
Table 5. Ablation Study. Rows in (A) represent the proportion of masked interactions. Rows in (B) represent the masking rate . Rows in (C) denote the parameters for model architecture (number of encoder blocks and dimension of latent space)

4.4.4. Ablation Study

We conduct several ablation experiments to understand how each property of our model affects the model’s performances. First, we observe the effect of the hyper-parameters for masking: the proportion of masked interaction and the masking rate . Then, we varied the size of the model, the number of layers , and the dimension of model . As a result, the model with , , , and performs the best.

5. Conclusion

We introduced assessment modeling, fundamental pre-training tasks for IESs. Our experiments show the effectiveness of assessment modeling as pre-training tasks for the label-scarce educational problems like exam score and review correctness prediction. Future works will include assessments beyond correctness and timeliness of the responses and investigate further label-scarce educational problems.