TopicResponse: A Marriage of Topic Modelling and Rasch Modelling for Automatic Measurement in MOOCs

07/29/2016 ∙ by Jiazhen He, et al. ∙ The University of Melbourne 0

This paper explores the suitability of using automatically discovered topics from MOOC discussion forums for modelling students' academic abilities. The Rasch model from psychometrics is a popular generative probabilistic model that relates latent student skill, latent item difficulty, and observed student-item responses within a principled, unified framework. According to scholarly educational theory, discovered topics can be regarded as appropriate measurement items if (1) students' participation across the discovered topics is well fit by the Rasch model, and if (2) the topics are interpretable to subject-matter experts as being educationally meaningful. Such Rasch-scaled topics, with associated difficulty levels, could be of potential benefit to curriculum refinement, student assessment and personalised feedback. The technical challenge that remains, is to discover meaningful topics that simultaneously achieve good statistical fit with the Rasch model. To address this challenge, we combine the Rasch model with non-negative matrix factorisation based topic modelling, jointly fitting both models. We demonstrate the suitability of our approach with quantitative experiments on data from three Coursera MOOCs, and with qualitative survey results on topic interpretability on a Discrete Optimisation MOOC.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Massive Open Online Courses (MOOCs) have attracted wide attention due to the promise of delivering education at scale. This new learning environment produces a variety of data (e.g., demographic data, student engagement, and forum activities), which offer new opportunities to understanding student learning. While quizzes and assignments have dominated summative assessment, the many sources of rich student engagement data generated in MOOC platforms present new views on student learning and avenues for formative feedback. This paper explores whether students’ participation across automatically discovered MOOC forum topics is suitable for modelling academic ability.

Our work is inspired by the importance of forum discussions as an active learning activity, and recent research on quantitative measurement of student learning in the education community. In particular, 1) MOOC discussion forums, as the main platform for student-instructor and student-student interactions, is of importance in gaining insights into student learning. 2) Recent research in education 

(Milligan, 2015) suggests that a distinctive and complex learning skill is required to promote learning in MOOCs. Educators are interested in whether and how the possession of this complex learning skill may be evidenced by latent complex patterns of engagement, instead of traditional assessment tools such as quizzes and assignments. 3) In order to validate such a hypothesis, measurement theory can be used (Rasch, 1993; Wright and Masters, 1982). A set of items is handcrafted from forum activities (e.g., “contributed a post attracting votes from others” and “made repeated thread visits in more than half the weeks”), and calibrated (e.g., deleted or changed) to fit a measurement model as evidence as to whether the set of items is appropriate for measuring the complex learning skill (Milligan, 2015). This process is human-intensive and time-consuming as reflected by Figure 1.

Figure 1: Workflow for devising items manually versus automatically discovering topics as items for measurement. Traditionally, a set of items are handcrafted from MOOC forum behaviours, and then the students’ dichotomous responses on the items are examined using the Rasch model. If the model fits well, then the students and items can be compared on an inferred scale (the ruler). Otherwise the items are refined (changed, added or deleted) manually until model fit. The process of handcrafting and calibration is time-consuming. Instead, we aim to automatically generate topics from discussion posts as items that fit the Rasch model by design.

Driven by these observations, we investigate whether students’ participation in automatically discovered forum topics can be used as an instrument to model students’ ability. If students’ participation across the discovered topics fit a measurement model (in this paper, we use the Rasch model) in terms of statistical effectiveness, and the topics are interpretable to subject-matter experts by way of qualitative effectiveness, then the discovered topics can be regarded as useful items for measurement. The resulting scaled topics, endowed with estimated difficulty levels, can assist in subsequent curriculum refinement, student assessment, and personalised feedback.

The technical challenge, then, is to automatically discover topics such that students’ participation across them fit the Rasch model. He et al (2016)

have adapted topic modelling of students’ online forum postings, such that students’ participation across these topics conforms to the Guttman scale. However, the Guttman scale is widely regarded as overly-idealised and impractical in the real world. In contrast the Rasch model, one of the simplest item response theory (IRT) models and the basis for many extensions, has been widely used in education and psychology. It is a generative probabilistic model that represents student responses as noisy observations of latent student abilities related to item difficulties. It can be viewed as a stochastic counterpart to the Guttman scale, permitting measurement error. If a person’s ability level is higher than an item’s difficulty, the person will answer the item correctly in the Guttman scale, while in the Rasch model there is a certain probability of incorrect response. While the Guttman scale only permits ordering of persons and items, Rasch models the locations on the scale and hence also meaningful differences 

(Scholten, 2011). The algorithm proposed for the Guttman scale (He et al, 2016) does not adapt readily for Rasch modelling. Instead we propose the TopicResponse algorithm, which simultaneously performs non-negative matrix factorisation and Rasch model fitting. The main contributions of this paper include:

  • The first study that combines topic modelling with Rasch modelling in psychometric testing: generating topics that measure students’ academic abilities based on online forum postings;

  • An algorithm TopicResponse fitting NMF and Rasch models simultaneously, for which we provide a proof of convergence; and

  • Quantitative experiments on three Coursera MOOCs covering a broad swath of disciplines, establishing statistical effectiveness of our algorithm, and qualitative results on a Discrete Optimisation MOOC, supporting interpretability.

We review related work in Section 2. In Section 3, we present preliminaries and formalise our problem. Our algorithm is introduced in Section 4, and evaluated in Section 5. Section 6 concludes the paper.

2 Related Work

Many studies have focused on item response theory (IRT) or MOOC data analysis, but research on automatic discovery of items for measurement in MOOCs has received little attention. The main relevant work to this paper is (He et al, 2016), where NMF-based topic modelling is adapted and used for Guttman scaling (Guttman, 1950) in order to measure students’ latent abilities based on their MOOC forum posts. A major drawback of that work is that the Guttman scale is regarded to be the most restrictive IRT model and is overly idealised: it neither serves as the basis of more sophisticated (probabilistic) models, nor is it practical in the real world as a deterministic model. While the Guttman scale only models ordering of persons and items, the (probabilistic) Rasch model permits the interpretation of the differences between items and people (Scholten, 2011). The Rasch model is a generative model that models student responses as noisy observations of latent student abilities in relation to item difficulties. The algorithm for Guttman scaling (He et al, 2016) does not naturally extend to incorporating Rasch modelling.

2.1 Item Response Theory (IRT)

The field of IRT studies statistical models for measurement in education and psychology. Such models specify the probability of a person’s response on an item as a mathematical function of the person’s and item’s latent attributes. A principal goal of IRT is to create a scale on which persons and items can be placed and compared meaningfully. IRT has been used for computerised adaptive testing (CAT), which aims to accurately and efficiently assess individuals’ trait levels, and is used in the Scholastic Aptitude Test (SAT), Graduate Record Examination (GRE), while Chen et al (2005) proposed a personalised e-learning system based on IRT considering course material difficulty and learner ability.

As a statistical model, IRT has attracted attention in machine learning recently.

Bergner et al (2012) applied model-based collaborative filtering to estimate the parameters for IRT models, considering IRT as a type of collaborative filtering task, where the user-item interactions are factorised into user and item parameters. Bachrach et al (2012) proposed a probabilistic graphical model that jointly models the difficulties of questions, the abilities of participants and the correct answers to questions in aptitude testing and crowdsourcing settings. While in MOOCs, Champaign et al (2014) investigated the correlations between resource use and students’ skill and relative skill improvement measured by IRT. Colvin et al (2014) analysed pre-post test questions using IRT, to compare the learning in MOOCs and a blended on-campus course. Past work has tended to focus on using already-devised items to measure student ability under IRT models, while we are interested in automatically discovering content-based items that are characteristic of measurement in MOOCs (Milligan, 2015).

2.2 MOOC Forums

MOOC forums have been of great interest recently, due to the availability of rich textual data and social behaviour. Various studies have been conducted such as sentiment analysis, community finding, question recommendation, answers & intervention prediction.

Wen et al (2014) use sentiment analysis to monitor students’ trending opinions towards the course and to correlate sentiment with dropouts over time using survival analysis. Yang et al (2015) predict students’ confusion during learning activities as expressed in discussion forums, using discussion behaviour and clickstream data; they further explore the impact of confusion on student dropout. Ramesh et al (2015) predict sentiment in MOOC forums using hinge-loss Markov random fields. Gillani et al (2014) find communities using Bayesian Non-Negative Matrix Factorisation. Yang et al (2014) recommend questions of interest to students by designing a context-aware matrix factorisation model considering constraints on students and questions. MOOC forum data has also been leveraged in the task of predicting accepted answers to forum questions (Jenders et al, 2016) and predicting instructor intervention (Chaturvedi et al, 2014). Despite the variety of studies, little machine learning research has explored forum discussions for the purpose of measurement in MOOCs.

3 Preliminaries and Problem Formulation

We choose NMF as the basic approach to discover forum topics due to the interpretability of the topics produced, and the extensibility of its optimisation formulation. For the IRT model for measurement, we focus on the Rasch model for dichotomous data due to its popularity, and due to being the basis for many extensions in education and psychology. We next overview the Rasch model for dichotomous data and NMF, and then define our problem.

3.1 Rasch Model

The Rasch model (Wright and Masters, 1982; Bond and Fox, 2001) for dichotomous data (correct/incorrect, agree/disagree responses) specifies the probability of a person’s positive response (correct, agree) on an item as a logistic function of the difference between the person’s ability and item difficulty,


where latent denotes person ’s ability, latent denotes item ’s difficulty, denotes person ’s observed random response on item , and is the probability of this response being positive. This probability is best illustrated with the Item Characteristic Curve (ICC) as depicted in Figure 2 and commonly used in the field of IRT. It can be seen that the higher a person’s ability is, relative to the difficulty of an item, the higher the probability of a positive response on that item. When a person’s ability is equal to an item’s difficulty on the latent scale, positive responses are observed with 0.5 probability.

Figure 2: The Item Characteristic Curves for three items (item 1–the easiest, 3–the most difficult). A person with ability has 0.5 probability of responding positively on item 2 with difficulty , and higher (and lower) probability on the easiest item 1 (most difficult item 3, respectively).

The latent measurement scale is analogous to the ruler shown in Figure 1, where persons and items are placed together and can be compared meaningfully. The Rasch model provides a way to construct the ruler using persons’ responses on items. Persons and items are located along the scale according to their abilities and difficulties respectively.

The Rasch model can be viewed as a stochastic counterpart to the Guttman scale. For example, in Figure 1, person 1 and person 2 will have positive response on item 1 in a Guttman scale. While in a Rasch scale, there are certain probabilities that person 1 and person 2 will enjoy positive responses on item 1, with person 1’s probability being higher. This error model leads to a higher level of measurement scale: the interval scale, where we can tell how much more able person 2 is compared to person 1. From the Guttman scale, by comparison, we can tell that person 2 is better than person 1 but not by how much.

Item 1 Item 2 Item 3 Item 4 Item 5 Proportion Ability
(Count) () () () () correct
Person 1 1 0 0 0 0 0.20 -1.39
Person 2 1 1 0 0 0 0.60 0.41
Person 3 0 1 1 0 0 0.60 0.41
Person 4 1 0 1 1 0 0.67 0.71
Person 5 1 1 1 0 1 0.80 1.39
Proportion correct 0.80 0.33 0.33 0.20 0.20
Difficulty -1.39 0.71 0.71 1.39 1.39
Table 1: An example of items for measuring basic mathematical ability, students’ responses, initial item difficulty estimates and student ability estimates.

Table 1 further illustrates our setup, with an example of items for measuring basic mathematical ability, alongside hypothetical students’ responses. The initial estimates (see Equations 6,7

below) for item difficulties and person abilities are produced on a logit scale. For example, if person 1 responds to the items positively 20% of the time and negatively 80% of the time, then the person’s initial ability estimate is approximately

by taking the natural logarithm of the odds ratio for positive response


3.1.1 Rasch Estimation

Given an observed response matrix = (e.g., Table 1), a basic goal is to estimate the person and item parameters and . The most common estimation methods are based on maximum-likelihood estimation, including: jointly maximum-likelihood (JML) estimation, conditional maximum-likelihood (CML) estimation and marginal maximum-likelihood (MML) estimation (Baker and Kim, 2004). In this paper, we focus on JML.

Under the assumption that a sample of persons is drawn independently at random from a population of persons possessing a latent skill attribute, and the assumption of local independence that a person’s responses to different items are statistically independent, the probability of an observed data matrix with items and persons is the product of the probabilities of the individual responses, and can be given by the joint likelihood function


The log-likelihood function is then


The parameters of the Rasch model can be estimated by joint maximum likelihood—maximisation of this expression—using Newton-Raphson (Bertsekas, 1999), which yields the following iterative solution for and ,


The convergence to a local optimum (with suitable step sizes) is guaranteed. The initial estimates of , can be obtained by firstly calculating the proportion of items that a person responded correctly , and then taking the natural logarithm of the odds of person ’s correct response as shown in Table 1, which can be formalised as follows:


where denotes the number of items that person responded to positively. Similarly, the initial estimates of , can be obtained by


where denotes the number of persons who responded correctly on item , and denotes the proportion of persons who responded correctly on item .

For those items receiving no correct responses (), or no incorrect responses (), some implementations of the Rasch model will delete the item, while other models handle the situation as follows (Baker and Kim, 2004), where is a small number (e.g., 1.0 is used in our experiments),

These pseudo counts are similar to frequentist Laplace corrections, or (weak) uniform Bayesian priors.

3.1.2 Evaluating Model Fit

A set of items is said to measure a latent attribute on an interval scale when there is a close fit between data and model. The model-data fit is typically examined using infit and outfit statistics—two types of mean square error statistics—-conveying information about the error in the estimates for each individual item and person.

Outfit and infit test statistics are defined for each item and person to test the fit of items and persons under the Rasch model, by carefully summarising the Rasch residuals. The Rasch residuals are the differences between the observed responses and the expected responses according to the Rasch model. Formally, the expected response of person

on item under the Rasch model (abbreviated to ) is . The residual between the observation and the expected response is then . Standardised residuals are often used to assess the fit of a single person-item response



denotes the variance of

(abbreviated to ).

The outfit of item summarises the squared standardised residuals, averaged over persons,


Typical treatments assume standardised residuals

approximately following a unit normal distribution. Their sum of squares therefore approximately follows a

distribution. Dividing this sum by its degrees of freedom yields a mean-square value, with an expectation of 1.0 and taking values in the range of 0 to infinity.

Outfit is sensitive to unexpected responses to items, e.g., lucky guesses (e.g., a person responds 111001) or careless sequences of mistakes (e.g., a person responds 010100) (Linacre, 2002)

. Since outfit is sensitive to the very unexpected observations (outliers), infit was devised to be more sensitive to the overall pattern of responses 

(Linacre, 2006). Infit is an information-weighted form of outfit: it weights the observations by their statistical information (model variance) which is larger for targeted observations, and smaller for extreme observations (Bond and Fox, 2001). In this paper, we focus on infit. Formally, the infit of item is given by


Both outfit and infit have the expected value of 1.0. Values larger than 1.0 indicate model underfitting, i.e., data is less predictable than the model expects, while values less than 1.0 indicate overfitting, i.e., observations are highly predictable (Wright et al, 1994). Conventionally, the acceptable range is usually taken to be [0.7,1.3] or [0.8,1.2] depending on application.

3.2 Non-Negative Matrix Factorisation (NMF)

Given a non-negative matrix and a positive integer , NMF factorises into the product of a non-negative matrix and a non-negative matrix such that


A commonly-used measure for quantifying the quality of this approximation is the Frobenius norm between and . Thus, NMF involves solving


This objective function is convex in and separately, but not together. Therefore standard convex solvers are not expected to find a global optimum in general. The multiplicative update algorithm (Lee and Seung, 2001) is commonly used to find a local optimum, where and are updated by a multiplicative factor that depends on the quality of the approximation.

Figure 3: Example matrices: word-student , word-topic , topic-student .

In the present MOOC setting, we focus on the students who contributed posts or comments in forums. For each student, we aggregate all posts or comments that they contributed. Each student is represented by a bag of words as shown in the example word-student matrix in Figure 3, where represents the number of words, and represents the number of students. Using NMF, a word-student matrix can be factorised into two non-negative matrices: word-topic matrix and topic-student matrix

. For each student, the column vector of

is approximated by a linear combination of the columns of , weighted by the components of . Therefore, each column vector of can be regarded as a topic, and the memberships of students in these topics are encoded by as shown in Figure 3.

3.3 Problem Statement

We seek to explore the feasibility of automatic discovery of forum discussion topics for measuring students’ academic abilities in MOOCs, as quantified by the Rasch model. Our central tenet is that topics can be regarded as useful items for measuring a latent skill, if student responses to these topics are well fit by the Rasch model, and if the topics are interpretable to domain experts for educational relevance. Therefore, we need to discover topics from students’ posts and comments in MOOC forums, in such a way that students’ participation across these topics fits the Rasch model. Student item response records whether a student posts on the corresponding topic or not. After discovery, topics must then be further assessed for interpretability to domain experts. Our goal is decision support.

In particular, under the NMF framework, a word-student matrix can be factorised into two non-negative matrices: word-topic matrix and topic-student matrix . Our application requires that the topic-student matrix be a) binary ensuring the response of a student to a topic is dichotomous; b) useful for measuring students’ academic abilities; and c) well-fit by the Rasch model. NMF provides an elegant framework for incorporating these constraints via adding novel regularisation, as detailed in the next section. A glossary of the symbols most used in this paper is given in Table 2.

Symbol Description
the number of words
the number of students
the number of topics
word-student matrix
word-topic matrix
topic-student matrix
matrix for students with ideal number of distinct topics posted
all-ones matrix with size
student ’s grade
item difficulty vector
student ability vector
binary response (0 or 1) of person to item
observed response of person to item
the probability of positive response of person to item
variance of
standardised residual
regularisation coefficients
Table 2: Glossary of symbols

4 The TopicResponse Algorithm: Joint NMF-Rasch Estimation

To favour topics that fit the Rasch model, we jointly optimise wwwboth NMF and Rasch models, which yields the objective function

where is the log-likelihood function maximised in Rasch estimation, and is a user-specified parameter controlling the trade-off between the quality of factorisation and Rasch estimation.

Weak supervision of item responses.

The fit between student topic responses and the Rasch model will provide statistical evidence of measuring skill attainment. However, it is difficult to conclude what the topics are measuring without domain knowledge. To favour the topics that can be used to measure students’ academic abilities, we impose a constraint on based on some student grade, which provides an indicator of student’s abilities (we discuss sources of auxiliary grade information below). In particular, we assume that there is the following relationship between the ideal number of distinct topics that each student contributes and their grade ,

where is a matrix, denoting the ideal number of distinct topics posted by students. For example under items, student scoring should post on a number of topics . The minimum and maximum number of different topics that a student posted is 1 and respectively. This is motivated by the initialisation of and as illustrated in Section 3.1.1, where positive responses on 0 or topics is undesirable.

This supervision constraint is markedly weaker than a similar constraint found in (He et al, 2016), as demonstrated in Figure 4. He et al (2016) leverage the student grade to exactly determine the item responses for the Guttman scale. The Guttman scale, as a deterministic model, requires that if a student can get a difficult item correct, they can also achieve correct responses on all easier items. This assumption is very restrictive, and rarely makes sense in practice. The Rasch model allows errors in the responses; and only constrains the number of distinct topics posted by a student, rather than the exact response pattern.

Most (MOOC) courses conduct multiple forms of assessment throughout the duration of teaching. For example, weekly quizzes, take-home assignments, mid-term tests, projects, presentations, etc. In the large-scale MOOC context, such evaluations may be peer-assessed. Students often enter courses with some cumulative grade-point average that may be (loosely) predictive of future performance. Any of these readily-available sources of student information could be reasonably used to seed . Even final course grades could be used, particularly when the ultimate application of TopicResponse is not measuring students, but refining curriculum.

Figure 4: An example of in the Guttman scale and the Rasch model.

In order to encourage satisfaction of the soft constraint on topic responses, we introduce a regularisation term on , namely .

Quantising & Regularising the Response Matrix.

We introduce regularisation term , commonly used to prevent overfitting in NMF. To encourage binary solutions, we impose an additional regularisation term , where operator denotes the Hadamard product. Binary matrix factorisation (BMF) is a variation of NMF, where the input matrix and the two factorised matrices are all binary. Our approach is inspired by those of Zhang et al (2007) and Zhang et al (2010). Our added term equals , which is minimised by (only) binary .

TopicResponse Model.

We have the following regularisations:

  • to encourage a grade-guided ;

  • to prevent overfitting; and

  • to encourage a binary item-response solution.

These terms together with joint NMF-Rasch estimation yield final objective


where are user-specified regularisation parameters, with primal program

TopicResponse Fitting Procedure.

A local optimum of program (13) is achieved via iteration



and denote the positive part and negative part of matrix respectively. We next describe how these update rules are derived.

The update rules (16) and (17) can be obtained using Newton’s method. The update rules (14) and (15) can be derived via the Karush-Kuhn-Tucker conditions necessary for local optimality. First we construct the unconstrained Lagrangian

where are the Lagrangian dual variables for inequality constraints and respectively, and , denote their corresponding matrices. The KKT condition of stationarity requires that the derivative of with respect to , vanishes at a local optimum :

Complementary slackness , implies:

These two equations lead to the updating rules (14) and (15). Regarding the update rules (14), (15), (16) and (17) we have the following theorem:

Theorem 1

The objective function of TopicResponse program (13) is non-increasing under update rules (14), (15), (16) and (17).

This result guarantees that the update rules of , , and eventually converge, and that the obtained solution will be a local optimum. The proof of Theorem 1 is given in the Appendix.

0:    , , , , , ,;
0:    A topic-student matrix, , item difficulties , person abilities ;
1:  Initialise , using NMF;
2:  Normalise , following  (Zhang et al, 2007, 2010);
3:  Initialise , based on Eq. (6) and Eq. (7);
4:  repeat
5:     Update iteratively based on Eq. (14) to Eq. (17);
6:  until converged
7:  return  ;
Algorithm 1 TopicResponse

Our overall approach TopicResponse is described as Algorithm 1. and are initialised using plain NMF (Lee and Seung, 1999, 2001), then normalised (Zhang et al, 2007, 2010). and are initialised based on Eq. (6) and Eq. (7), where is replaced by . At optimisation completion, estimates for topics, item difficulties and person abilities can be obtained together. Code for TopicResponse is available from the authors’ websites.

5 Experiments

We report on extensive experiments evaluating the effectiveness of TopicResponse on real MOOCs. In our experiments, we use the first offerings of three Coursera MOOCs from education, economics and computer science offered by The University of Melbourne: Assessment and Teaching of 21st Century Skills delivered in 2014, Principles of Macroeconomics delivered in 2013, and Discrete Optimisation delivered in 2013. We denote these three courses by EDU, ECON and OPT respectively.

5.1 Dataset Preparation

We focus on the students who contributed posts or comments in forums. For each student, we aggregate all the posts and comments that they contributed. After stemming and removing stop words, a word-student matrix with normalised tf-idf in [0,1] is produced. The statistics of words and students before and after preprocessing, the dominated words, and the sparsity of word-student matrix (the percentage of non-zeros values) for three MOOCs are displayed in Table 3.

5.2 Baseline and Evaluation Metrics

We compare our algorithm TopicResponse with the baseline algorithm Grade-Guided NMF (GG-NMF), which minimises the following objection function

MOOC #Students #Words #Words after preprocessing Dominated words Word-student matrix sparsity
EDU 1,749 28,931 18,391 student, learn, skill, work, teacher, use, assess, teach, problem, collabor 0.59%
ECON 1,551 26,370 21,412 gdp, would, econom, think, product, good, one, economi, increas, invest 0.50%
OPT 1,092 19,284 16,128 use, solut, get, time, one, tri, python, work, optim, would 0.85%
Table 3: Statistics of our three Coursera MOOC datasets.

A local optimum can be obtained using the Karush-Kuhn-Tucker conditions. Like TopicResponse, GG-NMF regularises by considering the students’ grades as an indicator of academic ability. The difference is that TopicResponse optimises the Rasch estimation and NMF simultaneously, while in GG-NMF, the students’ topic responses are first obtained, and then are passed through the Rasch model. We evaluate the two algorithms in terms of the following metrics.

Quality of factorisation. We measure so as to record how well the factorisation approximates the student-word matrix.

Measuring student academic ability. Quality of constraint on students’ topic participation, based on grades: .

Negative log-likelihood. Log-likelihood measures fit of the Rasch model to the entire dataset. For convenience, we look at the negative log-likelihood, which should be minimised: smaller is better. This measure is our main focus for Rasch, as it is important to examine the model-level fit before looking at item-level fit.

Item infit. As illustrated in Section 3.1.2, item infit examines the fit of a particular item, with non-fitting items suitable for further refinement. We use the conventional acceptance range of [0.7, 1.3].

Param. Values Explored (Default)
Figure 5: Negative log-likelihood as goodness of fit; Smaller is better.
Table 4: Hyperparameter settings.

5.3 Hyperparameter Settings

Table 4 presents the parameter values used for our parameter sensitivity experiments, where the default values shown in boldface are used in experiments unless noted otherwise.

5.4 Main Results for GG-NMF and TopicResponse

In the first group of experiments, we examine the performance of GG-NMF (baseline) and TopicResponse in terms of negative log-likelihood, the quality of factorisation in approximation given by , and the supervision soft constraint . For GG-NMF, the factorisation and Rasch estimation are separated, where topic-student response matrix is first obtained using GG-NMF, and then is taken as input to Rasch estimation. For TopicResponse, the negative log-likelihood is optimised together with factorisation. The parameters are set using the boldface default values in Table 4. Figure 5 displays the negative log-likelihood of GG-NMF and TopicResponse.

It can be seen that TopicResponse can yield superior negative log-likelihood, implying better fit between the topic-student response matrix and the Rasch model. TopicResponse therefore provides greater confidence that other item-level fit statistics such as infit, will be acceptable. Jointly optimising the matrix factorisation and Rasch estimation can bring us closer to global optima.

(a) Quality of factorisation,
(b) Quality of graded-guided constraint,
Figure 6: Performance of GG-NMF and TopicResponse in terms of and ; Smaller is better.

We present the results on quality of approximation and supervision term , in Figure 6. From these plots, we can see that without sacrificing approximation performance in terms of , TopicResponse obtains superior (while obtaining excellent negative log-likelihoods as above). This performance again demonstrates that optimising the factorisation and Rasch estimation globally can be superior to optimising them separately. We therefore conclude that TopicResponse is preferable to GG-NMF; we focus on results for TopicResponse in the remainder of our experiments.

5.5 Item Infit, Item Difficulty and Student Ability

We further examine the infit of each item, which indicates if the set of topics conform to the Rasch model, and is appropriate for measurement. As illustrated in Section 3.1.2, a conventional acceptable range of infit is 0.7 to 1.3. As an example, we show the item infit in Figure 7 on OPT MOOC. We can see that the infit of each item is in the acceptable range, with most very close to the (ideal) expected value of 1.0, indicating that the set of topics conform to the Rasch model and is appropriate for measuring student ability.

Figure 7: Item infit histogram for OPT MOOC; infit closer to 1 is better.
Figure 8: Histograms of OPT MOOC student ability location (top) and item difficulty location (bottom).

Additionally, we examine item difficulties and student abilities. Figure 8 displays the histogram of item difficulty and student ability along a common scale. According to the Rasch model, the higher a person’s ability relative to the difficulty of a topic, the higher the probability that person posts on that topic. It can be seen that most students with low ability (around -2 logits), only dominate the “easiest” topic (topic 1 with difficult -2.3 logits); this topic concerns general problem solving. In other words, these students are likely to post only on topic 1, and unlikely to post on other topics. By comparison, the most able students with abilities around 2, with high probability contribute to all the topics.

5.6 Topic Interpretation and Discussion

We qualitatively examine topic interpretation, in order to assess educational meaningfulness. Well-scaled topics can potentially be used for curriculum refinement. Table 5 presents the topics generated using TopicResponse, alongside inferred difficulties. Topics are interpreted by an instructor who teaches a similar course. As the topics are not all course content-related, we envision that instructors examine discovered topics prior to using all for refining curriculum or taking other actions. Additionally, the inferred student ability levels and topic difficulty levels could be potentially used for personalised feedback, by tailoring appropriate topics of course content or forum discussion to students with their individual ability level taken into account. For example, most students (lowest ability) only discuss solving problem in general, as shown in Figure 8. If they cannot obtain sufficient help from forum discussions, they may be prone to drop out without further topic exploration. Therefore, in intervening with at-risk students, it is advisable to leverage discovered topics to better focus measures. Such services may be useful in preventing dropout in early stages (when most dropouts typically occur).

No. Topics Interpretation Inferred difficulty
1 use time problem get solut one optim algorithm tri work Solving in general -2.30
2 cours thank would lectur realli great assign good like think Course feedback -0.93
3 python use run program solver java matlab instal command work Python/Java/Matlab (How to start) -0.63
4 problem thank solut get grade knapsack got feedback optim solv How knapscak problem is solved and graded -0.44
5 memori dp use column bb implement solv algorithm bound tabl Comparing algorithms memory/time 0.23
6 color node graph random edg greedi opt search swap iter Graph coloring 0.31
7 item valu weight capac estim take solut calcul best list Knapsack problem 0.33
8 file pi line solver data submit lib urllib2 solveit open Using solvers 0.52
9 video http class load lecture org problem coursera optimization 001 Platform 1.17
10 submit assign assignment error messag view assignment_id detail class coursera Assignment submission 1.73
Table 5: Topics and difficulty levels, by TopicResponse on OPT MOOC.

5.7 Parameter Sensitivity

To validate the robustness of TopicResponse to parameter settings, a series of sensitivity experiments were conducted. The parameter settings are shown in Table 4. Negative log-likelihoods, , and are examined in these experiments. Due to space limitations, we report here results for on the OPT MOOC. The reader is referred to Appendix B for results on parameters ,,, on all three MOOCs.

Effect of Parameter .

As can be seen in Figure 9, as is increased TopicResponse performs better in terms of negative log-likelihood, and performs worse in terms of the other three metrics due to the regularisation on the Rasch model. By contrast, the performance of GG-NMF does not change as there is no regularisation term on its Rasch estimation. Overall, TopicResponse performs well when varies between 0.1 and 0.2.

(a) Negative log-ikelihood on OPT
(b) on OPT
(c) on OPT
(d) on OPT
Figure 9: Performance of GG-NMF and TopicResponse on OPT with varying .

6 Conclusion and Future Work

We have examined the suitability of content-based items (topics) discovered from MOOC forum discussions, for modelling student abilities. Our central tenet is that topics can be regarded as useful items for measuring latent skills, if student responses to these topics fit the Rasch item-response theory model, and if the discovered topics are further interpretable to domain experts. We propose to jointly optimise NMF and Rasch modelling, in order to discover Rasch-scaled topics. We provide a quantitative validation on three Coursera MOOCs, demonstrating that TopicResponse yields better global fit to the Rasch model (observed with lower negative log-likelihood), maintains good quality of factorisation approximation, while measuring the students’ academic abilities (reflected by the grade-guided constraint on students’ participation on topics). We also provide qualitative examination of topic interpretation with inferred difficulty levels on a Discrete Optimisation MOOC. The results on goodness of fit and our qualitative examination, together suggest potential applications in curriculum refinement, student assessment and personalised feedback.

We opted to study the relatively simple Rasch model, as it forms the basis of very many subsequent models in the literature. One direction for extension, is that for any model (like Rasch), that fits parameters via maximum-likelihood estimation (or risk minimisation in general), the model can be augmented with NMF as an additional regularisation. For example, such an extension should be straightforward for polychotomous observations, hierarchical models on latent skills, models that include more flexible per-student variation, etc. These represent fruitful direction for future research. Another possible extension could involve augmenting the matrices in the NMF or Rasch objective terms with manually-crafted items, to make effective use of prior knowledge.

We thank Jeffrey Chan for discussions related to this work, and the anonymous reviewers and editor for their thoughtful feedback. This work is supported by Data61, and the Australian Research Council (DE160100584).


  • Bachrach et al (2012) Bachrach Y, Graepel T, Minka T, Guiver J (2012) How to grade a test without knowing the answers—a Bayesian graphical model for adaptive crowdsourcing and aptitude testing. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp 1183–1190
  • Baker and Kim (2004) Baker FB, Kim SH (2004) Item response theory: Parameter estimation techniques. CRC Press
  • Bergner et al (2012) Bergner Y, Droschler S, Kortemeyer G, Rayyan S, Seaton D, Pritchard DE (2012) Model-based collaborative filtering analysis of student response data: Machine-learning item response theory. International Educational Data Mining Society
  • Bertsekas (1999) Bertsekas DP (1999) Nonlinear programming. Athena Scientific
  • Bond and Fox (2001) Bond TG, Fox CM (2001) Applying the Rasch model: Fundamental measurement in the human sciences. Lawrence Erlbaum Associates Publishers
  • Champaign et al (2014) Champaign J, Colvin KF, Liu A, Fredericks C, Seaton D, Pritchard DE (2014) Correlating skill and improvement in 2 MOOCs with a student’s time on tasks. In: Proceedings of the First ACM Conference on Learning@Scale Conference, ACM, pp 11–20
  • Chaturvedi et al (2014) Chaturvedi S, Goldwasser D, Daumé III H (2014) Predicting instructor’s intervention in MOOC forums. In: ACL (1), pp 1501–1511
  • Chen et al (2005) Chen CM, Lee HM, Chen YH (2005) Personalized e-learning system using item response theory. Computers & Education 44(3):237–255
  • Colvin et al (2014) Colvin KF, Champaign J, Liu A, Fredericks C, Pritchard DE (2014) Comparing learning in a MOOC and a blended on-campus course. In: Educational Data Mining 2014
  • Gillani et al (2014) Gillani N, Eynon R, Osborne M, Hjorth I, Roberts S (2014) Communication communities in MOOCs. arXiv preprint arXiv:14034640
  • Guttman (1950) Guttman L (1950) The basis for scalogram analysis. In: Stouffer S (ed) Measurement and Prediction: The American Soldier, Wiley, New York
  • He et al (2016)

    He J, Rubinstein BI, Bailey J, Zhang R, Milligan S, Chan J (2016) MOOCs meet measurement theory: A topic-modelling approach. In: Thirtieth AAAI Conference on Artificial Intelligence

  • Jenders et al (2016) Jenders M, Krestel R, Naumann F (2016) Which answer is best? Predicting accepted answers in MOOC forums. WWW’2016 Companion
  • Lee and Seung (1999) Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791
  • Lee and Seung (2001) Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp 556–562
  • Linacre (2002) Linacre JM (2002) What do infit and outfit, mean-square and standardized mean. Rasch Measurement Transactions 16(2):878
  • Linacre (2006) Linacre JM (2006) Misfit diagnosis: Infit outfit mean-square standardized. Retrieved June 1:2006
  • Milligan (2015) Milligan S (2015) Crowd-sourced learning in MOOCs: Learning analytics meets measurement theory. In: Proceedings of the Fifth International Conference on Learning Analytics And Knowledge, ACM, pp 151–155
  • Ramesh et al (2015) Ramesh A, Kumar SH, Foulds J, Getoor L (2015) Weakly supervised models of aspect-sentiment for online course discussion forums. In: Annual Meeting of the Association for Computational Linguistics (ACL)
  • Rasch (1993) Rasch G (1993) Probabilistic models for some intelligence and attainment tests. ERIC
  • Scholten (2011) Scholten AZ (2011) Admissible statistics from a latent variable perspective. Theory & Psychology 18:111–117
  • Wen et al (2014) Wen M, Yang D, Rose C (2014) Sentiment analysis in MOOC discussion forums: What does it tell us? In: Educational Data Mining 2014
  • Wright and Masters (1982) Wright BD, Masters GN (1982) Rating Scale Analysis. Rasch Measurement. ERIC
  • Wright et al (1994) Wright BD, Linacre JM, Gustafson J, Martin-Lof P (1994) Reasonable mean-square fit values. Rasch measurement transactions 8(3):370
  • Yang et al (2014) Yang D, Adamson D, Rosé CP (2014) Question recommendation with constraints for massive open online courses. In: Proceedings of the 8th ACM Conference on Recommender systems, ACM, pp 49–56
  • Yang et al (2015) Yang D, Wen M, Howley I, Kraut R, Rose C (2015) Exploring the effect of confusion in discussion forums of massive open online courses. In: Proceedings of the Second (2015) ACM Conference on Learning@ Scale, ACM, pp 121–130
  • Zhang et al (2007) Zhang Z, Ding C, Li T, Zhang X (2007) Binary matrix factorization with applications. In: Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, IEEE, pp 391–400
  • Zhang et al (2010) Zhang ZY, Li T, Ding C, Ren XW, Zhang XS (2010) Binary matrix factorization for analyzing gene expression data. Data Mining and Knowledge Discovery 20(1):28–52

Appendix A Proof of Theorem 1

The update rules for and are derived using the Newton-Raphson method (Bertsekas, 1999), where the convergence to a local optimum is guaranteed. Here, we focus on the proof for the update rule for . The update rule for can be proved similarly. We closely follow the procedure described in (Lee and Seung, 2001)

, where an auxiliary function similar to that used in the Expectation-Maximization (EM) algorithm is used for proof.

Definition 2 (Lee and Seung 2001)

is an auxiliary function for if the conditions

are satisfied.

Lemma 1 (Lee and Seung 2001)

If is an auxiliary function, then is non-increasing under the update


The result follows from noting .∎

For any element in , let denote the part of in Eq. (12) relevant to . Since the update is essentially element-wise, it is sufficient to show that each is non-increasing under the update rule of Eq. (15). To prove this, we define the auxiliary function regarding as follows.

Lemma 2




is an auxiliary function for .


It is obvious that . So we need only prove that . Considering the Taylor series expansion of ,

is equivalent to , where

To prove the above inequality, we have

Thus, as claimed.∎

Replacing in Eq. (18) by Eq. (19) and setting to be 0 results in the update rule in Eq. (15). Since Eq. (19) is an auxiliary function, is non-increasing under this update rule.

Appendix B Experimental Results of Parameter Sensitivity on Regularisation Parameters ,,, and

a) Effect of Parameter : As we can see from Figure 11, GG-NMF and TopicResponse are not sensitive to , performing stably with varying . TopicResponse constantly performs better in terms of negative log-likelihood while maintaining the comparable performance in terms of the other three metrics.

b) Effect of Parameter : It can be seen from Figure 12 that GG-NMF and TopicResponse perform well in terms of (Figure 11(d) to Figure 11(f)) and (Figure 11(j) to Figure 11(l)) when varies from to , and from to respectively. gets worse as increases, but does not change a lot compared to and . As increases, the performance of GG-NMF and TopicResponse in terms of negative log-likelihood decrease, and TopicResponse constantly performs better than GG-NMF. Overall, with values around 1.0 is good for GG-NMF and TopicResponse.

c) Effect of Parameter : It can be seen that GG-NMF and TopicResponse perform well in terms of (Figure 12(d) to Figure 12(f)) and (Figure 12(j) to Figure 12(l)) when varies from to , and from to respectively. Similar to , does not affect significantly. TopicResponse constantly achieves better negative log-likelihood than GG-NMF. Overall, with values around 1.0 is good for GG-NMF and TopicResponse.

d) Effect of the number of topics : It can be seen from Figure 14 that TopicResponse constantly outperforms GG-NMF in terms of negative log-likelihood, while getting slightly worse performance in the other three metrics. This is reasonable, as GG-NMF has more constraints and hence the model itself is less likely to perform as well as the less constrained GG-NMF in other metrics. Overall, GG-NMF and TopicResponse perform well in the experiments when is set to 10 or 15. We choose 10 as the value of since a smaller number of topics are easier to analyse.

(a) Negative log-ikelihood on EDU
(b) Negative log-ikelihood on ECON
(c) Negative log-ikelihood on OPT
(d) on EDU
(e) on ECON
(f) on OPT
(g) on EDU
(h) on ECON
(i) on OPT
(j) on EDU
(k) on ECON
(l) on OPT
Figure 10: Performance of GG-NMF and TopicResponse with varying .
(a) Negative log-likelihood on EDU
(b) Negative log-likelihood on ECON
(c) Negative log-likelihood on OPT
(d) on EDU
(e) on ECON
(f) on OPT
(g) on EDU
(h) on ECON
(i) on OPT
(j) on EDU
(k) on ECON
(l) on OPT
Figure 11: Performance of GG-NMF and TopicResponse with varying .
(a) Negative log-likelihood on EDU
(b) Negative log-likelihood on ECON
(c) Negative log-likelihood on OPT
(d) on EDU
(e) on ECON