1 Introduction
Massive Open Online Courses (MOOCs) have attracted wide attention due to the promise of delivering education at scale. This new learning environment produces a variety of data (e.g., demographic data, student engagement, and forum activities), which offer new opportunities to understanding student learning. While quizzes and assignments have dominated summative assessment, the many sources of rich student engagement data generated in MOOC platforms present new views on student learning and avenues for formative feedback. This paper explores whether students’ participation across automatically discovered MOOC forum topics is suitable for modelling academic ability.
Our work is inspired by the importance of forum discussions as an active learning activity, and recent research on quantitative measurement of student learning in the education community. In particular, 1) MOOC discussion forums, as the main platform for studentinstructor and studentstudent interactions, is of importance in gaining insights into student learning. 2) Recent research in education
(Milligan, 2015) suggests that a distinctive and complex learning skill is required to promote learning in MOOCs. Educators are interested in whether and how the possession of this complex learning skill may be evidenced by latent complex patterns of engagement, instead of traditional assessment tools such as quizzes and assignments. 3) In order to validate such a hypothesis, measurement theory can be used (Rasch, 1993; Wright and Masters, 1982). A set of items is handcrafted from forum activities (e.g., “contributed a post attracting votes from others” and “made repeated thread visits in more than half the weeks”), and calibrated (e.g., deleted or changed) to fit a measurement model as evidence as to whether the set of items is appropriate for measuring the complex learning skill (Milligan, 2015). This process is humanintensive and timeconsuming as reflected by Figure 1.Driven by these observations, we investigate whether students’ participation in automatically discovered forum topics can be used as an instrument to model students’ ability. If students’ participation across the discovered topics fit a measurement model (in this paper, we use the Rasch model) in terms of statistical effectiveness, and the topics are interpretable to subjectmatter experts by way of qualitative effectiveness, then the discovered topics can be regarded as useful items for measurement. The resulting scaled topics, endowed with estimated difficulty levels, can assist in subsequent curriculum refinement, student assessment, and personalised feedback.
The technical challenge, then, is to automatically discover topics such that students’ participation across them fit the Rasch model. He et al (2016)
have adapted topic modelling of students’ online forum postings, such that students’ participation across these topics conforms to the Guttman scale. However, the Guttman scale is widely regarded as overlyidealised and impractical in the real world. In contrast the Rasch model, one of the simplest item response theory (IRT) models and the basis for many extensions, has been widely used in education and psychology. It is a generative probabilistic model that represents student responses as noisy observations of latent student abilities related to item difficulties. It can be viewed as a stochastic counterpart to the Guttman scale, permitting measurement error. If a person’s ability level is higher than an item’s difficulty, the person will answer the item correctly in the Guttman scale, while in the Rasch model there is a certain probability of incorrect response. While the Guttman scale only permits ordering of persons and items, Rasch models the locations on the scale and hence also meaningful differences
(Scholten, 2011). The algorithm proposed for the Guttman scale (He et al, 2016) does not adapt readily for Rasch modelling. Instead we propose the TopicResponse algorithm, which simultaneously performs nonnegative matrix factorisation and Rasch model fitting. The main contributions of this paper include:
The first study that combines topic modelling with Rasch modelling in psychometric testing: generating topics that measure students’ academic abilities based on online forum postings;

An algorithm TopicResponse fitting NMF and Rasch models simultaneously, for which we provide a proof of convergence; and

Quantitative experiments on three Coursera MOOCs covering a broad swath of disciplines, establishing statistical effectiveness of our algorithm, and qualitative results on a Discrete Optimisation MOOC, supporting interpretability.
2 Related Work
Many studies have focused on item response theory (IRT) or MOOC data analysis, but research on automatic discovery of items for measurement in MOOCs has received little attention. The main relevant work to this paper is (He et al, 2016), where NMFbased topic modelling is adapted and used for Guttman scaling (Guttman, 1950) in order to measure students’ latent abilities based on their MOOC forum posts. A major drawback of that work is that the Guttman scale is regarded to be the most restrictive IRT model and is overly idealised: it neither serves as the basis of more sophisticated (probabilistic) models, nor is it practical in the real world as a deterministic model. While the Guttman scale only models ordering of persons and items, the (probabilistic) Rasch model permits the interpretation of the differences between items and people (Scholten, 2011). The Rasch model is a generative model that models student responses as noisy observations of latent student abilities in relation to item difficulties. The algorithm for Guttman scaling (He et al, 2016) does not naturally extend to incorporating Rasch modelling.
2.1 Item Response Theory (IRT)
The field of IRT studies statistical models for measurement in education and psychology. Such models specify the probability of a person’s response on an item as a mathematical function of the person’s and item’s latent attributes. A principal goal of IRT is to create a scale on which persons and items can be placed and compared meaningfully. IRT has been used for computerised adaptive testing (CAT), which aims to accurately and efficiently assess individuals’ trait levels, and is used in the Scholastic Aptitude Test (SAT), Graduate Record Examination (GRE), while Chen et al (2005) proposed a personalised elearning system based on IRT considering course material difficulty and learner ability.
As a statistical model, IRT has attracted attention in machine learning recently.
Bergner et al (2012) applied modelbased collaborative filtering to estimate the parameters for IRT models, considering IRT as a type of collaborative filtering task, where the useritem interactions are factorised into user and item parameters. Bachrach et al (2012) proposed a probabilistic graphical model that jointly models the difficulties of questions, the abilities of participants and the correct answers to questions in aptitude testing and crowdsourcing settings. While in MOOCs, Champaign et al (2014) investigated the correlations between resource use and students’ skill and relative skill improvement measured by IRT. Colvin et al (2014) analysed prepost test questions using IRT, to compare the learning in MOOCs and a blended oncampus course. Past work has tended to focus on using alreadydevised items to measure student ability under IRT models, while we are interested in automatically discovering contentbased items that are characteristic of measurement in MOOCs (Milligan, 2015).2.2 MOOC Forums
MOOC forums have been of great interest recently, due to the availability of rich textual data and social behaviour. Various studies have been conducted such as sentiment analysis, community finding, question recommendation, answers & intervention prediction.
Wen et al (2014) use sentiment analysis to monitor students’ trending opinions towards the course and to correlate sentiment with dropouts over time using survival analysis. Yang et al (2015) predict students’ confusion during learning activities as expressed in discussion forums, using discussion behaviour and clickstream data; they further explore the impact of confusion on student dropout. Ramesh et al (2015) predict sentiment in MOOC forums using hingeloss Markov random fields. Gillani et al (2014) find communities using Bayesian NonNegative Matrix Factorisation. Yang et al (2014) recommend questions of interest to students by designing a contextaware matrix factorisation model considering constraints on students and questions. MOOC forum data has also been leveraged in the task of predicting accepted answers to forum questions (Jenders et al, 2016) and predicting instructor intervention (Chaturvedi et al, 2014). Despite the variety of studies, little machine learning research has explored forum discussions for the purpose of measurement in MOOCs.3 Preliminaries and Problem Formulation
We choose NMF as the basic approach to discover forum topics due to the interpretability of the topics produced, and the extensibility of its optimisation formulation. For the IRT model for measurement, we focus on the Rasch model for dichotomous data due to its popularity, and due to being the basis for many extensions in education and psychology. We next overview the Rasch model for dichotomous data and NMF, and then define our problem.
3.1 Rasch Model
The Rasch model (Wright and Masters, 1982; Bond and Fox, 2001) for dichotomous data (correct/incorrect, agree/disagree responses) specifies the probability of a person’s positive response (correct, agree) on an item as a logistic function of the difference between the person’s ability and item difficulty,
(1) 
where latent denotes person ’s ability, latent denotes item ’s difficulty, denotes person ’s observed random response on item , and is the probability of this response being positive. This probability is best illustrated with the Item Characteristic Curve (ICC) as depicted in Figure 2 and commonly used in the field of IRT. It can be seen that the higher a person’s ability is, relative to the difficulty of an item, the higher the probability of a positive response on that item. When a person’s ability is equal to an item’s difficulty on the latent scale, positive responses are observed with 0.5 probability.
The latent measurement scale is analogous to the ruler shown in Figure 1, where persons and items are placed together and can be compared meaningfully. The Rasch model provides a way to construct the ruler using persons’ responses on items. Persons and items are located along the scale according to their abilities and difficulties respectively.
The Rasch model can be viewed as a stochastic counterpart to the Guttman scale. For example, in Figure 1, person 1 and person 2 will have positive response on item 1 in a Guttman scale. While in a Rasch scale, there are certain probabilities that person 1 and person 2 will enjoy positive responses on item 1, with person 1’s probability being higher. This error model leads to a higher level of measurement scale: the interval scale, where we can tell how much more able person 2 is compared to person 1. From the Guttman scale, by comparison, we can tell that person 2 is better than person 1 but not by how much.
Item 1  Item 2  Item 3  Item 4  Item 5  Proportion  Ability  

(Count)  ()  ()  ()  ()  correct  
Person 1  1  0  0  0  0  0.20  1.39 
Person 2  1  1  0  0  0  0.60  0.41 
Person 3  0  1  1  0  0  0.60  0.41 
Person 4  1  0  1  1  0  0.67  0.71 
Person 5  1  1  1  0  1  0.80  1.39 
Proportion correct  0.80  0.33  0.33  0.20  0.20  
Difficulty  1.39  0.71  0.71  1.39  1.39 
Table 1 further illustrates our setup, with an example of items for measuring basic mathematical ability, alongside hypothetical students’ responses. The initial estimates (see Equations 6,7
below) for item difficulties and person abilities are produced on a logit scale. For example, if person 1 responds to the items positively 20% of the time and negatively 80% of the time, then the person’s initial ability estimate is approximately
by taking the natural logarithm of the odds ratio for positive response
.3.1.1 Rasch Estimation
Given an observed response matrix = (e.g., Table 1), a basic goal is to estimate the person and item parameters and . The most common estimation methods are based on maximumlikelihood estimation, including: jointly maximumlikelihood (JML) estimation, conditional maximumlikelihood (CML) estimation and marginal maximumlikelihood (MML) estimation (Baker and Kim, 2004). In this paper, we focus on JML.
Under the assumption that a sample of persons is drawn independently at random from a population of persons possessing a latent skill attribute, and the assumption of local independence that a person’s responses to different items are statistically independent, the probability of an observed data matrix with items and persons is the product of the probabilities of the individual responses, and can be given by the joint likelihood function
(2)  
The loglikelihood function is then
(3) 
The parameters of the Rasch model can be estimated by joint maximum likelihood—maximisation of this expression—using NewtonRaphson (Bertsekas, 1999), which yields the following iterative solution for and ,
(4) 
(5) 
The convergence to a local optimum (with suitable step sizes) is guaranteed. The initial estimates of , can be obtained by firstly calculating the proportion of items that a person responded correctly , and then taking the natural logarithm of the odds of person ’s correct response as shown in Table 1, which can be formalised as follows:
(6) 
where denotes the number of items that person responded to positively. Similarly, the initial estimates of , can be obtained by
(7) 
where denotes the number of persons who responded correctly on item , and denotes the proportion of persons who responded correctly on item .
For those items receiving no correct responses (), or no incorrect responses (), some implementations of the Rasch model will delete the item, while other models handle the situation as follows (Baker and Kim, 2004), where is a small number (e.g., 1.0 is used in our experiments),
These pseudo counts are similar to frequentist Laplace corrections, or (weak) uniform Bayesian priors.
3.1.2 Evaluating Model Fit
A set of items is said to measure a latent attribute on an interval scale when there is a close fit between data and model. The modeldata fit is typically examined using infit and outfit statistics—two types of mean square error statistics—conveying information about the error in the estimates for each individual item and person.
Outfit and infit test statistics are defined for each item and person to test the fit of items and persons under the Rasch model, by carefully summarising the Rasch residuals. The Rasch residuals are the differences between the observed responses and the expected responses according to the Rasch model. Formally, the expected response of person
on item under the Rasch model (abbreviated to ) is . The residual between the observation and the expected response is then . Standardised residuals are often used to assess the fit of a single personitem response(8) 
where
denotes the variance of
(abbreviated to ).The outfit of item summarises the squared standardised residuals, averaged over persons,
(9) 
Typical treatments assume standardised residuals
approximately following a unit normal distribution. Their sum of squares therefore approximately follows a
distribution. Dividing this sum by its degrees of freedom yields a meansquare value, with an expectation of 1.0 and taking values in the range of 0 to infinity.
Outfit is sensitive to unexpected responses to items, e.g., lucky guesses (e.g., a person responds 111001) or careless sequences of mistakes (e.g., a person responds 010100) (Linacre, 2002)
. Since outfit is sensitive to the very unexpected observations (outliers), infit was devised to be more sensitive to the overall pattern of responses
(Linacre, 2006). Infit is an informationweighted form of outfit: it weights the observations by their statistical information (model variance) which is larger for targeted observations, and smaller for extreme observations (Bond and Fox, 2001). In this paper, we focus on infit. Formally, the infit of item is given by(10) 
Both outfit and infit have the expected value of 1.0. Values larger than 1.0 indicate model underfitting, i.e., data is less predictable than the model expects, while values less than 1.0 indicate overfitting, i.e., observations are highly predictable (Wright et al, 1994). Conventionally, the acceptable range is usually taken to be [0.7,1.3] or [0.8,1.2] depending on application.
3.2 NonNegative Matrix Factorisation (NMF)
Given a nonnegative matrix and a positive integer , NMF factorises into the product of a nonnegative matrix and a nonnegative matrix such that
.
A commonlyused measure for quantifying the quality of this approximation is the Frobenius norm between and . Thus, NMF involves solving
(11) 
This objective function is convex in and separately, but not together. Therefore standard convex solvers are not expected to find a global optimum in general. The multiplicative update algorithm (Lee and Seung, 2001) is commonly used to find a local optimum, where and are updated by a multiplicative factor that depends on the quality of the approximation.
In the present MOOC setting, we focus on the students who contributed posts or comments in forums. For each student, we aggregate all posts or comments that they contributed. Each student is represented by a bag of words as shown in the example wordstudent matrix in Figure 3, where represents the number of words, and represents the number of students. Using NMF, a wordstudent matrix can be factorised into two nonnegative matrices: wordtopic matrix and topicstudent matrix
. For each student, the column vector of
is approximated by a linear combination of the columns of , weighted by the components of . Therefore, each column vector of can be regarded as a topic, and the memberships of students in these topics are encoded by as shown in Figure 3.3.3 Problem Statement
We seek to explore the feasibility of automatic discovery of forum discussion topics for measuring students’ academic abilities in MOOCs, as quantified by the Rasch model. Our central tenet is that topics can be regarded as useful items for measuring a latent skill, if student responses to these topics are well fit by the Rasch model, and if the topics are interpretable to domain experts for educational relevance. Therefore, we need to discover topics from students’ posts and comments in MOOC forums, in such a way that students’ participation across these topics fits the Rasch model. Student item response records whether a student posts on the corresponding topic or not. After discovery, topics must then be further assessed for interpretability to domain experts. Our goal is decision support.
In particular, under the NMF framework, a wordstudent matrix can be factorised into two nonnegative matrices: wordtopic matrix and topicstudent matrix . Our application requires that the topicstudent matrix be a) binary ensuring the response of a student to a topic is dichotomous; b) useful for measuring students’ academic abilities; and c) wellfit by the Rasch model. NMF provides an elegant framework for incorporating these constraints via adding novel regularisation, as detailed in the next section. A glossary of the symbols most used in this paper is given in Table 2.
Symbol  Description 

the number of words  
the number of students  
the number of topics  
wordstudent matrix  
wordtopic matrix  
topicstudent matrix  
matrix for students with ideal number of distinct topics posted  
allones matrix with size  
student ’s grade  
item difficulty vector  
student ability vector  
binary response (0 or 1) of person to item  
observed response of person to item  
the probability of positive response of person to item  
variance of  
standardised residual  
regularisation coefficients 
4 The TopicResponse Algorithm: Joint NMFRasch Estimation
To favour topics that fit the Rasch model, we jointly optimise wwwboth NMF and Rasch models, which yields the objective function
where is the loglikelihood function maximised in Rasch estimation, and is a userspecified parameter controlling the tradeoff between the quality of factorisation and Rasch estimation.
Weak supervision of item responses.
The fit between student topic responses and the Rasch model will provide statistical evidence of measuring skill attainment. However, it is difficult to conclude what the topics are measuring without domain knowledge. To favour the topics that can be used to measure students’ academic abilities, we impose a constraint on based on some student grade, which provides an indicator of student’s abilities (we discuss sources of auxiliary grade information below). In particular, we assume that there is the following relationship between the ideal number of distinct topics that each student contributes and their grade ,
where is a matrix, denoting the ideal number of distinct topics posted by students. For example under items, student scoring should post on a number of topics . The minimum and maximum number of different topics that a student posted is 1 and respectively. This is motivated by the initialisation of and as illustrated in Section 3.1.1, where positive responses on 0 or topics is undesirable.
This supervision constraint is markedly weaker than a similar constraint found in (He et al, 2016), as demonstrated in Figure 4. He et al (2016) leverage the student grade to exactly determine the item responses for the Guttman scale. The Guttman scale, as a deterministic model, requires that if a student can get a difficult item correct, they can also achieve correct responses on all easier items. This assumption is very restrictive, and rarely makes sense in practice. The Rasch model allows errors in the responses; and only constrains the number of distinct topics posted by a student, rather than the exact response pattern.
Most (MOOC) courses conduct multiple forms of assessment throughout the duration of teaching. For example, weekly quizzes, takehome assignments, midterm tests, projects, presentations, etc. In the largescale MOOC context, such evaluations may be peerassessed. Students often enter courses with some cumulative gradepoint average that may be (loosely) predictive of future performance. Any of these readilyavailable sources of student information could be reasonably used to seed . Even final course grades could be used, particularly when the ultimate application of TopicResponse is not measuring students, but refining curriculum.
In order to encourage satisfaction of the soft constraint on topic responses, we introduce a regularisation term on , namely .
Quantising & Regularising the Response Matrix.
We introduce regularisation term , commonly used to prevent overfitting in NMF. To encourage binary solutions, we impose an additional regularisation term , where operator denotes the Hadamard product. Binary matrix factorisation (BMF) is a variation of NMF, where the input matrix and the two factorised matrices are all binary. Our approach is inspired by those of Zhang et al (2007) and Zhang et al (2010). Our added term equals , which is minimised by (only) binary .
TopicResponse Model.
We have the following regularisations:

to encourage a gradeguided ;

to prevent overfitting; and

to encourage a binary itemresponse solution.
These terms together with joint NMFRasch estimation yield final objective
(12)  
where are userspecified regularisation parameters, with primal program
(13) 
TopicResponse Fitting Procedure.
A local optimum of program (13) is achieved via iteration
(14)  
(15)  
(16)  
(17) 
where
and denote the positive part and negative part of matrix respectively. We next describe how these update rules are derived.
The update rules (16) and (17) can be obtained using Newton’s method. The update rules (14) and (15) can be derived via the KarushKuhnTucker conditions necessary for local optimality. First we construct the unconstrained Lagrangian
where are the Lagrangian dual variables for inequality constraints and respectively, and , denote their corresponding matrices. The KKT condition of stationarity requires that the derivative of with respect to , vanishes at a local optimum :
Complementary slackness , implies:
These two equations lead to the updating rules (14) and (15). Regarding the update rules (14), (15), (16) and (17) we have the following theorem:
Theorem 1
This result guarantees that the update rules of , , and eventually converge, and that the obtained solution will be a local optimum. The proof of Theorem 1 is given in the Appendix.
Algorithm.
Our overall approach TopicResponse is described as Algorithm 1. and are initialised using plain NMF (Lee and Seung, 1999, 2001), then normalised (Zhang et al, 2007, 2010). and are initialised based on Eq. (6) and Eq. (7), where is replaced by . At optimisation completion, estimates for topics, item difficulties and person abilities can be obtained together. Code for TopicResponse is available from the authors’ websites.
5 Experiments
We report on extensive experiments evaluating the effectiveness of TopicResponse on real MOOCs. In our experiments, we use the first offerings of three Coursera MOOCs from education, economics and computer science offered by The University of Melbourne: Assessment and Teaching of 21st Century Skills delivered in 2014, Principles of Macroeconomics delivered in 2013, and Discrete Optimisation delivered in 2013. We denote these three courses by EDU, ECON and OPT respectively.
5.1 Dataset Preparation
We focus on the students who contributed posts or comments in forums. For each student, we aggregate all the posts and comments that they contributed. After stemming and removing stop words, a wordstudent matrix with normalised tfidf in [0,1] is produced. The statistics of words and students before and after preprocessing, the dominated words, and the sparsity of wordstudent matrix (the percentage of nonzeros values) for three MOOCs are displayed in Table 3.
5.2 Baseline and Evaluation Metrics
We compare our algorithm TopicResponse with the baseline algorithm GradeGuided NMF (GGNMF), which minimises the following objection function
MOOC  #Students  #Words  #Words after preprocessing  Dominated words  Wordstudent matrix sparsity 

EDU  1,749  28,931  18,391  student, learn, skill, work, teacher, use, assess, teach, problem, collabor  0.59% 
ECON  1,551  26,370  21,412  gdp, would, econom, think, product, good, one, economi, increas, invest  0.50% 
OPT  1,092  19,284  16,128  use, solut, get, time, one, tri, python, work, optim, would  0.85% 
A local optimum can be obtained using the KarushKuhnTucker conditions. Like TopicResponse, GGNMF regularises by considering the students’ grades as an indicator of academic ability.
The difference is that TopicResponse optimises the Rasch estimation and NMF simultaneously,
while in GGNMF, the students’ topic responses are first obtained, and then are passed through the Rasch model.
We evaluate the two algorithms in terms of the following metrics.
Quality of factorisation. We measure so as to record how well the factorisation approximates the studentword matrix.
Measuring student academic ability. Quality of constraint on students’ topic participation, based on grades: .
Negative loglikelihood. Loglikelihood measures fit of the Rasch model to the entire dataset. For convenience, we look at the negative loglikelihood, which should be minimised: smaller is better. This measure is our main focus for Rasch, as it is important to examine the modellevel fit before looking at itemlevel fit.
Item infit. As illustrated in Section 3.1.2, item infit examines the fit of a particular item, with nonfitting items suitable for further refinement. We use the conventional acceptance range of [0.7, 1.3].
5.3 Hyperparameter Settings
Table 4 presents the parameter values used for our parameter sensitivity experiments, where the default values shown in boldface are used in experiments unless noted otherwise.
5.4 Main Results for GGNMF and TopicResponse
In the first group of experiments, we examine the performance of GGNMF (baseline) and TopicResponse in terms of negative loglikelihood, the quality of factorisation in approximation given by , and the supervision soft constraint . For GGNMF, the factorisation and Rasch estimation are separated, where topicstudent response matrix is first obtained using GGNMF, and then is taken as input to Rasch estimation. For TopicResponse, the negative loglikelihood is optimised together with factorisation. The parameters are set using the boldface default values in Table 4. Figure 5 displays the negative loglikelihood of GGNMF and TopicResponse.
It can be seen that TopicResponse can yield superior negative loglikelihood, implying better fit between the topicstudent response matrix and the Rasch model. TopicResponse therefore provides greater confidence that other itemlevel fit statistics such as infit, will be acceptable. Jointly optimising the matrix factorisation and Rasch estimation can bring us closer to global optima.
We present the results on quality of approximation and supervision term , in Figure 6. From these plots, we can see that without sacrificing approximation performance in terms of , TopicResponse obtains superior (while obtaining excellent negative loglikelihoods as above). This performance again demonstrates that optimising the factorisation and Rasch estimation globally can be superior to optimising them separately. We therefore conclude that TopicResponse is preferable to GGNMF; we focus on results for TopicResponse in the remainder of our experiments.
5.5 Item Infit, Item Difficulty and Student Ability
We further examine the infit of each item, which indicates if the set of topics conform to the Rasch model, and is appropriate for measurement. As illustrated in Section 3.1.2, a conventional acceptable range of infit is 0.7 to 1.3. As an example, we show the item infit in Figure 7 on OPT MOOC. We can see that the infit of each item is in the acceptable range, with most very close to the (ideal) expected value of 1.0, indicating that the set of topics conform to the Rasch model and is appropriate for measuring student ability.
Additionally, we examine item difficulties and student abilities. Figure 8 displays the histogram of item difficulty and student ability along a common scale. According to the Rasch model, the higher a person’s ability relative to the difficulty of a topic, the higher the probability that person posts on that topic. It can be seen that most students with low ability (around 2 logits), only dominate the “easiest” topic (topic 1 with difficult 2.3 logits); this topic concerns general problem solving. In other words, these students are likely to post only on topic 1, and unlikely to post on other topics. By comparison, the most able students with abilities around 2, with high probability contribute to all the topics.
5.6 Topic Interpretation and Discussion
We qualitatively examine topic interpretation, in order to assess educational meaningfulness. Wellscaled topics can potentially be used for curriculum refinement. Table 5 presents the topics generated using TopicResponse, alongside inferred difficulties. Topics are interpreted by an instructor who teaches a similar course. As the topics are not all course contentrelated, we envision that instructors examine discovered topics prior to using all for refining curriculum or taking other actions. Additionally, the inferred student ability levels and topic difficulty levels could be potentially used for personalised feedback, by tailoring appropriate topics of course content or forum discussion to students with their individual ability level taken into account. For example, most students (lowest ability) only discuss solving problem in general, as shown in Figure 8. If they cannot obtain sufficient help from forum discussions, they may be prone to drop out without further topic exploration. Therefore, in intervening with atrisk students, it is advisable to leverage discovered topics to better focus measures. Such services may be useful in preventing dropout in early stages (when most dropouts typically occur).
No.  Topics  Interpretation  Inferred difficulty 

1  use time problem get solut one optim algorithm tri work  Solving in general  2.30 
2  cours thank would lectur realli great assign good like think  Course feedback  0.93 
3  python use run program solver java matlab instal command work  Python/Java/Matlab (How to start)  0.63 
4  problem thank solut get grade knapsack got feedback optim solv  How knapscak problem is solved and graded  0.44 
5  memori dp use column bb implement solv algorithm bound tabl  Comparing algorithms memory/time  0.23 
6  color node graph random edg greedi opt search swap iter  Graph coloring  0.31 
7  item valu weight capac estim take solut calcul best list  Knapsack problem  0.33 
8  file pi line solver data submit lib urllib2 solveit open  Using solvers  0.52 
9  video http class load lecture org problem coursera optimization 001  Platform  1.17 
10  submit assign assignment error messag view assignment_id detail class coursera  Assignment submission  1.73 
5.7 Parameter Sensitivity
To validate the robustness of TopicResponse to parameter settings, a series of sensitivity experiments were conducted. The parameter settings are shown in Table 4. Negative loglikelihoods, , and are examined in these experiments. Due to space limitations, we report here results for on the OPT MOOC. The reader is referred to Appendix B for results on parameters ,,, on all three MOOCs.
Effect of Parameter .
As can be seen in Figure 9, as is increased TopicResponse performs better in terms of negative loglikelihood, and performs worse in terms of the other three metrics due to the regularisation on the Rasch model. By contrast, the performance of GGNMF does not change as there is no regularisation term on its Rasch estimation. Overall, TopicResponse performs well when varies between 0.1 and 0.2.
6 Conclusion and Future Work
We have examined the suitability of contentbased items (topics) discovered from MOOC forum discussions, for modelling student abilities. Our central tenet is that topics can be regarded as useful items for measuring latent skills, if student responses to these topics fit the Rasch itemresponse theory model, and if the discovered topics are further interpretable to domain experts. We propose to jointly optimise NMF and Rasch modelling, in order to discover Raschscaled topics. We provide a quantitative validation on three Coursera MOOCs, demonstrating that TopicResponse yields better global fit to the Rasch model (observed with lower negative loglikelihood), maintains good quality of factorisation approximation, while measuring the students’ academic abilities (reflected by the gradeguided constraint on students’ participation on topics). We also provide qualitative examination of topic interpretation with inferred difficulty levels on a Discrete Optimisation MOOC. The results on goodness of fit and our qualitative examination, together suggest potential applications in curriculum refinement, student assessment and personalised feedback.
We opted to study the relatively simple Rasch model, as it forms the basis of very many subsequent models in the literature. One direction for extension, is that for any model (like Rasch), that fits parameters via maximumlikelihood estimation (or risk minimisation in general), the model can be augmented with NMF as an additional regularisation. For example, such an extension should be straightforward for polychotomous observations, hierarchical models on latent skills, models that include more flexible perstudent variation, etc. These represent fruitful direction for future research. Another possible extension could involve augmenting the matrices in the NMF or Rasch objective terms with manuallycrafted items, to make effective use of prior knowledge.
Acknowledgements.
We thank Jeffrey Chan for discussions related to this work, and the anonymous reviewers and editor for their thoughtful feedback. This work is supported by Data61, and the Australian Research Council (DE160100584).References
 Bachrach et al (2012) Bachrach Y, Graepel T, Minka T, Guiver J (2012) How to grade a test without knowing the answers—a Bayesian graphical model for adaptive crowdsourcing and aptitude testing. In: Proceedings of the 29th International Conference on Machine Learning (ICML12), pp 1183–1190
 Baker and Kim (2004) Baker FB, Kim SH (2004) Item response theory: Parameter estimation techniques. CRC Press
 Bergner et al (2012) Bergner Y, Droschler S, Kortemeyer G, Rayyan S, Seaton D, Pritchard DE (2012) Modelbased collaborative filtering analysis of student response data: Machinelearning item response theory. International Educational Data Mining Society
 Bertsekas (1999) Bertsekas DP (1999) Nonlinear programming. Athena Scientific
 Bond and Fox (2001) Bond TG, Fox CM (2001) Applying the Rasch model: Fundamental measurement in the human sciences. Lawrence Erlbaum Associates Publishers
 Champaign et al (2014) Champaign J, Colvin KF, Liu A, Fredericks C, Seaton D, Pritchard DE (2014) Correlating skill and improvement in 2 MOOCs with a student’s time on tasks. In: Proceedings of the First ACM Conference on Learning@Scale Conference, ACM, pp 11–20
 Chaturvedi et al (2014) Chaturvedi S, Goldwasser D, Daumé III H (2014) Predicting instructor’s intervention in MOOC forums. In: ACL (1), pp 1501–1511
 Chen et al (2005) Chen CM, Lee HM, Chen YH (2005) Personalized elearning system using item response theory. Computers & Education 44(3):237–255
 Colvin et al (2014) Colvin KF, Champaign J, Liu A, Fredericks C, Pritchard DE (2014) Comparing learning in a MOOC and a blended oncampus course. In: Educational Data Mining 2014
 Gillani et al (2014) Gillani N, Eynon R, Osborne M, Hjorth I, Roberts S (2014) Communication communities in MOOCs. arXiv preprint arXiv:14034640
 Guttman (1950) Guttman L (1950) The basis for scalogram analysis. In: Stouffer S (ed) Measurement and Prediction: The American Soldier, Wiley, New York

He et al (2016)
He J, Rubinstein BI, Bailey J, Zhang R, Milligan S, Chan J (2016) MOOCs meet measurement theory: A topicmodelling approach. In: Thirtieth AAAI Conference on Artificial Intelligence
 Jenders et al (2016) Jenders M, Krestel R, Naumann F (2016) Which answer is best? Predicting accepted answers in MOOC forums. WWW’2016 Companion
 Lee and Seung (1999) Lee DD, Seung HS (1999) Learning the parts of objects by nonnegative matrix factorization. Nature 401(6755):788–791
 Lee and Seung (2001) Lee DD, Seung HS (2001) Algorithms for nonnegative matrix factorization. In: Advances in Neural Information Processing Systems, pp 556–562
 Linacre (2002) Linacre JM (2002) What do infit and outfit, meansquare and standardized mean. Rasch Measurement Transactions 16(2):878
 Linacre (2006) Linacre JM (2006) Misfit diagnosis: Infit outfit meansquare standardized. Retrieved June 1:2006
 Milligan (2015) Milligan S (2015) Crowdsourced learning in MOOCs: Learning analytics meets measurement theory. In: Proceedings of the Fifth International Conference on Learning Analytics And Knowledge, ACM, pp 151–155
 Ramesh et al (2015) Ramesh A, Kumar SH, Foulds J, Getoor L (2015) Weakly supervised models of aspectsentiment for online course discussion forums. In: Annual Meeting of the Association for Computational Linguistics (ACL)
 Rasch (1993) Rasch G (1993) Probabilistic models for some intelligence and attainment tests. ERIC
 Scholten (2011) Scholten AZ (2011) Admissible statistics from a latent variable perspective. Theory & Psychology 18:111–117
 Wen et al (2014) Wen M, Yang D, Rose C (2014) Sentiment analysis in MOOC discussion forums: What does it tell us? In: Educational Data Mining 2014
 Wright and Masters (1982) Wright BD, Masters GN (1982) Rating Scale Analysis. Rasch Measurement. ERIC
 Wright et al (1994) Wright BD, Linacre JM, Gustafson J, MartinLof P (1994) Reasonable meansquare fit values. Rasch measurement transactions 8(3):370
 Yang et al (2014) Yang D, Adamson D, Rosé CP (2014) Question recommendation with constraints for massive open online courses. In: Proceedings of the 8th ACM Conference on Recommender systems, ACM, pp 49–56
 Yang et al (2015) Yang D, Wen M, Howley I, Kraut R, Rose C (2015) Exploring the effect of confusion in discussion forums of massive open online courses. In: Proceedings of the Second (2015) ACM Conference on Learning@ Scale, ACM, pp 121–130
 Zhang et al (2007) Zhang Z, Ding C, Li T, Zhang X (2007) Binary matrix factorization with applications. In: Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, IEEE, pp 391–400
 Zhang et al (2010) Zhang ZY, Li T, Ding C, Ren XW, Zhang XS (2010) Binary matrix factorization for analyzing gene expression data. Data Mining and Knowledge Discovery 20(1):28–52
Appendix A Proof of Theorem 1
The update rules for and are derived using the NewtonRaphson method (Bertsekas, 1999), where the convergence to a local optimum is guaranteed. Here, we focus on the proof for the update rule for . The update rule for can be proved similarly. We closely follow the procedure described in (Lee and Seung, 2001)
, where an auxiliary function similar to that used in the ExpectationMaximization (EM) algorithm is used for proof.
Definition 2 (Lee and Seung 2001)
is an auxiliary function for if the conditions
are satisfied.
Lemma 1 (Lee and Seung 2001)
If is an auxiliary function, then is nonincreasing under the update
(18) 
Proof
The result follows from noting .∎
For any element in , let denote the part of in Eq. (12) relevant to . Since the update is essentially elementwise, it is sufficient to show that each is nonincreasing under the update rule of Eq. (15). To prove this, we define the auxiliary function regarding as follows.
Lemma 2
Function
(19) 
where
is an auxiliary function for .
Proof
It is obvious that . So we need only prove that . Considering the Taylor series expansion of ,
is equivalent to , where
To prove the above inequality, we have
Thus, as claimed.∎
Appendix B Experimental Results of Parameter Sensitivity on Regularisation Parameters ,,, and
a) Effect of Parameter : As we can see from Figure 11, GGNMF and TopicResponse are not sensitive to , performing stably with varying . TopicResponse constantly performs better in terms of negative loglikelihood while maintaining the comparable performance in terms of the other three metrics.
b) Effect of Parameter : It can be seen from Figure 12 that GGNMF and TopicResponse perform well in terms of (Figure 11(d) to Figure 11(f)) and (Figure 11(j) to Figure 11(l)) when varies from to , and from to respectively. gets worse as increases, but does not change a lot compared to and . As increases, the performance of GGNMF and TopicResponse in terms of negative loglikelihood decrease, and TopicResponse constantly performs better than GGNMF. Overall, with values around 1.0 is good for GGNMF and TopicResponse.
c) Effect of Parameter : It can be seen that GGNMF and TopicResponse perform well in terms of (Figure 12(d) to Figure 12(f)) and (Figure 12(j) to Figure 12(l)) when varies from to , and from to respectively. Similar to , does not affect significantly. TopicResponse constantly achieves better negative loglikelihood than GGNMF. Overall, with values around 1.0 is good for GGNMF and TopicResponse.
d) Effect of the number of topics : It can be seen from Figure 14 that TopicResponse constantly outperforms GGNMF in terms of negative loglikelihood, while getting slightly worse performance in the other three metrics. This is reasonable, as GGNMF has more constraints and hence the model itself is less likely to perform as well as the less constrained GGNMF in other metrics. Overall, GGNMF and TopicResponse perform well in the experiments when is set to 10 or 15. We choose 10 as the value of since a smaller number of topics are easier to analyse.
Comments
There are no comments yet.