Variational Item Response Theory: Fast, Accurate, and Expressive

02/01/2020 ∙ by Mike Wu, et al. ∙ Stanford University 7

Item Response Theory is a ubiquitous algorithm used around the world to understand humans based on their responses to questions in fields as diverse as education, medicine and psychology. However, for medium to large datasets, contemporary solutions pose a tradeoff: either have bayesian, interpretable, accurate estimates or have fast computation. We introduce variational inference and deep generative models to Item Response Theory to offer the best of both worlds. The resulting algorithm is (a) orders of magnitude faster when inferring on the classical model, (b) naturally extends to more complicated input than binary correct/incorrect, and more expressive deep bayesian models of responses. Applying this method to five large-scale item response datasets from cognitive science and education, we find improvements in imputing missing data and better log likelihoods. The open-source algorithm is immediately usable.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


A PyTorch implementation of "Variational Item Response Theory: Fast Accurate, and Expressive"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of estimating human ability from stochastic responses to a series of questions has been studied since the 1950s in thousands of papers and spans several domains. The corresponding Item Response Theory algorithms are used every day around the world, and are currently deployed in many human critical contexts including college admissions test, school-system assessment, survey analysis, popular questionnaires, and medical diagnosis.

Despite a large body of literature, contemporary IRT methods fall short – it remains surprisingly difficult to estimate human ability from stochastic responses. One crucial bottleneck is that the state of the art Bayesian inference algorithms are prohibitively slow to run on reasonable sized datasets even with very rudimentary models of response. This leaves practitioners with a choice: either have interpretable, nuanced Bayesian models with appropriate inference or have timely computation.

In the field of artificial intelligence a revolution in deep generative models via “variational inference" have demonstrated an impressive ability to perform fast inference on complex Bayesian models. In this paper, we present a novel application of variational AI to the problem of IRT. We present necessary derivations for the corresponding inference algorithms and apply the algorithms both to synthetic experimentation as well as real world datasets.

Specifically, our contributions are as follows:

  1. Variational inference for IRT: We derive a novel optimization objective to perform inference for Item Response Theory. To do so, we derive a variational posterior distribution over ability and item characteristics that is (importantly) amortized over subjects.

  2. Faster inference: We find our inference approach to be orders of magnitude faster than previous inference techniques and be usable on much larger datasets with no loss in accuracy.

  3. More expressive:

    Our inference approach is naturally compatible with deep generative models and, as such, we enable the novel extension of Bayesian IRT models to use neural network based representations for inputs, predictions, and student ability. We develop the first deep generative IRT models.

  4. Real world application: We demonstrate the impact of faster, expressive inference by applying our algorithms to datasets including: PISA, DuoLingo and Gradescope. We achieve up to 200 times speedup and show improved accuracy at imputing hidden responses.

The potential real world impact of this extension is two-fold. More accurate models can be constructed for existing applications of Item Response Theory, such as for visual acuity tests [piech2019stanford], or educational assessments such as PISA [organisation2016pisa]. Second, Item Response Theory can be extended for students who produce work where their responses go beyond binary correct/incorrect (imagine students writing text, drawing pictures, or even learning to code) encouraging educators to give students more engaging items without having to give up Bayesian inference.

2 Background

We briefly review the many variations of item response theory and the fundamental principles of approximate inference, focusing on modern variational inference.

2.1 Item Response Theory

Imagine answering a series of multiple choice questions. For example, consider a personality survey, a homework assignment, or a medical examination. As a subject, selecting a response to each question is an interaction between your ability and the characteristics of the question, such as its difficulty. The goal in examination analysis is to gauge this unknown ability of each student and the unknown item characteristics based only on his or her responses. Early procedures [edgeworth1888statistics]

defaulted to very simple methods such as counting the number of correct responses, ignoring differences in question quality. In reality, we understand that not all questions are created equal: some may be hard to understand while others may test more difficult concepts. To capture these nuances, Item Response Theory (IRT) was developed as a mathematical framework to reason formally about people’s responses to items. The core principle underlying IRT is that the probability of a correct response to an item is a logistic function of only the person ability and item characteristics.

Being widely popular, the IRT model plays an impactful role in many large institutions. It is the preferred method for estimating ability in several state assessments in the United States, for international assessments gauging educational competency across countries [harlen2001assessment], and for the National Assessment of Educational Programs (NAEP), a large-scale measurement of literacy and reading comprehension in the US [ravitch1995national]. Beyond education, IRT is a classical method widely used in cognitive science and psychology, especially with regards to studies of language acquisition and development [hartshorne2018critical, magdalena2016ratings, frank2017wordbank, braginsky2015developmental].

IRT has many forms, of which we review a few:

Figure 1:

Graphical models for the (a) 1PL, (b) 2PL, and (c) 3PL Item Response Theories. Observed variables are shaded whereas latent variables are white. Arrows represent dependency between random variables and each rectangle represents a plate. Note nested plate as there is a response for each person and item.

The simplest class of IRT summarizes the ability of a person with a single parameter. This class contains three versions: 1PL, 2PL, and 3PL IRT, each of the differ by the number of free variables used to characterize an item. The 1PL IRT model, also called the Rasch model [rasch1960studies], is given in Eq. 1,


where persons and items with being the response by the -th person to the -th question. Each item in the 1PL model is characterized by a single number representing difficulty, . Notice that the functional form of the 1PL model is a logistic function. Thus, a higher difficulty requires a higher ability in order to respond correctly. The 2PL IRT model adds a discrimination parameter, for each item that controls the slope (or scale) of the logistic curve whereas the 3PL IRT model also adds a pseudo-guessing parameter, for each item that sets the asymptotic minimum of the logistic curve. We can expect a higher discrimination to more heavily separate people of low and high ability. We can also interpret pseudo-guessing as the probability of success if the respondent were to make an educated guess on this question. 111We default to 2PL as the pseudo-guessing parameter introduces several invariances in the model. This requires far more data (in the millions) to infer ability accurately, as measured by our synthetic experiments. For practitioners, we warn against using 3PL for small to medium datasets. The 2PL and 3PL IRT models are shown in Eq. 2. Fig. 1 shows the graphical models for each of the IRTs.


where for 2PL and for 3PL.

How good of an approximation is IRT to the true underlying function for human cognitive reasoning? The answer varies, and is dependent by the dimensionality of the construct. For instance, if we are measuring a person’s understanding on elementary arithmetic, then a single dimension may suffice in capturing the majority of the variance: the simplicity of the domain is limited to four basic operations. However, if we are instead measuring a person’s general mathematics ability, a single real number no longer seems sufficient. Even if we bound the domain to middle school mathematics, there are several factors that contribute to “mathematical understanding" (e.g. proficiency in algebra versus geometry). Summarizing a person with a single number in this setting would result in a fairly loose approximation.

For these cases where multiple facets of ability (e.g. intellectual versus socio-emotional) contribute to performance, we consider multidimensional item response theory [ackerman1994using, reckase2009multidimensional, mcdonald2000basis]. For simplicitly, we show only 2PL multidimensional IRT:


where we use bolded notation to represent a

dimensional vector. Notice that the discrimination is also a vector of equal size to ability. The 1PL and 2PL MIRT models follow from Eq. 


In this paper, we make the following simplifying assumptions common in standard item response theory [reckase2009multidimensional]: (1) person abilities cannot change over the course of answering questions, (2) the response by a person to one test item is independent of the responses to any other item, (3) the relation between ability and item characteristics is linear, and (4) the responses are binary. Here, we will explore both nonlinear generalizations to IRT as well as polytomous responses. Other extensions to knowledge tracing, teslets, nonparameteric and unfolding models should be straightforward and are left as future work.

In practice, we are given a dataset of responses and must “fit" one of the many IRT models. In other words, given a matrix of observed responses, we want to infer the ability of all persons (multi-dimensional or not) and the characteristics of all items presented. Often, a portion of the entries are missing. For instance, a student may have skipped a question on the exam or perhaps different versions of an assessment were given to different population groups. In the next section, we provide a brief overview of inference in item response theory followed by a technical introduction to modern approximate inference techniques.

2.2 Inference in Item Response Theory

We compare and contrast the three popular methods used to do inference for IRT in research and industry.

Joint Maximum Likelihood

The most straightforward approach is to jointly pick the most likely ability and item features given the observed responses. We optimize the following objective:


with any standard optimizer e.g. stochastic gradient descent. The symbol

represents all item features e.g. for 2PL. Eq. 5 is often called the Joint Maximum Likelihood Estimator [beguin2001mcmc, embretson2013item], abbreviated MLE. MLE poses inference as a supervised regression problem in which we choose the most likely unknown variables to match known dependent variables. While MLE is simple to understand and implement, we lack any measure of uncertainty which have important consequences when responses are missing.

Marginal Maximum Likelihood

In classic works, many have pointed that when using MLE, the number of unknown parameters (that we must optimize for) unhealthily increases with the number of persons [bock1981marginal, haberman1977maximum]. In particular, [haberman1977maximum] shows that in practical settings with a finite (and often small) number of items, standard convergence theorems do not hold for MLE as the number of persons grows.

To remedy this, the authors instead view ability as a nuisance parameter: one that should be marginalized out and removed from consideration [bock1981marginal, bock1988full]

. Brock et. al. introduces an Expectation-Maximization (EM)

[dempster1977maximum] algorithm to iterate between (1) updating beliefs about item characteristics and (2) using the updated beliefs to define a marginal distribution (without ability) by numerical integration of . Appropriately, this algorithm is referred to as Maximum Marginal Likelihood Estimation, abbreviated as EM. Eq. 7 shows the E and M steps for EM.


where represents the iteration count. We often choose to be a simple prior distribution like standard Normal. In general, the integral in the E-step is intractable: EM uses a Gaussian-Hermite quadrature to discretely approximate . See [harwell1988item] for a closed form expression for in the M step. As a property of EM, this method finds the maximum a posteriori (MAP) estimates for item characteristics. Interestingly EM does not infer ability as it is “ignored" in the model, which may be a limitation in some use cases. The common practice is to use EM to infer item characteristics, then fit ability using a second auxiliary model. In practice, EM has grown to be very popular in industry especially as it is fast for small to medium sized datasets. However, we expect that EM may scale poorly to large datasets and multi-dimensional IRT models as numerical integration requires far more points to properly measure high dimensional volume.

Bayesian Item Response Theory

EM begins to incorporate uncertainty into its inference by using a prior distribution along with the conditional, thereby by Bayes rule, is proportional to the posterior. However, it draws only the most likely sample from the posterior, which ignores variance and may struggle with multi-modal distributions. Ideally, we would like to draw i.i.d. samples from the posterior or perhaps even derive a closed form expression. To achieve the former, MCMC methods have been applied to item response theory. While theoretically appealing, these approaches have not yet gained widespread popularity due to their cost, often several magnitudes more expensive than EM or MLE. In the next subsection, we review MCMC and explore a more modern approach to approxmiate Bayesian inference: variational methods. We hypothesize that variational approaches may preserve some of the theoretical benefits of Bayesian IRT without the overt cost of MCMC.

2.3 Approximate Inference

We will present approximate inference in a general manner and revisit its exact applications to IRT in Sec. 3.


be a joint distribution over a set of latent variables

and observed variables . The task of inference is to compute posterior beliefs over the latent variables after incorporating evidence into the prior distribution: . This quantity, however, is usually intractable as the marginal likelihood requires solving an infeasible integral. Since we are also not guaranteed to know the form of , even direct sampling is difficult. Thus, we are forced to approximate the posterior instead.

Sampling Methods

Largely, there are two classes of approximate inference techniques. The first, known as Markov Chain Monte Carlo (MCMC) sampling

[hastings1970monte, gelfand1990sampling] draws samples from the desired distribution by constructing a Markov chain carefully designed such that is the equilibrium distribution. By mixing the chain, the more steps taken, the more closely the distibution of drawn samples matches the true posterior. A primary critique of the MCMC algorithm is that it struggles in high dimensions where samples drawn from the current proposal distribution are commonly rejected, resulting in very slow mixing. To address this issue, the Hamiltonian Monte Carlo (HMC) algorithm [neal2011mcmc, neal1994improved, hoffman2014no] improves upon the MCMC algorithm by choosing the next sample in a more sophisticated manner (rather than a random walk procedure). It does so by solving a Hamiltonian dynamical system (via numerical integration) to define the next proposal distribution. Intuitively, doing so reduces the correlation between successive samples, ensuring a consistently higher probability of acceptance via properties of the Hamiltonian dynamics. We recommend reading [hoffman2014no] for a good review of HMC. The strengths of these algorithms are that the samples are (mostly) guaranteed to be drawn from the true posterior ; correlation issues aside, we can be confident in using the samples for downstream computation. However, the main drawbacks of sampling algorithms are computation — mixing an MCMC chain or using the Leapfrog integrator require signficant compute, meaning these algorithms scale poorly to large datasets and high dimensions. In the context of IRT, with new datasets of upwards of a million people, such limitations can be debilitating.

Given their lengthy existence, MCMC based approaches have been the de-factor approach in psychology, cognitive science, and education. However, there exist a second class of approximate inference techniques that have gained significant attention in the machine learning community. Next, we provide a careful review of this second class of algorithms, named

variational inference.

Variational Methods

In variational inference [jordan1999introduction, wainwright2008graphical, blei2017variational], or VI, we introduce a family of tractable distributions over the latent variables (such that we can easily sample and score). We wish to find the member that minimizes the Kullback-Leibler (KL) divergence between itself and the true posterior:


where are parameters that define each distribution. For example,

would be the mean and scale for a Gaussian distribution. Since the “best" approximate posterior

depends on the observed variables, its parameters have as a dependent variable. To be clear, there is one approximate posterior for every possible value of the observed variables.

Frequently, we need to do inference for many different values of the observed variables . Let be an empirical distribution over the observed variables, which is equivalent to the marginal if the generative model is correctly specified. Then, the average quality of the variational approximations is measured by


In practice, is unknown but assume access to a training dataset of examples i.i.d. sampled from . Since we never need to score, this is sufficient to evaluate Eq. 9. We have essentially converted inference to an optimization problem of picking the best parameters to maximize an objective function. We introduce a few additional details of VI that will be useful in the Methods.


As presented in Eq. 9, we must learn an approximate posterior for each . For a large dataset , this can quickly grow to be unwieldly. One such solution to this scalability problem is amortization [gershman2014amortized], which reframes the per-observation optimization problem as a supervised regression task. Consider learning a single deterministic mapping to predict or equivalently as a function of the observation . Often, we choose to be a conditional distribution, denoted by .

The benefit of amortization is a large reduction in computational cost: the number of parameters is vastly smaller than learning a per-observation posterior. Additionally, if we manage to learn a good regressor, then the amortized approximate posterior could generalize to new observations

unseen in training. This strength has made this approach popular with modern latent variable models like Variational Autoencoders


The drawback of this approach is that it introduces an amortization gap, since we are technically using a less flexible inference model. Instead of Eq. 9, we instead get the following optimization problem,


The gap comes from the cost of sharing the variational parameters over all examples in the dataset (mathematically represented as pulling the out of the outer expectation) as opposed to individual parameters per example.

Model Learning

So far we have assumed a fixed generative model . However, often we can only specify a family of possible models parameterized by . The symmetric challenge (to approximate inference) is to choose whose model best explains the evidence. Naturally, we do so by maximizing the log marginal likelihood of the data


Using what we did in Eq. 10, we derive the Evidence Lower Bound (ELBO) with as our inference model


We can jointly optimize and to maximize the ELBO. In practice, we often parameterize and with deep neural networks, as is common with the VAE [kingma2013auto].

Stochastic Gradient Estimation

The gradient of the ELBO (Eq. 12) with respect to and are:


Eq. 13 can be estimated using Monte Carlo samples. However, as it stands, Eq. 14 is difficult to estimate as we cannot distribute the gradient inside the inner expectation. For certain families , we can use a reparameterization trick.

Reparameterization Estimators

Reparameterization is the technique of removing sampling from the gradient computation graph [kingma2013auto, rezende2014stochastic]. In particular, if we can reduce sampling to sampling from a parameter-free distribution plus a deterministic function application, , then we may rewrite Eq. 14 as


which now can be estimated efficiently by Monte Carlo (the gradient is inside the expectation). A benefit of reparameterization over alternative estimators (e.g. score estimator [mnih2014neural] or REINFORCE [williams1992simple]) is lower variance while remaining unbiased. A common example is if is Gaussian and we choose to be , then .

Normalizing Flows

A critique of VI is the need to specify the family of distributions

to be reparameterizable, of which the most popular is the Normal family. If for example, the true posterior is multi-modal, our best Normal approximation will only capture one mode. To access a richer family of distributions, we consider transforming samples from our Normal distribution to a more complex one.

Let be a continuous and invertible mapping. Given , define . Then,

has the following probability distribution:


where is the Jacobian determinant. such mappings can be composed together: . In this case, the log likelihood of a transformed sample is

Then, , an arbitrarily complex (potentially multi-modal) distribution. Computing the Jacobian determinant is intractable for most functions . Several works [rezende2015variational, tomczak2016improving, berg2018sylvester] have posed tractable families of functions for in which the Jacobian determinant is either 1 or a computable sparse matrix. In our discussion, we will explore planar flows [rezende2015variational] with VI.


where and are learnable parameters; is an elementwise nonlinearity and its derivative.

3 Methods

Having viewed the major principles of variational inference, we are ready to present variational inference for IRT, which will adapt many of the same ideas above.

Figure 2: Graphical models with reverse arrows showing inference for VIRTU (a) and VIRTU (b).

We assume that we are given an empirical dataset of binary responses from persons to items. This dataset may or may not contain missing entries. We use the index to enumerate different persons and the index to enumerate different items. For example, refers to the (possibly multidimensional) ability for person . Likewise, we will use the notation to represent all the characteristics of item . For instance, for a 3PL unidimensional IRT model, we have . First, we derive two novel lower bound on the log joint distribution for a single person where are the -th person’s responses to all questions. The generative model, unless otherwise specified, is an IRT model and thus includes no additional parameters.

Theorem 3.1

Let be the ability for person and be the characteristics for item . We use the shorthand notation . Let be the binary response for item by person . We write . If we defined the following terms,

then the following inequality is true:

where the joint distribution factors as .

See Sec. A for proof.

Corollary 3.1.1

Using the definitions in Thm. 3.1, if the variational distribution factors as follows:

Then the following lower bound holds

where .

Near identical to proof of Thm. 3.1. See Sec. A. The primary difference between Thm. 3.1 and Coro. 3.1.1 is how the variational joint distribution factors. As practitioners, we are free to choose the form of . In Thm. 3.1, we assume and are independent conditioned on whereas we do not make that assumption in Coro. 3.1.1. In the supplement, we provide evidence that the independence assumption is too strong and decreases the quality of inference dramatically. We call the lower bound in Coro. 3.1.1 VI. We will call the lower bound in Thm. 3.1 VI to make its assumptions clear. Fig. 2 shows graphical models distinguishing the two. Mathematically, we can write the VI objective as:


where is an empirical distribution over person responses to the items. As mentioned, we are only given a dataset of persons responding to items. However, such a dataset is sufficient for computing the VI objective. The gradients of the VI loss with respect to can be estimated with Monte Carlo samples as:


Like in Eq. 15, we may wish to move the gradient inside the KL divergences by reparameterization to reduce variance. To allow easy reparameterization, we define all variational distributions as Normal distributions with diagonal covariance. In practice, we find that estimating the gradient in Eq. 20 with a single simple is sufficient. With this setup, the VI loss can be optimized using stochastic gradient descent to learn an amortized inference model that maximizes the marginal probability of observed data.

The posterior needs to be robust to missing data as often not every person answers every question. To do so, we explore the following design:


If we assume each component is Gaussian, then is Gaussian as well, being a Product-Of-Experts [hinton1999products, wu2018multimodal]. If item is missing, we replace its term in the product with the prior, representing no added information. We found this design to outperform averaging over non-missing entries: .

4 Related Work

We have already covered a variety of methods for parameter estimation in IRT such as MLE, EM, and MCMC. The benefits and drawbacks of these methods are well-documented [lindenHandbookItemResponse2017], so we need not discuss them here. Instead, we focus specifically on methods that utilize deep neural networks or variational inference to estimate IRT parameters.

We are aware of two approaches that incorporate deep neural networks into Item Response Theory: Deep-IRT [yeung2019deep] and DIRT [cheng2019dirt]. Deep-IRT is a modification of the Dynamic Key-Value Memory Network (DKVMN) [zhang2017dynamic] that treats data as longitudinal, processing items one-at-a-time using a recurrent architecture. Deep-IRT produces point estimates of ability and item difficulty at each time step, which are then passed into a 1PL IRT function to produce the probability of answering the item correctly. The main difference between DIRT and Deep-IRT is the choice of neural network: instead of the DKVMN, DIRT uses an LSTM with attention [vaswani2017attention]. In our experiments, we compare our approach to Deep-IRT and find that we outperform it by up to 30% on the accuracy of missing response imputation.

While variational inference has been suggested as a promising alternative to other inference approaches for IRT [lindenHandbookItemResponse2017], there has been surprisingly little work in this area. In an exploration of Bayesian prior choice for IRT estimation, Natesan et al. [natesan2016bayesian] posed a variational approximation to the posterior:


We note a few differences between Eq. 22 and our variational approach: First, we amortize over persons, meaning our variational posterior is dependent on the response of person to items. This is critical to good performance as it prevents overfitting to the observed dataset (one can imagine that using Eq. 22 is problematic for values of or that were not seen in training) and more importantly, Eq. 22 requires fitting a new posterior for each person and item . This would result in marginal speed ups (if any) compared to more expensive inference techniques. Second, Eq. 22 has the same form as VI, which we show in the supplement to dramatically degrade performance. The insight to decompose instead of assuming independence is both theoretically [webb2018faithful] and experimentally supported.

Curi et al. [curi2019interpretable] used a Variational Autoencoder (VAE) to estimate IRT parameters in a 28-question synthetic dataset. However, this approach only modeled ability as a latent variable, ignoring the full joint posterior incorporating item features. In this work, our analogue to the VAE builds on the IRT graphical model, incorporating both ability and item characteristics in a principled manner. This could explain why Curi et. al. report the VAE requiring substantially more data to recover the true parameters when compared to MCMC whereas we find comparable performance of our variational lower bound to MCMC.

(a) Synthetic: Correlation with True Ability
(b) Synthetic: Missing Data Imputation
(c) Real World: Missing Data
(d) Real World: Time Cost
(e) Nonlinear: Missing Data
(f) Nonlinear: Log Likelihoods
(g) Posterior Predictive Checks
Figure 3: (a) Correlation of inferred ability with true ability (known in synthetic setting) for VI, MLE, and HMC. The y-axis reports the log seconds (e.g. 10 log seconds is 6.1 hours). (b) Missing data imputation (shown as printed numbers) for VI, MLE, EM, and HMC in a synthetic dataset. (c) Missing data imputation for real world data as well as (d) cost in log seconds. (e) Missing data imputation and (f)log likelihoods of training data for various nonlinear IRT models trained with VI. (e) Samples statistics from the predictive posterior defined using HMC and VI. Subfigures in red show the average number of items answered correctly for each person. Subigures in yellow show the average number of persons who answered each item correctly.

5 Datasets

We explore one toy dataset to build intuiton and five large scale applications of IRT to measure ability in either language acquision or educational assessments.

# Persons # Items Missing Data?
CritLangAcq 669498 95 N
WordBank 5520 797 N
DuoLingo 2587 2125 Y
Gradescope 1254 98 Y
PISA 519334 183 Y
Table 1: Dataset Statistics

Synthetic IRT

To sanity check that VI performs as well as other inference techniques, we synthetically generate a dataset of responses using a 2PL IRT model: sample , where the prior distributions are spherical Gaussian (recall that

contains all item characteristics). Given ability and item characteristics, IRT-2PL determines a Bernoulli distribution over responses to item

by person . We sample once from this Bernoulli distribution to “generate" an observation. We vary and to stress test VI.

Second Language Acquisition

This dataset contains native and non-native English speakers answering questions to a grammar quiz222The quiz can be found at The data is publically available at, which upon completion would return a prediction of the user’s native language. Using social media, over half a million users of varying ages and demographics completed the quiz. Quiz questions often contain both visual and linguistic components. For instance, a quiz question could ask the user to “choose the image where the dog is chased by the cat" and present two images of animals where only one of image agrees with the caption. Every response is thus binary, marked as correct or incorrect. In total, there are 669,498 persons with 95 items and no missing data. The creators of this dataset use it to study the presence of absence of a “critical period" for second language acquisition [hartshorne2018critical]. We will refer to this dataset as CritLangAcq.

WordBank: Vocabulary Development

The MacArthur-Bates Communicative Development Inventories (CDIs) are a widely used metric for early language acquisition in children, testing concepts in vocabulary comprehension, production, gestures, and grammar. The WordBank [frank2017wordbank] database archives several independently collected CDI datasets across several languages and research The database consists of a matrix of persons against vocabulary words where the entry is 1 if person has knowledge of word when tested and 0 otherwise. If child was never quizzed on word , we consider the entry missing. For simplicitly, we ignore difference in children demographics and native languages (each child is treated as an independent entry). In total, there are 5,520 children responding to 797 items.

DuoLingo: App-Based Language Learning

We examine the 2018 DuoLingo Shared Task on Second Language Acquisition Modeling (SLAM) [settles2018second]. This dataset contains anonymized user data from the popular education application, DuoLingo. In the application, users must choose the correct vocabulary word among a list of distractor words. We focus on subset of native English speakers learning Spanish and only consider lesson sessions. Each user has a timeseries of responses to a list of vocabulary words, each of which is shown several times. We repurpose this dataset for IRT: the goal being to infer the user’s language proficiency from his or her errors. As such, we average over all times a user has seen each vocabulary item. For example, if the user was presented “habla" 10 times and correctly identified the word 5 times, he or she would be given a response score of 0.5. We then round it to be 0 or 1. We will revisit a continuous version in the discussion. After processing, we are left with 2587 users and 2125 vocabulary words with missing data as users frequently drop out. We ignore user and syntax features.

Gradescope: Course Exam Data

Gradescope [singh2017gradescope] is a popular course application that assists teachers in grading student assignments. This dataset contains in total 105,218 reponses from 6,607 assignments in 2,748 courses and 139 schools. All assignments are instructor-uploaded, fixed-template assignments, with at least 3 questions, with the majority being examinations. We focus on course 102576, randomly chosen from the courses with the most students. In the supplement, we provide results for 5 other courses from Gradescope (Table 12

). We remove students who did not respond to any questions. We binarize the response scores by rounding up partial credit. In total, there are 1254 students with 98 items with missing entries for course 102576.

PISA 2015: International Assessment

The Programme for International Student Assessment (PISA) is an international exam that measures 15-year-old students’ reading, mathematics, and science literacy every three years. It is run by the Organization for Economic Cooperation and Development (OECD). The OECD released anonymized data from PISA ’15 for students from 80 countries and education We focus on the science component. Using IRT to access student performance is part of the pipeline the OECD uses to compute holistic literacy scores for different countries. As part of our processing, we binarize responses, rounding any partial credit to 1. In total, there are 519,334 students and 183 questions. Not every student answers every question as many versions of the computer exam exist.

6 Fast and Accurate Inference

We will show that VI is as accurate as HMC but nearly as fast as MLE/EM, making Bayesian IRT a realistic, even optimal, option for modern applications.

6.1 Evaluation

We compare VI to HMC, EM666We use the popular MIRT package in R for EM with 61 points for numerical integration., and MLE using unidimensional IRT-2PL. For evaluation, we measure the wall-clock cost of each inference algorithm. However, speed only matters assuming good performance. To measure the latter, we use various approaches, the most popular being accuracy of imputing missing data by artificially holding out 10% of the responses. We use the inferred ability and item characteristics to generate responses, thereby populating missing entries. This is a very general metric and can be used in almost all cases. For the synthetic dataset specifically, because we know the true ability, we can measure the correlation between the it and the inferred ability under each algorithm (with the exception of EM). A correlation of 1.0 would indicate perfect inference. For real world data, we cannot do the same so we settle for a proxy. To compare HMC and VI in particular, we use posterior predictive checks. Recall that the predictive posterior is defined as:

With HMC, we are given samples from the posterior. With VI, replace with . Given samples for all such persons to , we measure (1) the average number of items answered correctly per person and (2) the average number of persons who answered item correctly. This practice is common in statistical psychology and cognitive science [sinharay2006posterior].

6.2 Results and Analysis

With synthetic experiments we are free to vary and to extremes to stress test the inference algorithms: first, we range from 100 to 1.5 million persons and 10 to 6k items ; second, we vary the number of dimensions from 1 to 5. Fig. 3(a,b) plot the show the correlation and missing data accuracy of VI, MLE, HMC, and EM. In both subplots, the y-axis represents log seconds: a small difference in log seconds is a dramatic difference in wall clock time. For example, 5 log seconds is about 3 minutes, 10 log seconds is about 6 hours, and 15 log seconds is about 900 hours.

We find several promising patterns. First, the computational cost of VI is only slightly higher than MLE (the backwards pass of VI is more expensive each step), both of which are magnitudes lower than HMC. For instance, with 1.56 million persons, HMC takes 217 hours whereas VI takes 800 seconds. Similarly with 6250 times, HMC takes 4.3 hours whereas VI takes 385 seconds. Although EM is known for being cheap, it does not scale as well as VI for large datasets. For example, with 1.56M persons, EM takes 4.3 hours; with 6250 items, EM takes 43 minutes. With low to medium sized datasets, EM remains the fastest. However, this speed comes at a price. Properly marginalizing out ability is difficult to do numerically, especially as the number of dimensions increases. In all settings, we find the EM’s accuracy of imputing missing data to be lower by 10%. As the dimensionality increases, the performance of EM continues to drop, approaching 59% in 5d whereas VI, MLE, and HMC have an accuracy of 79%. HMC and VI are almost identical in performance in both correlation and missing data accuracy, which suggests that here, the variational posterior is a good approximation of the true one. The performance of MLE is close to VI and HMC although consistently lower by 0.5 to 1% in accuracy. This is likely due to MLE overfitting to the observed responses as it is a greedy algorithm.

We also find the limitations of VI. Because we chose to amortize inference over persons, VI will not perform very well given very few individuals. For instance,with 100 people, VI only achieves 0.56 correlation with the true ability where MLE and HMC do much better. However, amortized VI does not require too large of a dataset: with 500 people, we find equal performance between the inference algorithms. Under scarce data, we would recommend not amortizing VI (Eq. 8).

Fig. 3(c) show the accuracy of missing data imputation on real world datasets. We find the same patterns as we did in the synthetic experiments. Running HMC on CritLangAcq or PISA takes roughly 120 hours whereas VI takes 50 minutes for CritLangAcq and 5 hours for PISA, the latter being more expensive because of additional computation required for missing data. In comparison, EM is sometimes faster than VI (e.g. Gradescope, PISA) and sometimes slower. With respect to accuracy, VI and HMC are again identical, outperforming EM by up to 8% in missing data imputation. Interestingly, we find the “overfitting" of MLE to be more pronounced here. If we focus on the DuoLingo and Gradescope, the two datasets with pre-existing large portions of missing values, MLE is surpassed by EM, with VI achieving an accuracy 10% higher. Finally, Fig. 3(e) shows posterior predictive checks between VI and HMC. We find that the two algorithms strongly agree about the average number of correct persons and items in all datasets. The lowest agreement occurs with DuoLingo: since the number of items is greater than the number of persons, there is naturally more uncertainty.

In summary, through careful experimentation on synthetic and real world data, we find VI to be fast and accurate, matching HMC in performance and MLE in speed.

7 Deep Item Response Theory

IRT is a surprisingly good model for item responses despite its simplicitly. However, we may not believe the true idiosyncracies of every individual’s response is captured by a logistic function. If we were able to use nonlinear functions, we may be able to more faithfully capture the density of the data. We test this intuition here by merging item response theory with the modern machinery of deep learning.

7.1 Nonlinear Generalizations of IRT

We have assumed that is a deterministic IRT model defining a Bernoulli distribution for the response to each item. However, nothing limits us from replacing it with a more expressive generative model. With the speed and versatility of VI, we can support much more heavily parameterized models. Here, we consider three different alternatives with varying levels of expressivity that help define a class of more powerful nonlinear IRT.

Learning a Linking Function

We replace the logistic function in standard IRT with a nonlinear linking function. As such, it preserves the linear relationships between items and persons. We call this VI (Link). For person and item , the 2PL link generative model is:


where is a nonlinear transformation followed by a sigmoid to constrain the output to be between .

Learning a Neural Network

Here, we no longer preserve the linear relationships between items and persons and instead feed the ability and item characteristics directly into a neural network, which will combine the inputs nonlinearly. We call this version VI (Deep). For person and item , the deep generative model is:


where again

includes a Sigmoid function at the very end to preserve the correct output signatures. This is an even more expressive model than VI (Link).

Learning a Residual Correction

Although clearly a powerful model, we might fear that VI (Deep) becomes too uninterpretable. So, for the third and final nonlinear model, we use the standard IRT but add a nonlinear residual component that can correct for any inaccuracies. We call this version VI (Residual). For person and item , the 2PL residual generative model is:


During optimization, we initialize the weights of the residual network to 0, thus ensuring its initial output is 0. This encourages the model to stay close to IRT, using the residual only when necessary.

Since now has free parameters, we can jointly optimize it with the variational parameters as in Sec. 2.3. The gradient of the VI loss with respect to is straightforward to estimate by sampling without any need for reparameterization. Precisely, is equal to:

Again, a single sample should suffice.

7.2 Evaluation

A better generative model should better explain the data density. In other words, observations in the training dataset should have low surprisal to the model. To capture this intution, we estimate the log marginal likelihood of the training dataset. A higher number (closer to 0) is better. Because VI is amortized, we can evaluate the quality of inference on unseen data points. We held-out a portion of the generated dataset for a separate test dataset. With questions, there are unique combinations for responses. We designed the test dataset to contain combinations unseen in training. An undesirable model might overfit to the training combinations, thereby getting poor log likelihoods on the test set. For a single person, the log marginal likelihood of his or her responses can be computed as:


We use 1000 samples to estimate Eq. 26.

Additionally, we measure accuracy on missing data imputation as we did for the speed experiments above. A more powerful generative model that is more descriptive of the data should be better fill in missing values.

7.3 Results and Analysis

Fig. 3(d) show two plots: the left compares the accuracy of imputing missing data; the right compares the log likelihoods of observed data. We find a consistent trend: the more powerful generative models achieve a higher log likelihood (closer to 0) and a higher accuracy. This is perhaps unsurprising as a higher log likelihood indicates that the model better describes the data density, which would be make missing data imputation easier. In particular, we find very large increases in log likelihood moving from IRT to Link, spanning 100 to 500 log points depending on the dataset. Further, from Link to Deep and Residual, we find another increase of 100 to 200 log points. In some cases, we find Residual to outperform Deep despite that the two are equally parameterized, suggesting that initialization with IRT finds better local optima. These gains in log likelihood translate to a consistent 1 to 2% increase in accuracy for Deep/Residual over IRT.

We also compare our various deep generative IRT models with purely deep learning approaches, namely Deep-IRT [zhang2017dynamic]. Unlike traditional IRT models, Deep-IRT was built for knowledge tracing and assumed sequential responses. To make our datasets amenable to Deep-IRT, we assume an ordering of responses from to . As shown in Fig. 3(d), our models outperform Deep-IRT in all 5 datasets by as much as 30% in missing data imputation (see WordBank).

8 Discussion

We explore some features of VI and possible extensions that may be interesting for several real world uses.

8.1 Importance of Amortization

We stress the importance of amortization in our variational approach, both in terms of generalization and efficiency. Consider a dataset with 100 items for a total of possible responses for any given person. Even with a dataset of a million persons, almost all the possible responses remain unseen. Without amortization, we fit a variational posterior for each “unique" individual as characterized by their responses. Thus, we will only be able to do inference for the observed dataset. If we were to deploy such a system, it is quite likely that we receive a previously unseen set of responses, making inference difficult. With amortization, as we are learning a regressor from responses to posterior parameters, we may be able to generalize to a larger subset of the possible responses using only the million observed responses for training. In the supplement (Table 10), we show results measuring the log likelihoods of unseen responses under our inference model, in which we find similar log likelihoods to observed responses, suggesting a notion of generalization.

Dataset Amortized Un-Amortized
CritLangAcq 2.8k 43.2k
WordBank 176.4 657.1
DuoLingo 429.9 717.9
Gradescope 114.5 511.1
PISA 25.2k 125.8k
Table 2: Time Costs with and without Amoritzation

For efficiency, we find that amortized VI is much cheaper computationally, especially in circumstances in which the number of possible response vectors is very large (e.g. for CritLangAcq). Table 2 compares the wall clock time (sec.) for the 5 datasets. While VI is comparable to MLE and EM, unamortized VI is 2 to 15 times more expensive.

8.2 Extending to Polytomous Responses

Thus far, we have been only working with binary responses, true or false. However, many questionnaires and examinations are not binary: responses can be multiple choice (e.g. Likert scale) or even real valued (e.g. 92% on a course test). As we have posed IRT as a generative model, we have prescribed a Bernoulli distribution over the -th person’s response to the -th item. Yet nothing prevents us from choosing a different distribution, such as Categorical for multiple choice or Normal for real-values. The DuoLingo dataset contains partial credit, computed as a fraction of times an individual gets a word correct. A more granular treatment of these polytomous values should yield a more faithful model that can better capture the differences between people. So, we retrain DuoLingo where is a (truncated) Normal distribution over responses with fixed variance.

Inf. Alg. Train Test
VI (Link)
VI (Deep)
VI (Res.)
Table 3: DuoLingo with Polytomous Responses

Table 3 show the log densities: we observe the same patterns among the nonlinear models. In principle, modeling continuous outputs should be more powerful as it can distinguish between a person with response and a person with response whereas binarized models could not. In summary, we emphasize the flexibility of our generative framework to support varying data formats.

8.3 Beyond Gaussian Posterior Assumptions

In our experiments, we have chosen the variational family to be Gaussian with diagonal covariance, mainly for use. However, for certain distributions, such a simple family may not suffice. There is a rich body of work in the machine learning community dedicated to richer posterior distributions for variational inference [rezende2015variational, tomczak2016improving, berg2018sylvester, huang2018neural, kingma1606improving]. For demonstration, we compare log likelihoods using a Gaussian posterior versus a transformed posterior using planar flows [rezende2015variational].

Dataset VI VI-NF
Table 4: Effect of Richer Variational Posteriors

We compose 10 invertible flows to samples from both and . We can then rewrite the VI objective (Eq. 19) with planar flows as:


where the components are defined as:

By Eq. 16 and 18, we know equals where is the usual Gaussian posterior. A similar definition holds for . By the form of the planar flow (Eq. 18), there is a closed form expression for the Jacobian determinant. Since the expectation is over in Eq. 27, we can reparameterize to estimate gradients. One can show that VI-NF is still a lower bound on the expected log marginal . Table 4 shows results for a 2PL IRT generative model using VI. We find that VI-NF has better log likelihoods across all datasets. Future work may investigate more complex flow functions.

8.4 Interpreting the Linking Function

With nonlinear models, we face an unfortunate tradeoff between interpretability and expressivity. In domains like education, practitioners greatly value the interpretability of IRT where predictions can be directly attributed to ability or item features. With VI (Deep), our most expressive model, predictions use a black box neural network, making it hard to understand the interactions between persons and items.

Figure 4: Learned link functions for (a) CritLangAcq, (b) WordBank, (c) DuoLingo, (d) Gradescope, and (e) PISA. The dotted black line shows the default logistic function.

Fortunately, with VI (Link), we can maintain a degree of interpretability along with power. The “Link" generative model is identical to IRT, only differing in the linking function (i.e. item response function). Each subfigure in Fig. 4 shows the learned response function for one of the real world datasets; the dotted black line represents the standard linking function, a sigmoid. We find three types linking functions: (1) for Gradescope and PISA, the learned function stays near a Sigmoid. In these cases, we find the log densities for VI (Link) to be close to VI (IRT) in Table 10; (2) For WordBank and CritLangAcq, the response function is no longer monotonic. In fact, it closely resembles an unfolding model [liu2018fitting, andrich1993hyperbolic], which encodes a more nuanced interaction between ability and item characteristics: higher scores are related to higher ability only if the ability and item characteristics are “nearby" in latent space. (3) For DuoLingo, we find a piecewise function that resembles a sigmoid for positive values and a negative linear function for negative values. In cases (2) and (3) we find much greater differences between VI (IRT) and VI (Link) in Table 10. For DuoLingo, VI (Link) matches the log density of more expressive models, suggesting that most of the benefit of nonlinearity is exactly in the linking function.

8.5 When to use what?

An outstanding question a practitioner might have after reading this paper is: Which inference algorithm is best for what setting? Not wanting to add to the deluge of IRT strategies, we humbly suggest the following policy: If your data size under 10k, use HMC (or EM if speed is of utmost importance); if your data size is under 100k, use VI (or MLE if no missing data); anything beyond 100k, use VI and consider nonlinear IRT. If you are working with multidimensional IRT, use VI or HMC. If the number of items is past 500, use VI (or MLE if no missing data).

9 Conclusion

Item Response Iheory is a paradigm for reasoning about the scoring of tests, surveys, and similar instruments for measuring aptitude. Inferring ability and item characteristics poses a technical challenge: balancing efficiency against accuracy. While the most principled approach, Bayesian IRT, has existed, it posed such a large computational cost that it was unrealistic to use. In this paper, we posed an alternative algorithm for Bayesian IRT that was magnitudes faster and as performant as MCMC. The effectivness of variational inference led to a discussion on nonlinear IRT, which we found gave performance gains.


Appendix A Proofs of Theorem and Corollary

In this section, we will restate the main Theorem and Corollary from the main text as well as provide proofs of the statements.

Theorem A.1

Let be the ability for person and be the characteristics for item . We use the shorthand notation . Let be the binary response for item by person . We write . If we defined the following terms,

then the inequality is true:

where .

Begin by expanding the marginal distribution and applying Jensen’s Inequality.

Focus on below.