Bayesian inference in educational measurement often demands that we sample from posterior distributions of the form:
is a vector of binary observations,is a prior density, and111We suppress dependence on parameters when there is no risk of confusion.
is a logistic cumulative distribution function (cdf) with locationand (positive) scale parameter such that the posterior depends on the observations only via Dawid (1979)
. This is the problem this article is about. Note that the same problem is encountered in the Bayesian analysis of (dynamic) logistic regression models, negative-binomial regression models, auto-logistic models<e.g.,¿Besag1975, and the logistic stick-breaking representations of multinomial distributions <e.g.,¿LindermanEtAl2015.
The problem is ubiquitous in Bayesian item response theory (IRT) modeling and must be considered, for instance, when represents student ability in a two-parameter logistic (2PL) model (Birnbaum, 1968, p. 400) and we wish to produce so-called plausible values; i.e., sample from the posterior of ability Mislevy (1991); Marsman, Maris . (2016). In fact, any full-conditional posterior distribution of the 2PL (e.g., the posterior of the item parameters given ability) is of the same form (see Appendix). The same is true for multi-dimensional IRT models. Thus, if we solve the problem, we can start building Gibbs samplers for most, if not all, IRT models used in practice.
A solution was presented by MarsmanIsing in a paper on Bayesian inference for the Ising network model222The Ising model originates in statistical physics and is popular for image analysis <e.g¿[8.2.3]MarinRobert14. It is gaining popularity as a model for network psychometrics; see Epskamp16 and MarsmanEtAlMBR. A general and fundamental discussion of the relation between the Ising model and IRT can be found in Kruis16. Lenz (1920); Ising (1925) and their sampling algorithm is the topic of the present paper. Our goal is to prove that the efficiency of the algorithm improves when the number of observations becomes large. This property makes the algorithm particularly suited for large scale applications in educational measurement. Where MarsmanIsing state their algorithm in just a few phrases we will make an effort to explain how and why its works. For ease of presentation, we choose the familiar context of plausible values. We then discuss the efficiency of the algorithm with large samples and end with a discussion.
2 A Lemma
The following innocuous lemma is the basis for the algorithm.
Consider a set of independent random variables
independent random variableseach with a potentially unique distribution function , where . Define, for , the random indicator variables
Then, for ,
The next sections will explain the significance of Lemma 1
and, to this aim, its notation will be used throughout the text. Observe that the joint distributionin the lemma is of the same form as the posterior distribution in Equation 1. For our present purpose, we choose to be the prior distribution, and assume that , for , is logistic so that corresponds to the target posterior ; defined by Equations 1 and 2. The random variables will be called auxiliary variables. The letters , and are reserved for indexes with running from one to and and from zero to . We use the original indexes when we refer to entries of . For example, the second entry of will be written as . This makes it easy to remember that .
3 About the Algorithm
We explain how the algorithm is used to generate plausible values starting with the simple case where all items in the test are equivalent (Rasch) items and the item response functions (IRF), , are standard logistic cdfs. The problem is then the following: A student produces an item response pattern and our aim is to sample from the posterior of ability:
where we assume a standard logistic prior . Note that, even in this simple case, the posterior is not a known distribution and the problem is not trivial and many have therefore turned to normal-ogive models.
3.1 The Simplest Case
A posterior can be defined as the collection of values of that generate the observations <cf.¿[section 3.1]Rubin84. This powerful idea implies that we can sample from the posterior if we can simulate data <e.g.,¿MarsmanEtAl2017. In IRT, simulating data is simple and entails sampling from the joint distribution by first drawing an ability from and then responses from . On this occasion we are not interested in the data but in the ability that generated the data. To wit, if an ability generates a response pattern , it is a plausible value for a student with correct responses.
Unfortunately, it is likely that and we generate a plausible value for the wrong student. A solution is to simulate a number of response patterns and keep the ability that generated a response pattern whose sum equals . To this effect, we use Lemma 1 which provides the following algorithm to simulate response patterns from one sample of auxiliary variables:
Algorithm 1 implements the method of composition Tanner (1996) as we described it earlier. That is, we draw a sample of auxiliary variables each of which is then called on to be the ability and used to generate item responses. We illustrate this with a small example.
Example 1 (Toy Example).
For example, if we sample we generate four response patterns:
Note that, obviously, we do not generate each of the possible response pattern but we do generate each possible score.
Finally, we observe that we can actually get our plausible value without calculating the response patterns. The sum is, by definition, the number of values smaller than . This means that each generated response pattern has a different sum and there is always one whose sum corresponds to . Noting that this response pattern is generated by the th smallest of the auxiliary variables we find the following procedure to generate a plausible value for the student with a test-score :
Note that this algorithm is simple and computationally cheap. Only a partial sort is required to find the smallest of set of numbers and computation time is of order . For later reference, we call Algorithm 2 the Sum-Matched Approximate Bayes (SM-AB) algorithm: The name will be explained in the next section.
3.2 The General Case
In practice, the items will differ in difficulty and/or discrimination and we now consider what happens when the items follow a 2PL. This means that the item response functions are logistic as in Equation 2 with the discrimination and the easiness parameter of item .
3.2.1 The Sum-Matched Approximate Bayes Algorithm:
Formally, Algorithm 2 draws a value from a proposal, , such that . A sample of auxiliary variables determines which proposal is selected. This means that the proposal is random and the algorithm produces an independent sample from a mixture of proposals as illustrated with the following example.
Example 2 (Toy Example Cont.).
Suppose that and we select which is drawn from a proposal . If we repeat this a number of times we might:
so that is: drawn from proposal: ⋮ ⋮
Note that each sample of auxiliary variables generates a unique response pattern and each of these is a random permutation of the observed response pattern .
If the auxiliary variables are identically distributed, all proposals are the same and we always sample from the target posterior. This is no longer true when the prior is not standard logistic and/or the items are not equivalent. Specifically, if , is drawn from:
where . We see that the prior trades places with the IRF of the th item so that, once more, we sample a plausible value for the wrong student; that is, a student who scored correct answers on a (slightly) different test. If, for instance, is a normal prior, we would sample from a posterior where the prior is logistic and the th item is a normal-ogive item (Lord Novick, 1968, pp. 365-366). A further complication is added when the items differ in discrimination and the sum is no longer sufficient. Even if we would not sample from the correct posterior unless the weighted sums match; i.e., .
It follows that in any realistic situation, is most likely drawn from a proposal that approximates the target posterior. The name we gave to Algorithm 2 is intended to emphasize that each approximate proposal is “matched” to the target posterior on the sum; i.e., the number of correct answers.
3.2.2 The Sum-Matched Metropolis-Hastings Algorithm:
The SM-AB algorithm produces an independent sample from an approximate posterior and although experience suggests that it works quite well (see Figure 1), it is not exact. As a remedy, MarsmanIsing add a Metropolis-Hastings step Metropolis . (1953); Hastings (1970) and construct a dependent sample from the target posterior. This gives us the Sum-Matched Metropolis-Hastings (SM-MH) algorithm:
Draw an ability and a response pattern from
The MH algorithm produces a sample from a Markov chain. The acceptance probability functionis chosen such that the following detailed balance condition holds:
and the chain converges to the target posterior for any initial value . This means that the SM-MH algorithm produces a sequence of values such that, as increases, the distribution of these values more closely approximates the target posterior. Note that the MH algorithm works with any proposal even if it is random Tierney (1994).
After some algebra, detailed in the Appendix, we find that the acceptance ratio can be written as:
with, for ,
Note that the MH step adds little computational complexity and the expressions remain simple for large values of .
The acceptance ratio indicates how probable the new value is with respect to the current value, according to the target posterior. If , it depends only on the difference between the weighted sums of the generated and the observed response pattern. If , it also accounts for the fact that item has traded places with the prior.
If, for example, we have Rasch items and assume a logistic prior
for any , so that we always accept if . If all items are equivalent, we accept regardless of and the MH algorithm reduces to the SM-AB algorithm.
4 Simulation Experiments
In this section, we present some simulation results to evaluate the computational burden and the auto-correlation in samples produced by the SM-MH algorithm. We use the R-code given in the Appendix on a ASUS R301L laptop with an Intel i5 CPU with a clock speed of 2.2 GHz and 3.8 gigabytes of memory running Ubuntu.
4.1 Computational Complexity
The complexity of the algorithm is determined by the generation and sorting of the auxiliary variables. As mentioned above, finding the th smallest value of a set of numbers does not require a full sort. In fact, it can be done in linear time using the quickselect algorithm Hoare (1961) which means that computation time increases linearly with as illustrated333For each value of we simulate a thousand response patterns assuming a 2PL. For each pattern one iteration is performed. We take the average of the computing times. in Figure 2.
Note that loops are time-consuming when performed in R. Especially when is large, significant computational advantages can be obtained by coding (parts of) the algorithm in a compiled language (e.g., C, C++, Fortran, Pascal). Comparing, for example, the right with the left-hand panel in Figure 2 shows that a C implementation is about thirty times faster.
4.2 Autocorrelation and Convergence
If we accept all proposed values in the first two iterations, convergence is almost immediate (see Figure 1). After these first iterations, acceptance rates remain high. Figure 3 gives an impression of typical autocorrelations with observations. The left panel shows that, when the items are Rasch items, the SM-MH algorithm comes close to generating an independent and identically distributed sample, with virtually no autocorrelation whatsoever. The right panel shows that autocorrelations are higher when items differ in discrimination. Even then, Figure 4 suggests that autocorrelations vanish when increases and acceptance probabilities go to one. It is the burden of the next sections to prove that this is indeed the case.
5 Efficiency of the SM-AB Algorithm when Increases
In this section, we will prove that, as increases, the SM-AB algorithm will increasingly produce proposals that are accepted in the MH step. The SM-AB and the SM-MH will become equal and the cost of producing a sample of observations from the posterior approximates that of generating a sample of observations under the posited model; i.e., one response pattern444If we use simulation we would on average need to generate response patterns, a number that would increase rapidly if and decreases to zero..
We first demonstrate that the proposal value converges to the true value. We then demonstrate that acceptance rates in the SM-MH algorithm will increase in proportion to the square root of .
Here is a second lemma that proves that the proposal value from the SM-AB algorithm will converge (strongly) to the true value of .
Consider a set of independent random variables . Define a second, set of random variables such that , for all . Then,
where and denotes the th order statistic of the ; i.e., the th smallest with the integer part of .
Define the random function (and similarly ) to be
such that corresponds to . We find the difference between the random functions and to be characterized by a simple random walk:
where . is a simple random walk since and implies
The result follows. ∎
Although Lemma 2 is admittedly technical, it has a simple interpretation. To wit, “nature” uses Algorithm 1 and samples from a distribution , and item responses . The SM-AB algorithm imitates nature. It samples a second set of independent variables such that and offers the th order statistic as a proposal value. Lemma 2 states that this proposal value will be arbitrary close to for large . A numerical illustration is shown in Figure 5 where we plotted several chains produced by the SM-AB algorithm.
5.2 Convergence Rate
The confidence bounds in Figure 5 are implied by the classical Hoeffding inequality Hoeffding (1963). Hoeffding’s inequality, in its simplest form, deals with independent but not necessarily identically distributed random variables that are all defined on the unit interval. Hoeffding’s inequality puts a bound on the difference between the mean value of these variables and its expected value :
for an arbitrary positive real number . As a simple illustration consider copies of a Bernoulli random variable with success probability . The expected value of the mean of these random variables , and (5) implies that
for all positive real numbers . In other words, the probability that lies in an arbitrary small interval around its expected value goes to one if goes to infinity. Or, if we fix and define , can be found with probability in an interval of width around its mean. If we have a test taker with ability that answers a set of 2PL items, the mean of the response has expected value
Figure 6 shows an example where we plotted the mean together with the 95% confidence interval (i.e., ) around its expected value for different values of . Note that the width of the confidence interval decreases in proportion to the square root of ; e.g., multiplying by four halves the interval.
Hoeffding’s inequality, together with the monotonicity of as a function of , implies that an estimator obtained by solving
converges to the true value of if goes to infinity. This is illustrated in Figure 7 in which the same person, items and responses are used as in Figure 6. The confidence bounds are obtained by inverting as function of and evaluating the resulting function in the bounds implied by Hoeffding’s inequality555Inverting was done using the method of false-position. As with the mean, the width of the confidence interval decreases in proportion to the square root of .
We can estimate using a generated response pattern instead of the observed one. To this aim, we determine by solving , where , and denotes the expected value of the mean under the proposal. While is a monotone function of , it is not equal to the one under the model (7) because, if , one item response function is replaced by the prior cdf. It will be clear, however, that the difference becomes negligible as goes to infinity so that converges to and the confidence bounds apply to both estimators. Thus, the confidence bounds in Figure 5 are based on Hoeffding’s inequality, and 95% of the realizations of the SM-AB algorithm stay withing these bounds and are ever closer to the true value of as increases.
We have discussed a MH algorithm proposed by MarsmanIsing and have proven that it becomes more efficient when more data becomes available and scales-up to handle large problems.
MarsmanIsing introduce their algorithm is a few phrases and present no derivation. We have found that the complete formal underpinning for the algorithm is provided by two lemmas. It is worth noting that these lemmas are valid for any distribution of the auxiliary variables which implies that the SM-algorithms can be used, and with minimal changes, for any IRT model where the item response functions are distribution functions as in a normal-ogive model. Lemma 2, is of independent interest because it implies that ability can be estimated consistently from the number-correct score under any monotone IRT model.
That the algorithm becomes better with more data is an important property that, to the best of our knowledge, is not shared by any of the existing samplers; e.g., PatzJunker99, or MarisMaris02. The most widely used MCMC algorithm in educational measurement is probably the data-augmentation (DA) Gibbs sampler proposed by albert92; see also AlbertChib93 and Polson13. Compared to the SM-algorithms, the DA Gibbs sampler has two important disadvantages. First, it is restricted to normal-ogive models and forces the user to adopt normal priors. Second, DA causes auto-correlation and this auto-correlation will not diminish as more data are observed. Especially with large datasets666Albert (page 259) notes that the choice of initial values does not seem crucial. In 1992 he obviously did not have the opportunity to try large examples., good starting values are essential for the algorithm to reach the posterior support whereas the SM-AB algorithm takes us there in a few iteration.
In closing, we will sketch a possible extension for future research. That is, we adapt the sampler for models where response patterns are restricted to lie in a subset . To this effect, we adapt the MH step, as it stands, and multiply in (3) by:
where is the normalizing constant in the proposal, and is the normalizing constant when . This extension of the algorithm is useful because would allow us to handle multi-stage designs and/or polytomous item responses.
- Albert (1992) albert92Albert, J. 1992. Bayesian Estimation of Normal Ogive Item Response Curves Using Gibbs Sampling Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics173251–269.
- Albert Chib (1993) AlbertChib93Albert, J. Chib, S. 1993. Bayesian Analysis of Binary and Polytomous Response Data Bayesian analysis of binary and polytomous response data. Journal of the American Statistical Association88422669–679.
- Besag (1975) Besag1975Besag, J. 1975. Statistical Analysis of Non-Lattice Data Statistical analysis of non-lattice data. The Statistician243179–195.
- Birnbaum (1968) Birnbaum68Birnbaum, A. 1968. Some Latent Trait Models and their Use in Inferring an Examinee’s Ability Some latent trait models and their use in inferring an examinee’s ability. F. Lord M. Novick (), Statistical Theories of Mental Test Scores Statistical theories of mental test scores ( 395–479). Reading, MAAddison-Wesley.
- Dawid (1979) Dawid79Dawid, A. 1979. Conditional Independence in Statistical Theory Conditional independence in statistical theory. Journal of the Royal Statistical Society, Series B (Methodological)4111–31.
- Eddelbuettel Francois (2011) RcppEddelbuettel, D. Francois, R. 2011. Rcpp: Seamless R and C++ integration Rcpp: Seamless R and C++ integration. Journal of Statistical Software4081–18.
- Epskamp . (In press) Epskamp16Epskamp, S., Maris, G., Waldorp, L. Borsboom, D. In press. Network Psychometrics Network psychometrics. P. Irwing, D. Hughes T. Booth (), Handbook of Psychometrics. Handbook of psychometrics. New-YorkWiley.
- Hastings (1970) hastings70Hastings, W. 1970. Monte Carlo Sampling Methods Using Markov Chains and Their Applications Monte Carlo sampling methods using Markov chains and their applications. Biometrika57197–109.
- Hoare (1961) Hoare61Hoare, C. 1961. Algorithm 65: Find Algorithm 65: Find. Communications of the ACM47321–322.
- Hoeffding (1963) Hoeffding63Hoeffding, W. 1963. Probability Inequalities for Sums of Bounded Random Variables Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association5830113–30.
- Ising (1925) Ising25Ising, E. 1925. Beitrag zur Theorie des Ferromegnetismus Beitrag zur theorie des ferromegnetismus. Zeitschrift für Physik311253–258.
- Kruis Maris (2016) Kruis16Kruis, J. Maris, G. 2016. Three Representations of the Ising Model Three representations of the Ising model. Scientific Reports634175.
- Lenz (1920) Lenz20Lenz, W. 1920. Beiträge zum Verstandnis der Magnetischen Eigenschaften in Festen Korpern Beiträge zum verstandnis der magnetischen eigenschaften in festen korpern. Physikalische Zeitschrift21613–615.
- Linderman . (2015) LindermanEtAl2015Linderman, S., Johnson, M. Adams, R. 2015. Dependent Multinomial Models Made Easy: Stick-Breaking with the Polya-Gamma Augmentation Dependent multinomial models made easy: Stick-breaking with the Polya-Gamma augmentation. C. Cortes, N. Lawrence, D. Lee, M. Sugiyama R. Garnett (), Advances in Neural Information Processing Systems 28 Advances in neural information processing systems 28 ( 3456–3464). Curran Associates, Inc.
- Lord Novick (1968) LordNovick68Lord, F. Novick, M. 1968. Statistical Theories of Mental Test Scores Statistical theories of mental test scores. Reading, MAAddison-Wesley.
- Marin Robert (2014) MarinRobert14Marin, JM. Robert, C. 2014. Bayesian Essentials with R Bayesian essentials with R. New-YorkSpinger.
- Maris Maris (2002) MarisMaris02Maris, G. Maris, E. 2002. A MCMC-Method for Models with Continuous Latent Responses A MCMC-method for models with continuous latent responses. Psychometrika673335–350.
- Marsman, Borsboom . (2016) MarsmanEtAlMBRMarsman, M., Borsboom, D., Kruis, J., Epskamp, S., van Bork, R., Waldorp, L.Maris, G. 2016. An Introduction to Network Psychometrics: Relating Ising network models to item response theory models An introduction to network psychometrics: Relating Ising network models to item response theory models. Manuscript submitted for publication.
- Marsman . (2015) MarsmanIsingMarsman, M., Maris, G., Bechger, T. Glas, C. 2015. Bayesian Inference for Low-Rank Ising Networks Bayesian inference for low-rank Ising networks. Scientific Reports590501–7.
- Marsman . (2017) MarsmanEtAl2017Marsman, M., Maris, G., Bechger, T. Glas, C. 2017. Turning Simulation into Estimation: Generalized Exchange Algorithms for Exponential Family Models Turning simulation into estimation: Generalized exchange algorithms for exponential family models. PLoS One1211–15.
- Marsman, Maris . (2016) MarsmanPVMarsman, M., Maris, G., Bechger, TM. Glas, CAW. 2016. What Can we Learn From Plausible Values? What can we learn from plausible values? Psychometrika812274–289.
- Metropolis . (1953) metropolis53Metropolis, N., Rosenbluth, A., Rosenbluth, M. Teller, A. 1953. Equation of State Calculations by Fast Computing Machines Equation of state calculations by fast computing machines. The Journal of Chemical Physics2161087–1092.
- Mislevy (1991) mislevy91Mislevy, R. 1991. Randomization-Based Inference About Latent Variables from Complex Samples Randomization-based inference about latent variables from complex samples. Psychometrika562177–196.
- Patz Junker (1999) PatzJunker99Patz, R. Junker, B. 1999. A Straightforward Approach to Markov Chain Monte Carlo Methods for Item Response Models A straightforward approach to Markov Chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics242146–178.
- Polson . (2013) Polson13Polson, NG., Scott, JG. Windle, J. 2013. Bayesian Inference for Logistic Models Using Polya-Gamma Latent Variables Bayesian Inference for Logistic Models Using Polya-Gamma Latent Variables. Journal of the American Statistical Association1081339–1349.
- R Core Team (2016) RR Core Team. 2016. R: A Language and Environment for Statistical Computing R: A language and environment for statistical computing . Vienna, Austria. https://www.R-project.org/
- Rubin (1984) Rubin84Rubin, D. 1984. Bayesianly justifiable and relevant frequency calculations for the applied statistician Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics121151–1172.
- Tanner (1996) Tanner96Tanner, M. 1996. Tools for Statistical Inference Tools for statistical inference (Third ). New yorkSpringer-Verlag.
- Tierney (1994) Tierney94Tierney, L. 1994. Markov Chains for Exploring Posterior Distributions Markov chains for exploring posterior distributions. The Annals of Statistics2241701–1762.
7.1 The Acceptance Probability Function for the MH Algorithm
Let denote the current parameter value, the proposal value so that
We now find the expression for when the , for are logistic as in (2). The case for is relatively simple and we focus on deriving expressions for the case where .
First, ignoring terms unrelated to , we find that
where . We will simplify this further. Since ,
It follows that:
Hence, we find that:
where the rest-term R
How complex the rest-term is depends on the prior. If the prior is logistic with parameters and ,
Let . It follows that, for , and with a logistic prior:
where runs from to . Thus, if the items are Rasch items and the prior is logistic with scaling parameter and mean we find that
for any given that the sum-scores are matched.
7.2 R Code
As a courtesy to the reader we provide a minimal set of R-scripts R Core Team (2016) that read as pseudo-code for readers who program in other languages.
At the time of writing, the quickselect algorithm is not yet implemented in R. A C++ implementation can be made available via the Rcpp-package Eddelbuettel Francois (2011).
Note that the output is a list with the k-th smallest element and its original index; one-based for use in R.
7.2.2 The SM-MH Algorithm
The following is a straightforward implementation assuming a normal prior and index .
7.2.3 A Gibbs Sampler for the 2PL
Let be the number of persons, and the number of items. The following script performs one iteration of a Gibbs sampler for the 2PL.