DIF Statistical Inference and Detection without Knowing Anchoring Items

by   Yunxiao Chen, et al.

Establishing the invariance property of an instrument (e.g., a questionnaire or test) is a key step for establishing its measurement validity. Measurement invariance is typically assessed by differential item functioning (DIF) analysis, i.e., detecting DIF items whose response distribution depends not only on the latent trait measured by the instrument but also the group membership. DIF analysis is confounded by the group difference in the latent trait distributions. Many DIF analyses require to know several anchor items that are DIF-free and then draw inference on whether each of the rest is a DIF item, where the anchor items are used to calibrate the latent trait distributions. In this paper, we propose a new method for DIF analysis under a multiple indicators and multiple causes (MIMIC) model for DIF. This method adopts a minimal L1 norm condition for identifying the latent trait distributions. It can not only accurately estimate the DIF effects of individual items without requiring prior knowledge about an anchor set, but also draw valid statistical inference, which yields accurate detection of DIF items. The inference results further allow us to control the type-I error for DIF detection. We conduct simulation studies to evaluate the performance of the proposed method and compare it with the anchor-set-based likelihood ratio test approach and the LASSO approach. The proposed method is further applied to analyzing the three personality scales of Eysenck personality questionnaire - revised (EPQ-R).



There are no comments yet.


page 1

page 2

page 3

page 4


Statistical Analysis of Item Preknowledge in Educational Tests: Latent Variable Modelling and Statistical Decision Theory

Tests are a building block of our modern education system. Many tests ar...

Deep Neural Networks for Detecting Statistical Model Misspecifications. The Case of Measurement Invariance

While in recent years a number of new statistical approaches have been p...

Joint Maximum Likelihood Estimation for High-dimensional Exploratory Item Response Analysis

Multidimensional item response theory is widely used in education and ps...

Unfair items detection in educational measurement

Measurement professionals cannot come to an agreement on the definition ...

Item Quality Control in Educational Testing: Change Point Model, Compound Risk, and Sequential Detection

In standardized educational testing, test items are reused in multiple t...

A Goodness-of-Fit Test for Statistical Models

Statistical modeling plays a fundamental role in understanding the under...

Lagrangian Inference for Ranking Problems

We propose a novel combinatorial inference framework to conduct general ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Measurement invariance refers to the psychometric equivalence of an instrument (e.g., a questionnaire or test) across several specified groups, such as gender and ethnicity. The lack of measurement invariance suggests that the instrument has different structures or meanings to different groups, leading to biases in measurements (Millsap, 2012).

Measurement invariance is typically assessed by differential item functioning (DIF) analysis of item response data that aims to detect the measurement non-invariant items (i.e. DIF items). More precisely, a DIF item has a response distribution that depends not only on the latent trait measured by the instrument but also on respondents’ group membership. Therefore, the detection of a DIF item involves comparing the item responses of different groups, conditioning on the latent traits. The complexity of the problem lies in that individuals’ latent trait levels cannot be directly observed but are measured by the instrument that may contain DIF items. In addition, different groups may have different latent trait distributions. This problem thus involves identifying the latent trait, and then conducting the conditional group comparison given individuals’ latent trait levels.

Many statistical methods have been developed for DIF analysis. Traditional methods for DIF analysis require prior knowledge about a set of DIF-free items, which is known as the anchor set. This anchor set is used to identify the true latent trait. These methods can be classified into two categories. Methods in the first categories

(Mantel and Haenszel, 1959; Dorans and Kulick, 1986; Swaminathan and Rogers, 1990; Shealy and Stout, 1993; Zwick et al., 2000; Zwick and Thayer, 2002; May, 2006; Soares et al., 2009; Frick et al., 2015) do not explicitly assume an item response theory (IRT) model, and methods in the second categories (Thissen, 1988; Lord, 2012; Kim et al., 1995; Raju, 1988, 1990; Cai et al., 2011; Woods et al., 2013; Thissen, 2001; Oort, 1998; Steenkamp and Baumgartner, 1998; Cao et al., 2017; Woods et al., 2013; Tay et al., 2015, 2016) are developed based on IRT models. Compared with non-IRT-based methods, an IRT-based method defines the DIF problem more clearly, at the price of potential model mis-specification. Specifically, an IRT model represents the latent trait as a latent variable and further characterizes the item-specific DIF effects by modeling each item response distribution as a function of the latent variable and group membership.

The DIF problem is well-characterized by a multiple indicators, multiple causes (MIMIC) IRT model, which is a structural equation model that was originally developed for continuous indicators (Zellner, 1970; Goldberger, 1972) and later extended to categorical item response data (Muthen, 1985, 1986; Muthen et al., 1991; Muthen and Lehman, 1985). A MIMIC model for DIF consists of a measurement component and a structural component. The measurement component models how the item responses depend on the measured psychological trait and respondents’ group membership. The structural component models the group-specific distributions of the psychological trait. The anchor set imposes zero constraints on item-specific parameters in the measurement component, making the model and thus the latent trait identifiable. Consequently, the DIF effects of the rest of the items can be tested by drawing statistical inference on the corresponding parameters under the identified model.

Anchor-set-based methods rely heavily on a correctly specified set of DIF-free items. The mis-specification of some anchor items can lead to invalid statistical inference results (Kopf et al., 2015b). To address this issue, item purification methods (Candell and Drasgow, 1988; Clauser et al., 1993; Fidalgo et al., 2000; Wang and Yeh, 2003; Wang and Su, 2004; Wang et al., 2009; Kopf et al., 2015b, a) have been proposed, which iteratively select an anchor set by stepwise model selection methods. More recently, regularized estimation methods (Magis et al., 2015; Tutz and Schauberger, 2015; Huang, 2018; Belzak and Bauer, 2020; Bauer et al., 2020; Schauberger and Mair, 2020) have been proposed that use LASSO-type regularized estimation procedures for simultaneous model selection and parameter estimation. Moreover, Yuan et al. (2021) proposed a visualization method for the detection of DIF under the Rasch model (Rasch, 1960)

. Their methods are based on testing differential item pair functioning, which does not require prior information of an anchor set. Unfortunately, unlike many anchor-set-based methods with a correctly specified anchor set, all these methods do not provide valid statistical inference for testing the null hypothesis of “item

is DIF-free”, for each item . That is, the type-I error for testing the hypothesis cannot be guaranteed to be controlled at a pre-specified significance level.

This paper proposes a new method that can accurately estimate the DIF effects without requiring prior knowledge. It further draws statistical inference on the DIF effects of individual items, yielding valid confidence intervals and p-values. The point estimation and statistical inference lead to accurate detection of the DIF items, for which the type-I error of DIF detection can be controlled by the inference results. The method is proposed under a MIMIC model with a two-parameter logistic (Birnbaum, 1968) IRT measurement model and a linear structural model. The key to this method is a minimal norm assumption for identifying the true model. Methods are developed for estimating the model parameters, and obtaining confidence intervals and -values. Procedures for the detection of DIF items are further developed. Our method is compared to the likelihood ratio test method (Thissen et al., 1993) that requires an anchor set, and the LASSO-based approach recently proposed in Belzak and Bauer (2020). We begin with a statistical formulation of the DIF problem under a MIMIC model, followed by a review of related methods, and then a description of the proposed method.

2 A MIMIC Formulation of DIF

Consider individuals answering items. Let

be a binary random variable, denoting individual

’s response to item . Let be the observed value, i.e., the realization, of . For the ease of exposition, we use

to denote the response vector of individual

. The individuals are from two groups, indicated by , where 0 and 1 are referred to as the reference and focal groups, respectively. We further introduce a latent variable , which represents the latent trait that the items are designed to measure. DIF occurs when the distribution of depends not only but also . More precisely, DIF occurs if is not conditionally independent of , given . Seemingly a simple group comparison problem, DIF analysis is non-trivial due to the latency of . In particular, the distribution of may depend on the value of , which confounds the DIF analysis. In what follows, we describe a MIMIC model framework for DIF analysis, under which the relationship among , , and is characterized. It is worth pointing out that this framework can be generalized to account for more complex DIF situations; see the Discussion Section.

2.1 Measurement Model

The two-parameter logistic (2PL) model (Birnbaum, 1968) is widely used to model binary item responses (e.g., wrong/right or absent/present). In the absence of DIF, the 2PL model assumes a logistic relationship between and , which is independent of the value of . That is,


where the slope parameter and intercept parameter are typically known as the discrimination and easiness parameters, respectively. The right hand side of (1) as a function of is known as the 2PL item response function. When the items potentially suffer from DIF, then the item response functions may depend on the group membership . In that case, the item response function can be modeled as


where is an item-specific parameter that characterizes its DIF effect. More precisely,

That is,

is the odds ratio for comparing two individuals from two groups who have the same latent trait level. Item

is DIF-free under this model when . We further make the local independence assumption as in most IRT models. That is, , …, are assumed to be conditionally independent, given and .

2.2 Structural Model

A structural model specifies the distribution of , which may depend on the group membership. Specifically, we assume the conditional distribution of given

to follow a normal distribution,

Note that the latent trait distribution for the reference group is set to a standard normal distribution to identify the location and scale of the latent trait. A similar assumption is typically adopted in IRT models for a single group of individuals.

The MIMIC model for DIF combines the above measurement and structural models, for which a path diagram is given in Figure 1. The marginal likelihood function for this MIMIC model takes the form


where denotes all the fixed model parameters.

The goal of DIF analysis is to detect the DIF items, i.e., the items for which . Unfortunately, without further assumptions, this problem is ill-posed due to the non-identifiability of the model. We discuss this identifiabibility issue below.

Figure 1: The path diagram of a MIMIC model for DIF analysis. The subscript is omitted for simplicity. The dashed lines from to indicate the DIF effects.

2.3 Model Identifiability

Without further assumptions, the above MIMIC model is not identifiable. That is, for any constant , the model remains equivalent, if we simultaneously replace and by and , respectively, and keep unchanged. This identifiability issue is due to that all the items are allowed to suffer from DIF, resulting in an unidentified latent trait. In other words, without further assumptions, it is impossible to disentangle the DIF effects and the difference between the two groups’s latent trait distributions. As will be discussed below, statistical methods for DIF analysis make certain assumptions to ensure the identifiability.

3 Related Works

Many of the IRT-based DIF analyses (Thissen et al., 1986; Thissen, 1988; Thissen et al., 1993) require prior knowledge about a subset of DIF-free items, which are known as the anchor items. More precisely, we denote this known subset by . Then under the MIMIC model described above, it implies that the constraints are imposed for all in the estimation. With these zero constraints, the parameters cannot be freely transformed, and thus, the above MIMIC model becomes identifiable. Therefore, for each non-anchor item , the hypothesis of can be tested, for example, by a likelihood ratio test. The DIF items can then be detected based on the statistical inference of these hypothesis tests.

The validity of the anchor-item-based analyses relies on the assumption that the anchor items are truly DIF-free. If the anchor set includes one or more DIF items, then the results can be misleading (Kopf et al., 2015b). To address the issue brought by the mis-specification of the anchor set, item purification methods (Candell and Drasgow, 1988; Clauser et al., 1993; Fidalgo et al., 2000; Wang and Yeh, 2003; Wang and Su, 2004; Wang et al., 2009; Kopf et al., 2015b, a)

have been proposed that iteratively purify the anchor set. These methods conduct model selection using a stepwise procedure to select the anchor set, implicitly assuming that there exists a reasonably large set of DIF items. Then DIF is assessed by hypothesis testing given the selected anchor set. This type of methods also has several limitations. First, the model selection results may be sensitive to the choice of the initial set of anchor items and the specific stepwise procedure (e.g., forward or backward selection), which is a common issue with stepwise model selection procedures (e.g., stepwise variable selection for linear regression). Second, the model selection step has uncertainty. As a result, there is no guarantee that the selected anchor set is completely DIF-free, and furthermore, the post-selection statistical inference of items may not be valid in the sense that the type-I error may not be controlled at the targeted significance level.

Regularized estimation methods (Magis et al., 2015; Tutz and Schauberger, 2015; Huang, 2018; Belzak and Bauer, 2020; Bauer et al., 2020; Schauberger and Mair, 2020) have also been proposed for identifying the anchor items, which also implicitly assumes that many items are DIF-free. These methods do not require prior knowledge about anchor items, and simultaneously select the DIF-free items and estimate the model parameters using a LASSO-type penalty (Tibshirani, 1996). Under the above MIMIC model, a regularized estimation procedure solves the following optimization problem,


where is a tuning parameter that determines the sparsity level of the estimated parameters. Generally speaking, a larger value of leads to a more sparse vector A regularization method (Belzak and Bauer, 2020) solves the optimization problem (4) for a sequence of values, and then selects the tuning parameter based on the Bayesian Information Criterion (Schwarz, 1978). Let be the selected tuning parameter. Items for which are classified as DIF items and the rest are classified as DIF-free items. While the regularization methods are computationally more stable than stepwise model selection in the item purification methods, they also suffer from some limitations. First, they involve solving non-smooth optimization problems like (4) for different tuning parameter values, which is computationally intensive and requires tailored computation code that is not available in most statistical packages/software for DIF analysis. Second, these methods may be sensitive to the choice of the tuning parameter. Although methods and theories have been developed in the statistics literature to guide the selection of the tuning parameter, there is no consensus on how the tuning parameter should be chosen, leaving ambiguity in the application of these methods. Finally, as a common issue of regularized estimation methods, obtaining valid statistical inference from these methods is not straightforward. That is, regularized estimation like (4) does not provide a valid -value for testing the null hypothesis . In fact, post-selection inference after regularized estimation was conducted in Bauer et al. (2020), where the type I error cannot be controlled at the targeted level under some simulation scenarios.

4 Proposed Method

In what follows, we propose a new method for DIF analysis that does not require prior knowledge about anchor items. As will be shown in the rest, the proposed method can not only accurately detect the DIF items, but also provide valid statistical inference for testing the hypotheses of .

4.1 Model Identification via a Minimal Condition

We impose a condition for identifying the true model in the same spirit as the sparsity assumption that is implicitly imposed in the regularized estimation and item purification methods. Recall that the MIMIC model remains equivalent, if we simultaneously replace and by and , respectively, and keep unchanged. Let , , , and denote the true model parameters. To identify the true model, we require the following minimal (ML1) condition to hold


for all . This assumption implies that, among all models that are equivalent to the true model, the true parameter vector has the smallest norm. Since the norm measures the sparsity level of a vector (Tibshirani, 1996), the ML1 condition tends to hold when is sparse. To illustrate this point, we show the function in Figure 2, where , for all , and 1 when and , respectively.

Figure 2: Function , where , for all , and 1 for and , respectively. The minimal value of is achieved when .

4.2 Parameter Estimation

We now propose a procedure for estimating the model under the ML1 condition. This procedure is described in Algorithm 1 below.

Algorithm 1.
  • Step 1: Solve the following MML estimation problem

  • Step 2: Solve the optimization problem

  • Output: The ML1 estimate , .

We provide some remarks about these steps. The estimator (6) in Step 1 can be viewed as the MML estimator of the MIMIC model, treating item 1 as an anchor item. We emphasize that the constraint in Step 1 is an arbitrary but mathematically convenient constraint for ensuring the estimability of the MIMIC model when solving (6). It does not require item 1 to be truely a DIF-free item. This constraint can be replaced by any equivalent constraint, for example, , while not affecting the final estimation result. Step 2 finds the transformation that leads to the ML1 solution among all the models that are equivalent to the estimated model from Step 1. The optimization problem (7) is convex that takes the same form as the Least Absolute Deviations (LAD) objective function in median regression (Koenker, 2005)

. Consequently, it can be solved using standard statistical packages/software for quantile regression. In particular, the R package “

quantreg(Koenker et al., 2018) is used in our simulation study and real data analysis.

The ML1 condition (5), together with some additional regularity conditions, guarantees the consistency of the above ML1 estimator. That is, will converge to as the sample size grows to infinity. This result is formalized in Theorem 1, with its proof given in the Appendix.

Theorem 1.

Let be the true model parameters, and be the true parameter values of the equivalent MIMIC model with constraint . Assume this equivalent model satisfies the standard regularity conditions in Theorem 5.14 of van der Vaart (2000) that concerns the consistency of maximum likelihood estimation. Further, assume that the ML1 condition (5) holds. Then , as .

We further discuss the connection between the proposed estimator and the regularized estimator (4). Note that is the one with the smallest among all equivalent estimators that maximize the likelihood function (3). When the solution path of (4) is smooth and the solution to the ML1 problem (7) is unique, it is easy to see that is the limit of when the positive tuning parameter converges to zero. In other words, the proposed estimator can be viewed as a limiting version of the LASSO estimator (4). According to Theorem 1, this limiting version of the LASSO estimator is sufficient for the consistent estimation of the true model.

Note that the consistency in Theorem 1 further implies that the DIF items can be consistently selected by a simple hard-thresholding approach. Similar hard-thresholding methods perform well for variable selection in regression models (Meinshausen and Yu, 2009). Given our ML1 estimate and a pre-specified threshold , this hard-thresholding method classifies the items for which as the non-DIF items and the rest as DIF items. Theorem 2 below describes the result on selection consistency.

Theorem 2.

Let be estimators of the DIF parameters returned by Algorithm 1. For any fixed satisfying

, the probability

converges to 1, as the sample size goes to infinity, for all .

In practice, the value of can be chosen by BIC, similar to the choice of for the LASSO procedure in Belzak and Bauer (2020). That is, we consider a sequence of values. For each , the hard-thresholding method is applied to obtain an estimated set of non-DIF items. We then refit the MIMIC model with the parameters fixed to zero for in this estimated non-DIF item set, and compute the corresponding BIC value. We choose the that yields the smallest BIC values. The final classification of the items is given by the results based on the chosen .

4.3 Statistical Inference

The statistical inference of individual parameters is of particular interest in the DIF analysis. In fact, without the bias brought by the regularization tuning parameter, we can draw valid statistical inference on the DIF parameters .

Note that the uncertainty of is inherited from , where asymptotically follows a mean-zero multivariate normal distribution111Note that this is a degenerated multivariate normal distribution since . by the large-sample theory for maximum likelihood estimation; see Appendix for more details. We denote this multivariate normal distribution by , where a consistent estimator of , denoted by , can be obtained based on the marginal likelihood. We define a function

where . Note that the function maps an arbitrary parameter vector of the MIMC model to the parameter of the equivalent ML1 parameter vector. To draw statistical inference, we need the distribution of

By the asymptotic distribution of , we know that the distribution of can be approximated by that of , and the latter can be further approximated by , where follows a normal distribution . Therefore, although function does not have an analytic form, we can approximate the distribution of by Monte Carlo simulation. We summarize this procedure in Algorithm 2 below.

Algorithm 2.
  • Input: The number of Monte Carlo samples and significance level .

  • Step 1: Generate i.i.d. samples from a multivariate normal distribution with mean and covariance matrix . We denote these samples as , …, .

  • Step 2: Obtain , for .

  • Step 3: Obtain the and quantiles of the empirical distribution of , denoted by and , respectively.

  • Output: Level confidence interval for is given by . In addition, the -value for a two-sided test of is given by

Algorithm 2 only involves sampling from a multivariate normal distribution and solving a convex optimization problem based on the LAD objective function, both of which are computationally efficient. The value of is set to 10,000 in our simulation study and 50,000 in the real data example below.

The p-values can be used to control the type-I error, i.e., the probability of mistakenly detecting a non-DIF item as a DIF one. To control the item-specific type-I errors to be below a pre-specified threshold (e.g., ), we detect the items for which the corresponding p-values are below . Besides the type-I error, we may also consider the False Discovery Rate (FDR) for DIF detection (Bauer et al., 2020) to account for multiple comparisons, where the FDR is defined as the expected ratio of the number of non-DIF items to the total number of detections. To control the FDR, the Benjamini-Hochberg (B-H) procedure (Benjamini and Hochberg, 1995) can be employed to the p-values.

5 Simulation Study

This section conducts simulation studies to evaluate the performance of the proposed method and compare it with the likelihood ratio test (LRT) method (Thissen, 1988) and the LASSO method (Bauer et al., 2020). Note that the LRT method requires a known anchor item set. Correctly specified anchor item sets with different sizes will be considered when applying the LRT method.

In the simulation, we set the number of items , and consider two settings for the sample sizes, , and 1000. The parameters of the true model are set as follows. First, the discrimination parameters are set between 1 and 2, and the easiness parameters are set between and 1. Their true values are given in Table 1. The observations are split into groups of equal sizes, indicated by , and 1. The parameter in the structural model is set to 0.5, so that the latent trait distribution is standard normal and for the reference and focal groups, respectively. We consider two settings for the DIF parameters, one with smaller absolute DIF parameter values, and the other with larger absolute DIF parameter values. Their true values are given in Table 1. For both sets of the DIF parameters, the ML1 condition is satisfied. The combinations of settings for the sample sizes and DIF parameters lead to four settings in total. For each setting, 100 independent datasets are generated.

We first evaluate the accuracy of the proposed estimator given by Algorithm 1. Table 2 shows the mean squared errors (MSE) for and the average MSEs for s, s, and s that are obtained by averaging the corresponding MSEs over the items. As we can see, these MSEs and average MSEs are small in magnitude and decrease as the sample size increases under each setting. This observation aligns with our consistency result in Theorem 1.

We then compare the proposed method and the LRT method in terms of their performances on statistical inference. Specifically, we focus on whether FDR can be controlled when applying the B-H procedure to the p-values obtained from the two methods. The comparison results are given in Table 3. As we can see, FDR is controlled to be below the targeted level for the proposed method and the LRT method with 1, 5, and 10 anchor items, under all settings. In addition, the FDR levels given by the proposed method are similar to those given by the LRT method.

When anchor items are known, the standard error can be computed for each estimated

and thus the corresponding Wald interval can be constructed. We compare the coverage rates of the confidence intervals given by Algorithm 2 and the Wald intervals that are based on five anchor items. The results are shown in Figure 3. We see that the coverage rates from both methods are comparable across all settings and are close to the 95% targeted level. Note that these coverage rates are calculated based on only 100 replicated datasets, which may be slightly affected by the Monte Carlo errors.

Furthermore, we compare the proposed hard-thresholding procedure with the LASSO procedure, in terms of the accuracy in the detection of DIF items. For the hard-thresholding procedure, 20 values of threshold are considered that are equally spaced in and under the small and large DIF settings, respectively. For the LASSO procedure, 20 values are considered for the tuning parameter that are equally spaced in and under the small and large DIF settings, respectively. The optimal values of and are chosen by BIC, which yield the selection of DIF items for the two procedures, respectively. The selection accuracy is evaluated by two metrics, the true positive rate (TPR) and false positive rate (FPR), where the TPR is the expected ratio of the number of correctly detected DIF items to the number of truely DIF items, and the FPR is the expected ratio of the number of falsely detected DIF items to the number of non-DIF items. Table 4 shows the TPR and FPR that are estimated based on the 100 simulation replications. As we can see, the two methods have comparable TPR and FPR under all settings. It is worth noting that the hard-thresholding method is computationally much faster than the LASSO method, since it does not involve maximizing a regularized likelihood under different tuning parameters.

Finally, we compare the detection power of different methods based on the receiver operating characteristic (ROC) curves. For a given method, an ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings. More specifically, ROC curves are constructed for the hard-thresholding and the LASSO methods by varying the corresponding tuning parameters and . ROC curves are also constructed by thresholding the p-values from the proposed method and the LRT method with 1, 5, and 10 anchor items, respectively. Note that for the LRT method, the TPR and FPR are calculated based on the non-anchor items. For each method, an average ROC curve is obtained based on the 100 replications, for which the area under the ROC curve (AUC) is calculated. A larger AUC value indicates the better detection power. The AUC values for different methods across our simulation settings are given in Table 5. According to the AUC values, the two proposed procedures (i.e., thresholding the p-values from Algorithm 2 and the hard-thresholding procedure) and the LRT method with 10 anchor items have similar detection power and perform better than the rest. That is, without knowing any anchor items, the proposed procedures perform better than the LRT method that knows 1 or 5 anchor items, and perform similarly as the LRT method that knows 10 anchor items. The superior performance of the proposed procedures is brought by the use of the ML1 condition, which identifies the model parameters using information from all the items. We also see that the LASSO procedure performs similarly as the proposed procedures under the large DIF settings, but is less accurate under the small DIF settings. This may be due to that the model selection consistency of the LASSO procedure requires the so-called irrepresentable conditions (Zhao and Yu, 2006; van de Geer and Bühlmann, 2009), which are strong conditions that may not be satisfied by the current MIMIC model under the small DIF settings.

Figure 3: Scatter plots of the coverage rates of the 95% confidence intervals for ’s. x-axes and y-axes are labeled with item numbers and coverage rates respectively. Panels (a) - (d) correspond to our proposed method, and panels (e) - (h) correspond to the Wald intervals constructed with five anchor items.
Item number (small DIF) (large DIF)
1 1.30 0.80 0.00 0.00
2 1.40 0.20 0.00 0.00
3 1.50 -0.40 0.00 0.00
4 1.70 -1.00 0.00 0.00
5 1.60 1.00 0.00 0.00
6 1.30 0.80 0.00 0.00
7 1.40 0.20 0.00 0.00
8 1.50 -0.40 0.00 0.00
9 1.70 -1.00 0.00 0.00
10 1.60 1.00 0.00 0.00
11 1.30 0.80 0.00 0.00
12 1.40 0.20 0.00 0.00
13 1.50 -0.40 0.00 0.00
14 1.70 -1.00 0.00 0.00
15 1.60 1.00 0.00 0.00
16 1.30 0.80 -0.40 -1.00
17 1.40 0.20 0.40 1.30
18 1.50 -0.40 -0.50 -0.90
19 1.70 -1.00 0.40 1.20
20 1.60 1.00 0.40 1.00
21 1.30 0.80 -0.40 -1.00
22 1.40 0.20 0.40 1.30
23 1.50 -0.40 -0.50 -0.90
24 1.70 -1.00 0.40 1.20
25 1.60 1.00 0.40 1.00
Table 1: Discrimination, easiness and DIF parameter values used in the simulation studies.
Small DIF Large DIF
0.032 0.016 0.032 0.017
0.038 0.018 0.038 0.018
0.058 0.027 0.061 0.029
0.026 0.011 0.026 0.013

Table 2: Average mean squared errors of the estimated parameters in the simulation studies. Mean squared errors are first evaluated by averaging out of 100 replications and then averaged across 25 items to obtain the average mean squared errors for , and . The mean squared errors for is presented.
Small DIF Large DIF Small DIF Large DIF
p-value based method 0.014 0.027 0.022 0.024
LRT (1 anchor item) 0.015 0.027 0.020 0.020
LRT (5 anchor items) 0.023 0.034 0.019 0.025
LRT (10 anchor items) 0.015 0.023 0.011 0.013

Table 3: Comparison of the FDR of the proposed p-value based method and the LRT method with 1, 5 and 10 anchor items respectively at the FDR control of 5%. The values are averaged out of 100 replications.
Small DIF Large DIF Small DIF Large DIF
TPR Hard-thresholing 0.453 0.984 0.602 1.000
LASSO 0.431 0.986 0.610 1.000
FPR Hard-thresholding 0.044 0.037 0.033 0.022
LASSO 0.038 0.041 0.038 0.021

Table 4: Comparison of the TPR and FPR of the proposed hard-thresholding method and the LASSO method. The optimal thresholds for the hard-thresholding method and the optimal penalties for the LASSO method are selected using the BIC criteria. The results are averaged out of 100 replications.
Small DIF Large DIF Small DIF Large DIF
p-value based 0.809 0.997 0.926 1.000
Hard-thresholding 0.825 0.996 0.914 1.000
LASSO 0.789 0.994 0.837 0.999
LRT (1 anchor item) 0.696 0.958 0.831 0.994
LRT (5 anchor items) 0.793 0.990 0.909 1.000
LRT (10 anchor items) 0.801 0.992 0.927 1.000

Table 5: Comparison of AUC of the proposed p-value based method, the hard-thresholding method, the LASSO method and the LRT method with 1, 5 and 10 anchor items respectively.

6 Application to EPQ-R Data

DIF methods have been commonly used for assessing the measurement invariance of personality tests (e.g., Escorial and Navas (2007), Millsap (2012), Thissen et al. (1986)). In this section, we apply the proposed method to the Eysenck Personality Questionnaire-Revised (EPQ-R, Eysenck et al. (1985)), a personality test that has been intensively studied and received applications worldwide (Fetvadjiev and van de Vijver, 2015). The EPQ-R has three scales that measure the Psychoticism (P), Neuroticism (N) and Extraversion (E) personality traits, respectively. We analyze the long forms of the three personality scales that consist of 32, 24, and 23 items, respectively. Each item has binary responses of “yes” and “no” that are indicated by 1 and 0, respectively. This analysis is based on data of an EPQ-R study collected from 1432 participants in the United Kingdom. Among these participants, 823 are females and 609 are males. Females and males are indicated by and 1, respectively. We study the DIF caused by gender. The three scales are analyzed separately using the proposed methods.

The results are shown through Tables 6 - 8, and Figure 4. Specifically, Tables  6 - 8 present the p-values for testing and the detection results for the P, E, N scales, respectively. For each table, the items are ordered by the p-values in an increasing order. The items indicated by “F” are the ones detected by the B-H procedure with FDR level 0.05, and those indicated by “H” are the ones detected by the hard-thresholding procedure whose threshold is chosen by BIC. The item IDs are consistent with those in Appendix 1 of Eysenck et al. (1985), where the item descriptions are given. The three panels of Figure 4 further give the point estimate and confidence interval for each parameter, for the three scales, respectively. Under the current model parameterization, a positive DIF parameter means that a male participant is more likely to answer “yes” to the item than a female participant, given that they have the same personality trait level. We note that the absolute values of are all below 1, suggesting that there are no items with very large gender-related DIF effects.

From Tables 6 - 8, we see that all the three scales have some items whose p-values are close to zero, suggesting gender DIF may exists across the three scales. We list some example items with the smallest p-values for each of the three scales. For the P scale, items 7 “Would being in debt worry you?”, 14 “Do you dislike people who don’t know how to behave themselves?” and 34 “Do you have enemies who want to harm you?” have the smallest p-values. As shown in Figure 4, the estimated DIF parameters are positive for items 14 and 34 and negative for item 7. For the E scale, items 63 “Do you nearly always have a ‘ready answer’ when people talk to you?”, 90 “Do you like plenty of bustle and excitement around you?” and 36 “Do you have many friends?” have the smallest p-values, where item 63 has a positive and items 90 and 36 have negative s. For the N scale, the top three items are 8 “Do you ever feel ‘just miserable’ for no reason?”, 84 “Do you often feel lonely?” and 70 “Do you often feel life is very dull?”, where is negative for item 8 and positive for items 84 and 70.

From Tables 6 through 8, we see that the selection based on the B-H procedure with FDR level 0.05 and that based on the hard-thresholding approach are largely consistent with each other. In particular, items selected by the B-H procedure are also selected by the hard-thresholding approach. More precisely, the B-H procedure detects 10, 11, and 11 items as DIF items from the P, E, and N scales, respectively, and the hard-threhsolding procedure selects 11, 12, and 12 items, respectively. We remark that the hard-thresholding procedure is based on the point estimate of the items and thus does not always select the items with the smallest p-values. For example, it detects item 91 as a DIF item but does not detects item 88, though item 91 has a larger p-value. This is because item 91 has a smaller absolute value of than item 88.

Figure 4: Plots of 95% confidence intervals for the DIF parameters ’s on scale P, N, and E data sets. The red horizontal lines denote . Items are arranged according to the increasing p-values.
Item 7 FH 14 FH 34 FH 95 FH 81 FH 2 FH 73 FH 30 FH
p-value 0.00000 0.00002 0.00004 0.00008 0.00014 0.00050 0.00534 0.00646
Item 37 FH 9 FH 88 91  H 29 56 41 99
p-value 0.00736 0.00750 0.01800 0.05182 0.06756 0.11788 0.16000 0.20586
Item 5 79 12 68 21 18 96 64
p-value 0.28952 0.29974 0.30520 0.31474 0.32266 0.32318 0.37490 0.44788
Item 85 25 42 48 59 54 50 75
p-value 0.48100 0.53366 0.72276 0.81756 0.90470 0.92302 0.93352 0.94764
Table 6: P-values for testing for items in P scale. Note that the items are ordered in increasing p-values. Items selected by the B-H procedure with FDR control at 5% and the proposed hard-thresholding method are identified using “F” and “H”, respectively, besides the item numbers.
Item 63 FH 90 FH 36 FH 6 FH 67 FH 33 FH 61 FH 94 FH
p-value 0.00014 0.00108 0.00114 0.00194 0.00262 0.00340 0.00578 0.00734
Item 78 FH 51 FH 58 FH 11  H 28 20 55 1
p-value 0.01206 0.01260 0.02264 0.04164 0.10804 0.12628 0.12706 0.41380
Item 40 16 24 69 47 45 72
p-value 0.62640 0.64666 0.76164 0.77082 0.83650 0.84300 0.92682
Table 7: P-values for testing for items in E scale. Note that the items are ordered in increasing p-values. Items selected by the B-H procedure with FDR control at 5% and the proposed hard-thresholding procedure are identified using “F” and “H”, respectively, besides the item numbers.
Item 8 FH 84 FH 70 FH 87 FH 22 FH 17 FH 74 FH 83 FH
p-value 0.00002 0.00010 0.00014 0.00014 0.00022 0.00026 0.00042 0.00376
Item 92 FH 52 FH 60 FH 26 H 38 46 13 100
p-value 0.00450 0.00828 0.02028 0.03514 0.07220 0.11180 0.14794 0.25704
Item 43 80 65 31 3 35 97 76
p-value 0.32968 0.33684 0.63190 0.63214 0.77634 0.82176 0.88542 0.93842
Table 8: P-values for testing for items in N scale. Note that the items are ordered in increasing p-values. Items selected by the B-H procedure with FDR control at 5% and the proposed hard-thresholding procedure are identified using “F” and “H”, respectively, besides the item numbers.

7 Discussion

This paper proposes a new method for DIF analysis under a MIMIC model framework. It can accurately estimate the DIF effects of individual items without requiring prior knowledge about an anchor item set and also provide valid p-values. The point estimate and p-values are further used for the detection of DIF items. According to our simulation results, the proposed p-value-based and hard-thresholding procedures have better and comparable performance under the small and large DIF settings, respectively, comparing with the LASSO method of Belzak and Bauer (2020). In addition, the p-value based methods accurately control the item-specific type-I errors and the FDR. Finally, the proposed method is applied to the three scales of the Eysenck Personality Questionnaire-Revised to study the gender-related DIF. For each of the three long forms of the P, N, E scales, around 10 items are detected by the proposed procedures as potential DIF items. The psychological mechanism of these DIF effects is worth future investigation. An R package has been developed for the proposed procedures that will be published online upon the acceptance of this paper.

The proposed method has several advantages over the LASSO method. First, the proposed method does not require a tuning parameter to estimate the model parameters, while the LASSO method involves choosing the tuning parameter for the regularization term. It is thus more straightforward to use for the practitioners. Second, we do not need to solve the optimization problems that involve maximizing a regularized likelihood function under different tuning parameter choices. Therefore, the proposed method is computationally less intensive, since the optimization involving a regularized likelihood function is non-trivial due to both the integral with respect to the latent variables and the non-smooth penalty term. Finally, the proposed method provides valid statistical inference, which is more difficult for the LASSO method due to the uncertainty associated with the model selection step.

8 Limitations and Future Directions

The current work has some limitations, which offers opportunities for future research. First, the current method only considers DIF associated with the intercept parameters but not the slope parameters (i.e., discrimination parameters) in the 2PL model. To generalize the current method to further account for DIF in the slope parameters, an additional set of DIF parameters need to be introduced and a new ML1 condition needs to be imposed. The estimation procedure also needs to be generalized accordingly.

Second, as is true for all simulation studies, we cannot examine all possible conditions that might occur in applied settings. Additional simulation studies will be conducted in future research to better understand the performance of the proposed method. In particular, sample sizes, item sizes, group sizes and distribution of the DIF items can be varied and tested.

Finally, the current work focuses on the type-I error and FDR as error metrics that concern falsely detecting non-DIF items as DIF items. In many applications of measurement invariance, it may be of more interest to consider an error metric that concerns the false detection of DIF items as DIF-free. Suitable error metrics, as well as methods for controlling such error metrics, remain to be proposed.


  • Bauer et al. (2020) Bauer, D. J., Belzak, W. C., and Cole, V. T. (2020). Simplifying the assessment of measurement invariance over multiple background variables: Using regularized moderated nonlinear factor analysis to detect differential item functioning. Structural Equation Modeling: A Multidisciplinary Journal, 27(1):43–55.
  • Belzak and Bauer (2020) Belzak, W. and Bauer, D. J. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods, 25(6):673–690.
  • Benjamini and Hochberg (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57:289–300.
  • Birnbaum (1968) Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord, F. M. and Novick, M. R., editors, Statistical Theories of Mental Test Scores, pages 395–479. Addison Wesley, Reading, MA.
  • Cai et al. (2011) Cai, L., Du Toit, S., and Thissen, D. (2011). Irtpro: Flexible, multidimensional, multiple categorical irt modeling [computer software]. Chicago, IL: Scientific Software International.
  • Candell and Drasgow (1988) Candell, G. L. and Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12(3):253–260.
  • Cao et al. (2017) Cao, M., Tay, L., and Liu, Y. (2017). A Monte Carlo study of an iterative Wald test procedure for DIF analysis. Educational and Psychological Measurement, 77(1):104–118.
  • Clauser et al. (1993) Clauser, B., Mazor, K., and Hambleton, R. K. (1993). The effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6(4):269–279.
  • Dorans and Kulick (1986) Dorans, N. J. and Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the scholastic aptitude test. Journal of Educational Measurement, 23(4):355–368.
  • Escorial and Navas (2007) Escorial, S. and Navas, M. J. (2007). Analysis of the gender variable in the Eysenck Personality Questionnaire–revised scales using differential item functioning techniques. Educational and Psychological Measurement, 67(6):990–1001.
  • Eysenck et al. (1985) Eysenck, S. B., Eysenck, H. J., and Barrett, P. (1985). A revised version of the psychoticism scale. Personality and Individual Differences, 6(1):21–29.
  • Fetvadjiev and van de Vijver (2015) Fetvadjiev, V. H. and van de Vijver, F. J. (2015). Measures of personality across cultures. In Measures of Personality and Social Psychological Constructs, pages 752–776. Elsevier.
  • Fidalgo et al. (2000) Fidalgo, A., Mellenbergh, G. J., and Muñiz, J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel procedures. Methods of Psychological Research Online, 5(3):43–53.
  • Frick et al. (2015) Frick, H., Strobl, C., and Zeileis, A. (2015). Rasch mixture models for DIF detection: A comparison of old and new score specifications. Educational and Psychological Measurement, 75(2):208–234.
  • Goldberger (1972) Goldberger, A. S. (1972). Structural equation methods in the social sciences. Econometrica: Journal of the Econometric Society, 40:979–1001.
  • Huang (2018) Huang, P. H. (2018). A penalized likelihood method for multi-group structural equation modelling. British Journal of Mathematical and Statistical Psychology, 71(3):499–522.
  • Kim et al. (1995) Kim, S. H., Cohen, A. S., and Park, T. H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32(3):261–276.
  • Koenker (2005) Koenker, R. (2005). Quantile Regression. Econometric Society Monographs. Cambridge University Press.
  • Koenker et al. (2018) Koenker, R., Portnoy, S., Ng, P. T., Zeileis, A., Grosjean, P., and Ripley, B. D. (2018). Package ‘quantreg’. Cran R-project. org.
  • Kopf et al. (2015a) Kopf, J., Zeileis, A., and Strobl, C. (2015a). Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educational and Psychological Measurement, 75(1):22–56.
  • Kopf et al. (2015b) Kopf, J., Zeileis, A., and Strobl, C. (2015b). A framework for anchor methods and an iterative forward approach for DIF detection. Applied Psychological Measurement, 39(2):83–103.
  • Lord (2012) Lord, F. M. (2012). Applications of item response theory to practical testing problems. Routledge.
  • Louis (1982) Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):226–233.
  • Magis et al. (2015) Magis, D., Tuerlinckx, F., and De Boeck, P. (2015). Detection of differential item functioning using the lasso approach. Journal of Educational and Behavioral Statistics, 40(2):111–135.
  • Mantel and Haenszel (1959) Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4):719–748.
  • May (2006) May, H. (2006). A multilevel Bayesian item response theory method for scaling socioeconomic status in international studies of education. Journal of Educational and Behavioral Statistics, 31(1):63–79.
  • Meinshausen and Yu (2009) Meinshausen, N. and Yu, B. (2009).

    Lasso-type recovery of sparse representations for high-dimensional data.

    The Annals of Statistics, 37(1):246–270.
  • Millsap (2012) Millsap, R. E. (2012). Statistical approaches to measurement invariance. Routledge.
  • Muthen (1985) Muthen, B. (1985). A method for studying the homogeneity of test items with respect to other relevant variables. Journal of Educational Statistics, 10(2):121–132.
  • Muthen (1986) Muthen, B. (1986). Some Uses of Structural Equation Modeling in Validity Studies: Extending IRT to External Variables Using SIMS Results. Center for the Study of Evaluation.
  • Muthen et al. (1991) Muthen, B., Kao, C. F., and Burstein, L. (1991). Instructionally sensitive psychometrics: Application of a new IRT-based detection technique to mathematics achievement test items. Journal of Educational Measurement, 28(1):1–22.
  • Muthen and Lehman (1985) Muthen, B. and Lehman, J. (1985). Multiple group IRT modeling: Applications to item bias analysis. Journal of Educational Statistics, 10(2):133–142.
  • Oort (1998) Oort, F. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling: A Multidisciplinary Journal, 5(2):107–124.
  • Raju (1988) Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4):495–502.
  • Raju (1990) Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14(2):197–207.
  • Rasch (1960) Rasch, G. (1960). Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests.
  • Schauberger and Mair (2020) Schauberger, G. and Mair, P. (2020). A regularization approach for the detection of differential item functioning in generalized partial credit models. Behavior Research Methods, 52(1):279–294.
  • Schwarz (1978) Schwarz, G. (1978). The Bayesian information criterion. Annals of Statistics, 6:461–464.
  • Shealy and Stout (1993) Shealy, R. and Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58(2):159–194.
  • Soares et al. (2009) Soares, T. M., Gonçalves, F. B., and Gamerman, D. (2009). An integrated Bayesian model for DIF analysis. Journal of Educational and Behavioral Statistics, 34(3):348–377.
  • Steenkamp and Baumgartner (1998) Steenkamp, J.-B. E. and Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25(1):78–90.
  • Swaminathan and Rogers (1990) Swaminathan, H. and Rogers, H. J. (1990).

    Detecting differential item functioning using logistic regression procedures.

    Journal of Educational measurement, 27(4):361–370.
  • Tay et al. (2016) Tay, L., Huang, Q., and Vermunt, J. K. (2016). Item response theory with covariates (IRT-C) assessing item recovery and differential item functioning for the three-parameter logistic model. Educational and Psychological Measurement, 76(1):22–42.
  • Tay et al. (2015) Tay, L., Meade, A. W., and Cao, M. (2015). An overview and practical guide to IRT measurement equivalence analysis. Organizational Research Methods, 18(1):3–46.
  • Thissen (1988) Thissen, D. (1988). Use of item response theory in the study of group differences in trace lines. Test validity.
  • Thissen (2001) Thissen, D. (2001). Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Chapel Hill: University of North Carolina at Chapel Hill.
  • Thissen et al. (1986) Thissen, D., Steinberg, L., and Gerrard, M. (1986). Beyond group-mean differences: The concept of item bias. Psychological Bulletin, 99:118.
  • Thissen et al. (1993) Thissen, D., Steinberg, L., and Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models.
  • Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58:267–288.
  • Tutz and Schauberger (2015) Tutz, G. and Schauberger, G. (2015). A penalty approach to differential item functioning in Rasch models. Psychometrika, 80(1):21–43.
  • van de Geer and Bühlmann (2009) van de Geer, S. A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3(2009):1360–1392.
  • van der Vaart (2000) van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge University Press.
  • Wang et al. (2009) Wang, W. C., Shih, C. L., and Yang, C. C. (2009). The MIMIC method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69(5):713–731.
  • Wang and Su (2004) Wang, W. C. and Su, Y. H. (2004). Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method. Applied Measurement in Education, 17(2):113–144.
  • Wang and Yeh (2003) Wang, W. C. and Yeh, Y. L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27(6):479–498.
  • Woods et al. (2013) Woods, C. M., Cai, L., and Wang, M. (2013). The Langer-improved Wald test for DIF testing with multiple groups: Evaluation and comparison to two-group IRT. Educational and Psychological Measurement, 73(3):532–547.
  • Yuan et al. (2021) Yuan, K., Liu, H., and Han, Y. (2021). Differential item functioning analysis without a priori information on anchor items: QQ plots and graphical test. Psychometrika, 86:345–377.
  • Zellner (1970) Zellner, A. (1970). Estimation of regression relationships containing unobservable independent variables. International Economic Review, 11:441–454.
  • Zhao and Yu (2006) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso.

    Journal of Machine Learning Research

    , 7:2541–2563.
  • Zwick and Thayer (2002) Zwick, R. and Thayer, D. T. (2002). Application of an empirical Bayes enhancement of Mantel-Haenszel differential item functioning analysis to a computerized adaptive test. Applied Psychological Measurement, 26(1):57–76.
  • Zwick et al. (2000) Zwick, R., Thayer, D. T., and Lewis, C. (2000).

    Using loss functions for DIF detection: An empirical Bayes approach.

    Journal of Educational and Behavioral Statistics, 25(2):225–247.

Appendix A Proof of Theorems

Proof of Theorem 1.

Since MIMIC model with constraint is identifiable, by classical asymptotic theory for MLE (van der Vaart, 2000), we have converges in probability to That is, as , for any , we must have with probability tending to 1 that , and , for any . Denote as a function of Similarly, denote . Let and , respectively. We seek to establish that will converge in probability to as First note that by regularity conditions, there exists such that Then, there must exist such that Furthermore, note is clearly continuous and convex in , so consistency will follow if can be shown to converge point-wise to that is uniquely minimized at the true value (typically uniform convergence is needed, but point-wise convergence of convex functions implies their uniform convergence on compact subsets). Following the model identifiability and the ML1 condition (5), is unique. To see this, suppose for contradiction that there exist and such that and and First note that for all Then by model identifiability, there exists such that So we have


Hence, and . If ML1 condition (5) holds, then and This contradicts the assumption Hence, must be unique. For any

Take , it follows that for any fixed , as . Moreover, following from the uniqueness of and the continuity and the convexity of in , we must have as

Note that , , , for all From the model identifiability and the ML1 condition (5), we know that , , , for all Since , as , it follows directly from the Slutsky’s Theorem that , as . ∎

Proof of Theorem 2.

Following from the results of Theorem 1, we have as for all i.e., for any , First consider the case when , we can simply take , then we have .

Next we consider the case when First note that implies that as for all . That is, for any Let By the assumption that , we know that Take Then,

Hence, we have

which then implies that

Combining the two cases, we have shown for all ,

Appendix B Asymptotic Distribution of

Since the model is identifiable with constraint and all the regularity conditions in Theorem 5.39 of van der Vaart (2000) are satisfied, hence, by Theorem 5.39 in van der Vaart (2000), N in distribution as In practice, we use the inverse of the observed Fisher information matrix, denoted by , which is a consistent estimator of , to draw Monte Carlo samples. Below, we give procedures to evaluate from the marginal log-likelihood.

Following the notations in the main article, we first provide the complete data log-likelihood function,

Since is considered as a random variable such that N, so we will work with the marginal log-likelihood function,

Note that the observed Fisher information matrix cannot be directly obtained from the due to the intractable integral. Instead, we apply the Louis Identity (Louis, 1982) to evaluate the observed Fisher information matrix. Let and denote the gradient vector and the negative of the hessian matrix of the complete data log-likelihood function, respectively. Then by the Louis Identity, can be expressed as

Denote . Then, in particular,

Furthermore, note that is a (3J+1) by (3J +1) matrix with the only non-zero entries,