1 Introduction
Measurement invariance refers to the psychometric equivalence of an instrument (e.g., a questionnaire or test) across several specified groups, such as gender and ethnicity. The lack of measurement invariance suggests that the instrument has different structures or meanings to different groups, leading to biases in measurements (Millsap, 2012).
Measurement invariance is typically assessed by differential item functioning (DIF) analysis of item response data that aims to detect the measurement noninvariant items (i.e. DIF items). More precisely, a DIF item has a response distribution that depends not only on the latent trait measured by the instrument but also on respondents’ group membership. Therefore, the detection of a DIF item involves comparing the item responses of different groups, conditioning on the latent traits. The complexity of the problem lies in that individuals’ latent trait levels cannot be directly observed but are measured by the instrument that may contain DIF items. In addition, different groups may have different latent trait distributions. This problem thus involves identifying the latent trait, and then conducting the conditional group comparison given individuals’ latent trait levels.
Many statistical methods have been developed for DIF analysis. Traditional methods for DIF analysis require prior knowledge about a set of DIFfree items, which is known as the anchor set. This anchor set is used to identify the true latent trait. These methods can be classified into two categories. Methods in the first categories
(Mantel and Haenszel, 1959; Dorans and Kulick, 1986; Swaminathan and Rogers, 1990; Shealy and Stout, 1993; Zwick et al., 2000; Zwick and Thayer, 2002; May, 2006; Soares et al., 2009; Frick et al., 2015) do not explicitly assume an item response theory (IRT) model, and methods in the second categories (Thissen, 1988; Lord, 2012; Kim et al., 1995; Raju, 1988, 1990; Cai et al., 2011; Woods et al., 2013; Thissen, 2001; Oort, 1998; Steenkamp and Baumgartner, 1998; Cao et al., 2017; Woods et al., 2013; Tay et al., 2015, 2016) are developed based on IRT models. Compared with nonIRTbased methods, an IRTbased method defines the DIF problem more clearly, at the price of potential model misspecification. Specifically, an IRT model represents the latent trait as a latent variable and further characterizes the itemspecific DIF effects by modeling each item response distribution as a function of the latent variable and group membership.The DIF problem is wellcharacterized by a multiple indicators, multiple causes (MIMIC) IRT model, which is a structural equation model that was originally developed for continuous indicators (Zellner, 1970; Goldberger, 1972) and later extended to categorical item response data (Muthen, 1985, 1986; Muthen et al., 1991; Muthen and Lehman, 1985). A MIMIC model for DIF consists of a measurement component and a structural component. The measurement component models how the item responses depend on the measured psychological trait and respondents’ group membership. The structural component models the groupspecific distributions of the psychological trait. The anchor set imposes zero constraints on itemspecific parameters in the measurement component, making the model and thus the latent trait identifiable. Consequently, the DIF effects of the rest of the items can be tested by drawing statistical inference on the corresponding parameters under the identified model.
Anchorsetbased methods rely heavily on a correctly specified set of DIFfree items. The misspecification of some anchor items can lead to invalid statistical inference results (Kopf et al., 2015b). To address this issue, item purification methods (Candell and Drasgow, 1988; Clauser et al., 1993; Fidalgo et al., 2000; Wang and Yeh, 2003; Wang and Su, 2004; Wang et al., 2009; Kopf et al., 2015b, a) have been proposed, which iteratively select an anchor set by stepwise model selection methods. More recently, regularized estimation methods (Magis et al., 2015; Tutz and Schauberger, 2015; Huang, 2018; Belzak and Bauer, 2020; Bauer et al., 2020; Schauberger and Mair, 2020) have been proposed that use LASSOtype regularized estimation procedures for simultaneous model selection and parameter estimation. Moreover, Yuan et al. (2021) proposed a visualization method for the detection of DIF under the Rasch model (Rasch, 1960)
. Their methods are based on testing differential item pair functioning, which does not require prior information of an anchor set. Unfortunately, unlike many anchorsetbased methods with a correctly specified anchor set, all these methods do not provide valid statistical inference for testing the null hypothesis of “item
is DIFfree”, for each item . That is, the typeI error for testing the hypothesis cannot be guaranteed to be controlled at a prespecified significance level.This paper proposes a new method that can accurately estimate the DIF effects without requiring prior knowledge. It further draws statistical inference on the DIF effects of individual items, yielding valid confidence intervals and pvalues. The point estimation and statistical inference lead to accurate detection of the DIF items, for which the typeI error of DIF detection can be controlled by the inference results. The method is proposed under a MIMIC model with a twoparameter logistic (Birnbaum, 1968) IRT measurement model and a linear structural model. The key to this method is a minimal norm assumption for identifying the true model. Methods are developed for estimating the model parameters, and obtaining confidence intervals and values. Procedures for the detection of DIF items are further developed. Our method is compared to the likelihood ratio test method (Thissen et al., 1993) that requires an anchor set, and the LASSObased approach recently proposed in Belzak and Bauer (2020). We begin with a statistical formulation of the DIF problem under a MIMIC model, followed by a review of related methods, and then a description of the proposed method.
2 A MIMIC Formulation of DIF
Consider individuals answering items. Let
be a binary random variable, denoting individual
’s response to item . Let be the observed value, i.e., the realization, of . For the ease of exposition, we useto denote the response vector of individual
. The individuals are from two groups, indicated by , where 0 and 1 are referred to as the reference and focal groups, respectively. We further introduce a latent variable , which represents the latent trait that the items are designed to measure. DIF occurs when the distribution of depends not only but also . More precisely, DIF occurs if is not conditionally independent of , given . Seemingly a simple group comparison problem, DIF analysis is nontrivial due to the latency of . In particular, the distribution of may depend on the value of , which confounds the DIF analysis. In what follows, we describe a MIMIC model framework for DIF analysis, under which the relationship among , , and is characterized. It is worth pointing out that this framework can be generalized to account for more complex DIF situations; see the Discussion Section.2.1 Measurement Model
The twoparameter logistic (2PL) model (Birnbaum, 1968) is widely used to model binary item responses (e.g., wrong/right or absent/present). In the absence of DIF, the 2PL model assumes a logistic relationship between and , which is independent of the value of . That is,
(1) 
where the slope parameter and intercept parameter are typically known as the discrimination and easiness parameters, respectively. The right hand side of (1) as a function of is known as the 2PL item response function. When the items potentially suffer from DIF, then the item response functions may depend on the group membership . In that case, the item response function can be modeled as
(2) 
where is an itemspecific parameter that characterizes its DIF effect. More precisely,
That is,
is the odds ratio for comparing two individuals from two groups who have the same latent trait level. Item
is DIFfree under this model when . We further make the local independence assumption as in most IRT models. That is, , …, are assumed to be conditionally independent, given and .2.2 Structural Model
A structural model specifies the distribution of , which may depend on the group membership. Specifically, we assume the conditional distribution of given
to follow a normal distribution,
Note that the latent trait distribution for the reference group is set to a standard normal distribution to identify the location and scale of the latent trait. A similar assumption is typically adopted in IRT models for a single group of individuals.
The MIMIC model for DIF combines the above measurement and structural models, for which a path diagram is given in Figure 1. The marginal likelihood function for this MIMIC model takes the form
(3) 
where denotes all the fixed model parameters.
The goal of DIF analysis is to detect the DIF items, i.e., the items for which . Unfortunately, without further assumptions, this problem is illposed due to the nonidentifiability of the model. We discuss this identifiabibility issue below.
2.3 Model Identifiability
Without further assumptions, the above MIMIC model is not identifiable. That is, for any constant , the model remains equivalent, if we simultaneously replace and by and , respectively, and keep unchanged. This identifiability issue is due to that all the items are allowed to suffer from DIF, resulting in an unidentified latent trait. In other words, without further assumptions, it is impossible to disentangle the DIF effects and the difference between the two groups’s latent trait distributions. As will be discussed below, statistical methods for DIF analysis make certain assumptions to ensure the identifiability.
3 Related Works
Many of the IRTbased DIF analyses (Thissen et al., 1986; Thissen, 1988; Thissen et al., 1993) require prior knowledge about a subset of DIFfree items, which are known as the anchor items. More precisely, we denote this known subset by . Then under the MIMIC model described above, it implies that the constraints are imposed for all in the estimation. With these zero constraints, the parameters cannot be freely transformed, and thus, the above MIMIC model becomes identifiable. Therefore, for each nonanchor item , the hypothesis of can be tested, for example, by a likelihood ratio test. The DIF items can then be detected based on the statistical inference of these hypothesis tests.
The validity of the anchoritembased analyses relies on the assumption that the anchor items are truly DIFfree. If the anchor set includes one or more DIF items, then the results can be misleading (Kopf et al., 2015b). To address the issue brought by the misspecification of the anchor set, item purification methods (Candell and Drasgow, 1988; Clauser et al., 1993; Fidalgo et al., 2000; Wang and Yeh, 2003; Wang and Su, 2004; Wang et al., 2009; Kopf et al., 2015b, a)
have been proposed that iteratively purify the anchor set. These methods conduct model selection using a stepwise procedure to select the anchor set, implicitly assuming that there exists a reasonably large set of DIF items. Then DIF is assessed by hypothesis testing given the selected anchor set. This type of methods also has several limitations. First, the model selection results may be sensitive to the choice of the initial set of anchor items and the specific stepwise procedure (e.g., forward or backward selection), which is a common issue with stepwise model selection procedures (e.g., stepwise variable selection for linear regression). Second, the model selection step has uncertainty. As a result, there is no guarantee that the selected anchor set is completely DIFfree, and furthermore, the postselection statistical inference of items may not be valid in the sense that the typeI error may not be controlled at the targeted significance level.
Regularized estimation methods (Magis et al., 2015; Tutz and Schauberger, 2015; Huang, 2018; Belzak and Bauer, 2020; Bauer et al., 2020; Schauberger and Mair, 2020) have also been proposed for identifying the anchor items, which also implicitly assumes that many items are DIFfree. These methods do not require prior knowledge about anchor items, and simultaneously select the DIFfree items and estimate the model parameters using a LASSOtype penalty (Tibshirani, 1996). Under the above MIMIC model, a regularized estimation procedure solves the following optimization problem,
(4) 
where is a tuning parameter that determines the sparsity level of the estimated parameters. Generally speaking, a larger value of leads to a more sparse vector A regularization method (Belzak and Bauer, 2020) solves the optimization problem (4) for a sequence of values, and then selects the tuning parameter based on the Bayesian Information Criterion (Schwarz, 1978). Let be the selected tuning parameter. Items for which are classified as DIF items and the rest are classified as DIFfree items. While the regularization methods are computationally more stable than stepwise model selection in the item purification methods, they also suffer from some limitations. First, they involve solving nonsmooth optimization problems like (4) for different tuning parameter values, which is computationally intensive and requires tailored computation code that is not available in most statistical packages/software for DIF analysis. Second, these methods may be sensitive to the choice of the tuning parameter. Although methods and theories have been developed in the statistics literature to guide the selection of the tuning parameter, there is no consensus on how the tuning parameter should be chosen, leaving ambiguity in the application of these methods. Finally, as a common issue of regularized estimation methods, obtaining valid statistical inference from these methods is not straightforward. That is, regularized estimation like (4) does not provide a valid value for testing the null hypothesis . In fact, postselection inference after regularized estimation was conducted in Bauer et al. (2020), where the type I error cannot be controlled at the targeted level under some simulation scenarios.
4 Proposed Method
In what follows, we propose a new method for DIF analysis that does not require prior knowledge about anchor items. As will be shown in the rest, the proposed method can not only accurately detect the DIF items, but also provide valid statistical inference for testing the hypotheses of .
4.1 Model Identification via a Minimal Condition
We impose a condition for identifying the true model in the same spirit as the sparsity assumption that is implicitly imposed in the regularized estimation and item purification methods. Recall that the MIMIC model remains equivalent, if we simultaneously replace and by and , respectively, and keep unchanged. Let , , , and denote the true model parameters. To identify the true model, we require the following minimal (ML1) condition to hold
(5) 
for all . This assumption implies that, among all models that are equivalent to the true model, the true parameter vector has the smallest norm. Since the norm measures the sparsity level of a vector (Tibshirani, 1996), the ML1 condition tends to hold when is sparse. To illustrate this point, we show the function in Figure 2, where , for all , and 1 when and , respectively.
4.2 Parameter Estimation
We now propose a procedure for estimating the model under the ML1 condition. This procedure is described in Algorithm 1 below.
Algorithm 1.

Step 1: Solve the following MML estimation problem
(6) 
Step 2: Solve the optimization problem
(7) 
Output: The ML1 estimate , .
We provide some remarks about these steps. The estimator (6) in Step 1 can be viewed as the MML estimator of the MIMIC model, treating item 1 as an anchor item. We emphasize that the constraint in Step 1 is an arbitrary but mathematically convenient constraint for ensuring the estimability of the MIMIC model when solving (6). It does not require item 1 to be truely a DIFfree item. This constraint can be replaced by any equivalent constraint, for example, , while not affecting the final estimation result. Step 2 finds the transformation that leads to the ML1 solution among all the models that are equivalent to the estimated model from Step 1. The optimization problem (7) is convex that takes the same form as the Least Absolute Deviations (LAD) objective function in median regression (Koenker, 2005)
. Consequently, it can be solved using standard statistical packages/software for quantile regression. In particular, the R package “
quantreg” (Koenker et al., 2018) is used in our simulation study and real data analysis.The ML1 condition (5), together with some additional regularity conditions, guarantees the consistency of the above ML1 estimator. That is, will converge to as the sample size grows to infinity. This result is formalized in Theorem 1, with its proof given in the Appendix.
Theorem 1.
Let be the true model parameters, and be the true parameter values of the equivalent MIMIC model with constraint . Assume this equivalent model satisfies the standard regularity conditions in Theorem 5.14 of van der Vaart (2000) that concerns the consistency of maximum likelihood estimation. Further, assume that the ML1 condition (5) holds. Then , as .
We further discuss the connection between the proposed estimator and the regularized estimator (4). Note that is the one with the smallest among all equivalent estimators that maximize the likelihood function (3). When the solution path of (4) is smooth and the solution to the ML1 problem (7) is unique, it is easy to see that is the limit of when the positive tuning parameter converges to zero. In other words, the proposed estimator can be viewed as a limiting version of the LASSO estimator (4). According to Theorem 1, this limiting version of the LASSO estimator is sufficient for the consistent estimation of the true model.
Note that the consistency in Theorem 1 further implies that the DIF items can be consistently selected by a simple hardthresholding approach. Similar hardthresholding methods perform well for variable selection in regression models (Meinshausen and Yu, 2009). Given our ML1 estimate and a prespecified threshold , this hardthresholding method classifies the items for which as the nonDIF items and the rest as DIF items. Theorem 2 below describes the result on selection consistency.
Theorem 2.
Let be estimators of the DIF parameters returned by Algorithm 1. For any fixed satisfying
, the probability
converges to 1, as the sample size goes to infinity, for all .
In practice, the value of can be chosen by BIC, similar to the choice of for the LASSO procedure in Belzak and Bauer (2020). That is, we consider a sequence of values. For each , the hardthresholding method is applied to obtain an estimated set of nonDIF items. We then refit the MIMIC model with the parameters fixed to zero for in this estimated nonDIF item set, and compute the corresponding BIC value. We choose the that yields the smallest BIC values. The final classification of the items is given by the results based on the chosen .
4.3 Statistical Inference
The statistical inference of individual parameters is of particular interest in the DIF analysis. In fact, without the bias brought by the regularization tuning parameter, we can draw valid statistical inference on the DIF parameters .
Note that the uncertainty of is inherited from , where asymptotically follows a meanzero multivariate normal distribution^{1}^{1}1Note that this is a degenerated multivariate normal distribution since . by the largesample theory for maximum likelihood estimation; see Appendix for more details. We denote this multivariate normal distribution by , where a consistent estimator of , denoted by , can be obtained based on the marginal likelihood. We define a function
where . Note that the function maps an arbitrary parameter vector of the MIMC model to the parameter of the equivalent ML1 parameter vector. To draw statistical inference, we need the distribution of
By the asymptotic distribution of , we know that the distribution of can be approximated by that of , and the latter can be further approximated by , where follows a normal distribution . Therefore, although function does not have an analytic form, we can approximate the distribution of by Monte Carlo simulation. We summarize this procedure in Algorithm 2 below.
Algorithm 2.

Input: The number of Monte Carlo samples and significance level .

Step 1: Generate i.i.d. samples from a multivariate normal distribution with mean and covariance matrix . We denote these samples as , …, .

Step 2: Obtain , for .

Step 3: Obtain the and quantiles of the empirical distribution of , denoted by and , respectively.

Output: Level confidence interval for is given by . In addition, the value for a twosided test of is given by
Algorithm 2 only involves sampling from a multivariate normal distribution and solving a convex optimization problem based on the LAD objective function, both of which are computationally efficient. The value of is set to 10,000 in our simulation study and 50,000 in the real data example below.
The pvalues can be used to control the typeI error, i.e., the probability of mistakenly detecting a nonDIF item as a DIF one. To control the itemspecific typeI errors to be below a prespecified threshold (e.g., ), we detect the items for which the corresponding pvalues are below . Besides the typeI error, we may also consider the False Discovery Rate (FDR) for DIF detection (Bauer et al., 2020) to account for multiple comparisons, where the FDR is defined as the expected ratio of the number of nonDIF items to the total number of detections. To control the FDR, the BenjaminiHochberg (BH) procedure (Benjamini and Hochberg, 1995) can be employed to the pvalues.
5 Simulation Study
This section conducts simulation studies to evaluate the performance of the proposed method and compare it with the likelihood ratio test (LRT) method (Thissen, 1988) and the LASSO method (Bauer et al., 2020). Note that the LRT method requires a known anchor item set. Correctly specified anchor item sets with different sizes will be considered when applying the LRT method.
In the simulation, we set the number of items , and consider two settings for the sample sizes, , and 1000. The parameters of the true model are set as follows. First, the discrimination parameters are set between 1 and 2, and the easiness parameters are set between and 1. Their true values are given in Table 1. The observations are split into groups of equal sizes, indicated by , and 1. The parameter in the structural model is set to 0.5, so that the latent trait distribution is standard normal and for the reference and focal groups, respectively. We consider two settings for the DIF parameters, one with smaller absolute DIF parameter values, and the other with larger absolute DIF parameter values. Their true values are given in Table 1. For both sets of the DIF parameters, the ML1 condition is satisfied. The combinations of settings for the sample sizes and DIF parameters lead to four settings in total. For each setting, 100 independent datasets are generated.
We first evaluate the accuracy of the proposed estimator given by Algorithm 1. Table 2 shows the mean squared errors (MSE) for and the average MSEs for s, s, and s that are obtained by averaging the corresponding MSEs over the items. As we can see, these MSEs and average MSEs are small in magnitude and decrease as the sample size increases under each setting. This observation aligns with our consistency result in Theorem 1.
We then compare the proposed method and the LRT method in terms of their performances on statistical inference. Specifically, we focus on whether FDR can be controlled when applying the BH procedure to the pvalues obtained from the two methods. The comparison results are given in Table 3. As we can see, FDR is controlled to be below the targeted level for the proposed method and the LRT method with 1, 5, and 10 anchor items, under all settings. In addition, the FDR levels given by the proposed method are similar to those given by the LRT method.
When anchor items are known, the standard error can be computed for each estimated
and thus the corresponding Wald interval can be constructed. We compare the coverage rates of the confidence intervals given by Algorithm 2 and the Wald intervals that are based on five anchor items. The results are shown in Figure 3. We see that the coverage rates from both methods are comparable across all settings and are close to the 95% targeted level. Note that these coverage rates are calculated based on only 100 replicated datasets, which may be slightly affected by the Monte Carlo errors.Furthermore, we compare the proposed hardthresholding procedure with the LASSO procedure, in terms of the accuracy in the detection of DIF items. For the hardthresholding procedure, 20 values of threshold are considered that are equally spaced in and under the small and large DIF settings, respectively. For the LASSO procedure, 20 values are considered for the tuning parameter that are equally spaced in and under the small and large DIF settings, respectively. The optimal values of and are chosen by BIC, which yield the selection of DIF items for the two procedures, respectively. The selection accuracy is evaluated by two metrics, the true positive rate (TPR) and false positive rate (FPR), where the TPR is the expected ratio of the number of correctly detected DIF items to the number of truely DIF items, and the FPR is the expected ratio of the number of falsely detected DIF items to the number of nonDIF items. Table 4 shows the TPR and FPR that are estimated based on the 100 simulation replications. As we can see, the two methods have comparable TPR and FPR under all settings. It is worth noting that the hardthresholding method is computationally much faster than the LASSO method, since it does not involve maximizing a regularized likelihood under different tuning parameters.
Finally, we compare the detection power of different methods based on the receiver operating characteristic (ROC) curves. For a given method, an ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings. More specifically, ROC curves are constructed for the hardthresholding and the LASSO methods by varying the corresponding tuning parameters and . ROC curves are also constructed by thresholding the pvalues from the proposed method and the LRT method with 1, 5, and 10 anchor items, respectively. Note that for the LRT method, the TPR and FPR are calculated based on the nonanchor items. For each method, an average ROC curve is obtained based on the 100 replications, for which the area under the ROC curve (AUC) is calculated. A larger AUC value indicates the better detection power. The AUC values for different methods across our simulation settings are given in Table 5. According to the AUC values, the two proposed procedures (i.e., thresholding the pvalues from Algorithm 2 and the hardthresholding procedure) and the LRT method with 10 anchor items have similar detection power and perform better than the rest. That is, without knowing any anchor items, the proposed procedures perform better than the LRT method that knows 1 or 5 anchor items, and perform similarly as the LRT method that knows 10 anchor items. The superior performance of the proposed procedures is brought by the use of the ML1 condition, which identifies the model parameters using information from all the items. We also see that the LASSO procedure performs similarly as the proposed procedures under the large DIF settings, but is less accurate under the small DIF settings. This may be due to that the model selection consistency of the LASSO procedure requires the socalled irrepresentable conditions (Zhao and Yu, 2006; van de Geer and Bühlmann, 2009), which are strong conditions that may not be satisfied by the current MIMIC model under the small DIF settings.
Item number  (small DIF)  (large DIF)  

1  1.30  0.80  0.00  0.00 
2  1.40  0.20  0.00  0.00 
3  1.50  0.40  0.00  0.00 
4  1.70  1.00  0.00  0.00 
5  1.60  1.00  0.00  0.00 
6  1.30  0.80  0.00  0.00 
7  1.40  0.20  0.00  0.00 
8  1.50  0.40  0.00  0.00 
9  1.70  1.00  0.00  0.00 
10  1.60  1.00  0.00  0.00 
11  1.30  0.80  0.00  0.00 
12  1.40  0.20  0.00  0.00 
13  1.50  0.40  0.00  0.00 
14  1.70  1.00  0.00  0.00 
15  1.60  1.00  0.00  0.00 
16  1.30  0.80  0.40  1.00 
17  1.40  0.20  0.40  1.30 
18  1.50  0.40  0.50  0.90 
19  1.70  1.00  0.40  1.20 
20  1.60  1.00  0.40  1.00 
21  1.30  0.80  0.40  1.00 
22  1.40  0.20  0.40  1.30 
23  1.50  0.40  0.50  0.90 
24  1.70  1.00  0.40  1.20 
25  1.60  1.00  0.40  1.00 
Small DIF  Large DIF  

0.032  0.016  0.032  0.017  
0.038  0.018  0.038  0.018  
0.058  0.027  0.061  0.029  
0.026  0.011  0.026  0.013  

Small DIF  Large DIF  Small DIF  Large DIF  
pvalue based method  0.014  0.027  0.022  0.024 
LRT (1 anchor item)  0.015  0.027  0.020  0.020 
LRT (5 anchor items)  0.023  0.034  0.019  0.025 
LRT (10 anchor items)  0.015  0.023  0.011  0.013 

Small DIF  Large DIF  Small DIF  Large DIF  
TPR  Hardthresholing  0.453  0.984  0.602  1.000 
LASSO  0.431  0.986  0.610  1.000  
FPR  Hardthresholding  0.044  0.037  0.033  0.022 
LASSO  0.038  0.041  0.038  0.021  

Small DIF  Large DIF  Small DIF  Large DIF  
pvalue based  0.809  0.997  0.926  1.000 
Hardthresholding  0.825  0.996  0.914  1.000 
LASSO  0.789  0.994  0.837  0.999 
LRT (1 anchor item)  0.696  0.958  0.831  0.994 
LRT (5 anchor items)  0.793  0.990  0.909  1.000 
LRT (10 anchor items)  0.801  0.992  0.927  1.000 

6 Application to EPQR Data
DIF methods have been commonly used for assessing the measurement invariance of personality tests (e.g., Escorial and Navas (2007), Millsap (2012), Thissen et al. (1986)). In this section, we apply the proposed method to the Eysenck Personality QuestionnaireRevised (EPQR, Eysenck et al. (1985)), a personality test that has been intensively studied and received applications worldwide (Fetvadjiev and van de Vijver, 2015). The EPQR has three scales that measure the Psychoticism (P), Neuroticism (N) and Extraversion (E) personality traits, respectively. We analyze the long forms of the three personality scales that consist of 32, 24, and 23 items, respectively. Each item has binary responses of “yes” and “no” that are indicated by 1 and 0, respectively. This analysis is based on data of an EPQR study collected from 1432 participants in the United Kingdom. Among these participants, 823 are females and 609 are males. Females and males are indicated by and 1, respectively. We study the DIF caused by gender. The three scales are analyzed separately using the proposed methods.
The results are shown through Tables 6  8, and Figure 4. Specifically, Tables 6  8 present the pvalues for testing and the detection results for the P, E, N scales, respectively. For each table, the items are ordered by the pvalues in an increasing order. The items indicated by “F” are the ones detected by the BH procedure with FDR level 0.05, and those indicated by “H” are the ones detected by the hardthresholding procedure whose threshold is chosen by BIC. The item IDs are consistent with those in Appendix 1 of Eysenck et al. (1985), where the item descriptions are given. The three panels of Figure 4 further give the point estimate and confidence interval for each parameter, for the three scales, respectively. Under the current model parameterization, a positive DIF parameter means that a male participant is more likely to answer “yes” to the item than a female participant, given that they have the same personality trait level. We note that the absolute values of are all below 1, suggesting that there are no items with very large genderrelated DIF effects.
From Tables 6  8, we see that all the three scales have some items whose pvalues are close to zero, suggesting gender DIF may exists across the three scales. We list some example items with the smallest pvalues for each of the three scales. For the P scale, items 7 “Would being in debt worry you?”, 14 “Do you dislike people who don’t know how to behave themselves?” and 34 “Do you have enemies who want to harm you?” have the smallest pvalues. As shown in Figure 4, the estimated DIF parameters are positive for items 14 and 34 and negative for item 7. For the E scale, items 63 “Do you nearly always have a ‘ready answer’ when people talk to you?”, 90 “Do you like plenty of bustle and excitement around you?” and 36 “Do you have many friends?” have the smallest pvalues, where item 63 has a positive and items 90 and 36 have negative s. For the N scale, the top three items are 8 “Do you ever feel ‘just miserable’ for no reason?”, 84 “Do you often feel lonely?” and 70 “Do you often feel life is very dull?”, where is negative for item 8 and positive for items 84 and 70.
From Tables 6 through 8, we see that the selection based on the BH procedure with FDR level 0.05 and that based on the hardthresholding approach are largely consistent with each other. In particular, items selected by the BH procedure are also selected by the hardthresholding approach. More precisely, the BH procedure detects 10, 11, and 11 items as DIF items from the P, E, and N scales, respectively, and the hardthrehsolding procedure selects 11, 12, and 12 items, respectively. We remark that the hardthresholding procedure is based on the point estimate of the items and thus does not always select the items with the smallest pvalues. For example, it detects item 91 as a DIF item but does not detects item 88, though item 91 has a larger pvalue. This is because item 91 has a smaller absolute value of than item 88.
Item  7 FH  14 FH  34 FH  95 FH  81 FH  2 FH  73 FH  30 FH 

pvalue  0.00000  0.00002  0.00004  0.00008  0.00014  0.00050  0.00534  0.00646 
Item  37 FH  9 FH  88  91 H  29  56  41  99 
pvalue  0.00736  0.00750  0.01800  0.05182  0.06756  0.11788  0.16000  0.20586 
Item  5  79  12  68  21  18  96  64 
pvalue  0.28952  0.29974  0.30520  0.31474  0.32266  0.32318  0.37490  0.44788 
Item  85  25  42  48  59  54  50  75 
pvalue  0.48100  0.53366  0.72276  0.81756  0.90470  0.92302  0.93352  0.94764 
Item  63 FH  90 FH  36 FH  6 FH  67 FH  33 FH  61 FH  94 FH 

pvalue  0.00014  0.00108  0.00114  0.00194  0.00262  0.00340  0.00578  0.00734 
Item  78 FH  51 FH  58 FH  11 H  28  20  55  1 
pvalue  0.01206  0.01260  0.02264  0.04164  0.10804  0.12628  0.12706  0.41380 
Item  40  16  24  69  47  45  72  
pvalue  0.62640  0.64666  0.76164  0.77082  0.83650  0.84300  0.92682 
Item  8 FH  84 FH  70 FH  87 FH  22 FH  17 FH  74 FH  83 FH 

pvalue  0.00002  0.00010  0.00014  0.00014  0.00022  0.00026  0.00042  0.00376 
Item  92 FH  52 FH  60 FH  26 H  38  46  13  100 
pvalue  0.00450  0.00828  0.02028  0.03514  0.07220  0.11180  0.14794  0.25704 
Item  43  80  65  31  3  35  97  76 
pvalue  0.32968  0.33684  0.63190  0.63214  0.77634  0.82176  0.88542  0.93842 
7 Discussion
This paper proposes a new method for DIF analysis under a MIMIC model framework. It can accurately estimate the DIF effects of individual items without requiring prior knowledge about an anchor item set and also provide valid pvalues. The point estimate and pvalues are further used for the detection of DIF items. According to our simulation results, the proposed pvaluebased and hardthresholding procedures
have better and comparable performance under the small and large DIF settings, respectively, comparing with the LASSO method of Belzak and Bauer (2020).
In addition, the pvalue based methods accurately control the itemspecific typeI errors and the FDR.
Finally, the proposed method is applied to the three scales of the Eysenck Personality QuestionnaireRevised to study the genderrelated DIF.
For each of the three long forms of the P, N, E scales, around 10 items are detected by the proposed procedures as potential DIF items. The psychological mechanism of these DIF effects is worth future investigation.
An R package has been developed for the proposed procedures that will be published online upon the acceptance of this paper.
The proposed method has several advantages over the LASSO method. First, the proposed method does not require a tuning parameter to estimate the model parameters, while the LASSO method involves choosing the tuning parameter for the regularization term. It is thus more straightforward to use for the practitioners. Second, we do not need to solve the optimization problems that involve maximizing a regularized likelihood function under different tuning parameter choices. Therefore, the proposed method is computationally less intensive, since the optimization involving a regularized likelihood function is nontrivial due to both the integral with respect to the latent variables and the nonsmooth penalty term. Finally, the proposed method provides valid statistical inference, which is more difficult for the LASSO method due to the uncertainty associated with the model selection step.
8 Limitations and Future Directions
The current work has some limitations, which offers opportunities for future research. First, the current method only considers DIF associated with the intercept parameters but not the slope parameters (i.e., discrimination parameters) in the 2PL model. To generalize the current method to further account for DIF in the slope parameters, an additional set of DIF parameters need to be introduced and a new ML1 condition needs to be imposed. The estimation procedure also needs to be generalized accordingly.
Second, as is true for all simulation studies, we cannot examine all possible conditions that might occur in applied settings. Additional simulation studies will be conducted in future research to better understand the performance of the proposed method. In particular, sample sizes, item sizes, group sizes and distribution of the DIF items can be varied and tested.
Finally, the current work focuses on the typeI error and FDR as error metrics that concern falsely detecting nonDIF items as DIF items. In many applications of measurement invariance, it may be of more interest to consider an error metric that concerns the false detection of DIF items as DIFfree. Suitable error metrics, as well as methods for controlling such error metrics, remain to be proposed.
References
 Bauer et al. (2020) Bauer, D. J., Belzak, W. C., and Cole, V. T. (2020). Simplifying the assessment of measurement invariance over multiple background variables: Using regularized moderated nonlinear factor analysis to detect differential item functioning. Structural Equation Modeling: A Multidisciplinary Journal, 27(1):43–55.
 Belzak and Bauer (2020) Belzak, W. and Bauer, D. J. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods, 25(6):673–690.
 Benjamini and Hochberg (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57:289–300.
 Birnbaum (1968) Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord, F. M. and Novick, M. R., editors, Statistical Theories of Mental Test Scores, pages 395–479. Addison Wesley, Reading, MA.
 Cai et al. (2011) Cai, L., Du Toit, S., and Thissen, D. (2011). Irtpro: Flexible, multidimensional, multiple categorical irt modeling [computer software]. Chicago, IL: Scientific Software International.
 Candell and Drasgow (1988) Candell, G. L. and Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12(3):253–260.
 Cao et al. (2017) Cao, M., Tay, L., and Liu, Y. (2017). A Monte Carlo study of an iterative Wald test procedure for DIF analysis. Educational and Psychological Measurement, 77(1):104–118.
 Clauser et al. (1993) Clauser, B., Mazor, K., and Hambleton, R. K. (1993). The effects of purification of matching criterion on the identification of DIF using the MantelHaenszel procedure. Applied Measurement in Education, 6(4):269–279.
 Dorans and Kulick (1986) Dorans, N. J. and Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the scholastic aptitude test. Journal of Educational Measurement, 23(4):355–368.
 Escorial and Navas (2007) Escorial, S. and Navas, M. J. (2007). Analysis of the gender variable in the Eysenck Personality Questionnaire–revised scales using differential item functioning techniques. Educational and Psychological Measurement, 67(6):990–1001.
 Eysenck et al. (1985) Eysenck, S. B., Eysenck, H. J., and Barrett, P. (1985). A revised version of the psychoticism scale. Personality and Individual Differences, 6(1):21–29.
 Fetvadjiev and van de Vijver (2015) Fetvadjiev, V. H. and van de Vijver, F. J. (2015). Measures of personality across cultures. In Measures of Personality and Social Psychological Constructs, pages 752–776. Elsevier.
 Fidalgo et al. (2000) Fidalgo, A., Mellenbergh, G. J., and Muñiz, J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of MantelHaenszel procedures. Methods of Psychological Research Online, 5(3):43–53.
 Frick et al. (2015) Frick, H., Strobl, C., and Zeileis, A. (2015). Rasch mixture models for DIF detection: A comparison of old and new score specifications. Educational and Psychological Measurement, 75(2):208–234.
 Goldberger (1972) Goldberger, A. S. (1972). Structural equation methods in the social sciences. Econometrica: Journal of the Econometric Society, 40:979–1001.
 Huang (2018) Huang, P. H. (2018). A penalized likelihood method for multigroup structural equation modelling. British Journal of Mathematical and Statistical Psychology, 71(3):499–522.
 Kim et al. (1995) Kim, S. H., Cohen, A. S., and Park, T. H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32(3):261–276.
 Koenker (2005) Koenker, R. (2005). Quantile Regression. Econometric Society Monographs. Cambridge University Press.
 Koenker et al. (2018) Koenker, R., Portnoy, S., Ng, P. T., Zeileis, A., Grosjean, P., and Ripley, B. D. (2018). Package ‘quantreg’. Cran Rproject. org.
 Kopf et al. (2015a) Kopf, J., Zeileis, A., and Strobl, C. (2015a). Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educational and Psychological Measurement, 75(1):22–56.
 Kopf et al. (2015b) Kopf, J., Zeileis, A., and Strobl, C. (2015b). A framework for anchor methods and an iterative forward approach for DIF detection. Applied Psychological Measurement, 39(2):83–103.
 Lord (2012) Lord, F. M. (2012). Applications of item response theory to practical testing problems. Routledge.
 Louis (1982) Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):226–233.
 Magis et al. (2015) Magis, D., Tuerlinckx, F., and De Boeck, P. (2015). Detection of differential item functioning using the lasso approach. Journal of Educational and Behavioral Statistics, 40(2):111–135.
 Mantel and Haenszel (1959) Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4):719–748.
 May (2006) May, H. (2006). A multilevel Bayesian item response theory method for scaling socioeconomic status in international studies of education. Journal of Educational and Behavioral Statistics, 31(1):63–79.

Meinshausen and Yu (2009)
Meinshausen, N. and Yu, B. (2009).
Lassotype recovery of sparse representations for highdimensional data.
The Annals of Statistics, 37(1):246–270.  Millsap (2012) Millsap, R. E. (2012). Statistical approaches to measurement invariance. Routledge.
 Muthen (1985) Muthen, B. (1985). A method for studying the homogeneity of test items with respect to other relevant variables. Journal of Educational Statistics, 10(2):121–132.
 Muthen (1986) Muthen, B. (1986). Some Uses of Structural Equation Modeling in Validity Studies: Extending IRT to External Variables Using SIMS Results. Center for the Study of Evaluation.
 Muthen et al. (1991) Muthen, B., Kao, C. F., and Burstein, L. (1991). Instructionally sensitive psychometrics: Application of a new IRTbased detection technique to mathematics achievement test items. Journal of Educational Measurement, 28(1):1–22.
 Muthen and Lehman (1985) Muthen, B. and Lehman, J. (1985). Multiple group IRT modeling: Applications to item bias analysis. Journal of Educational Statistics, 10(2):133–142.
 Oort (1998) Oort, F. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling: A Multidisciplinary Journal, 5(2):107–124.
 Raju (1988) Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4):495–502.
 Raju (1990) Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14(2):197–207.
 Rasch (1960) Rasch, G. (1960). Studies in mathematical psychology: I. probabilistic models for some intelligence and attainment tests.
 Schauberger and Mair (2020) Schauberger, G. and Mair, P. (2020). A regularization approach for the detection of differential item functioning in generalized partial credit models. Behavior Research Methods, 52(1):279–294.
 Schwarz (1978) Schwarz, G. (1978). The Bayesian information criterion. Annals of Statistics, 6:461–464.
 Shealy and Stout (1993) Shealy, R. and Stout, W. (1993). A modelbased standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58(2):159–194.
 Soares et al. (2009) Soares, T. M., Gonçalves, F. B., and Gamerman, D. (2009). An integrated Bayesian model for DIF analysis. Journal of Educational and Behavioral Statistics, 34(3):348–377.
 Steenkamp and Baumgartner (1998) Steenkamp, J.B. E. and Baumgartner, H. (1998). Assessing measurement invariance in crossnational consumer research. Journal of Consumer Research, 25(1):78–90.

Swaminathan and
Rogers (1990)
Swaminathan, H. and Rogers, H. J. (1990).
Detecting differential item functioning using logistic regression procedures.
Journal of Educational measurement, 27(4):361–370.  Tay et al. (2016) Tay, L., Huang, Q., and Vermunt, J. K. (2016). Item response theory with covariates (IRTC) assessing item recovery and differential item functioning for the threeparameter logistic model. Educational and Psychological Measurement, 76(1):22–42.
 Tay et al. (2015) Tay, L., Meade, A. W., and Cao, M. (2015). An overview and practical guide to IRT measurement equivalence analysis. Organizational Research Methods, 18(1):3–46.
 Thissen (1988) Thissen, D. (1988). Use of item response theory in the study of group differences in trace lines. Test validity.
 Thissen (2001) Thissen, D. (2001). Software for the computation of the statistics involved in item response theory likelihoodratio tests for differential item functioning. Chapel Hill: University of North Carolina at Chapel Hill.
 Thissen et al. (1986) Thissen, D., Steinberg, L., and Gerrard, M. (1986). Beyond groupmean differences: The concept of item bias. Psychological Bulletin, 99:118.
 Thissen et al. (1993) Thissen, D., Steinberg, L., and Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58:267–288.
 Tutz and Schauberger (2015) Tutz, G. and Schauberger, G. (2015). A penalty approach to differential item functioning in Rasch models. Psychometrika, 80(1):21–43.
 van de Geer and Bühlmann (2009) van de Geer, S. A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3(2009):1360–1392.
 van der Vaart (2000) van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge University Press.
 Wang et al. (2009) Wang, W. C., Shih, C. L., and Yang, C. C. (2009). The MIMIC method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69(5):713–731.
 Wang and Su (2004) Wang, W. C. and Su, Y. H. (2004). Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the MantelHaenszel method. Applied Measurement in Education, 17(2):113–144.
 Wang and Yeh (2003) Wang, W. C. and Yeh, Y. L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27(6):479–498.
 Woods et al. (2013) Woods, C. M., Cai, L., and Wang, M. (2013). The Langerimproved Wald test for DIF testing with multiple groups: Evaluation and comparison to twogroup IRT. Educational and Psychological Measurement, 73(3):532–547.
 Yuan et al. (2021) Yuan, K., Liu, H., and Han, Y. (2021). Differential item functioning analysis without a priori information on anchor items: QQ plots and graphical test. Psychometrika, 86:345–377.
 Zellner (1970) Zellner, A. (1970). Estimation of regression relationships containing unobservable independent variables. International Economic Review, 11:441–454.

Zhao and Yu (2006)
Zhao, P. and Yu, B. (2006).
On model selection consistency of lasso.
Journal of Machine Learning Research
, 7:2541–2563.  Zwick and Thayer (2002) Zwick, R. and Thayer, D. T. (2002). Application of an empirical Bayes enhancement of MantelHaenszel differential item functioning analysis to a computerized adaptive test. Applied Psychological Measurement, 26(1):57–76.

Zwick et al. (2000)
Zwick, R., Thayer, D. T., and Lewis, C. (2000).
Using loss functions for DIF detection: An empirical Bayes approach.
Journal of Educational and Behavioral Statistics, 25(2):225–247.
Appendix A Proof of Theorems
Proof of Theorem 1.
Since MIMIC model with constraint is identifiable, by classical asymptotic theory for MLE (van der Vaart, 2000), we have converges in probability to That is, as , for any , we must have with probability tending to 1 that , and , for any . Denote as a function of Similarly, denote . Let and , respectively. We seek to establish that will converge in probability to as First note that by regularity conditions, there exists such that Then, there must exist such that Furthermore, note is clearly continuous and convex in , so consistency will follow if can be shown to converge pointwise to that is uniquely minimized at the true value (typically uniform convergence is needed, but pointwise convergence of convex functions implies their uniform convergence on compact subsets). Following the model identifiability and the ML1 condition (5), is unique. To see this, suppose for contradiction that there exist and such that and and First note that for all Then by model identifiability, there exists such that So we have
and
Hence, and . If ML1 condition (5) holds, then and This contradicts the assumption Hence, must be unique.
For any
Take , it follows that for any fixed , as . Moreover, following from the uniqueness of and the continuity and the convexity of in , we must have as
Note that , , , for all From the model identifiability and the ML1 condition (5), we know that , , , for all Since , as , it follows directly from the Slutsky’s Theorem that , as . ∎
Proof of Theorem 2.
Following from the results of Theorem 1, we have as for all i.e., for any , First consider the case when , we can simply take , then we have .
Next we consider the case when First note that implies that as for all . That is, for any Let By the assumption that , we know that Take Then,
Hence, we have
which then implies that
Combining the two cases, we have shown for all ,
∎
Appendix B Asymptotic Distribution of
Since the model is identifiable with constraint and all the regularity conditions in Theorem 5.39 of van der Vaart (2000) are satisfied, hence, by Theorem 5.39 in van der Vaart (2000), N in distribution as In practice, we use the inverse of the observed Fisher information matrix, denoted by , which is a consistent estimator of , to draw Monte Carlo samples. Below, we give procedures to evaluate from the marginal loglikelihood.
Following the notations in the main article, we first provide the complete data loglikelihood function,
Since is considered as a random variable such that N, so we will work with the marginal loglikelihood function,
Note that the observed Fisher information matrix cannot be directly obtained from the due to the intractable integral. Instead, we apply the Louis Identity (Louis, 1982) to evaluate the observed Fisher information matrix. Let and denote the gradient vector and the negative of the hessian matrix of the complete data loglikelihood function, respectively. Then by the Louis Identity, can be expressed as
Denote . Then, in particular,
Furthermore, note that is a (3J+1) by (3J +1) matrix with the only nonzero entries,
Comments
There are no comments yet.