# Joint Maximum Likelihood Estimation for High-dimensional Exploratory Item Response Analysis

Multidimensional item response theory is widely used in education and psychology for measuring multiple latent traits. However, exploratory analysis of large-scale item response data with many items, respondents, and latent traits is still a challenge. In this paper, we consider a high-dimensional setting that both the number of items and the number of respondents grow to infinity. A constrained joint maximum likelihood estimator is proposed for estimating both item and person parameters, which yields good theoretical properties and computational advantage. Specifically, we derive error bounds for parameter estimation and develop an efficient algorithm that can scale to very large datasets. The proposed method is applied to a large scale personality assessment data set from the Synthetic Aperture Personality Assessment (SAPA) project. Simulation studies are conducted to evaluate the proposed method.

## Authors

• 16 publications
• 9 publications
• 6 publications
07/19/2019

### A Note on Exploratory Item Factor Analysis by Singular Value Decomposition

In this note, we revisit a singular value decomposition (SVD) based algo...
09/03/2020

### Unfolding-Model-Based Visualization: Theory, Method and Applications

Multidimensional unfolding methods are widely used for visualizing item ...
09/20/2021

### Machine Learning-Based Estimation and Goodness-of-Fit for Large-Scale Confirmatory Item Factor Analysis

We investigate novel parameter estimation and goodness-of-fit (GOF) asse...
03/30/2021

### A Tensor-EM Method for Large-Scale Latent Class Analysis with Clustering Consistency

Latent class models are powerful statistical modeling tools widely used ...
04/07/2019

### An IRT-based Model for Omitted and Not-reached Items

Missingness is a common occurrence in educational assessment and psychol...
06/17/2020

### Multidimensional Bayesian IRT Model for Hierarchical Latent Structures

It is reasonable to consider, in many cases, that individuals' latent tr...
10/21/2021

### DIF Statistical Inference and Detection without Knowing Anchoring Items

Establishing the invariance property of an instrument (e.g., a questionn...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multidimensional item response theory (MIRT) models have been widely used in psychology and education for measuring multiple latent traits based on dichotomous or polytomous items. The concept of MIRT dates back to McDonald (1967), Lord et al. (1968) and Reckase (1972), and is closely related to linear factor analysis (e.g. Anderson, 2003) that models continuous multivariate data. We refer the readers to Reckase (2009) for a review of MIRT models. An important application of MIRT models is exploratory item factor analysis, which serves to uncover a set of latent traits underlying a battery of psychological items. Such exploratory analysis facilitates the understanding of large scale response data with many items, respondents and possibly many latent traits.

A key step in MIRT based analysis is parameter estimation, which can be numerically challenging under the high-dimensional setting, i.e., when the number of items, the sample size, and the dimension of the latent space are large. The most popular method for estimating item parameters is the maximum marginal likelihood estimator (MMLE). In this approach, respondent specific parameters (i.e., their latent trait levels) are treated as random effects and the marginal likelihood function of item parameters is obtained by integrating out the random effects. Item parameter estimates are obtained by maximizing the marginal likelihood function. This approach typically involves evaluating a -dimensional integral, where is the number of latent traits. When is moderately large (say, ), evaluating the integral can be numerically challenging. Many approaches have been proposed to approximate the integral, including adaptive Gaussian quadrature methods (e.g. Schilling and Bock, 2005), Monte Carlo integration (e.g. Meng and Schilling, 1996), fully Bayesian estimation methods (e.g. Béguin and Glas, 2001; Bolt and Lall, 2003; Edwards, 2010), and data augmented stochastic approximation algorithm (e.g. Cai, 2010a, b). However, even with these state-of-the-art algorithms, the computation is time-consuming.

An alternative approach to parameter estimation in MIRT models is the joint maximum likelihood estimator (JMLE; see, e.g. Embretson and Reise, 2000). This approach treats both the person and item parameters as fixed effects (i.e., model parameters). Parameter estimates are obtained by maximizing the joint likelihood function of both person and item parameters. Traditionally, the JMLE is less preferred to the MMLE. This is partly because, under the usual asymptotic setting where the number of respondents grows to infinity and the number of items is fixed, the number of parameters in the joint likelihood function also diverges, resulting in inconsistent estimate of item parameters (Neyman and Scott, 1948; Andersen, 1973; Haberman, 1977; Ghosh, 1995) even for simple IRT models.

In this paper, we propose a variant of joint maximum likelihood estimator, which adds constraints to the parameter space. This constrained joint maximum likelihood estimator (CJMLE) is computationally efficient and statistically solid under the high-dimensional setting. In terms of computation, an alternating minimization (AM) algorithm with projected gradient decent update is proposed, which can be parallelled efficiently. Specifically, we implement this parallel computing algorithm in R with core functions written in C++ through Open Multi-Processing (OpenMP, Dagum and Menon, 1998) that can scale to very large data. For example, according to our simulation study, the algorithm can fit a data set with 25,000 respondents, 1000 items, and 10 latent traits in 12 minutes on a single Intel(R) machine (Core(TM) i7CPU@5650U@2.2 GHz) with eight threads (OpenMP enabled). Theoretically, we provide error bounds on parameter estimation, under the setting that both the numbers of items and respondents grow to infinity. Specifically, we show that the item parameters can be consistently estimated up to a scaling and a rotation (possibly oblique). Consequently, the latent structure of test items may be investigated by using appropriate rotational methods (see e.g. Browne, 2001). Both our computational algorithm and theoretical framework are new to the psychometrics literature.

Our theoretical framework assumes a diverging number of items, which is suitable when analyzing large scale data. To the best of our knowledge, such an asymptotic setting has not received enough attention, except in Haberman (1977, 2004) and Chiu et al. (2016). Our theoretical analysis applies to a general MIRT model that includes the multidimensional two-parameter logistic model (Reckase and McKinley, 1983; Reckase, 2009) as a special case, while the analyses in Haberman (1977, 2004) and Chiu et al. (2016) are limited to the unidimensional Rasch model (Rasch, 1960; Lord et al., 1968) and cognitive diagnostic models (Rupp et al., 2010), respectively. Our technical tools for studying the properties of the CJMLE include theoretical developments in matrix completion theory (e.g. Candès and Plan, 2010; Candès and Tao, 2010; Candès and Recht, 2012; Davenport et al., 2014) and matrix perturbation theory (e.g. Stewart and Sun, 1990).

We apply the proposed method to a data set from the Synthetic Aperture Personality Assessment (SAPA) project (Condon and Revelle, 2015) that is collected to evaluate the structure of personality constructs in the temperament domain. Applying traditional exploratory item factor analysis methods to this data set is challenging because (1) the data set is of large scale, containing 23,681 respondents, 696 items, and possibly many latent traits, (2) the responses are not continuous and thus linear factor analysis is inappropriate, and (3) the data contain “massive missingness” by design that is missing completely at random. The proposed method turns out to be suitable for analyzing the data set. Specifically, the prediction power of the MIRT model and CJMLE is studied and the latent structure is investigated and interpreted.

The rest of the paper is organized as follows. In Section 2, we propose the CJMLE and an algorithm for its computation. Theoretical properties of CJMLE are developed in Section 3, followed by simulation studies and real data analysis in Sections 4 and 5. Finally, discussions are provided in Section 6. Proofs of our theoretical results are provided in Appendix.

## 2 Method

### 2.1 Model

To simplify the presentation, we focus on binary response data. We point out that our framework can be extended to ordinal and nominal responses without much difficulty. Let indicates respondents and indicates items. Each respondent is represented by a

-dimensional latent vector

and each item is represented by parameters . Let be the response from respondent to item , which is assumed to follow distribution

 P(Yij=1|θi,aj)=exp(a⊤jθi)1+exp(a⊤jθi). (1)

This model can be viewed as an extension of the multidimensional two-parameter logistic model (M2PL; Reckase and McKinley, 1983; Reckase, 2009). That is, if is set to be 1 for all , then (1) becomes

 P(Yij=1|θi,aj)=exp(aj1+∑Kk=2aj2θi2)1+exp(aj1+∑Kk=2aj2θi2),

which takes the same form as a multidimensional two-parameter logistic model with latent traits. Similar to many latent factor models, Model (1) is also rotational and scaling invariant, as will be discussed in Section 3.

### 2.2 Constrained Joint Maximum Likelihood Estimator

The traditional way of estimating item parameters in an item response theory model is through the marginal maximum likelihood estimator. That is, the latent vector s are further assumed to be independent and identically distributed samples from some distribution . Let

be the observed value of random variable

. The marginal likelihood function is a function of , defined as

 LM(a1,...,aJ) =N∏i=1∫J∏j=1P(Yij=1|θi,aj)yij(1−P(Yij=1|θi,aj))1−yijF(dθi) (2) =N∏i=1∫J∏j=1exp(a⊤jθiyij)1+exp(a⊤jθi)F(dθi).

Marginal maximum likelihood estimates (MMLE) of , …, are then obtained by maximizing . Appropriate constraints may be imposed on , …, to ensure identifiability. In this approach, the latent trait s are treated as random effects, while the item parameters , …, are treated as fixed effects. When is large, the marginal maximum likelihood estimator suffers from computational challenge brought by the -dimensional integration in (2). The computation becomes time-consuming when is moderately large (say when ) even with the state-of-art algorithms.

An alternative approach is the joint maximum likelihood estimator (JMLE; see e.g. Embretson and Reise, 2000). In this approach, both s and s are treated as fixed effects and their estimates are obtained by maximizing the joint likelihood function

 LJ(θ1,...,θN,a1,...,aJ) (3) =N∏i=1J∏j=1exp(a⊤jθiyij)1+exp(a⊤jθi).

Traditionally, the JMLE is less preferred to the MMLE. This is partly because, under the usual asymptotic setting where the number of respondents grows to infinity and the number of items is fixed, the number of parameters in also diverges, resulting in inconsistent estimate of item parameters even for the Rasch model that takes a simple form (Andersen, 1973; Haberman, 1977; Ghosh, 1995).

Under the high-dimensional setting, , , and can all be large, so that MMLE is computationally infeasible. Interestingly, JMLE becomes a good choice. Intuitively, the number of parameters in the joint likelihood is in the order , which can be substantially smaller than the number of observed responses (in the order of ) when is relatively small. Consequently, under reasonable regularities, JMLE yields satisfactory statistical properties, as will be discussed in Section 3. In fact, Haberman (1977) shows that under the Rasch model, the estimation of both respondent parameters and item parameters is consistent, when and grow to infinity and converges to 0. Unfortunately, the results of Haberman (1977) relies on the simple form of the Rasch model and thus can hardly be generalized to the general model setting as in (1).

Computationally, (3) can be optimized efficiently. That is because maximizing (3) is equivalent to maximizing its logarithm, which can be written as

 lJ(θ1,...,θN,a1,...,aJ) =logLJ(θ1,...,θN,a1,...,aJ) (4) =N∑i=1J∑j=1a⊤jθiyij−log(1+exp(a⊤jθi)).

Due to its factorization, is a biconvex function of and , meaning that is a convex function of when is given and vise versa. As a consequence, (4) can be optimized by iteratively maximizing with respect to one given the other. This optimization program can be further speeded up by parallel computing, thanks to the special structure of . See Section 2.4 for further discussions.

Following the above discussion, we propose a constrained joint maximum likelihood estimator, defined as

 (^θ1,...,^θN,^a1,...,^aJ)= \operatornamewithlimitsargminθ1,...,θN,a1,...,aJ−lJ(θ1,...,θN,a1,...,aJ), (5) s.t. ∥θi∥2≤C,∀i, ∥aj∥2≤C,∀j,

where denotes the Euclidian norm of a vector and is a positive tuning parameter that imposes regularization on the magnitude of each and . When , (5) becomes exactly a JMLE. The constraints in (5) avoid overfitting when the number of model parameters is large.

This estimator has the property of rotational invariance. To see this, we rewrite the joint likelihood function in the matrix form. Let , , and . Then the joint log-likelihood function can be rewritten as

 lJ(Θ,A)=lJ(M)=N∑i=1J∑j=1mijyij−log(1+exp(mij)).

The rotational invariance property of the CJMLE is described by the following proposition.

###### Proposition 1.

Let be a minimizer of (5). Then for any orthogonal matrix , is also a minimizer of (5).

The CJMLE is even invariant up to a scaling and an oblique rotation if the estimate satisfies for all or for all . We summarize it by the following proposition.

###### Proposition 2.

Let be a minimizer of (5). Suppose that for all or for all . Then there exists an invertible

non-identity matrix

, such that is still a minimizer of (5).

Like the tuning parameter in ridge regression, our tuning parameter

also controls the bias-variance trade-off of estimation. Roughly speaking, for a large value of

, the true parameters and are likely to satisfy the constraints. Thus, the estimation tends to have a small bias. In the meanwhile, the variance may be large. When is small, the bias tends to be large and variance tends to be small.

### 2.3 Practical Consideration: Missing Responses

In practice, each respondent may only respond to a small proportion of items, possibly due to the data collection design. For example, in the SAPA data that are analyzed in Section 5, each respondent is only administered a small random subset of 696 items (containing on average 86 items) to avoid high time costs to respondents.

Our CJMLE can be modified to handle missing responses, while maintaining its computational advantage. In addition, as will be shown in Section 3, under a reasonable regularity condition on the data missingness mechanism, the CJMLE still has good statistical properties. Let denote the non-missing responses, where if item is administered to respondent . Then the joint likelihood function becomes

 lΩJ(Θ,A)=lΩJ(ΘA⊤)=∑i,j:ωij=1mijyij−log(1+exp(mij)),

and our CJMLE becomes

 (^Θ,^A) =\operatornamewithlimitsargminΘ,A−lΩJ(ΘA⊤). (6) s.t. ∥θi∥2≤C,∀i, ∥aj∥2≤C,∀j.

If for all and , no response is missing and (6) is the same as (5). From now on, we focus on the analysis of (6), which includes (5) as a special case when for all and .

### 2.4 Algorithm

We develop an alternating minimization algorithm for solving (6) that is often used to fit low rank models (see e.g. Udell et al., 2016). This algorithm can be efficiently paralleled. In this algorithm, we assume the number of latent traits and tuning parameter are known. In practice, when having no knowledge about and , they are chosen by a T-fold (e.g., ) cross validation.

To handle the constraints in (6), a projected gradient descent update is used in each iteration. A projected gradient descent update is as follows. Let be a -dimensional vector. We define projection operator:

 ProxC(y)=\operatornamewithlimitsargminx:∥x∥2≤C∥y−x∥2=⎧⎨⎩yif ∥y∥2≤C;√C∥y∥yif ∥% y∥2>C. (7)

Here, returns a feasible point (i.e., the constraint ) that is most close to . Consider optimization problem

 minx f(x) (8) s.t. ∥x∥2≤C,

where is a differentiable convex function. Denote the gradient of by . Then a projected gradient descent update at is defined as

 x(1)=ProxC(x(0)−ηg(x(0))),

where is a step size decided by line search. Due to the projection, . Furthermore, it can be shown that for sufficiently small , , when satisfies mild regularity conditions and . We refer the readers to Parikh et al. (2014) for further details about this projected gradient descent update.

###### Algorithm 1 (Alternating Minimization algorithm for CJMLE).
1. (Initialization) Input responses , nonmissing response indicator , dimension of latent factor , constraint parameter , iteration number , and initial value and .

2. (Alternating minimization) At th iteration, perform parallel computation for each block:

• For each respondent , update

 θ(m)i=ProxC(θ(m−1)i−ηgi(θ(m−1)i,A(m−1))), (9)

 li(θ,A(m−1))=∑j:ωij=1−yijK∑k=1θka(m−1)jk+log(1+exp(K∑k=1θka(m−1)jk))

with respect to at . is a step size chosen by line search, which guarantees that .

• For each item , update

 a(m)j=ProxC(a(m−1)j−η~gj(a(m−1)j,Θ(m))), (10)

 ~lj(a,Θ(m−1))=∑i:ωij=1−yijK∑k=1akθ(m)ik+log(1+exp(K∑k=1akθ(m)ik))

with respect to at . is a step size chosen by line search, which guarantees that .

3. (Output) Iteratively perform Step 2 until convergence. Output , , where is the last iteration number.

The algorithm guarantees that the joint likelihood function increases in each iteration. The parallel computing in step 2 of the algorithm is implemented through OpenMP, which greatly speeds up the computation even on a single machine with multiple cores. The efficiency of this parallel algorithm is further amplified, when running on a computer cluster with many machines.

## 3 Theory

The proposed CJMLE has good statistical properties under the high-dimensional setting, when and both grow to infinity. We assume the number of latent traits is fixed and known and the true parameters and satisfy

1. and for all , .

We first derive an error bound for , where and are the true parameters. It requires an assumption on the mechanism of data missingness – missing completely at random.

1. s in are independent and identically distributed Bernoulli random variables with

 P(ωij=1)=nNJ.

Note that under assumption A1 and A2, we observe on average responses, that is,

 EN∑i=1J∑j=1ωij=n.

With assumptions A1 and A2, we have the following theorem.

###### Theorem 1.

Suppose that assumptions A1 and A2 are satisfied. Further assume that Then there exist absolute constants and , such that

 1NJ∥^Θ^A⊤−Θ∗(A∗)⊤∥2F≤C2CeC√J+Nn (11)

is satisfied with probability

.

We call the left hand side of (11) the scaled Frobenius loss. We provide some remarks on Theorem 1. First, under the asymptotic regime that and both grow to infinity, the probability that (11) is satisfied converges to 1. Second, the left hand side of (11) is a reasonable scaling of the squared Frobenius norm of the error . This is because is a matrix and thus has terms. When both and grow to infinity, the unscaled may diverge with a positive probability. Third, when as assumed, the right hand size of (11) converges to 0. In particular, if no response is missing, then and the right hand size of (11) converges to 0 with a speed . Fourth, the constant plays a role in the right hand size of (11). For any satisfying A1, the larger being used in (6), the larger the upper bound in (11). Ideally, we’d like to use the smallest that satisfies A2. Finally, the error bound in Theorem 1 can be further used to quantify the error of predicting respondents’ responses to items that have not been administered due to the missing data design.

We then study the recovery of , as it is of our interest to explore the factor structure of items via the loading matrix. However, due to the rotational and scaling invariance properties, it is not appropriate to measure the closeness between and by any matrix norm directly. Following the canonical approach in matrix perturbation theory (e.g. Stewart and Sun, 1990), we consider to measure their closeness by the largest principal angle between the linear spaces and that are spanned by the column vectors of and , respectively. Specifically, for two linear spaces and , the largest principal angle is defined as

 sin∠(U,V)≜maxu∈U:u≠0minv∈V:v≠0sin∠(u,v),

where is the angle between two vectors. In fact, and equality is satisfied when .

If the largest principal angle between and is 0, then , meaning that there exist a rank- matrix , such that . In other words, and are equivalent up to a scaling and an oblique rotation. Roughly speaking, if the largest principal angle is close to zero and has a simple loading structure, the simple structure of may be found by applying a rotational method to , such as the promax rotation (Hendrickson and White, 1964).

In what follows, we provide a bound on . We make the following assumption.

1. There exists a positive constant , such as the

th largest singular value of

, denoted by , satisfying

 σ∗K≥C3√NJ,

when and grow to infinity.

###### Theorem 2.

Suppose that assumptions A1, A2, and A3 are satisfied and . Then

 sin∠(V∗,^V)≤2C3(C2CeC)12(J+Nn)14.

is satisfied with probability .

Theorem 2 has a straightforward corollary.

###### Corollary 1.

Under the same assumptions of Theorem 2, converges to 0 in probability as .

We point out that Assumption A3 is mild. The following proposition implies that assumption A3 is satisfied with high probability when s and s are independent samples from certain distributions.

###### Proposition 3.

Suppose that s are i.i.d. samples with mean 0 and variance and s are i.i.d. samples with mean 0 and variance , then there exists a constant , in probability, when and grow to infinity.

## 4 Simulation

### 4.1 Study I

We evaluate the performance of CJMLE and verify the theoretical results by simulation studies. The number of items takes value 300, 400, …, 1000 and for each given , sample size . The number of latent traits are considered and are assumed to be known. We sample

s from a truncated normal distribution,

 θik∼TN(μ,σ,l,u),

where the underlying normal distribution has mean

and the truncation interval . This implies that . The loading parameter s are generated from a mixture distribution. Let be a random indicator, satisfying . If , is set to be 0 and if , is sampled from the truncated normal distribution . Under this setting, about of the ’s are 0, which mimics a simple loading structure where each item does not measure all the latent traits. Similarly, . Missing data are considered by setting , where is set to be , , and , corresponding to the situations of “massive missing”, “moderate missing”, and “no missing”. To provide some intuition, when and , only about 11% and 43% of the entries of are not missing when and , respectively. For each , , and , 50 independent replications are generated.

We verify the theoretical results. To guarantee the satisfaction of assumption A1, we set . Results are shown in Figures 1 and 2. In the two panels of Figures 1, the -axis records the value of and the -axis represents the scaled Frobenius loss . The lines in Figure 1 are obtained by connecting the median of the scaled Frobenius loss among 50 replications for different s, given and . For each setting of , , and

, an interval is also plotted that shows the upper and lower quartiles of the scaled Frobenius loss. According to Figure

1, for each missing data situation, the scaled Frobenius loss tends to converge to 0 as and simultaneously grow. In addition, we observe that the less missing data, the smaller the scaled Frobenius loss. Results from Figure 1 are consistent with Theorem 1.

In Figure 2, the -axis records the number of items and the -axis records the largest principal angle between the column spaces of and , . Similar as above, the median and upper and lower quartiles of the largest principal angle among 50 independent replications are presented. According to Figure 2, the largest principal angle between the two spaces and tend to converge to 0 when both and diverge, for given and . In addition, having less missing data yields a smaller principal angle. Results from Figure 2 are consistent with Theorem 2.

### 4.2 Study II

We then investigate the situation when is fixed and grows. Specifically, we consider , , and grows from 1,000 to 10,000. The generation of and is the same as in Study I and is still set to be . For ease of presentation, we only show the “no missing” case (). The patterns are similar in the presence of missing data. Results are shown in Figure 3. The left panel of Figure 3 plots sample size (-axis) versus scaled Frobenius loss (-axis) and the right panel plots sample size (-axis) versus largest principal angle (-axis). According to Figure 3, when is fixed and grows, both the scaled Frobenius loss and the largest principal angles first decrease and then tend to stabilize around some positive values. This result implies that the CJMLE may not perform well when either or is small.

## 5 Real Data Analysis

We apply the proposed method to analyzing selected personality data111The data set is available from http://dx.doi.org/10.7910/DVN/SD7SVE. from the Synthetic Aperture Personality Assessment project in Condon and Revelle (2015). The SAPA Project is a collaborative online data collection tool for assessing psychological constructs across multiple domains of personality. This data set was collected to evaluate the structure of personality constructs in the temperament domain. The data set contains respondents’ responses to personality assessment items. The items are from either the International Personality Item Pool (Goldberg, 1999; Goldberg et al., 2006) or the Eysenck Personality Questionnaire - Revised (Eysenck et al., 1985)

. The data contain “massive missingness” by its design and the missingness mechanism can be classified as missing completely at random. The mean number of items to which respondents answered is 86.1 (sd = 58.7; median = 71). The mean number of administrations for each item is 2,931 (sd = 781; median = 2,554). All the items are on a six-category rating scale and in this analysis we dichotomize them by truncating at the median response of each item. The readers are referred to

We first explore the latent dimensionality of data. Specifically, we consider the dimension , 5, 10, 15, 20 and 25 and the constraint tuning parameter , 4, 6, 8, and 10. The combination of and that best predicts the missing responses are investigated by five-fold cross validation. Let be the indicator matrix of non-missing responses. We randomly split the non-missing responses into five non-overlapping sets that are of equal sizes, indicated by , satisfying . Moreover, we denote , indicating the data excluding set . For given and , we find the CJMLE for each (), denoted by , by solving (6). Then the fitted model with parameters is used to predict non-missing responses in set . The prediction accuracy is measured by the following log-likelihood based cross-validation (CV) error:

 −5∑t=1⎧⎪ ⎪⎨⎪ ⎪⎩∑i,j: ω(t)ij=1yijlog^m(t)ij+(1−yij)log(1−^m(t)ij)⎫⎪ ⎪⎬⎪ ⎪⎭,

where are entries of the matrix . The combination of and that yields smaller CV error is preferred, as the corresponding model tends to better predict unobserved responses. Results are shown in Figures 4 and 5. Except when , for each value of , the log-likelihood based CV error first decreases and then increases and achieve the minima at . When , the CV error decreases as increases from 2 to 10. This may be due to a bias-variance trade-off, where bias decreases and variance increases as increases. When , the bias term may dominate the variance, due to stringent unidimensionality assumption. In addition, the CV errors when is much higher than the smallest CV errors under other choices of , as shown in Figure 5. It means that a unidimensional model is inadequate to capture the underlying structure of the 696 items. Figure 6 shows the smallest CV error when , 10, 15, 20, and 25. Specifically, the smallest CV error is achieved when and , meaning that in terms of prediction, 15 latent traits tend to best characterize this personality item pool in the temperament domain.

## 6 Discussion

In this paper, we propose a constrained maximum likelihood estimator for analyzing large scale item response data, which allows for the presence of high percentage of missing responses. It differs from the traditional JMLE, by adding constraint on the Euclidian norms of both the item and respondent parameters. An efficient parallel computing algorithm is proposed that is scalable even on a single machine to large data sets with tens of thousands of respondents, thousands of items, and more than ten latent traits, with good timings. The CJMLE also has good statistical properties. In particular, we provide error bounds on the parameter estimates and show that the linear space spanned by the column vectors of the factor loading matrix can be consistently recovered, under mild regularity conditions and the high-dimensional setting that both the numbers of items and respondents grow to infinity. This result implies that the true loading structure can be learned by applying proper rotational methods to the estimated loading matrix, when the true loading matrix has a simple structure (i.e., each latent trait only measures a subset of the items). These theoretical developments are consistent with results from the simulation studies. Our simulation results also imply that the high-dimensional setting that both the numbers of respondents and items grow to infinity is important. When either the sample size or the number of items is small, the CJMLE may not work well. The proposed model is applied to analyzing large scale data from the Synthetic Aperture Personality Assessment project. It is found that a model with 15 latent traits has the best prediction power based on results from cross validation and the majority of the traits seem to be homogeneous and correspond to well-known personality facets.

The proposed method may be extended along several directions. First, the current algorithm and theory focus on binary responses. It is worth extending them to multidimensional IRT models for ordinal response data, such as the generalized partial credit model (Yao and Schwarz, 2006) which is an extension of the multidimensional two-parameter logistic model to analyzing ordinal responses. Second, even after applying rotational methods, the obtained factor loading matrix may still be dense and difficult to interpret. To better pursue a simple loading structure, it may be helpful to further impose regularization of the factor loading parameters, as introduced in Sun et al. (2016). Third, it is also important to incorporate respondent specific covariates in the analysis, so that the relationship between baseline covariates and psychological latent traits can be investigated. Finally, the current theoretical framework assumes the number of latent traits as known. Theoretical results on estimating the number of latent traits based on the CJMLE are also of interest, which may provide guidance on choosing the number of latent traits.

## 7 Appendix

### 7.1 Proof of Theorem 1

###### Proof.

The proof of Theorem 1 is similar to that of (Davenport et al., 2014,  Theorem 1). Thus, we only state the main steps and omit the details. Let

 ¯lJ(M)=lJ(M)−lJ(0), (12)

where is an matrix whose entries are all zero. Then, we have the following lemma from Davenport et al. (2014).

###### Lemma 1 (Lemma A.1 of Davenport et al. (2014)).

There exist constant and such that for all , and ,

 P(sup∥M∥∗≤α√rNJ|¯lJ(M)−E¯lJ(M)|≥C0αLγ√r√n(N+J)+NJlog(N+J))≤C1N+J, (13)

where under the settings of the current paper and denotes the nuclear norm of a matrix.

Let and in the above lemma, we have

 P(sup∥M∥∗≤C√KNJ|¯lJ(M)−E¯lJ(M)|≥C0C√K√n(N+J)+NJlog(N+J))≤C1N+J. (14)

Define

 H={M=(mij)1≤i≤N,1≤j≤J:mij=a⊤jθi,∥θi∥2≤C and ∥aj∥2≤C, for all i,j}. (15)

Note that if , then

 ∥M∥∗≤√NJ√rank(M)∥M∥∞≤C√KNJ. (16)

Thus, from (14), we further have

 P(supM∈H|¯lJ(M)−E¯lJ(M)|≥C0C√K√n(N+J)+NJlog(N+J)) (17) ≤ P(sup∥M∥∗≤C√KNJ|¯lJ(M)−E¯lJ(M)|≥C0C√K√n(N+J)+NJlog(N+J)) (18) ≤ C1N+J. (19)

We use the following result, which is a slight modification of the last equation on p.210 of Davenport et al. (2014).

 nD(M∗∥^M)≤2supM∈H|¯lJ(M)−E¯lJ(M)|, (20)

where ,

denotes the Kullback-Leibler divergence between the joint distribution of

when the model parameters are and . In addition, we have the following inequality, which is a direct application of Lemma A.2 in Davenport et al. (2014) and the third equation on page 211 of Davenport et al. (2014). For any such that ,

 ∥M1−M2∥2F≤8βCNJD(M1∥M2), (21)

with under our settings. Combining (19), (20) and (21), we can see that with probability ,

 ∥^M−M∗∥2F≤16βCNJn×C0C√K√n(N+J)+NJlog(N+J). (22)

Note that for , . Thus, we can rearrange terms in the above inequality and simplify it as

 1NJ∥^M−M∗∥2F≤64CeC√K(J+N)n×√1+NJlog(N+J)n(N+J). (23)

For , we further have

 NJlog(N+J)n(N+J)≤NJ(N+J)2≤14. (24)

Combine the above equation with (23) and note that is assumed fixed, we complete the proof. ∎

### 7.2 Proof of Theorem 2

###### Proof.

Denote and . Let be the singular values of and , …,