Joint Maximum Likelihood Estimation for High-dimensional Exploratory Item Response Analysis

12/19/2017
by   Yunxiao Chen, et al.
0

Multidimensional item response theory is widely used in education and psychology for measuring multiple latent traits. However, exploratory analysis of large-scale item response data with many items, respondents, and latent traits is still a challenge. In this paper, we consider a high-dimensional setting that both the number of items and the number of respondents grow to infinity. A constrained joint maximum likelihood estimator is proposed for estimating both item and person parameters, which yields good theoretical properties and computational advantage. Specifically, we derive error bounds for parameter estimation and develop an efficient algorithm that can scale to very large datasets. The proposed method is applied to a large scale personality assessment data set from the Synthetic Aperture Personality Assessment (SAPA) project. Simulation studies are conducted to evaluate the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/19/2019

A Note on Exploratory Item Factor Analysis by Singular Value Decomposition

In this note, we revisit a singular value decomposition (SVD) based algo...
09/03/2020

Unfolding-Model-Based Visualization: Theory, Method and Applications

Multidimensional unfolding methods are widely used for visualizing item ...
09/20/2021

Machine Learning-Based Estimation and Goodness-of-Fit for Large-Scale Confirmatory Item Factor Analysis

We investigate novel parameter estimation and goodness-of-fit (GOF) asse...
03/30/2021

A Tensor-EM Method for Large-Scale Latent Class Analysis with Clustering Consistency

Latent class models are powerful statistical modeling tools widely used ...
04/07/2019

An IRT-based Model for Omitted and Not-reached Items

Missingness is a common occurrence in educational assessment and psychol...
06/17/2020

Multidimensional Bayesian IRT Model for Hierarchical Latent Structures

It is reasonable to consider, in many cases, that individuals' latent tr...
10/21/2021

DIF Statistical Inference and Detection without Knowing Anchoring Items

Establishing the invariance property of an instrument (e.g., a questionn...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multidimensional item response theory (MIRT) models have been widely used in psychology and education for measuring multiple latent traits based on dichotomous or polytomous items. The concept of MIRT dates back to McDonald (1967), Lord et al. (1968) and Reckase (1972), and is closely related to linear factor analysis (e.g. Anderson, 2003) that models continuous multivariate data. We refer the readers to Reckase (2009) for a review of MIRT models. An important application of MIRT models is exploratory item factor analysis, which serves to uncover a set of latent traits underlying a battery of psychological items. Such exploratory analysis facilitates the understanding of large scale response data with many items, respondents and possibly many latent traits.

A key step in MIRT based analysis is parameter estimation, which can be numerically challenging under the high-dimensional setting, i.e., when the number of items, the sample size, and the dimension of the latent space are large. The most popular method for estimating item parameters is the maximum marginal likelihood estimator (MMLE). In this approach, respondent specific parameters (i.e., their latent trait levels) are treated as random effects and the marginal likelihood function of item parameters is obtained by integrating out the random effects. Item parameter estimates are obtained by maximizing the marginal likelihood function. This approach typically involves evaluating a -dimensional integral, where is the number of latent traits. When is moderately large (say, ), evaluating the integral can be numerically challenging. Many approaches have been proposed to approximate the integral, including adaptive Gaussian quadrature methods (e.g. Schilling and Bock, 2005), Monte Carlo integration (e.g. Meng and Schilling, 1996), fully Bayesian estimation methods (e.g. Béguin and Glas, 2001; Bolt and Lall, 2003; Edwards, 2010), and data augmented stochastic approximation algorithm (e.g. Cai, 2010a, b). However, even with these state-of-the-art algorithms, the computation is time-consuming.

An alternative approach to parameter estimation in MIRT models is the joint maximum likelihood estimator (JMLE; see, e.g. Embretson and Reise, 2000). This approach treats both the person and item parameters as fixed effects (i.e., model parameters). Parameter estimates are obtained by maximizing the joint likelihood function of both person and item parameters. Traditionally, the JMLE is less preferred to the MMLE. This is partly because, under the usual asymptotic setting where the number of respondents grows to infinity and the number of items is fixed, the number of parameters in the joint likelihood function also diverges, resulting in inconsistent estimate of item parameters (Neyman and Scott, 1948; Andersen, 1973; Haberman, 1977; Ghosh, 1995) even for simple IRT models.

In this paper, we propose a variant of joint maximum likelihood estimator, which adds constraints to the parameter space. This constrained joint maximum likelihood estimator (CJMLE) is computationally efficient and statistically solid under the high-dimensional setting. In terms of computation, an alternating minimization (AM) algorithm with projected gradient decent update is proposed, which can be parallelled efficiently. Specifically, we implement this parallel computing algorithm in R with core functions written in C++ through Open Multi-Processing (OpenMP, Dagum and Menon, 1998) that can scale to very large data. For example, according to our simulation study, the algorithm can fit a data set with 25,000 respondents, 1000 items, and 10 latent traits in 12 minutes on a single Intel(R) machine (Core(TM) i7CPU@5650U@2.2 GHz) with eight threads (OpenMP enabled). Theoretically, we provide error bounds on parameter estimation, under the setting that both the numbers of items and respondents grow to infinity. Specifically, we show that the item parameters can be consistently estimated up to a scaling and a rotation (possibly oblique). Consequently, the latent structure of test items may be investigated by using appropriate rotational methods (see e.g. Browne, 2001). Both our computational algorithm and theoretical framework are new to the psychometrics literature.

Our theoretical framework assumes a diverging number of items, which is suitable when analyzing large scale data. To the best of our knowledge, such an asymptotic setting has not received enough attention, except in Haberman (1977, 2004) and Chiu et al. (2016). Our theoretical analysis applies to a general MIRT model that includes the multidimensional two-parameter logistic model (Reckase and McKinley, 1983; Reckase, 2009) as a special case, while the analyses in Haberman (1977, 2004) and Chiu et al. (2016) are limited to the unidimensional Rasch model (Rasch, 1960; Lord et al., 1968) and cognitive diagnostic models (Rupp et al., 2010), respectively. Our technical tools for studying the properties of the CJMLE include theoretical developments in matrix completion theory (e.g. Candès and Plan, 2010; Candès and Tao, 2010; Candès and Recht, 2012; Davenport et al., 2014) and matrix perturbation theory (e.g. Stewart and Sun, 1990).

We apply the proposed method to a data set from the Synthetic Aperture Personality Assessment (SAPA) project (Condon and Revelle, 2015) that is collected to evaluate the structure of personality constructs in the temperament domain. Applying traditional exploratory item factor analysis methods to this data set is challenging because (1) the data set is of large scale, containing 23,681 respondents, 696 items, and possibly many latent traits, (2) the responses are not continuous and thus linear factor analysis is inappropriate, and (3) the data contain “massive missingness” by design that is missing completely at random. The proposed method turns out to be suitable for analyzing the data set. Specifically, the prediction power of the MIRT model and CJMLE is studied and the latent structure is investigated and interpreted.

The rest of the paper is organized as follows. In Section 2, we propose the CJMLE and an algorithm for its computation. Theoretical properties of CJMLE are developed in Section 3, followed by simulation studies and real data analysis in Sections 4 and 5. Finally, discussions are provided in Section 6. Proofs of our theoretical results are provided in Appendix.

2 Method

2.1 Model

To simplify the presentation, we focus on binary response data. We point out that our framework can be extended to ordinal and nominal responses without much difficulty. Let indicates respondents and indicates items. Each respondent is represented by a

-dimensional latent vector

and each item is represented by parameters . Let be the response from respondent to item , which is assumed to follow distribution

(1)

This model can be viewed as an extension of the multidimensional two-parameter logistic model (M2PL; Reckase and McKinley, 1983; Reckase, 2009). That is, if is set to be 1 for all , then (1) becomes

which takes the same form as a multidimensional two-parameter logistic model with latent traits. Similar to many latent factor models, Model (1) is also rotational and scaling invariant, as will be discussed in Section 3.

2.2 Constrained Joint Maximum Likelihood Estimator

The traditional way of estimating item parameters in an item response theory model is through the marginal maximum likelihood estimator. That is, the latent vector s are further assumed to be independent and identically distributed samples from some distribution . Let

be the observed value of random variable

. The marginal likelihood function is a function of , defined as

(2)

Marginal maximum likelihood estimates (MMLE) of , …, are then obtained by maximizing . Appropriate constraints may be imposed on , …, to ensure identifiability. In this approach, the latent trait s are treated as random effects, while the item parameters , …, are treated as fixed effects. When is large, the marginal maximum likelihood estimator suffers from computational challenge brought by the -dimensional integration in (2). The computation becomes time-consuming when is moderately large (say when ) even with the state-of-art algorithms.

An alternative approach is the joint maximum likelihood estimator (JMLE; see e.g. Embretson and Reise, 2000). In this approach, both s and s are treated as fixed effects and their estimates are obtained by maximizing the joint likelihood function

(3)

Traditionally, the JMLE is less preferred to the MMLE. This is partly because, under the usual asymptotic setting where the number of respondents grows to infinity and the number of items is fixed, the number of parameters in also diverges, resulting in inconsistent estimate of item parameters even for the Rasch model that takes a simple form (Andersen, 1973; Haberman, 1977; Ghosh, 1995).

Under the high-dimensional setting, , , and can all be large, so that MMLE is computationally infeasible. Interestingly, JMLE becomes a good choice. Intuitively, the number of parameters in the joint likelihood is in the order , which can be substantially smaller than the number of observed responses (in the order of ) when is relatively small. Consequently, under reasonable regularities, JMLE yields satisfactory statistical properties, as will be discussed in Section 3. In fact, Haberman (1977) shows that under the Rasch model, the estimation of both respondent parameters and item parameters is consistent, when and grow to infinity and converges to 0. Unfortunately, the results of Haberman (1977) relies on the simple form of the Rasch model and thus can hardly be generalized to the general model setting as in (1).

Computationally, (3) can be optimized efficiently. That is because maximizing (3) is equivalent to maximizing its logarithm, which can be written as

(4)

Due to its factorization, is a biconvex function of and , meaning that is a convex function of when is given and vise versa. As a consequence, (4) can be optimized by iteratively maximizing with respect to one given the other. This optimization program can be further speeded up by parallel computing, thanks to the special structure of . See Section 2.4 for further discussions.

Following the above discussion, we propose a constrained joint maximum likelihood estimator, defined as

(5)

where denotes the Euclidian norm of a vector and is a positive tuning parameter that imposes regularization on the magnitude of each and . When , (5) becomes exactly a JMLE. The constraints in (5) avoid overfitting when the number of model parameters is large.

This estimator has the property of rotational invariance. To see this, we rewrite the joint likelihood function in the matrix form. Let , , and . Then the joint log-likelihood function can be rewritten as

The rotational invariance property of the CJMLE is described by the following proposition.

Proposition 1.

Let be a minimizer of (5). Then for any orthogonal matrix , is also a minimizer of (5).

The CJMLE is even invariant up to a scaling and an oblique rotation if the estimate satisfies for all or for all . We summarize it by the following proposition.

Proposition 2.

Let be a minimizer of (5). Suppose that for all or for all . Then there exists an invertible

non-identity matrix

, such that is still a minimizer of (5).

Like the tuning parameter in ridge regression, our tuning parameter

also controls the bias-variance trade-off of estimation. Roughly speaking, for a large value of

, the true parameters and are likely to satisfy the constraints. Thus, the estimation tends to have a small bias. In the meanwhile, the variance may be large. When is small, the bias tends to be large and variance tends to be small.

2.3 Practical Consideration: Missing Responses

In practice, each respondent may only respond to a small proportion of items, possibly due to the data collection design. For example, in the SAPA data that are analyzed in Section 5, each respondent is only administered a small random subset of 696 items (containing on average 86 items) to avoid high time costs to respondents.

Our CJMLE can be modified to handle missing responses, while maintaining its computational advantage. In addition, as will be shown in Section 3, under a reasonable regularity condition on the data missingness mechanism, the CJMLE still has good statistical properties. Let denote the non-missing responses, where if item is administered to respondent . Then the joint likelihood function becomes

and our CJMLE becomes

(6)

If for all and , no response is missing and (6) is the same as (5). From now on, we focus on the analysis of (6), which includes (5) as a special case when for all and .

2.4 Algorithm

We develop an alternating minimization algorithm for solving (6) that is often used to fit low rank models (see e.g. Udell et al., 2016). This algorithm can be efficiently paralleled. In this algorithm, we assume the number of latent traits and tuning parameter are known. In practice, when having no knowledge about and , they are chosen by a T-fold (e.g., ) cross validation.

To handle the constraints in (6), a projected gradient descent update is used in each iteration. A projected gradient descent update is as follows. Let be a -dimensional vector. We define projection operator:

(7)

Here, returns a feasible point (i.e., the constraint ) that is most close to . Consider optimization problem

(8)

where is a differentiable convex function. Denote the gradient of by . Then a projected gradient descent update at is defined as

where is a step size decided by line search. Due to the projection, . Furthermore, it can be shown that for sufficiently small , , when satisfies mild regularity conditions and . We refer the readers to Parikh et al. (2014) for further details about this projected gradient descent update.

Algorithm 1 (Alternating Minimization algorithm for CJMLE).
  1. (Initialization) Input responses , nonmissing response indicator , dimension of latent factor , constraint parameter , iteration number , and initial value and .

  2. (Alternating minimization) At th iteration, perform parallel computation for each block:

    • For each respondent , update

      (9)

      where is the gradient of

      with respect to at . is a step size chosen by line search, which guarantees that .

    • For each item , update

      (10)

      where is the gradient of

      with respect to at . is a step size chosen by line search, which guarantees that .

  3. (Output) Iteratively perform Step 2 until convergence. Output , , where is the last iteration number.

The algorithm guarantees that the joint likelihood function increases in each iteration. The parallel computing in step 2 of the algorithm is implemented through OpenMP, which greatly speeds up the computation even on a single machine with multiple cores. The efficiency of this parallel algorithm is further amplified, when running on a computer cluster with many machines.

3 Theory

The proposed CJMLE has good statistical properties under the high-dimensional setting, when and both grow to infinity. We assume the number of latent traits is fixed and known and the true parameters and satisfy

  1. and for all , .

We first derive an error bound for , where and are the true parameters. It requires an assumption on the mechanism of data missingness – missing completely at random.

  1. s in are independent and identically distributed Bernoulli random variables with

Note that under assumption A1 and A2, we observe on average responses, that is,

With assumptions A1 and A2, we have the following theorem.

Theorem 1.

Suppose that assumptions A1 and A2 are satisfied. Further assume that Then there exist absolute constants and , such that

(11)

is satisfied with probability

.

We call the left hand side of (11) the scaled Frobenius loss. We provide some remarks on Theorem 1. First, under the asymptotic regime that and both grow to infinity, the probability that (11) is satisfied converges to 1. Second, the left hand side of (11) is a reasonable scaling of the squared Frobenius norm of the error . This is because is a matrix and thus has terms. When both and grow to infinity, the unscaled may diverge with a positive probability. Third, when as assumed, the right hand size of (11) converges to 0. In particular, if no response is missing, then and the right hand size of (11) converges to 0 with a speed . Fourth, the constant plays a role in the right hand size of (11). For any satisfying A1, the larger being used in (6), the larger the upper bound in (11). Ideally, we’d like to use the smallest that satisfies A2. Finally, the error bound in Theorem 1 can be further used to quantify the error of predicting respondents’ responses to items that have not been administered due to the missing data design.

We then study the recovery of , as it is of our interest to explore the factor structure of items via the loading matrix. However, due to the rotational and scaling invariance properties, it is not appropriate to measure the closeness between and by any matrix norm directly. Following the canonical approach in matrix perturbation theory (e.g. Stewart and Sun, 1990), we consider to measure their closeness by the largest principal angle between the linear spaces and that are spanned by the column vectors of and , respectively. Specifically, for two linear spaces and , the largest principal angle is defined as

where is the angle between two vectors. In fact, and equality is satisfied when .

If the largest principal angle between and is 0, then , meaning that there exist a rank- matrix , such that . In other words, and are equivalent up to a scaling and an oblique rotation. Roughly speaking, if the largest principal angle is close to zero and has a simple loading structure, the simple structure of may be found by applying a rotational method to , such as the promax rotation (Hendrickson and White, 1964).

In what follows, we provide a bound on . We make the following assumption.

  1. There exists a positive constant , such as the

    th largest singular value of

    , denoted by , satisfying

    when and grow to infinity.

Theorem 2.

Suppose that assumptions A1, A2, and A3 are satisfied and . Then

is satisfied with probability .

Theorem 2 has a straightforward corollary.

Corollary 1.

Under the same assumptions of Theorem 2, converges to 0 in probability as .

We point out that Assumption A3 is mild. The following proposition implies that assumption A3 is satisfied with high probability when s and s are independent samples from certain distributions.

Proposition 3.

Suppose that s are i.i.d. samples with mean 0 and variance and s are i.i.d. samples with mean 0 and variance , then there exists a constant , in probability, when and grow to infinity.

4 Simulation

4.1 Study I

We evaluate the performance of CJMLE and verify the theoretical results by simulation studies. The number of items takes value 300, 400, …, 1000 and for each given , sample size . The number of latent traits are considered and are assumed to be known. We sample

s from a truncated normal distribution,

where the underlying normal distribution has mean

and standard deviation

and the truncation interval . This implies that . The loading parameter s are generated from a mixture distribution. Let be a random indicator, satisfying . If , is set to be 0 and if , is sampled from the truncated normal distribution . Under this setting, about of the ’s are 0, which mimics a simple loading structure where each item does not measure all the latent traits. Similarly, . Missing data are considered by setting , where is set to be , , and , corresponding to the situations of “massive missing”, “moderate missing”, and “no missing”. To provide some intuition, when and , only about 11% and 43% of the entries of are not missing when and , respectively. For each , , and , 50 independent replications are generated.

(a) K=5
(b) K=10
Figure 1: The scaled Frobenius loss under different simulation settings (dashed line: no missing (); dotted line: moderate missing (); solid line: massive missing ()).
(a) K=5
(b) K=10
Figure 2: The largest principal angle between and under different simulation settings (dashed line: no missing (); dotted line: moderate missing (); solid line: massive missing ()).

We verify the theoretical results. To guarantee the satisfaction of assumption A1, we set . Results are shown in Figures 1 and 2. In the two panels of Figures 1, the -axis records the value of and the -axis represents the scaled Frobenius loss . The lines in Figure 1 are obtained by connecting the median of the scaled Frobenius loss among 50 replications for different s, given and . For each setting of , , and

, an interval is also plotted that shows the upper and lower quartiles of the scaled Frobenius loss. According to Figure 

1, for each missing data situation, the scaled Frobenius loss tends to converge to 0 as and simultaneously grow. In addition, we observe that the less missing data, the smaller the scaled Frobenius loss. Results from Figure 1 are consistent with Theorem 1.

In Figure 2, the -axis records the number of items and the -axis records the largest principal angle between the column spaces of and , . Similar as above, the median and upper and lower quartiles of the largest principal angle among 50 independent replications are presented. According to Figure 2, the largest principal angle between the two spaces and tend to converge to 0 when both and diverge, for given and . In addition, having less missing data yields a smaller principal angle. Results from Figure 2 are consistent with Theorem 2.

4.2 Study II

We then investigate the situation when is fixed and grows. Specifically, we consider , , and grows from 1,000 to 10,000. The generation of and is the same as in Study I and is still set to be . For ease of presentation, we only show the “no missing” case (). The patterns are similar in the presence of missing data. Results are shown in Figure 3. The left panel of Figure 3 plots sample size (-axis) versus scaled Frobenius loss (-axis) and the right panel plots sample size (-axis) versus largest principal angle (-axis). According to Figure 3, when is fixed and grows, both the scaled Frobenius loss and the largest principal angles first decrease and then tend to stabilize around some positive values. This result implies that the CJMLE may not perform well when either or is small.

(a) Scaled Frobenius loss
(b) Largest principal angle
Figure 3: The scaled Frobenius loss and largest principal angle when and N varies from 1,000 to 10,000 under the “no missing” setting.

5 Real Data Analysis

We apply the proposed method to analyzing selected personality data111The data set is available from http://dx.doi.org/10.7910/DVN/SD7SVE. from the Synthetic Aperture Personality Assessment project in Condon and Revelle (2015). The SAPA Project is a collaborative online data collection tool for assessing psychological constructs across multiple domains of personality. This data set was collected to evaluate the structure of personality constructs in the temperament domain. The data set contains respondents’ responses to personality assessment items. The items are from either the International Personality Item Pool (Goldberg, 1999; Goldberg et al., 2006) or the Eysenck Personality Questionnaire - Revised (Eysenck et al., 1985)

. The data contain “massive missingness” by its design and the missingness mechanism can be classified as missing completely at random. The mean number of items to which respondents answered is 86.1 (sd = 58.7; median = 71). The mean number of administrations for each item is 2,931 (sd = 781; median = 2,554). All the items are on a six-category rating scale and in this analysis we dichotomize them by truncating at the median response of each item. The readers are referred to

Condon and Revelle (2015) for more detailed information about this data set.

We first explore the latent dimensionality of data. Specifically, we consider the dimension , 5, 10, 15, 20 and 25 and the constraint tuning parameter , 4, 6, 8, and 10. The combination of and that best predicts the missing responses are investigated by five-fold cross validation. Let be the indicator matrix of non-missing responses. We randomly split the non-missing responses into five non-overlapping sets that are of equal sizes, indicated by , satisfying . Moreover, we denote , indicating the data excluding set . For given and , we find the CJMLE for each (), denoted by , by solving (6). Then the fitted model with parameters is used to predict non-missing responses in set . The prediction accuracy is measured by the following log-likelihood based cross-validation (CV) error:

where are entries of the matrix . The combination of and that yields smaller CV error is preferred, as the corresponding model tends to better predict unobserved responses. Results are shown in Figures 4 and 5. Except when , for each value of , the log-likelihood based CV error first decreases and then increases and achieve the minima at . When , the CV error decreases as increases from 2 to 10. This may be due to a bias-variance trade-off, where bias decreases and variance increases as increases. When , the bias term may dominate the variance, due to stringent unidimensionality assumption. In addition, the CV errors when is much higher than the smallest CV errors under other choices of , as shown in Figure 5. It means that a unidimensional model is inadequate to capture the underlying structure of the 696 items. Figure 6 shows the smallest CV error when , 10, 15, 20, and 25. Specifically, the smallest CV error is achieved when and , meaning that in terms of prediction, 15 latent traits tend to best characterize this personality item pool in the temperament domain.

Figure 4: Log-likelihood based cross-validation error under different combinations of and .
Figure 5: Log-likelihood based cross-validation error under different combinations of and .
Figure 6: Smallest log-likelihood based cross-validation error when , and 25.

We then explore the factor loading matrix , where and are chosen based on the CV error above. The loading matrix after promax rotation has good interpretation. Specifically, we present 10 items with highest absolute value of loadings for each latent trait, as shown in Tables 1 and 2, where “()” and “()” denote positive and negative loadings. The latent traits are ordered based on the total sum of squares (TSS) of loadings. In fact, items heavily load on latent traits 2-12 seem to be about “compassion”, “extroversion”, “depression”, “conscientiousness”, “intellect”, “irritability”, “boldness”, “intellect”, “perfection”, “traditionalism”, and “honesty”, respectively. Except for latent traits 6 and 9 that seem to be both about “intellect”, all the others among traits 2-12 seem to be about different well-known personality facets. Items that heavily load on latent traits 1, and 13-15 seem to be heterogeneous and these latent traits are worth further investigation.

Trait 1 (TSS = 348.5) Trait 2 (TSS = 279.4) Trait 3 (TSS = 231.1)
() Am not interested in other people’s problems.
() Can’t be bothered with other’s needs.
() Don’t understand people who get emotional.
() Am not really interested in others.
() Pay too little attention to details.
() Don’t put my mind on the task at hand.
() Am seldom bothered by the apparent suffering of strangers.
() Am not easily amused.
() Shirk my duties.
() Rarely notice my emotional reactions. () Feel others’ emotions.
() Am sensitive to the needs of others.
() Feel sympathy for those who are worse off than myself.
() Sympathize with others’ feelings.
() Have a soft heart.
() Inquire about others’ well-being.
() Sympathize with the homeless.
() Am concerned about others.
() Know how to comfort others.
() Suffer from others’ sorrows. () Keep in the background.
() Am mostly quiet when with other people.
() Prefer to be alone.
() Dislike being the center of attention.
() Want to be left alone.
() Seek quiet.
() Keep others at a distance.
() Tend to keep in the background on social occasions.
() Hate being the center of attention.
() Am afraid to draw attention to myself.
Trait 4 (TSS = 217.2) Trait 5 (TSS = 215.4) Trait 6 (TSS = 206.1)
() Have once wished that I were dead.
() Often feel lonely.
() Am often down in the dumps.
() Find life difficult.
() Feel a sense of worthlessness or hopelessness.
() Dislike myself.
() Am happy with my life.
() Seldom feel blue.
() Rarely feel depressed.
() Suffer from sleeplessness.
() Get to work at once.
() Complete my duties as soon as possible.
() Get chores done right away.
() Find it difficult to get down to work.
() Have difficulty starting tasks.
() Keep things tidy.
() Get started quickly on doing a job.
() Start tasks right away.
() Leave a mess in my room.
() Need a push to get started. () Think quickly.
() Can handle complex problems.
() Nearly always have a ”ready answer” when talking to people.
() Can handle a lot of information.
() Am quick to understand things.
() Catch on to things quickly.
() Formulate ideas clearly.
() Know immediately what to do.
() Come up with good solutions.
() Am hard to convince.
Trait 7 (TSS = 198.3) Trait 8 (TSS = 178.9) Trait 9 (TSS = 175.4)
() Rarely show my anger.
() Get irritated easily.
() Lose my temper.
() Get angry easily.
() Can be stirred up easily.
() When my temper rises, I find it difficult to control.
() Get easily agitated.
() It takes a lot to make me feel angry at someone.
() Rarely get irritated.
() Seldom get mad. () Would never go hang gliding or bungee jumping.
() Like to do frightening things.
() Love dangerous situations.
() Seek adventure.
() Am willing to try anything once.
() Do crazy things.
() Take risks that could cause trouble for me.
() Take risks.
() Never go down rapids in a canoe.
() Act wild and crazy. () Need a creative outlet.
() Don’t pride myself on being original.
() Do not have a good imagination.
() Enjoy thinking about things.
() Do not like art.
() Have a vivid imagination.
() See myself as an average person.
() Consider myself an average person.
() Am just an ordinary person.
() Believe in the importance of art.
Table 1: Results from analyzing SAPA data (Part I): Top 10 items with highest absolute value of loadings on each latent trait.
Trait 10 (TSS = 147.5 ) Trait 11 (TSS = 135.4) Trait 12 (TSS = 134.0)
() Want everything to add up perfectly.
() Dislike imperfect work.
() Want everything to be ”just right.”
() Demand quality.
() Have an eye for detail.
() Want every detail taken care of.
() Avoid mistakes.
() Being in debt is worrisome to me.
() Am exacting in my work.
() Dislike people who don’t know how to behave themselves. () Believe in one true religion.
() Don’t consider myself religious.
() Believe that there is no absolute right and wrong.
() Tend to vote for liberal political candidates.
() Think marriage is old-fashioned and should be done away with.
() Tend to vote for conservative political candidates.
() Believe one has special duties to one’s family.
() Like to stand during the national anthem.
() Believe that we should be tough on crime.
() Believe laws should be strictly enforced. () Use flattery to get ahead.
() Tell people what they want to hear so they do what I want.
() Would never take things that aren’t mine.
() Don’t pretend to be more than I am.
() Tell a lot of lies.
() Play a role in order to impress people.
() Switch my loyalties when I feel like it.
() Return extra change when a cashier makes a mistake.
() Use others for my own ends.
() Not regret taking advantage of someone impulsively.
Trait 13 (TSS = 120.3) Trait 14 (TSS = 117.5) Trait 15 (TSS = 111.7)
() Like to take my time.
() Like a leisurely lifestyle.
() Would never make a high-risk investment.
() Let things proceed at their own pace.
() Always know why I do things.
() Always admit it when I make a mistake.
() Have read the great literary classics.
() Am more easy-going about right and wrong than most people.
() Value cooperation over competition.
() Don’t know much about history. () Am interested in science.
() Trust what people say.
() Find political discussions interesting.
() Don’t [worry about] political and social problems.
() Would not enjoy being a famous celebrity.
() Believe that we coddle criminals too much.
() Enjoy intellectual games.
() Believe that people are basically moral.
() Like to solve complex problems.
() Trust people to mainly tell the truth. () Dislike loud music.
() Like telling jokes and funny stories to my friends.
() Seldom joke around.
() Prefer to eat at expensive restaurants.
() Laugh aloud.
() Most things taste the same to me.
() People spend too much time safeguarding their future with savings and insurance.
() Amuse my friends.
() Love my enemies.
() Am not good at deceiving other people.
Table 2: Results from analyzing SAPA data (Part II): Top 10 items with highest loadings on each latent trait.

6 Discussion

In this paper, we propose a constrained maximum likelihood estimator for analyzing large scale item response data, which allows for the presence of high percentage of missing responses. It differs from the traditional JMLE, by adding constraint on the Euclidian norms of both the item and respondent parameters. An efficient parallel computing algorithm is proposed that is scalable even on a single machine to large data sets with tens of thousands of respondents, thousands of items, and more than ten latent traits, with good timings. The CJMLE also has good statistical properties. In particular, we provide error bounds on the parameter estimates and show that the linear space spanned by the column vectors of the factor loading matrix can be consistently recovered, under mild regularity conditions and the high-dimensional setting that both the numbers of items and respondents grow to infinity. This result implies that the true loading structure can be learned by applying proper rotational methods to the estimated loading matrix, when the true loading matrix has a simple structure (i.e., each latent trait only measures a subset of the items). These theoretical developments are consistent with results from the simulation studies. Our simulation results also imply that the high-dimensional setting that both the numbers of respondents and items grow to infinity is important. When either the sample size or the number of items is small, the CJMLE may not work well. The proposed model is applied to analyzing large scale data from the Synthetic Aperture Personality Assessment project. It is found that a model with 15 latent traits has the best prediction power based on results from cross validation and the majority of the traits seem to be homogeneous and correspond to well-known personality facets.

The proposed method may be extended along several directions. First, the current algorithm and theory focus on binary responses. It is worth extending them to multidimensional IRT models for ordinal response data, such as the generalized partial credit model (Yao and Schwarz, 2006) which is an extension of the multidimensional two-parameter logistic model to analyzing ordinal responses. Second, even after applying rotational methods, the obtained factor loading matrix may still be dense and difficult to interpret. To better pursue a simple loading structure, it may be helpful to further impose regularization of the factor loading parameters, as introduced in Sun et al. (2016). Third, it is also important to incorporate respondent specific covariates in the analysis, so that the relationship between baseline covariates and psychological latent traits can be investigated. Finally, the current theoretical framework assumes the number of latent traits as known. Theoretical results on estimating the number of latent traits based on the CJMLE are also of interest, which may provide guidance on choosing the number of latent traits.

7 Appendix

7.1 Proof of Theorem 1

Proof.

The proof of Theorem 1 is similar to that of (Davenport et al., 2014,  Theorem 1). Thus, we only state the main steps and omit the details. Let

(12)

where is an matrix whose entries are all zero. Then, we have the following lemma from Davenport et al. (2014).

Lemma 1 (Lemma A.1 of Davenport et al. (2014)).

There exist constant and such that for all , and ,

(13)

where under the settings of the current paper and denotes the nuclear norm of a matrix.

Let and in the above lemma, we have

(14)

Define

(15)

Note that if , then

(16)

Thus, from (14), we further have

(17)
(18)
(19)

We use the following result, which is a slight modification of the last equation on p.210 of Davenport et al. (2014).

(20)

where ,

denotes the Kullback-Leibler divergence between the joint distribution of

when the model parameters are and . In addition, we have the following inequality, which is a direct application of Lemma A.2 in Davenport et al. (2014) and the third equation on page 211 of Davenport et al. (2014). For any such that ,

(21)

with under our settings. Combining (19), (20) and (21), we can see that with probability ,

(22)

Note that for , . Thus, we can rearrange terms in the above inequality and simplify it as

(23)

For , we further have

(24)

Combine the above equation with (23) and note that is assumed fixed, we complete the proof. ∎

7.2 Proof of Theorem 2

Proof.

Denote and . Let be the singular values of and , …,