1 Introduction
Multivariate data analysis (Anderson; 2003) has been one of the most important topics in modern statistics that is widely encountered in different disciplines including psychology, education, marketing, economics, medicine, engineering, geography, and biology. Latent factor models, originated from the seminal work of Spearman (1904) on human intelligence, play a key role in multivariate data analysis (Anderson; 2003; Skrondal and RabeHesketh; 2004; Bartholomew et al.; 2008, 2011). Latent factor models capture and interpret the common dependence among multiple manifest variables through the use of lowdimensional latent factors.
A key application of latent factor models is in psychological measurement, where the latent factors are interpreted as the psychological traits (e.g., cognitive abilities, personality traits, and psychopathic traits) that are not directly observable. By modeling the relationship between the latent factors and the manifest variables, the ultimate goal of latent factor analysis in psychological measurement is to make statistical inference on the individual specific latent traits. Specifically, after fitting a latent factor model, each individual will be assigned latent factor scores, as an estimate of his/her levels on the corresponding psychological traits. Decision will be made based on the factor scores, such as deciding whether a student has mastered a certain skill, ranking the students according to their proficiency levels on a skill, describing the personality profile of an individual, and diagnosing whether a patient suffers from a certain mental health disorder.
Due to the confirmatory nature of psychological measurement, item design information, i.e. how each item is associated with latent traits, is often available a priori and used in fitting a latent factor model. Such information is incorporated into the model through constraints on the model parameters, resulting in a structured latent factor model. A good item design is key to measurement validity, guaranteeing that the modelbased measurement reflects what it is supposed to measure (Wainer and Braun; 2013).
This paper develops statistical theory and methods for the design and the scoring in psychological measurement, which are the central problems to the psychological measurement theory (AERA et al.; 2014). Motivated by largescale assessments, we adopt an asymptotic regime in which the sample size and the number of items grow to infinity. Under this regime, we provide insights into the design of measurement based on our theoretical development on the structural identifiability of latent factors, a concept proposed in this paper that is central to largescale measurement. This notion of identifiability is different from the classic definition of identifiability and may shed lights on the identifiability of infinitedimensional models for noni.i.d. data that are widely encountered in modern statistics, including in network analysis and spatialtemporo statistics. Moreover, necessary and sufficient conditions are established for the structural identifiability of a latent factor, which formalizes the intuition a latent factor can be identified when it is measured by sufficiently many items that distinguish it from the other factors. This result explains the reason why the “simple structure” design is popular in psychological measurement (Cattell; 2012). Our asymptotic results also provide theoretical guarantee to the use of estimated factor scores in making decisions (e.g. classification and ranking) when the corresponding latent factors are structurally identifiable.
The rest of the paper is organized as follows. In Section 2, we introduce a generalized latent factor modeling framework, within which our research questions are formulated. In Section 3, we discuss the structural identifiability for latent factors, the relationship between structural identifiability and estimability, and provide an estimator that consistently estimates the structurally identifiable latent factors. Further implications of our theoretical results on largescale measurement are provided in Section 4 and extensions of our results to more complex settings are discussed in Section 5. A new perturbation bound on linear subspaces is presented in Section 6 that is useful to statistical analysis of lowrank matrix estimation. Our theoretical results are verified by simulation studies in Section 7. Finally, concluding remarks are provided in Section 8. The proofs of all the technical results are provided in the supplement.
2 Structured Latent Factor Analysis
2.1 Generalized Latent Factor Model
Consider that there are individuals and manifest variables (e.g. test items). Let
be a random variable denoting the
th individual’s value on the th manifest variable and let be its realization. For example, in educational tests, s could be binary responses from the examinees, indicating whether the answers are correct or not. We further assumes that each individual is associated with adimensional latent vector, denoted as
and each manifest variable is associated with parameters . We give two concrete contexts. Consider an educational test of mathematics, with dimensions of “algebra”, “geometry”, and “calculus”. Then , , and represent individual ’s proficiency levels on algebra, geometry, and calculus, respectively. In the measurement of Big Five personality factors (Goldberg; 1993), personality factors are considered, including “openness to experience”, “conscientiousness”, “extraversion”, “agreeableness”, and “neuroticism”. Then , …, represent individual ’s levels on the continuums of the five personality traits. The manifest parameter s can be understood as the regression coefficients when regressing s on s, . The manifest parameter s are also known as the factor loadings in the factor analysis literature (e.g. Bartholomew et al.; 2008) and the discrimination parameters in the item response theory literature (e.g. Embretson and Reise; 2000). In many applications of latent factor models, especially in psychology and education, the estimations of s and s are both of interest (e.g. Bartholomew et al.; 2008).Our development is under a generalized latent factor model framework (Skrondal and RabeHesketh; 2004), which extends the generalized linear model framework (McCullagh and Nelder; 1989) to latent factor analysis. This general modeling framework allows for different types of manifest variables. Specifically, we assume that the distribution of given and is a member of the exponential family with natural parameter
(1) 
and possibly a scale (i.e. dispersion) parameter
. More precisely, the density/probability mass function takes the form:
(2) 
where and are prespecified functions that depend on the member of the exponential family. Given and , and , we assume all s are independent. Consequently, the likelihood function, in which s and s are treated as fixed effects, can be written as
(3) 
This likelihood function is known as the joint likelihood function in the literature of latent variable models (Skrondal and RabeHesketh; 2004). Since the likelihood function depends on and only through s, it has rotational and scaling indeterminacy. That is, the likelihood remains unchanged when we replace and by and , for all and , where can be any invertible matrix.
We remark that in the existing literature of latent factor models, there is often an intercept term indexed by in the specification of (1), which can be easily realized under our formulation by constraining , for all . In that case, serves as the intercept term.
This family of models contains special cases, such as the linear factor analysis model (e.g. Anderson; 2003; Bartholomew et al.; 2008), multidimensional item response theory model for binary responses that plays a key role in educational assessment (e.g. Reckase; 2009), and the Poisson model that is widely used to analyze multivariate count (e.g. Moustaki and Knott; 2000). We list their forms below.

Linear Factor Analysis: , where the scale parameter .

Multidimensional Item Response Theory (MIRT): , where the scale parameter .

Poisson Factor Model: , where the scale parameter .
Usually, the likelihood function (3) is not used for maximum likelihood analysis. This is possibly because, under the conventional asymptotic setting that is fixed and grows to infinity, the number of parameters in (3) diverges due to the growing number of person parameters, resulting in inconsistent maximum likelihood estimation. This phenomenon is first pointed out in Neyman and Scott (1948) and is further investigated in subsequent developments, including Andersen (1970), Haberman (1977), Fischer (1981), and Ghosh (1995). Consequently, in generalized latent factor analysis, the person parameters s are typically assumed to be random effects (i.e., independent and identically distributed samples from a distribution) and are integrated out from the joint likelihood function (3), while the manifest parameter s are still regarded as fixed effects. The resulting likelihood function is known as the marginal likelihood.
The analysis of this paper focuses on the joint likelihood function (3) where both s and s are treated as fixed effects, under an asymptotic setting that both and grow to infinity. For ease of exposition, we assume the scale parameter is known in the rest of the paper, while pointing out that it is straightforward to extend all the results to the case where it is unknown.
This asymptotic regime is motivated by largescale assessments in psychology and education, where both the sample size and the number of manifest variables can be very large. Moreover, quantifying the estimation accuracy of s, which is a main focus of latent factor analysis in psychological and educational measurement, is more straightforward and thus easier to interpret under the fix effect point of view. We point out that a similar asymptotic setting has been adopted in Haberman (1977) for the analysis of the Rasch model (Rasch; 1960), a simple unidimensional latent factor model that is widely used in educational measurement.
2.2 Confirmatory Structure
In this paper, we consider a confirmatory setting where domain knowledge is available for the manifest variables. For example, in a personality assessment in psychology, what personality factor each item measures is prespecified. For instance, the item “I am the life of the party” measures trait “extraversion” and the item “I get stressed out easily” measures trait “neuroticism” in a Big Five personality test. In an educational assessment, the item design is also prespecified in the test blueprint. For example, one item may measure both algebra and geometry and another may measure calculus solely. Such information is typically reflected by constraints on the manifest parameters s. Specifically, for each item , there is a prespecified vector , where means that latent factor is measured by manifest variable and thus no constraint is imposed on , and implies that latent factor is independent with manifest variable and thus is set to .
Intuitively, a good design leads to superior measurement results. This intuition is formalized in this paper through asymptotic analysis which establishes a relationship between the design information given by
s and the structural identifiability of the latent factors that is defined in Section 3.2.3 Research Questions
This paper focuses on the identifiability of the latent factors. To define the identifiability, consider a population of people where and a population of manifest variables where . A latent factor is a hypothetical construct, defined by the person population. More precisely, it is determined by the individual latent factor scores of the entire person population, denoted by , where denotes the true latent factor score of person on latent factor and denotes the set of vectors with countably infinite real number components. The identifiability of the th latent factor then is equivalent to the identifiability of a vector in
under the distribution of an infinite dimensional random matrix,
.The above setting is natural in the context of largescale measurement, but is a nonstandard asymptotic setting in statistics. Under this setting, this paper addresses three research questions that are central to modern measurement theory. First, how should the identifiability of latent factors be suitably formalized? Second, under what design are the latent factors identifiable? Third, what is the relationship between the identifiability and estimability? In other words, we want to know whether and to what extend the scores of an identifiable latent factor can be recovered from data.
The identifiability of latent factor models is an important problem in statistics. Research on this topic dates back to Anderson and Rubin (1956) and has received much attention by statisticians under both low and highdimensional settings (e.g. Anderson; 1984; Bai et al.; 2012). To the best of our knowledge, this is the first work characterizing the relationship between measurement design information (reflected by constraints) and the identifiability of latent factor models under a highdimensional setting. In addition, our developments apply to a general model class. Both the incorporation of design information and the general model form make our problem technically challenging, involving the asymptotic analysis of a nonconvex optimization problem. As will be shown in the rest of the paper, we tackle these challenges by proving useful probabilistic error bounds and by developing perturbation bounds on the intersection of linear subspaces.
2.4 Preliminaries
In this section, we fix some notations used throughout this paper.
Notations.

[label=.]

: the set of all positive integers.

: the set of vectors with countably infinite real number components.

: the set of all the real matrices with countably infinite rows and columns.

: the set of all the binary matrices with countably infinite rows and columns.

: the parameter matrix for the person population, .

: the parameter matrix for the manifest variable population, .

: the design matrix for the manifest variable population, .

: the vector or matrix with all components being 0.

: the first components of a vector .

: the submatrix of a matrix formed by rows and columns , where .

: the first components of the th column of a matrix .

: the th column of a matrix .

: the Euclidian norm of a vector .

: the sine of the angle between two vectors,
where , , and the function takes value , and when is positive, zero, and negative, respectively.

: the Frobenius norm of a matrix , .

: the spectral norm of matrix
, i.e., the largest singular value of matrix.

: the singular values of a matrix , in a descending order.

: a function mapping from to , defined as
(4) 
: the cardinality of a set .

: the projection of a vector onto a subset .
3 Structural Identifiability and Theoretical Results
3.1 Structural Identifiability
We first formalize the definition of structural identifiability. For two vectors with countably infinite components , we define
(5) 
which quantifies the angle between two vectors and in . In particular, we say the angle between and is zero when is zero.
Definition 1 (Structural identifiability of a latent factor).
Consider the th latent factor, where , and a nonempty parameter space for . We say the th latent factor is structurally identifiable in the parameter space if for any , implies .
We point out that the parameter space is essentially determined by the design information s. As will be shown shortly in this section, a good design imposes suitable constraints on the parameter space, which further ensures the structure identifiability of the latent factors. This definition of identifiability avoids the consideration of the scale of the latent factor, which is not uniquely determined as the distribution of data only depends on . Moreover, the sine measure is a canonical way to quantify the distance between two linear spaces that has been used in, for example, the wellknown sine theorems for matrix perturbation(Davis; 1963; Wedin; 1972). As will be shown in the sequel, this definition of structural identifiability naturally leads to a relationship between identifiability and estimability and has important implications on psychological measurement.
We now characterize the structural identifiability under suitable regularity conditions. We consider a design matrix for the manifest variable population, where . Throughout the paper, we consider design matrices satisfying the following stability assumption.

The limit
exists for any subset . In addition, .
Note that is the proportion of manifest variables that are associated with and only with latent factors in . In addition, implies that there are few irrelevant manifest variables. We also make the following assumption on the generalized latent factor model, that is satisfied under most of the widely used models, including the linear factor model, MIRT model, and the Poisson factor model listed above.

The natural parameter space .
Under the above assumptions, Theorem 1 provides a necessary and sufficient condition on the design matrix for the structural identifiability of the th latent factor. This result is established within the parameter space ,
(6) 
where is a positive constant, the function is defined in (4) and
(7) 
denotes the set of manifest variables that are associated with and only with latent factors in . Discussions on the parameter space are provided after the statement of Theorem 1.
Theorem 1.
Under Assumptions A1 and A2, the th latent factor is structurally identifiable in if and only if
(8) 
where we define if for all that contains .
The following proposition guarantees that the parameter space is nontrivial.
Proposition 1.
For all satisfying A1, .
We further remark on the parameter space . First, requires some regularities on each and () and the matrix satisfying the constraints imposed by () for all . It further requires that there is enough variation among people, quantified by , where the function is defined in (4). Note that this requirement is mild, in the sense that if s are i.i.d. with a strictly positive definite covariance matrix, then
a.s., according to the strong law of large numbers. Furthermore,
requires that each group of items (categorized by ) contains sufficient information if appearing frequently (). Similar to the justification for , can also be justified by considering that s are i.i.d. following a certain distribution for .We provide two examples to assist with understanding Theorem 1. If and , then the second latent factor is not structurally identifiable, even if it is associated with infinitely many manifest variables. In addition, having many manifest variables with a simple structure ensures the structural identifiability of a latent factor. That is, if , then the th factor is structurally identifiable.
3.2 Identifiability and estimability
It is well known that for a fixed dimensional parametric model with i.i.d. samples, the identifiability of the model parameter is necessary for the existence of a consistent estimator. We extend this result to the infinitedimensional parameter space under the current setting. We start with a generalized definition for the consistency of estimating a latent factor. An estimator given
individuals and items is denoted by , which only depends on for all .Definition 2 (Consistency for estimating latent factor ).
The sequence of estimators is said to consistently estimate the latent factor if
(9) 
for all .
The next proposition establishes the necessity of the structural identifiability of a latent factor on its estimability.
Proposition 2.
If latent factor is not structurally identifiable in , then there does not exist a consistent estimator for latent factor .
3.3 Estimation and Its Consistency
We further show that the structural identifiability and estimability are equivalent under our setting. For ease of exposition, let be the true design matrix in satisfying assumption A1. In addition, let be the true parameters for the person and the manifest variable populations. We provide an estimator such that
when satisfies (8), which leads to the structural identifiability of latent factor under Theorem 1. Specifically, we consider the following estimator
(10)  
where , is any constant greater than in the definition of , and imposes the constraint on . Note that maximizing is equivalent to maximizing the joint likelihood (3), due to the natural exponential family form. The next theorem provides an error bound on .
Theorem 2.
Under assumptions A1A2 and , there exists , , and such that for all and , with probability
(11) 
Moreover, if satisfies (8) and thus latent factor is structurally identifiable, then
(12) 
with probability , where is a constant independent with and .
Proposition 2, and Theorems 1 and 2 together imply that the structural identifiability and estimability over are equivalent, which is summarized in the following corollary.
Corollary 1.
Under Assumptions A1 and A2, there exists an estimator such that in for all if and only if the design matrix satisfies (8).
Remark 1.
The error bound (11) holds even when one or more latent factors are not structurally identifiable. In particular, (11) holds when removing the constraint from (10), which corresponds to the exploratory factor analysis setting where no design matrix is prespecified (or in other words, for all and ; see e.g. Chen et al.; 2017).
Remark 2.
The proposed estimator (10) and its error bound are related to lowrank matrix completion (e.g. Candès and Plan; 2010; Davenport et al.; 2014), where a bound similar to (11) can typically be derived. The key differences are (a) the research on matrix completion is only interested in the estimation of , while the current paper focuses on the estimation of that is a fundamental problem of psychological measurement and (b) our results are derived under a generalized latent factor model that covers many models.
We end this section by providing an alternating minimization algorithm (Algorithm 1) for solving the optimization program (10), which is computationally efficient through our parallel computing implementation using Open MultiProcessing (OpenMP; Dagum and Menon; 1998). Specifically, we adopt a projected gradient descent update (e.g. Parikh and Boyd; 2014) to handle the constraints, where the projections have closedform solutions. Similar algorithms have been considered in other works, such as Udell et al. (2016) and Zhu et al. (2016), for solving optimization problems with respect to lowrank matrices. Convergence properties of this type of algorithms have also been studied (e.g. Zhao et al.; 2015).
4 Further Implications
In this section, we discuss the implications of the above results on largescale measurement.
4.1 On the design of tests.
According to Theorems 1 and 2, the key to the structural identifiability and consistent estimation of factor is
(13) 
which provides insights on the measurement design. First, it implies that the “simple structure” design that is advocated in psychological measurement is a safe design. Under the simple structure design, each manifest variable is associated with one and only one factor. If each latent factor is associated with many manifest variables that only measure factor , or more precisely , (13) is satisfied.
Second, our result implies that a simple structure is not necessary for a good measurement design. A latent factor can still be identified even when it is always measured together with some other factors. For example, consider the matrix in Table 1. Under this design, all three factors satisfy (13) even when there is no item measuring a single latent factor.
Third, (13) is not satisfied when there exists a and . That is, almost all manifest variables that are associated with factor are also associated with factor , in the asymptotic sense. Consequently, one cannot distinguish factor from factor , making factor structurally unidentifiable. We point out that in this case, factor may still be structurally identifiable; for example, when .
Finally, (13) is also not satisfied when . It implies that the factor is not structurally identifiable when the factor is not measured by a sufficient number of manifest variables.
1  2  3  4  5  6  

1  1  0  1  1  0  
1  0  1  1  0  1  
0  1  1  0  1  1 
4.2 Properties of Estimated Factor Scores
A useful result.
Let be the true parameters for the person and the manifest variable populations. The following corollary is derived from Theorem 2 that establishes a relationship between the true person parameters and their estimates. This result is the key to the rest of the results in this section.
Corollary 2.
Under Assumption A1A2 and (8) is satisfied for some , then there exists a sequence of random variables , such that
(14) 
Remark 3.
Corollary 2 follows directly from (12). It provides an alternative view on how approximates . Since the likelihood function depends on and only through , the scale of is not identifiable even when it is structurally identifiable. This phenomenon is intrinsic to latent variable models (e.g. Skrondal and RabeHesketh; 2004). Corollary 2 states that and are close in Euclidian distance after properly normalized. The normalized vectors and are both of unit length. The value of depends on the angle between and . Specifically, if and otherwise. In practice, especially in psychological measurement, can typically be determined by additional domain knowledge.
On the distribution of person population.
In psychological measurement, the distribution of true factor scores is typically of interest, which may provide an overview of the population on the constructs being measured. Corollary 2 implies the following proposition on the empirical distribution of the factor scores.
Proposition 3.
We point out that the normalization in (15) is reasonable. Consider a random design setting where
s are i.i.d. samples from some distribution with a finite second moment. Then
converges weakly to the distribution of , where is a random variable following the same distribution. Proposition 3 then implies that when factor is structurally identifiable and both and are large, the empirical distribution of approximates the empirical distribution of accurately, up to a scaling. Specifically, for any 1Lipschitz function , is a consistent estimator for according to the definition of Wasserstein distance. Furthermore, Corollary 2 states that under the regularity conditions, , implying that , for all . That is, most of the s will fall into a small neighborhood of the corresponding s.On ranking consistency.
The estimated factor scores may also be used to rank individuals along a certain construct. In particular, in educational testing, the ranking provides an ordering of the students’ proficiency in a certain ability (e.g., calculus, algebra, etc.). Our results also imply the validity of the ranking along a latent factor when it is structurally identifiable and and are sufficiently large. More precisely, we have the following proposition.
Proposition 4.
Suppose assumptions A1A2 are satisfied and furthermore (8) is satisfied for factor . Consider and , the normalized versions of and as defined in (15). In addition, assume that there exists a constant such that for any sufficiently small and sufficiently large ,
(16) 
Then,
(17) 
where is the number of inconsistent pairs according to the ranks of and .
We point out that (16) is a mild regularity condition on the empirical distribution . It requires that the probability mass under does not concentrate in any small neighborhood, which further implies that the pairs of individuals who are difficult to distinguish along factor , i.e., s that and are close, take only a small proportion among all the pairs. In fact, it can be shown that (16) is true with probability tending to 1 as grows to infinity, when s are i.i.d. samples from a distribution with a bounded density function. Proposition 4 then implies that if we rank the individuals using (assuming can be consistently estimated based on other information), the proportion of incorrectly ranked pairs converges to 0. Note that is known as the Kendall’s tau distance (Kendall and Gibbons; 1990), a widely used measure for ranking consistency.
On classification consistency.
Another common practice of utilizing estimated factor scores is to classify individuals into two or more groups along a certain construct. For example, in an educational mastery test, it is of interest to classify examinees into “mastery” and “nonmastery” groups according their proficiency in a certain ability
(Lord; 1980; Bartroff et al.; 2008). In measuring psychopathology, it is common to classify respondents into “diseased” and “nondiseased” groups based on a mental health disorder. We justify the validity of making classification based on the estimated factor score.Proposition 5.
Considering two prespecified thresholds and is the wellknown indifference zone formulation of educational mastery test (e.g. Bartroff et al.; 2008). In that context, examinees with are classified into the “mastery” group and those with are classified into the “nonmastery” group. The interval is known as the indifference zone, within which no decision is made. Proposition 5 then implies that when factor is structurally identifiable, the classification error tends to 0 as both and grow to infinity.
5 Extensions
5.1 Generalized latent factor models with intercepts
As mentioned in Section 2.1, intercepts can be easily incorporated in the generalized latent factor model by restricting . Then, s are the intercept parameters and for all . Consequently, for any satisfying , and thus the latent factors 2 are not structurally identifiable according to Theorem 1. Interestingly, these factors are still structurally identifiable if we restrict to the following parameter space
(19) 
which requires that and are asymptotically orthogonal, for all .
Proposition 6.
Under Assumptions A1A2, and assuming that for all and , then the th latent factor is structurally identifiable in if and only if
(20) 
for .
The next proposition guarantees that is also nonempty.
Proposition 7.
For all satisfying A1 and for all , and in addition , then .
Remark 4.
When having intercepts in the model, similar consistency results can be established for the estimator
(21)  
5.2 Extension to Missing Values
Our estimator can also handle missing data which are often encountered in practice. Let be the indicator matrix of nonmissing values, where if is observed and if is missing. When data are completely missing at random, the joint likelihood function becomes
Comments
There are no comments yet.