1 Introduction
Hierarchical Latent Attribute Models (HLAMs) are a popular family of discrete latent variable models widely used in multiple scientific disciplines, including cognitive diagnosis in educational assessments [25, 37, 20, 33, 12], psychiatric diagnosis of mental disorders [36, 13], and epidemiological and medical measurement studies [39, 38]. Based on subjects’ responses (often binary) to a set of items, HLAMs enable finegrained inference on subjects’ statuses of an underlying set of discrete latent attributes; and further allow for clustering the population into subgroups based on the inferred attribute patterns. In HLAMs, each latent attribute is often assumed binary and carries specific scientific meaning. For example, in an educational assessment, the observed responses are students’ correct or wrong answers to a set of test items, and the latent attributes indicate students’ binary states of mastery or deficiency of certain cognitive abilities measured by the assessment [25, 37, 33].
HLAMs have connections to other multivariate discrete latent variable models, including latent tree graphical models [11, 2, 24, 30]
, restricted Boltzmann machines
[21, 22, 34, 26, 23] and restricted Boltzmann forests (RBForests) [27], and latent feature models [15, 7, 29, 43], but with the following two key differences. First, the observed variables are assumed to have certain structured dependence on the latent attributes. This dependence is summarized by a binary structural matrix to encode scientific interpretations. The second key feature is that HLAMs incorporate the hierarchical structure among the latent attributes. For instance, in cognitive diagnosis, the possession of certain attributes are often assumed to be the prerequisite for possessing some others [28, 35]. Such hierarchical structures differ from the latent tree models in that, the latter use a probabilistic graphical model to model the hierarchical tree structure among latent variables, while in an HLAM the hierarchy is a directed acyclic graph (DAG) encoding hard constraints on allowable configurations of latent attributes. This type of hierarchical constraints in HLAMs have a similar flavor as those of RBForests proposed in [27], though the DAGstructure constraints in an HLAM are more flexible than a foreststructure (i.e., group of trees) one in an RBForest (see Example 1).One major issue in the applications of HLAMs is that, the structural matrix and the attribute hierarchy often suffer from potential misspecification by domain experts in confirmatorytype applications, or even entirely unknown in exploratorytype applications. A key question is then how to efficiently learn both the structural matrix and the attribute hierarchy from noisy observations. More fundamentally, it is an important yet open question whether and when the latent structural matrix and the attribute hierarchy are identifiable. Identifiability of HLAMs has a close connection to the uniqueness of tensor decompositions as the probability distribution of an HLAM can be written as a mixture of higherorder tensors. In particular, HLAMs can be viewed as a special family of restricted latent class models, with the binary structural matrix imposing constraints on the model parameters. However, related works on identifiability of latent class models and uniqueness of tensor decompositions, such as
[1, 3, 4, 8], cannot be directly applied to HLAMs due to the constraints induced by the structural matrix. To tackle identifiability under such structural constraints, [42, 40, 41, 18, 16, 17] recently proposed identifiability conditions for latent attribute models. However, [42, 40, 41, 18] considered latent attribute models without any attribute hierarchy; [16] assumed both the structural matrix and true configurations of attribute patterns are known a priori; [17] considered the problem of learning the set of truly existing attribute patterns but assumed the structural matrix is correctly specified beforehand. Establishing identifiability without assuming any knowledge of the structural matrix and the attribute hierarchy is a technically much more challenging task and still remains unaddressed in the literature. Moreover, computationally, the existing methods for learning latent attribute models [10, 41, 17] could not simultaneously estimate the structural matrix and the attribute hierarchy.This paper has two main contributions. First, we address the challenging identifiability issue of HLAMs. We develop sufficient and almost necessary conditions for identifying the attribute hierarchy, the structural matrix, and the related model parameters in an HLAM. Second, we develop a scalable algorithm to estimate the latent structure and attribute hierarchy of an HLAM. Specifically, we propose a novel approach to simultaneously estimating the structural matrix and performing dimension reduction of attribute patterns. The superior performance of the proposed algorithm is demonstrated in various settings of synthetic data and an application to an educational assessment dataset.
2 Model Setup
This section introduces the model setup of HLAMs in details. We first introduce some notations. For an integer , denote
. For two vectors
and of the same length, denote if for all , and denote otherwise. Define operations “” and “” similarly. For a set , denote its cardinality by . Denote the identity matrix by . Denote the binary indicator function by, and the sigmoid function by
.An HLAM consists of two types of subjectspecific binary variables, the observed responses
to items; and the latent attribute pattern . First consider the latent attributes. Attribute is said to be the prerequisite of and denoted by (or ), if any with and is “forbidden” to exist. This is a common assumption in applications such as cognitive diagnosis [28, 35]. A subject’s latent pattern is assumed to follow a categorical distribution of population proportion parameters , with and . In particular, any pattern not respecting the hierarchy is deemed impossible to exist with population proportion . An attribute hierarchy is a set of prerequisite relations between the attributes, which we denote by . Any hierarchy would induce a set of allowable configurations of attribute patterns, which we denote by . The set is a proper subset of if . So an attribute hierarchy determines the sparsity pattern of the vector of proportion parameters .Example 1.
Figure 1 presents several hierarchies with the size of the associated , where a dotted arrow from to indicates . The attribute hierarchy in an HLAM is a DAG generally. In the literature, the RBForests proposed in [27] also introduce hard constraints on allowable configurations of the binary hidden (latent) variables in a restricted Boltzmann machine (RBM). The modeling goal of RBForests is to make computing the probability mass function of observed variables tractable, while not having to limit the number of latent variables. Specifically, in an RBForest, latent variables are grouped in several full and complete binary trees of a certain depth, with variables in a tree respecting the following constraints: if a latent variable takes value zero with , then all latent variables in its left subtree must take value ; while if , all latent variables in its right subtree must take value ( in the paper [27]). The attribute hierarchy model in an HLAM has a similar spirit to RBForests, and actually includes the RBForests as a special case. For instance, the hierarchy in Figure 1(d) is equivalent to a tree of depth 3 in an RBForest with . HLAMs allow for more general attribute hierarchies to encourage better interpretability. Another key difference between HLAMs and RBForests is the different joint model of the observed variables and the latent ones. An RBForest is an extension of an RBM, and they both use the same energy function, while HLAMs model the distribution differently, as to be specified below.
On top of the model of the latent attributes, an HLAM uses a binary matrix to encode the structural relationship between the items and the attributes. In cognitive diagnostic assessments, the matrix is often specified by domain experts to summarize which cognitive abilities each test item targets on [25, 37, 14]. Specifically, if and only if the response to the th item has statistical dependence on latent variable . The distribution of , i.e., , only depends on its “parent” latent attributes ’s that are connected to , i.e., . The structural matrix naturally induces a bipartite graph connecting the latent and the observed variables, with edges corresponding to entries of “1” in . Figure 2 presents an example of a structural matrix and its corresponding directed graphical model between the latent attributes and observed variables. The solid edges from the latent attributes to the observed variables are specified by . As also can be seen from the graphical model, the observed responses to the items are conditionally independent given the latent attribute pattern.
In the psychometrics literature, various HLAMs adopting the matrix concept have been proposed with the goal of diagnosing targeted attributes [25, 36, 37, 20, 12]. They are often called the cognitive diagnosis models. In this work, we focus on two popular and basic types of modeling assumptions under such a framework; as to be revealed soon, these two types of assumptions also have close connections to Boolean matrix decomposition [31, 32]. Specifically, the HLAMs considered in this paper assume a logical ideal response given an attribute pattern and an item loading vector in the noiseless case. Then itemlevel noise parameters are further introduced to account for uncertainty of observations. The following are two popular ways to define the ideal response.
The first model is the ANDModel that assumes a conjunctive “and” relationship among the binary attributes. The ideal response of attribute pattern to item takes the form
(ANDmodel ideal response)  (1) 
To interpret, in (1) indicates whether an attribute pattern possesses all the attributes specified by the item loading vector . This conjunctive relationship is often assumed for diagnosis of students’ mastery or deficiency of skill attributes in educational assessments, and naturally indicates whether a student with has mastered all the attributes required by the test item . With , the uncertainty of the responses is further modeled by the itemspecific Bernoulli parameters
(2) 
where is assumed for identifiability. For each item , the ideal response , if viewed as a function of attribute patterns, divides the patterns into two latent classes and ; and for these two latent classes, respectively, the item parameters quantify the noise levels of the response to item that deviates from the ideal response. Note that the equals either or . Denote the item parameter vectors by and . Such a model defined by (1) and (2) is called the Deterministic Input Noisy output “And” (DINA) model in cognitive diagnosis [25].
The second model is the ORmodel that assumes the following ideal response
(ORmodel)  (3) 
Such a disjunctive relationship is often assumed in psychiatric measurement. Assumptions (3) and (2) yield the Deterministic Input Noisy output “Or” (DINO) model in cognitive diagnosis [36]. In the Boolean matrix factorization literature, a similar model was proposed by [31, 32]. Adapted to the terminology here, [32] assumes the ideal response takes the form , which is equivalent to (3), while [32] constrains all the itemlevel noise parameters to be the same.
One can see from the last equivalent formulation of the ORmodel that its ideal response is symmetric about the two vectors and ; while for the ANDmodel this is not the case. There is an interesting duality [10] between the ANDmodel and the ORmodel that . Due to this duality, we next will focus on the asymmetric ANDmodel without loss of generality.
3 Theory of model identifiability
This section presents the main theoretical result on model identifiability. Denote the ideal response matrix by . The has rows indexed by the items and columns by attribute patterns in , and its th entry is defined to be the ideal response in (1). Given an attribute hierarchy and the resulting , two matrices and are equivalent if . We also equivalently write it as (or ). The following example illustrates how an attribute hierarchy determines a set of equivalent matrices.
Example 2.
Consider the attribute hierarchy in Figure 2, which results in . Then the identity matrix is equivalent to the following matrices under ,
(4) 
where the “”’s in the third matrix above indicate unspecified values, any of which can be either 0 or 1. This equivalence is due to that attribute serves as the prerequisite for both and , and any item loading vector measuring or is equivalent to a modified one that also measures
, in terms of classifying the patterns in
into two categories and .The following main theorem establishes identifiability conditions for an HLAM. See Supplement A for its proof.
Theorem 1.
Consider an HLAM under the ANDmodel assumption with a and a hierarchy .
(i)
are jointly identifiable if the true satisfies the following conditions.

The contains a submatrix ; and if setting to “0” for any and , the resulting matrix equals up to column permutation.
(Assume first rows of form , and denote the remaining submatrix of by .)

For any item , and any , we set to “1” and obtain a modified . The contains distinct column vectors.

For any item , and any , we set to “0” and obtain a modified , with entries . The satisfies that for all .
To identify , Condition A is necessary; moreover, Conditions , and are necessary and sufficient when there exists no hierarchy with for all .
(ii) In addition to Conditions A–C, if is constrained to contain an , then are identifiable, and can be identified up to the equivalence class under the true . On the other hand, it is indeed necessary for to contain an to ensure an arbitrary identifiable.
When estimating an HLAM with the goal of recovering the ideal response structure and the model parameters, Theorem 1(i) guarantees that Conditions , and suffice and are close to being necessary. While the goal is to uniquely determine the attribute hierarchy from the identified , the additional condition that contains an becomes necessary. This phenomenon can be better understood if one relates it to the identification criteria for the factor loading matrix in factor analysis models [5, 6]; the loading matrix there is often required to include an identity submatrix or satisfy certain rank constraints, since otherwise the loading matrix can only be identifiable up to a matrix transformation. We would also like to point out that developing identifiability theory for HLAMs that can have arbitrarily complex hierarchies is more difficult than the case without hierarchy, and hence Theorem 1 is a significant technical advancement over previous works [19, 17].
We next present an example as an illustration of Theorem 1.
Example 3.
Consider the attribute hierarchy among attributes as in Figure 2. The following structural matrix satisfies Conditions , and in Theorem 1. In particular, the first 3 rows of serves as in Condition . We call the two types of modifications of matrix described in Conditions and by the name “Operation” and , respectively. In the following equation, the matrix entries modified by Operations and are highlighted, and the resulting and indeed satisfies the requirements in Conditions and . So the HLAM associated with is identifiable.
4 A scalable algorithm for estimating HLAMs
This section presents an efficient twostep algorithm for structure learning of HLAMs. The EM algorithm is popular for estimating latent variable models; however for HLAMs, it needs to evaluate subjects’ and items’ probabilities of all configurations of dimensional patterns in each E step, so it is computationally intractable for moderate to large with complexity . To resolve this issue, [17] recently proposed an efficient twostep algorithm to estimate latent patterns under a fixed and known and established its statistical consistency properties. Here we propose a twostep algorithm that improves the idea of [17], and it is able to simultaneously learn the structural matrix and latent patterns. Our new first step jointly estimates and performs dimension reduction of the latent patterns in a scalable way, with computational complexity . Then based on the estimated and candidate patterns, our second step imposes further regularization on proportions of patterns to extract the set of truly existing patterns and the corresponding attribute hierarchy.
For a sample of size , denote by the matrix containing the subjects’ response vectors as rows, and denote by the matrix containing subjects’ latent attribute patterns as rows. Our first step treats as random effect variables with noninformative marginal distributions and as fixed effect parameters. The loglikelihood of the complete data, and , is as follows under the ANDmodel,
(5)  
We develop a stochastic EM algorithm for structure learning. In particular, in the E step, we use Gibbs samples of to stochastically approximate their target posterior expectation; then in the M step we update the estimates of the item parameters . We call the algorithm EM with Alternating Direction Gibbs (ADGEM) as each E step iteratively draws Gibbs samples of (along the direction of updating attribute patterns) and (along the direction of updating item loadings). The details of ADGEM are presented in Algorithm 1. In practice we draw samples of with the first as burnin in each E step; we find usually a small number suffices for good performance and is taken in the experiments. Algorithm 1 has a desirable property of performing dimension reduction to obtain a set of candidate patterns, as can be seen from its last step of including all the unique row vectors of the matrix in the . This is because the matrix has size , which means the number of selected candidate patterns can be at most , no matter how large might be. Indeed, in the experimental setting with in Section 5, the , while the proposed algorithm successfully reduces to several hundreds (see Table 1), and then estimates true latent structure with good accuracy and scalability.
After using Algorithm 1 to obtain the estimated structural matrix and a set of candidate latent patterns , we further impose penalty on the proportion parameters of these candidate patterns for sparse estimation. Denote . Motivated by [17], the second stage maximizes the following objective function,
(6) 
where is a tuning parameter encouraging sparsity of , and for some is used to avoid the singularity issue of the log function at zero. Note that the estimated by Algorithm 1 implicitly appears in the above (6), because it determines the ideal response of patterns to items and further determines whether a should equal or . To maximize (6), we apply the Penalized EM (PEM) algorithm proposed in [17] to obtain the set of selected latent patterns . The PEM algorithm has complexity in each E step, thanks to the dimension reduction of ADGEM algorithm in the first stage. We also use the Extended Bayesian Information Criterion (EBIC) [9] to select the tuning parameter and obtain the best set of attribute patterns . Then finally, the attribute hierarchy can be determined by examining the order between columns of the binary matrix containing the selected patterns. Denote the columns of the matrix by ’s where . Specifically, if , then should be included in . Combined with the proposed Algorithm 1, the final output is , including the structural matrix, latent patterns, and attribute hierarchy.
We remark that it is straightforward to handle missing data in an HLAM and still perform structure learning. Indeed, it suffices to replace the objective functions (5) and (6) that are over the by functions over , where is the set of indices in corresponding to those observed entries. Supplement B contains more details on computation.
5 Experiments with synthetic and real data
Synthetic data.
We perform simulations in two different settings, the first having relatively small with and the second having relatively large with . Two different numbers of attributes and are considered. We next specify the structural matrices and used to generate the synthetic data. Let be a matrix with if or and zero otherwise; then let be a matrix that consists of one submatrix and two copies of . Under or , the vertically stacks an appropriate number of . The algorithm is implemented in Matlab. For all the scenarios, 200 independent runs are carried out. The second step PEM algorithm is always run over a range of , from which EBIC selects the best. Figure 3 presents two particular hierarchies among attributes, the diamond and the tree, together with the hierarchy estimation results. More extensive simulation results are presented in Table 1. In the table, the column “acc” records the average accuracy of estimating the structural matrix up to the equivalence class under , as illustrated in Example 2; the “TPR” denotes True Positive Rate, the average proportion of true patterns that are selected in ; and “FDR” denotes “False Discovery Rate (FDR)”, the average proportion of selected patterns in that are true. The statistical variations of the results presented in Table 1 are included in Supplement B.
Results in Table 1 not only demonstrate the algorithm’s excellent performance, but also provide interesting insight into the differences between the two settings, (I) and (II) . In setting (I), the first stage ADGEM algorithm tends to produce a relatively large number of candidate patterns (though definitely below sample size , even for ), and the second stage PEM algorithm significantly reduces the number of patterns, usually yielding . In contrast, in setting (II), Algorithm 1 itself usually can successfully reduce the number of candidate patterns, giving close to , and the PEM algorithm does not seem to improve the selection results very much in such scenarios. One explanation for this phenomenon is that in the small case, there are not enough items “measuring” subjects’ latent attributes, so the ADGEM algorithm is not very sure about which false attribute patterns to exclude (very nicely, ADGEM does not tend to exclude truly existing patterns), and further regularization of patterns in the PEM algorithm is very necessary and helpful; while in the large case, there exists enough information about the subjects extracted by the large number of items, and hence the ADGEM can be more confident about discarding those nonexisting patterns in the data. Inspired by this observation, we also apply the ADGEM algorithm to the task of factorization and reconstruction of large and noisy binary matrices. Supplement C contains some interesting simulation results.
noise  

acc  TPR  1FDR  size  acc  TPR  1FDR  size  
10  1.00  1.00  0.96  10 (113)  1.00  1.00  1.00  10 (10)  
1.00  1.00  0.96  10 (166)  1.00  1.00  0.68  15 (16)  
15  1.00  1.00  0.95  15 (120)  1.00  1.00  1.00  15 (15)  
1.00  0.99  0.94  16 (179)  1.00  1.00  0.80  19 (20)  
10  0.98  0.91  0.90  10 (272)  0.99  0.99  0.97  10 (11)  
0.99  1.00  0.88  10 (851)  0.97  0.94  0.62  15 (28)  
15  0.99  0.96  0.95  15 (309)  1.00  1.00  0.99  15 (16)  
0.99  0.99  0.89  15 (894)  0.99  0.98  0.71  21 (41) 
Real data.
We use the proposed method to analyze real data from a largescale educational assessment, the Trends in International Mathematics and Science Study (TIMSS). This dataset is part of the TIMSS 2011 Austrian data, which was also used in [14] to analyze students’ abilities in mathematical subcompetences and can be found in the R
package CDM
.
It includes responses of Austrian fourth grade students and items. A number of attributes is prespecified in [14], together with a tentative matrix. One structure specific to such large scale assessments is that only a subset of all items in the entire study is presented to each of students [14]. This results in many missing values in the data matrix, and the considered dataset has a missing rate .
After running the ADGEM algorithm with missing data firstly, there is , out of the possible patterns.
Figure 4(a) presents the results of the second stage PEM algorithm. The smallest EBIC value is achieved when , with estimated latent patterns in presented in Figure 4(b). The hierarchy corresponding to in Figure 4(c) reveals that attribute “Geometry Reasoning” has the largest number of prerequisites. And in general, attributes related to either “reasoning” or “geometry” seem to be higher level skills in the hierarchy.
6 Conclusion
We have proposed transparent conditions on the structural matrix for identifying an HLAM and developed a scalable algorithm for estimating an HLAM. The algorithm has great empirical performance on both small and largescale structure learning tasks. This work focuses on basic types of HLAMs that have two itemspecific parameters per item. It would be interesting to generalize the theory and algorithm to other latent attribute models, like those considered in [17]
. More broadly, this work makes an attempt to bridge the two fields of psychometrics and machine learning. In psychometrics, various latent attribute models have been recently proposed, which carry good scientific interpretability in the underlying latent structure; while in machine learning, relevant latent variable models including RBM and its extensions are popular, which enjoy computational efficiency. This work sheds light on further research that can combine strengths from both fields to analyze large and complex datasets from educational and psychological assessments.
References
 [1] E. S. Allman, C. Matias, and J. A. Rhodes. Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics, 37:3099–3132, 2009.
 [2] A. Anandkumar, K. Chaudhuri, D. Hsu, S. M. Kakade, L. Song, and T. Zhang. Spectral methods for learning multivariate latent tree structure. In Advances in Neural Information Processing Systems, pages 2025–2033, 2011.
 [3] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832, 2014.
 [4] A. Anandkumar, D. Hsu, M. Janzamin, and S. Kakade. When are overcomplete topic models identifiable? Uniqueness of tensor tucker decompositions with structured sparsity. Journal of Machine Learning Research, 16:2643–2694, 2015.
 [5] T. W. Anderson. An introduction to multivariate statistical analysis. John Wiley & Sons, New York, 2009.
 [6] J. Bai and K. Li. Statistical analysis of factor models of high dimension. The Annals of Statistics, 40(1):436–465, 2012.
 [7] J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West. Bayesian nonparametric latent feature models. Bayesian Statistics, 8:1–25, 2007.
 [8] A. Bhaskara, M. Charikar, and A. Vijayaraghavan. Uniqueness of tensor decompositions with applications to polynomial identifiability. In Conference on Learning Theory, pages 742–778, 2014.
 [9] J. Chen and Z. Chen. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771, 2008.
 [10] Y. Chen, J. Liu, G. Xu, and Z. Ying. Statistical analysis of matrix based diagnostic classification models. Journal of the American Statistical Association, 110(510):850–866, 2015.
 [11] M. J. Choi, V. Y. Tan, A. Anandkumar, and A. S. Willsky. Learning latent tree graphical models. Journal of Machine Learning Research, 12(May):1771–1812, 2011.
 [12] J. de la Torre. The generalized DINA model framework. Psychometrika, 76:179–199, 2011.
 [13] J. de la Torre, L. A. van der Ark, and G. Rossi. Analysis of clinical data from a cognitive diagnosis modeling framework. Measurement and Evaluation in Counseling and Development, 51(4):281–296, 2018.
 [14] A. C. George and A. Robitzsch. Cognitive diagnosis models in R: A didactic. The Quantitative Methods for Psychology, 11(3):189–205, 2015.
 [15] Z. Ghahramani and T. L. Griffiths. Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, pages 475–482, 2006.
 [16] Y. Gu and G. Xu. Partial identifiability of restricted latent class models. Annals of Statistics, accepted, arXiv:1803.04353, 2018.
 [17] Y. Gu and G. Xu. Learning attribute patterns in highdimensional structured latent attribute models. Journal of Machine Learning Research, accepted, arXiv:1904.04378, 2019.
 [18] Y. Gu and G. Xu. The sufficient and necessary condition for the identifiability and estimability of the DINA model. Psychometrika, 84(2):468–483, 2019.
 [19] Y. Gu and G. Xu. Sufficient and necessary conditions for the identifiability of the matrix. Statistica Sinica, accepted, doi:10.5705/ss.202018.0410, 2019.
 [20] R. A. Henson, J. L. Templin, and J. T. Willse. Defining a family of cognitive diagnosis models using loglinear models with latent variables. Psychometrika, 74:191–210, 2009.

[21]
G. E. Hinton.
Training products of experts by minimizing contrastive divergence.
Neural computation, 14(8):1771–1800, 2002. 
[22]
G. E. Hinton and R. R. Salakhutdinov.
Reducing the dimensionality of data with neural networks.
Science, 313(5786):504–507, 2006.  [23] G. E. Hinton and R. R. Salakhutdinov. Replicated softmax: an undirected topic model. In Advances in Neural Information Processing Systems, pages 1607–1614, 2009.
 [24] D. Hsu, S. M. Kakade, and P. S. Liang. Identifiability and unmixing of latent parse trees. In Advances in Neural Information Processing Systems, pages 1511–1519, 2012.
 [25] B. W. Junker and K. Sijtsma. Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25(3):258–272, 2001.
 [26] H. Larochelle and Y. Bengio. Classification using discriminative restricted Boltzmann machines. In Proceedings of the 25th International Conference on Machine Learning, pages 536–543. ACM, 2008.
 [27] H. Larochelle, Y. Bengio, and J. Turian. Tractable multivariate binary density estimation and the restricted Boltzmann forest. Neural computation, 22(9):2285–2307, 2010.
 [28] J. P. Leighton, M. J. Gierl, and S. M. Hunka. The attribute hierarchy method for cognitive assessment: A variation on Tatsuoka’s rulespace approach. Journal of Educational Measurement, 41(3):205–237, 2004.
 [29] K. Miller, M. I. Jordan, and T. L. Griffiths. Nonparametric latent feature models for link prediction. In Advances in Neural Information Processing Systems, pages 1276–1284, 2009.

[30]
R. Mourad, C. Sinoquet, N. L. Zhang, T. Liu, and P. Leray.
A survey on latent tree models and applications.
Journal of Artificial Intelligence Research
, 47:157–203, 2013.  [31] S. Ravanbakhsh, B. Póczos, and R. Greiner. Boolean matrix factorization and noisy completion via message passing. In Proceedings of the 33rd International Conference on Machine LearningVolume 48, pages 945–954, 2016.
 [32] T. Rukat, C. C. Holmes, M. K. Titsias, and C. Yau. Bayesian Boolean matrix factorisation. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2969–2978. JMLR. org, 2017.
 [33] A. A. Rupp, J. Templin, and R. A. Henson. Diagnostic measurement: Theory, methods, and applications. Guilford Press, 2010.
 [34] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine learning, pages 791–798. ACM, 2007.
 [35] J. Templin and L. Bradshaw. Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79(2):317–339, 2014.
 [36] J. L. Templin and R. A. Henson. Measurement of psychological disorders using cognitive diagnosis models. Psychological methods, 11(3):287, 2006.
 [37] M. von Davier. A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61:287–307, 2008.
 [38] Z. Wu, L. CasciolaRosen, A. Rosen, and S. L. Zeger. A Bayesian approach to restricted latent class models for scientificallystructured clustering of multivariate binary outcomes. arXiv preprint arXiv:1808.08326, 2018.
 [39] Z. Wu, M. DeloriaKnoll, and S. L. Zeger. Nested partially latent class models for dependent binary data; estimating disease etiology. Biostatistics, 18(2):200–213, 2017.
 [40] G. Xu. Identifiability of restricted latent class models with binary responses. The Annals of Statistics, 45:675–707, 2017.
 [41] G. Xu and Z. Shang. Identifying latent structures in restricted latent class models. Journal of the American Statistical Association, 113(523):1284–1295, 2018.
 [42] G. Xu and S. Zhang. Identifiability of diagnostic classification models. Psychometrika, 81:625–649, 2016.
 [43] I. E. Yen, W.C. Lee, S.E. Chang, A. S. Suggala, S.D. Lin, and P. Ravikumar. Latent feature lasso. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3949–3957. JMLR. org, 2017.
Supplementary Material
This supplementary material is organized as follows. Supplement A presents the proof of the main theorem, Theorem 1. Supplement B includes some further details on computation, with details of EBIC in part B.1, algorithms handling missing data in part B.2, and details on the experiments in Section 5 of the main text in part B.3. Supplement C includes simulation results on large noisy binary matrix factorization/reconstruction and structural matrix estimation. The Matlab codes for implementing the algorithms and reproducing the experimental results are included in another zip archive.
Supplement A: Proof the main theorem
We introduce some notations and technical preparations before presenting the proof. Denote an arbitrary response vector by . For an HLAM with the ANDmodel assumption in (1), the probability mass function of the binary response vector is
(7) 
The , so the structural matrix implicitly appears in the above expression through defining the ideal responses ’s.
We next define a useful technical quantity, a marginal probability matrix matrix as follows. The rows of are indexed by all the possible response patterns and columns by all the possible attribute patterns . The entry of the matrix is the marginal probability of an subject with attribute pattern that provides positive responses to the set of items . Namely, denote a random response vector by and view the as a fixed vector, there is
(8) 
Denote the row vector of the matrix corresponding to response pattern by , and denote the column vector of the matrix corresponding to attribute pattern by . From the above definition (8), it is not hard to see that for all if and only if (which we also denote by ). This implies that we can focus on the matrix structure, and prove identifiability by showing that gives under certain conditions.
The matrix has another nice algebraic property, established in [40], that will be frequently used in the later proof. We restate it here. The can be viewed as a map of two general dimensional vectors and . For an arbitrary dimensional vector , there exists a invertible matrix that only depends on such that,
(9) 
There is one basic fact about any attribute hierarchy and the resulting : the allzero and allone attribute patterns and always belong to that is induced by an arbitrary . This is because any prerequisite relation among attributes would not rule out the existence of the pattern possessing no attributes or the pattern possessing all attributes.
Proof of Theorem 1.
The proofs of part (i) and part (ii) are presented as follows.
Proof of part (i).
We first show the sufficiency of Conditions , and for identifiability of . Since Condition is satisfied, from now on we assume without loss of generality that
(10) 
We next show that if for any ,
Comments
There are no comments yet.