Item response theory models commonly posit the “local” independence assumption that patterns of response vectors to items administered in achievement tests are due solely to probabilistic functions of some latent ability vector. This assumption is quite strong in that model modifications must be made to accommodate group clusters, such as students clustered within classrooms. It also necessitates careful evaluation such that item parameters are invariant between student attributes such as age, gender, race, and primary language. A method for analyzing complex item response data utilizing an item response network was recently proposed byJin Jeon (2019). The approcate, termed the network item response model (NIRM) is an exploratory method that creates a multiplex of item and person networks and estimates positions of both items and persons in a single latent space. The model was developed such that it relaxes the common local independence assumption in traditional item response models. Thus, clustering of item responses due to extraneous factors do not have negative implications for estimation, but rather are captured and represented by relative distances in an item latent space.
1.1 Description of Network Item Response Models
Latent space modeling (Hoff ., 2002; Handcock ., 2007; Raftery ., 2012) approaches have been used successfully in the analysis of social networks to represent the networks as a set of latent positions in Euclidean space. Recently, Gollini Murphy (2016) extended the latent space modeling literature by utilizing multiple network views from the same latent space to be used to estimate the latent space. The network item response model adopts this approach to represent persons and items in the same latent space. There are two specifications that enable this approach. First, the network item response model conceptualizes pairwise person and item data as a set of networks. Typical item response datasets have the form where represents the binary response of the th person to the th item. The network item response model utilizes a different unit of analysis than the typical rows and columns representation of the data. Instead, fitting the model requires constructing two sets of binary adjacency matrices. The two sets contain and binary adjacency matrices, herafter refered as networks. There is one network for each person and one network for each item. The elements of these networks consist of a representation of the interactions of persons within items and items within persons. There are multiple choices for the construction of these networks, which is discussed in a later section.
The full data likelihood for the network item response model is the joint probability of observing both sets of networksand , which is a function of the latent space (where is the number of dimensions), intercept terms for each item , and intercept terms, , for each person.
The second specification is that the network item response model enables a joint modeling of latent item and person spaces. To accomplish this requires a mapping from one space to another. Jin Jeon (2019) used a mapping that assumed the item space is a function of the person space. Specifically, the item latent positions were equal to the average person positions of those who endorsed the item (e.g. answered it correctly for educational tests, or responded in the affirmative for psychological tests). Other mappings are possible, and this will also be discussed further in a later section.
A critical point in differentiating NIRM from other models is the unit of analysis, which is a pairwise summary of item and person information - not individual person level data itself. In general, it is not possible to completely recover the original data matrix from it’s network representation, and therefore it is not directly equivalent to many other model-based procedures for investigating psychological and educational data, such as Item Response Theory (IRT) or network psychometric models such as the Ising Model for binary data or Gaussian Graphical Model (GGM) for continuous data.
To date there only two articles regarding the NIRM approach to blending networks and item response models. The first was the introduction of the model (Jin Jeon, 2019) and the second extended the model the model to allow for hierarchical dependencies (Jin ., 2018). The depth of technical details in those articles prevented a thorough case study in the use of this model. This article has four aims. First, this article will provide a deep description of all analysis decisions that need to be made prior to using NIRM, including some new decision points that had not been considered in Jin Jeon (2019). Second, we will provide solutions for incorporating new data into and existing NIRM. Adding new individuals, new items, or both are common in educational assessment and this article will enable these activities for a previously estimated NIRM. For example, in IRT these processes are termed ability estimation and linking. However, the analogues to ability estimation and equating have not been previously been discussed in NIRMs. Third, we will then present analysis of a dataset using NIRM. The case study shows the utility of the NIRM by demonstrating its use on an example data set containing item-level responses to a high school statistics students assessment, which was administered to classrooms across several different schools. This analysis will include providing interpretation from known properties of the items (e.g. content area) as well as illuminate potential issues with the item responses. We provide the previous two sections separately because the real data analysis represents one set of a combination of analysis decisions and a discussion of each decision intertwined with the analysis would confound the aims of both. Finally, this article will describe the distinction between other commonly used models for item responses in education and psychology. Specifically, this article will compare and contrast between the NIRM, the Ising Model, IRT Models.
2 Practical Recommendations when Using NIRM
Prior to using NIRM, there are several determinations that need to be made. The first of these is the selection of the dimensionality of the latent space. Then, one must decide how to construct the item networks, which is effectively a choice on defining interesting margins in a contingency table for every pair of persons and items. The final step is selecting a linkage between the item latent space and the person latent space. The remainder of this section will discuss methods for estimation of NIRM models, optional and necessary steps for post-processing the results, and a guide for interpreting and visualizing the model.
2.1 Selection of the Dimension of the Latent Space
This step is a conceptually similar problem to selecting the number of factors in factor analysis or IRT. Similar to the number of factors problem, there is no satisfactory solution because the true data generation process is likely much more complex than a simple function of a low-dimension latent space. However, commonly applied rules for selecting the number of latent dimensions in factor analysis such as the scree test (Cattell, 1966), parallel analysis (Horn, 1965)
, and the “eigenvalue great than one” rule(Kaiser, 1960) do not apply to NIRM. Likewise, NIRM does not presently have analogues to common measures of fit in factor analysis or structural equation modeling such as TLI (Tucker Lewis, 1973), standardized root-mean-square residual (SRMR), or RMSEA (Steiger Lind, 1980; Browne Cudeck, 1993). That said, there are two considerations for selecting the number of dimensions in the latent Euclidean space. The first criterion is ease of interpretation of the resultant latent space. One purpose of NIRM is to project summary information about pairwise responses of persons and items in order to obtain information about item properties. To this end, it is clear that a preference for simplification and selecting the minimum number of dimensions is a worthy goal. Therefore it is often reasonable to select 2 or 3 dimensions for the purpose of easing the process visualizing the latent spaces. The second criterion is the risk of “underfactoring”. NIRM is conceptually similar to multi-dimensional scaling in that it attempts to identify an optimal representation of distances between persons, rather than quantify magnitudes of a latent trait as in IRT. If the selected number of dimensions is too low for this purpose, the visualized distances may not be an accurate representation of the true distances. The implications of this have been acknowledged in the literature many times over, but it bears repeating here. While approaches from the factor analytic literature may not provide useful criterion, we acknowledge that due to conceptual similarities, approaches from parameterized versions of multidimensional scaling give insight into selecting the number of dimensions for NIRM.
Specifically, we note it is possible to place a diffuse prior over the integer number of dimensions up to a pre-specified maximum and select the choice that optimizes a corresponding information criterion (Oh Raftery, 2001). However, the application of this is not straightforward. Relatedly, Oh Raftery (2001) note several issues with bayesian MDS that are directly applicable to NIRM. Namely, for identical Euclidean distances, the coordinates of must get closer to the origin as
increases, unless all the extra coordinates are equal to 0. This is a direct consequence of the curse of dimensionality(Zimek ., 2012). Overall, this is a complex issue, but starting with a low number of dimensions such as 2 or 3 can serve as a good rule of thumb.
2.2 Construction of Item Networks
As previously mentioned, the unit of analysis for NIRM is different from IRT or Ising Models. From the original response dataset, we construct two sets of binary networks for pairwise representations of persons and items. That is, for each item , we have an binary network, . Likewise, for each person , we have a network, . Each element of networks or is constructed as a function of the appropriate elements of the original data matrix, . For example, from below, in network , the element in the row and the column, (i.e. element ) is a function of elements from the and rows of the column in data matrix (i.e. ).
The original article on NIRM discussed only a single method for constructing the networks from a matrix of item responses. It conceptualized each element of the matrix as an encoding of whether or not both elements of comparison took a value of 1. That is, each element of one of the networks was the product of two elements of the original data matrix, represented in Equation 2.
That is, because of the binary nature of the data, this product takes a value of 1 when both elements are 1, and 0 otherwise. While described as an interaction between the two variables, taking the product is simply one method of classifying pairwise responses into a set of operational taxonomic units(Sokal Sneath, 1961).
1 Strictly speaking, any substantively interesting enumeration of the elements of Table 1 can be used as an element of analysis. In the original article, the operational unit was taken to be positively matched concordant pairs. This lead to the above specification. The psychological interpretation of this enumeration might be that an affirmative response indicates presence of the trait, but the absence of an affirmative response doesn’t necessarily indicate absence of the trait. This enumeration is in fact quite common in psychological classifications as diagnostic criteria may declare a threshold on the presence of a fixed number of criteria to imply a diagnosis but the absence of any of the non-endorsed criteria are not counter-evidence to a diagnosis. Another example of this is the literature on positive and negative affect in which the absence of one does not imply the presence of the other. This selection of how to enumerate the data is somewhat analogous to the selection of encoding or encoding in Ising models (Haslbeck ., 2018) with the distinction being the selection within the Ising model results in statistically equivalent models but in NIRM it does not.
In other fields, such as educational measurement, the presence or absence of a correct response is taken as evidence a high or low magnitude of a trait, respectively. Here, we are coding similarity of responses as an indicator function that takes value 1 if respondents have the same endorsement and 0 otherwise. For binary data, this may be mathematically represented as below.
Furthermore, while the network of item responses must contain binary information, there is nothing that prevents dummy-coding a single item or collection of items such that meaningful binary encodings are including in the item response networks. One example of this would be “multiple-mark” or “multiple TRUE/FALSE” item type (Frisbie Sweeney, 1982), which contains a clustered set of check-box type responses with a “mark-all-that-apply” type prompt. Traditionally, item response analysis techniques have struggled with this item type for three reasons. First, the clustered set of options nested under a single prompt is a prototypical example of the type of items that cause local dependence (Yen, 1984). Second, the common choice for analyzing these data types is to either treat the sum score of the individually marked elements as a single ordinal item or to apply a scoring rule that accounts for or penalizes over/under marking and then again treat the outcome as a single ordinal item. Clearly, this ignores information encoded in the pattern of responses. The third struggle with the item type follows from the first and second. If analyzing the sub-items separately and analyzing a sum score are both sub-optimal, then the next best correct solution would be to analyzing the whole patterns of responses from the mark-M item. However, the number of patterns for this item type is where is the number of sub-items. For most seemingly reasonable number of sub-items, is much too large to expect a non-zero number of observations of each pattern.
2.3 Choosing the linkage between latent spaces
The item and person latent spaces are jointly estimated, but there is a misalignment in the dimensionality of the item networks and person networks. Jin Jeon (2019) resolve this problem by assuming that the one space can be defined as a function of the other. Logically, this makes sense as both networks are constructed from the same dataset, . There are many possible choices of linkages between the two latent spaces. Yet, Jin Jeon (2019) mentioned only a single method of linking the two, which defined the latent position of item , , to be the average of the latent position of the respondents who answered item correctly. That is, the chosen the mathematical link is:
To give further intuition on this choice of link, consider that each person contributes their own item network to the analysis. This network contains a binary similarity measure for each item pair within individuals and both the persons tendency have elements of their item network take values of 1 (i.e., ) and the distance between items determine the probability of a link. Then, each person’s network contributes to determining the distance between the items, controlling for a persons tendency to answer pairs of items similarly. Thus, it is reasonable that an average of the persons who answered an item correctly defines the location of that item.
However, there is nothing preventing additional methods of resolving the dimensional mismatch and linking the two spaces. A clear alternative would be the assumption that this influence goes the other direction. That is, that the respondent latent space is a function of the item latent space and a reasonable approximation of a persons latent position is the average of the positions of the items which they answers correctly.
This formulation may not always be appropriate. When the number of items is small (e.g., ) and many items have low proportions of positive endorsement, the accuracy of estimating the positions of items is severely limited because few individuals are contributing to estimating the positions of the items.
Jin Jeon (2019, pp. 244) provides a detailed discussion of Bayesian estimation of NIRM vis a Metropolis-Hastings algorithm. Unfortunately, this algorithm is either or depending on whether the person space is a function of the item space or the item space is a function of the person space, respectively. Preliminary investigations into variational approximations to this estimation procedure via an EM algorithm have found that this approach may be too slow to be useful. This might appear contrary to the results of Gollini Murphy (2016), but the distinction between that context and the NIRM is that the applied analysis in Gollini Murphy (2016) jointly modeled three network views while NIRM requires modeling at least network views. Knowing this, the EM algorithm becomes where is the selected number of quadrature points and is the selected dimension of the NIRM. This grows very quickly and becomes unfeasible even in moderately sized samples. Thus, the Bayesian procedure remains the method of choice for estimating NIRM.
In the Bayesian procedure, one must select prior information for , , or . Normal priors are suggested for all first order parameters, i.e.
Note that the choice of as a prior mean for or is arbitrary as the mean contributes no information to the distances between latent positions. Additionally, the choice of a diagonal matrix is arbitrary. However, we note that if this specification is made then or
can be estimated by assigning a hyper prior. This provides useful information as the variance of the latent spaces can be viewed as a pseudo effect size of the model. That is, a large value ofresults in larger distances between person and item positions in the network and proportionally lessens the influence of the marginal tendencies of 1’s in the person networks () and item networks ().
An Inv-gamma distribution is recommended to estimate the effects of the distances on the network. That is, we recommend
At present, there has been little investigation into the effects of selection of hyper-parameters for this model. In other Bayesian analyses, it is common to select such that the prior expectation of equals 1. While this does have influence over the contribution of the latent space, we suspect that said influence is minimal. This is because in a distance-based approach the amount of information about grows with instead of , even at low sample sizes this information can quickly overwhelm the prior.
Clearly in a Bayesian analysis it is important to check convergence of the parameter sampling chains. This is straightforward for the person and item intercept terms, ( and ). However, because Euclidean distance is invariant under translation, rotation, and reflection, one must take extra care in diagnosing the results of the latent positions. First, analyzing the sampling chain of an individual’s latent position without post-processing will likely show the strong serial correlations. This is because the lack of identification of the latent positions allow the latent space to freely transform during sampling. To correct for this, Procrustes matching can be applied to the MCMC samples. More appropriately, selecting pairs of individuals or items, calculating the distance between the pair at each sampling iteration, and plotting the resulting sampling chain of the distance can be used to diagnose sampling issues with the distance metrics. Naturally, one may opt to perform this procedure for a sample of pairs due to the large number of pairs, i.e. .
For visualization purposes, it may be useful to rotate a matrix of latent positions. For example, the estimated matrix of estimated latent person positions can be centered at the origin and rotated to its principal axes orientation (Borg Groenen, 2005), giving the positions zero mean and a diagonal covariance matrix. Specifically, let be the matrix of estimated position. Then, let be the
matrix of eigenvectors of. Preferably, the columns of are ordered by the magnitude of the corresponding eigenvalues.
Then, the transformation rotates the latent space such that the dimensions are in decreasing order of their contribution to the observed networks. There are infinitely many other choices for rotation matrices, and similar topics have been evaluated in great depth in the exploratory factor analysis literature. However, many rotation methods have the aim of achieving a “simple structure” in which many factor loadings of a single item are zero. This is not a fruitful aim when rotating latent positions as a zero-valued position on a single dimension does not have any inherent meaning. However, target rotation (Browne, 1972) may add insight into the latent positions for meaningfully specified target matrices. For example, specifying a target matrix such that certain subgroups of items obtain a meaningful pattern of positive and negative sign could create a rotation such that the orthant in which an item lies is substantively meaningful.
Interpretation for the person () and item () intercept terms partially depends on the network encoding selected in previous stages of the analysis. In any case, the intercept term reflects the density of the associated person or item networks, but of course that density has different interpretations depending on the encoding. If networks are constructed as in Equation 2, then reflects the tendency of person to answer pairs of items in the “1” category and reflects the tendency for pairs of individuals to both respond in the “1” category to item . Conversely, if networks are constructed as in Equation 3, then reflects the tendency of person to answer pairs of items in a concordant manner (i.e. both “0” or both “1”) and reflects the tendency for pairs of individuals to respond in a concordant manner to item , again, both “0” or both “1”. Typically, latent person positions cannot be interpreted directly; they must be interpreted as a function of distance from either other persons or other items. In practice, one can inspect the distance from a person to a centroid of other persons or a centroid of items. In the context of a single classroom, an individual with a large distance to other students in the same classroom likely has largely different strengths and/or deficiencies on the material compared to other students. In psychology, large distances may be interpreted as individuals currently residing in a different developmental state (Jin Jeon, 2019) than others, or perhaps having a largely different set of psychological symptoms than others. In the latter case, and given proper considerations for the source of the sample, clusters of individuals may represent relatively stable patterns of symptoms that may otherwise be in flux, i.e. a collection of low energy states.
There are two ways to visualize the results of a NIRM. The first is to construct plots visualizing the two latent spaces, or perhaps a single plot superimposing the item latent space on the person latent space. And example of these can be seen in Figure 1, which represents the NIRM estimated from the well-known LSAT6 dataset. In these plots, one can attempt to interpret the positions of items and persons relative to each other.
Importantly, large distances imply largely different patterns of responding. It is also important to visually inspect for clusters of items; these occur when portions of the sample have responded in a highly consistent manner to them. One the other side, “outliers” that are far away from others are also informative as these are items which display highly dis-regular responding relative to other items. This may be items that are irrelevant from a content perspective or may indicate a miskeyed item in educational testing.
Another approach to visualize a NIRM is to represent the items as a network of nodes where distances between items are the weights in the edges of the network. However, representing distances as edges in the network may be misleading, and other network methods consider larger values (or at least, more positive values) to be indicative of higher “connectedness”. Instead, it would be better to represent positions with a measure of similarity. An example of this of can be found in Figure 2, and these plots can easily be generated with the qgraph package in R (Epskamp ., 2012).
There are many choices for selecting an appropriate similarity metric. A natural first choice would be cosine similarity as it is related to Euclidean distance. However, unlike Euclidean distance cosine similarity is not invariant to translations and rotations of the latent space. Instead, one may select an order preserving distance to similarity transformation. For example,
is a suitable choice that maps Euclidean distance into the similarity space. Here, and are the latent positions of items and . Another suitable choice for a similarity metric would be
which also bounds the similarity metrix into .
3 Adding new data
Given an estimated NIRM, it may be of interest to estimate the positions of new additions to the data. For example, one may want to estimate the positions of persons in a newly collected sample to compare positions to the old sample. This is analogous to ability estimation in IRT. Likewise, if data were collected on one or more new items from individuals in the same sample, it might be of interest to place those items in the existing latent space. This is analogous to linking item parameters of an IRT model.
Unfortunately, estimating new positions in NIRM is not trivial. In first round estimation of NIRM, the latent space itself is estimated, which is a function of the sample and the items that were included in the analysis. The naive solution is to combine the old data and new data into a single dataset and estimate the full model on the whole data. This may not be a desirable solution as estimation is generally slow. Fortunately, there are some alternatives. These alternatives depend on the analysis choices that were made when estimating the original model. Specifically, the choice of the model space linkage and whether one is adding new items locations or new person locations. These are outlined in Figure 3.
3.1 Approximating new positions
Importantly, there are some quickly calculated approximations to estimating new positions for person or items when the latent space has been estimated as a function of the opposite space. For example, if the person latent space has been estimated as a function of the item latent space, then there is a quick approximation for estimating the locations of a new sample of individuals. Similarly, if the item latent space has been estimated as a function of the person latent space, then there is also a quick approximation for estimating the locations of new items.
Relying on the link between the latent spaces, a new set of positions can be estimated by averaging the latent positions for the items which were answered correctly. This approximation is straightforward to calculate and when the number of items with correct responses is sufficiently large, it converges to the true value of the latent positions.
Therefore, if the item latent space is a function of the person latent space, this approximation may be used:
Likewise, if the person latent space is a function of the items latent space, one can estimate new person positions with the following equation:
Here, is an arbitrarily small number to prevent division by 0 which may occur if a response vector contains no correct responses. However, this approximation loses accuracy when very few items have been answered correctly and in the particular degenerate case of a zero sum-score will define the latent person vector to be . While seemingly not useful as an approximation in this case, zero correct responses indicates that there the participant is equally distant to all sets of items.
If these approximations are not useful the proceeding sections describe sampling procedures for estimating new positions assuming the latent spaces are fixed to their posterior estimates from the first sample. One benefit of these assumptions is the rotational and translational indeterminacy is no longer a factor and thus the samples do not require post-processing.
In all cases, the selection of prior information can remain the same, with the exception of the hyper-prior on the variance of the latent spaces; this hyper-prior may be removed and the latent space variances can be fixed to their posterior estimates.
3.2 Approximating new person intercepts
Intercepts for new persons may not be substantively interesting as they mostly provide a correction for base rates (i.e. they are analogous to main effects) when estimating person positions. They have similar interpretations to the person and item location parameters of the Rasch model. Like the in the Rasch models, they provide little additional information above person sum scores, except to give an inverse logit transformed representation of the probability of concordant pairs. However, if these are of interest, a simple approximation is available. An approximatete intercepts for new persons are availible:
where is an indicator function that evaluates to 1 if the arguments are TRUE and 0 otherwise. This approximation is not useful when attempting to map a new sum score to when that value of the sum score was not observed in previous samples. Unfortunately, there is little to be done about these approximate estimators in that case. It is also not uncommon for estimators to give non-useful results for some special cases, such as patterns of all 1’s or 0’s in 2PL IRT models (De Ayala, 2013, pp. 26). However, we provide the approximations in hopes they may be useful in some circumstances, as it is expected these models may be used in an exploratory fashion. For cases when the approximations will not do, we describe a set of conditional posteriors that, with added assumptions, can be used to estimate parameters of persons and items for new data.
3.3 Incorporating new persons
When new data are added for new persons on the same items, we may wish to estimate the parameters corresponding to the new individuals. Here, we are attempting to estimate two new sets of information. The first is the new person intercept. Assuming that the new individuals are from the same latent space, the following conditional posterior for is provided below.
where from the initial NIRM calibration. One also must estimation the locations for the new persons. The conditional posterior for new person positions is given in equation 6. This requires both new and old datasets and parameter estimates from the previous models. Historically, it has been considered cumbersome to retain old data to estimate new portions of a model; however, with modern data retention practices we hope this is currently less of an issue. Unfortunately, it is also an unavoidable tradeoff with distance-based models.
Similarly to the full data likelihood, the above posterior contains a high number of product terms. For every item, there is a product over all pairs of individuals. That term is then multiplied by additional product terms, each of which itself is a product over all pairs of items; However, we may drop all terms that do not contain and as they are assumed fixed in this sample. Therefore, we now have
With these conditional posteriors, constructing a Markov Chain Monte Carlo (MCMC) sampler is straightforward and can proceed by providing a random starting value forfrom the prior distribution and setting a starting value for by taking the average of that have the same sum score from the previous sample; for example using Equation 4 from above. Note that the other when estimating positions for multiple new persons, the new persons do not contribute to each other’s estimation as the latent space is assumed fixed and including all new individuals would update the latent space. This is in line with ability estimation practices in IRT. With the exception of online calibration procedures (e.g., Ban ., 2001), ability and item parameters are not simultaneously updated.
3.4 Incorporating new items
This section describes incorporating data from new items into NIRM. These processes all involve fixing some portion of the previously estimated model. In that way, they are similar to various types of fixed-parameter calibration procedures from the IRT literature (e.g, Kim, 2006)
. There are two scenarios for incorporating new items. In case 1, some or all of the present sample has been administered new items. This may happen in longitudinal studies or learning environments. In case 2, a new sample has been administered both the complete set of items from the original model plus a new set of items for which new positions must be estimated. Clearly, the former is a more simple endeavor than the latter. In either case, we must update the item intercept using the following conditional posterior.
Starting values can be obtained in the analogous process to obtaining the starting values for the person intercepts from above. If only item data has been added for the original sample, the person positions used above may be treated as fixed. If the newly added data represents a sample of new persons with response to the old, common items and new “pre-test” items, person parameters must be updated using the methods described in the previous section. This process includes a decision point; should the new items contribute to estimating person positions? If yes, then one ends up in a position where persons in the new and old samples with identical response patterns on the common items do not have identical positions in the latent space. If no, the person positions will be identical but the new item positions will only reflect distances from the latent space defined by the original dataset. The former option generally seems more tenable - it makes sense to partially update person positions in light of new information.
Estimating the locations of new items again depends on the type of new data. For case 1, the positions of individuals have been fixed prior to estimation. In case 2, the positions of individuals are estimated concurrently. This somewhat complicates the MCMC sampling scheme, but the procedure for placing the new items positions in the latent space in either case can make use of the conditional posterior in Equation 8 below.
A similar approach to that in Equation 7 can be applied to limit the number of calculations in the sampling procedure. Dropping all pairs of items that do not contain at least one new item results in placing the new items in the latent space. It is possible to further simplify computation by only including pairs of items where exactly one item is new. The distinction between the former and the latter equates to not updating the latent space at all vs. performing a partial update of the item latent space in which the positions of the new items take both new and old items into account. Care must be taken when deciding between these two options. If the item latent space is not updated and new items are simply “placed” (or “positioned”) into the old latent space, then 1.) the distance between the locations of the new items may not be an accurate reflection of their true distances and 2.) clusters (or lack-thereof) of new items may be difficult to evaluate. Generally, it should be preferred to partially update the latent space with the new items.
4 Illustrative Data Analysis
In this section, we illustrate the workflow of estimating a NIRM of student assessment data. The R code used in the application are provided in the online repository associated with this study.
4.1 Data Background
The AP-CAT (Advanced Placement - Computerized Adaptive Testing) project was funded by the National Science Foundation to develop formative assessments for high-school AP statistics classes. This project required item bank calibration, development of an online system for data collection, pilot testing of multiple fixed-form and CAT assessments with the ultimate goal of providing granular diagnostic feedback to students via detailed score reports. During the pilot phases of the project, students in several classrooms took up to 5 assessments throughout spaced within year-long AP statistics course. The item development phase included an expert panel who associated items with the content areas of AP statistics, which at that point in time had 4 content domains.
This study used data from the 2018-19 cohort of the AP-CAT Project. In total, that sample has a total size of 441 participants. All participants completed appropriate parental consent and assent forms to obtain eligibility for participation in this study. Participants received access to the AP-CAT platform at no cost as part of this study. As part of the study, students completed several self-report surveys on learning-related constructs and a set of linear test assignments that contributed to course grade. The subsample (N = 368) of those who completed the standardized assignment at the second time point were included in the analysis.
The assessment used in this illustrative example contained 27 common items across all teachers whose classrooms participated in this study; some teachers also taught multiple sections. Some teachers elected not to administer all items due to the classrooms having not covered the related content at that time point. The assessment contained items from 3 of the 4 main topic areas, the majority balanced between area 2 and 3 as content area 1 had been assessed in the assessment administered at the 1st time point. The breakdown of item content area is further described in Table 2, and example items are provided in the online respository.
To analyze the data, we assume the distances between items and persons can be approximated with a latent space comprising 2 dimensions; this was selected both for ease of visualization and due to the majority of the items coming from two content domains. It is not necessary for the dimension of the latent space dimension to match any “true” number of dimensions. Rather, increasing the number of dimensions would ease post-hoc detection of item clusters (e.g. by spectral clustering of the latent space) by increasing the number of directions in which separation may occur. This is a tradeoff that comes at the cost of limiting visualization of the clusters in the latent space.
We selected the “all concordant pairs” specification (Equation 3) of the item networks rather than the “positive concordant pairs” (2) representation because it was believed that absence of a correct response indicates a lack of proficiency in the attributes required to respond to the question. Thus, it is desirable if incorrect response would contribute to an increased distance from that item’s position. The linkage between the networks was chosen such that the person space is a function of the item latent space (or equivalently, the item positions). This selection was informed by the adequately large number of items, allowing for person positions to reflect a large degree of variability in the individual patterns of response. The Bayesian estimation procedure outlined in Jin Jeon (2019, pp. 244) and the Estimation section above. The parameters of the prior distribution were chosen that the intercept parameters had a prior mean of 0 and a prior variance of =
= 10. We elected to estimate the contributions of the latent spaces by placing an Inv-Gamma hyperprior on the variability of the latent spaces, with
The analysis took 39 minutes to complete on a using 4 threads of a 4 core Intel i7-4810 CPU @ 2.80GHz with 16GB of RAM to run a change of length 15000 of with the first 5000 were discarded for burn-in and a thinning rate of 5 for 2000 total samples. Posteriors distribution summaries for person intercept parameters are groups by sum score and are provided in Table 4.
estimates have a curvilinear relationship with the sum score because when the “all concordant pairs” network specification is used, represents consistency of the pattern rather than the pattern itself. Similarly, estimates are provided in 5, and tend to increase as the proportion correct on the items moves away from the center.
Person positions are displayed in Figure 4, with item positions superimposed.
The first thing to notice about the person positions plot is the single large cluster of person positions near the center of the figure. There participants all have highly similar sum scores (represented by color) and form a group with very small distances to each other. For example, participants number 30 and number 352 are on the upper left edge of the cluster in the center. Because of the proximity, they likely have very similar response patterns. However, given that they have the same sum score (24, denoted by the same color), but different positions, these participants must have slight differences in their response patterns. Indeed, the pattern for participant 30 is “111110101111111011111111111” and the pattern for participant 352 is “011111111101111111011111111”. The differ on 6 items, ; the pattern score for only those 6 items is ”011010” for participant 30 and ”100101” for participant 352.
3 The next thing to inspect is items the contrast between items that have large distances from one another. For example, items 7 and 9 are on the opposite side of the plot about the x axis. Item 7 had 193 participants answer the item correctly and item 9 had 195 participants answer the item correctly. The proportions correct are very similar, but the two items were not estimated to reside in similar locations. This can be easily understood from viewing the contingency table for the items. The table shows there are 170 concordant pairs and 198 discordant pairs. Comparing this to the number of concordant pairs of item 8 and 9, which has 193 concordant pairs while item 8 has 194 correct responses. This three way simplification demonstrates how proximity reflects patterns in the contingency tables that may be difficult to spot without a high degree of effort. Table 3 displays the three-way contingency table for items 7, 8 and 9.
Next, we examine for outlier items in the latent space. From visual inspection, it appears that item 1 has the largest distances from the other items. However, this is easily verified by calculating the average distances from every item to every other item. For example, in R code this would be colMeans(dist(W)) where W is the matrix containing the item positions. Indeed, item 1 has the largest average distance to the other items at 1.9 units. Pane B of Figure 5 contains a histogram of average item-rest distances. Pane A of the same figure displays a histogram of all distances. The large distance of item 1 to the rest of the items is cause for suspicion into the items fit within the rest of the items. The item with the largest distance from item 1 is item 23; those two have 281 discordant pairs and only 86 concordant pairs. Here we may attempted to infer the source of these differences. Given an educational test, we suspect that two possible causes of the low connectedness between those two nodes may be due to different content areas or potentially a miskeyed response. Inspecting the items, both items 1 and 23 are from content area 3. Thus it is not likely that the items have different patterns do to different knowledge requirements. However, the second possibility was found to be plausible - item 1 contains multiple response options which may be considered to be correct by the test takers.
Figure 6 displays the network of item similarities (determined by Inverse-Exponentiated distances, ). In this figure, positions in the network are determined by choosing layout = "spring" option in the qgraph function in the R package of the same name (Epskamp ., 2012). The highly similar items are all also highly interconnected, with many of items that have at least one high value of similarity tending to have many additional high similarity connections. The next section will compare NIRM to two other commonly used psychometric models, and fit versions of these models to the same dataset used in this illustrative example.
5 Comparisons to Other Models
To compare and contrast Item Response Theory and the Ising Model with the network item response model, we first briefly describe both.
5.1 Item Response Theory
We first consider the multidimensional version of the 2 parameter logistic IRT model. In this model the probability of a student yielding a correct response is:
Here, represents the latent ability vector, is the vector of discrimination parameters, and is the item intercept. For typical models, the full data likelihood for persons and items has the following form:
Within the form of the likelihood, i.e. the double product over items and persons, we find the baked in assumptions of independent persons (e.g. random sampling) and conditional independence of items. The latter of these two is the well known local independence assumption. This set of assumptions is quite strict in that it requires explicitly adding terms to the model in order to account for various types of dependency. See, for example, mixture IRT for unobserved person clusters (Rost, 1990), multiple group IRT for known person clusters (Bock Zimowski, 1997), bifactor models for known item clusters caused by a known construct (Gibbons ., 2009), testlet models for known item clusters caused by an unknown construct (DeMars, 2006), and etc. Clearly, there is no shortage of techniques to accommodate independence assumption violations in IRT.
5.2 Latent Ising Model
In the binary networks used in the Ising model, variables take values as opposed to due to the traditional use in physics where the network nodes represented spins on a lattice. The Ising model as used in psychological research, has the following form:
where the summation over is over all pairs of nodes that are connected neighbors on graph, and is the sum over all possible
patterns of the binary variables. Nodes are considered to be independent conditional on the values of the other nodes which they are “connected”. In traditional Ising models, only nodes that were adjacent on the lattice were considered to be connected. However, in psychometric networks it is desirable to consider all nodes to be eligible for connection, and then estimate which nodes are connected and to what degree. The model above contains main effects and interaction terms. The main effects determine the tendency for nodeto take positive (if ) or negative values (if ). If two nodes have a positive interaction () then those nodes will have a tendency to take the same values. Likewise, negative interaction terms () represent a tendency for items to take opposite values. Importantly, these terms are interpreted similarly to regression coefficients in that the effect only has meaning at fixed values of other variables. Stated differently, the interpretation of the interaction term for and has the preceding interpretation conditional on the values of the other nodes.
Without discussing further here, we note that Marsman . (2018) provided a clever linkage between Ising models and multi-dimensional 2 parameter logistic models that shows each Ising model has a statistically equivalent IRT model. While true, Epskamp . (2017) make clear that interpretations and implications of each model are drastically different. IRT models represent a common cause model, in which the responses to items are caused by common underlying factor(s). Conversely, Ising models represent another paradigm in which observable features form a network of nodes connected by causal relations. However, both models are effectively multivariate generalized linear models. In IRT, it is latent variables that are used to predict the outcome. In the Ising model, all variables and all possible interactions of variables are used as predictors.
NIRM, on the other hand, does not attempt to predict the response value of a single item and instead it chooses some collection of elements of the contingency tables for all items/persons to predict. From this perspective, NIRM is attempting to answer a different question than either IRT or Ising models: What is an optimal representation of distances between item and person positions, and conditional on marginal tendencies, that best replicates the observed summaries of pairwise responses. In particular, for the second of the two pairwise encoding procedures discussed, there is no way to recover the data in row-column form once the networks have been constructed; that is concordant pairs of all types are treated equally and all discordant pairs are treated equally. This understanding of how NIRM, the Ising model, and IRT conceptualize pairwise responses is key to understanding their differences and is discussed more deeply in the next section.
5.4 Response Pattern Similarities
One driver in the development of network approaches in psychometrics is the pursuit of alternative explanations of positive correlations among tests of ability. That is, the prevailing models that account for the so called positive manifold (Spearman, 1925) of responses is contain a scalar/vector of latent trait scores that cause all response patterns. On the other hand, the Ising model explains the positive manifold as a consequence of “a network of mutually reinforcing entities connected by causal relations” (Marsman ., 2018). Both Ising models or IRT models imply an the explanation of why response patterns may be similar to one another. These explanations may also be very similar. If binary response patterns are the same for two people, the IRT explanation would be that it is likely the two individuals have similar values for their latent trait scores, and given that, they independently generated similar response patterns. Similarly, for two identical patterns generated from an Ising model perspective, the explanation would be that due to the connections interactions present in the network, some patterns are more likely than others. Then, identical patterns are a product of independent draws from a categorical distribution with non-equal probabilities.
Considering another angle, why might two items have a large number of concordant pairs? In IRT, the number of concordant pairs from two items is solely function of the item parameters. That is for a fixed sample size N, the number of concordant pairs can be written as
In the Ising model, the number of concordant pairs for a set of items is a function of and implies that this probability is the same for all possible patterns of the other variables. This implication is a byproduct of the absence of latent variables in the model, or rather, the Ising model representing a pairwise markov field of the binary items. In NIRM, similar response patterns are caused by proximity of persons in the Euclidean latent space. Technically, the distance between each person contributes to the pairs of responses. All said, the Ising model and IRT models result in different findings when fit to a dataset. To make the distinction clear, we will estimate the Ising Model and a unidimensional 2PL model on the same data from the illustrative example. The Ising model was estimated with the IsingFit function (using default settings) in the package of the same name (van Borkulo ., 2016).
As previously mentioned, Ising models are frequently visualized via a networked plot of the nodes, which is displayed in Figure 7. This estimated model has many unconnected nodes, with only a few clusters of connections. The estimated thresholds for the IsingModel are in Table 6. The interpretation of the Ising Model visualization is quite different than the NIRM model. In the Ising model, the connections represent a tendency to covary after taking into a account the values of all other variables. In contrast, in NIRM, a connection represents a tendency to have a relatively large number of concordant pairs after taking into account the positions of all other variables.
The 2PL model was estimated using the mirt package in R (Chalmers, 2012). The loadings and intercepts are also provided in Table 6. An interesting note that there are some red flags in the values of the parameter estimates for the IRT model. First, Item 1 has a negative (yet rather close to zero) factor loading. For educational data, this is rarely expected; any in-depth analysis would probe into the data further to understand this issue. Similarly, Item 27 has an unusually large intercept (6.83), reflective of the near-perfect proportion correct of the item (.98). For large-scale testing programs, this item would be considered far to easy to provide useful information in discarded. Both of the findings from the IRT model were corroborated by evidence from the NIRM. However, the NIRM and the Ising Model led to somewhat different substantive conclusions. This is not to say that either conclusion is wrong, but the information received is certainly different between the two.
6 Discussion and Conclusions
This study intended to show application of Network Item Response Models to student assessment data by introducing the model and providing in-depth discussion of the multiple decision points that occur throughout the model building procedure. However, this discussion is not complete as some questions surrounding network item response models are currently unanswered. One open question in NIRM is a procedure for evaluating model fit. Given the Bayesian estimation procedure, posterior predictive checks are always possible. However, more standardized metrics would be desirable such that cut-off values can be determined. While the quantification of the latent spaces via estimatingcan capture the contribution, it is unknown how this metric response to increases in dimension, number of items, and number of persons. A similar question to model fit is the sample size requirements that are necessary for estimating NIRM. The present study had a sample size of slightly less than 400. The first study on NIRM (Jin Jeon, 2019) analyzed a real dataset that had roughly 300 participants and provided a simulation study comparing NIRM to a mixture IRT model. In that study, at least, a sample of 300 was adequate enough to outperform mixture IRT in detecting person clusters. However, more research is needed.
There are several other open questions in NIRM. For example, this manuscript provided two specifications for linkage between the latent spaces and two specifications for constructing the items networks. The differences between the 4 combinations of these choices have not been evaluated via simulation study. Two further concepts that may require further study are identification and regularization. It has been stated multiple times that the positions in NIRM are inherently meaningless. However, it may be possible to identify the locations in the model by fixing the positions of a small number of response patterns (e.g. d + 1) to certain positions in the latent space. For example, identifying the response pattern of all 1’s to occur at the point . Furthermore, estimators of Euclidean distance may be positively biased when the true distance is in the neighborhood of 0. This may be concerning when attempted to cluster items, as inflated estimates of distance make the clustering processes more difficult. A possible solution is to use a prior of the form described by Madigan Raftery (1994), and this may also be a direction of future study.
We have shown that modeling networks constructed from item responses in student assessment data can provide interesting insights into the structure of the data. First, it is a useful procedure for detecting items that do not display strong conformance to other items. The modelling procedure is particularly useful due to the ease with which the resulting latent spaces can be visualized. Comparing to an IRT model in which one must calculate summary information about parameter estimates, clusters of individuals and individual outliers are easily spotted from visual inspection of the latent space plots. Interestingly, even though under specific conditions IRT models are equivalent to Ising models, in this case they let to different conclusions.
- Ban . (2001) ban2001comparativeBan, JC., Hanson, BA., Wang, T., Yi, Q. Harris, DJ. 2001. A Comparative Study of On-line Pretest Item—Calibration/Scaling Methods in Computerized Adaptive Testing A comparative study of on-line pretest item—calibration/scaling methods in computerized adaptive testing. Journal of Educational Measurement383191–212.
- Bock Zimowski (1997) bock1997multipleBock, RD. Zimowski, MF. 1997. Multiple group IRT Multiple group irt. Handbook of modern item response theory Handbook of modern item response theory ( 433–448). Springer.
- Borg Groenen (2005) borg2005modernBorg, I. Groenen, PJ. 2005. Modern multidimensional scaling: Theory and applications Modern multidimensional scaling: Theory and applications. Springer Science Business Media.
- Browne (1972) browne1972obliqueBrowne, MW. 1972. Oblique rotation to a partially specified target. Oblique rotation to a partially specified target. British Journal of Mathematical and Statistical Psychology.
- Browne Cudeck (1993) browne1993alternativeBrowne, MW. Cudeck, R. 1993. Alternative ways of assessing model fit In: Bollen KA, Long JS, editors. Testing structural equation models Alternative ways of assessing model fit in: Bollen ka, long js, editors. testing structural equation models. Beverly Hills, CA: Sage111–135.
- Cattell (1966) cattell1966screeCattell, RB. 1966. The scree test for the number of factors The scree test for the number of factors. Multivariate behavioral research12245–276.
- Chalmers (2012) chalmers2012mirtChalmers, RP. 2012. mirt: A Multidimensional Item Response Theory Package for the R Environment mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software4861–29. 10.18637/jss.v048.i06
- De Ayala (2013) de2013theoryDe Ayala, RJ. 2013. The theory and practice of item response theory The theory and practice of item response theory. Guilford Publications.
- DeMars (2006) demars2006applicationDeMars, CE. 2006. Application of the Bi-Factor multidimensional item response theory model to Testlet-Based tests Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of educational measurement432145–168.
- Epskamp . (2012) epskamp2012qgraphEpskamp, S., Cramer, AO., Waldorp, LJ., Schmittmann, VD. Borsboom, D. 2012. qgraph: Network visualizations of relationships in psychometric data qgraph: Network visualizations of relationships in psychometric data. Journal of Statistical Software4841–18.
- Epskamp . (2017) epskamp2017estimatingEpskamp, S., Kruis, J. Marsman, M. 2017. Estimating psychopathological networks: Be careful what you wish for Estimating psychopathological networks: Be careful what you wish for. PloS one126.
- Frisbie Sweeney (1982) frisbie1982relativeFrisbie, DA. Sweeney, DC. 1982. The relative merits of multiple true-false achievement tests The relative merits of multiple true-false achievement tests. Journal of Educational Measurement29–35.
- Gibbons . (2009) gibbons2009psychometricGibbons, RD., Rush, AJ. Immekus, JC. 2009. On the psychometric validity of the domains of the PDSQ: An illustration of the bi-factor item response theory model On the psychometric validity of the domains of the pdsq: An illustration of the bi-factor item response theory model. Journal of psychiatric research434401–410.
- Gollini Murphy (2016) gollini2016jointGollini, I. Murphy, TB. 2016. Joint modeling of multiple network views Joint modeling of multiple network views. Journal of Computational and Graphical Statistics251246–265.
- Handcock . (2007) handcock2007modelHandcock, MS., Raftery, AE. Tantrum, JM. 2007. Model-based clustering for social networks Model-based clustering for social networks. Journal of the Royal Statistical Society: Series A (Statistics in Society)1702301–354.
- Haslbeck . (2018) haslbeck2018interpretingHaslbeck, J., Epskamp, S., Marsman, M. Waldorp, L. 2018. Interpreting the Ising model: The input matters Interpreting the ising model: The input matters. arXiv preprint arXiv:1811.02916.
- Hoff . (2002) hoff2002latentHoff, PD., Raftery, AE. Handcock, MS. 2002. Latent space approaches to social network analysis Latent space approaches to social network analysis. Journal of the american Statistical association974601090–1098.
- Horn (1965) horn1965rationaleHorn, JL. 1965. A rationale and test for the number of factors in factor analysis A rationale and test for the number of factors in factor analysis. Psychometrika302179–185.
- Jin Jeon (2019) jin2019doublyJin, IH. Jeon, M. 2019. A doubly latent space joint model for local item and person dependence in the analysis of item response data A doubly latent space joint model for local item and person dependence in the analysis of item response data. Psychometrika841236–260.
- Jin . (2018) jin2018hierarchicalJin, IH., Jeon, M., Schweinberger, M. Lin, L. 2018. Hierarchical Network Item Response Modeling for Discovering Differences Between Innovation and Regular School Systems in Korea Hierarchical network item response modeling for discovering differences between innovation and regular school systems in korea. arXiv preprint arXiv:1810.07876.
- Kaiser (1960) kaiser1960applicationKaiser, HF. 1960. The application of electronic computers to factor analysis The application of electronic computers to factor analysis. Educational and psychological measurement201141–151.
- Kim (2006) kim2006comparativeKim, S. 2006. A comparative study of IRT fixed parameter calibration methods A comparative study of irt fixed parameter calibration methods. Journal of Educational Measurement434355–381.
- Madigan Raftery (1994) madigan1994modelMadigan, D. Raftery, AE. 1994. Model selection and accounting for model uncertainty in graphical models using Occam’s window Model selection and accounting for model uncertainty in graphical models using occam’s window. Journal of the American Statistical Association894281535–1546.
- Marsman . (2018) marsman2018introductionMarsman, M., Borsboom, D., Kruis, J., Epskamp, S., van Bork, R., Waldorp, L.Maris, G. 2018. An introduction to network psychometrics: Relating Ising network models to item response theory models An introduction to network psychometrics: Relating ising network models to item response theory models. Multivariate behavioral research53115–35.
- Oh Raftery (2001) oh2001bayesianOh, MS. Raftery, AE. 2001. Bayesian multidimensional scaling and choice of dimension Bayesian multidimensional scaling and choice of dimension. Journal of the American Statistical Association964551031–1044.
- Raftery . (2012) raftery2012fastRaftery, AE., Niu, X., Hoff, PD. Yeung, KY. 2012. Fast inference for the latent space network model using a case-control approximate likelihood Fast inference for the latent space network model using a case-control approximate likelihood. Journal of Computational and Graphical Statistics214901–919.
- Rost (1990) rost1990raschRost, J. 1990. Rasch models in latent classes: An integration of two approaches to item analysis Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement143271–282.
- Sokal Sneath (1961) sokal1961principlesSokal, RR. Sneath, PH. 1961. Principles of numerical taxonomy Principles of numerical taxonomy.
- Spearman (1925) spearman1925someSpearman, C. 1925. Some issues in the theory of “g”(including the Law of Diminishing Returns) 1. Some issues in the theory of “g”(including the law of diminishing returns) 1. Nature Publishing Group.
- Steiger Lind (1980) steiger1980paperSteiger, J. Lind, J. 1980. Paper presented at the annual meeting of the Psychometric Society Paper presented at the annual meeting of the psychometric society. Statistically-based tests for the number of common factors.
- Tucker Lewis (1973) tucker1973reliabilityTucker, LR. Lewis, C. 1973. A reliability coefficient for maximum likelihood factor analysis A reliability coefficient for maximum likelihood factor analysis. Psychometrika3811–10.
- van Borkulo . (2016) bork2016isingvan Borkulo, C., Epskamp, S. with contributions from Alexander Robitzsch. 2016. IsingFit: Fitting Ising Models Using the ELasso Method Isingfit: Fitting ising models using the elasso method . https://CRAN.R-project.org/package=IsingFit R package version 0.3.1
- Yen (1984) yen1984effectsYen, WM. 1984. Effects of local item dependence on the fit and equating performance of the three-parameter logistic model Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement82125–145.
Zimek . (2012)
zimek2012surveyZimek, A., Schubert, E. Kriegel, HP.
A survey on unsupervised outlier detection in high-dimensional numerical data A survey on unsupervised outlier detection in high-dimensional numerical data.
Statistical Analysis and Data Mining: The ASA Data Science Journal55363–387.
|IRT Model||Ising Model|