 # Inequality Constrained Multilevel Models

Multilevel or hierarchical data structures can occur in many areas of research, including economics, psychology, sociology, agriculture, medicine, and public health. Over the last 25 years, there has been increasing interest in developing suitable techniques for the statistical analysis of multilevel data, and this has resulted in a broad class of models known under the generic name of multilevel models. Generally, multilevel models are useful for exploring how relationships vary across higher-level units taking into account the within and between cluster variations. Research scientists often have substantive theories in mind when evaluating data with statistical models. Substantive theories often involve inequality constraints among the parameters to translate a theory into a model. This chapter shows how the inequality constrained multilevel linear model can be given a Bayesian formulation, how the model parameters can be estimated using a so-called augmented Gibbs sampler, and how posterior probabilities can be computed to assist the researcher in model selection.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Multilevel Models

### 1.1. Introduction

In many areas of research, datasets have a multilevel or hierarchical structure. By hierarchy we mean that units at a certain level are grouped or clustered into, or nested within, higher-level units. The “level” signifies the position of a unit or observation within the hierarchy. This implies that the data are collected in groups or clusters. Examples of clusters are families, schools, and firms. In each of these examples a cluster is a collection of units on which observations can be made. In the case of schools, we can have three levels in the hierarchy with pupils (level 1) within classes (level 2) within schools (level 3). The key thing that defines a variable as being a level is that its units can be regarded as a random sample from a wider population of units. For example, considering a multilevel data structure of pupils within classes within schools, the pupils are a random sample from a wider population of pupils and the classrooms are a random sample from a wider population of classrooms. Likewise the schools are a random sample from a wider population of schools. Data can then be collected at the pupil level (for example, a test score), at the classroom level (for example, teacher experience in years), and at the school level (for example, school’s mean socioeconomic status). Variables like gender and social class are not levels. This is because they have a small fixed number of categories. For example, gender has only two categories, male and female. There is no wider population of gender categories that male and female are a random sample from. Another usual form of clustering arises when data are measured repeatedly on the same unit, for instance a patient. In this case the measurements from each patient would be at level 1 and the patients would be at level 2.

In all cases the elements of a cluster share some common characteristics. Therefore, the observations within a cluster tend to be more alike than observations from different clusters, that is, they are correlated. For instance, in the pupils within classrooms example, pupils in the same classroom share some common characteristics (e.g., they have the same teachers); thus the test scores of pupils within a classroom will tend to be more alike than test scores from different classrooms. Multilevel data therefore have two sources of variation: In addition to the variation within clusters, the heterogeneity between clusters introduces an additional source of variation. Therefore, any analysis methods used should take the within cluster and between cluster variation into account. Because data can be clustered at more than a single level (e.g., pupils within classrooms within schools), data clustered at a single level (e.g., pupils within classrooms) are referred to as two-level data and the statistical models for the analyses are referred to as two-level models.

Multilevel or hierarchical data structures can occur in many areas of research, including economics, psychology, sociology, agriculture, medicine, and public health. Over the last 25 years, there has been increasing interest in developing suitable techniques for the statistical analysis of multilevel data, and this has resulted in a broad class of models known under the generic name of models. Generally, multilevel models are useful for exploring how relationships vary across higher-level units taking into account the within and between cluster variations. Considering an example of two-level data obtained on pupils within schools, there are two possible ways to deal with the data: either to focus separately on the pupils or on the schools. Focusing on the pupils by pooling

together the data from all the schools ignores differences between schools and thus suppresses variation that can be important. Ignoring the clustering will generally cause standard errors of regression coefficients to be underestimated. On the other hand, focusing on schools by analyzing the data of each school separately ignores a lot of information and consequently renders low power for inferences. Multilevel modeling offers a compromise between these two extremes and enables researchers to obtain correct inferences.

### 1.2. The Multilevel Model

In this chapter we will confine ourselves to two-level models for continuous data, with one single outcome or response variable that has been measured at the lowest level and explanatory variables (or covariates) that have been measured at levels

and . For the sake of consistency, level and level units will be referred to as individuals and groups, respectively. Stated otherwise, individuals will be nested within groups.

To fix ideas, suppose we have groups and individuals in each group such that the total number of individuals is . Furthermore, assume that one covariate has been measured at the individual level and one covariate has been measured at the group level and an outcome variable has been measured on each individual. As an illustration suppose we have data on mathematics grades from high school students from classes as well as information on student socioeconomic background and teacher experience in years. In this case, each of the classrooms would be a and the students would be the . Furthermore, would be the student level outcome variable “math grade,” would be student “socioeconomic status,” and would be “teacher experience” in years. Our interest is in modeling the outcome variable in terms of the individual level variable and the group level variable using a multilevel model. At the individual level, for individual (where for group ) within group ( groups in the sample) and , we have the following model:

 (1) ykj=π1j+π2jakj+εkj.

In (1), is the intercept, is the regression coefficient for the covariate , and is the residual error term. The residual errors

are assumed to have a normal distribution with mean

and variance

. Model (1) implies that each group has its own regression equation with an intercept and a slope . The next step in the modeling is to explain the variation of the regression coefficients and by introducing variables at group level:

 (2) π1j = β1+β2wj+u1j, (3) π2j = β3+β4wj+u2j,

where and are random residual error terms at group level. Note that in (2) and (3), the regression coefficients (’s) do not vary across groups and that is why they have no subscript on them. Since they apply to all groups, they are sometimes referred to as effects. Furthermore, all between group variation left in the coefficients after predicting them with the group variable is assumed to be random residual variation (at group level) which is captured by the terms and .

Substituting (2) and (3) into (1) renders the linear - regression model:

 (4) ykj=β1+β2wj+β3akj+β4akjwj+u1j+u2jakj+εkj.

The right-hand side of model (4) has two parts to it: a part , where the coefficients are fixed, and a part . Note that in practice one can have several covariates measured at both individual and group level. Therefore, model (4

) can be written in a slightly more general form using vector notation:

 (5) ykj=\boldmathxkj\boldmathβT+% \boldmathzkj\boldmathuTj+εkj,

where is a vector of predictors (including main effects at levels 1 and 2 as well as interactions between level 1 and level 2 covariates) having coefficients . Furthermore, is a vector of predictors having random effects at the group level and is an error term. In the example above, = (), = , = (), and . The vector of predictors will usually be a subset of the fixed-effects predictors , although this is not a necessary requirement. The random terms and are assumed to be mutually independent and normally distributed:

 (6) \boldmathuT∼N(\boldmath0,% \boldmathV),  εkj∼N(0,σ2),

where is the variance-covariance matrix of the random effects and is the residual variance. Thus, we can see that multilevel models provide a natural way to decompose complex patterns of variability associated with hierarchical structure.

In a frequentist analysis, estimation of parameters in the linear multilevel model is carried out by maximizing the likelihood function. To this end, direct maximization using the Newton-Raphson or Expectation-Maximization (EM) algorithm can be performed. For discussions on the methods, techniques, and issues involved in multilevel modeling in general, the interested reader is referred to

[5, 11, 12, 14, 18, 24, 26]. This chapter is intended to illustrate model selection for inequality constrained two-level models. A Bayesian approach will be used for parameter estimation and model selection . Bayesian estimation in multilevel models (without constraints on the model parameters) has also been implemented in the statistical package MLwiN .

## 2. Informative Inequality Constrained Hypotheses

Research scientists often have substantive theories in mind when evaluating data with statistical models. Substantive theories often involve inequality constraints among the parameters to translate a theory into a model; that is, a parameter or conjunction of parameters is expected to be larger or smaller than another parameter or conjunction of parameters. Stated otherwise and using as a generic representation of a parameter, we have that or for some two parameters and . Additionally, inequality constraints also play a pivotal role when competing theories are presented as an expression of a multitude of initial plausible explanations regarding a certain phenomenon on which data are collected. Consider the following examples on two common multilevel models: school effects models and individual growth models. These examples will be the thrust of Sections 4 and 5.

###### Example 1.

An educational researcher is interested in the effect of certain student and school level variables on mathematical achievement (mathach), and has obtained a dataset on students within schools. A students’ ethnic background (min), student socioeconomic status (ses), a schools’ average student socioeconomic status (mses), and the dichotomy between Catholic (cat) and public (pub) schools are hypothesized to be defining variables for the explanation of math achievement (cf. [2, 5, 7, 8, 23]). A possible formulation of the two-level model in the form (5) might be

 mathachkj = β1catj+β2pubj+β3msesj+β4catjseskj + β5pubjseskj+β6msesjseskj+β7minkj + u1j+u2jseskj+εkj.

The reason for assigning an indicator variable to both the Catholic and public category of the constituent dichotomy is because this will enable one to estimate the regression coefficients corresponding to the covariates and and their interactions with other covariates rather than estimating contrasts.

The researcher can think of different plausible models regarding the direction and (relative) strength of the effects of the mentioned variables on the response math achievement. Subsequently, the researcher expresses the idea that students in Catholic schools have higher math achievement than those in public schools {}. Certain sociological work found that students belonging to a minority have lower math achievement than students not belonging to an ethnic minority {}. Additionally, the researcher has the expectation that math achievement is positively related to socioeconomic status and that the effect of student socioeconomic status on mathematical achievement is more pronounced in public schools than in Catholic schools {}. These theories allow for several plausible models of differing complexity and with differing theoretical implications. The question of interest becomes: Which of the plausible models best fits the data?

###### Example 2.

A researcher in child and adolescent psychology has obtained observational data on substance abuse collecting multiple waves of data on adolescents. This researcher sets out to assess the effects of alcoholic intake among peers (peer) and the fact that the adolescent has alcoholic (coa) or nonalcoholic (ncoa) parents on the development of adolescent alcohol use (alcuse) (cf. [6, 24]). The model can be formulated as

 alcusekj = β1coaj+β2ncoaj+β3peerj+β4coajtkj+β5ncoajtkj + β6peerjtkj+u1j+u2jtkj+εkj,

where is a time variable.

For these data, competing theories abound in the researchers’ mind. A first plausible theory for him or her could be that adolescents with an alcoholic parent are more prone to have a higher alcoholic intake at baseline {}, as well as over time {}. A second plausible theory amends the first, with the additional expectation that for initial alcoholic intake, the effect of an alcoholic parent will be more influential than peer alcoholic intake {}, whereas for the time-dependent increase in alcoholic intake, peers will be more influential {}. The question of interest is: Which of the theories best fits the data?

The researchers’ hypotheses are in fact informative, as they are hypotheses in which one explicitly defines direction or (relative) strength of relationships based on prior information for usage in confirmatory data analysis. Informative hypotheses have a direct connection to model translations of theory. For instance, the researcher from Example 2 would be interested in the following two hypotheses that have been arrived at by translating substantive theories via constraints on model parameters:

 H1: {β1>β2},β3,{β4>β5},β6 versus H2: {β1>β2},{β1>β3},{β6>β4>β5}.

The pertinent question is: Given and , which of the two hypotheses has more support from the data?

A researcher might bring the classical or frequentist statistical viewpoint to bear on the central question of interest. One would then normally proceed to specify the traditional null hypothesis, which assumes that none of the covariate variables are associated with the response variable of interest against the alternative that at least one covariate variable is associated with the response variable:

 H0:all βi equal 0versusH3:not all βi equal 0.

There are several problems related to this procedure that leads one to infer little information regarding the actual hypotheses of interest, being and . Generally, in the usual frequentist sharp null hypothesis test setting, the researcher often starts from the idea that holds and then tests

using an appropriate test statistic. If we assume

, the vector containing all , is away from the zero vector , with but very small, then by the consistency of the testing procedure, the rejection of becomes the sure event for sufficiently large . One could then actually choose in accordance with the rejection of . More specifically, if the null hypothesis is rejected, no information is gained regarding the fit of the inequality constrained hypothesis of interest. Note that the research questions of actual interest are not directly incorporated into the alternative hypothesis. Post hoc directional tests are then usually employed with certain corrections on the maintained significance level to assess the inequalities deemed interesting in the actual research hypothesis. If one considers above, these post hoc tests would amount to assessing:

 (7) H01:β1=β2versusH11:β1−β2>0andH02: β4=β5versusH12:β4−β5>0.

The researcher is left with the situation in which several test results (those for the omnibus test and the post hoc tests) have to be combined to evaluate a single model translated theory. Such a situation may eventually force the researcher to make arbitrary choices. For example, how would one evaluate the situation where not all directional alternatives are accepted, or when the rather arbitrary significance threshold is surpassed by an arbitrarily small amount? Such problems abound especially in the social sciences where it is not uncommon to find situations where power is sufficient for obtaining significance somewhere while being insufficient to identify any specific effect . The power gap between a single test and a collection of tests often renders the situation in which the omnibus test proves significant in the sense that the obtained p-value is smaller than or equal to the pre-specified significance level, while the individual post hoc tests lack power such that successive testing efforts may find erratic patterns of “significant” p-values.

If the null hypothesis is not rejected when testing against , there is still a possibility that it could be rejected when testing it against the hypotheses of interest, namely and . Inequality constraints contain information, in the form of truncations of the parameter space, and when properly incorporated, more efficient inferences can result. To gain power, one could therefore specify inequality constrained alternatives more in tune with substantive theoretical beliefs, instead of the traditional alternative . This way the null hypothesis, if rejected, will be rejected in favor of the constrained alternative. Our researcher would then embark on testing

 H0: β1=β2=β4=β5=0 versus H4: β1−β2⩾0, β4−β5⩾0, and β1, β2, β4, and β5 do not all equal 0 and H0: β1=β2=β3=β4=β5=β6=0 versus H5: β1−β2⩾0, β1−β3⩾0, β6−β4⩾0, β4−β5⩾0, and β1, β2, β3, β4, β5 and β6 do not all equal 0

respectively, in order to convey more information regarding the model translated theories of interest. Yet again, there are certain problems that render the information to be inferred from these omnibus tests to be limited.

First, there is an important difference between tests of the form (2) and tests of the form (7). The former states that a directional effect is present when the alternative is accepted, but it does not give which of the constituent directional effects is significant. For such an evaluation one needs to resort to tests of the latter form, which takes us back to the problems associated with combining several test results to evaluate a single model translated theory as discussed earlier. Moreover, for complex models and multivariate settings there may not generally be optimal solutions for frequentist inequality constrained testing alternatives such as those in (2). The interested reader is referred to [1, 22] for overviews on the possibilities of frequentist inequality constrained hypothesis testing. But even if these frequentist alternatives were available, the researcher would still run into a problem when wanting to evaluate which theory or plausible model fits the data best. One possibility is to test the null hypothesis against each of the theories in the form of inequality constrained alternatives. This would help one to obtain some evidence for the support for each of the separate theories, but it would still not answer the question concerning which theory is best. It is very well possible that in all of the tests the null hypothesis is rejected in favor of the inequality constrained alternative.

To assess the researchers’ substantive theory in light of the available data, one needs to directly compare the constrained alternatives. This involves the simultaneous evaluation of multiple model translated theories, and for such an exercise, no frequentist possibilities are available. Therefore, Bayesian model selection is posed as an alternative to hypothesis testing. Posterior probabilities can be computed for all models under consideration, which enables the direct comparison of both nested and non-nested models. The incorporation of inequality constrained theory evaluation in a Bayesian computational framework has been formulated for multilevel models in . In the next section it will be shown how the inequality constrained multilevel linear model can be given a Bayesian formulation, how the model parameters can be estimated using a so-called augmented Gibbs sampler, and how posterior probabilities can be computed to assist the researcher in model selection. Those wishing to skip this section may find general information regarding Bayesian estimation and model selection in Chapters 3 and 4. Subsequently, the two examples described above will be analyzed in the inequality constrained Bayesian framework to elaborate model selection among competing inequality constrained model translated theories. This will be done in Sections 4 and 5. The chapter will be concluded with a discussion in Section 6.

## 3. Bayesian Estimation and Model Selection

### 3.1. Introduction

In Bayesian analysis, model specification has two parts to it:

1. The likelihood function

, which defines the probability distribution of the observed data

conditional on the unknown (model) parameters

2. The prior distribution of the model parameters .

Bayesian inference proceeds via specification of a posterior distribution for , which is obtained by multiplying the likelihood and the prior distribution:

 (9) p(\boldmathθ|\boldmathD)=f(\boldmathD|\boldmathθ)p(\boldmathθ)m(% \boldmathD)∝f(\boldmathD|\boldmathθ)p(\boldmathθ),

where is the marginal distribution of . The posterior distribution contains the state of knowledge about the model parameters given the observed data and the knowledge formalized in the prior distribution. Random draws from the posterior distribution are then used for inferences and predictions. In the sequel it will be explained how samples can be drawn from the posterior distribution.

For (5), the likelihood is

 (10) J∏j=1∫\boldmathuj⎧⎨⎩Nj∏k=11√2πσexp⎛⎝−(ykj−\boldmathxkj\boldmathβT−\boldmathzkj\boldmathu% Tj)2σ2⎞⎠⎫⎬⎭p(\boldmathuj∣\boldmath0,\boldmathV) d\boldmathuj,

where = (), = (, , ), and is a normal distribution with mean and covariance matrix .

Suppose we have a total of competing hypotheses or model translated theories for , where is the encompassing model (in the remainder of the text we will use the terms “hypothesis” and “model” interchangeably). The encompassing model is one where no constraints are put on the (model) parameters and therefore all other models are nested in . If denotes the prior distribution of , then it follows that the prior distribution of for is

 (11) p(\boldmathθ|Hs)=p(\boldmathθ|H1)I\boldmathθ∈Hs∫p(\boldmathθ|H1)I\boldmathθ∈Hsd\boldmathθ.

The indicator function if the parameter values are in accordance with the restrictions imposed by model , and otherwise. Equation (11) indicates that for each model under investigation, the constraints imposed on the model parameters are accounted for in the prior distribution of the respective model. Using independent prior distributions for each of the model parameters, the prior distribution of the unconstrained encompassing model can be written as the product

 (12) p(\boldmathθ|H1)=p(\boldmathβ)×p(\boldmathV)×p(σ2),

where , , and are the prior distributions of , , and , respectively. In order to obtain a conjugate model specification, normal priors will be used for the fixed effects , a scaled inverse prior for , and an inverse Wishart prior for . It follows that for the unconstrained encompassing model , the posterior distribution of the parameters in is proportional to the product of (10) and (12).

In what follows, it is explained how prior distributions for , , and will be specified. As mentioned in Chapter 4 (see also [15, 17]), the encompassing prior should not favor the unconstrained or any of the constrained models. Because all constraints are on the parameters in the vector , each of the s will be assigned the same prior distribution. In general, the estimate for the regression coefficient

in a linear regression model with no covariates,

, where is the dependent variable and is an error term, is the mean of (i.e., ). Each of the parameters in will therefore be assigned a normal distribution with mean equal to the mean of the response variable (from the data) and a large variance chosen so that the prior has minimal influence on the posterior distribution of the parameter. The prior distribution of will also be data based – will be assigned a scaled inverse -distribution with degree of freedom and scale equal to the variance of the response variable. Lastly, will be assigned an inverse Wishart prior distribution with degrees of freedom and as scale matrix the identity matrix where is the dimension of . Estimating the covariance matrix is challenging especially when . This is because each of the correlations (between the components of in (6)) has to fall in the interval and must also be positive definite. Setting the degrees of freedom to

ensures that each of the correlations has a uniform distribution on

(). Although setting the degrees of freedom to ensures that the resulting model is reasonable for the correlations, it is quite constraining for the estimation of the variance terms in . Therefore, when , it is recommended to model using a scaled inverse Wishart distribution. The interested reader is referred to  for more details on the implementation.

### 3.2. Estimation

In this section it is explained how samples can be obtained from the posterior distribution of

and how they can be used for inferences. With conjugate prior specifications, in (

12), the full conditional distributions of and are inverse Wishart and scaled inverse distributions, respectively, and the full conditional distribution of each parameter in the vector of fixed effects is a normal distribution.

The Gibbs sampler (see, for example, [9, 15, 25]), which is an iterative procedure, can be used to sample from the conditional distribution of each model parameter – the set of unknown parameters is partitioned and then each parameter (or group of parameters) is estimated conditional on all the others. To sample from the posterior distribution of the encompassing model described in Section 3.1, first initial values are assigned to each of the model parameters. Next, Gibbs sampling proceeds in four steps, namely:

• Sample for from where

 \boldmathΦj=\boldmathΣjσ2Nj∑k=1\boldmathzTkj(ykj−\boldmathxkj\boldmathβT)

and

 \boldmathΣj=⎡⎢⎣∑Njk=1\boldmathzTkj\boldmathzkjσ2+\boldmathV−1⎤⎥⎦−1.
• If the prior distribution of

is an inverse chi-square distribution with degrees of freedom

and scale , then sample from a scaled inverse -distribution with degrees of freedom and scale

 γω2+J∑j=1Nj∑k=1(ykj−\boldmath% xkj\boldmathβT−\boldmathzkj% \boldmathuTj)2.
• If the prior distribution of is an inverse Wishart distribution with degrees of freedom and scale matrix , sample from an inverse Wishart distribution with degrees of freedom and scale matrix

 J∑j=1\boldmathuj\boldmathuTj+% \boldmathT.
• Let = . If the prior distribution of is a normal distribution with mean and variance , then sample from a normal distribution with mean

 μpτ2p+σ−2∑Jj=1∑Njk=1[ykj−∑Pi=1i≠pβixikj−∑Qq=1uqjzqkj]xpkjτ−2p+σ−2∑Jj=1∑Njk=1x2pkj

and variance

 ⎡⎢⎣1τ2p+∑Jj=1∑Njk=1x2pkjσ2⎤⎥⎦−1.

Effectively, the Gibbs sampler starts with initial values for all the parameters and then updates the parameters in turn by sampling from the conditional posterior distribution of each parameter. Iterating the above four steps produces a sequence of simulations , , , ; , , , ; , , , ; and so on until the sequence has converged. The first set of iterations, referred to as the burn-in, must be discarded since they depend on the arbitrary starting values. See Chapter 3 and references therein for more information on convergence diagnostics for the Gibbs sampler.

After convergence, samples drawn from the posterior distribution can be used to obtain parameter estimates, posterior standard deviations, and central credibility intervals. See, for example,

. To elaborate, suppose that and we have a sample , from the posterior distribution. To estimate the posterior mean of , a researcher would use

 (13) 1BB∑b=1β(b)1,

and a central credibility interval (CCI) for would be obtained by taking the empirical and quantiles of the sample of values. Furthermore, estimates of functions of parameters can also be obtained. For instance, suppose an estimate for the posterior mean of and a credibility interval is required. This is easily obtained by taking the difference , , and using the computed values to obtain the posterior mean and credibility interval. Samples from the posterior distribution can also be used to draw histograms to display the distributions of parameters and functions of parameters.

### 3.3. Model Selection

If and

denote the prior probability and marginal likelihood of model

, respectively, then the posterior model probability (PMP) of is

 (14) PMP(Hs | \boldmathD)=m(\boldmathD | Hs)p(Hs)∑Ss′=1m(\boldmathD | Hs′)p(Hs′).

The method of encompassing priors (see [15, 17] and Chapter 4), can be used to obtain posterior probabilities for each model under investigation. If and are the proportions of the prior and posterior distributions of that are in agreement with the constraints imposed by model

, then the Bayes factor

comparing to is the quantity . Note that for each constrained model , the quantities and provide information about the complexity (“size” of the parameter space) and fit of , respectively. Subsequently, if is the encompassing model and assuming that each model is a priori equally likely, it follows that

 (15) PMP(Hs|\boldmathD)=BFs1BF11+BF21+⋯+BFS1,

for each and . In practice, therefore, one only needs to specify the prior distribution and correspondingly the posterior distribution of the encompassing model. Next, samples are drawn from the specified prior and posterior distributions, which are then used to determine the quantities and . Subsequently, posterior probabilities can be computed using (15) and the model with the highest posterior probability is considered to be the one that gets the highest support from the data. If the model with the highest posterior probability is one of the constrained models, then parameter estimates for the model can be obtained using the Gibbs sampling procedure presented in Section 3.2 with an extra step, namely that the ’s are sampled from truncated normal distributions (see Chapter 3).

Note that if a diffuse encompassing prior is used, then for the class of models with strict inequality constraints, such as or , the PMPs obtained will not be sensitive to the prior specification. However for models with equality constraints, such as or , PMPs strongly depend on the actual specification of the encompassing prior. For details on this, the interested reader is referred to Chapter 4 and [15, 16, 17]. In this chapter, models with equality constraints are not considered, so sensitivity of PMPs to the choice of encompassing prior is not an issue.

## 4. School Effects Data Example

### 4.1. Data

The data used in this section are a subsample of the 1982 High School and Beyond Survey.111This data collection provides the second wave of data in a longitudinal, multi-cohort study of American youth conducted by the National Opinion Research Center on behalf of the National Center for Education Statistics. In the first wave, conducted in 1980, data were collected from 58,270 high school students and 1015 secondary schools by self-enumerated questionnaires, personal and telephone interviews, and mailback questionnaires. It includes information on 7,185 students nested within 160 schools. Data were obtained from http://www.ats.ucla.edu/stat/paperexamples/singer/default.htm.

The data set includes the following variables:

1. : The response variable, which is a standardized measure of mathematics achievement. The variable has mean , standard deviation , and range to 24.99.

2. : A composite and centered indicator of student socioeconomic status. It was a composite of parental education, parental occupation, and income. The variable ses has mean 0.00014, standard deviation , and range to 2.69.

3. : A student level dummy variable that was coded as

if the student belonged to a minority and otherwise. Numbers of minority and nonminority students were 1974 and 5211, respectively.

4. : School level variable indicating the average of student ses values within each school. As ses was centered around its mean a score of 0 can be interpreted as indicating a school with average (in fact average average) student ses values, whereas and 1 indicate schools with below average and above average student ses values respectively. The variable mses has mean , standard deviation , and range to 0.83.

5. : School level dichotomous variable where indicates a Catholic school and indicates a public school. Numbers of Catholic and public schools were 70 and 90, respectively.

Let and respectively represent the math achievement and student socioeconomic status for the kth student in the th school . Let be an indicator variable defined to be 1 if subject in school belongs to an ethnic minority, and 0 otherwise. Furthermore, let and be school level indicator variables defined to be 1 if a school is Catholic or public, respectively, and otherwise. It should be noted that the variable is equivalent to the original variable sector. The reason for defining a new indicator variable pub is because in a regression model, this will make it possible to estimate the regression coefficients corresponding to the covariates and and their interactions with other covariates rather than estimating contrasts. Furthermore, defining variables in this way enables one to put constraints on the model parameters. Finally, let represent the continuous school level variable meanses.

### 4.2. Theory and Models

Research into child and adolescent mathematical achievement has spurred a vast stream of sociological, psychological, and educational literature; see, for example, [2, 5, 7, 8, 23]. Van den Berg, Van Eerde, and Klein  conducted research into the mathematical skills of ethnic minorities in the Dutch elementary school system. They concluded that children from ethnic minorities have less mathematical ability/maturity than children from the native Dutch population. These effects were, in their view, attributable to a language barrier and the differential use of educational skills between the home and the school environment. These effects are expected to persist throughout high school. Gamoran  found that Catholic schools produce higher overall math achievement in comparison to public schools. The (partial) explanation for this was found in the manner in which Catholic schools implement academic tracking. In addition, [5, 23] have indicated that higher math achievement occurs in schools where the average student socioeconomic status is higher. It is these expectations we want to express in a set of informative hypotheses.

Assuming a linear relationship between a student’s mathematics achievement, ses and min, the relationship can be modeled using

 mathachkj=π1j+π2jseskj+π3jminkj+εkj,

where

 π1j = β1catj+β2pubj+β3msesj+u1j, π2j = β4catj+β5pubj+β6msesj+u2j, π3j = β7,

and with

 \boldmathu=(u1j,u2j)T∼N(\boldmath0,\boldmathV),  εkj∼N(0,σ2).

Thus, the school-specific intercepts () and effects () are related to the type of school and average socioeconomic status of the school. Note that the coefficient does not vary across schools. To keep things simple we are assuming it has the same value for each school . Making the coefficient differ for each school, say by having = , would give rise to a covariance matrix for . Effectively, the extra term introduces three new variance components, namely cov(), cov(), and var() that have to be estimated from the data.

The following competing inequality constrained model translated theories will be compared:

 H1 : β1, β2, β3, β4, β5, β6, β7, H2 : {β1>β2}, β3, β4, β5, β6, β7, H3 : β1, β2, β3, β4, β5, β6, β7<0, H4 : {β1>β2}, β3, β4, β5, β6, β7<0, H5 : {β1>β2}, β3, {β4<β5}, β6, β7<0, H6 : {β1>β2}, β3, {β4>β5}, β6, β7<0.

Model 1 is the unconstrained encompassing model. Model 2 expresses the idea that students in Catholic schools have higher math achievement than those in public schools . Model 3 expresses the viewpoint that students belonging to a minority will have lower math achievement than students not belonging to an ethnic minority. As is an indicator variable defined to be 1 if subject in school belongs to an ethnic minority, the previous expectation means that should be negative, so that . Model 4 combines the viewpoints in models 2 and 3, namely that student in Catholic schools perform better than those in public schools and that students belonging to ethnic minorities perform worse than those not belonging to an ethnic minority. Model 5 expresses the viewpoints of model 4, with the additional expectation that the slopes for ses are higher in public compared to Catholic schools . Lastly, model 6 expresses the viewpoints of model 4, with the additional expectation that the slopes for ses are higher in Catholic compared to public schools .

### 4.3. Results

As mentioned before, Bayesian analysis requires specification of prior distributions for all unknown parameters in the encompassing model (). For all analyses diffuse priors were used. The regression coefficients were each given normal prior distributions with mean and variance (that is, standard deviation ). What this means is that each of the coefficients is expected to be in the range (, ), and if the estimates are in this range, the prior distribution is providing very little information in the inference. Because the outcome and all predictors have variation that is of the order of magnitude , we do not expect to obtain coefficients much bigger than 20, so prior distributions with standard deviation are noninformative. The variance covariance matrix was given an inverse Wishart prior distribution with degrees of freedom and as scale matrix a identity matrix. Lastly, was given a scaled inverse prior distribution with degree of freedom and scale .

To obtain posterior model probabilities for the competing models, samples (after a burn-in of ) were drawn from the prior and posterior distributions of the encompassing model (), respectively. For each of the constrained models , the proportion of samples from prior and posterior in agreement with the constraints on were used to estimate the posterior probabilities of each model. Table 1 shows the resulting estimated posterior probabilities, which express prior knowledge (model translated theories using inequality constraints) being brought up to date with empirical data. As can be seen in Table 1, gets most support from the data suggesting that, on average, students in Catholic schools have higher math achievement than those in public schools and that student level socioeconomic status is positively associated with mathematics achievement with public schools having higher slopes than Catholic schools. This is in line with the findings in . Lastly, model 5 also suggests that students from an ethnic minority have lower math achievement than those who are not from a minority. These findings are similar to what was observed in a sample of children from the Netherlands . It is worthwhile to note that models 2 and 3 are nested in model 5, implying that in a sense there is more evidence to support model 5 than just the PMP of . Stated otherwise, if models and were not part of the competing set of models, the PMP of model would have been bigger than .

Subsequently, estimates for parameters of model were obtained using constrained Gibbs sampling. Posterior distributions of the model parameters were monitored for iterations after a burn-in of and were summarized by posterior means, standard deviations, and central credibility intervals. These are displayed in Table 2.

Relating the estimates to the theories behind model , it can be concluded that controlling for all other predictors in the model:

1. Average predicted score for mathematics achievement is higher for Catholic than public schools. The average predicted mathematics achievement scores for students who are not minorities in schools with are 14.33 and 12.67 for Catholic and public schools, respectively.

2. Students belonging to ethnic minorities have lower mathematics achievement than those who are not from minorities. The coefficient for implies that the average predicted difference in mathematics achievement scores between students from minorities and nonminorities is 2.76.

3. Student level ses is positively associated with mathematics achievement with public schools having higher slopes than Catholic schools; for schools with average student values (i.e., ), each extra unit of corresponds to an increase of and in average mathematics achievement for public and Catholic schools, respectively. Furthermore, in both Catholic and public schools, the student level ses effect on math achievement increases with increasing meanses. Stated otherwise, the importance of as a predictor for math achievement is more pronounced for schools with higher values of .

## 5. Individual Growth Data Example

### 5.1. Data

As part of a larger study regarding substance abuse, Curran, Stice, and Chassin  collected 3 waves of longitudinal data on 82 adolescents. Beginning at age 14, each year the adolescents completed a 4-item instrument that sought to assess their alcohol consumption during the previous year. Using an 8-point scale (ranging from 0 = “not at all”, to 7 = “every day”), the adolescents described the frequency with which they (1) drank beer or wine, (2) drank hard liquor, (3) had 5 or more drinks in a row, and (4) got drunk. The data were obtained from http://www.ats.ucla.edu/stat/examples/alda/.

The dataset includes the following variables:

1. : The dependent variable. This (continuous) variable was generated by computing the square root of the mean of participants’ responses across its constituent variables (the frequency with which the adolescents (1) drank beer or wine, (2) drank hard liquor, (3) had 5 or more drinks in a row, and (4) got drunk). The variable has mean 0.92 and standard deviation 1.06 (range 0 to 3.61).

2. : Variable indicating age of adolescent.

3. : A measure of alcohol use among the adolescent’s peers. This predictor was based on information gathered during the initial wave of data collection. Participants used a 6-point scale (ranging from 0 = “none”, to 5 = “all”) to estimate the proportion of their friends who (1) drank alcohol occasionally and (2) drank alcohol regularly. This continuous variable was generated by computing the square root of the mean of participants’ responses across its constituent variables. The variable has mean 1.02 and standard deviation (range 0 to 2.53)

4. : A dichotomous variable where a indicates that an adolescent is a child of an alcoholic parent. Of the 246 adolescents, 111 are children of alcoholic parents and the rest are children of nonalcoholic parents.

Now let and be the response (alcohol use) and age, respectively, for the jth subject at age . Next, let , where denotes the standard deviation of . It follows that corresponds to the baseline age of . Also, let and be indicator variables defined to be 1 if the subject is the child of an alcoholic or not the child of an alcoholic parent, respectively, and 0 otherwise. Additionally, let be the centered and scaled measure of alcohol use among the adolescent’s peers obtained by subtracting the mean and dividing by two standard deviations. In regression models that include both binary and continuous predictors, scaling the continuous predictors by dividing by standard deviations rather than standard deviation ensures comparability in the coefficients of the binary and continuous predictors [10, 11]. Note that for interactions between two continuous variables, say and , each of the variables is scaled before taking their product; that is, the interaction term is not obtained by scaling (). It is the product of and , where and denote the mean and standard deviation of , respectively.

### 5.2. Theory and Models

Previous longitudinal latent growth models have been used to examine the relation between changes in adolescent alcohol use and changes in peer alcohol use. Curran, Stice, and Chassin  found that peer alcohol use was predictive of increases in adolescent alcohol use. Furthermore, Singer and Willett  have shown that adolescents with an alcoholic parent tended to drink more alcohol as compared to those whose parents were not alcoholics. Additionally, it is expected that with regard to initial adolescent alcohol use, an alcoholic parent may be of more influence than peers, whereas for rate of change with regard to alcohol intake, peers may have more influence. It is these expectations we want to investigate in a model and accompanying informative hypotheses.

Assuming that the profiles of each subject can be represented by a linear function of time, the model can be written as

 alcusekj=π1j+π2jtkj+εkj,

where

 π1j = β1coaj+β2ncoaj+β3speerj+u1j, π2j = β4coaj+β5ncoaj+β6speerj+u2j,

and

 \boldmathu=(u1j,u2j)T∼N(\boldmath0,\boldmathV),  εkj∼N(0,σ2).

Thus, the subject-specific intercepts () and time effects () are related to peer alcohol use and whether parent(s) is/are alcoholic or not.

The following competing models will be compared:

 H1 : β1, β2, β3, β4, β5, β6, H2 : {β1>β2}, β3, β4, β5, β6, H3 : {β1>β3}, β2, {β4<β6}, β5, H4 : {β1>β2}, β3, {β4>β5}, β6.

Model is the unconstrained model. Model expresses the theory that adolescents with an alcoholic parent are more prone to higher alcohol use at baseline . Model expresses the theory that with regard to an adolescent’s alcohol use, parents have more influence than peers at baseline , whereas over time peers have more influence . Model expresses the theory that adolescents with an alcoholic parent are more prone to higher alcohol use at baseline , as well as over time .

### 5.3. Results

The prior distributions for the parameters in the encompassing model were specified as follows. The regression coefficients were each given normal prior distributions with mean and variance . The variance covariance matrix V was given an inverse Wishart prior distribution with degrees of freedom and a identity matrix as scale matrix. Turning to the prior on , we used a scaled inverse -distribution with degree of freedom and scale . Subsequently, samples (after a burn-in of ) were drawn from the prior and the posterior distributions of the encompassing model, respectively. For each of the models , , and , the proportion of samples from prior and posterior distribution of in agreement with the constraints on were used to estimate the posterior probabilities of each model. These are displayed in Table 3.

The posterior probabilities suggest that the support in the data is highest for model . Subsequently, estimates for parameters of model were obtained using constrained Gibbs sampling. Posterior distributions of the model parameters were monitored for iterations after a burn-in of and were summarized by posterior means, standard deviations, and central credibility intervals, which are presented in Table 4. Looking at the PMPs for models and in Table 3 suggests that model is not much worse than . In Table 4, the estimate for is less than that of ; this is opposite to the constraint of model . This suggests that the reason why model has a higher PMP than model is because the constraint on the parameters and in model is not in accordance with the data, whereas model does not put any constraints on these parameters.

Based on the estimates in Table 4, the following can be concluded:

1. Controlling for peer alcohol use, baseline (age = ), adolescent alcohol use was higher in children of alcoholics than in children with nonalcoholic parents. The difference in average baseline alcohol use was with central credibility interval ().

2. Since is the coefficient for