DeepAI

# Marginal log-linear models and mediation analysis

We review some not well known results about marginal log-linear models, derive some new ones and show how they might be relevant in mediation analysis within logistic regression. In particular, we elaborate on the relation between interaction parameters defined within different marginal distributions and describe an algorithm for estimating the sane interaction parameters within different marginals.

11/02/2017

### Collapsibility of marginal models for categorical data

We consider marginal log-linear models for parameterizing distributions ...
07/03/2018

### Probability Based Independence Sampler for Bayesian Quantitative Learning in Graphical Log-Linear Marginal Models

Bayesian methods for graphical log-linear marginal models have not been ...
02/08/2019

### Fast Sequence Segmentation using Log-Linear Models

Sequence segmentation is a well-studied problem, where given a sequence ...
04/03/2020

### Composite mixture of log-linear models for categorical data

Multivariate categorical data are routinely collected in many applicatio...
04/09/2018

### On marginal and conditional parameters in logistic regression models

A fundamental research question is how much a variation in a covariate i...
07/21/2019

### Log-linear models independence structure comparison

Log-linear models are a family of probability distributions which captur...
07/31/2020

### Generalized Cut Polytopes for Binary Hierarchical Models

Marginal polytopes are important geometric objects that arise in statist...

## 1 Introduction

Marginal log-linear models, Bergsma and Rudas (2002), were conceived to construct discrete multivariate distributions subject to restrictions imposed, simultaneously, on different marginals. Consider the simple context where denotes a treatment, one or more variables which might be affected by and may influence the response which, for simplicity, we assume to be binary. In this context, we might be interested in the marginal distributions and

in addition to the joint distribution

.

### 1.1 Notations and preliminary results

A list of variables, say , shortened as will be used to denote both a marginal distribution and the interaction among the variable in the list; let , denote two such lists with ; will denote the log-linear interactions defined within the marginal , coded either as contrasts between adjacent categories (Ac) or with respect to a reference category (Rc) depending on the context; in both cases, variables in will be set to the initial or reference category coded as 0. When and quantitative, the linear logistic model including the interaction has the form

 logP(Y=1∣X=x,W=w)P(Y=0∣X=x,W=w)=β0+βXx+βWw+βXWxw (1)

where the interaction is equal to under Ac and to under Rc.

To introduce the mixed parametrization, recall that in a general multi-way table with cells, the saturated model may be parameterized as

 \boldmathp=\boldmathG\boldmathθ−% \boldmath1klog[\boldmath1′kexp(\boldmathG\boldmathθ)], (2)

linearly independent columns which do not span the unitary vector and

is a vector of log-linear (canonical) parameters. Let be the left inverse of such that = , then (2) may be inverted as = ; note the one to one correspondence between the rows of , the columns of and the log-linear parameters. Define the vector of mean parameters = ; clearly there is a one to one correspondence between elements of and . Let be the collection of columns of that correspond to the set of interactions in , then the vector = has the same size as .

Given a partition of the collection of all possible interactions for the joint distribution into two disjoint sets , the mixed parametrization, (Barndorff-Nielsen, 1978, pag. 121-22), is made of and has the following properties:

###### Lemma 1.

(i) there is a one to one mapping between and , (ii) the two components of the mixed parametrization are variation independent and (iii) the expected information matrix is block diagonal.

The following results on the differential properties of the mixed parametrization will be used later: let and denote the distribution within the marginal ; let = , then we have (see Forcina, 2012, Lemma 3, 4);

###### Lemma 2.
 \boldmathC\scriptsize{I}=∂% \boldmathμ\scriptsize{I}∂\boldmathθ′\scriptsize{I}=∂% \boldmathμ\scriptsize{I}∂\boldmathη′\scriptsize{I},\scriptsize{M}=G′\scriptsize{I}\boldmathΩ(% \boldmathp)\boldmathG\scriptsize{I};

in addition, is symmetric and positive definite if the elements of are strictly positive.

## 2 Main results

It is well known that the parameters in the marginal logistic models for , and do not determine those in (1); the mixed parametrization allows to sharpen this result as follows:

###### Proposition 1.

(i) The parameters of the three logistic regression models defined on the marginals , , are variation independent from . (ii) If , then the parameters of the three marginals determine uniquely the joint distribution.

Proof: the log linear parameters within the marginals are uniquely determined by the set of mean parameters which are variation independent from = . The above list of mean parameters together with constitute a mixed parametrization of the joint distribution, thus (ii) follows from Lemma 1.

###### Remark 1.

In principle, under (ii), the parameters in (1) could be written as functions of the mean parameters; the algorithm in Forcina (2012), A2, provides an efficient and accurate numerical alternative.

For the model in (1), Stanghellini and Doretti (2019) derived an expression for = , where is the regression coefficient of in the linear logistic model defined within the marginal

distribution. For the case of a multivariate discrete distribution on a set of binary random variables, an expression for the difference between the same interaction parameters defined within two different marginals, say

, was derived by Evans (2015), Theorem 3.1. In the Appendix we rewrite the latter result in the case where and are discrete and show that, by setting and , they are essentially equivalent to those in Stanghellini and Doretti (2019).

The following provides some additional insights into the relation between interaction parameters defined within different marginals:

###### Proposition 2.

Suppose that has size , then

 ∂\boldmathλ\scriptsize{I},% \scriptsize{M}∂\boldmathθ′\scriptsize% {I}=∂\boldmathλ% \scriptsize{I},\scriptsize{M}∂% \boldmathμ′\scriptsize{I}∂% \boldmathμ\scriptsize{I}∂\boldmathθ′\scriptsize{I}=\boldmathId. (3)

Proof: Follows from Lemma 2.

In the special case when , Proposition 2 simply says that and are variation independent which is somehow implied by the derivation in Stanghellini and Doretti (2019). Additional features of the result are clarified in the example below.

###### Example 1.

Consider an distribution where are binary and has

categories; suppose we have two probability distributions

, with all log-linear parameters being equal, except for . Then the difference between corresponding pairs of marginal interactions is equal to .

It is well known that we cannot impose log-linear restrictions on the interactions both in the marginal and in the joint distribution; for a formal argument see Bergsma and Rudas (2002). However, Colombi and Forcina (2014) proved a result that, within the Rc coding and assuming that has categories, may be stated as follows:

###### Proposition 3.

Within , the marginal log-linear parametrization with elements

 (\boldmathλX,XY,\boldmathλY,XY,% \boldmathλXY,XY,\boldmathλW,XW,% \boldmathλXW,XW,\boldmathθXY,\boldmathθWY,¯\boldmathθXWY)

where is obtained from by deleting all elements with is a smooth parametrization of the saturated model.

In words, if we want to define (and possibly constraint) the interactions both in the marginal and in the joint, we need to remove a subset of the interactions corresponding to a fixed value of . This may be seen as an added flexibility in the modelling process: if we are interested in imposing constraints to the interaction both in the marginal and in the joint, the price to pay is that we cannot model a subset of the interactions. The feature is illustrated in the next section.

## 3 Application

### 3.1 The data

The data come from the NCDS, a UK cohort study that included everybody born in UK from March 3rd to March 9th 1958. several variables concerning the parents and the child are recorded; a full description of the data set is available at http://cls.ucl.ac.uk/cls-studies/1958-national-child-development-study

. In this simplified analysis, we consider the number of years of schooling for each parent, parents’ concern about the education of the child shown at different stages (as recorded by the teachers), the weekly income of parents and the academic qualification reached by the child, an ordered categorical variable with four categories. The issue of interest is the effect of parents’ education on that of the child. Intuitively, parents’ education might affect income by which to offer better chances to the child. In addition, more educated parents might show more concern being more aware of the importance of education. Direct effects may work through the atmosphere inside the family, like having books and meeting more educated friends.

For simplicity, the analysis below is restricted to the sample of 2161 daughters, the response if the child got at least an high school degree; income and concern are dichotomized at the median. The exposure

is a categorical variable with four levels obtained by splitting at quantiles the following measure of parent’s education

 ~X=Em+Ef−∣Em−Ef∣/3,

where denote the number of years of schooling for mother and father and is a penalty for unequally educated parents. We also assume there are two mediators: , the father weekly income (that of the mother wa ignored, having a large number of missing values) and , an average measure of the concern shown by parent at different stages, as recorded by teachers. Finally define = .

### 3.2 Two alternative models

We compare two alternative models, both parameterized with the adjacent coding; because all variables except X are binary, assuming that, say, the adjacent interactions are constant in

is equivalent to assume that the logits of

is a linear functions of . However, because the evidence against linearity in was rather strong, the dependence on was left unconstrained.

M1: Define the overall effect of on in the corresponding marginal distribution, in addition, model the effect of on the mediators in the marginal . Define all other interactions within the joint , including the interactions; the parameters already in the model determine the interactions which cannot be modeled. Then we constrain to 0 the interactions in and the and interactions within

; this model fits well with a deviance of 7.82 and 7 dof. Parameter estimates and standard errors for interaction parameters involving the

term are given in Table 1.

M2: Define the effects of on within the marginal as above and all other effects within the joint ; next, constrain to 0 the interactions in the marginal as above and the and interactions in the joint. This model, which is the closest analog to the one considered above, has a deviance of 13.04 with the same number of dof. Estimates and standard errors for the dependence of on are displayed in Table 1.

The effect of on is strongest in going from 2 to 3; the same holds for the marginal effect of on . Within M2 the effects of conditional on and are roughly similar the the corresponding ones under M1.

If we assume that there are no unobserved confounders, the estimated joint distribution under M1 allows to compute an estimate of the natural direct and indirect effect of parents’ education on academic qualification of the daughter, Pearl (2014), by changing from one category to the next (see VanderWeele et al., 2013, equations (1) and (2)). Results are in Table 3 with standard errors estimated by bootstrap; the direct effect is always the largest component of the total though going from 0 to 1 does not seem to matter.

## Appendix

### Rephrasing Robin Evans result

Let be two nested marginals and = ; assume that we define interactions as contrasts relative to the reference category coded as 0; we also use the convention that, when the value of the conditioning variables are not given, they are fixed to the reference value; the derivation below is, essentially, a re-writing of Evans (2015). Let denote the log-linear interaction among variables in computed within the marginal fixed at the value .

###### Lemma 3.
 λ\scriptsize{I};\scriptsize{M}(% \boldmathx\scriptsize{I})−λ\scriptsize{I};\scriptsize{N}(\boldmathx% \scriptsize{I})=∑\scriptsize{J}⊆% \scriptsize{I}(−1)|\scriptsize{I}{∖}J|logp\scriptsize{R}∣\scriptsize{N}(\boldmath0\scriptsize{R},\boldmathx\scriptsize{J};\boldmath0\scriptsize{N}{∖}\scriptsize{J}), (4)

where the conditional probabilities on the right-hand side are of the event when the conditioning set is split into a component taking the original values and the remaining ones fixed to 0.

Proof: Start from the expansion of , add and subtract and write the difference between the two in terms of conditional probabilities

 λ\scriptsize{I};\scriptsize{M}(\boldmathxI) =∑\scriptsize{J}⊆\scriptsize{I}(−1)|\scriptsize{I}{∖}\scriptsize{J}|logp\scriptsize{M}(% \boldmathx\scriptsize{J},\boldmath0% \scriptsize{M}{∖}\scriptsize{J% }) =∑\scriptsize{J}⊆\scriptsize{I}(−1)|\scriptsize{I}{∖}\scriptsize{J}|logp\scriptsize{M}(\boldmathx\scriptsize{J},\boldmath0\scriptsize{N}{∖}\scriptsize{J},\boldmath0\scriptsize{R})p% \scriptsize{N}(\boldmathx\scriptsize{J},\boldmath0\scriptsize{N}{∖}\scriptsize{J})+λ\scriptsize{I};\scriptsize{N}(\boldmathx\scriptsize{I}).

We now apply Lemma 3 to the special case where = , = , is binary and are discrete; to simplify notations, let = ; in addition, because is the joint distribution, replace with .

###### Corollary 1.
 λXY(x,y)−λXY;XY(x,y)=logp¯W(0,0)p¯W(x,y)p¯W(0,y)p¯W(x,0) (5)

this may also be expressed in terms of log-linear parameters defined within as

 λXY(x,y)−λXY;XY(x,y)=−log1+∑w>0exp[λW(w)]1+∑w>0exp[λW(w)+λWX(w,x)] +log1+∑w>0exp[λW(w)+λWY(w,y)]1+∑w>0exp[λW(w)+λWX(w,x)+λWY(w,y)+λWXY(w,x,y)]

Proof. The first part follows from Lemma 3 by noting that, because

has just two elements, the expansion contains four elements which can be arranged into the form of a log odds ratio. For the second part, first write the conditional distribution of

as a multinomial and then apply (3) in Colombi and Forcina (2014) for expanding interactions conditional to into a sum of higher order interactions.

### Log-linear versus logistic parameterizations

For what follows, it might be useful to recall how, under the corner point coding, log-linear parameters may be mapped into the corresponding logistic parameters. When the dependent variable, like , is binary, we have

 logP(Y=1∣X=x,W=w)P(Y=0∣X=x,W=w)=λY+λXY(x)+λWY(w)+λXWY(x,w),

with the convention that the log.linear parameter is 0 whenever at least one of the arguments is 0. Having assumed that is multinomial with, possibly, more than two categories, its logits may be written as

 logP(W=w∣X=x,Y=y)P(W=0∣X=x,Y=y)=λW(w)+λXW(w,x)+λWY(w,y)+λXWY(x,w,y).

### The results of Stanghellini and Doretti

As above, let be binary and be discrete; equation (A2) in Stanghellini and Doretti (2019) may be written as

 logP(W=w∣Y=1,X=x)P(W=w∣Y=0,X=x)=logP(Y=1∣X=x,W=w)P(Y=0∣X=x,W=w)−logP(Y=1∣X=x)P(Y=0∣X=x)

which follows by expanding the left-hand side as

 logP(W=w∣Y=1,X=x)P(W=w∣Y=0,X=x)=logP(W=w,Y=1,X=x)P(W=w,Y=0,X=x)−logP(Y=1,X=x)P(Y=0,X=x)

and noting that logits may be computed equivalently either on the joint or conditional distribution.

To derive an extension of their (A3) to non binary , first swap conditioning

 logP(W=w∣X=x,Y=y)P(W=0∣X=x,Y=y)=logP(Y=y∣X=x,W=w)P(Y=y∣X=x,W=0)+logP(W=w∣X=x)P(W=0∣X=x)

next expand the first term on the right hand-side by adding and subtracting and ,

 logP(Y=y∣X=x,W=w)P(Y=y∣X=x,W=0)=logP(Y=0∣X=x,W=w)P(Y=0∣X=x,W=0)+ y[logP(Y=1∣X=x,W=w)P(Y=0∣X=x,W=w)−logP(Y=1∣X=x,W=0)P(Y=0∣X=x,W=0)].

Thus the analog of the log-linear expansion in their (A3) is

 logP(W=w∣X=x,Y=y)P(W=0∣X=x,Y=y)=+y[λWY(w)+λXWY(x,w)] +log1+exp(λY+λXY(x))1+exp(λY+λXY(x)+λWY(w)+λXWY(x,w))+λW(w)+λWX(w,x).

which is a equivalent to (1) in the special case when

are both binary variables

### 3.3 Acknowledgments

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. The author would like to thank Elena Stanghellini for suggesting the problem and for several helpful comments.

## References

• Barndorff-Nielsen (1978) Barndorff-Nielsen, O.E., 1978. Information and exponential families. Wiley, New York.
• Bergsma and Rudas (2002) Bergsma, W.P., Rudas, T., 2002. Marginal models for categorical data. Annals of Statististics 30, 140–159.
• Colombi and Forcina (2014) Colombi, R., Forcina, A., 2014. A class of smooth models satisfying marginal and context specific conditional independencies.

J. Multivariate Analysis 126, 75–85.

• Evans (2015) Evans, R.J., 2015. Smoothness of marginal log-linear parameterizations. Electronic Journal of Statistics 9, 475–491.
• Forcina (2012) Forcina, A., 2012. Smoothness of conditional independence models for discrete data. J. Multivariate Analysis 106, 49–56.
• Pearl (2014) Pearl, J., 2014. Interpretation and identification of causal mediation. Psychological methods 19, 459.
• Stanghellini and Doretti (2019) Stanghellini, E., Doretti, M., 2019. On marginal and conditional parameters in logistic regression models. Biometrika 106, 732–739.
• VanderWeele et al. (2013) VanderWeele, T., Vansteelandt, S., Robins, J., 2013. Effect decomposition in the presence of an exposure-induced mediator-outcome confounder. Epidemiology 25, 300–306.