The modern framework for statistical modelling of incomplete data was introduced by Rubin , a vector of response binary random variables was introduced, and conditions were given under which inferences could be based on the marginal density alone. Note that and were used in  to denote what we have called and , respectively.
Intrinsic to this approach is the partitioning of a realisation of into values that are observed and values that are missing according to some response pattern . In  the subscripts ‘’ and ‘’ were introduced to denote this partition. These were replaced with the subscripts in and in  and and in [3, 9]. Over three decades the latter notation has become the de facto standard in the exposition of statistical methods for incomplete data typically aimed at practicing statisticians and other researchers, so it is important for there to be a clear understanding of what it means.
1.2 Two notions of missingness
“Here to keep the notation simple we will be somewhat imprecise in our treatment of these complications. …
… The actual observed data consists of the values of the variables . The distribution of the observed data is obtained by integrating out of the joint density of and . That is,
In the extract above, the authors stated that their intention was to keep the notation simple. But setting , encodes missingness into the notation as attributes of the data vector instead of as the vector , in the mathematical relation . This, in fact, significantly complicates rather than simplifies the notation, particularly in regard to the domain of the marginal density, , for . In the product of functions on the right hand side of (5.11), the factor is shorthand for the composition of functions , where is the projection sending a realisation of to the realisation of . When unpacked this way, the notation in (5.11) specifies that
This is not straightforward to interpret because the formal mathematical relation of missingness exists in the domain of , and this missingness relation is not preserved by the projection .
Note that the failure of to preserve the missingness relation is not simply because is a many-to-one function. Even if the domain of is restricted to include only pairs pertaining to a specific response pattern , the two missingness relationships still differ. The formal definition of ‘observed’ and ‘missing’ encoded in on the right hand side of (1) is an absolute concept: every data item in the range of is stamped irrevocably either as ‘observed’ or ‘missing’. On the left hand side, however, ‘observed’ and ‘missing’ mean ‘observed this time’ and ‘missing this time’, respectively. This is a different concept which at the meta-mathematical level is inconsistent with at the stochasitc level: in the density functions and , the notations and denote arbitrary realisations, which entails holding fixed the response pattern determining the partition of while at the same time allowing to vary (in contradiction of the stochastic relationship encoded in ). To distinguish between these different concepts we define them formally.
Definition 1 (Formal Missingness). Given a response pattern , we call formally observed and formally missing with respect to if the values over which and range occur only in two tuples or three tuples .
Definition 2 (Temporal Missingness). Given a response pattern , we call temporally observed and temporally missing with respect to if the values over which and range occur either in two tuples or they include two tuples or three tuples with .
Informally, the distinction between formal and temporal missingness is that the former is what is defined formally by the relation , whereas with the latter the data variables have been partitioned according to some response pattern simply for the purpose of considering either or from a particular point of view, and there is no requirement or expectation that the formal relationship is, or can be, preserved.
Note 1.2.1. The notation and used in definitions 1 and 2 is ambiguous. For example, it is not completely clear in definition 2 that the partition of into and is held fixed according to and not according to . Extended notation to allow these two notions of missingness to be distinguished more clearly will be introduced in Section 2.
Note 1.2.2. Variables in the conditional of the marginal distribution for are temporally missing and observed, respectively, whereas variables in the conditional density are formally missing and observed. Failing to distinguish this notationally requires to be variable in the range of while simultaneously holding the response pattern fixed, which conflicts with the meaning of missingness defined by .
Note 1.2.3. When formal missingness is intended, makes sense only as one part of a pair , and this pair denotes a stochastic function more general than a random vector. On the other hand, when temporal missingness is intended, both and denote marginal distributions of that are each mixtures of formally observable and formally unobservable values.
1.3 Two different functions
A statement that is equivalent to a Missing at Random (MAR) assumption is often written in the following (or a similar) way:
Despite the function on the right hand side of (2) being denoted ‘’, technically the functions being compared are on the left hand side and on the right hand side, where is the restriction of the projection to the domain of and denotes the function derived from the marginal density for . Note that because these functions have different domains. This mathematical distinction is a minor technicality, but the distinction is important stochastically. We will illustrate this shortly, but firstly we distinguish between these two functions by giving them each separate notation:
The stochastic difference between and is that realisations drawn according to the former come from the range of the projection , but realisations of the latter come from the domain of . That is, the realisations come from different sides of equation (1). In particular, an update to a realisation according to has the form of a three tuple with the response pattern remaining unchanged. However, an update to the same realisation according to has the form of a two tuple , and to maintain consistency with , a subsequent updating of the response pattern to according to the response mechanism is required to complete the triple .
Due to this stochastic difference between and , it is important to emphasise that the correct statement of equation (2) is that:
1.4 Conceptual difficulties for the reader
The difference between and and the failure in the literature to distinguish between these densities and between variables which are formally missing versus temporally missing creates unnecessary potential conceptual difficulties for a reader, and this can make it difficult for a reader to obtain a coherent conceptual picture of how the related statistcal methods work. We outline some of these difficulties below.
Difficulty 1. The construction of the distribution requires identification of variables and in the domain of the marginal density for , and this requires the reader to deal with two inconsistent definitions of missingness simultaneously that are not distinguished in the notation: temporal, , pertaining to the marginal distribution for and formal, , as defined by .
Note that difficulty 1 is not due to the encoding of ‘observed’ and ‘missing’ into the labels ‘’ and ‘’, as opposed to ‘’ and ‘’ used in , but rather because the same labels are used both in the domain and in the range of the projection in equation (1).
Difficulty 2. If denotes the particular realised values of , then the distribution is the wrong distribution conceptually for ignorable multiple imputation.
As we noted in Section 1.3, an update to according to arises as a two tuple and requires an update to the response pattern to form a completed three tuple to maintain consistency with . Therefore, a sequence of imputations drawn according to that is consistent with has the form:
This constrasts with a sequence of imputations drawn according to which conceptually has the correct form:
Difficulty 3. Standard conventions for interpreting mathematical notation leads to ‘’ in the notation ‘’ being interpreted as the density and not the density as is required by equation (2) (see equation (5)).
Note that difficulty 3 does not apply to equation (2) because the context allows the reader to interpret the function on the right hand side correctly as (if the reader examines the notation carefully). However, this is definitely not the case with the standalone notation ‘’, and it is this latter notation which permeates much of the published literature on ignorable multiple imputation methodology.
Difficulty 4. Failure to distinguish between formal and temporal missingness clashes with the standard statistical convention of inferring the identity of a density function through the denotation of the variables in its domain.
It is common to infer from the notation ‘’ for a joint density that ‘’ denotes a marginal density. However, the notation ‘’ is ambiguous because the interpretation of as formally observed leads to one function , but the interpretation of as temporally observed leads to a different function with a different domain.
1.5 Additional limitations and notational inconsistencies
Omitting from the notation the dependence of ‘’ and ‘’ on a specific response pattern implicitly assumes that is the only response pattern of interest to the reader. This prevents the expression of the mathematical relationships between response patterns that exist within equation (2). Understanding these relationships at a conceptual level is useful for a reader to comprehend the primary implications of a MAR assumption in practice where one response pattern per unit is observed, and several different response patterns are realised overall.
The use of uppercase letters to denote both variable realisations of random vectors as well as the random vectors themselves is common in the literature on incomplete data methods. This is contrary to the recommendations in . It is also another potential source of conceptual confusion for readers because the notation ‘’ ordinarily would be understood to mean the composition of the density function with the random variable
, whereas a densitiy function is something that is integrated to calculate probabilities for.
The use of a capital ‘
’ to denote a probability density function also seems fairly common in the literature on methods for incomplete data. This too is contrary to widely understood usage of the notation where a capitaldenotes the probability measure and is a function of events (subsets of outcomes), whereas the density is a corresponding function of outcomes which is integrated over subsets to calculate values for . This is a further potential source of confusion for readers.
1.6 What we do
2 Notation for
2.1 Random Vectors
Throughout, denotes a random vector modelling the observed and unobserved data comprising all units in the study jointly, and denotes a random vector of binary response random variables of the same dimension as
, where ‘1’ means observed. Joint distributions for the pair of random vectorswill be referred to as full distributions.
Note 2.1.1. We have no need to distinguish between vectors interpreted as column matrices versus row matrices, and so for our purposes we do not give vectors column matrice interpretation and dispense with the common ‘’ and ‘’ notations.
Note 2.1.2. Typically a data analyst thinks of a given as comprising a rectangular matrix with each column pertaining to a specific ‘variable’ (for example, blood pressure) and each row pertaining to a specific unit (for example, an individual in the study). In our notation, the data matrix is shaped so that there is a single row with the data for the various units placed side by side in sets of colulmns.
2.2 Sample Spaces
Let be the set of distinct response patterns with denoting the ‘all ones’ vector corresponding to the complete cases. For convenience, we let denote the ‘all zeros’ vector corresponding to non-participants, where it may or may not be the case that for some . (We exclude so as to avoid ever having .) Note that the dot product gives the number of values observed when the response pattern is realised and, in particular, gives the number of variables in (and also in ). Let be the set of realisable datasets, where a realizable dataset contains complete data including all values that may or may not be observable.
Let be the full sample space of realisable pairs of datasets and response patterns, where for . When the subscript of is omitted, we denote by . Let and denote the projections and , respectively.
Realisations which represent a specific realisable dataset or response pattern only are denoted and , respectively.
2.3 Projections on and
For , let and denote the projections extracting from each vector the vectors of its observed and unobserved values, respectively, according to the response pattern . (In logic, ‘’ is commonly used for negation.) By convention we set . To apply these projections correctly over , we define the following mappings
and use an abbreviated notation to refer to the images of under these mappings:
Additionally, for and set
Note 2.3.1. The notations in (8) and (9) and on the right hand sides of (10)(13) may seem unwieldy. Note that these notations are needed solely for the purpose of carefully defining the four symbols , , and . It is only these latter four symbols that are needed for working with densities for the distributions for themselves.
Note 2.3.2. The vectors and have length while the vectors and have length . Note that these lengths vary from response pattern to response pattern.
Note 2.3.3. The projections and apply solely on the range of and are always consistent with the missingness relation . Each response pattern gives projections and on , and each are pieced together over all response patterns to given a single pair of functions on all of .
Note 2.3.4. The projections and apply on either or as the context dictates. Each gives a distinct pair of projections and on all of or all of , as the case may be. In the latter case, these and are consistent with on and inconsistent with elsewhere on . The ‘’ in ‘’ and ‘’ can be taken to mean ‘temporally’ or ‘this time’.
2.4 Observable Data Events
Given , we call
the observed data event corresponding to . The set consists of all datasets which have the same observed values as (as defined by the response pattern ). For a fixed , the events in (14) partition , and over all they give a partition of . These observable data events are the classes of the equivalence relation defined by setting for all , if, and only if, and .
2.5 Density Functions
We specify full distributions for through density functions , with probabilities being determined by integration: for any for which a probability can be defined (see  or  for details). Note we suppress the dominating measure in the notation. Two different ways of factorizing are useful:
The first factorization in (15) is called a selection model factorization of , and the factor is called the response mechanism. The second factorization in (15) is called a pattern-mixture factorization, and for each , we call the conditional density the pattern mixture component pertaining to .
Note 2.5.1. Technically, the symbols , , and denote density functions and , , , and denote real numbers. Because it is common in statistics to use the same symbol to denote different densities, for example a joint density and a marginal density , we adopt the usual convention and often refer to density functions by their values.
3 The observable data distribution
To apply likelihood theory to incomplete data, from the model for the full data one must construct a model for just the observable data. This involves specifying a set of outomes and a set of events for the observable data, and to each full density , a corresponding density on the set of outcomes for the observable data. As a first demonstration of the use of the notation in Section 2, here we give an explicit construction for this probability space together with a step-by-step derivation of the density given in (5.11) in the extract quoted in Section 1.
The outcomes can be taken to be either the set of observable data events or the range of the map because there is a one-to-one correspondence between and . The latter seems to be preferred [3, 4, 12]:
This is an irregularly-shaped set because as noted in Section 2 the vectors typically have different lengths for different response patterns.
Under the one-to-one correspondence between and , events in correspond to unions of observable data events in
. Restricting to observable data events gives the density for the probability distribution on:
This can be seen to be the required density simply by pulling events in back to unions of observable data events in and integrating over these corresponding events for (by applying iterated integrals as per Fubini’s Theorem;  p 101). Note that we use and not in ‘’ because the integrand is defined on all of and the variables integrated out of are different for each subset .
The response-pattern-dependant processing being performed in the construction of the density in (17) does not correlate well with the selection-model factorization for , and this can make the construction seem a little opaque. An alternative derivation is possible starting with a pattern-mixture factorization for .
One way to do this is to start from , restrict to : , marginalize to :
and then put the pieces together over all of : . Alternatively, for each one can marginalisze over all of :
restrict to : , and then put the pieces together over all of :
Note 3.1. In (19), for a given the density is a marginal density of with domain . There are of these distributions. On the other hand, there is only one density with domain . For a given , the function agrees with on the set , but comparison of these two functions on the rest of their domains is not well defined.
Note 3.2. Because (16) is irregularly shaped and not a Cartesian product, the stochastic function obtained by composing with is not a random vector. Tsiatis ( page 13) calls these ‘random quantities’. Stochastic functions more general than random vectors are called ‘random objects’ by Ash ( page 178) and ‘random elements’ by Shorack ( page 90). To be applicable to incomplete data, the likelihood theory must be sufficiently general to cover these random quantities. See  pages 563–567 for a sufficiently general likelihood theory for the case of IID data.
Note 3.3. If is interpreted as formally observed and considered to vary over response patterns, then it denotes the composition of with . As was noted in Section 1.2, when interpreted this way alone is insufficent to model the observable data. This is because there is potential for clashes between the ranges from distinct response pattern. That is, we may have with on the right hand side of (16).
4 Temporally observed and temporally missing variables are formally mixed
As a second example of the use of the notation defined in Section 2, we give a formal demonstration that the random vectors and each comprise mixtures of formally observable and formlly unobservable data.
To do this neatly, we define a partial-order on as follows (see Figure 1 for the definition of a partial order): for each let denote the projection with domain extracting the coordinate of each response pattern. Then for define
In words, if, and only if, all values that are defined to be observed according to pattern are defined to be observed according to pattern . It is straightforward to check that this relation is reflexive, transitive and anti-symmetric.
Figure 1 illustrates when has three variables (and all possible patterns):
Let and consider a full density for as in (15) factored into selection model and pattern-mixture forms, . Marginalising the latter factorization over all response patterns gives the marginal density for as a mixture of the pattern-mixture components:
Letting and substituting into both sides of (22) gives:
Now in the sum on the right-hand side of (23), the terms for which all the entries of are labelled as formally missing according to are those with response patterns satisfying (according to the partial order defined in (21)). Similarly, the terms for which all the entries of are labelled as formally observed according to are those with response patterns satisfying . By anti-symmetry, the only component on the right-hand side of (23) for which all labelling of the values is formally correct is the single component with . Hence, provided contains at least two response patterns, one of and is a mixture of formally observable and formally unobservable data. (This shows that at least one of and is mixed. In most cases, this will be true of both.)
5 Derivation of the MAR Identity
Definition 5.1. Given factorised in selection model form together with observed data , we say that the response mechanism is Missing at Random (MAR) with respect to if is a constant function on .
Note that Rubin  defines MAR for a model . This can be accommodated by requiring that MAR hold with respect to for all densities in . Everywhere MAR (in ) is accommodated by requiring that MAR hold with respect to all observed data events (for all densities in ).
Let be as in (15) and let be a partially-observed realisation drawn according to . Partitioning into observable and unobservable components as defined by gives
Note that the ‘’ in denotes the function (see (4)) and not the function . Factorizing the joint density for the values on each side of (24) into the product of a marginal and a conditional density, and then rearranging (provided all required denominators are non-zero) gives:
In (25) the function denotes the composition of the marginal density with the projection (suitably restricted).
6 Further analysis of the MAR Identity
As a final example of use of the notation in Section 2, we examine the MAR identity (27) more closely. For a fixed , the domain of the densities in this equality is the observed data event . When restricted to this event, gives a bijection onto a corresponding subset of . Combining the inverse of this bijection with (27) gives
For notational simplicity, we relabel the response patterns, if necessary, so that . Conditioning on the variables in (23) yields
When the data comprise IID draws with differing response patterns across units, holding fixed in (30) and letting vary shows that associations on the left hand side for which data are never observed are partially observed on the right hand side amongst units with response patterns different from . This key feature of MAR is obscured in the notation on the right hand side of (2).
Missing data is a common problem across a broad range of medical and public health research, and in other fields of empirical research as well. Consequently, there is a broad range of stakeholders with an interest in being able to read and understand the literature on the relevant statistical methods. We have identified two significant ambiguites in the use of notation in this literature which we suggest undermines its purpose to disseminate the requisite information in a clear and logically coherent manner.
The first involves failure to distinguish between two different relationships between data vectors and respose indicators, one in the domain and one in the range of the projection . Three distinct relationships are implicit in the original work : an ‘either/or’ notion through an extended random vector in which an investigator observed either a data value or a missing value symbol ‘’, but not both, a ‘temporal’ partitition of the marginal distribution of the data, and according to the response pattern that was observed ‘this time’, and the ‘absolute’ relation arising implicitly through specification of the joint density for this random vector. The equating of the latter two relationships is not evident in  and seems to have entered into the literature later, either prior to or through .
We have noted the presence in  of the manual transference of the missingness relation to the codomain of through the notation ‘’ and ‘’, and note that this practise has persisted in the literature for more than three decades. We suggest that it circumvents the mathematical framework established through because is not a ‘missingness preserving’ transformation. This renders the literature opaque because missingness is not an intrinsic attribute of a data vector , and the relationship between data vectors and response indicators defined by does not exist in the marginal distribution for . Specifically, the notations and at the meta-mathematical level contradict at the stochastic level because and denote variables in the domains of the respective densities, and conceptually this requires holding the response pattern fixed and allowing to vary in contradiction of the stochastic relationship .
We have also explained how the equating of formal and temporal missingness therough the same notation for both results in ambiguity in the notation : in the former case,
makes sense only as part of an ordered pair (and denotes a stochastic function more general than a random vector (or random variable), and in the latter case, and each denote random vectors that are marginal distributions of and each are mixtures of formally observed and formally missing values (as defined by ). Additionally, we have explained how this equating of different relationships conflicts with the statistical convention of identifying a density function through the notation used for the variables in its domain. Specifically, the same notation ‘