 # What does it mean for data to be observed' or missing'?

In statistical modelling of incomplete data, missingness is encoded as a relation between datasets Y and response patterns R. We identify two different meanings of observed' and missing' implicit in this framework, only one of which is consistent with the definition formally encoded in (Y, R). Notation that has been used in the literature for more than three decades fails to distinguish between these two concepts, rendering the notations f(y_obs,y_mis)' and f(y_mis | y_obs)' conceptually contradictory. Additionally, the same notation `f(y_mis | y_obs)' is used to refer to two densities with different domains. These densities can be considered to be equivalent mathematically, but conceptually they are not interchangeable as distributions because of their differing relationships to (Y, R). Only one of these distributions is consistent with (Y, R) and standard conventions for interpretation of mathematical notation leads to the wrong choice conceptually for ignorable multiple imputation. We introduce formal definitions and notational improvements to treat these and other ambiguities, and we demonstrate their use through several example derivations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Background

The modern framework for statistical modelling of incomplete data was introduced by Rubin 

. Alongside the vector of data random variables

, a vector of response binary random variables  was introduced, and conditions were given under which inferences could be based on the marginal density  alone. Note that and  were used in  to denote what we have called and , respectively.

Intrinsic to this approach is the partitioning of a realisation of into values that are observed and values that are missing according to some response pattern . In  the subscripts ‘’ and ‘’ were introduced to denote this partition. These were replaced with the subscripts in and in  and and in [3, 9]. Over three decades the latter notation has become the de facto standard in the exposition of statistical methods for incomplete data typically aimed at practicing statisticians and other researchers, so it is important for there to be a clear understanding of what it means.

### 1.2 Two notions of missingness

The following is an extract from  (pp 89–90; also  pp 118-119):

“Here to keep the notation simple we will be somewhat imprecise in our treatment of these complications. …

… The actual observed data consists of the values of the variables . The distribution of the observed data is obtained by integrating out of the joint density of and . That is,

 f(Yobs,R|θ,ψ)=∫f(Yobs,Ymis|θ)f(R|Yobs,Ymis,ψ)dYmis.(5.11)''

In the extract above, the authors stated that their intention was to keep the notation simple. But setting , encodes missingness into the notation as attributes of the data vector  instead of as the vector , in the mathematical relation . This, in fact, significantly complicates rather than simplifies the notation, particularly in regard to the domain of the marginal density, , for . In the product of functions on the right hand side of (5.11), the factor is shorthand for the composition of functions , where is the projection sending a realisation  of to the realisation of . When unpacked this way, the notation in (5.11) specifies that

 (Yobs,Ymis)=πY(Yobs,Ymis,R). (1)

This is not straightforward to interpret because the formal mathematical relation of missingness exists in the domain of , and this missingness relation is not preserved by the projection .

Note that the failure of  to preserve the missingness relation is not simply because is a many-to-one function. Even if the domain of is restricted to include only pairs pertaining to a specific response pattern , the two missingness relationships still differ. The formal definition of ‘observed’ and ‘missing’ encoded in on the right hand side of (1) is an absolute concept: every data item in the range of  is stamped irrevocably either as ‘observed’ or ‘missing’. On the left hand side, however, ‘observed’ and ‘missing’ mean ‘observed this time’ and ‘missing this time’, respectively. This is a different concept which at the meta-mathematical level is inconsistent with  at the stochasitc level: in the density functions and , the notations  and  denote arbitrary realisations, which entails holding fixed the response pattern determining the partition of  while at the same time allowing  to vary (in contradiction of the stochastic relationship encoded in ). To distinguish between these different concepts we define them formally.

Definition 1 (Formal Missingness). Given a response pattern , we call formally observed and  formally missing with respect to  if the values over which and  range occur only in two tuples or three tuples .

Definition 2 (Temporal Missingness). Given a response pattern , we call temporally observed and  temporally missing with respect to  if the values over which and  range occur either in two tuples  or they include two tuples  or three tuples  with .

Informally, the distinction between formal and temporal missingness is that the former is what is defined formally by the relation , whereas with the latter the data variables have been partitioned according to some response pattern  simply for the purpose of considering either or from a particular point of view, and there is no requirement or expectation that the formal relationship  is, or can be, preserved.

Note 1.2.1. The notation and used in definitions 1 and 2 is ambiguous. For example, it is not completely clear in definition 2 that the partition of  into and is held fixed according to  and not according to . Extended notation to allow these two notions of missingness to be distinguished more clearly will be introduced in Section 2.

Note 1.2.2. Variables in the conditional of the marginal distribution for  are temporally missing and observed, respectively, whereas variables in the conditional density are formally missing and observed. Failing to distinguish this notationally requires to be variable in the range of while simultaneously holding the response pattern fixed, which conflicts with the meaning of missingness defined by .

Note 1.2.3. When formal missingness is intended, makes sense only as one part of a pair , and this pair denotes a stochastic function more general than a random vector. On the other hand, when temporal missingness is intended, both and denote marginal distributions of that are each mixtures of formally observable and formally unobservable values.

### 1.3 Two different functions f(ymis|yobs)

A statement that is equivalent to a Missing at Random (MAR) assumption is often written in the following (or a similar) way:

 p(ymis|yobs,r)=f(ymis|yobs). (2)

Sometimes (2) is assumed to hold for just the realised observed values  and at other times it is assumed to hold for all possible observable values under repeated sampling from  (see  for details).

Despite the function on the right hand side of (2) being denoted ‘’, technically the functions being compared are on the left hand side and on the right hand side, where is the restriction of the projection to the domain of and denotes the function derived from the marginal density for . Note that because these functions have different domains. This mathematical distinction is a minor technicality, but the distinction is important stochastically. We will illustrate this shortly, but firstly we distinguish between these two functions by giving them each separate notation:

 f(T)(ymis|yobs) :=f(ymis|yobs) (3) f(F)(ymis|yobs) :=f(−|−)(πY(ymis,yobs,r)). (4)

The stochastic difference between and  is that realisations drawn according to the former come from the range of the projection , but realisations of the latter come from the domain of . That is, the realisations come from different sides of equation (1). In particular, an update to a realisation according to  has the form of a three tuple  with the response pattern  remaining unchanged. However, an update to the same realisation according to  has the form of a two tuple , and to maintain consistency with , a subsequent updating of the response pattern to according to the response mechanism  is required to complete the triple .

Due to this stochastic difference between and , it is important to emphasise that the correct statement of equation (2) is that:

 p(ymis|yobs,r)=f(F)(ymis|yobs). (5)

### 1.4 Conceptual difficulties for the reader

The difference between and and the failure in the literature to distinguish between these densities and between variables which are formally missing versus temporally missing creates unnecessary potential conceptual difficulties for a reader, and this can make it difficult for a reader to obtain a coherent conceptual picture of how the related statistcal methods work. We outline some of these difficulties below.

Difficulty 1. The construction of the distribution requires identification of variables and in the domain of the marginal density  for , and this requires the reader to deal with two inconsistent definitions of missingness simultaneously that are not distinguished in the notation: temporal, , pertaining to the marginal distribution for  and formal, , as defined by .

Note that difficulty 1 is not due to the encoding of ‘observed’ and ‘missing’ into the labels ‘’ and ‘’, as opposed to ‘’ and ‘’ used in , but rather because the same labels are used both in the domain and in the range of the projection  in equation (1).

Difficulty 2. If denotes the particular realised values of , then the distribution is the wrong distribution conceptually for ignorable multiple imputation.

As we noted in Section 1.3, an update to according to  arises as a two tuple  and requires an update to the response pattern to form a completed three tuple  to maintain consistency with . Therefore, a sequence of imputations drawn according to  that is consistent with  has the form:

 (˜y?,y(1)?,r(1)),(˜y?,y(2)?,r(2)),…,(˜y?,y(m)?,r(m)). (6)

This constrasts with a sequence of imputations drawn according to  which conceptually has the correct form:

 (˜yobs,y(1)mis,˜r),(˜yobs,y(2)mis,˜r),…,(˜yobs,y(m)mis,˜r). (7)

Difficulty 3. Standard conventions for interpreting mathematical notation leads to ‘’ in the notation ‘’ being interpreted as the density and not the density as is required by equation (2) (see equation (5)).

Note that difficulty 3 does not apply to equation (2) because the context allows the reader to interpret the function on the right hand side correctly as (if the reader examines the notation carefully). However, this is definitely not the case with the standalone notation ‘’, and it is this latter notation which permeates much of the published literature on ignorable multiple imputation methodology.

Difficulty 4. Failure to distinguish between formal and temporal missingness clashes with the standard statistical convention of inferring the identity of a density function through the denotation of the variables in its domain.

It is common to infer from the notation ‘’ for a joint density that ‘’ denotes a marginal density. However, the notation ‘’ is ambiguous because the interpretation of as formally observed leads to one function , but the interpretation of as temporally observed leads to a different function  with a different domain.

### 1.5 Additional limitations and notational inconsistencies

Omitting from the notation the dependence of ‘’ and ‘’ on a specific response pattern  implicitly assumes that  is the only response pattern of interest to the reader. This prevents the expression of the mathematical relationships between response patterns that exist within equation (2). Understanding these relationships at a conceptual level is useful for a reader to comprehend the primary implications of a MAR assumption in practice where one response pattern per unit is observed, and several different response patterns are realised overall.

The use of uppercase letters to denote both variable realisations of random vectors as well as the random vectors themselves is common in the literature on incomplete data methods. This is contrary to the recommendations in . It is also another potential source of conceptual confusion for readers because the notation ‘’ ordinarily would be understood to mean the composition  of the density function with the random variable

, whereas a densitiy function is something that is integrated to calculate probabilities for

.

The use of a capital ‘

’ to denote a probability density function also seems fairly common in the literature on methods for incomplete data. This too is contrary to widely understood usage of the notation where a capital

denotes the probability measure and is a function of events (subsets of outcomes), whereas the density is a corresponding function of outcomes which is integrated over subsets to calculate values for . This is a further potential source of confusion for readers.

### 1.6 What we do

In Section 2 we introduce extended notation to distinguish between formal and temporal missingness, and we demonstrate how this allows the difficulties discussed in Section 1.4 and limitation in Section 1.5 to be overcome. We do this through several example derivations given in Sections 3 to 6.

## 2 Notation for (Y,r)

### 2.1 Random Vectors

Throughout, denotes a random vector modelling the observed and unobserved data comprising all units in the study jointly, and denotes a random vector of binary response random variables of the same dimension as

, where ‘1’ means observed. Joint distributions for the pair of random vectors

will be referred to as full distributions.

Note 2.1.1. We have no need to distinguish between vectors interpreted as column matrices versus row matrices, and so for our purposes we do not give vectors column matrice interpretation and dispense with the common ‘’ and ‘’ notations.

Note 2.1.2. Typically a data analyst thinks of a given as comprising a rectangular matrix with each column pertaining to a specific ‘variable’ (for example, blood pressure) and each row pertaining to a specific unit (for example, an individual in the study). In our notation, the data matrix is shaped so that there is a single row with the data for the various units placed side by side in sets of colulmns.

### 2.2 Sample Spaces

Let be the set of distinct response patterns with denoting the ‘all ones’ vector corresponding to the complete cases. For convenience, we let denote the ‘all zeros’ vector corresponding to non-participants, where it may or may not be the case that for some . (We exclude so as to avoid ever having .) Note that the dot product gives the number of values observed when the response pattern is realised and, in particular, gives the number of variables in  (and also in ). Let be the set of realisable datasets, where a realizable dataset contains complete data including all values that may or may not be observable.

Let be the full sample space of realisable pairs of datasets and response patterns, where for . When the subscript of is omitted, we denote by . Let and denote the projections and , respectively.

Realisations which represent a specific realisable dataset or response pattern only are denoted and , respectively.

### 2.3 Projections on Y and Ωj

For , let and denote the projections extracting from each vector the vectors of its observed and unobserved values, respectively, according to the response pattern . (In logic, ‘’ is commonly used for negation.) By convention we set . To apply these projections correctly over , we define the following mappings

 o:R →{π(r)∘πY:Ωr→Yπ(r)} (8) m:R →{π(¬r)∘πY:Ωr→Yπ(¬r)} (9)

and use an abbreviated notation to refer to the images of under these mappings:

 yob(r) :=(y,r)o(πR(y,r)) (10) ymi(r) :=(y,r)m(πR(y,r)). (11)

 yto(r) :={yπ(r)over Y(y,rj)o(πR(y,r))over Ω (12) ytm(r) :={yπ(¬r)over Y(y,rj)m(πR(y,r))over Ω. (13)

Note 2.3.1. The notations in (8) and (9) and on the right hand sides of (10)(13) may seem unwieldy. Note that these notations are needed solely for the purpose of carefully defining the four symbols , , and . It is only these latter four symbols that are needed for working with densities for the distributions for  themselves.

Note 2.3.2. The vectors and have length while the vectors and have length . Note that these lengths vary from response pattern to response pattern.

Note 2.3.3. The projections and apply solely on the range of and are always consistent with the missingness relation . Each response pattern gives projections  and  on , and each are pieced together over all response patterns to given a single pair of functions on all of .

Note 2.3.4. The projections and apply on either or as the context dictates. Each gives a distinct pair of projections and on all of  or all of , as the case may be. In the latter case, these and are consistent with  on and inconsistent with  elsewhere on . The ‘’ in ‘’ and ‘’ can be taken to mean ‘temporally’ or ‘this time’.

Note 2.3.5. The notation ‘’ is ambiguous because as defined by (12) and (13) this can denote either the function  (see (3)) defined on  or a function (not ) defined on all of . However, the notation ‘’ is unambiguous because by (10) and (11) this must denote .

### 2.4 Observable Data Events

Given , we call

 Ω(y∗,r)={(yob(r),ymi(r)∗,r):y∗∈Y} (14)

the observed data event corresponding to . The set consists of all datasets which have the same observed values as  (as defined by the response pattern ). For a fixed , the events in (14) partition , and over all  they give a partition of . These observable data events are the classes of the equivalence relation defined by setting for all , if, and only if, and .

### 2.5 Density Functions

We specify full distributions for through density functions , with probabilities being determined by integration: for any for which a probability can be defined (see  or  for details). Note we suppress the dominating measure in the notation. Two different ways of factorizing are useful:

 h(y,r)=f(y)g(r|y)=p(r)p(y|r). (15)

The first factorization in (15) is called a selection model factorization of , and the factor is called the response mechanism. The second factorization in (15) is called a pattern-mixture factorization, and for each , we call the conditional density the pattern mixture component pertaining to .

Note 2.5.1. Technically, the symbols , , and denote density functions and , , , and denote real numbers. Because it is common in statistics to use the same symbol to denote different densities, for example a joint density and a marginal density , we adopt the usual convention and often refer to density functions by their values.

## 3 The observable data distribution

To apply likelihood theory to incomplete data, from the model for the full data one must construct a model for just the observable data. This involves specifying a set of outomes and a set of events for the observable data, and to each full density , a corresponding density on the set of outcomes for the observable data. As a first demonstration of the use of the notation in Section 2, here we give an explicit construction for this probability space together with a step-by-step derivation of the density given in (5.11) in the extract quoted in Section 1.

The outcomes can be taken to be either the set of observable data events or the range of the map because there is a one-to-one correspondence between  and . The latter seems to be preferred [3, 4, 12]:

 Ωob:=k⋃j=1(Yto(rj)×{rj}). (16)

This is an irregularly-shaped set because as noted in Section 2 the vectors typically have different lengths for different response patterns.

Under the one-to-one correspondence between  and , events in correspond to unions of observable data events in

. Restricting to observable data events gives the density for the probability distribution on

:

 (17)

This can be seen to be the required density simply by pulling events in back to unions of observable data events in  and integrating over these corresponding events for  (by applying iterated integrals as per Fubini’s Theorem;  p 101). Note that we use and not in ‘’ because the integrand is defined on all of and the variables integrated out of are different for each subset .

The response-pattern-dependant processing being performed in the construction of the density in (17) does not correlate well with the selection-model factorization for , and this can make the construction seem a little opaque. An alternative derivation is possible starting with a pattern-mixture factorization for .

One way to do this is to start from , restrict to : , marginalize to :

 h(yob(rj),rj) =∫p(rj)p(ymi(rj),yob(rj)|rj)dymi(rj) =p(rj)∫p(ymi(rj),yob(rj)|rj)dymi(rj) (18)

and then put the pieces together over all of : . Alternatively, for each one can marginalisze over all of :

 h(yto(rj),r) =∫p(r)p(ytm(rj),yto(rj)|r)dytm(rj) =p(r)∫p(ytm(rj),yto(rj)|r)dytm(rj) =p(r)p(yto(rj)|r), (19)

restrict to : , and then put the pieces together over all of :

 h(yob(r),r)=p(r)p(yob(r)|r). (20)

Note 3.1. In (19), for a given the density is a marginal density of  with domain . There are of these distributions. On the other hand, there is only one density with domain . For a given , the function agrees with on the set , but comparison of these two functions on the rest of their domains is not well defined.

Note 3.2. Because (16) is irregularly shaped and not a Cartesian product, the stochastic function obtained by composing with is not a random vector. Tsiatis ( page 13) calls these ‘random quantities’. Stochastic functions more general than random vectors are called ‘random objects’ by Ash ( page 178) and ‘random elements’ by Shorack ( page 90). To be applicable to incomplete data, the likelihood theory must be sufficiently general to cover these random quantities. See  pages 563–567 for a sufficiently general likelihood theory for the case of IID data.

Note 3.3. If is interpreted as formally observed and considered to vary over response patterns, then it denotes the composition of with . As was noted in Section 1.2, when interpreted this way alone is insufficent to model the observable data. This is because there is potential for clashes between the ranges from distinct response pattern. That is, we may have with on the right hand side of (16).

## 4 Temporally observed and temporally missing variables are formally mixed

As a second example of the use of the notation defined in Section 2, we give a formal demonstration that the random vectors and each comprise mixtures of formally observable and formlly unobservable data.

To do this neatly, we define a partial-order on  as follows (see Figure 1 for the definition of a partial order): for each let denote the projection with domain extracting the coordinate of each response pattern. Then for define

 ri≤prj⇔πl(ri)≤πl(rj) for all l∈{1,2,…,r1⋅r1}. (21)

In words, if, and only if, all values that are defined to be observed according to pattern  are defined to be observed according to pattern . It is straightforward to check that this relation is reflexive, transitive and anti-symmetric.

Figure 1 illustrates  when has three variables (and all possible patterns):

 R={ r1=(1,1,1), r2=(0,1,1), r3=(1,0,1), r4=(0,0,1), r5=(1,1,0), r6=(0,1,0), r7=(1,0,0), r8=(0,0,0)}.

Let and consider a full density for as in (15) factored into selection model and pattern-mixture forms, . Marginalising the latter factorization over all response patterns gives the marginal density for  as a mixture of the pattern-mixture components:

 f(y)=k∑j=1p(rj)p(y|rj). (22)

Letting and substituting into both sides of (22) gives:

 f(ytm(r),yto(r))=k∑j=1p(rj)p(ytm(r),yto(r)|rj). (23)

Now in the sum on the right-hand side of (23), the terms for which all the entries of are labelled as formally missing according to are those with response patterns satisfying (according to the partial order defined in (21)). Similarly, the terms for which all the entries of are labelled as formally observed according to are those with response patterns satisfying . By anti-symmetry, the only component on the right-hand side of (23) for which all labelling of the values is formally correct is the single component with . Hence, provided contains at least two response patterns, one of and is a mixture of formally observable and formally unobservable data. (This shows that at least one of and is mixed. In most cases, this will be true of both.)

## 5 Derivation of the MAR Identity

As a third demonstration of use of the notation defined in Section 2, we give a derivation of equation (2).

Definition 5.1. Given factorised in selection model form together with observed data , we say that the response mechanism is Missing at Random (MAR) with respect to if is a constant function on .

Note that Rubin  defines MAR for a model . This can be accommodated by requiring that MAR hold with respect to for all densities in . Everywhere MAR (in ) is accommodated by requiring that MAR hold with respect to all observed data events (for all densities in ).

Let be as in (15) and let be a partially-observed realisation drawn according to . Partitioning into observable and unobservable components as defined by  gives

 p(r)p(ymi(r),yob(r)|r)=f(ymi(r),yob(r))g(r|y). (24)

Note that the ‘’ in denotes the function  (see (4)) and not the function . Factorizing the joint density for the values on each side of (24) into the product of a marginal and a conditional density, and then rearranging (provided all required denominators are non-zero) gives:

 p(ymi(r)|yob(r),r)=f(ymi(r)|yob(r))f(yob(r))g(r|y)p(r)p(yob(r)|r). (25)

In (25) the function denotes the composition of the marginal density with the projection  (suitably restricted).

If is MAR with respect to , then over the only non-constant factor on the right hand side is . Integrating both sides with respect to the variables and rearranging gives

 p(yob(r)|r)=1p(r)f(yob(r))g(r|y) (26)

because . Substituting (26) back into (25) then gives

 p(ymi(r)|yob(r),r)=f(ymi(r)|yob(r)). (27)

## 6 Further analysis of the MAR Identity

As a final example of use of the notation in Section 2, we examine the MAR identity (27) more closely. For a fixed , the domain of the densities in this equality is the observed data event . When restricted to this event, gives a bijection onto a corresponding subset of . Combining the inverse of this bijection with (27) gives

 f(ytm(r)|yto(r))=f(ymi(r)|yob(r))=p(ymi(r)|yob(r),r). (28)

For notational simplicity, we relabel the response patterns, if necessary, so that . Conditioning on the variables in (23) yields

 f(ytm(r)|yto(r))=p(r)p(ymi(r)|yob(r),r)+k−1∑j=1p(rj)p(ytm(r)|yto(r),rj). (29)

Substituting (28) into (29) and rearranging then gives:

 p(ymi(r)|yob(r),r)=11−p(r)k−1∑j=1p(rj)p(ytm(r)|yto(r),rj). (30)

When the data comprise IID draws with differing response patterns across units, holding fixed in (30) and letting vary shows that associations on the left hand side for which data are never observed are partially observed on the right hand side amongst units with response patterns different from . This key feature of MAR is obscured in the notation on the right hand side of (2).

## 7 Discussion

Missing data is a common problem across a broad range of medical and public health research, and in other fields of empirical research as well. Consequently, there is a broad range of stakeholders with an interest in being able to read and understand the literature on the relevant statistical methods. We have identified two significant ambiguites in the use of notation in this literature which we suggest undermines its purpose to disseminate the requisite information in a clear and logically coherent manner.

The first involves failure to distinguish between two different relationships between data vectors and respose indicators, one in the domain and one in the range of the projection . Three distinct relationships are implicit in the original work : an ‘either/or’ notion through an extended random vector in which an investigator observed either a data value or a missing value symbol ‘’, but not both, a ‘temporal’ partitition of the marginal distribution of the data,  and  according to the response pattern that was observed ‘this time’, and the ‘absolute’ relation arising implicitly through specification of the joint density  for this random vector. The equating of the latter two relationships is not evident in  and seems to have entered into the literature later, either prior to or through .

We have noted the presence in  of the manual transference of the missingness relation to the codomain of  through the notation ‘’ and ‘’, and note that this practise has persisted in the literature for more than three decades. We suggest that it circumvents the mathematical framework established through  because  is not a ‘missingness preserving’ transformation. This renders the literature opaque because missingness is not an intrinsic attribute of a data vector , and the relationship between data vectors and response indicators defined by  does not exist in the marginal distribution for . Specifically, the notations and  at the meta-mathematical level contradict at the stochastic level because and  denote variables in the domains of the respective densities, and conceptually this requires holding the response pattern fixed and allowing  to vary in contradiction of the stochastic relationship .

We have also explained how the equating of formal and temporal missingness therough the same notation for both results in ambiguity in the notation : in the former case,

makes sense only as part of an ordered pair (

and denotes a stochastic function more general than a random vector (or random variable), and in the latter case, and  each denote random vectors that are marginal distributions of  and each are mixtures of formally observed and formally missing values (as defined by ). Additionally, we have explained how this equating of different relationships conflicts with the statistical convention of identifying a density function through the notation used for the variables in its domain. Specifically, the same notation ‘