1. Introduction
Empirical researchers often face a missing data problem. This is also called selection or censoring. Due to missing data, the observed data on an outcome variable corresponds to draws from the law of the outcome conditional on nonmissingness. Most of the time, the law of interest is the unconditional one. But the researcher can also be interested in the law of the outcome variable for the population that does not reveal the value of the outcome (the censored one). For example, surveys rely on a sample drawn at random and the estimators require the observation of all sampled units. In practice, there is missing data and those estimators cannot be computed. A common practice is to rely on imputations. This means that the missing observations are replaced by artificial ones so that the estimator can eventually be computed. In the presence of endogenous censoring, the law conditional on censoring is the important one for imputation.
It is usual to assume that the data is Missing at Random (henceforth MAR, see [12]) in which case there are variables which are never missing such that the law of the outcome conditional on them and nonmissingness is the same as the law of outcome conditional on them and missingness. Under such an assumption, the estimable conditional law is the same as the one which is unconditional on missingness. As a consequence, the researcher does not need a model for the joint law of the outcome and selection and the selection can be ignored. In survey sampling, the sampling frame can be based on variables available for the whole population, for example, if it involves stratification. In this case, those variables are natural candidates for conditioning variables for MAR to hold. In practice, there is noncompliance. It means that the researcher often does not have observations for all sampled units. Though the original sampling law is known, the additional layer of missing data can be viewed as well as a selection mechanism conditional on the first one. The law of this second selection mechanism is unknown to the statistician. Oftentimes it can be suspected that units reveal the value of a variable partly depending on the value of that variable and the MAR assumption does not hold. This is a type of endogeneity issue commonly studied in econometrics. For example, wages are only observed for those who work. Firms only carry out investment decisions if the net discounted value is nonnegative. An individual might be less willing to answer a question on his salary because it is not a typical one (either low or high). We expect a strong heterogeneity in the mechanism that drives individuals to not reveal the value of a variable.
When the MAR assumption no longer holds, the selection mechanism cannot be ignored. Identification of the law of the outcome or the law conditional on missingness usually relies on the specification of a model for the vector formed by the outcome and a model for the selection. The alternative approach is to follow the partial identification route and recognize that the parameters of interest which are functionals of these laws lie in sets. The Tobit and generalized Tobit models (also called Heckman selection model, see
[11]) are classical parametric selection models to handle endogenous censoring. The generalized Tobit model involves a system of two equations: one for the outcome and one for the selection. Each of these equations involve an error term and these errors are dependent, hence the endogeneity. Identification in such systems relies on some variables which appear in the selection equation and are not measurable with respect to the sigmafield generated by the variables in the outcome equation, and which do not have an effect on the errors. So these variables have an effect on the selection but not on the outcome. They are called instrumental variables or simply instruments.This paper presents nonparametric models in sections 3. We explain in Section 4 that having a one dimensional error term appearing in an additively separable form in the selection equation implies the socalled instrument monotonicity. Instrument monotonicity has been introduced in [2]. It has a strong identification power but at the same time leads to unrealistic selection equations as we detail in Section 4. To overcome this issue, we present in Section 5 selection equations where the error in the selection equation is multidimensional and appears in a non additively separable fashion. The baseline specification is a model where the selection equation involves an index with random coefficients. We show that we can rely on a nonparametric model for these random coefficients. Finally, Section 6
presents a method to obtain a confidence interval around a nonlinear statistic like the Gini index with survey data in the presence of non MAR
^{1}^{1}1The terminology nonignorable (see [12]) is also used but strictly speaking it is defined for parametric models and requires parameter spaces to be rectangles. This is why we do not use this terminology in this paper.
missing data when we suspect that some instruments are nonmonotonic. These confidence intervals account for both the uncertainty due to survey sampling and the one due to missing data.2. Preliminaries
Bold letters are used for vectors and matrices and capital letters for random elements. denotes the indicator function, the derivative with respect to the variable , the inner product in the Euclidian space, the euclidian norm, the spherical measure on the unit sphere in the Euclidian space. We write when we want to make clear that the Euclidian space is . We write a.e. for almost everywhere.
All random elements are defined on the same probability space with probability
and is the expectation. The support of a function or random vector is denoted by . We denote by the support of the conditional law of given when it makes sense. For a random vector , is its density with respect to a measure which will be clear in the text and is its dimension. We use the notation for a conditional density and for the conditional expectation function evaluated at. Equalities between random variables are understood almost surely. Random vectors appearing in models and which realisations are not in the observed data are called unobservable.
3. Models with One Unobservable for Endogenous Censoring
In this paper, the researcher is interested in features of the law of a variable given . She has censored observations of , uncensored observations of a vector of which is a subvector, and
is a binary variable equal to 1 when
is not censored and else is 0. Inference on the conditional law of given is possible if and are independent given , namely if, for all bounded continuous function ,(1) 
in which case
(2) 
and we conclude by the law of iterated expectations. Condition (1) is called Missing at Random. When it holds without the conditioning on , it is called Missing Completely at Random (MCAR, see [12]).
We consider cases where the researcher does not know that a specific uncensored vector is such that (1) holds. Then is partly based on , even conditionally. This situation is called Not Missing at Random (NMAR, see [12]). In the language of econometrics, this is called endogenous censoring or selection.
Important parametric models rely on as a model equation for the variable of interest, and are unknown parameters, and are independent, and is a standard normal random variable. In the Tobit model, for a given threshold . In the Heckman selection model (see [11])
(3)  
(4)  
(5)  
(6) 
(3) is called the selection equation. The law of given and , hence of given is identified and the model parameters can be estimated by maximum likelihood. Some functionals of the conditional law of given can be estimated for some semiparametric extensions. For example, the conditional mean function can be obtained by estimating a regression model with an additional regressor which is a function of . This leads to the interpretation that the endogeneity can be understood as a missing regressor problem.
A more general model is
(7)  
(8)  
(9)  
(10) 
Equation (7) is the selection equation or missing mechanism. This model is quite general and clearly . By applying the nondecreasing CDF of on both sides of the inequality, it yields the same conditional law of given as
where and the law of are unknown.
Remark 1.
If we replace (8) by and are independent given , assumption MAR holds by taking a vector which components are those of and .
Condition (8) allows for dependence between and and to be partly based on , even conditionally. It provides an alternative identification strategy. Indeed we can check that, for all bounded continuous function ,
(11) 
This is a key element to obtain the law of given because
(12)  
(13) 
But (11) also allows to obtain the law of given and censorship ().
(14) 
Remark 2.
Similar computations are given for a binary treatment effect model in [1] for effects that depend on an average (i.e. for all ) rather than the whole law as above. There the integrand is called the local instrumental variable.
The vector is called a vector of instrumental variables. By (8), has a direct effect on via which is non trivial but it does not have an effect on given .
Condition (10) is strong. First, the support of should be infinite so in practice we think that at least a variable in is continuous. Second, the variation of should be large enough to move the selection probability from 0 to 1. For all , there should exist a fraction of the population (based on the value of their ) who reveal their with probability larger than and a fraction of the population who do not reveal their with probability larger than . This is a ”large support” assumption. Using (13) for identification is called identification at infinity. It does not deliver an efficient method for estimation because it would make use of the subsample for which is close to 1. In contrast, (12) can be used to form estimators which use all the data. (9) and (10) were not required in the parametric Tobit and Heckman selection models. The task of finding which satisfies (8) was already difficult but working with the nonparametric model requires those additional stringent assumptions.
4. Monotonicity
In this section, we show that the above nonparametric specification is not as general as we would think. From a modelling perspective, it is equivalent (see [14]) to the socalled instrument monotonicity introduced in [2].
For the sake of exposition, assume that is discrete. For and individuals that we index by , such that , we have . Suppose now that we could change exogeneously (by experimental assignment) to in leaving unchanged the unobserved characteristics for . The corresponding of those individuals are shifted monotonically. Indeed, we have either (1) or (2) . In case (1),
while in case (2),
This instrument monotonicity condition has been formalized in [2].
Consider a missing data problem in a survey where , is the identity of a pollster, and when the surveyed individual replies and else . The identity of the pollster could be Mr A (z=0) or Mrs B (z=1). This qualifies for an instrument because, usually, the identity of the pollster can have an effect on the response but not on the value of the surveyed variable. If the missing data model is any from Section 3 and pollster B has a higher response rate than pollster A, then in the hypothetic situation where all individuals surveyed by Mr A had been surveyed by Mrs B, then those who responded to Mr A respond to Mrs B and some who did not respond to Mr A respond to Mrs B, but no one who responded to Mr A would not respond to Mrs B. This last type of individuals corresponds to the socalled defiers in the terminology of [2]: those for which when and when . There, instrument monotonicity means that there are no defiers.
Remark 3.
The terminology also calls compliers those who did not respond to Mr A but who would respond to Mrs B, never takers those who would respond to neither, and always takers those who would respond to both.
The absence of defiers can be unrealistic. For example, some surveyed individuals can answer a pollster because they feel confident with him/her. They can share the same traits which the statistician do not observe. For example, in the conversation they could realize they share the same interest or went to the same school.
5. A Random Coefficients Model for the Selection Equation
[14] showed that monotonicity is equivalent to modelling the selection equation as an additively separable latent index model with a single unobservable. In (7) the index is and is the unobservable. A nonadditively separable model takes the form . [1] calls a benchmark nonadditively separable model with multiple unobservables a selection model where the selection equation is a random coefficients binary choice model. A random coefficients latent index model takes the form , where and are independent. The multiple unobservables are the coefficients and play the role of above. The model is nonadditively separable due to the products. The random intercept absorbs the usual mean zero error and deterministic intercept. The random slopes can be interpreted as the tastes for the characteristic . The components of can be dependent.
To gain intuition, assume that is discrete. For and individuals such that , we have
Suppose that the first component of takes positive and negative values with positive probability, that we change exogeneously to in by only changing the first component, and that we leave unchanged the unobserved characteristics for . This model allows for populations of compliers (those for which the first component of is positive) and defiers (those for which the first component of is negative).
A parametric model for a selection equation specifies a parametric law for . A parametric model for a selection model specifies a joint law of given . The model parameters can be estimated by maximum likelihood. The components of given could be modelled as dependent. is a vector of latent variables and the likelihood involves integrals over
. As for the usual Logit or Probit models, a scale normalization is usually introduced for identification. Indeed
for all . A nonparametric model allows the law of given to be a nonparametric class. Parametric and nonparametric models are particularly interesting when they allow for discrete mixtures so that there can be different groups of individuals such as the compliers, defiers, always takers, and never takers. But estimating a parametric model with latent variables which are drawn from multivariate mixtures can be a difficult exercise. In contrast, nonparametric estimators can be easy to compute.5.1. Scaling to Handle Genuine Non Instrument Monotonicity
In this section, we rely on the approach used in the first version of [5] in the context of treatment effects models. This is based on the normalisation in [9, 10]. The vector of instrumental variables is of dimension . For scale normalization, we define
We introduce some additional notations. When is an integrable function on , we denote by the function (by a density argument) and the hemispherical transform (see [13]) of is defined as
This is a circular convolution in dimension
The null space of consists of the integrable functions which are even (by a density argument) and integrate to 0 on . is injective when acting on the cone of nonnegative almost everywhere functions in or such that a.e. (see [9, 10]). This means that cannot be nonzero at two antipodal points of . We denote by the unbounded inverse operator. We now present a formula for the inverse. For an integrable function , we denote by the function . If is continuous and , then
(15) 
and, if , then
(16) 
where
for all and , are orthogonal polynomials on for the weight . The Gegenbauer polynomials can be obtained by the recursion , for while , and
Remark 4.
We assume
(17)  
(18)  
(19)  
(20)  
(21)  
(22) 
This specification allows for non instrument monotonicity for all instruments. Condition (20) is very demanding because it means that is the whole space for all . For further reference, we use the notation . This can be relaxed as in [6] by working in specific nonparametric classes yielding quasianalyticity.
Remark 5.
Proceeding like in [6, 7] we could allow an index of the form where are instrumental variables and is multidimensional of arbitrary dimension but has a sparse random series expansion on some classes of functions. Also, the conditional law of , given , for all , can have a support which is a subspace of the whole space. This means that a nonparametric random coefficients linear index already captures a large class of nonadditively separable models with multiple unobservables.
We can show using (19), (20), and (21), that for a.e. and ,
(23) 
By (23), backing out is an inverse problem. However, there is a particular difficulty which is that the lefthand side is only defined (and estimable) on . We obtain the following theorem which states that can be identified at infinity.
Denote by the continuous and odd function defined, for all in the interior of , by
by for all on the boundary of , and by for all in the interior of . This function is nonparametrically identified by Theorem 1.
By (24), for all ,
(25) 
This is now a bonafide illposed inverse problem and the inversion can be obtained by (15)(16).
Theorem 2.
Proof.
This is because
and, for all ,
∎
As a result, the parameter in Theorem 2 is nonparametrically identified and the argument does not involve identification at infinity. This gives, by integration, an other expression for than that of Theorem 1 which does not rely on identification at infinity. By taking to be the function identically equal to 1, we obtain for a.e. .
From this expression, one can obtain an estimator by plugin and smoothing. One possible smoothing technique is to replace the sum over by a sum up to a truncation parameter. In the approach in [9], there is an additional damping of the high frequencies by an infinitely differentiable filter with compact support. The needlet estimator in [10] also builds on this idea. In the case of the estimation of , [10] provides the minimax lower bounds for more general losses and an adaptive estimator based on thresholding the coefficients of a needlet expansion with a data driven level of hard thresholding.
The root nonparametrically identified in Theorem 2 allows to obtain the law of given and censorship ()
(26) 
where is nonparametrically identified. Estimation can be carried by the plugin principle.
5.2. Alternative Scaling Under a Weak Version of Monotonicity
In this section, we denote by the general linear group over and assume
(27) 
We denote by , , , and . This yields
Assume also (19),
(28) 
(29) 
(30) 
(31) 
(32) 
Condition (31) is slightly stronger than necessary. Conditions implying that certain functions are quasianalytic, hence allowing to have some heavy tails, are sufficient (see [6]).
By (19),
is the cumulative distribution function of a linear functional of a random vector and for all
in the interior of ,So such invertible matrices are identified.
The vector of random coefficients in the linear index structure clearly satisfies (22). For this reason, we consider the specification of the previous section more general. There is instrument monotonicity in , though not for . This is a weak type of monotonicity because it is possible that there is instrument monotonicity for none of the instrumental variable in the original scale. This is the approach presented in the other versions of [5]. It is shown in [5] that the equation
where , and are unknown functions, can also be transformed by reparametrization into
(33) 
and that the unknown functions are identified by similar arguments as for the additive model for a regression function.
Proof.
Let . We have by (19), for a.e. ,
Hence, by (30), is nonparametrically identified. Moreover, for all and a.e. ,
(34) 
the lefthand side is nonparametrically identified and the righthand side is the Fourier transform of the law of
conditional on at . We conclude by (31) and (32). ∎It is possible to turn the identification argument using (34) into an estimation procedure as in [7].
Remark 6.
Remark 7.
Remark 8.
In a binary treatment effect model the outcome can be written as . and are the potential outcomes without and with treatment. They are unobservable. A selection model can be viewed as a degenerate case where a.s. Quantities similar to the root in Theorem 2 have been introduced in [5]. They are for the marginals of the potential outcomes for . An extension of the Marginal Treatment Effect in [1] to multiple unobservables and for laws is the Conditional on Unobservables Distribution of Treatment Effects .
6. Application to Missing Data in Surveys
When making inference with survey data, the researcher has available data on a vector of characteristics for units belonging to a random subset of a larger finite population . The law used to draw can depend on variables available for the whole population, for example from a sensus. We assume that the researcher is interested in a parameter which could be computed if we had the values of a variable for all units of index . This can be an inequality index, for example the Gini index, and the wealth of household . In the absence of missing data, the statistician can produce a confidence interval for , making use of the data for the units and his available knowledge on the law . We assume that the cardinality of is fixed and equal to . When
is a total, it is usual to rely on an unbiased estimator, an estimator of its variance, and a Gaussian approximation. For more complex parameters, linearization is often used to approximate moments. The estimator usually rely on the survey weights
. For example an estimator of the Gini index is(35) 
where . The estimators of the variance of the estimators are more complex to obtain and we assume there is a numerical procedure to obtain it. Inference is based on the approximation
(36) 
where is a standard normal random variable and is an estimator of the variance of .
In practice, this is not possible when some of the s are missing. There is a distinction between total nonresponse, where the researcher discards the data for some units or it is not available, and partial nonresponse. Let us ignore total nonresponse which is usually dealt with using reweighting and calibration and focus on partial nonresponse. We consider a case where can be missing for some units , while all other variables are available for all units . We rely on a classical formalism where the vector of surveyed variables and of those used to draw , for each unit , are random draws from a superpopulation. In this formalism the parameter for all indices of households in the population and are random and we shall now use capital letters for them. Let and be random variables, where if and if unit reveals the value of given , and and be random vectors which will play a different role.
It is classical to rely on imputations to handle the missing data. This means that we replace missing data by artificial values obtained from a model forming predictions or simulating from a probability law and inject them in a formula like (35). In [3] we discuss the use of the Heckman selection model when we suspect that the data is not missing at random. This relies on a parametric model for the partially missing outcome which is prone to criticism. Also as this paper has shown such a model relies on instrument monotonicity which is an assumption which is too strong to be realistic.
It is difficult to analyze theoretically the effect of such imputations. For example when the statistic is nonlinear in the s (e.g. (35)) then using predictions can lead to distorted statistics. It is also tricky to make proper inference when one relies on imputations. One way to proceed is to rely on a hierarchical model as in [4]
. There the imputation model is parametric and we adopted the Bayesian paradigm for two reasons. The first is to account for parameter uncertainty and the second is to replace maximum likelihood with high dimensional integrals by a Monte Carlo Markov Chain Algorithm (a Gibbs sampler). The hierarchical approach also allows layers such as to model model uncertainty. The Markov chain produces sequences of values for each
for in the posterior distribution given , the choice of which is discussed afterwards. Subsequently we get a path of(37) 
where is a standard normal random variable independent from given . (38) is derived from (36). The variables are those making the missing mechanism corresponding to relative to MAR^{2}^{2}2They can be those used by the survey statistician to draw if any (and usually made available) to handle a total nonresponse which is MAR via imputations.. The last values of the sample path for allows to form credible sets by adjusting the set so that the frequency that exceeds , where is a confidence level.
is the socalled burnin. These confidence sets account for error due to survey sampling, parameter uncertainty, and nonresponse. They can be chosen from the quantiles of the distribution, to minimize the volume of the set, etc.
We now consider our nonparametric model of endogenous selection which allows for nonmonotonicity of the instrumental variables to handle a missing mechanism corresponding to which is NMAR. For simplicity, we assume away parameter uncertainty and total nonresponse. The variables in Section 5 can be variables that are good predictors for . They are not needed to obtain valid inference but can be useful to make confidence intervals smaller. However, the selection corresponding to the binary variables relative to the outcomes given follow a NMAR mechanism. The (multiple) imputation approach becomes: for

Draw an i.i.d. sample of for from the law of given , , and , an independent standard normal , and set for where are the uncensored observations,

Compute
(38)
The confidence interval is formed from the sample for a given confidence level.
References
 [1] Heckman, J. J., Vytlacil, E.: Structural equations, treatment effects, and econometric policy evaluation. Econometrica. 73, 669–738 (2005)
 [2] Imbens, G. W., Angrist, J. D.: Identification and estimation of local average treatment effects. Econometrica. 62, 467–475 (1994)
 [3] Gautier, E.: Eléments sur la sélection dans les enquêtes et sur la nonréponse non ignorable. Actes des Journées de Méthodologie Statistique (2005)
 [4] Gautier, E.: Hierarchical Bayesian estimation of inequality measures with nonrectangular censored survey data with an application to wealth distribution of the French households. Annals of Applied Statistics. 5, 1632–1656 (2011)
 [5] Gautier, E., Hoderlein, C.: A triangular treatment effect model with random coefficients in the selection equation. https://arxiv.org/abs/1109.0362 (2015)
 [6] Gaillac, C., Gautier, E.: Identification in some random coefficients models when regressors have limited variation. Working paper (2019).
 [7] Gaillac, C., Gautier, E.: Adaptive estimation in the linear random coefficients model when regressors have limited variation. https://arxiv.org/abs/1905.06584 (2019)
 [8] Gaillac, C., Gautier, E.: Estimates for the SVD of the truncated Fourier transform on and stable analytic continuation. https://arxiv.org/abs/1905.11338 (2019)
 [9] Gautier, E., Kitamura, Y.: Nonparametric estimation in random coefficients binary choice models. Econometrica. 81, 581–607 (2013).
 [10] Gautier, E., Le Pennec, E.: Adaptive estimation in the nonparametric random coefficients binary choice model by needlet thresholding. Electron. J. Statist. 12, 277–320 (2018).
 [11] Heckman, J. J.: Sample selection bias as a specification error. Econometrica. 47, 153–161 (1979)
 [12] Little R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley (2002)
 [13] Rubin, B.: Inversion and characterization of the hemispherical transform. J. Anal. Math. 77, 105–128 (1999)
 [14] Vytlacil, E.: Independence, monotonicity, and latent index models: an equivalence result. Econometrica. 70, 331–341 (2002)
Comments
There are no comments yet.