Estimating the joint distribution with data Missing At Random (MAR) is a hard task. Usually, one applies strictly parametric methods, mostly relying on members of the exponential family such as the multivariate normal distribution. Its parameters can be determined by the Expectation Maximization (EM) algorithm ([DempsterEM]). However, the misspecification error in the case of non-Gaussian data might be considerable. We can extend the normality assumption and assume a Gaussian copula model, liberating us from restrictions on the shape of the marginals. Figure 1
is based on such a distribution for the bivariate case. Here, the green lines show the underlying marginal cumulative distribution functions for the first (left) and the second component (right) of a two-dimensional random vector (X1, X2), respectively. We then generatedobservations from (X1, X2) and calculated the corresponding empirical cumulative distribution functions (ecdf) (orange lines). To assess the influence of missings, we artificially chose some values to be Missing At Random (MAR) and recalculated the ecdfs based on the observed datapoints only (blue lines). In particular, the left column displaying is Missing Completely At Random (MCAR), while the missingness of the right column, which is displaying , depends on .
[SongEMCopula] propose an EM algorithm for this setting. However, their approach has two weaknesses.
The presented algorithm simplifies by assuming that the marginals and the copula can be estimated separetely (compare Equation (6) in [SongEMCopula] and Equation (10) in this paper) in the M-step.
As a parametrization of the marginals is necessary to apply the EM algorithm, they offer two options to model the marginals.
Fix the marginal distributions as the ecdf’s of the observed data points and learn only the copula through the EM algorithm. This can lead to a heavily biased estimation, as the right-hand side of Figure 1 shows. The blue line depicting the ecdf of the observed data points of differs clearly from the underlying marginal cumulative distribution function of , which is drawn in green.
Use parametric assumptions on the marginals. However, a Kolmogorov-Smirnov test with the correct underlying distribution of applied on the observed data points reveals a p-value of
. Hence we would reject the (true) null hypothesis at every reasonable confidence level. Thus, this approach presumes a priori knowledge about the marginals, which rarely exists in practice.
It is the aim of the present paper to overcome these obstacles. Thereby, our contributions are two-fold.
We present a mathematically rigorous application of the EM algorithm in the Gaussian copula model. Similarly to complete data approaches, it estimates the marginals and the copula separately. However, the two steps are carried out repeatedly.
We propose a semiparametric approach for the marginal distributions. It allows us to learn their shape without possessing any a priori knowledge about them.
The structure of the paper is as follows. In Section 2 we review some background knowledge about copulas and the Gaussian copula in particular. We proceed by presenting the method (Section 3). In Section 4 we investigate its performance in a simulation study, where we also discuss the data generating process behind Figure 1. Lastly, we conclude with a discussion and an outlook in Section 5. All technical aspects and poofs of this paper are given in the Appendix A.
2 The Gaussian Copula Model
In the following we consider a -dimensional data set of size , where and are i.i.d. samples from a -dimensional random vector with joint distribution function and marginal distribution functions . The parameters of the marginals we denote as , where can be a vector itself and it is the parameter of , so we write .
For an observation , we define as the index set of the observed and as the index set of the missing columns. Hence, and . and are then the subvectors containing the observable and unobservable values of the vector . is the random vector indicating if an entry is missing, where if is missing. Further, we define to be the density function and to be the distribution function of the one-dimensional standard normal distribution. stands for the distribution function of a -variate normal distribution with covariance and mean . To simplify the notation, we define . For a matrix , the entry of the -th row and the -th column, we denote as and for index sets , is the submatrix of with row number in and column number in .
Throughout, we assume to be strictly increasing in every component and continuous. Therefore, for all , is strictly increasing and continuous and so is the existing inverse function .
Sklar’s theorem [Sklar] decomposes into its marginals and its dependency structure , by
is a so-called copula, which means a -dimensional distribution function with support , whose marginal distributions are uniform. Hence, every distribution belongs to an equivalence class, which contains all the distributions with the identical copula. In this paper we focus on the so-called Gaussian copulas, where
and is of full rank. We can see that every equivalence class contains exactly one multivariate normal random vector with and for all . Beyond all multivariate normal distributions that share the same correlation structure, every equivalence class contains other distributions that do not have Gaussian marginals. Hence, the Gaussian copula model provides us with an extension of the normality assumption. Let’s take a random vector from such an equivalence class. Under the transformation
and hence is normally distributed with mean and covariance . The two-step approaches given in [GenestSemiparametricEstimation] and [liu2012high] use this property and apply the following scheme:
Find consistent estimates for the marginal distributions .
Find by estimating the covariance of the random vector
From now on assume that the marginals of have existing density functions . Then, using Equation (3) and a change of variables, we can derive the joint density function (see [SongGaussianCopulaBasics])
where . As in the case of the multivariate normal distribution, we can read off conditional independencies ([liu2012high]) from the inverse of the covariance matrix , using the property
is called the precision matrix. In order to slim down the notation, we define for a subset
The conditional density functions have a closed form.
Proposition 2.1 (Conditional Distribution of Gaussian copula).
Let and be such that .
The conditional density of is given by
where , , and
is normally distributed with mean and covariance .
The expectation of with respect to the density can be expressed by
Using Proposition 2.1 it can be seen that the conditional distribution’s copula is Gaussian as well. More importantly, we can derive an algorithm for sampling data points from the conditional distribution:
The very last step follows by Proposition 2.1, as it holds for any measurable , that:
3 The EM Algorithm in the Gaussian Copula Model
3.1 The EM Algorithm
Let be an incomplete data set following a parametric distribution with parameter and corresponding density function , where observations are MAR. The EM algorithm [DempsterEM] finds a local optimum of the log-likelihood function
After choosing a start value , it does so by iterating the following two steps.
For our purpose, there exist two extensions of interest.
In many cases there is no closed formula for the right-hand side of Equation (6). If it is possible to sample from the conditional distribution, then one can use Monte Carlo integration [WeiStochasticEM] as an approximation, which is called the Monte Carlo EM algorithm.
3.2 Applying the ECM-Algorithm on the Gaussian Copula Model
As we need a full parametrization of the Gaussian copula model for the EM algorithm, we choose parametric marginal distributions with densities . According to Equation (4), the joint density with respect to the parameters and has the form
where . Section 3.3 will describe how we can keep the flexibility for the marginals despite the parametrization. But first we outline the EM algorithm for general parametric marginal distributions.
The first and the last summand depend only on and , respectively. Thus, of special interest is the second summand, for which we observe by Proposition 2.1
and . At this point [SongEMCopula] neglect, that in general
holds and hence (10) is depending not only on but also on . This has let us reconsider their approach as we describe below.
We have encountered, that the joint optimization with respect to and is difficult, as there is no closed-form solution for (9). We circumvent this problem by sequentially optimizing with respect to and , applying the ECM algorithm. The maximization routine is the following.
The reader might notice, that this is a two-step approach consisting of first estimating the copula encoded in and then estimating the marginals by finding . However, both steps are executed iteratively, which is typical for the EM algorithm.
where depends on and . Considering all observations, we search for
with as in Proposition A.1, but for observation . The maximizer depends on the statistic only. Generally, the maximization can be formalized as a convex optimization problem, which can be solved by a gradient descent. However, the properties of this estimator are not understood (for example a scaling of by leads to a different solution, see Appendix A.3). To overcome this issue, we instead approximate the solution by the correlation matrix
where is the diagonal matrix with entries . This was also proposed in [GuoCorrelation, Section 2.2].
Maximizing with respect to
We now focus on finding , which maximizes
with respect to . As there is in general no closed form for the expectations, we approximate them with Monte Carlo integration. Again, we start by considering a single observation to simplify the terms. Employing Algorithm 1, we receive samples from the distribution of under the parameters and . We set . Then,
Hence, considering all observations, we set
We emphasize, that not only sampling but also the evaluation and the calculation of derivatives of the right hand side of Equation (13) can be parallelized.
The idea of using Monte Carlo integration is similar to the Monte Carlo EM algorithm introduced by [WeiStochasticEM]. However, note that we only use the Monte Carlo samples to update the parameters of the marginal distributions .
We would also like to point out some interesting aspects about Equations (12) and (13):
The summand describes how well the marginal distributions fit to the (one-dimensional) data.
The first summand adjusts for the dependence structure in the data. If all observations at step are assumed to be independent, then and this term is .
More generally, the derivative depends on if and only if . That means, if implies conditional independence of column and given all other columns (Equation (5)), the optimal can be found without considering .
3.3 Modelling with Semiparametric Marginals
The algorithm of Section 3.2 depends on a parametrization of the marginals. We have seen in Section 1, that we cannot deduce the parametric family of the marginal distributions by the observed data points. If there is a priori knowledge about the parametrization of the marginals, then we can plug it into the formulae of Section 3.2. However, as this prior knowledge for the marginals rarely exists (they might not even belong to a parametric family), we propose the usage of semiparametric mixture models. In particular, we are using a parametrization of the form
is a hyperparameter and the ordering of theensures the identifiability of the distribution ([MclachlanFiniteMixtures]).
Using mixture models for density estimation is not new (e.g. [MclachlanFiniteMixtures], [HwangNonparametric], [ScottMultidimensionalDensityEstimation]). As [ScottMultidimensionalDensityEstimation] note, mixture models vary between being parametric and non-parametric, where flexibility increases with
. From a theoretical perspective it is reasonable to choose Gaussian mixture models as the density functions coming from a mixture of Gaussians is dense in the set of all density functions with respect to the-norm ([MclachlanFiniteMixtures, Section 3.2]). This flexibility and the provided parametrization make the mixture models a natural choice for the marginals.
3.4 A Blueprint of the Algorithm
The complete algorithm can be summarized as follows:
For the Monte Carlo EM Algorithm, [WeiStochasticEM] propose to stabilize the parameters with a rather small and increase it substantially in the latter steps of the algorithm. This seems to be reasonable for line 2 of Algorithm 2 as well.
If we have no a priori knowledge about the marginals and we model them semiparametrically, we propose to choose such that the cumulative distribution functions of the mixture models fit to the ecdf of the observed data points. The number of components in the mixture model can then be chosen such that its corresponding provides a good approximation. For
and estimate the standard deviation using the observed data points only.
4 Simulation Study
In order to evaluate the method, we conduct a simulation study. In particular, we elaborate the issues arising from an MAR mechanism, leading to biased estimators for the marginals when deploying a two-step approach as [GenestSemiparametricEstimation].
We emphasize, that our data-generating process is fundamentally different from simulation studies based on an MCAR mechanism. In fact, with MCAR there is no need for an iterative approach like the EM algorithm. As the marginals can be estimated by the ecdfs of the observed data points consistently, we can use covariance estimators for MCAR (as [LouniciCovarianceEstimation]) on
where is the ecdf of the existing observations of column . Alternatively, one can employ the method of [WangCopulaPrecisionEstimation].
We consider a two-dimensional data set111We would have liked to include the setup of the simulation study of [SongEMCopula]. However, the missing mechanism can neither be extracted from the paper nor did the authors provide them on request. with a priori unknown marginals and , whose copula is a Gaussian copula with correlation parameter . The marginals are chosen to be with and degrees of freedom. The data matrix keeps (complete) observations of the random vector. We enforce the following missing data mechanism:
Remove every entry in the data matrix
with probability. The resulting data matrix (with missing entries) we denote as .
If and are observed, remove with probability
We call the resulting data matrix .
The resulting data set is MAR. Besides , the parameters and control how many entries are absent in the final data set. Assuming that , and are not too large, the ecdf of the observed values of is shifted to the left compared to the underlying distribution function (changing the signs of and/or may change the direction of the shift, but the situation is analogous). We used this procedure to generate Figure 1 with , , and . We observe that we could estimate the marginal distribution of using the ecdf of the observed data.
4.2 Adapting the EM Algorithm
For the setup in Section 4.1 we choose , for which we saw a sufficient flexibility for the marginal distributions. is then chosen by fitting the marginals to the existing observations.
we set as the identity matrix. For, we observed that with , stabilizes after around steps. Cautiously, we run steps before we increase to for which we run another 5 steps. We stop the algorithm, when the condition is fulfilled.
We investigate four different settings of the setup in Section 4.1. We vary the correlation and the missing mechanism parameters . As a competitor, we compare our method with the method of [SongEMCopula], where the marginal distributions are estimated by the observed data only. We call this algorithm the ”Simplified COPula Estimator” or SCOPE for short. As a gold standard, we compute only the covariance structure by applying an EM algorithm to the Gaussian observations . The idea is to eliminate the difficulty of finding estimators for the marginals.
For every one of those four settings we run 1000 simulations. To evaluate the methods, we look at two different aspects.
First, we compare the estimators for . The results are depicted in Figure 4. We see, that no method is clearly superior in estimating . As even knowing the marginal distributions does not lead to substantially better estimators, we deduce that (at least in our setting) the quality of the estimators for the marginals is almost negligible for obtaining a good estimator for the copula.
Second, we calculate a two-sample Kolmogorov-Smirnov (KS) test statistic as described in[FasanoKSTest, Section 5]. For that, we draw each time ata points from the learned joint distribution. We describe the details of the sampling procedure in the Appendix A.4. With those draws, we calculate the two-sample KS test statistic between the real distribution and the competitors. The results are depicted in Figure 5. It shows, that the quality of the learned joint distribution depends highly on the estimation of the marginals, see [thurowimputing] for a similar approach in the univariate setting.
Additionally, we see that the benefit of the proposed algorithm is larger in the case of high correlation. This is in line with the intuition that if the correlation is vanishing, the two random variablesand are independent. Thus, the missingness probability and are independent. (Note that there is a difference from the case, where , and hence the missingness probability is conditionally independent from given .) In that case, we can estimate the marginal of using the ecdf of the observed data points. Hence, for small , SCOPE is almost consistent. An illustration can be found in Figure 3.
5 Discussion and Outlook
In this paper we have investigated the estimation of the Gaussian copula and the marginals with an incomplete data set. If the data is Missing At Random we have shown, that a consistent estimate of a marginal distribution depends on the copula and other marginals. Further, we derived a rigorous EM algorithm based on Monte Carlo integration that can still be applied. It works by iteratively finding the marginal distributions and the copula and is hence similar to known methods for complete data sets.
However, the EM algorithm relies on a complete parametrization of the marginals. In case there is no prior knowledge about them, we presented the novel idea to employ semiparametric mixture models. Although this is practically always a misspecification of the marginal distributions, our simulation study revealed that the combination of the EM algorithm and the semiparametric marginals delivers better estimates for the joint distribution than the currently used algorithms of [SongEMCopula] and [ZhaoUdellCopulaEM]. The intuition behind the algorithm is the following:
Given a current state of knowledge about the copula and the marginals , we can derive a probabilistic approximation of the location of the missing values. This is achieved by generating samples for the missing values given the observed values. With the information gained on the location of the data points and , we can derive new estimators for the marginals , leading to updated scores , which can then in turn be utilized to find an improved copula .
We note that the focus of the paper is on estimating the joint distribution without precise specification of its subsequent use. Therefore, we have not discussed imputation methods (see, e.g.[rubin1996multiple], [van2018flexible], [ramosaj2019predicting], [ramosaj2020cautionary]) in more detail. However, some researchers have used the Gaussian copula model as a device for Multiple Imputation (MI) with some success ([ZhaoUdellCopulaEM], [HollenbachWrongEM], [HouariWrongMarginals]
), although all mentioned approaches do not see the importance of the modeling of the marginals. The resulting complete data sets can be used for inference. This seems odd, since we can derive all statistics from the learned joint distribution. However, it is not unusual to use a different model for imputation than for the analysis ([SchaferMI]
), partly because no fully parametric model for the analysis is used. In that case, the proposed procedure finds reasonable draws from the conditional distribution of the missing values given the observed data. The findings of Section4 translate into better draws for the missing values.
Besides that, the proposed procedure is interesting for applications focusing on marginal distributions. At the Battery Cell Competence Center, BMW builds battery cells on a prototype scale. A key performance indicator (KPI) is assigned to every battery cell at the end of the production line. However, some products do not reach the end of the production process because a decision system relying on observed measurements sorts out pieces. With the proposed method, one can assess the distribution of the KPI over all products and evaluate the performance of the decision system without intervening in the production process. As the data is confidential, the concrete application can not be shown here.
With respect to future research, different aspects might be worth investigating:
Maximize Equation (13) with respect to sequentially. The derivative with respect to only one specific is faster to calculate and the involved terms demand less memory. It is again an application of the ECM algorithm of [MengECM].
Include the weights and bandwidths of the mixture models (Equation (14)) to the parameters and examine other kernels like the Epanechnikov kernel.
Develop methods to select in Equation (14) (similar to [MclachlanFindg] for complete data sets).
Generalize the approach to different parametric copulas . In particular, when the copula and is the multivariate distribution function of an exponential family distribution with marginals , then the algorithm of Section 3 also applies and there is a solution that depends on an expected statistic only.
Given the numerous opportunities for future research and the promising results of our method, we are looking forward to more interesting contributions in the field of semiparametric density estimation in the case of missing data.
Appendix A Appendix
a.1 Proof of Conditional Distribution
Proof of Proposition 2.1.
We prove in the order of the Proposition, which is a multivariate generalization of [kaarik].
Inspect the conditional density function:
Using well-known factorization lemmas using the Schur complement (see for example [MurphyML, Section 4.3.4]) applied on we encounter
The distribution of
follows with a change of variable argument. Using Equation (15), we observe for any measurable set
where we used in the second equation the transformation and the fact that
This proof is analogous to the one above and we finally obtain
a.2 Closed-form Solution of E-Step for
Assume w.l.o.g., that and let