1 Introduction
We consider the problem of reconstructing a sequence of multidimensional real vectors where some components of some vectors are missing. As an example, consider a speech utterance that is corrupted by another sound signal: at a given instant some speech spectral bands are corrupted by the other signal and can be considered missing. A given spectral band may be missing at some instants and present at other instants. The reconstruction problem here is to obtain the value of the missing spectral bands given the present ones, for the whole utterance. In the particular case when the components or variables that are missing are the same for all vectors of the sequence, the reconstruction problem becomes a regression or mapping approximation (of the missing variables given the present ones, or the Ys given the Xs).
Regression methods (e.g. a neural net) usually estimate univalued mappings (onetoone or manytoone), where a given value of the present variables results in a unique value for the missing variables. This works well when there is an underlying function (in its mathematical sense) that uniquely determines the missing variables given the present ones; but this is not the case generally, as for inverse mappings (resulting from inverting a forward, univalued map). For example, the position of the endeffector of a robot arm is uniquely determined by the angles at its joints, but not vice versa (inverse kinematics). When the missing data pattern varies along the sequence, univalued mappings will occur intertwined with multivalued mappings (onetomany). We depart from the traditional point of view and define multivalued mappings as the basis of our reconstruction method: in one particular vector of the sequence, there may be more than one value for its missing components that is compatible with the value of its present components. A flexible way of generating such multivalued mappings (of an arbitrary subset of variables onto another subset) is via conditional distributions of a joint density model. However, some of these values will not be compatible with the rest of the sequence, considered globally: for example, they may result in physically impossible discontinuities of the inverse kinematics of the robot arm. To break the ambiguity of which reconstructed value to choose at each vector, we assume some prior information is available that constrains the sequence, in particular its continuity: one must choose the reconstructed values such that the resulting sequence is as continuous as possible. Thus, we assume that the data have two kinds of redundancy: (1) the vectors lie in a lowdimensional manifold and so knowledge of some vector components constrains the remaining ones (pointwise redundancy); and (2) consecutive vectors lie near each other (acrossthesequence redundancy).
The rest of the paper is organised as follows. Section 2 defines the data reconstruction problem. Section 3 explains how we derive multivalued functional relationships from conditional distributions. Section 4 explains how to use prior information to constrain the multivalued mappings thus derived, and how to find the optimal reconstruction. Section 5 summarises the reconstruction method. Sections 6–7 give experimental results. The remaining sections discuss the method, review related work and conclude the paper.
A preliminary version of this work appeared as a conference publication (CarreiraPerpiñán, 2000b).
2 Definition of the problem of data reconstruction
Generally, we define the problem of data reconstruction as follows: given a data set where part of the data are missing, reconstruct the whole data set to minimise an error criterion. Let us examine in detail the elements of this definition.
The data set
The data set must have some structure in order that reconstruction be possible, i.e., there must be some dependencies between the different vectors in the set that give rise to redundancy. A typical example is sequential data in which consecutive vectors are close to each other (of course, not all sequences satisfy this). Such data can be considered as the result of sampling in time a vector that is a continuous function of the time. We can generalise this notion as follows. Assume are samples from a continuous function of an independent variable at points . We call the experimental conditions; can be the time (when the sample was taken), the position in space (where it was taken), etc. Thus, is the sample point obtained at condition . We call the observed variables. We observe but not necessarily , and is unknown. Table 1 gives some examples. gives to the collection the structure or redundancy that allows the reconstruction of missing data.
Examples of problem  Experimental conditions  Observed variables  

Trajectory of a mobile point  time, 1D 


Spoken utterance  time, 1D  speech feature vector, 13D  
Wind field on the ocean surface  spatial coordinates, 2D  wind velocity vector, 2D  
Colour image  spatial coordinates, 2D  RGB level, 3D 
We now have three dimensionalities: the dimensionality of the space of the vectors that we observe, ; the intrinsic dimensionality of the manifold of where the vectors are constrained to live, ; and the dimensionality of the variable , , corresponding to the batch of data that we want to reconstruct. These dimensionalities verify . For example (fig. 1), imagine an ant that walks on the trunk of a tree () and take as the Euclidean space () and as the time (); a given trajectory of the ant will be 1D, but the region that the ant is allowed to be on (the trunk) is 2D and in principle we may find it anywhere in that 2D region (either by taking a very long trajectory or by taking many different trajectories).
In the rest of this paper we assume . The case does not allow to look at the relationship between variables, since there is only one (), and therefore we cannot make use of the redundancy derived from a lowdimensional representation. We will also assume sequential data unless otherwise noted, i.e., is the time (or some other 1D variable), although the treatment can be generalised to any dimensionality . We will write to denote a sequential data set, where indicates a temporal order in the data, reserving the notation for data sets which need not have any sequential order (in which case is just a label).
The pattern of missing data
That part of the data are missing means that some of the variables have missing values. We say that the value of a given scalar variable is present if such value was measured; otherwise, we say it is missing. Abusing the language, we will also speak of present (missing) variables to mean variables whose values are present (missing). The reasons for a value to be missing are multiple: the value may not have been measured; the value was measured but may have got lost, erased or corrupted; and so on, depending on the particular problem.
When the values are corrupted in various amounts rather than just being either missing or present, one could consider further categories of uncertainty in a value. This can be a beneficial strategy for, e.g., recognition of occluded speech (Cooke et al., 2001)
. However, for the purposes of data reconstruction, it is not clear what one should do with a value that is neither present (which must not be modified) nor missing (which must be filled in). Therefore we will stick to the present/missing dichotomy. We will also assume that each value has been classified as either present or missing, even though in some applications this task may not be trivial.
We can represent the pattern of missing data associated with the data set with a matrix (or mask) of such that if the value of is present and otherwise. The matrix acts then as a multiplicative mask on the complete data set, i.e., the data set where all values are present, as represented in figure 2. We obtain the problem of regression (or mapping approximation) in the particular case where is independent of (in which case the mask of fig. 2 has a columnar structure). We will use the term regressiontype missing data pattern if the pattern of missing data is constant over the sequence and general missing data pattern if it varies. We will use the term complete data set to mean the data set as if no values were missing and reconstructed data set to mean the data set where the missing values have been filled in by a reconstruction algorithm.
The algorithm described in this paper is applicable to any pattern of missing data, irrespectively of why or how that pattern came about. However, if the missing data are not missing completely at random (i.e., the pattern of missing data depends on the actual data; Little and Rubin, 1987), then any information about the mechanism of missing data should be taken into account if possible, since it may further constrain the values that the missing variables may take.
Reconstruction of the whole data set
To reconstruct the whole data set means to find a single estimate of each missing value. Generally, we define four types of reconstruction according to the combinations of the following characteristics:

The number of candidate reconstructions of a given entity that are provided: single or multiple reconstruction.

The entity that is being reconstructed: pointwise (or local) for reconstruction of a vector given information present in only, and global for reconstruction of the whole sequence or data set given information present in it.
Using this terminology, we seek a single global reconstruction of the data set. A method that provides single pointwise reconstruction can only provide single global reconstruction; standard regression and mapping approximation are examples of such methods. But single global reconstruction can be attained by an appropriate combination of a collection of multiple pointwise reconstructions; the method proposed in this paper does this.
Error criterion
In this paper we use the square difference between the true and the reconstructed vectors (averaged over the sequence) as a measure of the reconstruction error. Other criteria are also possible.
Notation
We use the following notation to select components (variables) of a vector. If is a dimensional real vector and is a set of indices, then represents the vector composed of those components of whose indices are in . For example, if and then is . This notation is convenient to represent arbitrary patterns of missing data, where the present or missing variables at point are indicated by sets and satisfying and . Abusing the notation, we may sometimes write or to mean or , respectively. Often, we will also use and to refer to the present and missing variables, respectively.
3 Deriving multivalued functional relationships from conditional distributions
Any kind of reconstruction is based on a functional relationship of what is missing given what is present. This section discusses two central ideas. The first one is that one can define a functional relationship ( as a function of ) from the conditional distribution of given
by picking representative points of this distribution. The second one is that underlying a multimodal conditional distribution there (often) is a multivalued functional relationship and that it is wrong to summarise such a distribution with its expected value. Instead, we propose the use of all the modes of a conditional distribution to define a multivalued functional relationship and thus to define multiple pointwise reconstruction. We assume that the conditional distribution comes from a probability density function (pdf)
for all the observed variables .In general, we use the terms predictive distribution for a distribution containing information about the missing variables (perhaps different from the conditional) and candidate (pointwise) reconstructions for the values used to fill in the missing variables (perhaps different from the modes).
3.1 Informative, or sparse, distributions
A conditional distribution which consists of several sharp peaks conveys information about a functional relationship in that the probability mass is concentrated around a few points. We can construct a multivalued mapping by picking several particularly informative points of the domain of (see fig. 3). In general, we say that a dimensional pdf (possibly the distribution of conditioned on the values of some other variables) is informative or sparse if the probability mass is concentrated around a lowdimensional manifold. Conversely, if the probability mass is spread over a dimensional region then we say that the pdf is uninformative. We thus state the principle that highly informative distributions can be assimilated to (possibly multivalued) functional relationships.
As we define it, the concept of distribution sparseness is a probabilistic extension of the concept of lowdimensional manifold. Thus, a dimensional pdf defined around a point/curve/surface is sparse for , respectively, and so on. Our definition of sparseness is vague, since the term “concentrated around” is relative—just how much probability mass and how near the lowdimensional manifold? For the purpose of deriving functional relationships from conditional distributions this definition suffices. CarreiraPerpiñán (2001) discusses the issue of measuring sparseness quantitatively.
A conditional distribution may be informative for some values of and uninformative for other values of . If we require that be informative for any value of , then the joint density must be itself an informative distribution, since .
3.2 Unimodal distributions: optimality of the mean
For unimodal distributions, it makes sense to summarise the distribution with a single point. Which point to choose depends on what error criterion we want to minimise. If we pick the point that minimises the average distance (with respect to the distribution of ) to every other point in the domain of , , then the optimal point is the mean of if is given by the norm (Bishop, 1995, pp. 201–205) and the median if is given by the norm (DeGroot, 1986, p. 209–210)
. For symmetric distributions the mean, median and mode coincide, but for skewed distributions they differ. However, the median does not have a natural generalisation to more than 1D. A number of approaches exist that derive univalued mappings from the conditional distribution, usually via the mean (see section
9.2).3.3 Multimodal distributions: the mean considered harmful
The mean of a multimodal distribution can lie on an area of low probability, thus being a highly unlikely representative of the distribution. Worse still, it may lie outside of the support of the random variable when such support is not a convex set, since the mean is a convex sum itself (and this can happen even if the distribution is unimodal); this fact has been pointed out in the context of inverse problems
(Jordan, 1990; Jordan and Rumelhart, 1992). In spite of this, the mean remains the most common choice of representative point of a multimodal distribution, possibly due to being optimal with respect to the norm (as long as one is constrained to choose a single point) and to its ease of calculation.A better choice are the modes. Unlike the mean, any mode by definition must lie inside the support of the random variable and have a locally high probability density. In general, calculating the modes of a multivariate distribution is difficult because one does not know how many modes there are and because computing each one of them cannot be done analytically. However, for Gaussian mixtures, CarreiraPerpiñán (2000a) gives algorithms for finding all the modes^{1}^{1}1The algorithms work by starting a hillclimbing search from every centroid. A Gaussian mixture in or more dimensions can have more modes than components even if all the components have the same, isotropic covariance (CarreiraPerpiñán and Williams, 2003b, a), so the algorithms can miss some modes. However, this situation is very infrequent.. The algorithms can also compute error bars for each mode, thus estimating locally the error, though this will not be used here.
But then how does one deal with a multivalued choice? In the absence of additional information, all we can do is to keep all the modes, since any of them is a likely representative of the distribution. This implies defining a multivalued functional relationship. Thus, we propose to select all the modes of the conditional distribution as representative points of it.
3.4 Sampling the predictive distribution
Another method to pick representative points of a distribution is to sample from it. This makes sense when there are few present variables, so that the missing variables are so underdetermined that they can take values on a continuous manifold: in this case, the modes of a conditional distribution are effectively a particular sample from that manifold. Computationally, sampling can also be more attractive than finding all the modes.
However, when the missing variables are constrained to take a finite number of values, this does not make much sense. Unless the conditional distribution is very sharply peaked around its modes, sampling will return values that by definition are corrupted by noise: sometimes they may fall in areas far from the main probability mass body or ignore lowprobability bumps (which may represent infrequent but legitimate reconstructions). A further, serious problem is how to set the number of samples to obtain, which will certainly locally underestimate or overestimate the true number of values that the missing variables can take. Missing a correct pointwise reconstruction or generating a wrong one may affect the global reconstruction, not just the local one, via the continuity constraint (see section 4). Our experiments show that the sampling strategy performs consistently worse than both the mean and modebased approaches.
3.5 Joint density model of the observed variables
For a generic missing data pattern we need to be able to obtain the conditional distribution for any combination of missing and present variables, which requires estimating a joint density model of all the observed variables. The joint density embodies the relation of any subset of variables with any other subset of variables; all we need is to compute the appropriate conditional distribution, which in turn requires a marginalisation: . Therefore, we are free to choose the method by which we estimate the joint density as long as the estimator allows easy marginalisation. The density model should be estimated offline using a training set different from the one that is to be reconstructed. Typically, the training set will have no missing data, although even if it does, it is possible to train the model using an EM algorithm (e.g. Ghahramani and Jordan, 1994; McLachlan and Krishnan, 1997).
A suitable class of density models are continuous latent variable models, since the pointwise redundancy implies an intrinsic low dimensionality of the observed variables (CarreiraPerpiñán, 2001). The distribution of the observed variables is sparse in the sense of section 3.1
. The density model must be able to represent multimodal densities. This discards linearnormal models (factor analysis and principal component analysis). It should also allow an easy computation of conditional distributions. Useful models include the generative topographic mapping (GTM;
Bishop, Svensén, and Williams, 1998b), mixtures of factor analysers (Ghahramani and Hinton, 1996) and mixtures of principal component analysers (Tipping and Bishop, 1999). For all these models the density of the observed variables has the form of a (constrained) Gaussian mixture and the number of tunable parameters can be kept reasonably low while keeping the ability to represent a broad class of multimodal densities. We can also directly use Gaussian mixtures or (nonparametric) kernel density estimation, both of which have the property of universal density approximation
(Titterington et al., 1985; Scott, 1992). In this paper we use the GTM; we refer the reader to the original papers (Bishop et al., 1998b, a) for details on this model.Hereafter we will assume that the joint density has the form of a Gaussian mixture, whose parameters were estimated from a data sample in an unsupervised way. Computing conditional distributions is then straightforward (for a diagonal Gaussian mixture, this requires little more than crossing out rows and columns). Finding all the modes can be done with the algorithms of CarreiraPerpiñán (2000a). In the particular case where the pattern of missing data is constant, one can just model the appropriate conditional distribution rather than the joint density, of course (section 9.4).
4 Use of prior information to constrain multivalued mappings
So far we have exploited the redundancy between component variables of a given data point (via the conditional distribution) to constrain the possible values of the missing variables, but this can still result in multivalued mappings, as we have seen. In the absence of any additional information, the answer to the reconstruction problem would be those multivalued mappings. We now turn to the case of using extra information about the problem to constrain the possible values so that we obtain a unique global reconstruction of a data set.
4.1 Continuity constraints
For many practical cases, additional information comes from the redundancy between data points and depends on the experimental conditions. The most usual such constraint is given by continuity of the variables as a function of the experimental conditions, : nearby values of produce nearby values of , or “ is small” according to a distance dependent on the problem. In this case, typically is a physical variable such as the time (or space) and then is a measured vector that depends continuously on it, resulting in a continuous trajectory (or field). In general, we can define constraints on the trajectory by bounding via a norm the derivatives of the function (if they exist), numerically approximating each derivative by a finitedifference scheme in terms of the available measures at and applying suitable boundary conditions at the trajectory ends (e.g. to consider open or closed trajectories). Continuity (which penalises sharp jumps) results from the first derivative, while smoothness (which penalises sharp turns) results from the second. This results in the norm of a linear combination of points near . Actual examples are given in section 4.3.
The norm depends on the problem, but typically will be the Euclidean or squared Euclidean norm. If different variables have different units or scales it may be convenient to use a weighted norm. The squared Euclidean norm has the computational advantage that it separates additively along each variable and so for a constant missing data pattern () we need only consider the missing variables (since the present ones contribute a constant additive term). It also results in constraints that are quadratic forms on the variables , though this makes no difference in the constraint minimisation.
Another type of constraint that has often been found useful in inverse problems is a quadratic constraint , which can be interpreted physically as an energy in mechanical systems. In particular, corresponds to the potential energy of a harmonic oscillator with resting position at and restoring constant .
4.2 Constraint by forward mapping
In the particular case where the reconstruction problem is a mapping inversion problem, we can use the known forward (direct) mapping as a further constraint. The forward mapping maps the missing variables onto the present ones: , where and are independent of (i.e., the same for all data points). Thus, given the values of and given a candidate reconstruction (for example could be one of the modes of ), then the distance between and should be as small as possible.
In the ideal case where the procedure that provides candidate reconstructions (in this paper, the modes of the conditional distribution) was perfect, this constraint would have no effect, since every would exactly map onto . In reality, correct reconstructions will give small, but nonzero, differences between and , while incorrect reconstructions (such as spurious modes) will generally give a much larger difference. Thus, the constraint by forward mapping can help to discard such incorrect reconstructions.
4.3 Minimisation of a global constraint
The constraints introduced above are by definition local and generally take the form of the (squared) norm of a linear combination of neighbouring points; e.g. . Now we derive from such local constraints a global constraint that involves the whole data set. This way we define an objective function depending on all missing variables and then find the combination of candidate pointwise reconstructions that minimises it. We will then have a single global reconstruction that should be a good approximation to the complete data set. The role of the global constraint in breaking the nonuniqueness of the reconstruction is analogous to that of regularising operators for illposed problems (Tikhonov and Arsenin, 1977).
We can define a global constraint by adding the local constraints for each point in the sequence, thus obtaining:
(1)  
(2)  
(3)  
(4) 
In eqs. (1)–(2) we have used first and secondorder forward differences in and , respectively, assuming that the experimental condition variable is sampled regularly; if this is not the case, then one should weight term by . Note that is the length of the polygonal trajectory passing through . We can define a global constraint as a linear combination of constraints such as those above, but in this paper we will concentrate in .
The global constraint is a function of the missing variables . Thus we arrive at the following minimisation problem:
where the search space is the Cartesian product of the sets of candidate reconstructions for each (i.e., each set contains the modes of ). This is a combinatorial optimisation problem that can be expressed as finding the shortest path in a layered graph, as represented in figure 4. Calling the number of candidate pointwise reconstructions at point , the total number of paths is , which in an average case is of exponential order in . Fortunately, there are efficient algorithms both for global (exact) and local (approximate) minimisation.
4.4 Global minimisation: dynamic programming
The problem of finding the shortest path in a layered graph is a particular case of that of finding the shortest path in an acyclic graph and can be conveniently solved by dynamic programming (Bellman, 1957; Bertsekas, 1987). We can apply dynamic programming because for this problem the following principle of optimality for a decision policy holds: regardless of the policy adopted at previous stages, the remaining decisions must constitute an optimal policy, where here a stage is a layer in the graph and a policy is a sequence of decisions (i.e., a sequence of chosen nodes). This leads us to an algorithm where the decision of what link to choose is taken locally (at each layer ), but paths must be kept. Figure 5 gives the forward recursion version of the dynamic programming algorithm (starting from layer ) for a global continuity constraint . Due to the symmetry of the problem, a backward algorithm (starting from layer ) is equivalent. The algorithm requires the following definitions:
The set of candidate pointwise reconstructions for ; thus is node of layer in the graph.  
Minimal length path from layer to node of layer , for . Thus, where indicates some node.  
Total length of , i.e., , where is the th element of the sequence . 
We disregard the unlikely case of ties, where the operation may return several values of .
The dynamic programming algorithm examines each link in the graph (i.e., each pair of nodes in adjacent layers) only once, thus achieving its mission very efficiently.
4.5 Local minimisation: greedy algorithm
A more intuitive and slightly faster algorithm is obtained as a greedy version of dynamic programming: at layer , it simply selects the minimal cost edge (i.e., the closest node). The starting point can be any node in any layer , but to improve the chances of getting a good path, it is better to start in a layer having very few nodes (hopefully just one). The algorithm greedily proceeds from leftwards to and rightwards to , since all edges are undirected. Unlike dynamic programming, this algorithm needs only keep a single path at any time rather than paths, but it does not necessarily find a path of globally minimal cost. Our experiments show that it usually leads to poor solutions, not just in terms of a high value of the constraint, but also as yielding a high reconstruction error of the dataset—which is our ultimate criterion. Such solutions are sensitive to the choice of starting layer . Also, the greedy algorithm has a tendency to obtain reconstructed trajectories that retrace themselves at turning points and have abrupt jumps.
5 Summary of the method
We can now summarise our reconstruction method (see also fig. 8). The first step is done offline and involves estimating a Gaussianmixture joint density model of the observed variables, using a complete dataset (we use GTM in this paper). At reconstruction time, we are given a sequence with missing data. Then:

For each vector in the sequence, :

Compute the conditional distribution of the missing variables given the present ones. This is a Gaussian mixture too.

Find all the modes of this conditional distribution. These are the candidate reconstructions for .


Minimise the trajectory length over the set of candidate reconstructions using dynamic programming to yield the reconstructed sequence.
6 Experiments: 2D toy example
In this section we study the performance of the reconstruction method with a 2D toy problem with observed variables . The mapping is manytoone and so easy to estimate by traditional methods, but the mapping is onetomany, as is the mapping .
We consider the forward, nonlinear mapping for , which results in 1D data () observed in dimensions by taking with and ; see fig. 6A. The forward mapping is injective only in parts of the domain and so the inverse mapping is sometimes multivalued. The task is to reconstruct a (possibly noisy) sampled trajectory of 2D points such as that of fig. 6A with missing data. The reconstruction error is computed as the average squared error , where is the original, complete trajectory and the trajectory reconstructed by a particular method.
Five types of mask are considered:

where is always missing (regression , % missing data);

where is always missing (regression , % missing data);

, , where any of , are missing at random (%, %, % missing data, respectively).
and are regressiontype masks, while – are general missing data patterns. By applying the mask to a complete trajectory, we obtain a trajectory with missing data (see fig. 2).
As training data for the joint density model , we generated a shuffled (i.e., without sequential order) training set with points sampled from the curve with additive normal noise for . We used it to train models^{2}^{2}2We gratefully acknowledge the use of Matlab code by Markus Svensén (to train GTM) and Ian Nabney and Christopher Bishop (Netlab, to train MLPs), both freely available at http://www.ncrg.aston.ac.uk. (fig. 6B):

A factor analyser with one factor. We use this as a linearGaussian density model baseline.

A multilayer perceptron (MLP) with a single hidden layer of
units, trained to minimise the squared reconstruction error using stochastic gradient descent and small, random starting values for the weights. We use this as universal mapping approximator baseline.

A 1D GTM with a grid of points and Gaussian basis functions of width equal to the separation between basis functions centres, all in the interval (see Bishop et al., 1998b). This results in a Gaussian mixture with components all with the same, isotropic covariance. We use this to implement mean and modebased methods.
We compared the following reconstruction methods based on GTM:

Single pointwise reconstruction by the conditional mean (section 3.2).

Single pointwise reconstruction by the global mode of the conditional distribution.

Single pointwise reconstruction by a random mode (all modes are taken equally likely).

Single pointwise reconstruction by the closest mode, i.e., the mode of the conditional distribution that is closest in Euclidean distance to the true value of the original sequence (of course, unknown in practice). The cmode gives a lower bound of the reconstruction error achievable by any modebased method (gmode, rmode, grmode, dpmode) and tells us how much usable reconstruction information is contained in the conditional modes.

Single pointwise reconstruction by the mode of the conditional distribution that is closest in Euclidean distance to the previously reconstructed point, i.e., a greedy minimisation of (section 4.5).

Multiple pointwise reconstruction by the modes of the conditional distribution followed by dynamic programming minimisation of to select the global reconstruction (section 4.4). The continuity constraint is based on the unweighted Euclidean distance, eq. (1). This is the method we advocate in section 5.

Like dpmode except that if the conditional distribution is unimodal, we use its mean rather than its mode. This is intended to account for skewed unimodal conditional distributions.

Multiple pointwise reconstruction by samples of the conditional distribution (section 3.4) followed by dynamic programming minimisation of to select the global reconstruction. We took slightly larger than the maximal number of inverse branches of (equal to
) so that all the branches have a chance to contribute but without facilitating the appearance of outliers.
Additionally, we compared with the methods fa for factor analysis, for which the mean and modebased methods coincide (being a symmetric, unimodal density); and mlp for the multilayer perceptron (but only for masks and , which correspond to regression patterns of missing data).
t1[][] t2[t][t] TRset[rB][rB]Training set noisytraj[rB][rB] Noisy trajectory mapping[tl][tl] Mapping GTMdensity[rB][rB] GTM density & latent manifold FAdensity[lB][lB] FA density & latent manifold MODE[B][lB]mode MEAN[][]mean GTMcond cond. distr. , GTM FAcond cond. distr. , FA Minv[][]Mask M50[][]Mask Mfwd[][]Mask originaloriginal meanmean fafa mlpmlp gmodegmode grmodegrmode dpmodedpmode cmodecmode sampdpsampdp conddist[B][B]Cond. distr. modes[rB][rB]modes x2pi[][Bc] x4pi3[][Bc] x2pi3[][Bc] x0pi[][Bc] x2pi3[][Bc] x4pi3[][Bc] x2pi[][Bc] y2pi[r][r] y4pi3[r][r] y2pi3[r][r] y0pi[r][r] y2pi3[r][r] y4pi3[r][r] y2pi[r][r] AAA BBB CCC DDD EEE FFF GGG HHH III
We ran a number of experiments of which we discuss a representative selection (fig. 6, table 2). The same basic results were confirmed with many randomly generated training sets, masks and trajectories to be reconstructed. Additional experiments are given in CarreiraPerpiñán (2001), including further masks (e.g. data missing by blocks) and models (e.g. fullcovariance Gaussian mixtures, MLP ensembles). We report various aspects of the results next.
Method comparison
Based on these experiments we can draw the following conclusions about the methods:

As expected, fa is always much worse than mean for any mask, since the forward mapping is nonlinear; and for both masks and , mlp was practically equal to the mean.

Since our GTM model is very close to the true density, the mean approximates extremely well the forward mapping (mask , not shown in fig. 6), being univalued. It fails with the inverse mapping (mask , fig. 6E), this being multivalued: the univalued mean mapping travels through the midpoints of the inverse branches, blending them into a single branch. Because of the symmetry of the forward mapping, the midpoint of these branches always happens to coincide with one of the branches and so the result is better than it should be in a general case lacking symmetry (where the mean will not be a valid inverse). The mean also fails for general masks (e.g. mask , fig. 6F), although, as predicted by the theory, in terms of average reconstruction error it is still the best method based on single pointwise reconstruction (the others being gmode and rmode).

Both gmode and rmode result in discontinuous mappings, with frequent branch switches, but unlike the mean they always provide with valid inverses, because they track branches. gmode generally outperforms rmode—the latter can be considered as the chance baseline for single pointwise reconstruction by the modes.

cmode achieves practically zero reconstruction error for all masks considered, vastly outperforming the mean too (except, marginally, sometimes with mask , where mean is optimal). This demonstrates that the modes of a good conditional distribution contain information that can potentially achieve nearzero reconstruction error; the problem lies in the selection of a good constraint that discards the wrong modes.

Much of the modes’ information is recovered by dpmode, which outperforms or equals any other method, including mean, for any mask. Even for the forward mapping, where mean is guaranteed to be optimal on the average, dpmode still performs as well as the mean (it actually outperforms it in table 2A, row , but this is an isolated instance). Its performance is degraded only slightly for very high amounts of missing data (e.g. mask ), where the other methods incur huge errors. For regression problems (masks and ), dpmode may perform worse than mean in two situations analysed below: nonsmooth density models and oversampled trajectories.

There is virtually no difference between meandp and dpmode. This is due to the fact that the training set contained isotropic noise, so that when the conditional distribution is unimodal, it is approximately symmetric and its mean and mode nearly coincide. For more complex types of noise meandp may slightly improve over dpmode.

Both grmode and sampdp result most times in wrongly reconstructed trajectories that retrace themselves and contain shortcuts between branches. For grmode the reason is the inability to backtrack out of a wrong solution, although for general missing data patterns (–) its performance is not much worse than that of dpmode. For sampdp there are two reasons: the inability to find a priori a good value^{3}^{3}3To force all mapping branches to be represented, we also tried a very high value . The resulting trajectories were smoother but still wrong. for the number of samples , so that suboptimal candidate reconstructions are generated and/or correct ones are missed; and the appearance of wrong trajectory reconstructions with low value for the global constraint. Therefore, despite the computational economy of these approaches, they are not recommended.
Denoising
A noisy trajectory is reconstructed as a smooth trajectory because by reducing a conditional distribution to a point (single pointwise reconstruction) or a point per branch (multiple pointwise reconstruction) all variation is eliminated for the given values of the present variables. In fact, a large part of the reconstruction error in table 2 is due to the noise in the original trajectory, which has been removed from the reconstructed one.
Regression is harder than varying patterns of missing data
For methods based on global constraint minimisation, in particular dpmode, a varying missing data pattern helps to break the ambiguity. The reason is the changing structure of the candidate reconstructions for varying patterns of missing data. When the pattern of missing data is constant (regressiontype) and the conditional distribution has spurious modes, it is possible to have long runs of wrong candidate reconstructions that give a short trajectory segment that is shorter than the correct one (even though there may be long jumps where the conditional distribution becomes unimodal). For varying patterns of missing data the spatial structure of these series typically changes dramatically from to . Thus, the runs of wrongly reconstructed points are much shorter and when concatenated they give a longer trajectory than the correct one. This can be seen in table 2 for the dpmode: large errors appear only for nonsmooth density models (table 2B; see below) or oversampling for the regressiontype patterns (masks , ), but never for general ones (–), even when as many as % of the values are missing. Thus, the dpmode method is very robust for varying patterns of missing data even with not very good density models, oversampling or large amounts of missing data.
A: toy problem  

Mask  fa  mlp  GTM with components  
mean  gmode  rmode  cmode  grmode  dpmode  sampdp  meandp  
3.8011  0.0129  0.0196  0.0120  0.0120  0.0120  0.0120  0.0120  0.2027  0.0196  
4.2702  2.1633  2.1184  2.0878  7.7086  0.0129  0.7529  0.0129  2.0003  0.0129  
15.5385  14.6874  62.8203  36.1892  0.0069  0.2534  0.1936  5.9703  0.2059  
11.4116  9.7848  31.4508  14.8545  0.0058  0.1512  0.0746  4.6827  0.0867  
2.9049  1.7891  6.5224  3.1629  0.0040  0.0191  0.0066  0.6252  0.0062  
B: toy problem  
Mask  GTM with components  
mean  gmode  rmode  cmode  grmode  dpmode  sampdp  meandp  
0.0956  0.1659  2.0207  0.1609  2.1708  1.1226  0.3732  1.1338  
2.2376  3.3735  8.1796  0.1048  3.8005  0.1093  2.1427  0.1094  
14.8596  58.0260  31.1647  0.1566  1.0583  0.2272  8.5859  0.2285  
9.7731  28.7667  21.0178  0.1168  0.5550  0.1181  2.5967  0.1198  
1.8337  6.1947  7.9984  0.0510  0.0984  0.0510  0.6293  0.0517  
C: robot arm problem  
Mask  fa  mlp  GTM with components  
mean  gmode  rmode  cmode  grmode  dpmode  sampdp  meandp  
0.0130  0.0007  0.0014  0.0017  0.0017  0.0017  0.0017  0.0017  0.0027  0.0014  
0.7690  0.6394  0.6767  1.2998  1.4603  0.0057  3.4837  0.3230  0.4602  0.3230  
0.7369  0.7084  1.9592  1.7930  0.0072  0.2395  0.0928  0.1366  0.0928  
0.4136  0.3655  1.1159  1.1873  0.0045  0.4156  0.1092  0.0717  0.1094  
0.2297  0.1486  0.1539  0.2371  0.0029  0.0343  0.0074  0.0147  0.0080 
Nonsmooth density models and spurious modes
In the previous experiments we have used a nearly ideal density model (GTM with components): it approximates the true density almost exactly and so any conditional distribution has the right number of modes and at the right locations. Gaussian mixtures, being a superposition of localised bumps, have a tendency to develop ripple on an otherwise smooth density, as seen with a GTM model of components (fig. 6G). Although the density estimate is worse than that of fig. 6B in terms of loglikelihood, it still represents qualitatively well the density. However, the mixture components do not coalesce enough in some regions. This results in wavy conditional distributions having more modes than they should (see for various values of in fig. 6H). Some of the true modes along the trajectory have unfolded into a few modes scattered in a small area around the true mode. A very good reconstruction is still possible since some of these modes are very close to the true one, as evidenced by the low reconstruction error of the cmode. The mean also achieves a reconstruction error about as low as with , being largely insensitive to the ripple. But for mask the error for dpmode is now for while it was for (fig. 6I). The problem is that this crowd of spurious modes may well allow wrong reconstructed trajectories that have a lower global constraint value (that are shorter) due to shortcuts that appear as horizontal and nearly vertical segments in fig. 6I. The parameter that governs this behaviour is the ratio between the extent of the mode scatter inside a conditional distribution and the sampling period of the trajectory: the larger the scatter, the more likely interference becomes with neighbouring trajectory points. (Owing to the geometry of this particular example, no spurious modes appear in the conditional distribution and so the error for remains low.) The error for general missing data patterns remains low for the same reason as before: the subsets of missing variables usually change from point to point and thus the probability of getting a run of several points whose conditional distributions have spurious modes decreases. In general, spurious modes can also appear when using Gaussian components with separate, fullcovariance parameters (which, besides, are much harder to train due to loglikelihood singularities).
Over and undersampling
We experimented with very small and very large values of the sampling rate of the trajectory for method dpmode (or equivalently, for very large and very small values of the number of points in the trajectory, respectively). A very small sampling rate is one close to the Nyquist rate; a very large sampling rate is one whose period is much smaller than the noise (normal with ). For undersampling, dpmode still reconstructs well the trajectory up to but starts finding wrongly reconstructed trajectories for lower , particularly for the worse GTM models (). This is clearly due to a lack of enough information to reconstruct the trajectory. More surprising are the results for oversampling: for , dpmode can give wrongly reconstructed trajectories that retrace themselves and have shortcuts for mask (and, for nonsmooth models, for too) although it still reconstructs well for –. The reason is that the original trajectory is polygonally very long (the point trajectory is times longer than the point one: high ) but is not long in terms of actual displacement—being a random walk superimposed on a slow drift, it twists around itself many times in a region of size . Thus, if there are multiple pointwise candidate reconstructions, there often exist shorter trajectories containing multiple retracings of a branch segment and infrequent branch switchings. The quality of the density model is not really at fault here: it is a characteristic of the global constraint chosen. We found that the trajectories were correctly reconstructed if we used the squared Euclidean distance in the value of (instead of the Euclidean distance), the reason being that long jumps are then penalised more.
Other effects
The reconstructed trajectories tend to show a slight error at the trajectory turns, e.g. in fig. 6D for dpmode and grmode, or in fig. 6I for mean. The error consists of cutting short through the turns (for all methods) for mask and, less noticeably, of a spike right at the tip of the turns (for grmode and dpmode) for mask . The “cuttingshort” effect is due to slight imperfections of the GTM density estimate. The Gaussian components interact more strongly in the convex side of the turn, pile up there and bias the mean (see fig. 6
H); cutting through turns of the original trajectory also saves trajectory length. The spike is the premature blending of two inverse branches into one branch. As the two branches approach their intersection point, the bumps associated with the two respective modes of the conditional distribution interact and blend into a single unimodal bump before the intersection point. In both cases the effect size is larger the larger the component variance (
) is; in turn, this variance is larger the noisier the training set is and the fewer components () are used.With general missing data patterns, the case of all variables (, ) missing at a point in the sequence results in two different behaviours. Single pointwise reconstruction methods prescribe reconstructing them with a fixed value: the mean of the joint density model for fa and mean and its global mode for gmode. This produces large jumps to that fixed value and thus inflates the reconstruction error (fig. 6F). Multiple pointwise reconstruction methods prescribe reconstructing them with all the modes of the joint density model, of which there are and for and , respectively. This adds more flexibility and reduces the reconstruction error, since the jumps are now to one of those modes and are therefore shorter. Even better results are obtained by using all centroids instead of the modes, particularly for very smooth density models where the components coalesce strongly and decrease the number of modes. Strictly, though, the case of all variables missing is just a particular case, the most extreme one, of a range of missingness patterns.
Summary
The main result is that, for a good density model and if continuity holds, dpmode (our method) can greatly improve over the traditional methods mean (the conditional mean) and mlp (the universal mapping approximator), approaching the limit of cmode (which is close to zero error) for all patterns of missing data; and is particularly successful for general patterns of missing data even for poor density models, oversampled trajectories or large amounts of missing data. This means that the modes contain important information about the possible options to predict the missing values, and that application of the continuity constraint allows to recover that information.
If the density model is not smooth, the conditional distribution presents spurious modes which may give rise to wrong solutions of the dynamic programming search. In this case, the reconstruction error with dpmode for regression problems (, ) usually exceeds that of the conditional mean. For general patterns of missing data (–) the error increase is small. The mean is barely affected in any case.
Finally, oversampling seems: (1) not to affect the dpmode for general missing data patterns (for both smooth and nonsmooth density models); (2) to introduce quantisation errors for forward (univalued) mappings but only in some areas, with the overall reconstruction being correct (the smoother the model, the lower the error); and (3) to severely degrade the quality of the reconstruction for inverse multivalued mappings due to shortcuts and retracings (for both smooth and nonsmooth density models).
7 Experiments: robot arm inverse kinematics
The inverse kinematics of a robot arm is a prototypical example of a mapping inversion problem, with a welldefined forward mapping and a multivalued inverse mapping. We consider a twojoint, planar arm that has been often used in the pattern recognition literature (e.g.
Bishop, 1994; Rohwer and van der Rest, 1996). The forward mapping gives the position in Cartesian coordinates of the end effector (the hand of the robot arm) given the angles at its joints, where is the actuator space and the work space (i.e., the region reachable by the end effector):with constant link lengths (, ); see fig. 6. The transformation from the desired endeffector position to the corresponding joint angles (inverse kinematics) can be obtained analytically for this simple arm (Asada and Slotine, 1986) and is in general bivalued (“elbow up” and “elbow down” configurations). Studies of trajectory formation have considered sophisticated cost functions such as energy, torque, jerk or acceleration (Nelson, 1983), all expressible in terms of quadratic functions of derivatives. Besides, since the forward mapping is known we could further use an constraint (see section 4.2). Although we could favourably use these, we will show that a very simple cost function, continuity in the space , is enough to recover the trajectory.
The experimental methodology is analogous to that of the toy problem. A training set of
points was generated by sampling uniformly the actuator space, then applying the forward mapping and finally adding normal spherical noise of standard deviation
(see fig. 7). We trained the following models: an MLP with a single hidden layer of units, a factor analyser with latent space of dimension and a GTM with latent space of dimension , latent grid and RBF grid (resulting in a Gaussian mixture with equal, isotropic components). The number of parameters of the MLP and GTM were similar (around ). We applied the same methods and masks as in the toy example, with masks and meaning the regressions and , respectively. For sampdp, we generated samples per conditional distribution. We manually designed a continuous trajectory in actuator space (sampled at points) and obtained the corresponding trajectory in work space by applying ; we then added small normal noise () to all values to obtain the sequence .The practically interesting problem (mask ) is to recover from so that impossible movements of the arm (discontinuities in ) are avoided. The reconstruction results, given in table 2C, show a similar behaviour to that observed in results of the toy experiments: dpmode beats the other methods (in particular, the mean and the mlp, both of which perform very similarly) and its performance is often close to the bound of cmode, even for high amounts of missing data. The largest error for dpmode occurs for the inverse mapping, which confirms that regression problems, when multivalued, are harder than general missing data patterns. All methods perform equally well in the univalued, forward mapping (). We observed that had to modes, which means that the density model is not smooth, although in this case it does not seem to affect the dpmode.
The large error of gmode is mainly due to choosing the wrong branch in parts of the trajectory and having discontinuous jumps when joining the correct one. Note that the error reported by e.g. Bishop (1994) (who used a gmodetype method) is . This error is low when the reconstructed maps well to the given (i.e., is a valid inverse of ), but disregards discontinuities of the trajectory, which are given by high or (i.e., may not be the right inverse to use at this point ).
These results demonstrate that, in this problem, the continuity constraint alone can allow good reconstruction with dpmode. However, since the forward mapping is known, an constraint can perfectly be incorporated to improve the reconstruction quality. Further advantages of our method in the inverse kinematics problem include the fact that it can encode multiple (unlimited) learned trajectories; that the trajectory length (number of states) is unlimited; and that the trajectory can have any topology. This makes the method useful for localisation or path planning.
8 Discussion
8.1 Distributions over trajectories
Our algorithm operates in two decoupled stages: first generate set of local candidates, then solve a combinatorial optimisation problem to find the global reconstruction. One way to merge both concepts is to define a grand density over the whole sequence, , in a constructive way: first generate from the joint density of sec. 3.5; then, generate subject to being in the data manifold, , but near , . The latter is simply a Gaussian centred in with a covariance (say) , where would be related to the speed at which the curve is traversed, and represents the continuity constraint . And so on for . In general, we generate a sequence from (normalised). This results in a random walk (the term ) constrained to the data manifold (the term ). The distance of the continuity constraint (i.e., the covariance matrix of ) determines the “sampling period” of the sequence. We may now attack the problem of global reconstruction directly, by choosing a representative point of (in the sense of section 3) which will give us a likely reconstruction of the trajectory.
If we maximise the logarithm of the grand density over the missing variables , we find an objective function
(5) 
which has the standard form of a fitness term (on the left) and a constraint term (on the right) with weight , as e.g. in the elastic net (Durbin et al., 1989), or its generalisation to more sophisticated constraint priors (CarreiraPerpiñán and Goodhill, 2003). Note that we do not maximise over the parameters of the model , but over the missing variables; in particular, this means there are no singularities because the objective function is bounded above. We can obtain the method we have proposed in this paper as a limit case of eq. (5): if is taken as a sum of deltas centred at the modes of , then the search space is restricted to those modes and operates only on the right side term (our continuity constraint).
An objective function over trajectories opens the door for multiple global reconstruction as defined in section 2. Further, we might think that this grand distribution could be unimodal since there should be less ambiguity in the global reconstruction than in the local ones. If so we could take the mean and be free from localmaxima problems. But the truth is that may have many local maxima where point tends to a mode of for all ; this will certainly be the case if such conditional distributions are sharply peaked or is large, and computing its mean would be difficult. Perhaps an optimisation based on annealing would be helpful here.
Another characteristic of this method is that the candidate pointwise reconstructions (the modes) are weighted by their respective density values, while in our method all modes are equally weighted. This may bias the reconstruction towards highly likely pointwise reconstructions at the expense of the continuity constraint. Finally, we are also left with the choice of the tradeoff parameter . An implementation and evaluation of this method is left for future research.
8.2 Computational complexity
Reconstructing a data set of vectors requires two separate computations: (1) implicitly constructing the layered graph of fig. 4, i.e., computing the multiple pointwise reconstructions of each point; (2) finding the shortest path on the graph (we do not consider the cost of estimating the joint density model, since this is done offline with a training set and the resulting density can be reused for reconstructing many different data sets). With dynamic programming, the complexity is given by the total number of links in the graph, which for an average case where every layer has nodes is . This is very fast and is always dominated by the mode finding. Analysing the complexity of the latter is difficult (see CarreiraPerpiñán, 2001, 2000a for details). In general, and as confirmed experimentally, a crucial factor is the amount of missing data, since this directly affects the number of modes found at each point of the sequence. It is possible to accelerate the mode search by discarding lowprobability mixture components in the conditional distribution (see CarreiraPerpiñán, 2000a, 2001; we obtained up to speedup with up to % increase in reconstruction error); or by reducing the number of mixture components at the potential cost of a less accurate density model.
8.3 Choice of density model: robustness and smoothness
The modes are a key aspect of our approach. However, the mode is not a robust statistic of a distribution since small variations in the distribution shape can have large effects on the location and number of modes. This is related to the smoothness of the density model mentioned in section 6: with finite mixtures of localised functions, spurious modes can appear as ripple superimposed on a smoothlyvarying function. These spurious modes can happen in regions where the mixture components are sparsely distributed and have little interaction; and their probability value can be as high as that of true modes, which rules out the use of a rejection threshold to filter the spurious ones out. Such spurious modes can lead the dynamic programming search to a wrong trajectory with a large reconstruction error, although we observed this only in regression problems, not in general missing data patterns. For regression problems, specially mapping inversion, applying a forward mapping constraint (where the forward mapping may be known exactly or derived by a mapping approximator) should prevent spurious modes from forming part of the reconstructed sequence because spurious modes, by definition, will give a high value for the constraint .
To prevent spurious modes from entering the constraint minimisation at all, a possible solution is to smooth the density model, either globally (at estimation time) or locally (at modefinding time, i.e., to smooth the conditional distribution). If the density is a Gaussian mixture with spherical components, then smoothing it by convolution with a Gaussian kernel of width is equivalent simply to adding to each component width. Globally smoothing can be done at a higher computational cost by increasing the number of components (in GTM the mixture centroids are not trainable parameters and so we incur no loss of generalisation). Alternatively, one can regularise the density by adding a term to the loglikelihood to penalise low variance parameters. In all these cases, selecting how much to smooth so that important modes are not removed is crucial.
A related problem is that of obtaining a Gaussian mixture that represents well the data with a small number of components, for which many methods exist (Figueiredo and Jain, 2002). However, it is likely that, in their efforts to reduce the number of components, these methods will result in nonsmooth models.
Note that the use of Gaussian mixtures can be considered optimal with respect to avoiding spurious modes in that convolving an arbitrary function with a Gaussian kernel never (in 1D) or almost never (in higher dimension) results in the creation of new modes, which is not true of any other kernel, as is known in scalespace theory (Lindeberg, 1994; CarreiraPerpiñán and Williams, 2003b, a). This, the easy computation of conditional distributions and the availability of algorithms for finding all modes (CarreiraPerpiñán, 2000a)
make Gaussian mixtures very attractive in the present framework—in spite of the fact that the Gaussian distribution (unlike longtailed distributions) is not robust to outliers in the training set
(Huber, 1981).8.4 Dynamical, sequential and time series modelling
Our approach does not attempt to model the temporal evolution of the system. The joint density model
is estimated statically. The temporal aspect of the data appears indirectly and a posteriori through the application of the continuity constraints to select a trajectory. In this respect, our approach differs from that of dynamical systems or from models based on Markovian assumptions, such as Kalman filters
(Harvey, 1991)(Rabiner and Juang, 1993).The fact that the duration or speed of the trajectory plays no role in our algorithm makes it invariant to time warping. That is, the dynamic programming algorithm depends only on the values of the observed variables but not on the experimental conditions and so is independent of the speed at which the trajectory is traversed. It is also independent of direction, since it can be run forwards (from point to ) or backwards with the same result. Therefore, our reconstruction algorithm does not depend on the particular parametrisation of the trajectory, but just on its geometric shape. This is important in the case of speech: it is well known that hidden Markov models are very sensitive to time warpings, i.e., fast or slow speech styles, where the trajectory in speech feature space is the same but is traversed fast or slowly, respectively. Our reconstruction method should then be robust to time warpings.
A time series prediction can be seen as a reconstruction problem where the data set is , are present and are missing. However, our method is not useful here: the trivial application of a continuity constraint would lead to , i.e., a constant sequence.
8.5 Bumpfinding rather than modefinding
Besides not being a robust statistic, using a mode as a reconstructed point is not appropriate in general because locally the optimal value (in the sense) is the mean. That is, if a function is multivalued it will have several branches; in a neighbourhood around a branch the function becomes univalued and so the mean of that branch would be optimal. This suggests that, when the conditional distribution is multimodal, we should look for bumps associated with the correct values and take the means of these bumps as reconstructed values instead of the modes—where by bumps we mean fairly concentrated regions where the density is comparatively high. A possible approach to select the bumps would be to decompose a density as a mixture . Here, is the density associated with the th bump, which should be localised in the space of but can be asymmetrical. If is modelled by a mixture of Gaussians (as is our case) then the decomposition could be attained by regrouping Gaussian components with a clustering algorithm. This approach would replace the mode finding procedure with a (probably much faster) grouping and averaging procedure.
8.6 Reconstruction as a preprocessing step
If the missing data reconstruction is a preprocessing step for some other procedure that ordinarily operates on the complete data, then the whole procedure may be suboptimal but faster than marginalising over the missing variables. For example, in a classification task such as speech recognition, one wants to compute where is a phoneme class and an acoustic feature vector (Rabiner and Juang, 1993). Using a hidden Markov model, such probabilities can be computed for every point in an utterance and an optimal transcription obtained with the Viterbi algorithm. However, if some features are deemed to be missing (due to the presence of noise, for example), then the correct thing to do is to use
, i.e., to marginalise over the missing variables the joint distribution of what is unknown given what is known:
If we reconstruct the data as and then use instead, we are implicitly wasting all the information contained in the distribution , effectively replacing it with a delta function at . Cooke et al. (2001) have demonstrated empirically the superiority of the marginalisation approach for classification in the context of recognition of occluded speech.
However, strictly what we have shown is that reconstructing and then classifying is worse only when the reconstruction is done on a pointbypoint basis, i.e., considering the speech frames independent from each other—which they are not. Thus, there may indeed be a benefit in using a global, utterancewide constraint to reconstruct the whole speech segment—ideally recovering the original speech—and then classifying it; in other words, reconstructing not just from , but from , as proposed in our method.
8.7 Reconstruction via dimensionality reduction
Continuous latent variable models are a convenient probabilistic formulation of the problem of dimensionality reduction (see CarreiraPerpiñán, 2001 for a review). Here, the density in the space of the observed variables is represented as where are the latent variables (with prior distribution ) and is a noise model on a lowdimensional manifold defined by a mapping from latent to data space, such as . Dimensionality reduction of an observed point is achieved by taking a representative point (typically the mean) of the posterior distribution in latent space .
If a latent variable model is used as density model in our method, one would expect that there is a sequence in latent space that corresponds to the sequence in observed space. Thus one could think of performing missing data reconstruction via dimensionality reduction. That is, if at some point in the sequence are missing and are present, we first reduce dimensionality by picking a representative point of and then map that point onto observed space using the mapping . This will not work well except when is sharply unimodal, that is, strongly constrains to lie in a small region. But usually will be multimodal and therefore this is translating the problem of finding a multivalued relationship to that of a multivalued dimensionality reduction ! Besides, the dimensionality reduction approach will produce a value not just for but also for , which may differ from the actually observed value of .
In fact,
where in the last equality we have used the fact that the observed variables are conditionally independent given the latent ones. Therefore, the procedure is equivalent to collapsing onto a delta function centred on , throwing away all the probability mass not in . For this same reason, in general it is not convenient to apply the continuity constraints to the latent variables rather than to the observed ones.
However, if we do want to reduce the dimensionality of a sequence with missing data, we can cast this as a reconstruction problem, where the missing variables are the latent variables and the present variables are the present observed variables . We can apply the ideas of this paper to the predictive distribution .
8.8 Underconstrained functions
When is underconstrained given , can take any value in a continuous manifold of dimension , rather than taking values in a countable set (for ). This typically happens when there are too few present variables (at some point in the sequence). Geometrically, if the possible joint values of and span a manifold of dimensionality in , then the set of possible values for given a fixed value of is the intersection of
with the coordinate hyperplane
. The geometric approach has two disadvantages: it requires solving nonlinear systems of equations; and it disregards the noise, i.e., the stochastic variability of the data around and inside that manifold.In the probabilistic framework the information about the data manifold
is embedded in the joint probability distribution
of the observed variables, noise is taken care of and the only mathematical operations needed are conditioning (therefore marginalising) and finding all modes, which are computationally efficient for Gaussian mixtures. Using a Gaussian mixture as density model provides a partial but working solution to the underconstrained case, because the number of modes is finite if the Gaussian mixture is finite. Thus, the modes act as a finite sample of such manifold, and a quantisation error appears. This error can be reduced by using more mixture components, but at a cost that grows exponentially with the intrinsic dimensionality of the data.In the extreme case where all variables are missing at point , the modes are now the modes of the joint density function and can be computed once and stored for subsequent points where all variables are missing, to save computer time. If the density is a Gaussian mixture, another possibility with nil computational cost is simply to use all the component centroids, since in principle they should all lie in highdensity areas of the data space. This will also produce a finer discretisation of the data manifold, since there should be fewer modes than centroids.
8.9 Unbounded horizon problems
There are practical reconstruction problems where the data set to be reconstructed is infinite or long enough that the user periodically demands partial reconstruction; for example, in continuous speech with missing data, the user should receive reconstructed speech in real time, which requires that past speech be frozen once reconstructed, passed to the user and discarded for reconstruction of future speech. In operations research problems such as inventory control this is called an unbounded horizon problem.
The greedy algorithm requires no modification for unbounded horizon problems, but we do not recommend it for the reasons of section 4.5. The dynamic programming algorithm requires a finite sequence. A simple approach is to split the data stream into chunks (perhaps coinciding with user requests), reconstruct them separately and concatenate the reconstructed chunks. This has the risk of getting discontinuities at the splitting points and getting a suboptimal reconstruction of the whole stream, though for long enough chunks this may not be a problem.
Note that points where there is a unique pointwise reconstruction () effectively split the layered graph into separate subgraphs (e.g. at node in fig. 4). That is, whenever the reconstructed trajectory for points earlier than can be frozen (to its optimal value) and the dynamic programming algorithm restarted from scratch, saving computer time and storage. Depending on the particular application and on the amount of missing data such zerouncertainty points may be common; in speech, one likely example are silent frames, which are easily detected by thresholding the frame energy.
8.10 Extensions
Our approach has considered 1D constraints (i.e., sequential data). It would be interesting to consider multidimensional constraints, e.g. reflecting spatial structure rather than (or as well as) temporal. An application where the experimental conditions are 2D is the recovery of wind fields from satellite scatterometer measurements (Nabney et al., 2000). The strategy of section 4.1 of defining constraints based on the norm of finite difference approximations of derivatives can be readily generalised to multidimensional experimental conditions (see CarreiraPerpiñán and Goodhill, 2003 for a related analysis in the context of elastic nets). An important problem with constraints of dimension higher than one is the curse of the dimensionality: the complexity of the multidimensional dynamic programming algorithm grows exponentially with (Durbin et al., 1998, pp. 141–143). Thus, global minimisation will not be feasible except for very small dimensions
. Further research is necessary to develop efficient heuristic approximations to multidimensional dynamic programming.
Our approach uses a continuity constraint. However, isolated discontinuities may occur in sequences that are otherwise continuous (as a function of the experimental conditions). The discontinuities can be intrinsic to the nature of the problem or caused by undersampling. They pose challenging modelling difficulties, since they can confuse the dynamic programming search and cause a wrong global reconstruction. A possible approach is to use a robust local constraint, where the penalty saturates past a certain value of the reconstruction error.
9 Related work
The key aspects of our approach are the use of a joint density model (learnt in an unsupervised way); the use of the modes of the conditional distribution as multiple pointwise candidate reconstructions; the mode search in Gaussian mixtures; the definition of a geometric trajectory measure derived from continuity constraints, and its minimisation by dynamic programming. Some of these ideas have been applied earlier in the literature in different contexts. However, we are not aware of any work that attempts to solve the missing data reconstruction problem in its full generality.
9.1 Statistical approaches to missing data and imputation methods
We have dealt with the problem “given a data set with missing data, reconstruct it” and we have assumed that a model for the data was available (perhaps obtained from a complete training set). The problem “given a data set with missing data, estimate parameters (and standard errors,
values, tests, etc.) of a model of the data, or more generally, make inferences about the population from which the data come from” has been the main concern of the statistical literature on missing data (Little and Rubin, 1987; Little, 1992; Schafer, 1997). Such inferences must be done incorporating the missing data uncertainty; otherwise one will obtain too small standard errors. The pattern of missing data, given by the matrix of section 2, is considered a random variable. Calling , etc., then the present data are and the complete data . If a joint distribution of with parameters