Robust Prediction when Features are Missing

12/16/2019 ∙ by Xiuming Liu, et al. ∙ 0

Predictors are learned using past training data containing features which may be unavailable at the time of prediction. We develop an prediction approach that is robust against unobserved outliers of the missing features, based on the optimality properties of a predictor which has access to these features. The robustness properties of the approach are demonstrated in real and synthetic data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A common task in statistical machine learning and signal processing is to predict outcomes

based on features and , using past training data drawn from an unknown distribution

In certain problems, however, not all features in the training data are available at the time of prediction. For instance, in medical diagnosis, certain features are more expensive or time-consuming to obtaining than others, and therefore not available in an early stage of assessment. Other features are only observable after the outcome has occurred. We let denote features missing at the time of prediction and consider the task of predicting given only the observable features .

A direct approach would discarding all past training data about , and predict only on the basis of the association between and

. By contrast, missing data in statistics is commonly tackled by means of imputation

[8, 1]. An indirect approach is then to predict using both and an imputed . However, as we show in Section II, this turns out to be equivalent to the direct approach. For both approaches, learning a linearly parameterized predictor that minimizes the mean squared error (MSE), is shown to perform poorly in events when the missing features occur in the tails of the marginal distribution . Robust statistics has typically focused problems with contaminated training data [5] or heavy-tailed noise distributions [7, 2]

. For the latter, Student-t distributions are often adopted in regression models in order to achieve robust estimation of model parameters such that outlying training data samples are downweighted.

Our concern in this paper is robust prediction in the event of outlying missing features. Specifically, we achieve robustness using a weighted combination of optimistic and conservative predictors, which are derived in Section III. The approach of switching between modes during extreme events can be found in econometrics [4] and signal processing [9, 3], but is not considered for prediction with missing features. We demonstrate the robustness properties of the proposed approach in both synthetic and real data sets.

Notation: We define a pseudonorm , where , and the sample mean of as .

Fig. 1: Illustration of MSE conditioned on a missing scalar feature along with . Minimizing the overall MSE () may lead to high outlier occurring in the tails of . By contrast, a suboptimal predictor () can mitigate the outlier events. The MSE is lower bounded by an oracle predictor which observes .

Ii Problem formulation

We consider scenarios in which

  • and are correlated,

  • the dimension of is greater than that of ,

and study the class of linearly parameterized predictors , where . Without loss of generality we consider to be centered and subsequent results are, moreover, trivially extended to arbitrary features by replacing with a function .

The mean squared-error of a predictor is


where the expectation is with respect to . We consider how the missing feature impacts on the prediction performance. The tails of the distribution of are contained in the region


such that (see appendix), which for small

corresponds to the probability of an outlier event. We can now decompose

into an outlier and an inlier , respectively. As Figure 1 illustrates, prediction performance can degrade significantly for outlier events.

Using training samples, our goal is to formulate a robust predictor that will reduce the outlier without incurring a significant increase of the inlier .

Iii Predictors

If the feature were known, the optimal linearly parameterized predictor is given by




Its prediction errors are then uncorrelated with the features, that is,


which renders the predictor robust against outlier events for and , respectively. In the case of missing features , we begin by considering predictors which satisfy either one of the orthogonality properties (5).

Iii-a Optimistic predictor

The predictors that satisfy the first equality in (5

) are given by all parameter vectors in


This set, however, consists of a single element which is also the minimizer of (1)[6]. That is,


We denote the resulting predictor as ‘optimistic’ with respect to the missing , because it does not attempt to satisfy the second equality in (5).

Remark: Given that and are correlated, we may consider using the MSE-optimal linear predictor


to impute the missing feature. An indirect predictor approach would then be to use in (3) but this, however, is equivalent to the above predictor. That is, using the block matrix inversion lemma in (4).

Iii-B Conservative predictor

The predictors that satisfy the second equality in (5) are given by all parameter vectors in


This set is a -dimensional subspace of and therefore we consider the parameter that minimizes (1), viz.


We denote the resulting predictor as ‘conservative’ with respect to the missing , because it satisfies only the second equality in (5).

Remark: Comparing the error of with that of in (3), the excess MSE can be expressed as


where and is a residual term (see appendix). Note that the first term is weighted by the dispersion of . The constraint enforces the first term in (11) to be zero. This leaves degrees of freedom to minimize the second term. By contrast, minimizes the sum of both terms.

Iii-C Robust predictor

Satisfying only one of the equalities in (5) comes at a cost: The optimistic yields robustness against outlying but not and, conversely, the conservative yields robustness against outlying but not . Since both equalities can only be satisfied by the infeasible predictor (3

), we propose a predictor that interpolates between optimistic and conservative modes using the side information that

provides about outliers in the missing features . That is, we propose to learn the adaptive parameter vector


such that the predictor becomes robust against outliers in both and .

Iv Learning robust predictor

Learning the robust predictor implies finding a finite-sample approximations of , and in (12) using training samples .

We begin by defining , which yields the empirical counterpart of (7):


Similarly, for (10) we note that the empirical counterpart of the constraint in (9) is


All vectors that satisfy the constraint can therefore be parameterized as

where is the orthogonal projection matrix onto the null space of . This yields the empirical counterpart of (10),


where is a minimizer of the convex function .

Next, we consider learning a model of the probability of an outlier event, , conditioned on . Using the definition (2), we predict an outlier event with the scalar

where is the empirical version of (8). The conditional outlier probability is modeled using a standard logistic function,

The model parameters and are learned from the training data by minimizing the standard cross-entropy criterion


This approach takes into account the inherent uncertainty of predicting an outlying from . An example of a fitted model is illustrated in Figure 2.

Fig. 2: Outlier and inlier events for in the training data along with fitted logistic model of conditional probability. In this example, the learned offset and slope parameters were and , respectively.

In sum, we learn a robust predictor with an adaptive parameter vector


using samples, as described in Algorithm 1.

1:Input: Training data and
2:Compute via (13)
3:Compute via (15)
4:For each , form
5:Learn via (16)
6:Output: in (17)
Algorithm 1 Learning robust predictor

Remark: In the case of high-dimensional features and

one may use regularized methods, such as ridge regression,

Lasso, or the tuning-free Spice method [10], to learn , and .

V Experimental results

We evaluate the robustness of the proposed predictor using both synthetic and real data.

V-a Synthetic data

Consider the following data-generating process of , , and :


where are the -dimensional t-distributed latent variable with degrees of freedom and are white Gaussian processes of corresponding dimensions.

We evaluate predictors of a new outcome given only , where the vector is learned from sample training data. Specifically, we evaluate the optimistic , conservative and proposed predictors with respect to outliers for missing features with heavy tails (). A comparison of the conditional MSE functions for the learned predictors is given in Fig. 3, where it is seen that the robust predictor smoothly interpolates between two modes.

Fig. 3: Conditional MSE functions of the missing feature for different predictors. Intervals correspond to the percentiles and lines represent the median over 50 Monte Carlo runs. The robust predictor significantly reduces errors in the tails of , while only incurring a small increased errors for inlier in comparison with the optimistic .
Fig. 4: The box plot of and in the case of and . The boxes show the distribution of outlier and inlier from 50 simulations using training and test samples.

When averaging over , the MSE for each training dataset is illustrated in Fig. 4. We see that the robust predictor drastically reduces the outlier , while leading to a small increase in the inlier . The differences in MSEs, when averaged over across all training datasets, are summarized in Tables I, which demonstrates the robustness of the proposed approach.

TABLE I: Changes in averaged MSE compared to in [%], when and . Average of 50 simulations using test samples.

V-B Air quality data

Next, we demonstrate the proposed method using real-world air quality data. Nitrogen-oxides (NO) emitted by the fossil fuel vehicles are a major air pollutant in urban environments, with negative impacts on the health of inhabitants.

The aim here is to predict the daily average NO concentration, denoted , based the NO and ozone (O) measurements from previous days. That is, is of dimension and contains the daily average NO and O levels from past days. In the training data, we have also access to , the O concentration at the same time as outcome . This feature is correlated with and . For prediction of a new outcome, however, is a missing feature.

The dataset contains 10 years daily average NO and O measurements from 2006-01-01 to 2015-12-31. Data is split into the 7 years of training data (2006-2012), and 3 years of test data (2012-2015). Considering outliers when , Table II shows that we are able to reduce the outlier by 10% while incurring minimal increase of the inlier .

TABLE II: Real-world NOx and ozone data with . Changes in MSE compared to in [%].

Vi Conclusion

Based on orthogonality properties of an optimal oracle predictor, we developed a prediction method that is robust against outliers of the missing features. It is formulated as a convex combination of optimistic and conservative predictors, and requires only specifying the intended outlier level against which it must be robust. The ability of the robust predictor to suppress outlier errors, while incurring a minor increase in inlier errors was demonstrated using both simulated and real-world datasets.

Appendix A Proofs

A-a Probability bound for (2)

The probability bound for an event follows readily from a Chebychev-type inequality:

A-B MSE decomposition (11)

The outcome can always be decomposed as


. The random variable

is orthogonal to and , and consequently to the residual . Thus we can express the prediction error as

Squaring this expression and and taking the expectation yields (11). Similarly, inserting it into the constraint in (9) yields .


  • [1] Chapelle,Olivier, Schölkopf,Bernhard, and Zien,Alexander (2006;2010;) Semi-supervised learning. MIT Press, Cambridge, Mass. Cited by: §I.
  • [2] J. Christmas and R. Everson (2010) Robust autoregression: student-t innovations using variational bayes. IEEE Transactions on Signal Processing 59 (1), pp. 48–57. Cited by: §I.
  • [3] Y. Ephraim and W. J. Roberts (2005)

    Revisiting autoregressive hidden markov modeling of speech signals

    IEEE Signal processing letters 12 (2), pp. 166–169. Cited by: §I.
  • [4] J. D. Hamilton and R. Susmel (1994) Autoregressive conditional heteroskedasticity and changes in regime. Journal of econometrics 64 (1-2), pp. 307–333. Cited by: §I.
  • [5] P. J. Huber (1964) Robust estimation of a location parameter. The Annals of Mathematical Statistics, pp. 73–101. Cited by: §I.
  • [6] Kailath,Thomas, S. H., and Hassibi,Babak (2000) Linear estimation. Prentice Hall, Upper Saddle River, NJ (English). External Links: ISBN 9780130224644;0130224642; Cited by: §III-A.
  • [7] K. L. Lange, R. J. Little, and J. M. Taylor (1989) Robust statistical modeling using the t distribution. Journal of the American Statistical Association 84 (408), pp. 881–896. Cited by: §I.
  • [8] R. J. Little and D. B. Rubin (2019) Statistical analysis with missing data. Vol. 793, John Wiley & Sons. Cited by: §I.
  • [9] A. Poritz (1982) Linear predictive hidden markov models and the speech signal. In ICASSP’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 7, pp. 1291–1294. Cited by: §I.
  • [10] D. Zachariah and P. Stoica (2015)

    Online hyperparameter-free sparse estimation method

    IEEE Transactions on Signal Processing 63 (13), pp. 3348–3359. Cited by: §IV.