# Robust Prediction when Features are Missing

Predictors are learned using past training data containing features which may be unavailable at the time of prediction. We develop an prediction approach that is robust against unobserved outliers of the missing features, based on the optimality properties of a predictor which has access to these features. The robustness properties of the approach are demonstrated in real and synthetic data.

There are no comments yet.

## Authors

• 3 publications
• 19 publications
• 9 publications
• ### Handling Missing Values using Decision Trees with Branch-Exclusive Splits

In this article we propose a new decision tree construction algorithm. T...
04/26/2018 ∙ by Cédric Beaulac, et al. ∙ 0

• ### Learning Robust Decision Policies from Observational Data

We address the problem of learning a decision policy from observational ...
06/03/2020 ∙ by Muhammad Osama, et al. ∙ 0

• ### Robust functional ANOVA model with t-process

Robust estimation approaches are of fundamental importance for statistic...
12/18/2018 ∙ by Chen Zhang, et al. ∙ 0

• ### Focus of Attention for Linear Predictors

We present a method to stop the evaluation of a prediction process when ...
12/29/2012 ∙ by Raphael Pelossof, et al. ∙ 0

• ### P-values for classification

Let (X,Y) be a random variable consisting of an observed feature vector ...
01/18/2008 ∙ by Lutz Duembgen, et al. ∙ 0

• ### Controlling for Unobserved Confounds in Classification Using Correlational Constraints

As statistical classifiers become integrated into real-world application...
03/05/2017 ∙ by Virgile Landeiro, et al. ∙ 0

• ### Efficient Robust Mean Value Calculation of 1D Features

A robust mean value is often a good alternative to the standard mean val...
01/29/2016 ∙ by Erik Jonsson, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

A common task in statistical machine learning and signal processing is to predict outcomes

based on features and , using past training data drawn from an unknown distribution

 (xi,zi,yi)∼p(x,z,y),i=1,…,n.

In certain problems, however, not all features in the training data are available at the time of prediction. For instance, in medical diagnosis, certain features are more expensive or time-consuming to obtaining than others, and therefore not available in an early stage of assessment. Other features are only observable after the outcome has occurred. We let denote features missing at the time of prediction and consider the task of predicting given only the observable features .

A direct approach would discarding all past training data about , and predict only on the basis of the association between and

. By contrast, missing data in statistics is commonly tackled by means of imputation

[8, 1]. An indirect approach is then to predict using both and an imputed . However, as we show in Section II, this turns out to be equivalent to the direct approach. For both approaches, learning a linearly parameterized predictor that minimizes the mean squared error (MSE), is shown to perform poorly in events when the missing features occur in the tails of the marginal distribution . Robust statistics has typically focused problems with contaminated training data [5] or heavy-tailed noise distributions [7, 2]

. For the latter, Student-t distributions are often adopted in regression models in order to achieve robust estimation of model parameters such that outlying training data samples are downweighted.

Our concern in this paper is robust prediction in the event of outlying missing features. Specifically, we achieve robustness using a weighted combination of optimistic and conservative predictors, which are derived in Section III. The approach of switching between modes during extreme events can be found in econometrics [4] and signal processing [9, 3], but is not considered for prediction with missing features. We demonstrate the robustness properties of the proposed approach in both synthetic and real data sets.

Notation: We define a pseudonorm , where , and the sample mean of as .

## Ii Problem formulation

We consider scenarios in which

• and are correlated,

• the dimension of is greater than that of ,

and study the class of linearly parameterized predictors , where . Without loss of generality we consider to be centered and subsequent results are, moreover, trivially extended to arbitrary features by replacing with a function .

The mean squared-error of a predictor is

 MSE(w)≜E[∣∣y−ˆy(x;w)∣∣2], (1)

where the expectation is with respect to . We consider how the missing feature impacts on the prediction performance. The tails of the distribution of are contained in the region

 Zα={z:z⊤(E[zz⊤])−1z≥q/α} (2)

such that (see appendix), which for small

corresponds to the probability of an outlier event. We can now decompose

 MSE=Pr{z∈Zα}MSEα+Pr{z∉Zα}MSE1−α,

into an outlier and an inlier , respectively. As Figure 1 illustrates, prediction performance can degrade significantly for outlier events.

Using training samples, our goal is to formulate a robust predictor that will reduce the outlier without incurring a significant increase of the inlier .

## Iii Predictors

If the feature were known, the optimal linearly parameterized predictor is given by

 ˆy⋆(x,z)=α⊤x+β⊤z, (3)

where

 (4)

Its prediction errors are then uncorrelated with the features, that is,

 E[x(y−ˆy)]=0andE[z(y−ˆy)]=0, (5)

which renders the predictor robust against outlier events for and , respectively. In the case of missing features , we begin by considering predictors which satisfy either one of the orthogonality properties (5).

### Iii-a Optimistic predictor

The predictors that satisfy the first equality in (5

) are given by all parameter vectors in

 Wo={w:E[x(y−w⊤x)]=0} (6)

This set, however, consists of a single element which is also the minimizer of (1)[6]. That is,

 wo≡argminwMSE(w)=(E[xx⊤])−1E[xy] (7)

We denote the resulting predictor as ‘optimistic’ with respect to the missing , because it does not attempt to satisfy the second equality in (5).

Remark: Given that and are correlated, we may consider using the MSE-optimal linear predictor

 ˆz=E[zx⊤](E[xx⊤])−1x (8)

to impute the missing feature. An indirect predictor approach would then be to use in (3) but this, however, is equivalent to the above predictor. That is, using the block matrix inversion lemma in (4).

### Iii-B Conservative predictor

The predictors that satisfy the second equality in (5) are given by all parameter vectors in

 Wc={w:E[z(y−w⊤x)]=0} (9)

This set is a -dimensional subspace of and therefore we consider the parameter that minimizes (1), viz.

 wc=argminw∈WcMSE(w), (10)

We denote the resulting predictor as ‘conservative’ with respect to the missing , because it satisfies only the second equality in (5).

Remark: Comparing the error of with that of in (3), the excess MSE can be expressed as

 MSE(w)−MSE⋆=∥Γ(α−w)+β∥2E[zz⊤]+∥α−w∥2E[˜x˜x⊤]≥0, (11)

where and is a residual term (see appendix). Note that the first term is weighted by the dispersion of . The constraint enforces the first term in (11) to be zero. This leaves degrees of freedom to minimize the second term. By contrast, minimizes the sum of both terms.

### Iii-C Robust predictor

Satisfying only one of the equalities in (5) comes at a cost: The optimistic yields robustness against outlying but not and, conversely, the conservative yields robustness against outlying but not . Since both equalities can only be satisfied by the infeasible predictor (3

), we propose a predictor that interpolates between optimistic and conservative modes using the side information that

provides about outliers in the missing features . That is, we propose to learn the adaptive parameter vector

 w(x)=Pr{z∉Zα|x}wo+Pr{z∈Zα|x}wc, (12)

such that the predictor becomes robust against outliers in both and .

## Iv Learning robust predictor

Learning the robust predictor implies finding a finite-sample approximations of , and in (12) using training samples .

We begin by defining , which yields the empirical counterpart of (7):

 ˆwo=argminw% MSEn(w)=(En[xx⊤])†En[xy] (13)

Similarly, for (10) we note that the empirical counterpart of the constraint in (9) is

 En[z(y−w⊤x)]=0⇔En[zx⊤]w=En[zy] (14)

All vectors that satisfy the constraint can therefore be parameterized as

 w(θ)=(En[zx⊤])†En[zy]+Πθ,

where is the orthogonal projection matrix onto the null space of . This yields the empirical counterpart of (10),

 ˆwc=w(ˆθ), (15)

where is a minimizer of the convex function .

Next, we consider learning a model of the probability of an outlier event, , conditioned on . Using the definition (2), we predict an outlier event with the scalar

 δ(x)=√ˆz⊤(x)En[zz⊤]†ˆz(x)≥0,

where is the empirical version of (8). The conditional outlier probability is modeled using a standard logistic function,

 ˆPr{z∈Zα|x}=11+expκ(δ(x)−δ0)

The model parameters and are learned from the training data by minimizing the standard cross-entropy criterion

 minκ,δ0−En[I(z∈Zα)lnˆPr{z∈Zα|x}+I(z∉Zα)ln(1−ˆPr{z∈Zα|x})] (16)

This approach takes into account the inherent uncertainty of predicting an outlying from . An example of a fitted model is illustrated in Figure 2.

In sum, we learn a robust predictor with an adaptive parameter vector

 ˆw(x)=ˆPr{z∉Zα|x}ˆwo+ˆPr{z∈Zα|x}ˆwc (17)

using samples, as described in Algorithm 1.

Remark: In the case of high-dimensional features and

one may use regularized methods, such as ridge regression,

Lasso, or the tuning-free Spice method [10], to learn , and .

## V Experimental results

We evaluate the robustness of the proposed predictor using both synthetic and real data.

### V-a Synthetic data

Consider the following data-generating process of , , and :

 z∼St(0,1,νz),x=1z+u+ϵx,y=z+1⊤x+ϵy, (18)

where are the -dimensional t-distributed latent variable with degrees of freedom and are white Gaussian processes of corresponding dimensions.

We evaluate predictors of a new outcome given only , where the vector is learned from sample training data. Specifically, we evaluate the optimistic , conservative and proposed predictors with respect to outliers for missing features with heavy tails (). A comparison of the conditional MSE functions for the learned predictors is given in Fig. 3, where it is seen that the robust predictor smoothly interpolates between two modes.

When averaging over , the MSE for each training dataset is illustrated in Fig. 4. We see that the robust predictor drastically reduces the outlier , while leading to a small increase in the inlier . The differences in MSEs, when averaged over across all training datasets, are summarized in Tables I, which demonstrates the robustness of the proposed approach.

### V-B Air quality data

Next, we demonstrate the proposed method using real-world air quality data. Nitrogen-oxides (NO) emitted by the fossil fuel vehicles are a major air pollutant in urban environments, with negative impacts on the health of inhabitants.

The aim here is to predict the daily average NO concentration, denoted , based the NO and ozone (O) measurements from previous days. That is, is of dimension and contains the daily average NO and O levels from past days. In the training data, we have also access to , the O concentration at the same time as outcome . This feature is correlated with and . For prediction of a new outcome, however, is a missing feature.

The dataset contains 10 years daily average NO and O measurements from 2006-01-01 to 2015-12-31. Data is split into the 7 years of training data (2006-2012), and 3 years of test data (2012-2015). Considering outliers when , Table II shows that we are able to reduce the outlier by 10% while incurring minimal increase of the inlier .

## Vi Conclusion

Based on orthogonality properties of an optimal oracle predictor, we developed a prediction method that is robust against outliers of the missing features. It is formulated as a convex combination of optimistic and conservative predictors, and requires only specifying the intended outlier level against which it must be robust. The ability of the robust predictor to suppress outlier errors, while incurring a minor increase in inlier errors was demonstrated using both simulated and real-world datasets.

## Appendix A Proofs

### A-a Probability bound for (2)

The probability bound for an event follows readily from a Chebychev-type inequality:

 Pr{z∈Zα}=∫Zαp(z)dz≤∫Zα[(α/q)z⊤(E[zz⊤])−1z]p(z)dz≤α∫[z⊤(E[zzT])−1z/q]p(z)dz=α

### A-B MSE decomposition (11)

The outcome can always be decomposed as

 y=α⊤x+β⊤z+v,

where

. The random variable

is orthogonal to and , and consequently to the residual . Thus we can express the prediction error as

 y−w⊤x=[Γ(α−w)+β]⊤z+(α−w)⊤˜x+v

Squaring this expression and and taking the expectation yields (11). Similarly, inserting it into the constraint in (9) yields .

## References

• [1] Chapelle,Olivier, Schölkopf,Bernhard, and Zien,Alexander (2006;2010;) MIT Press, Cambridge, Mass. Cited by: §I.
• [2] J. Christmas and R. Everson (2010) Robust autoregression: student-t innovations using variational bayes. IEEE Transactions on Signal Processing 59 (1), pp. 48–57. Cited by: §I.
• [3] Y. Ephraim and W. J. Roberts (2005)

Revisiting autoregressive hidden markov modeling of speech signals

.
IEEE Signal processing letters 12 (2), pp. 166–169. Cited by: §I.
• [4] J. D. Hamilton and R. Susmel (1994) Autoregressive conditional heteroskedasticity and changes in regime. Journal of econometrics 64 (1-2), pp. 307–333. Cited by: §I.
• [5] P. J. Huber (1964) Robust estimation of a location parameter. The Annals of Mathematical Statistics, pp. 73–101. Cited by: §I.
• [6] Kailath,Thomas, S. H., and Hassibi,Babak (2000) Linear estimation. Prentice Hall, Upper Saddle River, NJ (English). External Links: ISBN 9780130224644;0130224642; Cited by: §III-A.
• [7] K. L. Lange, R. J. Little, and J. M. Taylor (1989) Robust statistical modeling using the t distribution. Journal of the American Statistical Association 84 (408), pp. 881–896. Cited by: §I.
• [8] R. J. Little and D. B. Rubin (2019) Statistical analysis with missing data. Vol. 793, John Wiley & Sons. Cited by: §I.
• [9] A. Poritz (1982) Linear predictive hidden markov models and the speech signal. In ICASSP’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 7, pp. 1291–1294. Cited by: §I.
• [10] D. Zachariah and P. Stoica (2015)

Online hyperparameter-free sparse estimation method

.
IEEE Transactions on Signal Processing 63 (13), pp. 3348–3359. Cited by: §IV.