A common task in statistical machine learning and signal processing is to predict outcomesbased on features and , using past training data drawn from an unknown distribution
In certain problems, however, not all features in the training data are available at the time of prediction. For instance, in medical diagnosis, certain features are more expensive or time-consuming to obtaining than others, and therefore not available in an early stage of assessment. Other features are only observable after the outcome has occurred. We let denote features missing at the time of prediction and consider the task of predicting given only the observable features .
A direct approach would discarding all past training data about , and predict only on the basis of the association between and
. By contrast, missing data in statistics is commonly tackled by means of imputation[8, 1]. An indirect approach is then to predict using both and an imputed . However, as we show in Section II, this turns out to be equivalent to the direct approach. For both approaches, learning a linearly parameterized predictor that minimizes the mean squared error (MSE), is shown to perform poorly in events when the missing features occur in the tails of the marginal distribution . Robust statistics has typically focused problems with contaminated training data  or heavy-tailed noise distributions [7, 2]
. For the latter, Student-t distributions are often adopted in regression models in order to achieve robust estimation of model parameters such that outlying training data samples are downweighted.
Our concern in this paper is robust prediction in the event of outlying missing features. Specifically, we achieve robustness using a weighted combination of optimistic and conservative predictors, which are derived in Section III. The approach of switching between modes during extreme events can be found in econometrics  and signal processing [9, 3], but is not considered for prediction with missing features. We demonstrate the robustness properties of the proposed approach in both synthetic and real data sets.
Notation: We define a pseudonorm , where , and the sample mean of as .
Ii Problem formulation
We consider scenarios in which
and are correlated,
the dimension of is greater than that of ,
and study the class of linearly parameterized predictors , where . Without loss of generality we consider to be centered and subsequent results are, moreover, trivially extended to arbitrary features by replacing with a function .
The mean squared-error of a predictor is
where the expectation is with respect to . We consider how the missing feature impacts on the prediction performance. The tails of the distribution of are contained in the region
such that (see appendix), which for small
corresponds to the probability of an outlier event. We can now decompose
into an outlier and an inlier , respectively. As Figure 1 illustrates, prediction performance can degrade significantly for outlier events.
Using training samples, our goal is to formulate a robust predictor that will reduce the outlier without incurring a significant increase of the inlier .
If the feature were known, the optimal linearly parameterized predictor is given by
Its prediction errors are then uncorrelated with the features, that is,
which renders the predictor robust against outlier events for and , respectively. In the case of missing features , we begin by considering predictors which satisfy either one of the orthogonality properties (5).
Iii-a Optimistic predictor
The predictors that satisfy the first equality in (5
) are given by all parameter vectors in
We denote the resulting predictor as ‘optimistic’ with respect to the missing , because it does not attempt to satisfy the second equality in (5).
Remark: Given that and are correlated, we may consider using the MSE-optimal linear predictor
to impute the missing feature. An indirect predictor approach would then be to use in (3) but this, however, is equivalent to the above predictor. That is, using the block matrix inversion lemma in (4).
Iii-B Conservative predictor
The predictors that satisfy the second equality in (5) are given by all parameter vectors in
This set is a -dimensional subspace of and therefore we consider the parameter that minimizes (1), viz.
We denote the resulting predictor as ‘conservative’ with respect to the missing , because it satisfies only the second equality in (5).
Remark: Comparing the error of with that of in (3), the excess MSE can be expressed as
where and is a residual term (see appendix). Note that the first term is weighted by the dispersion of . The constraint enforces the first term in (11) to be zero. This leaves degrees of freedom to minimize the second term. By contrast, minimizes the sum of both terms.
Iii-C Robust predictor
Satisfying only one of the equalities in (5) comes at a cost: The optimistic yields robustness against outlying but not and, conversely, the conservative yields robustness against outlying but not . Since both equalities can only be satisfied by the infeasible predictor (3
), we propose a predictor that interpolates between optimistic and conservative modes using the side information thatprovides about outliers in the missing features . That is, we propose to learn the adaptive parameter vector
such that the predictor becomes robust against outliers in both and .
Iv Learning robust predictor
Learning the robust predictor implies finding a finite-sample approximations of , and in (12) using training samples .
We begin by defining , which yields the empirical counterpart of (7):
All vectors that satisfy the constraint can therefore be parameterized as
where is the orthogonal projection matrix onto the null space of . This yields the empirical counterpart of (10),
where is a minimizer of the convex function .
Next, we consider learning a model of the probability of an outlier event, , conditioned on . Using the definition (2), we predict an outlier event with the scalar
where is the empirical version of (8). The conditional outlier probability is modeled using a standard logistic function,
The model parameters and are learned from the training data by minimizing the standard cross-entropy criterion
This approach takes into account the inherent uncertainty of predicting an outlying from . An example of a fitted model is illustrated in Figure 2.
In sum, we learn a robust predictor with an adaptive parameter vector
using samples, as described in Algorithm 1.
V Experimental results
We evaluate the robustness of the proposed predictor using both synthetic and real data.
V-a Synthetic data
Consider the following data-generating process of , , and :
where are the -dimensional t-distributed latent variable with degrees of freedom and are white Gaussian processes of corresponding dimensions.
We evaluate predictors of a new outcome given only , where the vector is learned from sample training data. Specifically, we evaluate the optimistic , conservative and proposed predictors with respect to outliers for missing features with heavy tails (). A comparison of the conditional MSE functions for the learned predictors is given in Fig. 3, where it is seen that the robust predictor smoothly interpolates between two modes.
When averaging over , the MSE for each training dataset is illustrated in Fig. 4. We see that the robust predictor drastically reduces the outlier , while leading to a small increase in the inlier . The differences in MSEs, when averaged over across all training datasets, are summarized in Tables I, which demonstrates the robustness of the proposed approach.
V-B Air quality data
Next, we demonstrate the proposed method using real-world air quality data. Nitrogen-oxides (NO) emitted by the fossil fuel vehicles are a major air pollutant in urban environments, with negative impacts on the health of inhabitants.
The aim here is to predict the daily average NO concentration, denoted , based the NO and ozone (O) measurements from previous days. That is, is of dimension and contains the daily average NO and O levels from past days. In the training data, we have also access to , the O concentration at the same time as outcome . This feature is correlated with and . For prediction of a new outcome, however, is a missing feature.
The dataset contains 10 years daily average NO and O measurements from 2006-01-01 to 2015-12-31. Data is split into the 7 years of training data (2006-2012), and 3 years of test data (2012-2015). Considering outliers when , Table II shows that we are able to reduce the outlier by 10% while incurring minimal increase of the inlier .
Based on orthogonality properties of an optimal oracle predictor, we developed a prediction method that is robust against outliers of the missing features. It is formulated as a convex combination of optimistic and conservative predictors, and requires only specifying the intended outlier level against which it must be robust. The ability of the robust predictor to suppress outlier errors, while incurring a minor increase in inlier errors was demonstrated using both simulated and real-world datasets.
Appendix A Proofs
A-a Probability bound for (2)
The probability bound for an event follows readily from a Chebychev-type inequality:
A-B MSE decomposition (11)
-  (2006;2010;) Semi-supervised learning. MIT Press, Cambridge, Mass. Cited by: §I.
-  (2010) Robust autoregression: student-t innovations using variational bayes. IEEE Transactions on Signal Processing 59 (1), pp. 48–57. Cited by: §I.
Revisiting autoregressive hidden markov modeling of speech signals. IEEE Signal processing letters 12 (2), pp. 166–169. Cited by: §I.
-  (1994) Autoregressive conditional heteroskedasticity and changes in regime. Journal of econometrics 64 (1-2), pp. 307–333. Cited by: §I.
-  (1964) Robust estimation of a location parameter. The Annals of Mathematical Statistics, pp. 73–101. Cited by: §I.
-  (2000) Linear estimation. Prentice Hall, Upper Saddle River, NJ (English). External Links: Cited by: §III-A.
-  (1989) Robust statistical modeling using the t distribution. Journal of the American Statistical Association 84 (408), pp. 881–896. Cited by: §I.
-  (2019) Statistical analysis with missing data. Vol. 793, John Wiley & Sons. Cited by: §I.
-  (1982) Linear predictive hidden markov models and the speech signal. In ICASSP’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 7, pp. 1291–1294. Cited by: §I.
Online hyperparameter-free sparse estimation method. IEEE Transactions on Signal Processing 63 (13), pp. 3348–3359. Cited by: §IV.