I Introduction
A common task in statistical machine learning and signal processing is to predict outcomes
based on features and , using past training data drawn from an unknown distributionIn certain problems, however, not all features in the training data are available at the time of prediction. For instance, in medical diagnosis, certain features are more expensive or timeconsuming to obtaining than others, and therefore not available in an early stage of assessment. Other features are only observable after the outcome has occurred. We let denote features missing at the time of prediction and consider the task of predicting given only the observable features .
A direct approach would discarding all past training data about , and predict only on the basis of the association between and
. By contrast, missing data in statistics is commonly tackled by means of imputation
[8, 1]. An indirect approach is then to predict using both and an imputed . However, as we show in Section II, this turns out to be equivalent to the direct approach. For both approaches, learning a linearly parameterized predictor that minimizes the mean squared error (MSE), is shown to perform poorly in events when the missing features occur in the tails of the marginal distribution . Robust statistics has typically focused problems with contaminated training data [5] or heavytailed noise distributions [7, 2]. For the latter, Studentt distributions are often adopted in regression models in order to achieve robust estimation of model parameters such that outlying training data samples are downweighted.
Our concern in this paper is robust prediction in the event of outlying missing features. Specifically, we achieve robustness using a weighted combination of optimistic and conservative predictors, which are derived in Section III. The approach of switching between modes during extreme events can be found in econometrics [4] and signal processing [9, 3], but is not considered for prediction with missing features. We demonstrate the robustness properties of the proposed approach in both synthetic and real data sets.
Notation: We define a pseudonorm , where , and the sample mean of as .
Ii Problem formulation
We consider scenarios in which

and are correlated,

the dimension of is greater than that of ,
and study the class of linearly parameterized predictors , where . Without loss of generality we consider to be centered and subsequent results are, moreover, trivially extended to arbitrary features by replacing with a function .
The mean squarederror of a predictor is
(1) 
where the expectation is with respect to . We consider how the missing feature impacts on the prediction performance. The tails of the distribution of are contained in the region
(2) 
such that (see appendix), which for small
corresponds to the probability of an outlier event. We can now decompose
into an outlier and an inlier , respectively. As Figure 1 illustrates, prediction performance can degrade significantly for outlier events.
Using training samples, our goal is to formulate a robust predictor that will reduce the outlier without incurring a significant increase of the inlier .
Iii Predictors
If the feature were known, the optimal linearly parameterized predictor is given by
(3) 
where
(4) 
Its prediction errors are then uncorrelated with the features, that is,
(5) 
which renders the predictor robust against outlier events for and , respectively. In the case of missing features , we begin by considering predictors which satisfy either one of the orthogonality properties (5).
Iiia Optimistic predictor
The predictors that satisfy the first equality in (5
) are given by all parameter vectors in
(6) 
This set, however, consists of a single element which is also the minimizer of (1)[6]. That is,
(7) 
We denote the resulting predictor as ‘optimistic’ with respect to the missing , because it does not attempt to satisfy the second equality in (5).
Remark: Given that and are correlated, we may consider using the MSEoptimal linear predictor
(8) 
to impute the missing feature. An indirect predictor approach would then be to use in (3) but this, however, is equivalent to the above predictor. That is, using the block matrix inversion lemma in (4).
IiiB Conservative predictor
The predictors that satisfy the second equality in (5) are given by all parameter vectors in
(9) 
This set is a dimensional subspace of and therefore we consider the parameter that minimizes (1), viz.
(10) 
We denote the resulting predictor as ‘conservative’ with respect to the missing , because it satisfies only the second equality in (5).
Remark: Comparing the error of with that of in (3), the excess MSE can be expressed as
(11) 
where and is a residual term (see appendix). Note that the first term is weighted by the dispersion of . The constraint enforces the first term in (11) to be zero. This leaves degrees of freedom to minimize the second term. By contrast, minimizes the sum of both terms.
IiiC Robust predictor
Satisfying only one of the equalities in (5) comes at a cost: The optimistic yields robustness against outlying but not and, conversely, the conservative yields robustness against outlying but not . Since both equalities can only be satisfied by the infeasible predictor (3
), we propose a predictor that interpolates between optimistic and conservative modes using the side information that
provides about outliers in the missing features . That is, we propose to learn the adaptive parameter vector(12) 
such that the predictor becomes robust against outliers in both and .
Iv Learning robust predictor
Learning the robust predictor implies finding a finitesample approximations of , and in (12) using training samples .
We begin by defining , which yields the empirical counterpart of (7):
(13) 
Similarly, for (10) we note that the empirical counterpart of the constraint in (9) is
(14) 
All vectors that satisfy the constraint can therefore be parameterized as
where is the orthogonal projection matrix onto the null space of . This yields the empirical counterpart of (10),
(15) 
where is a minimizer of the convex function .
Next, we consider learning a model of the probability of an outlier event, , conditioned on . Using the definition (2), we predict an outlier event with the scalar
where is the empirical version of (8). The conditional outlier probability is modeled using a standard logistic function,
The model parameters and are learned from the training data by minimizing the standard crossentropy criterion
(16) 
This approach takes into account the inherent uncertainty of predicting an outlying from . An example of a fitted model is illustrated in Figure 2.
In sum, we learn a robust predictor with an adaptive parameter vector
(17) 
using samples, as described in Algorithm 1.
Remark: In the case of highdimensional features and
one may use regularized methods, such as ridge regression,
Lasso, or the tuningfree Spice method [10], to learn , and .V Experimental results
We evaluate the robustness of the proposed predictor using both synthetic and real data.
Va Synthetic data
Consider the following datagenerating process of , , and :
(18) 
where are the dimensional tdistributed latent variable with degrees of freedom and are white Gaussian processes of corresponding dimensions.
We evaluate predictors of a new outcome given only , where the vector is learned from sample training data. Specifically, we evaluate the optimistic , conservative and proposed predictors with respect to outliers for missing features with heavy tails (). A comparison of the conditional MSE functions for the learned predictors is given in Fig. 3, where it is seen that the robust predictor smoothly interpolates between two modes.
When averaging over , the MSE for each training dataset is illustrated in Fig. 4. We see that the robust predictor drastically reduces the outlier , while leading to a small increase in the inlier . The differences in MSEs, when averaged over across all training datasets, are summarized in Tables I, which demonstrates the robustness of the proposed approach.
100  

500  
1000 
VB Air quality data
Next, we demonstrate the proposed method using realworld air quality data. Nitrogenoxides (NO) emitted by the fossil fuel vehicles are a major air pollutant in urban environments, with negative impacts on the health of inhabitants.
The aim here is to predict the daily average NO concentration, denoted , based the NO and ozone (O) measurements from previous days. That is, is of dimension and contains the daily average NO and O levels from past days. In the training data, we have also access to , the O concentration at the same time as outcome . This feature is correlated with and . For prediction of a new outcome, however, is a missing feature.
The dataset contains 10 years daily average NO and O measurements from 20060101 to 20151231. Data is split into the 7 years of training data (20062012), and 3 years of test data (20122015). Considering outliers when , Table II shows that we are able to reduce the outlier by 10% while incurring minimal increase of the inlier .
7  

28  
56 
Vi Conclusion
Based on orthogonality properties of an optimal oracle predictor, we developed a prediction method that is robust against outliers of the missing features. It is formulated as a convex combination of optimistic and conservative predictors, and requires only specifying the intended outlier level against which it must be robust. The ability of the robust predictor to suppress outlier errors, while incurring a minor increase in inlier errors was demonstrated using both simulated and realworld datasets.
Appendix A Proofs
Aa Probability bound for (2)
The probability bound for an event follows readily from a Chebychevtype inequality:
AB MSE decomposition (11)
The outcome can always be decomposed as
where
. The random variable
is orthogonal to and , and consequently to the residual . Thus we can express the prediction error asSquaring this expression and and taking the expectation yields (11). Similarly, inserting it into the constraint in (9) yields .
References
 [1] (2006;2010;) Semisupervised learning. MIT Press, Cambridge, Mass. Cited by: §I.
 [2] (2010) Robust autoregression: studentt innovations using variational bayes. IEEE Transactions on Signal Processing 59 (1), pp. 48–57. Cited by: §I.

[3]
(2005)
Revisiting autoregressive hidden markov modeling of speech signals
. IEEE Signal processing letters 12 (2), pp. 166–169. Cited by: §I.  [4] (1994) Autoregressive conditional heteroskedasticity and changes in regime. Journal of econometrics 64 (12), pp. 307–333. Cited by: §I.
 [5] (1964) Robust estimation of a location parameter. The Annals of Mathematical Statistics, pp. 73–101. Cited by: §I.
 [6] (2000) Linear estimation. Prentice Hall, Upper Saddle River, NJ (English). External Links: ISBN 9780130224644;0130224642; Cited by: §IIIA.
 [7] (1989) Robust statistical modeling using the t distribution. Journal of the American Statistical Association 84 (408), pp. 881–896. Cited by: §I.
 [8] (2019) Statistical analysis with missing data. Vol. 793, John Wiley & Sons. Cited by: §I.
 [9] (1982) Linear predictive hidden markov models and the speech signal. In ICASSP’82. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 7, pp. 1291–1294. Cited by: §I.

[10]
(2015)
Online hyperparameterfree sparse estimation method
. IEEE Transactions on Signal Processing 63 (13), pp. 3348–3359. Cited by: §IV.
Comments
There are no comments yet.