1 Introduction
Regression analysis is a statistical analysis to find an unknown function relating predictor variables to a real response variable given some paired measurements of the predictor and response variables. Typically, the values of the predictor variables are given, and the conditional expectation or probability of the response variable for a given value of the predictor variables is estimated using the given measurements. In practice, the response variable measurements may contain outliers that do not follow the pattern of the other measurements. These outliers can produce a misleading outcome in regression analysis when conventional regression approaches are applied unless the outlier can be taken into account. Robust regression is an approach to overcome this limitation (Huber, 2011). This paper is concerned with a robust regression method.
This work is motivated by the needs of analyzing the timedependent measurements of environmental factors such as temperature and relative humidity at a specific geographic site over a three month time period using Luna Innovations corrosion and coatings evaluation system (CorRES^{™}), a multichannel electrochemical monitoring device that continuously records environmental parameters including temperature and relative humidity. These environmental variables may be used to predict more complex parameters related to atmospheric chemistry that affect degradation of aerospace materials. A sophisticated atmospheric monitoring system that incorporates gas monitoring sensors manufactured by Airpointer^{®} was used over the same three month time period to obtain response variable measurements for model development. Data sets from these gas monitoring systems may contain a small percentage of outlier data associated with system startup, user operation, periodic calibration, and other uncontrollable factors such as power fluctuation. Establishing an accurate regression model with less influences from the outliers is crucial for understanding the patterns of the environment factors relevant to degradation of aerospace materials. We are particularly interested in a robust Gaussian process regression with data collected by CorRES and Airpointer monitoring systems, mainly because of the generality and flexibility of a Gaussian process regression.
A Gaussian process (GP) regression is a Bayesian nonparametric approach for regression analysis (Rasmussen and Williams, 2006). In the approach, a GP prior is placed on an unknown regression function to express the prior belief on the function, and a Gaussian likelihood function is popularly used to model data as the noisy observations of the unknown function with Gaussian noises. It is well known that the Gaussian likelihood is very sensitive to outliers in data in that the mean and variance estimates are significantly affected by the outliers (Jylänki et al., 2011)
. This is mainly because the Gaussian density is fast decaying in its tail, so data in the tail part has very low likelihood values. To address this issue, different likelihood models induced from probability distributions showing heavytail behaviors have been tried, including the Studentt likelihood
(Jylänki et al., 2011; Shah et al., 2014; Ranjan et al., 2016), Laplace likelihood (Kuss, 2006; Ranjan et al., 2016), Gaussian mixture likelihood (NaishGuzman and Holden, 2008; Daemi et al., 2019), and datadependent noise model (Goldberg et al., 1998). Replacing the Gaussian likelihood with a nonGaussian likelihood makes the derivation of the posterior distribution difficult. Many numerical approximation schemes have been sought with cost of expensive computation steps, e.g., a Markov Chain Monte Carlo sampling and several variants of the Expectation Propagation
(Minka, 2001).In this paper, we propose a simpler treatment of outliers. We regard outliers as noisy and biased observations of an underlying regression function, where the ‘bias’ means the deviation of the conditional expectation of the response variable from the regression function. Accordingly, we introduce a bias term in the likelihood that explains the deviation of outliers from a regression trend for a robust GP regression. There is a clear difference from the existing approaches that try to explain outliers with different noise models. In our approach, the unknown bias term is estimated jointly with the hyperparameters of a covariance function of GP. When the bias is estimated, the posterior inference of the unknown regression function follows a standard GP regression solution procedure given in an analytically closed form. This simple idea provides the posterior mean and variance estimates of the regression function in an analytically closed form as the standard GP regression approach, so it is computationally efficient. We will study how the posterior mean and variance estimates behave with the comparisons to the more complex approximation models discussed in the literature.
The remainder of this paper is organized as follows. Section 2 describes two bias models to link biased observations to an underlying regression function and entails how the model parameters are estimated jointly with other hyperparameters of the GP regression. We also explain how the robust GP estimates can be achieved with the bias models in the same section. In Section 3, we investigate the numerical performance of the proposed method with comparisons to some existing robust GP approaches for various simulation scenarios of different outlier proportions, noise levels, and input dimensions of data. The numerical evaluation is extended with the use of real environmental sensor data containing different patterns of outliers in Section 4. We finally conclude this paper in Section 5.
2 Method
Consider a robust regression problem of estimating an unknown regression function that relates a dimensional predictor to a real response , using its observations, . The observed response is assumed a noisy and biased observation of ,
where is the bias of deviating from , and
is a random white noise, independent of
. We regard outliers as data with large bias . Notice that we do not use bold font for the multivariate predictor and reserve bold font for the collection of observed predictor locations, . Some other notations for the future use are described as follows. Letdenote the corresponding response vector, and let
denote a vector of the unknown bias values.In GP regression, the unknown function is assumed a realization of Gaussian process with zero mean and covariance function . The covariance function popularly used for is a parametric covariance function such as the Matérn covariance function and the exponential covariance function. We denote the parameters of the covariance by . Given the GP prior, the observations
is used to obtain the posterior predictive distribution of
at an unobserved location , denoted by. The joint distribution of
follows a multivariate normal distribution,
(1) 
where , , and is an matrix with entry . The subscripts on , and indicate the two sets of locations between which the covariance is computed, and we have abbreviated the subscript as .
If the parameters and are given or their estimates , and are given, we can apply the Gaussian conditioning formula to the joint distribution (1) to achieve the posterior predictive distribution of given ,
(2) 
The predictive mean is taken to be the point prediction of at location , and its uncertainty is measured by the predictive variance . In the subsections below, we will explain how to estimate the the parameters.
2.1 Constant Bias Model
We first regard the unknown bias as a constant vector to estimate, and the likelihood function becomes a Gaussian likelihood with datadependent means,
which is different from the datadependent noise variance model (Goldberg et al., 1998)
. The bias vector
can be estimated jointly with the other parameters, the noise variance and the covariance parameter . The negative log likelihood function of the parameter isEstimating the parameters by minimizing the negative log likelihood is not a great idea. First of all, it is not welldefined, since the NL function has infinitely many minima. The estimates are easily overfit to , and the fit to the residual gives a oversmoothed estimate of as illustrated in Figure 1.
We place a certain regularization on for the wellposedness of the problem. We assume that the outliers occur occasionally, so the bias term should be sparse with many zero elements but very few nonzero elements. We consider to add a regularization on the bias term to the negative log likelihood,
where is the norm. The regularized negative log likelihood function is minimized for optimizing , and jointly. This regularized ML estimates give a better posterior estimate of as illustrated in Figure 2.
In fact, adding the regularization on to the negative log likelihood function is equivalent to assuming a zeromean Laplace prior on ,
and then taking a maximum a posteriori probability (MAP) estimate of . Please note that with the Laplace prior, the negative log joint density of and is proportional to ,
(3) 
where is independent of and . Therefore, minimizing gives a solution that corresponds to the MAP estimate of .
2.2 Random Bias Model
In this section, we consider the unknown bias as a random quantity. We first consider a Gaussian prior distribution for the random quantity,
(4) 
where is the mean of the bias, and is the variance of the bias. In this case, the conditional distribution of given is
where . This overparameterized model is not a great choice, because the number of the parameters surpasses the number of data. This incurs a nonuniqueness issue in the parameter estimation.
We consider more restricted models described below. Depending on the restriction, this model would have different likelihoods. For example, when and , the likelihood is reduced to a standard Gaussian likelihood, because
where . Obviously, we will not consider this case. When , the likelihood becomes similar to one for the constant bias model which we already discussed in the previous section, because
When , the likelihood is similar to a Gaussian likelihood with datadependent noise variances, because
In this section, we choose this model as a random bias model and discuss the parameter estimation for the model. Please note that, by letting , this datadependent noise variance model is similar to one discussed in Goldberg et al. (1998), where was regressed with a logGaussian process prior. Here we treat as a parameter to estimate.
To discuss how we estimate the variance parameters, let . Given the set of the parameters, the distribution of is
(5) 
where is a diagonal matrix with as its the diagonal element. The negative log likelihood function of the parameter set, , can be
Minimizing the negative log likelihood directly for the parameter set can cause the overestimation of the bias parameters, and , leading to the oversmoothed estimate as we seen in the previous section. We regularize with the norm and regularize each with its linear and negative exponential terms,
The regularization term for
is actually derived from the negative log density of a inversegamma distribution. When
, its density function is proportional toIn addition, the regularization term for is proportional to the negative log density of a normal distribution, ,
The negative log joint density of , and is
(6) 
where . Therefore, minimizing with respect to the parameter set gives the MAP estimates of and . The result of the parameter estimation with is illustrated in Figure 3 with the same toy example which we used in the previous section. The estimates of ’s are highly concentrated on one value except for two outliers, where the variance estimates are significantly higher to mitigate the effect of the outliers on .
2.3 Tuning Parameter Selection
Both of the constant bias model and random bias model come with tuning parameters, and . We can choose the tuning parameters iteratively with the regularized ML estimation discussed in the previous section. Let denote the estimates achieved by minimizing for an initial choice of . We plug in the estimate into the negative log joint density in (3),
(7) 
We minimize for updating . This minimization and the minimization of are iteratively solved until the value converges. This iteration is actually equivalent to the coordinate descent algorithm to minimize the log joint density, .
We perform the selection of the tuning parameter similarly. Let denote the estimates achieved by minimizing for an initial choice of . Plug the estimates into the negative log joint density in (6), we have
(8) 
We minimize for updating . This minimization and the minimization of are iteratively solved until the tuning parameter values converge.
3 Simulated Examples
In this section, we present the numerical outcomes of our two proposed bias models with a comprehensive set of simulation scenarios, comparing with the fractional EP approximation with the Studentt likelihood (Jylänki et al., 2011) and the MCMC approach with the Gaussian scale mixture likelihood (Gelman et al., 2013). We use two synthetic datasets, one with one input dimension and another with ten input dimensions. For each dataset, eight simulation scenarios are generated with different outlier patterns, and 15 replicated experiments were performed for each scenario. Below are the details of how we create the scenarios.
The first dataset with one input dimension is generated from the underlying regression function,
The 300 training samples, denoted as , are randomly sampled from and
where
is a random variable following a mixture of two Gaussian distributions. We used
, where quantifies the fraction of outliers, is the mean bias of outliers, and is the variance around the mean. We varied the fraction of outliers, , and considered . We also considered the values to be one sixth of or one twelfth of . If it is the one sixth, the samples from is slightly overlapped with those from , so the outliers are less distinct from the normal data. If it is the one twelfth, they are less likely overlapping. Combining these possible values, we have eight different test scenarios in total. The 1000 test samples, denoted as , are randomly sampled from and .The second dataset has ten input dimensions, and the synthetic dataset was first introduced by Friedman (1991) and was referred in many papers for robust GP approaches (Kuss, 2006). In the dataset, the underlying regression function at a 10dimensional input depends on the first five input dimensions,
(9) 
The 300 training samples and the 1000 test samples are generated following the same procedure used in the first dataset.
For each test scenario, we evaluated the four methods, our constant bias model (COB), our random bias model (RAB), the fractional EP approximation with the Studentt likelihood (EP) and the MCMC approach with the Gaussian scale mixture likelihood (MCMC). We collected the total computation time of the four methods to judge their computational efficiency. For comparison of prediction accuracy, we use two measures on the test data, denoted as . The first measure is the mean squared error (MSE)
(10) 
which measures the accuracy of the mean prediction at location . The second one is the negative log predictive density (NLPD)
(11) 
which considers the accuracy of the predictive variance as well as the mean prediction . These two criteria were used broadly in the GP regression literature. A smaller value of MSE or NLPD indicates a better performance.
For this numerical analysis, we used the GPStuff Version 4.7 MATLAB package (Vanhatalo et al., 2013) for MCMC and EP. We chose all hyperparameters to be optimized instead of fixing them to the same values. We implemented our two proposed approaches with MATLAB. For all of the four methods, we applied the same squared exponential covariance function. The results with the two datasets are presented in the two subsections below.
3.1 1D Synthetic Dataset
This 1D example is mainly used to illustrate the outcomes of the four compared methods and compare the predictive means and variances of the four compared methods in a low dimensional case. Figure 4 shows the outcomes of the four methods compared for , and . All of the four methods work comparably while the EP
approach underestimates the predictive variance for this example, giving narrower confidence intervals.
We did more quantitative analysis on the 15 replicated outcomes of MSE, NLPD and TIME for each of the eight simulation scenarios. The MSE performance is summarized using the bar charts in Figure 5. The average MSE values over 15 replicated outcomes are very close among the four methods. The variations of the MSE values are significantly different. The random bias model showed larger variations over the other methods, mainly due to the existence of a few bad performing cases. Both of the MCMC and EP approaches were quite stable. However, the MCMC approach has shown a couple of significantly bad performing cases as shown in Figure 5(g), mainly due to MCMC sampling variations. The EP approach has shown three significantly bad performing cases as shown in Figure 5(f), which occurred when the EP converged slowly relative to the other good performance cases. The outcomes from the constant bias model were less varied and more consistent than those from the other methods.
The NLPD performance was similar to the MSE performance except for that the EP’s average NLPD values are significantly higher than the corresponding values for the other compared methods. For example, see Figure 6(b) and 6(d). In the two scenarios, the MSE performance of the EP was similar to those of the other methods, so the differences in the NLPD are mainly due to the differences in the predictive variance estimates. As we illustrated in Figure 4, the EP underestimates the predictive variances for some scenarios. Besides, the variations in the NLPD value for the EP are larger for these two scenarios. Again, the random bias model has shown relatively larger variations in the NLPD values for a couple of the simulated scenarios.
The computation times of the four methods are quite different. The MCMC approach takes more computation times than the other methods, and the EP is the second slowest. Our proposed constant bias model is fastest among the four methods. The constant bias model provides a very efficient robust GP regression solution with great and less varying MSE and NLPD performances. This is the major benefit of using the proposed constant bias model.
3.2 10D Friedman Dataset
We use this example to see how the four compared methods perform for high dimensional inputs. The Friedman data has 10 input dimensions, five of which are not related to the response variable but adds complexity in data analysis. The main interest here is how the accuracy and computation time of the four compared methods are affected by the increased input dimension. Like in the previous example, we collected the MSE, NLPD and TIME metrics for the four compared methods.
Figure 8 summarizes the MSE values of the four methods for eight test scenarios. Different from the 1D dataset, the averages and variations of the MSE values are significantly different among the four compared methods. Our proposed constant bias model (COB) was the best performer for all of the eight scenarios with significant gaps to the other methods, and its average MSE values are the lowest while the variations of the MSE values are relatively little. Our proposed random bias model (RAB) was the secondbest consistently, and the MCMC approach was the thirdbest. The EP approach has shown a significant deterioration of the performance as the input dimension increases. A major benefit of using the COB is that the MSE performance changes much less in between the 1D dataset and this dataset. For example, its average MSE performance with the 1D data was 0.2649 while that with this 10D dataset was 0.4208 for , and , but the gaps are much more significant for the other three methods as shown in Table 1.
RAB  COB  MCMC  EP  

1D Synthetic Data  0.2994  0.2674  0.2601  0.2598 
10D Friedman Dataset  1.9074  0.4209  2.2994  7.8828 
Figure 9 compares the NLPD performances of the four methods. The NLPD performance is quite comparable to the MSE performance. The proposed COB is the best performer with the lowest average negative log posterior density values and little variations. The NLPDs of the RAB and MCMC were comparable.
Figure 10 compares the total computation times. The COB is fastest. It is interesting to see that the EP slowed down for this high dimensional example significantly, being slower than the MCMC. Besides, the variation in the computation times was very significant for the EP, mainly because the convergence of the expectation propagation algorithm varied significantly over 15 random replicated experiments.
The main message from the two simulated studies is clear. The predictive power of the proposed COB method is very competitive and stable, which does not depend significantly on training samples and input dimensions of the dataset. Its computation time is faster than the existing approximation methods by more than one order of magnitude, and it is also less dependent on the input dimension.
4 Real Examples – Measurement of Environmental Variables
In this section, we use environmental data collected by two monitoring systems deployed by the U.S. Air Force for development of atmospheric corrosion models of aluminum alloys (AA). We considered temperature and humidity from the CorRES^{™} sensing system, and nitrogen oxides (NO) and ozone (O) gas concentrations in parts per billion (ppb) from the Airpointer^{®} system. Both systems were placed at an outdoor environmental test site operated by the U.S. Naval Research Laboratory in Key West FL. In this study, the variables of time, ambient temperature, in degree Celsius, and relative humidity are used as threedimensional predictors, and each of the gas concentrations are used as response variables. Multiple causes for the gas concentration outlier data were identified, including loss of communications from sensors, and data produced during prescribed user calibrations. The NO data have more intermittent outliers, while the O data contain more clustered outliers 11. For the O, the outliers deviate strongly from the normal data. Although outliers in the gas concentration data are only a small fraction of the total data, a computationally efficient process for accounting for these outliers increases the utility of the environmental monitoring systems and data sets. The capability of the robust GP regression to account for response variable outliers in the gas monitoring data sets was determined relative to three other existing approaches for these two unique NO and O scenarios.
Another major contrast from the simulated datasets is that this real dataset contains more measurements. In total, the dataset has 1,400 measurements for each of the gas concentrations. We manually identified outliers as indicated with red circles in Figure 11. There are five outliers for the NO gas and 45 outliers for the O gas. For each of the gas types, we randomly selected 10% or 20% of the 1,400 records excluding the outliers, which serve as a test dataset. So, the test dataset only contains normal data. The training data contains the whole 1,400 records, substituting the records selected for the test dataset with outliers. The outliers are randomly sampled from those which we manually identified. Dropping our proposed random bias model (RAB) in this comparison due to its poor performance for this example, we evaluate the three other methods in terms of the three performance metrics, MSE, NLPD and TIME. In this study, we applied the exponential kernel for all of the four methods.
Table 2 summarizes the outcomes with the MSE, NLPD and TIME evaluations. The MCMC approach worked best for the NO data, and the proposed COB model worked closely to the MCMC approach. However, there is a meaningful gap in between them. By looking at their predictive means and variances as shown in Figure 12, we can see that both of the models work reasonably, but the COB model oversmooths the wiggly fluctuations of the NO measurements. The EP approach did not work very well, creating a flat predictive mean and too wide predictive variance around the mean. It is mainly due to the poor convergence of its hyperparameter optimization for this example.
For the O measurement, the proposed COB model worked better in MSE, but the MCMC approach worked better in NLPD. Figure 13 shows the predictive means and variances of the three compared methods. Our proposed model better catches the overall trend of normal data, and the MCMC approach is influenced by outliers for some test locations. The EP failed to find the mean line of the normal data, catching up with the outlier data below the normal mean line.
All in all, the proposed COB model works better than or close to the best performing method for all of the simulated and real data cases with much shorter computation time. The proposed model can be a very time economic option for the robust GP regression with great prediction accuracy and robustness.
Dataset (Percentage Outliers)  Metrics  COB  MCMC  EP 

NO (10%)  MSE  0.1104  0.0436  0.5024 
NLPD  0.3605  0.7021  2.0554  
TIME  51 Sec  1,070 Sec  206 Sec  
NO (20%)  MSE  0.1974  0.0978  0.5267 
NLPD  0.6078  0.2193  1.3589  
TIME  53 Sec  961 Sec  175 Sec  
O (10%)  MSE  9.2467  9.3090  160.2542 
NLPD  13.1931  6.4755  3.9824  
TIME  40 Sec  983 Sec  168 Sec  
O (20%)  MSE  5.5220  8.0989  126.9505 
NLPD  4.3783  2.2006  6.8298  
TIME  50 Sec  833 Sec  991 Sec 
5 Conclusion
We presented a new computationally efficient method to a robust GP regression. The approach regards data as noisy and biased observations of an underlying regression function. Accordingly, the likelihood linking data to the regression function includes bias terms. We modeled the bias terms in two different ways, biases as unknown fixed parameters and biases as unknown random quantities. We estimated the parameters of the resulting bias models based on the proposed regularized maximum likelihood estimation. We entailed how the regularized likelihood functions are formulated and how the tuning parameters are chosen. After the bias model estimation, the robust GP regression can be achieved following the standard GP procedure. This bias modeling approach is very simple and computationally efficient, while it provides excellent prediction accuracy and robustness to outliers. We examined its numerical performance based on a comprehensive simulation study. For most of the simulation cases, the proposed approach outperformed the existing approaches of the fractional EP approximation with the Studentt likelihood and the MCMC approach with the Gaussian scale mixture likelihood. The performance of the proposed approach was also less sensitive to the proportion of outliers and the noise level of data as well as the dimension of data. The computation time of the proposed approach was faster than the existing approaches by at least one order of magnitude. The new approach has also shown better tradeoffs between the computational time and prediction accuracy than the existing approaches in our numerical study with environmental sensor data. We believe that the proposed bias model is a very attractive solution to a robust GP regression.
We would like to acknowledge support for this work from Air Force Research Laboratory Materials and Manufacturing Directorate and members members of the AFRL Corrosion Integrated Product Team. The work was conducted under U.S. Federal Government Contract No FA865015D5405.
References
 Daemi et al. (2019) Atefeh Daemi, Yousef Alipouri, and Biao Huang. Identification of robust Gaussian process regression with noisy input using EM algorithm. Chemometrics and Intelligent Laboratory Systems, 191:1–11, 2019.
 Friedman (1991) Jerome H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1–67, 1991.
 Gelman et al. (2013) Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2013.
 Goldberg et al. (1998) Paul W Goldberg, Christopher KI Williams, and Christopher M Bishop. Regression with inputdependent noise: A Gaussian process treatment. In Advances in Neural Information Processing Systems, pages 493–499, 1998.
 Huber (2011) Peter J Huber. Robust Statistics. Springer, 2011.

Jylänki et al. (2011)
Pasi Jylänki, Jarno Vanhatalo, and Aki Vehtari.
Robust Gaussian process regression with a Studentt likelihood.
Journal of Machine Learning Research
, 12(Nov):3227–3257, 2011. 
Kuss (2006)
Malte Kuss.
Gaussian process models for robust regression, classification, and reinforcement learning
. PhD thesis, Technische Universität, 2006. 
Minka (2001)
Thomas P Minka.
Expectation propagation for approximate Bayesian inference.
InProceedings of the Seventeenth conference on Uncertainty in artificial intelligence
, pages 362–369. Morgan Kaufmann Publishers Inc., 2001.  NaishGuzman and Holden (2008) Andrew NaishGuzman and Sean Holden. Robust regression with twinned Gaussian processes. In Advances in Neural Information Processing Systems, pages 1065–1072, 2008.
 Ranjan et al. (2016) Rishik Ranjan, Biao Huang, and Alireza Fatehi. Robust Gaussian process modeling using em algorithm. Journal of Process Control, 42:125–136, 2016.
 Rasmussen and Williams (2006) C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
 Shah et al. (2014) Amar Shah, Andrew Wilson, and Zoubin Ghahramani. Studentt processes as alternatives to Gaussian processes. In Artificial Intelligence and Statistics, pages 877–885, 2014.
 Vanhatalo et al. (2013) Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen, Pasi Jylänki, Ville Tolvanen, and Aki Vehtari. Gpstuff: Bayesian modeling with Gaussian processes. Journal of Machine Learning Research, 14(Apr):1175–1179, 2013.
Comments
There are no comments yet.