Anomaly Detection and Removal Using Non-Stationary Gaussian Processes

07/02/2015 ∙ by Steven Reece, et al. ∙ 0

This paper proposes a novel Gaussian process approach to fault removal in time-series data. Fault removal does not delete the faulty signal data but, instead, massages the fault from the data. We assume that only one fault occurs at any one time and model the signal by two separate non-parametric Gaussian process models for both the physical phenomenon and the fault. In order to facilitate fault removal we introduce the Markov Region Link kernel for handling non-stationary Gaussian processes. This kernel is piece-wise stationary but guarantees that functions generated by it and their derivatives (when required) are everywhere continuous. We apply this kernel to the removal of drift and bias errors in faulty sensor data and also to the recovery of EOG artifact corrupted EEG signals.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

abstract

This paper proposes a novel Gaussian process approach to fault removal in time-series data. Fault removal does not delete the faulty signal data but, instead, massages the fault from the data. We assume that only one fault occurs at any one time and model the signal by two separate non-parametric Gaussian process models for both the physical phenomenon and the fault. In order to facilitate fault removal we introduce the Markov Region Link kernel for handling non-stationary Gaussian processes. This kernel is piece-wise stationary but guarantees that functions generated by it and their derivatives (when required) are everywhere continuous. We apply this kernel to the removal of drift and bias errors in faulty sensor data and also to the recovery of EOG artifact corrupted EEG signals.

I Introduction

Gaussian processes (GPs) are experiencing a resurgence of interest. Current applications are in diverse fields such as geophysics, medical imaging, multi-sensor fusion [5] and sensor placement [1]. A GP is often thought of as a “Gaussian over functions” [7]. It can be used to construct a distribution over functions via a prior on the function’s values. The prior is specified through a positive-definite kernel, which determines the covariance between two outputs as a function of their corresponding inputs. A GP is fully described by its mean and covariance functions. Suppose we have a set of training data drawn from a noisy process:

(1)

where is the real process and

is zero-mean Gaussian with variance

. For convenience both inputs and outputs are aggregated into and

respectively. The GP estimates the value of the function

at sample locations . The basic GP regression equations are given in [7]:

(2)
(3)

where is the marginal posterior mean at and is the corresponding covariance. The prior covariance at is where the covariance matrix has elements . The function is called the kernel function. These functions come with parameters, , which can be tuned to suit the application. The kernel function is chosen according to known properties of the underlying processes, such smoothness and stationarity. There are many existing kernels to choose from (see, for example, [7]). We will develop a new kernel for our fault recovery algorithm. The prior mean is traditionally set to zero and we follow this convention. However, the results in this paper can be readily generalised to non-zero prior means. The term captures the noise in Eqn 5.

The GP parameters (which includes

and hyperparameters associated with the covariance function) can be inferred from the data through Bayes’ rule

(4)

The parameters are usually given a vague prior distribution .

The paper is organised as follows. Section II demonstrates how fault recovery can be achieved using non-stationary Gaussian processes. Section III reviews current covariance functions for modelling non-stationary processes. Then, Section IV presents the problem description as a piece-wise stationary problem with boundary constraints at the onset of faults. Section V presents the MRL kernel for functions which are continuous boundaries and this is extended to cases where function derivatives are continuous at region boundaries in Section VI. In Sections VII and VIII we demonstrate the efficacy of our approach on simulated data from a faulty sensor target estimation problem as well as a dataset involving EOG artifact corrupted EEG signals. Finally, we conclude in Section IX.

Ii Gaussian Process Faulty Recovery

We assume that a faulty signal is the linear composition of the real physical phenomenon of interest and a deterministic offset induced by a fault either in the sensor measurement or in the physical phenomenon itself. Again, the signal is assumed to be drawn from a noisy process:

(5)

where is the real process, is the fault process and is zero-mean Gaussian with variance .

Again, both inputs and outputs are aggregated into and respectively. The GP regression equations, (2) and (3), are modified so that both the real and fault processes can be inferred separately:

where . is the marginal posterior mean at and is the corresponding covariance of the real process. and are the marginal posterior mean and corresponding covariance of the fault process, respectively. The prior covariances at are and for the real process and the fault process, respectively, where, again, each covariance matrix is generated by a kernel function. We may assign different kernels to the real and fault processes.

We shall consider two kinds of faults in this paper, drift and bias (see Figure 1). Although, our approach can be extended to arbitrary fault types. Both faults are temporary, so they have a start time and an end time. The fault process is assumed to be zero outside of this time interval. Drift faults are gradual. Starting from zero error they grow over time and then either shrink back to zero gradually or snap back to zero instantaneously. Bias faults are severe and immediate. They induce a significant error in the signal at onset which persists until the fault subsides at which point the signal snaps back onto the real process. The drift fault is continuous at the start whereas the bias fault is discontinuous.

(a) Bias (b) Drift
Fig. 1: Typical real process and faulty observations for both bias and drift faults.

In order to model the drift fault we develop a kernel which guarantees that the fault is continuous at onset. We call this kernel the continuous, conditionally independent (or CCI) kernel. The CCI kernel has applications outside of fault recovery. This non-stationary kernel splices two or more locally stationary kernels together whilst preserving function continuity throughout. Consequently, we present this kernel thoroughly for the first time.

Iii Non-Stationary Kernels

Many applications use stationary covariance functions for which the kernel is a function of the distance between the input points. Stationary covariance functions are appealing due to their intuitive interpretation and their relative ease of construction. Unfortunately, stationary GP functions are not applicable in applications where there are input-dependent variations in the model hyperparameters (e.g. length-scale, amplitude) and kernel families. Consequently, non-stationary GP functions have been proposed, such as the neural network kernel 

[11] and the Gibbs kernel [2].

Methods for deriving non-stationary kernels from stationary kernels have also been proposed. Perhaps the earliest approach was to assume a stationary random field within a moving window [3]. This approaches works well when the non-stationarity is smooth and gradual. It fails when sharp changes in the kernel structure occur. An alternative solution is to introduce an arbitrary non-linear mapping (or warping) of the input and then apply a stationary covariance function in the -space [9]. Unfortunately, this approach does not handle sharp changes between different locally applied kernels very well [4]. The mixture of GPs [10] approach uses the EM algorithm to simultaneously assign GP mixtures to locations and optimise their hyperparameters. Although the mixture approach can use arbitrary local GP kernels, it does not guarantee function continuity over GP kernel transitions. Paciorek [6] proposes a non-stationary GP kernel which guarantees continuity over region boundaries. Unfortunately, this approach requires that the local, stationary kernels belong to the same family.

Iii-a Example: The Gibbs Non-Stationary Kernel

Gibbs [2, 7] derived the covariance function:

(6)

where each length-scale, , is an arbitrary positive function of and is the dimensionality of . If the length-scale varies rapidly then the covariance drops off quite sharply due to the pre-factor in Eqn 6. As a consequence the inferred function estimate can be quite uncertain at length-scale boundaries. This is demonstrated in Figure 2(a) for which the length-scale changes from for to for .

(a) (b)
Fig. 2: Part (a) shows a function and its estimate obtained using the Gibbs Non-stationary Gaussian process kernel. Also, (b) a random sample drawn from the Gibbs non-stationary kernel typically showing a discontinuity where the length-scale changes.

Further, the Gibbs kernel does not guarantee that functions generated by the kernel are continuous. Figure 2

(b) shows a typical sample drawn from the posterior Gaussian distribution represented in Figure 

2(a).

Iii-B Example: Warping of the Input Space

This example demonstrates the limitation of modelling piece-wise stationary functions by warping the input space as proposed by [9]. Figures 3 and 4

show a continuous wedge function with low and high signal-to-noise (SNR) respectively. The figures also show the mean and first standard deviation of two models:

  • a warped squared exponential

  • two functions, each generated from a squared exponential kernel each side of a boundary at , and continuous there.

Fig. 3: Low SNR: The left panel shows the ground truth (red line), observations (red dots) and GP estimate with mean (black line) and first standard deviation (grey region). The other panels show the distributions over length scales, and , inferred for the piece-wise plots each side of .
Fig. 4: High SNR: The left panel shows the ground truth (red line), observations (red dots) and GP estimate with mean (black line) and first standard deviation (grey region). The other panels show the distributions over length scales, and , inferred for the piece-wise plots each side of .

For low SNR the warping approach can smooth over features of high curvature such as the wedge apex. For high SNR the warping kernel produces a “bell” estimate as it is forced to fit a smooth kernel at the apex.

In many applications a completely different GP function may be required to model different regions within the space of interest and the family of non-stationary covariance functions currently available in the literature may be too restrictive to model these problems especially when there are function continuity conditions at region boundaries. We will show how arbitrary stationary GP kernels can be combined to form non-stationary GP covariance priors which preserve function continuity. We shall call the new kernel the Markov Region Link (MRL).

Iv Problem Description

We will assume that a domain can be partitioned such that within regions (tiles) the process is stationary [4]. Furthermore, each region may be modelled by kernels from different families. For example, one region may be modelled by a Matérn kernel whereas a neighbouring region may be modelled by a mixture of squared exponential and periodic kernels. We do not assume that the functions generated by these kernels are independent between regions and, although we desire sharply changing GP kernels or hyperparameters at region boundaries, we would also like to preserve function continuity at the boundaries.

Two regions are labelled and and collectively they form the global region . A function over is inferred at sample locations given training data at locations . However, the training data locations are partitioned between the regions and the region boundary. 111In 1D problems the region boundary is often referred to as the change-point. Let be the locations internal to region and let be the boundary. Then:

We will assume that the function can be modelled using a stationary GP in each region and endeavour to design a global GP covariance prior which preserves the individual region kernels. We will also endeavour to preserve function continuity where desired across region boundaries including, for example, function continuity or continuity of function derivatives. Thus, the necessary conditions are:

  1. The global kernel preserves the individual region kernels, . That is, for all and all regions .

  2. The global kernel preserves function continuity, or derivative continuity, at the boundary.

Proposition 1

If two regions, labelled and , are joined at the boundary and a function defined over the regions is modelled by in region and in and the function is continuous at the boundary then:

The boundary covariance, , is a hyperparameter which can be learned from the training data.

V The Markov Link Kernel

We assume that the processes internal to each region are conditionally independent given the process at the boundary . The corresponding graphical model is shown in Figure 5 and and are the processes internal to the regions labelled 1 and 2 and is the process at the boundary.

Fig. 5: Non-stationary GP prior graphical model.

The process in region and at the boundary is modelled using the GP kernel . The rows and columns of

correspond to the stacked vector

:

Similarly, the process in region and at the boundary is modelled using the GP kernel where the row and columns correspond to the stacked vector :

Of course, if the kernels both accurately model the prior covariance of the process at the boundary then:

So we condition both and on to yield and respectively:

where and . The global prior covariance is then:

where the rows and columns correspond to the stacked vector . The cross-terms, , are:

where and are the region function values conditioned on the function at the boundary:

(7)
(8)

Since then:

As a corollary of this approach we can derive a Gaussian process kernel for 1D signals with a change point at :

Corollary 1

If and are two stationary GP kernels (not necessarily from the same family) which model region and region respectively, and and are their hyperparameters, then:

where:

To demonstrate the new kernel we return to the problem in Figure 2. Using identical hyperparameters and observations as in Figure 2 the function estimate obtained using the Markov Region Link approach is shown in Figure 6.

Fig. 6: Function and estimate obtained using conditionally independent function segments described by stationary Gibbs Gaussian process kernels joined at the boundary.

Vi Derivative Boundary Conditions

So far, we have developed covariance functions which preserve function continuity at the boundary. The approach can be extended to assert function derivative continuity at the boundary. The covariance between a function and any of its derivatives can be determined from the GP kernel [7]. For example, the prior covariance between the function and its first derivative is:

where and . The covariance between the derivatives is:

(9)

The derivative variance at can be obtained by setting to in Eqn 9. In our notation denotes partial derivative with respect to the first parameter, in this case and denotes double differentiation with respect to both and then .

These relationships can be used to define non-stationary GP covariance priors which impose continuous function derivatives at region boundaries. The prior mean and covariance for both the regional and global priors are augmented to include the function derivative. For example, if the first derivative is added to the prior then the prior covariances for regions and become: 222We shall use prime to denote the augmented covariances.

and:

The rows and columns in correspond to the stacked vector where denotes the function derivative at . Similarly, the rows and columns in correspond to the stacked vector . Consequently, we can use the approach outlined in Section V to construct a global prior for processes which are conditionally independent in each region. This is done by defining as follows:

and using and defined above in place of and in Section V.

Fig. 7: Function estimated using two stationary Gibbs kernels joined at with the constraint that the function first derivative is continuous at .

Figure 7 shows the effect that the derivative constraint can have on the function estimate. Two stationary Gibbs kernels are used with for and for as in Figure 2. Clearly, the approach which imposes a continuous first derivative on the GP model produces a tighter function estimate at .

Vii Application 1: Target Estimation with Faulty Sensors

We shall use a GP to estimate a target’s position over time as it is tracked by a simple sensor. However, the sensor is faulty and outputs readings which have either drifted gradually from the truth or undergone a sudden jolt away from the truth resulting in a fixed bias in the reading (see Figure 1).

The proposed filter algorithm operates on-line and infers a posterior distribution for the current target’s position using observations of its previous locations. Smoothing from future observations is not considered. The target’s trajectory is described by the process and the sensor is subject to occasional faults. The sensor’s fault process, , can be either a short term fixed bias or it can drift over a period of time. The, possibly faulty, observation at time is:

where is zero mean, Gaussian with variance . We wish to estimate target location over time.

The processes and are modelled by GP kernels and respectively. We will assume that is stationary and we will use a simple squared exponential kernel to model the target dynamics. However, the fault is intermittent and it starts at time and ends at . We model the fault process using a non-stationary kernel. Firstly, is zero over times, and , for which there is no fault. For a bias fault, we assume that the bias is a fixed offset and thus assert:

where is a scale parameter representing the magnitude of the bias. We assume that the drift is gradual and build the drift model using a squared exponential kernel:

that will be modified shortly to accommodate drift error starting from zero. The parameters and are the output and input scales, respectively, and again, .

The bias fault causes a jump in the observation sequence when the fault sets in at . However, the drift fault is gradual and is zero at and discontinuous at when the fault disappears (see Figure 1). We use the Markov Region Link kernel to construct the drift fault process prior covariance from and impose the continuity boundary condition at . Using the approach set out in Section V the prior covariance for the drift fault becomes the block matrix:

The first row and column are zero matrices for times less than , corresponding to the period before the fault starts. The last row and column are zero matrices for times greater than , corresponding to times after the fault has stopped. The central row and column blocks are prior covariances over time samples during which the sensor is faulty:

Continuity of the fault process at imposes . Values for are obtained using Corollary 1 in Section V with , , and .

The bias kernel is more straight forward:

with the rows and columns interpreted as for .

Figure 8 shows typical tracks, observations and GP track estimates. The target trajectory and observation sequence are randomly generated, and . Notice that the algorithm has successfully corrected for the faulty observations.

(a) Bias (b) Drift
Fig. 8: Typical target trajectory and faulty observations for both bias and drift faults.

Parallel faults may also be modelled using our approach. A parallel fault comprises more than one basic fault occurring at the same time. In the following example the sensor measurement drifts to a new fixed bias offset. The fault can be modelled using a drift fault, without fault correction, followed immediately by a biased fault.

A SICK laser range sensor is used to track a person. The sensor is subject to a knock as it tracks the person. This knock results in the sensor drifting for a period before coming to rest. The sensor data thus exhibits a drift followed by a constant bias. Figure 9 shows a typical bearing range scan at a single time instance during the tracking procedure. Figure 10 shows the data obtained from our sensor (transformed to the laboratory centred Cartesian coordinates) and also the target’s true trajectory obtained by using objects in the environment with known location.

Fig. 9: Typical SICK sensor range/bearing scan of the laboratory.
Fig. 10: True and observed target trajectory obtained using SICK range sensor.

We use the MRL kernel to splice together both the drift kernel and the bias kernel and assert that the target observations are continuous at the transition from drift to bias fault. Both the transition time and the kernel induced variance, , at that time are parameters of our model.

Figure 11 shows the ground truth and the one standard deviation estimate if no fault is assumed as the person moves from right to left. The magnitude of the error is under-estimated.

Fig. 11: GP target trajectory estimate obtained by ignoring fault.

Figure 12 shows the GP on-line estimated trajectory as well as an estimate of the fault. Both estimates are correctly determined in this case.

Fig. 12: GP target trajectory estimate obtained after recovering from faults.

Viii Application 2: EOG Artifact Removal From EEG Signals

In this example we detect and remove EOG artifacts from EEG signals by modelling the artifacts as drift type faults. The recovery of the EEG signal is often treated as a blind source separation problem [8] where ICA identifies the separate artifact free EEG signal (which we refer to as EEG*) and the EOG signal. In our approach we use explicit models for the EEG* and EOG signal and stipulate that these signals are independent.

The “observed” EEG signal, , is a mixture of EEG* and EOG artifact signals. The EEG* signal, , is modelled as the combination of a smooth function (generated by a GP with prior covariance ) and i.i.d distributed residuals . The EOG artifact, , is modelled as a piece-wise smooth function, (generated from a GP “fault” model with prior covariance ) and, similarly, for the residuals .

where , , and and

is the identity matrix. The random vectors

, , and are assumed to be mutually independent. The residuals, and , are considered to be part of the signals and are therefore not treated as noise and are not filtered out.

We use a simple squared exponential to model the EEG* signal. Typically the EOG signal is zero everywhere except within a small time window. Within this window the EOG artifact can be modelled as two smooth functions (not necessarily monotonic) which join at a spike near the centre of the window (so that the EOG signal is continuous but not differentiable at the join). Thus, the EOG’s prior covariance is chosen to be zero everywhere except between the artifacts start and end times, and , respectively. We use the methods outlined in Section V to build the EOG artifact prior covariance. Between the start and end times the EOG artifact signal is modelled by two piece-wise squared exponential kernels joined mid-way between and so that they are continuous at the mid-point. Furthermore, the EOG signal is zero at and .

The following GP equations determine the mean and covariance for the hidden variable, :

where . Similar expressions can be obtained for .

To track the EEG signal our algorithm determines sequentially over time. When is the current time and is the previous times at which data points were obtained then:

Marginalising :

In general, when is Gaussian distributed then its mean, , is the solution to:

and its variance, , is given by:

Thus, defining and :

and:

Similar reasoning leads us to similar expressions for the EOG artifact signal:

and:

These expressions for and determine the proportion of the EEG signal residual that is assigned to the EOG artifact signal and also to the artifact free EEG signal (EEG*).

Our model requires eight hyperparameters, collectively referred to as : the scale heights and scale lengths for the GP models (we assume that both parts of the EOG model have the same scale heights and lengths); the artifact start and end times and also the residual variances and . The likelihood used in Equation 4 to determine a distribution over the hyperparameter values is given by:

The hyperparameters are marginalised using Monte-Carlo sampling.

Figure 13

shows a typical EEG signal which is corrupted by EOG artifacts. It also shows the one standard error confidence interval for the artifact free EEG* signal and the EOG artifact obtained using our algorithm. Figure 

14 shows the mean difference between the original EEG signal and the inferred EEG* signal, indicating the expected proportion of the original signal that is retained in the EEG*.

Fig. 13: EEG signal (crosses) and 1 standard error confidence intervals for the EEG* (left panel) and EOG (right panel) signals obtained using the GP mixture model approach.
Fig. 14: Original EEG signal (dots) and difference (line) between original signal and the mean EEG* obtained using the GP mixture model approach.

Ix Conclusions

This paper has presented a novel approach for non-stationary Gaussian processes across boundaries (i.e. changepoints for 1D signals). It builds piece-wise stationary prior covariance matrices from stationary Gaussian process kernels and, where appropriate, asserts function continuity or continuity of any function derivative at the region boundaries. The approach has been successfully demonstrated on sensor fault detection and recovery and also on EEG signal tracking.

References

  • [1] R. Garnett, M. A. Osborne, and S. J. Roberts. Bayesian optimization for sensor set selection. In 9th ACM/IEEE Interantional Conference on Information Processing in Sensor Networks (IPSN’10), Stockholm, Sweden, pages 209–219, 2010.
  • [2] M. N. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhD thesis, Department of Physics, Cambridge University, 1997.
  • [3] T. C. Haas. Kriging and automated variogram modeling within a moving window. Atmospheric Environment, 24A:1759–1769, 1990.
  • [4] H.-M. Kim, B. K. Mallick, and C. C. Holmes. Analyzing nonstationary spatial data using piecewise gaussian processes. Journal of the American Statistical Association (Theory and Methods), 100(470):653–668, June 2005.
  • [5] M. Osborne, A. Rogers, A. Ramchurn, S. Roberts, and N. R. Jennings. Towards real-time information processing of sensor network data using computationally efficient multi-output gaussian processes. In IPSN 2008: International Conference on Information Processing in Sensor Networks, St. Louis, Missouri, 2008.
  • [6] C. J. Paciorek and M. J. Schervish. Nonstationary covariance functions for gaussian process regression. In In Proc. of the Conf. on Neural Information Processing Systems (NIPS). MIT Press, 2004.
  • [7] C. E. Rasmussen and C. K. I. Williams.

    Gaussian Processes for Machine Learning

    .
    The MIT Press, 2006.
  • [8] S. J. Roberts, R. Everson, I. Rezek, P. Anderer, and A. Schlogl. Tracking ica for eye-movement artefact removal. In Proc. EMBEC’99, Vienna, 1999.
  • [9] P. D. Sampson and P. Guttorp. Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 87:108–119, 1992.
  • [10] V. Tresp. Mixtures of gaussian processes. In Proc. of the Conf. on Neural Information Processing Systems (NIPS 13), pages 654–660, 2001.
  • [11] C. K. I. Williams. Computation with infinite neural networks. Neural Computation, 10(5):1203–1216, 1998.