Robust Anomaly Detection Using Semidefinite Programming

04/03/2015
by   Jose A. Lopez, et al.
Northeastern University
0

This paper presents a new approach, based on polynomial optimization and the method of moments, to the problem of anomaly detection. The proposed technique only requires information about the statistical moments of the normal-state distribution of the features of interest and compares favorably with existing approaches (such as Parzen windows and 1-class SVM). In addition, it provides a succinct description of the normal state. Thus, it leads to a substantial simplification of the the anomaly detection problem when working with higher dimensional datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/16/2016

Holistic Features For Real-Time Crowd Behaviour Anomaly Detection

This paper presents a new approach to crowd behaviour anomaly detection ...
03/27/2019

Botnet fingerprinting method based on anomaly detection in SMTP conversations

The paper presents the results obtained during research on detection of ...
09/26/2016

One-Class SVM with Privileged Information and its Application to Malware Detection

A number of important applied problems in engineering, finance and medic...
10/28/2021

Generalized Anomaly Detection

We study anomaly detection for the case when the normal class consists o...
02/18/2018

Anomaly Detection using One-Class Neural Networks

We propose a one-class neural network (OC-NN) model to detect anomalies ...
05/09/2012

Improving Compressed Counting

Compressed Counting (CC) [22] was recently proposed for estimating the a...
05/16/2018

Towards Malware Detection via CPU Power Consumption: Data Collection Design and Analytics (Extended Version)

This paper presents an experimental design and data analytics approach a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Detecting anomalies in data is a task that engineers and scientists of every discipline encounter on a fairly regular basis. Common approaches to identifying anomalous observations include: estimating the probability density function of the “normal” state using density-estimation methods, computing the Mahalanobis distance between a point and a sample distribution, using machine-learning methods to learn the normal state and perform 1-class classification, using “whiteness” tests to detect when the differences between new samples and predictions (of linear models) become statistically colored, etc. However, a disadvantage encountered while estimating the distribution of the normal state is that one never knows if the available samples are representative of

all

the normal behaviors. This is especially true when there is a dearth of data with which to construct a reliable probability density estimate. In this paper, we address this challenge through the use of recent results from the method of moments and polynomial optimization. The main idea is to use information about the statistical moments of the normal-state distribution of a given set of features to compute an upper-bound on the probability of a given observation of a random variable with this distribution. The observation is then deemed to be anomalous when this bound falls below a given threshold. While in principle computing this bound is a challenging infinite-dimensional problem, as shown in the sequel, the use of concepts from the theory of moments allows for recasting it into a computationally tractable finite dimensional optimization.

The generalized moment problem (GMP), defined below, has tremendous modeling power and has found use in a wide range of applications in areas such as probability, financial economics, engineering, and operations research [11]. However, its wider application has been held back because, in its fullest generality, the GMP is intractable [9]. On the other hand, recent advances in algebraic geometry have rendered the GMP tractable when the problem instance is described using polynomials [7, 11].

Formally, the GMP is given by:

(1)

where is a Borel subset of , is the space of finite signed Borel measures on , is the positive cone containing finite Borel measures on , is a set of real numbers, and are integrable with respect to all for every . The stands for either an inequality or equality constraint.

Lasserre [7, 11, 9, 10] showed that semidefinite programming (SDP) can be used to obtain an (often) finite sequence of relaxations which can be solved efficiently and which converge from above to the optimal solution when the problem has polynomial data; thereby giving us a tool with which to attempt to solve very difficult (i.e., non-convex) medium-sized problems, given the present state of SDP solvers. The monotonic convergence means that the -th relaxation is useful, even when optimality has not been achieved, by providing the user with an upper (i.e. optimistic) bound to in (1).

In the sequel, we will go over the preliminaries of the moments approach, present the aforementioned Lasserre relaxations, and demonstrate how to use them to detect anomalous data samples.

2. Preliminaries

For a real symmetric matrix , means is positive semidefinite. Let , , and let

be the column vector containing the canonical polynomial basis monomials

(2)

for polynomials of the form with dimension and degree at most . denotes the set of polynomials in with real coefficients.

With , let and denote a finite sequence of real variables. For a polynomial , with coefficients , let denote the linear map

(3)

which associates with the moment variables, .

2.1. Moment and Localizing Matrices

The -th order moment matrix is indexed by up-to order- monomial coefficients . That is, each entry of the matrix is given by

(4)

where . For example, in the case and , if and , then . Usually the is suppressed (since it is understood that the first row and column of contains the elements of ). The entire matrix is given below for illustration.

(5)

Thus, it is clear that .

The classical problem of moments is concerned with determining if, for a given moment sequence , there is a measure so that for each . If so, we say has a representing measure which is called determinate if it is unique. Loosely speaking, checking existence of such a measure involves testing if is positive semidefinite. Readers interested in the technical details are referred to [11] and references therein.

For a given polynomial , the localizing matrix is given by

(6)

This is equivalent to shifting by the monomials of . For example, with , we have

(7)

Localizing matrices are used to specify support constraints on , as described in the next section.

2.2. Upper-bounds Over Semi-algebraic Sets

The material in this subsection comes from [11] and [8] which discuss the problem of finding bounds on the probability that some random variable belongs to a set , given some of its moments .

Suppose is of the form

(8)

where are given polynomial constraints. In terms of the GMP, this problem can be expressed as

(9)
s.t.

where is the indicator function of the set . To solve this problem using SDP methods, Lasserre provided the following relaxations

(10)

where and is the set of indices that correspond to the known moment information.

The following result assures us that approaches the optimal value from above and tells us when optimality has been reached. This result is included for completeness as we will generally not require an optimal solution. This is an advantage because we do not have to worry about extracting optimizers111There is (currently) no reliable method of extracting optimizers from a moment matrix. from a moment matrix that has satisfied the rank condition in Theorem 1.

Theorem 1 ( ).

[11] Let be the optimal value of the semidefinite program . Then:

(a):

For every , and moreover, as .

(b):

If is attained at an optimal solution which satisfies

(11)

then .

2.3. Lower Bounds

To compute lower bounds, one can use the complement of , instead of , in and subtract from 1 [11]. However, in the approach presented here, the lower bounds do not provide much information because the sets that we will work with are small compared to the range of the data, which means the lower bound will usually be zero.

3. Robust Anomaly Detection

One approach to anomaly detection is to estimate the probability density function (PDF) and select a threshold below which incoming samples are classified as anomalous. Another is to train a 1-class support vector machine (SVM) and use it to classify samples. Here, we will compare against both of these standard techniques. To quantify performance, we will use the threshold-independent receiver operating characteristic (ROC) curves and area under curve (AUC) metric; this is a standard way of assessing binary classifier performance. Both ROC and AUC are readily obtained using the MATLAB command

perfcurve

. The kernel density estimates are obtained using the MATLAB command

ksdensity (and the KDE toolbox for higher dimensional examples [6]). For the 1-class SVM, we use the MATLAB command fitcsvm with the automatic tuning- and “standardized data” options enabled.

We would like to emphasize to the reader, however, that our goal is not to estimate the probability density function. Our goal is simpler: to use the upper-bound on the probability of the observation to decide whether or not it is anomalous. To use for anomaly detection we select to specify a neighborhood which contains an incoming measurement. The neighborhood can be a sphere or a box222 in can be any set of the form in (8). and is specified in the constraints in . The can be selected as a function of the measurement noise, or otherwise small compared to the expected range of the data. In fact, can be set to zero if one wishes. In practice, we would like to use to account for measurement noise.

3.1. Moment Estimation and Data Whitening

A key advantage of the proposed method is that the information from a data set enters through the moment estimates which are computed using Equation (12).

(12)

It is known that it is difficult to estimate higher order moments since whatever noise is present in the data will be raised to higher powers as well. In view of Equation (12), one can see that if the data is poorly scaled or biased in some coordinate, will increase quickly in those directions and will therefore result in fewer available moments for those directions and may cause numerical problems for the SDP solver. Thus, it is sometimes useful to use the data whitening technique discussed in [1]

to transform the data so that it has zero mean and unit variance. This may be useful in lower-dimensional instances, where it is possible to use these higher order moments without the size of the moment matrices exceeding computational resources.

Data whitening is a simple procedure that involves subtracting the mean of the data and multiplying by a transformation matrix, in the multivariate case, or dividing by the standard deviation in the univariate case. Incoming samples are then processed the same way, using the stored mean and transformation, before obtaining an upper-bound probability estimate.

4. Examples

In this section we present four examples333We used MATLAB R2013b/R2014a, CVX [2], and SeDuMi [12] to solve our examples.. The first is a 1D example that shows that our approach compares favorably with traditional kernel probability density based anomaly detectors and with the 1-class SVM. We will demonstrate: that higher-order moments contain useful information and that this information helps best the other two approaches especially when there are fewer data points to learn from and when the dimensionality of the data increases. These points will become apparent through the first three examples, which increase in dimension. The final example shows how our method can be used to recover contours from a noisy binary image.

4.1. Example 1: A Bimodal Distribution

Consider the bimodal distribution [3] in Figure 1. This distribution poses a challenge to density estimation tools like Parzen windows because the two modes are estimated best using different window sizes.

Figure 1

shows the distribution and the test points, which consist of 300 inliers and 50 outliers.

Figure 1. Bimodal distribution

Table 1 shows the AUC values for our experiments, which use for the moments classifier. The reader can see that AUC generally increases as more moments are used and that the performance is better with fewer points. The difference, however, is small because this is a 1D example.

N SVM Parzen
50 0.9821 0.9686 0.8701 0.9715 0.9940 0.9929
100 0.9907 0.9807 0.8457 0.9882 0.9984 0.9957
300 0.9953 0.9781 0.8268 0.9783 0.9951 0.9945
Table 1. AUC Summary
Figure 2. ROC curves when 50 points are used for training
Figure 3. ROC curves when 300 points are used for training

4.2. Example 2: 2D Distribution

For higher dimensional datasets, the complexity of estimating PDFs increases greatly. Indeed, the arguably most popular engineering software tool, MATLAB, does not have a density estimation function for multi-dimensional datasets.

Consider the distribution below

(13)

with , , and . This distribution has a “P” shape and the closed portion presents a problem for the kernel and SVM approaches.

Figure 4 shows the probability surface and the 350 test points which include 50 outliers.

Figure 4. 2D distribution and test points

Table 2 shows the results of our experiments, using for the moments classifier. Again, the reader can see how the performance increases as more moment information is included except this time the difference in performance is greater because the data has another dimension. This trend will continue in the third example.

N SVM Parzen
100 0.9252 0.7843 0.6330 0.9722 0.9883 0.9916
150 0.9437 0.8045 0.6575 0.9751 0.9903 0.9947
300 0.9435 0.8335 0.6436 0.9705 0.9863 0.9933
Table 2. AUC Summary
Figure 5. ROC curves when 100 points are used for training
Figure 6. ROC curves when 300 points are used for training

4.3. Example 3: Swiss roll Distribution

The following distribution is the well-known “Swiss roll” distribution. The following equations were used to create the test and training samples.

(14)
(15)
(16)

with , , , , and . Figure 7 shows the test points which consist of 300 inliers and 50 outliers.

Figure 7. Swiss-roll distribution

Table 3 shows the familiar trend of increasing AUC with additional moments and excellent performance even with just 100 training samples. This example illustrates how the other two standard techniques degrade as dimensionality increases. As with the previous examples, was used for the moments classifier and the recommended automatic tuning options were used for other techniques.

N SVM Parzen
100 0.7887 0.8487 0.8099 0.9194 0.9228
300 0.8001 0.9085 0.8201 0.9480 0.9903
Table 3. AUC Summary
Figure 8. ROC curves when 100 points are used for training
Figure 9. ROC curves when 300 points are used for training

4.4. Example 4: Contour Detection

The following example is included to show the potential of this method to be used in image processing applications. The contour image is from [5, 4]. In this example, we attempt to recover a contour corrupted by 5% noise. We calculated the Haralick energy, in two directions and three scales, for (9x9) neighborhoods of each pixel of the clean image, computed , and the upper-bound probability surface, , using the noisy image with . The three resulting probability images were then averaged. Since there are many more white pixels, the contour pixels should have a lower probability upper-bound. Further, as the Haralick energy is the sum of the squared elements of the co-occurrence matrix, computed for each 9x9 window, we also expect the noisy pixels to have higher probability than the contour pixels. The reader can see that this is the case, as shown in the left subfigure of Figure 11. After inverting, thresholding, and post-processing with morphological operations, we obtained the contour shown on the right of Figure 11, again, using just the information contained in .

Figure 10. Clean and noisy images
Figure 11. Upper-bound image and recovered contour

5. Conclusions

This paper illustrates the ability of very recent developments in polynomial optimization and generalized moment problems to provide an elegant, computationally tractable, solution to anomaly detection problems. The proposed approach uses information from a set of moments of a probability distribution to directly provide an upper-bound

on the probability of observing a particular sample (modulo measurement noise), without explicitly modeling the distribution. A salient feature of this technique is that all the data, whether obtained using 1 or 1000 observations, enter the problem through a finite sequence of moments, that can be updated as more data is collected, without affecting the computational complexity (since new observations affect the value, but not the size of the moment matrices). Furthermore, the detectors obtained by solving reached peak performance with fewer samples than the kernel and SVM methods.

Our examples implicitly assume that the observations contain enough information to make good decisions, that is, we have assumed that the samples are good features. When anomalous observations are available, it may be possible to construct optimal features to separate classes – in the same way support (or relevance) vector machine methods find separating curves – by maximizing the distance between the moment vectors of the two classes. We think this is an interesting direction for future research.

References

  • [1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
  • [2] Inc. CVX Research. CVX: Matlab software for disciplined convex programming, version 2.0. http://cvxr.com/cvx, August 2012.
  • [3] Richard O. Duda, Peter E. (Peter Elliot) Hart, and David G. Stork. Pattern classification. Wiley, second edition, 2001.
  • [4] Pedro Felzenszwalb and John G Oberlin. Multiscale fields of patterns. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 82–90. 2014.
  • [5] Pedro F. Felzenszwalb and John G. Oberlin. Multiscale fields of patterns. CoRR, abs/1406.0924, 2014.
  • [6] Alexander Ihler and Mike Mandel. Kde toolbox for matlab. http://www.ics.uci.edu/~ihler/code/kde.html.
  • [7] Jean B Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on Optimization, 11(3):796–817, 2001.
  • [8] Jean B Lasserre. Bounds on measures satisfying moment conditions. Annals of Applied Probability, pages 1114–1137, 2002.
  • [9] Jean B Lasserre. A semidefinite programming approach to the generalized problem of moments. Mathematical Programming, 112(1):65–92, 2008.
  • [10] Jean B Lasserre. Moments and sums of squares for polynomial optimization and related problems. Journal of Global Optimization, 45(1):39–61, 2009.
  • [11] Jean-Bernard Lasserre. Moments, positive polynomials and their applications, volume 1. World Scientific, 2009.
  • [12] J.F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11–12:625–653, 1999. Version 1.05 available from http://fewcal.kub.nl/sturm.