1. Introduction
Detecting anomalies in data is a task that engineers and scientists of every discipline encounter on a fairly regular basis. Common approaches to identifying anomalous observations include: estimating the probability density function of the “normal” state using densityestimation methods, computing the Mahalanobis distance between a point and a sample distribution, using machinelearning methods to learn the normal state and perform 1class classification, using “whiteness” tests to detect when the differences between new samples and predictions (of linear models) become statistically colored, etc. However, a disadvantage encountered while estimating the distribution of the normal state is that one never knows if the available samples are representative of
allthe normal behaviors. This is especially true when there is a dearth of data with which to construct a reliable probability density estimate. In this paper, we address this challenge through the use of recent results from the method of moments and polynomial optimization. The main idea is to use information about the statistical moments of the normalstate distribution of a given set of features to compute an upperbound on the probability of a given observation of a random variable with this distribution. The observation is then deemed to be anomalous when this bound falls below a given threshold. While in principle computing this bound is a challenging infinitedimensional problem, as shown in the sequel, the use of concepts from the theory of moments allows for recasting it into a computationally tractable finite dimensional optimization.
The generalized moment problem (GMP), defined below, has tremendous modeling power and has found use in a wide range of applications in areas such as probability, financial economics, engineering, and operations research [11]. However, its wider application has been held back because, in its fullest generality, the GMP is intractable [9]. On the other hand, recent advances in algebraic geometry have rendered the GMP tractable when the problem instance is described using polynomials [7, 11].
Formally, the GMP is given by:
(1) 
where is a Borel subset of , is the space of finite signed Borel measures on , is the positive cone containing finite Borel measures on , is a set of real numbers, and are integrable with respect to all for every . The stands for either an inequality or equality constraint.
Lasserre [7, 11, 9, 10] showed that semidefinite programming (SDP) can be used to obtain an (often) finite sequence of relaxations which can be solved efficiently and which converge from above to the optimal solution when the problem has polynomial data; thereby giving us a tool with which to attempt to solve very difficult (i.e., nonconvex) mediumsized problems, given the present state of SDP solvers. The monotonic convergence means that the th relaxation is useful, even when optimality has not been achieved, by providing the user with an upper (i.e. optimistic) bound to in (1).
In the sequel, we will go over the preliminaries of the moments approach, present the aforementioned Lasserre relaxations, and demonstrate how to use them to detect anomalous data samples.
2. Preliminaries
For a real symmetric matrix , means is positive semidefinite. Let , , and let
be the column vector containing the canonical polynomial basis monomials
(2) 
for polynomials of the form with dimension and degree at most . denotes the set of polynomials in with real coefficients.
With , let and denote a finite sequence of real variables. For a polynomial , with coefficients , let denote the linear map
(3) 
which associates with the moment variables, .
2.1. Moment and Localizing Matrices
The th order moment matrix is indexed by upto order monomial coefficients . That is, each entry of the matrix is given by
(4) 
where . For example, in the case and , if and , then . Usually the is suppressed (since it is understood that the first row and column of contains the elements of ). The entire matrix is given below for illustration.
(5) 
Thus, it is clear that .
The classical problem of moments is concerned with determining if, for a given moment sequence , there is a measure so that for each . If so, we say has a representing measure which is called determinate if it is unique. Loosely speaking, checking existence of such a measure involves testing if is positive semidefinite. Readers interested in the technical details are referred to [11] and references therein.
For a given polynomial , the localizing matrix is given by
(6) 
This is equivalent to shifting by the monomials of . For example, with , we have
(7) 
Localizing matrices are used to specify support constraints on , as described in the next section.
2.2. Upperbounds Over Semialgebraic Sets
The material in this subsection comes from [11] and [8] which discuss the problem of finding bounds on the probability that some random variable belongs to a set , given some of its moments .
Suppose is of the form
(8) 
where are given polynomial constraints. In terms of the GMP, this problem can be expressed as
(9)  
s.t. 
where is the indicator function of the set . To solve this problem using SDP methods, Lasserre provided the following relaxations
(10) 
where and is the set of indices that correspond to the known moment information.
The following result assures us that approaches the optimal value from above and tells us when optimality has been reached. This result is included for completeness as we will generally not require an optimal solution. This is an advantage because we do not have to worry about extracting optimizers^{1}^{1}1There is (currently) no reliable method of extracting optimizers from a moment matrix. from a moment matrix that has satisfied the rank condition in Theorem 1.
Theorem 1 ( ).
[11] Let be the optimal value of the semidefinite program . Then:
 (a):

For every , and moreover, as .
 (b):

If is attained at an optimal solution which satisfies
(11) then .
2.3. Lower Bounds
To compute lower bounds, one can use the complement of , instead of , in and subtract from 1 [11]. However, in the approach presented here, the lower bounds do not provide much information because the sets that we will work with are small compared to the range of the data, which means the lower bound will usually be zero.
3. Robust Anomaly Detection
One approach to anomaly detection is to estimate the probability density function (PDF) and select a threshold below which incoming samples are classified as anomalous. Another is to train a 1class support vector machine (SVM) and use it to classify samples. Here, we will compare against both of these standard techniques. To quantify performance, we will use the thresholdindependent receiver operating characteristic (ROC) curves and area under curve (AUC) metric; this is a standard way of assessing binary classifier performance. Both ROC and AUC are readily obtained using the MATLAB command
perfcurve. The kernel density estimates are obtained using the MATLAB command
ksdensity (and the KDE toolbox for higher dimensional examples [6]). For the 1class SVM, we use the MATLAB command fitcsvm with the automatic tuning and “standardized data” options enabled.We would like to emphasize to the reader, however, that our goal is not to estimate the probability density function. Our goal is simpler: to use the upperbound on the probability of the observation to decide whether or not it is anomalous. To use for anomaly detection we select to specify a neighborhood which contains an incoming measurement. The neighborhood can be a sphere or a box^{2}^{2}2 in can be any set of the form in (8). and is specified in the constraints in . The can be selected as a function of the measurement noise, or otherwise small compared to the expected range of the data. In fact, can be set to zero if one wishes. In practice, we would like to use to account for measurement noise.
3.1. Moment Estimation and Data Whitening
A key advantage of the proposed method is that the information from a data set enters through the moment estimates which are computed using Equation (12).
(12) 
It is known that it is difficult to estimate higher order moments since whatever noise is present in the data will be raised to higher powers as well. In view of Equation (12), one can see that if the data is poorly scaled or biased in some coordinate, will increase quickly in those directions and will therefore result in fewer available moments for those directions and may cause numerical problems for the SDP solver. Thus, it is sometimes useful to use the data whitening technique discussed in [1]
to transform the data so that it has zero mean and unit variance. This may be useful in lowerdimensional instances, where it is possible to use these higher order moments without the size of the moment matrices exceeding computational resources.
Data whitening is a simple procedure that involves subtracting the mean of the data and multiplying by a transformation matrix, in the multivariate case, or dividing by the standard deviation in the univariate case. Incoming samples are then processed the same way, using the stored mean and transformation, before obtaining an upperbound probability estimate.
4. Examples
In this section we present four examples^{3}^{3}3We used MATLAB R2013b/R2014a, CVX [2], and SeDuMi [12] to solve our examples.. The first is a 1D example that shows that our approach compares favorably with traditional kernel probability density based anomaly detectors and with the 1class SVM. We will demonstrate: that higherorder moments contain useful information and that this information helps best the other two approaches especially when there are fewer data points to learn from and when the dimensionality of the data increases. These points will become apparent through the first three examples, which increase in dimension. The final example shows how our method can be used to recover contours from a noisy binary image.
4.1. Example 1: A Bimodal Distribution
Consider the bimodal distribution [3] in Figure 1. This distribution poses a challenge to density estimation tools like Parzen windows because the two modes are estimated best using different window sizes.
Table 1 shows the AUC values for our experiments, which use for the moments classifier. The reader can see that AUC generally increases as more moments are used and that the performance is better with fewer points. The difference, however, is small because this is a 1D example.
N  SVM  Parzen  

50  0.9821  0.9686  0.8701  0.9715  0.9940  0.9929 
100  0.9907  0.9807  0.8457  0.9882  0.9984  0.9957 
300  0.9953  0.9781  0.8268  0.9783  0.9951  0.9945 
4.2. Example 2: 2D Distribution
For higher dimensional datasets, the complexity of estimating PDFs increases greatly. Indeed, the arguably most popular engineering software tool, MATLAB, does not have a density estimation function for multidimensional datasets.
Consider the distribution below
(13)  
with , , and . This distribution has a “P” shape and the closed portion presents a problem for the kernel and SVM approaches.
Figure 4 shows the probability surface and the 350 test points which include 50 outliers.
Table 2 shows the results of our experiments, using for the moments classifier. Again, the reader can see how the performance increases as more moment information is included except this time the difference in performance is greater because the data has another dimension. This trend will continue in the third example.
N  SVM  Parzen  

100  0.9252  0.7843  0.6330  0.9722  0.9883  0.9916 
150  0.9437  0.8045  0.6575  0.9751  0.9903  0.9947 
300  0.9435  0.8335  0.6436  0.9705  0.9863  0.9933 
4.3. Example 3: Swiss roll Distribution
The following distribution is the wellknown “Swiss roll” distribution. The following equations were used to create the test and training samples.
(14)  
(15)  
(16) 
with , , , , and . Figure 7 shows the test points which consist of 300 inliers and 50 outliers.
Table 3 shows the familiar trend of increasing AUC with additional moments and excellent performance even with just 100 training samples. This example illustrates how the other two standard techniques degrade as dimensionality increases. As with the previous examples, was used for the moments classifier and the recommended automatic tuning options were used for other techniques.
N  SVM  Parzen  

100  0.7887  0.8487  0.8099  0.9194  0.9228 
300  0.8001  0.9085  0.8201  0.9480  0.9903 
4.4. Example 4: Contour Detection
The following example is included to show the potential of this method to be used in image processing applications. The contour image is from [5, 4]. In this example, we attempt to recover a contour corrupted by 5% noise. We calculated the Haralick energy, in two directions and three scales, for (9x9) neighborhoods of each pixel of the clean image, computed , and the upperbound probability surface, , using the noisy image with . The three resulting probability images were then averaged. Since there are many more white pixels, the contour pixels should have a lower probability upperbound. Further, as the Haralick energy is the sum of the squared elements of the cooccurrence matrix, computed for each 9x9 window, we also expect the noisy pixels to have higher probability than the contour pixels. The reader can see that this is the case, as shown in the left subfigure of Figure 11. After inverting, thresholding, and postprocessing with morphological operations, we obtained the contour shown on the right of Figure 11, again, using just the information contained in .
5. Conclusions
This paper illustrates the ability of very recent developments in polynomial optimization and generalized moment problems to provide an elegant, computationally tractable, solution to anomaly detection problems. The proposed approach uses information from a set of moments of a probability distribution to directly provide an upperbound
on the probability of observing a particular sample (modulo measurement noise), without explicitly modeling the distribution. A salient feature of this technique is that all the data, whether obtained using 1 or 1000 observations, enter the problem through a finite sequence of moments, that can be updated as more data is collected, without affecting the computational complexity (since new observations affect the value, but not the size of the moment matrices). Furthermore, the detectors obtained by solving reached peak performance with fewer samples than the kernel and SVM methods.Our examples implicitly assume that the observations contain enough information to make good decisions, that is, we have assumed that the samples are good features. When anomalous observations are available, it may be possible to construct optimal features to separate classes – in the same way support (or relevance) vector machine methods find separating curves – by maximizing the distance between the moment vectors of the two classes. We think this is an interesting direction for future research.
References
 [1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2006.
 [2] Inc. CVX Research. CVX: Matlab software for disciplined convex programming, version 2.0. http://cvxr.com/cvx, August 2012.
 [3] Richard O. Duda, Peter E. (Peter Elliot) Hart, and David G. Stork. Pattern classification. Wiley, second edition, 2001.
 [4] Pedro Felzenszwalb and John G Oberlin. Multiscale fields of patterns. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 82–90. 2014.
 [5] Pedro F. Felzenszwalb and John G. Oberlin. Multiscale fields of patterns. CoRR, abs/1406.0924, 2014.
 [6] Alexander Ihler and Mike Mandel. Kde toolbox for matlab. http://www.ics.uci.edu/~ihler/code/kde.html.
 [7] Jean B Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on Optimization, 11(3):796–817, 2001.
 [8] Jean B Lasserre. Bounds on measures satisfying moment conditions. Annals of Applied Probability, pages 1114–1137, 2002.
 [9] Jean B Lasserre. A semidefinite programming approach to the generalized problem of moments. Mathematical Programming, 112(1):65–92, 2008.
 [10] Jean B Lasserre. Moments and sums of squares for polynomial optimization and related problems. Journal of Global Optimization, 45(1):39–61, 2009.
 [11] JeanBernard Lasserre. Moments, positive polynomials and their applications, volume 1. World Scientific, 2009.
 [12] J.F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11–12:625–653, 1999. Version 1.05 available from http://fewcal.kub.nl/sturm.
Comments
There are no comments yet.