Efficient Nonlinear RX Anomaly Detectors

12/07/2020 ∙ by José A. Padrón-Hidalgo, et al. ∙ 0

Current anomaly detection algorithms are typically challenged by either accuracy or efficiency. More accurate nonlinear detectors are typically slow and not scalable. In this letter, we propose two families of techniques to improve the efficiency of the standard kernel Reed-Xiaoli (RX) method for anomaly detection by approximating the kernel function with either data-independent random Fourier features or data-dependent basis with the Nyström approach. We compare all methods for both real multi- and hyperspectral images. We show that the proposed efficient methods have a lower computational cost and they perform similar (or outperform) the standard kernel RX algorithm thanks to their implicit regularization effect. Last but not least, the Nyström approach has an improved power of detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Anomaly detection (AD), as a remote sensing (RS) research topic, is gaining importance because of the need for processing large number of images that are acquired from satellite and airborne platforms [15]. AD aims to detect small portions of the image which do not belong to the background of the scene. Unlike target detection, AD does not use known target spectra and anomalies are assumed to be rare and at the tail of the background distribution.

Among the many detector algorithms found in the literature, the Reed-Xiaoli (RX) detector [13]

is widely used due to its good performance and simplicity. The RX detector determines target pixels that are spectrally different from the image background based on the Mahalanobis metric. For RX to be effective, anomalous targets must be sufficiently small compared to background and is assumed to follow a Gaussian distribution. However, it has been shown that the Gaussian distribution assumption fails, for example, in hyperspectral images or with complex feature relations, especially at the tails of the distribution

[8]. As a result, nonlinear versions of RX have been introduced to mitigate this problem, and the kernel RX (KRX) detector was proposed in [7] to cope with complex and nonlinear backgrounds. However, the KRX algorithm has not been widely adopted in practice because, being a kernel method, the memory and computational cost increase, at least quadratically, with the number of pixels. This poses the perennial problem of accuracy versus usability in nonlinear detectors in general and kernel anomaly detectors in particular.

In this study, we focus on improving the space (memory) and time efficiency of the KRX anomaly detector. Kernel-based anomaly detectors provide excellent detection performance since they are able to characterize non-linear backgrounds [2]. In order to undertake this challenge, we propose to use efficient techniques based on random Fourier features and low-rank approximations to obtain improved performance of the KRX algorithm. We reported our initial efforts using the random Fourier features approach in[10]

In the literature, local and global RX-based detectors have been proposed. In local AD [13], pixels in a sliding window are used as input data. Despite their adaptation to local relations, the detection power has been shown to be low recently [3, 8]

. Conversely, in global AD all image pixels are used to estimate statistics. Thereby, targets with various sizes and shapes can be detected while detection of small targets can be difficult. In this study, all the methods are used in a global setting for the sake of simplicity.

Ii RX Based Anomaly Detection

Among the various AD methods proposed in the literature, one of the most frequently used anomaly detectors is the Reed-Xiaoli (RX) [13]. In this section, we explain the RX method and its kernelized version, the KRX anomaly detector.

Ii-a RX Anomaly Detector

We consider an acquired image reshaped in matrix form as , where is the number of pixels and is the total number of channels acquired by the sensor. For simplicity, let us assume that is a centered data matrix. The RX detector characterizes the background in terms of the covariance matrix . The detector calculates the squared Mahalanobis distance between a test pixel and the background as follows:

(1)

In a global AD setting, as discussed here, can be efficiently computed using all the image pixels since the dimensionality of the image is much lower than the number of pixels (). Whereas, in a local AD setting, needs to be computed for each image pixel using the centered pixels in a window having an origin at that pixel [8].

Ii-B KRX Anomaly Detector

It is known that linear RX is computationally efficient and leads to an optimal solution when pixels in follow a Gaussian distribution. However, real life problems are not always Gaussian distributed and this requires more flexible models. Kernel methods are a possible solution because they can capture higher-order (nonlinear) feature relations, while still using linear algebra operations [2].

In order to develop the kernel RX, let us consider a mapping for the pixels in the image to a higher dimensional Hilbert feature space by means of the feature map . The mapped data matrix is now denoted as . Let us define a kernel function that, by virtue of the Riesz theorem, can evaluate (reproduce) the dot product between samples in , i.e. . To estimate how anomalous a pixel is using a pixel under test for , we first map it , and apply the RX formula in (1) as

(2)

which, after some linear algebra, can be expressed in terms of kernel matrices [6, 2]:

(3)

where contains the similarities between and all points in using , and stands for the kernel matrix containing all data similarities [7]. Note that, as in the linear RX method, the KRX also requires centering the data (now in ), which can be easily done111Centering in feature space can be easily done implicitly via the simple kernel matrix operation , where , represents the Kronecker delta if and zero otherwise.. Hereafter we assume that all kernel matrices are centered.

Note that constructing and inverting a kernel matrix of large poses a huge computational cost. A simple strategy to alleviate this problem is to draw samples randomly () and use them in the standard KRX, which is here referred to as simple sub-sampling (SRX) and defined as

(4)

where is a data matrix sampled from , contains the similarities between and , and is a kernel matrix containing data similarities between the points in .

Iii Efficient techniques for Kernel RX

Kernel methods are able to fit nonlinear problems but, they do not scale well when the number of samples grow. We propose using feature map and low-rank approximation approaches to improve the efficiency of the KRX detector. We study the following approximations to the KRX method: Random Fourier features (RRX) previously studied by the authors in [10], orthogonal random features (ORX), naive low-rank approximation (LRX), and Nyström low-rank approximation (NRX).

Iii-a Randomized Feature Map Approaches

Iii-A1 Random Fourier Features (RFF)

An outstanding result in the recent kernel methods literature makes use of a classical definition in harmonic analysis to the approximation and scalability [12]. The Bochner’s theorem states that a continuous shift-invariant kernel on is positive definite (p.d.) if and only if

is the Fourier transform of a non-negative measure. If a shift-invariant kernel

is properly scaled, its Fourier transform

is a proper probability distribution. This property is used to approximate kernel functions with linear projections on a number of

random features as where are randomly sampled from a data-independent distribution  [12]. Note that we can define a -dimensional randomized feature map , which can be explicitly constructed as

to approximate the Radial Basis Function (RBF) kernel.

Therefore, given data points (pixels), the kernel matrix can be approximated with the explicitly mapped data, , and will be denoted as However, we do not use such an approach in Equation (3), which would lead to a mere approximation with extra computational cost. Instead, we run the linear RX in Equation (1) with explicitly mapped points onto random Fourier features, which reduces to

(5)

and leads to a nonlinear randomized RX (RRX) [10] that approximates the KRX. Essentially, we map the original data into a nonlinear space through the explicit mapping to a -dimensional space (instead of the potentially infinite feature space with ), and then use the linear RX formula. This allows us to control the space and time complexity explicitly through , as one has to store matrices of and invert matrices of size only (see Table I). Typically, parameter satisfies in practical applications.

Iii-A2 Orthogonal Random Features (ORF)

RFF has become a very practical solution for the bottleneck in kernel methods when grows. In RFF, frequencies are sampled from a particular pdf and they act as a basis. This, however, may lead to features that are linearly dependent thus geometrically covering less space. Imposing orthogonality in the basis can be a remedy to this issue, which has led to the Orthogonal Random Features (ORF) [20]

. The linear transformation matrix of ORF is

, where

is a uniformly distributed random orthogonal matrix. The set of rows of

forms a basis in . is a diagonal matrix, with diagonal entries sampled i.i.d. from the -distribution with degrees of freedom. makes the norms of the rows of and (with all the frequencies of RFF) identically distributed. Theoretical results show that ORF achieves lower error than RFF for the RBF kernel [20]. This approach follows the above RFF philosophy, and the final anomaly score is now:

(6)

where each frequency is a row of and is the matrix formed by the mappings of each element in the dataset, and is the mapping of a pixel to be tested.

Iii-A3 Nyström Approximation

The Nyström method selects a subset of samples to construct a low-rank approximation of the kernel matrix [18]. This method approximates the kernel function as , where contains the similarities between and all points, and stands for the kernel matrix between the points in . Therefore, can be expressed as:

(7)

where is a matrix which contains similarities between the points in and the points in . The similarities were computed using the standard RBF kernel function .

Using the above definition given in (7), the Nyström method approximates the kernel matrix

(8)

by plugging (7) and (8) into (3), one can define the low-rank approximation of KRX:

(9)

where while . Since is not a squared matrix , it is rank deficient, and we propose to use the pseudoinverse instead of the inverse of . By doing this, most of the terms cancel, leading to a more compact equation for the NRX:

(10)

Note that NRX involves the inversion of an matrix which is much more efficient compared to KRX. In addition, the Nyström approach is more generic than using random Fourier feature approaches, as it allows one to approximate all positive semidefinite kernels, not just shift-invariant kernels. Furthermore, this approximation is data-dependent (i.e. the basis functions are a subset of estimation data itself) which could translate into better results [19].

Iii-A4 Connection to reduced-set methods

Reduced-set techniques were successfully used to obtain sparse kernel methods and low rank approximations of multivariate kernel methods [1]. This methodology can be applied to approximate KRX which leads to equation (10). In this approach, the data matrix is subsampled into , , and mapped into , which, by using (2), we obtain the LRX formula:

(11)

Identifying and , (11) leads to:

(12)

which just differs from (10) in the inverse of , and when is full rank they are the same. In the following and in the experiments, we will use only instead of as both are mathematically equivalent.

Iii-B Space and Time Complexity

Table I gives the theoretical computational complexity of the benchmark methods (RX, KRX, SRX) and proposed methods (RRX, ORX, NRX) presented in this paper. In this study, we assume since we aim to deal with big data settings. Besides, KRX becomes sufficiently efficient when is small, e.g. for a image. As seen in Table I, RX provides the best efficiency; thus, it should be employed for scenes where the data is Gaussian distributed. However, KRX and the proposed KRX approximations should be used for nonlinear distributions. Clearly, KRX is the least efficient compared to the proposed approximations, and it is also not applicable to big data. Feature map methods, e.g. RRX and ORX, provide the best computational efficiency for nonlinear (i.e non-Gaussian) distributions, while low-rank approximation methods, e.g. LRX and NRX, are also efficient yet relatively slower compared to the feature map methods. Thus, one should choose the proper method based on the image distribution characteristics [14, 5], detection performance requirements, and computational resource limitations. These conclusions are assessed experimentally in the following section.

Space Time
Method
RX
RRX & ORX
NRX
KRX

is transformation of image into a nonlinear space.
is matrix (covariance, kernel etc.) and is its inverse.

TABLE I: Memory and time complexity for all methods.

Iv Experimental Results

(a)
(b)
(c)
(d)
Fig. 5: Images with anomalies (outlined in yellow) in four scenarios: (a) consequences of the hot spots corresponding to latent fires at the World Trade Center (WTC) in NYC (extension of anomalous pixels represents the 0.23% of the image), (b) urban area where anomalies are vehicles in Gainesville city (0.52%), (c) Quickbird multispectral images acquired over Denver, the anomalies are roofs in an urbanized area (1.6%), and (d) a beach scene where the anomalies are ships captured by AVIRIS sensor (2.02%) over San Diego, USA.

This section analyzes the performance of the proposed nonlinear RX anomaly detection methods. We performed tests in four real examples, and tested robustness using the area under curve (AUC) of receiver operating characteristic (ROC) curves. We provide illustrative source code for all methods in http://isp.uv.es/code/fastrx.html

Iv-a Data collection and experimental setup

We collected multispectral and hyperspectral images acquired by the Quickbird and AVIRIS sensors. Fig. 5 showcases the scenes used in the experiments. The AD scenarios consider anomalies related to: latent fires, vehicles, urbanization (roofs) and ships [3, 4, 11]. Table II summaries relevant attributes of the datasets such as sensors, spatial and spectral resolution.

Parameter estimation is required for the RX, KRX, RRX, ORX and NRX. First of all, the KRX method and its proposed variants involve the optimization of the parameter of the RBF kernel. For the feature map approaches (RRX and ORX), the number of basis, , parameter should be optimized. Whereas, for low-rank approximations (NRX), the number of random sub-samples, , parameter should be optimized.

Images Sensor Size Bands Resolution
WTC AVIRIS 200 x 200 224 1.7 m
Gainesville AVIRIS 100 x 100 190 3.5 m
Denver Quickbird 500 x 684 4 1m-4m
San Diego AVIRIS 100 x 100 193 7.5 m
TABLE II: Images attributes used in the experimentation dataset.

We adopted a cross-validation scheme to select all the involved parameters: number of Fourier basis , rank , and RBF parameter . We selected parameters using different data sizes ranging between and samples.

Iv-B Numerical comparison

We report the averaged AUC results for all cases with

runs (standard deviations were always lower than

and hence are not reported). Figure 6 shows that nonlinear methods improve detection over the linear RX and NRX outperforms the other approximations in three out of the four images. The AUC values of KRX are related to the inversion of a relatively big matrix. This raises the issues of poorly estimated matrices (with a huge condition number) which are also computationally expensive to invert (). However, all the proposed fast kernel RX methods have the advantage of solving both issues. Firstly, thanks to the cross-validation procedure, an estimate of the optimal number of features (RRX, ORX) or samples (NRX) can be obtained, allowing to better capture the intrinsic dimensionality of the mapped data. In a previous work [9], we showed that optimizing the number of frequencies in random Fourier features approaches acts as an efficient regularizer leading to better estimates with a reduced number of frequencies needed. And secondly, fast versions are able to obtain better performance in AUC metric at a fraction of the cost (see Fig. 6).

WTC Gainesville
Denver San Diego
Fig. 6: ROC curves in linear scale for all scenes. Numbers in legend display the AUC values for each method.
Fig. 7: CPU execution time versus the AUC values for pixels, crosses corresponds to different rank values for Denver image.

Iv-C On the computational efficiency

WTC Gainesville Denver San Diego
RX (0.94) RX (0.96) RX (0.56) RX (0.92)
NRX (0.97) NRX (0.98) NRX (0.75) NRX (0.96)
Fig. 8: Anomaly detection maps for best thresholds (top: the best linear RX (AUC) results, bottom: best nonlinear RX (AUC) method).

Figure 7 illustrates the trade-off between the computational execution time and the AUC. The crosses indicate different values of rank ( or parameters) in the set and the number of pixels was fixed to . The optimal parameters estimated for KRX are used for the fast approaches. KRX has the best AUC values in all the images. NRX and SRX are more sensitive to rank values. RRX and ORX are almost insensitive to the rank but results do not improve when the rank increases, thus limiting their performance. The combination of lower spectral information and the ambiguity of the class (note that the anomaly class ‘urbanized’ can be confused with a pervasive class ‘urban’) makes the Quickbird scene a very difficult problem (lower AUCs). In this situation, as the rank parameter for the NRX method grows, it approximates the KRX algorithm. In Figure 8, the RX detector (top row) is shown against the best detector obtained (bottom row). The best result in AUC was achieved by the NRX in all the images. It is worth mentioning the good results in detection achieved by the NRX in all the scenes, which can be visually compared the linear version.

V Conclusions

In this letter, we introduced a family of efficient nonlinear anomaly detection algorithms based on the RX method. We used the theory of reproducing kernels, and proposed several efficient methods. The kernel Reed-Xiaoli (KRX) detector was improved using efficient and fast techniques based on feature maps and low-rank approximations. Among all methods, both the Nyström and the equivalent low-rank (LRX) approximation achieves the best results and yields a more efficient and accurate non-linear RX method to be applied in practice. For future research, we plan to study the behaviour of fast approximations for alternative KRX variants [16, 17]. Note that the presented methodologies for fast KRX can be applicable to other kernel anomaly detectors, in local settings, and for real-time detection.

References

  • [1] J. Arenas-Garcia, K. B. Petersen, G. Camps-Valls, and L. K. Hansen (2013-07)

    Kernel multivariate analysis framework for supervised subspace learning: a tutorial on linear and kernel multivariate methods

    .
    IEEE Signal Processing Magazine 30 (4), pp. 16–29. External Links: ISSN 1053-5888 Cited by: §III-A4.
  • [2] G. Camps-Valls and L. Bruzzone (2009-12) ”Kernel methods for remote sensing data analysis”. Wiley & Sons, UK. External Links: ISBN 978-0-470-72211-4 Cited by: §I, §II-B, §II-B.
  • [3] Q. Guo, R. Pu, and J. Cheng (2016) Anomaly detection from hyperspectral remote sensing imagery. Geosciences 6 (4). External Links: ISSN 2076-3263, Document Cited by: §I, §IV-A.
  • [4] X. Kang, X. Zhang, S. Li, K. Li, J. Li, and J. A. Benediktsson (2017-10) Hyperspectral anomaly detection with attribute and edge-preserving filters. IEEE Transactions on Geoscience and Remote Sensing 55 (10), pp. 5600–5611. External Links: Document, ISSN Cited by: §IV-A.
  • [5] N. Keshava (2004) Distance metrics and band selection in hyperspectral processing with applications to material identification and spectral libraries. IEEE Transaction on Geoscience and Remote Sensing 42. Cited by: §III-B.
  • [6] H. Kwon and N. Nasrabadi (2006) A comparative analysis of kernel subspace target detectors for hyperspectral imagery. EURASIP Journal of Advances in Signal Processing 2007 (2007), pp. 29250. Cited by: §II-B.
  • [7] H. Kwon and N. M. Nasrabadi (2005) Kernel RX - algorithm: a nonlinear anomaly detector for hyperspectral imagery. IEEE Transaction on Geoscience and Remote Sensing 43 (2), pp. 388–397. External Links: ISSN 0196-2892 Cited by: §I, §II-B.
  • [8] S. Matteoli, M. Diani, and G. Corsini (2010) A tutorial overview of anomaly detection in hyperspectral images. IEEE Aerospace and Electronic Systems Magazine 25 (7), pp. 5–28. External Links: Document, ISSN 0885-8985 Cited by: §I, §I, §II-A.
  • [9] P. Morales-Álvarez, A. Pérez-Suay, R. Molina, and G. Camps-Valls (2018) Remote sensing image classification with large-scale Gaussian processes. IEEE Transaction on Geoscience and Remote Sensing 56. Cited by: §IV-B.
  • [10] F. Nar, A. Pérez-Suay, J. A. Padrón-Hidalgo, and G. Camps-Valls (2018-07) Randomized RX for target detection. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Vol. , pp. 4237–4240. External Links: Document, ISSN 2153-7003 Cited by: §I, §III-A1, §III.
  • [11] J. A. Padrón-Hidalgo, V. Laparra, N. Longbotham, and G. Camps-Valls (2019-10) Kernel anomalous change detection for remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 57 (10), pp. 7743–7755. External Links: Document, ISSN Cited by: §IV-A.
  • [12] A. Rahimi and B. Recht (2008) Random features for large-scale kernel machines. In Neural Information Processing System (NIPS), Cited by: §III-A1.
  • [13] I. S. Reed and X. Yu (1990) Adaptive multiple-band CFAR detection of an optical pattern with unknown spectral distribution. IEEE Trans. on Acoustics, Speech, and Signal Processing 38, pp. 1760–1770. External Links: Document Cited by: §I, §I, §II.
  • [14] Cited by: §III-B.
  • [15] D. W. J. Stein, S. G. Beaven, L. E. Hoff, E. M. Winter, A. P. Schaum, and A. D. Stocker (2002-01) Anomaly detection from hyperspectral imagery. IEEE Signal Processing Magazine 19 (1), pp. 58–69. External Links: Document, ISSN Cited by: §I.
  • [16] J. Theiler and G. Grosklos (2016) Problematic projection to the in-sample subspace for a kernelized anomaly detector.. IEEE Geoscience and Remote Sensing Letters (13), pp. 485–489. Cited by: §V.
  • [17] J. Theiler and G. Grosklos (2016) Cracks in KRX: when more distant points are less anomalous.. 8th IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS). Cited by: §V.
  • [18] C. K. I. Williams and M. Seeger (2001) Using the Nyström method to speed up kernel machines. In Neural Information Processing System (NIPS), pp. 682–688. Cited by: §III-A3.
  • [19] T. Yang, Y. Li, M. Mahdavi, R. Jin, and Z. Zhou (2012) Nyström method vs random Fourier features: a theoretical and empirical comparison. In Neural Information Processing System (NIPS), pp. 476–484. Cited by: §III-A3.
  • [20] F. X. X. Yu, A. T. Suresh, K. M. Choromanski, D. N. Holtmann-Rice, and S. Kumar (2016) Orthogonal random features. In Neural Information Processing System (NIPS), pp. 1975–1983. Cited by: §III-A2.