Anomaly detection (AD), as a remote sensing (RS) research topic, is gaining importance because of the need for processing large number of images that are acquired from satellite and airborne platforms . AD aims to detect small portions of the image which do not belong to the background of the scene. Unlike target detection, AD does not use known target spectra and anomalies are assumed to be rare and at the tail of the background distribution.
Among the many detector algorithms found in the literature, the Reed-Xiaoli (RX) detector 
is widely used due to its good performance and simplicity. The RX detector determines target pixels that are spectrally different from the image background based on the Mahalanobis metric. For RX to be effective, anomalous targets must be sufficiently small compared to background and is assumed to follow a Gaussian distribution. However, it has been shown that the Gaussian distribution assumption fails, for example, in hyperspectral images or with complex feature relations, especially at the tails of the distribution. As a result, nonlinear versions of RX have been introduced to mitigate this problem, and the kernel RX (KRX) detector was proposed in  to cope with complex and nonlinear backgrounds. However, the KRX algorithm has not been widely adopted in practice because, being a kernel method, the memory and computational cost increase, at least quadratically, with the number of pixels. This poses the perennial problem of accuracy versus usability in nonlinear detectors in general and kernel anomaly detectors in particular.
In this study, we focus on improving the space (memory) and time efficiency of the KRX anomaly detector. Kernel-based anomaly detectors provide excellent detection performance since they are able to characterize non-linear backgrounds . In order to undertake this challenge, we propose to use efficient techniques based on random Fourier features and low-rank approximations to obtain improved performance of the KRX algorithm. We reported our initial efforts using the random Fourier features approach in
In the literature, local and global RX-based detectors have been proposed. In local AD , pixels in a sliding window are used as input data. Despite their adaptation to local relations, the detection power has been shown to be low recently [3, 8]
. Conversely, in global AD all image pixels are used to estimate statistics. Thereby, targets with various sizes and shapes can be detected while detection of small targets can be difficult. In this study, all the methods are used in a global setting for the sake of simplicity.
Ii RX Based Anomaly Detection
Among the various AD methods proposed in the literature, one of the most frequently used anomaly detectors is the Reed-Xiaoli (RX) . In this section, we explain the RX method and its kernelized version, the KRX anomaly detector.
Ii-a RX Anomaly Detector
We consider an acquired image reshaped in matrix form as , where is the number of pixels and is the total number of channels acquired by the sensor. For simplicity, let us assume that is a centered data matrix. The RX detector characterizes the background in terms of the covariance matrix . The detector calculates the squared Mahalanobis distance between a test pixel and the background as follows:
In a global AD setting, as discussed here, can be efficiently computed using all the image pixels since the dimensionality of the image is much lower than the number of pixels (). Whereas, in a local AD setting, needs to be computed for each image pixel using the centered pixels in a window having an origin at that pixel .
Ii-B KRX Anomaly Detector
It is known that linear RX is computationally efficient and leads to an optimal solution when pixels in follow a Gaussian distribution. However, real life problems are not always Gaussian distributed and this requires more flexible models. Kernel methods are a possible solution because they can capture higher-order (nonlinear) feature relations, while still using linear algebra operations .
In order to develop the kernel RX, let us consider a mapping for the pixels in the image to a higher dimensional Hilbert feature space by means of the feature map . The mapped data matrix is now denoted as . Let us define a kernel function that, by virtue of the Riesz theorem, can evaluate (reproduce) the dot product between samples in , i.e. . To estimate how anomalous a pixel is using a pixel under test for , we first map it , and apply the RX formula in (1) as
where contains the similarities between and all points in using , and stands for the kernel matrix containing all data similarities . Note that, as in the linear RX method, the KRX also requires centering the data (now in ), which can be easily done111Centering in feature space can be easily done implicitly via the simple kernel matrix operation , where , represents the Kronecker delta if and zero otherwise.. Hereafter we assume that all kernel matrices are centered.
Note that constructing and inverting a kernel matrix of large poses a huge computational cost. A simple strategy to alleviate this problem is to draw samples randomly () and use them in the standard KRX, which is here referred to as simple sub-sampling (SRX) and defined as
where is a data matrix sampled from , contains the similarities between and , and is a kernel matrix containing data similarities between the points in .
Iii Efficient techniques for Kernel RX
Kernel methods are able to fit nonlinear problems but, they do not scale well when the number of samples grow. We propose using feature map and low-rank approximation approaches to improve the efficiency of the KRX detector. We study the following approximations to the KRX method: Random Fourier features (RRX) previously studied by the authors in , orthogonal random features (ORX), naive low-rank approximation (LRX), and Nyström low-rank approximation (NRX).
Iii-a Randomized Feature Map Approaches
Iii-A1 Random Fourier Features (RFF)
An outstanding result in the recent kernel methods literature makes use of a classical definition in harmonic analysis to the approximation and scalability . The Bochner’s theorem states that a continuous shift-invariant kernel on is positive definite (p.d.) if and only if
is the Fourier transform of a non-negative measure. If a shift-invariant kernelis properly scaled, its Fourier transform
is a proper probability distribution. This property is used to approximate kernel functions with linear projections on a number ofrandom features as where are randomly sampled from a data-independent distribution . Note that we can define a -dimensional randomized feature map , which can be explicitly constructed as
to approximate the Radial Basis Function (RBF) kernel.
Therefore, given data points (pixels), the kernel matrix can be approximated with the explicitly mapped data, , and will be denoted as However, we do not use such an approach in Equation (3), which would lead to a mere approximation with extra computational cost. Instead, we run the linear RX in Equation (1) with explicitly mapped points onto random Fourier features, which reduces to
and leads to a nonlinear randomized RX (RRX)  that approximates the KRX. Essentially, we map the original data into a nonlinear space through the explicit mapping to a -dimensional space (instead of the potentially infinite feature space with ), and then use the linear RX formula. This allows us to control the space and time complexity explicitly through , as one has to store matrices of and invert matrices of size only (see Table I). Typically, parameter satisfies in practical applications.
Iii-A2 Orthogonal Random Features (ORF)
RFF has become a very practical solution for the bottleneck in kernel methods when grows. In RFF, frequencies are sampled from a particular pdf and they act as a basis. This, however, may lead to features that are linearly dependent thus geometrically covering less space. Imposing orthogonality in the basis can be a remedy to this issue, which has led to the Orthogonal Random Features (ORF) 
. The linear transformation matrix of ORF is, where forms a basis in . is a diagonal matrix, with diagonal entries sampled i.i.d. from the -distribution with degrees of freedom. makes the norms of the rows of and (with all the frequencies of RFF) identically distributed. Theoretical results show that ORF achieves lower error than RFF for the RBF kernel . This approach follows the above RFF philosophy, and the final anomaly score is now:
where each frequency is a row of and is the matrix formed by the mappings of each element in the dataset, and is the mapping of a pixel to be tested.
Iii-A3 Nyström Approximation
The Nyström method selects a subset of samples to construct a low-rank approximation of the kernel matrix . This method approximates the kernel function as , where contains the similarities between and all points, and stands for the kernel matrix between the points in . Therefore, can be expressed as:
where is a matrix which contains similarities between the points in and the points in . The similarities were computed using the standard RBF kernel function .
Using the above definition given in (7), the Nyström method approximates the kernel matrix
where while . Since is not a squared matrix , it is rank deficient, and we propose to use the pseudoinverse instead of the inverse of . By doing this, most of the terms cancel, leading to a more compact equation for the NRX:
Note that NRX involves the inversion of an matrix which is much more efficient compared to KRX. In addition, the Nyström approach is more generic than using random Fourier feature approaches, as it allows one to approximate all positive semidefinite kernels, not just shift-invariant kernels. Furthermore, this approximation is data-dependent (i.e. the basis functions are a subset of estimation data itself) which could translate into better results .
Iii-A4 Connection to reduced-set methods
Reduced-set techniques were successfully used to obtain sparse kernel methods and low rank approximations of multivariate kernel methods . This methodology can be applied to approximate KRX which leads to equation (10). In this approach, the data matrix is subsampled into , , and mapped into , which, by using (2), we obtain the LRX formula:
Identifying and , (11) leads to:
which just differs from (10) in the inverse of , and when is full rank they are the same. In the following and in the experiments, we will use only instead of as both are mathematically equivalent.
Iii-B Space and Time Complexity
Table I gives the theoretical computational complexity of the benchmark methods (RX, KRX, SRX) and proposed methods (RRX, ORX, NRX) presented in this paper. In this study, we assume since we aim to deal with big data settings. Besides, KRX becomes sufficiently efficient when is small, e.g. for a image. As seen in Table I, RX provides the best efficiency; thus, it should be employed for scenes where the data is Gaussian distributed. However, KRX and the proposed KRX approximations should be used for nonlinear distributions. Clearly, KRX is the least efficient compared to the proposed approximations, and it is also not applicable to big data. Feature map methods, e.g. RRX and ORX, provide the best computational efficiency for nonlinear (i.e non-Gaussian) distributions, while low-rank approximation methods, e.g. LRX and NRX, are also efficient yet relatively slower compared to the feature map methods. Thus, one should choose the proper method based on the image distribution characteristics [14, 5], detection performance requirements, and computational resource limitations. These conclusions are assessed experimentally in the following section.
|RRX & ORX|
is transformation of image into a nonlinear space.
is matrix (covariance, kernel etc.) and is its inverse.
Iv Experimental Results
This section analyzes the performance of the proposed nonlinear RX anomaly detection methods. We performed tests in four real examples, and tested robustness using the area under curve (AUC) of receiver operating characteristic (ROC) curves. We provide illustrative source code for all methods in http://isp.uv.es/code/fastrx.html
Iv-a Data collection and experimental setup
We collected multispectral and hyperspectral images acquired by the Quickbird and AVIRIS sensors. Fig. 5 showcases the scenes used in the experiments. The AD scenarios consider anomalies related to: latent fires, vehicles, urbanization (roofs) and ships [3, 4, 11]. Table II summaries relevant attributes of the datasets such as sensors, spatial and spectral resolution.
Parameter estimation is required for the RX, KRX, RRX, ORX and NRX. First of all, the KRX method and its proposed variants involve the optimization of the parameter of the RBF kernel. For the feature map approaches (RRX and ORX), the number of basis, , parameter should be optimized. Whereas, for low-rank approximations (NRX), the number of random sub-samples, , parameter should be optimized.
|WTC||AVIRIS||200 x 200||224||1.7 m|
|Gainesville||AVIRIS||100 x 100||190||3.5 m|
|Denver||Quickbird||500 x 684||4||1m-4m|
|San Diego||AVIRIS||100 x 100||193||7.5 m|
We adopted a cross-validation scheme to select all the involved parameters: number of Fourier basis , rank , and RBF parameter . We selected parameters using different data sizes ranging between and samples.
Iv-B Numerical comparison
We report the averaged AUC results for all cases with
runs (standard deviations were always lower thanand hence are not reported). Figure 6 shows that nonlinear methods improve detection over the linear RX and NRX outperforms the other approximations in three out of the four images. The AUC values of KRX are related to the inversion of a relatively big matrix. This raises the issues of poorly estimated matrices (with a huge condition number) which are also computationally expensive to invert (). However, all the proposed fast kernel RX methods have the advantage of solving both issues. Firstly, thanks to the cross-validation procedure, an estimate of the optimal number of features (RRX, ORX) or samples (NRX) can be obtained, allowing to better capture the intrinsic dimensionality of the mapped data. In a previous work , we showed that optimizing the number of frequencies in random Fourier features approaches acts as an efficient regularizer leading to better estimates with a reduced number of frequencies needed. And secondly, fast versions are able to obtain better performance in AUC metric at a fraction of the cost (see Fig. 6).
Iv-C On the computational efficiency
|RX (0.94)||RX (0.96)||RX (0.56)||RX (0.92)|
|NRX (0.97)||NRX (0.98)||NRX (0.75)||NRX (0.96)|
Figure 7 illustrates the trade-off between the computational execution time and the AUC. The crosses indicate different values of rank ( or parameters) in the set and the number of pixels was fixed to . The optimal parameters estimated for KRX are used for the fast approaches. KRX has the best AUC values in all the images. NRX and SRX are more sensitive to rank values. RRX and ORX are almost insensitive to the rank but results do not improve when the rank increases, thus limiting their performance. The combination of lower spectral information and the ambiguity of the class (note that the anomaly class ‘urbanized’ can be confused with a pervasive class ‘urban’) makes the Quickbird scene a very difficult problem (lower AUCs). In this situation, as the rank parameter for the NRX method grows, it approximates the KRX algorithm. In Figure 8, the RX detector (top row) is shown against the best detector obtained (bottom row). The best result in AUC was achieved by the NRX in all the images. It is worth mentioning the good results in detection achieved by the NRX in all the scenes, which can be visually compared the linear version.
In this letter, we introduced a family of efficient nonlinear anomaly detection algorithms based on the RX method. We used the theory of reproducing kernels, and proposed several efficient methods. The kernel Reed-Xiaoli (KRX) detector was improved using efficient and fast techniques based on feature maps and low-rank approximations. Among all methods, both the Nyström and the equivalent low-rank (LRX) approximation achieves the best results and yields a more efficient and accurate non-linear RX method to be applied in practice. For future research, we plan to study the behaviour of fast approximations for alternative KRX variants [16, 17]. Note that the presented methodologies for fast KRX can be applicable to other kernel anomaly detectors, in local settings, and for real-time detection.
Kernel multivariate analysis framework for supervised subspace learning: a tutorial on linear and kernel multivariate methods. IEEE Signal Processing Magazine 30 (4), pp. 16–29. External Links: Cited by: §III-A4.
-  (2009-12) ”Kernel methods for remote sensing data analysis”. Wiley & Sons, UK. External Links: Cited by: §I, §II-B, §II-B.
-  (2016) Anomaly detection from hyperspectral remote sensing imagery. Geosciences 6 (4). External Links: Cited by: §I, §IV-A.
-  (2017-10) Hyperspectral anomaly detection with attribute and edge-preserving filters. IEEE Transactions on Geoscience and Remote Sensing 55 (10), pp. 5600–5611. External Links: Cited by: §IV-A.
-  (2004) Distance metrics and band selection in hyperspectral processing with applications to material identification and spectral libraries. IEEE Transaction on Geoscience and Remote Sensing 42. Cited by: §III-B.
-  (2006) A comparative analysis of kernel subspace target detectors for hyperspectral imagery. EURASIP Journal of Advances in Signal Processing 2007 (2007), pp. 29250. Cited by: §II-B.
-  (2005) Kernel RX - algorithm: a nonlinear anomaly detector for hyperspectral imagery. IEEE Transaction on Geoscience and Remote Sensing 43 (2), pp. 388–397. External Links: Cited by: §I, §II-B.
-  (2010) A tutorial overview of anomaly detection in hyperspectral images. IEEE Aerospace and Electronic Systems Magazine 25 (7), pp. 5–28. External Links: Cited by: §I, §I, §II-A.
-  (2018) Remote sensing image classification with large-scale Gaussian processes. IEEE Transaction on Geoscience and Remote Sensing 56. Cited by: §IV-B.
-  (2018-07) Randomized RX for target detection. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Vol. , pp. 4237–4240. External Links: Cited by: §I, §III-A1, §III.
-  (2019-10) Kernel anomalous change detection for remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 57 (10), pp. 7743–7755. External Links: Cited by: §IV-A.
-  (2008) Random features for large-scale kernel machines. In Neural Information Processing System (NIPS), Cited by: §III-A1.
-  (1990) Adaptive multiple-band CFAR detection of an optical pattern with unknown spectral distribution. IEEE Trans. on Acoustics, Speech, and Signal Processing 38, pp. 1760–1770. External Links: Cited by: §I, §I, §II.
-  Cited by: §III-B.
-  (2002-01) Anomaly detection from hyperspectral imagery. IEEE Signal Processing Magazine 19 (1), pp. 58–69. External Links: Cited by: §I.
-  (2016) Problematic projection to the in-sample subspace for a kernelized anomaly detector.. IEEE Geoscience and Remote Sensing Letters (13), pp. 485–489. Cited by: §V.
-  (2016) Cracks in KRX: when more distant points are less anomalous.. 8th IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS). Cited by: §V.
-  (2001) Using the Nyström method to speed up kernel machines. In Neural Information Processing System (NIPS), pp. 682–688. Cited by: §III-A3.
-  (2012) Nyström method vs random Fourier features: a theoretical and empirical comparison. In Neural Information Processing System (NIPS), pp. 476–484. Cited by: §III-A3.
-  (2016) Orthogonal random features. In Neural Information Processing System (NIPS), pp. 1975–1983. Cited by: §III-A2.