I Introduction
Anomaly detection (AD), as a remote sensing (RS) research topic, is gaining importance because of the need for processing large number of images that are acquired from satellite and airborne platforms [15]. AD aims to detect small portions of the image which do not belong to the background of the scene. Unlike target detection, AD does not use known target spectra and anomalies are assumed to be rare and at the tail of the background distribution.
Among the many detector algorithms found in the literature, the ReedXiaoli (RX) detector [13]
is widely used due to its good performance and simplicity. The RX detector determines target pixels that are spectrally different from the image background based on the Mahalanobis metric. For RX to be effective, anomalous targets must be sufficiently small compared to background and is assumed to follow a Gaussian distribution. However, it has been shown that the Gaussian distribution assumption fails, for example, in hyperspectral images or with complex feature relations, especially at the tails of the distribution
[8]. As a result, nonlinear versions of RX have been introduced to mitigate this problem, and the kernel RX (KRX) detector was proposed in [7] to cope with complex and nonlinear backgrounds. However, the KRX algorithm has not been widely adopted in practice because, being a kernel method, the memory and computational cost increase, at least quadratically, with the number of pixels. This poses the perennial problem of accuracy versus usability in nonlinear detectors in general and kernel anomaly detectors in particular.In this study, we focus on improving the space (memory) and time efficiency of the KRX anomaly detector. Kernelbased anomaly detectors provide excellent detection performance since they are able to characterize nonlinear backgrounds [2]. In order to undertake this challenge, we propose to use efficient techniques based on random Fourier features and lowrank approximations to obtain improved performance of the KRX algorithm. We reported our initial efforts using the random Fourier features approach in[10]
In the literature, local and global RXbased detectors have been proposed. In local AD [13], pixels in a sliding window are used as input data. Despite their adaptation to local relations, the detection power has been shown to be low recently [3, 8]
. Conversely, in global AD all image pixels are used to estimate statistics. Thereby, targets with various sizes and shapes can be detected while detection of small targets can be difficult. In this study, all the methods are used in a global setting for the sake of simplicity.
Ii RX Based Anomaly Detection
Among the various AD methods proposed in the literature, one of the most frequently used anomaly detectors is the ReedXiaoli (RX) [13]. In this section, we explain the RX method and its kernelized version, the KRX anomaly detector.
Iia RX Anomaly Detector
We consider an acquired image reshaped in matrix form as , where is the number of pixels and is the total number of channels acquired by the sensor. For simplicity, let us assume that is a centered data matrix. The RX detector characterizes the background in terms of the covariance matrix . The detector calculates the squared Mahalanobis distance between a test pixel and the background as follows:
(1) 
In a global AD setting, as discussed here, can be efficiently computed using all the image pixels since the dimensionality of the image is much lower than the number of pixels (). Whereas, in a local AD setting, needs to be computed for each image pixel using the centered pixels in a window having an origin at that pixel [8].
IiB KRX Anomaly Detector
It is known that linear RX is computationally efficient and leads to an optimal solution when pixels in follow a Gaussian distribution. However, real life problems are not always Gaussian distributed and this requires more flexible models. Kernel methods are a possible solution because they can capture higherorder (nonlinear) feature relations, while still using linear algebra operations [2].
In order to develop the kernel RX, let us consider a mapping for the pixels in the image to a higher dimensional Hilbert feature space by means of the feature map . The mapped data matrix is now denoted as . Let us define a kernel function that, by virtue of the Riesz theorem, can evaluate (reproduce) the dot product between samples in , i.e. . To estimate how anomalous a pixel is using a pixel under test for , we first map it , and apply the RX formula in (1) as
(2) 
which, after some linear algebra, can be expressed in terms of kernel matrices [6, 2]:
(3) 
where contains the similarities between and all points in using , and stands for the kernel matrix containing all data similarities [7]. Note that, as in the linear RX method, the KRX also requires centering the data (now in ), which can be easily done^{1}^{1}1Centering in feature space can be easily done implicitly via the simple kernel matrix operation , where , represents the Kronecker delta if and zero otherwise.. Hereafter we assume that all kernel matrices are centered.
Note that constructing and inverting a kernel matrix of large poses a huge computational cost. A simple strategy to alleviate this problem is to draw samples randomly () and use them in the standard KRX, which is here referred to as simple subsampling (SRX) and defined as
(4) 
where is a data matrix sampled from , contains the similarities between and , and is a kernel matrix containing data similarities between the points in .
Iii Efficient techniques for Kernel RX
Kernel methods are able to fit nonlinear problems but, they do not scale well when the number of samples grow. We propose using feature map and lowrank approximation approaches to improve the efficiency of the KRX detector. We study the following approximations to the KRX method: Random Fourier features (RRX) previously studied by the authors in [10], orthogonal random features (ORX), naive lowrank approximation (LRX), and Nyström lowrank approximation (NRX).
Iiia Randomized Feature Map Approaches
IiiA1 Random Fourier Features (RFF)
An outstanding result in the recent kernel methods literature makes use of a classical definition in harmonic analysis to the approximation and scalability [12]. The Bochner’s theorem states that a continuous shiftinvariant kernel on is positive definite (p.d.) if and only if
is the Fourier transform of a nonnegative measure. If a shiftinvariant kernel
is properly scaled, its Fourier transformis a proper probability distribution. This property is used to approximate kernel functions with linear projections on a number of
random features as where are randomly sampled from a dataindependent distribution [12]. Note that we can define a dimensional randomized feature map , which can be explicitly constructed asto approximate the Radial Basis Function (RBF) kernel.
Therefore, given data points (pixels), the kernel matrix can be approximated with the explicitly mapped data, , and will be denoted as However, we do not use such an approach in Equation (3), which would lead to a mere approximation with extra computational cost. Instead, we run the linear RX in Equation (1) with explicitly mapped points onto random Fourier features, which reduces to
(5) 
and leads to a nonlinear randomized RX (RRX) [10] that approximates the KRX. Essentially, we map the original data into a nonlinear space through the explicit mapping to a dimensional space (instead of the potentially infinite feature space with ), and then use the linear RX formula. This allows us to control the space and time complexity explicitly through , as one has to store matrices of and invert matrices of size only (see Table I). Typically, parameter satisfies in practical applications.
IiiA2 Orthogonal Random Features (ORF)
RFF has become a very practical solution for the bottleneck in kernel methods when grows. In RFF, frequencies are sampled from a particular pdf and they act as a basis. This, however, may lead to features that are linearly dependent thus geometrically covering less space. Imposing orthogonality in the basis can be a remedy to this issue, which has led to the Orthogonal Random Features (ORF) [20]
. The linear transformation matrix of ORF is
, whereis a uniformly distributed random orthogonal matrix. The set of rows of
forms a basis in . is a diagonal matrix, with diagonal entries sampled i.i.d. from the distribution with degrees of freedom. makes the norms of the rows of and (with all the frequencies of RFF) identically distributed. Theoretical results show that ORF achieves lower error than RFF for the RBF kernel [20]. This approach follows the above RFF philosophy, and the final anomaly score is now:(6) 
where each frequency is a row of and is the matrix formed by the mappings of each element in the dataset, and is the mapping of a pixel to be tested.
IiiA3 Nyström Approximation
The Nyström method selects a subset of samples to construct a lowrank approximation of the kernel matrix [18]. This method approximates the kernel function as , where contains the similarities between and all points, and stands for the kernel matrix between the points in . Therefore, can be expressed as:
(7) 
where is a matrix which contains similarities between the points in and the points in . The similarities were computed using the standard RBF kernel function .
Using the above definition given in (7), the Nyström method approximates the kernel matrix
(8) 
by plugging (7) and (8) into (3), one can define the lowrank approximation of KRX:
(9) 
where while . Since is not a squared matrix , it is rank deficient, and we propose to use the pseudoinverse instead of the inverse of . By doing this, most of the terms cancel, leading to a more compact equation for the NRX:
(10) 
Note that NRX involves the inversion of an matrix which is much more efficient compared to KRX. In addition, the Nyström approach is more generic than using random Fourier feature approaches, as it allows one to approximate all positive semidefinite kernels, not just shiftinvariant kernels. Furthermore, this approximation is datadependent (i.e. the basis functions are a subset of estimation data itself) which could translate into better results [19].
IiiA4 Connection to reducedset methods
Reducedset techniques were successfully used to obtain sparse kernel methods and low rank approximations of multivariate kernel methods [1]. This methodology can be applied to approximate KRX which leads to equation (10). In this approach, the data matrix is subsampled into , , and mapped into , which, by using (2), we obtain the LRX formula:
(11) 
Identifying and , (11) leads to:
(12) 
which just differs from (10) in the inverse of , and when is full rank they are the same. In the following and in the experiments, we will use only instead of as both are mathematically equivalent.
IiiB Space and Time Complexity
Table I gives the theoretical computational complexity of the benchmark methods (RX, KRX, SRX) and proposed methods (RRX, ORX, NRX) presented in this paper. In this study, we assume since we aim to deal with big data settings. Besides, KRX becomes sufficiently efficient when is small, e.g. for a image. As seen in Table I, RX provides the best efficiency; thus, it should be employed for scenes where the data is Gaussian distributed. However, KRX and the proposed KRX approximations should be used for nonlinear distributions. Clearly, KRX is the least efficient compared to the proposed approximations, and it is also not applicable to big data. Feature map methods, e.g. RRX and ORX, provide the best computational efficiency for nonlinear (i.e nonGaussian) distributions, while lowrank approximation methods, e.g. LRX and NRX, are also efficient yet relatively slower compared to the feature map methods. Thus, one should choose the proper method based on the image distribution characteristics [14, 5], detection performance requirements, and computational resource limitations. These conclusions are assessed experimentally in the following section.
Space  Time  

Method  
RX  
RRX & ORX  
NRX  
KRX 
is transformation of image into a nonlinear space.
is matrix (covariance, kernel etc.) and is its inverse.
Iv Experimental Results
This section analyzes the performance of the proposed nonlinear RX anomaly detection methods. We performed tests in four real examples, and tested robustness using the area under curve (AUC) of receiver operating characteristic (ROC) curves. We provide illustrative source code for all methods in http://isp.uv.es/code/fastrx.html
Iva Data collection and experimental setup
We collected multispectral and hyperspectral images acquired by the Quickbird and AVIRIS sensors. Fig. 5 showcases the scenes used in the experiments. The AD scenarios consider anomalies related to: latent fires, vehicles, urbanization (roofs) and ships [3, 4, 11]. Table II summaries relevant attributes of the datasets such as sensors, spatial and spectral resolution.
Parameter estimation is required for the RX, KRX, RRX, ORX and NRX. First of all, the KRX method and its proposed variants involve the optimization of the parameter of the RBF kernel. For the feature map approaches (RRX and ORX), the number of basis, , parameter should be optimized. Whereas, for lowrank approximations (NRX), the number of random subsamples, , parameter should be optimized.
Images  Sensor  Size  Bands  Resolution 

WTC  AVIRIS  200 x 200  224  1.7 m 
Gainesville  AVIRIS  100 x 100  190  3.5 m 
Denver  Quickbird  500 x 684  4  1m4m 
San Diego  AVIRIS  100 x 100  193  7.5 m 
We adopted a crossvalidation scheme to select all the involved parameters: number of Fourier basis , rank , and RBF parameter . We selected parameters using different data sizes ranging between and samples.
IvB Numerical comparison
We report the averaged AUC results for all cases with
runs (standard deviations were always lower than
and hence are not reported). Figure 6 shows that nonlinear methods improve detection over the linear RX and NRX outperforms the other approximations in three out of the four images. The AUC values of KRX are related to the inversion of a relatively big matrix. This raises the issues of poorly estimated matrices (with a huge condition number) which are also computationally expensive to invert (). However, all the proposed fast kernel RX methods have the advantage of solving both issues. Firstly, thanks to the crossvalidation procedure, an estimate of the optimal number of features (RRX, ORX) or samples (NRX) can be obtained, allowing to better capture the intrinsic dimensionality of the mapped data. In a previous work [9], we showed that optimizing the number of frequencies in random Fourier features approaches acts as an efficient regularizer leading to better estimates with a reduced number of frequencies needed. And secondly, fast versions are able to obtain better performance in AUC metric at a fraction of the cost (see Fig. 6).WTC  Gainesville 
Denver  San Diego 
IvC On the computational efficiency
WTC  Gainesville  Denver  San Diego 
RX (0.94)  RX (0.96)  RX (0.56)  RX (0.92) 
NRX (0.97)  NRX (0.98)  NRX (0.75)  NRX (0.96) 
Figure 7 illustrates the tradeoff between the computational execution time and the AUC. The crosses indicate different values of rank ( or parameters) in the set and the number of pixels was fixed to . The optimal parameters estimated for KRX are used for the fast approaches. KRX has the best AUC values in all the images. NRX and SRX are more sensitive to rank values. RRX and ORX are almost insensitive to the rank but results do not improve when the rank increases, thus limiting their performance. The combination of lower spectral information and the ambiguity of the class (note that the anomaly class ‘urbanized’ can be confused with a pervasive class ‘urban’) makes the Quickbird scene a very difficult problem (lower AUCs). In this situation, as the rank parameter for the NRX method grows, it approximates the KRX algorithm. In Figure 8, the RX detector (top row) is shown against the best detector obtained (bottom row). The best result in AUC was achieved by the NRX in all the images. It is worth mentioning the good results in detection achieved by the NRX in all the scenes, which can be visually compared the linear version.
V Conclusions
In this letter, we introduced a family of efficient nonlinear anomaly detection algorithms based on the RX method. We used the theory of reproducing kernels, and proposed several efficient methods. The kernel ReedXiaoli (KRX) detector was improved using efficient and fast techniques based on feature maps and lowrank approximations. Among all methods, both the Nyström and the equivalent lowrank (LRX) approximation achieves the best results and yields a more efficient and accurate nonlinear RX method to be applied in practice. For future research, we plan to study the behaviour of fast approximations for alternative KRX variants [16, 17]. Note that the presented methodologies for fast KRX can be applicable to other kernel anomaly detectors, in local settings, and for realtime detection.
References

[1]
(201307)
Kernel multivariate analysis framework for supervised subspace learning: a tutorial on linear and kernel multivariate methods
. IEEE Signal Processing Magazine 30 (4), pp. 16–29. External Links: ISSN 10535888 Cited by: §IIIA4.  [2] (200912) ”Kernel methods for remote sensing data analysis”. Wiley & Sons, UK. External Links: ISBN 9780470722114 Cited by: §I, §IIB, §IIB.
 [3] (2016) Anomaly detection from hyperspectral remote sensing imagery. Geosciences 6 (4). External Links: ISSN 20763263, Document Cited by: §I, §IVA.
 [4] (201710) Hyperspectral anomaly detection with attribute and edgepreserving filters. IEEE Transactions on Geoscience and Remote Sensing 55 (10), pp. 5600–5611. External Links: Document, ISSN Cited by: §IVA.
 [5] (2004) Distance metrics and band selection in hyperspectral processing with applications to material identification and spectral libraries. IEEE Transaction on Geoscience and Remote Sensing 42. Cited by: §IIIB.
 [6] (2006) A comparative analysis of kernel subspace target detectors for hyperspectral imagery. EURASIP Journal of Advances in Signal Processing 2007 (2007), pp. 29250. Cited by: §IIB.
 [7] (2005) Kernel RX  algorithm: a nonlinear anomaly detector for hyperspectral imagery. IEEE Transaction on Geoscience and Remote Sensing 43 (2), pp. 388–397. External Links: ISSN 01962892 Cited by: §I, §IIB.
 [8] (2010) A tutorial overview of anomaly detection in hyperspectral images. IEEE Aerospace and Electronic Systems Magazine 25 (7), pp. 5–28. External Links: Document, ISSN 08858985 Cited by: §I, §I, §IIA.
 [9] (2018) Remote sensing image classification with largescale Gaussian processes. IEEE Transaction on Geoscience and Remote Sensing 56. Cited by: §IVB.
 [10] (201807) Randomized RX for target detection. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Vol. , pp. 4237–4240. External Links: Document, ISSN 21537003 Cited by: §I, §IIIA1, §III.
 [11] (201910) Kernel anomalous change detection for remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 57 (10), pp. 7743–7755. External Links: Document, ISSN Cited by: §IVA.
 [12] (2008) Random features for largescale kernel machines. In Neural Information Processing System (NIPS), Cited by: §IIIA1.
 [13] (1990) Adaptive multipleband CFAR detection of an optical pattern with unknown spectral distribution. IEEE Trans. on Acoustics, Speech, and Signal Processing 38, pp. 1760–1770. External Links: Document Cited by: §I, §I, §II.
 [14] Cited by: §IIIB.
 [15] (200201) Anomaly detection from hyperspectral imagery. IEEE Signal Processing Magazine 19 (1), pp. 58–69. External Links: Document, ISSN Cited by: §I.
 [16] (2016) Problematic projection to the insample subspace for a kernelized anomaly detector.. IEEE Geoscience and Remote Sensing Letters (13), pp. 485–489. Cited by: §V.
 [17] (2016) Cracks in KRX: when more distant points are less anomalous.. 8th IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS). Cited by: §V.
 [18] (2001) Using the Nyström method to speed up kernel machines. In Neural Information Processing System (NIPS), pp. 682–688. Cited by: §IIIA3.
 [19] (2012) Nyström method vs random Fourier features: a theoretical and empirical comparison. In Neural Information Processing System (NIPS), pp. 476–484. Cited by: §IIIA3.
 [20] (2016) Orthogonal random features. In Neural Information Processing System (NIPS), pp. 1975–1983. Cited by: §IIIA2.