I Introduction
“… Nature almost surely operates by combining chance with necessity, randomness with determinism…”
–Eric Chaisson, Epic of Evolution: Seven Ages of the Cosmos
Earthobservation (EO) satellites provide a unique source of information to address some of the challenges of the Earth system science [1]. Current EO applications for image classification have to deal with a huge amount of heterogeneous and complex data sources.
The superspectral Copernicus Sentinels [2, 3], as well as the planned EnMAP [4], HyspIRI [5], PRISMA [6] and FLEX [7], will soon provide unprecedented data streams to be analyzed. Very high resolution (VHR) sensors like Quickbird, Worldview2 and the recent Worldview3 [8] also pose big challenges to data processing. The challenge is not only attached to optical sensors. Infrared sounders, like the Infrared Atmospheric Sounding Interferometer (IASI) [9] sensor on board the MetOp satellite series, impose even larger constraints: the orbital period of Metop satellites (101 minutes), the large spectral resolution (8461 spectral channels between 645 cm and 2760 cm), and the spatial resolution (601530 samples) of the IASI instrument yield several hundreds of gigabytes of data to be processed daily. The IASI mission delivers approximately spectra per day, which gives a rate of about 29 Gbytes/day to be processed. EO radar images also increased in resolution, and current platforms such as ERS1/2, ENVISAT, RadarSAT1/2, TerraSARX, and CosmoSkyMED give raise to extremely fine resolution data that call for advanced scalable processing methods. Besides, we should not forget the availability of the extremely large remote sensing data archives^{1}^{1}1The Earth Observing System Data and Information System (EOSDIS) for example is managing around 4 terabytes daily, and the flow of data to users is about 20 terabytes daily. already collected by several past missions. In addition, we should be also prepared for the near future in diversity and complementarity of sensors^{2}^{2}2Follow the links for an uptodate list of current ESA, EUMETSAT, JAXA, CNSA and NASA EO missions.. These large scale data problems require enhanced processing techniques that should be accurate, robust and fast. Standard classification algorithms cannot cope with this new scenario efficiently.
In the last decade, kernel methods have dominated the field of remote sensing image classification [10, 11]. In particular, a kernel method called support vector machine (SVM, [12, 13, 14, 15, 16]) was gradually introduced in the field, and quickly became a standard for image classification. Further SVM developments considered the simultaneous integration of spatial, spectral and temporal information [17, 18, 19, 20, 21], the richness of hyperspectral imagery [16, 22], and exploited the power of clusters of computers [23, 24]. Undoubtedly, kernel methods have been the most widely studied classifiers, and became the preferred choice for users and practitioners. However, they are still not widely adopted in real practice because of the high computational cost when dealing with large scale problems. Roughly speaking, given examples available for training, kernel machines need to store kernel matrices of size , and to process them using standard linear algebra tools (matrix inversion, factorization, eigendecomposition, etc.) that typically scale cubically, . This is an important constraint that hampers their applicability to large scale EO data processing.
An alternative kernel classifier to SVM is the Gaussian Process classifier (GPC) [25]. GPC has appealing theoretical properties, as it approaches the classification problem with a solid probabilistic treatment, and very good performance in practice. The GPC method was originally introduced in the field of remote sensing in [26], where very good capabilities for land cover classification from multi/hyperspectral imagery were illustrated. Since then, GPC has been widely used in practice and extended to many settings: hyperspectral image classification [27], semantic annotation of highresolution remote sensing images [28], change detection problems with semisupervised GPC [29]
, or classification of images with the help of user’s intervention in active learning schemes
[30, 31]. Unfortunately, like any other kernel method, its computational cost is very large. This is probably the reason why GPC has not yet been widely adopted by the geoscience and remote sensing community in large scale classification scenarios, despite its powerful theoretical background and excellent performance in practice.
In GP for classification we face two main problems. First, the nonconjugate observation model for classification (usually based on the sigmoid, probit, or Heaviside step functions) renders the calculation of the marginal distribution needed for inference impossible. The involved integrals are not analytically tractable, so one has to resort to numerical methods or approximations [25]
. One could rely on Markov Chain Monte Carlo (MCMC) methods, but they are computationally far too expensive. By assuming a Gaussian approximation to the posterior of the latent variables, one can use the Laplace approximation (LA) and the (more accurate) expectation propagation (EP)
[32, 33]. The observation model can also be bounded, leading to the variational inference approach [34] that we use in this paper. Once the nonconjugacy of the observation model has been solved, the second problem is the inversion of huge matrices, which yields the unbearable complexity. Notice that this is the only difficulty that appears when GP is used for regression, where the observation model can be analytically integrated out. This efficiency problem could be addressed with recent Sparse GP approximations based on inducing points and approximate inference [35], but they come at the price of a huge number of parameters to estimate.
In this paper, we introduce two alternative pathways to perform large scale remote sensing image classification with GPC. First, following the ideas in [36], we approximate the squared exponential (SE) kernel matrix of GPC by a linear one based on projections over a reduced set of random Fourier features (RFF). This novel method is referred to as RFFGPC. It allows us to work in the primal space of features, which significantly reduces the computational cost of large scale applications. In fact, a recent similar approach allows for using millions of examples in SVMbased land cover classification and regression problems [37]. The solid theoretical ground and the good empirical RFFGPC performance make it a very useful method to tackle large scale problems in Earth observation. However, RFFGPC can only approximate (theoretically and in practice) a predefined kernel (the SE in this work), and the approximation does not necessarily lead to a discriminative kernel. These shortcomings motivate our second methodological proposal: we introduce a novel approach to avoid randomly sampling a number of Fourier frequencies, and instead we propose learning the optimal ones within the variational approach. Therefore, Fourier frequencies are no longer randomly sampled and fixed, but latent variables estimated directly from data via variational inference. We refer to this method as VFFGPC (Variational Fourier Features). The performance of RFFGPC and VFFGPC is illustrated in large and medium size realworld remote sensing image classification problems: (1) classification of clouds over landmarks from a long time series of Seviri/MSG (Spinning Enhanced Visible and Infrared Imager, Meteosat Second Generation) remote sensing images, and (2) cloud detection using IASI and AVHRR (Advanced Very High Resolution Radiometer) infrared sounding data, respectively. Excellent empirical results support the proposed large scale methods in both accuracy and computational efficiency. In particular, the extraordinary performance of VFFGPC in the medium size data set justifies its use not only as a large scale method, but also as a generalpurpose and scalable classification tool capable of learning an appropriate discriminative kernel.
The remainder of the paper is organized as follows. Section II reviews the RFF approximation and introduces it into GPC, deriving RFFGPC and the more sophisticated VFFGPC. Section III introduces the two realworld remote sensing data sets used for the experimental validation. Section IV presents the experimental results comparing the two proposed methods and standard GPC in terms of accuracy and efficiency. Section V concludes the paper with some remarks and future outlook.
Ii Large Scale Gaussian Process Classification
Gaussian Processes (GP) [38] is a probabilistic stateoftheart model for regression and classification tasks. In the geostatistic community, GP for regression is usually referred to as kriging. For inputoutput data pairs , a GP models the underlying dependence from a functionspace perspective, i.e. introducing latent variables
that jointly follow a normal distribution
. The kernel function encodes the sort of functions favored, and is the socalled kernel matrix. The observation model of the output given the latent variable depends on the problem at hand. In binary classification (i.e. when), the (nonconjugate) logistic observation model is widely used. It is given by the sigmoid function as
.Iia Random Fourier Features
The main issue with large scale applications of GP is its cost at the training phase, which comes from the kernel matrix inversion. The work [36] presents a general methodology (based on Bochner’s theorem [39]) to approximate any positivedefinite shiftinvariant kernel by a linear one. This is achieved by explicitly projecting the original dimensional data onto random Fourier features , whose linear kernel approximates . This linearity will enable us to work in the primal space of features and substitute matrix inversions by ones, resulting in a total computational cost. In largescale applications, one can set a , and thus the obtained complexity represents an important benefit over the original . Moreover, the complexity at test is also reduced from to , even becoming independent on .
In this work we use the wellknown SE (or Gaussian) kernel . Following the methodology in [36], this kernel can be linearly approximated as
(1) 
where
(2) 
and the Fourier frequencies must be sampled from a normal distribution . As explained in [36, Claim 1], this approximation exponentially improves with the number of Fourier frequencies used (and also exponentially worsens with , the original dimension of ). However, obviously, increasing in our methods will go at the cost of increasing the and complexities. Other kernels different from the SE one could be used, but that would imply sampling from a different distribution.
Our novel RFFGPC method considers a standard Bayesian linear model over these new features . Such a linear model corresponds to GP classification with the linear kernel [40, Chapter 6]. Since approximates the SE kernel , our RFFGPC constitutes an approximation to GP classification with SE kernel. However, RFFGPC is well suited for large scale applications, as it works in the primal space of features and thus presents a (resp. ) train (resp. test) complexity.
Notice that RFFGPC needs to sample the Fourier frequencies from
from the beginning, whereas hyperparameters
and must be estimated during the learning process (just as in standard GP classification). In order to uncouple and , in the sequel we consider the equivalent features(3) 
with now sampled from and fixed. Notice that we have collectively denoted .
At this point, it is natural to consider other possibilities for the Fourier frequencies , rather than just randomly sample and fix them from the beginning. The proposed VFFGPC model treats them as hyperparameters to be estimated (so as to maximize the likelihood of the observed data), just like and in the case of RFFGPC. This makes VFFGPC more expressive, flexible, and tailored to the data, although it may no longer constitute an approximation to the SE kernel (for which the must be normally distributed). More specifically, VFFGPC does start with an approximated SE kernel (since the are initialized with a normal distribution), but the maximum a posteriori (MAP) optimization on makes it learn a new kernel which may no longer approximate a SE one. Therefore, for VFFGPC we also use as in eq. (3), with now both and to be estimated. Interestingly, VFFGPC extends the Sparse Spectrum Gaussian Process model originally introduced for regression in [41], to GP classification.
More specifically, the authors there also sparsify the SE kernel by working on the primal space of and Fourier features, see [41, Equation 5]. However, our classification setting involves a sigmoidbased (logistic) observation model for the output given the latent variable (see the next eq. (4)), whereas in regression this is just given by a normal distribution. Therefore, VFFGPC needs to additionally deal with the nonconjugacy of the sigmoid, which motivates the variational bound of eq. (6) and the consequent variational inference procedure described in Section IIC. It is interesting to realize that VFFGPC is introduced here as a natural extension of RFFGPC, whereas there is not a regression analogous for RFFGPC in [41].
Another possibility for the Fourier frequencies would be to estimate them (just as in VFFGPC) but considering alternative prior distributions (which means utilizing alternative kernels). Moreover, instead of maximum a posteriori inference, we could address the marginalization of the Fourier frequencies . Alternatively, to promote sparsity, the use of Gaussian Scale Models (GSM) [42] could also be investigated. These possibilities will be explored in future work, and here we will concentrate on RFFGPC and VFFGPC.
IiB Models formulation
As anticipated in previous section, RFFGPC and VFFGPC are standard Bayesian linear models working on the explicitly mapped features of eq. (3). In the case of RFFGPC, is sampled from at the beginning and fixed, with to be estimated. In the case of VFFGPC, both and are estimated, with a prior over . In order to derive both methods in a unified way, will denote for RFFGPC and both for VFFGPC.
Since we are dealing with binary classification, we consider the standard logistic observation model
(4) 
where . For the weights we utilize the prior normal distribution , with to be estimated, see eq. (1).
For an observed dataset , the joint p.d.f. reads
(5) 
where we collectively denote and . For the sake of brevity, from now on we will systematically omit the conditioning on .
IiC Variational inference
Given the observed dataset , in this section we seek point estimates of and by maximizing the marginal likelihood (in VFFGPC, the additional prior yields maximum a posteriori inference, instead of maximum likelihood one, for ). After that, we obtain (an approximation to) the posterior distribution . Due to the nonconjugate observation model, the required integrals will be mathematically intractable, and we will resort to the variational inference approximation [40, Section 10.6].
First, notice that integrating out in eq. (5) is not analytically possible due to the sigmoid functions in the observation model . To overcome this problem, we use the variational bound
(6) 
which is true for any real numbers and where [40, Section 10.6]. Applying it to every factor of , we have the following lower bound
(7) 
Here we write for the projecteddata matrix (which depends on ), is the diagonal matrix , , and the term only depends on . The key is that this lower bound for is conjugate with the normal prior (since it is the exponential of a quadratic function on ), and thus it allows us integrating out in eq (5). In exchange, we have introduced additional hyperparameters that will need to be estimated along with and .
Therefore, substituting for its bound, eq. (5) can be lower bounded as
(8) 
where we have denoted
(9) 
Now it is clear that we can marginalize out , and iteratively estimate the optimal values of , and as those that maximize (or equivalently ) for the observed . Starting at , , and , we can calculate , , and for . In the case of , we make use of the local maximum condition . From there, it is not difficult to prove that the optimal value satisfies [40]
(10) 
where the square and square root of a vector are understood as elementwise. In the case of and , we use nonlinear conjugate gradient methods [43] and obtain (notice that for VFFGPC we can collapse and , removing the prior on )
(11) 
Once the hyperparameters , , and have been estimated by , , and respectively, we need to compute the posterior . As before, this is mathematically intractable due to the sigmoids in the observation model. Therefore, we again resort to the variational bound in eq. (8) to get an optimal approximation to the posterior . Namely, we do it by minimizing (an upper bound of) the KL divergence between both distributions:
Thus, the minimum is reached for , with and calculated in eq. (9) using , , and .
In summary, at training time, our methods RFFGPC and VFFGPC run iteratively until convergence of the hyperparameters , , and to their optimal values , , and (see Algorithm 1). The computations involved there suppose a computational complexity of (which equals when ), whereas standard GPC scales as . At test time, the probability of class for a previously unseen instance is:
(12) 
with being the sigmoid function. Whereas GPC presents a computational cost of for each test instance, eq. (IIC) implies a complexity of in the case of our methods. In particular, notice that this is independent on the number of training instances. These significant reductions in computational cost (both at training and test) make our proposal suitable for large scale and realtime applications in general, and in EO applications in particular.
Finally, regarding the convergence of the proposed methods, we cannot theoretically guarantee the convergence to a global optimum (only a local one), since we are using conjugate gradient methods to solve the nonconvex optimization problem in eq. (IIC). However, from a practical viewpoint, we have experimentally checked that both methods have a satisfactory similar convergence pattern. Namely, in the first iterations, the hyperparameters experiment more pronounced changes, widely exploring the hyperparameters space. Then, once they reach a local optimum vicinity, these variations become smaller. Eventually, the hyperparameters values hardly change and the stop criterion is satisfied.
Iii Data Collection and Preprocessing
This section introduces the datasets used for comparison purposes in the experiments. We considered (1) a continuous year of MSG data involving several hundred thousands of labeled pixels for cloud classification; and (2) a mediumsize manually labeled dataset used to create the operational IASI cloud mask.
Iiia Cloud detection over landmarks with Seviri/MSG
We focus on the problem of cloud identification over landmarks using Seviri MSG data. This satellite mission constitutes a fundamental tool for weather forecasting, providing images of the full Earth disc every 15 minutes. Matching the landmarks accurately is of paramount importance in image navigation and registration models and geometric quality assessment in the Level 1 instrument processing chain. Detection of clouds over landmarks is an essential step in the MSG processing chain, as undetected clouds are one of the most significant sources of error in landmark matching (see Fig. 1).
The dataset used in the experiments was provided by EUMETSAT, and contains Seviri/MSG Level 1.5 acquisitions for 200 landmarks of variable size for a whole year (2010). Landmarks mainly cover coastlines, islands, or inland waters. We selected all multispectral images from a particular landmark location, Dakhla (Western Sahara), which involves 35,040 MSG acquisitions with a fixed resolution of pixels. In addition, Level 2 cloud products were provided for each landmark observation, so the Level 2 cloud mask [44] is used as the best available ‘ground truth’ to validate the results. We framed the problem for this particular landmark as a pixelwise classification one.
A total amount of features were extracted from the images, involving band ratios, spatial, contextual and morphological features, and discriminative cloud detection scores. In particular, we considered: 7 channels converted to top of atmosphere (ToA) reflectance (R1, R2, R3, R4) and brightness temperature (BT7, BT9, BT10), 3 band ratios, and 6 spatial features. On the one hand, the three informative band ratios were: (i) a cloud detection ratio, ; (ii) a snow index, ; and (iii) the NDVI, . On the other hand, the six spatial features were obtained by applying average filters of sizes and
, as well as a standard deviation filter of size
, on both bands R1 and BT9.Based on previous studies [44, 45], and in order to simplify the classification task, the different illumination conditions (and hence difficulty) over the landmarks are studied by splitting the day into four ranges (subproblems) according to the solar zenith angle (SZA) values: high (SZASZA), mid (SZASZA°), low (80°SZA90°), and night (SZA90°). Therefore, different classifiers are developed for each SZA range.
The final amount of pixels available for each illumination condition is for high, mid, and night, and for low. Moreover, each problem has different dimensionality: all the features were used for the three daylight problems, and was used for the night one (some bands and ratios are meaningless at night).
IiiB Cloud detection with the IASI/AVHRR data
The IAVISA dataset is part of the study “IASI/AVHRR Visual Scenes Analysis and Cloud Detection” (http://www.brockmannconsult.de/iavisainfoweb/), whose aim is to improve the IASI cloud detection by optimizing the coefficients used for a predefined set of cloud tests performed in the IASI Level 2 processing. The dataset was derived by visual analysis of globally distributed data, and served as input for the optimization and validation of the IASI cloud detection. Each collected IASI sample was classified concerning its cloudiness based on the visual inspection of the AVHRR Level 1B inside the IASI footprint. Each sample classifies a single IASI instantaneous field of view (IFOV) as being cloudfree (clear sky, 28%), partly cloudy low (26%), partly cloudy high (26%), or cloudy (20%). For the sake of simplicity, here we focus on discriminating between cloudy and cloudfree pixels.
In order to ensure the representativeness of the dataset for the natural variability of clouds, labeling further considered additional conditions depending on: 1) the surface type (see Table I), 2) the climate zone (Köppen classification over land, geographical bands over sea), 3) the season and 4) day/night discrimination. First, the surface type database used as ancillary information was the IGBP (International GeosphereBiosphere Programme) scene types in the CERES/SARB (Clouds and the Earth’s Radiant Energy System, Surface and Atmospheric Radiation Budget) surface map. The 18class map was used to identify surface properties of a given region. The distribution of the surface types in the map is given in Table I, showing the even distribution of clouds across land cover types which ensures representativeness (natural variability) of the database. Second, the different climate zones were sampled as follows: tropical (), dry (), temperate (), cold (), and polar zones (). Third, seasonality was also taken into account, and yielded the following distribution: Spring (), Summer (), Autumn (), and Winter (). The global sampling and some cloudy and cloudfree chips are shown in Fig. 2. The final database consists of instances and original features, which have been summarized to
more informative directions through a standard Principal Component Analysis.
ID  Surface Type  Samples 

1  Evergreen Needle Forrest  603 
2  Evergreen Broad Forrest  876 
3  Deciduous Needle Forrest  89 
4  Deciduous Broad Forrest  324 
5  Mixed Forest  602 
6  Closed Shrubs  329 
7  Open Shrubs  1484 
8  Woody Savannas  768 
9  Savannas  656 
10  Grassland  886 
11  Wetlands  147 
12  Crops  1294 
13  Urban  44 
14  Crop/Mosaic  1436 
15  Snow/Ice  1443 
16  Barren/Desert  1379 
17  Water  12150 
18  Tundra  413 
Iv Experiments
In this section, we empirically demonstrate the performance of the proposed methods in the two real problems described above. Moreover, we carry out an exhaustive comparison to GPC with SE kernel (in the sequel, GPC for brevity).
In order to provide a deeper understanding of our methods, different values of (number of Fourier frequencies) will be used. Different sizes of the training dataset will be also considered, in order to analyze the scalability of the methods and to explore the tradeoff between accuracy and computational cost. When increasing (respectively, ), we will add training instances (respectively, Fourier frequencies) to those already used for lower values of (respectively, ). We provide numerical (in terms of recognition rate), computational (training and test times), and visual (by inspecting classification maps) assessments.
Iva Cloud detection in the landmarks dataset
In this large scale problem, the number of training examples is selected as for RFFGPC and VFFGPC, and for GPC. Notice that the improvement at training computational cost ( for the proposed methods against for GPC) enables us to consider much greater training datasets for our methods. In fact, as we will see in Figure 3, even with RFFGPC and VFFGPC are computationally cheaper than GPC with . Indeed, higher values of are not considered for GPC to avoid exceeding the (already expensive) seconds of training CPU time needed with just . Regarding the number of Fourier frequencies, we use .
The experimental results, which include predictive performance (test overall accuracy), training CPU time, and test CPU time, are shown in Figure 3. Every single value is the average of five independent runs under the same setting. Namely, for each illumination condition, a test dataset is fixed and five different balanced training datasets are defined with the remaining data. Notice that a general first observation across Figure 3 suggests that higher accuracies are obtained for higher illumination conditions (SZA).
Figure 3 reveals an overwhelming superiority of RFFGPC and VFFGPC over standard GPC: our proposed methods achieve a higher predictive performance while investing substantially lower training CPU time. This is very clear from the third column plots, where for any blue point we can find orange and yellow points which are placed more northwest (i.e. higher test OA and lower CPU training time). Furthermore, the fourth column shows an equally extraordinary reduction in test CPU time (production time), where the proposed methods are more than times faster than GPC. In particular, this makes RFFGPC and VFFGPC better suited than standard GPC for realtime EO applications.
Regarding the practical differences between RFFGPC and VFFGPC, we observe that RFFGPC is faster (at training) whereas VFFGPC is more accurate. This is a natural consequence of their theoretical formulations: the estimation of the Fourier frequencies in VFFGPC makes it more flexible and expressive, but involves a heavier training. Therefore, in this particular problem, the final practical choice between the two proposed methods would depend on the relative importance that the user assigns to test accuracy (where VFFGPC stands out) and training cost (where RFFGPC does so). In terms of test cost, both methods are very similar, as expected from the identical theoretical test complexity. The independence of this quantity on is also intuitively reflected in the experiments, with all the RFFGPC and VFFGPC lines collapsing onto a single one in the fourth column plots of Figure 3.
At this point, it is worth to analyze a bit further the role of in the performance of our methods. Recall (Section IIA) that RFFGPC is an approximation to GPC, with an error that exponentially decreases with the ratio between the dimensions of the projected Fourier features space and the original one. Therefore, it is theoretically expected that the performance of RFFGPC increases with , becoming equivalent to GPC when . Actually, this is supported by the first column of Figure 3. Moreover, since our problem here presents a low ( for high, mid, and low, and for night), it is natural that RFFGPC with just already gets very similar (even better in some cases) results to standard GPC with the same (yet far much faster, compare GPC and RFF/VFF for ). In the case of VFFGPC, where the Fourier frequencies are model parameters to be estimated, the number is directly related to the complexity of the model. Therefore, its increase should not always mean a higher performance in test OA, since large values may provoke overfitting to the training dataset (this will be clear in the next dataset, whereas it does not occur in LANDMARKS). Furthermore, unlike RFFGPC, the performance of VFFGPC is not directly affected by .
It is also reasonable to expect that both test OA and training CPU time increase with the training dataset size . More specifically, and from a practical perspective in which the computational resources are finite, the first column in Figure 3 shows that test OA becomes stalled when only one of or increases. However, greater improvements in test OA are achieved when and are jointly increased. Notice that this is also justifiable from a theoretical viewpoint: the higher the dimensionality of the projected Fourier features space (which is ), the larger number of examples are required to identify the separation between classes.
IvB Cloud detection with the IAVISA dataset
As explained in Section IIIB, this problem involves a total amount of instances. We performed fivefold crossvalidation, which produces five pairs of training/test datasets with (approximately) instances each. Results are then averaged. Since RFFGPC and VFFGPC are conceived for largescale applications (they scale linearly with , recall their training cost), they will not be utilized with values of lower than this training dataset size of (even GPC is able to cope with this size). Indeed, in the case of GPC we use the values . Regarding the number of Fourier frequencies , we consider the grid . The experimental results, which include the same metrics as those used for the previous problem on landmarks, are shown in Figure 4.
In this case, we again observe a clear outperformance of VFFGPC against GPC: it achieves higher test OA while requiring less training and test CPU times. Moreover, the improvement in test OA is greater than , and train/test CPU times are around and times lower respectively. However, unlike in the previous problem of cloud detection over landmarks, RFFGPC does not exhibit such a clear superiority over GPC in this application. Whereas it does drastically decrease the train/test CPU times, it is not able to reach the test OA of GPC with . Therefore, in practice, the optimal choice for this application is VFFGPC. RFFGPC would only be recommended if the training CPU time is a very strong limitation.
The main reason why RFFGPC is not completely competitive in this problem is its theoretical scope: as an efficient approximation to GPC, it is conceived for large scale applications which are out of the reach of standard GPC. If the size of the problem allows for using GPC (as in this case), then RFFGPC will only provide a more efficient alternative (less training and test CPU times), but its predictive performance will be always below that of GPC. Moreover, the difference in this performance is directly influenced by the original dimension of data (recall that the kernel approximation behind RFFGPC exponentially degrades with , Section IIA). This is precisely a second hurdle that RFFGPC finds in IAVISA: the high makes RFFGPC with the full dataset be quite far from the corresponding GPC at predictive performance (test OA). In conclusion, the ideal setting for RFFGPC is a large scale problem (high ) with few features (low ), precisely the opposite to the IAVISA dataset.
Interestingly, VFFGPC bypasses these limitations of RFFGPC by learning a new kernel and not just approximating the SE one. First, VFFGPC is not just a GP adaptation wellsuited for large scale applications, but a generalpurpose, expressive, and very competitive kernelbased classifier that scales well with the number of training instances. Second, as it does not rely on the kernel approximation, VFFGPC is not affected by the original dimension of data. Both ideas are empirically supported by the results obtained in IAVISA.
The first plot of Figure 4 shows that the predictive performance of VFFGPC does not necessarily improves by increasing . This is the expected behavior from the theoretical formulation of VFFGPC, where the Fourier frequencies are parameters to be estimated. Thus, a higher amount of them confers VFFGPC a greater flexibility to learn hidden patterns in the training dataset, but also the possibility to overfit very particular structures of it which do not generalize to the test set. This is the classical problem of the model complexity in machine learning, and it is further illustrated in Figure 5. Together with the first plot in Figure 4, it shows the paradigmatic behaviour of train and test performance in presence of overfitting: train OA grows with the model complexity (great flexibility allows for learning very particular structures of the training set, even reaching a of train OA), whereas test OA initially grows (the first patterns are part of the ground truth and thus general to the test set) but then goes down (when the learned information is too specific to the training set). Notice that this overfitting phenomena did not occur at LANDMARKS, where test OA monotonically increased with . In addition to the different nature of the problems, the training dataset size plays a crucial role at this: smaller datasets (like IAVISA) are more prone to overfitting than larger ones (LANDMARKS) under the same model complexity.
Finally, it is worth noting that VFFGPC achieves its maximum test OA when using just Fourier frequencies. This reflects (i) a not very sophisticated internal structure of the IAVISA dataset (since just directions are enough to correctly classify of the data), and (ii) the VFFGPC capability to learn those discriminative directions from data. In particular, this shows that VFFGPC can be used not only as a classifier, but also as a method that learns the most relevant discriminative directions in a dataset. Unfortunately, RFFGPC is not able to benefit from these privileged directions that may exist in some datasets, since it randomly samples and fix the Fourier frequencies from the beginning.
IvC Explicit classification maps for cloud detection
The last two sections were dedicated to thoroughly analyze the performance of the proposed methods, empirically understand their behavior, weaknesses, and strengths, and compare them against GPC. In order to illustrate the explicit cloud detection behind the experiments, here we provide several explanatory classification maps obtained by the best model (in terms of predictive performance) for the LANDMARKS dataset: VFFGPC with and .
The classification maps are obtained for the whole year 2010 at the Dakhla landmark, with a total of 34940 satellite acquisitions. The acquired window size is
pixels. Relying on the proposed feature extraction procedure, we trained the four necessary models (high, mid, low, night), and then proceed to predict over the whole available amount of chips acquired in the 2010 year.
^{3}^{3}3A full video with all the classification maps is available at http://decsai.ugr.es/vip/software.html and http://isp.uv.es/code/vff.html. RFFGPC and VFFGPC codes are also provided.In Figure 6, several chips are provided with the aim of illustrating different behaviors. In the first situation (first row), we can see a characteristic error of the L2 cloud mask, which sometimes tends to wrongly label the coastline pixels as cloudy. However, VFFGPC leads to a better classification, identifying just one cloudy pixel and thus avoiding this negative coastline effect^{4}^{4}4As a clarification note, the coastline pixels were removed from the training dataset by applying a carefully designed morphological filter around coastlines.. In the second row, the visual channels show large clouds crossing the landmark. In the bottomright of the image, a long cloud is unlabeled in the L2 cloud mask but correctly detected by VFFGPC. While being formally accounted as an error, such discrepancy is actually positive for our method. Moreover, VFFGPC shows an interesting cloudsensitive behaviour at the topleft cloudy mass, identifying a larger cloudy area than that provided by EUMETSAT. This is a desirable propensity in cloud detection applications, where we prefer to identify larger clouds (and then thoroughly analyze them) rather than missing some of them. In the third row, the RGB channel allows for visually identifying three main cloudy masses at the landmark. The L2 mask poorly labels the central cloudy band, and does not detect the lower cloud. Both deficiencies are overcome by VFFGPC. Finally, the fourth chip shows a huge cloudy mass that is undetected by the L2 mask but is correctly identified by VFFGPC.
Therefore, although VFFGPC is trained with an imperfect ground truth, we observe that it is able to bypass some of these deficiencies, and exhibits a desirable cloudsensitive behavior. This improvement can be also related to the particular design of the training datasets, splitting the problem into four different cases depending on the illumination conditions.
V Conclusions and Future Work
We presented two efficient approximations to Gaussian process classification to cope with big data classification problems in EO. The first one, RFFGPC, performs standard GP classification by means of a fast approximation to the kernel (covariance) via random Fourier features. The advantage of the method is mainly computational, as the training cost is instead of the induced by the direct inversion of the kernel matrix (the test cost is also reduced to be independent on , from to ). The RFF method approximates the squared exponential (SE) covariance with Fourier features randomly sampled in the whole spectral domain. The solid theoretical grounds and good empirical performance makes it a very useful method to tackle large scale problems. Actually, the use of RFF has been exploited before in other settings, from classification with SVMs to regression with the KRR. However, we emphasize two main shortcomings. Firstly, the RFF approach can only approximate (theoretically and in practice) a predefined kernel (the SE one in this work). Secondly, by sampling the Fourier domain from a Gaussian, one has no control about the expressive power of the representation since some frequency components of the signal can be better represented than others. As a consequence, the approximated kernel may not have good discrimination capabilities. Noting these two problems, we proposed here our second methodology: a variational GP classifier (VFFGPC) which goes one step beyond by optimizing over the Fourier frequencies. It is shown to be not just a GP adaptation wellsuited for large scale applications, but a whole novel, generalpurpose, and very competitive kernelbased classifier that scales well (linearly, as RFFGPC) with the number of training instances.
We illustrated the performance of the algorithms in two real remote sensing problems of large and medium size. In the first case study, a challenging problem dealt with the identification of clouds over landmarks using Seviri/MSG imagery. The problem involved several hundred thousands data points for training the classifiers. In the second case study, we used the IAVISA dataset, which exploits IASI/AVHRR data to identify clouds with the IASI infrared sounding data. Compared to the original GPC, the experimental results show a high competitiveness in accuracy, a remarkable decrease in computational cost, and an excellent tradeoff between both.
These results encourage us to expand the experimentation to additional problems, trying to exploit the demonstrated potential of VFFGPC when dealing with any value of (training data set size) and (original dimension of the data). Other prior distributions and inference methods, as explained at the end of Section IIA, will be also explored in the future.
References
 [1] M. Berger, J. Moreno, J. A. Johannessen, P. Levelt, and R. Hanssen, “ESA’s sentinel missions in support of earth system science,” Rem. Sens. Env., vol. 120, pp. 84–90, 2012.
 [2] M. Drusch, U. Del Bello, S. Carlier, O. Colin, V. Fernandez, F. Gascon, B. Hoersch, C. Isola, P. Laberinti, P. Martimort, A. Meygret, F. Spoto, O. Sy, F. Marchese, and P. Bargellini, “Sentinel2: ESA’s Optical HighResolution Mission for GMES Operational Services,” Rem. Sens. Env., vol. 120, pp. 25–36, 2012.
 [3] C. Donlon, B. Berruti, A. Buongiorno, M.H. Ferreira, P. Féménias, J. Frerick, P. Goryl, U. Klein, H. Laur, C. Mavrocordatos, J. Nieke, H. Rebhan, B. Seitz, J. Stroede, and R. Sciarra, “The Global Monitoring for Environment and Security (GMES) Sentinel3 mission,” Remote Sensing of Environment, vol. 120, pp. 37–57, 2012.
 [4] T. Stuffler, C. Kaufmann, S. Hofer, K. Farster, G. Schreier, A. Mueller, A. Eckardt, H. Bach, B. Penné, U. Benz, and R. Haydn, “The EnMAP hyperspectral imagerAn advanced optical payload for future applications in Earth observation programmes,” Acta Astronautica, vol. 61, no. 16, pp. 115–120, 2007.
 [5] D. Roberts, D. Quattrochi, G. Hulley, S. Hook, and R. Green, “Synergies between VSWIR and TIR data for the urban environment: An evaluation of the potential for the Hyperspectral Infrared Imager (HyspIRI) Decadal Survey mission,” Rem. Sens. Env., vol. 117, pp. 83–101, 2012.
 [6] D. Labate, M. Ceccherini, A. Cisbani, V. De Cosmo, C. Galeazzi, L. Giunti, M. Melozzi, S. Pieraccini, and M. Stagi, “The PRISMA payload optomechanical design, a high performance instrument for a new hyperspectral mission,” Acta Astronautica, vol. 65, no. 910, pp. 1429–1436, 2009.
 [7] S. Kraft, U. Del Bello, M. Drusch, A. Gabriele, B. Harnisch, and J. Moreno, “On the demands on imaging spectrometry for the monitoring of global vegetation fluorescence from space,” in Proceedings of SPIE  The International Society for Optical Engineering, vol. 8870, 2013.
 [8] N. Longbotham, F. Pacifici, B. Baugh, and G. CampsValls, “Prelaunch assessment of worldview3 information content,” Lausanne, Switzerland, June 2014, pp. 24–27.
 [9] B. Tournier, D. Blumstein, F. Cayla, , and G. Chalon, “IASI level 0 and 1 processing algorithms description,” in Proc. of ISTCXII Conference, 2002.
 [10] G. CampsValls and L. Bruzzone, Eds., Kernel methods for Remote Sensing Data Analysis. UK: Wiley & Sons, Dec 2009.
 [11] G. CampsValls, D. Tuia, L. GómezChova, and J. Malo, Eds., Remote Sensing Image Processing. Morgan & Claypool, Sept 2011.
 [12] C. Huang, L. S. Davis, and J. R. G. Townshend, “An assessment of support vector machines for land cover classification,” Int. J. Rem. Sens., vol. 23, no. 4, pp. 725–749, 2002.
 [13] G. CampsValls, L. GómezChova, J. Calpe, E. Soria, J. D. Martín, L. Alonso, and J. Moreno, “Robust support vector method for hyperspectral data classification and knowledge discovery,” IEEE Trans. Geosc. Rem. Sens., vol. 42, no. 7, pp. 1530–1542, Jul 2004.
 [14] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Trans. Geosci. Rem. Sens., vol. 42, no. 8, pp. 1778–1790, 2004.
 [15] G. M. Foody and J. Mathur, “A relative evaluation of multiclass image classification by support vector machines,” IEEE Trans. Geosci. Rem. Sens., pp. 1–9, Jul 2004.
 [16] G. CampsValls and L. Bruzzone, “Kernelbased methods for hyperspectral image classification,” IEEE Trans. Geosc. Rem. Sens., vol. 43, no. 6, pp. 1351–1362, Jun 2005.
 [17] J. Benediktsson, J. Palmason, and J. Sveinsson, “Classification of hyperspectral data from urban areas based on extended morphological profiles,” IEEE Trans. Geosc. Rem. Sens., vol. 43, pp. 480–490, 2005.
 [18] M. Fauvel, J. Benediktsson, J. Chanussot, and J. Sveinsson, “Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles,” IEEE Trans. Geosc. Rem. Sens., vol. 46, no. 11, pp. 3804–3814, 2008.
 [19] F. Pacifici, M. Chini, and W. Emery, “A neural network approach using multiscale textural metrics from very highresolution panchromatic imagery for urban landuse classification,” Remote Sens. Environ., vol. 113, no. 6, pp. 1276–1292, 2009.
 [20] D. Tuia, F. Pacifici, M. Kanevski, and W. Emery, “Classification of very high spatial resolution imagery using mathematical morphology and support vector machines,” IEEE Trans. Geosci. Rem. Sens., vol. 47, no. 11, pp. 3866 –3879, Nov 2009.
 [21] G. CampsValls, L. GómezChova, J. MuñozMarí, J. L. RojoÁlvarez, and M. MartínezRamón, “Kernelbased framework for multitemporal and multisource remote sensing data classification and change detection,” IEEE Trans. Geosc. Rem. Sens., vol. 46, no. 6, pp. 1822–1835, 2008.
 [22] A. Plaza, J. A. Benediktsson, J. Boardman, J. Brazile, L. Bruzzone, G. CampsValls, J. Chanussot, M. Fauvel, P. Gamba, A. Gualtieri, and J. Tilton, “Recent advances in techniques for hyperspectral image processing,” Remote Sens. Environ., vol. 113, pp. 110–122, 2009.
 [23] J. Plaza, R. Pérez, A. Plaza, P. Martínez, and D. Valencia, “Parallel morphological/neural processing of hyperspectral images using heterogeneous and homogeneous platforms,” Cluster Comput., vol. 11, pp. 17–32, 2008.
 [24] J. MuñozMarí, A. Plaza, J. Gualtieri, and G. CampsValls, “Parallel programming and applications in grid, P2P and networking systems,” in Parallel Implementation of SVM in Earth Observation Applications, F. Xhafa, Ed. UK: IOS Press, 2009.
 [25] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. New York: The MIT Press, 2006.
 [26] Y. Bazi and F. Melgani, “Gaussian process approach to remote sensing image classification,” IEEE Trans. Geosc. Rem. Sens., vol. 48, no. 1, pp. 186–197, Jan 2010.
 [27] M. Y. Yang, W. Liao, B. Rosenhahn, and Z. Zhang, “Hyperspectral image classification using gaussian process models,” in 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), July 2015, pp. 1717–1720.
 [28] K. Chen, P. Jian, Z. Zhou, J. Guo, and D. Zhang, “Semantic annotation of highresolution remote sensing images via gaussian process multiinstance multilabel learning,” IEEE Geoscience and Remote Sensing Letters, vol. 10, no. 6, pp. 1285–1289, Nov 2013.
 [29] K. Chen, Z. Zhou, C. Huo, X. Sun, and K. Fu, “A semisupervised contextsensitive change detection technique via gaussian process,” IEEE Geoscience and Remote Sensing Letters, vol. 10, no. 2, pp. 236–240, March 2013.
 [30] P. Ruiz, J. Mateos, G. CampsValls, R. Molina, and A. Katsaggelos, “Bayesian active remote sensing image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 4, pp. 2186–2196, April 2014.
 [31] L. Kalantari, P. Gader, S. Graves, and S. A. Bohlman, “Oneclass gaussian process for possibilistic classification using imaging spectroscopy,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 7, pp. 967–971, July 2016.

[32]
T. Minka, “Expectation propagation for approximate bayesian inference,” in
Proceedings of the Seventeenth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI01). San Francisco, CA: Morgan Kaufmann, 2001, pp. 362–369.  [33] M. Kuss and C. Rasmussen, “Assessing approximate inference for binary Gaussian process classification,” Machine learning research, vol. 6, pp. 1679–1704, Oct 2005.
 [34] Z. Chen, S. D. Babacan, R. Molina, and A. K. Katsaggelos, “Variational bayesian methods for multimedia problems,” IEEE Transaction on Multimedia, vol. 16, no. 4, pp. 1000–10 017, 2014.
 [35] A. Damianou, “Deep Gaussian Processes and Variational Propagation of Uncertainty,” Ph.D. dissertation, University of Sheffield, 2015.
 [36] A. Rahimi and B. Recht, “Random features for largescale kernel machines,” in Advances in neural information processing systems, 2007, pp. 1177–1184.
 [37] A. PérezSuay, L. GómezChova, J. Amorós, and G. CampsValls, “Randomized kernels for large scale earth observation applications,” Remote Sensing of Environment, 2017.
 [38] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. New York: The MIT Press, 2006.
 [39] W. Rudin, Fourier analysis on groups. John Wiley & Sons, 2011.
 [40] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
 [41] M. LázaroGredilla, J. QuiñoneroCandela, C. E. Rasmussen, A. R. FigueirasVidal et al., “Sparse spectrum gaussian process regression,” Journal of Machine Learning Research, vol. 11, no. Jun, pp. 1865–1881, 2010.

[42]
S. Babacan, R. Molina, M. Do, and A. Katsaggelos, “Bayesian blind
deconvolution with general sparse image priors,” in
European Conference on Computer Vision (ECCV)
, 2012, pp. 341–355.  [43] R. Fletcher and C. M. Reeves, “Function minimization by conjugate gradients,” The computer journal, vol. 7, no. 2, pp. 149–154, 1964.
 [44] M. Derrien and H. Le Gléau, “MSG/SEVIRI cloud mask and type from SAFNWC,” International Journal of Remote Sensing, vol. 26, no. 21, pp. 4707–4732, 2005.
 [45] J. Hocking, P. N. Francis, and R. Saunders, “Cloud detection in meteosat second generation imagery at the met office,” Universitat de València, Tech. Rep. 540, 2010.
Comments
There are no comments yet.