Introduction
Earth observation (EO) by airborne and satellite remote sensing and insitu observations play a fundamental role in monitoring our planet. In the last decade, machine learning has attained outstanding results in the estimation of biogeophysical variables from the acquired images at local and global scales in a timeresolved manner. Gaussian processes (GPs) [1], as flexible nonparametric models to find functional relationships, have excelled in EO problems in recent years, mainly introduced for model inversion and emulation of complex codes [2]. GPs provide not only accurate estimates but also principled uncertainty estimates for the predictions. Besides, GPs can easily accommodate multimodal data coming from different sensors and from multitemporal acquisitions. Due to their solid Bayesian formalism, GPs can include prior physical knowledge about the problem, and allow for a formal treatment of uncertainty quantification and error propagation.
In remote sensing, we often deal with radiative transfer models (RTMs) which implement the equations of energy transfer. These codes are needed for modelling, understanding, and predicting some variables of interest related to the state of the land cover, water bodies and atmosphere. An RTM operating in forward mode generates a multidimensional radiance observation seen by the sensor given a multidimensional
parameter state vector
, see Fig. 1. Running forward simulations yields a lookuptable (LUT) of inputoutput pairs, . Solving the inverse problem implies learning the function using to return an estimate each time a new satellite observation is acquired. GPs have been used to learn both the often costly forward model as well as the inverse model . Learning the forward model allows for faster simulations, while learning an inverse model has allowed to provide physicallymeaningful, spatiallyexplicit, and temporallyresolved maps of variables of interest.Despite great advances in forward and inverse modelling, GP models still have to face important challenges, such as the high computational cost involved or the derivation of faithful confidence intervals. More importantly, we posit that GP models should evolve towards
datadriven physicsaware models that respect signal characteristics, be consistent with elementary laws of physics, and move from pure regression to observational causal inference.Advances in GP inverse modelling
The most important shortcoming of GPs is their high computational cost and the memory requirements, which grows cubically and quadratically with the number of training points, respectively. Recently, a great progress has been made in constructing scalable versions of GPs, demonstrating their utility in big data regimes [3].
An important challenge in Earth observation relates to the fact that data comes with complex nonlinearities, levels and sources of noise, and nonstationarities. Standard GPs often assume homoscedastic noise and use stationary kernels though. The current stateoftheart GP to deal with heteroscedastic noise makes use of a marginalized variational approximation
[2]. The method has resulted in excellent performance in estimating biophysical parameters (chlorophylla content in plants and water bodies) from acquired reflectances. In many EO applications one transforms the observed variable to linearize or Gaussianize the data via parametric transforms. A warped GP model has allowed learning a nonparametric optimal transformation from data, and has shown very good results in predicting vegetation parameters (chlorophyll, leaf area index, and fractional vegetation cover) from hyperspectral images [4]. Another common problem in remote sensing is that of ensuring consistency across products: estimating several related variables simultaneously can incorporate their relations in a single model. A recent latent force model (LFM) GP can encode ordinary/partial differential equations governing the system, and has allowed to monitoring crops, estimate multiple vegetation covariates simultaneously, and deal with missing observations due to the presence of clouds or sensor acquisition problems
[5].Making inferences with GPs is not only about obtaining pointwise estimates but also faithful uncertainty estimates, essential to perform error propagation. Inference should also contemplate extrapolation analysis as an ambitious farend goal. Besides, note that we ultimately aim to characterize model error by comparing simulators to reality, calibrate models by proper estimation of (hyper)parameters, and make uncertainty statements about the world that combine models, data, and their corresponding errors. We think that the Bayesian formalism of GPs is the natural framework to tackle these yet unresolved problems.
Advances in GP forward modelling
Surrogate modelling, also known as emulation, based on GPs is gaining popularity in remote sensing. Emulators are essentially statistical models that learn to mimic the RTM code using a representative dataset . GPs have largely dominated the field for decades and have provided excellent accuracy and physical consistency as studied via sensitivity analysis in the context of vegetation and atmosphere models in [2]. Once the GP model is trained, one can readily perform fast forward simulations, which in turn allows improved inversion. However, replacing an RTM with a GP model requires running expensive evaluations of first. Recent more efficient alternatives construct an approximation to starting with a set of support points selected iteratively [5]
. This topic is related to active learning and Bayesian optimization, which might push results further in accuracy and sparsity, especially when modelling complex codes.
RTMs are the result of many decades of scientific research and continuous development, so they often include ad hoc
rules, heuristics, and nondifferentiable links that hamper analytic treatment. Emulation allows to account for input errors, derive predictive variance estimates, infer sensitivity values of parameters, calculate Jacobians, and perform uncertainty propagation and quantification analytically. Besides, a lot of physical knowledge used for designing RTMs could be translated in designing priors (e.g. physically plausible parameter values). These excellent capabilities have not been widely exploited in EO applications though.
Towards physicsaware GP modelling
The GP framework allows us to include constraints and priors adapted to signal features such as nonstationarity, circularity, spatialtemporal relations, colourednoise processes, and noni.i.d. relations. Nevertheless, datadriven GP models should be further constrained to provide physicallyplausible predictions. Recent approaches consider designing joint observationsimulation crosscovariances [5]. Recently we suggested a full framework for hybrid modelling with machine learning [6], which could be formalized within the GP probabilistic framework too.
Learning dynamical physical systems is very challenging. Recent regression approaches have learned the governing equations of nonlinear dynamical systems from data, such as the Lorenz, NavierStokes and Schrödinger equations. Models typically impose sparsity and hierarchical modelling, but also a GP probabilistic approach has excelled in discovering ordinary and partial differential, integrodifferential, and fractional order operators [7].
The integration of physics into GP models does not only achieve improved generalization but, more importantly, endorses these greybox models with consistency and faithfulness. As a byproduct, the hybridization process has an interesting regularization effect, as physics discards implausible models and promotes simpler structures.
From regression to causation
Understanding is more challenging than predicting, especially when no interventional studies can be conducted, as in the Earth sciences. Causal inference from observational data to estimate causal graphical models has become a mature science with effective machine learning methods to deal with both time series and nontime ordered data, see [8, 9]
and references therein. Causal inference methods can be classified roughly into conditional independence or constraintbased approaches and structural causal models. Constraintbased causal discovery algorithms iteratively infer graphical models utilizing conditional independence testing. In
[10] a GPbased conditional independence test is combined with a scalable causal discovery algorithm allowing to infer highdimensional graphical models from time series data. Constraintbased algorithms only allow to infer causal graphical models up to a Markov equivalence class. Utilizing additional assumptions, such as on the noise distribution or functional dependence, the class of structural causal models [8] allows to infer causal directionality in such undecidable Markov equivalent cases. Further GPbased causal discovery methods include [11] where a GP model was used as a prior to capture the timevarying causal association in a nonparametric manner, while in [12] GPs were exploited as an efficient prewhitening step to deal with noniid observations so common in remote sensing. Recently, [4, 2] introduced the WGP regression in additive noise models to account for postnonlinear effects and heteroscedastic noise respectively, and applied it successfully to a set of geoscience and remote sensing bivariate problems. Some important challenges in causal inference for the Earth science are still to be solved: how to scale GP models to deal with millions of points, missing data and time aggregation as well as time subsampling, and complex spatialtemporal dependency structures. Testing scientific hypotheses, comparing modelvsdata causal graphs, and assessing the impacts of extreme events, are just some exciting further avenues of research.Funding
GCV would like to acknowledge the support from the European Research Council (ERC) under the ERC Consolidator Grant 2014 project SEDAL (grant agreement 647423).
References
 [1] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, New York, 2006.
 [2] G. CampsValls, J. Verrelst, J. MuñozMarí, V. Laparra, F. MateoJiménez, and J. GómezDans. A survey on Gaussian Processes for Earth observation data analysis. IEEE Geoscience and Remote Sensing Magazine, 4(2):58–78, 2016.

[3]
J. Hensman, N. Fusi, and N.D. Lawrence.
Gaussian processes for big data.
In
Conference on Uncertainty in Artificial Intelligence
, pages 282–290. auai.org, 2013.  [4] A. Mateo, J. MuñozMarí, A. PérezSuay, and G. CampsValls. Warped Gaussian Processes in Remote Sensing Parameter Estimation and Causal Inference. IEEE Geoscience and Remote Sensing Letters, 15(11):1647–1651, 2018.
 [5] G. CampsValls, D. Svendsen, L. Martino, J. MuñozMarí, V. Laparra, M. CamposTaberner, and D. Luengo. Physicsaware Gaussian processes in remote sensing. Applied Soft Computing, 68:69–82, Jul 2018.
 [6] M. Reichstein, G. CampsValls, B. Stevens, J. Denzler, N. Carvalhais, M. Jung, and Prabhat. Deep learning and process understanding for datadriven Earth system science. Nature, 566:195–204, Feb 2019.
 [7] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Machine learning of linear differential equations using gaussian processes. Journal of Computational Physics, 348:683–693, Aug 2017.
 [8] J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference  Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. MIT, Cambridge, MA, USA, 2017.
 [9] K. Zhang, B. Schölkopf, P. Spirtes, and C. Glymour. Learning causality and causalityrelated learning: Some recent progress. National Science Review, 5(1):26–29, 2018.
 [10] Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. Detecting causal associations in large nonlinear time series datasets. Sci. Adv., 2019. arXiv: 1702.07007v2.
 [11] B. Huang, K. Zhang, and B. Schölkopf. Identification of timedependent causal model: A gaussian process treatment. In 24th International Joint Conference on Artificial Intelligence, Machine Learning Track, pages 3561–3568, Palo Alto, California USA, 2015. AAAI Press.
 [12] S.R. Flaxman, D.B. Neill, and A.J. Smola. Gaussian processes for independence tests with noniid data in causal inference. ACM TIST, 7(2):22:1–22:23, 2016.