Monotonic Gaussian Process Flow

05/30/2019 ∙ by Ivan Ustyuzhaninov, et al. ∙ 0

We propose a new framework of imposing monotonicity constraints in a Bayesian non-parametric setting. Our approach is based on numerical solutions of stochastic differential equations [Hedge, 2019]. We derive a non-parametric model of monotonic functions that allows for interpretable priors and principled quantification of hierarchical uncertainty. We demonstrate the efficacy of the proposed model by providing competitive results to other probabilistic models of monotonic functions on a number of benchmark functions. In addition, we consider the utility of a monotonic constraint in hierarchical probabilistic models, such as deep Gaussian processes. These typically suffer difficulties in modelling and propagating uncertainties throughout the hierarchy that can lead to hidden layers collapsing to point estimates. We address this by constraining hidden layers to be monotonic and present novel procedures for learning and inference that maintain uncertainty. We illustrate the capacity and versatility of the proposed framework on the task of temporal alignment of time-series data where it is beneficial to preserve the uncertainty in the temporal warpings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Monotonic regression is a task of inferring the relationship between a dependent variable and an independent variable when it is known that the relationship is monotonic. Monotonic functions are prevalent in areas as diverse as physical sciences Lavine:1995 , marine biology Hall:2001 , geology Haslett:2008 , public health Dette:2006 , design of computer networking systems Golchi:2015 , economics Canini:2016 , insurance Durot:2018 , biology Lorenzi:2019 ; Nader:2019 , meteorology Kim:2018 and many others. Furthermore, monotonicity constraints have been used in 2-layer models in the context of warped inputs; examples of where such construction is beneficial include Bayesian optimisation of non-stationary functions Snoek:2014 and temporal warps of time-series data in mixed effects models Kaiser:2018 ; Raket:2016 .

Extensive study by the statistics Ramsay:1988 ; Sill:1997

and, more recently, machine learning communities has resulted in a variety of frameworks. While many traditional approaches use parametric constrained splines, contemporary methods have considered monotonicity in the context of continuous random processes, mostly based on Gaussian processes (GPs) 

Rasmussen:2005 . As a non-parametric Bayesian model, a GP is an attractive foundation on which to build flexible and theoretically sound models, however, imposing monotonicity constraints on a GP has proven to be problematic Lin:2014 ; Riihimaki:2010 .

Monotonicity also appears in the context of hierarchical models. Discriminative classifiers benefit from hierarchical structures to collapse the input domain to encode invariances. This can be embodied as a hierarchy of non-injective (many-to-one) mappings. In contrast, if we wish to build generative probabilistic models (e.g. unsupervised representation learning) we seek to explain the observed data. We, therefore, want to transform a (simple) input distribution to a (complicated) data distribution. In a hierarchical model, this necessitates a composition of injective mappings for all but the output layer; this property is met if the hidden layers in the model comprise monotonic transformations. Fundamentally, discrimination and modelling benefit from hierarchies in opposite ways, where the former benefits from collapsing the input domain, the latter benefits from hierarchical structure to create more flexible transformations.

We might go further and consider how many of these injective layers should be required in a hierarchy. In the presence of additional mid-hierarchy marginal information or domain specific knowledge of compositional priors, e.g. Kaiser:2018

, the hierarchy allows us to exploit our prior beliefs. In the absence of such information (e.g. unsupervised learning), multiple injective layers would only arise if the individual layers were insufficiently expressive to capture the transformation (as the concatenation of injective transforms is itself injective). That motivates the study of non-parametric models of monotonic (injective) functions, which can represent a wide variety of transformations and therefore serve as a general purpose first layer in a hierarchical model.

In this work we propose a novel non-parametric Bayesian model of monotonic functions that is based on recent work on differential equations (DEs). At the heart of such models is the idea of approximating the derivatives of a function rather than studying the function directly. DE models have gained a lot of popularity recently and they have been successfully applied in conjunction with both neural networks 

Chen:2018 and GPs Heinonen:2018 ; Yildiz:2018a ; Yildiz:2018 . We consider a recently proposed framework, called differential GP flows Hedge:2019 , that performs classification and regression by learning a Stochastic Differential Equation (SDE) transformation of the input space. It admits an expressive yet computationally convenient parametrisation using GPs while the uniqueness theorem guarantees monotonicity.

In this paper we formulate a novel approach that is guaranteed to lead to monotonic samples from the flow field and further present a hierarchical model that will propagate uncertainty throughout without collapsing the distribution. We test our proposed approach on a set of regression benchmarks and provide an illustrative example of the 2-layer model where the first layer corresponds to smooth warping of time, and the second layer models a sequence of time-series observations.

2 Related work

Splines

Many classical approaches to monotonic regression rely on spline smoothing: given a basis of monotone spline functions, the underlying function is approximated using a non-negative linear combination of these basis functions and the monotonicity constraints are satisfied in the entire domain Wahba:1978 by construction. For example, Ramsey Ramsay:1998 considers a family of functions defined by the differential equation which contains the strictly monotone twice differentiable functions, and approximates using a basis of M-splines and I-splines. Shively et al. Shively:2009

consider a finite approximation using quadratic splines and a set of constraints on the coefficients that ensure isotonicity at the interpolation knots. The use of piecewise linear splines was explored by Haslett and Parnell 

Haslett:2008 who use additive i.i.d. gamma increments and a Poisson process to locate the interpolation knots; this leads to a process with a random number of piecewise linear segments of random length, both of which are marginalised analytically. Further examples of spline based approaches rely on cubic splines Wolberg:2002

, mixtures of cumulative distribution functions 

Bornkamp:2009 and an approximation of the unknown regression function using Bernstein polynomials Curtis:2011 .

Gaussian process

A GP is a stochastic process which is fully specified by its mean function, , and its covariance function,

, such that any finite set of random variables have a joint Gaussian distribution 

Rasmussen:2005 . GPs provide a robust method for modeling non-linear functions in a Bayesian non-parametric framework; ordinarily one considers a GP prior over the function and combines it with a suitable likelihood to derive a posterior estimate for the function given data.

Monotonic Gaussian processes

A common approach is to ensure that the monotonicity constraints are satisfied at a finite number of input points. For example, Da Veiga and Marrel DaVeiga:2012

use a truncated multi-normal distribution and an approximation of conditional expectations at discrete locations, while Maatouk 

Maatouk:2017 and Lopez-Lopera et al. Lopez-Lopera:2019 proposed finite-dimensional approximations based on deterministic basis functions evaluated at a set of knots. Another popular approach proposed by Riihimäki and Vehtari Riihimaki:2010 is based on including the derivatives information at a number of input locations by forcing the derivative process to be positive at these locations. Extensions to this approach include both adapting to new application domains Golchi:2015 ; Lorenzi:2019 ; Siivola:2016 and proposing new inference schemes Golchi:2015 . However, these approach do not guarantee monotonicity as they impose constraints at a finite number of points only. Lin and Dunson Lin:2014

propose another GP based approach that relies on projecting sample paths from a GP to the space of monotone functions using pooled adjacent violators algorithm which does not impose smoothness. Furthermore, the projection operation complicates the inference of the parameters of the GP and produces distorted credible intervals. Lenk and Choi 

Lenk:2017 design shape restricted functions by enforcing that the derivatives of the functions are squared Gaussian processes and approximating the GP using a series expansion with the Karhunen-Loève representation and numerical integration. Andersen et al. Andersen:2018 construct monotonic functions using a differential equation formulation where the solution to the DE is constructed using a composition of a GP and a non-negative function; we refer to this method as transformed GP.

3 Background

Any random process can be defined through its finite-dimensional distribution Oksendal:1992 . This implies that modelling a certain set of functions

as trajectories of such a process requires their definition through the finite-dimensional joint distributions

. Constraining the functions to be monotonic necessitates choosing a family of joint probability distributions that satisfies the monotonicity constraint:

(MC)

One way to achieve this is to truncate commonly used joint distributions (e.g. Gaussian), however, inference in such models is computationally challenging Maatouk:2017 . Another approach is to define a random process to have monotone trajectories by construction (e.g. Compound Poisson process), however, this often requires making simplifying assumptions on the trajectories (and therefore on ). In this work we use solutions of stochastic differential equations (SDEs) to define a random process with monotonic trajectories by construction while avoiding strong simplifying assumptions.

Gaussian process flows

Our model builds on the general framework for modelling functions as SDE solutions introduced in Hedge:2019 . The approach is based on considering the following SDE (where is the Wiener process):

(1)

Its solution for an initial condition is a random process indexed by “time” . For a fixed time , is a random variable that depends on the initial condition (we denote it ), and we may define a mapping of an arbitrary initial condition to such solutions at time : , effectively defining distributions over functions (similar to Gaussian processes, for example). The family of such distributions is parametrised by functions (drift) and (diffusion), which are defined in Hedge:2019 using a derived sparse Gaussian process Titsias:2009 (hence the name GP flows).

Specifically, assume we have inputs () and a single-output GP , with since is a function of both and time . We specify the GP via a set of inducing outputs , , corresponding to inducing input locations , , in a manner similar to Titsias:2009 . The predictive posterior distribution of such a GP is:

(2)

The SDE functions are defined to be and implying that Eq. (1) is completely defined by a GP and its set of inducing points . Inferring is intractable in closed form, hence the posterior of is approximated by a variational distribution , the parameters of which are optimised by maximising the likelihood lower bound :

(3)

The expectation is approximated by sampling the numerical approximations of the SDE solutions, which is particularly convenient to do with and defined as parameters of a GP posterior, because sampling such an SDE solution only requires generating samples from the posterior of the GP given the inducing points (see Hedge:2019 for thorough discussion of this procedure).

4 Monotonic Gaussian process flows

We now describe our monotonic model that can be used to model observations directly, or as a first layer in a 2-layer hierarchical model in which the monotonic transformation can be considered a warping of the input space. This will be considered in Sec. 5. We begin with a colloquial outline in two parts before a formal specification of our result.

  1. A smooth differential equation may be considered as a flow field and the propagation of particles as trajectories under fluid flow. A fundamental property of such flows is that one can never cross the streams of the flow field. Therefore, if particles are evolved simultaneously under a flow field their ordering cannot be permuted; this gives rise to a monotonicity constraint.

  2. A stochastic differential equation, however, introduces random perturbations into the flow field so particles evolving independently could jump across flow lines and change their ordering. However, a single, coherent draw from the SDE will always produce a valid flow field (the flow field will simply change between draws). Thus, particles evolving jointly under a single draw will still evolve under a valid flow field and therefore never permute.

SDE solutions are monotonic functions of initial values

It transpires that the joint distribution , where is a solution of Eq. (1) for an initial value defined in Sec. 3, with initial conditions satisfies (MC). It is the consequence of a general result that SDE solutions are unique and continuous under certain regularity assumptions (see, for example, Theorem 5.2.1 in Oksendal:1992 ) for any initial value (i.e. a random variable is a unique and continuous function of for any element of the sample space ).

Using this result we conclude that if we have two initial conditions and such that , the corresponding solutions at some time also obey this ordering . Indeed, were that not the case, the continuity of as a function of implies that there exists some such that (i.e. the trajectories corresponding to and cross), resulting in the SDE having two different solutions for the initial condition (namely and ) violating the uniqueness result.

The above argument implies that individual solutions (i.e. solutions to a single draw) of SDEs at a fixed time , , are monotonic functions of the initial conditions, and they provide a random process with monotonic trajectories. The actual prior distribution of such trajectories depends on the exact form of the functions and in Eq. (1) (e.g. if , the SDE is simply an ordinary DE and is a deterministic function of meaning that the prior distribution consists of a single function). A prior distribution of and thus induces a prior distribution of the monotonic functions , and inference in this model consists of computing the posterior distribution of these functions conditioned on the observed noisy sample from a monotonic function.

Notable differences to Hedge:2019

We define and using a GP with inducing points as in Sec. 3. In Hedge:2019 , a regular GP is placed on top of the SDE solutions so is a GP with a Gaussian likelihood in Eq. (3). In contrast, since we are modelling monotonic functions and are monotonic functions of , we define to be directly the likelihood . A critical difference in our inference procedure is that samples during the evolution of the SDE must be drawn jointly from the GP posterior over the flow field. This ensures that they are taken from the same instantaneous realisation of the stochastic flow field and thus the monotonicity constraint is preserved (because the above argument relies on considering the SDE solutions for the same , i.e. a specific realisation of the random flow field).

5 Two-layer model with monotonic inputs warping

Potential issues with deep GPs

Hierarchical models, and specifically deep GPs, are useful for modelling complex data (e.g. exhibiting multiple different length-scales) that are not well-fitted by regular GPs. We argue, however, that deep GPs have two issues restricting the range of potential applications, which we address in this work for a specific setting of a 2-layer model:

  1. Unconstrained deep GPs have many degenerate configurations (e.g. one of the layers collapsing to a constant function). Such configurations are often local minima, therefore, careful initialisation and optimisation are required to avoid them.

  2. All but one layer in a deep GPs often collapse to deterministic transformations, and the uncertainty is preserved only in the remaining layer Havasi:2018 . That happens because typical inference schemes (e.g. Damianou:2013 ; Salimbeni:2017 ) make independence assumption between the layers, meaning that if the uncertainty is preserved in at least two layers, the inputs to the upper one (of the two) would be uncertain leading to worse data fits.

We address the first issue by constraining the first layer to be monotonic. Such transformations allow preservation of the rank of the kernel matrix, i.e. , where is a monotonic function. A monotonic first layer can be considered part of a composite kernel for the output GP, which allows us to think of the entire model as a GP with a transformed kernel, while unconstrained deep GPs are not not necessarily actual GPs resulting in a variety of possible degenerate solutions.

To address the second issue, we must model dependencies between the transformations in the first and second layers. Intuitively, we might consider the following toy example of a 2-layer model: the first layer models the distribution over a set of two functions , the second one models the distribution over , and the true function we are trying to fit can be represented either as or as . The distributions over these functions in two layers need to be dependent (so that both functions in each individual draw from the joint distribution are either -functions or both are -functions), otherwise samples from the model can be compositions of and -functions, which do not match the true function. If the inference scheme does not allow for such dependencies it is impossible to make all the samples match the true function unless it collapses to one of the modes. This corresponds to always using either -functions or -functions without keeping any uncertainty resulting from two different solutions  Havasi:2018 . Next we describe how this intuition translates to the 2-layer models composed of monotonic flow and a regular GP.

5.1 GP on top of the monotonic flow output

Instead of directly using the SDE solutions as the model of the outputs, we add an additional transformation on top of them that permits modelling non-monotonic functions (as the output transformation is not constrained to be monotonic). We choose such a transformation to be a standard Gaussian process with a stationary kernel, that, as described above, enables us to think of the monotonic flow as a part of a composite output GP kernel, warping of the input space to satisfy the stationary assumptions of the output GP.

Specifically, we introduce an output function and a set of inducing points corresponding to inducing locations (). Given the noisy observations , the marginal likelihood is:

(4)

where is the finite-dimensional evaluation of the output GP, is the observational noise likelihood, and is the output GP posterior given the inducing points .

To compute , which is intractable in closed form, we use variational inference and introduce a distribution . Similarly to Eq. (3), we obtain a lower bound:

(5)

The two inner expectations are over the distribution of the SDE solutions (monotonic flow outputs) and the output GP is evaluated at the flow outputs, which we estimate empirically by sampling (see Sec. 3). Assuming a factorisation of the variational distribution , would allow one to marginalise the inducing points to remove the outer expectation. However, as argued above, modelling dependencies between and is the key component in preserving the uncertainty in both layers. Therefore, to permit any form of the joint , we approximate the outer expectation in Eq. (5) by sampling as well. Overall, we use the following procedure to compute the lower bound Eq. (5):

  1. Draw samples ,

  2. For each of these samples, draw the monotonic flow samples using the SDE numerical integration Hedge:2019 and the output GP samples from ,

  3. Empirically estimate the expectation in Eq. (5) as .

This inference scheme is similar to stochastic variational inference for deep GPs Salimbeni:2017 and GP flows Hedge:2019 , the main difference being that we not only sample the outputs of the model but also, jointly, the inducing points required to sample the outputs. That adds additional stochasticity to the computations, and allows us to explicitly model the dependencies between and in .

5.2 Modelling dependencies between layers

Correlations between inducing points

We study how the dependencies between the layers can avoid uncertainty collapse in a 2-layer model by considering the variational joint distribution of the inducing points to be either factorised with Gaussian and (we refer to this as the base case), or a Gaussian with a full covariance matrix . The reparametrisation trick Kingma:2013 for the Gaussian distribution permits computation of the gradients of the variational parameters and the use of stochastic computation for the lower bound in Eq. (5) to optimise the variational parameters.

Direct dependency between flow output and output GP

The inducing points for the monotonic flow and for the output GP are conditionally independent given the flow outputs. This arises as each draw of the output GP inducing points needs to be consistent with the draw from the flow output (such that it is mapped to the observations by means of the inducing points). However, the way we generated this draw from the first layer becomes irrelevant once we have it. Thus, modelling directly does not take into account the conditional structure

, which leads to increased variance of

, and in turn to increased variance in the output GP, analogous to GPs with uncertain inputsGirard:2003 . Consequently, this results in the variance of being decreased during optimisation to obtained reduced variance in the output and a better fit of the data. However, it comes at a cost of less uncertainty in the outputs of the first layer.

We propose to model the dependencies between and directly using the result for the optimal variational distribution of inducing points in a single-layer GP in (Titsias:2009, , Eq. (10)). We can then define a variational distribution for a fixed which we use to sample from the 2-layer model as follows:

  1. Sample and ,

  2. Using the generated , sample and use it to generate samples from the output GP .

This procedure allows us to match the samples from the flow and the output GP and, as we show in Sec. 6, it results in more uncertainty being preserved in the first layer compared to the cases of not modeling these dependencies (base case) or modelling the dependencies between the inducing points directly (case with correlated inducing points).

6 Experiments

center flat sinusoidal step linear exponential logistic GP 15.1 21.9 27.1 16.7 19.7 25.5 GP projection Lin:2014 11.3 21.1 25.3 16.3 19.1 22.4 Regression splines Shively:2009 19.7 22.9 28.5 24.0 21.3 19.4 GP approximation Maatouk:2017 18.2 20.6 41.1 15.8 20.8 21.0 GP with derivatives Riihimaki:2010 16.5 5.1 19.9 2.9 68.6 5.5 16.3 7.6 27.4 6.5 30.1 5.7 Transformed GP Andersen:2018 (VI-full) 16.4 4.5 20.6 5.9 52.5 3.6 11.6 5.8 17.5 7.3 17.1 6.2 Monotonic Flow (ours) 16.8 3.2 17.9 4.2 20.5 5.0 13.2 6.7 14.4 4.8 18.1 5.0

Table 1: Root-mean-square error SD () of trials for data of size
Regression

First, we test the monotonic flow model on the task of estimating 1D monotonic curves from noisy data. We use a set of benchmark functions from previous studies Lin:2014 ; Maatouk:2017 ; Shively:2009 . Examples of three such functions are shown in Fig. 1, the exact equations are in the Supplementary material. The training data is generated by evaluating these functions at equally spaced points and adding i.i.d. Gaussian noise . We note that many real-life datasets that benefit from including the monotonicity constraints have similar trends and high levels of noise (e.g. Curtis:2011 ; Haslett:2008 ; Kim:2018 ). Following the literature, we used the root-mean-square-error (RMSE) to evaluate the performance of the model.

100 data points

In Table 1 we provide the results obtained by fitting different monotonic models to data sets containing points. As baselines we include: GPs with monotonicity information Riihimaki:2010 111Implementation available from https://research.cs.aalto.fi/pml/software/gpstuff/., transformed GPs Andersen:2018 222Implementation provided in personal communications., and other results reported in the literature. We report the RMSE means and the SD from 20 trial runs with different random noise samples and show example fits in the bottom row of Fig. 1. This figure contains the means of the predicted curves from

trials with the best parameter values (each trial contains a different sample of standard Gaussian random noise). We plot samples as opposed to the mean and the standard deviation as, due to the monotonicity constraint, samples are more informative than sample statistics. For the GP with monotonicity information we choose

virtual points and place them equidistantly in the range of the data; we provide the best RMSEs for . For the transformed GP we report the best results for the boundary conditions and the number of terms in the approximation . For both models we use a squared exponential kernel. Our method depends on the time , the kernel and the number of inducing points . For this experiment, we consider , and two kernel options, squared exponential and ARD Matérn . The lowest RMSE are achieved using the flow and the transformed GP. Overall, our method performs very competitively, achieving the best results on 3 functions and being within the standard deviation of the best result on all others. We note that the training data is very noisy (see Fig. 1), therefore using prior monotonicity assumptions achieves significantly improved results over a regular GP.

center flat sinusoidal step linear exponential logistic Transformed GP Andersen:2018 (VI-full) 18.5 14.4 40.0 17.5 101.9 11.4 37.4 22.8 52.9 11.9 51.7 19.6 Monotonic Flow (ours) 21.7 15.0 39.1 13.0 164.5 10.7 30.8 12.0 32.8 17.9 43.2 15.2

Table 2: Root-mean-square error SD () of 20 trials for data of size
15 data points

In Table 2 and Fig. 1 (top row) we provide the comparison of the flow and the transformed GP in a setting when only data points are available. Our fully non-parametric model is able to recover the structure in the data significantly better than the Transformed GP which usually reverts to a nearly linear fit on all functions. This might be explained by the fact that the Transformed GP is a parametric approximation of a monotonic GP, and the more parameters included, the larger the variety of the functions it can model. However, estimating a large (w.r.t. dataset size) number of parameters is challenging given a small set of noisy observations. The monotonic flow tends to underestimate the value of the function on the left side of the domain and overestimate the value on the right. The mean of our prior of the monotonic flow with a stationary flow GP kernel is an identity function, so given a small set of noisy observations, the predictive posterior mean quickly reverts to the prior distribution near the edges of the data interval.

(a) Linear, 15 data points
(b) Exponential, 15 data points
(c) Logistic, 15 data points
(d) Linear, 100 data points
(e) Exponential, 100 data points
(f) Logistic, 100 data points
Figure 1: Mean fits for trials with different random noise as estimated by the flow and the transformed GP Andersen:2018 (the noise samples are identical for both methods; we plot the data from one trial).

Base Model

Correlated Ind.

Optimal Ind.

Figure 2: Layers of a 2-layer models fitted to a chirp function. From left to right we plot the overall fit of to the noisy observations (black dots), the warp produced by the monotonic first layer, and the fit in the warped coordinates; we show the base model, one with correlated inducing points, and one with optimal inducing points.
Composite regression

Next, we consider the task of fitting a 2-layer model to noisy data of a chirp function . Imposing monotonic constraints on the first layer allows us to warp the inputs to the second layer in a way that the observations and the new warped inputs can be modelled using a standard stationary kernel. Fig. 2 shows the fitted function, the warps produced by the monotonically constrained first layer, and the fitted function in the warped coordinates (i.e. samples of the output GP against flow samples). The base model (factorised variational distribution of inducing points) keeps most of the uncertainty in the second layer while the flow nearly collapses to a point estimate. Correlating the inducing points allows the model to distribute the uncertainty across the two layers providing a hierarchical decomposition of the uncertainty. Using the optimal inducing points, however, provides a wide range of possible warps without compromising on the overall quality of the fit to the observed data.

Alignment application

A monotonic constraint in the first layer is desirable in mixed effects models where the first layer corresponds to the warping of space or time that does not allow permutations.We take a set of temporally misaligned sequences (of possibly different lengths) and we want to recover the warps that produced these observed sequences. For a detailed description of this problem, see Kazlauskaite:2019 . In this example, the output GPs fit the observed sequences while the monotonic layers warp the input space such that re-sampling the output GPs at fixed evenly spaced inputs gives temporally aligned sequences. The baseline Kazlauskaite:2019 use MAP estimates and thus retains no uncertainty about the warps. Fig. 3 contrasts the point estimates of Kazlauskaite:2019 and our proposed 2-layer construction that captures the uncertainty in the warps in the first layer while also fitting the observed data well.

Figure 3: Given noisy observations from temporally warped sequences, we compare the warps (first layer outputs) and the fits to the data for our 2-layer model and for Kazlauskaite:2019 .

7 Conclusion

We have proposed a novel non-parametric model of monotonic functions based on a random process with monotonic trajectories that confers improved performance over the state-of-the-art. Many real-life regression tasks deal with functions that are known to be monotonic, and explicitly imposing this constraint helps uncover the structure in the data, especially when the observations are noisy or data is scarce. We have also demonstrated that monotonicity constraints can be used to guard against degenerate solutions and improve the interpretability of multi-layer GP-based models. We believe this is an exciting avenue to pursue for future work and hope that this model will be a valuable tool in further research on uncertainty propagation in compositional models.

Acknowledgments

This work has been supported by EPSRC CDE (EP/L016540/1) and CAMERA (EP/M023281/1) grants as well as the Royal Society. The authors are grateful to Markus Kaiser and Garoe Dorta Perez for their insight and feedback on this work, and to Michael Andersen for sharing the implementation of Transformed GPs. IK would like to thank the Frostbite Physics team at EA.

References

  • (1) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Zh. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, Sh. Moore, D. Murray, Ch. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Software available from tensorflow.org.
  • (2) M. R. Andersen, E. Siivola, G. Riutort-Mayol, and A. Vehtari. A non-parametric probabilistic model for monotonic functions. “All Of Bayesian Nonparametrics” Workshop at NeurIPS, 2018.
  • (3) B. Bornkamp and K. Ickstadt. Bayesian nonparametric estimation of continuous monotone functions with applications to dose-response analysis. Biometrics, 65(1):198–205, 2009.
  • (4) K. Canini, A. Cotter, M.R. Gupta, M. Milani Fard, and J. Pfeifer. Fast and flexible monotonic functions with ensembles of lattices. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  • (5) R.T.Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud.

    Neural ordinary differential equations.

    In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • (6) S. McKay Curtis and S. K. Ghosh. A variable selection approach to monotonic regression with bernstein polynomials. Journal of Applied Statistics, 38(5):961–976, 2011.
  • (7) S. Da Veiga and A. Marrel. Gaussian process modeling with inequality constraints. Annales de la Faculté des sciences de Toulouse : Mathématiques, Ser. 6, 21(3):529–555, 2012.
  • (8) A. Damianou and N. Lawrence. Deep gaussian processes. In

    International Conference on Artificial Intelligence and Statistics (AISTATS)

    , 2013.
  • (9) H. Dette and R. Scheder. Strictly monotone and smooth nonparametric regression for two or more variables. Canadian Journal of Statistics, 34(4):535–561, 2006.
  • (10) C. Durot and H.P. Lopuhaä. Limit theory in monotone function estimation. Statistical Science, 33(4):547–567, 2018.
  • (11) A. Girard, C. E. Rasmussen, J. Q. Candela, and R. Murray-Smith. Gaussian process priors with uncertain inputs application to multiple-step ahead time series forecasting. In Advances in Neural Information Processing Systems 15. 2003.
  • (12) S. Golchi, D.R. Bingham, H. Chipman, and D.A. Campbell. Monotone emulation of computer experiments. SIAM-ASA Journal on Uncertainty Quantification, 3(1):370–392, 2015.
  • (13) P. Hall and L.-S. Huang. Nonparametric kernel regression subject to monotonicity constraints. Annals of Statistics, 29(3):624–647, 2001.
  • (14) J. Haslett and A. Parnell. A simple monotone process with application to radiocarbon-dated depth chronologies. Journal of the Royal Statistical Society. Series C, 57:399–418, 2008.
  • (15) M. Havasi, J. M. Hernández-Lobato, and J. J. Murillo-Fuentes. Inference in deep gaussian processes using stochastic gradient hamiltonian monte carlo. In Advances in Neural Information Processing Systems (NeurIPS). 2018.
  • (16) P. Hegde, M. Heinonen, H. Lähdesmäki, and S. Kaski. Deep learning with differential gaussian process flows. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  • (17) M. Heinonen, C. Yildiz, H. Mannerström, J. Intosalmi, and H. Lähdesmäki. Learning unknown ode models with gaussian processes. In International Conference on Machine Learning (ICML), 2018.
  • (18) M. Kaiser, C. Otte, T. Runkler, and C. H. Ek. Bayesian alignments of warped multi-output gaussian processes. In Advances in Neural Information Processing Systems (NeurIPS). 2018.
  • (19) I. Kazlauskaite, C. H. Ek, and N.F.D. Campbell. Gaussian process latent variable alignment learning. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 748–757. PMLR, 16–18 Apr 2019.
  • (20) D.H. Kim, H. Ryu, and Y. Kim. Nonparametric bayesian modeling for monotonicity in catch ratio. Communications in Statistics: Simulation and Computation, 47(4):1056–1065, 2018.
  • (21) D. P Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Representation Learning (ICLR), 2014.
  • (22) M. Lavine and A. Mockus. A nonparametric bayes method for isotonic regression. Journal of Statistical Planning and Inference, 46(2):235–248, 1995.
  • (23) P.J. Lenk and T. Choi. Bayesian analysis of shape-restricted functions using gaussian process priors. Statistica Sinica, 27(1):43–69, 2017.
  • (24) L. Lin and D.B. Dunson. Bayesian monotone regression using gaussian process projection. Biometrika, 101(2):303–317, 2014.
  • (25) A. F. Lopez-Lopera, ST John, and N. Durrande. Gaussian process modulated cox processes under linear inequality constraints. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  • (26) M. Lorenzi, M. Filippone, G.B. Frisoni, D.C. Alexander, and S. Ourselin. Probabilistic disease progression modeling to characterize diagnostic uncertainty: Application to staging and prediction in alzheimer’s disease. NeuroImage, 190:56–68, 2019.
  • (27) H. Maatouk. Finite-dimensional approximation of gaussian processes with inequality constraints. arXiv:1706.02178, 2017.
  • (28) C. A. Nader, N. Ayache, P. Robert, and M. Lorenzi. Monotonic gaussian process for spatio-temporal trajectory separation in brain imaging data. arXiv:1902.10952, 2019.
  • (29) B. Øksendal. Stochastic Differential Equations (3rd Ed.): An Introduction with Applications. Springer-Verlag, 1992.
  • (30) L. L. Raket, B. Grimme, G. Schöner, Ch. Igel, and B. Markussen. Separating timing, movement conditions and individual differences in the analysis of human movement. PLOS Computational Biology, 12(9):1–27, 09 2016.
  • (31) J.O. Ramsay. Monotone regression splines in action. Statistical Science, 3(4):425–441, 1988.
  • (32) J.O. Ramsay. Estimating smooth monotone functions. Journal of the Royal Statistical Society. Series B, 60(2):365–375, 1998.
  • (33) C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. 2005.
  • (34) J Riihimäki and A Vehtari. Gaussian processes with monotonicity information. Journal of Machine Learning Research, 9:645–652, 2010.
  • (35) H. Salimbeni and M. Deisenroth. Doubly stochastic variational inference for deep gaussian processes. In Advances in Neural Information Processing Systems (NeurIPS). 2017.
  • (36) Th. S. Shively, Th. W. Sager, and S. G. Walker. A bayesian approach to nonparametric monotone function estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(1):159–175, 2009.
  • (37) E. Siivola, J. Piironen, and A. Vehtari. Automatic monotonicity detection for gaussian processes. arXiv:1610.05440, 2016.
  • (38) J. Sill and Y.S. Abu-Mostafa. Monotonicity hints. In Advances in Neural Information Processing Systems (NeurIPS), 1997.
  • (39) J. Snoek, K. Swersky, R. Zemel, and R.P. Adams. Input warping for bayesian optimization of non-stationary functions. In International Conference on Machine Learning (ICML), 2014.
  • (40) M. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.
  • (41) G. Wahba. Improper priors, spline smoothing and the problem of guarding against model errors in regression. Journal of the Royal Statistical Society. Series B, 49, 07 1978.
  • (42) G. Wolberg and I. Alfy. An energy-minimization framework for monotonic cubic spline interpolation. Journal of Computational and Applied Mathematics, 143(2):145–188, 2002.
  • (43) C. Yildiz, M. Heinonen, J. Intosalmi, H. Mannerstrom, and H. Lahdesmaki. Learning stochastic differential equations with gaussian processes without gradient matching. In IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2018.
  • (44) C. Yildiz, M. Heinonen, and H. Lähdesmäki. A nonparametric spatio-temporal SDE model. In Spatiotemporal Workshop at NeurIPS, 2018.

Supplementary material

s.1 Implementation details

Our model is implemented in Tensorflow [1]. For the evaluations in Tables 1 and 2 we use iterations with the learning rate of that gets reduced by a factor of when the objective does not improve for more than iterations. For numerical solutions of SDE, we use Euler-Maruyama solver with time steps, as proposed in [16].

Computational complexity

Computational complexity of drawing a sample from the monotonic flow model is , where is the number of steps in numerical computation of the approximate SDE solution, is the complexity of computing the GP posterior for inputs based on inducing points, and is the complexity of drawing a sample from this posterior. We typically draw fewer than samples to limit the computational cost.

Non-Gaussian noise

The inference procedures for the monotonic flow and for the 2-layer model can be easily applied to arbitrary likelihoods, because they are based on stochastic variational inference and do not require the closed form integrals of the likelihood density.

s.2 Functions for evaluating the monotonic flow model

The functions we use for evaluations are the following:

  • [label=]

  •     (flat function)

  •     (sinusoidal function)

  •     (step function)

  •     (linear function)

  •     (exponential function)

  •     (logistic function)

For the experiments shown in Fig. 3 we generate data points using for linearly spaced inputs .

s.3 Example flow paths

In Fig. S1 we show the flow paths corresponding to out monotonic model fitted on the noisy data generated using the logistic function as defined in Sec. 4.1. The first three figures show different samples of the entire flow field for points inside () and outside the domain. The right-most figure shows paths sampled at 3 input points, . The flow warps the equally spaced inputs () to the outputs ().

Figure S1: First 3 figures from the left show sample flow fields while the figure on the right shows samples of paths at 3 inputs.

s.4 Transformed GP in a 2-layer setting

We have considered using Transformed GPs [2] in a two layer model. The first layer is modelled by the Transformed GP and, therefore, constrained to be monotonic and the second layer is a standard GP. This setting is analogous to the one explained in the main part of the paper. Results for and are shown in Fig. S2. We correlated the parameters of the monotonic GP approximation and the inducing points of the output GP to keep the uncertainty in the first layer, similar to modelling the correlation between the inducing points in the flow and output GP in Sec. 5 in the main part of the paper.

Nevertheless, we observed that this model is unable to fit the chirp function well or it finds a degenerate solution with almost no uncertainty in the first layer. We believe it might be because of poor local minima which become harder to avoid as the number of parameters in the GP approximation increases. One way to avoid them would be careful initialisation, however it is not trivial (while the monotonic flow is naturally initialised to an identity function if the flow vector field is close to a zero one).

Figure S2: A 2-layer model fitted using Transformed GP [2] and an output GP (similar to the model defined in Sec. 5.2 in the main part in the paper). From left to right, we plot the fitted function, the first layer of the model, and the fitted function in the warped coordinates. The rows correspond to different values of , the number of terms in the approximation of the kernel.

s.5 Comparison to SGHMC

In a recent paper, Havasi et al. [15] discuss the issue of the intermediate layers in a DGP collapsing to a single mode (as discussed in the main part of the paper, we found that this is true for DSVI [35] as well as for MAP estimates [19]). Their Hamiltonian Monte Carlo-based stochastic inference scheme is able to estimate non-Gaussian posteriors, and can capture multi-modality and estimate uncertainty in the intermediate layers of a DGP. In order to compare the uncertainty estimates of our 2-layer setting, we fitted a 2-layer DGP model on the data identical to the one in Fig.2 in the main part of the paper using the author-provided implementation. Fig. S3 shows the fitted model for different random initialisations of the SGHMC. We note that the first layer, referred to as the warp, is not constrained, unlike in our method. The output shown in the top row is comparable to the results produced by our method. However, the other outputs have a short lengthscale in both layers and thus seem to overfit. In these experiments we use the parameter values from the original paper (in particular, num posterior samples = 100, posterior sample spacing = 50), and reduce the learning rate to (increasing the number of iterations by a factor of ).

Figure S3: A 2-layer DGP fitted using SGHMC [15]. From left to right, we plot the fitted function, the first layer of the model, and the fitted function in the warped coordinates. The rows correspond to random initialisations of SGHMC; the data is identical in all cases.