Deep neural nets are in widespread use of machine learning applications. They owe their unprecedented expressive power to repetitive application of a function that non-linearly transforms the input pattern. Furthermore, if the transformation is designed to be a ResNet moduleresnet, the processing pipeline can be viewed as an ODE system discretized across even time intervals neural_ode. Rephrasing this model in terms of a continuous-time ODE is referred to as a Neural ODE. While the generalization capabilities of Neural ODEs have been closely investigated by neural_ode
, their success as a Bayesian inference building block remains unexplored. In order to answer this question, we devise a generic Bayesian neural model that solves a SDEoksendal
as an intermediate step to model the flow of the activation maps. Our method differs from earlier work in that we model the drift and diffusion functions of an SDE as Bayesian Neural Nets (BNN), instead of the mean and covariance functions of the Gaussian Process (GP) posterior predictivediffgp or a vanilla neural net with a fixed dropout rate global to the synaptic connections nsde
. Other attempts of coupling SDEs with neural networks consist of finding unknown parameters to otherwise known functionssde_blackbox.
The contributions of our work are as follows: i) we build a Neural SDE by assigning two seperate and potentially overlapping BNNs on the drift and diffusion terms, ii) we show how SGLD can naturally be used to infer the consequent model as an alternative to variational inference, iii) and we illustrate how crucial uncertainty-aware learning is for time series modeling with Neural ODEs.
2 Stochastic Differential Equations
An SDE can be expressed in the following generic form
The equation is governed by the drift , which models the deterministic dynamics, and the diffusion , which models the stochasticity in the system. Further, represents the time increment and is a Wiener process. There does not exist any closed-form solution to generic SDEs, hence numerical approximation techniques are employed, possibly the most popular of which is the Euler-Maruyama discretization method, which suggests the following update rule
where . The same approximation holds when the variable
is a vector. In this case the diffusion term is a matrix-valued function of the input and time and corresponding is modeled as independent Wiener processes , where is a
-dimensional identity matrixoksendal.
3 Differential Bayesian Neural Nets
Assume for brevity that we are given a supervised learning problem, i.e. we aim to find a mapping from inputsto outputs . We pose the below probabilistic model
The last step above is a likelihood suitable to the learning setup Dirac delta evaluated on the input observation and for some chosen that represent the duration of the flow, namely the model capacity. The critical intermediate step of the model is the stochastic process on the continuous-time activation maps , which we refer to as the Differential Bayesian Neural Net (DBNN):
, where and are the synaptic weights of BNNs on the drift vector and on the diffusion matrix, respectively, for some rank . The distributions and are priors on the BNN weights, hence their properties are known or designable a-priori. The function is Brownian motion implied by a Wiener process with zero mean and unit covariance without loss of generality, and the related operation around is the Itô integral oksendal. Note that these BNNs may have shared weights, i.e. . The dynamics of the resultant stochastic process are given by the below stochastic differential equation
The process does not have a closed-form solution, sometimes does not even have an expressable density function, generalizeable to the neural net architecture. However, it is possible to take approximate samples from it by a discretization rule such as Euler-Maruyama. As a work-around, we first marginalize the stochastic process out of the likelihood by Monte Carlo integration
where is the th time point realization of the Euler-Maruyama draw . This approximation appears in the literature as the simulated likelihood method applied_sde. Having integrated out the stochastic process, the rest is a plain approximate posterior inference problem on . The sample-driven solution to the stochastic process
integrates naturally into a Markov Chain Monte Carlo (MCMC) scheme. We choose Stochastic Gradient Langevin Dynamics (SGLD)sgld with a block decay structure psgld to benefit from the loss gradient. Our training scheme is detailed in Algorithm 1.
We compare our method DBNN against Neural SDE (N-SDE) nsde
, its closest and most-recent relative, which applies a fixed-rate dropout on the neural net of the diffusion matrix and uses RMSE as loss function. Hence, we evaluate how making both drift and diffusion neural nets fully Bayesian and using a modified variant of SGLD for posterior weight inference improves the results.
Time series modeling.
In the first experiment one draw of the Vasicek model vasicek with equally spaced observations is given. We specify the model as . It starts from the initial point (0,0), converges to 1, and then oscillates around it. Figure (a)a plots the results. Our method is capable of modeling the underlying dynamics and reflects the noisy nature of the data. In contrast, the NSDE approach results in excessively smooth predictions and uncalibrated uncertainty scores. Figure (b)b
shows results for non-equally spaced data. Ground truth is the centered sigmoid function, from which 20 noisy observations have been sampled. DBNN is capable of representing the predictive uncertainty and shows increasing uncertainty in the interpolation and extrapolation areas. Although N-SDE learns an accurate predictive mean, its uncertainty scores show little to no correlation to the observed data. In both experiments we observed that N-SDE did not converge for dropout rates. Additionally we found N-SDE to behave sensitive towards the choice of time dependence for drift and diffusion. Our results demonstrate the necessarity to properly account for uncertainty during training, as we do in DBNN, in order to get well calibrated predictive uncertainty.
For regression, we place an additional linear layer above
in order to match the output dimensionality. Since we can estimate the properties of the distribution, with mean and covariance
, we propagate both moments through the linear layer. The predictive mean is thus modeled as
and predictive variance as. It is possible to design as a diagonal matrix assuming uncorrelated activation map dimensions. Further, can be parameterized by assigning the DBNN output on its Cholesky decomposition, or it can take any other structure of the form . When choosing , it is possible to heavily reduce the number of learnable parameters. Table 1 shows results for the UCI benchmark dataset. We use the experiment setup (network architecture and train/test splitting schemes) defined in pbp
. Further, we choose the hyperparameters of N-SDE as indropout. DBNN brings either improved or competitive fit on test data in all data sets. Modeling correlated noise also improves the results in most data sets.
We extend Neural ODEs to a fully Bayesian setting. The model flows an input observation through stochastic dynamics, where both the drift and the diffusion follow a BNN. The posterior on the BNN weights is approximated by a modified variant of SGLD. The resultant model, called DBNN, outperforms the recent N-SDE in a number of time series prediction and regression tasks. Our model benefits from the natural flexibility of using a variety of possible network designs for and
, as long as the input and output dimensions remain same. Thus the model is easily extendable towards other tasks, such as image segmentation or reinforcement learning.