Differential Bayesian Neural Nets

12/02/2019
by   Look Andreas, et al.
Bosch
0

Neural Ordinary Differential Equations (N-ODEs) are a powerful building block for learning systems, which extend residual networks to a continuous-time dynamical system. We propose a Bayesian version of N-ODEs that enables well-calibrated quantification of prediction uncertainty, while maintaining the expressive power of their deterministic counterpart. We assign Bayesian Neural Nets (BNNs) to both the drift and the diffusion terms of a Stochastic Differential Equation (SDE) that models the flow of the activation map in time. We infer the posterior on the BNN weights using a straightforward adaptation of Stochastic Gradient Langevin Dynamics (SGLD). We illustrate significantly improved stability on two synthetic time series prediction tasks and report better model fit on UCI regression benchmarks with our method when compared to its non-Bayesian counterpart.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/29/2019

Stability of stochastic impulsive differential equations: integrating the cyber and the physical of stochastic systems

According to Newton's second law of motion, we humans describe a dynamic...
06/16/2020

Deterministic Inference of Neural Stochastic Differential Equations

Model noise is known to have detrimental effects on neural networks, suc...
02/22/2020

Stochasticity in Neural ODEs: An Empirical Study

Stochastic regularization of neural networks (e.g. dropout) is a wide-sp...
06/17/2020

Learning Partially Known Stochastic Dynamics with Empirical PAC Bayes

We propose a novel scheme for fitting heavily parameterized non-linear s...
12/10/2018

Bayesian Layers: A Module for Neural Network Uncertainty

We describe Bayesian Layers, a module designed for fast experimentation ...
05/05/2020

Time Dependence in Non-Autonomous Neural ODEs

Neural Ordinary Differential Equations (ODEs) are elegant reinterpretati...
02/06/2021

Neural SDEs as Infinite-Dimensional GANs

Stochastic differential equations (SDEs) are a staple of mathematical mo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural nets are in widespread use of machine learning applications. They owe their unprecedented expressive power to repetitive application of a function that non-linearly transforms the input pattern. Furthermore, if the transformation is designed to be a ResNet module

resnet, the processing pipeline can be viewed as an ODE system discretized across even time intervals neural_ode. Rephrasing this model in terms of a continuous-time ODE is referred to as a Neural ODE. While the generalization capabilities of Neural ODEs have been closely investigated by neural_ode

, their success as a Bayesian inference building block remains unexplored. In order to answer this question, we devise a generic Bayesian neural model that solves a SDE

oksendal

as an intermediate step to model the flow of the activation maps. Our method differs from earlier work in that we model the drift and diffusion functions of an SDE as Bayesian Neural Nets (BNN), instead of the mean and covariance functions of the Gaussian Process (GP) posterior predictive

diffgp or a vanilla neural net with a fixed dropout rate global to the synaptic connections nsde

. Other attempts of coupling SDEs with neural networks consist of finding unknown parameters to otherwise known functions

sde_blackbox.

The contributions of our work are as follows: i) we build a Neural SDE by assigning two seperate and potentially overlapping BNNs on the drift and diffusion terms, ii) we show how SGLD can naturally be used to infer the consequent model as an alternative to variational inference, iii) and we illustrate how crucial uncertainty-aware learning is for time series modeling with Neural ODEs.

2 Stochastic Differential Equations

An SDE can be expressed in the following generic form

(1)

The equation is governed by the drift , which models the deterministic dynamics, and the diffusion , which models the stochasticity in the system. Further, represents the time increment and is a Wiener process. There does not exist any closed-form solution to generic SDEs, hence numerical approximation techniques are employed, possibly the most popular of which is the Euler-Maruyama discretization method, which suggests the following update rule

(2)

where . The same approximation holds when the variable

is a vector

. In this case the diffusion term is a matrix-valued function of the input and time and corresponding is modeled as independent Wiener processes , where is a

-dimensional identity matrix

oksendal.

3 Differential Bayesian Neural Nets

Assume for brevity that we are given a supervised learning problem, i.e. we aim to find a mapping from inputs

to outputs . We pose the below probabilistic model

DBNN
Figure 1: Illustration of our algorithm. First an input is passed through the DBNN. The resulting distribution is then used to calculate .

The last step above is a likelihood suitable to the learning setup Dirac delta evaluated on the input observation and for some chosen that represent the duration of the flow, namely the model capacity. The critical intermediate step of the model is the stochastic process on the continuous-time activation maps , which we refer to as the Differential Bayesian Neural Net (DBNN):

, where and are the synaptic weights of BNNs on the drift vector and on the diffusion matrix, respectively, for some rank . The distributions and are priors on the BNN weights, hence their properties are known or designable a-priori. The function is Brownian motion implied by a Wiener process with zero mean and unit covariance without loss of generality, and the related operation around is the Itô integral oksendal. Note that these BNNs may have shared weights, i.e. . The dynamics of the resultant stochastic process are given by the below stochastic differential equation

The process does not have a closed-form solution, sometimes does not even have an expressable density function, generalizeable to the neural net architecture. However, it is possible to take approximate samples from it by a discretization rule such as Euler-Maruyama. As a work-around, we first marginalize the stochastic process out of the likelihood by Monte Carlo integration

where is the th time point realization of the Euler-Maruyama draw . This approximation appears in the literature as the simulated likelihood method applied_sde. Having integrated out the stochastic process, the rest is a plain approximate posterior inference problem on . The sample-driven solution to the stochastic process

integrates naturally into a Markov Chain Monte Carlo (MCMC) scheme. We choose Stochastic Gradient Langevin Dynamics (SGLD)

sgld with a block decay structure psgld to benefit from the loss gradient. Our training scheme is detailed in Algorithm 1.

  Inputs: Initial weights , Decay rate , Flow time , Minibatch size , Iteration count
  Outputs: BNN weights
  for  do
     Sample minibatch
     for  do
         
         for  do
            for  do
               
            end for
         end for
         
     end for
     
     if  then
         
     end if
  end for
Algorithm 1 DBNN Inference

4 Experiments

We compare our method DBNN against Neural SDE (N-SDE) nsde

, its closest and most-recent relative, which applies a fixed-rate dropout on the neural net of the diffusion matrix and uses RMSE as loss function. Hence, we evaluate how making both drift and diffusion neural nets fully Bayesian and using a modified variant of SGLD for posterior weight inference improves the results.

(a) Time series prediction based on noisy and equally spaced observations. Underlying ground truth function is the stochastic Vasicek model.
(b)

Time series prediction based on noisy and randomly distributed observations. Underlying ground truth function is the centered sigmoid function.

Figure 2: Time series prediction results for DBNN and N-SDE with fixed dropout diffusion nsde.

Time series modeling.

In the first experiment one draw of the Vasicek model vasicek with equally spaced observations is given. We specify the model as . It starts from the initial point (0,0), converges to 1, and then oscillates around it. Figure (a)a plots the results. Our method is capable of modeling the underlying dynamics and reflects the noisy nature of the data. In contrast, the NSDE approach results in excessively smooth predictions and uncalibrated uncertainty scores. Figure (b)b

shows results for non-equally spaced data. Ground truth is the centered sigmoid function, from which 20 noisy observations have been sampled. DBNN is capable of representing the predictive uncertainty and shows increasing uncertainty in the interpolation and extrapolation areas. Although N-SDE learns an accurate predictive mean, its uncertainty scores show little to no correlation to the observed data. In both experiments we observed that N-SDE did not converge for dropout rates

. Additionally we found N-SDE to behave sensitive towards the choice of time dependence for drift and diffusion. Our results demonstrate the necessarity to properly account for uncertainty during training, as we do in DBNN, in order to get well calibrated predictive uncertainty.

Regression.

For regression, we place an additional linear layer above

in order to match the output dimensionality. Since we can estimate the properties of the distribution

, with mean and covariance

, we propagate both moments through the linear layer. The predictive mean is thus modeled as

and predictive variance as

. It is possible to design as a diagonal matrix assuming uncorrelated activation map dimensions. Further, can be parameterized by assigning the DBNN output on its Cholesky decomposition, or it can take any other structure of the form . When choosing , it is possible to heavily reduce the number of learnable parameters. Table 1 shows results for the UCI benchmark dataset. We use the experiment setup (network architecture and train/test splitting schemes) defined in pbp

. Further, we choose the hyperparameters of N-SDE as in

dropout. DBNN brings either improved or competitive fit on test data in all data sets. Modeling correlated noise also improves the results in most data sets.

boston energy concrete wine_red kin8mn power naval protein
506 768 1,030 1,599 8,192 9,568 11,934 45,730
13 8 8 22 8 4 26 9
PBP pbp -2.57(0.09) -2.04(0.02) -3.16(0.02) -0.97(0.01) 0.90(0.01) -2.84(0.01) 3.73(0.01) -2.97(0.00)
Dropout dropout -2.46(0.06) -1.99(0.02) -3.04(0.02) -0.93(0.01) 0.95(0.01) -2.80(0.01) 3.80(0.01) -2.89(0.00)
N-SDE nsde Dropout -2.48(0.03) -1.35(0.01) -3.05(0.03) -0.97(0.01) 0.94(0.02) -2.82(0.01) 3.83(0.03) -2.89(0.00)
DBNN Diagonal -2.47(0.04) -1.60(0.09) -3.05(0.03) -0.93(0.02) 1.06(0.01) -2.81(0.01) 2.78(0.00) -2.85(0.01)
DBNN Cholesky -2.45(0.03) -1.22(0.05) -3.05(0.03) -0.92(0.02) 1.08(0.01) -2.80(0.00) 2.97(0.09) -2.81(0.00)
Table 1: Test log likelihood values of 8 benchmark datasets.

5 Conclusion

We extend Neural ODEs to a fully Bayesian setting. The model flows an input observation through stochastic dynamics, where both the drift and the diffusion follow a BNN. The posterior on the BNN weights is approximated by a modified variant of SGLD. The resultant model, called DBNN, outperforms the recent N-SDE in a number of time series prediction and regression tasks. Our model benefits from the natural flexibility of using a variety of possible network designs for and

, as long as the input and output dimensions remain same. Thus the model is easily extendable towards other tasks, such as image segmentation or reinforcement learning.

References