1 Introduction
BNNs combine powerful function approximators with the ability to model uncertainty, making them useful in domains where (i) training data is expensive or limited, or (ii) inaccurate predictions are prohibitively costly and decisionmaking must be informed by our level of confidence (MacKay, 1995; Neal, 1995). Domain experts often have prior knowledge about the modeled function and the ability to encode such information on top of training data can thus improve performance. However, BNNs define prior distributions over parameters, whose high dimensionality and lack of interpretability make the incorporation of functional beliefs close to impossible.
We present an interpretable approach for incorporating prior functional information into BNNs in the form of constraints, while staying consistent with the Bayesian framework. We then apply our method to two domains where the ability to encode such constraints is crucial: (i) prediction of clinical actions in health care, where constraints prevent unsafe actions for certain physiological inputs, and (ii) human motion prediction, where joint positions are constrained by anatomically feasible ranges.
Our contributions are: (a) we introduce constraint priors, capable of incorporating both negative constraints (where the function cannot be) and positive constraints (where the function should be), applicable with any blackbox inference algorithm normally used with BNNs, and (b) we demonstrate the application of constraint priors with a variety of suitable inference methods on toy problems as well as two large and highdimensional realworld data sets.
2 Related Work
Most closely related to our work, (Lorenzi & Filippone, 2018) considered functionspace equality and inequality constraints of deep probabilistic models. However, they focused on deep Gaussian processes (DGPs) rather than BNNs, and on lowdimensional data from simulated ODE systems, whereas we consider highdimensional realworld settings. They also do not consider classification settings.
(Hafner et al., 2018) specify a Gaussian function prior with the goal of preventing overconfident BNN predictions outofdistribution. In contrast, we use ”positive constraints” to guide the function where it should be. Also related are functional BNNs by (Sun et al., 2019), where variational inference is performed in functionspace using a stochastic process model. Their view is more general—and accordingly, more complex to optimize—while we focus on constraints in specific regions of the inputoutput space.
3 Background
A conventional BNN, operating in the function (or inputoutput) space , typically has a prior over parameters , where are the neural network weights and biases. Given data , we perform inference to obtain the posterior
. The posterior predictive for the output
for some new input is obtained by integrating over the posterior distribution of :(1) 
The space of is highdimensional and the relationship between the weights and the function is nonintuitive. As such, the prior is often trivially chosen as an isotropic Gaussian:
(2) 
4 OutputConstrained BNNs
We consider two kinds of “expert knowledge”: positive constraints define regions where a function should be, and negative constraints define regions where a function cannot be. This delineation is not arbitrary — the level of prior knowledge (strongly vs. weakly informative) and the task (regression or classification) may suggest the use of different prior constraints.
Defining constrained regions Formally, a positive constrained region is a set of inputoutput tuples defining where outputs given certain inputs should be. Conversely, a negative constrained region is a set of tuples defining where outputs given certain inputs cannot be. We will use when describing properties of constrained regions of both kinds and denote for all in and for all in . Given this formulation, it is our goal to enforce
(3) 
Note that (3) is simply the posterior predictive distribution conditioned on . The generality of this approach allows for the incorporation of very complicated yet interpretable constraints a priori, such as for example arbitrary equality, inequality and logical (ifthen and eitheror) constraints.
Constraint prior We connect the weight space of the BNN with constraints through the distribution:
(4) 
where is the BNN forward pass and
is the set of tuneable hyperparameters of
. Accordingly, a constraint prior can then be constructed as:(5) 
achieving the goal of expressing prior function knowledge in weight space while retaining the weightspace prior . Intuitively, measures the BNN’s adherence to the constrained region.
It remains to describe how is defined. For positive constraints , measures how close lies to , for which natural choices of distributions exist for both regression and classification. For negative constraints , we define as the expected violation of given
using a classifier function. Complete definitions of
for positive and negative priors are provided in Appendix A; details on inference procedures are provided in Appendix B.5 Demonstrations on Synthetic Data
This section provides proof of concepts of OCBNNs using 2dimensional synthetic examples. Refer to Appendix C for experimental details and Appendix D for additional results. For regression, the posteriors are visualized in black/gray for baseline BNNs, and blue for OCBNNs. Negative constrained regions are red; positive (Gaussian) constraints are green. For the classification example, the three classes are colorcoded red, green and blue.
OCBNNs model uncertainty in a manner that respects constrained regions while explaining training data. Figure 1 demonstrates this for both the regression and classification setting. Correct predictions are maintained with similar uncertainty levels as the baseline while constraints are correctly enforced with uncertainty levels changing to reflect that. These examples demonstrate how OCBNNs enforce constraints without sacrificing predictive accuracy.
OCBNNs encourage correct outofdistribution behavior. Figure 2 (left) depicts sparse data, along with outofdistribution positive constrained regions. The posterior predictive indistribution closely mimics the baseline, while the posterior outofdistribution (OOD) learns to avoid the constrained region. This demonstrates that OCBNNs function well away from the data, which is important because we typically want to enforce functional constraints when there is a lack of observed training data for the model to learn from.
OCBNNs can capture posterior multimodality. While negative constraints
do not explicitly define multimodal posterior predictives, a bounded constrained region does imply that the posterior predictive might have probability mass on either side of the bounded region (i.e. for all
dimensions of ). Figure 2 (right), demonstrates that we capture challenging posterior predictives.6 Applications
6.1 Clinical action prediction
MIMICIII (Johnson et al., 2016) is a benchmark database containing time series data of various physiological measurements and clinical actions prescribed belonging to intensive care patients who stayed at the Beth Israel Deaconess Medical Center between 2001 and 2012.
Problem Formulation From the raw timeseries data, we construct a balanced dataset for a timeindependent classification task of hypotension management. There are 9 features representing various physiological states, such as mean blood pressure and lactate levels. The goal is to predict if clinical action (either vasopressor or IV fluid) should be taken.
Constraints The constraint imposed is that for mean blood pressure less than 65 units, some action should be taken, which is physiologically realistic. We apply the positive (Dirichlet) constraint prior (Appendix A), as well as the weightsonly prior baseline. In the given data, some training points fall within the constrained region. We train our model both with and without artificially filtering out all points within the positive constrained region.
OCBNNs maintain classification accuracy while reducing physiologically infeasible constraint violations. Table 1 displays experimental results, with statistics computed from the posterior mean. In addition to standard accuracy (ACC) and F1 score, we measure the violation fraction (VIOL), which is the fraction of predictions on heldout points that violate the constraints. The results show that OCBNNs match standard BNNs on all predictive accuracy metrics, with significantly lower violation of the constrained region for the case where points originally in the constrained region are filtered out.
filtered  unfiltered  
BNN  OCBNN  BNN  OCBNN  
Train 
ACC  0.745  0.741  0.881  0.878 
F1  0.805  0.801  0.882  0.880  
VIOL  0.151  0.149  N/A  N/A  
Test 
ACC  0.660  0.665  0.647  0.649 
F1  0.746  0.748  0.725  0.736  
VIOL  0.132  0.126  0.117  0.039 
6.2 Human motion prediction
We evaluate OCBNNs on data of humans conducting various motions available at (Kratzer, 2019) as described in (Kratzer et al., 2018). This data contains human upper body poses across many reaching tasks at a frame rate of 120Hz. The poses are provided in the form of upper body joint angles.
Problem formulation Given a subset of trajectories in (Kratzer, 2019), our goal is to predict joint angles 20 frames in the future from angles at the current time frame and the numerically computed joint velocities and accelerations. In the following, we limit ourselves to abduction and flexion (further denoted as Y and Zrotation to match the nomenclature in the original data (Kratzer, 2019)) of the left and right shoulder during righthanded reaching motions.
The joint angles in the test data were perturbed with normally distributed noise (
, degrees) to simulate a scenario in which a human motion prediction model is trained on data recorded in a highend motion capture lab, and then used to predict motion from data obtained by noisy wearable sensors.Constraints Several anatomical feasibility or functional range constraints for each of the joint angles could be applied, e.g. as described in (Namdari et al., 2012). We derived constraints on the joint limits from the reaching motions provided in (Kratzer, 2019) as the empirically observed extrema across all motions, which is modeled using the negative constraint prior.
OCBNNs prevent infeasible predictions. We compare a BNN and OCBNN using the negative prior and the empirical bounds on joint angles. Both models are compared in (i) RMSE using the posterior predictive mean (RMSE) [], (ii) heldout data log likelihood of with posterior predictive mean
and variance
(HOLL), and (iii) posterior predictive violation defined as the percentage of probability mass in an infeasible constrained region (PPVIOL) [%], each evaluated at all target points.These metrics are summarized in Table 2. We find that OCBNNs reduce the possibility of making an infeasible prediction to less than 0.001%, substantially improving on BNNs. Figure 4 shows exemplary motion predictions obtained with both BNN and OCBNN for five consecutive points in a test trajectory.
BNN  OCBNN  

Train 
RMSE  0.929  1.252 
HOLL  1718.409  1342.602  
PPVIOL  0.046  0.000  
Test 
RMSE  7.320  12.127 
HOLL  101.129  683.697  
PPVIOL  18.447  0.000 
7 Discussion
OCBNNs prevent constraint violation while fitting low and highdimensional data.
Our results highlight that incorporating expert knowledge into OCBNNs helps enforcing feasible and thus more robust predictions. Results for both datasets in Section 6 demonstrate that constraint violation metrics are reduced significantly, whereas accuracy metrics are nearly unchanged. This affirms the behavior observed in the synthetic examples in Section 5.Training data in constrained region can outweigh prior effect. The clinical dataset results show that the presence of data in reduces the effect of constraint priors. This is expected and in accordance with the Bayesian framework, where the likelihood effect will crowd out the prior given enough training data, and also suggests that the practitioner can use OCBNNs even for situations where the constraints themselves may not be fully satisfied.
OCBNNs can facilitate data imputation.
The fact that OCBNNs model uncertainty correctly in constrained regions without losing predictive accuracy, even for highdimensional datasets, show that OCBNNs can encode imputation in input regions without training data. Rather than directly modifying the training set through imputation, prior beliefs about missing data can instead be formulated as constraints.
When to use which prior? In the regression setting, negative priors are weakly informative whereas positive priors tend to be strongly informative – one or both of the prior types can be used depending on domain knowledge. While the negative prior formulation does not apply to classification cases, this does not pose a problem as negative and positive constraints are complements in discrete space.
8 Conclusion and Outlook
We describe OCBNNs, a formulation to incorporate expert knowledge into BNNs by prescribing positive and negative (i.e., desired and forbidden) regions, and demonstrate their application to synthetic and realworld data. We show that OCBNNs generally maintain the desirable properties of regular BNNs while their predictions follow the prescribed constraints. This makes them a promising tool for settings like healthcare, where models trained on sparse data may be augmented with expert knowledge. In addition, OCBNNs may find applications in safe reinforcement learning, e.g. in tasks where certain actions are known to have catastrophic consequences.
Acknowledgements
MG and FDV acknowledge support from AFOSR FA 95501710155. LL and WY acknowledge support from the John A. Paulson School of Engineering and Applied Sciences at Harvard University.
References
 Hafner et al. (2018) Hafner, D., Tran, D., Lillicrap, T., Irpan, A., and Davidson, J. Reliable uncertainty estimates in deep neural networks using noise contrastive priors. In eprint arXiv:1807.09289, 2018.
 Johnson et al. (2016) Johnson, A. E., Pollard, T. J., Shen, L., Liwei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. Mimiciii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 Kratzer (2019) Kratzer, P. mocapmlrdatasets. https://github.com/charlespwd/projecttitle, 2019.
 Kratzer et al. (2018) Kratzer, P., Toussaint, M., and Mainprice, J. Towards combining motion optimization and data driven dynamical models for human motion prediction. In 2018 IEEERAS 18th International Conference on Humanoid Robots (Humanoids), pp. 202–208. IEEE, 2018.

Liu & Wang (2016)
Liu, Q. and Wang, D.
Stein variational gradient descent: A general purpose bayesian inference algorithm.
In Advances In Neural Information Processing Systems, pp. 2378–2386, 2016.  Lorenzi & Filippone (2018) Lorenzi, M. and Filippone, M. Constraining the dynamics of deep probabilistic models. arXiv preprint arXiv:1802.05680, 2018.
 MacKay (1995) MacKay, D. J. C. Probable networks and plausible predictions – a review of practical bayesian methods for supervised neural networks. In Network: Computation in Neural Systems, 6:3, 469505, 1995.
 Namdari et al. (2012) Namdari, S., Yagnik, G., Ebaugh, D. D., Nagda, S., Ramsey, M. L., Williams Jr, G. R., and Mehta, S. Defining functional shoulder range of motion for activities of daily living. Journal of shoulder and elbow surgery, 21(9):1177–1183, 2012.
 Neal (1995) Neal, R. M. Bayesian Learning for Neural Networks. PhD thesis, Graduate Department of Computer Science, University of Toronto, 1995.

Neal (2012)
Neal, R. M.
Mcmc using hamiltonian dynamics.
In
Handbook of Markov Chain Monte Carlo
, 2012.  Sun et al. (2019) Sun, S., Zhang, G., Shi, J., and Grosse, R. Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779, 2019.
Appendix A Constraint Priors
In this section, we describe the detailed functional forms of our positive and negative constraints and priors for both classification and regression settings, noting aspects important for inference.
a.1 Positive constraint prior
Since describes the set of points that the learned function should model, has the straightforward interpretation of measuring how closely lies to
. Most common probability distributions as well as (possibly improper) userdefined distributions are amenable, though differentiability may be a condition for certain inference methods. In particular, natural choices of distributions exist for both regression and classification.
Regression In the simplest setting, for which there is a known groundtruth function described perfectly by
, the Gaussian distribution is a natural choice:
(6) 
where is a sampling distribution for , which is necessary for tractability if is large or infinite. itself can be userdefined as the domain allows, allowing for flexibility in sampling.
is the tuneable standard deviation of the Gaussian, controlling strictness of deviation from
. More generally, it is possible that there exists multiple for some . This can be expressed using multimodal distributions, for example:(7) 
where are the userdefined mixture weights.
Classification describes the classes that the BNN is constrained to for the corresponding . In the discrete setting, the natural distribution is the Dirichlet. For classes,
(8) 
where for some controllable penalty .
a.2 Negative constraint prior
The negative constraint prior enforces the infeasibility of regions in function space and is constructed by placing little prior probability on high expected violation of
:(9) 
In (9), is a classifier function that encodes softly whether or not is in , which allows blackbox use with any inference technique:
(10) 
The definition of assumes that the negative region is defined by sets of inequality constraints , i.e. with , which can define arbitrary linear and nonlinear shapes in the inputoutput space. is a soft indicator of whether a constraint of the form is satisfied, a more generallyparameterizable sigmoidal activation defined as
(11) 
If all constraints for at least one infeasible region are satisfied, our prior knowledge is violated and is far from 0. Otherwise, at least one constraint of all infeasible regions is violated and our prior beliefs satisfied; is close to 0. Contrary to other classification functions, the product of two tanh functions with different scales enables a sharp and steep overall classification of violating values in and a smoother and flatter classification for satisfying values in , making gradients less vanishing for constraintsatisfying, i.e. regionviolating inputs. We use .
Appendix B Inference
Constraint priors can be substituted for the traditional prior term with any blackbox sampling or variational inference (VI) algorithm. Here, we provide a summary of the algorithms we use and describe the trivial modifications used to incorporate constraint priors . Note that the general form of is not normalized, which does not pose a problem for inference in practice.
Hamiltonian Monte Carlo (HMC) HMC (Neal, 2012) is a MCMC method considered to be the “gold standard” in posterior sampling even though not being scalable. We substitute by in the potential energy term computed at each sampling iteration:
(12) 
As the presence of increases the magnitude of the prior , empirical performance typically improves by using a smaller stepsize than with for the same dataset.
Stein Variational Gradient Descent (SVGD) SVGD (Liu & Wang, 2016) is a VI method where a set of particles (in our case, ) are optimized via functional gradient descent to mimic the true posterior. SVGD combines the efficiency of VI methods with the ability of MCMC methods to capture more expressive posterior approximations. is substituted by in the computation of the functional gradient:
(13)  
Our implementation of SVGD uses the weighted RBF kernel and adapting bandwith as suggested in (Liu & Wang, 2016) as well as minibatched data for tractability.
Appendix C Experimental Details
c.1 Synthetic Examples
For all experiments, the BNN used comprises a single hidden layer with 10 nodes, and Radial Basis Function (RBF) activations
.All regression plots show the posterior mean function (bold line) as well as the confidence intervals for
(dark shading) and (light shading).Figure 1: (left) The constrained regions are and for . The function generating the training points is . The negative prior formulation is used. (right) The input space is 2dimensional and there are 3 classes (colorcoded) with 8 training points in each class, generated from the Gaussian means and . The constrained region is and defined such that points within the box should be classified as green. The positive prior is used. HMC (10000 burnin, 1000 samples collected at intervals of 10) is used for both examples.
Figure 2: (left) The positive constraints are for and for . Both constraints are Gaussian with the . The 3 training points are arbitrarily defined. HMC (10000 burnin, 1000 samples collected at intervals of 10) is used. (Right) The constrained boxed region is and . The function generating the training points is . SVGD with 75 particles is used with Adagrad.
c.2 Clinical action prediction
For all experiments, the BNN used comprises a 2 hidden layers of 200 nodes each and RBF activations. SVGD is used for inference with 50 particles, 1500 iterations, Adagrad optimization, and a suitable batch size. The size of the full dataset is 298K; this reduces to 125K when points in the constrained region are filtered out. Details on the prior formulation for can be found in A. The Dirichlet parameter is set to 10 for allowed classes and 0.01 for forbidden classes.
c.3 Human motion prediction
For these experiments, the BNN used comprises a 2 hidden layers of 100 nodes each and RBF activations. For inference, we again used SVGD and Adagrad with 50 particles and 1000 iterations. The negative prior used samples from and , see Eq. 9.
We randomly chose a subset of 10 righthanded reaching trajectories from (Kratzer, 2019). This data was randomly split into 5 training and 5 test trajectories, which amounts to 243 train Markov states of sensors for training and 142 states for evaluation. Given this problem setting, the regression task had 12dimensional inputs and 4dimensional targets. The number of training trajectories was kept low to increase sparsity and the difficulty of successful robust generalization.
Appendix D Additional Results
d.1 Additional Synthetic Examples
Figure 5 shows additional examples for outofdistribution and multimodal behavior. (left) Outofdistribution negative constraints. The negative constraints are and for and and for . The training points are identical to those in the left plot of Figure 2. HMC (10000 burnin, 1000 samples collected at intervals of 10) is used. (right) Multimodal positive constraints. The two positive functions are and , both for the domain . The training points were arbitrarily defined. An equallyweighted mixture of two Gaussians with is used as the positive constraint prior. SVGD with 75 particles and Adagrad are used.
Comments
There are no comments yet.