BNNs combine powerful function approximators with the ability to model uncertainty, making them useful in domains where (i) training data is expensive or limited, or (ii) inaccurate predictions are prohibitively costly and decision-making must be informed by our level of confidence (MacKay, 1995; Neal, 1995). Domain experts often have prior knowledge about the modeled function and the ability to encode such information on top of training data can thus improve performance. However, BNNs define prior distributions over parameters, whose high dimensionality and lack of interpretability make the incorporation of functional beliefs close to impossible.
We present an interpretable approach for incorporating prior functional information into BNNs in the form of constraints, while staying consistent with the Bayesian framework. We then apply our method to two domains where the ability to encode such constraints is crucial: (i) prediction of clinical actions in health care, where constraints prevent unsafe actions for certain physiological inputs, and (ii) human motion prediction, where joint positions are constrained by anatomically feasible ranges.
Our contributions are: (a) we introduce constraint priors, capable of incorporating both negative constraints (where the function cannot be) and positive constraints (where the function should be), applicable with any black-box inference algorithm normally used with BNNs, and (b) we demonstrate the application of constraint priors with a variety of suitable inference methods on toy problems as well as two large and high-dimensional real-world data sets.
2 Related Work
Most closely related to our work, (Lorenzi & Filippone, 2018) considered function-space equality and inequality constraints of deep probabilistic models. However, they focused on deep Gaussian processes (DGPs) rather than BNNs, and on low-dimensional data from simulated ODE systems, whereas we consider high-dimensional real-world settings. They also do not consider classification settings.
(Hafner et al., 2018) specify a Gaussian function prior with the goal of preventing overconfident BNN predictions out-of-distribution. In contrast, we use ”positive constraints” to guide the function where it should be. Also related are functional BNNs by (Sun et al., 2019), where variational inference is performed in function-space using a stochastic process model. Their view is more general—and accordingly, more complex to optimize—while we focus on constraints in specific regions of the input-output space.
A conventional BNN, operating in the function (or input-output) space , typically has a prior over parameters , where are the neural network weights and biases. Given data , we perform inference to obtain the posterior
. The posterior predictive for the outputfor some new input is obtained by integrating over the posterior distribution of :
The space of is high-dimensional and the relationship between the weights and the function is non-intuitive. As such, the prior is often trivially chosen as an isotropic Gaussian:
4 Output-Constrained BNNs
We consider two kinds of “expert knowledge”: positive constraints define regions where a function should be, and negative constraints define regions where a function cannot be. This delineation is not arbitrary — the level of prior knowledge (strongly vs. weakly informative) and the task (regression or classification) may suggest the use of different prior constraints.
Defining constrained regions Formally, a positive constrained region is a set of input-output tuples defining where outputs given certain inputs should be. Conversely, a negative constrained region is a set of tuples defining where outputs given certain inputs cannot be. We will use when describing properties of constrained regions of both kinds and denote for all in and for all in . Given this formulation, it is our goal to enforce
Note that (3) is simply the posterior predictive distribution conditioned on . The generality of this approach allows for the incorporation of very complicated yet interpretable constraints a priori, such as for example arbitrary equality, inequality and logical (if-then and either-or) constraints.
Constraint prior We connect the weight space of the BNN with constraints through the distribution:
where is the BNN forward pass and
is the set of tuneable hyperparameters of. Accordingly, a constraint prior can then be constructed as:
achieving the goal of expressing prior function knowledge in weight space while retaining the weight-space prior . Intuitively, measures the BNN’s adherence to the constrained region.
It remains to describe how is defined. For positive constraints , measures how close lies to , for which natural choices of distributions exist for both regression and classification. For negative constraints , we define as the expected violation of given
using a classifier function. Complete definitions offor positive and negative priors are provided in Appendix A; details on inference procedures are provided in Appendix B.
5 Demonstrations on Synthetic Data
This section provides proof of concepts of OC-BNNs using 2-dimensional synthetic examples. Refer to Appendix C for experimental details and Appendix D for additional results. For regression, the posteriors are visualized in black/gray for baseline BNNs, and blue for OC-BNNs. Negative constrained regions are red; positive (Gaussian) constraints are green. For the classification example, the three classes are color-coded red, green and blue.
OC-BNNs model uncertainty in a manner that respects constrained regions while explaining training data. Figure 1 demonstrates this for both the regression and classification setting. Correct predictions are maintained with similar uncertainty levels as the baseline while constraints are correctly enforced with uncertainty levels changing to reflect that. These examples demonstrate how OC-BNNs enforce constraints without sacrificing predictive accuracy.
OC-BNNs encourage correct out-of-distribution behavior. Figure 2 (left) depicts sparse data, along with out-of-distribution positive constrained regions. The posterior predictive in-distribution closely mimics the baseline, while the posterior out-of-distribution (OOD) learns to avoid the constrained region. This demonstrates that OC-BNNs function well away from the data, which is important because we typically want to enforce functional constraints when there is a lack of observed training data for the model to learn from.
OC-BNNs can capture posterior multimodality. While negative constraints
do not explicitly define multimodal posterior predictives, a bounded constrained region does imply that the posterior predictive might have probability mass on either side of the bounded region (i.e. for alldimensions of ). Figure 2 (right), demonstrates that we capture challenging posterior predictives.
6.1 Clinical action prediction
MIMIC-III (Johnson et al., 2016) is a benchmark database containing time series data of various physiological measurements and clinical actions prescribed belonging to intensive care patients who stayed at the Beth Israel Deaconess Medical Center between 2001 and 2012.
Problem Formulation From the raw time-series data, we construct a balanced dataset for a time-independent classification task of hypotension management. There are 9 features representing various physiological states, such as mean blood pressure and lactate levels. The goal is to predict if clinical action (either vasopressor or IV fluid) should be taken.
Constraints The constraint imposed is that for mean blood pressure less than 65 units, some action should be taken, which is physiologically realistic. We apply the positive (Dirichlet) constraint prior (Appendix A), as well as the weights-only prior baseline. In the given data, some training points fall within the constrained region. We train our model both with and without artificially filtering out all points within the positive constrained region.
OC-BNNs maintain classification accuracy while reducing physiologically infeasible constraint violations. Table 1 displays experimental results, with statistics computed from the posterior mean. In addition to standard accuracy (ACC) and F1 score, we measure the violation fraction (VIOL), which is the fraction of predictions on held-out points that violate the constraints. The results show that OC-BNNs match standard BNNs on all predictive accuracy metrics, with significantly lower violation of the constrained region for the case where points originally in the constrained region are filtered out.
6.2 Human motion prediction
We evaluate OC-BNNs on data of humans conducting various motions available at (Kratzer, 2019) as described in (Kratzer et al., 2018). This data contains human upper body poses across many reaching tasks at a frame rate of 120Hz. The poses are provided in the form of upper body joint angles.
Problem formulation Given a subset of trajectories in (Kratzer, 2019), our goal is to predict joint angles 20 frames in the future from angles at the current time frame and the numerically computed joint velocities and accelerations. In the following, we limit ourselves to abduction and flexion (further denoted as Y- and Z-rotation to match the nomenclature in the original data (Kratzer, 2019)) of the left and right shoulder during right-handed reaching motions.
The joint angles in the test data were perturbed with normally distributed noise (, degrees) to simulate a scenario in which a human motion prediction model is trained on data recorded in a high-end motion capture lab, and then used to predict motion from data obtained by noisy wearable sensors.
Constraints Several anatomical feasibility or functional range constraints for each of the joint angles could be applied, e.g. as described in (Namdari et al., 2012). We derived constraints on the joint limits from the reaching motions provided in (Kratzer, 2019) as the empirically observed extrema across all motions, which is modeled using the negative constraint prior.
OC-BNNs prevent infeasible predictions. We compare a BNN and OC-BNN using the negative prior and the empirical bounds on joint angles. Both models are compared in (i) RMSE using the posterior predictive mean (RMSE) , (ii) held-out data log likelihood of with posterior predictive mean
and variance(HO-LL), and (iii) posterior predictive violation defined as the percentage of probability mass in an infeasible constrained region (PP-VIOL) [%], each evaluated at all target points.
These metrics are summarized in Table 2. We find that OC-BNNs reduce the possibility of making an infeasible prediction to less than 0.001%, substantially improving on BNNs. Figure 4 shows exemplary motion predictions obtained with both BNN and OC-BNN for five consecutive points in a test trajectory.
OC-BNNs prevent constraint violation while fitting low- and high-dimensional data.
OC-BNNs prevent constraint violation while fitting low- and high-dimensional data.Our results highlight that incorporating expert knowledge into OC-BNNs helps enforcing feasible and thus more robust predictions. Results for both datasets in Section 6 demonstrate that constraint violation metrics are reduced significantly, whereas accuracy metrics are nearly unchanged. This affirms the behavior observed in the synthetic examples in Section 5.
Training data in constrained region can outweigh prior effect. The clinical dataset results show that the presence of data in reduces the effect of constraint priors. This is expected and in accordance with the Bayesian framework, where the likelihood effect will crowd out the prior given enough training data, and also suggests that the practitioner can use OC-BNNs even for situations where the constraints themselves may not be fully satisfied.
OC-BNNs can facilitate data imputation.
The fact that OC-BNNs model uncertainty correctly in constrained regions without losing predictive accuracy, even for high-dimensional datasets, show that OC-BNNs can encode imputation in input regions without training data. Rather than directly modifying the training set through imputation, prior beliefs about missing data can instead be formulated as constraints.
When to use which prior? In the regression setting, negative priors are weakly informative whereas positive priors tend to be strongly informative – one or both of the prior types can be used depending on domain knowledge. While the negative prior formulation does not apply to classification cases, this does not pose a problem as negative and positive constraints are complements in discrete space.
8 Conclusion and Outlook
We describe OC-BNNs, a formulation to incorporate expert knowledge into BNNs by prescribing positive and negative (i.e., desired and forbidden) regions, and demonstrate their application to synthetic and real-world data. We show that OC-BNNs generally maintain the desirable properties of regular BNNs while their predictions follow the prescribed constraints. This makes them a promising tool for settings like healthcare, where models trained on sparse data may be augmented with expert knowledge. In addition, OC-BNNs may find applications in safe reinforcement learning, e.g. in tasks where certain actions are known to have catastrophic consequences.
MG and FDV acknowledge support from AFOSR FA 9550-17-1-0155. LL and WY acknowledge support from the John A. Paulson School of Engineering and Applied Sciences at Harvard University.
- Hafner et al. (2018) Hafner, D., Tran, D., Lillicrap, T., Irpan, A., and Davidson, J. Reliable uncertainty estimates in deep neural networks using noise contrastive priors. In eprint arXiv:1807.09289, 2018.
- Johnson et al. (2016) Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
- Kratzer (2019) Kratzer, P. mocap-mlr-datasets. https://github.com/charlespwd/project-title, 2019.
- Kratzer et al. (2018) Kratzer, P., Toussaint, M., and Mainprice, J. Towards combining motion optimization and data driven dynamical models for human motion prediction. In 2018 IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids), pp. 202–208. IEEE, 2018.
Liu & Wang (2016)
Liu, Q. and Wang, D.
Stein variational gradient descent: A general purpose bayesian inference algorithm.In Advances In Neural Information Processing Systems, pp. 2378–2386, 2016.
- Lorenzi & Filippone (2018) Lorenzi, M. and Filippone, M. Constraining the dynamics of deep probabilistic models. arXiv preprint arXiv:1802.05680, 2018.
- MacKay (1995) MacKay, D. J. C. Probable networks and plausible predictions – a review of practical bayesian methods for supervised neural networks. In Network: Computation in Neural Systems, 6:3, 469-505, 1995.
- Namdari et al. (2012) Namdari, S., Yagnik, G., Ebaugh, D. D., Nagda, S., Ramsey, M. L., Williams Jr, G. R., and Mehta, S. Defining functional shoulder range of motion for activities of daily living. Journal of shoulder and elbow surgery, 21(9):1177–1183, 2012.
- Neal (1995) Neal, R. M. Bayesian Learning for Neural Networks. PhD thesis, Graduate Department of Computer Science, University of Toronto, 1995.
Neal, R. M.
Mcmc using hamiltonian dynamics.
Handbook of Markov Chain Monte Carlo, 2012.
- Sun et al. (2019) Sun, S., Zhang, G., Shi, J., and Grosse, R. Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779, 2019.
Appendix A Constraint Priors
In this section, we describe the detailed functional forms of our positive and negative constraints and priors for both classification and regression settings, noting aspects important for inference.
a.1 Positive constraint prior
Since describes the set of points that the learned function should model, has the straightforward interpretation of measuring how closely lies to
. Most common probability distributions as well as (possibly improper) user-defined distributions are amenable, though differentiability may be a condition for certain inference methods. In particular, natural choices of distributions exist for both regression and classification.
Regression In the simplest setting, for which there is a known ground-truth function described perfectly by
, the Gaussian distribution is a natural choice:
where is a sampling distribution for , which is necessary for tractability if is large or infinite. itself can be user-defined as the domain allows, allowing for flexibility in sampling.
is the tuneable standard deviation of the Gaussian, controlling strictness of deviation from. More generally, it is possible that there exists multiple for some . This can be expressed using multimodal distributions, for example:
where are the user-defined mixture weights.
Classification describes the classes that the BNN is constrained to for the corresponding . In the discrete setting, the natural distribution is the Dirichlet. For classes,
where for some controllable penalty .
a.2 Negative constraint prior
The negative constraint prior enforces the infeasibility of regions in function space and is constructed by placing little prior probability on high expected violation of:
In (9), is a classifier function that encodes softly whether or not is in , which allows black-box use with any inference technique:
The definition of assumes that the negative region is defined by sets of inequality constraints , i.e. with , which can define arbitrary linear and nonlinear shapes in the input-output space. is a soft indicator of whether a constraint of the form is satisfied, a more generally-parameterizable sigmoidal activation defined as
If all constraints for at least one infeasible region are satisfied, our prior knowledge is violated and is far from 0. Otherwise, at least one constraint of all infeasible regions is violated and our prior beliefs satisfied; is close to 0. Contrary to other classification functions, the product of two tanh functions with different scales enables a sharp and steep overall classification of violating values in and a smoother and flatter classification for satisfying values in , making gradients less vanishing for constraint-satisfying, i.e. region-violating inputs. We use .
Appendix B Inference
Constraint priors can be substituted for the traditional prior term with any black-box sampling or variational inference (VI) algorithm. Here, we provide a summary of the algorithms we use and describe the trivial modifications used to incorporate constraint priors . Note that the general form of is not normalized, which does not pose a problem for inference in practice.
Hamiltonian Monte Carlo (HMC) HMC (Neal, 2012) is a MCMC method considered to be the “gold standard” in posterior sampling even though not being scalable. We substitute by in the potential energy term computed at each sampling iteration:
As the presence of increases the magnitude of the prior , empirical performance typically improves by using a smaller step-size than with for the same dataset.
Stein Variational Gradient Descent (SVGD) SVGD (Liu & Wang, 2016) is a VI method where a set of particles (in our case, ) are optimized via functional gradient descent to mimic the true posterior. SVGD combines the efficiency of VI methods with the ability of MCMC methods to capture more expressive posterior approximations. is substituted by in the computation of the functional gradient:
Our implementation of SVGD uses the weighted RBF kernel and adapting bandwith as suggested in (Liu & Wang, 2016) as well as mini-batched data for tractability.
Appendix C Experimental Details
c.1 Synthetic Examples
For all experiments, the BNN used comprises a single hidden layer with 10 nodes, and Radial Basis Function (RBF) activations.
All regression plots show the posterior mean function (bold line) as well as the confidence intervals for(dark shading) and (light shading).
Figure 1: (left) The constrained regions are and for . The function generating the training points is . The negative prior formulation is used. (right) The input space is 2-dimensional and there are 3 classes (color-coded) with 8 training points in each class, generated from the Gaussian means and . The constrained region is and defined such that points within the box should be classified as green. The positive prior is used. HMC (10000 burn-in, 1000 samples collected at intervals of 10) is used for both examples.
Figure 2: (left) The positive constraints are for and for . Both constraints are Gaussian with the . The 3 training points are arbitrarily defined. HMC (10000 burn-in, 1000 samples collected at intervals of 10) is used. (Right) The constrained boxed region is and . The function generating the training points is . SVGD with 75 particles is used with Adagrad.
c.2 Clinical action prediction
For all experiments, the BNN used comprises a 2 hidden layers of 200 nodes each and RBF activations. SVGD is used for inference with 50 particles, 1500 iterations, Adagrad optimization, and a suitable batch size. The size of the full dataset is 298K; this reduces to 125K when points in the constrained region are filtered out. Details on the prior formulation for can be found in A. The Dirichlet parameter is set to 10 for allowed classes and 0.01 for forbidden classes.
c.3 Human motion prediction
For these experiments, the BNN used comprises a 2 hidden layers of 100 nodes each and RBF activations. For inference, we again used SVGD and Adagrad with 50 particles and 1000 iterations. The negative prior used samples from and , see Eq. 9.
We randomly chose a subset of 10 right-handed reaching trajectories from (Kratzer, 2019). This data was randomly split into 5 training and 5 test trajectories, which amounts to 243 train Markov states of sensors for training and 142 states for evaluation. Given this problem setting, the regression task had 12-dimensional inputs and 4-dimensional targets. The number of training trajectories was kept low to increase sparsity and the difficulty of successful robust generalization.
Appendix D Additional Results
d.1 Additional Synthetic Examples
Figure 5 shows additional examples for out-of-distribution and multimodal behavior. (left) Out-of-distribution negative constraints. The negative constraints are and for and and for . The training points are identical to those in the left plot of Figure 2. HMC (10000 burn-in, 1000 samples collected at intervals of 10) is used. (right) Multimodal positive constraints. The two positive functions are and , both for the domain . The training points were arbitrarily defined. An equally-weighted mixture of two Gaussians with is used as the positive constraint prior. SVGD with 75 particles and Adagrad are used.