The active learning setup consists of a base predictor that chooses the order in which the data points are supposed to be labeled using an acquisition function. Contrary to the tabula rasa
ansatz of the present deep learning age, the state of the art in active learning has maintained to use hand-designed acquisition functions to rank the unlabeled sample space. Different studies observed different acquisition functions to perform optimally for specific applications after evaluating various choices. The critical fact is that active learning is meant to address applications where data labeling is extremely costly and it is not possible to know the ideal acquisition function for a given application a priori. Once an acquisition function is chosen and active learning has been performed based on it, the labeling budget is already exhausted, leaving no possibility for another try with an alternative acquisition function. This limitation can only be circumvented by adapting the acquisition function to data during the active learning process, getting feedback from the impact of the previous labeling rounds on model fit. For real-world scenarios to which active learning is applicable, learning also the acquisition function is not only an option driven solely by practical concerns such as avoidance of handcrafting effort, but also an absolute necessity stemming from epistemic limitations.
The acquisition functions in active learning are surrogate models that map a data point to a value that encodes the expected contribution of observing its label to model fit. The founding assumption of the active learning setup is that evaluating the acquisition score of a data point is substantially cheaper than acquiring its ground-truth label. Hence, the acquisition functions are expected to be both computationally cheap and maximally accurate in detecting most information-rich regions of the sample space. These competing goals are typically addressed by information-theoretic heuristics. Possibly the most frequently used acquisition heuristic is Maximum Entropy Sampling, which assigns the highest score to the data point for which the predictor reports highest entropy (i.e. uncertainty). This criterion builds on the assumption that the most valuable data point is the one the model is maximally unfamiliar about. While being maximally intuitive, this method remains agnostic to exploiting knowledge from the current model fit about how much the new label can impact the model uncertainty. Another heuristic with comparable reception, Bayesian Active Learning by Disagreement (BALD) [Houlsby et al.2012], benefits from this additional information by maximizing the mutual information between the predictor output and the model parameters. A second major vein of research approaches the active learning problem from a geometric instead of an uncertainty based perspective, e.g. via selection of a core-set [Sener and Savarese2018].
None of the aforementioned heuristics has a theoretical superiority that is sufficient to rule out all other options. Maxent strives only to access unexplored data regions. BALD performs the same by also taking into account the expected effect of the newly explored label on the uncertainty of the model parameters. While some studies argue in favor of BALD due to this additional information it enjoys [Srinivas et al.2012]
, others prefer to avoid this noisy estimate drawn from an under-trained model[Qiu et al.2017]
. This paper presents a data-driven method that alleviates the consequences of the unsolved acquisition function selection problem. As prediction uncertainty is an essential input to acquisition heuristics, we choose a deep Bayesian Neural Net (BNN) as our base predictor. In order to acquire high-quality estimates of prediction uncertainty with an acceptable computational cost, we devise a deterministic approximation scheme that can both effectively train a deep BNN and calculate its posterior predictive density following a chain of closed-form operations. Next, we incorporate all the probabilistic information provided by the BNN predictions into a novel state design, which brings about another full-scale probability distribution. This distribution is then fed into a probabilistic policy network, which is trained by reinforcement feedback collected from every labeling round in order to inform the system about the success of its current acquisition function. This feedback fine-tunes the acquisition function, bringing about improved performance in the subsequent labeling rounds. Figure1 depicts the workflow of our method.
We evaluate our method on three benchmark vision data sets from different domains and complexities: MNIST for images of handwritten digits, FashionMNIST for greyscale images of clothes, and CIFAR-10 for colored natural images. We observe in our experiments that the policy net is capable of inventing an acquisition function that outperforms all the handcrafted alternatives if the data distribution permits. In the rest of the cases, the policy net converges to the best-performing handcrafted choice, which varies across data sets and is unknown prior to the active learning experiment.
2 The Model
Our method consists of two major components: a predictor and a policy net guiding the predictor by learning a data set specific acquisition function.
As the predictor, described in Section 2.1 we use a BNN, whose posterior predictive density we use to distill the system state. The policy net, another BNN, takes this state as input to decide which data points to request labels for next.111 For illustrative purposes we rely on a Central Limit Theorem approach to efficiently marginalize the weights of these BNNs. In general any approach to training a BNN which provides trustworthy predictive uncertainties could be used.
For illustrative purposes we rely on a Central Limit Theorem approach to efficiently marginalize the weights of these BNNs. In general any approach to training a BNN which provides trustworthy predictive uncertainties could be used.We describe this second part of the pipeline in Section 2.2
. Since we introduce a reinforcement learning based method for active learning, we refer to it asReinforced Active Learning (RAL).
2.1 Predictor: Bayesian Neural Network
Let be a data set consisting of
tuples of feature vectorsand labels for a
dimensional binary output label. Parameterizing an arbitrary neural networkwith parameters following some prior , we assume the following probabilistic model
where is the th output channel of the net, , and
is a Bernoulli distribution. The calculation of the posterior predictive
involves the calculation of the posterior distribution on the latent variables, which can be accomplished by Bayes rule
for and . As this is intractable in general we require approximate inference techniques. We aim for high-quality prediction uncertainties. A sampling based approach is not practical for vision-scale applications where neural nets with a large parameter count are being used. Instead we use variational inference (VI). In order to keep the calculations maximally tractable while benefiting from stochasticity , we formulate a normal mean-field variational posterior (which could be generalized):
where the tuple represents the variational parameter set for weight of the network and . VI approximates the true intractable posterior by optimizing to minimize the Kullback-Leibler (KL) divergence between and , which boils down to minimizing the negative evidence lower bound
In this formula, the first term on the r.h.s. maximizes the data fit (i.e. minimizes the reconstruction loss), and the second term penalizes divergence of the posterior from the prior, inducing the Occam’s razor principle to the preferred solution.
The modeler has control on the model families of both and . Hence, choosing the prior
suitably to the normally distributedassures an analytically tractable solution for the term.222We use with a fixed precision . However, the data fit term involves a nonlinear neural net, which introduces difficulties for keeping the calculation of the expectations tractable. A major issue is that we need to differentiate this term with respect to the variational parameters , which appear in the density with respect to which the expectation is taken. This problem is overcome by the reparameterization trick [Kingma and Welling2014], which re-formulates as a sampling step from a parameter-free distribution and a deterministic variable transformation.
where the variational parameters
now appear only inside the expectation term, and we could take the gradient of the loss with respect to them and approximate integral in the expectation by Monte Carlo sampling. Although this will provide an unbiased estimator of the exact gradient, due to distorting the global variables of a highly-parameterized system, this estimator will have prohibitively high variance. The remedy is to postpone sampling one step further.
Let the pre-activation and feature map of
th neuron of layerfor data point be and , respectively. We then have
as a repeating operation at every layer transition within a BNN.333The same line of reasoning directly applies to convolutional layers where the sum on is performed in a sliding window fashion. As is the sampling output of layer , it is a deterministic input to layer . Consequently, is a weighted linear sum of
independent normal random variables, which is another normal random variable with
We now take separate samples for local variables, further reducing the estimator variance stemming from the Monte Carlo integration. This is known as Variational Dropout [Kingma et al.2015], as the process performed for the expected log-likelihood term is equivalent to Gaussian dropout with rate for weight .
2.1.1 Fast Dropout and the CLT Trick
Fast Dropout [Wang and Manning2013] has been introduced as a technique to perform Gaussian dropout without taking samples. The technique builds on the observation that
is essentially a random variable that consists of a sheer sum of a provisionally large number of other random variables. This is a direct call to the Central Limit Theorem (CLT) that transforms the eventual distribution into a normal density, which can be trivially modeled by matching the first two moments
Here, and , as determined in Equation 4. We also require the first two moments over of the , for which it suffices to solve
These two are analytically tractable for standard choices of activation functions, such as when
is the ReLU activation andis a normal distribution [Frey and Hinton1999]. Note that is either the linear activation of the input layer, a weighted sum of normals, hence another normal, or a hidden layer, which will then similarly follow CLT and therefore already be approximated as a normal. Hence, the above pattern repeats throughout the entire network, allowing a tight closed-form approximation of the analytical solution. Below, we refer to this method as the CLT trick.
2.1.2 Closed-Form Uncertainty Estimation with BNNs
Fast Dropout uses the aforementioned CLT trick only for implementing dropout. Here we extend the same method to perform variational inference by minimizing a deterministic loss, i.e. avoiding Monte Carlo sampling altogether. Even though the CLT trick has previously been used mainly for expectation propagation, its direct application to variational inference has not been investigated prior to our work. Furthermore, the state of the art in deep active learning relies on test-time dropout [Gal et al.2017], which is computationally prohibitive. Speeding up this process requires parallel computing on the end-product, hence reflects additional costs on the user of the model not addressable at the production time. A thus far overlooked aspect of the CLT trick is that it also allows closed-form calculation of the posterior predictive density. Once training is over, we get a factorized surrogate for our posterior. Plugging this surrogate into the predictive density formula, for a new observation we get
where the functions and encode the cumulative map from the input layer to the moments of the top-most layer after repetitive application of the CLT trick across the layers.444Once and are computed one could also choose a categorical likelihood and approximate the integral via sampling. With is tightly approximated by an analytical calculation of a known distributional form, its high-order moments are readily available for downstream tasks, being active learning in our case.
2.2 Guide: The Policy Net
As opposed to the standard active learning pipeline, our method is capable of adapting its acquisition scheme to the characteristics of individual data distributions. Differently from earlier work on data-driven label acquisition, our method can perform the adaptation on the fly, i.e. while the active learning labeling rounds take place. This adaptiveness is achieved within a reinforcement learning framework, where a policy net is trained by rewards observed from the environment. We denote the collection of unlabeled points by and the labeled ones by . The variables and denote the number of data points in each respective case.
In active learning, the label acquisition process takes place on the entire unlabeled sample set. However, a feasible reinforcement learning setup necessitates a condensed state representation. To this end, we first rank the unlabeled sample set by an information-theoretic heuristic, namely the maximum entropy criterion. As such heuristics assign similar scores to samples with similar character, consecutive samples in ranking inevitably have high correlation. In order to break the trend and enhance diversity, we follow the ranking from the top and pick up every th sample until we collect samples
. We adopt the related term from the Markov Chain Monte Carlo sampling literature and refer to this process asthinning. Feeding these samples into our predictor BNN (Equation 9), we attain a posterior predictive density estimate for each and distill the state of the unlabeled sample space in the following distributional form:
where and are mean and variance of the activation an output neuron, calculated as in Equation 9.
At each labeling round, a number of data points are sampled from the set according to the probability masses assigned on them by the present policy.
The straight-forward reward would be the performance of the updated predictor on a separate validation set. This, however, clashes with the constraint imposed on us by the active learning scenario. The motivating assumption is that labels are valuable and scarce, so it is not feasible to construct a separate labeled validation set large enough to get a good guess of the desired test set performance for the policy net to calculate rewards. In our preliminary experiments, we have observed that merging the validation set with the existing training set and performing active learning on the remaining sample set consistently provides a steeper learning curve than keeping a validation set for reward calculation. Hence, we abandon this option altogether. Instead, we propose a novel reward signal
consisting of the two components detailed below. The first component assesses the improvement in data fit of the chosen point once it has been labeled. From a Bayesian perspective, a principled measure of model fit is the marginal likelihood. For a newly labeled pair the reward is
where are our respective variational posteriors before and after training with the new point. The second component, , encourages diversity across the chosen labels throughout the whole labeling round:
The policy net is a second BNN parameterized by
. Compared to the classifier, taking deterministic data points as input, the policy net takes the state, which follows a -dimensional normal distribution. Inputing such a stochastic input into our deterministic inference scheme is straightforward by using the first two moments of the state during the first moment-matching round. The output of the policy net, in turn, parameterizes an dimensional categorical distribution over possible actions. In order to benefit fully from the BNN character of the policy and to marginalize over the we again follow the approach we use for the classifier propagating the moments and first compute binary probabilities for taking action at time point
and finally normalize them to choose the action via
and is a Categorical distribution.
We adopt the episodic version of the standard REINFORCE algorithm [Williams1992] to train our policy net. We use a moving average over all the past rewards as the baseline. A labeling episode consists of choosing a sequence of points to be labeled (with a discount factor of ) after which the BNN is retrained and the policy net takes one update step. We parameterize the policy itself by a neural network with parameters . Our method iterates between labeling episodes, training the policy net , and training the BNN until the labeling budget is exhausted. The pseudocode of our method is given in Algorithm 1.
As RAL is the first method to adapt its acquisition function while active learning takes place, its natural reference model is the standard active learning setup with a fixed acquisition heuristic.555see github.com/manuelhaussmann/ral for a reference pytorch implementation of the proposed model.
for a reference pytorch implementation of the proposed model.We choose the most established two information-theoretic heuristics: Maximum Entropy Sampling (Maxent) and BALD. Gal et.al. gal17a already demonstrated how BNNs (in their case with fixed Bernoulli dropout) provide an improved signal to acquisition functions that can be used to improve upon using predictive uncertainty from deterministic nets. We will use our own BNN formulation for both RAL as well as these baseline acquisition functions, to give them access to the same closed-form predictive uncertainty and to to ensure maximal comparability between our model and the baselines by having an absolutely identical architecture and training procedure for all methods. For Maxent one selects the point that maximizes the predictive entropy,
while BALD chooses the point that maximizes the expected reduction in posterior entropy, or equivalently
We can compute maximum entropy as well as the first of the two BALD terms in closed form, while we calculate the second term of BALD via a sampling based approach. We also include random sampling as a—on our kind of data rather competitive—baseline and evaluate on three data sets to show the adaptability of the method to the problem at hand.
To evaluate the performance of the proposed pipeline, we take as the predictor is a standard LeNet5 sized model (two convolutional layers of 20, 50 channels and two linear layers of 500, 10 neurons) and as the guide a policy net consisting of two layers with 500 hidden neurons. We use three different image classification data sets to simulate varying difficulty while keeping the architectures and hyperparameters fixed. MNIST serves as a simple data set containing greyscale digits, FashionMNIST is a more difficult data set of greyscale clothing objects, and CIFAR-10 finally is a very difficult data set given the small classifier depth that requires the classification of colored natural images. The assumption of active learning that labels are scarce and expensive also entails the problem that a large separate validation set to evaluate and finetune hyperparameters is not feasible. Both nets are optimized via Adam[Kingma and Ba2015]
using their suggested hyperparameters. In general we followed the assumption that an AL setting does not allow us to reserve valuable labeled data for hyper-parameter optimization so that they all remain fixed to the common defaults in the literature. The predictor is trained for 30 epochs between labeling rounds (labeling five points per round), while the policy net gets one update step after each round. To simulate the need to avoid a costly retraining after each labeling round the predictor net is initialized to the learned parameters from the last one, with a linearly decreasing learning rate after each round. In each experiment the state is constructed by ranking the unlabeled data according to their predictive entropy and then taking every twentieth point untilpoints. Since all three data sets consider a ten class classification problem, the result is a dimensional normal distribution as the input to the policy network. We stop after having collected 400 points starting from an initial set of 50 data points.
We summarize the results in Table 1. RAL can learn to adapt itself to the data set at hand and can always outperform the baselines. Note that our central goal is to evaluate the relative performance of RAL and the baselines in these experiments and not the absolute performance. For a real world application one would use deeper architectures for more complex data sets, incorporate pretrained networks from similar labeled data sets, and use data augmentation to make maximal use from the small labeled data. Further benefits would come from using semi-supervised information, e.g. by assigning pseudo-labels to data points the classifier assigns a high predictive certainty [Wang et al.2017]. Such approaches would significantly improve the classifier performance for all models, but since they would blur the contribution of the respective acquisition function, we consciously ignore them here. Note that although RAL uses a thinned Maxent ranking to generate its state, it can improve upon that strategy in every case. An ablation study showed that while the thinning process can improve over the plain Maxent in some settings if one were to use it as a fixed strategy, it is not sufficient to explain the final performance difference between RAL and Maxent. REINFORCE owes its success to the bulk filtering step, which substantially facilitates the RL problem by filtering out a large portion of the search space. The simplified problem can thus be improved within a small number of episodes. More interactions with the environment would certainly bring further improvement at the expense of increasing labeling cost. We here present only a proof-of-concept for the idea that can improve on the feedback-free AL even within limited interaction rounds. Further algorithmic improvements are worthwhile investigating as future work, such as applying TRPO [Schulman et al.2015] or PPO [Schulman et al.2017] in place of vanilla REINFORCE.
4 Related Work
The gold standard in AL methods has long remained to base on hard-coded and hand-designed acquisition heuristics (see [Settles2012] for a review). A first extension is to not limit oneself to one heuristic, but to learn how to choose between multiple ones, e.g. by a bandit gorithm [Baram et al.2004, Chu and Lin2016, Hsu and Lin2015]Ebert et al.2012]. However this still suffers from the problem of being limited to existing heuristics.
A further step to gain more flexibility is to formulate the problem as a meta-learning approach. The general idea [Fang et al.2017, Konyushkova et al.2017, Pang et al.2018] is to use a set of labeled data sets to learn a general acquisition function that can either be applied as is to the target data set or finetuned on a sufficiently similar set of labeled data. Our approach differs from those attempts insofar as we learn the acquisition function based solely on the target data distribution while the data is labeled. If we take the scarcity of labels serious we can’t allow ourselves the luxury of a separate large validation set to adapt a general heuristic. A separate large enough validation set also could not outperform the simple ablation study of allowing simpler acquisition functions that do not need a separate data set to instead combine that set with the labeled data they are training on. This is simply due to that as long as little labeled data is available the gain from being able to learn from extra data tends to outweigh the benefit one would get by a complicated acquisition function, and as soon as data becomes more abundant the effectiveness of any active learning method sharply. We therefore discard them from comparative analysis. A related area is the field of metareasoning [Callaway et al.2018], where an agent has to learn how to request based on a limited computational budget.
Alongside the sampling-based alternatives for BNN inference, which are already abundant and standardized [Blundell et al.2015, Gal and Ghahramani2016, Kingma et al.2015, Louizos et al.2017, Molchanov et al.2017], deterministic inference techniques are also emerging. While direct adaptations of expectation propagation are the earliest of such methods [Gast and Roth2018, Hernández-Lobato and Adams2015], they do not yet have a widespread reception due to their relative instability in training. This problem arises from the fact that EP do not provide any convergence guarantees, hence an update might either improve or deteriorate the model fit on even the training data. Contrarily, variational inference maximizes a lower bound on the log-marginal likelihood. Early studies exist on deterministic variational inference of BNNs [Haussmann et al.2019, Wu et al.2019]. However, neither quantifies the uncertainty quality by using the posterior predictive of their models for a downstream application. Earlier work that performs active learning with BNNs does exist [Hernández-Lobato and Adams2015, Gal et al.2017, Depeweg et al.2018]. However, all of these studies use hard-coded acquisition heuristics.
Our state construction method that forms a normal distribution from the posterior predictives of data points shortlisted by a bootstrap acquisition criterion is novel for the active learning setting. Yet, it has strong links to model-based reinforcement learning methods that propagate uncertainties through one-step predictors along the time axis [Deisenroth and Rasmussen2011].
We introduce a new reinforcement based method for labeling criterion learning. It is able to learn how to choose points in parallel to the labeling process itself, instead of requiring large already labeled subsets to learn on in an off-line setting beforehand. We achieve this by formulating the classification net, the policy net as well as the state probabilistically. We demonstrate its ability to adapt to a variety of qualitatively different data set situations performing similar to or even outperforming handcrafted heuristics. In the future work, we plan to extend the policy net with a density estimator that models the input data distribution so that it can also take the underlying geometry into account, making it less dependent on the quality of the probabilities.
- [Baram et al.2004] Y. Baram, R. Yaniv, and K. Luz. Online choice of active learning algorithms. JMLR, 2004.
- [Blundell et al.2015] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In ICML, 2015.
- [Callaway et al.2018] F. Callaway, S. Gul, P.M. Krueger, T.L. Griffiths, and F. Lieder. Learning to select computations. 2018.
- [Chu and Lin2016] H. Chu and H. Lin. Can active learning experience be transferred? In ICDM, 2016.
- [Deisenroth and Rasmussen2011] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In ICML, 2011.
- [Depeweg et al.2018] S. Depeweg, J. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning. In ICML, 2018.
- [Ebert et al.2012] S. Ebert, M. Fritz, and B. Schiele. Ralf: A reinforced active learning formulation for object class recognition. In CVPR, 2012.
- [Fang et al.2017] M. Fang, Y. Li, and T. Cohn. Learning how to active learn: A deep reinforcement learning approach. In EMNLP, 2017.
- [Frey and Hinton1999] B. J. Frey and G. E. Hinton. Variational learning in nonlinear gaussian belief networks. Neural Computation, 1999.
- [Gal and Ghahramani2016] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016.
- [Gal et al.2017] Y. Gal, R. Islam, and Z. Ghahramani. Deep Bayesian active learning with image data. In ICML, 2017.
- [Gast and Roth2018] J. Gast and S. Roth. Lightweight probabilistic deep networks. In CVPR, 2018.
[Haussmann et al.2019]
M. Haussmann, F.A. Hamprecht, and M. Kandemir.
Sampling-free variational inference for bayesian neural networks by variance backpropagation.UAI, 2019.
- [Hernández-Lobato and Adams2015] J.M. Hernández-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In ICML, 2015.
- [Houlsby et al.2012] N. Houlsby, F. Huszar, Z. Ghahramani, and J.M Hernández-Lobato. Collaborative gaussian processes for preference learning. In NIPS, 2012.
- [Hsu and Lin2015] W. Hsu and H. Lin. Active learning by learning. In AAAI, 2015.
- [Kingma and Ba2015] D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- [Kingma and Welling2014] D.P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
- [Kingma et al.2015] D.P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In NIPS, 2015.
- [Konyushkova et al.2017] K. Konyushkova, R. Sznitman, and P. Fua. Learning active learning from data. In NIPS. 2017.
- [Louizos et al.2017] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learning. In NIPS, 2017.
- [Molchanov et al.2017] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsifies deep neural networks. In ICML, 2017.
- [Pang et al.2018] K. Pang, M. Dong, Y. Wu, and T. Hospedales. Meta-learning transferable active learning policies by deep reinforcement learning. arXiv preprint, 2018.
- [Qiu et al.2017] Z. Qiu, D.J. Miller, and G. Kesidis. A maximum entropy framework for semisupervised and active learning with unknown and label-scarce classes. IEEE transactions on neural networks and learning systems, 2017.
- [Schulman et al.2015] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In ICML, 2015.
- [Schulman et al.2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint, 2017.
[Sener and Savarese2018]
O. Sener and S. Savarese.
Active learning for convolutional neural networks: Acore-set approach.In CVPR, 2018.
- [Settles2012] B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2012.
- [Srinivas et al.2012] N. Srinivas, A. Krause, S.M. Kakade, and M.W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 2012.
- [Wang and Manning2013] S. Wang and C. Manning. Fast dropout training. In ICML, 2013.
- [Wang et al.2017] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 2017.
- [Williams1992] R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992.
- [Wu et al.2019] A. Wu, S. Nowozin, E. Meeds, R. E. Turner, J.M. Hernández-Lobato, and A.L. Gaunt. Deterministic variational inference for robust bayesian neural networks. ICLR, 2019.