1 Introduction
Much of machine learning’s success stems from leveraging the wealth of data produced in recent years. However, in many cases expert knowledge is needed to provide labels, and access to these experts is limited by time and cost constraints. For example, cameras could easily provide images of the many fish that inhabit a coral reef, but an ichthyologist would be needed to properly label each fish with the relevant biological information. In such settings,
active learning (AL) (Settles, 2012) enables dataefficient model training by intelligently selecting points for which labels should be requested.Taking a Bayesian perspective, a natural approach to AL is to choose the set of points that maximally reduces the uncertainty in the posterior over model parameters (MacKay, 1992)
. Unfortunately, solving this combinatorial optimization problem is NPhard. Most AL methods iteratively solve a greedy approximation, e.g. using maximum entropy
(Shannon, 1948) or maximum information gain (MacKay, 1992; Houlsby et al., 2011). These approaches alternate between querying a single data point and updating the model, until the query budget is exhausted. However, as we discuss below, such greedy methods have severe limitations in modern machine learning applications, where datasets are massive and models often have millions of parameters.A possible remedy is to select an entire batch of points at every AL iteration. Batch AL approaches dramatically reduce the computational burden caused by repeated model updates, while resulting in much more significant learning updates. Unfortunately, naively constructing a batch using traditional acquisition functions still leads to highly correlated queries sener2018active, i.e. a large part of the budget is spent on repeatedly choosing nearby points. Despite recent interest in batch methods sener2018active; elhamifar2013convex; guo2010active; yang2015multi, there currently exists no principled, scalable Bayesian batch AL algorithm.
In this paper, we propose a novel Bayesian batch AL approach that mitigates these issues. The key idea is to recast batch construction as optimizing a sparse subset approximation to the log posterior induced by the full dataset. This formulation of AL is inspired by recent work on Bayesian coresets (huggins2016coresets; campbell2017automated). We leverage these similarities and use the FrankWolfe algorithm frank1956algorithm
to enable efficient Bayesian AL at scale. We derive interpretable closedform solutions for linear and logistic regression models, revealing close connections to existing AL methods in these cases. By using random projections, we further generalize our algorithm to work with any model with a tractable likelihood. We demonstrate the benefits of our approach on several largescale regression and classification tasks.
2 Background
We consider discriminative models parameterized by , mapping from inputs to a distribution over outputs . Given a labeled dataset , the learning task consists of performing inference over the parameters to obtain the posterior distribution . In the AL setting (Settles, 2012), the learner is allowed to choose the data points from which it learns. In addition to the initial dataset , we assume access to
an unlabeled pool set , and
an oracle labeling mechanism which can provide labels for the corresponding inputs.
Probabilistic AL approaches choose points by considering the posterior distribution of the model parameters. Without any budget constraints, we could query the oracle times, yielding the complete data posterior through Bayes’ rule,
(1) 
where here plays the role of the prior. While the complete data posterior is optimal from a Bayesian perspective, in practice we can only select a subset, or batch, of points due to budget constraints. From an informationtheoretic perspective (MacKay, 1992), we want to query points that are maximally informative, i.e. minimize the expected posterior entropy,
(2) 
where is a query budget. Solving Eq. 2
directly is intractable, as it requires considering all possible subsets of the pool set. As such, most AL strategies follow a myopic approach that iteratively chooses a single point until the budget is exhausted. Simple heuristics, e.g. maximizing the predictive entropy (
MaxEnt), are often employed (gal2017deep; sener2018active). Houlsby et al. (2011) propose BALD, a greedy approximation to Eq. 2 which seeks the point that maximizes the decrease in expected entropy:(3) 
While such greedy strategies can be nearoptimal in certain cases (golovin2011adaptive; dasgupta2005analysis), they become severely limited for largescale settings. In particular, it is computationally infeasible to retrain the model after every acquired data point, e.g. retraining a ResNet he2016deep thousands of times is clearly impractical. Even if such an approach were feasible, the addition of a single point to the training set is likely to have a negligible effect on the parameter posterior distribution (sener2018active). Since the model changes only marginally after each update, subsequent queries thus result in acquiring similar points in data space. As a consequence, there has been renewed interest in finding tractable batch AL formulations. Perhaps the simplest approach is to naively select the highestscoring points according to a standard acquisition function. However, such naive batch construction methods still result in highly correlated queries sener2018active. This issue is highlighted in Fig. 1, where both MaxEnt (Fig. 1) and BALD (Fig. 1) expend a large part of the budget on repeatedly choosing nearby points.
3 Bayesian batch active learning as sparse subset approximation
We propose a novel probabilistic batch AL algorithm that mitigates the issues mentioned above. Our method generates batches that cover the entire data manifold (Fig. 1), and, as we will show later, are highly effective for performing posterior inference over the model parameters. Note that while our approach alternates between acquiring data points and updating the model for several iterations in practice, we restrict the derivations hereafter to a single iteration for simplicity.
The key idea behind our batch AL approach is to choose a batch , such that the updated log posterior best approximates the complete data log posterior . In AL, we do not have access to the labels before querying the pool set. We therefore take expectation w.r.t. the current predictive posterior distribution . The expected complete data log posterior is thus
(4) 
where the second equality assumes conditional independence of the outputs given the corresponding inputs. This assumption holds for the type of factorized predictive posteriors we consider, e.g. as induced by Gaussian or Multinomial likelihood models.
Batch construction as sparse approximation
Taking inspiration from Bayesian coresets (huggins2016coresets; campbell2017automated), we recast Bayesian batch construction as a sparse approximation to the expected complete data log posterior. Since the first term in Eq. 4 only depends on , it suffices to choose the batch that best approximates . Similar to campbell2017automated, we view and
as vectors in function space. Letting
be a weight vector indicating which points to include in the AL batch, and denoting (with slight abuse of notation), we convert the problem of constructing a batch to a sparse subset approximation problem, i.e.(5) 
Intuitively, Eq. 5 captures the key objective of our framework: a “good" approximation to implies that the resulting posterior will be close to the (expected) posterior had we observed the complete pool set. Since solving Eq. 5 is generally intractable, in what follows we propose a generic algorithm to efficiently find an approximate solution.
Inner products and Hilbert spaces
We propose to construct our batches by solving Eq. 5 in a Hilbert space induced by an inner product between function vectors, with associated norm . Below, we discuss the choice of specific inner products. Importantly, this choice introduces a notion of directionality into the optimization procedure, enabling our approach to adaptively construct query batches while implicitly accounting for similarity between selected points.
FrankWolfe optimization
To approximately solve the optimization problem in Eq. 5 we follow the work of campbell2017automated, i.e. we relax the binary weight constraint to be nonnegative and replace the cardinality constraint with a polytope constraint. Let , , and be a kernel matrix with . The relaxed optimization problem is
(6) 
where we used . The polytope has vertices and contains the point . Eq. 6 can be solved efficiently using the FrankWolfe algorithm frank1956algorithm, yielding the optimal weights after iterations. The complete AL procedure, Active Bayesian CoreSets with FrankWolfe optimization (ACSFW), is outlined in Appendix A (see Algorithm A.1). The key computation in Algorithm A.1 (Line 6) is
(7) 
which only depends on the inner products and norms . At each iteration, the algorithm greedily selects the vector most aligned with the residual error . The weights are then updated according to a line search along the vertex of the polytope (recall that the optimum of a convex objective over a polytope—as in Eq. 6—is attained at the vertices), which by construction is the coordinate unit vector. This corresponds to adding at most one data point to the batch in every iteration. Since the algorithm allows to reselect indices from previous iterations, the resulting weight vector has nonzero entries. Empirically, we find that this property leads to smaller batches as more data points are acquired.
Since it is nontrivial to leverage the continuous weights returned by the FrankWolfe algorithm in a principled way, the final step of our algorithm is to project the weights back to the feasible space, i.e. set if , and otherwise. While this projection step increases the approximation error, we show in Section 7 that our method is still effective in practice. We leave the exploration of alternative optimization procedures that do not require this projection step to future work.
Choice of inner products
We employ weighted inner products of the form , where we choose to be the current posterior . We consider two specific inner products with desirable analytical and computational properties; however, other choices are possible. First, we define the weighted Fisher inner product (johnson2004fisher; campbell2017automated)
(8) 
which is reminiscent of informationtheoretic quantities but requires taking gradients of the expected loglikelihood terms^{1}^{1}1Note that the entropy term in (see Eq. 4) vanishes under this norm as the gradient for is zero. w.r.t. the parameters. In Section 4, we show that for specific models this choice leads to simple, interpretable expressions that are closely related to existing AL procedures.
An alternative choice that lifts the restriction of having to compute gradients is the weighted Euclidean inner product, which considers the marginal likelihood of data points (campbell2017automated),
(9) 
The key advantage of this inner product is that it only requires tractable likelihood computations. In Section 5 this will prove highly useful in providing a blackbox method for these computations in any model (that has a tractable likelihood) using random feature projections.
Method overview
In summary, we
consider the in Eq. 4 as vectors in function space and recast batch construction as a sparse approximation to the full data log posterior from Eq. 5;
replace the cardinality constraint with a polytope constraint in a Hilbert space, and relax the binary weight constraint to nonnegativity;
solve the resulting optimization problem in Eq. 6 using Algorithm A.1;
construct the AL batch by including all points with .
4 Analytic expressions for linear models
In this section, we use the weighted Fisher inner product from Eq. 8
to derive closedform expressions of the key quantities of our algorithm for two types of models: Bayesian linear regression and logistic regression. Although the considered models are relatively simple, they can be used flexibly to construct more powerful models that still admit closedform solutions. For example, in
Section 7 we demonstrate how using neural linear models wilson2016deep; riquelme2018deep allows to perform efficient AL on several regression tasks. We consider arbitrary models and inference procedures in Section 5.Linear regression
Consider the following model for scalar Bayesian linear regression,
(10) 
where
is a factorized Gaussian prior with unit variance; extensions to richer Gaussian priors are straightforward. Given a labeled dataset
, the posterior is given in closed form as with . For this model, a closedform expression for the inner product in Eq. 8 is(11) 
where is chosen to be the posterior . See Section B.1 for details on this derivation. We can make a direct comparison with BALD (MacKay, 1992; Houlsby et al., 2011) by treating the squared norm of a data point with itself as a greedy acquisition function,^{2}^{2}2We only introduce to compare to other acquisition functions; in practice we use Algorithm A.1. , yielding,
(12) 
The two functions share the term , but BALD wraps the term in a logarithm whereas scales it by . The magnitude term implies that has connections to leverage scores (drineas2012fast; ma2015statistical; derezinski2018leveraged), which are used when subsampling for linear regression. In this literature, the squared norm quantifies the degree to which those covariates influence the leastsquares solution. Therefore, the feature vectors with the largest norms (the most leverage) should be kept when subsampling the data. Hence, we can interpret as augmenting BALD with a leverage score. Further, ting2018optimal show that leverage scores are equivalent to influence functions, i.e. the likelihood score scaled by the inverse Fisher information, when the regression responses are integrated out. Thus, can also be viewed as combining (expected) influence functions and BALD. Dropping from makes the two quantities proportional——and thus equivalent under a greedy maximizer.
Logistic regression
Consider the following model for Bayesian logistic regression,
(13) 
where we again assume is a factorized Gaussian with unit variance. For this model, the exact parameter posterior distribution is intractable due to the nonlinear likelihood. We assume an approximation of the form , e.g. obtained by using a Laplace approximation or expectation propagation (bishop2006pattern)
. Since the posterior predictive is also intractable in this setting, we use the standard probit approximation
with representing the standard Normal cumulative density function (cdf) (murphy2012machine). This choice yields a closedform solution for Eq. 8,(14)  
where is the bivariate Normal cdf. We again view as an acquisition function and rewrite Eq. 14 as
(15) 
where is Owen’s T function (owen1956tables). See Section B.2 for the full derivation of Eqs. 15 and 14. Eq. 15 has a simple and intuitive form that accounts for the magnitude of the input vector (again establishing a connection to leverage scores (drineas2012fast)), and a regularized term for the predictive variance.
5 Random projections for nonlinear models
In Section 4, we have derived closedform expressions of the weighted Fisher inner product for two specific types of models. However, this approach suffers from two shortcomings. First, it is limited to models for which the inner product can be evaluated in closed form, e.g. linear regression or probit regression. Second, the resulting algorithm requires computations to construct a batch, restricting our approach to moderatelysized pool sets.
We address both of these issues using random feature projections, allowing us to approximate the key quantities required for the batch construction. In Algorithm A.2, we introduce a procedure that works for any model with a tractable likelihood, scaling only linearly in the pool set size . To keep the exposition simple, we consider models in which the expectation of w.r.t. is tractable, but we stress that our algorithm could work with sampling for that expectation as well.
While it is easy to construct a projection for the weighted Fisher inner product campbell2017automated, its dependence on the number of model parameters through the gradient makes it difficult to scale it to more complex models. We therefore only consider projections for the weighted Euclidean inner product from Eq. 9, which we found to perform comparably in practice. The appropriate projection is campbell2017automated
(16) 
i.e. represents the dimensional projection of in Euclidean space. Given this projection, we are able to approximate inner products as dot products between vectors,
(17) 
where
can be viewed as an unbiased sample estimator of
using Monte Carlo samples from the posterior . Importantly, Eq. 16 can be calculated for any model with a tractable likelihood. Since in practice we only require inner products of the form , batches can be efficiently constructed in time. As we show in Section 7, this enables us to scale our algorithm up to pool sets comprising hundreds of thousands of examples.6 Related work
Bayesian AL approaches attempt to query points that maximally reduce model uncertainty. Common heuristics to this intractable problem greedily choose points where the predictive posterior is most uncertain, e.g. maximum variance and maximum entropy (Shannon, 1948), or that maximally improve the expected information gain (MacKay, 1992; Houlsby et al., 2011). Since it is unclear how to extend these methods to the batch setting in a principled way, their scalability is limited. Recent work on improving inference for AL with deep probabilistic models (hernandez2015probabilistic; gal2017deep) used datasets with at most data points and few model updates.
There has been great interest in batch AL recently. The literature is dominated by nonprobabilistic methods, which commonly trade off diversity and uncertainty. Many approaches are modelspecific, e.g. for linear regression (yu2006active), logistic regression (hoi2006batch; guo2008discriminative), and knearest neighbors (wei2015submodularity); our method works for any model with a tractable likelihood. Others (elhamifar2013convex; guo2010active; yang2015multi) follow optimizationbased approaches that require optimization over a large number of variables. As these methods scale quadratically with the number of data points, they are limited to smaller pool sets. Probabilistic batch methods mostly focus on Bayesian optimization problems (e.g. azimi2010batch; azimi2012hybrid; gonzalez2016batch
). While AL with nonparametric models
(kapoor2007active) could benefit from that work, scaling such models to large datasets remains challenging. Our work therefore provides the first principled, scalable and modelagnostic Bayesian batch AL approach.Similar to us, sener2018active formulate AL as a coreset selection problem. They construct batches by solving a center problem, attempting to minimize the maximum distance to one of the queried data points. Since this approach heavily relies on the geometry in data space, it requires an expressive feature representation. For example, sener2018active only consider ConvNet representations learned on highly structured image data. In contrast, our work is inspired by Bayesian coresets (huggins2016coresets; campbell2017automated)
, which enable scalable Bayesian inference by approximating the loglikelihood of a labeled dataset with a sparse weighted subset thereof. Consequently, our method is less reliant on a structured feature space and only requires to evaluate loglikelihood terms.
7 Experiments and results
We perform experiments to answer the following questions: (1) does our approach avoid correlated queries, (2) is our method competitive with greedy methods in the smalldata regime, and (3) does our method scale to large datasets and models? We address questions (1) and (2) on several linear and logistic regression tasks using the closedform solutions derived in Section 4, and question (3) on largescale regression and classification datasets by leveraging the projections from Section 5. Full experimental details are deferred to Appendix C.
Does our approach avoid correlated queries?
In Fig. 1, we have seen that traditional AL methods are prone to correlated queries. To investigate this further, in Footnote 4 we compare batches selected by ACSFW and BALD on a simple logistic regression task. Since BALD has no explicit batch construction mechanism, we naively choose the most informative points according to BALD. While the BALD acquisition function does not change during batch construction, rotates after each selected data point. This provides further intuition about why ACSFW is able to spread the batch in data space, avoiding the strongly correlated queries that BALD produces.
Is our method competitive with greedy methods in the smalldata regime?
We evaluate the performance of ACSFW on several UCI regression datasets. We compare against maximum entropy (MaxEnt)^{5}^{5}5Note that MaxEnt and BALD MacKay (1992); Houlsby et al. (2011) are equivalent in this case. and Random. Starting with labeled points sampled randomly from the pool set, we use each AL method to iteratively grow the training dataset by requesting batches of size until the budget of
queries is exhausted. To guarantee fair comparisons, all methods use the same neural linear model, i.e. a Bayesian linear regression model with a deterministic neural network feature extractor
(riquelme2018deep). In this setting, posterior inference can be done in closed form (riquelme2018deep). The model is retrained for epochs after every AL iteration using Adam kingma2014adam. After each iteration, we evaluate RMSE on a heldout set. Experiments are repeated for seeds, using randomized traintest splits. Further details, including architectures and learning rates, are in Appendix C.N  d  Random  MaxEnt  ACSFW  MaxEnt (greedy)  

yacht  
boston  
energy  
power  
year  N/A 
The results are summarized in Table 1. ACSFW consistently outperforms Random by a large margin (unlike MaxEnt), and is mostly on par with MaxEnt on smaller datasets. While the results are encouraging, greedy procedures still often yield better results in these smalldata regimes; see Table 1, MaxEnt (greedy). We conjecture that this is because single data points do have significant impact on the posterior. The benefits of using ACSFW become clearer with increasing dataset size: as shown in Fig. 3, ACSFW achieves much more dataefficient learning on larger datasets.
(c) seeds during AL. Error bars denote two standard errors.
Does our method scale to large datasets and models?
Leveraging the projections from Section 5, we apply ACSFW to largescale datasets and complex models. We demonstrate the benefits of our approach on year, a UCI regression dataset with ca. data points, and on the classification datasets cifar10, SVHN and Fashion MNIST. Greedy methods would be impractical in these cases.
For year, we again use a neural linear model, start with labeled points and allow for batches of size until the budget of queries is exhausted. We average the results over seeds, using randomized traintest splits. As can be seen in Fig. 3, our approach significantly outperforms both Random and MaxEnt during the entire AL process.
For the classification experiments, we start with (cifar10: ) labeled points and request batches of size (), up to a budget of () points. We additionally compare to BALD as well as two batch AL algorithms, namely KMedoids and KCenter (sener2018active). Performance is measured in terms of accuracy on a holdout test set comprising (Fashion MNIST: , as is standard) points, with the remainder used for training. We use a neural linear model with a ResNet18 he2016deep feature extractor, trained from scratch at every AL iteration for epochs using Adam kingma2014adam. Since posterior inference is intractable in the multiclass setting, we resort to variational inference with meanfield Gaussian approximations (wainwright2008graphical; blundell2015weight).
Fig. 4 demonstrates that in all cases ACSFW significantly outperforms Random, which is a strong baseline in AL sener2018active; gal2017deep; hernandez2015probabilistic. Somewhat surprisingly, we find that the probabilistic methods (BALD and MaxEnt), provide strong baselines as well, and consistently outperform Random. We discuss this point and provide further experimental results in Appendix D. Finally, Fig. 4 demonstrates that in all cases ACSFW performs at least as well as its competitors, including stateoftheart nonprobabilistic batch AL approaches such as KCenter. These results demonstrate that ACSFW can usefully apply probabilistic reasoning to AL at scale, without any sacrifice in performance.
8 Conclusion and future work
We have introduced a novel Bayesian batch AL approach based on sparse subset approximations. Our methodology yields intuitive closedform solutions, revealing its connection to BALD as well as leverage scores. Yet more importantly, our approach admits relaxations (i.e. random projections) that allow it to tackle challenging largescale AL problems with general nonlinear probabilistic models. Leveraging the FrankWolfe weights in a principled way and investigating how this method interacts with alternative approximate inference procedures are interesting avenues for future work.
Acknowledgments
Robert Pinsler receives funding from iCASE grant #1950384 with support from Nokia. Jonathan Gordon, Eric Nalisnick and José Miguel HernándezLobato acknowledge support from Samsung. We thank Adrià GarrigaAlonso, James Requeima, Marton Havasi and Carl Edward Rasmussen for helpful feedback and discussions.
References
 Bayesian active learning for classification and preference learning. arXiv Preprint arXiv:1112.5745. Cited by: §B.1, §1, §2, §4, §6, footnote 5.
 Informationbased objective functions for active data selection. Neural computation 4 (4), pp. 590–604. Cited by: §B.1, §1, §2, §4, §6, footnote 5.

Active learning.
Synthesis Lectures on Artificial Intelligence and Machine Learning
6 (1), pp. 1–114. Cited by: §1, §2.  A mathematical theory of communication. Bell System Technical Journal 27 (3), pp. 379–423. Cited by: §1, §6.
Appendix A Algorithms
a.1 Active Bayesian coresets with FrankWolfe optimization (ACSFW)
Algorithm A.1 outlines the ACSFW procedure for a budget , vectors and the choice of an inner product (see Section 2). After computing the norms and (Lines 2 and 3) and initializing the weight vector to zero (Line 4), the algorithm performs iterations of FrankWolfe optimization. At each iteration, the FrankWolfe algorithm chooses exactly one data point (which can be viewed as nodes on the polytope) to be added to the batch (Line 6). The weight update for this data point can then be computed by performing a line search in closed form [campbell2017automated] (Line 7), and using the stepsize to update (Line 8). Finally, the optimal weight vector with cardinality
is returned. In practice, we project the weights back to the feasible space by binarizing them (not shown; see
Section 2 for more details), as working with the continuous weights directly is nontrivial.a.2 ACSFW with random projections
Algorithm A.2 details the process of constructing an AL batch with budget and random feature projections for the weighted Euclidean inner product from Eq. 16.
Appendix B Closedform derivations
b.1 Linear regression
Consider the following model for scalar Bayesian linear regression,
where denotes the prior. To avoid notational clutter we assume a factorized Gaussian prior with unit variance, but what follows is easily extended to richer Gaussian priors. Given an initial labeled dataset , the parameter posterior can be computed in closed form as
(B.18)  
and the predictive posterior is given by
(B.19) 
Using this model, we can derive a closedform term for the inner product in Eq. 8,
where in the second equality we have taken expectation w.r.t. from Eq. B.19, and in the third equality w.r.t. from Eq. B.18. Similarly, we obtain
We can make a direct comparison with BALD by treating the squared norm of a data point with itself as an acquisition function, , yielding,
Viewing as a greedy acquisition function is reasonable as
the norm of is related to the magnitude of the reduction in Eq. 5, and thus can be viewed as a proxy for greedy optimization.
This establishes a link to notions of sensitivity from the original work on Bayesian coresets campbell2017automated, huggins2016coresets, where is the key quantity for constructing the coreset (i.e. by using it for importance sampling or FrankWolfe optimization).
As demonstrated in Fig. B.5, dropping from makes the two quantities proportional——and thus equivalent under a greedy maximizer.
b.2 Logistic regression
Consider the following model for Bayesian logistic regression,
where we again assume is a factorized Gaussian with unit variance. For this model, the exact parameter posterior distribution is intractable due to the nonlinear likelihood. We assume an approximation of the form . More importantly, the posterior predictive is also intractable in this setting. For the purpose of this derivation, we use the additional approximation
where in the second line we have plugged in our approximation to the parameter posterior, and used the wellknown approximation , where represents the standard Normal cdf.
Next, we derive a closedform approximation for the weighted Fisher inner product in Eq. 8. We begin by noting that
(B.20) 
where we define , and use as before. Next, we employ the identity [owen1956tables]
where is the bivariate Normal (with correlation ) cdf evaluated at . Plugging this, and Eq. B.22 into Eq. B.20 yields
where .
Next, we derive an expression for the squared norm, i.e.
(B.21) 
Here, we again use the approximation , and the following identity [owen1956tables]:
(B.22) 
where is Owen’s T function^{6}^{6}6Efficient opensource implementations of numerical approximations exist, e.g. in scipy. [owen1956tables]. Plugging Eq. B.22 back into Eq. B.21 and taking expectation w.r.t. the approximate posterior, we have that
Appendix C Experimental details
Computing infrastructure and source code
All experiments were run on a desktop Ubuntu 16.04 machine, with a GeForce GTX TITAN X GPU. The code to reproduce all the experiments will be released upon publication.
Hyperparameter selection
We manually tuned the hyperparameters with the goal of trading off performance and stability of the model training throughout the AL process, while keeping the protocol similar across datasets. While a more systematic hyperparameter search might yield improved results, we anticipate that the gains would be comparable across AL methods since they all share the same model and optimization procedure.
c.1 Regression experiments
Model
We use a deterministic feature extractor consisting of two fully connected hidden layers with (year:
) units, interspersed with batch norm and ReLu activation functions. Weights and biases are initialized from
, where , and is the number of incoming features. We additionally apply weight decay with regularization parameter (year: ). The final layer performs exact Bayesian inference. We place a factorized zeromean Gaussian prior with unit variance on the weights of the last layer , , and an inverse Gamma prior on the noise variance, , with (year: ). Inference with this prior can be performed in closed form, where the predictive posterior follows a Student’s T distribution bishop2006pattern. For the year dataset, we use projections during the batch construction of ACSFW.Optimization
Inputs and outputs are normalized during training to have zero mean and unit variance, and unnormalized for prediction. The network is trained for epochs with the Adam optimizer, using a learning rate of (year: ) and cosine annealing. The training batch size is adapted during the AL process as more data points are acquired: we set the batch size to the closest power of (e.g. we initially start with a batch size of ), but not more than . For energy and yacht, we divert from this protocol to stabilize the training process, and set the batch size to .
c.2 Classification experiments
Model
We employ a deterministic feature extractor consisting of a ResNet he2016deep, followed by one fullyconnected hidden layer with units with a ReLu activation function. All weights are initialized with Glorot initialization glorot2010understanding. We additionally apply weight decay with regularization parameter to all weights of this feature extractor. The final layer is a dense layer that returns samples using local reparametrization kingma2015variational, followed by a softmax activation function. The mean weights of the last layer are initialized from
and the log standard deviation weights of the variances are initialized from
. We place a factorized zeromean Gaussian prior with unit variance on the weights of the last layer , . Since exact inference is intractable, we perform meanfield variational inference [wainwright2008graphical, blundell2015weight] on the last layer. The predictive posterior is approximated using samples. We use projections during the batch construction of ACSFW.Optimization
We use data augmentation techniques during training, consisting of random cropping to 32px with padding of 4px, random horizontal flipping and input normalization. The entire network is trained jointly for
epochs with the Adam optimizer, using a learning rate of , cosine annealing, and a fixed training batch size of .Appendix D Baseline Probabilistic Methods for Active Learning
One surprising result we found in our experiments was the strong performance of the probabilistic baselines MaxEnt and BALD, especially considering that a number of previous works have reported weaker results for these methods (e.g. [sener2018active]).
Probabilistic methods rely on the parameter posterior distribution . For neural network based models, posterior inference is usually intractable and we are forced to resort to approximate inference techniques [neal2012bayesian]. We hypothesize that probabilistic AL methods are highly sensitive to the inference method used to train the approximate posterior distribution . Many works use Monte Carlo Dropout (MCDropout) [srivastava2014dropout] as the standard method for these approximations [gal2016dropout, gal2017deep], but commonly only use MCDropout on the final layer.
In our work, we find that a Bayesian multiclass classification model on the final layer of a powerful deterministic feature extractor, trained with variational inference [wainwright2008graphical, blundell2015weight] tends to lead to significant performance gains compared to using MCDropout on the final layer. A comparison of these two methods is shown in Fig. D.6, demonstrating that for cifar10, SVHN and Fashion MNIST a neural linear model is preferable to one trained with MCDropout in the AL setting. In future work, we intend to further explore the tradeoffs implied by using different inference procedures for AL.
Comments
There are no comments yet.