The deep neural networks (DNNs) responsible for self-driving functions in autonomous cars require exhaustive training data, covering all tasks they may potentially have to address such as driving on a slope, parallel parking, or managing intersections on a roundabout. For each task that needs to be mastered, it is generally not difficult to collect unlabeled data, i.e., by recording videos or still images that are representative of the expected states of the environment in situations where the task is relevant. In contrast, labeling the data with the correct actions or decisions to be taken in response to given inputs is costly, as it typically involves annotation by a human. For example, for the task of detecting obstacles, labeling examples may involve operating the DNN in “shadow mode”, whereby the autopilot runs in the background, with a severed connection to the actuation system, whilst a human is in full control of the vehicle. If the DNN fails to detect an object, an upload of the respective samples is triggered when the latter swerves around the obstacle. Human annotators then label the undetected object, and the labelled sample is added to the training data set corresponding to this task(de2021assessment).
As shown in Fig. 1, this paper addresses a problem formulation motivated by settings such as the one just described in which the learning agent has unlabeled data sets for multiple related tasks, and aims at preparing for fast adaptation on new tasks. In the example, the self-driving car may, for instance, need to adapt its operation based on limited labeled data from a new environment, say a new type of terrain. We focus on devising strategies for the active selection of which tasks should be selected for labeling, say by a human annotator, so as to enhance the capacity of the learner to adapt to new, a priori unknown, tasks.
The problem is formulated as active meta-learning within a hierarchical Bayesian framework in which a vector of hyperparameters is shared among the learning tasks. The proposed approach attempts to sequentially maximize the amount of information that the learner has about the hyperparameters by actively selecting which task should be labeled next.
The methodology developed in this paper extends the principles introduced in the vast literature on active learning (felder2009active)
to the higher-level problem of meta-learning. Active learning refers to the problem of selectingexamples to label so as to improve the sample efficiency of a training algorithm in inferring model parameters. In contrast, active meta-learning selects data sets corresponding to distinct learning tasks in order to infer more efficiently hyperpameters with the goal of generalizing more quickly to new tasks.
1.1 Related Work
The most closely related work to ours, at least in terms of motivation is (kaddour2020probabilistic). In it, the authors assume that the meta-learner has access to a simulator that can be used to generate data from any new task. Tasks are identified by a latent task embedding, which is defined in continuous space. The approach relies on the optimization of a variational model in the task embedding space, which is used to quantify how tasks relate to each other and how “surprising” new tasks are. Using the variational model, candidate tasks are scored in a low-dimensional space, and samples are generated for the selected task.
In contrast, in this paper, the meta-learner cannot generate data from arbitrary new tasks. Rather, it has only access to a pool of a finite number of unlabeled data sets from a pre-defined set of meta-training tasks (see Fig. 1). The meta-learner chooses which task should be labeled next by adopting an information-theoretic approach that quantifies the amount of information that can be gained from a candidate task via a measure of epistemic uncertainty. We view our setting as complementary to (kaddour2020probabilistic): While the method proposed in (kaddour2020probabilistic) is of interest for applications in which it is natural to assume access to a data generator (see, e.g., (cohen2021learning) for a practical application), the problem formulation studied appears to be more relevant for applications such as autonomous driving, in which data must be measured and the key problem is the cost of annotating unlabeled data sets.
We will use to denote the conditional differential entropy, which for simplicity will be referred to as entropy; and we use to denote the conditional mutual information.
2 Information-Theoretic Active Meta-Learning
In the active meta-learning setting under study, shown in Fig. 1, the meta-learner aims at optimizing the hyperparameter vector of a learning algorithm so as to maximize the performance of the latter on a new, a priori unknown, task by using data from a small number of related meta-training tasks. To this end, at the beginning of the meta-learning process, the meta-learner has access to unlabeled data sets from a set of meta-training tasks. Each data set comprises covariate vectors, with no labels. At each step of the meta-learning process, the meta-learner chooses the next task to be labeled in a sequential fashion. When the meta-learner selects a task , it receives the corresponding labels , obtaining the complete data set .
At each step of the selection process, the meta-learner has obtained labels for a subset of meta-training tasks, and hence it has access to the meta-training data set . The next task to be labeled in the set is selected based on a meta-acquisition function that assigns a score to any of the remaining meta-training tasks in the set as
In (1), we have allowed for the use of a subset of the available data for the meta-training task in order to control complexity. Upon selection of a candidate task , the corresponding labeled data set is incorporated in the meta-training data set, and the procedure is repeated until the budget of meta-training tasks is exhausted.
We adopt a hierarchical Bayesian model whereby the model parameters for each task have a shared prior distribution determined by the common hyperparameter . This model captures the statistical relationship between the meta-training tasks, and provides a useful way to derive principled meta-learning algorithms (rothfuss2021pacoh), (jose2021epistemic). Both the model parameters and hyperparameters, with denoting the prior distribution of the hyperparameters and denoting any set of tasks. To complete the probabilistic model, we also fix a discriminative model
assigning a probability (density ifis continuous labeled) to all values of the label, given the covariates and the model parameters.
Combining prior and likelihood, the joint distribution of labels , per-task model parameters , and hyperparameter , when conditioned on the corresponding covariates , takes the form
From (2.1), when the labels for the meta-training tasks in set have been acquired, the posterior distribution of any subset of the labels for any task is given by
where the average is taken over the marginal of the model parameter with respect to the joint posterior distribution .
2.2 Epistemic Uncertainty and BAMLD
Given the current available data set , under the assumed model (2.1), the overall average predictive uncertainty for a batch of covariates for a candidate meta-training task can be measured by the conditional entropy (houlsby2011bayesian), (xu2020minimum), (jose2021epistemic)
As we will discuss next, the first term in (2.2) is a measure of aleatoric uncertainty, which does not depend on the amount of labeled meta-training data; while the second is a measure of epistemic uncertainty, which captures the informativeness of the labels for the new task given the already labeled data sets
. Based on this observation, we will propose to adopt (an estimate of) the latter term as a meta-acquisition function to be used in (1).
To elaborate, let us first consider the entropy term in (2.2), which can be written as
in which is the predictive distribution under the assumed model (2.1) in the reference case in which one had access to the true hyperparameter vector . Accordingly, the first term in (2.2) captures the aleatoric uncertainty in the prediction on a new, a priori unknown, task that stems from the inherent randomness in the task generation. This term can not be reduced by collecting data from meta-training tasks, since such data can only improve the estimate of the hyperparameter vector , and is thus irrelevant for the purposes of task selection. We note that, the conditional entropy term in (2.2) is distinct from the standard aleatoric uncertainty in conventional learning in which the model parameter , and not the hyperparameter , is assumed to be unknown (houlsby2011bayesian).
The second term in (2.2) can be written as
The mutual information (2.2) is hence the difference between two terms: the overall predictive uncertainty (4) and the predictive (aleatoric) uncertainty (2.2) corresponding to the ideal case in which the hyperparameter vector is known. Intuitively, the conditional mutual information in (2.2) represents the epistemic uncertainty arising from the limitations in the amount of available meta-training data, which causes the labels of a new meta-training task to carry useful information about the hyperparameter vector given the knowledge of the meta-training data set .
Given that the aleatoric uncertainty, the first term in (2.2), is independent of the amount of the selected meta-training data, we propose that the meta-learner select tasks that minimize the epistemic uncertainty, i.e., the second term, in (2.2). The corresponding meta-acquisition problem (1) is given as
The resulting BAMLD algorithm is summarized in Algorithm 1.
To further interpret the proposed meta-acquisition function, let us consider again the decomposition (2.2) of the conditional mutual information. The first term in (2.2) corresponds to the average log-loss of a predictor that averages over the posterior distribution of the hyperparameter, whilst the second corresponds to the average log-loss of a genie-aided prediction that knows the correct realisation of the hyperparameter , assuming a well-specified model. These two terms differ more significantly, making the conditional mutual information larger, when distinct choices of the hyperparameter vector yield markedly different predictive distributions . Thereby, the meta-learner does not merely select tasks with maximal predictive uncertainty, but rather it chooses tasks for which individual choices of the hyperparameter vector “disagree” more significantly.
BAMLD is the natural counterpart of the Bayesian Active Learning by Disagreement (BALD) method, introduced in (houlsby2011bayesian) for conventional learning. Crucially, while BALD gauges disagreement at the level of model parameters, BAMLD operates at the level of hyperparameters, marginalizing out the model parameters.
2.3 Implementation of BAMLD via Variational Inference
The posterior distribution of the hyperparameters given the selected data , which is needed to evaluate the two terms in BAMLD meta-acquisition function (8) (see (2.2)), is generally intractable. To address this problem, we assume that an approximation of the posterior has been obtained via standard variational inference (VI) methods, yielding a variational distribution (see, e.g., (angelino2016patterns)). Using the variational distribution, for the first and second term in (2.2) can be estimated by replacing the true posterior with the variational distribution . In practice, averages over the variational distribution can be further estimated using Monte Carlo sampling. Alternatively, samples approximately distributed according to the true posterior distribution
can be obtained via Monte Carlo methods, such as Stochastic-Gradient Markov Chain Monte Carlo (see, e.g.,angelino2016patterns), or via particle-based methods like Stein Variational Gradient Descent (SVGD) (liu2016stein).
3 BAMLD for Gaussian Process Regression
In this section, we instantiate BAMLD for Gaussian Process Regression (GPR). The application of BAMLD to GPR has the computational advantage that the marginalization over the model parameter required to compute the predictive distribution , which is in turn needed to evaluate the meta-acquisition function (2.2), can be obtained in closed form. We will also demonstrate in the next section that BAMLD for GPRs can be useful to improve the sample efficiency of black-box optimization via Bayesian Optimization (BO). Unlike (houlsby2011bayesian), we study Gaussian Processes (GPs) in which the GP prior is determined by a vector of hyperparameters as in deep kernel methods (hazan2015steps), rather than being fixed.
3.1 Computing the Predictive Distribution for Fixed Hyperparameters
GPRs are non-parametric models that can be used to replace the parametric joint distributionin the joint distribution (2.1) in order to efficiently evaluate the predictive distribution for a fixed hyperparameter vector appearing in the meta-acquisition function (2.2). For a given hyperparameter vector , adopting a GPR amounts to assuming the marginal (rasmussen2003gaussian)
where represents the mean vector, and is the covariance matrix with . Accordingly, GPR is parametrized by the shared hyperparameter through the mean and kernel functions defined as
respectively, where and are neural networks.
The second term in (2.2) can be approximated as
where denotes the determinant and the outer expectations can be estimated using samples from the variational posterior as discussed in the previous section.
For the first term in (2.2), one similarly approximates the predictive distribution , which marginalizes over the hyperparameter vector , as
which can be again estimated via samples from the variational posterior .
In our implementation, we have estimated the expectations with respect to the variational posterior using SVGD (rothfuss2021pacoh). SVGD maintains samples , using which we can obtain the estimates
is a mixture of multivariate normal distributions.
In this section, we empirically evaluate the performance of BAMLD. We will use as benchmarks alternative selection functions directly inspired by existing methods for conventional active learning, and demonstrate that BAMLD provides state-of-the-art results in terms of predictive accuracy as well as for Bayesian optimization (BO) problems.
All implementation details are included in the Supplementary Material.
4.1 Experimental Setup
We consider two synthetic, but challenging environments for regression and BO.
Following (finn2017model), for each task , the population distribution for the input-output pair is such that
where the function
depends on the task parameters . The distribution of the task parameters will be specified bellow.
Bayesian optimization (BO).
BO aims to find a global maximizer of a black-box function , in an iterative manner by using a minimal number of queries on the input . BO chooses the next input at which to query function , and observes noisy feedback as , with . The objective function is defined as (rothfuss2021meta)
where the parameters
define the task, and we have the unnormalized Cauchy and Gaussian probability density functions
respectively. The population distribution used to generate meta-training data for a given task is such that
follows a uniform distribution, andis given by (4.1). Furthermore, the task distribution is such that the parameters are independent and distributed as for , and the parameters are distributed as
During meta-testing, the meta-trained GP is used as a surrogate function to form a posterior belief over the function values, and, at each iteration of BO, the next query point is chosen to maximize the upper confidence bound (UCB) acquisition function, which is designed to balance exploration and exploitation (garivier2011upper). The mean and kernel parameters of the surrogate are then updated, and the procedure is repeated. This setting conforms to (rothfuss2021meta), which, however, considers only standard passive meta-learning.
Apart from the baseline uniform selection of tasks, we adopt as benchmarks schemes inspired by well-studied principles of maximal predictive uncertainty and diversity.
Tasks are selected uniformly at random from the pool of available tasks, as is usually the case in most meta-learning work papers e.g., (finn2017model), (nichol2018firstorder).
Predictive uncertainty-based meta-acquisition.
As a counterpart of diversity-based active learning methods (yang2015multi), (wang2017incorporating), we consider a diversity-based meta-acquisition function inspired by (al2021data), which selects tasks that maximize the task diversity in the meta-training data set. To this end, tasks are first clustered using k-nearest neighbours, as applied to a fixed representation of the available data set of covariates for a given meta-training task. Here, we use the mean of the covariate vectors in set . The approach is justified by theoretical results showing that meta-learning generally benefits from task diversity, as it can decrease meta-overfitting (jose2021information).
4.3 Results for Regression
We start by comparing the performance of BAMLD for the regression problem described above with meta-training tasks in the initial pool. Fig. 2 and Fig. 3 illustrate the results in terms of the prediction root mean squared error (RMSE) as a function of the number of acquired meta-training tasks, for two distinct distributions of the parameters . Against the benchmark schemes, BAMLD consistently performs best in both cases, achieving minimal RMSE. For example, BAMLD requires only meta-training tasks to achieve an RMSE of , whilst uncertainty-base acquisition, and uniform/diversity-based acquisition schemes require and tasks, respectively.
The performance of all benchmark schemes saturates after the -th task has been acquired, suggesting that, in line with results in the curriculum learning literature, the order according to which tasks are acquired has an impact on the performance (bengio2009curriculum), (graves2017automated). We also note that, the RMSE of all schemes increases in the task environment in Fig. 3
, which is characterized by the large variance of the task parameter. In such a setting, the RMSE gap between BAMLD and the benchmark schemes becomes more pronounced.
The impact of the task environment on the performance of the active meta-learning schemes is further investigated by considering a setting with heterogeneous tasks. Specifically, we assume that we have clusters comprised of tasks in each cluster, such that in the -th cluster, the amplitudes are sampled as
whilst all other task parameters are distributed as in Fig. 2. We report the results in Fig. 4. As the relative entropy increases, so does the RMSE for all meta-acquisition functions; however, BAMLD is seen to consistently outperform the benchmark meta-acquisition functions. In addition, the initial gap in RMSE achieved by BAMLD and the benchmark schemes becomes increasingly pronounced until the task environment becomes too diverse. In such cases, meta-learning in general begins to fail to infer a useful inductive-bias irrespective of the task selection strategy.
4.4 Results for BO
For the BO problem, we report the regret , which is defined as
where denotes the best solution obtained in the optimization procedure so far, i.e., . As benchmarks, we consider the vanilla BO scheme that neglects meta-training data and assumes a GP prior with mean and a squared-exponential kernel with parameters as in (srinivas2009gaussian); as well as an ideal meta-BO scheme for which the mean and kernel are meta-optimized in the offline phase using all meta-training data sets (rothfuss2021meta). Meta-testing is done for the proposed BAMLD meta-BO scheme after selecting meta-training tasks out of the pool of tasks.
Vanilla BO is seen to have the poorest performance among all schemes, getting stuck in a sub-optimal solution after a few iterations. Conversely, meta-BO finds good solutions quickly, requiring only function evaluations before the regret is minimized. A comparable regret is achieved with BAMLD meta-BO. However, BAMLD meta-BO uses only out of tasks, as opposed to all tasks as the ideal meta-BO scheme, indicating that it is capable of achieving a positive transfer for BO with only a fraction of the available tasks.
This paper introduced BAMLD, a method for information-theoretic active meta-learning. The active meta-learner quantifies the amount of information to be gained from learning a candidate task via a measure of epistemic uncertainty at the level of hyperparameters in order to select meta-training tasks from a pool of available data sets. An instantiation for non-parametric methods, namely GPR, was provided, which enables the application to Bayesian Optimization (BO). Experimental results for regression showed that BAMLD compares favourably to alternative selection methods inspired by the literature on conventional active learning. In addition, BAMLD was shown to decrease the number of required meta-training tasks for BO, and still achieve a positive transfer – a result that is encouraging for many applications that depend on black-box optimization.