Distribution regression is the problem of learning a regression function from samples of a distribution to a single set-level label. For example, we might attempt to infer the sentiment of texts based on word-level features, to predict the label of an image based on small patches, or even perform traditional parametric statistical inference by learning a function from sets of samples to the parameter values.
Recent years have seen wide-ranging applications of this framework, including inferring summary statistics in Approximate Bayesian Computation (MitSejTeh2016), estimating Expectation Propagation messages (JitGreHeeEslLakSejSza2015), predicting the voting behaviour of demographic groups (flaxman2015ecological; flaxman2016understanding), and learning the total mass of dark matter halos from observable galaxy velocities (Ntampaka2015; Ntampaka2016). Closely related distribution classification problems also include identifying the direction of causal relationships from data (Lopez-paz2015)Yoshikawa2014; Kusner2015).
One particularly appealing approach to the distribution regression problem is to represent the input set of samples by their kernel mean embedding (described in Section 2.1
), where distributions are represented as single points in a reproducing kernel Hilbert space. Standard kernel methods can then be applied for distribution regression, classification, anomaly detection, and so on. This approach was perhaps first popularized bymuandet:smm; szabo2015two provided a recent learning-theoretic analysis.
In this framework, however, each distribution is simply represented by the empirical mean embedding, ignoring the fact that large sample sets are much more precisely understood than small ones. Most studies also use point estimates for their regressions, such as kernel ridge regression or support vector machines, thus ignoring uncertainty both in the distribution embeddings and in the regression model.
We propose a set of Bayesian approaches to distribution regression. The simplest method, similar to that of flaxman2015ecological
, is to use point estimates of the input embeddings but account for uncertainty in the regression model with simple Bayesian linear regression. Alternatively, we can treat uncertainty in the input embeddings but ignore model uncertainty with the proposed Bayesian mean shrinkage model, which builds on a recently proposed Bayesian nonparametric model of uncertainty in kernel mean embeddings(flaxman2016bayesian)
, and then use a sparse representation of the desired function in the RKHS for prediction in the regression model. This model allows for a full account of uncertainty in the mean embedding, but requires a point estimate of the regression function for conjugacy; we thus use backpropagation to obtain a MAP estimate for it as well as various hyperparameters. We then combine the treatment of the two sources of uncertainty into a fully Bayesian model and use Hamiltonian Monte Carlo for efficient inference. Depending on the inferential goals, each model can be useful. We demonstrate our approaches on an illustrative toy problem as well as a challenging real-world age estimation task.
2.1 Problem Overview
Distribution regression is the task of learning a classifier or a regression function that maps probability distributions to labels. The challenge of distribution regression goes beyond the standard supervised learning setting: we do not have access to exact input-output pairs since the true inputs, probability distributions, are observed only through samples from that distribution:
so that each bag has a label along with individual observations . We assume that the observations are i.i.d. samples from some unobserved distribution , and that the true label depends only on . We wish to avoid making any strong parametric assumptions on the . For the present work, we will assume the labels are real-valued; Appendix B shows an extension to binary classification. We typically take the observation space to be a subset of , but it could easily be a structured domain such as text or images, since we access it only through a kernel (for examples, see e.g. gartnerstructured).
We consider the standard approach to distribution regression, which relies on kernel mean embeddings and kernel ridge regression. For any positive definite kernel function , there exists a unique reproducing kernel Hilbert space (RKHS) , a possibly infinite-dimensional space of functions where evaluation can be written as an inner product, and in particular for all . Here is a function of one argument, .
Given a probability measure on , let us define the kernel mean embedding into as
Notice that serves as a high- or infinite-dimensional vector representation of . For the kernel mean embedding of into to be well-defined, it suffices that , which is trivially satisfied for all if is bounded. Analogously to the reproducing property of RKHS, represents the expectation function on : . For so-called characteristic kernels (sriperumbudur2010hilbert), every probability measure has a unique embedding, and thus completely determines the corresponding probability measure.
2.2 Estimating Mean Embeddings
For a set of samples drawn iid from , the empirical estimator of , , is given by
This is the standard estimator used by previous distribution regression approaches, which the reproducing property of shows us corresponds to the kernel
But (3) is an empirical mean estimator in a high- or infinite-dimensional space, and is thus subject to the well-known Stein phenomenon, so that its performance is dominated by the James-Stein shrinkage estimators. Indeed, muandet2014kernel studied shrinkage estimators for mean embeddings, which can result in substantially improved performance for some tasks (Ramdas2015). flaxman2016bayesian proposed a Bayesian analogue of shrinkage estimators, which we now review.
This approach consists of (1) a Gaussian Process prior on , where is selected to ensure that almost surely and (2) a normal likelihood . Here, conjugacy of the prior and the likelihood leads to a Gaussian process posterior on the true embedding , given that we have observed at some set of locations . The posterior mean is then essentially identical to a particular shrinkage estimator of muandet2014kernel, but the method described here has the extra advantage of a closed form uncertainty estimate, which we utilise in our distributional approach. For the choice of , we use a Gaussian RBF kernel , and choose either or, following flaxman2016bayesian, where is proportional to a Gaussian measure. For details of our choices, and why they are sufficient for our purposes, see Appendix A.
This model accounts for the uncertainty based on the number of samples , shrinking the embeddings for small sample sizes more. As we will see, this is essential in the context of distribution regression, particularly when bag sizes are imbalanced.
2.3 Standard Approaches to Distribution Regression
Following szabo2015two, assume that the probability distributions are each drawn randomly from some unknown meta-distribution over probability distributions, and take a two-stage approach, illustrated as in Figure 1.
Denoting the feature map by , one uses the empirical kernel mean estimator (3) to separately estimate the mean of each group:
Next, one uses kernel ridge regression (saunders1998ridge) to learn a function , by minimizing the squared loss with an RKHS complexity penalty:
Here is a “second-level” kernel on mean embeddings. If is a linear kernel on the RKHS , then the resulting method can be interpreted as a linear (ridge) regression on mean embeddings, which are themselves nonlinear transformations of the inputs. A nonlinear second-level kernel on sometimes improves performance (muandet:smm; szabo2015two).
Distribution regression as described is not scalable for even modestly-sized datasets, as computing each of the entries of the relevant kernel matrix requires time . Many applications have thus used variants of random Fourier features (rahimi2007random). In this paper we instead expand in terms of landmark points drawn randomly from the observations, yielding radial basis networks (broomhead1988radial) with mean pooling.
We consider here three different Bayesian models, with each model encoding different types of uncertainty. We begin with a non-Bayesian RBF network formulation of the standard approach to distribution regression as a baseline, before refining this approach to better propagate uncertainty in bag size, as well as model parameters.
3.1 Baseline Model
The baseline RBF network formulation we employ here is a variation of the approaches of broomhead1988radial, Que2016, law2017testing, and zaheer2017deep. As shown in Figure 2, the initial input is a minibatch consisting of several bags , each containing points. Each point is then converted to an explicit featurisation, taking the role of in (5), by a radial basis layer: is mapped to
where are landmark points. A mean pooling layer yields the estimated mean embedding corresponding to each of the bags represented in the minibatch, where .111In the implementation, we stack all of the bags into a single matrix of size for the first layer, then perform pooling via sparse matrix multiplication. Finally, a fully connected output layer gives real-valued labels
. As a loss function we use the mean square error. For learning, we use backpropagation with the Adam optimizer (Kingma2015). To regularise the network, we use early stopping on a validation set, as well as an penalty corresponding to a normal prior on .
3.2 Bayesian Linear Regression Model
The most obvious approach to adding uncertainty to the model of Section 3.1 is to encode uncertainty over regression parameters only, as follows:
This is essentially Bayesian linear regression on the empirical mean embeddings, and is closely related to the model of flaxman2015ecological. Here, we are working directly with the finite-dimensional , unlike the infinite-dimensional before. Due to the conjugacy of the model, we can easily obtain the predictive distribution , integrating out the uncertainty over . This provides us with uncertainty intervals for the predictions .
For model tuning, we can maximise the model evidence, i.e. the marginal log-likelihood (see bishop2006 for details), and use backpropagation through the network to learn and and any kernel parameters of interest.222Note that unlike the other models considered in this paper, we cannot easily do minibatch stochastic gradient descent, as the marginal log-likelihood does not decompose for each individual data point.
3.3 Bayesian Mean Shrinkage Model
A shortcoming of the prior models, and of the standard approach in szabo2015two
, is that they ignore uncertainty in the first level of estimation due to varying number of samples in each bag. Ideally we would estimate not just the mean embedding per bag, but also a measure of the sample variance, in order to propagate this information regarding uncertainty from the bag size through the model. Bayesian tools provide a natural framework for this problem.
We can use the Bayesian nonparametric prior over kernel mean embeddings (flaxman2016bayesian) described in Section 2.2, and observe the empirical embeddings at the landmark points . For , we take a fixed set of landmarks, which we can choose via -means clustering or sample without replacement (Que2016). Using the conjugacy of the model to the Gaussian process prior , we obtain a closed-form posterior Gaussian process whose evaluation at points is:
where , and denotes the set . We take the prior mean to be the average of the ; under a linear kernel , this means we shrink predictions towards the mean prediction. Note essentially controls the strength of the shrinkage: a smaller means we shrink more strongly towards . We take to be the average of the empirical covariance of across all bags, to avoid poor estimation of for smaller bags. More intuition about the behaviour of this estimator can be found in Appendix C.
Now, supposing we have normal observation error , and use a linear kernel as our second level kernel , we have:
where . Clearly, this is difficult to work with; hence we parameterise as , where is a set of landmark points for , which we can learn or fix. (Appendix D gives a motivation for this approximation using the representer theorem.) Using the reproducing property, our likelihood model becomes:
where . For fixed and we can analytically integrate out the dependence on , and the predictive distribution of a bag label becomes
The prior , where is the kernel matrix on , gives the standard regularisation on of . The log-likelihood objective becomes
We can use backpropagation to learn the parameters , , and if we wish , , and any kernel parameters. The full model is illustrated in Figure 3. This approach allows us to directly encode uncertainty based on bag size in the objective function, and gives probabilistic predictions.
3.4 Bayesian Distribution Regression
It is natural to combine the two Bayesian models above, fully propagating uncertainty in estimation of the mean embedding and of the regression coefficients
. Unfortunately, conjugate Bayesian inference is no longer available. Thus, we consider a Markov Chain Monte Carlo (MCMC) sampling based approach, and here use Hamiltonian Monte Carlo (HMC) for efficient inference, though any MCMC-type scheme would work. Whereas inference above used gradient descent to maximise the marginal likelihood, with the gradient calculated using automatic differentiation, here we use automatic differentiation to calculate the gradient of the joint log-likelihood and follow this gradient as we perform sampling over the parameters we wish to infer.
We can still exploit the conjugacy of the mean shrinkage layer, obtaining an analytic posterior over the mean embeddings. Conditional on the mean embeddings, we have a Bayesian linear regression model with parameters . We sample this model with the NUTS HMC sampler (hoffman-gelman:2012; stan-software:2014).
4 Related Work
As previously mentioned, szabo2015two provides a thorough learning-theoretic analysis of the regression model discussed in Section 2.3. This formalism considering a kernel method on distributions using their embedding representations, or various scalable approximations to it, has been widely applied (e.g. muandet:smm; Yoshikawa2014; flaxman2015ecological; JitGreHeeEslLakSejSza2015; Lopez-paz2015; MitSejTeh2016). There are also several other notions of similarities on distributions in use (not necessarily falling within the framework of kernel methods and RKHSs), as well as local smoothing approaches, mostly based on estimates of various probability metrics (Moreno2003; Jebara2004; Poczos2011; Oliva2013; poczos2013distribution; Kusner2015). For a partial overview, see sutherland2016scalable.
Other related problems of learning on instances with group-level labels include learning with label proportions (quadrianto2009estimating; patrini2014almost), ecological inference (king1997solution; gelman2001models), pointillistic pattern search (Ma2015), multiple instance learning (dietterich1997solving; kuck2005learning; zhou2009multi; krummenacher2013ellipsoidal) and learning with sets (zaheer2017deep).333For more, also see giorgiopatrini.org/nips15workshop.
There have also been some Bayesian approaches in related contexts, though most do not follow our setting where the label is a function of the underlying distribution rather than the observed sample set. kuck2005learning consider an MCMC method with group-level labels but focus on individual-level classifiers, while jackson2006improving use hierarchical Bayesian models on both individual-level and aggregate data for ecological inference.
JitGreHeeEslLakSejSza2015 and flaxman2015ecological quantify the uncertainty of distribution regression models by interpreting the kernel ridge regression on embeddings as Gaussian process regression. However, the former’s setting has no uncertainty in the mean embeddings, while the latter’s treats empirical embeddings as fixed inputs to the learning problem (as in Section 3.2).
There has also been generic work on input uncertainty in Gaussian process regression (Girard2004; Damianou2016). These methods could provide a framework towards allowing for second-level kernels in our models. One could also, though, consider regression with uncertain inputs as a special case of distribution regression, where the label is a function of the distribution’s mean and .
We will now demonstrate our various Bayesian approaches: the mean-shrinkage pooling method with (shrinkage) and with for proportional to a Gaussian measure (shrinkageC), Bayesian linear regression (BLR), and the full Bayesian distribution regression model with (BDR). We also compare the non-Bayesian baselines RBF network (Section 3.1) and freq-shrinkage, which uses the shrinkage estimator of muandet2014kernel to estimate mean embeddings. Code for our methods and to reproduce the experiments is available at https://github.com/hcllaw/bdr.
We first demonstrate the characteristics of our models on a synthetic dataset, and then evaluate them on a real life age prediction problem. Throughout, for simplicity, we take , i.e. , and – although and could be different, with learnt. Here is the standard RBF kernel. We tune the learning rate, number of landmarks, bandwidth of the kernel and regularisation parameters on a validation set. For BDR, we use weakly informative normal priors (possibly truncated at zero); for other models, we learn the remaining parameters.
5.1 Gamma Synthetic Data
We create a synthetic dataset by repeatedly sampling from the following hierarchical model, where is the label for the th bag, each has entries i.i.d. according to the given distribution, and is an added noise term which differs for the two experiments below:
In these experiments, we generate bags for training, bags for a validation set for parameter tuning, bags to use for early-stopping of the models, and bags for testing. Tuning is performed to maximize log-likelihoods for Bayesian models, MSE for non-Bayesian models. Landmark points are chosen via -means (fixed across all models). We also show results of the Bayes-optimal model, which gives true posteriors according to the data-generating process; this is the best performance any model could hope to achieve. Our learning models, which treat the inputs as five-dimensional, fully nonparametric distributions, are at a substantial disadvantage even in how they view the data compared to this true model.
, and so on. Colours represent the predictive standard deviation of each point.
Varying bag size: Uncertainty in the inputs.
In order to study the behaviour of our models with varying bag size, we fix four sizes . For each generated dataset, of the bags have , and have . Among the other half of the data, we vary the ratio of and bags to demonstrate the methods’ efficacy at dealing with varied bag sizes: we let be the overall percentage of bags with , ranging from (in which case no bags have size ) to (in which case of the overall bags have size ). Here we do not add additional noise: .
Results are shown in Figure 4. BDR and shrinkage methods, which take into account bag size uncertainty, perform well here compared to the other methods. The full BDR model very slightly outperforms the Bayesian shrinkage models in both likelihood and in mean-squared error; frequentist shrinkage slightly outperforms the Bayesian shrinkage models in MSE, likely because it is tuned for that metric. We also see that the choice of affects the results; does somewhat better.
Figure 5 demonstrates in more detail the difference between these models. It shows test set predictions of each model on the bags of different sizes. Here, we can see explicitly that the shrinkage and BDR models are able to take into account the bag size, with decreasing variance for larger bag sizes, while the BLR model gives the same variance for all outputs. Furthermore, the shrinkage and BDR models can shrink their predictions towards the mean more for smaller bags than larger ones: this improves performance on the small bags while still allowing for good predictions on large bags, contrary to the BLR model.
Fixed bag size: Uncertainty in the regression model.
The previous experiment showed the efficacy of the shrinkage estimator in our models, but demonstrated little gain from posterior inference for regression weights over their MAP estimates, i.e. there is no discernible improvement of BLR over RBF network. To isolate the effect of quantifying uncertainty in the regression model, we now consider the case where there is no variation in bag size at all and normal noise is added onto the observations. In particular we take and , and sample landmarks randomly from the training set.
Results are shown in Table 1. Here, BLR or BDR outperform all other methods on all runs, highlighting that uncertainty in the regression model is also important for predictive performance. Importantly, the BDR method performs well in this regime as well as in the previous one.
|Optimal||0.170 (0.009)||0.401 (0.018)|
|RBF network||0.235 (0.014)||–|
|shrinkage||0.237 (0.014)||0.703 (0.027)|
|shrinkageC||0.236 (0.013)||0.700 (0.029)|
|BLR||0.228 (0.012)||0.681 (0.025)|
|BDR||0.227 (0.012)||0.683 (0.025)|
5.2 IMDb-WIKI: Age Estimation
|CNN||10.25 (0.22)||3.80 (0.034)|
|RBF network||9.51 (0.20)||–|
|shrinkage||9.28 (0.20)||3.54 (0.021)|
|BLR||9.55 (0.19)||3.68 (0.021)|
We now demonstrate our methods on a celebrity age estimation problem, using the IMDb-WIKI database (Rothe-IJCV-2016) which consists of images of celebrities444We used only the IMDb images, and removed some implausible images, including one of a cat and several of people with supposedly negative age, or ages of several hundred years., with corresponding age labels. This database was constructed by crawling IMDb for images of its most popular actors and directors, with potentially many images for each celebrity over time. Rothe-IJCV-2016
use a convolutional neural network (CNN) with a VGG-16 architecture to perform 101-way classification, with one class corresponding to each age in.
We take a different approach, and assume that we are given several images of a single individual (i.e. samples from the distribution of celebrity images), and are asked to predict their mean age based on several pictures. For example, we have 757 images of Brad Pitt from age 27 up to 51, while we have only 13 images of Chelsea Peretti at ages 35 and 37. Note that 22.5% of bags have only a single image. We obtain bags, with each bag containing between and images of a particular celebrity, and the corresponding bag label calculated from the average of the age labels of the images inside each bag.
In particular, we use the representation learnt by the CNN in Rothe-IJCV-2016, where maps from the pixel space of images to the CNN’s last hidden layer. With these new representations, we can now treat them as inputs to our radial basis network, shrinkage (taking here) and BLR models. Although we could also use the full BDR model here, due to the computational time and memory required to perform proper parameter tuning, we relegate this to a later study.
We use bags for training, bags for early stopping, for validation and for testing. Landmarks are sampled without replacement from the training set.
We repeat the experiment on different splits of the data, and report the results in Table 2. The baseline CNN results give performance by averaging the predictive distribution from the model of Rothe-IJCV-2016 for each image of a bag; note that this model was trained on all of the images used here. From Table 2, we can see that the shrinkage methods have the best performance; they outperforms all other methods in all splits of the dataset, in both metrics. Non-Bayesian shrinkage again yields slightly better RMSEs, likely because it is tuned for that metric. This demonstrates that modelling bag size uncertainty is vital.
Supervised learning on groups of observations using kernel mean embeddings typically disregards sampling variability within groups. To handle this problem, we construct Bayesian approaches to modelling kernel mean embeddings within a regression model, and investigate advantages of uncertainty propagation within different components of the resulting distribution regression. The ability to take into account the uncertainty in mean embedding estimates is demonstrated to be key for constructing models with good predictive performance when group sizes are highly imbalanced. We also demonstrate that the results of a complex neural network model for age estimation can be improved by shrinkage.
Our models employ a neural network formulation to provide more expressive feature representations and learn discriminative embeddings. Doing so makes our model easy to extend to more complicated featurisations than the simple RBF network used here. By training with backpropagation, or via approximate Bayesian methods such as variational inference, we can easily ‘learn the kernel’ within our framework, for example fine-tuning the deep network of Section 5.2 rather than using a pre-trained model. We can also apply our networks to structured settings, learning regression functions on sets of images, audio, or text. Such models naturally fit into the empirical Bayes framework.
On the other hand, we might extend our model to more Bayesian feature learning by placing priors over the kernel hyperparameters, building on classic work on variational approaches (barber1998radial) and fully Bayesian inference (andrieu2001robust) in RBF networks. Such approaches are also possible using other featurisations, e.g. random Fourier features (as in Oliva2015).
Future distribution regression approaches will need to account for uncertainty in observation of the distribution. Our methods provide a strong, generic building block to do so.
Appendix A Choice of to ensure
We need to choose an appropriate covariance function , such that , where . In particular, it is for infinite-dimensional RKHSs not sufficient to define , as draws from this particular prior are no longer in (wahba1990spline) (but see below). However, we can construct
where is any finite measure on . This then ensures with probability by the nuclear dominance (lukic2001stochastic; pillai2007characterizing) for any stationary kernel . In particular, flaxman2016bayesian provides details when is a squared exponential kernel defined by
and , i.e. it is proportional to a Gaussian measure on , which provides with a non-stationary component. In this paper, we take , where and are tuning parameters, or parameters that we learn.
Here, the above holds for a general set of stationary kernels, but note that by taking a convolution of a kernel with itself, it might make the space of functions that we consider overly smooth (i.e. concentrated on a small part of ). In this work, however, we consider only the Gaussian RBF kernel . In fact, recent work (steinwart2014convergence, Theorem 4.2)
actually shows that in this case, the sample paths almost surely belong to (interpolation) spaces which are infinitesimally larger than the RKHS of the Gaussian RBF kernel. This suggests that we can chooseto be an RBF kernel with a length scale that is infinitesimally bigger than that of ; thus, in practice, taking would suffice and we do observe that it actually performs better (Fig. 4).
Appendix B Framework for Binary Classification
Suppose that our labels
, i.e. we are in a binary classification framework. Then a simple approach to accounting for uncertainty in the regression parameters is to use bayesian logistic regression, putting priors on, i.e.
however for the mean shrinkage pooling model, if we use the above , we would not be able to obtain an analytical solution for . Instead we use the probit link function, as given by:
where. Then as before we have
with and as defined in section 3.3. Hence, as before
Note here and Then expanding and rearranging
Note that since and independent normal r.v., . Let be standard normal, then we have:
Hence, we also have:
Now placing the prior , we have the following MAP objective:
Since we have an analytical solution for , we can also use this in HMC for BDR.
Appendix C Some more intuition on the shrinkage estimator
In this section, we provide some intuition behind the shrinkage estimator in section 3.3. Here, for simplicity, we choose for all bag , and , and consider the case where , i.e. . We can then see that if has eigendecomposition , with , the posterior mean is
so that large eigenvalues,, are essentially unchanged, while small eigenvalues, , are shrunk towards 0. Likewise, the posterior variance is
its eigenvalues also decrease as increases.
Appendix D Alternative Motivation for choice of
Here we provide an alternative motivation for the choice of . First, consider the following Bayesian model with a linear kernel on , where :
Now considering the log-likelihood of (supposing we have these exact embeddings), we obtain:
To avoid over-fitting, we place a Gaussian prior on , i.e. . Minimizing the negative log-likelihood over , we have:
Now this is in the form of an empirical risk minimisation problem. Hence using the representer theorem (representer), we have that:
i.e. we have a finite-dimensional problem to solve. Thus since is a linear kernel:
where can be thought of as the similarity between distributions.
Now we have the same posterior as in Section 3.3, and we would like to compute . This suggests we need to integrate out , …. But it is unclear how to perform this integration, since the follow Gaussian process distributions. Hence we can take an approximation to , i.e. , which would essentially give us a dual method with a sparse approximation to .