Multiple instance learning (MIL) is a setting which falls under the supervised learning paradigm. Within the MIL framework, there exist two different learning tasks: multiple instance classification (MIC) and multiple instance regression (MIR) [4, 11]
. The former has been extensively studied in the literature while the latter has been underrepresented. This could be due to the fact that many of the data sources studied within the MIL framework are images and text, which correspond to classification tasks. The MIC problem generally consists in classifying bags into positive or negative examples where negative bags contain only negative instances and positive bags contain at least one positive instance. A multitude of applications are covered by the MIC framework. It has been applied to medical imaging in a weakly supervised setting[20, 21] where each image is taken as a bag and sub-regions of the image are instances, to image categorization  and retrieval [23, 22] and to analyzing videos , where the video is treated as the bag and the frames are the instances.
On the other hand, the MIR problem, where bags labels are now real valued, has been much less prevalent in the literature. In a regression setting, as opposed to classification, one cannot simply identify a single positive instance. Instead, one needs to estimate the contribution of each of the instances towards the bag label. The MIR problem was first introduced in the context of predicting drug activity level and the first proposed MIR algorithm relied on the assumption that the bag’s label can be fully explained by a single prime instance (prime-MIR) 
. However, this is a simplistic assumption as we throw away a lot of information about the distribution (e.g, variance, skewness). Instead of assuming that a single instance is responsible for the bag’s label, the MIR problem has been tackled using a weighted linear combination of the instances, or as a prime cluster of instances (cluster-MIR) . Other works have looked at first efficiently mapping the instances in each bag to a new embedding space, and then train a regressor on the new embedded feature space. For instance, one can transform the MIR problem to a regular supervised learning problem by mapping each bag to a feature space which is characterized by a similarity measure between a bag and an instance . The resulting embedding of a bag in the new feature space represents how similar a bag is to various instances from the training set. A drawback of this approach is that the embedding space for each bag can be high-dimensional when the number of instances in the training set is large, producing many redundant and possibly uninformative features.
In this paper, we use a similar approach and compute the kernel mean embeddings for each bag . The use of kernel mean embedding in distribution regression has been applied to various real-world problems such as analyzing the 2016 US presidential election  and estimating aerosol levels in the atmosphere 
. Intuitively, kernel mean embedding measures how similar each bag is to all the other bags from the training set. In this paper, as opposed to previous works, we do not compute the kernel mean embeddings directly on the input features (i.e, on the instances) but on the predictions made by a previous learning algorithm (e.g, a neural network). This insight comes from the fact that a simple baseline algorithm (instance-MIR) performed surprisingly well on several datasets, when the regressor was a neural network with a large hidden layer. The instance-MIR algorithm essentially ignores the fact that we are in a distribution regression framework and treats each instance as a separate observation, thereby yielding a unique prediction for each instance. The final bag label is taken to be the mean or the median of the predictions for that given bag. However, using a point estimate in the original prediction space is a performance bottleneck. In this paper, we propose a novel algorithm (instance-kme-MIR), which leverages both the representational power of the instance-MIR algorithm equipped with a neural network and alleviate the aforementioned issue by mapping our predictions into a high or infinite-dimensional space, characterized by a kernel function. We test our approach on 5 remotely sensed real-world datasets.
2 Related Work
The datasets we are using to test our algorithm stems from remotely sensed data111http://www.dabi.temple.edu/~vucetic/MIR.html 222https://harvist.jpl.nasa.gov/papers.shtml, and have previously been described [14, 18] and studied as a distribution regression problem [19, 18, 14]. This allows us to compare the performance of our approach with the baseline instance-MIR and the current state-of-the-art. The first application (3 of the 5 datasets) consists in predicting aerosol optical depth (AOD) - aerosols are fine airborne solid particles or liquid droplets in air, that both reflect and absorb incoming solar radiation. The second application (2 of the 5 datasets) is the prediction of county-level crop yields  (wheat and corn) in Kansas between 2001 and 2005. These two applications can naturally be framed as a multiple instance regression problem. Indeed, in both applications, satellites will gather noisy measurements due to the intrinsic variability within the sensors and the properties of the targeted area on Earth (e.g, surface and atmospheric effects). For the AOD prediction task, aerosols have been found to have a very small spatial variability over distances up to 100 km . For the crop data, we can reasonably assume that the yields are similar across a county and thus consider the bag label as the aggregated yield over the entire county.
The first study which investigated estimating AOD levels within a MIR setting, proposed an iterative method (pruning-MIR) which prunes outlying instances from each bag and then proceeds in a similar fashion as instance-MIR 
. The main drawback of this approach is that it is not obvious what the pruning threshold should be and we may thus get rid of informative instances in the process. In a subsequent work, the authors investigated a probabilistic framework (EM-MIR) by fitting a mixture model and using the expectation-maximization (EM) algorithm to learn the mixing and distribution parameters. The current state-of-the-art algorithm (attention-MIR) on the AOD datasets has been obtained by treating each bag as a set (i.e, an unordered sequence) of instances . To do so, the authors implemented an order-invariant operation characterized by a content-based attention mechanism , which then attends the instances a selected number of times. Finally, the problem of estimating AOD levels has been tackled using kernel mean embedding directly on the input features (i.e, the instances) 
, where they show that performance is robust to the kernel choice but the hyperparameter values of the kernels are of primary importance. In this paper, however, we compute the kernel mean embeddings of the distributions of the predicted labels made by a neural network. In order to have a principled way to find the kernel parameters, authors have proposed a Bayesian kernel mean embedding model with a Gaussian process prior, from which we can obtain a closed form marginal pseudolikelihood. This marginal likelihood can then be optimized in order to find the kernel parameters.
3.1 Multiple Instance Regression
In the MIR problem, our observed dataset is , where B is the number of bags, is the label of bag , is the instance of bag and is the number of instances in bag . Note that , and is a subset of , where is the number of features in each instance. The number of features must be the same for all the instances, but the number of instances can vary within each bag.
We want to learn the best mapping : , . By best mapping we mean the function which minimizes the mean squared error (MSE) on bags unseen during training (e.g, on the validation set). Formally, we seek such that
from the validation data , where is the hypothesis space of functions under consideration.
The two main challenges that the multiple instance regression problem poses are to find which instances are predictive of the bag’s label and to efficiently summarize the information from the instances within each bag. However, the instance-MIR baseline algorithm, which we describe next, does not attempt to solve the multiple instance regression problem by addressing the two aforementioned challenges. Instead, it simply treats each instance independently and fit a regression model to all the instances separately.
3.2 Instance-MIR Algorithm
As mentioned, the instance-MIR algorithm makes predictions on all the instances before taking the mean or the median of the predictions for each bag, as the final prediction. This means that during training, all the instances have the same weights and thus contribute equally to the loss function.
Formally, our dataset is formed by pairs of instance and bag label which can be denoted as . The final label prediction on an unseen bag can be simply calculated as
where is the predicted label corresponding to the instance in bag . Empirically, this method has been shown to be competitive , even though it requires models with high complexity in order to be able to effectively map many different noisy instances to the same target value. Thus, it is appropriate to take as a neural network with a large number of hidden units , as apposed to a small number .
3.3 Kernel Mean Embedding
In this subsection, we briefly describe kernel mean embedding and its application to distribution regression, where the goal is to compute the kernel mean embedding of each bag. We assume that the instances in each bag, are i.i.d. samples from some unobserved distribution , for . The idea is to adopt a two-stage procedure by first representing each set of samples (i.e, bags)
by its corresponding kernel mean embedding and then train a kernel ridge regression on those embeddings.
Formally, let be a reproducing kernel Hilbert space (RKHS), which is a potentially infinite dimensional space of functions , and let be a reproducing kernel function of . Then for , we can evaluate at as an inner product
(reproducing kernel property). Then, for a probability measurewe can define its kernel mean embedding as
For to be well-defined, we simply require that the norm of is finite, and so we want such that . This is always true for kernel functions that are bounded (e.g, Gaussian RBF, inverse multiquadric) but may be violated for unbounded ones (e.g, polynomial) . In fact, it has been shown that the kernel mean embedding approach to distribution regression does not yield satisfying results when using a polynomial kernel, due to the aforementioned violation .
However, as mentioned, we do not have access to but only observe i.i.d. samples drawn from it. Instead, we compute the empirical mean estimator of , given by
3.4 Kernel Ridge Regression
In kernel ridge regression (KRR), we seek to find the set of parameters , such that
where is the kernelized Gram matrix of the dataset, and is the hyperparameter controlling the amount of weight decay (i.e, regularization) on the parameters . In the case of KRR applied to kernel mean embedding, we have
where is the KRR kernel and ( is the number of bags in the training set). In this paper, we take to be the linear kernel, as it simplifies the computation and has been shown to yield competitive results when compared to non-linear kernels . Thus, we have that
where is the instance of bag and is the instance of bag j. In order to make predictions on bags not seen during training, we simply compute
where is obtained by differentiating (4) with respect to , equating to 0, and solving for . Note that as mentioned in subsection 3.1, is the number of bags in the training set and is the number of unseen bags (e.g, in validation or testing set), and so .
4 Instance-kme-MIR Algorithm
In this section, we describe our novel algorithm (instance-kme-MIR), and discuss the choice we made for the hyperparameter values. We emphasise that the novelty in this paper is to compute the kernel mean embeddings on the predictions made by a previous learning algorithm, as opposed to previous works [13, 8, 6, 5], where the authors directly compute the kernel mean embeddings on the input features. Our algorithm can be seen as an extension of instance-MIR, where we take advantage of the representational power of neural networks (Part 1 of our algorithm), and address its performance bottleneck by computing the kernel mean embeddings on the predictions (Part 2 of our algorithm).
In our implementation333 https://github.com/pinouche/Instance-kme-MIR, we choose and to be a single layered neural network, as it was shown to yield good results for the instance-MIR algorithm . We purposefully set the number of folds to be large, so that in Part 1 of our algorithm, we still train on of the training set. It thus makes sense to use the same hyperparameter values for the neural network when comparing the baseline instance-MIR to instance-kme-MIR. For Part 2, we experimented with two different kernels (RBF and inverse multiquadric).
5.1 Training Protocol
In order to fairly compare our algorithm to the current state-of-the-art [14, 18], we evaluate its performance using the same training and evaluation protocol. The protocol consists in a 5-fold cross validation, where the bags in the training set are randomly split into 5 folds, out of which 4 folds are used in training and 1 fold serves as the validation set. In turn, each of the 5 folds serves as the validation set and the 4 remaining folds as the training set. The cross validation is repeated 10 times in order to eliminate the randomness involved in choosing the folds. We use the root mean squared error (RMSE) to evaluate the performance and report our results, shown in Table 1, on 5 real-world datasets. While the baseline instance-MIR was already evaluated , we re-implement it on the 3 AOD datasets, with different hyperparameter values, and thus obtain distinct results. The validation loss reported in Table 1 below is the average loss over the 50 evaluations (10 iterations of 5-fold cross validation).
In Table 1, we display the results for 4 algorithms: the baseline instance-MIR (described in subsection 3.2), attention-MIR , EM-MIR  and our novel algorithm (instance-kme-MIR), for two different kernels and , where
Note that prior to our implementation of instance-kme-MIR, the state-of-the-art results on the 5 datasets were shared between the 3 other algorithms . Now, as can be seen in Table 1, attention-MIR achieves the best results on the AOD datasets while instance-kme-MIR yields the best results on the crop datasets.
We experimented with several values for and , where and , with a constant increment for both hyperparameters. The results in Table 1 are reported for the best hyperparameter values. We found that while extreme hyperparameter values negatively impacted the performance of our algorithm, most values yielded similarly good results, which means that our algorithm is robust to hyperparameter values.
Instance-MIR (median) refers to the instance-MIR algorithm where the median is used to compute the final prediction for each bag, instead of the mean, as described in subsection 3.2. We can see that there does not seem to be an advantage to using the mean or the median, as both methods achieve very similar results. On the other hand, we can see that our algorithm consistently outperforms the baseline instance-MIR. However, note that since our algorithm makes use of the predictions made from instance-MIR (in Part 2 of Algorithm 1), we can only aim to achieve a measured improvement over the standard instance-MIR. Thus, our method is mostly beneficial in the cases where instance-MIR is the best out-of-the box algorithm (e.g, on the 2 crop datasets). Since our algorithm computes the kernel mean embedding between scalars (i.e, between the real-valued predictions) and is robust to values of and , it is easy to tune and its computation cost is very close to that of instance-MIR.
In this paper, we developed a straightforward extension of the baseline instance-MIR algorithm. Our method takes advantage of the expressive power of neural networks while addressing the main weakness of instance-MIR by computing the kernel mean embeddings of the predictions. We have shown that our algorithm consistently outperforms the baseline and achieves state-of-the-art results on the 2 crop datasets. In addition, our algorithm is robust to the kernel parameter values and its performance gains come at a low computational cost.
Nonetheless, it fails when the baseline instance-MIR does not yield satisfying results (e.g, on the 3 crop datasets). This is because we compute the kernel mean embeddings on predictions made from the baseline instance-MIR, and we can thus only expect measured improvements from that baseline. Another drawback of our method comes from the fact that instance-MIR assigns the same weights to all the instances during training. However, the number of instances per bag may vary and it would make sense to be more confident when we make a prediction on a bag which contains a large number of instances compared to a bag with only a few instances. To tackle this issue, we could take a Bayesian approach to kernel mean embedding and explicitly express our uncertainty in the sampling variability of the groups .
Finally, as future work, we could use the attention coefficients from attention-MIR, in order to weigh the contribution of each of the instances towards the loss function. This would get rid of potentially redundant and noisy instances, thus improving the quality of the training data.
Amores, J.: Multiple instance classification: Review, taxonomy and comparative study. Artificial intelligence201, 81–105 (2013)
-  Chen, Y., Bi, J., Wang, J.Z.: Miles: Multiple-instance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(12), 1931–1947 (2006)
Chen, Y., Wang, J.Z.: Image categorization by learning and reasoning with regions. Journal of Machine Learning Research5(Aug), 913–939 (2004)
-  Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89(1-2), 31–71 (1997)
-  Flaxman, S., Sejdinovic, D., Cunningham, J.P., Filippi, S.: Bayesian learning of kernel embeddings. arXiv preprint arXiv:1603.02160 (2016)
-  Flaxman, S., Sutherland, D., Wang, Y.X., Teh, Y.W.: Understanding the 2016 us presidential election using ecological inference and distribution regression with census microdata. arXiv preprint arXiv:1611.03787 (2016)
-  Ichoku, C., Chu, D.A., Mattoo, S., Kaufman, Y.J., Remer, L.A., Tanré, D., Slutsker, I., Holben, B.N.: A spatio-temporal approach for global validation and analysis of modis aerosol products. Geophysical Research Letters 29(12), MOD1–1 (2002)
-  Law, H.C.L., Sutherland, D.J., Sejdinovic, D., Flaxman, S.: Bayesian approaches to distribution regression. arXiv preprint arXiv:1705.04293 (2017)
-  Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B.: Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning 10(1-2), 1–141 (2017)
-  Ray, S., Craven, M.: Supervised versus multiple instance learning: An empirical comparison. In: Proceedings of the 22nd international conference on Machine learning. pp. 697–704. ACM (2005)
-  Ray, S., Page, D.: Multiple instance regression. In: ICML. vol. 1, pp. 425–432 (2001)
-  Sikka, K., Dhall, A., Bartlett, M.: Weakly supervised pain localization using multiple instance learning. In: Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. pp. 1–8. IEEE (2013)
-  Szabó, Z., Gretton, A., Póczos, B., Sriperumbudur, B.: Two-stage sampled learning theory on distributions. In: Artificial Intelligence and Statistics. pp. 948–957 (2015)
-  Uriot, T.: Learning with sets in multiple instance regression applied to remote sensing. arXiv preprint arXiv:1903.07745 [stat.ML] (2019)
-  Vinyals, O., Bengio, S., Kudlur, M.: Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391 (2015)
-  Wagstaff, K.L., Lane, T.: Salience assignment for multiple-instance regression (2007)
-  Wagstaff, K.L., Lane, T., Roper, A.: Multiple-instance regression with structured data. In: Data Mining Workshops, 2008. ICDMW’08. IEEE International Conference on. pp. 291–300. IEEE (2008)
-  Wang, Z., Lan, L., Vucetic, S.: Mixture model for multiple instance regression and applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing 50(6), 2226–2237 (2012)
-  Wang, Z., Radosavljevic, V., Han, B., Obradovic, Z., Vucetic, S.: Aerosol optical depth prediction from satellite observations by multiple instance regression. In: Proceedings of the 2008 SIAM International Conference on Data Mining. pp. 165–176. SIAM (2008)
-  Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: CVPR (2015)
Xu, Y., Mo, T., Feng, Q., Zhong, P., Lai, M., I-Chao Chang, E.: Deep learning of feature representation with multiple instance learning for medical image analysis
-  Yang, C., Lozano-Perez, T.: Image database retrieval with multiple-instance learning techniques. In: Data Engineering, 2000. Proceedings. 16th International Conference on. pp. 233–243. IEEE (2000)
Zhang, Q., Goldman, S.A., Yu, W., Fritts, J.E.: Content-based image retrieval using multiple-instance learning. In: ICML. vol. 2, pp. 682–689. Citeseer (2002)