Learning with Sets in Multiple Instance Regression Applied to Remote Sensing

03/18/2019 ∙ by Thomas Uriot, et al. ∙ Imperial College London 0

In this paper, we propose a novel approach to tackle the multiple instance regression (MIR) problem. This problem arises when the data is a collection of bags, where each bag is made of multiple instances corresponding to the same unique real-valued label. Our goal is to train a regression model which maps the instances of an unseen bag to its unique label. This MIR setting is common to remote sensing applications where there is high variability in the measurements and low geographical variability in the quantity being estimated. Our approach, in contrast to most competing methods, does not make the assumption that there exists a prime instance responsible for the label in each bag. Instead, we treat each bag as a set (i.e, an unordered sequence) of instances and learn to map each bag to its unique label by using all the instances in each bag. This is done by implementing an order-invariant operation characterized by a particular type of attention mechanism. This method is very flexible as it does not require domain knowledge nor does it make any assumptions about the distribution of the instances within each bag. We test our algorithm on five real world datasets and outperform previous state-of-the-art on three of the datasets. In addition, we augment our feature space by adding the moments of each feature for each bag, as extra features, and show that while the first moments lead to higher accuracy, there is a diminishing return.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Multiple instance learning (MIL) has been an active area of research as of late. However, the most commonly studied MIL problem is the one of multiple instance classification (MIC), where negative bags strictly contain negative instances and positive bags contain at least one positive instance. A multitude of applications are covered by the MIC framework and it has been extensively applied to medical imaging in a weakly supervised setting (Wu et al., 2015), (Xu et al., [n. d.]), where each image is taken as a bag and sub-regions of the image are instances. The MIL framework has also been applied to analyzing videos (Sikka et al., 2013), where each video sequence is represented as a bag and each frame or segment is an instance, and to image categorization (Chen and Wang, 2004) and retrieval (Zhang et al., 2002), (Yang and Lozano-Perez, 2000).

However, the MIL problem in a regression setting, where bag labels are real valued, has been surprisingly much less studied in the literature. This may be due to the fact that the main data sources found in MIL are images and text, which are more often encountered in classification tasks. The main difference between the MIR and MIC problems lies in the fact that it is no longer possible to restrict oneself to finding a single positive instance, which renders the problem more open-ended. The MIR problem first appeared in (Ray and Page, 2001)

where the authors made the assumption that each bag contained a prime instance that was responsible for the bag’s label. They would first identify the prime instance in each bag and then fit a linear predictor on those instances. However, by only considering the prime instance, we may get rid of a lot of useful information contained in the non-prime instances. For example, let us posit that the instances in a bag follow a Gaussian distribution, a sensible choice for the prime instance would be the mean, but in doing so, we would lose information about the variance, which could be important in predicting the label.

Instead of the prime instance assumption, two main directions of tackling the MIR problem have emerged in the literature. The first direction consists in mapping the instances in each bag to a new embedding space while losing the least information possible. For instance, in Szabó et al. (2015) and in Law et al. (2017), the authors propose to use kernel mean embedding (Muandet et al., 2017) in order to summarize the information in each bag. While in Chen et al. (2006)

, the authors convert the MIR problem to a standard supervised learning problem by first mapping each bag into a feature space characterized by a similarity measure between the instances present in each bag. The second direction, rather than assuming that a single prime instance is responsible for the bag label, looks at explicitly using more than one instance per bag, whether it be as a weighted combination

(Wagstaff and Lane, 2007) or as a prime cluster of instances (Wagstaff et al., 2008).

In this paper, we propose to both map information from each bag to a single vector by computing the first moments of each feature and to learn a non-linear weighted combination of the instances by implementing a neural network with attention mechanism

(Bahdanau et al., 2014). The idea is to treat each bag as an unordered sequence where the elements of the sequence are the instances. Essentially, each bag is a set composed of a certain number of instances and we want to make the learning process invariant to permutations of the instances. This is achieved by using an order invariant aggregation operator (e.g, mean, median) which corresponds to a particular type of attention mechanism described in Vinyals et al. (2015)

. In general (e.g, in neural machine translation

(Luong et al., 2015)), the instances used as input to the attention mechanism follow a sequential ordering, while in our case, we assume that the instances are unordered and independent. The bag label is thus fully parametrized by neural networks and the output is insensitive to the ordering of the instances.

We test our algorithm on 5 real-world datasets, stemming from remotely sensed data, which have previously been studied as a MIR problem, and we compare our results to the current state-of-the-art on those datasets. The first application consists in predicting aerosol optical depth (AOD)111http://www.dabi.temple.edu/~vucetic/MIR.html (Holben et al., 1998) - aerosols are fine airborne solid particles or liquid droplets in air, that both reflect and absorb incoming solar radiation - which was first attempted in Wang et al. (2008). The second application is the prediction of county-level crop yields222https://harvist.jpl.nasa.gov/papers.shtml (Wagstaff and Lane, 2007) (wheat and corn) in Kansas between 2001 and 2005. Remotely sensed data from satellite is a setting in which the MIR problem naturally arises due to two reasons. Firstly, Sensors from the satellite will gather several noisy measurements due to the variability from the sensors themselves, and to the properties of the targeted area on Earth (e.g, surface and atmospheric effects). Secondly, aerosols have a very small spatial variability over distances up to 100 km (Ichoku et al., 2002). For the crop data, we can reasonably assume that the yields are similar across a county and thus consider each county as a bag and its overall crop yield as the bag label.

2. Related work

The aforementioned properties of low spatial variability and noisy measurements can be found in many applications related to estimating AOD such as predicting greenhouse gases levels (e.g, water vapor (Tobin et al., 2006), carbon monoxide (Deeter et al., 2003) and ozone (Balis et al., 2007)). In addition to estimating greenhouse gases, the MIR problem also appears in applications which relate to Earth observation, such as estimation of precipitation levels (Jobard et al., 2011), land surface temperature (Coll et al., 2005), soil moisture (Jackson et al., 2009), ice cap thickness (Tilling et al., 2018) and ocean salinity (Lagerloef et al., 2006). Finally, estimating vegetation productivity (Yang et al., 2006) and vegetation canopy (Knyazikhin et al., 1998) are applications which relate more closely to the analysis of crop yields and biomass density in general.

Estimating AOD using a MIR setup was investigated for the first time in Wang et al. (2008), where the authors proposed an iterative method which prunes (pruning-MIR) outlying instances from each bag as long as the prediction accuracy keeps increasing. To make the final bag label prediction from the remaining non-pruned instances, the authors simply make predictions for each of these instances and then take the mean or the median. The extreme case, where no pruning happens, and all the instances in each bag are used separately to make predictions, before taking the mean or the median, is called the instance-MIR. The instance-MIR method essentially ignores the fact that the given problem can be framed in a MIR setup. In Wang et al. (2012)

, the authors build on their previous work on predicting AOD and investigate a probabilistic framework by fitting a mixture model and learning the parameters using the expectation-maximization (EM) algorithm. While the authors assume that a unique prime instance is responsible for the bag label, each instance still contributes to the label proportionally to its probability of being the prime instance. This is in contrast to

Ray and Page (2001), where the authors also use the EM algorithm to select a unique prime instance (prime-MIR), rather than using a soft, probabilistic weighted combination of instances. The authors in Wang et al. (2012) test their EM algorithm on the aforementioned AOD and crop yield datasets and achieve state-of-the-art results. They also evaluate the performance of previous MIR algorithms: prime-MIR (Ray and Page, 2001), pruning-MIR (Wang et al., 2008), cluster-MIR (Wagstaff et al., 2008) and two other baseline algorithms, instance-MIR and aggregated-MIR. In this paper, we compare our algorithm (attention-MIR) to the results obtained in Wang et al. (2012) on both datasets, and re-implement instance-MIR and aggregated-MIR.

The estimation of crop yields in a MIR framework was studied in Wagstaff and Lane (2007) and Wagstaff et al. (2008), where the authors used county-level crop (wheat and corn) data in Kansas and California between 2001 and 2005. In their first work, the authors proposed the cluster-MIR algorithm, which improved upon prime-MIR, by clustering similar instances together in order to have a representation of the bag structure and identify the most relevant instance within each bag. In their second work, they sought to summarize each bag by a meta-instance characterized by a linear combination of all the instances, and in turn, the bag label is assumed to be a linear combination of the features of that meta-instance.

In this paper, our method does assume that the instances follow a particular distribution. Thus, we do not have to build a prior for the probability of an instance being the prime instance (Wang et al., 2012), which renders our method more flexible and readily applicable to any application domain. Furthermore, we do not assume that there exists a prime instance and do not limit ourselves to a linear regressor Wagstaff et al. (2008)

. Instead, the bag label is fully parametrized by a neural network with an attention mechanism which can be made arbitrarily complex, and is trained end-to-end using backpropagation. This allows us to model complicated non-linear relationships between the instances themselves as well as the instances and the bag label. In this work, we implement a particular type of attention mechanism from

Vinyals et al. (2015). On top of being order invariant with regards to the instances and modeling complex relationships, it allows us to estimate the salience of each instance in predicting the label, by reading the attention coefficients. Neural network architectures where the output is invariant to permutations in the input have been proposed in Ravanbakhsh et al. (2016) and Qi et al. (2017), in the task of point-cloud classification and segmentation. While the attention mechanism has been widely used in machine translation (Vaswani et al., 2017) or image captioning (Xu et al., 2015), its use in the MIL setting has been very minimal. One of the first investigation of using an attention-based permutation invariant operator (instead of max or average pooling for instance), in the context of MIL, was conducted in Ilse et al. (2018), in order to identify regions of interest (ROIs) in image classification.

3. Data

In this section, we describe the 5 real-world remote sensing datasets used to evaluate our algorithm, out of which 3 (MODIS, MISR1, MISR2) stem from the AOD retrieval application and 2 (CORN, WHEAT) from the crop yield prediction.

AOD Crop Yield
#bags 1364 800 800 525 525
#instances 100 100 varying 100 100
#features 12 16 16 92 92
Table 1. Number of bags, instances per bag, and features per instance, for each dataset.

3.1. Aerosol Data

Aerosols are fine airborne solid particles or liquid droplets in air, that both reflect and absorb incoming solar radiation. They can come from both natural sources such as fog, forest fires, clouds or volcanic eruptions and from human activities like urban haze, transport (especially using diesel fuel) and coal burning. One can see in Figure 2 that AOD levels are the highest in very densely populated regions such as West Africa (Accra and Lagos), China and India.

While local aerosol pollution is harmful to us since we inhale these fine particles directly into our lungs, aerosol pollution in the atmosphere partly counteracts the effect of greenhouse gases on global warming by actually providing a net cooling force (Zhang et al., 2013). For these reasons, and to validate climate models, being able to estimate AOD from satellite measurements is a very important task.

AOD represents the total reduction of radiation caused by aerosols from the top of the atmosphere down to the surface . The estimation of AOD via satellite measurements relies on the fact that the solar radiation is modified when it traverses the aerosols and this can be measured through reflectance.

Figure 1. Locations of the 35 ground-based radiometers as part of the AERONET network. Image taken from Wang et al. (2012).

In the AOD datasets, the bag labels come from in-situ measurements, and the features stem from two different instruments placed on satellites: Moderate Resolution Imaging Spectroradiometer (MODIS) and Multi-angle Imaging SpectroRadiometer (MISR). These instruments gather information as multispectral images which have a low spatial resolution of up to m for each pixel, whereas, as mentioned, AOD levels can be assumed to be constant up to km. In the MIR setting, the bag label (ground truth) is a ground-based measurement made by highly accurate instruments (most notably by the Aerosol Robotic Network AERONET333https://aeronet.gsfc.nasa.gov/, which is a global network of ground-based radiometers, scattered across the US and the globe, as shown in Figure 1 and Figure 2). On the other hand, the bag itself is a multispectral satellite image, where each pixel is taken as an instance.

Figure 2. Mean AOD (%) measured by the AERONET stations, at 550 nm, from 2000 to 2012. Figure taken from Ruiz-Arias et al. (2013).

For each instance (i.e, pixel), the MODIS and MISR features fall into two categories: reflectances at several spectral bands and solar angles. The solar angles are constant over an entire bag (i.e, multispectral image), while the reflectances vary due to surface and atmospheric effects within the considered area.

MODIS. As mentioned in Table 1, the MODIS dataset contains 1364 bags, where each bag consists of 100 instances representing randomly selected pixels around the corresponding AERONET site. The data were collected from the MODIS satellite instruments at 45 AERONET sites across the United States between 2002 and 2004. Each instance is made of 12 features, which are 7 MODIS reflectances at different spectral bands and 5 solar angles, and the corresponding bag label is the AERONET AOD ground measurement.

MISR1. The MISR1 dataset contains 800 bags which were collected from the MISR satellite instruments at the 35 AERONET sites shown in Figure 1 between 2001 and 2004. Each bag also contains 100 instances representing randomly selected pixel within a 20 kilometer radius of the AERONET sites. Similarly to the MODIS data, each instance is made of 16 features: 12 MISR reflectances at different spectral bands and 4 solar angles. The corresponding bag labels are the AERONET AOD ground measurements.

MISR2. The MISR2 dataset is a cleaner version of the MISR1, where each of the 800 bags consists of a varying number of instances (713 out of the 800 bags contain 100 instances), representing randomly selected non-cloudy pixels within a 20 kilometer radius of the AERONET site. This is because, even though clouds are aerosols in and of themselves, they are not the aerosols we are interested in measuring and are thus an important source of noise. The features of the MISR2 are the same as in the MISR1 and in our analysis we only keep the 713 bags with 100 instances as some of the other bags have very few instances (with these bags having 54 instances each on average).

3.2. Crop Yield Data

The remotely sensed data for the crop yields also come from the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument aboard the Terra spacecraft. The WHEAT and CORN datasets consist of 525 bags, where the data were collected over 5 years between 2001 and 2005. More precisely, there are 105 bags per year, which correspond to the 105 counties in the state of Kansas, and thus each bag represents a unique county. The bag label is simply the crop yield in bushel per acre (wheat and corn), reported by the U.S. Department of Agriculture (USDA), for each year. Similarly to the AOD data, each bag consists of 100 instances which are randomly selected pixels ( m m Earth surface) within each county. Each instance is made of 92 features, which represent surface reflectance measured at 2 spectral bands (red and infrared). There are 46 time-points for each spectral band across the year (measured every 8 days at the same pixel location).

Figure 3. (a) Corn yield (bushels per acre) and (b) radiance (watts per square meter per steradian) in each of the 105 counties of Kansas. Image taken from Wang et al. (2012).

Note that, since the pixels are randomly chosen within a county, a bag can equally contain crop pixels and non-crop pixels (e.g, forests, clouds, cities). One of the challenges in estimating crop yields is how to treat such non-crop pixels: are they mostly noise or contain valuable information? Finally, as opposed to aerosols, which have the property of having a low spatial variability, the bag label in the crop yield prediction is an aggregated value over the entire county. Figure (a)a shows the corn yield for all the counties in Kansas in 2002, and Figure (b)b displays the radiance at a particular spectral band of the MODIS instrument.

4. Methodology

4.1. Multiple Instance Regression

In the MIR problem, our observed dataset is , where B is the number of bags, is the label of bag , is the instance of bag and is the number of instances in bag . Note that , where is the number of features in each instance. The number of features must be the same for all the instances, but the number of instances can vary within each bag.

We want to learn the best mapping : , . By best mapping we mean the function which minimizes the mean squared error (MSE) on bags unseen during training (i.e, on the validation set). Formally, we seek such that


from the validation data , where is the hypothesis space of functions under consideration. To achieve that, the information characterized by the instances in each bag has to be summarised whilst losing the least information possible.

4.2. MIR Algorithms

In this section, we formally describe the aggregated-MIR and the instance-MIR which are two simple algorithms used as baselines against which we will compare our novel attention-MIR algorithm.

4.2.1. Aggregated-MIR

In the aggregated-MIR, each bag is treated as a single observation to be used in the training of our regression function. In other words, our data is simply a set of bag and label pairs which can be denoted as , where is the mean of the bag.

The suitability of this algorithm increases as the number of instances per bag increases since the sample mean gets closer to the true population mean. However, in practice, the number of instances in each bag is limited and taking the mean remains sensitive to outliers. While taking the median could prove to be insensitive to outliers, both statistics do not capture enough characteristics (e.g, variance, skewness and higher moments) of how the instances are distributed within each bag. In other words, too much information is lost in the process of summarizing the data in each bag. This is what motivates us to augment the feature space by taking higher order moments instead of using the mean as a meta-instance.

4.2.2. Instance-MIR

The instance-MIR algorithm takes all the instances in each bag separately and makes predictions on all the instances before taking the mean or the median of the predictions for each bag.

This algorithm ignores the fact that the given problem can be solved in a MIR setup and treats each instance as an independent observation. Formally, our dataset is formed by pairs of instance and bag label which can be denoted as . The final label prediction on an unseen bag can be simply calculated as

where is the predicted label corresponding to the instance in bag from the validation or testing set. Again, instead of the mean, one could use the median of the predictions for each bag, in order to compute . Empirically, this method has been shown to be competitive (Ray and Craven, 2005), even though it requires models with rather high complexity in order to be able to effectively map many different noisy instances to the same target value.

4.2.3. Attention-MIR

Here, we describe our new MIR algorithm based on a particular type of attention mechanism (Vinyals et al., 2015)

, where the bag label is fully parametrized by neural networks. Our model is very flexible and can be made arbitrarily complex by increasing the number of layers and the number of neurons per layer, which allows us to handle complicated structures in our input data. This algorithm satisfies the invariance property that we need when learning with sets, in that swapping two inputs in the set (i.e, swapping the instances within the bag) does not alter the encoding of the set. In the attention-MIR, our dataset can be denoted as

, and the predicted label of bag is parametrized as


where and

are multi-layer perceptrons,

is a long short-term memory cell

(Hochreiter and Schmidhuber, 1997), is the concatenation operation and is a function which takes two vectors and returns a scalar. We can be as creative as we want when choosing the function and adapt it to the problem at hand, such as


In our experiments, we set the LSTM cell and the two MLPs to be single layered. The LSTM cell takes no input and computes a recurrent hidden state over several loops called processing steps which are indexed by , where . The initial hidden state can either be learned or simply initialized to zeros. The LSTM updates its hidden state by repeatedly reading the memory vectors via the attention mechanism. Intuitively, the network first attends the instances which it finds most important in (4) and (5), summarizes the information in (5), and use it to update the hidden state of the LSTM (3), where it attends the instances again, conditioned on the previous processing steps. The order-invariant operation happens in (5), where permuting with yields the same vector , due to the summation. Similarly to the work in Wagstaff and Lane (2007), it is easy to identify the salience of each of the instances by simply reading off the value of the parameters for each bag.

Note that when we have 0 processing steps (i.e, ), is simply a result of the zero initialized and thus, the parameters do not carry any information, since has never seen the inputs . In other words, the attention mechanism points blindly (i.e, randomly) at the instances. The effect of having multiple processing steps is well illustrated in Figure 4 below, which depicts our end-to-end architecture, starting from the original inputs to the final prediction. We can see that for each processing step, the attention mechanism has access to the original inputs. It can then refine its choices as to which inputs matter, conditioned on the information from the previous steps.

Figure 4. Diagram showing our architecture described in (2)-(7), for bag .

4.3. Augmenting the Feature Space Using Moments

As mentioned previously, the use of the mean or the median in the aggregated-MIR to summarize the information within each bag is limited and fails to capture important features of a distribution. We thus compute the first raw sample moments, where the sample moment is defined as

which can be shown to be an unbiased estimator of the

moment of the population, where

is the sample size. The choice of using moments to characterize a distribution is natural due to the direct interpretation of the first moments regarding the shape of the distribution (i.e, variance, skewness, kurtosis). The higher-order moments (beyond the

moment) require more data to yield estimates of quality and are more subtle to interpret in terms of shape parameters of a distribution. We show in our results that while using the first moments can lead to better predictions, higher-order moments do not bring useful additional information and even make our algorithm more prone to overfitting.

Note that raw sample moments are simply mappings from a set of observations to a real number and we can ask the question as to whether we could simply use kernel functions instead.

5. Evaluation

In this section, we first describe the evaluation protocol, followed by our results for different hyperparameter regimes and finally compare them to the state-of-the-art obtained in

Wang et al. (2012).

5.1. Training Protocol

In order to have a fair comparison between our method (attention-MIR) and the EM-MIR proposed in Wang et al. (2012), we evaluate the performance of our algorithm using the same evaluation protocol. The evaluation protocol consists of a 5-fold cross validation, where the bags are randomly split into 5 subsets, out of which 1 fold is split in half and serve as validation and test sets. In turn, each of the 5 fold is taken as the validation and test set, and the 4 remaining ones are used for training. The hyperparameters are chosen to minimize the validation loss and we report the loss (RMSE) on the test set. In doing so, we are not reporting the error on the validation set, which can often be over-optimistic (since we may be overfitting on the validation set by trying out many hyperparameter values). Furthermore, it would be even more misleading to report the lowest validation loss when using a stochastic algorithm such as neural networks. This is because the minimum error on the validation set could occur due to stochasticity and not be similar in any way to the test set. However, the authors in Wang et al. (2012) report their results on the validation set, which might be slightly over-optimistic, as opposed to using a test set. Since our datasets are rather small, we perform the 5-fold cross validation 10 times in order to give a more accurate representation of our algorithm’s performance, by eliminating the randomness involved in choosing the folds. The final loss reported in Table 2 below is simply the average of the losses computed on the test set, over the 50 overall runs (10 iterations of 5-fold cross validation).

Note that due to time and computing resource constraints, we did not try out many hyperparameter values, which means that we could obtain better results by doing so. The code can be used to investigate more neural architectures and other hyperparameter regimes 444 https://github.com/pinouche/attention-MIR.

5.2. Results

The results obtained on the 5 datasets using the attention-MIR algorithm, for different numbers of processing steps in (2)-(7), are shown in Table 2. Note that in our case, we set the number of neurons in (4) and (7) to be equal to the LSTM size.

Datasets #processing steps LSTM size
T=1 T=2 T=3 T=4
MODIS 11.1 9.09 9.05 9.31 256
MISR1 9.63 7.32 8.47 8.77 256
MISR2 7.73 6.95 6.98 7.14 256
WHEAT 9.59 5.24 5.44 5.70 512
CORN 43.6 27.0 27.6 29.3 512
Table 2. Test loss on the 5 datasets for the attention-MIR algorithm. For the 3 AOD datasets the loss is the RMSE 100 and for the 2 CROP datasets the loss is the RMSE.

For , the attention mechanism has only been conditioned on the inputs once. In other words, with one processing step, the network only has one opportunity to decide which instances matter most. With , the attention mechanism is able to refine its choices when assigning weights to the instances and can attend to other instances than those from the first step. We can see from Table 2 that the optimal number of processing steps is and that the poorest performance happens for . For , we think that the decrease in performance might be due to the network assigning importance to the noisy instances, leading to overfitting and poorer generalization.

In Table 3, we display the best obtained results from Table 2 for our attention-MIR algorithm, and compare them to the state-of-the-art previously attained on these datasets in Wang et al. (2012).

Aggregated 12.5 9.74 7.61 5.63 35.76
Instance 12.0 10.7 7.94 4.96 24.57
EM (Wang et al., 2012) 9.5 7.5 7.3 4.9 26.8
Attention 9.05 7.32 6.95 5.24 27.00
Table 3. Test loss on the 5 datasets for the attention-MIR, EM-MIR and the two baseline algorithms. For the 3 AOD datasets the loss is the RMSE 100 and for the 2 CROP datasets the loss is the RMSE.

We see that attention-MIR gives the best results on the 3 AOD datasets, while remaining competitive on the crop datasets. This lack of performance on the crop datasets can be explained by the fact that there are only 525 bags and 92 features, which is the smallest number of bags and the largest number of features out of all the datasets. Thus, our model has a very large number of parameters to learn and only 420 bags to train on. In addition, we show the results achieved by re-implementing the aggregated-MIR and instance-MIR baselines. Even though the competitiveness of the instance-MIR algorithm has been praised in (Ray and Craven, 2005), we were still surprised to see that it is the best performing method on the

(b) MISR1
(c) MISR2
(e) CORN
Figure 5. Results obtained when augmenting the input features by using the raw sample moments.

CORN dataset and a very close second on the WHEAT dataset. The big discrepancy between our results and those the authors found in Wang et al. (2012), for the instance-MIR, is probably due to the fact that the authors used a neural network with low capacity (one layer of 10 neurons). However, mapping several very noisy observations to the same target value requires a complex model with high capacity. For this reason, we used a single layered neural network with 256 neurons and an appropriate amount of weight decay regularization.

Finally, we augment our input features by adding the first raw sample moments and show the results in Figure 5 above. By comparing to the results in Table 3, we see that using moments can increase performance. Our attention-MIR algorithm always benefits from adding the first moment, except for the MISR1 and MISR2 datasets. We can also see that for

, the error on attention-MIR start increasing, which is due to overfitting the data very quickly. Furthermore, this algorithm should already capture complex non-linear transformation between the instances, and thus does not benefit much from such feature engineering. On the other hand, aggregated-MIR and instance-MIR respectively take the mean of the instances (i.e, the first moment) as a meta-instance and treat each instance separately. This means that, for these two algorithms, there is no way to extract more complex relationships between the instances in each bag. This is why adding moments to the inputs leads to a big increase in performance, except for the 2 CROP datasets due to a small number of bags. It is interesting to note that the behaviour of the loss for the aggregated-MIR and instance-MIR follow a similar pattern (except on MISR1) despite the two algorithms being very different.

6. Conclusion

We developed a flexible novel MIR algorithm based on the attention mechanism in order to treat each bag as a set by rendering the output invariant to permutations in the input. In doing so, we are able to assign a weight to all the instances within each bag and to capture complex relationships between the instances.

We have shown that our algorithm achieved state-of-the-art results on the 3 AOD datasets, while being very competitive on the 2 CROP datasets. One weakness of our model, which was displayed on the CROP datasets, is that it works better with a large number of bags. This is due to our model having a higher number of parameters compared to the other MIR algorithms. On the other hand, despite its simplicity, instance-MIR was the best performing algorithm on the CROP datasets. We also showed that augmenting the feature space using raw sample moments often led to a significant increase in performance across all the algorithms and datasets.

For future work, it would be interesting to use arbitrary transformations instead of the raw sample moments, such as kernel functions. In addition, we could investigate combining information across bags as in Szabó et al. (2015) as well as across features within each bag (e.g, using moments). Finally, we could look at improving the instance-MIR by addressing the performance bottleneck happening when taking the mean or the median of all the predictions in each bag.


  • (1)
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
  • Balis et al. (2007) D Balis, M Kroon, ME Koukouli, EJ Brinksma, G Labow, JP Veefkind, and RD McPeters. 2007. Validation of Ozone Monitoring Instrument total ozone column measurements using Brewer and Dobson spectrophotometer ground-based observations. Journal of Geophysical Research: Atmospheres 112, D24 (2007).
  • Chen et al. (2006) Yixin Chen, Jinbo Bi, and James Ze Wang. 2006. MILES: Multiple-instance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 12 (2006), 1931–1947.
  • Chen and Wang (2004) Yixin Chen and James Z Wang. 2004. Image categorization by learning and reasoning with regions.

    Journal of Machine Learning Research

    5, Aug (2004), 913–939.
  • Coll et al. (2005) César Coll, Vicente Caselles, Joan M Galve, Enric Valor, Raquel Niclos, Juan M Sánchez, and Raúl Rivas. 2005. Ground measurements for the validation of land surface temperatures derived from AATSR and MODIS data. Remote sensing of Environment 97, 3 (2005), 288–300.
  • Deeter et al. (2003) MN Deeter, LK Emmons, GL Francis, DP Edwards, JC Gille, JX Warner, B Khattatov, D Ziskin, J-F Lamarque, S-P Ho, et al. 2003. Operational carbon monoxide retrieval algorithm and selected results for the MOPITT instrument. Journal of Geophysical Research: Atmospheres 108, D14 (2003).
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  • Holben et al. (1998) Brent N Holben, Slutsker Eck, D Tanre, JP Buis, A Setzer, E Vermote, John A Reagan, YJ Kaufman, T Nakajima, et al. 1998. AERONET- A federated instrument network and data archive for aerosol characterization. Remote sensing of environment 66, 1 (1998), 1–16.
  • Ichoku et al. (2002) Charles Ichoku, D Allen Chu, Shana Mattoo, Yoram J Kaufman, Lorraine A Remer, Didier Tanré, Ilya Slutsker, and Brent N Holben. 2002. A spatio-temporal approach for global validation and analysis of MODIS aerosol products. Geophysical Research Letters 29, 12 (2002), MOD1–1.
  • Ilse et al. (2018) Maximilian Ilse, Jakub M Tomczak, and Max Welling. 2018. Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712 (2018).
  • Jackson et al. (2009) Thomas J Jackson, Rajat Bindlish, and Michael Cosh. 2009. Validation of AMSR-E soil moisture products using in situ observations. Journal of the Remote Sensing Society of Japan 29, 1 (2009), 263–270.
  • Jobard et al. (2011) I Jobard, F Chopin, Jean Claude Bergès, and R Roca. 2011. An intercomparison of 10-day satellite precipitation products during West African monsoon. International Journal of Remote Sensing 32, 9 (2011), 2353–2376.
  • Knyazikhin et al. (1998) Yuri Knyazikhin, JV Martonchik, Ranga B Myneni, DJ Diner, and Steven W Running. 1998. Synergistic algorithm for estimating vegetation canopy leaf area index and fraction of absorbed photosynthetically active radiation from MODIS and MISR data. Journal of Geophysical Research: Atmospheres 103, D24 (1998), 32257–32275.
  • Lagerloef et al. (2006) G Lagerloef, Y Chao, and F Colomb. 2006. Aquarius/SAC-D ocean salinity mission science overview. In Geoscience and Remote Sensing Symposium, 2006. IGARSS 2006. IEEE International Conference on. IEEE, 1675–1677.
  • Law et al. (2017) Ho Chung Leon Law, Dougal J Sutherland, Dino Sejdinovic, and Seth Flaxman. 2017. Bayesian approaches to distribution regression. arXiv preprint arXiv:1705.04293 (2017).
  • Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
  • Muandet et al. (2017) Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. 2017. Kernel Mean Embedding of Distributions: A Review and Beyond. Foundations and Trends® in Machine Learning 10, 1-2 (2017), 1–141.
  • Qi et al. (2017) Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017.

    Pointnet: Deep learning on point sets for 3d classification and segmentation.

    Proc. Computer Vision and Pattern Recognition (CVPR), IEEE

    1, 2 (2017), 4.
  • Ravanbakhsh et al. (2016) Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. 2016. Deep learning with sets and point clouds. arXiv preprint arXiv:1611.04500 (2016).
  • Ray and Craven (2005) Soumya Ray and Mark Craven. 2005. Supervised versus multiple instance learning: An empirical comparison. In Proceedings of the 22nd international conference on Machine learning. ACM, 697–704.
  • Ray and Page (2001) Soumya Ray and David Page. 2001. Multiple instance regression. In ICML, Vol. 1. 425–432.
  • Ruiz-Arias et al. (2013) JA Ruiz-Arias, Jimy Dudhia, CA Gueymard, and D Pozo-Vázquez. 2013. Assessment of the Level-3 MODIS daily aerosol optical depth in the context of surface solar radiation and numerical weather modeling. Atmospheric Chemistry and Physics 13, 2 (2013), 675–692.
  • Sikka et al. (2013) Karan Sikka, Abhinav Dhall, and Marian Bartlett. 2013. Weakly supervised pain localization using multiple instance learning. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. IEEE, 1–8.
  • Szabó et al. (2015) Zoltán Szabó, Arthur Gretton, Barnabás Póczos, and Bharath Sriperumbudur. 2015. Two-stage sampled learning theory on distributions. In Artificial Intelligence and Statistics. 948–957.
  • Tilling et al. (2018) Rachel L Tilling, Andy Ridout, and Andrew Shepherd. 2018. Estimating Arctic sea ice thickness and volume using CryoSat-2 radar altimeter data. Advances in Space Research 62, 6 (2018), 1203–1225.
  • Tobin et al. (2006) David C Tobin, Henry E Revercomb, Robert O Knuteson, Barry M Lesht, L Larrabee Strow, Scott E Hannon, Wayne F Feltz, Leslie A Moy, Eric J Fetzer, and Ted S Cress. 2006. Atmospheric Radiation Measurement site atmospheric state best estimates for Atmospheric Infrared Sounder temperature and water vapor retrieval validation. Journal of Geophysical Research: Atmospheres 111, D9 (2006).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
  • Vinyals et al. (2015) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2015. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391 (2015).
  • Wagstaff and Lane (2007) Kiri L Wagstaff and Terran Lane. 2007. Salience assignment for multiple-instance regression. (2007).
  • Wagstaff et al. (2008) Kiri L Wagstaff, Terran Lane, and Alex Roper. 2008. Multiple-instance regression with structured data. In Data Mining Workshops, 2008. ICDMW’08. IEEE International Conference on. IEEE, 291–300.
  • Wang et al. (2012) Zhuang Wang, Liang Lan, and Slobodan Vucetic. 2012. Mixture model for multiple instance regression and applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing 50, 6 (2012), 2226–2237.
  • Wang et al. (2008) Zhuang Wang, Vladan Radosavljevic, Bo Han, Zoran Obradovic, and Slobodan Vucetic. 2008. Aerosol optical depth prediction from satellite observations by multiple instance regression. In Proceedings of the 2008 SIAM International Conference on Data Mining. SIAM, 165–176.
  • Wu et al. (2015) Jianjun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015. Deep multiple instance learning for image classification and auto-annotation. In CVPR.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048–2057.
  • Xu et al. ([n. d.]) Yan Xu, Tao Mo, Qiwei Feng, Peilin Zhong, Maode Lai, and Eric I-Chao Chang. [n. d.]. Deep learning of feature representation with multiple instance learning for medical image analysis. ([n. d.]).
  • Yang and Lozano-Perez (2000) Cheng Yang and Tomas Lozano-Perez. 2000. Image database retrieval with multiple-instance learning techniques. In Data Engineering, 2000. Proceedings. 16th International Conference on. IEEE, 233–243.
  • Yang et al. (2006) Wenze Yang, Bin Tan, Dong Huang, Miina Rautiainen, Nikolay V Shabanov, Yujie Wang, Jeffrey L Privette, Karl Fred Huemmrich, Rasmus Fensholt, Inge Sandholt, et al. 2006. MODIS leaf area index products: From validation to algorithm improvement. IEEE Transactions on Geoscience and Remote Sensing 44, 7 (2006), 1885–1898.
  • Zhang et al. (2002) Qi Zhang, Sally A Goldman, Wei Yu, and Jason E Fritts. 2002.

    Content-based image retrieval using multiple-instance learning. In

    ICML, Vol. 2. Citeseer, 682–689.
  • Zhang et al. (2013) Rong Zhang, Thomas L Delworth, Rowan Sutton, Daniel LR Hodson, Keith W Dixon, Isaac M Held, Yochanan Kushnir, John Marshall, Yi Ming, Rym Msadek, et al. 2013. Have aerosols caused the observed Atlantic multidecadal variability? Journal of the Atmospheric Sciences 70, 4 (2013), 1135–1144.