A bag-to-class divergence approach to multiple-instance learning

03/07/2018 ∙ by Kajsa Møllersen, et al. ∙ University of Tromsø the Arctic University of Norway 0

In multi-instance (MI) learning, each object (bag) consists of multiple feature vectors (instances), and is most commonly regarded as a set of points in a multidimensional space. A different viewpoint is that the instances are realisations of random vectors with corresponding probability distribution, and that a bag is the distribution, not the realisations. In MI classification, each bag in the training set has a class label, but the instances are unlabelled. By introducing the probability distribution space to bag-level classification problems, dissimilarities between probability distributions (divergences) can be applied. The bag-to-bag Kullback-Leibler information is asymptotically the best classifier, but the typical sparseness of MI training sets is an obstacle. We introduce bag-to-class divergence to MI learning, emphasising the hierarchical nature of the random vectors that makes bags from the same class different. We propose two properties for bag-to-class divergences, and an additional property for sparse training sets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Multi-instance learning

In supervised learning, the training data consists of

objects, , with corresponding class labels, ; . An object is typically a vector of feature values, , named instance. In multi-instance (MI) learning, each object consists of several instances. The set , where the elements are vectors of length , is referred to as bag. The number of instances, , varies from bag to bag, whereas the vector length is constant. In supervised MI learning, the training data consists of sets and their corresponding class labels, .

Figure (a)a shows an image (bag), , of benign breast tissue Gelasca2008Evaluation , divided into segments with corresponding feature vectors (instances) Kandemir2014Empowering . Correspondingly, figure (b)b shows malignant breast tissue.

(a) Benign
(b) Malignant
Figure 1: Breast tissue images. The image segments are not labelled.

The images in the data set have class labels, the individual segments do not. This is a key characteristic of MI learning: the instances are not labelled. MI learning includes instance classification Doran2016MultipleInstance , clustering Zhang2009Multiinstance , regression Zhang2009Multiinstance , and multi-label learning Zhou2012Multiinstance ; Tang2017Deep

, but this article will focus on bag classification. MI learning can also be found as integrated parts of end-to-end methods for image analysis that generate patches, extract features and do feature selection

Tang2017Deep . See also Wang2018Revisiting

for an overview and discussion on end-to-end neural network MI learning methods.

The term MI learning was introduced in an application of molecules (bags) with different shapes (instances), and their ability to bind to other molecules Dietterich1997Solving . A molecule binds if at least one of its shapes can bind. In MI terminology, the classes, , in binary classification are referred to as positive, , and negative, . The assumption that a positive bag contains at least one positive instance, and a negative bag contains only negative instances is referred to as the standard MI assumption.

Many new applications violate the standard MI assumption, such as image classification Xu2016Multipleinstance and text categorisation Qiao2017Diversified . Consequently, successful algorithms meet more general assumptions, see e.g. the hierarchy of Weidmann et al. Weidmann2003Twolevel or Foulds and Frank’s taxonomy Foulds2010Review . For a more recent review on MI classification algorithms, see e.g. Cheplygina2015Multiple . Carbonneau et al. Carbonneau2018Multiple discussed sample independence and data sparsity, which we address in Section 3.2. Amores Amores2013Multiple presented the three paradigms of instance space (IS), embedded space (ES), and bag space (BS). IS methods aggregate the outcome of single-instance classifiers applied to the instances of a bag, whereas ES methods map the instances to a vector, and then use a single-instance classifier. In the BS paradigm, the instances are transformed to a non-vectorial space where the classification is performed, avoiding the detour via single-instance classifiers. The non-vectorial space of probability functions has not yet been introduced to the BS paradigm, despite its analytical benefits.

Whereas both Carbonneau et al. Carbonneau2018Multiple and Amores Amores2013Multiple defined a bag as a set of feature vectors, Foulds and Frank Foulds2010Review stated that a bag can also be modelled as a probability distribution. The distinction is necessary in analysis of classification approaches, and both viewpoints offer benefits, see Section 6.1 for a discussion.

1.2 The non-vectorial space of probability functions

From the probabilistic viewpoint, an instance is a realisation of a random vector, , with probability distribution and sample space

. The posterior probability,

, is an effective classifier if the standard MI assumption holds, since it is known beforehand to be

where is the positive instance space, and the positive and negative instance spaces are disjoint.

Bayes’ rule,

, can be used when the posterior probability is unknown. An assumption used to estimate the probability distribution of instance given the class,

, is that instances from bags of the same class are independent and identically distributed (i.i.d.) random samples, but this is a poor description for MI learning. As an illustrative example, let the instances be the colour of image segments from the class sea. Image depicts a clear blue sea, whereas image depicts a deep green sea, and instance distributions are clearly dependent not only on class, but also on bag. The random vectors in are i.i.d., but have a different distribution than those in . An important distinction between uncertain objects, whose distribution depends solely on the class label Jiang2013Clustering ; Kriegel2005Densitybased , and MI learning is that the instances of two bags from the same class are not from the same distribution.

The dependency nature for MI learning can be described as a hierarchical distribution (Eq. 3), where a bag, , is defined as the probability distribution of its instances, , and the bag space, , is a set of distributions.

1.3 Dissimilarities in MI learning

Dissimilarities in MI learning can be categorised as instance-to-instance, bag-to-bag or bag-to-class. Amores Amores2013Multiple implicitly assumed metricity for dissimilarity functions Scholkopf2000Kernel in the BS paradigm, but there is nothing inherent to MI learning that imposes these restrictions. The non-metric Kullback-Leibler (KL) information Kullback1951Information is an example of a divergence: a dissimilarity measure between two probability distributions.

Divergences have not been used in MI learning, due to the lack of a probability function space defined for the BS paradigm, despite the benefit of analysis independent of specific data sets Gibbs2002Choosing . The -divergences Ali1966General ; Csiszar1967Informationtype have desirable properties for dissimilarity measures, including minimum value for equal distributions, but there is no complete categorisation of divergences. The KL information is a non-symmetric

-divergence, often used in both statistics and computer science, and is defined as follows for two probability density functions (pdfs)

and :


An example of a symmetric -divergence is the Bhattacharyya (BH) distance, defined as


and can be a better choice if the absolute difference, and not the ratio, differentiates the two pdfs. The appropriate divergence for a specific task can be chosen based on identified properties, e.g. for clustering Mollersen2016DataIndependent , or a new dissimilarity function can be proposed Mollersen2015Divergencebased .

This article aims to identify properties for bag classification, and we make the following contributions:

  • Presenting the hierarchical model for general, non-standard MI assumptions (Section 3.1).

  • Introduction of bag-to-class dissimilarity measure (Section 3.2).

  • Identification of two properties for bag-to-class divergence (Section 4.1).

  • A new bag-to-class dissimilarity measure for sparse training data (Section 4.2).

In Section 5, the KL information and the new dissimilarity measure is applied to data sets and the results are reported. Bags defined in the probability distribution space, in combination with bag-to-class divergence, constitutes a new framework for MI learning, which is compared to other frameworks in Section 6.

2 Related work

The feature vector set viewpoint seems to be the most common, but the probabilistic viewpoint was introduces already in 1998, then under the i.i.d. given class assumption Maron1998Framework . This assumption has been used in approaches such as estimating the expectation by the mean Xu2004Logistic , or estimation of class distribution parameters Tax2011Bag , but has also been criticised Zhou2009Multiinstance . The hierarchical distribution was introduced for learnability theory under the standard MI assumption for instance classification Doran2016MultipleInstance , and we expand the use for more general assumptions.

Dissimilarities in MI learning have been categorised as instance-to-instance or bag-to-bag Amores2013Multiple ; Cheplygina2016DissimilarityBased . The bag-to-prototype approach in Cheplygina2016DissimilarityBased offers an in-between category, but the theoretical framework is missing. Bag-to-class dissimilarity has not been studied within the MI framework, but was used under the i.i.d. given class assumption for image classification in Boiman2008In , where also the sparseness of training sets was addressed: if the instances are aggregated on class level, a denser representation is achieved. Many MI algorithms use dissimilarities, e.g. graph distances Lee2012Bridging , Hausdorff metrics Scott2005Generalized , functions of the Euclidean distance Cheplygina2015Multiple ; RuizMunoz2016Enhancing , and distribution parameter based distances Cheplygina2015Multiple . The performances of dissimilarities on specific data sets have been investigated Cheplygina2015Multiple ; Tax2011Bag ; Cheplygina2016DissimilarityBased ; RuizMunoz2016Enhancing ; Sorensen2010DissimilarityBased , but more analytical comparisons are missing. A large class of commonly used kernels are also distances Scholkopf2000Kernel , and hence, many kernel-based approaches in MI learning can be viewed as dissimilarity-based approaches. In Wei2017Scalable

, the Fisher kernel is used as input to a support vector machine (SVM), whereas in

Zhou2009Multiinstance and Qiao2017Diversified the kernels are an integrated part of the methods.

The non-vectorial graph space was used in Zhou2009Multiinstance ; Lee2012Bridging . We introduce the non-vectorial space of probability functions as an extension within the BS paradigm for bag classification through dissimilarity measures between distributions.

The KL information was applied in Boiman2008In , and is a much-used divergence function. It is closely connected to the Fisher information Kullback1951Information used in Wei2017Scalable

and to the cross entropy used as loss function in

Wang2018Revisiting . We propose a conditional KL information in Section 4.2, which differs from the earlier proposed weighted KL information Sahu2003Fast whose weight is a constant function of .

3 Theoretical background

3.1 Hierarchical distributions

A bag is the probability distribution from which the instances are sampled. The generative model of instances from a positive or negative bag follows a hierarchical distribution


respectively. The common view in MI learning is that a bag consists of positive and negative instances, which corresponds to a bag being a mixture of a positive and a negative distribution.

Consider tumour images labelled or , with instances extracted from segments. Let and denote the pdfs of positive and negative segments, respectively, of image . The pdf of bag is a mixture distribution

where , where if instance is positive. The probability of positive segments, , depends on the image’s class label, and hence is sampled from or . The characteristics of positive and negative segments vary from image to image. Hence, and

are realisations of random variables, with corresponding probability distributions

and . The generative model of instances from a positive (negative) bag is


The corresponding sampling procedure from positive (negative) bag, , is
Step 1: Draw from , from , and from . These three parameters define the bag.
Step 2: For , draw from , draw from if , and from otherwise.

By imposing restrictions, assumptions can be accurately described, e.g. the standard MI assumption: at least one positive instance in a positive bag: ; no positive instances in a negative bag: ; the positive and negative instance spaces are disjoint.

Eq. 4 is the generative model of MI problems, assuming that the instances have unknown class labels and that the distributions are parametric. The parameters , and are i.i.d. samples from their respective distributions, but are not observed and are hard to estimate, due to the very nature of MI learning: The instances are not labelled. Instead, can be estimated from the observed instances, and a divergence function can serve as classifier.

3.2 Bag-to-class dissimilarity

The training set in MI learning is the instances, since the bag distributions are unknown. Under the assumption that the instances from each bag are i.i.d. samples, the KL information has a special role in model selection, both from the frequentist and the Bayesian perspective. Let be the sample distribution (unlabelled bag), and let and be two models (labelled bags). Then the expectation over of the log ratio of the two models, , is equal to . In other words, the log ratio test reveals the model closest to the sampling distribution in terms of KL information Eguchi2006Interpreting

. From the Bayesian viewpoint, the Akaike Information Criterion (AIC) reveals the model closest to the data in terms of KL information, and is asymptotically equivalent to Bayes factor under certain assumptions

Kass1995Bayes .

The i.i.d. assumption is not inherent to the probability distribution viewpoint, but the asymptotic results for the KL information rely on it. In many applications, such as image analysis with sliding windows, the instances are best represented as dependent samples, but the dependencies are hard to estimate, and the independence assumption is often the best approximation. Doran and Ray Doran2016MultipleInstance showed that the independence assumption is an approximation of dependent instances, but comes with the cost of slower convergence.

If the bag sampling is sparse, the dissimilarity between and the labelled bags becomes somewhat arbitrary w.r.t. the true label of . The risk is high for ratio-based divergences such as the KL information, since for . The bag-to-bag KL information is asymptotically the best choice of divergence function, but this is not the case for sparse training sets. Bag-to-class dissimilarity makes up for some of the sparseness by aggregation of instances. Consider an image segment of colour deep green, which appears in sea images, but not in sky images, and a segment of colour white, which appears in both classes (waves and clouds). If the combination deep green and white does not appear in the training set, then a bag-to-bag KL information will result in infinite dissimilarity for all bags, regardless of class, but the bag-to-class KL information will be finite for the sea class.

Let be the probability distribution of a random vector from the bags of class . Let and be the divergences between the unlabelled bag and each of the classes. Choice of divergence is not obvious, since is different from both and , but can be done by identification of properties.

4 Properties for bag-level classification

4.1 Properties for bag-to-class divergences

We here propose two properties for bag-to-class divergences regarding infinite bag-to-class ratio and zero instance probability. Let , and . Denote the divergence between an unlabelled bag and the reference distribution, , by .

As a motivating example, consider the following: A positive bag,

, is a continuous uniform distribution

, sampled according to . A negative bag, , is sampled according to , and let so that there is an overlap between the two classes. For both positive and negative bags, we have that for a subspace of and for a different subspace of , merely reflecting that the variability in instances within a class is larger than within a bag, as illustrated in Fig. 2.

Figure 2: The pdf of a bag with uniform distribution and the pdfs of the two classes.

If is a sample from the negative class, and for some subspace of it can easily be classified. From the above analysis, large bag-to-class ratio should be reflected in large divergence, whereas large class-to-bag ratio should not.

Property 1: For the subspace of where the bag-to-class ratio is larger than some , the contribution to the total divergence, , approaches the maximum contribution as . For the subspace of where the class-to-bag ratio is larger than , the contribution to the total divergence, , does not approach the maximum contribution as :

Property 1 can not be fulfilled by a symmetric divergence.

As a second motivating example, consider the same positive class as before, and the two alternative negative classes defined by;

For bag classification, the question becomes: from which class is a specific bag sampled? It is equally probable that a bag comes from each of the two negative classes, since and only differ where , and we argue that should be equal to .

Property 2: For the subspace of where is smaller than some , the contribution to the total divergence, , approaches zero as :

KL information is the only divergence that fulfils these two properties among the non-symmetric divergences listed in Taneja2006Generalized . As there is no complete list of divergences, so it is possible that other divergences that the authors are not aware of fulfil these properties.

4.2 A class-conditional dissimilarity for MI classification

In the sea and sky images example, consider an unlabelled image with a pink segment, e.g. a boat. If pink is absent in the training set, then the bag-to-class KL information will be infinite for both classes. We therefore propose the following property:

Property 3: For the subspace of where both class probabilities are smaller than some , the contribution to the total divergence, , approaches zero as :

We present a class-conditional dissimilarity that accounts for this:


which also fulfils Properties 1 and 2.

4.3 Bag-level divergence classification

We propose two similar methods based on either the ratio of bag-to-class divergences, , or the class-conditional dissimilarity in Eq. 5. We propose using the KL information (Eq. 1) or the Bhattacharyya distance (Eq. 2), but any divergence function can be applied.

Given a training set and a set, , of instances drawn from an unknown distribution, , with unknown class label , and let denote the set of all . The bag-level divergence classification follows the steps:

Classify according to: (6)

Common methods for pdf estimation are Gaussian mixture models (GMMs) and kernel density estimation (KDE). The integrals in step 2 are commonly approximated by importance sampling and Riemann sums. In rare cases, e.g. when the distributions are Gaussian, the divergences can be calculated directly. The threshold

can be pre-defined based on, e.g. misclassification penalty and prior class probabilities, or estimated from the training set by leave-one-out cross-validation. When the feature dimension is high and the number of instances in each bag is low, pdf estimation becomes arbitrary. A solution is to estimate separate pdfs for each dimension, calculate the corresponding divergences , and treat them as inputs into a classifier replacing step 3. Code available at https://github.com/kajsam/Bag-to-class-divergence.

5 Experiments

5.1 Simulated data

The following study exemplifies the difference between BH distance ratio, , KL information ratio, , and as classifiers for sparse training data. The minimum dissimilarity bag-to-bag classifiers are also implemented, based on KL information and BH distance. The number of instances from each bag is , the number of bags in the training set is varied from to from each class, and the number of bags in the test set is . Each bag and its instances are sampled as described in Eq. 4

, and the area under the receiver operating characteristic (ROC) curve (AUC) serves as performance measure. For simplicity, we use Gaussian distributions in one dimension for

Sim 1-Sim 4:

Sim 1: : No positive instances in negative bags.
Sim 2: : Positive instances in negative bags.
Sim 3:

: Positive and negative instances have the same expectation of the mean, but unequal variance.

Sim 4: : Positive instances are sampled from two distributions with unequal mean expectation.

We add Sim 5 and Sim 6 for the discussion on instance labels in Section 6, as follows: Sim 5 is an uncertain object classification, where the positive bags are lognormal densities with and , and negative bags are Gaussian mixtures densities with , , , and . These two densities are nearly identical, see (McLachlan2000Finite, , p. 15). In Sim 6, the parameters of Sim 5 are i.i.d. observations from Gaussian distributions, each with for the Gaussian mixture, and for the lognormal distribution. Figure 3 shows the estimated class densities and two estimated bag densities for Sim 2 with negative bags in the training set.

Figure 3: (a) One positive bag in the training set give small variance for the class pdf. (b) Ten positive bags in the training set, and the variance has increased.

We use the following details for the algorithm in (4.3): KDE fitting: Epanechnikov kernel with estimated bandwidth varying with the number of observations. Integrals: Importance sampling. Classifier: is varied to give the full range of sensitivities and specificities necessary to calculate AUC.

Table 1 shows the mean AUCs for repetitions.

Bags : 5 : 10 : 25
Sim: :
1 61 69 85 62 72 89 61 73 92
1 5 63 75 86 64 82 94 68 84 97
10 69 86 87 73 91 95 75 91 98
1 57 61 75 59 61 78 58 55 75
2 5 59 67 79 60 68 84 62 63 85
10 64 77 80 66 78 86 68 72 86
1 51 55 71 52 58 73 50 57 74
3 5 53 61 76 53 66 81 52 65 83
10 58 73 78 58 76 84 57 76 87
1 55 61 70 56 62 73 56 58 69
4 5 56 63 75 57 64 81 59 59 80
10 60 74 77 62 76 85 63 69 84
1 64 61 62 67 63 66 64 62 67
5 5 73 69 63 74 70 67 75 71 72
10 74 70 62 75 73 69 76 74 72
1 68 68 67 66 68 68 68 71 68
6 5 65 64 67 68 68 69 70 71 74
10 66 64 66 70 69 72 72 73 74
Table 1: AUC for simulated data.

5.2 Breast tissue images

Breast tissue images (see Fig. 1) with corresponding feature vectors are used as example. Following the procedure in Kandemir2014Empowering , the principal components are used for dimension reduction, and -fold cross-validation is used so that and are fitted only to the instances in the training folds. For pdf estimation, GMMs are fitted to the first principal component, using an EM-algorithm, with number of components chosen by minimum AIC. In addition, KDE as in Section 5.1, and KDE with Gaussian kernel and optimal bandwidth Sheather1991Reliable is used.

KDE (Epan.) KDE (Gauss.) GMMs
90 92 94
82 92 96
Table 2: AUC for breast tissue images.

5.3 Benchmark data

We here present the results for 7 benchmark datasets111https://figshare.com/articles/MIProblems_A_repository_of_multiple_instance_learning_datasets/6633983 together with the results of five other methods as reported in the cited publications. The datasets have relatively few instances per bag compared to the dimensionality. For detailed descriptions and references, see Cheplygina2015Multiple . We use the following details for the algorithm in (4.3): KDE fitting: Gaussian kernel with optimal bandwidth. Integrals: Importance sampling. Classifier: Support vector machine (SVM) with linear kernel.
for d = 1: Dim
Fit to and sample from .
1. Fit , , using KDE.
2. Approximate or using and .
3. Input or to SVM.

10 times 10-fold cross-validation is used, except for the 2000-Image dataset where 5 times 2 fold cross-validation is used as in Wei2017Scalable and Zhou2009Multiinstance . In Cheplygina2016DissimilarityBased

, one 10-fold cross-validation was performed, and the standard error was reported. In

Wang2018Revisiting , 5 times 10-fold cross-validation was performed. In Qiao2017Diversified

, several parameters are optimised for each data set, which prevents a fair comparison, and there was no reported deviation/error. The accuracies and the standard deviations are presented in Table 

3 and Table 4, where the highest accuracies for each data set and those within one standard deviation are marked in bold.

Musk1 Musk2 Fox Tiger Elephant
MI-Net(DS)Wang2018Revisiting 89.4 (9.3) 87.4 (9.7) 63.0 (8.0) 84.5 (8.7) 87.2 (7.2)
miFVWei2017Scalable 87.5 (10.6) 86.1 (10.6) 56.0 (9.9) 78.9 (9.1) 78.9 (9.1)
miGraphZhou2009Multiinstance 88.9 (3.3) 90.3 (2.6) 61.6 (2.8) 86.0 (1.6) 86.8 (0.7)
Cheplygina2016DissimilarityBased 89.3 (3.4) 85.5 (4.7) 64.4 (2.2) 81.0 (4.6) 80.4 (3.5)
Qiao2017Diversified 87.7 89.1 65.0 80.0 90.67
rBH 64.4 (3.1) 69.2 (3.2) 71.5 (1.2) 70.1 (1,3) 81.7 (1.7)
cKL 74.0 (1.9) 69.9 (2.0) 65.8 (2.1) 85.0 (1.4) 71.1 (3.3)
Table 3: Accuracy and standard deviation/error for benchmark data sets.
2000 - Image Alt.atheism
MI-Net(DS) - 86.0 (13.4)
miFV 87.5 (7.2) -
miGraph 72.1 65.5 (4.0)
- 44.0 (4.5)
rBH 90.0 (6.4) 62.0 (2.6)
cKL 80.1 (10.5) 85.5 (1.4)
Table 4: Accuracy and standard deviation/error for benchmark data sets.

5.4 Results

The general trend in Table 1 is that gives higher AUC than , which in turn gives higher AUC than , in line with the divergences’ properties for sparse training sets. The same trend can be seen with a Gaussian kernel and optimal bandwidth (numbers not reported). The gap between and narrows with larger training sets. In other words, the benefit of increases with sparsity. This can be explained by the risk of , as seen in Figure 3(a).

Increasing also narrows the gap between and , and eventually (at approximately ), outperforms (numbers not reported). Sim 1 and Sim 3 are less affected because the ratio is already .

The minimum bag-to-bag classifier gives a single sensitivity-specificity outcome, and the KL information outperforms the BH distance. Compared to the ROC curve, as illustrated in Fig. 4, the minimum bag-to-bag KL information classifier exceeds the bag-to-class dissimilarities only for very large training sets, typically for 500 or more, then at the expense of extensive computation time.

Figure 4: An example of ROC curves for , and classifiers. The performance increases when the number of positive bags in the training set increases from (dashed line) to (solid line). The sensitivity-specificity pairs for the bag-to-bag KL and BH classifier is displayed for positive and negative bags in the training set for comparison.

Sim 5 is an example in which the absolute difference, and not the ratio, differentiates the two classes, and has the superior performance. When the extra hierarchy level is added in Sim 6, the performances returned to normal.

The breast tissue study shows that the simple divergence-based approach can outperform more sophisticated algorithms. is more sensitive than to choice of density estimation method. performs better than with GMM, and both exceed the AUC of of the original algorithm. Table 2 shows how the performance can vary between two common pdf estimation methods that do not assume a particular underlying distribution. Both KDE and GMM are sensitive to chosen parameters or parameter estimation method, bandwidth and number of components, respectively, and no method will fit all data sets. In general, KDE is faster, but more sensitive to bandwidth, whereas GMM is more stable. For bags with very few instances the benefits of GMM cannot be exploited, and KDE is preferred.

The benchmark data study shows that the proposed method combined with a standard classifier obtain comparable results with state-of-the-art algorithms, with the exception of the Musk data sets where the number of instances per bag is low. In , more than half of the bags contain less than 5 instances, and in , one fourth of the bags contain less than 5 instances. Few instances per bag prevents good distribution estimations, and since the proposed method is based on bag distributions, the result is not surprising. The algorithms perform in the same range, although they are conceptually very different: MI-Net is a neural network approach, miFV is a kernel approach, miGraph is a graph approach, D is a dissimilarity approach, and DivDict is a diverse dictionary approach.

6 Discussion

6.1 Point-of-view

The theoretical basis of the bag-to-class divergence approach relies on viewing a bag as a probability distribution, and hence fits into the branch of collective assumptions of the Foulds and Frank taxonomy Foulds2010Review . The probability distribution estimation can be seen as extracting bag-level information from a set , and hence falls into the BS paradigm of Amores Amores2013Multiple . The probability distribution space is non-vectorial, different from the distance-kernel spaces in Amores2013Multiple , and divergences are used for classification.

In practice, the evaluation points of the importance sampling gives a mapping from the set to a single vector, . The mapping concurs with the ES paradigm, and the same applies for the graph-based methods. From that viewpoint, the bag-to-class divergence approach expands the distance branch of Foulds and Frank to include a bag-to-class category in addition to instance-level and bag-level distances. However, the importance sampling is a technicality of the algorithm, and we argue that the method belongs to the BS paradigm. When the divergences are used as input to a classifier as in Section 5.3, the ES paradigm is a better description.

Carbonneau et al. Carbonneau2018Multiple assume underlying instance labels, and from a probability distribution viewpoint this corresponds to posterior probabilities, which are in practice inaccessible. In Sim 1 - Sim 4, the instance labels are inaccessible through observations without previous knowledge about the distributions. In Sim 6, the instance label approach is not useful, due to the similarity between the two distributions:


where and are the lognormal and the Gaussian mixture, respectively. Eq. 4 is just a special case of Eq. 7, where is the random vector . Without knowledge about the distributions, discriminating between training sets following the generative model of Eq. 4 and Eq. 7 is only possible for a limited number of problems. Even the uncertain objects of Sim 5 is difficult to discriminate from MI objects based solely on the observations in the training set.

6.2 Conclusions and future work

Although the bag-to-bag KL information has the minimum misclassification rate, the typical bag sparseness of MI training sets is an obstacle, which is partly solved by bag-to-class dissimilarities, and the proposed class-conditional KL information accounts for additional sparsity of bags.

The bag-to-class divergence approach addresses three main challenges MI learning. (1) Aggregation of instances according to bag label and the additional class-conditioning provides a solution for the bag sparsity problem. (2) The bag-to-bag approach suffers from extensive computation time, solved by the bag-to-class approach. (3) Viewing bags as probability distributions give access to analytical tools from statistics and probability theory, and comparisons of methods can be done on a data-independent level through identification of properties. The properties presented here are not an extensive list, and any extra knowledge should be taken into account whenever available.

A more thorough analysis of the proposed function, , will identify its weaknesses and strengths, and can lead to improved versions as well as alternative class-conditional dissimilarity measures and a more comprehensive tool.

The diversity of data types, assumptions, problem characteristics, sampling sparsity, etc. is far too large for any one approach to be sufficient. The introduction of divergences as an alternative class of dissimilarity functions; and the bag-to-class dissimilarity as an alternative to the bag-to-bag dissimilarity, has added additional tools to the MI toolbox.


This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.



  • (1) E. Drelie Gelasca, J. Byun, B. Obara, B. S. Manjunath, Evaluation and benchmark for biological image segmentation, in: 15th IEEE International Conference on Image Processing, 2008, pp. 1816–1819. doi:10.1109/ICIP.2008.4712130.
  • (2) M. Kandemir, C. Zhang, F. A. Hamprecht, Empowering Multiple Instance Histopathology Cancer Diagnosis by Cell Graphs, in: P. Golland, N. Hata, C. Barillot, J. Hornegger, R. Howe (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2014, Springer International Publishing, 2014, pp. 228–235. doi:10.1007/978-3-319-10470-6_29.
  • (3) G. Doran, S. Ray, Multiple-Instance Learning from Distributions

    , Journal of Machine Learning Research 17 (128) (2016) 1–50.

    URL http://jmlr.org/papers/v17/15-171.html
  • (4) M.-L. Zhang, Z.-H. Zhou, Multi-instance clustering with applications to multi-instance prediction, Applied Intelligence 31 (1) (2009) 47–68. doi:10.1007/s10489-007-0111-x.
  • (5) Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, Y.-F. Li, Multi-instance multi-label learning

    , Artificial Intelligence 176 (1) (2012) 2291–2320.

  • (6) P. Tang, X. Wang, Z. Huang, X. Bai, W. Liu, Deep patch learning for weakly supervised object classification and discovery, Pattern Recognition 71 (2017) 446–459. doi:10.1016/J.PATCOG.2017.05.001.
  • (7) X. Wang, Y. Yan, P. Tang, X. Bai, W. Liu, Revisiting multiple instance neural networks, Pattern Recognition 74 (2018) 15–24. doi:10.1016/J.PATCOG.2017.08.026.
  • (8) T. G. Dietterich, R. H. Lathrop, T. Lozano-Pérez, Solving the multiple instance problem with axis-parallel rectangles, Artificial Intelligence 89 (1-2) (1997) 31–71. doi:10.1016/s0004-3702(96)00034-3.
  • (9) Y.-Y. Xu, Multiple-instance learning based decision neural networks for image retrieval and classification, Neurocomputing 171 (2016) 826–836. doi:10.1016/j.neucom.2015.07.024.
  • (10) M. Qiao, L. Liu, J. Yu, C. Xu, D. Tao, Diversified dictionaries for multi-instance learning, Pattern Recognition 64 (2017) 407–416. doi:10.1016/j.patcog.2016.08.026.
  • (11) N. Weidmann, E. Frank, B. Pfahringer, A Two-Level Learning Method for Generalized Multi-instance Problems, in: N. Lavrač, D. Gamberger, H. Blockeel, L. Todorovski (Eds.), Machine Learning: ECML 2003, Lecture Notes in Computer Science, vol 2837. Springer Berlin Heidelberg, 2003, pp. 468–479. doi:10.1007/978-3-540-39857-8_42.
  • (12) J. Foulds, E. Frank, A review of multi-instance learning assumptions

    , The Knowledge Engineering Review 25 (1) (2010) 1–25.

  • (13) V. Cheplygina, D. M. J. Tax, M. Loog, Multiple Instance Learning with Bag Dissimilarities, Pattern Recogn. 48 (1) (2015) 264–275. doi:10.1016/j.patcog.2014.07.022.
  • (14) M. A. Carbonneau, V. Cheplygina, E. Granger, G. Gagnon, Multiple Instance Learning: A survey of Problem Characteristics and Applications, Pattern Recognition 77 (2018) 329–353. doi:10.1016/j.patcog.2017.10.009.
  • (15) J. Amores, Multiple Instance Classification: Review, Taxonomy and Comparative Study, Artif. Intell. 201 (2013) 81–105. doi:10.1016/j.artint.2013.06.003.
  • (16) B. Jiang, J. Pei, Y. Tao, X. Lin, Clustering Uncertain Data Based on Probability Distribution Similarity, Knowledge and Data Engineering, IEEE Transactions on 25 (4) (2013) 751–763. doi:10.1109/tkde.2011.221.
  • (17) H. P. Kriegel, M. Pfeifle, Density-based Clustering of Uncertain Data, in: Proceedings of the eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, 2005, pp. 672–677. doi:10.1145/1081870.1081955.
  • (18) B. Schölkopf, The Kernel Trick for Distances, in: Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS’00, 2000, pp. 283–289.
    URL http://portal.acm.org/citation.cfm?id=3008793
  • (19) S. Kullback, R. A. Leibler, On Information and Sufficiency, The Annals of Mathematical Statistics 22 (1) (1951) 79–86. doi:10.1214/aoms/1177729694.
  • (20) A. L. Gibbs, F. E. Su, On Choosing and Bounding Probability Metrics, International Statistical Review 70 (3) (2002) 419–435. doi:10.1111/j.1751-5823.2002.tb00178.x.
  • (21) S. M. Ali, S. D. Silvey, A General Class of Coefficients of Divergence of One Distribution from Another, Journal of the Royal Statistical Society. Series B (Methodological) 28 (1) (1966) 131–142.
    URL http://www.jstor.org/stable/2984279
  • (22) I. Csiszár, Information-type measures of difference of probability distributions and indirect observations, Studia Scientiarum Mathematicarum Hungarica 2 (1967) 299–318.
  • (23) K. Møllersen, S. S. Dhar, F. Godtliebsen, On Data-Independent Properties for Density-Based Dissimilarity Measures in Hybrid Clustering, Applied Mathematics 07 (15) (2016) 1674–1706. doi:10.4236/am.2016.715143.
  • (24) K. Møllersen, J. Y. Hardeberg, F. Godtliebsen, Divergence-based colour features for melanoma detection, in: Colour and Visual Computing Symposium (CVCS), 2015, pp. 1–6. doi:10.1109/CVCS.2015.7274885.
  • (25) O. Maron, T. Lozano-Pérez, A framework for multiple-instance learning, in: Advances in Neural Information Processing Systems 10, 1998, pp. 570–576.
  • (26) X. Xu, E. Frank, Logistic Regression and Boosting for Labeled Bags of Instances, in: H. Dai, R. Srikant, C. Zhang (Eds.), Advances in Knowledge Discovery and Data Mining, PAKDD 2004, Lecture Notes in Computer Science, vol 3056, Springer Berlin Heidelberg, 2004, pp. 272–281. doi:10.1007/978-3-540-24775-3_35.
  • (27) D. M. J. Tax, M. Loog, R. P. W. Duin, V. Cheplygina, W. J. Lee, Bag Dissimilarities for Multiple Instance Learning, in: M. Pelillo, E. R. Hancock (Eds.), Similarity-Based Pattern Recognition, SIMBAD 2011, Lecture Notes in Computer Science, vol 7005, Springer Berlin Heidelberg, 2011, pp. 222–234. doi:10.1007/978-3-642-24471-1_16.
  • (28) Z. H. Zhou, Y. Y. Sun, Y. F. Li, Multi-instance Learning by Treating Instances As non-I.I.D. Samples, in: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, ACM, New York, NY, USA, 2009, pp. 1249–1256. doi:10.1145/1553374.1553534.
  • (29) V. Cheplygina, D. M. J. Tax, M. Loog, Dissimilarity-Based Ensembles for Multiple Instance Learning, Neural Networks and Learning Systems, IEEE Transactions on, 27 (6) (2016) 1379–1391. doi:10.1109/TNNLS.2015.2424254.
  • (30) O. Boiman, E. Shechtman, M. Irani, In defense of Nearest-Neighbor based image classification

    , in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, CVPR 2008, pp. 1–8.

  • (31) W.-J. Lee, V. Cheplygina, D. M. J. Tax, M. Loog, R. P. W. Duin, Bridging structure and feature representations in graph matching, Int. J. Patt. Recogn. Artif. Intell. 26 (05) (2012) 1260005+. doi:10.1142/s0218001412600051.
  • (32) S. Scott, J. Zhang, J. Brown, On generalized multiple-instance learning, Int. J. Comp. Intel. Appl. 05 (01) (2005) 21–35. doi:10.1142/s1469026805001453.
  • (33) J. F. Ruiz-Muñoz, G. Castellanos-Dominguez, M. Orozco-Alzate, Enhancing the dissimilarity-based classification of birdsong recordings, Ecological informatics 33 (2016) 75–84. doi:10.1016/j.ecoinf.2016.04.001.
  • (34) L. Sørensen, M. Loog, D. M. J. Tax, W.-J. Lee, M. de Bruijne, R. P. W. Duin, Dissimilarity-Based Multiple Instance Learning, in: E. R. Hancock, R. C. Wilson, T. Windeatt, I. Ulusoy, F. Escolano (Eds.), Structural, Syntactic, and Statistical Pattern Recognition, SSPR /SPR 2010. Lecture Notes in Computer Science, vol 6218, Springer Berlin Heidelberg, 2010, pp. 129–138. doi:10.1007/978-3-642-14980-1_12.
  • (35) X.-S. Wei, J. Wu, Z.-H. Zhou, Scalable Algorithms for Multi-Instance Learning, IEEE Transactions on Neural Networks and Learning Systems 28 (4) (2017) 975–987. doi:10.1109/TNNLS.2016.2519102.
  • (36) S. K. Sahu, R. C. H. Cheng, A fast distance-based approach for determining the number of components in mixtures, Can J Statistics 31 (1) (2003) 3–22. doi:10.2307/3315900.
  • (37) S. Eguchi, J. Copas, Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma, J. Multivar. Anal. 97 (9) (2006) 2034–2040. doi:10.1016/j.jmva.2006.03.007.
  • (38) R. E. Kass, A. E. Raftery, Bayes Factors, Journal of the American Statistical Association 90 (430) (1995) 773–795. doi:10.2307/2291091.
  • (39) I. J. Taneja, P. Kumar, Generalized non-symmetric divergence measures and inequaities, Journal of Interdisciplinary Mathematics 9 (3) (2006) 581–599. doi:10.1080/09720502.2006.10700466.
  • (40) G. McLachlan, D. Peel, Finite Mixture Models, Wiley Series in Probability and Statistics, John Wiley & Sons, Inc., 2000. doi:10.1002/0471721182.
  • (41) S. J. Sheather, M. C. Jones, A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation, Journal of the Royal Statistical Society. Series B (Methodological) 53 (3) (1991) 683–690.