1 Introduction
1.1 Multiinstance learning
In supervised learning, the training data consists of
objects, , with corresponding class labels, ; . An object is typically a vector of feature values, , named instance. In multiinstance (MI) learning, each object consists of several instances. The set , where the elements are vectors of length , is referred to as bag. The number of instances, , varies from bag to bag, whereas the vector length is constant. In supervised MI learning, the training data consists of sets and their corresponding class labels, .Figure (a)a shows an image (bag), , of benign breast tissue Gelasca2008Evaluation , divided into segments with corresponding feature vectors (instances) Kandemir2014Empowering . Correspondingly, figure (b)b shows malignant breast tissue.
The images in the data set have class labels, the individual segments do not. This is a key characteristic of MI learning: the instances are not labelled. MI learning includes instance classification Doran2016MultipleInstance , clustering Zhang2009Multiinstance , regression Zhang2009Multiinstance , and multilabel learning Zhou2012Multiinstance ; Tang2017Deep
, but this article will focus on bag classification. MI learning can also be found as integrated parts of endtoend methods for image analysis that generate patches, extract features and do feature selection
Tang2017Deep . See also Wang2018Revisitingfor an overview and discussion on endtoend neural network MI learning methods.
The term MI learning was introduced in an application of molecules (bags) with different shapes (instances), and their ability to bind to other molecules Dietterich1997Solving . A molecule binds if at least one of its shapes can bind. In MI terminology, the classes, , in binary classification are referred to as positive, , and negative, . The assumption that a positive bag contains at least one positive instance, and a negative bag contains only negative instances is referred to as the standard MI assumption.
Many new applications violate the standard MI assumption, such as image classification Xu2016Multipleinstance and text categorisation Qiao2017Diversified . Consequently, successful algorithms meet more general assumptions, see e.g. the hierarchy of Weidmann et al. Weidmann2003Twolevel or Foulds and Frank’s taxonomy Foulds2010Review . For a more recent review on MI classification algorithms, see e.g. Cheplygina2015Multiple . Carbonneau et al. Carbonneau2018Multiple discussed sample independence and data sparsity, which we address in Section 3.2. Amores Amores2013Multiple presented the three paradigms of instance space (IS), embedded space (ES), and bag space (BS). IS methods aggregate the outcome of singleinstance classifiers applied to the instances of a bag, whereas ES methods map the instances to a vector, and then use a singleinstance classifier. In the BS paradigm, the instances are transformed to a nonvectorial space where the classification is performed, avoiding the detour via singleinstance classifiers. The nonvectorial space of probability functions has not yet been introduced to the BS paradigm, despite its analytical benefits.
Whereas both Carbonneau et al. Carbonneau2018Multiple and Amores Amores2013Multiple defined a bag as a set of feature vectors, Foulds and Frank Foulds2010Review stated that a bag can also be modelled as a probability distribution. The distinction is necessary in analysis of classification approaches, and both viewpoints offer benefits, see Section 6.1 for a discussion.
1.2 The nonvectorial space of probability functions
From the probabilistic viewpoint, an instance is a realisation of a random vector, , with probability distribution and sample space
. The posterior probability,
, is an effective classifier if the standard MI assumption holds, since it is known beforehand to bewhere is the positive instance space, and the positive and negative instance spaces are disjoint.
Bayes’ rule,
, can be used when the posterior probability is unknown. An assumption used to estimate the probability distribution of instance given the class,
, is that instances from bags of the same class are independent and identically distributed (i.i.d.) random samples, but this is a poor description for MI learning. As an illustrative example, let the instances be the colour of image segments from the class sea. Image depicts a clear blue sea, whereas image depicts a deep green sea, and instance distributions are clearly dependent not only on class, but also on bag. The random vectors in are i.i.d., but have a different distribution than those in . An important distinction between uncertain objects, whose distribution depends solely on the class label Jiang2013Clustering ; Kriegel2005Densitybased , and MI learning is that the instances of two bags from the same class are not from the same distribution.The dependency nature for MI learning can be described as a hierarchical distribution (Eq. 3), where a bag, , is defined as the probability distribution of its instances, , and the bag space, , is a set of distributions.
1.3 Dissimilarities in MI learning
Dissimilarities in MI learning can be categorised as instancetoinstance, bagtobag or bagtoclass. Amores Amores2013Multiple implicitly assumed metricity for dissimilarity functions Scholkopf2000Kernel in the BS paradigm, but there is nothing inherent to MI learning that imposes these restrictions. The nonmetric KullbackLeibler (KL) information Kullback1951Information is an example of a divergence: a dissimilarity measure between two probability distributions.
Divergences have not been used in MI learning, due to the lack of a probability function space defined for the BS paradigm, despite the benefit of analysis independent of specific data sets Gibbs2002Choosing . The divergences Ali1966General ; Csiszar1967Informationtype have desirable properties for dissimilarity measures, including minimum value for equal distributions, but there is no complete categorisation of divergences. The KL information is a nonsymmetric
divergence, often used in both statistics and computer science, and is defined as follows for two probability density functions (pdfs)
and :(1) 
An example of a symmetric divergence is the Bhattacharyya (BH) distance, defined as
(2) 
and can be a better choice if the absolute difference, and not the ratio, differentiates the two pdfs. The appropriate divergence for a specific task can be chosen based on identified properties, e.g. for clustering Mollersen2016DataIndependent , or a new dissimilarity function can be proposed Mollersen2015Divergencebased .
This article aims to identify properties for bag classification, and we make the following contributions:

Presenting the hierarchical model for general, nonstandard MI assumptions (Section 3.1).

Introduction of bagtoclass dissimilarity measure (Section 3.2).

Identification of two properties for bagtoclass divergence (Section 4.1).

A new bagtoclass dissimilarity measure for sparse training data (Section 4.2).
In Section 5, the KL information and the new dissimilarity measure is applied to data sets and the results are reported. Bags defined in the probability distribution space, in combination with bagtoclass divergence, constitutes a new framework for MI learning, which is compared to other frameworks in Section 6.
2 Related work
The feature vector set viewpoint seems to be the most common, but the probabilistic viewpoint was introduces already in 1998, then under the i.i.d. given class assumption Maron1998Framework . This assumption has been used in approaches such as estimating the expectation by the mean Xu2004Logistic , or estimation of class distribution parameters Tax2011Bag , but has also been criticised Zhou2009Multiinstance . The hierarchical distribution was introduced for learnability theory under the standard MI assumption for instance classification Doran2016MultipleInstance , and we expand the use for more general assumptions.
Dissimilarities in MI learning have been categorised as instancetoinstance or bagtobag Amores2013Multiple ; Cheplygina2016DissimilarityBased . The bagtoprototype approach in Cheplygina2016DissimilarityBased offers an inbetween category, but the theoretical framework is missing. Bagtoclass dissimilarity has not been studied within the MI framework, but was used under the i.i.d. given class assumption for image classification in Boiman2008In , where also the sparseness of training sets was addressed: if the instances are aggregated on class level, a denser representation is achieved. Many MI algorithms use dissimilarities, e.g. graph distances Lee2012Bridging , Hausdorff metrics Scott2005Generalized , functions of the Euclidean distance Cheplygina2015Multiple ; RuizMunoz2016Enhancing , and distribution parameter based distances Cheplygina2015Multiple . The performances of dissimilarities on specific data sets have been investigated Cheplygina2015Multiple ; Tax2011Bag ; Cheplygina2016DissimilarityBased ; RuizMunoz2016Enhancing ; Sorensen2010DissimilarityBased , but more analytical comparisons are missing. A large class of commonly used kernels are also distances Scholkopf2000Kernel , and hence, many kernelbased approaches in MI learning can be viewed as dissimilaritybased approaches. In Wei2017Scalable
, the Fisher kernel is used as input to a support vector machine (SVM), whereas in
Zhou2009Multiinstance and Qiao2017Diversified the kernels are an integrated part of the methods.The nonvectorial graph space was used in Zhou2009Multiinstance ; Lee2012Bridging . We introduce the nonvectorial space of probability functions as an extension within the BS paradigm for bag classification through dissimilarity measures between distributions.
The KL information was applied in Boiman2008In , and is a muchused divergence function. It is closely connected to the Fisher information Kullback1951Information used in Wei2017Scalable
and to the cross entropy used as loss function in
Wang2018Revisiting . We propose a conditional KL information in Section 4.2, which differs from the earlier proposed weighted KL information Sahu2003Fast whose weight is a constant function of .3 Theoretical background
3.1 Hierarchical distributions
A bag is the probability distribution from which the instances are sampled. The generative model of instances from a positive or negative bag follows a hierarchical distribution
(3) 
respectively. The common view in MI learning is that a bag consists of positive and negative instances, which corresponds to a bag being a mixture of a positive and a negative distribution.
Consider tumour images labelled or , with instances extracted from segments. Let and denote the pdfs of positive and negative segments, respectively, of image . The pdf of bag is a mixture distribution
where , where if instance is positive. The probability of positive segments, , depends on the image’s class label, and hence is sampled from or . The characteristics of positive and negative segments vary from image to image. Hence, and
are realisations of random variables, with corresponding probability distributions
and . The generative model of instances from a positive (negative) bag is(4) 
The corresponding sampling procedure from positive (negative) bag, , is
Step 1: Draw from , from , and from . These three parameters define the bag.
Step 2: For , draw from , draw from if , and from otherwise.
By imposing restrictions, assumptions can be accurately described, e.g. the standard MI assumption: at least one positive instance in a positive bag: ; no positive instances in a negative bag: ; the positive and negative instance spaces are disjoint.
Eq. 4 is the generative model of MI problems, assuming that the instances have unknown class labels and that the distributions are parametric. The parameters , and are i.i.d. samples from their respective distributions, but are not observed and are hard to estimate, due to the very nature of MI learning: The instances are not labelled. Instead, can be estimated from the observed instances, and a divergence function can serve as classifier.
3.2 Bagtoclass dissimilarity
The training set in MI learning is the instances, since the bag distributions are unknown. Under the assumption that the instances from each bag are i.i.d. samples, the KL information has a special role in model selection, both from the frequentist and the Bayesian perspective. Let be the sample distribution (unlabelled bag), and let and be two models (labelled bags). Then the expectation over of the log ratio of the two models, , is equal to . In other words, the log ratio test reveals the model closest to the sampling distribution in terms of KL information Eguchi2006Interpreting
. From the Bayesian viewpoint, the Akaike Information Criterion (AIC) reveals the model closest to the data in terms of KL information, and is asymptotically equivalent to Bayes factor under certain assumptions
Kass1995Bayes .The i.i.d. assumption is not inherent to the probability distribution viewpoint, but the asymptotic results for the KL information rely on it. In many applications, such as image analysis with sliding windows, the instances are best represented as dependent samples, but the dependencies are hard to estimate, and the independence assumption is often the best approximation. Doran and Ray Doran2016MultipleInstance showed that the independence assumption is an approximation of dependent instances, but comes with the cost of slower convergence.
If the bag sampling is sparse, the dissimilarity between and the labelled bags becomes somewhat arbitrary w.r.t. the true label of . The risk is high for ratiobased divergences such as the KL information, since for . The bagtobag KL information is asymptotically the best choice of divergence function, but this is not the case for sparse training sets. Bagtoclass dissimilarity makes up for some of the sparseness by aggregation of instances. Consider an image segment of colour deep green, which appears in sea images, but not in sky images, and a segment of colour white, which appears in both classes (waves and clouds). If the combination deep green and white does not appear in the training set, then a bagtobag KL information will result in infinite dissimilarity for all bags, regardless of class, but the bagtoclass KL information will be finite for the sea class.
Let be the probability distribution of a random vector from the bags of class . Let and be the divergences between the unlabelled bag and each of the classes. Choice of divergence is not obvious, since is different from both and , but can be done by identification of properties.
4 Properties for baglevel classification
4.1 Properties for bagtoclass divergences
We here propose two properties for bagtoclass divergences regarding infinite bagtoclass ratio and zero instance probability. Let , and . Denote the divergence between an unlabelled bag and the reference distribution, , by .
As a motivating example, consider the following: A positive bag,
, is a continuous uniform distribution
, sampled according to . A negative bag, , is sampled according to , and let so that there is an overlap between the two classes. For both positive and negative bags, we have that for a subspace of and for a different subspace of , merely reflecting that the variability in instances within a class is larger than within a bag, as illustrated in Fig. 2.If is a sample from the negative class, and for some subspace of it can easily be classified. From the above analysis, large bagtoclass ratio should be reflected in large divergence, whereas large classtobag ratio should not.
Property 1: For the subspace of where the bagtoclass ratio is larger than some , the contribution to the total divergence, , approaches the maximum contribution as . For the subspace of where the classtobag ratio is larger than , the contribution to the total divergence, , does not approach the maximum contribution as :
Property 1 can not be fulfilled by a symmetric divergence.
As a second motivating example, consider the same positive class as before, and the two alternative negative classes defined by;
For bag classification, the question becomes: from which class is a specific bag sampled? It is equally probable that a bag comes from each of the two negative classes, since and only differ where , and we argue that should be equal to .
Property 2: For the subspace of where is smaller than some , the contribution to the total divergence, , approaches zero as :
KL information is the only divergence that fulfils these two properties among the nonsymmetric divergences listed in Taneja2006Generalized . As there is no complete list of divergences, so it is possible that other divergences that the authors are not aware of fulfil these properties.
4.2 A classconditional dissimilarity for MI classification
In the sea and sky images example, consider an unlabelled image with a pink segment, e.g. a boat. If pink is absent in the training set, then the bagtoclass KL information will be infinite for both classes. We therefore propose the following property:
Property 3: For the subspace of where both class probabilities are smaller than some , the contribution to the total divergence, , approaches zero as :
We present a classconditional dissimilarity that accounts for this:
(5) 
which also fulfils Properties 1 and 2.
4.3 Baglevel divergence classification
We propose two similar methods based on either the ratio of bagtoclass divergences, , or the classconditional dissimilarity in Eq. 5. We propose using the KL information (Eq. 1) or the Bhattacharyya distance (Eq. 2), but any divergence function can be applied.
Given a training set and a set, , of instances drawn from an unknown distribution, , with unknown class label , and let denote the set of all . The baglevel divergence classification follows the steps:
Classify according to:  (6)  
or  
Common methods for pdf estimation are Gaussian mixture models (GMMs) and kernel density estimation (KDE). The integrals in step 2 are commonly approximated by importance sampling and Riemann sums. In rare cases, e.g. when the distributions are Gaussian, the divergences can be calculated directly. The threshold
can be predefined based on, e.g. misclassification penalty and prior class probabilities, or estimated from the training set by leaveoneout crossvalidation. When the feature dimension is high and the number of instances in each bag is low, pdf estimation becomes arbitrary. A solution is to estimate separate pdfs for each dimension, calculate the corresponding divergences , and treat them as inputs into a classifier replacing step 3. Code available at https://github.com/kajsam/Bagtoclassdivergence.5 Experiments
5.1 Simulated data
The following study exemplifies the difference between BH distance ratio, , KL information ratio, , and as classifiers for sparse training data. The minimum dissimilarity bagtobag classifiers are also implemented, based on KL information and BH distance. The number of instances from each bag is , the number of bags in the training set is varied from to from each class, and the number of bags in the test set is . Each bag and its instances are sampled as described in Eq. 4
, and the area under the receiver operating characteristic (ROC) curve (AUC) serves as performance measure. For simplicity, we use Gaussian distributions in one dimension for
Sim 1Sim 4:Sim 1: :
No positive instances in negative bags.
Sim 2: :
Positive instances in negative bags.
Sim 3:
: Positive and negative instances have the same expectation of the mean, but unequal variance.
Sim 4: : Positive instances are sampled from two distributions with unequal mean expectation.
We add Sim 5 and Sim 6 for the discussion on instance labels in Section 6, as follows: Sim 5 is an uncertain object classification, where the positive bags are lognormal densities with and , and negative bags are Gaussian mixtures densities with , , , and . These two densities are nearly identical, see (McLachlan2000Finite, , p. 15). In Sim 6, the parameters of Sim 5 are i.i.d. observations from Gaussian distributions, each with for the Gaussian mixture, and for the lognormal distribution. Figure 3 shows the estimated class densities and two estimated bag densities for Sim 2 with negative bags in the training set.
We use the following details for the algorithm in (4.3): KDE fitting: Epanechnikov kernel with estimated bandwidth varying with the number of observations. Integrals: Importance sampling. Classifier: is varied to give the full range of sensitivities and specificities necessary to calculate AUC.
Table 1 shows the mean AUCs for repetitions.
Bags  : 5  : 10  : 25  
Sim:  :  
1  61  69  85  62  72  89  61  73  92  
1  5  63  75  86  64  82  94  68  84  97 
10  69  86  87  73  91  95  75  91  98  
1  57  61  75  59  61  78  58  55  75  
2  5  59  67  79  60  68  84  62  63  85 
10  64  77  80  66  78  86  68  72  86  
1  51  55  71  52  58  73  50  57  74  
3  5  53  61  76  53  66  81  52  65  83 
10  58  73  78  58  76  84  57  76  87  
1  55  61  70  56  62  73  56  58  69  
4  5  56  63  75  57  64  81  59  59  80 
10  60  74  77  62  76  85  63  69  84  
1  64  61  62  67  63  66  64  62  67  
5  5  73  69  63  74  70  67  75  71  72 
10  74  70  62  75  73  69  76  74  72  
1  68  68  67  66  68  68  68  71  68  
6  5  65  64  67  68  68  69  70  71  74 
10  66  64  66  70  69  72  72  73  74 
5.2 Breast tissue images
Breast tissue images (see Fig. 1) with corresponding feature vectors are used as example. Following the procedure in Kandemir2014Empowering , the principal components are used for dimension reduction, and fold crossvalidation is used so that and are fitted only to the instances in the training folds. For pdf estimation, GMMs are fitted to the first principal component, using an EMalgorithm, with number of components chosen by minimum AIC. In addition, KDE as in Section 5.1, and KDE with Gaussian kernel and optimal bandwidth Sheather1991Reliable is used.
KDE (Epan.)  KDE (Gauss.)  GMMs  

90  92  94  
82  92  96 
5.3 Benchmark data
We here present the results for 7 benchmark datasets^{1}^{1}1https://figshare.com/articles/MIProblems_A_repository_of_multiple_instance_learning_datasets/6633983 together with the results of five other methods as reported in the cited publications.
The datasets have relatively few instances per bag compared to the dimensionality.
For detailed descriptions and references, see Cheplygina2015Multiple .
We use the following details for the algorithm in (4.3):
KDE fitting: Gaussian kernel with optimal bandwidth. Integrals: Importance sampling. Classifier: Support vector machine (SVM) with linear kernel.
for d = 1: Dim
Fit to and sample from .
1. Fit , , using KDE.
2. Approximate or using and .
end
3. Input or to SVM.
10 times 10fold crossvalidation is used, except for the 2000Image dataset where 5 times 2 fold crossvalidation is used as in Wei2017Scalable and Zhou2009Multiinstance . In Cheplygina2016DissimilarityBased
, one 10fold crossvalidation was performed, and the standard error was reported. In
Wang2018Revisiting , 5 times 10fold crossvalidation was performed. In Qiao2017Diversified, several parameters are optimised for each data set, which prevents a fair comparison, and there was no reported deviation/error. The accuracies and the standard deviations are presented in Table
3 and Table 4, where the highest accuracies for each data set and those within one standard deviation are marked in bold.Musk1  Musk2  Fox  Tiger  Elephant  
MINet(DS)Wang2018Revisiting  89.4 (9.3)  87.4 (9.7)  63.0 (8.0)  84.5 (8.7)  87.2 (7.2) 
miFVWei2017Scalable  87.5 (10.6)  86.1 (10.6)  56.0 (9.9)  78.9 (9.1)  78.9 (9.1) 
miGraphZhou2009Multiinstance  88.9 (3.3)  90.3 (2.6)  61.6 (2.8)  86.0 (1.6)  86.8 (0.7) 
Cheplygina2016DissimilarityBased  89.3 (3.4)  85.5 (4.7)  64.4 (2.2)  81.0 (4.6)  80.4 (3.5) 
Qiao2017Diversified  87.7  89.1  65.0  80.0  90.67 
rBH  64.4 (3.1)  69.2 (3.2)  71.5 (1.2)  70.1 (1,3)  81.7 (1.7) 
cKL  74.0 (1.9)  69.9 (2.0)  65.8 (2.1)  85.0 (1.4)  71.1 (3.3) 
2000  Image  Alt.atheism  
MINet(DS)    86.0 (13.4) 
miFV  87.5 (7.2)   
miGraph  72.1  65.5 (4.0) 
  44.0 (4.5)  
rBH  90.0 (6.4)  62.0 (2.6) 
cKL  80.1 (10.5)  85.5 (1.4) 
5.4 Results
The general trend in Table 1 is that gives higher AUC than , which in turn gives higher AUC than , in line with the divergences’ properties for sparse training sets. The same trend can be seen with a Gaussian kernel and optimal bandwidth (numbers not reported). The gap between and narrows with larger training sets. In other words, the benefit of increases with sparsity. This can be explained by the risk of , as seen in Figure 3(a).
Increasing also narrows the gap between and , and eventually (at approximately ), outperforms (numbers not reported). Sim 1 and Sim 3 are less affected because the ratio is already .
The minimum bagtobag classifier gives a single sensitivityspecificity outcome, and the KL information outperforms the BH distance. Compared to the ROC curve, as illustrated in Fig. 4, the minimum bagtobag KL information classifier exceeds the bagtoclass dissimilarities only for very large training sets, typically for 500 or more, then at the expense of extensive computation time.
Sim 5 is an example in which the absolute difference, and not the ratio, differentiates the two classes, and has the superior performance. When the extra hierarchy level is added in Sim 6, the performances returned to normal.
The breast tissue study shows that the simple divergencebased approach can outperform more sophisticated algorithms. is more sensitive than to choice of density estimation method. performs better than with GMM, and both exceed the AUC of of the original algorithm. Table 2 shows how the performance can vary between two common pdf estimation methods that do not assume a particular underlying distribution. Both KDE and GMM are sensitive to chosen parameters or parameter estimation method, bandwidth and number of components, respectively, and no method will fit all data sets. In general, KDE is faster, but more sensitive to bandwidth, whereas GMM is more stable. For bags with very few instances the benefits of GMM cannot be exploited, and KDE is preferred.
The benchmark data study shows that the proposed method combined with a standard classifier obtain comparable results with stateoftheart algorithms, with the exception of the Musk data sets where the number of instances per bag is low. In , more than half of the bags contain less than 5 instances, and in , one fourth of the bags contain less than 5 instances. Few instances per bag prevents good distribution estimations, and since the proposed method is based on bag distributions, the result is not surprising. The algorithms perform in the same range, although they are conceptually very different: MINet is a neural network approach, miFV is a kernel approach, miGraph is a graph approach, D is a dissimilarity approach, and DivDict is a diverse dictionary approach.
6 Discussion
6.1 Pointofview
The theoretical basis of the bagtoclass divergence approach relies on viewing a bag as a probability distribution, and hence fits into the branch of collective assumptions of the Foulds and Frank taxonomy Foulds2010Review . The probability distribution estimation can be seen as extracting baglevel information from a set , and hence falls into the BS paradigm of Amores Amores2013Multiple . The probability distribution space is nonvectorial, different from the distancekernel spaces in Amores2013Multiple , and divergences are used for classification.
In practice, the evaluation points of the importance sampling gives a mapping from the set to a single vector, . The mapping concurs with the ES paradigm, and the same applies for the graphbased methods. From that viewpoint, the bagtoclass divergence approach expands the distance branch of Foulds and Frank to include a bagtoclass category in addition to instancelevel and baglevel distances. However, the importance sampling is a technicality of the algorithm, and we argue that the method belongs to the BS paradigm. When the divergences are used as input to a classifier as in Section 5.3, the ES paradigm is a better description.
Carbonneau et al. Carbonneau2018Multiple assume underlying instance labels, and from a probability distribution viewpoint this corresponds to posterior probabilities, which are in practice inaccessible. In Sim 1  Sim 4, the instance labels are inaccessible through observations without previous knowledge about the distributions. In Sim 6, the instance label approach is not useful, due to the similarity between the two distributions:
(7) 
where and are the lognormal and the Gaussian mixture, respectively. Eq. 4 is just a special case of Eq. 7, where is the random vector . Without knowledge about the distributions, discriminating between training sets following the generative model of Eq. 4 and Eq. 7 is only possible for a limited number of problems. Even the uncertain objects of Sim 5 is difficult to discriminate from MI objects based solely on the observations in the training set.
6.2 Conclusions and future work
Although the bagtobag KL information has the minimum misclassification rate, the typical bag sparseness of MI training sets is an obstacle, which is partly solved by bagtoclass dissimilarities, and the proposed classconditional KL information accounts for additional sparsity of bags.
The bagtoclass divergence approach addresses three main challenges MI learning. (1) Aggregation of instances according to bag label and the additional classconditioning provides a solution for the bag sparsity problem. (2) The bagtobag approach suffers from extensive computation time, solved by the bagtoclass approach. (3) Viewing bags as probability distributions give access to analytical tools from statistics and probability theory, and comparisons of methods can be done on a dataindependent level through identification of properties. The properties presented here are not an extensive list, and any extra knowledge should be taken into account whenever available.
A more thorough analysis of the proposed function, , will identify its weaknesses and strengths, and can lead to improved versions as well as alternative classconditional dissimilarity measures and a more comprehensive tool.
The diversity of data types, assumptions, problem characteristics, sampling sparsity, etc. is far too large for any one approach to be sufficient. The introduction of divergences as an alternative class of dissimilarity functions; and the bagtoclass dissimilarity as an alternative to the bagtobag dissimilarity, has added additional tools to the MI toolbox.
Acknowledgements
This research did not receive any specific grant from funding agencies in the public, commercial, or notforprofit sectors.
Bibliography
References
 (1) E. Drelie Gelasca, J. Byun, B. Obara, B. S. Manjunath, Evaluation and benchmark for biological image segmentation, in: 15th IEEE International Conference on Image Processing, 2008, pp. 1816–1819. doi:10.1109/ICIP.2008.4712130.
 (2) M. Kandemir, C. Zhang, F. A. Hamprecht, Empowering Multiple Instance Histopathology Cancer Diagnosis by Cell Graphs, in: P. Golland, N. Hata, C. Barillot, J. Hornegger, R. Howe (Eds.), Medical Image Computing and ComputerAssisted Intervention – MICCAI 2014, Springer International Publishing, 2014, pp. 228–235. doi:10.1007/9783319104706_29.

(3)
G. Doran, S. Ray,
MultipleInstance Learning
from Distributions
, Journal of Machine Learning Research 17 (128) (2016) 1–50.
URL http://jmlr.org/papers/v17/15171.html  (4) M.L. Zhang, Z.H. Zhou, Multiinstance clustering with applications to multiinstance prediction, Applied Intelligence 31 (1) (2009) 47–68. doi:10.1007/s104890070111x.

(5)
Z.H. Zhou, M.L. Zhang, S.J. Huang, Y.F. Li,
Multiinstance
multilabel learning
, Artificial Intelligence 176 (1) (2012) 2291–2320.
doi:10.1016/j.artint.2011.10.002.  (6) P. Tang, X. Wang, Z. Huang, X. Bai, W. Liu, Deep patch learning for weakly supervised object classification and discovery, Pattern Recognition 71 (2017) 446–459. doi:10.1016/J.PATCOG.2017.05.001.
 (7) X. Wang, Y. Yan, P. Tang, X. Bai, W. Liu, Revisiting multiple instance neural networks, Pattern Recognition 74 (2018) 15–24. doi:10.1016/J.PATCOG.2017.08.026.
 (8) T. G. Dietterich, R. H. Lathrop, T. LozanoPérez, Solving the multiple instance problem with axisparallel rectangles, Artificial Intelligence 89 (12) (1997) 31–71. doi:10.1016/s00043702(96)000343.
 (9) Y.Y. Xu, Multipleinstance learning based decision neural networks for image retrieval and classification, Neurocomputing 171 (2016) 826–836. doi:10.1016/j.neucom.2015.07.024.
 (10) M. Qiao, L. Liu, J. Yu, C. Xu, D. Tao, Diversified dictionaries for multiinstance learning, Pattern Recognition 64 (2017) 407–416. doi:10.1016/j.patcog.2016.08.026.
 (11) N. Weidmann, E. Frank, B. Pfahringer, A TwoLevel Learning Method for Generalized Multiinstance Problems, in: N. Lavrač, D. Gamberger, H. Blockeel, L. Todorovski (Eds.), Machine Learning: ECML 2003, Lecture Notes in Computer Science, vol 2837. Springer Berlin Heidelberg, 2003, pp. 468–479. doi:10.1007/9783540398578_42.

(12)
J. Foulds, E. Frank, A
review of multiinstance learning assumptions
, The Knowledge Engineering Review 25 (1) (2010) 1–25.
doi:10.1017/s026988890999035x.  (13) V. Cheplygina, D. M. J. Tax, M. Loog, Multiple Instance Learning with Bag Dissimilarities, Pattern Recogn. 48 (1) (2015) 264–275. doi:10.1016/j.patcog.2014.07.022.
 (14) M. A. Carbonneau, V. Cheplygina, E. Granger, G. Gagnon, Multiple Instance Learning: A survey of Problem Characteristics and Applications, Pattern Recognition 77 (2018) 329–353. doi:10.1016/j.patcog.2017.10.009.
 (15) J. Amores, Multiple Instance Classification: Review, Taxonomy and Comparative Study, Artif. Intell. 201 (2013) 81–105. doi:10.1016/j.artint.2013.06.003.
 (16) B. Jiang, J. Pei, Y. Tao, X. Lin, Clustering Uncertain Data Based on Probability Distribution Similarity, Knowledge and Data Engineering, IEEE Transactions on 25 (4) (2013) 751–763. doi:10.1109/tkde.2011.221.
 (17) H. P. Kriegel, M. Pfeifle, Densitybased Clustering of Uncertain Data, in: Proceedings of the eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, 2005, pp. 672–677. doi:10.1145/1081870.1081955.

(18)
B. Schölkopf, The
Kernel Trick for Distances, in: Proceedings of the 13th International
Conference on Neural Information Processing Systems, NIPS’00, 2000, pp. 283–289.
URL http://portal.acm.org/citation.cfm?id=3008793  (19) S. Kullback, R. A. Leibler, On Information and Sufficiency, The Annals of Mathematical Statistics 22 (1) (1951) 79–86. doi:10.1214/aoms/1177729694.
 (20) A. L. Gibbs, F. E. Su, On Choosing and Bounding Probability Metrics, International Statistical Review 70 (3) (2002) 419–435. doi:10.1111/j.17515823.2002.tb00178.x.

(21)
S. M. Ali, S. D. Silvey, A General
Class of Coefficients of Divergence of One Distribution from Another,
Journal of the Royal Statistical Society. Series B (Methodological) 28 (1)
(1966) 131–142.
URL http://www.jstor.org/stable/2984279  (22) I. Csiszár, Informationtype measures of difference of probability distributions and indirect observations, Studia Scientiarum Mathematicarum Hungarica 2 (1967) 299–318.
 (23) K. Møllersen, S. S. Dhar, F. Godtliebsen, On DataIndependent Properties for DensityBased Dissimilarity Measures in Hybrid Clustering, Applied Mathematics 07 (15) (2016) 1674–1706. doi:10.4236/am.2016.715143.
 (24) K. Møllersen, J. Y. Hardeberg, F. Godtliebsen, Divergencebased colour features for melanoma detection, in: Colour and Visual Computing Symposium (CVCS), 2015, pp. 1–6. doi:10.1109/CVCS.2015.7274885.
 (25) O. Maron, T. LozanoPérez, A framework for multipleinstance learning, in: Advances in Neural Information Processing Systems 10, 1998, pp. 570–576.
 (26) X. Xu, E. Frank, Logistic Regression and Boosting for Labeled Bags of Instances, in: H. Dai, R. Srikant, C. Zhang (Eds.), Advances in Knowledge Discovery and Data Mining, PAKDD 2004, Lecture Notes in Computer Science, vol 3056, Springer Berlin Heidelberg, 2004, pp. 272–281. doi:10.1007/9783540247753_35.
 (27) D. M. J. Tax, M. Loog, R. P. W. Duin, V. Cheplygina, W. J. Lee, Bag Dissimilarities for Multiple Instance Learning, in: M. Pelillo, E. R. Hancock (Eds.), SimilarityBased Pattern Recognition, SIMBAD 2011, Lecture Notes in Computer Science, vol 7005, Springer Berlin Heidelberg, 2011, pp. 222–234. doi:10.1007/9783642244711_16.
 (28) Z. H. Zhou, Y. Y. Sun, Y. F. Li, Multiinstance Learning by Treating Instances As nonI.I.D. Samples, in: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, ACM, New York, NY, USA, 2009, pp. 1249–1256. doi:10.1145/1553374.1553534.
 (29) V. Cheplygina, D. M. J. Tax, M. Loog, DissimilarityBased Ensembles for Multiple Instance Learning, Neural Networks and Learning Systems, IEEE Transactions on, 27 (6) (2016) 1379–1391. doi:10.1109/TNNLS.2015.2424254.

(30)
O. Boiman, E. Shechtman, M. Irani,
In defense of
NearestNeighbor based image classification
, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, CVPR 2008, pp. 1–8.
doi:10.1109/cvpr.2008.4587598.  (31) W.J. Lee, V. Cheplygina, D. M. J. Tax, M. Loog, R. P. W. Duin, Bridging structure and feature representations in graph matching, Int. J. Patt. Recogn. Artif. Intell. 26 (05) (2012) 1260005+. doi:10.1142/s0218001412600051.
 (32) S. Scott, J. Zhang, J. Brown, On generalized multipleinstance learning, Int. J. Comp. Intel. Appl. 05 (01) (2005) 21–35. doi:10.1142/s1469026805001453.
 (33) J. F. RuizMuñoz, G. CastellanosDominguez, M. OrozcoAlzate, Enhancing the dissimilaritybased classification of birdsong recordings, Ecological informatics 33 (2016) 75–84. doi:10.1016/j.ecoinf.2016.04.001.
 (34) L. Sørensen, M. Loog, D. M. J. Tax, W.J. Lee, M. de Bruijne, R. P. W. Duin, DissimilarityBased Multiple Instance Learning, in: E. R. Hancock, R. C. Wilson, T. Windeatt, I. Ulusoy, F. Escolano (Eds.), Structural, Syntactic, and Statistical Pattern Recognition, SSPR /SPR 2010. Lecture Notes in Computer Science, vol 6218, Springer Berlin Heidelberg, 2010, pp. 129–138. doi:10.1007/9783642149801_12.
 (35) X.S. Wei, J. Wu, Z.H. Zhou, Scalable Algorithms for MultiInstance Learning, IEEE Transactions on Neural Networks and Learning Systems 28 (4) (2017) 975–987. doi:10.1109/TNNLS.2016.2519102.
 (36) S. K. Sahu, R. C. H. Cheng, A fast distancebased approach for determining the number of components in mixtures, Can J Statistics 31 (1) (2003) 3–22. doi:10.2307/3315900.
 (37) S. Eguchi, J. Copas, Interpreting KullbackLeibler Divergence with the NeymanPearson Lemma, J. Multivar. Anal. 97 (9) (2006) 2034–2040. doi:10.1016/j.jmva.2006.03.007.
 (38) R. E. Kass, A. E. Raftery, Bayes Factors, Journal of the American Statistical Association 90 (430) (1995) 773–795. doi:10.2307/2291091.
 (39) I. J. Taneja, P. Kumar, Generalized nonsymmetric divergence measures and inequaities, Journal of Interdisciplinary Mathematics 9 (3) (2006) 581–599. doi:10.1080/09720502.2006.10700466.
 (40) G. McLachlan, D. Peel, Finite Mixture Models, Wiley Series in Probability and Statistics, John Wiley & Sons, Inc., 2000. doi:10.1002/0471721182.
 (41) S. J. Sheather, M. C. Jones, A Reliable DataBased Bandwidth Selection Method for Kernel Density Estimation, Journal of the Royal Statistical Society. Series B (Methodological) 53 (3) (1991) 683–690.
Comments
There are no comments yet.