In a multiple instance learning (MIL) task, we are to learn a classifier based on a set of bags, where each bag contains multiple instances. And in the setting of MIL, We know the label of each bag but do not know the label of each instance. This applies to many real-world tasks, such as medical imaging[Quellec et al.2017] (e.g., computational pathology, mammography or CT lung screening), drug discovery (pharmacy) [Dietterich et al.1997], classification of text documents [Ruiz-Munoz et al.2015], speaker identification (signal processing) [Mandel and Ellis2008]
, and so on. Therefore, MIL is an important topic in the machine learning community and many methods have been published in the last few years
Existent MIL approaches can be roughly categorized into the instance-level paradigm and the bag-level paradigm. The instance-level methods [Zhang and Goldman2002, Ray and Craven2005, Angelidis and Lapata2018] treat instances of a bag differently and first predict instance labels as interim results. Then, they infer the bag label based on the predicted labels of instances constituting the bag. The bag-level approaches treat the bag as a whole and directly obtain the bag representation, implicitly by defining distance between bags [Wang and Zucker2000], bag kernels [Gärtner et al.2002] and bag dissimilarities [Cheplygina et al.2015], or explicitly by pooling (may with attention) instance presentations [Ilse et al.2018]. Then, they infer the bag label based on the resultant bag representation.
Because there are only bag labels, most of these methods were designed to accurately predict bag labels and their loss functions were only defined at the bag level. Even for methods of the instance-level paradigm, the instance-level label prediction is only an interim step for the final bag-level label prediction and no loss is defined on this goal. Therefore, these methods generally perform insufficiently for label prediction at the instance level[Kandemir and Hamprecht2015, Cheplygina et al.2015]. This restricts their applications in many tasks, where the instance-level label prediction is more interested, such as image segmentation and fine-grained sentiment classification.
In this work, we propose a novel MIL algorithm, whose loss function is specifically defined on the instance-level label prediction, to address this problem. The core idea of this algorithm is to unbiasedly estimate the instance-level label prediction loss without using instance labels. We show that this can be achieved if we know the ratio of negative instance under the i.i.d assumption in Theorem1. Further theoretical analysis on the consistency of this algorithm in Theorem 2 shows that, when using a bounded Bayes consistent loss function, it can achieve similar results as the fully supervised method (trained with instance labels) for label prediction at the instance level. Our experimental study validate the effectiveness of this algorithm at both the bag level and instance level. Contributions of this work can be summarized as follows:
We propose a novel MIL algorithm, whose loss function is specifically defined on the instance-level label prediction, to address instance-level label prediction in MIL.
We provide a theoretical analysis on the proposed algorithm and prove that, by using a bounded Bayes consistent loss function, the instance-level label prediction loss can be unbiasedly and consistently estimated without knowing instance labels.
Experiment results on both image and text datasets verify the effectiveness of this algorithm for label prediction at both the bag- and instance- levels.
2 Related Work
Since Dietterich et al. dietterich1997solving introduced multiple instance learning for drug activity prediction, researchers have proposed a large number of algorithms for the MIL tasks. According to how the information existent in the multiple instance (MI) data is exploited, these algorithms can be categorized into the instance-level paradigm and the bag-level paradigm.
In the instance-level paradigm, Diverse Density [Maron and Lozano-Pérez1998] is perhaps the best known framework for MIL. Formally, for a subject , denote , where and represent the positive and negative bag respectively, and and are defined by:
where and is the instance in bag and . The object of this algorithm is to seek the that maximizes . Following this framework, many methods have been proposed [Ray and Craven2005, Jia and Zhang2008]. The main difference of these methods lies in the definition of and
. For example, the Multiple Instance Logistic Regression (MILR)[Ray and Craven2005] method defines:
More recent neural-network-based methods replaces thefunction with the log-sum-exp pooling function [Pinheiro and Collobert2015, Wang et al.2018]
, the max pooling function[Wu et al.2015], a weighted function [Pappas and Popescu-Belis2017], and a gated attention function [Angelidis and Lapata2018].
Kotzias et al., kotzias2015group in addition proposed a constrain on the instance-level label prediction based on similarity between instances. They encouraged label predictions to be close for similar instances. In the line of ultilizing instance similarity, some methods proposed to directly recognize key (positive) instances [Zhang and Goldman2002, Liu et al.2012]
. The biggest difficulty of these similarity-based methods is that it is hard to design an appropriate distance measure between instances, especially for non-vectorized data like image and text.
In the bag-level paradigm, each bag is treated as a whole. Methods of this paradigm commonly define of a distance function that provides a way of comparing any two non-vectorial bags and . Once this distance function has been defined, it can be used in any standard distance-based classifier such as Citation
-KNN[Wang and Zucker2000], and SVM [Andrews et al.2003, Kwok and Cheung2007, Doran and Ray2014]. A representative framework in this paradigm is the MI-kernel [Gärtner et al.2002], which defines a kernel as a sum of the instance kernels and represents each bag by the minimum and maximum feature values of its instances. Along the same line, Zhou et al., zhou2009multi assumed that instances within a bag are not identically independent. Based on this assumption, they proposed another kernel function, which not only used the similarity between pairwise where and , but also used the similarity between the neighborhood of in and the neighborhood of in . Some other methods directly obtain the bag representation. For example, Wei et al., wei2017scalable used a Fisher transform, and ILse et al., ilse2018attention applied an attention-based pooling algorithm to to obtain the bag representation. The bag representation was further processed by a classifier to the perform the bag-level label prediction.
3 Preliminaries and Problem Statement
3.1 Multiple Instance Learning
be the instance-level input and output random variables, whereand denote the space of and , respectively. Let and be the bag-level input and output random variables, where and denote the space of and , respectively. In a typical multiple instance learning problem, we are given a set of bags and the bag labels, , drawn iid according to an unknown distribution, , over . And label of each instance is unknown. Let and denote the positive () and negative () bag set, respectively. The standard MIL setting states that:
If bag is negative, then all instances in is negative, i.e., .
If bag is positive, then at least one instance in is positive, i.e., .
3.2 Instance-level Risk Minimization in MIL
In this work, we focus on the instance-level label prediction and solve it from the view of risk minimization. Let denotes a classifier, and denotes the predicted label of . A loss function is a map . Given any loss function, , and a classifier, , we define the instance-level -risk of by:
where, as a notation throughout this paper, the denotes expectation and its subscript indicates the random variables with respect to which the expectation is taken. When happens to be the loss, would be the typical Bayes risk:
where is the indicator function. In the risk minimization framework, the objective of this work is to learn a classifier, , that minimizes under the bag constrains of MIL:
The biggest challenge of the above optimization problem is that we do not know the labels of instances. Therefore, it is impossible to estimate with the empirical risk:
where the superscript of means the risk is estimated in the supervised setting, where instance labels are available. In addition, because the loss is non-differentiable, it is necessary to replace it with an appropriate loss function . A common requirement for is that it should be differentiable. Another critical requirement is that it should be Bayes consistent, which we will illustrate in the following section.
3.3 Bayes Consistent Loss Function
A loss function is said to be Bayes consistent [Tewari and Bartlett2007] if the -risk of gets close to the optimal when the risk of approaches the lower-bound of Bayes risk. That is, implies . According to the work of bartlett2006convexity, bartlett2006convexity, a convex loss function is Bayes consistent if it is differentiable at 0 and , and any minimizer of
yields a Bayes consistent classifier, that is, and .
In this work, we follow the assumption that instances are identical independent (i.i.d) and the instance distribution is independent to the bag label given its instance label. According to this assumption, we have:
In the following, we show how to estimate the instance-level label prediction loss without using instance labels built on this assumption.
4.1 Estimate without Instance Labels
In this section, we prove that can be unbiasely estimated without using instance labels as detailed in the following theorem.
can be unbiasely estimated by:
Here the superscript of means that the risk is estimated in the unsupervised setting, where instance labels are not available.
can be reformulated as
and divided into two parts by bag label:
According to Eq. (5), we have
The second term of the right item of Eq. (10) can be further formulated as:
Note that is an unbiased estimation of the right term of Eq. 15. Therefore, is an unbiased estimation of , completing the proof. ∎
4.2 Consistency with Bounded Bayes Consistent Loss Function
In this section, we analyze the consistency between and (all proofs appear in Appendix A111https://drive.google.com/file/d/1MhvgHYtQo_9F2QcN0sGGNsK0lm6vBsEc/view?usp=sharing).
Let denote the Lipschitz constant that , denote , and denote a Reproducing Kernel Hilbert Space (RKHS). For each given , we consider as hypothesis space , the ball of radius in . Let be the covering number of following Theorem C in [Cucker and Smale2002], and denote the instance number in . Then we have the following theorem.
For a Bayes consistent loss function , if it is bounded by , then for any ,
where and .
Let us think about what if is unbounded, or more specifically, not upper bounded. For a given example and , its corresponding risk within is . Because is not upper bounded, to achieve smaller risk on , can heavily overfit making and accordingly . Thus, . From this analysis, we can expect that, when using a unbounded loss function and a flexible classifier, will dramatically decrease to a far below zero value. This is indeed observed in our experiments.
4.3 Formulate Bag Constrains of MIL
Note that the above estimation of does not take account the instantiation of each bag, and it may not satisfy the bag constrains of MIL specified in Eq. (3). In this section, we show how to combine estimation with constrains of MIL from the risk view.
From the risk view, the constrains can be formulated as:
Therefore, the optimization problem of Eq. (3) can be approximated by:
The above formulation assume that the data is exactly separable by . This is not guaranteed for many cases. Even if
pretty complicate and exactly separates the training bags, it usually suffers from severely overfitting problem and is susceptible to outliers. To regularize the above algorithm, we reformulate the optimization problem as:
where denotes the bag number in , and is a hyper-parameter to control the relative weighting between the twin goals of making small and of ensuring that satisfies the bag constrains of MIL. In this work, we set .
5.1 Choose Loss Function
In this work, we consider the mean square error (MSE) loss function and the cross entropy (CE) loss function to perform the task. Denote where
is the sigmoid function, we can check thatand its corresponding when . Therefore, is Bayes consistent. We can also check that and its corresponding when . Therefore, is also a Bayes consistent loss function. In addition, it is easy to know that the value range of is and that of is . Therefore, is bounded while is not. In this work, we apply as the loss function if without further illustration.
To perform the empirical study, we designed experiments on four commonly used datasets, that is, MNIST [LeCun et al.1998], SVHN [Netzer et al.2011], CIFAR10 [Krizhevsky and Hinton2009], and 20Newsgroup [Lang1995]. A half classes of each dataset was mapped to the positive class and the other half ones were mapped to the negative class. For example, digits of 0-4 of MNIST were mapped to the positive class and those of 5-9 were mapped to the negative class. See Appendix for label mapping detail of each dataset. For each dataset, we generated a training bag set and a testing bag set, respectively. Each bag set contains 3,000 positive bags and 3,000 negative bags. Let and denotes the positive and negative example set, respectively. Let
denote a uniform distribution over integers ranging from 1 to 9. Each bag was generated in the following procedure: for each negative bag, we first sample the bag sizefrom , and then we randomly sample with replace negative examples from ; for each positive bag, we first sample the bag size from and the positive example number from , then we randomly sample with replace positive examples from and negative examples from . Note that, to avoid overlap between training and testing bag sets, we solely sampled examples from the training data when constructing the training bag set and only sampled examples from the testing data when constructing the testing bag set.
5.3 Compared Methods
To evaluate the performance of our proposed algorithm, called both bag- and instance-level multiple instance learning (BIMIL), we introduced one of its variants, referred to IMIL, for a comparision. It only performs instance-level risk minimization (minimize ) and does not consider the bag constrains of MIL. In addition, we introduced some typical MIL methods proposed in previously published works. They included three instance-level approaches, i.e., the diversity density (DD) [Maron and Lozano-Pérez1998] with one subject, multiple instance logistic regression (MILR) [Ray and Craven2005], and miNet [Wang et al.2018], and two bag-level approaches, i.e., miFV [Wei et al.2017], and the neural-network based multiple instance learning with gated attention (MIGA) [Ilse et al.2018]. For miFV, we extended an image into a pixel vector as the instance representation and we used the TF-IDF vector as the instance representation for the 20Newsgroup data. Finally, we compared it with the fully supervised model (Sup) trained on the instance-level labels.
5.4 Implementation Detail
For the image datasets, i.e., MNIST, SVHN, and Cifar10, we used two convolutional layers following two fully-connected layers to implement the classifier
. All of these layers except the last one applied the relu activation functions. For 20Newsgroup, each instance was represented by the top 4,000 TF-IDF features and the classifier was implemented using two fully-connected layer with Softplus activation functions. Model parameters were all randomly initialized. Parameter updating was implemented using the RMSprop[Graves2013] optimizer with learning rate set to be 1e-4. Within the experiments, we assumed that the value of is knowns if without further illustration. See Appendix B for detail information for model implementation.
5.5 Instance-level Results
We first studied the effectiveness of our proposed risk estimation method, i.e., IMIL. To perform this study, we compared its Bayes risks on test data with that of Sup. The experiment results are depicted in Figure 1. We can see that the minimum instance-level Bayes risks achieved by IMIL were all close to those of Sup on all of the tested four datasets. We further compared the instance-level performance of our proposed algorithm, BIMIL, with two instance-level baselines, i.e., DD and MILR. Table 1 shows the results of these models on the four tested datasets. We can observe that BIMIL outperformed both DD and MILR for label prediction at the instance level and achieved compared results with the fully supervised model Sup. These observations verify the effectiveness of our proposed instance-level risk estimation method under the assumption of this work.
We then studied the influence of loss function. We mainly studied what if the loss function is unbounded. To this end, we compared the performance of IMIL when using the mean square loss and the cross entropy loss , respectively, to instantiate . Figure 3 depicts the tendency of and the Bayes risk by epoch when using these two loss functions on SVHN and 20Newsgroup datasets. From this figure, we can see that when using the cross entropy loss function to instantiate , the -risk on training data dropped quickly after a few epochs while that on testing data incrementally increased. This indicates that the model severely overfits training data. In the contrast, if we used the means square loss function to instantiate , the overfitting was significantly reduced and achieved better instance-level classification performance on testing data. This is consistent with our analysis about the necessarity of boundness of loss function in the ”Consistency with Bounded Bayes Consistent Loss Function” section.
Finally, we studied the influence of using biased value of to the proposed algorithm. According to our data generation process, we have . The minimum value 0.5 occurs when every positive bag comprises of positive instances only, and the maximum value 0.9 occurs when every positive bag contains only one positive instance. Therefore, we tried the value ranging from 0.5 to 0.9 by 0.05. Figure 2 depicts the performance of IMIL using different values of . We can first observe from the figure that IMIL performs quite robust around the true value of . In addition, for some over-estimated values, it even performs better than using the true value. Finally, we can see that it usually performs better for IMIL with over-estimated values than with under-estimated values.
5.6 Bag-level Results
To evaluate the bag-level performance of the proposed algorithm, we compared it with the five published baselines and its variant IMIL. Table 2 shows the performance of these models. We can first observe that the proposed algorithm, BIMIL, achieved comparable results with state-of-the-art MIL approaches for the bag-level label prediction. We can also observe that IMIL, which does not incorporate with constrains of MIL, also achieve acceptable results compared with existent models. This shows the applicability of our proposed algorithm for the bag-level label prediction. Of course, by comparing the results of IMIL and BIMIL, we know that incorporating with constrains of MIL can bring further improvement to the bag-level label prediction. Overall, the above experiments proved the effectiveness of our proposed algorithm for both the bag- and instance- level label prediction in MIL.
This work proposes a novel multiple instance learning algorithm to address the instance-level label prediction problem in MIL. Different from existent MIL approaches, the loss function of our proposed algorithm was specifically defined on the instance-level label prediction without using instance labels. We theoretically prove the effectiveness of this algorithm by using a bounded Bayes consistent loss function, under the i.i.d assumption. Experimental studies on both image and text datasets show that the proposed algorithm can achieve comparative performance for both the bag- and instance- level label prediction.
- [Andrews et al.2003] Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multiple-instance learning. In NIPS, pages 577–584, 2003.
Stefanos Angelidis and Mirella Lapata.
Multiple instance learning networks for fine-grained sentiment analysis.TACL, 6:17–31, 2018.
- [Bartlett et al.2006] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- [Cheplygina et al.2015] Veronika Cheplygina, David MJ Tax, and Marco Loog. Multiple instance learning with bag dissimilarities. Pattern Recognition, 48(1):264–275, 2015.
- [Cucker and Smale2002] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1–49, 2002.
- [Dietterich et al.1997] Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence, 89(1-2):31–71, 1997.
- [Doran and Ray2014] Gary Doran and Soumya Ray. A theoretical and empirical analysis of support vector machine methods for multiple-instance classification. Machine Learning, 97(1-2):79–102, 2014.
- [Gärtner et al.2002] Thomas Gärtner, Peter A Flach, Adam Kowalczyk, and Alexander J Smola. Multi-instance kernels. In ICML, volume 2, pages 179–186, 2002.
- [Graves2013] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- [Ilse et al.2018] Maximilian Ilse, Jakub M Tomczak, and Max Welling. Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712, 2018.
- [Jia and Zhang2008] Yangqing Jia and Changshui Zhang. Instance-level semisupervised multiple instance learning. In AAAI, pages 640–645, 2008.
- [Kandemir and Hamprecht2015] Melih Kandemir and Fred A Hamprecht. Computer-aided diagnosis from weak supervision: a benchmarking study. Computerized medical imaging and graphics, 42:44–50, 2015.
[Kotzias et al.2015]
Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth.
From group to individual labels using deep features.In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 597–606. ACM, 2015.
- [Krizhevsky and Hinton2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- [Kwok and Cheung2007] James T Kwok and Pak-Ming Cheung. Marginalized multi-instance kernels. In IJCAI, volume 7, pages 901–906, 2007.
- [Lang1995] Ken Lang. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995, pages 331–339. Elsevier, 1995.
- [LeCun et al.1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- [Liu et al.2012] Guoqing Liu, Jianxin Wu, and Z-H Zhou. Key instance detection in multi-instance learning. In Asian Conference on Machine Learning, pages 253–268, 2012.
- [Mandel and Ellis2008] Michael I Mandel and Daniel PW Ellis. Multiple-instance learning for music information retrieval. In ISMIR, pages 577–582, 2008.
- [Maron and Lozano-Pérez1998] Oded Maron and Tomás Lozano-Pérez. A framework for multiple-instance learning. In NIPS, pages 570–576, 1998.
[Netzer et al.2011]
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y
Reading digits in natural images with unsupervised feature learning.
NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
- [Pappas and Popescu-Belis2017] Nikolaos Pappas and Andrei Popescu-Belis. Explicit document modeling through weighted multiple-instance learning. Journal of Artificial Intelligence Research, 58:591–626, 2017.
- [Pinheiro and Collobert2015] Pedro O Pinheiro and Ronan Collobert. From image-level to pixel-level labeling with convolutional networks. In CVPR, pages 1713–1721, 2015.
- [Quellec et al.2017] Gwenolé Quellec, Guy Cazuguel, Béatrice Cochener, and Mathieu Lamard. Multiple-instance learning for medical image and video analysis. IEEE reviews in biomedical engineering, 10:213–234, 2017.
- [Ray and Craven2005] Soumya Ray and Mark Craven. Supervised versus multiple instance learning: An empirical comparison. In ICML, pages 697–704. ACM, 2005.
- [Rosasco et al.2004] Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana, and Alessandro Verri. Are loss functions all the same? Neural Computation, 16(5):1063–1076, 2004.
- [Ruiz-Munoz et al.2015] José Francisco Ruiz-Munoz, Mauricio Orozco-Alzate, and Germán Castellanos-Domínguez. Multiple instance learning-based birdsong classification using unsupervised recording segmentation. In IJCAI, pages 2632–2638, 2015.
- [Tewari and Bartlett2007] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classification methods. JMLR, 8(May):1007–1025, 2007.
- [Wang and Zucker2000] Jun Wang and Jean-Daniel Zucker. Solving multiple-instance problem: A lazy learning approach. 2000.
- [Wang et al.2018] Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai, and Wenyu Liu. Revisiting multiple instance neural networks. Pattern Recognition, 74:15–24, 2018.
- [Wei et al.2017] Xiu-Shen Wei, Jianxin Wu, and Zhi-Hua Zhou. Scalable algorithms for multi-instance learning. IEEE transactions on neural networks and learning systems, 28(4):975–987, 2017.
- [Wu et al.2015] Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. Deep multiple instance learning for image classification and auto-annotation. In CVPR, pages 3460–3469, 2015.
- [Zhang and Goldman2002] Qi Zhang and Sally A Goldman. Em-dd: An improved multiple-instance learning technique. In NIPS, pages 1073–1080, 2002.
- [Zhou et al.2009] Zhi-Hua Zhou, Yu-Yin Sun, and Yu-Feng Li. Multi-instance learning by treating instances as non-iid samples. In ICML, pages 1249–1256. ACM, 2009.
Appendix A Proof of Theorem 1
Let denote the Lipschitz constant that , denote , and denote a Reproducing Kernel Hilbert Space. For each given , we consider as hypothesis space , the ball of radius in the RKHS . Let be the covering number of following Theorem C in [Cucker and Smale2002]. Let denote the instance number in and that in . Then we have the following theorem.
For a Bayes consistent loss function , if it is bounded by , then for any ,
where and .
Let denote the empirical estimation of with randomly labeled examples. Since is bounded, , , and are finite. According to the Lemma in [Rosasco et al.2004] we have:
The empirical estimation error of can be written as:
According to Eq. 20, we have:
Similarly, let denote
The theorem follows replacing with . ∎
Appendix B Implementation Detail
Table 3 shows classes mapped to the positive class and the negative class respectively in our experiments for multiple instance learning.