1 Introduction
In a multiple instance learning (MIL) task, we are to learn a classifier based on a set of bags, where each bag contains multiple instances. And in the setting of MIL, We know the label of each bag but do not know the label of each instance. This applies to many realworld tasks, such as medical imaging
[Quellec et al.2017] (e.g., computational pathology, mammography or CT lung screening), drug discovery (pharmacy) [Dietterich et al.1997], classification of text documents [RuizMunoz et al.2015], speaker identification (signal processing) [Mandel and Ellis2008], and so on. Therefore, MIL is an important topic in the machine learning community and many methods have been published in the last few years
Existent MIL approaches can be roughly categorized into the instancelevel paradigm and the baglevel paradigm. The instancelevel methods [Zhang and Goldman2002, Ray and Craven2005, Angelidis and Lapata2018] treat instances of a bag differently and first predict instance labels as interim results. Then, they infer the bag label based on the predicted labels of instances constituting the bag. The baglevel approaches treat the bag as a whole and directly obtain the bag representation, implicitly by defining distance between bags [Wang and Zucker2000], bag kernels [Gärtner et al.2002] and bag dissimilarities [Cheplygina et al.2015], or explicitly by pooling (may with attention) instance presentations [Ilse et al.2018]. Then, they infer the bag label based on the resultant bag representation.
Because there are only bag labels, most of these methods were designed to accurately predict bag labels and their loss functions were only defined at the bag level. Even for methods of the instancelevel paradigm, the instancelevel label prediction is only an interim step for the final baglevel label prediction and no loss is defined on this goal. Therefore, these methods generally perform insufficiently for label prediction at the instance level
[Kandemir and Hamprecht2015, Cheplygina et al.2015]. This restricts their applications in many tasks, where the instancelevel label prediction is more interested, such as image segmentation and finegrained sentiment classification.In this work, we propose a novel MIL algorithm, whose loss function is specifically defined on the instancelevel label prediction, to address this problem. The core idea of this algorithm is to unbiasedly estimate the instancelevel label prediction loss without using instance labels. We show that this can be achieved if we know the ratio of negative instance under the i.i.d assumption in Theorem
1. Further theoretical analysis on the consistency of this algorithm in Theorem 2 shows that, when using a bounded Bayes consistent loss function, it can achieve similar results as the fully supervised method (trained with instance labels) for label prediction at the instance level. Our experimental study validate the effectiveness of this algorithm at both the bag level and instance level. Contributions of this work can be summarized as follows:
We propose a novel MIL algorithm, whose loss function is specifically defined on the instancelevel label prediction, to address instancelevel label prediction in MIL.

We provide a theoretical analysis on the proposed algorithm and prove that, by using a bounded Bayes consistent loss function, the instancelevel label prediction loss can be unbiasedly and consistently estimated without knowing instance labels.

Experiment results on both image and text datasets verify the effectiveness of this algorithm for label prediction at both the bag and instance levels.
2 Related Work
Since Dietterich et al. dietterich1997solving introduced multiple instance learning for drug activity prediction, researchers have proposed a large number of algorithms for the MIL tasks. According to how the information existent in the multiple instance (MI) data is exploited, these algorithms can be categorized into the instancelevel paradigm and the baglevel paradigm.
In the instancelevel paradigm, Diverse Density [Maron and LozanoPérez1998] is perhaps the best known framework for MIL. Formally, for a subject , denote , where and represent the positive and negative bag respectively, and and are defined by:
where and is the instance in bag and . The object of this algorithm is to seek the that maximizes . Following this framework, many methods have been proposed [Ray and Craven2005, Jia and Zhang2008]. The main difference of these methods lies in the definition of and
. For example, the Multiple Instance Logistic Regression (MILR)
[Ray and Craven2005] method defines:where
More recent neuralnetworkbased methods replaces the
function with the logsumexp pooling function [Pinheiro and Collobert2015, Wang et al.2018], the max pooling function
[Wu et al.2015], a weighted function [Pappas and PopescuBelis2017], and a gated attention function [Angelidis and Lapata2018].Kotzias et al., kotzias2015group in addition proposed a constrain on the instancelevel label prediction based on similarity between instances. They encouraged label predictions to be close for similar instances. In the line of ultilizing instance similarity, some methods proposed to directly recognize key (positive) instances [Zhang and Goldman2002, Liu et al.2012]
. The biggest difficulty of these similaritybased methods is that it is hard to design an appropriate distance measure between instances, especially for nonvectorized data like image and text.
In the baglevel paradigm, each bag is treated as a whole. Methods of this paradigm commonly define of a distance function that provides a way of comparing any two nonvectorial bags and . Once this distance function has been defined, it can be used in any standard distancebased classifier such as Citation
KNN
[Wang and Zucker2000], and SVM [Andrews et al.2003, Kwok and Cheung2007, Doran and Ray2014]. A representative framework in this paradigm is the MIkernel [Gärtner et al.2002], which defines a kernel as a sum of the instance kernels and represents each bag by the minimum and maximum feature values of its instances. Along the same line, Zhou et al., zhou2009multi assumed that instances within a bag are not identically independent. Based on this assumption, they proposed another kernel function, which not only used the similarity between pairwise where and , but also used the similarity between the neighborhood of in and the neighborhood of in . Some other methods directly obtain the bag representation. For example, Wei et al., wei2017scalable used a Fisher transform, and ILse et al., ilse2018attention applied an attentionbased pooling algorithm to to obtain the bag representation. The bag representation was further processed by a classifier to the perform the baglevel label prediction.3 Preliminaries and Problem Statement
3.1 Multiple Instance Learning
Let and
be the instancelevel input and output random variables, where
and denote the space of and , respectively. Let and be the baglevel input and output random variables, where and denote the space of and , respectively. In a typical multiple instance learning problem, we are given a set of bags and the bag labels, , drawn iid according to an unknown distribution, , over . And label of each instance is unknown. Let and denote the positive () and negative () bag set, respectively. The standard MIL setting states that:
If bag is negative, then all instances in is negative, i.e., .

If bag is positive, then at least one instance in is positive, i.e., .
3.2 Instancelevel Risk Minimization in MIL
In this work, we focus on the instancelevel label prediction and solve it from the view of risk minimization. Let denotes a classifier, and denotes the predicted label of . A loss function is a map . Given any loss function, , and a classifier, , we define the instancelevel risk of by:
(1) 
where, as a notation throughout this paper, the denotes expectation and its subscript indicates the random variables with respect to which the expectation is taken. When happens to be the loss, would be the typical Bayes risk:
(2) 
where is the indicator function. In the risk minimization framework, the objective of this work is to learn a classifier, , that minimizes under the bag constrains of MIL:
(3) 
The biggest challenge of the above optimization problem is that we do not know the labels of instances. Therefore, it is impossible to estimate with the empirical risk:
(4) 
where the superscript of means the risk is estimated in the supervised setting, where instance labels are available. In addition, because the loss is nondifferentiable, it is necessary to replace it with an appropriate loss function . A common requirement for is that it should be differentiable. Another critical requirement is that it should be Bayes consistent, which we will illustrate in the following section.
3.3 Bayes Consistent Loss Function
A loss function is said to be Bayes consistent [Tewari and Bartlett2007] if the risk of gets close to the optimal when the risk of approaches the lowerbound of Bayes risk. That is, implies . According to the work of bartlett2006convexity, bartlett2006convexity, a convex loss function is Bayes consistent if it is differentiable at 0 and , and any minimizer of
yields a Bayes consistent classifier, that is, and .
4 Methodology
In this work, we follow the assumption that instances are identical independent (i.i.d) and the instance distribution is independent to the bag label given its instance label. According to this assumption, we have:
(5)  
(6)  
(7)  
(8) 
In the following, we show how to estimate the instancelevel label prediction loss without using instance labels built on this assumption.
4.1 Estimate without Instance Labels
In this section, we prove that can be unbiasely estimated without using instance labels as detailed in the following theorem.

can be unbiasely estimated by:
(9) Here the superscript of means that the risk is estimated in the unsupervised setting, where instance labels are not available.
Proof.
can be reformulated as
and divided into two parts by bag label:
(10) 
According to Eq. (5), we have
(11) 
The second term of the right item of Eq. (10) can be further formulated as:
(12) 
According to Eq. (6) and (7), we have
thus:
(13) 
Because
we have
(14) 
Combining Eq. (8), (10), (11), (13), and (14), we can obtain
(15) 
Note that is an unbiased estimation of the right term of Eq. 15. Therefore, is an unbiased estimation of , completing the proof. ∎
4.2 Consistency with Bounded Bayes Consistent Loss Function
In this section, we analyze the consistency between and (all proofs appear in Appendix A^{1}^{1}1https://drive.google.com/file/d/1MhvgHYtQo_9F2QcN0sGGNsK0lm6vBsEc/view?usp=sharing).
Let denote the Lipschitz constant that , denote , and denote a Reproducing Kernel Hilbert Space (RKHS). For each given , we consider as hypothesis space , the ball of radius in . Let be the covering number of following Theorem C in [Cucker and Smale2002], and denote the instance number in . Then we have the following theorem.

For a Bayes consistent loss function , if it is bounded by , then for any ,
(16) where and .
Remark 2.
Let us think about what if is unbounded, or more specifically, not upper bounded. For a given example and , its corresponding risk within is . Because is not upper bounded, to achieve smaller risk on , can heavily overfit making and accordingly . Thus, . From this analysis, we can expect that, when using a unbounded loss function and a flexible classifier, will dramatically decrease to a far below zero value. This is indeed observed in our experiments.
4.3 Formulate Bag Constrains of MIL
Note that the above estimation of does not take account the instantiation of each bag, and it may not satisfy the bag constrains of MIL specified in Eq. (3). In this section, we show how to combine estimation with constrains of MIL from the risk view.
From the risk view, the constrains can be formulated as:
Therefore, the optimization problem of Eq. (3) can be approximated by:
(17) 
The above formulation assume that the data is exactly separable by . This is not guaranteed for many cases. Even if
pretty complicate and exactly separates the training bags, it usually suffers from severely overfitting problem and is susceptible to outliers. To regularize the above algorithm, we reformulate the optimization problem as:
(18) 
where denotes the bag number in , and is a hyperparameter to control the relative weighting between the twin goals of making small and of ensuring that satisfies the bag constrains of MIL. In this work, we set .
5 Experiment
5.1 Choose Loss Function
In this work, we consider the mean square error (MSE) loss function and the cross entropy (CE) loss function to perform the task. Denote where
is the sigmoid function, we can check that
and its corresponding when . Therefore, is Bayes consistent. We can also check that and its corresponding when . Therefore, is also a Bayes consistent loss function. In addition, it is easy to know that the value range of is and that of is . Therefore, is bounded while is not. In this work, we apply as the loss function if without further illustration.5.2 Datasets
To perform the empirical study, we designed experiments on four commonly used datasets, that is, MNIST [LeCun et al.1998], SVHN [Netzer et al.2011], CIFAR10 [Krizhevsky and Hinton2009], and 20Newsgroup [Lang1995]. A half classes of each dataset was mapped to the positive class and the other half ones were mapped to the negative class. For example, digits of 04 of MNIST were mapped to the positive class and those of 59 were mapped to the negative class. See Appendix for label mapping detail of each dataset. For each dataset, we generated a training bag set and a testing bag set, respectively. Each bag set contains 3,000 positive bags and 3,000 negative bags. Let and denotes the positive and negative example set, respectively. Let
denote a uniform distribution over integers ranging from 1 to 9. Each bag was generated in the following procedure: for each negative bag, we first sample the bag size
from , and then we randomly sample with replace negative examples from ; for each positive bag, we first sample the bag size from and the positive example number from , then we randomly sample with replace positive examples from and negative examples from . Note that, to avoid overlap between training and testing bag sets, we solely sampled examples from the training data when constructing the training bag set and only sampled examples from the testing data when constructing the testing bag set.5.3 Compared Methods
To evaluate the performance of our proposed algorithm, called both bag and instancelevel multiple instance learning (BIMIL), we introduced one of its variants, referred to IMIL, for a comparision. It only performs instancelevel risk minimization (minimize ) and does not consider the bag constrains of MIL. In addition, we introduced some typical MIL methods proposed in previously published works. They included three instancelevel approaches, i.e., the diversity density (DD) [Maron and LozanoPérez1998] with one subject, multiple instance logistic regression (MILR) [Ray and Craven2005], and miNet [Wang et al.2018], and two baglevel approaches, i.e., miFV [Wei et al.2017], and the neuralnetwork based multiple instance learning with gated attention (MIGA) [Ilse et al.2018]. For miFV, we extended an image into a pixel vector as the instance representation and we used the TFIDF vector as the instance representation for the 20Newsgroup data. Finally, we compared it with the fully supervised model (Sup) trained on the instancelevel labels.
5.4 Implementation Detail
For the image datasets, i.e., MNIST, SVHN, and Cifar10, we used two convolutional layers following two fullyconnected layers to implement the classifier
. All of these layers except the last one applied the relu activation functions. For 20Newsgroup, each instance was represented by the top 4,000 TFIDF features and the classifier was implemented using two fullyconnected layer with Softplus activation functions. Model parameters were all randomly initialized. Parameter updating was implemented using the RMSprop
[Graves2013] optimizer with learning rate set to be 1e4. Within the experiments, we assumed that the value of is knowns if without further illustration. See Appendix B for detail information for model implementation.5.5 Instancelevel Results
Model  MNIST  SVHN  Cifar10  20Newsgroup 
Sup  1.2  8.4  20.4  7.9 
DD [1]  1.9  12.8  22.2  8.9 
MILR [2]  2.0  12.5  22.9  9.2 
miNet [3]  2.1  12.3  22.7  9.5 
BIMIL  1.7  10.2  21.3  8.0 
We first studied the effectiveness of our proposed risk estimation method, i.e., IMIL. To perform this study, we compared its Bayes risks on test data with that of Sup. The experiment results are depicted in Figure 1. We can see that the minimum instancelevel Bayes risks achieved by IMIL were all close to those of Sup on all of the tested four datasets. We further compared the instancelevel performance of our proposed algorithm, BIMIL, with two instancelevel baselines, i.e., DD and MILR. Table 1 shows the results of these models on the four tested datasets. We can observe that BIMIL outperformed both DD and MILR for label prediction at the instance level and achieved compared results with the fully supervised model Sup. These observations verify the effectiveness of our proposed instancelevel risk estimation method under the assumption of this work.
We then studied the influence of loss function. We mainly studied what if the loss function is unbounded. To this end, we compared the performance of IMIL when using the mean square loss and the cross entropy loss , respectively, to instantiate . Figure 3 depicts the tendency of and the Bayes risk by epoch when using these two loss functions on SVHN and 20Newsgroup datasets. From this figure, we can see that when using the cross entropy loss function to instantiate , the risk on training data dropped quickly after a few epochs while that on testing data incrementally increased. This indicates that the model severely overfits training data. In the contrast, if we used the means square loss function to instantiate , the overfitting was significantly reduced and achieved better instancelevel classification performance on testing data. This is consistent with our analysis about the necessarity of boundness of loss function in the ”Consistency with Bounded Bayes Consistent Loss Function” section.
Finally, we studied the influence of using biased value of to the proposed algorithm. According to our data generation process, we have . The minimum value 0.5 occurs when every positive bag comprises of positive instances only, and the maximum value 0.9 occurs when every positive bag contains only one positive instance. Therefore, we tried the value ranging from 0.5 to 0.9 by 0.05. Figure 2 depicts the performance of IMIL using different values of . We can first observe from the figure that IMIL performs quite robust around the true value of . In addition, for some overestimated values, it even performs better than using the true value. Finally, we can see that it usually performs better for IMIL with overestimated values than with underestimated values.
5.6 Baglevel Results
To evaluate the baglevel performance of the proposed algorithm, we compared it with the five published baselines and its variant IMIL. Table 2 shows the performance of these models. We can first observe that the proposed algorithm, BIMIL, achieved comparable results with stateoftheart MIL approaches for the baglevel label prediction. We can also observe that IMIL, which does not incorporate with constrains of MIL, also achieve acceptable results compared with existent models. This shows the applicability of our proposed algorithm for the baglevel label prediction. Of course, by comparing the results of IMIL and BIMIL, we know that incorporating with constrains of MIL can bring further improvement to the baglevel label prediction. Overall, the above experiments proved the effectiveness of our proposed algorithm for both the bag and instance level label prediction in MIL.
Model  MNIST  SVHN  Cifar10  20Newsgroup 
Sup  2.8  14.7  28.7  10.8 
DD [1]  2.7  13.6  31.7  12.6 
MILR [2]  3.6  15.7  33.1  14.3 
miNet [3]  3.7  15.3  33.7  13.5 
miFV [4]  3.8  15.8  34.2  15.0 
MIGA [5]  3.1  15.1  32.4  11.4 
IMIL  2.9  15.7  31.8  13.3 
BIMIL  2.7  13.6  31.0  13.0 
6 Conclusion
This work proposes a novel multiple instance learning algorithm to address the instancelevel label prediction problem in MIL. Different from existent MIL approaches, the loss function of our proposed algorithm was specifically defined on the instancelevel label prediction without using instance labels. We theoretically prove the effectiveness of this algorithm by using a bounded Bayes consistent loss function, under the i.i.d assumption. Experimental studies on both image and text datasets show that the proposed algorithm can achieve comparative performance for both the bag and instance level label prediction.
References
 [Andrews et al.2003] Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multipleinstance learning. In NIPS, pages 577–584, 2003.

[Angelidis and
Lapata2018]
Stefanos Angelidis and Mirella Lapata.
Multiple instance learning networks for finegrained sentiment analysis.
TACL, 6:17–31, 2018.  [Bartlett et al.2006] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
 [Cheplygina et al.2015] Veronika Cheplygina, David MJ Tax, and Marco Loog. Multiple instance learning with bag dissimilarities. Pattern Recognition, 48(1):264–275, 2015.
 [Cucker and Smale2002] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1–49, 2002.
 [Dietterich et al.1997] Thomas G Dietterich, Richard H Lathrop, and Tomás LozanoPérez. Solving the multiple instance problem with axisparallel rectangles. Artificial intelligence, 89(12):31–71, 1997.
 [Doran and Ray2014] Gary Doran and Soumya Ray. A theoretical and empirical analysis of support vector machine methods for multipleinstance classification. Machine Learning, 97(12):79–102, 2014.
 [Gärtner et al.2002] Thomas Gärtner, Peter A Flach, Adam Kowalczyk, and Alexander J Smola. Multiinstance kernels. In ICML, volume 2, pages 179–186, 2002.
 [Graves2013] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 [Ilse et al.2018] Maximilian Ilse, Jakub M Tomczak, and Max Welling. Attentionbased deep multiple instance learning. arXiv preprint arXiv:1802.04712, 2018.
 [Jia and Zhang2008] Yangqing Jia and Changshui Zhang. Instancelevel semisupervised multiple instance learning. In AAAI, pages 640–645, 2008.
 [Kandemir and Hamprecht2015] Melih Kandemir and Fred A Hamprecht. Computeraided diagnosis from weak supervision: a benchmarking study. Computerized medical imaging and graphics, 42:44–50, 2015.

[Kotzias et al.2015]
Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth.
From group to individual labels using deep features.
In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 597–606. ACM, 2015.  [Krizhevsky and Hinton2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [Kwok and Cheung2007] James T Kwok and PakMing Cheung. Marginalized multiinstance kernels. In IJCAI, volume 7, pages 901–906, 2007.
 [Lang1995] Ken Lang. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995, pages 331–339. Elsevier, 1995.
 [LeCun et al.1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [Liu et al.2012] Guoqing Liu, Jianxin Wu, and ZH Zhou. Key instance detection in multiinstance learning. In Asian Conference on Machine Learning, pages 253–268, 2012.
 [Mandel and Ellis2008] Michael I Mandel and Daniel PW Ellis. Multipleinstance learning for music information retrieval. In ISMIR, pages 577–582, 2008.
 [Maron and LozanoPérez1998] Oded Maron and Tomás LozanoPérez. A framework for multipleinstance learning. In NIPS, pages 570–576, 1998.

[Netzer et al.2011]
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y
Ng.
Reading digits in natural images with unsupervised feature learning.
In
NIPS workshop on deep learning and unsupervised feature learning
, volume 2011, page 5, 2011.  [Pappas and PopescuBelis2017] Nikolaos Pappas and Andrei PopescuBelis. Explicit document modeling through weighted multipleinstance learning. Journal of Artificial Intelligence Research, 58:591–626, 2017.
 [Pinheiro and Collobert2015] Pedro O Pinheiro and Ronan Collobert. From imagelevel to pixellevel labeling with convolutional networks. In CVPR, pages 1713–1721, 2015.
 [Quellec et al.2017] Gwenolé Quellec, Guy Cazuguel, Béatrice Cochener, and Mathieu Lamard. Multipleinstance learning for medical image and video analysis. IEEE reviews in biomedical engineering, 10:213–234, 2017.
 [Ray and Craven2005] Soumya Ray and Mark Craven. Supervised versus multiple instance learning: An empirical comparison. In ICML, pages 697–704. ACM, 2005.
 [Rosasco et al.2004] Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana, and Alessandro Verri. Are loss functions all the same? Neural Computation, 16(5):1063–1076, 2004.
 [RuizMunoz et al.2015] José Francisco RuizMunoz, Mauricio OrozcoAlzate, and Germán CastellanosDomínguez. Multiple instance learningbased birdsong classification using unsupervised recording segmentation. In IJCAI, pages 2632–2638, 2015.
 [Tewari and Bartlett2007] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classification methods. JMLR, 8(May):1007–1025, 2007.
 [Wang and Zucker2000] Jun Wang and JeanDaniel Zucker. Solving multipleinstance problem: A lazy learning approach. 2000.
 [Wang et al.2018] Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai, and Wenyu Liu. Revisiting multiple instance neural networks. Pattern Recognition, 74:15–24, 2018.
 [Wei et al.2017] XiuShen Wei, Jianxin Wu, and ZhiHua Zhou. Scalable algorithms for multiinstance learning. IEEE transactions on neural networks and learning systems, 28(4):975–987, 2017.
 [Wu et al.2015] Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. Deep multiple instance learning for image classification and autoannotation. In CVPR, pages 3460–3469, 2015.
 [Zhang and Goldman2002] Qi Zhang and Sally A Goldman. Emdd: An improved multipleinstance learning technique. In NIPS, pages 1073–1080, 2002.
 [Zhou et al.2009] ZhiHua Zhou, YuYin Sun, and YuFeng Li. Multiinstance learning by treating instances as noniid samples. In ICML, pages 1249–1256. ACM, 2009.
Appendix A Proof of Theorem 1
Let denote the Lipschitz constant that , denote , and denote a Reproducing Kernel Hilbert Space. For each given , we consider as hypothesis space , the ball of radius in the RKHS . Let be the covering number of following Theorem C in [Cucker and Smale2002]. Let denote the instance number in and that in . Then we have the following theorem.

For a Bayes consistent loss function , if it is bounded by , then for any ,
(19) where and .
Proof.
Let denote the empirical estimation of with randomly labeled examples. Since is bounded, , , and are finite. According to the Lemma in [Rosasco et al.2004] we have:
(20) 
The empirical estimation error of can be written as:
(21) 
Thus,
(22) 
Let denote
According to Eq. 20, we have:
(23) 
Similarly, let denote
and denote
we have:
(24) 
and
(25) 
Therefore,
(26) 
The theorem follows replacing with . ∎
Appendix B Implementation Detail
Dataset  Positive  Negative  
MNIST  04  59  
SVHN  04  59  
Cifar10 



20Newsgroup 


Table 3 shows classes mapped to the positive class and the negative class respectively in our experiments for multiple instance learning.
Table 4 and 5 show the model architectures used in our experiments on the three image datasets and the 20Newsgroup, respectively.
Layer  Operation  channels  width  height 
0  Input  1  28  28 
1  Conv (relu)  32  28  28 
2  Max pool  32  14  14 
3  Conv (relu)  32  14  14 
4  Max pool  32  7  7 
6  Dense (relu)  512  
7  Dropout  512  
8  Dense (sigmoid)  1 
Layer  Operation  dimension 
0  Input  4,096 
1  Dense (softplus)  256 
2  Dense (softplus)  64 
3  Dense (sigmoid)  1 
Comments
There are no comments yet.