1 Motivation
The constant growth of data sizes and data complexity in real world problems has increasingly put strain on traditional modeling and classification techniques. Many assumptions cease to hold; it can no longer be expected that a complete set of training data is available for training at once, models fail to reflect information in complex data unless a prohibitively high number of parameters is employed, availability of class labels for all samples can not be realistically expected, and particularly the common assumption about each sample to be represented by a fixedsize vector seems to no longer hold in many real world problems.
Multiple instance learning (MIL) techniques address some of these concerns by allowing samples to be represented by an arbitrarily large set of fixedsized vectors instead of a single fixedsize vector. Any explicit ground truth information (e.g., class label) is assumed to be available on the (higher) level of samples but not on the (lower) level of instances. The aim is to utilize unknown patterns on instancelevel to enable samplelevel modeling and decision making. Note that MIL does not address the Representation Learning problem [3]. Instead it aims at better utilization of information in cases when ground truth knowledge about a dataset may be granular and available on various levels of abstraction only.
From a practical point of view MIL promises to i) save ground truth acquisition cost – labels are needed on samplelevel, i.e., on higherlevel(s) of abstraction only, ii) reveal patterns on instance level based on the available samplelevel ground truth information, and eventually iii) achieve high accuracy of models through better use of information present in data.
Despite significant progress in recent years, the current battery of MIL tools is still burdened with compromises. The existing models (see next Section 2 for a brief discussion) clearly leave open space for more efficient utilization of information in samples and for a clearer formalism to provide easily interpretable models with higher accuracy. The goal of this paper is to provide a clean formalism bridging the gap between the MIL problem formulation and classification techniques of neural networks (NNs). This opens the door to applying latest results in NNs to MIL problems.
2 Prior art on multiinstance problem
The pioneering work [11] coined multipleinstance or multiinstance learning as a problem where each sample (called bag in the following) consists of a set of instances , i.e., equivalently and each instance can be attributed a label but these instancelevel labels are not known even in the training set. The sample is deemed positive if at least one of its instances had a positive label, i.e., label of a sample is Most approaches solving this definition of MIL problem belong to instancespace paradigm, in which the classifier is trained on the level of individual instances and the label of the bag is inferred as Examples of such methods include: Diversedensity [17], EMDD [23], MILBoost [22], and MISVM [2].
Later works (see reviews [1, 12]) have introduced different assumptions on relationships between labels on the instance level and labels of bags or even dropped the notion of instancelevel labels and considered only labels on the level of bags, i.e., it is assumed that each bag has a corresponding label which is for simplicity assumed to be binary, i.e., in the following. Most approaches solving this general definition of the problem follow either the bagspace paradigm and define a measure of distance (or kernel) between bags [14, 18, 13]or the embeddedspace paradigm and define a transformation of the bag to a fixedsize vector [21, 6, 5].
Prior art on neural networks for MIL problems is scarce and aimed for instancespace paradigm. Ref. [19]
proposes a smooth approximation of the maximum pooling in the last neuron as
where is the output of the network before the pooling. Ref. [24] drops the requirement on smooth pooling and uses the maximum pooling function in the last neuron. Both approaches optimize the error function.3 Neural network formalism
The proposed neural network formalism is intended for a general formulation of MIL problems introduced in [18]. It assumes a nonempty space
where instances live with a set of all probability distributions
on Each bag corresponds to some probability distribution with its instances being realizations of random a variable with distribution Each bagis therefore assumed to be a realization of a random variable distributed according to
), where is the bag label. During the learning process each concrete bag is thus viewed as a realization of a random variable with probability distribution that can only be inferred from a set of instances observed in data. The goal is to learn a discrimination function where is the set of all possible realizations of distributions , i.e., . This definition includes the original used in [11], but it also includes the general case where every instance can occur in positive and negative bags, but some instances are more frequent in one class.The proposed formalism is based on the embeddedspace paradigm representing bag in an dimensional Euclidean space through a set of mappings
(1) 
with Many existing methods implement embedding function as
(2) 
where is a suitably chosen distance function, is the pooling function (e.g. minimum, mean or maximum), and finally is the dictionary with instances as items. Prior art methods differ in the choice of aggregation function distance function and finally in the selection of dictionary items, . A generalization was recently proposed in [6] defining using a distance function (or kernel) over the bags and dictionary containing bags rather instances. This generalization can be seen as a crude approximation of kernels over probability measures used in [18].
The computational model defined by (1) and (2) can be viewed as a neural network sketched in Figure 1. One (or more) lower layers implement a set of distance functions (denoted in Fig. 1 in vector form as ) projecting each instance from the bag from the input space for The pooling layer implementing the pooling function produces a single vector of the same dimension Finally subsequent layers denoted in the figure as implement the classifier that already uses a representation of the bag as a feature vector of fixed length . The biggest advantage of this formalism is that with a right choice of pooling function (e.g. mean or maximum) all parameters of the embedding functions can be optimized by the standard backpropagation algorithm. Therefore embedding at the instancelevel (layers before pooling) is effectively optimized while requiring labels only on the baglevel. This mechanism identifies parts of the instancespace with the largest differences between probability distributions generating instances in positive and negative bags with respect to the chosen pooling function. This is also the most differentiating feature of the proposed formalism to most prior art, which typically optimizes embedding parameters regardless of the labels.
The choice of a pooling function depends on the type of the MIL problem. If the bag’s label depends on a single instance, as it is the case for the instancelevel paradigm, then the maximum pooling function is appropriate, since its output also depends on a single instance. On the other hand if a bag’s label depends on properties of all instances, then the mean pooling function is appropriate, since its output depends on all instances and therefore it characterizes the overall distribution.
Remark: the key difference of the above approach to the prior art [24] is in performing pooling inside the network as opposed to after the last neuron or layer as in the cited reference. This difference is key to the shift from instancecentric modeling in prior art to bagcentric advocated here. However the proposed formalism is general and includes [24] as a special case, where instances are projected into the space of dimension one ( pooling function is set to maximum, and layers after the pooling functions are not present ( is equal to identity).
4 Experimental evaluation
The evaluation of the proposed formalism uses publicly available datasets from a recent study of properties of MIL problems [8], namely BrownCreeper, CorelAfrican, CorelBeach, Elephant, Fox, Musk1, Musk2, Mutagenesis1, Mutagenesis2, Newsgroups1, Newsgroups2, Newsgroups3, Protein, Tiger, UCSBBreastCancer, Web1, Web2, Web3, Web4, and WinterWren. The supplemental material [9] contains equal error rate (EER) of 28 MIL classifiers (and their variants) from prior art implemented in the MIL matlab toolbox [20] together with the exact experimental protocol and indexes of all splits in 5times repeated 10fold crossvalidation. Therefore the experimental protocol has been exactly reproduced and results from [9] are used in the comparison to prior art.
The proposed formalism has been compared to those algorithms from prior art that has achieved the lowest error on at least one dataset. This selection yielded 14 classifiers for 20 test problems, which demonstrates diversity of MIL problems and difficulty to choose suitable method. Selected algorithms include representatives of instancespace paradigm: MIL Boost [22], SimpleMIL, MISVM [2] with Gaussian and polynomial kernel, and prior art in Neural Networks (denoted prior NN) [24]; baglevel paradigm: nearest neighbor with citation distance [21] using 5 nearest neighbors; and finally embeddedspace paradigm: Miles [5] with Gaussian kernel, Bag dissimilarity [6] with minmin, meanmin, meanmean, Hausdorff, and Earthmoving distance (EMD), covcoef [9] embedding bags by calculating covariances of all pairs of features over the bag, and finally extremes and mean
embedding bags by using extreme and mean values of each feature over instances of the bag. All embedded space paradigm methods except Miles used a logistic regression classifier.
The proposed MIL neural network consists of a single layer of rectified linear units (ReLu)
[15] with transfer functionfollowed by a meanpooling layer and a single linear output unit. The training minimized a hinge loss function using the Adam
[16] variant of stochastic gradient descend algorithm with minibatch of size 100, maximum of 10 000 iterations, and default settings. L1 regularization on weights of the network was used to decrease overfitting. The topology had two parameters — the number of neurons in the first layer defining the dimension of bag representation, and the strength of the L1 regularization,Suitable parameters were found by estimating equal error rates by fivefold crossvalidation (on training samples) on all combinations of
and and using the combination achieving the lowest error. The prior art of [24]was implemented and optimized exactly as the proposed approach with the difference that the max pooling layer was
after the last linear output unit.Error of NN on  prior art  
training set  testing set  error  algorithm  
BrownCreeper  0  5.0  11.2  MILBoost 
CorelAfrican  2.6  5.5  11.2  minmin 
CorelBeach  0.2  1.2  17  extremes 
Elephant  0  13.8  16.2  minmin 
Fox  0.4  33.7  36.1  meanmin 
Musk1  0  17.5  12.8  Citation 
Musk2  0  11.4  11.8  Hausdorff 
Mutagenesis1  7.5  11.8  16.9  covcoef 
Mutagenesis2  14.9  10.0  17.2  emd 
Newsgroups1  0  42.5  18.4  meanmean 
Newsgroups2  0  35  27.5  prior NN 
Newsgroups3  0  37.5  31.2  meanmean 
Protein  2.5  7.5  15.5  minmin 
Tiger  0  20.0  19  MILES 
UCSBBreastCancer  0  25  13.6  MISVM g 
Web1  0  40.6  20.9  MILES 
Web2  0  28.1  7.1  MISVM p 
Web3  0  25  13.6  MISVM g 
Web4  0  18.8  1.5  meaninst 
WinterWren  0  5.9  2.1  emd 
Figure 2 summarizes results in critical difference diagram [10] showing the average rank of each classifier over the problems together with the confidence interval of corrected BonferroniDunn test with significance 0.05 testing whether two classifiers have equal performance. The critical diagram reveals that the classifier implemented using the proposed neural net formalism (caption proposed NN) achieved overall the best performance, having the average rank 4.3. In fact, Table 1 shows that it provides the lowest error on nine out of 20 problems. Note that the second best, Bag dissimilarity [6] with minmin distance and prior art in NN [24], achieved the average rank 6.4 and was the best only on three and one problems respectively.
Exact values of EER of the best algorithm from the prior art and that of the proposed NN formalism is summarized in Table 1. From the results it is obvious that the proposed neural network formalism have scored poorly on problems with a large dimension and a small number of samples, namely Newsgroups and Web (see Table 1 of [7] for details on the data). The neural network formalism has easily overfit to the training data, which is supported by zero errors on the training sets.
5 Conclusion
This work has presented a generalization of neural networks to multiinstance problems. Unlike the prior art, the proposed formalism embeds samples consisting of multiple instances into vector space, enabling subsequent use with standard decisionmaking techniques. The key advantage of the proposed solution is that it simultaneously optimizes the classifier and the embedding. This advantage was illustrated on a set of realworld examples, comparing results to a large number of algorithms from the prior art. The proposed formalism seems to outperform the majority of standard MIL methods in terms of accuracy. It should be stressed though that results were compared to those published by authors of survey benchmarks; not all methods in referred tests may have been set in the best possible way. However, as many such cases would be very computationally expensive, the proposed formalism becomes competitive also due to its relatively modest computational complexity that does not exceed that of a standard 3layer neural network. The proposed formalism opens up a variety of options for further development. A better and possibly more automated choice of pooling functions is one of the promising ways to improve performance on some types of data.
References
 [1] J. Amores. Multiple instance classification: Review, taxonomy and comparative study. Artificial Intelligence, 201:81–105, 2013.
 [2] S. Andrews, I. Tsochantaridis, T. Hofmann. Support vector machines for multipleinstance learning. S. Becker, S. Thrun, K. Obermayer, redaktorzy, Advances in Neural Information Processing Systems 15, strony 577–584. MIT Press, 2003.
 [3] Y. Bengio, A. Courville, P. Vincent. Representation learning: A review and new perspectives. arXiv preprint arXiv:1206.5538v2, 2012.
 [4] M.A. Carbonneau, V. Cheplygina, E. Granger, G. Gagnon. Multiple instance learning: A survey of problem characteristics and applications. arXiv preprint arXiv:1612.03365, 2016.
 [5] Y. Chen, J. Bi, J. Z. Wang. Miles: Multipleinstance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):1931–1947, Dec 2006.
 [6] V. Cheplygina, D. M. Tax, M. Loog. Multiple instance learning with bag dissimilarities. Pattern Recognition, 48(1):264 – 275, 2015.
 [7] V. Cheplygina, D. M. J. Tax. Characterizing Multiple Instance Datasets, strony 15–27. Springer International Publishing, Cham, 2015.
 [8] V. Cheplygina, D. M. J. Tax. SimilarityBased Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 1214, 2015. Proceedings, rozdzia/l Characterizing Multiple Instance Datasets, strony 15–27. Springer International Publishing, Cham, 2015.
 [9] V. Cheplygina, D. M. J. Tax, M. Loog. Supplemental documents to characterizing multiple instance datasets.

[10]
J. Demšar.
Statistical comparisons of classifiers over multiple data sets.
The Journal of Machine Learning Research
, 7:1–30, 2006.  [11] T. G. Dietterich, R. H. Lathrop, T. LozanoPérez. Solving the multiple instance problem with axisparallel rectangles. Artificial intelligence, 89(1):31–71, 1997.

[12]
J. Foulds, E. Frank.
A review of multiinstance learning assumptions.
The Knowledge Engineering Review
, 25(01):1–25, 2010.  [13] T. Gärtner, P. A. Flach, A. Kowalczyk, A. J. Smola. Multiinstance kernels. ICML, wolumen 2, strony 179–186, 2002.
 [14] D. Haussler. Convolution kernels on discrete structures. Raport instytutowy, Universityof California at Santa Cruz, 1999.
 [15] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun. What is the best multistage architecture for object recognition? Computer Vision, 2009 IEEE 12th International Conference on, strony 2146–2153, Sept 2009.
 [16] D. Kingma, J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [17] O. Maron, T. LozanoPérez. A framework for multipleinstance learning. M. I. Jordan, M. J. Kearns, S. A. Solla, redaktorzy, Advances in Neural Information Processing Systems 10, strony 570–576. MIT Press, 1998.
 [18] K. Muandet, K. Fukumizu, F. Dinuzzo, B. Schölkopf. Learning from distributions via support measure machines. Advances in neural information processing systems, strony 10–18, 2012.
 [19] J. Ramon, L. De Raedt. Multi instance neural networks. 2000.
 [20] C. V. Tax, D.M.J. MIL, a Matlab toolbox for multiple instance learning, Jun 2016. version 1.2.1.
 [21] J. Wang, J.D. Zucker. Solving the multipleinstance problem: A lazy learning approach. Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, strony 1119–1126, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
 [22] C. Zhang, J. C. Platt, P. A. Viola. Multiple instance boosting for object detection. Y. Weiss, B. Schölkopf, J. C. Platt, redaktorzy, Advances in Neural Information Processing Systems 18, strony 1417–1424. MIT Press, 2006.
 [23] Q. Zhang, S. A. Goldman. Emdd: An improved multipleinstance learning technique. T. G. Dietterich, S. Becker, Z. Ghahramani, redaktorzy, Advances in Neural Information Processing Systems 14, strony 1073–1080. MIT Press, 2002.
 [24] Z.h. Zhou, M.l. Zhang. Neural networks for multiinstance learning. Proceedings of the international conference on intelligent information technology, wolumen 182. Citeseer, 2002.
Comments
There are no comments yet.