The constant growth of data sizes and data complexity in real world problems has increasingly put strain on traditional modeling and classification techniques. Many assumptions cease to hold; it can no longer be expected that a complete set of training data is available for training at once, models fail to reflect information in complex data unless a prohibitively high number of parameters is employed, availability of class labels for all samples can not be realistically expected, and particularly the common assumption about each sample to be represented by a fixed-size vector seems to no longer hold in many real world problems.
Multiple instance learning (MIL) techniques address some of these concerns by allowing samples to be represented by an arbitrarily large set of fixed-sized vectors instead of a single fixed-size vector. Any explicit ground truth information (e.g., class label) is assumed to be available on the (higher) level of samples but not on the (lower) level of instances. The aim is to utilize unknown patterns on instance-level to enable sample-level modeling and decision making. Note that MIL does not address the Representation Learning problem . Instead it aims at better utilization of information in cases when ground truth knowledge about a dataset may be granular and available on various levels of abstraction only.
From a practical point of view MIL promises to i) save ground truth acquisition cost – labels are needed on sample-level, i.e., on higher-level(s) of abstraction only, ii) reveal patterns on instance level based on the available sample-level ground truth information, and eventually iii) achieve high accuracy of models through better use of information present in data.
Despite significant progress in recent years, the current battery of MIL tools is still burdened with compromises. The existing models (see next Section 2 for a brief discussion) clearly leave open space for more efficient utilization of information in samples and for a clearer formalism to provide easily interpretable models with higher accuracy. The goal of this paper is to provide a clean formalism bridging the gap between the MIL problem formulation and classification techniques of neural networks (NNs). This opens the door to applying latest results in NNs to MIL problems.
2 Prior art on multi-instance problem
The pioneering work  coined multiple-instance or multi-instance learning as a problem where each sample (called bag in the following) consists of a set of instances , i.e., equivalently and each instance can be attributed a label but these instance-level labels are not known even in the training set. The sample is deemed positive if at least one of its instances had a positive label, i.e., label of a sample is Most approaches solving this definition of MIL problem belong to instance-space paradigm, in which the classifier is trained on the level of individual instances and the label of the bag is inferred as Examples of such methods include: Diverse-density , EM-DD , MILBoost , and MI-SVM .
Later works (see reviews [1, 12]) have introduced different assumptions on relationships between labels on the instance level and labels of bags or even dropped the notion of instance-level labels and considered only labels on the level of bags, i.e., it is assumed that each bag has a corresponding label which is for simplicity assumed to be binary, i.e., in the following. Most approaches solving this general definition of the problem follow either the bag-space paradigm and define a measure of distance (or kernel) between bags [14, 18, 13]or the embedded-space paradigm and define a transformation of the bag to a fixed-size vector [21, 6, 5].
Prior art on neural networks for MIL problems is scarce and aimed for instance-space paradigm. Ref. 
proposes a smooth approximation of the maximum pooling in the last neuron aswhere is the output of the network before the pooling. Ref.  drops the requirement on smooth pooling and uses the maximum pooling function in the last neuron. Both approaches optimize the error function.
3 Neural network formalism
The proposed neural network formalism is intended for a general formulation of MIL problems introduced in . It assumes a non-empty space
where instances live with a set of all probability distributionson Each bag corresponds to some probability distribution with its instances being realizations of random a variable with distribution Each bag
is therefore assumed to be a realization of a random variable distributed according to), where is the bag label. During the learning process each concrete bag is thus viewed as a realization of a random variable with probability distribution that can only be inferred from a set of instances observed in data. The goal is to learn a discrimination function where is the set of all possible realizations of distributions , i.e., . This definition includes the original used in , but it also includes the general case where every instance can occur in positive and negative bags, but some instances are more frequent in one class.
The proposed formalism is based on the embedded-space paradigm representing bag in an -dimensional Euclidean space through a set of mappings
with Many existing methods implement embedding function as
where is a suitably chosen distance function, is the pooling function (e.g. minimum, mean or maximum), and finally is the dictionary with instances as items. Prior art methods differ in the choice of aggregation function distance function and finally in the selection of dictionary items, . A generalization was recently proposed in  defining using a distance function (or kernel) over the bags and dictionary containing bags rather instances. This generalization can be seen as a crude approximation of kernels over probability measures used in .
The computational model defined by (1) and (2) can be viewed as a neural network sketched in Figure 1. One (or more) lower layers implement a set of distance functions (denoted in Fig. 1 in vector form as ) projecting each instance from the bag from the input space for The pooling layer implementing the pooling function produces a single vector of the same dimension Finally subsequent layers denoted in the figure as implement the classifier that already uses a representation of the bag as a feature vector of fixed length . The biggest advantage of this formalism is that with a right choice of pooling function (e.g. mean or maximum) all parameters of the embedding functions can be optimized by the standard back-propagation algorithm. Therefore embedding at the instance-level (layers before pooling) is effectively optimized while requiring labels only on the bag-level. This mechanism identifies parts of the instance-space with the largest differences between probability distributions generating instances in positive and negative bags with respect to the chosen pooling function. This is also the most differentiating feature of the proposed formalism to most prior art, which typically optimizes embedding parameters regardless of the labels.
The choice of a pooling function depends on the type of the MIL problem. If the bag’s label depends on a single instance, as it is the case for the instance-level paradigm, then the maximum pooling function is appropriate, since its output also depends on a single instance. On the other hand if a bag’s label depends on properties of all instances, then the mean pooling function is appropriate, since its output depends on all instances and therefore it characterizes the overall distribution.
Remark: the key difference of the above approach to the prior art  is in performing pooling inside the network as opposed to after the last neuron or layer as in the cited reference. This difference is key to the shift from instance-centric modeling in prior art to bag-centric advocated here. However the proposed formalism is general and includes  as a special case, where instances are projected into the space of dimension one ( pooling function is set to maximum, and layers after the pooling functions are not present ( is equal to identity).
4 Experimental evaluation
The evaluation of the proposed formalism uses publicly available datasets from a recent study of properties of MIL problems , namely BrownCreeper, CorelAfrican, CorelBeach, Elephant, Fox, Musk1, Musk2, Mutagenesis1, Mutagenesis2, Newsgroups1, Newsgroups2, Newsgroups3, Protein, Tiger, UCSBBreastCancer, Web1, Web2, Web3, Web4, and WinterWren. The supplemental material  contains equal error rate (EER) of 28 MIL classifiers (and their variants) from prior art implemented in the MIL matlab toolbox  together with the exact experimental protocol and indexes of all splits in 5-times repeated 10-fold cross-validation. Therefore the experimental protocol has been exactly reproduced and results from  are used in the comparison to prior art.
The proposed formalism has been compared to those algorithms from prior art that has achieved the lowest error on at least one dataset. This selection yielded 14 classifiers for 20 test problems, which demonstrates diversity of MIL problems and difficulty to choose suitable method. Selected algorithms include representatives of instance-space paradigm: MIL Boost , SimpleMIL, MI-SVM  with Gaussian and polynomial kernel, and prior art in Neural Networks (denoted prior NN) ; bag-level paradigm: -nearest neighbor with citation distance  using 5 nearest neighbors; and finally embedded-space paradigm: Miles  with Gaussian kernel, Bag dissimilarity  with minmin, meanmin, meanmean, Hausdorff, and Earth-moving distance (EMD), cov-coef  embedding bags by calculating covariances of all pairs of features over the bag, and finally extremes and mean
embedding bags by using extreme and mean values of each feature over instances of the bag. All embedded space paradigm methods except Miles used a logistic regression classifier.
followed by a mean-pooling layer and a single linear output unit. The training minimized a hinge loss function using the Adam variant of stochastic gradient descend algorithm with mini-batch of size 100, maximum of 10 000 iterations, and default settings. L1 regularization on weights of the network was used to decrease overfitting. The topology had two parameters — the number of neurons in the first layer defining the dimension of bag representation, and the strength of the L1 regularization,
Suitable parameters were found by estimating equal error rates by five-fold cross-validation (on training samples) on all combinations ofand and using the combination achieving the lowest error. The prior art of 
was implemented and optimized exactly as the proposed approach with the difference that the max pooling layer wasafter the last linear output unit.
|Error of NN on||prior art|
|training set||testing set||error||algorithm|
Figure 2 summarizes results in critical difference diagram  showing the average rank of each classifier over the problems together with the confidence interval of corrected Bonferroni-Dunn test with significance 0.05 testing whether two classifiers have equal performance. The critical diagram reveals that the classifier implemented using the proposed neural net formalism (caption proposed NN) achieved overall the best performance, having the average rank 4.3. In fact, Table 1 shows that it provides the lowest error on nine out of 20 problems. Note that the second best, Bag dissimilarity  with minmin distance and prior art in NN , achieved the average rank 6.4 and was the best only on three and one problems respectively.
Exact values of EER of the best algorithm from the prior art and that of the proposed NN formalism is summarized in Table 1. From the results it is obvious that the proposed neural network formalism have scored poorly on problems with a large dimension and a small number of samples, namely Newsgroups and Web (see Table 1 of  for details on the data). The neural network formalism has easily overfit to the training data, which is supported by zero errors on the training sets.
This work has presented a generalization of neural networks to multi-instance problems. Unlike the prior art, the proposed formalism embeds samples consisting of multiple instances into vector space, enabling subsequent use with standard decision-making techniques. The key advantage of the proposed solution is that it simultaneously optimizes the classifier and the embedding. This advantage was illustrated on a set of real-world examples, comparing results to a large number of algorithms from the prior art. The proposed formalism seems to outperform the majority of standard MIL methods in terms of accuracy. It should be stressed though that results were compared to those published by authors of survey benchmarks; not all methods in referred tests may have been set in the best possible way. However, as many such cases would be very computationally expensive, the proposed formalism becomes competitive also due to its relatively modest computational complexity that does not exceed that of a standard 3-layer neural network. The proposed formalism opens up a variety of options for further development. A better and possibly more automated choice of pooling functions is one of the promising ways to improve performance on some types of data.
-  J. Amores. Multiple instance classification: Review, taxonomy and comparative study. Artificial Intelligence, 201:81–105, 2013.
-  S. Andrews, I. Tsochantaridis, T. Hofmann. Support vector machines for multiple-instance learning. S. Becker, S. Thrun, K. Obermayer, redaktorzy, Advances in Neural Information Processing Systems 15, strony 577–584. MIT Press, 2003.
-  Y. Bengio, A. Courville, P. Vincent. Representation learning: A review and new perspectives. arXiv preprint arXiv:1206.5538v2, 2012.
-  M.-A. Carbonneau, V. Cheplygina, E. Granger, G. Gagnon. Multiple instance learning: A survey of problem characteristics and applications. arXiv preprint arXiv:1612.03365, 2016.
-  Y. Chen, J. Bi, J. Z. Wang. Miles: Multiple-instance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):1931–1947, Dec 2006.
-  V. Cheplygina, D. M. Tax, M. Loog. Multiple instance learning with bag dissimilarities. Pattern Recognition, 48(1):264 – 275, 2015.
-  V. Cheplygina, D. M. J. Tax. Characterizing Multiple Instance Datasets, strony 15–27. Springer International Publishing, Cham, 2015.
-  V. Cheplygina, D. M. J. Tax. Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings, rozdzia/l Characterizing Multiple Instance Datasets, strony 15–27. Springer International Publishing, Cham, 2015.
-  V. Cheplygina, D. M. J. Tax, M. Loog. Supplemental documents to characterizing multiple instance datasets.
Statistical comparisons of classifiers over multiple data sets.
The Journal of Machine Learning Research, 7:1–30, 2006.
-  T. G. Dietterich, R. H. Lathrop, T. Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence, 89(1):31–71, 1997.
J. Foulds, E. Frank.
A review of multi-instance learning assumptions.
The Knowledge Engineering Review, 25(01):1–25, 2010.
-  T. Gärtner, P. A. Flach, A. Kowalczyk, A. J. Smola. Multi-instance kernels. ICML, wolumen 2, strony 179–186, 2002.
-  D. Haussler. Convolution kernels on discrete structures. Raport instytutowy, Universityof California at Santa Cruz, 1999.
-  K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun. What is the best multi-stage architecture for object recognition? Computer Vision, 2009 IEEE 12th International Conference on, strony 2146–2153, Sept 2009.
-  D. Kingma, J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  O. Maron, T. Lozano-Pérez. A framework for multiple-instance learning. M. I. Jordan, M. J. Kearns, S. A. Solla, redaktorzy, Advances in Neural Information Processing Systems 10, strony 570–576. MIT Press, 1998.
-  K. Muandet, K. Fukumizu, F. Dinuzzo, B. Schölkopf. Learning from distributions via support measure machines. Advances in neural information processing systems, strony 10–18, 2012.
-  J. Ramon, L. De Raedt. Multi instance neural networks. 2000.
-  C. V. Tax, D.M.J. MIL, a Matlab toolbox for multiple instance learning, Jun 2016. version 1.2.1.
-  J. Wang, J.-D. Zucker. Solving the multiple-instance problem: A lazy learning approach. Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, strony 1119–1126, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
-  C. Zhang, J. C. Platt, P. A. Viola. Multiple instance boosting for object detection. Y. Weiss, B. Schölkopf, J. C. Platt, redaktorzy, Advances in Neural Information Processing Systems 18, strony 1417–1424. MIT Press, 2006.
-  Q. Zhang, S. A. Goldman. Em-dd: An improved multiple-instance learning technique. T. G. Dietterich, S. Becker, Z. Ghahramani, redaktorzy, Advances in Neural Information Processing Systems 14, strony 1073–1080. MIT Press, 2002.
-  Z.-h. Zhou, M.-l. Zhang. Neural networks for multi-instance learning. Proceedings of the international conference on intelligent information technology, wolumen 182. Citeseer, 2002.