1 Introduction
Relational learning takes several different forms ranging from purely symbolic (logical) representations, to a wide collection of statistical approaches (De Raedt et al., 2008) based on tools such as probabilistic graphical models (Jaeger, 1997; De Raedt et al., 2008; Richardson and Domingos, 2006; Getoor and Taskar, 2007), kernel machines (Landwehr et al., 2010), and neural networks (Frasconi et al., 1998; Scarselli et al., 2009; Niepert et al., 2016a).
Multiinstance learning (MIL) is perhaps the simplest form of relational learning where data consists of labeled bags of instances. Introduced in (Dietterich et al., 1997)
, MIL has attracted the attention of several researchers during the last two decades and has been successfully applied to problems such as image and scene classification
(Maron and Ratan, 1998; Zha et al., 2008; Zhou et al., 2012), image annotation (Yang et al., 2006)(Yang and LozanoPerez, 2000; Rahmani et al., 2005), Web mining (Zhou et al., 2005), text categorization (Zhou et al., 2012) and diagnostic medical imaging (Hou et al., 2015; Yan et al., 2016). In classic MIL, labels are binary and bags are positive iff they contain at least one positive instance (existential semantics). For example, a visual scene with animals could be labeled as positive iff it contains at least one tiger. Various families of algorithms have been proposed for MIL, including axis parallel rectangles (Dietterich et al., 1997), diverse density (Maron and LozanoPérez, 1998), nearest neighbors (Wang and Zucker, 2000), neural networks (Ramon and De Raedt, 2000), and variants of support vector machines
(Andrews et al., 2002).In this paper, we extend the MIL setting by considering examples consisting of labeled nested bags of instances. Labels are observed for toplevel bags, while instances and lower level bags have associated latent labels. For example, a potential offside situation in a soccer match can be represented by a bag of images showing the scene from different camera perspectives. Each image, in turn, can be interpreted as a bag of players with latent labels for their team membership and/or position on the field. We call this setting multimultiinstance learning (MMIL), referring specifically to the case of bagsofbags^{1}^{1}1 the generalization to deeper levels of nesting is straightforward but not explicitly formalized in the paper for the sake of simplicity.. In our framework, we also relax the classic MIL assumption of binary instance labels, allowing categorical labels lying in a generic alphabet. This is important since MMIL with binary labels under the existential semantics would reduce to classic MIL after flattening the bagofbags.
We propose a solution to the MMIL problem based on neural networks with a special layer called baglayer. Unlike previous neural network approaches to MIL learning (Ramon and De Raedt, 2000), where predicted instance labels are aggregated by (a soft version of) the maximum operator, baglayers aggregate internal representations of instances (or bags of instances) and can be naturally intermixed with other layers commonly used in deep learning. Baglayers can be in fact interpreted as a generalization of convolutional layers followed by pooling, as commonly used in deep learning.
The MMIL framework can be immediately applied to solve problems where examples are naturally described as bagsofbags. For example, a text document can be described as a bag of sentences, where in turn each sentence is a bag of words. The range of possible applications of the framework is however larger. In fact, every structured data object can be recursively decomposed into parts, a strategy that has been widely applied in the context of graph kernels (see e.g., (Haussler, 1999; Gärtner et al., 2004; Passerini et al., 2006; Shervashidze et al., 2009; Costa and De Grave, 2010; Orsini et al., 2015)). Hence, MMIL is also applicable to supervised graph classification. Experiments on bibliographical and social network datasets confirm the practical viability of MMIL for these forms of relational learning.
As a further advantage, multimulti instance learning enables a particular way of interpreting the models by reconstructing instance and subbag latent variables. This allows to explain the prediction for a particular data point, and to describe the structure of the decision function in terms of symbolic rules. Suppose we could recover the latent labels associated with instances or inner bags. These labels would provide useful additional information about the data since we could group instances (or inner bags) that share the same latent label and attach some semantics to these groups by inspection. For example, in the case of textual data, grouping words or sentences with the same latent label effectively discovers topics
and the decision of a MMIL text document classifier can be interpreted in terms of the discovered topics. In practice, even if we cannot recover the true latent labels, we may still derive
pseudolabels from patterns of hidden units activations in the baglayers.The paper is organized as follows. In Section 2 we formally introduce the MMIL setting. In Section 3 we formalize bag layers and the resulting neural network architecture for MMIL. In Section 6 we discuss some related works. In Section 5 we describe a technique for extracting rules from trained networks of baglayers. In Section 7 we report experimental results on both semisynthentic and a realworld dataset, while in Section 8 we report experimental results on two different graph tasks. Finally we draw some conclusions in Section 9.
2 Framework
2.1 Traditional multiinstance learning
In the standard multiinstance learning (MIL) setting, data consists of labeled bags of instances. In the following, denotes the instance space (it can be any set), the bag label space for the observed labels of example bags, and the instance label space for the unobserved (latent) instance labels. For any set , denotes the set of all multisets of . An example in MIL is a pair , which we interpret as the observed part of an instancelabeled example . is thus a multiset of instances, and a multiset of labeled instances.
Examples are drawn from a fixed and unknown distribution . Furthermore, it is typically assumed that the label of an example is conditionally independent of the individual instances given their labels, i.e. . In the classic setting, introduced in (Dietterich, 2000) and used in several subsequent works (Maron and LozanoPérez, 1998; Wang and Zucker, 2000; Andrews et al., 2002), the focus is on binary classification () and it is postulated that , (i.e., an example is positive iff at least one of its instances is positive). More complex assumptions are possible and thoroughly reviewed in (Foulds and Frank, 2010). Supervised learning in this setting can be formulated in two ways: (1) learn a function that classifies whole examples, or (2) learn a function that classifies instances and then use some aggregation function defined on the multiset of predicted instance labels to obtain the example label.
2.2 Multimultiinstance learning
In multimultiinstance learning (MMIL), data consists of labeled nested bags of instances. When the level of nesting is two, an example is a labeled bagofbags drawn from a distribution . Deeper levels of nesting, leading to multiinstance learning are conceptually easy to introduce but we avoid them in the paper to keep our notation simple. We will also informally use the expression “bagofbags” to describe structures with two or more levels of nesting. In the MMIL setting, we call the elements of and topbags and subbags, respectively.
Now postulating unobserved labels for both the instances and the subbags, we interpret examples as the observed part of fully labeled data points , where is the space of subbag labels. Fully labeled data points are drawn from a distribution .
As in MIL, we make some conditional independence assumptions. Specifically, we assume that instance and subbag labels only depend on properties of the respective instances or subbags, and not on other elements in the nested multiset structure (thus excluding models for contagion or homophily, where, e.g., a specific label for an instance could become more likely, if many other instances contained in the same subbag also have that label). Furthermore, we assume that labels of subbags and topbags only depend on the labels of their constituent elements. Thus, for , and a bag of labeled instances we have:
(1) 
Similarly for the probability distribution of topbag labels given the constituent labeled subbags.
Example 2.1.
In this example we consider bagsofbags of handwritten digits (as in the MNIST dataset). Each instance (a digit) has attached its own latent class label in whereas subbag (latent) and topbag labels (observed) are binary. In particular, a subbag is positive iff it contains an instance of class 7 and does not contain an instance of class 3. A topbag is positive iff it contains at least one positive subbag. Figure 1 shows a positive and a negative example.
Example 2.2.
A topbag can consist of a set of images showing a potential offside situation in soccer from different camera perspectives. The label of the bag corresponds to the referee decision . Each individual image can either settle the offside question one way or another, or be inconclusive. Thus, there are (latent) image labels . Since no offside should be called when in doubt, the topbag is labeled as ’not offside’ if and only if it either contains at least one image labeled ’not offside’, or all the images are labeled ’inconclusive’. Images, in turn, can be seen as bags of player instances that have a label according to their relative position with respect to the potentially offside player of the other team. An image then is labeled ’offside’ if all the players in the image are labeled ’behind’; it is labeled ’not offside’ if it contains at least one player labeled ’in front’, and is labeled ’inconclusive’ if it only contains players labeled ’inconclusive’ or ’behind’.
Example 2.3.
In text categorization, the bagofword representation is often used to feed documents to classifiers. Each instance in this case consists of the indicator vector of words in the document (or a weighted variant such as TFIDF). The MIL approach has been applied in some cases (Andrews et al., 2002) where instances consist of chunks of consecutive words and each instance is an indicator vector. A bagofbags representation could instead describe a document as a bag of sentences, and each sentence as a bag of word vectors (constructed for example using Word2vec or GloVe).
3 A network architecture for MMIL
3.1 Bag layers
We model the conditional distribution with a neural network architecture that handles bagsofbags of variable sizes by aggregating intermediate internal representations. For this purpose, we introduce a new layer called baglayer. A baglayer takes as input a bag of dimensional vectors , and first computes dimensional representations
(2) 
using a weight matrix
, a bias vector
, and an activation function
(such as ReLU, tanh, or linear). The bag layer then computes its output as:
(3) 
where is elementwise aggregation operator (such as max or average). Both and are tunable parameters. Note that Equation 3 works with bags of arbitrary cardinality. A baglayer is illustrated in Figure 2.
Networks with a single baglayer can process bags of instances (as in the standard MIL setting). To solve the MMIL problem, two baglayers are required. The bottom baglayer aggregates over internal representations of instances; the top baglayer aggregates over internal representations of subbags, yielding a representation for the entire topbag. In this case, the representation of each subbag would be obtained as
(4) 
and the representation of a topbag would be obtained as
(5) 
where and denote the parameters used to construct subbag and topbag representations. Note that difference aggregation functions can be also evaluated in parallel. Note that nothing prevents us from intermixing baglayers with standard neural network layers, thereby forming networks of arbitrary depth. In this case, each in Eq. (4) would be simply replaced by the last layer activation of a deep network taking as input. Of course the topbag representation can be itself further processed by other layers. An example of the overall architecture is shown in Figure 3.
3.2 Expressiveness of networks of baglayers
We focus here on a deterministic (noiseless) version of the MMIL setting described in Section 2.2 where labels are deterministically assigned and no form of counting is involved. We show that under these assumptions, the architecture of Section 3.1
has enough expressivity to represent the solution to the MMIL problem. Our approach relies on classic universal interpolation results for neural networks
(Hornik et al., 1989). Note that existing results hold for vector data, and this section shows that they can be leveraged to bagofbag data when using the architecture of Section 3.1.Definition 3.1.
We say that data is generated under the deterministic MMIL setting if the following conditions hold true:

instance labels are generated by an unknown function , i.e., , for , ;

subbag labels are generated by an unknown function , i.e., ;

the topbag label is generated by an unknown function , i.e., .
Note that the classic MIL formulation (Maron and LozanoPérez, 1998) is recovered when examples are subbags, , and . Other generalized MIL formulations (Foulds and Frank, 2010; Scott et al., 2005; Weidmann et al., 2003) can be similarly captured in this deterministic setting.
For a multiset let denote the set of elements occurring in . E.g. .
Definition 3.2.
We say that data is generated under the noncounting deterministic MMI setting if, in addition to the conditions of Definition 3.1, both and only depend on .
The following result indicates that a network containing a baglayer with max aggregation is sufficient to compute the functions that label both subbags and topbags.
Lemma 3.1.
Let be sets of labels, and let be a labelling function for which whenever . Then there exist a network with one baglayer that computes .
Proof.
We construct a network where first a baglayer maps the multiset input to a bitvector representation of , on top of which we can then compute using a standard architecture for Boolean functions.
In detail, is constructed as follows: the input is encoded by dimensional vectors containing the onehot representations of the . We construct a baglayer with , is the identity matrix, is zero, is the identity function, and is max. The output of the baglayer then is an dimensional vector whose ’th component is the indicator function .
For each we can write the indicator function as a Boolean function of the indicator functions . Using standard universal approximation results (see, e.g., (Hornik et al., 1989), Theorem 2.5) we can construct a network that on input computes . such networks in parallel then produce an dimensional output vector containing the onehot representation of . ∎
Theorem 3.2.
Given a dataset of examples generated under the noncounting deterministic MMIL setting, there exist a network with two baglayers that can correctly label all examples in the dataset.
Proof.
We first note that the universal interpolation result of (Hornik et al., 1989) can be applied to a network taking as input an instance which appears in any data example, and generating the desired label . We then use Lemma 1 twice, first to form a network that computes the subbag labeling function , and then to form a network that computes the topbag labeling function . ∎
4 MMIL for graph learning
The MMIL perspective can also be used to derive algorithms suitable for supervised learning over graphs, i.e., tasks such as graph classification, node classification, and edge prediction. In all these cases, one first need to construct a representation for the object of interest (a whole graph, a node, a pair of nodes) and then apply a classifier. A suitable representation can be obtained in our framework by first forming a bagofbags associated with the object of interest (a graph, a node, or an edge) and then feeding it to a network with baglayers. In order to construct bagsofbags, we follow the classic decomposition strategy introduced by Haussler (1999). In the present context, it simply requires us to introduce a relation which holds true if is a “part” of and to form , the bag of all parts of . Parts can in turn be decomposed in a similar fashion, yielding bagsofbags. In the following, we focus on undirected graphs where is the set of nodes and is the set of edges. We also assume that a labelling function attaches attributes to vertices. Variants with directed graphs or labeled edges are straightforward and omitted here in the interest of brevity.
Graph classification
A simple solution is to define the partof relation between graphs to hold true iff is a subgraph of and to introduce a second partof relation that holds true iff is a node in . The bagofbags associated with is then constructed as where maps all elements of through function . In general, considering all subgraphs is not practical but suitable feasible choices for can be derived borrowing approaches already introduced in the graph kernel literature, for example decomposing into cycles and trees (Horváth et al., 2004), or into neighbors or neighbor pairs (Costa and De Grave, 2010) (some of these choices may require three levels of bag nesting, e.g., for grouping cycles and trees separately).
Node classification
In some domains, the node labelling function itself is bagvalued. For example in a citation network, could be the bag of words in the abstract of the paper associated with node . A bagofbags in this case may be formed by considering a paper together all papers in its neighborhood (i.e., its cites and citations): . A slightly more rich description with three layers of nesting could be used to set apart a node and its neighborhood: .
5 Interpreting networks of baglayers
Interpreting the predictions in the supervised learning setting amounts to provide a human understandable explanation of the prediction. Transparent techniques such as rules or trees retain much of the symbolic structure of the data and are well suited in this respect. On the contrary, predictions produced by methods based on numerical representations are often opaque, i.e., difficult to explain to humans. In particular, representations in neural networks are highly distributed, making it hard to disentangle a clear semantic interpretation of any specific hidden unit. Although many works exist that attempt to interpret neural networks, they mostly focus on specific application domains such as vision (Lapuschkin et al., 2016a; Samek et al., 2016).
The MMIL settings offers some advantages in this respect. Indeed, if instance or subbag labels were observed, they would provide more information about bagofbags than mere predictions. Latent variables are indeed associated with each individual “part” of the topbag, as opposite to the prediction which is associated with the whole. To clarify our vision, MIL approaches like miSVM and MISVM in (Andrews et al., 2002) are not equally interpretable: the former is more interpretable than the latter since it also provides individual instance labels rather than simply providing a prediction about the whole bag. These standard MIL approaches make two assumptions: first all labels are binary, second the relationship between the instance labels and the bag label is predefined to be the existential quantifier. In our case we relax these assumptions by allowing labels in a categorical alphabet and by allowing more complex mappings between bags of instance labels and subbag labels. Our approach may also provide a richer explanation due to the nested structure of the data as bagsofbags. We follow the standard MIL approaches in that we also assume a deterministic mapping from component to bag labels, i.e., we assume the data can be modelled in the deterministic MMIL setting according to Definition 3.1.
The idea we propose in the following is based on four steps. First, we employ clustering at the level of instance and subbag representations to construct pseudolabels as surrogates for hypothesized actual latent labels. Pseudolabels obtained in this way are abstract symbols without any specific semantics. Hence, in the second step we provide semantic interpretations of the pseudolabels for human inspection. Third, we apply a transparent learner to extract a humanreadable representation of the mappings between pseudolabels at the different levels of a bagofbags structure. Finally, we explain predictions for individual topbag examples by exhibiting the relevant components and their pseudolabels which determine the predicted topbag label.
As before, for ease of exposition we assume in the following a twolevel bagofbags structure. The method directly applies also to other nesting depths.
Clustering and pseudolabel construction
Given labeled topbag data and a trained MMIL network we consider the multisets of subbag and instance representations computed by the bag layers:
where the and are the representations according to (2).
Given the number of clusters and we run a clustering procedure on and on (separately), obtaining clusters and . We finally associate each subbag and each instance with the cluster indices of their representations, and use them as pseudolabels and .
Interpreting pseudolabels
Clusters can be directly inspected in the attempt to attach some meaning to pseudolabels. For example in the case of textual data, a human could inspect word clusters, similarly to what has been suggested in the area of topic modelling (Blei et al., 2003; Griffiths and Steyvers, 2004).
To facilitate inspection, we propose an approach to characterize clusters in terms of their most characteristic elements. To this end, we define a ranking of the elements in each cluster according to a score function based on intracluster distances. Consider a subbag whose baglayer representation belongs to cluster . We define the score
where is the centroid of the th cluster. The procedure for ranking instances is analogous. We use the cluster elements with maximal score to illustrate and interpret the semantic nature of a cluster. Note that this is different from the more common approach of interpreting clusters by way of their centroids.
In some cases the cluster elements may be equipped with some true, latent label. In such cases we can alternatively characterize pseudolabels in terms of their correspondence with these actual labels. An example of this will be seen in Section 7.1 below.
Learning interpretable rules
We next describe how we construct interpretable functions that approximate the actual (potentially noisy) relationships between pseudolabels in the MMIL network.
Let us denote a bag of pseudolabels as , where is the multiplicity of label . An attributevalue representation of the bag can be immediately obtained by normalizing counts: , where is the frequency of the label in the bag. Another attributevalue representation of the bag can be obtaining by using a bit which indicates the presence or the absence of an attribute. Jointly with an output label
, this attributevalue representation provides one supervised example for a propositional learner such as a decision tree. In the two level MMIL case, we learn in this way functions
mapping multisets of instance pseudolabels to subbag pseudolabels, and multisets of subbag pseudolabels to topbag labels, respectively (cf. Definition 3.1). In the second case, our target labels are the predicted labels of the original MMIL network, not the actual labels of the training examples. Thus, the objective is to construct rules that best explain the MMIL model, not the rules that provide the highest accuracy themselves.The instancelevel clustering defines a labeling function by associating any (test) instance with the index of its nearest centroid. Taken together, the three functions provide a complete classification model for a topbag based on the input features of its instances. We refer to the accuracy of this model with regard to the predictions of the original MMIL model as its fidelity.
We use fidelity on a validation set as the criterion to select the cardinalities for and by performing a grid search over value combinations.
Explaining individual classifications
The classification provided by for an input topbag will often rely only on small subsets of subbags and instances contained in (cf. the classic multiinstance setting, where a positive classification can rely only on a single positive instance). We can therefore explain classifications for individual examples by exhibiting the critical substructures of that support the prediction. The details of this step are typically quite domain specific, and we will illustrate different versions of it in the experimental section.
6 Related Works
6.1 Multiinstance neural networks
Ramon and De Raedt (2000) proposed a neural network solution to MIL where each instance in a bag is first processed by a replica of a neural network with weights . In this way, a bag of output values computed for each bag of instances. These values are then aggregated by a smooth version of the max function:
where is a constant controlling the sharpness of the aggregation (the exact maximum is computed when ). Recall that a single baglayer (as defined in Section 3) can used to solve the MIL problem. Still, a major difference compared to the work of (Ramon and De Raedt, 2000) is that baglayers perform aggregation at the representation level rather than at the output level. In this way, more layers can be added on the top of the aggregated representation, allowing for more expressiveness. In the classic MIL setting (where a bag is positive iff at least one instance is positive) this additional expressiveness is not required. However, it allows us to solve slightly more complicated MIL problems. For example, suppose each instance has a latent variable , and suppose that a bag is positive iff it contains at least one instance with label and no instance with label . In this case, a baglayer with two units can distinguish positive and negative bags, provided that instance representations can separate instances belonging to the classes and . The network proposed in (Ramon and De Raedt, 2000) would not be able to separate positive from negative bags. Indeed, as proved in Section 3.2, networks with baglayers can represent any Boolean function over sets of instances.
6.2 Convolutional neural networks
Convolutional neural networks (CNN) (Fukushima, 1980; LeCun et al., 1989) are the stateoftheart method for image classification (see, e.g., (Szegedy et al., 2016)
). It is easy to see that the representation computed by one convolutional layer followed by maxpooling can be emulated with one baglayer by just creating bags of adjacent image patches. The representation size
corresponds to the number of convolutional filters. The major difference is that a convolutional layer outputs spatially ordered vectors of size , whereas a baglayer outputs a set of vectors (without any ordering). This difference may become significant when two or more layers are sequentially stacked.Figure 4 illustrates the relationship between a convolutional layer and a baglayer, for simplicity assuming a onedimensional signal (i.e., a sequence). When applied to signals, a baglayer essentially correspond to a disordered convolutional layer and its output needs further aggregation before it can be fed into a classifier. The simplest option would be to stack one additional baglayer before the classification layer. Interestingly, a network of this kind would be able to detect the presence of a short subsequence regardless of its position within the whole sequence, achieving invariance to arbitrarily large translations
We finally note that it is possible to emulate a CNN with two layers by properly defining the structure of bagsofbags. For example, a second layer with filter size 3 on the top of the CNN shown in Figure 4 could be emulated with two baglayers fed by the bagofbags
A baglayer, however, is not limited to pooling adjacent elements in a feature map. One could for example segment the image first (e.g., using a hierarchical strategy (Arbeláez et al., 2011)) and then create bagsofbags by following the segmented regions.
The convolutional approach has been also recently employed for learning with graph data. The idea is to reinterpret the convolution operator as a message passing algorithm on a graph where each node is a signal sample (e.g., a pixel) and edges connect a sample to all samples covered by the filter when centered around its position (including a selfloop). The major difference between graphs and signals is that no obvious ordering can be defined on neighbors. This message passing strategy over graphs was originally proposed in (Gori et al., 2005; Scarselli et al., 2009) and reused with variants in several later works. Kipf and Welling (2016) for example, propose to address the ordering issue by sharing the same weights for each neighbor (keeping them distinct from the selfloop weight). They show that messagepassing is closely related to the 1dimensional WeisfeilerLehman (WL) method for isomorphism testing (one convolutional layer corresponding to one iteration of the WLtest) and can be also motivated in terms of spectral convolutions on graphs. On a side note, similar messagepassing strategies were used before in the context of graph kernels (Shervashidze et al., 2011; Neumann et al., 2012). Niepert et al. (2016b) proposed ordering via a “normalization” procedure that extends the classic canonicalization problem in graph isomorphism. Hamilton et al. (2017b) propose an extension of the approach in (Kipf and Welling, 2016)
where representations of the neighbors are aggregated by a general differentiable function that can be as simple as an average or as complex as a recurrent neural network. Additional related works include
(Duvenaud et al., 2015), where CNNs are applied to molecular fingerprint vectors, and (Atwood and Towsley, 2016) where a diffusion process across general graph structures generalizes the CNN strategy of scanning a regular grid of pixels.6.3 Nested SRL Models
In Statistical Relational Learning (SRL) a great number of approaches have been proposed for constructing probabilistic models for relational data. Relational data has an inherent bagofbag structure: each object in a relational domain can be interpreted as a bag whose elements are all the other objects linked to via a specific relation. These linked objects, in turn, also are bags containing the objects linked via some relation. A key component of SRL models are the tools employed for aggregating (or combining) information from the bag of linked objects. In many types of SRL models, such an aggregation only is defined for a single level. However, a few proposals have included models for nested combination (Jaeger, 1997; Natarajan et al., 2008). Like most SRL approaches, these models employ concepts from firstorder predicate logic for syntax and semantics, and (Jaeger, 1997) contains an expressivity result similar in spirit to the one we present in the following section 3.2.
A key difference between SRL models with nested combination constructs and our MMIL network models is that the former build models based on rules for conditional dependencies which are expressed in firstorder logic and typically only contain a very small number of numerical parameters (such as a single parameter quantifying a noisyor combination function for modelling multiple causal influences). MMI network models, in contrast, make use of the highdimensional parameter spaces of (deep) neural network architectures. Roughly speaking, MMIL network models combine the flexibility of SRL models to recursively aggregate over sets of arbitrary cardinalities with the power derived from highdimensional parameterisations of neural networks.
6.4 Interpretable models
Recently, the question of interpretability has become particularly prominent in the neural network context. Lapuschkin et al. (2016b); Samek et al. (2016) explain predictions of a classifier for each instance , by attributing scores to each entry of . A positive or negative score is then assigned to , depending whether contributes for predicting the target or not.
Ribeiro et al. (2016)
also provided explanations for individual predictions as a solution to the “trusting a prediction” by approximating a machine learning model with an interpretable model. The authors assumed that instances are given in a representation which is understandable to humans, regardless of the actual features used by the model. For example for text classification an interpretable representation may be the binary vector indicating the presence or absence of a word. An “interpretable” model is defined as a model that can be readily presented to the user with visual or textual artefacts (linear models, decision trees, or falling rule lists), which locally approximates the original machine learning model. Given a machine learning model
, an interpretable model is trained for each instance. For each instance , a set of instances is generated around by dropping out randomly some nonzero entries from . Given a similarity measure , e.g. scalar product, gaussian kernel, cosine distance, is trained by minimizingThe major differences between all those methods and our interpretation framework, described in Section 5, is that with the latter we are able to provide a global interpretation for the whole MMIL network, as well as to explain individual example.
7 Experiments on MMIL data
We evaluated our model on two experimental setups:

we constructed a multimulti instance semisynthetic dataset from MNIST, in which digits were organized in bagsofbags of arbitrary cardinality. This setup follows the example shown in Section 2.2. The aim of this experiment is to show the ability of the network to learn functions that have generated the data according to Theorem 3.2 in Section 3.2. Furthermore we interpreted the network by using the approach described in Section 5;

we decomposed a sentimentanalysis text dataset into MMIL data and MIL data. The goal is to show the differences between the interpretation of the two models.
7.1 Semisynthetic dataset
Results of Section 3.2 show that networks with bag layers can represent any labelling function in the noncounting deterministic MMIL setting. We show here that these networks trained by gradient descent can actually learn such functions from MMIL data.
The task is defined exactly as in Example 2.1 and we formed a balanced training set of 5,000 topbags using MNIST digits. Both subbag and topbag cardinalities were uniformly sampled in . Instances were sampled with replacement from the MNIST training set (60,000 digits). A test set of 5,000 topbags was similarly constructed but instances were sampled from the MNIST test set (10,000 digits). Details on the network architecture and the training procedure are reported in Appendix, Table 7. We stress the fact that instance and subbag labels were not used to form the training objective. The accuracy on the test set was , confirming that the network is able to recover the latent logic function that was used in the data generation process with a reasonably high accuracy.
We show next how the general approach of Section 5 for constructing interpretable rules, for this example, recovers the latent labels and logical rules used in the data generating process. Pseudolabels and rules are learnt with the procedure described in Section 5
. Clustering was performed with KMeans, while as propositional learner we used Decision Trees. By optimizing the interpretable model with respect to the fidelity on the validation set, the best number for instance pseudolabels and subbag pseudolabels turned to be 6 and 2, respectively. An heatmap which shows the fidelity on the validation set is reported in Appendix, Figure
8.By visually inspecting the clusters that correspond to pseudolabels (see Figure 7 in Appendix) it is immediate to recognize that pseudolabel corresponds to 7s, , , and correspond to 3s, and and correspond to numbers which differ from both 7s and 3s. We extracted the following rule that maps a bag of instance pseudolabels into the corresponding subbag pseudolabel:
(6) 
By considering those rules it is also immediate to recognize that pseudolabel gets attached to the subbags that contain a seven and not a three.
Similarly, we extracted the following rule that maps a bag of subbag pseudolabels into the corresponding topbag label:
(7) 
Hence, in this example, the true rules behind the data generation process were perfectly recovered. Note that perfect recover does not necessarily imply perfect fidelity since the quantization due to clustering may loose some information that the neural network is allowed to encode into the distributed representation of instances and subbags. Nonetheless, in this experiment the classification accuracy of the interpretable model on the test set was
, only less than the accuracy of the original model, and the fidelity was .7.2 Imdb
In this section we show other interpretability results on IMDB (Maas et al., 2011), a standard benchmark movie review dataset for sentiment binary classification. We remark that this IMDB dataset differs from the IMDB graph datasets described in Section 8.2. IMDB consists of 25,000 training reviews, 25,000 test reviews and 50,000 unlabelled reviews. Positive and negative labels are balanced within the training and test sets. A review can be seen as a bag of sentences and each sentence as a bag of words. For this particular task it is reasonable to think that for a review to be positive is often sufficient to contain a positive sentence, and for a sentence to be positive is often sufficient to contain a set of positive words.
A MMIL dataset was constructed from the reviews in which a topbag represents a bag of sentences. A subbag represents the bag of trigrams within a sentence. An instance represents a trigrams, i.e a triplet of words which is obtained by concatenating a word, the previous word and the next word. We used trigrams rather than single words in order to take into account possible negations, e.g. “not very good”, “not so bad”. Figure 5 depicts an example of decomposition of a review into MMIL data.
Each word is represented with Glove word vectors (Pennington et al., 2014) of size 100, trained on the dataset. Note that we used Glove word vectors in order to compare our model with the stateoftheart (Miyato et al., 2016) and nothing prevent us to use the onehot representation even for this scenario. In order to compare MMIL against multiinstance (MI) we also constructed a multiinstance dataset in which a review is simply represented as a bag of trigrams.
We trained two neural networks for MMIL and MIL data respectively, which have the following structure:

MMIL network: a Conv1D layer with 300 filters, ReLU activations and kernel size of 100, two stacked baglayers (with ReLU activations) with 500 units (250 maxaggregation, 250 meanaggregation) and an output layer with sigmoid activation;

MIL network: a Conv1D layer with 300 filters, ReLU activations and kernel size of 100, one baglayers (with ReLU activations) with 500 units (250 maxaggregation, 250 meanaggregation) and an output layer with sigmoid activation;
The models were trained by minimizing the binary crossentropy loss. We ran 20 epochs of the Adam optimizer with learning rate 0.001, on minibatches of size 128. We used also virtual adversarial training
(Miyato et al., 2016) for regularizing the network and exploiting the unlabelled reviews during the training phase. Although our model does not overperform the state of the art (, Miyato et al. (2016)), we obtained a final accuracy of for the MMIL network and for the MIL network. Those results shows that MMIL representation allows to obtain slightly better results than MIL representation. Moreover, overperforming the stateoftheart is out of the scope of this experiment, which aims to show interpretable results.We will show now interpretability results on the citation datasets. Similarly to Sections 7.1, we learnt pseudolabels and rules for both the MMIL model and MIL model. By using 2,500 reviews as validation set we obtained 4 and 5 pseudolabels for subbags and instances respectively for the MMIL case, and 6 pseudolabels for the MIL case. An heatmap which shows the fidelity on the validation set is reported in Appendix, Figure 12. For the MMIL case we report sentences and words in Table Tables 1 and 2 while for the MIL case we report the words in Table 3.
MMIL Case
We extracted the following rule which maps a bag of instance pseudolabels into the corresponding subbag pseudolabel:
(8) 
Similarly, we extracted the following rule that maps a bag of subbag pseudolabels into the corresponding topbag label:
(9) 
Note that and are not used for constructing the rules which map instance pseudolabels to subbag pseudolabels, and is not used for constructing the rules which map subbag pseudolabels to topbag labels.
MIL Case
We extracted the following rule which maps a bag of instance pseudolabels into the corresponding topbag label:
(10) 
Note that , , , and are not used for constructing the rules which map instance pseudolabels to subbag pseudolabels. Finally, by classifying IMDB using the rules and pseudolabels, we achieved an accuracy on the test set equals to for the MMIL case and for the MIL case. Fidelities for MMIL and MIL cases were and , respectively.
Although the rules and Tables 1 and 2 for MMI, and rules and Table 3 for MI explain the networks, they might be tricky to read. We will show the alternative approach described in Section 5, in which we explain single predictions as it may help the reader to understand better the benefits of the proposed interpretable models. We start by giving different colors for the pseudolabels. We report the results on two reviews: one classified correctly by the MMIL interpretable model and misclassified by the MIL interpretable model (Example 4), and one misclassified by the MMIL interpretable model and classified correctly by the MIL interpretable model (Example 5).
For the MMIL interpretable model we color in turn the sentences and the trigrams which activate the particular rule, while for the MIL interpretable model we color only the trigrams which activate the particular rule. The sentences which do not activate any rules are not reported (for the sake of readability). We also report the fired rules. Having the access to the sentences and trigrams rather than only the latter, helps the interpretation, even whereas we have misclassified examples. Indeed in Example 5 by reading the sentences (and then the trigrams) we can easily understand why the example is misclassified. On the other hand, by reading only the trigrams it is more tricky.
Example 7.1.
We report a positive review which MMIL classified correctly and MIL misclassified by using the respectively rules.
For MMI the fired rule for sentences is . The only sentence belonging to is “Bloody Birthday a pretty mediocre title for the film was a nice lil surprise”. The sentences belonging to are: “And I may say it’s also one of the best flicks I’ve seen with kids as the villains”, “It’s a really solid 80s horror flick but how these kids are getting away with all this mayhem and murder is just something that you can’t not think about”, “It’s a very recommendable and underrated 80s horror flick”.
For MMI the fired rules for trigrams are:

. The trigrams belonging to are “film was a”, “a nice lil”. The trigrams belonging to are “Birthday a pretty”, “a pretty mediocre”, “pretty mediocre title”;

. The trigrams belonging to are “one of the”, “of the best”, “the best flicks”;

. The trigrams belonging to are “It’s a really”, “a really solid”, “really solid 80s”. The ; The only trigram belonging to is “you can’t not”;

The trigrams belonging to are “a very recommendable”, “very recommendable and”, “recommendable and underrated”, “and underrated 80s”.
For MI the fired rule for trigrams is . The trigrams belonging to are “Birthday a pretty”, “a pretty mediocre”, “pretty mediocre title”, “to die in”, “die in horrible”, “in horrible fashion”. The trigrams belonging to are “one of the”, “of the best”,“the best flicks”, “It’s a really”, “a really solid”, “the less than”,“a very recommendable”, “very recommendable and”, “recommendable and underrated”, “and underrated 80s”.
Example 7.2.
We report a positive review which MMIL misclassified and MIL classified correctly by using the respectively rules.
For MMI the fired rule for sentences is . The only sentence belonging to is “The mental patients are all a little eye rolling by the Judge but my favorite was the old crazy biddy Rhea”. The only sentence belonging to is “The storyline is okay at best and the acting is surprisingly alright but after awhile it’s gets to be a little much”
For MMI the fired rules for trigrams are:

. The trigrams belonging to are “Judge but my”, “but my favorite”, “my favorite was”;

. The only trigram belonging to is “at best and”. The trigrams belonging to are “is okay at”, “okay at best”, “surprisingly alright but”, “a little much”, “little much PAD”. The only trigram belonging to is “The storyline is”.
For MI the fired rule for trigrams is . The trigrams belonging to are “Judge but my”, “okay at best”, “at best and”, “acting is surprisingly”, “But still it’s”, “still it’s fun”, “it’s fun quirky”, “fun quirky strange”. The trigrams belonging to are “The storyline is”, “storyline is okay”, “is okay at”.
       

overrated poorly written badly acted  I highly recommend you to NOT waste your time on this movie as I have  I loved this movie and I give it an 8/ 10  It’s not a total waste 
It is badly written badly directed badly scored badly filmed  This movie is poorly done but that is what makes it great  Overall I give this movie an 8/ 10  horrible god awful 
This movie was poorly acted poorly filmed poorly written and overall horribly executed  Although most reviews say that it isn’t that bad i think that if you are a true disney fan you shouldn’t waste your time with…  final rating for These Girls is an 8/ 10  Awful awful awful 
Poorly acted poorly written and poorly directed  I’ve always liked Madsen and his character was a bit predictable but this movie was definitely a waste of time both to watch and make…  overall because of all these factors this film deserves an 8/ 10 and stands as my favourite of all the batman films  junk forget it don’t waste your time etc etc 
This was poorly written poorly acted and just overall boring  If you want me to be sincere The Slumber Party Massacre Part 1 is the best one and all the others are a waste of…  for me Cold Mountain is an 8/ 10  Just plain god awful 
         

PAD 8/ 10  trash 2 out  had read online  it’s pretty poorly  give this a 
an 8/ 10  to 2 out  had read user  save this poorly  like this a 
for 8/ 10  PAD 2 out  on IMDb reading  for this poorly  film is 7 
HBK 8/ 10  a 2 out  I’ve read innumerable  just so poorly  it an 11 
Score 8/ 10  3/5 2 out  who read IMDb  is so poorly  the movie an 
to 8/ 10  2002 2 out  to read IMDb  were so poorly  this movie an 
verdict 8/ 10  garbage 2 out  had read the  was so poorly  40 somethings an 
Obscura 8/ 10  Cavern 2 out  I’ve read the  movie amazingly poorly  of 5 8 
Rating 8/ 10  Overall 2 out  movie read the  written poorly directed  gave it a 
it 8/ 10  rating 2 out  Having read the  was poorly directed  give it a 
fans 8/ 10  film 2 out  to read the  is very poorly  rating it a 
Hero 8/ 10  it 2 out  I read the  It’s very poorly  rated it a 
except 8/ 10  score 2 out  film reviews and  was very poorly  scored it a 
Tracks 8/ 10  Grade 2 out  will read scathing  a very poorly  giving it a 
vote 8/ 10  Just 2 out  PAD After reading  very very poorly  voting it a 
as 8/ 10  as 2 out  about 3 months  Poorly acted poorly  are reasons 1 
strong 8/ 10  and 2 out  didn’t read the  are just poorly  it a 8 
rating 8/ 10  rated 2 out  even read the  shown how poorly  vote a 8 
example 8/ 10  Rating 2 out  have read the  of how poorly  a Vol 1 
… 8/ 10  conclusion 2 out  the other posted  watching this awful  this story an 
           

production costs PAD  give it a  only 4/10 PAD  is time wellspent  … 4/10 …  PAD Recommended PAD 
all costs PAD  gave it a  score 4/10 PAD  two weeks hairdressing  .. 1/10 for  Highly Recommended PAD 
its costs PAD  rated it a  a 4/10 PAD  2 hours PAD  rate this a  Well Recommended PAD 
ALL costs PAD  rating it a  PAD 4/10 PAD  two hours PAD  gave this a  PAD 7/10 PAD 
possible costs PAD  scored it a  average 4/10 PAD  finest hours PAD  give this a  13 7/10 PAD 
some costs PAD  giving it a  vote 4/10 PAD  off hours PAD  rated this a  rate 7/10 PAD 
cut costs PAD  voting it a  Rating 4/10 PAD  few hours PAD  PAD Not really  .. 7/10 PAD 
rate this a  gave this a  .. 4/10 PAD  slow hours PAD  4/10 Not really  this 7/10 PAD 
gave this a  give this a  is 4/10 PAD  three hours PAD  a 4/10 or  Score 7/10 PAD 
rating this a  rate this a  this 4/10 PAD  final hours PAD  of 4/10 saying  solid 7/10 PAD 
give this a  giving this a  of 4/10 PAD  early hours PAD  rate it a  a 7/10 PAD 
and this an  gives this a  movie 4/10 PAD  six hours PAD  give it a  rating 7/10 PAD 
give this an  like this a  verdict 4/10 PAD  48 hours PAD  gave it a  to 7/10 PAD 
given this an  film merits a  gave 4/10 PAD  4 hours PAD  given it a  viewing 7/10 PAD 
gave this an  Stupid Stupid Stupid  13 4/10 PAD  6 hours PAD  giving it a  it 7/10 PAD 
rating this an  PAD Stupid Stupid  disappointment 4/10 PAD  five hours PAD  scored it a  score 7/10 PAD 
rate this an  award it a  at 4/10 PAD  nocturnal hours PAD  award it a  movie 7/10 PAD 
all costs …  given it a  rating 4/10 PAD  17 hours PAD  Cheesiness 0/10 Crappiness  is 7/10 PAD 
all costs ..  makes it a  … 4/10 PAD  for hours PAD  without it a  drama 7/10 PAD 
PAD Avoid PAD  Give it a  rate 4/10 PAD  wasted hours PAD  deserves 4/10 from  Recommended 7/10 PAD 
8 Experiments on Graphs
We tested our model on two different graph tasks:

a node classification task in which we focused on real citation network datasets where data can be naturally decomposed into bagsofbags (MMIL data) or bags (MIL data). The goal is to understand whether MMIL and MIL decompositions are reasonable representations for citation networks and whether MMIL representation is more suitable than MIL representation. Finally we compared our approach with the stateofart architectures, and we interpreted our results;

a graph classification task in which we tested our model on real social network graphs, where the data can be easily decomposed into bagsofbags. We compared our approach against the stateofart architectures.
8.1 Citation Datasets
In Section 4 we described how graph data can be mapped into MMIL structure. We show here a real application in which we decompose graphs into MMIL data. We also show a way for decomposing graphs into multiinstance data (which we abbreviate with MIL data). We stress the fact that our MIL setting does not assume any constraint on the latent labels whereas the standard multiinstance does. We considered three citation datasets from (Sen et al., 2008): Citeseer, Cora, and PubMed. Furthermore we will show interpretability results for PubMed (being the dataset with less classes, it will produce less rules), as described in Section 5.
We view the datasets as graphs where nodes represent papers described by titles and abstracts, and edges are citation links. We treat the citation links as undirected edges, in order to have a setup as close as possible with earlier works, (Kipf and Welling, 2016; Hamilton et al., 2017a). The goal is to classify nodes of the graph.
We collected the years of publication for all the papers of each dataset with respect to the provided unique ids. According to the years of publication, we split the datasets in training, validation, and test sets in order to have approximately nodes for the training set, nodes for the validation set and nodes for the test set. Hence for each dataset we chose two thresholds and , where . Training sets contain all the papers whose years of publication are lower or equal than , validation sets contain all the papers whose years of publication are greater than and lower or equal than , and test sets contain all the papers whose years of publication are greater than . Table 4 reports the statistics for each dataset while Figure 10, in Appendix, depicts the distributions of the papers over years of publication for the three datasets.
MMIL data was constructed from citation networks in which a topbag corresponds to a paper represented as a bag of nodes: the neighborhood of a node (including the node itself). A subbag represents the bag of words corresponding to the text (i.e. title and abstract) attached to the node. An instance is a word. Conversely, MIL data was constructed from citation networks in which bags are the sets of bag of words of the neighborhood of a node (including the node itself). Note that in general both the cardinality of bags for MMIL data and MIL data could differ. Words are encoded by using onehot vector, as we want to show the capability of our model to learn intermediate representations of bags from scratch. Figure 6 shows an example of MMIL and MIL decompositions starting from a node and its neighborhood of a citation graph.
The MMIL model has two stacked baglayers with ReLU activations with 250 units, while the MIL model has one baglayer with ReLU activations with 250 units. For both MMIL and MIL we proposed two versions which differ only for the aggregation functions for the baglayers: a setup in which we used the max, and another in which we used the mean. All models were trained by minimizing the softmax crossentropy loss. We ran 100 epochs of the Adam optimizer with learning rate 0.001 and we early stopped the training according to the loss on the validation set.
As baseline, we considered näive Bayes and logistic regression. For this scenario we reduced the task to a standard classification problem in which examples are papers and labels are categories associated with papers. As feature vectors we simply considered bag of words for both näive Bayes (Bernoulli) and logistic regression. We compared also our models against GCN
(Kipf and Welling, 2016) and GraphSAGE (Hamilton et al., 2017a), which are briefly described in Section 6. While GCN represents nodes with bag of words, GraphSAGE exploits the sentence embedding approach described by (Arora et al., 2016). For comparison reasons and given that bag of words represent the most challenging and standalone approach which does not rely in any embedding representation of words, we encoded the nodes as bag of words for both GCN and GraphSAGE. As GraphSAGE allows to use both max and mean as aggregation functions, we compared our models against both versions.Results in Table 5 report the accuracy for MMIL networks , MIL network, GCN, GraphSAGE, Näive Bayes and Logistic Regression. The MMIL network outperforms all the other methods. Note that MIL networks also provide reasonable results compared to the other methods, although results for Cora and Citeseer are worse compared to MMIL and results for PubMed are slightly worst compared to MMI. We remark that our framework is not specifically designed only for graphs, contrary to GraphSAGE and GCN.
Dataset  # Classes  # Nodes  # Edges  # Training  # Validation  # Test 

Citeseer  6  3,327  4,732  1,560  779  988 
Cora  7  2,708  5,429  1,040  447  1,221 
PubMed  3  19,717  44,338  8,289  3,087  8,341 
we denote the year of publication. Citeseer classes are 6 among Agents, Artificial Intelligence (AI), Database (DB), Humancomputer Interaction (HCI), Information Retrieval (IR), Machine Learning (ML). Cora classes are 7 among Case Based, Genetic Algorithms, Neural Networks, Probabilist Methods, Reinforcement Learning, Rule Learning, Theory. PubMed classes are 3 among Diabetes Mellitus Experimental (DME), Diabetes Mellitus Type 1 (DMT1), Diabetes Mellitus Type 2(DMT2).
Model  Cora  Citeseer  PubMed 

Naive Bayes (Bernoulli)  71.34%  63.77%  75.47% 
Logistic Regression  74.94%  64.37%  73.67% 
GCN (Kipf and Welling, 2016)  82.23%  66.50%  78.66% 
GraphSage (Hamilton et al., 2017a) MeanPool  80.18%  66.19%  75.59% 
GraphSage (Hamilton et al., 2017a) MaxPool  80.43%  67.61%  76.60% 
MIMean  79.93%  62.96%  81.15% 
MIMax  81.08%  67.41%  80.22% 
MMIMean  82.80%  70.75%  81.27% 
MMIMax  84.03%  69.64%  80.65% 
8.2 Social Datasets
We tested our model on a slightly different graph scenario in which we used six publicly available datasets first proposed by Yanardag and Vishwanathan (2015). Although this problem might similar to the Citation Dataset classification described in Section 8.1, the task here is to classify whole graphs rather than nodes.

COLLAB is a dataset where each graph represent the egonetwork of a researcher, and the task is to determine the field of study of the researcher between High Energy Physics, Condensed Matter Physics, and Astro Physics.

IMDBBINARY, IMDBMULTI are datasets derived from IMDB where in each graph the vertices represent actors/actresses and the edges connect people who have performed in the same movie. Collaboration graphs are generated from movies belonging to genres Action and Romance for IMDBBINARY and Comedy, Romance, and SciFi for IMDBMULTI, and for each actor/actress in those genres an egograph is extracted. The task is to identify the genre from which the egograph has been generated.

REDDITBINARY, REDDITMULTI5K, REDDITMULTI12K are datasets where each graph is derived from a discussion thread from Reddit. In those datasets each vertex represent a distinct user and two users are connected by an edge if one of them has responded to a post of the other in that discussion. The task in REDDITBINARY is to discriminate between threads originating from a discussionbased subreddit (TrollXChromosomes, atheism) or from a question/answersbased subreddit (IAmA, AskReddit).
The task in REDDITMULTI5K and REDDITMULTI12K is a multiclass classification problem where each graph is labeled with the subreddit where it has originated (worldnews, videos, AdviceAnimals, aww, mildlyinteresting for REDDITMULTI5K and AskReddit, AdviceAnimals, atheism, aww, IAmA, mildlyinteresting, Showerthoughts, videos, todayilearned, worldnews, TrollXChromosomes for
REDDITMULTI12K).
We built a MMIL data from each dataset by treating each graph as a topbag . Each node of the graph with its neighborhood, is a subbag , while an instance is a node.
All of those six datasets do not have any features attached to the nodes, as was instead the case with the citation datasets in Section 8.1. As features we used hence a representation of the degree of the nodes. Let be the degree of a node and let the maximum degree of the graph. The representation associated to is defined as follows:
(11) 
where . By using this representation the scalar product of two vectors and will be high if the nodes and (associated to and respectively) have similar degrees and it will be low if and have far different degrees.
The MMIL networks have the same structure for all the datasets: a dense layer with 500 nodes and ReLU activation, two stack baglayers with 500 units (250 max units and 250 mean units), and a dense layer with nodes and linear activation. is for COLLAB, for IMDBBinary, and for IMDBMULTI, for REDDITBINARY, for REDDITMULTI5K, and for REDDITMULTI12K. We performed a 10 times 10 fold crossvalidation, training the MMIL networks by minimizing the binary crossentropy loss (for REDDITBINARY and IMDBBINARY) and the softmax crossentropy loss (for COLLAB, IMDBMULTI, REDDIT5K, REDDIT12K). We ran 100 epochs of the Adam optimizer with learning rate 0.001 on minibatches of size 20.
We compared our method against DGK (Yanardag and Vishwanathan, 2015), PatchySAN (Niepert et al., 2016a), and SAEN (Orsini et al., 2018).
Results in Table 6 show that MMIL networks outperform the other methods on COLLAB, IMDBBINARY, and IMDBMULTI and have competitive results on REDDITBINARY, REDDITMULTI5K, and REDDITMULTI12K.
Dataset  DGK  PatchySAN  SAEN  Our Method 

COLLAB  73.09 0.25  72.60 2.15  78.50 0.69  79.46 0.31 
IMDBBINARY  66.96 0.56  71.00 2.29  71.59 1.20  72.62 1.04 
IMDBMULTI  44.55 0.52  45.23 2.84  48.53 0.76  49.42 0.68 
REDDITBINARY  78.04 0.39  86.30 1.58  87.22 0.80  86.54 0.64 
REDDITMULTI5K  41.27 0.18  49.10 0.70  53.63 0.51  53.42 0.67 
REDDITMULTI12K  32.22 0.10  41.32 0.42  47.27 0.42  45.25 0.48 
9 Conclusions
We have introduced the MMIL framework for handling data organized in nested bags. The MMIL setting allows for a natural hierarchical organization of data, where components at different levels of the hierarchy are unconstrained in their cardinality. We have identified several learning problems that can be naturally expressed as MMIL problems. For instance, image, text or graph classification are promising application areas, because here the examples can be objects of varying structure and size, for which a bagofbag data representation is quite suitable, and can provide a natural alternative to graph kernels or convolutional network for graphs. Furthermore we proposed new way of thinking in terms of interpretability. Although some MIL models can be easily interpreted by exploiting the learnt instance labels and the assumed rule, MMIL networks can be interpreted in a finer level: by removing the common assumptions of the standard MIL, we are more flexible and we can first associate labels to instances and subbags and then combine them in order to extract new rules. Finally, we proposed a different perspective to see convolutions on graphs. In most of the neural network for graphs approaches convolutions can be interpreted as message passing schema, while in our approach we provided a decomposition schema.
We proposed a neural network architecture involving the new construct of baglayers for learning in the MMIL setting. Theoretical results show the expressivity of this type of model. In the empirical results we have shown that learning MMIL models from data is feasible, and the theoretical capabilities of MMIL networks can be exploited in practice, e.g., to learn accurate models for noiseless data.Furthermore MMIL networks can be applied in a wide spectrum of scenarios, such as text, image, and graphs. For this latter we showed that MMIL is competitive with the stateoftheart models on node and graph classification tasks, and, in many cases, MMIL models outperform the others.
In this paper, we have focused on the setting where whole bagsofbags are to be classified. In conventional MIL learning, it is also possible to define a task where individual instances are to be classified. Such a task is however less clearly defined in our setup since we do not assume to know the label spaces at the instance and subbag level, nor the functional relationship between the labels at the different levels.
Appendix A: Experiments
Semisyntheic dataset
Layer  Parameters 

Convolutional Layer  kernel size with 32 channels 
Batch Normalization  
ReLU  
Max Pooling  kernel size 
Dropout  probability 0.5 
Convolutional Layer  kernel size with 64 channels 
Batch Normalization  
ReLU  
Max Pooling  kernel size 
Dropout  probability 0.5 
Dense  units 
ReLU  
Dropout  probability 0.5 
BagLayer (ReLU activation)  units 
ReLU  
BagLayer (ReLU activation)  units 
ReLU  
Dense  unit 
Citation Datasets
We will show here interpretability results for the citation datasets presented in Section 8.1. Similarly to Section 7.1 and Section 7.2 we learnt pseudolabels and rules for both the MMIL and MIL models. Since PubMed has the less number of classes, (and hence less rules to read), we will show interpretability results for this dataset for the model in which baglayers aggregate with the mean.
The optimal number of pseudolabels for the MMIL model turned to be 3 () and 5 () for subbags and instances, respectively. On the other hand, the optimal number of pseudolabels for the MIL model turned to be 3 () for the instances. Figure 11 depicts an heatmap which shows the fidelities on the validation set in function of the number of pseudolabels for both instances and subbags for the MMIL model.
The MMIL decomposition of citation datasets leads to a special scenario in which we actually know the real labels of the subbags. Indeed subbags are papers that, in the case of PubMed, are associated with 1 out of 3 classes. Note that the number of subbag pseudolabels for the MMIL case and the number of instance pseudolabels for the MIL case exactly match the real number of labels.
For the MMIL case by visually inspecting subbags that correspond to pseudolabels (see Figure 9) it is immediate to recognize that pseudolabel corresponds to DMT1 papers, to DME papers, and to DMT2 papers. Furthermore, by visually inspecting the instance that correspond to pseudolabels in Table 9 it is immediate to recognize the “topics”. Words are sorted in descending order, according to the score function, based on intracluster distance, described in Section 5. Similarly, for the MIL case the words are listed in Table 8. Below we report the rules for both the MMIL and MIL cases.
MMIL Case
We extracted the following rule which maps a bag of instance pseudolabels into the corresponding subbag pseudolabel:
(12) 
Note that is not used for constructing the rules which map instance pseudolabels to subbag pseudolabels. Similarly, we extracted the following rule that maps a bag of subbag pseudolabels into the corresponding topbag label:
(13) 
MIL Case
We extracted the following rule which maps a bag of instance pseudolabels into the corresponding topbag label
(14) 
By classifying PubMed using the rules and pseudolabels, we achieved an accuracy on the test set equals to for the MMIL case and for the MIL case. Fidelities for MMIL and MIL cases were and , respectively. Both of the results are still comparable and competitive with the methods described in Table 5.
Although the MIL case has higher accuracy compared to the MMIL case, the interpretation for the MMIL case provides more information, if we explain individual examples (see Section 5). Indeed, if we look at the rule which defines Diabetes Mellitus Experimental for the MI case we have , i.e. a paper belongs to the Diabetes Mellitus Experimental class if at least of words are associated with pseudolabel . However those words can spread over the paper itself and the citing/cited papers and with the MIL rules we are unable to distinguish. On the other hand, by using the rules coming from the MMIL model, we can explain more details concerning the labels. If we look at Diabetes Mellitus Experimental for the MMI case we have , i.e. a paper belongs to the Diabetes Mellitus Experimental class if it cites or is cited at most by of papers of class and it cites or is cited at least of papers of class . By applying the rules for and it is easy to see that a paper belongs to class if it contains for the most words related to Diabetes Mellitus Type 2 and a paper belongs to class if it contains for the most words related Diabetes Mellitus Experimental.
     

animals  children  non 
induction  juvenile  subjects 
induced  multiplex  patients 
experimental  hla  indians 
rats  childhood  fasting 
rat  adolescents  obesity 
dogs  conventional  pima 
caused  girls  american 
days  ascertainment  mexican 
strains  autoimmune  indian 
bl  dr  mody 
experiment  infusion  oral 
untreated  child  bmi 
wk  siblings  obese 
sz  intensive  men 
restored  healthy  prevalence 
sciatic  paediatric  resistance 
experimentally  spk  tolerance 
sprague  boys  mutations 
partially  sharing  igt 
         

normalization  animals  non  subjects  children 
greatly  experimental  indians  patients  multiplex 
susceptibility  induced  pima  patient  ascertainment 
lymphocytes  induction  obesity  individuals  conventional 
pregnant  rats  oral  type  juvenile 
always  dogs  fasting  analysis  girls 
organ  made  mexican  sample  night 
destruction  rat  obese  cascade  childhood 
tx  strains  medication  otsuka  pittsburgh 
contraction  bl  bmi  forearm  adolescents 
antibodies  caused  mody  gdr  infusion 
sequential  wk  indian  reported  denmark 
tract  counteracted  tolerance  mmol  intensified 
decarboxylase  partially  look  age  child 
recipients  rabbits  index  gox  beef 
livers  days  agents  dependent  sharing 
mt  conscious  resistance  isoforms  knowing 
cyclosporin  sciatic  maturity  meals  paediatric 
lv  tubules  gk  score  unawareness 
laboratories  myo  ii  affinities  pubert 
Imdb
References
 Andrews et al. (2002) Andrews S, Tsochantaridis I, Hofmann T (2002) Support vector machines for multipleinstance learning. In: Advances in neural information processing systems, pp 561–568, 00828
 Arbeláez et al. (2011) Arbeláez P, Maire M, Fowlkes C, Malik J (2011) Contour Detection and Hierarchical Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(5):898–916, DOI 10.1109/TPAMI.2010.161, 00000
 Arora et al. (2016) Arora S, Liang Y, Ma T (2016) A simple but toughtobeat baseline for sentence embeddings
 Atwood and Towsley (2016) Atwood J, Towsley D (2016) Diffusionconvolutional neural networks. In: Advances in Neural Information Processing Systems, pp 1993–2001
 Blei et al. (2003) Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. Journal of Machine Learning Research 3:993–1022, URL http://dl.acm.org/citation.cfm?id=944937
 Costa and De Grave (2010) Costa F, De Grave K (2010) Fast neighborhood subgraph pairwise distance kernel. In: Proceedings of the 26th International Conference on Machine Learning, Omnipress, pp 255–262, 00077
 De Raedt et al. (2008) De Raedt L, Demoen B, Fierens D, Gutmann B, Janssens G, Kimmig A, Landwehr N, Mantadelis T, Meert W, Rocha R, et al. (2008) Towards digesting the alphabetsoup of statistical relational learning

De Raedt et al. (2008)
De Raedt L, Frasconi P, Kersting K, Muggleton S (eds) (2008) Probabilistic inductive logic programming: theory and applications, Lecture notes in computer science, vol 4911. Springer, Berlin
 Dietterich (2000) Dietterich TG (2000) Ensemble Methods in Machine Learning. In: Multiple Classifier Systems, no. 1857 in Lecture Notes in Computer Science, Springer Berlin Heidelberg, pp 1–15, 02917
 Dietterich et al. (1997) Dietterich TG, Lathrop RH, LozanoPérez T (1997) Solving the multiple instance problem with axisparallel rectangles. Artificial Intelligence 89(1–2):31–71, DOI 10.1016/S00043702(96)000343, 01439
 Duvenaud et al. (2015) Duvenaud D, Maclaurin D, AguileraIparraguirre J, GómezBombarelli R, Hirzel T, AspuruGuzik A, Adams RP (2015) Convolutional Networks on Graphs for Learning Molecular Fingerprints. arXiv:150909292 [cs, stat] ArXiv: 1509.09292

Foulds and Frank (2010)
Foulds J, Frank E (2010) A review of multiinstance learning assumptions. The Knowledge Engineering Review 25(01):1, DOI
10.1017/S026988890999035X, 00081  Frasconi et al. (1998) Frasconi P, Gori M, Sperduti A (1998) A general framework for adaptive processing of data structures. IEEE Trans on Neural Networks 9:768–786

Fukushima (1980)
Fukushima K (1980) Neocognitron: A selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics 36(4):193–202, 01681
 Gärtner et al. (2004) Gärtner T, Lloyd JW, Flach PA (2004) Kernels and distances for structured data. Machine Learning 57(3):205–232
 Getoor and Taskar (2007) Getoor L, Taskar B (eds) (2007) Introduction to statistical relational learning. MIT Press, Cambridge, Mass., URL http://www.loc.gov/catdir/toc/ecip079/2007000951.html
 Gori et al. (2005) Gori M, Monfardini G, Scarselli F (2005) A new model for learning in graph domains. In: Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, IEEE, vol 2, pp 729–734
 Griffiths and Steyvers (2004) Griffiths TL, Steyvers M (2004) Finding scientific topics. Proceedings of the National Academy of Sciences (101(suppl 1)):5228–5235
 Hamilton et al. (2017a) Hamilton W, Ying Z, Leskovec J (2017a) Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, pp 1024–1034
 Hamilton et al. (2017b) Hamilton WL, Ying R, Leskovec J (2017b) Inductive representation learning on large graphs. In: Proc. of Neural Information Processing Systems, URL http://arxiv.org/abs/1706.02216
 Haussler (1999) Haussler D (1999) Convolution kernels on discrete structures. Tech. Rep. 646, Department of Computer Science, University of California at Santa Cruz, 01152
 Hornik et al. (1989) Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural networks 2(5):359–366, 12001
 Horváth et al. (2004) Horváth T, Gärtner T, Wrobel S (2004) Cyclic pattern kernels for predictive graph mining. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 158–167, 00248
 Hou et al. (2015) Hou L, Samaras D, Kurc TM, Gao Y, Davis JE, Saltz JH (2015) Efficient multiple instance convolutional neural networks for gigapixel resolution image classification. arXiv preprint

Jaeger (1997)
Jaeger M (1997) Relational bayesian networks. In: Geiger D, Shenoy PP (eds) Proceedings of the 13th Conference of Uncertainty in Artificial Intelligence (UAI13), Morgan Kaufmann, Providence, USA, pp 266–273
 Kingma and Ba (2014) Kingma D, Ba J (2014) Adam: A Method for Stochastic Optimization. arXiv:14126980 [cs] 00204 arXiv: 1412.6980
 Kipf and Welling (2016) Kipf TN, Welling M (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:160902907 00020
 Landwehr et al. (2010) Landwehr N, Passerini A, De Raedt L, Frasconi P (2010) Fast learning of relational kernels. Machine learning 78(3):305–342
 Lapuschkin et al. (2016a) Lapuschkin S, Binder A, Montavon G, Müller KR, Samek W (2016a) The lrp toolbox for artificial neural networks. Journal of Machine Learning Research 17(114):1–5, URL http://jmlr.org/papers/v17/15618.html
 Lapuschkin et al. (2016b) Lapuschkin S, Binder A, Montavon G, Müller KR, Samek W (2016b) The lrp toolbox for artificial neural networks. The Journal of Machine Learning Research 17(1):3938–3942

LeCun et al. (1989)
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural computation 1(4):541–551, 01543

Maas et al. (2011)
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 142–150
 Maron and LozanoPérez (1998) Maron O, LozanoPérez T (1998) A framework for multipleinstance learning. In: Advances in neural information processing systems, pp 570–576
 Maron and LozanoPérez (1998) Maron O, LozanoPérez T (1998) A framework for multipleinstance learning. Advances in neural information processing systems pp 570–576, 00870
 Maron and Ratan (1998) Maron O, Ratan AL (1998) Multipleinstance learning for natural scene classification. In: ICML, vol 98, pp 341–349
 Miyato et al. (2016) Miyato T, Dai AM, Goodfellow I (2016) Virtual adversarial training for semisupervised text classification
 Natarajan et al. (2008) Natarajan S, Tadepalli P, Dietterich TG, Fern A (2008) Learning firstorder probabilistic models with combining rules. Annals of Mathematics and Artificial Intelligence 54(13):223–256, URL http://link.springer.com/article/10.1007/s1047200991385, 00069
 Neumann et al. (2012) Neumann M, Patricia N, Garnett R, Kersting K (2012) Efficient graph kernels by randomization. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 378–393, URL http://link.springer.com/chapter/10.1007/9783642334603_30
 Niepert et al. (2016a) Niepert M, Ahmed M, Kutzkov K (2016a) Learning convolutional neural networks for graphs. In: International Conference on Machine Learning
 Niepert et al. (2016b) Niepert M, Ahmed M, Kutzkov K (2016b) Learning Convolutional Neural Networks for Graphs. New York, NY, USA, 00001 arXiv: 1605.05273
 Orsini et al. (2015) Orsini F, Frasconi P, De Raedt L (2015) Graph invariant kernels. In: Proceedings of the Twentyfourth International Joint Conference on Artificial Intelligence, pp 3756–3762
 Orsini et al. (2018) Orsini F, Baracchi D, Frasconi P (2018) Shift aggregate extract networks. Frontiers in Robotics and AI 5:42
 Passerini et al. (2006) Passerini A, Frasconi P, Raedt LD (2006) Kernels on prolog proof trees: Statistical learning in the ilp setting. Journal of Machine Learning Research 7(Feb):307–342

Pennington et al. (2014)
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
 Rahmani et al. (2005) Rahmani R, Goldman SA, Zhang H, Krettek J, Fritts JE (2005) Localized content based image retrieval. In: Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval, ACM, pp 227–236
 Ramon and De Raedt (2000) Ramon J, De Raedt L (2000) Multi instance neural networks 00115
 Ribeiro et al. (2016) Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 1135–1144
 Richardson and Domingos (2006) Richardson M, Domingos P (2006) Markov logic networks. Machine Learning 62:107–136
 Samek et al. (2016) Samek W, Montavon G, Binder A, Lapuschkin S, Müller KR (2016) Interpreting the predictions of complex ml models by layerwise relevance propagation. arXiv preprint arXiv:161108191
 Scarselli et al. (2009) Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80, 00073
 Scott et al. (2005) Scott S, Zhang J, Brown J (2005) On generalized multipleinstance learning. International Journal of Computational Intelligence and Applications 5(01):21–35, 00059
 Sen et al. (2008) Sen P, Namata G, Bilgic M, Getoor L, Galligher B, EliassiRad T (2008) Collective classification in network data. AI magazine 29(3):93, 00567
 Shervashidze et al. (2009) Shervashidze N, Vishwanathan SVN, Petri T, Mehlhorn K, Borgwardt KM (2009) Efficient graphlet kernels for large graph comparison. In: AISTATS, vol 5, pp 488–495, 00219
 Shervashidze et al. (2011) Shervashidze N, Schweitzer P, Van Leeuwen EJ, Mehlhorn K, Borgwardt KM (2011) Weisfeilerlehman graph kernels 12:2539–2561
 Szegedy et al. (2016) Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inceptionv4, InceptionResNet and the Impact of Residual Connections on Learning. arXiv:160207261 [cs] 00127 arXiv: 1602.07261
 Wang and Zucker (2000) Wang J, Zucker JD (2000) Solving multipleinstance problem: A lazy learning approach 00444
 Weidmann et al. (2003) Weidmann N, Frank E, Pfahringer B (2003) A twolevel learning method for generalized multiinstance problems. In: European Conference on Machine Learning, Springer, pp 468–479, 00110
 Yan et al. (2016) Yan Z, Zhan Y, Peng Z, Liao S, Shinagawa Y, Zhang S, Metaxas DN, Zhou XS (2016) Multiinstance deep learning: Discover discriminative local anatomies for bodypart recognition. IEEE transactions on medical imaging 35(5):1332–1343
 Yanardag and Vishwanathan (2015) Yanardag P, Vishwanathan S (2015) Deep graph kernels. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 1365–1374
 Yang and LozanoPerez (2000) Yang C, LozanoPerez T (2000) Image database retrieval with multipleinstance learning techniques. In: Data Engineering, 2000. Proceedings. 16th International Conference on, IEEE, pp 233–243

Yang et al. (2006)
Yang C, Dong M, Hua J (2006) Regionbased image annotation using asymmetrical support vector machinebased multipleinstance learning. In: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, IEEE, vol 2, pp 2057–2063
 Zha et al. (2008) Zha ZJ, Hua XS, Mei T, Wang J, Qi GJ, Wang Z (2008) Joint multilabel multiinstance learning for image classification. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp 1–8
 Zhou et al. (2005) Zhou ZH, Jiang K, Li M (2005) Multiinstance learning based Web mining. Applied Intelligence 22(2):135–147
 Zhou et al. (2012) Zhou ZH, Zhang ML, Huang SJ, Li YF (2012) Multiinstance multilabel learning. Artificial Intelligence 176(1):2291–2320, DOI 10.1016/j.artint.2011.10.002, 00288