Relational learning takes several different forms ranging from purely symbolic (logical) representations, to a wide collection of statistical approaches (De Raedt et al., 2008) based on tools such as probabilistic graphical models (Jaeger, 1997; De Raedt et al., 2008; Richardson and Domingos, 2006; Getoor and Taskar, 2007), kernel machines (Landwehr et al., 2010), and neural networks (Frasconi et al., 1998; Scarselli et al., 2009; Niepert et al., 2016a).
Multi-instance learning (MIL) is perhaps the simplest form of relational learning where data consists of labeled bags of instances. Introduced in (Dietterich et al., 1997)
, MIL has attracted the attention of several researchers during the last two decades and has been successfully applied to problems such as image and scene classification(Maron and Ratan, 1998; Zha et al., 2008; Zhou et al., 2012), image annotation (Yang et al., 2006)2000; Rahmani et al., 2005), Web mining (Zhou et al., 2005), text categorization (Zhou et al., 2012) and diagnostic medical imaging (Hou et al., 2015; Yan et al., 2016). In classic MIL, labels are binary and bags are positive iff they contain at least one positive instance (existential semantics). For example, a visual scene with animals could be labeled as positive iff it contains at least one tiger. Various families of algorithms have been proposed for MIL, including axis parallel rectangles (Dietterich et al., 1997), diverse density (Maron and Lozano-Pérez, 1998), nearest neighbors (Wang and Zucker, 2000), neural networks (Ramon and De Raedt, 2000)
, and variants of support vector machines(Andrews et al., 2002).
In this paper, we extend the MIL setting by considering examples consisting of labeled nested bags of instances. Labels are observed for top-level bags, while instances and lower level bags have associated latent labels. For example, a potential offside situation in a soccer match can be represented by a bag of images showing the scene from different camera perspectives. Each image, in turn, can be interpreted as a bag of players with latent labels for their team membership and/or position on the field. We call this setting multi-multi-instance learning (MMIL), referring specifically to the case of bags-of-bags111 the generalization to deeper levels of nesting is straightforward but not explicitly formalized in the paper for the sake of simplicity.. In our framework, we also relax the classic MIL assumption of binary instance labels, allowing categorical labels lying in a generic alphabet. This is important since MMIL with binary labels under the existential semantics would reduce to classic MIL after flattening the bag-of-bags.
We propose a solution to the MMIL problem based on neural networks with a special layer called bag-layer. Unlike previous neural network approaches to MIL learning (Ramon and De Raedt, 2000), where predicted instance labels are aggregated by (a soft version of) the maximum operator, bag-layers aggregate internal representations of instances (or bags of instances) and can be naturally intermixed with other layers commonly used in deep learning. Bag-layers can be in fact interpreted as a generalization of convolutional layers followed by pooling, as commonly used in deep learning.
The MMIL framework can be immediately applied to solve problems where examples are naturally described as bags-of-bags. For example, a text document can be described as a bag of sentences, where in turn each sentence is a bag of words. The range of possible applications of the framework is however larger. In fact, every structured data object can be recursively decomposed into parts, a strategy that has been widely applied in the context of graph kernels (see e.g., (Haussler, 1999; Gärtner et al., 2004; Passerini et al., 2006; Shervashidze et al., 2009; Costa and De Grave, 2010; Orsini et al., 2015)). Hence, MMIL is also applicable to supervised graph classification. Experiments on bibliographical and social network datasets confirm the practical viability of MMIL for these forms of relational learning.
As a further advantage, multi-multi instance learning enables a particular way of interpreting the models by reconstructing instance and sub-bag latent variables. This allows to explain the prediction for a particular data point, and to describe the structure of the decision function in terms of symbolic rules. Suppose we could recover the latent labels associated with instances or inner bags. These labels would provide useful additional information about the data since we could group instances (or inner bags) that share the same latent label and attach some semantics to these groups by inspection. For example, in the case of textual data, grouping words or sentences with the same latent label effectively discovers topics
and the decision of a MMIL text document classifier can be interpreted in terms of the discovered topics. In practice, even if we cannot recover the true latent labels, we may still derivepseudo-labels from patterns of hidden units activations in the bag-layers.
The paper is organized as follows. In Section 2 we formally introduce the MMIL setting. In Section 3 we formalize bag layers and the resulting neural network architecture for MMIL. In Section 6 we discuss some related works. In Section 5 we describe a technique for extracting rules from trained networks of bag-layers. In Section 7 we report experimental results on both semi-synthentic and a real-world dataset, while in Section 8 we report experimental results on two different graph tasks. Finally we draw some conclusions in Section 9.
2.1 Traditional multi-instance learning
In the standard multi-instance learning (MIL) setting, data consists of labeled bags of instances. In the following, denotes the instance space (it can be any set), the bag label space for the observed labels of example bags, and the instance label space for the unobserved (latent) instance labels. For any set , denotes the set of all multisets of . An example in MIL is a pair , which we interpret as the observed part of an instance-labeled example . is thus a multiset of instances, and a multiset of labeled instances.
Examples are drawn from a fixed and unknown distribution . Furthermore, it is typically assumed that the label of an example is conditionally independent of the individual instances given their labels, i.e. . In the classic setting, introduced in (Dietterich, 2000) and used in several subsequent works (Maron and Lozano-Pérez, 1998; Wang and Zucker, 2000; Andrews et al., 2002), the focus is on binary classification () and it is postulated that , (i.e., an example is positive iff at least one of its instances is positive). More complex assumptions are possible and thoroughly reviewed in (Foulds and Frank, 2010). Supervised learning in this setting can be formulated in two ways: (1) learn a function that classifies whole examples, or (2) learn a function that classifies instances and then use some aggregation function defined on the multiset of predicted instance labels to obtain the example label.
2.2 Multi-multi-instance learning
In multi-multi-instance learning (MMIL), data consists of labeled nested bags of instances. When the level of nesting is two, an example is a labeled bag-of-bags drawn from a distribution . Deeper levels of nesting, leading to multi-instance learning are conceptually easy to introduce but we avoid them in the paper to keep our notation simple. We will also informally use the expression “bag-of-bags” to describe structures with two or more levels of nesting. In the MMIL setting, we call the elements of and top-bags and sub-bags, respectively.
Now postulating unobserved labels for both the instances and the sub-bags, we interpret examples as the observed part of fully labeled data points , where is the space of sub-bag labels. Fully labeled data points are drawn from a distribution .
As in MIL, we make some conditional independence assumptions. Specifically, we assume that instance and sub-bag labels only depend on properties of the respective instances or sub-bags, and not on other elements in the nested multiset structure (thus excluding models for contagion or homophily, where, e.g., a specific label for an instance could become more likely, if many other instances contained in the same sub-bag also have that label). Furthermore, we assume that labels of sub-bags and top-bags only depend on the labels of their constituent elements. Thus, for , and a bag of labeled instances we have:
Similarly for the probability distribution of top-bag labels given the constituent labeled sub-bags.
In this example we consider bags-of-bags of handwritten digits (as in the MNIST dataset). Each instance (a digit) has attached its own latent class label in whereas sub-bag (latent) and top-bag labels (observed) are binary. In particular, a sub-bag is positive iff it contains an instance of class 7 and does not contain an instance of class 3. A top-bag is positive iff it contains at least one positive sub-bag. Figure 1 shows a positive and a negative example.
A top-bag can consist of a set of images showing a potential offside situation in soccer from different camera perspectives. The label of the bag corresponds to the referee decision . Each individual image can either settle the offside question one way or another, or be inconclusive. Thus, there are (latent) image labels . Since no offside should be called when in doubt, the top-bag is labeled as ’not offside’ if and only if it either contains at least one image labeled ’not offside’, or all the images are labeled ’inconclusive’. Images, in turn, can be seen as bags of player instances that have a label according to their relative position with respect to the potentially offside player of the other team. An image then is labeled ’offside’ if all the players in the image are labeled ’behind’; it is labeled ’not offside’ if it contains at least one player labeled ’in front’, and is labeled ’inconclusive’ if it only contains players labeled ’inconclusive’ or ’behind’.
In text categorization, the bag-of-word representation is often used to feed documents to classifiers. Each instance in this case consists of the indicator vector of words in the document (or a weighted variant such as TF-IDF). The MIL approach has been applied in some cases (Andrews et al., 2002) where instances consist of chunks of consecutive words and each instance is an indicator vector. A bag-of-bags representation could instead describe a document as a bag of sentences, and each sentence as a bag of word vectors (constructed for example using Word2vec or GloVe).
3 A network architecture for MMIL
3.1 Bag layers
We model the conditional distribution with a neural network architecture that handles bags-of-bags of variable sizes by aggregating intermediate internal representations. For this purpose, we introduce a new layer called bag-layer. A bag-layer takes as input a bag of -dimensional vectors , and first computes -dimensional representations
using a weight matrix
, a bias vector
, and an activation function
(such as ReLU, tanh, or linear). The bag layer then computes its output as:
where is element-wise aggregation operator (such as max or average). Both and are tunable parameters. Note that Equation 3 works with bags of arbitrary cardinality. A bag-layer is illustrated in Figure 2.
Networks with a single bag-layer can process bags of instances (as in the standard MIL setting). To solve the MMIL problem, two bag-layers are required. The bottom bag-layer aggregates over internal representations of instances; the top bag-layer aggregates over internal representations of sub-bags, yielding a representation for the entire top-bag. In this case, the representation of each sub-bag would be obtained as
and the representation of a top-bag would be obtained as
where and denote the parameters used to construct sub-bag and top-bag representations. Note that difference aggregation functions can be also evaluated in parallel. Note that nothing prevents us from intermixing bag-layers with standard neural network layers, thereby forming networks of arbitrary depth. In this case, each in Eq. (4) would be simply replaced by the last layer activation of a deep network taking as input. Of course the top-bag representation can be itself further processed by other layers. An example of the overall architecture is shown in Figure 3.
3.2 Expressiveness of networks of bag-layers
We focus here on a deterministic (noiseless) version of the MMIL setting described in Section 2.2 where labels are deterministically assigned and no form of counting is involved. We show that under these assumptions, the architecture of Section 3.1
has enough expressivity to represent the solution to the MMIL problem. Our approach relies on classic universal interpolation results for neural networks(Hornik et al., 1989). Note that existing results hold for vector data, and this section shows that they can be leveraged to bag-of-bag data when using the architecture of Section 3.1.
We say that data is generated under the deterministic MMIL setting if the following conditions hold true:
instance labels are generated by an unknown function , i.e., , for , ;
sub-bag labels are generated by an unknown function , i.e., ;
the top-bag label is generated by an unknown function , i.e., .
Note that the classic MIL formulation (Maron and Lozano-Pérez, 1998) is recovered when examples are sub-bags, , and . Other generalized MIL formulations (Foulds and Frank, 2010; Scott et al., 2005; Weidmann et al., 2003) can be similarly captured in this deterministic setting.
For a multiset let denote the set of elements occurring in . E.g. .
We say that data is generated under the non-counting deterministic MMI setting if, in addition to the conditions of Definition 3.1, both and only depend on .
The following result indicates that a network containing a bag-layer with max aggregation is sufficient to compute the functions that label both sub-bags and top-bags.
Let be sets of labels, and let be a labelling function for which whenever . Then there exist a network with one bag-layer that computes .
We construct a network where first a bag-layer maps the multiset input to a bit-vector representation of , on top of which we can then compute using a standard architecture for Boolean functions.
In detail, is constructed as follows: the input is encoded by -dimensional vectors containing the one-hot representations of the . We construct a bag-layer with , is the identity matrix, is zero, is the identity function, and is max. The output of the bag-layer then is an -dimensional vector whose ’th component is the indicator function .
For each we can write the indicator function as a Boolean function of the indicator functions . Using standard universal approximation results (see, e.g., (Hornik et al., 1989), Theorem 2.5) we can construct a network that on input computes . such networks in parallel then produce an -dimensional output vector containing the one-hot representation of . ∎
Given a dataset of examples generated under the non-counting deterministic MMIL setting, there exist a network with two bag-layers that can correctly label all examples in the dataset.
We first note that the universal interpolation result of (Hornik et al., 1989) can be applied to a network taking as input an instance which appears in any data example, and generating the desired label . We then use Lemma 1 twice, first to form a network that computes the sub-bag labeling function , and then to form a network that computes the top-bag labeling function . ∎
4 MMIL for graph learning
The MMIL perspective can also be used to derive algorithms suitable for supervised learning over graphs, i.e., tasks such as graph classification, node classification, and edge prediction. In all these cases, one first need to construct a representation for the object of interest (a whole graph, a node, a pair of nodes) and then apply a classifier. A suitable representation can be obtained in our framework by first forming a bag-of-bags associated with the object of interest (a graph, a node, or an edge) and then feeding it to a network with bag-layers. In order to construct bags-of-bags, we follow the classic -decomposition strategy introduced by Haussler (1999). In the present context, it simply requires us to introduce a relation which holds true if is a “part” of and to form , the bag of all parts of . Parts can in turn be decomposed in a similar fashion, yielding bags-of-bags. In the following, we focus on undirected graphs where is the set of nodes and is the set of edges. We also assume that a labelling function attaches attributes to vertices. Variants with directed graphs or labeled edges are straightforward and omitted here in the interest of brevity.
A simple solution is to define the part-of relation between graphs to hold true iff is a subgraph of and to introduce a second part-of relation that holds true iff is a node in . The bag-of-bags associated with is then constructed as where maps all elements of through function . In general, considering all subgraphs is not practical but suitable feasible choices for can be derived borrowing approaches already introduced in the graph kernel literature, for example decomposing into cycles and trees (Horváth et al., 2004), or into neighbors or neighbor pairs (Costa and De Grave, 2010) (some of these choices may require three levels of bag nesting, e.g., for grouping cycles and trees separately).
In some domains, the node labelling function itself is bag-valued. For example in a citation network, could be the bag of words in the abstract of the paper associated with node . A bag-of-bags in this case may be formed by considering a paper together all papers in its neighborhood (i.e., its cites and citations): . A slightly more rich description with three layers of nesting could be used to set apart a node and its neighborhood: .
5 Interpreting networks of bag-layers
Interpreting the predictions in the supervised learning setting amounts to provide a human understandable explanation of the prediction. Transparent techniques such as rules or trees retain much of the symbolic structure of the data and are well suited in this respect. On the contrary, predictions produced by methods based on numerical representations are often opaque, i.e., difficult to explain to humans. In particular, representations in neural networks are highly distributed, making it hard to disentangle a clear semantic interpretation of any specific hidden unit. Although many works exist that attempt to interpret neural networks, they mostly focus on specific application domains such as vision (Lapuschkin et al., 2016a; Samek et al., 2016).
The MMIL settings offers some advantages in this respect. Indeed, if instance or sub-bag labels were observed, they would provide more information about bag-of-bags than mere predictions. Latent variables are indeed associated with each individual “part” of the top-bag, as opposite to the prediction which is associated with the whole. To clarify our vision, MIL approaches like mi-SVM and MI-SVM in (Andrews et al., 2002) are not equally interpretable: the former is more interpretable than the latter since it also provides individual instance labels rather than simply providing a prediction about the whole bag. These standard MIL approaches make two assumptions: first all labels are binary, second the relationship between the instance labels and the bag label is predefined to be the existential quantifier. In our case we relax these assumptions by allowing labels in a categorical alphabet and by allowing more complex mappings between bags of instance labels and sub-bag labels. Our approach may also provide a richer explanation due to the nested structure of the data as bags-of-bags. We follow the standard MIL approaches in that we also assume a deterministic mapping from component to bag labels, i.e., we assume the data can be modelled in the deterministic MMIL setting according to Definition 3.1.
The idea we propose in the following is based on four steps. First, we employ clustering at the level of instance and sub-bag representations to construct pseudo-labels as surrogates for hypothesized actual latent labels. Pseudo-labels obtained in this way are abstract symbols without any specific semantics. Hence, in the second step we provide semantic interpretations of the pseudo-labels for human inspection. Third, we apply a transparent learner to extract a human-readable representation of the mappings between pseudo-labels at the different levels of a bag-of-bags structure. Finally, we explain predictions for individual top-bag examples by exhibiting the relevant components and their pseudo-labels which determine the predicted top-bag label.
As before, for ease of exposition we assume in the following a two-level bag-of-bags structure. The method directly applies also to other nesting depths.
Clustering and pseudo-label construction
Given labeled top-bag data and a trained MMIL network we consider the multi-sets of sub-bag and instance representations computed by the bag layers:
where the and are the representations according to (2).
Given the number of clusters and we run a clustering procedure on and on (separately), obtaining clusters and . We finally associate each sub-bag and each instance with the cluster indices of their representations, and use them as pseudo-labels and .
Clusters can be directly inspected in the attempt to attach some meaning to pseudo-labels. For example in the case of textual data, a human could inspect word clusters, similarly to what has been suggested in the area of topic modelling (Blei et al., 2003; Griffiths and Steyvers, 2004).
To facilitate inspection, we propose an approach to characterize clusters in terms of their most characteristic elements. To this end, we define a ranking of the elements in each cluster according to a score function based on intra-cluster distances. Consider a sub-bag whose bag-layer representation belongs to cluster . We define the score
where is the centroid of the th cluster. The procedure for ranking instances is analogous. We use the cluster elements with maximal score to illustrate and interpret the semantic nature of a cluster. Note that this is different from the more common approach of interpreting clusters by way of their centroids.
In some cases the cluster elements may be equipped with some true, latent label. In such cases we can alternatively characterize pseudo-labels in terms of their correspondence with these actual labels. An example of this will be seen in Section 7.1 below.
Learning interpretable rules
We next describe how we construct interpretable functions that approximate the actual (potentially noisy) relationships between pseudo-labels in the MMIL network.
Let us denote a bag of pseudo-labels as , where is the multiplicity of label . An attribute-value representation of the bag can be immediately obtained by normalizing counts: , where is the frequency of the label in the bag. Another attribute-value representation of the bag can be obtaining by using a bit which indicates the presence or the absence of an attribute. Jointly with an output label
, this attribute-value representation provides one supervised example for a propositional learner such as a decision tree. In the two level MMIL case, we learn in this way functionsmapping multisets of instance pseudo-labels to sub-bag pseudo-labels, and multisets of sub-bag pseudo-labels to top-bag labels, respectively (cf. Definition 3.1). In the second case, our target labels are the predicted labels of the original MMIL network, not the actual labels of the training examples. Thus, the objective is to construct rules that best explain the MMIL model, not the rules that provide the highest accuracy themselves.
The instance-level clustering defines a labeling function by associating any (test) instance with the index of its nearest centroid. Taken together, the three functions provide a complete classification model for a top-bag based on the input features of its instances. We refer to the accuracy of this model with regard to the predictions of the original MMIL model as its fidelity.
We use fidelity on a validation set as the criterion to select the cardinalities for and by performing a grid search over value combinations.
Explaining individual classifications
The classification provided by for an input top-bag will often rely only on small subsets of sub-bags and instances contained in (cf. the classic multi-instance setting, where a positive classification can rely only on a single positive instance). We can therefore explain classifications for individual examples by exhibiting the critical substructures of that support the prediction. The details of this step are typically quite domain specific, and we will illustrate different versions of it in the experimental section.
6 Related Works
6.1 Multi-instance neural networks
Ramon and De Raedt (2000) proposed a neural network solution to MIL where each instance in a bag is first processed by a replica of a neural network with weights . In this way, a bag of output values computed for each bag of instances. These values are then aggregated by a smooth version of the max function:
where is a constant controlling the sharpness of the aggregation (the exact maximum is computed when ). Recall that a single bag-layer (as defined in Section 3) can used to solve the MIL problem. Still, a major difference compared to the work of (Ramon and De Raedt, 2000) is that bag-layers perform aggregation at the representation level rather than at the output level. In this way, more layers can be added on the top of the aggregated representation, allowing for more expressiveness. In the classic MIL setting (where a bag is positive iff at least one instance is positive) this additional expressiveness is not required. However, it allows us to solve slightly more complicated MIL problems. For example, suppose each instance has a latent variable , and suppose that a bag is positive iff it contains at least one instance with label and no instance with label . In this case, a bag-layer with two units can distinguish positive and negative bags, provided that instance representations can separate instances belonging to the classes and . The network proposed in (Ramon and De Raedt, 2000) would not be able to separate positive from negative bags. Indeed, as proved in Section 3.2, networks with bag-layers can represent any Boolean function over sets of instances.
6.2 Convolutional neural networks
). It is easy to see that the representation computed by one convolutional layer followed by max-pooling can be emulated with one bag-layer by just creating bags of adjacent image patches. The representation sizecorresponds to the number of convolutional filters. The major difference is that a convolutional layer outputs spatially ordered vectors of size , whereas a bag-layer outputs a set of vectors (without any ordering). This difference may become significant when two or more layers are sequentially stacked.
Figure 4 illustrates the relationship between a convolutional layer and a bag-layer, for simplicity assuming a one-dimensional signal (i.e., a sequence). When applied to signals, a bag-layer essentially correspond to a disordered convolutional layer and its output needs further aggregation before it can be fed into a classifier. The simplest option would be to stack one additional bag-layer before the classification layer. Interestingly, a network of this kind would be able to detect the presence of a short subsequence regardless of its position within the whole sequence, achieving invariance to arbitrarily large translations
We finally note that it is possible to emulate a CNN with two layers by properly defining the structure of bags-of-bags. For example, a second layer with filter size 3 on the top of the CNN shown in Figure 4 could be emulated with two bag-layers fed by the bag-of-bags
A bag-layer, however, is not limited to pooling adjacent elements in a feature map. One could for example segment the image first (e.g., using a hierarchical strategy (Arbeláez et al., 2011)) and then create bags-of-bags by following the segmented regions.
The convolutional approach has been also recently employed for learning with graph data. The idea is to reinterpret the convolution operator as a message passing algorithm on a graph where each node is a signal sample (e.g., a pixel) and edges connect a sample to all samples covered by the filter when centered around its position (including a self-loop). The major difference between graphs and signals is that no obvious ordering can be defined on neighbors. This message passing strategy over graphs was originally proposed in (Gori et al., 2005; Scarselli et al., 2009) and reused with variants in several later works. Kipf and Welling (2016) for example, propose to address the ordering issue by sharing the same weights for each neighbor (keeping them distinct from the self-loop weight). They show that message-passing is closely related to the 1-dimensional Weisfeiler-Lehman (WL) method for isomorphism testing (one convolutional layer corresponding to one iteration of the WL-test) and can be also motivated in terms of spectral convolutions on graphs. On a side note, similar message-passing strategies were used before in the context of graph kernels (Shervashidze et al., 2011; Neumann et al., 2012). Niepert et al. (2016b) proposed ordering via a “normalization” procedure that extends the classic canonicalization problem in graph isomorphism. Hamilton et al. (2017b) propose an extension of the approach in (Kipf and Welling, 2016)
where representations of the neighbors are aggregated by a general differentiable function that can be as simple as an average or as complex as a recurrent neural network. Additional related works include(Duvenaud et al., 2015), where CNNs are applied to molecular fingerprint vectors, and (Atwood and Towsley, 2016) where a diffusion process across general graph structures generalizes the CNN strategy of scanning a regular grid of pixels.
6.3 Nested SRL Models
In Statistical Relational Learning (SRL) a great number of approaches have been proposed for constructing probabilistic models for relational data. Relational data has an inherent bag-of-bag structure: each object in a relational domain can be interpreted as a bag whose elements are all the other objects linked to via a specific relation. These linked objects, in turn, also are bags containing the objects linked via some relation. A key component of SRL models are the tools employed for aggregating (or combining) information from the bag of linked objects. In many types of SRL models, such an aggregation only is defined for a single level. However, a few proposals have included models for nested combination (Jaeger, 1997; Natarajan et al., 2008). Like most SRL approaches, these models employ concepts from first-order predicate logic for syntax and semantics, and (Jaeger, 1997) contains an expressivity result similar in spirit to the one we present in the following section 3.2.
A key difference between SRL models with nested combination constructs and our MMIL network models is that the former build models based on rules for conditional dependencies which are expressed in first-order logic and typically only contain a very small number of numerical parameters (such as a single parameter quantifying a noisy-or combination function for modelling multiple causal influences). MMI network models, in contrast, make use of the high-dimensional parameter spaces of (deep) neural network architectures. Roughly speaking, MMIL network models combine the flexibility of SRL models to recursively aggregate over sets of arbitrary cardinalities with the power derived from high-dimensional parameterisations of neural networks.
6.4 Interpretable models
Recently, the question of interpretability has become particularly prominent in the neural network context. Lapuschkin et al. (2016b); Samek et al. (2016) explain predictions of a classifier for each instance , by attributing scores to each entry of . A positive or negative score is then assigned to , depending whether contributes for predicting the target or not.
Ribeiro et al. (2016)
also provided explanations for individual predictions as a solution to the “trusting a prediction” by approximating a machine learning model with an interpretable model. The authors assumed that instances are given in a representation which is understandable to humans, regardless of the actual features used by the model. For example for text classification an interpretable representation may be the binary vector indicating the presence or absence of a word. An “interpretable” model is defined as a model that can be readily presented to the user with visual or textual artefacts (linear models, decision trees, or falling rule lists), which locally approximates the original machine learning model. Given a machine learning model, an interpretable model is trained for each instance. For each instance , a set of instances is generated around by dropping out randomly some nonzero entries from . Given a similarity measure , e.g. scalar product, gaussian kernel, cosine distance, is trained by minimizing
The major differences between all those methods and our interpretation framework, described in Section 5, is that with the latter we are able to provide a global interpretation for the whole MMIL network, as well as to explain individual example.
7 Experiments on MMIL data
We evaluated our model on two experimental setups:
we constructed a multi-multi instance semi-synthetic dataset from MNIST, in which digits were organized in bags-of-bags of arbitrary cardinality. This setup follows the example shown in Section 2.2. The aim of this experiment is to show the ability of the network to learn functions that have generated the data according to Theorem 3.2 in Section 3.2. Furthermore we interpreted the network by using the approach described in Section 5;
we decomposed a sentiment-analysis text dataset into MMIL data and MIL data. The goal is to show the differences between the interpretation of the two models.
7.1 Semi-synthetic dataset
Results of Section 3.2 show that networks with bag layers can represent any labelling function in the non-counting deterministic MMIL setting. We show here that these networks trained by gradient descent can actually learn such functions from MMIL data.
The task is defined exactly as in Example 2.1 and we formed a balanced training set of 5,000 top-bags using MNIST digits. Both sub-bag and top-bag cardinalities were uniformly sampled in . Instances were sampled with replacement from the MNIST training set (60,000 digits). A test set of 5,000 top-bags was similarly constructed but instances were sampled from the MNIST test set (10,000 digits). Details on the network architecture and the training procedure are reported in Appendix, Table 7. We stress the fact that instance and sub-bag labels were not used to form the training objective. The accuracy on the test set was , confirming that the network is able to recover the latent logic function that was used in the data generation process with a reasonably high accuracy.
We show next how the general approach of Section 5 for constructing interpretable rules, for this example, recovers the latent labels and logical rules used in the data generating process. Pseudo-labels and rules are learnt with the procedure described in Section 5
. Clustering was performed with K-Means, while as propositional learner we used Decision Trees. By optimizing the interpretable model with respect to the fidelity on the validation set, the best number for instance pseudo-labels and sub-bag pseudo-labels turned to be 6 and 2, respectively. An heat-map which shows the fidelity on the validation set is reported in Appendix, Figure8.
By visually inspecting the clusters that correspond to pseudo-labels (see Figure 7 in Appendix) it is immediate to recognize that pseudo-label corresponds to 7s, , , and correspond to 3s, and and correspond to numbers which differ from both 7s and 3s. We extracted the following rule that maps a bag of instance pseudo-labels into the corresponding sub-bag pseudo-label:
By considering those rules it is also immediate to recognize that pseudo-label gets attached to the sub-bags that contain a seven and not a three.
Similarly, we extracted the following rule that maps a bag of sub-bag pseudo-labels into the corresponding top-bag label:
Hence, in this example, the true rules behind the data generation process were perfectly recovered. Note that perfect recover does not necessarily imply perfect fidelity since the quantization due to clustering may loose some information that the neural network is allowed to encode into the distributed representation of instances and sub-bags. Nonetheless, in this experiment the classification accuracy of the interpretable model on the test set was, only less than the accuracy of the original model, and the fidelity was .
In this section we show other interpretability results on IMDB (Maas et al., 2011), a standard benchmark movie review dataset for sentiment binary classification. We remark that this IMDB dataset differs from the IMDB graph datasets described in Section 8.2. IMDB consists of 25,000 training reviews, 25,000 test reviews and 50,000 unlabelled reviews. Positive and negative labels are balanced within the training and test sets. A review can be seen as a bag of sentences and each sentence as a bag of words. For this particular task it is reasonable to think that for a review to be positive is often sufficient to contain a positive sentence, and for a sentence to be positive is often sufficient to contain a set of positive words.
A MMIL dataset was constructed from the reviews in which a top-bag represents a bag of sentences. A sub-bag represents the bag of trigrams within a sentence. An instance represents a trigrams, i.e a triplet of words which is obtained by concatenating a word, the previous word and the next word. We used trigrams rather than single words in order to take into account possible negations, e.g. “not very good”, “not so bad”. Figure 5 depicts an example of decomposition of a review into MMIL data.
Each word is represented with Glove word vectors (Pennington et al., 2014) of size 100, trained on the dataset. Note that we used Glove word vectors in order to compare our model with the state-of-the-art (Miyato et al., 2016) and nothing prevent us to use the one-hot representation even for this scenario. In order to compare MMIL against multi-instance (MI) we also constructed a multi-instance dataset in which a review is simply represented as a bag of trigrams.
We trained two neural networks for MMIL and MIL data respectively, which have the following structure:
MMIL network: a Conv1D layer with 300 filters, ReLU activations and kernel size of 100, two stacked bag-layers (with ReLU activations) with 500 units (250 max-aggregation, 250 mean-aggregation) and an output layer with sigmoid activation;
MIL network: a Conv1D layer with 300 filters, ReLU activations and kernel size of 100, one bag-layers (with ReLU activations) with 500 units (250 max-aggregation, 250 mean-aggregation) and an output layer with sigmoid activation;
The models were trained by minimizing the binary cross-entropy loss. We ran 20 epochs of the Adam optimizer with learning rate 0.001, on mini-batches of size 128. We used also virtual adversarial training(Miyato et al., 2016) for regularizing the network and exploiting the unlabelled reviews during the training phase. Although our model does not over-perform the state of the art (, Miyato et al. (2016)), we obtained a final accuracy of for the MMIL network and for the MIL network. Those results shows that MMIL representation allows to obtain slightly better results than MIL representation. Moreover, over-performing the state-of-the-art is out of the scope of this experiment, which aims to show interpretable results.
We will show now interpretability results on the citation datasets. Similarly to Sections 7.1, we learnt pseudo-labels and rules for both the MMIL model and MIL model. By using 2,500 reviews as validation set we obtained 4 and 5 pseudo-labels for sub-bags and instances respectively for the MMIL case, and 6 pseudo-labels for the MIL case. An heat-map which shows the fidelity on the validation set is reported in Appendix, Figure 12. For the MMIL case we report sentences and words in Table Tables 1 and 2 while for the MIL case we report the words in Table 3.
We extracted the following rule which maps a bag of instance pseudo-labels into the corresponding sub-bag pseudo-label:
Similarly, we extracted the following rule that maps a bag of sub-bag pseudo-labels into the corresponding top-bag label:
Note that and are not used for constructing the rules which map instance pseudo-labels to sub-bag pseudo-labels, and is not used for constructing the rules which map sub-bag pseudo-labels to top-bag labels.
We extracted the following rule which maps a bag of instance pseudo-labels into the corresponding top-bag label:
Note that , , , and are not used for constructing the rules which map instance pseudo-labels to sub-bag pseudo-labels. Finally, by classifying IMDB using the rules and pseudo-labels, we achieved an accuracy on the test set equals to for the MMIL case and for the MIL case. Fidelities for MMIL and MIL cases were and , respectively.
Although the rules and Tables 1 and 2 for MMI, and rules and Table 3 for MI explain the networks, they might be tricky to read. We will show the alternative approach described in Section 5, in which we explain single predictions as it may help the reader to understand better the benefits of the proposed interpretable models. We start by giving different colors for the pseudo-labels. We report the results on two reviews: one classified correctly by the MMIL interpretable model and misclassified by the MIL interpretable model (Example 4), and one misclassified by the MMIL interpretable model and classified correctly by the MIL interpretable model (Example 5).
For the MMIL interpretable model we color in turn the sentences and the trigrams which activate the particular rule, while for the MIL interpretable model we color only the trigrams which activate the particular rule. The sentences which do not activate any rules are not reported (for the sake of readability). We also report the fired rules. Having the access to the sentences and trigrams rather than only the latter, helps the interpretation, even whereas we have misclassified examples. Indeed in Example 5 by reading the sentences (and then the trigrams) we can easily understand why the example is misclassified. On the other hand, by reading only the trigrams it is more tricky.
We report a positive review which MMIL classified correctly and MIL misclassified by using the respectively rules.
For MMI the fired rule for sentences is . The only sentence belonging to is “Bloody Birthday a pretty mediocre title for the film was a nice lil surprise”. The sentences belonging to are: “And I may say it’s also one of the best flicks I’ve seen with kids as the villains”, “It’s a really solid 80s horror flick but how these kids are getting away with all this mayhem and murder is just something that you can’t not think about”, “It’s a very recommendable and underrated 80s horror flick”.
For MMI the fired rules for trigrams are:
. The trigrams belonging to are “film was a”, “a nice lil”. The trigrams belonging to are “Birthday a pretty”, “a pretty mediocre”, “pretty mediocre title”;
. The trigrams belonging to are “one of the”, “of the best”, “the best flicks”;
. The trigrams belonging to are “It’s a really”, “a really solid”, “really solid 80s”. The ; The only trigram belonging to is “you can’t not”;
The trigrams belonging to are “a very recommendable”, “very recommendable and”, “recommendable and underrated”, “and underrated 80s”.
For MI the fired rule for trigrams is . The trigrams belonging to are “Birthday a pretty”, “a pretty mediocre”, “pretty mediocre title”, “to die in”, “die in horrible”, “in horrible fashion”. The trigrams belonging to are “one of the”, “of the best”,“the best flicks”, “It’s a really”, “a really solid”, “the less than”,“a very recommendable”, “very recommendable and”, “recommendable and underrated”, “and underrated 80s”.
We report a positive review which MMIL misclassified and MIL classified correctly by using the respectively rules.
For MMI the fired rule for sentences is . The only sentence belonging to is “The mental patients are all a little eye rolling by the Judge but my favorite was the old crazy biddy Rhea”. The only sentence belonging to is “The storyline is okay at best and the acting is surprisingly alright but after awhile it’s gets to be a little much”
For MMI the fired rules for trigrams are:
. The trigrams belonging to are “Judge but my”, “but my favorite”, “my favorite was”;
. The only trigram belonging to is “at best and”. The trigrams belonging to are “is okay at”, “okay at best”, “surprisingly alright but”, “a little much”, “little much PAD”. The only trigram belonging to is “The storyline is”.
For MI the fired rule for trigrams is . The trigrams belonging to are “Judge but my”, “okay at best”, “at best and”, “acting is surprisingly”, “But still it’s”, “still it’s fun”, “it’s fun quirky”, “fun quirky strange”. The trigrams belonging to are “The storyline is”, “storyline is okay”, “is okay at”.
|overrated poorly written badly acted||I highly recommend you to NOT waste your time on this movie as I have||I loved this movie and I give it an 8/ 10||It’s not a total waste|
|It is badly written badly directed badly scored badly filmed||This movie is poorly done but that is what makes it great||Overall I give this movie an 8/ 10||horrible god awful|
|This movie was poorly acted poorly filmed poorly written and overall horribly executed||Although most reviews say that it isn’t that bad i think that if you are a true disney fan you shouldn’t waste your time with…||final rating for These Girls is an 8/ 10||Awful awful awful|
|Poorly acted poorly written and poorly directed||I’ve always liked Madsen and his character was a bit predictable but this movie was definitely a waste of time both to watch and make…||overall because of all these factors this film deserves an 8/ 10 and stands as my favourite of all the batman films||junk forget it don’t waste your time etc etc|
|This was poorly written poorly acted and just overall boring||If you want me to be sincere The Slumber Party Massacre Part 1 is the best one and all the others are a waste of…||for me Cold Mountain is an 8/ 10||Just plain god awful|
|PAD 8/ 10||trash 2 out||had read online||it’s pretty poorly||give this a|
|an 8/ 10||to 2 out||had read user||save this poorly||like this a|
|for 8/ 10||PAD 2 out||on IMDb reading||for this poorly||film is 7|
|HBK 8/ 10||a 2 out||I’ve read innumerable||just so poorly||it an 11|
|Score 8/ 10||3/5 2 out||who read IMDb||is so poorly||the movie an|
|to 8/ 10||2002 2 out||to read IMDb||were so poorly||this movie an|
|verdict 8/ 10||garbage 2 out||had read the||was so poorly||40 somethings an|
|Obscura 8/ 10||Cavern 2 out||I’ve read the||movie amazingly poorly||of 5 8|
|Rating 8/ 10||Overall 2 out||movie read the||written poorly directed||gave it a|
|it 8/ 10||rating 2 out||Having read the||was poorly directed||give it a|
|fans 8/ 10||film 2 out||to read the||is very poorly||rating it a|
|Hero 8/ 10||it 2 out||I read the||It’s very poorly||rated it a|
|except 8/ 10||score 2 out||film reviews and||was very poorly||scored it a|
|Tracks 8/ 10||Grade 2 out||will read scathing||a very poorly||giving it a|
|vote 8/ 10||Just 2 out||PAD After reading||very very poorly||voting it a|
|as 8/ 10||as 2 out||about 3 months||Poorly acted poorly||are reasons 1|
|strong 8/ 10||and 2 out||didn’t read the||are just poorly||it a 8|
|rating 8/ 10||rated 2 out||even read the||shown how poorly||vote a 8|
|example 8/ 10||Rating 2 out||have read the||of how poorly||a Vol 1|
|… 8/ 10||conclusion 2 out||the other posted||watching this awful||this story an|
|production costs PAD||give it a||only 4/10 PAD||is time well-spent||… 4/10 …||PAD Recommended PAD|
|all costs PAD||gave it a||score 4/10 PAD||two weeks hairdressing||.. 1/10 for||Highly Recommended PAD|
|its costs PAD||rated it a||a 4/10 PAD||2 hours PAD||rate this a||Well Recommended PAD|
|ALL costs PAD||rating it a||PAD 4/10 PAD||two hours PAD||gave this a||PAD 7/10 PAD|
|possible costs PAD||scored it a||average 4/10 PAD||finest hours PAD||give this a||13 7/10 PAD|
|some costs PAD||giving it a||vote 4/10 PAD||off hours PAD||rated this a||rate 7/10 PAD|
|cut costs PAD||voting it a||Rating 4/10 PAD||few hours PAD||PAD Not really||.. 7/10 PAD|
|rate this a||gave this a||.. 4/10 PAD||slow hours PAD||4/10 Not really||this 7/10 PAD|
|gave this a||give this a||is 4/10 PAD||three hours PAD||a 4/10 or||Score 7/10 PAD|
|rating this a||rate this a||this 4/10 PAD||final hours PAD||of 4/10 saying||solid 7/10 PAD|
|give this a||giving this a||of 4/10 PAD||early hours PAD||rate it a||a 7/10 PAD|
|and this an||gives this a||movie 4/10 PAD||six hours PAD||give it a||rating 7/10 PAD|
|give this an||like this a||verdict 4/10 PAD||48 hours PAD||gave it a||to 7/10 PAD|
|given this an||film merits a||gave 4/10 PAD||4 hours PAD||given it a||viewing 7/10 PAD|
|gave this an||Stupid Stupid Stupid||13 4/10 PAD||6 hours PAD||giving it a||it 7/10 PAD|
|rating this an||PAD Stupid Stupid||disappointment 4/10 PAD||five hours PAD||scored it a||score 7/10 PAD|
|rate this an||award it a||at 4/10 PAD||nocturnal hours PAD||award it a||movie 7/10 PAD|
|all costs …||given it a||rating 4/10 PAD||17 hours PAD||Cheesiness 0/10 Crappiness||is 7/10 PAD|
|all costs ..||makes it a||… 4/10 PAD||for hours PAD||without it a||drama 7/10 PAD|
|PAD Avoid PAD||Give it a||rate 4/10 PAD||wasted hours PAD||deserves 4/10 from||Recommended 7/10 PAD|
8 Experiments on Graphs
We tested our model on two different graph tasks:
a node classification task in which we focused on real citation network datasets where data can be naturally decomposed into bags-of-bags (MMIL data) or bags (MIL data). The goal is to understand whether MMIL and MIL decompositions are reasonable representations for citation networks and whether MMIL representation is more suitable than MIL representation. Finally we compared our approach with the state-of-art architectures, and we interpreted our results;
a graph classification task in which we tested our model on real social network graphs, where the data can be easily decomposed into bags-of-bags. We compared our approach against the state-of-art architectures.
8.1 Citation Datasets
In Section 4 we described how graph data can be mapped into MMIL structure. We show here a real application in which we decompose graphs into MMIL data. We also show a way for decomposing graphs into multi-instance data (which we abbreviate with MIL data). We stress the fact that our MIL setting does not assume any constraint on the latent labels whereas the standard multi-instance does. We considered three citation datasets from (Sen et al., 2008): Citeseer, Cora, and PubMed. Furthermore we will show interpretability results for PubMed (being the dataset with less classes, it will produce less rules), as described in Section 5.
We view the datasets as graphs where nodes represent papers described by titles and abstracts, and edges are citation links. We treat the citation links as undirected edges, in order to have a setup as close as possible with earlier works, (Kipf and Welling, 2016; Hamilton et al., 2017a). The goal is to classify nodes of the graph.
We collected the years of publication for all the papers of each dataset with respect to the provided unique ids. According to the years of publication, we split the datasets in training, validation, and test sets in order to have approximately nodes for the training set, nodes for the validation set and nodes for the test set. Hence for each dataset we chose two thresholds and , where . Training sets contain all the papers whose years of publication are lower or equal than , validation sets contain all the papers whose years of publication are greater than and lower or equal than , and test sets contain all the papers whose years of publication are greater than . Table 4 reports the statistics for each dataset while Figure 10, in Appendix, depicts the distributions of the papers over years of publication for the three datasets.
MMIL data was constructed from citation networks in which a top-bag corresponds to a paper represented as a bag of nodes: the neighborhood of a node (including the node itself). A sub-bag represents the bag of words corresponding to the text (i.e. title and abstract) attached to the node. An instance is a word. Conversely, MIL data was constructed from citation networks in which bags are the sets of bag of words of the neighborhood of a node (including the node itself). Note that in general both the cardinality of bags for MMIL data and MIL data could differ. Words are encoded by using one-hot vector, as we want to show the capability of our model to learn intermediate representations of bags from scratch. Figure 6 shows an example of MMIL and MIL decompositions starting from a node and its neighborhood of a citation graph.
The MMIL model has two stacked bag-layers with ReLU activations with 250 units, while the MIL model has one bag-layer with ReLU activations with 250 units. For both MMIL and MIL we proposed two versions which differ only for the aggregation functions for the bag-layers: a setup in which we used the max, and another in which we used the mean. All models were trained by minimizing the softmax cross-entropy loss. We ran 100 epochs of the Adam optimizer with learning rate 0.001 and we early stopped the training according to the loss on the validation set.
As baseline, we considered näive Bayes and logistic regression. For this scenario we reduced the task to a standard classification problem in which examples are papers and labels are categories associated with papers. As feature vectors we simply considered bag of words for both näive Bayes (Bernoulli) and logistic regression. We compared also our models against GCN(Kipf and Welling, 2016) and GraphSAGE (Hamilton et al., 2017a), which are briefly described in Section 6. While GCN represents nodes with bag of words, GraphSAGE exploits the sentence embedding approach described by (Arora et al., 2016). For comparison reasons and given that bag of words represent the most challenging and standalone approach which does not rely in any embedding representation of words, we encoded the nodes as bag of words for both GCN and GraphSAGE. As GraphSAGE allows to use both max and mean as aggregation functions, we compared our models against both versions.
Results in Table 5 report the accuracy for MMIL networks , MIL network, GCN, GraphSAGE, Näive Bayes and Logistic Regression. The MMIL network outperforms all the other methods. Note that MIL networks also provide reasonable results compared to the other methods, although results for Cora and Citeseer are worse compared to MMIL and results for PubMed are slightly worst compared to MMI. We remark that our framework is not specifically designed only for graphs, contrary to GraphSAGE and GCN.
|Dataset||# Classes||# Nodes||# Edges||# Training||# Validation||# Test|
we denote the year of publication. Citeseer classes are 6 among Agents, Artificial Intelligence (AI), Database (DB), Human-computer Interaction (HCI), Information Retrieval (IR), Machine Learning (ML). Cora classes are 7 among Case Based, Genetic Algorithms, Neural Networks, Probabilist Methods, Reinforcement Learning, Rule Learning, Theory. PubMed classes are 3 among Diabetes Mellitus Experimental (DME), Diabetes Mellitus Type 1 (DMT1), Diabetes Mellitus Type 2(DMT2).
|Naive Bayes (Bernoulli)||71.34%||63.77%||75.47%|
|GCN (Kipf and Welling, 2016)||82.23%||66.50%||78.66%|
|GraphSage (Hamilton et al., 2017a) MeanPool||80.18%||66.19%||75.59%|
|GraphSage (Hamilton et al., 2017a) MaxPool||80.43%||67.61%||76.60%|
8.2 Social Datasets
We tested our model on a slightly different graph scenario in which we used six publicly available datasets first proposed by Yanardag and Vishwanathan (2015). Although this problem might similar to the Citation Dataset classification described in Section 8.1, the task here is to classify whole graphs rather than nodes.
COLLAB is a dataset where each graph represent the ego-network of a researcher, and the task is to determine the field of study of the researcher between High Energy Physics, Condensed Matter Physics, and Astro Physics.
IMDB-BINARY, IMDB-MULTI are datasets derived from IMDB where in each graph the vertices represent actors/actresses and the edges connect people who have performed in the same movie. Collaboration graphs are generated from movies belonging to genres Action and Romance for IMDB-BINARY and Comedy, Romance, and Sci-Fi for IMDB-MULTI, and for each actor/actress in those genres an ego-graph is extracted. The task is to identify the genre from which the ego-graph has been generated.
REDDIT-BINARY, REDDIT-MULTI5K, REDDIT-MULTI12K are datasets where each graph is derived from a discussion thread from Reddit. In those datasets each vertex represent a distinct user and two users are connected by an edge if one of them has responded to a post of the other in that discussion. The task in REDDIT-BINARY is to discriminate between threads originating from a discussion-based subreddit (TrollXChromosomes, atheism) or from a question/answers-based subreddit (IAmA, AskReddit).
The task in REDDIT-MULTI5K and REDDIT-MULTI12K is a multiclass classification problem where each graph is labeled with the subreddit where it has originated (worldnews, videos, AdviceAnimals, aww, mildlyinteresting for REDDIT-MULTI5K and AskReddit, AdviceAnimals, atheism, aww, IAmA, mildlyinteresting, Showerthoughts, videos, todayilearned, worldnews, TrollXChromosomes for
We built a MMIL data from each dataset by treating each graph as a top-bag . Each node of the graph with its neighborhood, is a sub-bag , while an instance is a node.
All of those six datasets do not have any features attached to the nodes, as was instead the case with the citation datasets in Section 8.1. As features we used hence a representation of the degree of the nodes. Let be the degree of a node and let the maximum degree of the graph. The representation associated to is defined as follows:
where . By using this representation the scalar product of two vectors and will be high if the nodes and (associated to and respectively) have similar degrees and it will be low if and have far different degrees.
The MMIL networks have the same structure for all the datasets: a dense layer with 500 nodes and ReLU activation, two stack bag-layers with 500 units (250 max units and 250 mean units), and a dense layer with nodes and linear activation. is for COLLAB, for IMDB-Binary, and for IMDB-MULTI, for REDDIT-BINARY, for REDDIT-MULTI5K, and for REDDIT-MULTI12K. We performed a 10 times 10 fold cross-validation, training the MMIL networks by minimizing the binary cross-entropy loss (for REDDIT-BINARY and IMDB-BINARY) and the softmax cross-entropy loss (for COLLAB, IMDB-MULTI, REDDIT-5K, REDDIT-12K). We ran 100 epochs of the Adam optimizer with learning rate 0.001 on mini-batches of size 20.
Results in Table 6 show that MMIL networks outperform the other methods on COLLAB, IMDB-BINARY, and IMDB-MULTI and have competitive results on REDDIT-BINARY, REDDIT-MULTI5K, and REDDIT-MULTI-12K.
|COLLAB||73.09 0.25||72.60 2.15||78.50 0.69||79.46 0.31|
|IMDB-BINARY||66.96 0.56||71.00 2.29||71.59 1.20||72.62 1.04|
|IMDB-MULTI||44.55 0.52||45.23 2.84||48.53 0.76||49.42 0.68|
|REDDIT-BINARY||78.04 0.39||86.30 1.58||87.22 0.80||86.54 0.64|
|REDDIT-MULTI5K||41.27 0.18||49.10 0.70||53.63 0.51||53.42 0.67|
|REDDIT-MULTI12K||32.22 0.10||41.32 0.42||47.27 0.42||45.25 0.48|
We have introduced the MMIL framework for handling data organized in nested bags. The MMIL setting allows for a natural hierarchical organization of data, where components at different levels of the hierarchy are unconstrained in their cardinality. We have identified several learning problems that can be naturally expressed as MMIL problems. For instance, image, text or graph classification are promising application areas, because here the examples can be objects of varying structure and size, for which a bag-of-bag data representation is quite suitable, and can provide a natural alternative to graph kernels or convolutional network for graphs. Furthermore we proposed new way of thinking in terms of interpretability. Although some MIL models can be easily interpreted by exploiting the learnt instance labels and the assumed rule, MMIL networks can be interpreted in a finer level: by removing the common assumptions of the standard MIL, we are more flexible and we can first associate labels to instances and sub-bags and then combine them in order to extract new rules. Finally, we proposed a different perspective to see convolutions on graphs. In most of the neural network for graphs approaches convolutions can be interpreted as message passing schema, while in our approach we provided a decomposition schema.
We proposed a neural network architecture involving the new construct of bag-layers for learning in the MMIL setting. Theoretical results show the expressivity of this type of model. In the empirical results we have shown that learning MMIL models from data is feasible, and the theoretical capabilities of MMIL networks can be exploited in practice, e.g., to learn accurate models for noiseless data.Furthermore MMIL networks can be applied in a wide spectrum of scenarios, such as text, image, and graphs. For this latter we showed that MMIL is competitive with the state-of-the-art models on node and graph classification tasks, and, in many cases, MMIL models outperform the others.
In this paper, we have focused on the setting where whole bags-of-bags are to be classified. In conventional MIL learning, it is also possible to define a task where individual instances are to be classified. Such a task is however less clearly defined in our setup since we do not assume to know the label spaces at the instance and sub-bag level, nor the functional relationship between the labels at the different levels.
Appendix A: Experiments
|Convolutional Layer||kernel size with 32 channels|
|Max Pooling||kernel size|
|Convolutional Layer||kernel size with 64 channels|
|Max Pooling||kernel size|
|BagLayer (ReLU activation)||units|
|BagLayer (ReLU activation)||units|
We will show here interpretability results for the citation datasets presented in Section 8.1. Similarly to Section 7.1 and Section 7.2 we learnt pseudo-labels and rules for both the MMIL and MIL models. Since PubMed has the less number of classes, (and hence less rules to read), we will show interpretability results for this dataset for the model in which bag-layers aggregate with the mean.
The optimal number of pseudo-labels for the MMIL model turned to be 3 () and 5 () for sub-bags and instances, respectively. On the other hand, the optimal number of pseudo-labels for the MIL model turned to be 3 () for the instances. Figure 11 depicts an heat-map which shows the fidelities on the validation set in function of the number of pseudo-labels for both instances and sub-bags for the MMIL model.
The MMIL decomposition of citation datasets leads to a special scenario in which we actually know the real labels of the sub-bags. Indeed sub-bags are papers that, in the case of PubMed, are associated with 1 out of 3 classes. Note that the number of sub-bag pseudo-labels for the MMIL case and the number of instance pseudo-labels for the MIL case exactly match the real number of labels.
For the MMIL case by visually inspecting sub-bags that correspond to pseudo-labels (see Figure 9) it is immediate to recognize that pseudo-label corresponds to DMT1 papers, to DME papers, and to DMT2 papers. Furthermore, by visually inspecting the instance that correspond to pseudo-labels in Table 9 it is immediate to recognize the “topics”. Words are sorted in descending order, according to the score function, based on intra-cluster distance, described in Section 5. Similarly, for the MIL case the words are listed in Table 8. Below we report the rules for both the MMIL and MIL cases.
We extracted the following rule which maps a bag of instance pseudo-labels into the corresponding sub-bag pseudo-label:
Note that is not used for constructing the rules which map instance pseudo-labels to sub-bag pseudo-labels. Similarly, we extracted the following rule that maps a bag of sub-bag pseudo-labels into the corresponding top-bag label:
We extracted the following rule which maps a bag of instance pseudo-labels into the corresponding top-bag label
By classifying PubMed using the rules and pseudo-labels, we achieved an accuracy on the test set equals to for the MMIL case and for the MIL case. Fidelities for MMIL and MIL cases were and , respectively. Both of the results are still comparable and competitive with the methods described in Table 5.
Although the MIL case has higher accuracy compared to the MMIL case, the interpretation for the MMIL case provides more information, if we explain individual examples (see Section 5). Indeed, if we look at the rule which defines Diabetes Mellitus Experimental for the MI case we have , i.e. a paper belongs to the Diabetes Mellitus Experimental class if at least of words are associated with pseudo-label . However those words can spread over the paper itself and the citing/cited papers and with the MIL rules we are unable to distinguish. On the other hand, by using the rules coming from the MMIL model, we can explain more details concerning the labels. If we look at Diabetes Mellitus Experimental for the MMI case we have , i.e. a paper belongs to the Diabetes Mellitus Experimental class if it cites or is cited at most by of papers of class and it cites or is cited at least of papers of class . By applying the rules for and it is easy to see that a paper belongs to class if it contains for the most words related to Diabetes Mellitus Type 2 and a paper belongs to class if it contains for the most words related Diabetes Mellitus Experimental.
- Andrews et al. (2002) Andrews S, Tsochantaridis I, Hofmann T (2002) Support vector machines for multiple-instance learning. In: Advances in neural information processing systems, pp 561–568, 00828
- Arbeláez et al. (2011) Arbeláez P, Maire M, Fowlkes C, Malik J (2011) Contour Detection and Hierarchical Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(5):898–916, DOI 10.1109/TPAMI.2010.161, 00000
- Arora et al. (2016) Arora S, Liang Y, Ma T (2016) A simple but tough-to-beat baseline for sentence embeddings
- Atwood and Towsley (2016) Atwood J, Towsley D (2016) Diffusion-convolutional neural networks. In: Advances in Neural Information Processing Systems, pp 1993–2001
- Blei et al. (2003) Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. Journal of Machine Learning Research 3:993–1022, URL http://dl.acm.org/citation.cfm?id=944937
- Costa and De Grave (2010) Costa F, De Grave K (2010) Fast neighborhood subgraph pairwise distance kernel. In: Proceedings of the 26th International Conference on Machine Learning, Omnipress, pp 255–262, 00077
- De Raedt et al. (2008) De Raedt L, Demoen B, Fierens D, Gutmann B, Janssens G, Kimmig A, Landwehr N, Mantadelis T, Meert W, Rocha R, et al. (2008) Towards digesting the alphabet-soup of statistical relational learning
De Raedt et al. (2008)
De Raedt L, Frasconi P, Kersting K, Muggleton S (eds) (2008) Probabilistic inductive logic programming: theory and applications, Lecture notes in computer science, vol 4911. Springer, Berlin
- Dietterich (2000) Dietterich TG (2000) Ensemble Methods in Machine Learning. In: Multiple Classifier Systems, no. 1857 in Lecture Notes in Computer Science, Springer Berlin Heidelberg, pp 1–15, 02917
- Dietterich et al. (1997) Dietterich TG, Lathrop RH, Lozano-Pérez T (1997) Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence 89(1–2):31–71, DOI 10.1016/S0004-3702(96)00034-3, 01439
- Duvenaud et al. (2015) Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J, Gómez-Bombarelli R, Hirzel T, Aspuru-Guzik A, Adams RP (2015) Convolutional Networks on Graphs for Learning Molecular Fingerprints. arXiv:150909292 [cs, stat] ArXiv: 1509.09292
Foulds and Frank (2010)
Foulds J, Frank E (2010) A review of multi-instance learning assumptions. The Knowledge Engineering Review 25(01):1, DOI10.1017/S026988890999035X, 00081
- Frasconi et al. (1998) Frasconi P, Gori M, Sperduti A (1998) A general framework for adaptive processing of data structures. IEEE Trans on Neural Networks 9:768–786
Fukushima K (1980) Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics 36(4):193–202, 01681
- Gärtner et al. (2004) Gärtner T, Lloyd JW, Flach PA (2004) Kernels and distances for structured data. Machine Learning 57(3):205–232
- Getoor and Taskar (2007) Getoor L, Taskar B (eds) (2007) Introduction to statistical relational learning. MIT Press, Cambridge, Mass., URL http://www.loc.gov/catdir/toc/ecip079/2007000951.html
- Gori et al. (2005) Gori M, Monfardini G, Scarselli F (2005) A new model for learning in graph domains. In: Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, IEEE, vol 2, pp 729–734
- Griffiths and Steyvers (2004) Griffiths TL, Steyvers M (2004) Finding scientific topics. Proceedings of the National Academy of Sciences (101(suppl 1)):5228–5235
- Hamilton et al. (2017a) Hamilton W, Ying Z, Leskovec J (2017a) Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, pp 1024–1034
- Hamilton et al. (2017b) Hamilton WL, Ying R, Leskovec J (2017b) Inductive representation learning on large graphs. In: Proc. of Neural Information Processing Systems, URL http://arxiv.org/abs/1706.02216
- Haussler (1999) Haussler D (1999) Convolution kernels on discrete structures. Tech. Rep. 646, Department of Computer Science, University of California at Santa Cruz, 01152
- Hornik et al. (1989) Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural networks 2(5):359–366, 12001
- Horváth et al. (2004) Horváth T, Gärtner T, Wrobel S (2004) Cyclic pattern kernels for predictive graph mining. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 158–167, 00248
- Hou et al. (2015) Hou L, Samaras D, Kurc TM, Gao Y, Davis JE, Saltz JH (2015) Efficient multiple instance convolutional neural networks for gigapixel resolution image classification. arXiv preprint
Jaeger M (1997) Relational bayesian networks. In: Geiger D, Shenoy PP (eds) Proceedings of the 13th Conference of Uncertainty in Artificial Intelligence (UAI-13), Morgan Kaufmann, Providence, USA, pp 266–273
- Kingma and Ba (2014) Kingma D, Ba J (2014) Adam: A Method for Stochastic Optimization. arXiv:14126980 [cs] 00204 arXiv: 1412.6980
- Kipf and Welling (2016) Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:160902907 00020
- Landwehr et al. (2010) Landwehr N, Passerini A, De Raedt L, Frasconi P (2010) Fast learning of relational kernels. Machine learning 78(3):305–342
- Lapuschkin et al. (2016a) Lapuschkin S, Binder A, Montavon G, Müller KR, Samek W (2016a) The lrp toolbox for artificial neural networks. Journal of Machine Learning Research 17(114):1–5, URL http://jmlr.org/papers/v17/15-618.html
- Lapuschkin et al. (2016b) Lapuschkin S, Binder A, Montavon G, Müller KR, Samek W (2016b) The lrp toolbox for artificial neural networks. The Journal of Machine Learning Research 17(1):3938–3942
LeCun et al. (1989)
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural computation 1(4):541–551, 01543
Maas et al. (2011)
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 142–150
- Maron and Lozano-Pérez (1998) Maron O, Lozano-Pérez T (1998) A framework for multiple-instance learning. In: Advances in neural information processing systems, pp 570–576
- Maron and Lozano-Pérez (1998) Maron O, Lozano-Pérez T (1998) A framework for multiple-instance learning. Advances in neural information processing systems pp 570–576, 00870
- Maron and Ratan (1998) Maron O, Ratan AL (1998) Multiple-instance learning for natural scene classification. In: ICML, vol 98, pp 341–349
- Miyato et al. (2016) Miyato T, Dai AM, Goodfellow I (2016) Virtual adversarial training for semi-supervised text classification
- Natarajan et al. (2008) Natarajan S, Tadepalli P, Dietterich TG, Fern A (2008) Learning first-order probabilistic models with combining rules. Annals of Mathematics and Artificial Intelligence 54(1-3):223–256, URL http://link.springer.com/article/10.1007/s10472-009-9138-5, 00069
- Neumann et al. (2012) Neumann M, Patricia N, Garnett R, Kersting K (2012) Efficient graph kernels by randomization. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 378–393, URL http://link.springer.com/chapter/10.1007/978-3-642-33460-3_30
- Niepert et al. (2016a) Niepert M, Ahmed M, Kutzkov K (2016a) Learning convolutional neural networks for graphs. In: International Conference on Machine Learning
- Niepert et al. (2016b) Niepert M, Ahmed M, Kutzkov K (2016b) Learning Convolutional Neural Networks for Graphs. New York, NY, USA, 00001 arXiv: 1605.05273
- Orsini et al. (2015) Orsini F, Frasconi P, De Raedt L (2015) Graph invariant kernels. In: Proceedings of the Twenty-fourth International Joint Conference on Artificial Intelligence, pp 3756–3762
- Orsini et al. (2018) Orsini F, Baracchi D, Frasconi P (2018) Shift aggregate extract networks. Frontiers in Robotics and AI 5:42
- Passerini et al. (2006) Passerini A, Frasconi P, Raedt LD (2006) Kernels on prolog proof trees: Statistical learning in the ilp setting. Journal of Machine Learning Research 7(Feb):307–342
Pennington et al. (2014)
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
- Rahmani et al. (2005) Rahmani R, Goldman SA, Zhang H, Krettek J, Fritts JE (2005) Localized content based image retrieval. In: Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval, ACM, pp 227–236
- Ramon and De Raedt (2000) Ramon J, De Raedt L (2000) Multi instance neural networks 00115
- Ribeiro et al. (2016) Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 1135–1144
- Richardson and Domingos (2006) Richardson M, Domingos P (2006) Markov logic networks. Machine Learning 62:107–136
- Samek et al. (2016) Samek W, Montavon G, Binder A, Lapuschkin S, Müller KR (2016) Interpreting the predictions of complex ml models by layer-wise relevance propagation. arXiv preprint arXiv:161108191
- Scarselli et al. (2009) Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80, 00073
- Scott et al. (2005) Scott S, Zhang J, Brown J (2005) On generalized multiple-instance learning. International Journal of Computational Intelligence and Applications 5(01):21–35, 00059
- Sen et al. (2008) Sen P, Namata G, Bilgic M, Getoor L, Galligher B, Eliassi-Rad T (2008) Collective classification in network data. AI magazine 29(3):93, 00567
- Shervashidze et al. (2009) Shervashidze N, Vishwanathan SVN, Petri T, Mehlhorn K, Borgwardt KM (2009) Efficient graphlet kernels for large graph comparison. In: AISTATS, vol 5, pp 488–495, 00219
- Shervashidze et al. (2011) Shervashidze N, Schweitzer P, Van Leeuwen EJ, Mehlhorn K, Borgwardt KM (2011) Weisfeiler-lehman graph kernels 12:2539–2561
- Szegedy et al. (2016) Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv:160207261 [cs] 00127 arXiv: 1602.07261
- Wang and Zucker (2000) Wang J, Zucker JD (2000) Solving multiple-instance problem: A lazy learning approach 00444
- Weidmann et al. (2003) Weidmann N, Frank E, Pfahringer B (2003) A two-level learning method for generalized multi-instance problems. In: European Conference on Machine Learning, Springer, pp 468–479, 00110
- Yan et al. (2016) Yan Z, Zhan Y, Peng Z, Liao S, Shinagawa Y, Zhang S, Metaxas DN, Zhou XS (2016) Multi-instance deep learning: Discover discriminative local anatomies for bodypart recognition. IEEE transactions on medical imaging 35(5):1332–1343
- Yanardag and Vishwanathan (2015) Yanardag P, Vishwanathan S (2015) Deep graph kernels. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 1365–1374
- Yang and Lozano-Perez (2000) Yang C, Lozano-Perez T (2000) Image database retrieval with multiple-instance learning techniques. In: Data Engineering, 2000. Proceedings. 16th International Conference on, IEEE, pp 233–243
Yang et al. (2006)
Yang C, Dong M, Hua J (2006) Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning. In: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, IEEE, vol 2, pp 2057–2063
- Zha et al. (2008) Zha ZJ, Hua XS, Mei T, Wang J, Qi GJ, Wang Z (2008) Joint multi-label multi-instance learning for image classification. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, pp 1–8
- Zhou et al. (2005) Zhou ZH, Jiang K, Li M (2005) Multi-instance learning based Web mining. Applied Intelligence 22(2):135–147
- Zhou et al. (2012) Zhou ZH, Zhang ML, Huang SJ, Li YF (2012) Multi-instance multi-label learning. Artificial Intelligence 176(1):2291–2320, DOI 10.1016/j.artint.2011.10.002, 00288