I Introduction
Fuzzy inference is a powerful modeling framework that can handle computing with knowledge uncertainty and measurements imprecision effectively [1]. It has been successfully applied to a wide range of problems, mainly in system modeling and control [2, 3, 4]. Most of the proposed fuzzy inference methods gained success because of their ability to leverage expert knowledge to identify the model parameters [5]. This practice simplifies system design and ensures that the knowledge base (ifthen rules) used by the system is easy to interpret [6].
More recently, fuzzy inference has increasingly been applied to more advanced applications, such as contentbased information retrieval [7], image segmentation [8], image annotation [9]
[10], recommender systems [11], and multiple classifier fusion
[12]. The aforementioned applications are more challenging as they require extensive knowledge base to accommodate for various scenarios. Since this diverse knowledge base cannot be fully captured by domain experts, datadriven techniques are typically used to identify and learn the inference system’s parameters [13, 14]. One such technique is the Adaptive NeuroFuzzy Inference System (ANFIS)[15]. ANFIS is a universal approximator that combines the learning and modeling power of neural networks and fuzzy logic into an adaptive inference system. It is a hybrid intelligent system and it provides a systematic approach to jointly learn the optimal input space partition (rules) and the optimal output parameters using supervised learning.
Typically, in supervised learning, access to large labeled training datasets improves the performance of the devised algorithms by increasing their robustness and generalization capabilities. Nowadays, access to such large datasets is becoming more convenient. However, for a supervised leaning method to benefit from this data, it need to be carefully preprocessed, filtered, and labeled. Unfortunately, this process can be too tedious as the vast portion of the collected data is unstructured, labeled ambiguously and at a coarse level. An alternative and a relatively new framework of learning that tackles the inherent ambiguity better than supervised learning, is the Multiple Instance Learning (MIL) paradigm [16].
Ia Multiple Instance Learning
Unlike standard supervised learning, in MIL, an object is not represented by a simple data point, but rather by a collection of instances, called a bag. Each bag can contain a different number of instances. A bag is labeled negative if all of its instances are negative, and positive if at least one of its instances is positive^{2}^{2}2Note that positive bags may also contain negative instances.. Positive bags can encode ambiguity since the instances themselves are not labeled. Given a training set of labeled bags, the goal of MIL is to learn a concept that predicts the labels of training data at the instance level and generalizes to predict the labels of testing bags and their instances[17]. We refer to this definition as the standard MIL assumption. Multiple MIL paradigms have been proposed [18], but for simplicity we focus our formulation on the standard MIL assumption.
The MIL is a well known problem that has been studied for the last 20 years, it was first formalized by Dietterich et al. [19] providing a solution to drug activity prediction. Ever since, it has increasingly been applied to a wide variety of tasks including contentbased information retrieval [20], drug discovery [21], pattern recognition [22], image classification [23], regionbased image categorization [24], image annotation [25], object tracking [26] and time series prediction [16]. In general, MIL can be applied in two contexts of ambiguity: “polymorphism ambiguity” and “partwhole ambiguity” [27]
. In polymorphism ambiguity, an object can have multiple forms of expression in the input space and it is not known which form is responsible for the object label. Whereas, in partwhole ambiguity, an object can be broken into several parts represented by different feature vectors in the input space. However, only few parts are responsible for the object label
[28]. Polymorphism Ambiguity arise more often in applications related to chemistry and bioscience. The original MIL application of drug discovery [17, 16] is a case of polymorphism ambiguity. Partwhole Ambiguity is more common in pattern recognition problems. For example, in image annotation features are usually extracted locally (from patches) while the labels, or tags, are only available gloablly at the image level. Another closely related application is object detection. In this application, objects of interest may cover only a limited region of the image, the rest could be other objects or background. Traditional supervised learning requires identifying image patches containing the object of interest only and labeling them. As indicated by Viola et al. [29], placing bounding boxes around objects is an inherently ambiguous task. Thus, to avoid the tedious task of object segmentation and annotation, the problem of object detection can be addressed using an MIL paradigm. To illustrate the need for MIL further, in the following we analyze how a multiple instance (MI) representation can be applied to image classification. More details about MIL taxonomy have been reported by Amores [30].Consider the simple example of classifying images that contain “sky”. Using an MIL approach, each training image is represented by a bag of instances where each instance corresponds to features extracted from a region of interest. These regions could be obtained by segmenting the image or simply by dividing it into patches. A multiple instance representation is well suited for this purpose because only few regions may contain the object of interest (sky), that is the positive class. Other patches will be from background or other classes. This representation is illustrated in Figure
1. Traditional single instance learning are based on instance level (patchlevel) labels and would require each image region to be correctly segmented and labeled prior to learning.IB Fuzzy Inference Systems
A Fuzzy Inference System (FIS) is a paradigm in soft computing which provides a means of approximate reasoning [31]. A FIS is capable of handling computing with knowledge uncertainty and measurements imprecision effectively [1]. It performs a nonlinear mapping from an input space to an output space by deriving conclusions from a set of fuzzy ifthen rules and known facts [32]. Fuzzy rules are condition/action (ifthen) rules composed of a set of linguistic variables (e.g. image patch). Each variable is assigned a linguistic term (e.g. red, green, blue). For instance, the following rules could be used to identify patches from the image in Figure 1:

If patch is blue and texture is smooth then region is sky.

If patch is blue and patch position is upper half then region is sky.
Typically, a FIS is composed of 5 components: (1) a Fuzzification unit that assigns a membership degree to each crisp input dimension in the input fuzzy sets; (2) a Knowledge Base characterized by fuzzy sets of linguistic terms; (3) a Rule Base containing a set of fuzzy ifthen rules; (4) an Inference unit that performs fuzzy reasoning; and (5) a Deffuzification unit that generates crisp output values. FIS has proven to be very effective in various applications [2, 33, 34, 3, 35, 36, 37, 38, 39, 40, 4]. However, it is not applicable to cases where objects are represented by multiple instances.
IC Motivations For Multiple Instance Fuzzy Inference
There are two major limitations that prevent using standard FIS methods with multiple instance data. First, due to the absence of labels at the instance level, we cannot use standard FIS learning methods to construct the knowledge base. Second, we need an effective mechanism to aggregate instances’ confidences and infer at the bag level. The above limitations are due mainly to the inherent architecture of fuzzy inference systems. The standard inference systems reason with individual instances. First, the system’s input is an individual instance. Second, the rules describe fuzzy regions within the instances space. Third, the output of the system corresponds to the fuzzy inference using a single instance. Fourth, labels of the individual instances are required when using learning techniques to identify the parameters of the system. In summary, traditional fuzzy inference systems cannot be used effectively within the MIL framework.
To address the above limitations, we introduce two FIS designed to handle reasoning with bags of instances and capable of learning form ambiguously labeled data. The first one, called Multiple InstanceSugeno (MISugeno) extends the standard Sugeno system [41]. The second one, called Multiple InstanceANFIS (MIANFIS) extends the standard ANFIS [15] system and uses MISugeno rules. We report results on various experiments and discuss the advantages of using our proposed methods over closely related MIL algorithms such as Multiple Instance Neural Networks [42] (MINN) and Multiple Instance RBF Neural Networks [43] (RBFMIP).
Ii Multiple Instance Fuzzy Inference
In the following, let be a bag of instances with the th instance denoted as with elements corresponding to features, i.e.,
(1) 
Note that the number of instances can vary between bags ( depends on ). A bag is labeled positive if at least one of its instances is positive, and negative if all of its instances are negative.
Iia Multiple Instance Sugeno Style Fuzzy Inference
To adapt Sugeno inference to problems where objects are described by multiple instances, we propose a multiple instance Sugeno inference (MISugeno) system that uses multiple instance fuzzy ifthen rules. Recall that a fuzzy ifthen rule is expressed as
(2) 
where and are fuzzy sets on universes of discourse and , respectively.
The rule in (2) combines the fuzzy propositions (, ) into a logical implication abbreviated as with membership function .
The rule is defined using a premise part that is a single instance fuzzy proposition.
To generalize the rule in (2) to MI data, we define a multiple instance fuzzy rule as:
(3) 
where as in (2), and are fuzzy sets on the universes of discourse and , respectively. In (3), is a bag of instances as defined in (1), and is the number of instances in . The premise part of a multiple instance fuzzy rule (i.e., ) is a multiple instance proposition, whereas the consequent part is a traditional proposition. In (3), is a joint operator that can be any Tconorm (maximum, algebraic sum, bounded sum, etc.). The reason behind using a Tconorm for combining individual instances’ responses, goes back to the standard MIL assumption [16, 17] which states that a bag is positive if and only if one or more of its instances are positive. Thus, the baglevel class label is determined by the disjunction of the instancelevel class labels. We note that the Tconorm can be designed to handle a broader set of nonstandrad MIL problems, for example to allow the inference process to assign a higher degree of belief to bags with more than one positive instance.
The proposed MISugeno uses multiple instance fuzzy rules with a consequent part that is described by means of a function that maps a bag of instances to a crisp numerical value. Specifically, we define a multiple instance sugeno rule as:
(4) 
In (4), is a set of polynomial coefficients. When the polynomial coefficients are first order, the MISugeno fuzzy model is called first order, and zero order when the polynomial coefficients are zero order.
Figure 2 illustrates the proposed MISugeno system and its fuzzy inference mechanism to derive the output, o, in response to a bag of instances for the simple case of two rules. The premise part of the rules evaluates all the bag’s instances simultaneously. The inference starts by the fuzzification of instances of input bag . Fuzzification assigns a membership degree to each input instance dimension in the rules input fuzzy sets. In Figure 2, instance activates the th input fuzzy set of the th rule by a degree of truth . Next, an implication process is executed to combine the activations of the instances within the bag resulting in the activation of the rules’ output with different degrees. In this example, we use a simple min operator, and the output of rule will be partially activated by a degree = . The (truth instances) are combined in the premise part using the max Tconorm, resulting in the activation of rule by a degree = . To evaluate the consequent part, first the linear response of each instance is computed, i.e., . Then, a function is used to compute the final output by combining the instances’ responses. Many functions could be used and the choice should be domainspecfic. The output of each rule, and , are crisp values. As in the traditional Sugeno fuzzy inference system, the overall output of the system is obtained by taking the weighted average of the rules’ outputs.
The consequent part of the proposed MISugeno style inference system is inspired by the work of Ray and Page on multiple instance regression [44]. In their work, the authors proposed a regression framework for predicting bags’ labels. This formulation allows the linear coefficients and the parameters of the combining function to be learned using optimazation techniques, as we will show in section IIB.
Similar to traditional fuzzy inference, the premise part of a multiple instance rule defines a local fuzzy region within the instance space, and the consequent part describes the characteristics of the system’s output within each region. More specifically, in problems, a local region describes a positive concept (also called target concept), and the output of a rule represents the degree of “positivity” of the instances in that target concept. A target concept is a region in the instances’ feature space that includes as many instances from positive bags as possible and as few instances from negative bags as possible.
The Sugeno fuzzy model [41] was the first attempt at learning fuzzy rules from training data. It has been used to develop the standard ANIFS which combines the representation power of fuzzy inference and learning capability of neural networks to learn the rules. In the next section, we will use our MISugeno to develop a multiple instance extension of ANFIS (MIANFIS).
IiB MIANFIS: A Multiple Instance Adaptive NeuroFuzzy Inference System
Let be a bag of instances as defined in (1). For simplicity, we introduce our MIANFIS for the case of two rules. The generalization to an arbitrary number of rules is trivial. The MIANFIS with two Sugeno rules can be described as:
(5) 
Figure 3 illustrates the proposed MIANFIS architecture. As in the traditional ANFIS, nodes at the same layer have similar functions. We denote the output of the th node in layer as
 Layer 1

is an adaptive layer, it calculates the degree to which an input instance satisfies a quantifier . Every node evaluates the membership degree of an input instance, , in the fuzzy set . Generally, is a parameterized membership function (MF), for example a Gaussian MF with
(6)  Layer 2

is a fixed layer where every node computes the product of all incoming inputs. It evaluates the degree of truth of proposition instances, or simply, “truth instances”. The output of this layer is computed using:
(7) where is a ceiling operator, and is . As in the traditional ANFIS, any Tnorm can replace the product as the node function in this layer.
 Layer 3

is a new addition when compared to the traditional ANFIS. Every node in this layer aggregates the truth instances (within each bag) of the previous layer by means of a smooth Tconorm. In this paper, we use a “softmax” function ():
(8) In (8), determines the behavior of softmax. As approaches , softmax approaches the max operator. When , it calculates the mean. As approaches , softmax approaches the min operator. The outputs of this layer are the firing strength of each input bag in each multiple instance fuzzy rule. i.e.,
(9) Layer is also a fixed layer.
 Layer 4

is a fixed layer. Every node in this layer calculates the normalized firing strength of each rule, i.e.,
(10) where is the number of rules.
 Layer 5

is an adaptive layer. Every node in this layer computes the output of the multiple instance rule using
(11) The parameters are referred to as the consequent parameters. The only constraint on is that it has to be a smooth function to allow for optimization techniques to be applied. We use the “softmax” as the combining function for this layer. In this case, (11) is equivalent to:
(12) note that the constant here is not necessary the same as in Layer .
 Layer 6

is a fixed layer with a single node labeled . It computes the overall output of the system using
(13)
To learn the parameters of the proposed MIANFIS network, we propose a generalization to the basic learning algorithm presented by Jang [45]. Our variation is different from the ANFIS standard backpropagation learning rule due to the additional layers (Layers 3 and 5) and the use of “softmax” function (in (9) and (11)). Thus, all update equations need to be rederived.
BackPropagation Learning Rule:
we assume that we have training bags, , and their corresponding labels . After presenting the th training bag, we compute its squared error measure:
(14) 
In (14), is the desired bag output, and is the computed output of the network when presented with training bag . Recall that labels at the instances level are not available and errors can be computed only at the bag level.
The overall error measure of the network after presenting all bags is
(15) 
To develop the gradient descent optimization on E, we compute the error rate for the th training bag at each output node . This error rate (where indicates the MIANFIS layer) is defined as
(16) 
At the output node, we have
(17) 
For nonoutput nodes (i.e. internal nodes,
), we derive the error rate using the chain rule
(18) 
where refers the number of nodes at layer .
Next, we seek to minimize the network error with respect to the premise parameters , and with respect to the consequent parameters .
The error rate with respect to a generic parameter can be computed using
(19) 
where is the set of nodes whose outputs depend on .
Using(15), the total error rate is given by
(20) 
Update Rule For Premise Parameters: First we compute the error rate for the premise parameters and using
(21) 
and,
(22) 
Using the chain rule defined in (18), it can be shown that (see derivation in Appendix A)
(23) 
The center parameters are then updated using
(24) 
where is the learning rate.
The update formula for can be derived in a similar manner. It can be shown that
(25) 
The MF’s width, , are then updated using
(26) 
Update Rule For Consequent Parameters: The error rate for the consequent parameters is defined as
(27) 
where,
(28) 
Using (18), it can be shown that (see Appendix B)
(29) 
The consequent parameters are then updated using
(30) 
Equations (24), (26) and (30) can be used to update , and parameters either online, bag by bag ( we want to emphasis here that the online learning is not achieved instance by instance, but rather bag by bag), or offline in batch mode after presentation of the entire data.
The proposed MIANFIS learning algorithm is summarized in Algorithm 1.
Inputs:  : the set of training bags. 
: labels of the training bags.  
: the number of instances in each bag.  
: the constant used in the “softmax” function.  
: the learning rate.  
: number of epochs. 

: minimum parameters change value.  
Outputs:  : the sets of consequent parameters. 
: the set of membership functions’ centers.  
: the set of membership functions’ widths. 
Iii Preventing Overfitting: Rule Dropout
Neural networks with large number of parameters are susceptible to overfitting. MIANFIS is no exception, particularly when using large number of multiple instance fuzzy rules and relatively small training datasets. In such scenario, some rules could coadapt to the training data and degrade the network ability to generalize to unseen examples. In this section, we present a technique, known as Dropout, used to prevent overfitting and rules’ coadaptation.
Dropout is a regularization method that was introduced by Hinton et al. [46] to alleviate the serious problem of overfitting in deep neural networks. Over the years, many methods have been developed to reduce overfitting, including using a validation dataset to stop the training as soon as the performance gets worse, adding weight penalties using L1 and L2 regularization, or artificially augmenting the training dataset using labelpreserving transformations. However, as noted by Hinton [46]
, the best way to regularize a fixedsize model is to average the predictions of all possible settings of the parameters weighted by its posterior probability given the training data. This can be achieved by combining the predictions of an exponential number of models. Combining several models with different architectures have the advantage of better generalization and per consequence better testing performance. While generating an ensemble of models is trivial, training them all is prohibitively expensive.
Generally, Dropout works by setting to the output of each node in a given layer with probability ( typically equals ), during training. Nodes that are dropped out do not contribute to the parameters updates. During testing, all nodes are used but the outputs are weighted by the probability . Following this strategy, every time a new training example is presented, the network samples and trains a different architecture. In other words, Dropout trains an ensemble of networks ( networks, being the number of nodes) simultaneously leading to an important speedup in training time as compared to traditional ensemble methods. Figure 4 and Figure 5 illustrate the Dropout model.
In this paper, we propose to adopt the Dropout strategy to regularize MIANFIS networks. Typically, overfitting occurs in MIANFIS networks when a set of multiple instance rules coadapt to the provided data early during the training process and prevent the remaining rules from learning. Thus, degrading the network’s generalization capability. While the Dropout technique could be applied to MIANFIS as is (given the inherited neural network nature of the architecture), care should be exercised when selecting nodes to include in the list of the randomly dropped out nodes. MIANFIS nodes are different from that of standard neural networks as they are grouped into rules to model and express linguistic terms. Simply dropping few nodes from a given rule can change its role and could severely handicap the fuzzy inference process. Hence, Dropout should be executed differently. In deep neural nets, Dropout is applied to selected layers (vertically), for MIANFIS, we propose to apply Dropout on a rule by rule basis (i.e., horizontally). Either the whole rule is included, or the whole rule is dropped. This can be achieved by applying Dropout to Layer (see Figure 6), i.e., setting to zero the output of the “to be dropped out” rules. We will refer to this derived technique as “Rule Dropout”. Using a Rule Dropout strategy to train MIANFIS networks is approximatively equivalent to sampling and training ( is the number of rules) ensemble of networks.
Let be the probability with which a rule is present, formally, Rule Dropout is applied to Layer during training as follows
(31) 
where
(32) 
is a Bernoulli random variable with probability
of being . During testing, Layer output is scaled by , i.e., . Figure 6 illustrates our MIANFIS network with 3 multiple instance fuzzy rules where, at a given iteration, rule 2 has been dropped out..Deriving the new update equations for MIANFIS parameters requires taking into consideration the added Bernoulli random variable, . It is straightforward to show that the new gradients with respect to premise and consequent parameters are given by
(33) 
and,
(34) 
In a similar manner,
(35) 
As it can be seen, equations (33), (34), and (35) will get zeroed when the rule is dropped out (i.e., and ). Thus, its premise and consequent parameters are not updated.
Iv Experimental Results
Iva Synthetic Data
To illustrate the proposed multiple instance fuzzy inference and its ability to learn from data without instancelevel labels, first, we use a simple 2Dim synthetic dataset. This data were generated from a distribution of two positive contexts with centers at (0.5,0.5) and (1.5,1.5), and with a fixed standard deviation. These centers are marked with squares in Figure
7, and the circles around the centers indicates regions within 1 standard deviation. These regions are considered the two target concepts. From each positive concept we generated 50 bags. Each bag has a random number, between 2 and 10, of instances. Each bag from concept 1 (or 2) will have at least one instance close to target concept 1 (or 2). We also generated 50 negative bags randomly from non concept regions. Negative bags will have all of their instances outside both target concepts. In Figure 7, instances from negative bags are shown as “.”, and instances from positive bags are shown as “+” or “” depending on the underlying concept. In Figure 7, we highlight one bag from Concept 1 by circling all of its instances. As it can be seen, one of its instances is within one standard deviation region of target concept 1 while the other instances are scattered around. We should emphasize here that the centers of the target concepts in Figure 7 are unknown and not used by the learning algorithm. They are shown here for illustration and validation purposes only.IvA1 MIANFIS Rules Learning
In the following, we show that the MIANFIS Learning Algorithm (Algorithm 1) is capable of identifying positive concepts as well as their corresponding multiple instance fuzzy rules. To initialize the premise parameters, we partition the instances’ space into 6 partition generated randomly ^{3}^{3}3A grid or manual partitioning could also be used. We use the partitions’ centers as initial centers for the Gaussian MFs, and we initialize all standard deviation parameters to a default value of .
The initial fuzzy sets (MFs) of the rules, before training, are displayed in Figure 8
in dashed lines. As it can be seen, the initial 6 partitions simply cover random quadrants of the 2D instance space (if no label information is used, as in this case, data would appear to have uniform distribution (refer to Figure
7)). The learned fuzzy sets after convergence are shown in Figure 8 in bold lines. As it can be seen,the system has correctly identified the positive concepts, and at the same time identified irrelevant rules (MIRule 1, MIRule 3 and MIRule 5) and assigned low output values to each, , and respectively.
IvB Benchmark Datasets
To provide a quantitative evaluation of the proposed MIANFIS, we apply it to five benchmark data sets commonly used to evaluate MIL methods: The MUSK1, MUSK2 [19], and Fox, Tiger, and Elephant from the COREL data set [47]. MUSK1 and MUSK2 data sets consist of descriptions of molecules and the objective is to classify whether a molecule smells musky [48]. In these data sets, each bag represents a molecule and instances within each bag represent the different lowenergy conformations of the molecule. Each instance is characterized by 166 features. MUSK1 has 92 bags, of which 47 are positive, and MUSK2 has 102 bags, of which 39 are positive. The other data sets (Fox, Tiger, and Elephant), classify whether an image contains the corresponding animal. Each data set consists of 200 images (bags): 100 positive images containing the target animal and 100 negative images containing other animals. Each image is represented as a set of patches (instances) and each patch is in turn represented by a 230 dimensional feature vector describing its color, texture and shape information. We note that the three data sets are independent and used as binary classification problem (positive v.s. negative). Table I summarizes the characteristics of the five data sets. It is to be noted that for each benchmark data set, PCA was applied to reduce the dimensionality of the features in order to speedup MIANFIS training and increase the interpretability of the generated multiple instance fuzzy rules.
Data set  dim.(PCA)  No. Bags  Positive  Negative  No.Instances 

MUSK1  166(25)  92  47  45  
MUSK2  166(25)  102  39  63  
Fox  230(10)  200  100  100  
Tiger  230(10)  200  100  100  
Elephant  230(10)  200  100  100 
For each experiment, we construct a zeroorder MIANFIS with a given number of multiple instance rules. For MIANFIS the number of rules is not critical. It should be large enough to cover the diverse regions of the input space and the multiple concepts. If the specified number of rules is too large, some will vanish as illustrated in Figure 8 for the example with the synthetic data. Also, larger number of rules leads to slower training. We use Gaussian MFs to describe the input fuzzy sets. For initialization, we use the FCM algorithm to cluster the instances of the positive bags into a number of clusters equal to the number of fuzzy rules, and we initialize MFs’ centers as the clusters centers. MIANFIS was trained and tested using ten fold cross validation. Table II summarizes all parameters used in training the MIANFIS (parameters were manually selected using trial and error). We note that the reason behind using larger standard deviations for MUSK1 and MUSK2 datasets is the higher dimensionality of these data sets. We expect the sparsity to increase with the dimensions of the feature space, so we set the standard deviations to larger values to allow the initial rules to cover the entirety of the input space.
Parameter  MUSK1  MUSK2  FOX  Tiger  Elephant 
No. of MI Rules  6  3  2  4  3 
No. of Inputs  25  25  10  10  10 
MF’s  100  100  10  10  10 
Output parameters  1s  1s  1s  1s  1s 
Softmax’s  1  1  1  1  1 
Learning rate  0.1  0.1  0.1  0.1  0.1 
First, to illustrates the advantage of using MIANFIS over the traditional ANFIS we compare these two algorithms on the two MUSK data sets. Since ANFIS cannot learn from ambiguously labeled data, for the sake of comparison, we consider the naive MIL assumption where all instances from positive bags are considered positive and all instances from negative bags are considered negative. We refer to this implementation as NaiveANFIS. The results are summarized in Table III where the performance is reported in terms of prediction accuracy averaged over all 10 cross validation sets (% of correct standard deviation). As it can be seen, MIANFIS outperforms NaiveANFIS significantly. This is because inaccurately labeled instances within the positive bags were used for training the NaiveANFIS. The difference in performance between MIANFIS and NaiveANFIS is greater for MUSK1 and MUSK2 because of the greater number of instances per bag (more ambiguousity).
Algorithms  MUSK1  MUSK2  Fox  Tiger  Elephant 

MIANFIS  93.49  90.58  66.4  84.5  86.97 
NaiveANFIS  67.82  79.43  58.70  77.70  82.2 
Algorithms  MUSK1  MUSK2  Fox  Tiger  Elephant 

MIANFIS  93.49  90.58  66.4  84.5  86.97 
MILES [49]  86.3  87.7  N/A  N/A  N/A 
APR [19]  92.4  89.2  N/A  N/A  N/A 
DD [21]  88.9  82.5  N/A  N/A  N/A 
DDSVM [50]  85.8  91.3  N/A  N/A  N/A 
EMDD [51]  84.8  84.9  56.1  72.1  78.3 
CitationKNN [52] 
92.4  86.3  N/A  N/A  N/A 
MISVM [47]  77.9  84.3  57.8  84.0  81.4 
miSVM [47]  87.4  83.6  58.2  78.4  82.2 
MINN [53]  88.0  82.0  N/A  N/A  N/A 
BaggingAPR [54]  92.8  93.1  N/A  N/A  N/A 
RBFMIP [43]  91.3  90.1  N/A  N/A  N/A 
BPMIP [42]  83.7  80.4  N/A  N/A  N/A 
RBFBagUnit [55]  90.3  86.6  N/A  N/A  N/A 
MIkernel [56]  88.0  89.3  60.3  84.2  84.3 
PPPMkernel [57]  95.6  81.2  60.3  80.2  82.4 
MIGraph [56]  90.0  90.0  61.2  81.9  85.1 
miGraph [56]  88.9  90.3  61.6  86.0  86.8 
ALPSVM [58]  86.3  86.2  66.0  86.0  83.5 
MIForest [59]  85.0  82.0  64.0  82.0  84.0 
Table IV compares the performance of the proposed algorithm to state of the art MIL algorithms on the benchmark data sets.
Overall, MIANFIS is comparable to other MIL algorithms. In fact, on all tested data sets, MIANFIS ranked consistently among the top three. For MUSK1, PPPMkernel [57] performed the best (95.6%), but this algorithm did not perform as well for the other sets. For MUSK2 BaggingAPR [54] achieved the best accuracy, as reported by [49]. MIANFIS achieved the best average performance for the Fox and Elephant data sets, and second best performance after the miGraph [56] and ALPSVM [58] methods for the Tiger data set.
In order to demonstrate the gain in generalization acquired by MIANFIS when utilizing Rule Dropout, we train an MIANFIS architecture for binary classification with and without Rule Dropout on a multiple instance dataset sampled from COREL [47]. The dataset classify whether an image contains an elephant or not, and consists of 200 images (bags): 100 positive images containing the target animal and 100 negative images containing other animals. Each image is represented as a set of patches (instances) and each patch is in turn represented by 230 features describing color, texture and shape information. Before training, we applied PCA to reduce the dimensionality of the features to 10 dimensions to speedup MIANFIS. Table V summarizes the dataset characteristics. Two MIANFIS networks composed of 15 rules each, with one network employing Rule Dropout (with , this hyperparameter was selected based on trial and error), were trained on 90% of the data, and the remaining 10% was used for testing (split was done randomly). Figure 9 shows the training and testing errors for both networks during 100 epochs. As it can be seen, without Rule Dropout, starting at epoch 20, testing performance begins to degrade while training error continues to decrease. In other words, overfitting begins to occur. Typically, using a cross validation data set, this point can be detected and training would be stopped. However, this assumes that a cross validation data is available (or training data is large enough to be split into training and testing) and more important that the cross validation data is representative of the testing data. On the other hand, when using Rule Dropout, overfitting is significantly reduced and MIANFIS achieved better testing performance at the end of the training phase. Even though, when using Rule Dropout the training and testing error rates oscillate (due to the randomness of the dropout process), overall MIANFIS achieved 0.1123 testing SSE with Rule Dropout compared to 0.1451 testing SSE without Rule Dropout.
Data set  dim.(PCA)  No. Bags  Positive  Negative  No.Instances 

Elephant  230(10)  200  100  100 
IvC Application To Landmine Detection
In this section, we report the results of applying the proposed Multiple Instance Inference to fuse the output of multiple discrimination algorithms for the purpose of landmine detection using Ground Penetrating Radar (GPR). GPR data collected at different locations and different dates were used to train and test the proposed MIANFIS. The alarm collection covers 319 encounters of various antitank mines with high metal content (ATHM) and 422 encounters of various antitank mines with low metal content (ATLM). The vehiclemounted GPR sensor collects 3dimensional data as the vehicle moves (Figure 10). The 3dimensions correspond to the spatial location on the ground (downtrack, crosstrack, and depth) and is shown in Figure 11.
Figure 11(b) shows a 2D views of (depth, downtrack) and (depth, crosstrack) slices of GPR data. As it can be seen, the target signature does not extend over all depth values. Thus, one global feature vector may not discriminate between mines and clutter effectively. To overcome this limitation, most classifiers developed for this application extract multiple features from small overlapping windows at multiple depths. In the following, we assume that each training alarm (3D data cube) has been divided into 15 overlapping (depth wise) patches. Each patch is processed by 2 discrimination algorithms. These algorithms are based on the Edge Histogram Descriptor (EHD) [60]. The first one, called EHDDT, extracts features from each 2D (downtrack, depth) patch. The second discrimination algorithm, called EHDCT, extracts information for the 2D (CrossTrack, depth) patch. In addition, auxiliary features are synthesized from each patch. In particular, “SignatureWidth” in the Downtrack direction and “SignatureWidth” in the CrossTrack direction are used to capture the effective width of the hyperbolic shape within each patch. These auxiliary features are intended to provide contextual information that can support the relevance of the EHDDT and/or EHDCT. As a result, each alarm is represented by a Bag of 15 instances and each instance is a 4dimensional feature vector. Each bag is labeled as positive (has a target) or negative (non target), but labels at the instance level are not available. The XY Ground truth information about the target is available (using GPS and known target position on calibration lanes). However, the depth position cannot be easily identified as it depends on target size, burial depth, soil type, and other environmental conditions. Manually extracting the depth location can be very tedious. Similarly, during testing, it is not trivial how to combine partial confidence values from the multiple windows. Therefore, the MIL paradigm is suitable to solve this problem.
We construct a zeroorder MIANFIS (constant consequent parameters) having 5 multiple instance rules, and employing Gaussian MFs to describe the input fuzzy sets. To initialize the system’s parameters, first, we use the FCM algorithm to cluster the instances that belong to positive bags into 5 clusters, and we initialize the MFs’ centers as the clusters’ centers. Then, we initialize the standard deviations of the input MFs and the output parameters to 1.
After initialization, we run MIANFIS basic learning algorithm (Algorithm 1) to jointly learn a fuzzy description of the positive concepts as well as optimal rules’ output. Figure 12 is a graphical representation of the 5 multiple instance rules prior to running the optimization process (dotted line curves) and the learned rules after training (continuous curves). The fuzzy sets of the rules’ antecedents describe the location and the extent of the positive concepts in the 4D instance feature space. The rules’ consequent values can be interpreted as an assessment of the “positivity” of each learned concept. For instance, the MIANFIS learned the following two positive concepts to describe targets:
IvC1 Results
The proposed fusion method was trained and tested using 10folds cross validation. Figure 14 displays the ROCs (averaged over the 10 folds).
To provide a quantitative evaluation of the proposed multiple instance fuzzy inference fusion method, we compare its performance to a fusion method based on the standard Mamdani [12] and standard ANFIS [61]. Since the standard Mamdani and AFNIS cannot learn from partially labeled data, an expert is used to label all instances of all bags within the training data. We also compare MIANFIS performances to a naive MIL implementation of Mamdani (NaiveMamdani) and ANFIS (NaiveANFIS) where all instances from positive bags are considered positive and all instances from negative bags are considered negative.
As it can be seen in Figure 14, MIANFIS performed better than the standard ANFIS on the large dataset, and as expected NaiveANFIS performed worse. The standard ANFIS performed better at low FAR (False Alarms Rate), the reason is that strong Mines are easy to identify manually and in this case, the ground truth helps. However, weaker mine signatures are not as easy to localize, so the truth may not be as accurate and can degrade the performance. Overall, MIANFIS outperformed all presented fusion approaches and the individual discriminators (EHDDT and EHDCT). This is due to the ability of MIANFIS to overcome labeling ambiguity by learning meaningful concepts.
As in standard ANFIS, we cannot prove convergence of the algorithms. However, in all conducted experiments MIANFIS converged in less than 150 epochs. Figure 13 plots the root mean squared error (RMSE) vs. the training epoch number.
V Related Work
MIANFIS deals with ambiguity by introducing the novel concept of truth instances: when carrying reasoning using a bag of instances at Layer 2 (Figure 3), a proposition will not only have one degree of truth, it will have multiple degrees of truth (), we call truth instances. Thus, effectively encoding the third vagueness component of ambiguity and increasing the expressive power of traditional fuzzy logic. In addition to effectively model ambiguity, MIANIFS has the inherited capability of assessing the prediction quality by outputting soft values. For example, depending on the parameter of Softmax in Layer 3, MIANFIS can assign higher outputs to bags with more than one positive instance. Thus, giving the end user a way to assess the positiveness of a given bag.
Learning positive target concepts from ambiguously labeled data has been the core task of various MIL algorithms (e.g. Diverse Density [16]). MIANFIS has proven that it can learn positive concepts effectively while jointly providing a fuzzy representation of such regions. The fuzzy representation is combined into meaningful and simple multiple instance rules that can be easily visualized and interpreted.
Compared to previously proposed multiple instance neural networks, such as Multiple Instance Neural Networks [42] (MINN) and Multiple Instance RBF Neural Networks [43] (RBFMIP), MIANFIS advantage is the use of multiple instance fuzzy logic to learn a fuzzy representation of true positive concepts. MINN only learns standard neural network weights that do not carry any information regarding target concepts. On the other hand, while standard RBF neural networks have been shown to be equivalent to zero order traditional Sugeno systems under certain constraints [62]
, thus, capable of learning a fuzzy representation of the inputs, RBFMIP networks have different architecture and they do not employ adaptive radial basis functions in the first layer. Instead, they represent the inputs by computing their distances to clusters of training bags. This latter method is computationally expensive and its success depends greatly on the quality of the training data as it takes into consideration all the training examples which may include wrongly (noisily) labeled bags. RBFMIP learns only discriminative regions of the bags space and does not learn true positive concepts. Moreover, MIANFIS learning algorithms can be updated to support a wide range of loss functions (criterions) such as cross entropy
[63], maximum margin [64], etc. MINN is designed to use a handcrafted loss function which is largely responsible for the multiple instance behavior of the system and cannot be changed without substantially changing the architecture of MINN. This could be disadvantageous if MINN is to be used to solve multiple instance  multiple class classification problems.Vi Conclusions
In this paper, we have introduced a new framework to accomplish fuzzy inference with multiple instance data. Our work generalizes Sugeno fuzzy inference style to reason with multiple instances, the new inference style is called MISugeno. We then used MISugeno to develop MIANFIS, a novel neurofuzzy architecture that extends the standard Adaptive NeuroFuzzy Inference System (ANFIS) to reason with bags of instances in order to solve multiple instance learning problems. We developed a BackPropagation learning algorithm and showed that the proposed system is capable of learning meaningful concepts from ambiguously labeled data.
MIANFIS deals with ambiguity by introducing the novel concept of truth instances: when carrying reasoning using a bag of instances at Layer 2 (Figure 3), a proposition will not only have one degree of truth, it will have multiple degrees of truth (), we call truth instances. Thus, effectively encoding the third vagueness component of ambiguity and increasing the expressive power of traditional fuzzy logic.
Learning positive concepts from ambiguously labeled data has been the core task of various MIL algorithms. MIANFIS has proven that it can learn positive concepts effectively while jointly providing a fuzzy representation of such regions. The fuzzy representation is combined into meaningful and simple multiple instance rules that can be easily visualized and interpreted.
Using synthetic and benchmark data sets we showed that the proposed Multiple Instance Fuzzy Inference is comparable to state of the art MI machine learning algorithms. We also used our framework for a real application and applied it to fuse the output of multiple discrimination algorithms for the purpose of landmine detection using Ground Penetrating Radar.
In situations where overfitting is imminent, for example when using relatively smaller datasets to learn very large MIANFIS networks, we proposed a regularization technique, we called Rule Dropout, and showed that it could be used to train MIANFIS systems with better generalization.
In future work, we intend to develop a multiple class version of MIANFIS to be used to solve multiple class classification problems. In addition, we will conduct a detailed analysis of MIANFIS convergence.
Appendix A derivation of premise parameters update rules
From equations (21) and (22) the error rate for the premise parameters and are defined as following
and,
Using the chain rule defined in (18), we have
(36) 
Hence, (21) is equivalent to
(37) 
From (17), we have
(38) 
It is also straightforward to show that,
(39) 
and,
(40) 
Continuing with the derivation, we have
(41) 
and,
(42) 
The details of the derivation of the “softmax” function details can be found at [21].
Next, we need to compute
Comments
There are no comments yet.