1 Introduction
We address the problem of finding a set of images containing a common, but unknown, object category from a collection of image proposals. The input is a collection of bags, each containing several images from multiple classes. A bag is labelled as positive if it contains at least one image from the common object class and negative if none of the images in the bag is from the common object class. The goal is to find an instance of the common object in each positive bag.
Several computer vision problems, including cosegmentation, colocalization, and unsupervised video object tracking and segmentation have been formulated in this way
[1, 2, 3, 4, 5]. In the colocalization problem, Figure 1, each bag contains many cropped image regions from one image and these bags are labelled as positive or negative based on the presence of the common object. The goal is to identify proposals (regions), one per positive bag (image) that contain the common object. We designed our approach to address the general problem of finding common objects from positive bags and have evaluated it on two problems: fewshot common object recognition and object colocalization.Weakly supervised classification methods like multipleinstance learning (MIL) [6] can be used to solve this type of problems, but they require many training bags to learn new concepts. Metalearning techniques [7, 8, 9]
have been shown to reduce the need for training instances in fewshot learning. They do this by transferring knowledge from datasets of similar tasks so that classifiers can be trained with only a small set of examples from new unseen classes
[10]. Unfortunately, these methods require full supervision for the new classes. Motivated by their success, we leverage metalearning to reduce the need for large training sets for the weakly supervised domain.We model the problem of finding common objects as minimizing the energy of a graphical model. Each node of the graphical model represents a positive bag and minimizing the energy function corresponds to finding one image in each positive bag that contains the common object. The energy minimization problem uses unary and pairwise potential functions, where unary potentials represent the relation of the image to the images in the negative bag and the pairwise potentials represent the relation of an image pair from two positive bags. We adopt the relation network [11], which is successfully used in fewshot recognition to predict the relation of image pairs as our pairwise potentials. We propose a new algorithm that uses the relation of an image to all of the images in the negative bag to provide our unary potentials. Although graphical models have been used for MIL problems [12, 13], our method is different in that it uses a learningbased approach, inspired by metalearning, to increase the generalization power of potential functions to novel classes.
Any general purpose structured inference method [14, 15, 16] could be used to minimize the energy of the proposed graphical model. While significantly outperforming several MIL baselines, these algorithms suffer from a slow inference step when the number of instances in each bag is large. Further exacerbating the problem, the common object recognition task can be part of a larger pipeline, e.g., training a weakly supervised fewshot object detector, which might require millions of iterations of the inference to achieve good performance. To address this issue, we propose a greedy search algorithm for finding the common object based on the following observation: the common object for the complete problem is a common object in any subset of the bags as well. For example, in Figure 1
, a cake is the common object in the four bags, but it is also one of the common objects in any subset of positive bags. Our greedy method uses this property to restrict the search space for the optimal solution to the set of candidates that are a solution to a subset of bags. We introduce an efficient bottomup algorithm that can be implemented as a simple neural network which finds common objects in a set of positive bags with a single forward pass.
We make the following contributions. (1) Introduce a method to transfer knowledge from largescale strongly supervised datasets by learning pairwise and unary potentials and demonstrate the superiority of this learned relation metric to earlier MIL approaches on two problems. (2) Propose a specialized fast greedy algorithm for structured inference and show that this method achieves performance comparable to the stateoftheart inference methods while requiring less computational time.
2 Related Work
Multiple instance learning(MIL) [17, 18] methods have been used for learning weakly supervised tasks such as object localization (WSOL) [19, 20, 21, 22]. In a standard MIL framework, instance labels in each positive bag are treated as hidden variables with the constraint that at least one of them should be positive. MISVM and miSVM [23] are two popular methods for MIL, and have been widely adapted for many weakly supervised computer vision problems, achieving stateoftheart results in many different applications [17, 24]. In these methods, images in each bag inherit the label of the bag and an SVM is trained to classify images. The trained SVM is used to relabel the instances and this process is repeated until the labels remain stable. While miSVM uses all the instances in positive bags, in MISVM only the image with the highest score in each positive bag is used for training.
Cosaliency [25, 3], cosegmentation [1, 2, 26], and colocalization [27] methods have the same kind of output as WSOL methods. Similar to standard MIL algorithms, some of these methods rely on a relatively large training set for learning novel classes [27, 28]. The main difference between these methods and WSOL methods is that they usually do not utilize negative examples [1, 27, 28]. Negative examples in our method are optional and could be used to improve the results of the colocalization task.
Our approach is related to weakly supervised methods that make use of auxiliary fullylabelled data to accelerate the learning of new categories [29, 30, 31, 32, 33]. Since visual classes share many visual characteristics, knowledge from fullylabelled source classes is used to learn from the weaklylabelled target classes. The general approach is to use the labelled dataset to learn an embedding function for image proposals and use MISVM to classify instances of the weakly labelled dataset in this space [29, 30, 31]
. We show that learning a scoring function to compare images in the embedded space significantly improves the performance of this approach, especially when few positive images are available. Rochan et al.
[32]propose a method to transfer knowledge from a set of familiar objects to localize new objects in a collection of weakly supervised images. Their method uses semantic information encoded in word vectors for knowledge transfer. In contrast, our method uses the similarity between tasks in training and testing and does not rely solely on a given semantic relationship between the familiar and new classes. Deselaers et al.
[33] transfer objectness scores from source classes and incorporate them into unary terms of a conditional random field formulation.Our approach inspired by methods that use the metalearning paradigm for fewshot classification. These methods simulate the fewshot learning task during the training phase in which the model learns to optimize over a batch of sampled tasks. The metalearned method is later used to optimize over similar tasks during testing. Optimizationbased methods [34, 7], feature and metric learning methods [10, 35, 11], and memory augmentedbased methods [8] are just a few examples of modern fewshot learning. While our work is inspired by these methods, it is different in the sense that we do not assume strong supervision for the tasks. In relation networks [11] a similarity function is learned between image pairs and used to classify images from unseen classes. We adopt this method to learn the unary and pairwise potential functions in our graphical model.
3 Problem Setup and Notation
We consider a set with a binary relation . The elements of the set are called images in our work for simplicity of exposition. A relation is simply a subset of :
(1) 
A bag is a set of images, thus, a subset of . We will be concerned with collections of bags, . We say that a collection is positive if it is possible to select images, one from each bag, so that they are all related in pairs.
Given a collection, and an optional additional bag that we designate as negative^{1}^{1}1There is no point in having more than one negative bag in a collection since its purpose is simply to provide a set of images that are not compatible with the positive bags, in the sense described. , the task is to output a selection of images , one for each positive bag , where is from positive bag , such that the images are pairwise related, i.e., and that not all images are pairwise related to any image in the negative bag, i.e., such that .
3.1 Constructing From Image Labels
We are interested in cases where each of the images has a single latent (unknown) label where is the background class and is the set of foreground classes. Two images and are related if their labels are the same and belong to a foreground class i.e. . For example, (cropped) images may be labelled according to the foreground object they contain. In this case, two images , both containing a “cake” are related, . Whereas two images that are not of the same foreground object category are unrelated, .
3.2 Training and Test Splits
For a dataset , we use the notation to indicate that a random collection is drawn from the dataset. We define the sampling strategy in the implementation details for each dataset. During training, algorithms have access to a dataset and corresponding groundtruth relation. We construct the relation for the training dataset based on a set of foreground classes as described above.
Methods are evaluated on samples from a test dataset . There are no image in common between the training and test datasets. Moreover, the set of foreground classes used to construct the relation for the test dataset is different to the set of foreground classes used during training, i.e., . At test time we only know whether a bag is positive or negative with respect to a collection. The groundtruth relation (i.e., foreground class) is unknown to the model and only used for evaluating the performance of the model.
4 Pairwise Relation Module
The proposed method relies on an algorithm to estimate the relation of an input image pair
. One common approach is to learn an embedding function and use a fixed distance metric to compare the input pairs in the embedded space. In this approach, the learning happens only in the embedding function. Relation Network [11] extends this by jointly learning the embedding function and a comparator. The network consists of embedding and relation modules. The embedding module learns a joint feature embedding (into ) for the input pair of images and the relation module learns a mapping , mapping the embedded feature to the relation score ^{2}^{2}2We adopt the notation used in Relation Network paper [11] where denotes the parameters of the embedding and scoring functions combined. We adopt the relation module from the Relation Network due to its simplicity and success in fewshot learning. However, any other method which computes the relation between image pairs could be used in our method.As we need to evaluate the relation of many image pairs, our goal is to keep the model for the embedding and scoring functions as simple as possible. The feature embedding function consists of feature concatenation and a single linear layer with gated activation [36] and skip connections. Let and be features in extracted from images and by a CNN feature extraction module. Let be the concatenation of feature pairs. The embedding function is defined as:
(2)  
where and vectors are the parameters of the feature embedding module and and
are hyperbolic tangent and sigmoid activation functions respectively, applied componentwise to vectors in
. Then, we use a linear layer to map this features into relation score i.e., , where and. We found in practice that using gated activation in the embedding module improves the performance over a simple ReLU, whereas adding more layers does not affect the performance. We note that the effectiveness of gated activation has also been shown in other work
[37].5 Proposed Method
We pose the problem of finding the common object as finding a selection that minimizes an energy function. Our energy function is defined as sum of potential functions as follows:
(3) 
in which and are pairwise and unary potential functions with parameters and
and hyperparameter
controls the importance of unary potentials. The pairwise potential function is learned so that it encourages choosing pairs that are related to each other. The unary potential is chosen so it minimized when its input is not related to the images in the negative bag. In this way, the overall energy is minimized when images in are related to each other and unrelated to images in the negative bag.We first present the method for learning the pairwise and unary potential functions. Next, we propose a greedy searching algorithm that uses the following structure in this problem to optimize the energy function: the common object of the complete problem is a common object in any subset of the positive bags as well. In this model, we search for the optimal solution by evaluating the objective for selections that were optimal for smaller subproblems.
5.1 Learning Potential Functions
The pairwise potential function is defined as the negative of the output of the relation module: so it has a lower energy for related pairs. For a sampled collection
the episode loss is written as a binary logistic regression loss
where the sum is over all the pairs in the collection, is the total number of such pairs, and relation defined in Eq (1) provides the groundtruth labels.
The unary potential is constructed by comparing image with images in the negative bag . Let the vector be the estimated relation between image and all the images in i.e. where is the th image in the negative bag. By definition, the unary energy for an image should be high if at least one of the values in is high. In other words, is related to if it is related to at least one image in . This suggests the use of as the unary energy potential. However, depending on the class distribution of images in the negative bag, an image which is not from the common object class could be related to more than just one image from the negative bag. In this case, using the average relation to the few mostly related elements in might help to reduce the noise in the estimation and works better than a simple max operator. This motivates us to use a soft notation of max operator in which we use a weighted average of the relations in which higher relation values get a higher weight
(4) 
where controls how smoothly the weights change with respect to the input relation values and
(5) 
where is the total number of images in the negative bag. Observe that for we have the mean value of and it converges to the max operator as . We let the algorithm learn a balanced value for parameter in a datadriven way.
For a sampled collection , the episode loss for the unary potential is defined as a binary logistic regression loss
(6) 
where we use an extended definition of the relation function where and is the total number of unary potentials in the collection. By optimizing this loss, we learn a potential function that has higher value if is related to one example in the negative bag. Note that in Eq (3) selection of unary potentials with high values are discouraged.
Parameters of the unary and pairwise potential functions are learned by optimizing the sum of the presented loss functions over randomly sampled problems from the training set
(7) 
Although both unary and pairwise potential functions are using the relation network with an identical architecture, their input class distribution is different since one is comparing images in positive bags and one is comparing between images in positive and negative bags. Thus, sharing their parameters decreases overall performance.
5.2 Inference
Finding an optimal selection that minimizes the energy function defined in Eq (3) is NPhard and thus not feasible to compute exactly. Loopy belief propagation [14], TRWS [15], and AStar [16], are among the many algorithms used for approximate energy minimization. We propose an alternative approach specifically designed for solving our optimization problem.
Our approach is designed to decompose the overall problem into smaller subproblems, solve them, and combine their solutions to find a solution to the overall problem. This is based on the observation that a solution to the overall problem will also be a valid solution to any of the subproblems. Let be a subset of . Then, a subproblem refers to finding a set of common object proposals for with low energy values. Energy value for a selection is defined as sum of all pairwise and unary potentials in the subproblem, similar to how the energy function is defined for the overall problem in Eq (3).
The problem decomposition starts at the root and divides it into two disjoint subproblems and recursively continues dividing each into two subproblems until each subproblem only contains a single bag . If , then this can be represented as a full binary tree^{3}^{3}3
This is without loss of generality since zero padding could be used if the number of positive bags is not a power of
. where each node represents a subproblem. Let be the th node at level . Then root node represents the overall problem, nodes at any given level represent disjoint subproblems of the same size, and the leaf nodes, , at level of the tree each represent a subproblem with only one positive bag .The computation starts at the lowest level (leaf nodes) where the set of solution proposals is simply all the images in the bag. At the next level each node combines the solution proposals from its child nodes and prunes them to compute solution proposals for its own subproblem, which in turn is used as input to nodes at the next level and so on until we reach the root node, which is the output for the optimization. The joining procedure used to combine the solution of two nodes is described next.
Joining: Node at level receives as input solution proposals and from its child nodes and . The joining operation simply concatenates every possible selection from the first set with every possible selection in the second set and forms a set of selection proposals for the subproblem
where concatenates two selection sequences. We denote the joining operation by the Cartesian product notation i.e. .
Pruning: Due to the Cartesian product used to generate , the number of proposal selections grows exponentially as we ascend the tree. Also, not all the generated proposals contain a common object. Therefore, we use a pruning algorithm that picks the selections with the lowest energy values. The energy values for each subproblem can also be efficiently computed from bottom to top. At the lowest level, the energy for each selection is the unary potential from Eq (4),
(8) 
Note that selection consists of only one image. After the leaves, energy in other nodes can be computed recursively. Let is formed by joining two selection proposals and . The energy function can be factorized as
(9) 
where , according to the graph structure, is the sum of all pairwise potentials between two joining subproblems and computed on the fly.
Algorithm 1 summarizes the method. A good value of in the pruning method depends on the ambiguity of the task. It is possible to construct an adversarial example that needs all possible proposals at the root node to find the optimal solution. However, in practice, we found that does not need to be large to achieve good performance. Importantly, unlike other methods, this algorithm does not necessarily compute all of the pairwise potentials. For example, if an object class only appears in a small subproblem, the images of that class will get removed by nodes whose subproblem size is large enough. Thus, in the next level of the tree, the pairwise potentials between those images and other images would no longer be required. In general, the number of pairwise potentials computed depends on both the value of and the dataset. We observed only a small fraction of the total pairwise potentialswere required in our experiments.
6 Experiments
In this section, we evaluate the proposed algorithm on fewshot common object recognition and colocalization tasks. For each task, we first pretrain a CNN feature extractor module on the training set and use it with fixed weights to extract features from images for all the methods.
For learning pairwise and unary potentials, stochastic gradient descent with gradual learning rate decay schedule is used to minimize the loss function in Eq (
7). The complete framework (“Ours” in the tables) uses greedy optimization method described in Algorithm 1. The optimal value of in Eq (3) is found using grid search. In all experiments, a maximum of top selection proposals are kept in the greedy algorithm.All experiments are done on a single Nvidia GTX GPU and an AMD Ryzen Threadripper X CPU with Cores ( Threads) and GHz frequency. In the next sections, we first review the baseline methods and then present the results for each task.
6.1 Baseline Methods
We compare our greedy optimization algorithm to AStar [16] which is used for object cosegmentation [1] and the faster TRWS [15] which is used for inference on MIL problems [12, 33]. We use a highly efficient parallel implementation of these algorithms [38].
SVM based MIL We report the results of the three wellknown MISVM, miSVM [23] and sbMIL [39] methods using publicly available source code [24]. The sbMIL method is specially designed to deal with sparse positive bags. The RBF and linear kernel are chosen as they work better on fewshot common object recognition and colocalization respectively. Grid search is performed in order to select the hyperparameters.
Attention based deep MIL
Along with the SVM based methods, the results of the more recent attention based deep learning MIL method
[40] (ATNMIL) is presented on our benchmarks. After training the model, we select the image proposal with the maximum attention weight from each positive bag.6.2 Fewshot Common Object Recognition
In this task, bags are constructed by sampling images from the miniImageNet dataset [10]. To construct bags, we first randomly select classes out of all the possible classes . One of these is selected to be the target and the rest are considered nontarget classes. Then, each positive bag is constructed by randomly sampling one image from the target class and images from the target and nontarget classes. The negative bag is built by sampling examples from nontarget classes. For output selection , we measure the success rate which is equal to the percentage of that belong to the target class. We compute the expected value of success rate for randomly sampled problems and report the mean and
% confidence interval of the evaluation metric.
miniImageNet is a benchmark in fewshot learning. We use it as a proof of concept for comparison of different design choices without requiring large scale training and performance evaluations. The dataset contains images of size from classes. We experiment on the standard split of , and classes for training, validation and testing, respectively [34].
For the CNN feature extractor module, a Wide Residual Network (WRN) [41] with depth and width factor is pretrained on the training split. The dimensional output of global average pooling layer of the pretrained network is provided as input to all the methods.
We vary the number of bags as well as their sizes. We select the number of positive bags , the size of each positive bag , and the size of negative bag . The number of classes to sample from in each episode changes the difficulty of the task. We randomly choose between and when , and between and when for each problem.
The results in Table 2
show our method outperforms ATNMIL and SVM based approaches for all version of the problem. To test the importance of learning the unary and pairwise potentials, we construct a baseline that uses cosine similarity to compute the relation between pairs
^{4}^{4}4We also use negative of Euclidean distance measure for the relation but it shows inferior performance. while keeping the rest of the algorithm identical. The performance gap between our method and the baseline shows that the relation learning method, apart from structured inference formulation, plays an important role in boosting the performance.Average total (potentials computation + inference) runtime versus accuracy plot of different energy minimization methods on different settings is shown in Figure 2. Even on this small scale problem, the greedy optimization is faster on average while its accuracy is on par with other inference methods. See the supplementary material for complete numerical results.
6.3 CoLocalization
We evaluate on the colocalization problem to illustrate the benefits of the methods discussed in the paper on a real world and large scale dataset. In this task, each bag is constructed by extracting region proposals from one image and a selection represents one bounding box from each image. To select images of each problem, we first randomly select one class as the target. Then, images which have at least one object from the target class are sampled as positive bags. The negative bag is composed of images which do not contain the target class. The success rate metric used in fewshot common object detection is used to evaluate the performance of different algorithms. A region proposal is considered successful if it has IoU overlap greater than with the groundtruth target bounding box. Note that for the colocalization task, this metric is equivalent to CorLoc [42] measure which is widely used for localization problem evaluation [29, 31, 43, 44].
We train the algorithm on a split of COCO 2017 [45] dataset with seen classes and evaluate on the remaining unseen classes. The resulting dataset contains and images in the training and test set respectively. To evaluate the performance of the trained algorithm on a larger set of unseen classes we also test on validation set of ILSVRC2013 detection [46]. This dataset has originally classes but only classes do not have overlap with the classes that were used for training. The final dataset, after removing coco seen classes, contains images from unseen classes. The dataset creation method is explained in the supplementary material in more detail.
For the CNN feature extractor module, we pretrain a FasterRCNN detector [47] on the training dataset which has only seen classes. For each bag, top region proposals with the highest objectness scores are kept. The output of the second stage feature extractor is used in all methods.
Table 2 illustrates the quantitative results on COCO and ImageNet datasets with positive and negative images^{5}^{5}5We skip the results for sbMIL and miSVM as they showed similar or inferior results to MISVM.. Our method works considerably better than other strong MIL baselines. Qualitative results of our method compared with other MIL based approaches are illustrated in Figure 3. Our method selects the correct object even when the target object is not salient. More qualitative results are presented in the supplementary material.
To see the effect of unary and pairwise potentials separately, we provide results for two new variants for structured inference based methods: (i) Unary Only: where the common object proposal in each bag is selected using only the information in negative bags without seeing the elements in other bags, and (ii) Pairwise Only: where the negative bag information is ignored in each problem. The results show that the pairwise potentials contribute more to the final results. This is not surprising since negative images only help when they contain an object which is also appearing in positive images which, given the number of classes we are sampling from, has a low chance. Interestingly, by using the learned unary potentials alone we could get comparable results to the MIL baselines.
The results in Table 2 show that different inference algorithms have very similar performance. However, as it is shown in Figure 4, the greedy optimization algorithm is much faster. Note that our method requires to compute only 15% out of all pairwise potentials in average. One may argue that the pairs can be forwarded on multiple GPUs in parallel and this reduces the forward time. However, our greedy inference method can also take advantage of multiple GPUs since the nodes at each level are data independent.
Method  Accuracy () 
3 No Unary  
MEAN  
MAX  
SOFTMAX 
6.4 Ablation Study
In order to evaluate the effectiveness of our proposed unary potential function, we devise the following experiment. In the fewshot common object recognition task with positive bags, negative images, and , we train the unary potentials with four different settings: (1) SOFTMAX: Unary potential function with the learned described in Section 5.1, (2) MAX: , (3) MEAN: , and (4) No Unary: the model without using negative bag information. The pairwise potential function is kept identical in all the methods. The performance of our methods on the described settings are presented in Table 3. The results show the superiority of the learned weighted similarity to other strategies.
Next, we evaluate the quality of the learned pairwise relations by using them for the task of oneshot image recognition [10] on miniImageNet and compare it to the other stateoftheart methods. In each episode of a oneshot way problem, classes are randomly chosen from the set of possible classes and one image is sampled from each class. This minitraining set is used to predict the label of a new query image which is sampled from one of the classes. The performance is the accuracy of the method to predict the correct label averaged over many sampled episodes. All of these models are trained with a variant of deep residual networks [41, 53]. Note that unlike other methods, the model in [52] is trained on validation+training metasets.
At each episode, we use the learned relation function to score the similarity between the query image and all the images in the mini training set. The predicted label for the query image is simply the label of the image in mini training set which has the highest relation value to the query image. We compute the accuracy of the predictions of our pairwise potentials on test classes of miniImageNet and compare it with current stateoftheart fewshot methods in Table 4. We also provide comparison of gated activation function and a simplified ReLU activation in our architecture. Although our method is not trained directly for the task of oneshot learning, it achieves competitive results to the previous methods which are specifically trained for the task. Also, gated activation outperforms ReLU for this task.
7 Conclusion
In this paper, we introduce a method for learning to find a common object among small image collections which is constructed by learning unary and pairwise terms in an structured output prediction framework. Moreover, we propose a simple greedy inference algorithm that uses the structure of the problem to solve the task at hand without requiring computation of all pairwise terms. Our experiments on two challenging tasks illustrate that transferring knowledge from seen classes is favourable to other finetuning based weakly supervised methods in low data regimes. In addition, Our greedy inference algorithm is comparable to the two wellknown structured inference algorithms for this task while requiring significantly less computation.
References
 [1] Sara Vicente, Carsten Rother, and Vladimir Kolmogorov. Object cosegmentation. In CVPR, pages 2217–2224, 2011.
 [2] Alon Faktor and Michal Irani. Cosegmentation by composition. In ICCV, pages 1297–1304, 2013.
 [3] KuangJui Hsu, ChungChi Tsai, YenYu Lin, Xiaoning Qian, and YungYu Chuang. Unsupervised cnnbased cosaliency detection with graphical optimization. In ECCV, 2018.
 [4] Huazhu Fu, Dong Xu, Bao Zhang, and Stephen Lin. Objectbased multiple foreground video cosegmentation. In CVPR, pages 3166–3173, 2014.
 [5] Boris Babenko, MingHsuan Yang, and Serge Belongie. Visual tracking with online multiple instance learning. In CVPR, pages 983–990. IEEE, 2009.
 [6] Oded Maron and Tomás LozanoPérez. A framework for multipleinstance learning. In NIPS, pages 570–576, 1998.
 [7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.
 [8] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memoryaugmented neural networks. In ICML, pages 1842–1850, 2016.
 [9] Amirreza Shaban, ChingAn Cheng, Nathan Hatch, and Byron Boots. Truncated backpropagation for bilevel optimization. AISTATS, 2019.
 [10] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In NIPS, pages 3630–3638, 2016.
 [11] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for fewshot learning. In CVPR, 2018.
 [12] Thomas Deselaers and Vittorio Ferrari. A conditional random field for multipleinstance learning. In ICML, pages 287–294, 2010.
 [13] Hossein Hajimirsadeghi, Jinling Li, Greg Mori, Mohamed Zaki, and Tarek Sayed. Multiple instance learning by discriminative training of markov networks. In UAI, pages 262–271, 2013.
 [14] Yair Weiss and William T Freeman. On the optimality of solutions of the maxproduct beliefpropagation algorithm in arbitrary graphs. IEEE Transactions on Information Theory, 47(2):736–744, 2001.
 [15] Vladimir Kolmogorov. Convergent treereweighted message passing for energy minimization. TPAMI, 28(10):1568–1583, 2006.
 [16] Martin Bergtholdt, Jörg Kappes, Stefan Schmidt, and Christoph Schnörr. A study of partsbased object class detection using complete graphs. IJCV, 87(12):93, 2010.
 [17] MarcAndré Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329–353, 2018.
 [18] Deepak Pathak, Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional multiclass multiple instance learning. In ICLR Workshop, 2015.
 [19] Zequn Jie, Yunchao Wei, Xiaojie Jin, Jiashi Feng, and Wei Liu. Deep selftaught learning for weakly supervised object localization. CVPR, pages 4294–4302, 2017.
 [20] Kai Chen, Hang Song, Chen Change Loy, and Dahua Lin. Discover and learn new objects from documentaries. In CVPR, 2017.
 [21] Bohan Zhuang, Lingqiao Liu, Yao Li, Chunhua Shen, and Ian D Reid. Attend in groups: a weaklysupervised deep learning framework for learning from web data. In CVPR, 2017.
 [22] Yunhan Shen, Rongrong Ji, Shengchuan Zhang, Wangmeng Zuo, Yan Wang, and F Huang. Generative adversarial learning towards fast weakly supervised detection. In CVPR, pages 5764–5773, 2018.
 [23] Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multipleinstance learning. In NIPS, pages 577–584, 2003.
 [24] Gary Doran and Soumya Ray. A theoretical and empirical analysis of support vector machine methods for multipleinstance classification. Machine Learning, pages 79–102, 2014.
 [25] Dingwen Zhang, Junwei Han, Chao Li, and Jingdong Wang. Cosaliency detection via looking deep and wide. In CVPR, pages 2994–3002, 2015.
 [26] Dorit S Hochbaum and Vikas Singh. An efficient algorithm for cosegmentation. In ICCV, pages 269–276, 2009.
 [27] Yao Li, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. Image colocalization by mimicking a good detector’s confidence score distribution. In ECCV, pages 19–34, 2016.
 [28] Kevin Tang, Armand Joulin, LiJia Li, and Li FeiFei. Colocalization in realworld images. In CVPR, pages 1464–1471, 2014.
 [29] Jasper Uijlings, Stefan Popov, and Vittorio Ferrari. Revisiting knowledge transfer for training object class detectors. In CVPR, 2018.
 [30] Judy Hoffman, Deepak Pathak, Trevor Darrell, and Kate Saenko. Detector discovery in the wild: Joint multiple instance and representation learning. CVPR, pages 2883–2891, 2015.
 [31] Miaojing Shi, Holger Caesar, and Vittorio Ferrari. Weakly supervised object localization using things and stuff transfer. ICCV, pages 3401–3410, 2017.
 [32] Mrigank Rochan and Yang Wang. Weakly supervised localization of novel objects using appearance transfer. In CVPR, pages 4315–4324, 2015.
 [33] Thomas Deselaers, Bogdan Alexe, and Vittorio Ferrari. Weakly supervised localization and learning with generic knowledge. IJCV, 100(3):275–293, 2012.
 [34] Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. In ICLR, 2017.
 [35] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In NIPS, pages 4077–4087, 2017.
 [36] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In NIPS, pages 4790–4798, 2016.
 [37] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. Technical report, Google Brain, 2017.
 [38] B. Andres, T. Beier, and J.H. Kappes. OpenGM: A C++ library for discrete graphical models. CoRR, abs/1206.0111, 2012.
 [39] Razvan C. Bunescu and Raymond J. Mooney. Multiple instance learning for sparse positive bags. In ICML, pages 105–112, 2007.
 [40] Maximilian Ilse, Jakub Tomczak, and Max Welling. Attentionbased deep multiple instance learning. In ICML, pages 2127–2136, 2018.
 [41] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, pages 87.1–87.12, 2016.
 [42] Thomas Deselaers, Bogdan Alexe, and Vittorio Ferrari. Localizing objects while learning their appearance. In ECCV, pages 452–466. Springer, 2010.
 [43] Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars. Weakly supervised object detection with convex clustering. In CVPR, pages 1081–1089, 2015.
 [44] Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Weakly supervised object localization with multifold multiple instance learning. TPAMI, 39(1):189–203, 2017.
 [45] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
 [46] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
 [47] Ross Girshick. Fast rcnn. In ICCV, pages 1440–1448, 2015.

[48]
Tsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri, and Adam Trischler.
Rapid adaptation with conditionally shifted neurons.
In ICML, pages 3661–3670, 2018.  [49] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive metalearner. In ICLR, 2018.
 [50] Spyros Gidaris and Nikos Komodakis. Dynamic fewshot visual learning without forgetting. In CVPR, pages 4367–4375, 2018.
 [51] Boris N Oreshkin, Alexandre Lacoste, and Pau Rodriguez. Tadam: Task dependent adaptive metric for improved fewshot learning. arXiv preprint arXiv:1805.10123, 2018.
 [52] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L. Yuille. Fewshot image recognition by predicting parameters from activations. In CVPR, 2018.
 [53] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, pages 630–645, 2016.
Appendix A CoLocalization: COCO Dataset Creation and FasterRCNN Training
COCO dataset has classes in total. We take the same unseen classes which is used in zeroshot object detection paper bansal2018zero and keep remaining classes for training. The training set is constructed using the images in COCO 2017 train set which contain at least one object from the seen classes. The COCO test set, is built by combining the unused images of the train set and images in COCO validation set which contain at least one object from the unseen classes. Similar to bansal2018zero, to avoid training the network to classify unseen objects as background, we remove objects from unseen classes from the training images using their groundtruth segmentation masks.
We use Tensorflow objectdetection API for pretraining the FasterRCNN feature extraction module
huang2017speed. To speed up pretraining, training images are resized down to pixels and ResNethe2016identity is used as the backbone feature extractor. All layer weights are initialized with variance scaling initialization
glorot2010understanding and biases are set to zero initially. An additional linear layer which maps the dimensional output of second stage feature extractor to a dimensional feature vector is added to the network. We did this to have the dimension of the feature space the same as fewshot common object recognition experiment. We pretrain the feature extractor on four GPUs with batch size of for iterations. The dimensional features are used as input to all of the methods in our experiments.Appendix B Hyperparameter Tuning
In the fewshot common object recognition task, we use grid search on the validation set to tune the hyperparameters of all the methods. To ensure that the structured inference methods optimize the same objective function, we find for the TRWS method and use the same value in AStar and greedy energy functions. For the fewshot common object recognition task value of is shown in Table 5 for each setting.
In the CoLocalization experiments, the results of the best performing hyperparameters is reported for all the methods. and is used in COCO and ImageNet experiments respectively.
Appendix C Structured Inference Methods Comparison
The numerical results which are used to generate Figure 2 of the paper are shown in Table 5. The success rate of the greedy method is on par with the other inference algorithms. From the optimization point of view it is also important to see the mean energy value for the top selection of each method. These results are shown in Table 6 and Table 7 for fewshot common object recognition and colocalization experiments respectively. While AStart and TRWS achieve lower energy values for this problems, the success rate of the methods are comparable. This suggests that finding an approximate solution for the minimization problem is sufficient for achieving high success rate.
3 B =  TRWS  
ASTAR  
Greedy(Ours)  
B = 10  TRWS  
ASTAR  
Greedy(Ours) 
3 B =  TRWS  
ASTAR  
Greedy  
B = 10  TRWS  
ASTAR  
Greedy 
Method  COCO  ImageNet 
3 TRWS  
AStar  
Greedy 
Appendix D Sharing Parameters of Unary and Pairwise Relation Modules
As it is discussed in section 5.1, both unary and pairwise potential functions use the relation module with an identical architecture. However, since the input class distribution is different for these functions, we choose not to share their parameters. We conduct an experiment to see the effect of parameter sharing in fewshot common object recognition task with , , and . As Table 2 shows, the success rate for this setting is without parameter sharing. However, when the unary and pairwise are trained with shared relation module parameters, the performance degrades to .
Appendix E More Qualitative Results
Qualitative results on ImageNet dataset are illustrated in Figure 5. Figure 6 shows the complete qualitative results presented in the paper with the negative images on COCO dataset.
plain supp_bib
Comments
There are no comments yet.