1 Introduction
Remarkable progress has been achieved in visual tasks such as image categorization, object detection, or semantic segmentation, typically using fully supervised algorithms and vast amount of manually annotated data (e.g., [17, 20, 21, 27, 29, 38, 40]). With the advent of crowdsourcing, large corporations and, to a lesser extent, academic units can launch the corresponding massive annotation efforts for specific projects that may involve millions images [40].
But handling Internetscale image (or video) repositories or the continuous learning scenarios associated with personal assistants or autonomous cars demands approaches less hungry for manual annotation. Several alternatives are possible, including weakly supervised approaches that rely on readily available metadata [2, 9] or imagelevel labels [14, 23, 24, 25, 39, 45] instead of more complex annotations such as bounding boxes [17, 38] or object masks [20] as supervisory signal; semi supervised methods [6, 26] that exploit a relatively small number of fully annotated pictures, together with a larger set of unlabelled images; and self supervised algorithms that take advantage of the internal regularities of image parts [15, 37] or video subsequences [1, 34, 48] to construct image models that can be further finetuned in fully supervised settings.
We address here the even more challenging problem of discovering both the structure of image collections – that is, which images depict similar objects (or textures, scenes, actions, etc.), and the objects in question, in a fully unsupervised setting [8, 11, 16, 30, 39, 41, 43]. Although weakly, semi, and self supervised methods may provide a more practical foundation for largescale visual recognition, the fully unsupervised construction of image models is a fundamental scientific problem in computer vision, and it should be studied. In addition, any reasonable solution to this problem will facilitate subsequent human labelling (by presenting discovered groups to the operator) and scaling through automatic label propagation, help interactive querybased visual search by linking ahead of time fragments of potential interest, and provide a way to learn visual models for subsequent recognition.
1.1 The implicit structure of image collections
Any collection of images, say, those found on the Internet, or more modestly, in a dataset such as Pascal VOC’07, admits a natural graph representation, where nodes are the pictures themselves, and edges link pairs of images with similar visual content. In supervised image categorization (e.g., [27, 29]) or object detection (e.g., [17, 20, 38]) tasks, both the graph structure and the visual content are clearly defined: Annotators typically sort the images into bags, each one intended to represent some “object”, “scene” or, say, “action” class (“horse”, “forest”, “playing tennis”, etc.). Two nodes are linked by an edge when they are associated with the same bag, and each class is empirically defined by the images (or some manuallydefined rectangular regions within) in the corresponding connected component of the graph. In weakly supervised cosegmentation [23, 25, 39] or colocalization [14, 24, 45] tasks, on the other hand, the graph is fully connected, and all images are supposed to contain instances of the (few) same object categories, say, “horse”, “grass”, “sky”, “background”. Manual intervention is reduced to selecting which images to put into a single bag, and the visual content, in the form of regions defined by pixellevel symbolic labels or bounding boxes associated with one of the predefined categories, is discovered using a clustering algorithm. ^{1}^{1}1In both the cases of supervised image categorization/object detection and weakly supervised cosegmentation/colocalization, once the graph structure and the visual content have been identified at training time, these can be used to learn a model of the different object classes and add nodes, edges, and possibly additional bounding boxes at test time.
We address in this paper the much more difficult problem of fully unsupervised image matching and object discovery, where both the graph structure and a model of visual content in the form of object bounding boxes must be extracted from the native data without any manual intervention. This problem has been addressed in various forms, e.g., clustering [16]^{2}^{2}2Note that plain unsupervised clustering, whether classic, spectral, discriminative or deep [4, 22, 32, 36], focuses on data partitioning and not on the discovery of subsets of matching items within a cluttered collection., image matching [39] or topic discovery [41, 43] (see also [8, 11], where “pseudoobject” labels are learned in an unsupervised manner). In this presentation, we build directly on the work of Cho et al. [12] (see [28] for related work): Given an image and its neighbors, assumed to contain the same object, a robust matching technique exploits both appearance and geometric consistency constraints to assign confidence and saliency (“standout”) scores to region proposals in this image. The overall discovery algorithm alternates between localization steps where the neighbors are fixed and the regions with top saliency scores are selected as potential objects, and retrieval steps where the confidence of the regions within potential objects are used to find the nearest neighbors of each image. After a fixed number of steps, the region with top saliency in each image is declared to be the object it contains. Empirically, this method has been shown in [12] to give good results. However, it does not formulate image matching and object discovery as a proper optimization problem, and there is no guarantee that successive iterations will improve some objective measure of performance. The aim of this paper is to remedy this situation.
2 Proposed approach
2.1 Problem statement
Let us consider a set of images, each containing rectangular region proposals, with in . We assume that the images are equipped with some implicit graph structure, where there is a link between two images when the second image contains at least one object from a category depicted in the first one, and our aim is to discover this structure, that is, find the links and the corresponding objects. To model this problem, let us define an indicator variable , whose value is 1 when region number of image corresponds to a “foreground object” (visible in large part and from a category that occurs multiple times in the image collection), and 0 otherwise. We collect all the variables associated with image into an element of , and concatenate all the variables into an element of . Likewise, let us define an indicator variable , whose value is 1 if image contains an object also occurring in image , with and , and 0 otherwise, collect all the variables associated with image into an element of , and concatenate all the variables into an matrix with rows . Note that we can use to define a neighborhood for each image in the set: Image is a neighbor of the image if . By definition, defines an undirected graph if is symmetric and a directed one otherwise. Let us also denote by the similarity between regions and of images and , and by the matrix with entries .
We propose to maximize with respect to and the objective function
(1) 
Intuitively maximizing encourages building edges between
images and that contain regions and with a strong
similarity . Of course we would like to impose certain
constraints on the and variables. The following cardinality
constraints are rather natural:
An image should not contain more than a prededined
number of objects, say ,
(2) 
where is the element of with all entries equal to one.
An image should not match more than a predefined number of other images, say ,
(3) 
Assumptions. We will suppose from now on that is elementwise nonnegative, but not necessarily symmetric (the similarity model we explore in Section 3 is asymmetrical). Likewise, we will assume that the matrix has a zero diagonal but is not necessarily symmetric.
Under these assumptions, the cubic pseudoBoolean function is supermodular [10]. Without constraints, this type of functions can be maximized in polynomial time using a maxflow algorithm [7] (in the case of , which does not involve linear and quadratic terms, the solution is of course trivial without constraints, and amounts to setting all and with to 1). When the cardinality constraints (23) are added, this is not the case anymore, and we have to resort to a gradient ascent algorithm as explained next.
2.2 Relaxing the problem
Let us first note that, for binary variables
, and , we have(4) 
with . Relaxing our problem so all variables are allowed to take values in , our objective becomes a sum of concave functions, and thus is itself a concave function, defined over the convex set (hyperrectangle) , where is the total number of variables. This is the standard tight concave continuous relaxation of supermodular functions.
The Lagrangian associated with our relaxed problem is
(5) 
where and are positive Lagrange multipliers. The function is concave and the primal problem is strictly feasible; hence Slater’s conditions [44] hold, and we have the following equivalent primal and dual versions of our problem
(6) 
where the domain is the Cartesian product of and the space of matrices with entries in and a zero diagonal. With slight abuse we denote it , with .
2.3 Solving the dual problem
We propose to solve the dual problem with a subgradient descent approach. Starting from some initial values for and , we use the update rule
(7) 
where denotes positive part, , and are fixed step sizes, and are respectively the negative of the subgradients of the Lagrangian with respect to and in and , and
(8) 
As shown in Appendix, for fixed values of and , our Lagrangian is a supermodular pseudoBoolean function of binary variables sets and . This allows us to take advantage of the following direct corollary of [3, Prop. 3.7].
Proposition 2.1.
Let denote some supermodular pseudoBoolean function of variables. We have
(9) 
and the set of maximizers of in is the convex hull of the set of maximizers of on .
In particular, we can take
(10) 
As shown in [7, 10], the corresponding supermodular cubic pseudoBoolean function optimization problem is equivalent to a maximum stable set problem in a bipartite conflict graph, which can itself be reduced to a maximumflow problem. See Appendix for details.
Note that the size of the mincut/maxflow problems that have to be solved is conditioned by the number of nonzero entries, which is upperbounded by when the matrices are dense (denoting ). This is prohibitively high given that, in practice, is between and . To make the computations manageable, we set all but between and (depending on the dataset’s size) of the largest entries in to zero in our implementation.
2.4 Solving the primal problem
Once the dual problem is solved, as argued by Nedić & Ozdaglar [35] and Bach [3], an approximate solution of the primal problem can be found as a running average of the primal sequence generated as a byproduct of the subgradient method:
(11) 
after some number of iterations. Note the scalars and lie in but do not necessarily verify the constraints (2) and (3). Theoretical guarantees on these values can be found under additional assumptions in [3, 35].
2.5 Rounding the solution and greedy ascent
Note that two problems remain to be solved: The solution found now belongs to instead of , and it may not satisfy the original constraints. Note, however, that because of the form of the function , given some in and fixed values for and all with , the maximum value of given the constraints is obtained by setting to 1 exactly the entries of corresponding to the
largest entries of the vector
. Likewise, for some fixed value of , the maximum value of is reached by setting to 1, for all in , exactly the entries of corresponding to the largest scalars for in . This suggests the following approach to rounding up the solution, where the variables are updated sequentially in an order specified by some random permutation of , before the variables are updated in parallel. Given the permutation , the algorithm below turns the running average of the primal sequence into a discrete solution that satisfies the conditions (2) and (3): Initialize , . For to do Compute the indices to of the largest elements of the vector . . For to do . For to do Compute the indices to of the largest scalars . . For to do . Return .Note that there is no preferred order for the image indices. This actually suggests repeating this procedure with different random permutations until the variables and do not change anymore or some limit on the number of iterations is reached. This iterative procedure can be seen as a greedy ascent procedure over the discrete variables of interest. Note that by construction the terms in the left and right sides of (2) and (3) are equal at the optimum.
2.6 Ensemble post processing
The parameter can be seen from two different viewpoints: (1) as the maximum number of objects that may be depicted in an image, or (2) as an upper bound on the total number of object region candidates that are under consideration in a picture. Both viewpoints are equally valid but, following Cho et al. [12]
, we focus in the rest of this presentation on the second one, and present in this section a simple heuristic for selecting one final object region among these candidates. Concretely, since using random permutations during greedy ascent provides a different solution for each run of our method, we propose to apply an
ensemble method to stabilize the results and boost performance in this selection process, itself viewed as a postprocessing stage separate from the optimization part.Let us suppose that after independent executions of the greedy ascent step, we obtain solutions . We start by combining these solutions into a single discrete pair where and satisfy

if such that ,

if such that .
This way of combining the individual solutions can be seen as a max pooling procedure. We have also tried average pooling but found it less effective. Note that after this intermediate step, an image might violate any of the two constraints (23). This is not a problem in this postprocessing stage of our method. Indeed, we next show how to use and to select a single object proposal for each image.
We choose a single proposal for each image out of those retained in (proposals s.t. ). To this end, we rank the proposals in image according to a score defined for each proposal as
(12) 
where is composed of the images represented by the in which have the largest similarity to as measured by . Finally, we choose the proposal in image with maximum score as the final object region. Note that the graph of images corresponding to these final object regions can be retrieved by computing that maximizes the objective function given the value of defined by these regions as in the greedy ascent. Also, the method above can be generalized to more than one proposal per image using the defined ranking.
3 Similarity model
Let us now get back to the definition of the similarity function . As advocated by Cho et al. [12], a rectangular region which is a tight fit for a compact object (the foreground) should better model this object than a larger region, since it contains less background, or than a smaller region (a part) since it contains more foreground. Cho et al. [12] only implement the first constraint, in the form of a standout score. We discuss in this section how to implement these ideas in the optimization context of this work.
3.1 Similarity score
Following [12], the similarity score between proposal of image and proposal of image can be defined as
(13) 
where is a similarity term based on appearance alone, using the WHO descriptor (whiten HOG) [13, 19] in our case, and denote the image rectangles associated with the two proposals, is a discretized offset (translation plus two scale factors) taking values in , and measures the geometric compatibility between and the rectangles and . Intuitively, scales the appearanceonly score by a geometricconsistency term akin to a generalized Hough transform [5], see [12] for details.
Note that we can rewrite Eq. (13) as
(14) 
where is the vector of dimension with entries , and . The vectors and the vector can be precomputed with time and storage cost of . Each term can then be computed in time, and the matrix can thus be computed with a total time and space complexity of .
Note that the score defined by Eq. (13) depends on the number of region proposals per images, which may introduce a bias for edges between images that contain many region proposals. It may thus be desirable to normalize this score by defining it instead as
(15) 
3.2 Standout score
Let us identify the region proposals contained in some image with their index , and define as the set of regions that are parts of that region (that is, they are included, with some tolerance, within ). Let us also define as the set of regions that form the background for (that is, is included, with some tolerance, within these regions). Let denote the actual rectangular image region associated with proposal in image , and let denote the area of some rectangle . A plausible definition for is
(16) 
for some reasonable value of , e.g., 0.5. Likewise, a plausible definition for is
(17) 
for reasonable values of and , e.g., 0.8 and 2. Following [12], we define the standout score of a match as
(18) 
With this definition, may be negative. In our implementation, we threshold these scores so they are nonnegative.
When and are large, which is generally the case when the regions and are small, a bruteforce computation of may be very slow. We propose below instead a simple heuristic that greatly speeds up calculations.
Let denote the set formed by the matches with
highest scores , sorted in increasing order, which can be
computed in . The standout scores can be
computed efficiently by the following procedure:
Initialize all to .
For each match in do
For each match in do
.
For to and to do
If and then
.
The idea is that relatively few highconfidence matches in can be used to efficiently compute many standout scores. There is a tradeoff between the cost of this step, , and the number of variables it assigns a value to, . In practice, we have found that taking is a good compromise, with only about 5% of the standout scores being computed in a bruteforce manner, and a significant speedup factor of over 10.
4 Experiments and results
Datasets, proposals and metric.
For our experiments we use the same datasets (ObjectDiscovery [OD], VOC_6x2 and VOC_all) and region proposals (obtained by the randomized Prim’s algorithm [RP] [33]) as Cho et al. [12]. OD consists of pictures of three object classes (airplane, horse and car
) with outliers not containing any object instance. There are 100 images per category, with 18, 7 and 11 outliers respectively (containing no object instance). VOC_all is a subset of the PASCAL VOC2007 train
val dataset obtained by eliminating all images containing only objects marked as difficult or truncated. Finally, VOC_6x2 is a subset of VOC_all containing only images of 6 classes – aeroplane, bicycle, boat, bus, horse – and motorbike from two different views, left and right.For evaluation, we use the standard CorLoc measure, the percentage of images correctly localized. It is a proxy metric in the case of unsupervised discovery. An image is “correctly localized” when the intersection over union () between one of the groundtruth regions and the predicted one is greater than 0.5. Following [12], we evaluate our algorithm in “separate” and “mixed” settings. In the former case, the classwise performance is averaged over classes. In the latter, a single performance is computed over all classes jointly. In our experiments, we use , and standout matrices with 1000 nonzero entries unless mentioned otherwise.
Separate setting. We firstly evaluate different settings of our algorithm on the two smaller datasets, OD and VOC_6x2. The performance is governed by three design choices: (1) using the normalized standout score (NS) or its unnormalized version, (2) using continuous optimization (CO) or variables and with all entries equal to one to initialize the greedy ascent procedure, and (3) using the ensemble method (EM) or not. In total, we thus have eight configurations to test.
Method  OD  VOC_6x2  

84.2  67.7  

84.2  67.6  
w/o EM  w/o CO  w/o NS  81.9 0.9  65.9 1.0  
w NS  83.1 0.8  67.2 1.0  
w CO  w/o NS  82.9 0.8  66.6 0.7  
w NS  84.4 0.8  68.1 0.9  
w EM  w/o CO  w/o NS  84.4 0.0  68.8 0.4  
w NS  85.6 0.3  68.7 0.5  
w CO  w/o NS  83.8 0.2  67.4 0.4  
w NS  85.8 0.6  69.4 0.3 
The results are shown in Table 1. We have found a small bug in the publicly available code of Cho et al. [12], and report both the results from [12]
and those we obtained after correction. We observe that the normalized standout score always gives comparable or better results than its unnormalized counterpart, while the ensemble method also improves both the score and the stability (lower variance) of our solution. Combining the normalized standout score, the ensemble method, and the continuous optimization initialization to greedy ascent yields the best performance. Our best results outperform
[12] by small but statistically significant margins: 1.6% for OD and 1.8% for VOC_6x2. Finally, to assess the merit of the continuous optimization, we have measured its duality gap on OD and VOC_6x2: it ranges from 1.5% to 8.7% of the energy, with an average of 5.2% and 3.9% on the two datasets respectively.Method  VOC_all  
Cho et al.  36.6  
Cho et al., our execution  37.6  
w/o CO  w/o EM  36.4 0.3 
w EM  39.0 0.2  
w CO  w/o EM  37.8 0.3 
w EM  39.2 0.2  
Li et al. [31]  40.0  
Wei et al. [49]  46.9 
We now evaluate our algorithm on VOC_all. As the complexity of solving the max flow problem grows very fast with the number of images, for configurations with continuous optimization, we reduce the number of nonzero entries in each standout matrix such that the total number of nodes in the graph is around . These standout matrices are then used in rounding the continuous solution, but in the greedy ascent procedure we switch to standout matrices with 1000 nonzero entries. For configurations without the continuous optimization, we always use the standout matrices with 1000 nonzero entries. Also, to reduce the memory footprint of our method, we prefilter the set of potential neighbors of each image for the class person that contains 1023 pictures. Prefiltering is done by marking 100 nearest neighbors of each image in terms of Euclidean distance between GIST [46] descriptors as potential neighbors. In the separate setting, we only apply the prefiltering on the class person which has 1023 images. The other classes are sufficiently small for not resorting to the prefiltering procedure.
Table 2 shows the CorLoc values obtained by our method with different configurations compared to Cho et al. It can be seen that the ensemble postprocessing and the continuous optimization are also helpful on this dataset. We obtain the best result with the configuration that includes both of them, which is 1.6% better than Cho et al. However, our performance is still inferior to state of the art in image colocalization [31, 49]
which employ deep features from convolutional neural networks trained for image classification and explicitly exploits the singleclass assumption.
Mixed setting. We now compare in Table 3 the performance of our algorithm to Cho et al. in the mixed setting (none of the other methods is applicable to this case). It can be seen that our algorithm without the continuous optimization has the best performance among those in consideration. Compared to Cho et al., it gives a CorLoc 0.8% better on OD dataset, 4.3% better on VOC_6x2 and 2.3% better on VOC_all. The decrease in performance of our method when using the continuous optimization is likely due to the fact that we use standout matrices with only 200 nonzero entries on OD, 100 nonzero entries on VOC_6x2 and 100 nonzero entries on VOC_all (due to the limit on the number of nodes of the bipartite graphs) in the configuration with the continuous optimization while we use standout matrices with 1000 nonzero entries in the configuration without the continuous optimization.
Method  OD  VOC_6x2  VOC_all 
Cho et al.      37.6 
Cho et al., our execution  82.2  55.9  37.5 
w/o CO  83.0 0.4  60.2 0.4  39.8 0.2 
w CO  80.8 0.5  59.3 0.4  38.5 0.2 
Sensitivity to . We compare the performance of our method when using different values of on the VOC_6x2 dataset.^{3}^{3}3Note that we have also tried the interpretation of as the maximum number of objects per image, without satisfying results so far. Table 4 shows the CorLoc obtained by different configurations of our algorithm, all with normalized standout. The performance consistently increases with the value of on this dataset. In all other experiments however, we set to ease comparisons to [12].
Method  VOC_6x2  
w/o CO  w/o EM  63.5 1.2  
w EM  67.7 0.8  
w CO  w/o EM  65.8 0.8  
w EM  68.1 0.7  
w/o CO  w/o EM  67.2 1.0  
w EM  68.7 0.5  
w CO  w/o EM  68.1 0.9  
w EM  69.4 0.3  
w/o CO  w/o EM  68.6 1.0  
w EM  69.1 0.3  
w CO  w/o EM  68.9 0.7  
w EM  70.0 0.3 
Using deep features. Since activations from deep neural networks trained for image classification (deep features) are known to be better image representations than handcrafted features in various tasks, we have also experimented with such descriptors. We have replaced WHO [19] by activations from different layers in VGG16 [42], when computing the appearance similarity between regions. In this case, the similarity between two regions is simply the scalar product of the corresponding deep features (normalized or not). As a preliminary experiment to evaluate the effectiveness of deep features, we have run our algorithm without the continuous optimization with the standout score computed using layers conv4_3, conv5_3 and fc6 in VGG16. Table 5
shows the results of these experiments. Surprisingly, most of the deep features tested give worse results than WHO. This may be due to the fact that our matching task is more akin to image retrieval than classification, for which deep features are typically trained. Among those tested, only a variant of the features extracted from the layer
conv5_3 of VGG16 gives an improvement (about 2%) compared to the result obtained by using WHO.Features  Average  
WHO  68.8 0.5  
conv4_3 

unnormalized  64.2 0.2  
normalized  57.1 0.6  
ROI pooling [18]  unnormalized  63.1 0.2  
normalized  63.4 0.4  
conv5_3 

unnormalized  64.9 0.2  
normalized  64.1 0.4  
ROI pooling [18]  unnormalized  70.7 0.2  
normalized  68.2 0.3  
fc6 

unnormalized  61.3 0.2  
normalized  61.0 0.4 
Unsupervised initial proposals.
It should be noted that, although our algorithm like that of Cho et al. [12] is totally unsupervised once given the region proposals, the randomized Prim’s algorithm itself is supervised [33]. To study the effect of this builtin supervision, we have also tested the unsupervised selective search algorithm [47] for choosing region proposals. We have conducted experiments on VOC_6x2 dataset with the three different settings of selective search (fast, medium and quality). As one might expect, the fast mode gives the smallest number of proposals and of positive ones (proposals whose with one ground truth box is greater than 0.5); the quality mode outputs the largest set of proposals and of positive ones, the medium mode lies inbetween. To compare with [12], we also run their public software with each mode of selective search.
Proposal algorithm  Cho et al.  Ours  
selective search  fast  23.3  41.4 0.5 
medium  20.6  48.4 0.5  
quality  32.6  62.8 0.6  
randomized Prim’s  67.6  69.4 0.4 
The results are shown in Table 6. It can be seen that the performance of both Cho et al.’s method and ours drop significantly when using selective search. This may be due to the fact that the percentage of positive proposals found by selective search is much smaller than that of RP. However, we see that with the quality mode of selective search, our method gives results quite close to those of RP, whereas the method in [12] fails badly. This suggests that our method is more robust.
Visualization. In order to gain insight into the structures discovered by our approach, we derive from its output a graph of image regions and visualize its main connected components. The nodes of this graph are the image regions that have been finally retained. Two regions and are connected if the images containing them are neighbors in the discovered undirected image graph ( or ) and the standout score between them, , is greater than a certain threshold.
Choosing the threshold to get a sufficient number of large enough components for visualization purpose has proven difficult. We used instead an iterative procedure: the graph is first constructed with a high threshold to produce a small number of connected components of reasonable size, which are removed from the graph. On the remaining graph, a new, suitable threshold is found to get new components of sufficient size. This is repeated until a target number of components is reached.
When applied to our results in the mixed setting on VOC_6x2 dataset, this visualization procedure yields clusters that roughly match object categories. In Figure 1, we show subsampled graphs (for visualization purpose) of the two first components, which roughly correspond to classes bicycle and aeroplane. The third component is shown in Figure 2. Although containing also images of other classes, it is by far dominated by motorbike images. The visualization suggests that our model does extract meaningful semantic structures from the image collections and regions they contain.
5 Conclusion
We have presented an optimizationbased approach to fully unsupervised image matching and object discovery and demonstrated its promise on several standard benchmarks. In its current form, our algorithm is limited to relatively small datasets. We are exploring several paths for scaling up its performance, including better mechanisms based on deep features and the PHM algorithm for prefiltering image neighbors and selecting regions proposals. Future work will also be dedicated to developing effective ensemble methods for discovering multiple objects in images, further investigating a symmetric version of the proposed approach using an undirected graph, understanding why deep features do not give better results in our context, and improving our continuous optimization approach so as to handle large datasets in a mixed setting, perhaps through some form of variable clustering.
Appendix: Maximization of supermodular cubic pseudoBoolean functions
An immediate corollary of [7, Lemma 1] is that a cubic pseudoBoolean function with nonegative trinary coefficients and no binary terms is supermodular. For fixed and , this is obviously the case for the Lagrangian in (5).
In addition, the unary terms in are nonpositive, and the Langragian can thus be rewritten, up to some constant additive term, in the form
(19) 
where (the complement of ), , , and all coefficients and are positive. We specialize in the rest of this section the general maximization method of [7] to functions of this form.
The conflict graph [7, 10] associated with such a function has as a set of nodes , where the elements of correspond to linear terms, those of correspond to cubic terms, and an edge links to nodes when one of the corresponding terms contains a variable, and the other one its complement. By construction is a bipartite graph, with edges joining only elements of to elements of .
As shown in [7] maximizing amounts to finding a maximum weight stable set in , where the nodes of are assigned weights and the nodes of are assigned weights , which in turn reduces to computing a maximum flow between nodes and in the network deducted from by (1) adding a source node and edges with upper capacity bound between and the corresponding elements of ; (2) adding a sink node and edges with upper capacity bound between the corresponding elements of and ; (3) assigning to all edges (from to ) in an upper capacity bound of .
Let denote the minimum cut obtained by computing the maximum flow in this graph, where is an element of and is an element of . The maximum weight stable set is then . The monomials and associated with elements of are set to 1, from which the values of all variables are easily deduced.
Acknowledgments.
This work was supported in part by the Inria/NYU collaboration agreement, the Louis Vuitton/ENS chair on artificial intellgence and the EPSRC Programme Grant Seebibyte EP/M013774/1. We also thank Simon LacosteJulien for his valuable comments and suggestions.
References
 [1] P. Agrawal, J. Carreira, and J. Malik. Learning to see by moving. In ICCV, 2015.
 [2] J.B. Alayrac, P. Bojanowski, N. Agrawal, I. Laptev, J. Sivic, and S. LacosteJulien. Learning from narrated instruction videos. IEEE Trans. Pattern Anal. and Machine Intell., 40(9):2194–2208, 2018.

[3]
F. Bach.
Learning with submodular functions: A convex optimization
perspective.
Foundations and Trends in Machine Learning
, 6(23):145–373, 2013.  [4] F. Bach and Z. Harchaoui. DIFFRAC : a discriminative and flexible framework for clustering. In Proc. Neural Info. Proc. Systems, 2007.
 [5] D. Ballard. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition, 1981.

[6]
M. Belkin, I. Matveeva, and P. Niyogi.
Regularization and semisupervised learning on large graphs.
In COLT, 2004.  [7] A. Billionnet and M. Minoux. Maximizing a supermodular pseudoboolean function: A polynomial algorithm for supermodular cubic functions. Discrete Applied Mathematics, 12:1–11, 1985.
 [8] P. Bojanowski and A. Joulin. Unsupervised learning by predicting noise. In ICML, 2017.
 [9] P. Bojanowski, R. Lajugie, E. Grave, F. Bach, I. Laptev, J. Ponce, and C. Schmid. Weaklysupervised alignment of video with text. In ICCV, 2015.
 [10] E. Boros and P. Hammer. PseudoBoolean optimization. Discrete Applied Mathematics, 123(13):155–225, 2002.
 [11] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
 [12] M. Cho, S. Kwak, C. Schmid, and J. Ponce. Unsupervised object discovery and localization in the wild: Partbased matching with bottomup region proposals. In CVPR, 2015.
 [13] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
 [14] T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In ECCV, 2010.
 [15] C. Doersch, A. Gupta, and A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
 [16] A. Faktor and M. Irani. Clustering by composition–unsupervised discovery of image categories. In ECCV, 2012.
 [17] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. IEEE Trans. Pattern Anal. and Machine Intell., 32(9):1627–1645, 2010.
 [18] R. Girshick. Fast RCNN. In ICCV, 2015.
 [19] B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classification. In ECCV, 2012.
 [20] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask RCNN. In ICCV, 2017.
 [21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [22] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe. Deep clustering: Discriminative embeddings for segmentation and separation. In ICASSP, 2016.
 [23] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image cosegmentation. In CVPR, 2010.
 [24] A. Joulin, K. Tang, and L. FeiFei. Efficient image and video colocalization with FrankWolfe algorithm. In ECCV, 2014.
 [25] G. Kim and E. Xing. Distributed cosegmentation via submodular optimization on anisotropic diffusion. In ICCV, 2011.
 [26] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semisupervised learning with deep generative models. In Proc. Neural Info. Proc. Systems, 2014.
 [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [28] S. Kwak, M. Cho, I. Laptev, J. Ponce, and C. Schmid. Unsupervised object discovery and tracking in video collections. In ICCV, 2015.
 [29] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.
 [30] Y. J. Lee and K. Grauman. Objectgraphs for contextaware category discovery. In CVPR, 2010.
 [31] Y. Li, L. Liu, C. Shen, and A. Hengel. Image colocalization by mimicking a good detector’s confidence score distribution. In ECCV, 2016.
 [32] S. Lloyd. Least squares quantization in PCM. IEEE Trans. on information theory, 28(2):129–137, 1982.
 [33] S. Manen, M. Guillaumin, and L. Van Gool. Prime object proposals with randomized Prim’s algorithm. In ICCV, 2013.
 [34] M. Matthieu, C. Couprie, and Y. LeCun. Deep multiscale video prediction beyond mean square error. In ICLR, 2016.
 [35] A. Nedić and A. Ozdaglar. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization, 19(4), 2009.

[36]
A. Y. Ng, M. I. Jordan, and Y. Weiss.
On spectral clustering: Analysis and an algorithm.
In NIPS, 2002.  [37] M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2106.
 [38] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 [39] M. Rubinstein and A. Joulin. Unsupervised Joint Object Discovery and Segmentation in Internet Images. In CVPR, 2013.
 [40] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. FeiFei. ImageNet large scale visual recognition challenge. Int. J. Computer Vision, 115(3):211–252, 2015.
 [41] B. Russell, W. Freeman, A. Efros, J. Sivic, and A. Zisserman. Using multiple segmentations to discover objects and their extent in image collections. In CVPR, 2006.
 [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [43] J. Sivic, B. C. Russell, A. Zisserman, W. T. Freeman, and A. A. Efros. Unsupervised discovery of visual object class hierarchies. In CVPR, 2008.
 [44] M. Slater. Lagrange multipliers revisited. Cowles Commission Discussion Paper No. 403, 1950.
 [45] K. Tang, A. Joulin, and L.j. Li. Colocalization in realworld images. In CVPR, 2014.
 [46] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR, 2008.
 [47] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. IJCV, 2013.
 [48] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.
 [49] X. Wei, C. Zhang, Y. Li, C. Xie, J. Wu, C. Shen, and Z. Zhou. Deep descriptor transforming for image colocalization. In IJCAI, 2017.
Comments
There are no comments yet.