Particular object retrieval becomes very challenging when the object of interest is covering a small part of the image. In this case, the amount of relevant information is significantly reduced. Large objects might be partially occluded, while small objects are on a background that covers most of the image. A combination of both, occlusion and cluttered background, is not rare either. These conditions naturally arise from image acquisition and make naive approaches fail, including global template matching or semi-robust template matching [OT06].
Ideally, image descriptors should be extracted only from the relevant part of the image, suppressing the irrelevant clutter and occlusions. In this paper, we attempt to determine the regions containing the relevant information, as shown in Figure 1, in a fully unsupervised manner.
Methods based on robust matching of hand-crafted local features are naturally insensitive to occlusion and background clutter. The locality of the features allows to match small parts of images in regions containing the object of interest, while the incorrect matches are typically removed by robust geometric consistency check [PCISZ07]
. Methods based on efficient matching of vector-quantized local-feature descriptors were introduced in context of image retrieval by Sivic and Zisserman[SZ03].
Retrieval methods based on descriptors extracted by convolutional neural networks
(CNNs) have become popular because they combine good precision and recall, efficiency of the search, and reasonable memory footprint[BSCL14, RSAC14]
. Deep neural networks are capable of learning, to some extent, what information in the image is relevant, which results in a good performance even with global descriptors[TSJ15, BL15, KMO15]. However, if the signal to noise ratio is low, e.g. the object is relatively small, multiple objects are present, etc., the global CNN descriptors fail [ITA+16, IAT+17].
A class of methods inspired by object detection have recently emerged. Instead of attempting to match the whole image to the query, the problem is changed to finding a rectangular region in the image that best matches the query [TSJ15, SGMS16]. An inefficient search by sliding window is intractable for large collections of images. The exhaustive enumeration is approximated by similarity evaluation on a number of pre-selected regions. The regions are either selected geometrically to cover the whole image at different scales, as in R-MAC [TSJ15], or by considering the content by object or region proposal methods [SGMS16, SHGXS17, GARL16].
Another direction of suppressing irrelevant content is saliency detection [KMO15, NASH16]
. For each image, a saliency map, that captures more general region shapes compared to (a small set of) rectangles, is first estimated. The contribution of each pixel (or region) is then proportional to the saliency of that location.
In this work we introduce a very simple pooling scheme that inherits the properties of both saliency detection and region based pooling and that, like all previous approaches, is applied to each image in the database independently. In addition, we investigate the use of the resulting regional representation for automatic, offline object discovery and suppression of background clutter, which considers the image collection as a whole. Unlike previous approaches, we do this in an unsupervised way. As a consequence, our representation takes two saliency detection steps into account. One that acts per image and depends solely on its content and another that considers the image collection as a whole and captures frequently appearing objects.
In both cases, we derive a global representation that outperforms comparable state-of-the-art methods in retrieving small objects on standard benchmarks, while the memory footprint and online cost is only a fraction compared to more powerful regional representations [RSAC14, ITA+16]. Moreover, we show that our representation benefits significantly from query expansion methods.
Section 2 discusses our contributions against related work. Section 3 describes our methodology including our pooling scheme in Section LABEL:sec:crow and our object discovery approach in Section LABEL:sec:saliency. We present experimental results in Section LABEL:sec:exp and draw conclusions in Section LABEL:sec:discussion.
2 Related work
Local features and geometric matching offer an attractive way for retrieval systems to handle occlusions, clutter, and small objects [SZ03, PCISZ07, JDS10a]. One of their drawbacks is high query complexity and large storage cost; an image is typically represented by several thousands features. Many methods attempt to decrease the amount of indexed features by removing background clutter while maintaining the relevant information. The selection procedure is either applied independently per image or considers an image collection as a whole. Common examples of the former case are bursty feature detection [SAJ15], symmetry detection [TKA12] or use of semantic segmentation [AZ14b, OPTA+08]
. The methods of the second category, are scalable enough to jointly process the whole collection and perform feature selection by the following assumption. A feature that repeats over multiple instances of the same object in the dataset is likely to appear in novel views of the object too. Representative cases are common object discovery[TL09, TAJ15], co-occurrence detection [CM10], or methods using GPS information [GBQG09, KSP10].
The work by Turcot and Lowe [TL09] performs pairwise spatial verification on hand-crafted local features across all images and only indexes verified features. With an additional off-line cost, the on-line stage is sped up and the memory footprint is reduced. However, unique views of objects are not verified and thus discarded. In this work, we address a similar selection problem based on more powerful CNN-based representation rather than local features.
Recent advances on deep learning[ARSM+14, TSJ15, KMO15, GARL16b, RTC16] dispense with the large memory footprint by using global descriptors and cast the problem of instance search as Euclidean nearest neighbor search. Nevertheless, background clutter and occlusion are better handled by regional representation. Regional descriptors significantly increase the performance when they are indexed independently [RSAC14, ITA+16] but this comes at a prohibited memory and computational cost for large scale scenarios. Region Proposal Networks (RPN) are applied either off-the-shelf [SGMS16] or after fine-tuning [SHGXS17] for instance search. The RPNs reduce the number of regions per image only to the order of tens. Our work focuses on aggregating regional representation that keeps the complexity low but we rather detect regions around salient objects and objects that frequently appear in the dataset. Jimenez et al. [JAG17]
construct saliency maps and perform region detection to construct global image vectors, as we also do. However, they employ generic object detectors trained on ImageNet and this makes the method not applicable with fine-tuned networks which provide the best performance. The Hessian-affine detector is used on CNN activations to detect repeatable regions[JCSC17]. The major benefit in this work, though, comes from second order pooling and higher dimensional descriptors.
Saliency maps are another way to handle clutter and occlusions. Once more, there exist both examples of computation in an unsupervised manner [KMO15, LK17] or learned [NASH16, JDF17] and applied per image afterwards. Our approach generates saliency maps in a fully unsupervised way that capture both salient objects on single images but also repeating objects appearing in a particular image collection.
Like [TL09], our objective is to remove transient and non-distinctive objects as in Figure 1 and rather focus on objects appearing frequently in a dataset. Beginning with the activation map of a convolutional layer in a CNN, one would need access to a local representation to automatically discover such objects. On the other hand, knowing what these objects are would help forming a local representation by selecting regions depicting them, which appears to be a chicken-and-egg problem. Without an initial region selection, we risk “discovering” uninformative but frequently appearing “stuff”-like patches, for instance sky.