HashBox: Hash Hierarchical Segmentation exploiting Bounding Box Object Detection

02/27/2017 ∙ by Joachim Curto, et al. ∙ ETH Zurich Carnegie Mellon University 0

We propose a novel approach to address the Simultaneous Detection and Segmentation problem. Using hierarchical structures we use an efficient and accurate procedure that exploits the hierarchy feature information using Locality Sensitive Hashing. We build on recent work that utilizes convolutional neural networks to detect bounding boxes in an image (Faster R-CNN) and then use the top similar hierarchical region that best fits each bounding box after hashing, we call this approach HashBox. We then refine our final segmentation results by automatic hierarchy pruning. HashBox introduces a train-free alternative to Hypercolumns. We conduct extensive experiments on Pascal VOC 2012 segmentation dataset, showing that HashBox gives competitive state-of-the-art object segmentations.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Detection and Segmentation are key components in any computer vision toolbox. In this paper we present a hashing technique to segment an object given its bounding box and therefore attain both Detection and Segmentation simultaneously. At its heart lies a novel way to retrieve and generate a high quality segmentation, which is crucial for a wide variety of computer vision applications. Simply put, we use a state-of-the-art convolutional network to detect the objects, but a hashing technique build on top a high quality hierarchy of regions to generate the segmentations.

Object Detection and Segmentation are two popular problems in Computer Vision, historically treated as separated tasks. We consider these strongly related vision tasks as a unique one: detecting each object in an image and assigning to each pixel a binary label inside the corresponding bounding box.

CZ Segmentation addresses the problem with a surprising different technique that deviates from the current norm of using proposal object candidates [5]. In semantic segmentation the need for rich information models that entangle some kind of notion from the different parts that constitute an object is exacerbated. To alleviate this issue we build on the use of a the hierarchical model in [2] and explore the rich space of information of the Ultrametric Contour Map in order to find the best possible semantic segmentation of the given object. For this task, we exploit bounding boxes to facilitate the search task. Hence, we simply hash the bounding box patches and retrieve closest nearest neighbors to the given objects, obtaining superior instance segmentations. Using this simple but effective technique we get the segmentation mask which is then refined using Hierarchical Section Pruning.

We start from a bounding box detector and refine the object support, as in Hariharan et al. in Hypercolumns [4]. We propose here a train-free similarity hashing alternative to their approach.

We present a simple yet effective module that leverages the need for a training step and can provide segmentations after any given detector. Our approach is to use a state-of-the-art region-based CNN detector [3] as prior step to guide the segmentation process.

Outline: We begin next with a high-level description of the proposed method and develop further the idea to propose Hierarchical Section Hashing and Hierarchical Section Pruning in Section 1 and Section 3. Prior work follows in Section 2

. We conclude with the evaluation metrics in Section

4 and a brief discussion in Section 5.

We start with a primer. CZ Segmentation consists on the following main blocks:

  • Bounding Box Object Detection We use a convolutional neural network system [3] to detect all the objects in an image and generate the corresponding bounding boxes. We consider a detected object in an image as each output candidate thresholded by the class level score (benchmarks specifications in Section 4).

  • Hierarchical Image Representation We represent the image as a hierarchical region tree representation based on the Ultrametric Contour Map of Arbeláez et al. [2].

  • Similarity Hashing We develop Hierarchical Section Hashing based on the LSH technique of [6].

  • Region Refinement We refine the segmentations by the use of Hierarchical Section Pruning.

  • Evaluation We evaluate the results on the PASCAL VOC 2012 Segmentation dataset [7]

    using the JACCARD index metric, which measures the average best overlap achieved by a segmentation mask for a ground truth object.

This work is inspired on how humans segment images: they first localize the objects they want to segment, they carefully inspect the object on the image by the use of their visual system and finally they choose the region that belongs to the body of that particular object. We believe that although the problem of detection has to be solved by the use of deep learning techniques based on convolutional neural networks, the problem of segmenting those objects is of a different nature and can be best understood by the use of hashing.

Our main contributions are presented as follows:

  • Novel approach to solve the segmentation task exploiting bounding box object detection using similarity hashing.

  • Use of hierarchical structures which are rich on semantic meaning instead of other current state-of-the-art techniques such as proposal object candidate generation.

  • No need of training data for the segmentation task under bounding box detection framework, i.e. train-free accurate segmentations.

  • State-of-the-art results.

To our knowledge, we are the first to provide a segmentation solution based on a hashing technique. This approach leverages the need to optimize over a high dimensional space.

Despite the success of region proposal methods in detection, they have in turn arisen as the main computational bottleneck of these approaches. Yet unlike the latter, hierarchical structures derived from the UCM are in comparison inexpensive to compute and store. While we continue to use a very fast region-based convolutional neural network (R-CNN) to solve the detection task, we propose to solve the segmentation problem by exploring efficiently the space generated by a hierarchical image representation.

2 Prior Work

Recent works [1], [8] present Object Detection and Segmentation as a single problem. The SDS task requires to detect and segment every instance of a category in the image. Our work is however more closely related to the Hypercolumns approach in [4], where they go from bounding boxes to segmented masks. Our approach is related in the sense that we propose an alternative that does not require a training step and can be used as an off-the-shelf high quality segmenter.

For semantic segmentation [2], [5], [9], there has been several approaches where they guide the segmentation process by the use of a prior detector [10], [3], [11], [12], [13]

. Recently, this strategy has also presented state-of-the-art results in person detection and pose estimation

[14]. Our segmenter starts rather than from raw pixels, Long et al. [15], or bounding box proposals as in Girshick et al. [16] and Hariharan et al. [1] , from a set of hierarchical regions given by the UCM structure. Other techniques rely on a superpixel representation e.g. Mostajabi et al. [17]. This is a distinct tactic that works directly on a different representation.

3 CZ Segmentation

We delve into the details of the CZ Segmentation construction, Figure 1.

3.1 Bounding Box Object Detection

We begin by using the R-CNN object detector proposed by Ren et al. in [3], which is in turn based on [10]

. It introduces a Region Proposal Network (RPN) for the task of generating detection proposals and then solves the detection task by the use of a FAST R-CNN detector. They train a CNN on ImageNet Classification and fine-tune the network on the VOC detection set. For our experiments, we use the network trained on VOC 2007 and 2012, and evaluate the results on the VOC 2012 evaluation set. We use the very deep VGG-16 model


3.2 Hierarchical Image Representation

We consider the hierarchical image representation described in [5]. Considering a segmentation of an image into regions that partition its domain . A segmentation hierarchy is a family of partitions such that: (1) is the finest set of superpixels, (2) is the complete domain, and (3) regions from coarse levels are unions of regions from fine levels.

Figure 1: CZ Segmentation. We construct a hierarchical image representation based on the UCM and ’train’ the HSH map by hashing each of the parent partition node regions. To retrieve a segmentation mask, we ’test’ the HSH map by doing a lookup of the detected bounding box region, i.e. a fast approximate nearest neighbor search on the hierarchical structure, and finally refine the result through HSP.

3.3 Hierarchical Section Hashing

In this paper we introduce a novel segmentation algorithm that exploits current bounding box information to automatically select the best hierarchical region that segments the image. We introduce Hierarchical Section Hashing (HSH) which is in turn based on Locality-Sensitive Hashing (LSH). This algorithm helps us surpass the problem of computational complexity of the k-nearest neighbor rule and allows us to do a fast approximate neighbor search in the hierarchical structure of [2].

HSH can be summarized as follows:

  • Detect bounding boxes on an image using an state-of-the-art convolutional neural network structure [3].

  • Construct a hierarchical image map by using the Ultrametric Contour Map (UCM) and convey the result as a hierarchical region tree.

  • Each hierarchical region is indexed by a number of hash tables using LSH and then constructing a HSH map.

  • Each bounding box is hashed into the HSH map to retrieve the approximate nearest neighbor in sublinear time.

CZ Segmentation has two main steps: first ’train’ the HSH map with all the hierarchical regions of the image. Then ’test’ the HSH map with all the detected bounding boxes to retrieve the approximate nearest neighbors that segment each of the objects in the image. The novelty of this approach is that it provides the best hierarchical region provided by the UCM structure that segments the object image. CZ exploits bounding box object detection because it relies on the correct detection of the object detector.

3.4 Hierarchical Section Pruning

The final piece is to refine the segmentations given by the HSH map by using what we call Hierarchical Section Pruning (HSP).

HSP procedure can be summarized as follows:

  • Once a segmentation mask has been selected for all the objects in the given image, and their bounding boxes recomputed, the bounding box overlap ratio for all box pair combinations, according to the intersection over union criterium, is performed.

  • Those masks that present overlap with other object masks on the same image are hierarchically unselected. We always proceed to unselect the low level hierarchy regions, which by construction enclose a smaller region area and thus a single segmented object, from the high level hierarchy region, which enclose more than one object and a bigger image area.

  • Finally, isolated pixels on the mask are erased to preserve a single connected segmentation.

HSP is based on the fact that each segmentation mask represents a node on the hierarchical region tree constructed from the UCM. Therefore, hierarchical sections containing more than one object represent higher level nodes on the hierarchy. When HSP is applied, low level hierarchical regions are unselected from the high level hierarchical sections and therefore replaced by mid-low level sections on the same region tree structure that represents a single object or a smaller area of the image.

HSH and HSP Visual Examples can be seen in Figure 2.

Figure 2: Left: HSH Visual Example. Right: HSP Visual Example.

CZ Segmentation relies on the prior detection and therefore availability of bounding boxes for all the objects in a given image. The latter can be very useful as CZ can be understood as a simple and effective technique to provide high quality segmentations of still images after any available bounding box detector. Likewise, you get train-free off-the-shelf accurate segmentations for any given bounding box detection method.

3.5 Locality Sensitive Hashing

Our goal is to retrieve the

-nearest neighbors of a given hierarchy vector, which we call

image code

. In this setup we are limited by the curse of dimensionality and therefore using an exact search is inefficient. Our approach uses an approximate nearest neighbor technique: Locality Sensitive Hashing (LSH).

A Locality Sensitive Hash function maps such that the similarity between is preserved as


which is not possible for all but available for instance for euclidean metrics.

We build on the LSH work of [6]. LSH is a randomized hashing scheme, investigated with the primary goal of neighbor search. Its main constitutional block is a family of locality sensitive functions. We can define of functions is a -sensitive if, for any ,


where these probabilities are chosen from a random choice of


Algorithm 1 gives a simple description of the LSH algorithm for the given case when the distance of interest is , which is the one in use in CZ Segmentation. The family in this case contains axis-parallel stumps, which means the value of an is generated by taking a simple dimension and thresholding it with some T:


An LSH function is formed by independently function .

That is, we can understand that an example in our hierarchical partition provides a -bit hash key


This process is repeated times and produces independently constructed hash functions . The available reference (’training’) data are indexed by each one of the hash functions, producing hash tables, i.e. each of the hierarchical partitions generated by all the corresponding parents of the hierarchical tree structure.

Once the LSH data structure has been built it can be used to perform a very efficient search for approximate neighbors in the following way. When a query arrives, we compute its key for each hash table , and record the examples resulting from the lookup with that key. In other words, we find the ’training’ examples that fell in the same bucket of the -th hash table to which would fall. These lookup operations produce a set of candidate matches, . If this set is empty, the algorithm reports it and stops. Otherwise, the distances between candidate matches and are explicitly evaluated, and the examples that match the search criteria, which means that are closer to than , are returned.

Given: Dataset
Given: Number of bits , number of tables .
Output: A set of

1:for  do
2:     for  do
3:         Randomly (uniformly) draw
4:         Randomly (uniformly) draw
5:         Let be the function defined by
6:     The -th LSH function is .
Algorithm 1 LSH Algorithm

4 Evaluation and Results

We extensively evaluate CZ Segmentation on VOC 2012 validation set. Top detections from our algorithm can be seen in Figure 3.

Figure 3: Top Detections. Top: VOC 2012 Ground Truth. Bottom: CZ Segmentation.

4.1 JACCARD Index Metric

In Table 1 we show the results of the JACCARD Index Metric. This measure represents the average best overlap achieved by a candidate for a ground truth object.

CZ Segmentation with Jaccard at instance level 45.24 % and Jaccard at class level 43.05 % . Recall at overlap 0.5 is 43.36 %.

Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow Table Dog Horse MBike Person Plant Sheep Sofa Train TV Global
CZ Segmentation (Instance Level) 45.4 27.5 55.9 44.2 42.0 43.2 41.3 66.3 31.4 57.2 42.3 63.3 43.8 43.6 40.9 40.6 57.2 51.2 48.0 54.1 45.2
CZ Segmentation (Class Level) 33.3 18.5 48.1 37.5 40.7 45.1 39.4 59.9 23.3 51.0 43.3 60.4 39.8 43.1 34.6 37.2 51.0 47.0 53.6 54.2 43.1
Table 1: VOC 2012 Validation Set. Per-class and global JACCARD Index Metric at instance level.

5 Discussion

In this paper we introduce CZ Segmentation, an instance segmentation algorithm based on a hashing technique that exploits bounding box object detection. We show CZ achieves compelling results and generates off-the-shelf accurate segmentations.


  • [1] Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. ECCV (2014)
  • [2] Arbeláez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. PAMI (2011)
  • [3] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS (2015)
  • [4] Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. CVPR (2015)
  • [5] Arbeláez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial grouping. CVPR (2014)
  • [6] Charikar, M.: Similarity estimation techniques from rounding algorithms. STOC (2002)
  • [7] Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. IJCV (2010)
  • [8] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. ICCV (2017)
  • [9] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI (2017)
  • [10] Girshick, R.: Fast r-cnn. ICCV (2015)
  • [11] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. CVPR (2016)
  • [12] Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. CVPR (2017)
  • [13] Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., Murphy, K.: Speed/accuracy trade-offs for modern convolutional object detectors. CVPR (2017)
  • [14] Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.: Towards accurate multi-person pose estimation in the wild. CVPR (2017)
  • [15] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. CVPR (2015)
  • [16] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR (2014)
  • [17] Mostajabi, M., Yadollahpour, P., Shakhnarovich, G.: Feedforward semantic segmentation with zoom-out features. CVPR (2015)
  • [18] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ICLR (2015)