Learning a Layout Transfer Network for Context Aware Object Detection

12/09/2019 ∙ by Tao Wang, et al. ∙ 5

We present a context aware object detection method based on a retrieve-and-transform scene layout model. Given an input image, our approach first retrieves a coarse scene layout from a codebook of typical layout templates. In order to handle large layout variations, we use a variant of the spatial transformer network to transform and refine the retrieved layout, resulting in a set of interpretable and semantically meaningful feature maps of object locations and scales. The above steps are implemented as a Layout Transfer Network which we integrate into Faster RCNN to allow for joint reasoning of object detection and scene layout estimation. Extensive experiments on three public datasets verified that our approach provides consistent performance improvements to the state-of-the-art object detection baselines on a variety of challenging tasks in the traffic surveillance and the autonomous driving domains.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 6

page 10

page 12

page 13

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human perception almost always reflects an integration of information from bottom-up and top-down processes (e.g., [mcclelland1981interactive, palmer1999vision, lee2003hierarchical]). In particular, context effects are an integral part of visual perception in humans. Consider the object detection problem as shown in Figure 1. As humans, we take for granted the ability to understand the layout of a scene at the first glance, and then know where to look for specific objects. For example, cars are most likely to appear on paved areas and those closer to the camera are typically larger than distant ones. It is therefore appealing to design models for object detection that reason about the spatial context in a similar fashion. On the contrary, recent state-of-the-art object detection algorithms produce scores for densely sampled object locations and scales, or a few hundreds to thousands of “blobby” object proposals. While it is true that these methods encode the spatial context to some extent due to the large receptive fields of the underlying convolutional neural nets (CNNs), methods like these usually lack a holistic understanding of the scene layouts. More importantly, unlike human perception, their performances deteriorate quickly when visual cues of objects become ambiguous or weak. Motivated by how humans understand the scene layouts for object detection, we propose a simple, interpretable and flexible framework for learning to transfer scene layouts for context-aware object detection.

Fig. 1: Transferring scene layouts for object detection. (a) Input image. (b) Coarse scene layout for the car category at different scales (XS, S, M, L) retrieved from similar images. (c) Object attention map derived from the coarse scene layout. (d) Scene layout after transformation and refinement. (e) Object attention map derived from the transformed and refined scene layout. (f) Detector output.
Fig. 2: The overall architecture of our method. Our layout transfer network adopts a retrieve-and-transform strategy for scene layout estimation. Given an input image, we first retrieve the most similar scene layout from a pre-built codebook with a classification sub-network (blue). The retrieved layout are then refined with a transformation sub-network, producing the final scene layout features (green). These features are then concatenated with backbone convolutional features (orange) to obtain the feature map for downstream object detection (gray).

The general idea of context modeling for object detection has long been proven effective in the computer vision community, with seminal works from Torralba et al. 

[torralba2003contextual, murphy2003using, torralba2003context], and later Hoiem, Efros and Hebert [hoiem2008putting], plus a few more [wolf2006critical, rabinovich2007objects, blaschko2009object] as prominent examples. More recently, the modeling of the spatial context has been extended to 3D scenarios [sudderth2006depth, Bao:CVPR10, sun2010object, choi2013understanding, lin2013holistic, gupta2014learning, liu2015towards]

as high quality co-registered depth and color images have become more easily accessible. Most existing approaches assume a parametric model of the scene layout, such as the piecewise planar assumption 

[faugeras1988motion], blocks world assumption [gupta2010blocks], or the Manhattan world assumption [hedau2009recovering, lee2009geometric, lee2010estimating, chao2013layout], just to name a few. Despite the great progress, modeling the scene layout in a parametric fashion becomes challenging when the layout variation is large. In addition to an increased model complexity, performance could deterioriate quickly under atypical scene configurations. To address the above issues, we explore a semi-parametric approach to context modeling for object detection. In particular, we improve object detection through a coarse-to-fine scene layout model that predicts potential object locations and scales for every object category, as illustrated in Figure 1. Such a high dimensional mapping is difficult to learn in a purely parametric way due to the large output space, so we adopt a retrieve-and-transform strategy for scene layout prediction. Specifically, the input image (Figure 1 (a)) is first matched against a codebook of typical scene layout templates, and the most similar scene layout (Figure 1 (b) and (c)) is retrieved. This scene layout is further transformed and refined (Figure 1 (d) and (e)) to adapt to the specific appearance of the input image. In addition, we implement the above procedure as an integral component of Faster RCNN [ren2015faster], so that the object detection and the scene layout estimation tasks can be learned in an alternating fashion.

We note that the proposed method resembles human perception in the sense that we are able to seamlessly integrate information from bottom-up and top-down visual processings. The bottom-up processing involves the process that starts with deep image features; the top-down processing, which starts with retrieving scene layouts from our external scene layout memory, injects prior knowledge and expectations about the scene. By virtue of a simple and differentiable transformation, the proposed method is able to adapt the most relevant scene layout priors to a specific input image.

The benefit of our proposed method is threefold. Firstly, our scene layout representation is interpretable, which makes it more readily applicable to other scene understanding tasks. In fact, we show that we are able to build an external memory of typical scene layouts from a large database and then accurately retrieve the most relevant scene templates at test time. Secondly, a common problem in these retrieval-based methods is that it would be challenging to deal with intra-class variations. In our work, we use a variant of the spatial transformer network 

[jaderberg2015spatial] to adapt scene layout templates to specific images, making our method capable of handling diverse scene layouts. Lastly, the proposed module can be integrated into a deep network for object detection, resulting in a single CNN for joint object detection and scene layout estimation. Extensive experiments on three public datasets verified that images in the traffic surveillance and automated driving domains are well-suited for our approach because their scene layouts provide strong priors for localizing objects.

This paper extends our previous work [wang2017efficient] in several ways. In particular, (i) scene layout estimation is now based on a drastically different method based on retrieve-and-transform, which is both more efficient and effective; (ii) scene layout estimation is now an integral part of the object detection CNN. This improves detection performance, and also removes unnecessary feature computational costs; (iii) we provide more detailed experimental evaluation and ablation studies on our proposed method in order to quantitatively motivate our model design. Furthermore, we compare our method against the baselines on two additional public datasets for object detection.

The rest of the paper is organized as follows. In Section II, we briefly discuss related work in object detection, context modeling, and scene layout estimation. The details of our model, including its structure, inference and learning, are introduced in Section III. This is followed by experimental evaluation in Section IV and closing remarks in Section V.

Ii Related Work

Object detection.

Recent years witnessed a huge success of Convolutional Neural Network (CNN) based object detection algorithms over conventional methods based on hand-crafted features and a shallow object grammar-based architecture. Some of the most prominent examples include sliding-window based OverFeat 

[sermanet2014overfeat] and object proposal based R-CNN [girshick2014rich] and its faster variants [he2014spatial, girshick2015fast, ren2015faster, dai2016r]

. These methods are directly inspired by the success of CNN for image classification. The latter, proposal-based methods seek to exploit the strong representation power of deep networks to classify and make refinements to a relatively small set (typically hundreds to a few thousands) of potential object regions. Another line of work attempts to make direct predictions using a deep network without the object proposal step. Examples include YOLO 

[redmon2016you], SSD [liu2016ssd], DSOD [shen2017dsod] and we note that these methods are generally more computationally efficient. In this work, we choose Faster RCNN [ren2015faster] as our baseline object detector and explore how to improve their results via incorporating scene-level context cues.

Context modeling. Context aware object detection has been well studied, and many context-aware object detection methods have been proposed (e.g., [torralba2003contextual, torralba2004contextual, wolf2006critical, rabinovich2007objects, hoiem2008putting, kluckner2009semantic, blaschko2009object, maire2011object, pan2013coherent]). See [wolf2006critical] for a review and [divvala2009empirical] for an empirical study of earlier work in the literature. More recently, Yang et al. [yang2010layered] have shown that reasoning about a 2.1D layered object representation in a scene can positively impact object detection. Yao et al. [yao2012describing] propose a holistic scene understanding model which jointly solves object detection, segmentation and scene classification. Mottaghi et al. [mottaghi2014role] exploit both the local and global contexts by reasoning about the presence of contextual classes, and propose a context-aware improvement to the DPM [felzenszwalb2010object]. Zhang et al. [zhang2014data] propose a nonparametric column-based tiered model for scene layout estimation of road scenes. Zhu et al. [zhu2015segdeepm] use CNNs to obtain contextual scores for object hypotheses, in addition to scores obtained with object appearance. Cai et al. [cai2016unified] propose to ease the object scale variation issue by performing object detection with multiple CNN layers, each focusing on objects within certain scale ranges. Batzer et al. [batzer2016generic] propose a context-aware voting scheme for small and distant object detection. In addition, Sun and Jacobs [sun2017seeing] propose to learn a context model that predicts where objects may be missing. Other works have extended context modeling to 3D scenarios. For example, Bao, Sun and Savarse propose a parameterized 3D surface layout model and combine it with object detectors [Bao:CVPR10, sun2010object]. Geiger, Wojek and Urtasun [geiger2011joint] propose a generative model for joint inference of scene topology, geometry and 3D object locations. Wojek et al. [wojek2013monocular] also propose a 3D scene model with explicit occlusion reasoning for object detection and tracking. Choi et al. [choi2013understanding] learn latent 3D geometric phrases to jointly solve object detection and scene layout estimation. Similarly, Lin et al. [lin2013holistic] use a CRF model to integrate various contextual relations for holistic scene understanding. Other later works include [gupta2014learning][wang2015holistic] and [liu2015towards]. Our work differs from the methods above in the sense that we propose a semi-parametric, retrieve-and-transform based approach to model the spatial context for object detection, which allows us to efficiently search within a high dimensional output space of the scene context. Our method is simple, interpretable, and it can be integrated into a deep network for object detection.

Scene layout estimation. Our work is also related to scene layout estimation methods that attempt to predict either parametric or nonparametric scene layout representations. For indoor scenes, recent works (e.g., [mallya2015learning, dasgupta2016delay, liu2015rent3d, armeni20163d, song2017semantic]) made great progress by leveraging strong scene priors such as floor plans, geometric priors, or the more classical Manhattan world assumption, in conjunction with deep models trained on large-scale datasets. For outdoor scenes, Seff and Xiao [seff2016learning] propose to use deep models to predict a set of driving-related road layout attributes. In addition, Mattyus et al. [mattyus2016hd] propose a CRF-based scene layout model that combines perspective and top-view images. Zhai et al. [zhai2017predicting] learn a transform to transfer the semantics from ground to the aerial image domain. Li et al. [li2017foveanet] propose a perspective-aware scene parsing method that estimates the perspective geometry of a scene image through a CNN to allow for finer parsing of small distant objects, and fuse prediction results at multiple scales with a perspective-aware CRF. Schulter et al. [schulter2018learning] propose a CNN that can hallucinate depth and semantics occluded by foreground objects and estimate a scene layout in the top view from a single perspective view. Wang et al. [wang2019parametric] propose a rich parametric top-view representation of complex road scenes that uses CNNs for predicting scene model parameters and a CRF for consistency reasoning. Contrary to the above methods, we propose a simple and interpretable scene layout representation that can be directly used to improve object detection performance, and does not require additional data or annotation during training. Furthermore, we implement our retrieve-and-transform scene layout estimation model as part of an object detector that allows jointly learning for object detection and scene layout estimation.

Object detection
input image
object hypothesis
object detection score
Scene layout
training dataset
ground-truth annotation
scene layout type
image neighborhood
scene layout codebook
coarse scene layout score
refined scene layout score
TABLE I: Summary of the main notations. See Section III for details.

Iii Our Approach

Let us begin with the definition of the object detection problem and notations. Given an input image, object detection algorithms output a score for each valid object hypothesis. More formally, suppose we have an input image where is the input image space. Let the object hypothesis be , where is the object hypothesis space. To simplify the notation, we assume each hypothesis is where is the image coordinate location of the object center, the scale, the aspect ratio, and the object class. Note that each now implies a bounding box as well. Object detection algorithms define a scoring function for each valid object hypothesis . For example, this score is implemented as a two-class softmax score in Faster RCNN, i.e., .

The core idea of this work is that we are able to transfer scene layouts from training images that are similar to the input image to predict potential object locations and scales. To this end, we propose an additional scene layout score for any given object hypothesis by investigating a local neighborhood of the input image . Directly learning such a high-dimensional mapping is difficult, so we adopt a two-step retrieve-and-transform strategy here:

  • We first retrieve a coarse scene layout score by matching the input image to a codebook of scene layout templates.

  • Afterwards, we apply a transformation to to obtain the refined scene layout score, i.e., .

Fig. 3: Left: The t-SNE visualization [maaten2008visualizing] of the features from the ResNet-50 [he2016deep] network, color-coded according to their cluster membership. Right: Given a query image (in the leftmost column), we present example images from the matching scene layout cluster on the MIO-TCD dataset (the yellow panel). In particular, we show images near the 0th, 25th, 50th, 75th and 100th percentiles according to their distances from the cluster center. In addition, we show coarse scene layout scores for the car category obtained on the MIO-TCD dataset (the blue panel). By matching a query image to our scene layout codebook, spatial distribution of object locations at different scales (XS, S, M, L) are obtained. Note that the scene layout scores are still coarse, but they roughly suggest a scene layout in terms of potential object locations.

More specifically, denote the training dataset as with images, where is a training image and is the set of all ground-truth annotations of the -th image. Similar to , let be a ground-truth object annotation. Instead of codewords, our scene layout codebook stores the scene layout type label and the ground-truth annotations of each training image, i.e., , , where and is the number of scene layout templates. In this way, each training image is additionally associated with its scene layout type label . The main notations are summarized in Table I.

There are three key issues here: (1) how to define the neighborhood , (2) how to obtain an appropriate representation for that can be helpful for object detection, and (3) how to integrate into an existing object detection framework. To address the first issue, we build a codebook of typical scene layouts by clustering the image-level appearance features. By matching an input image against the codebook, we can learn to transfer the neighborhood information encoded in the codebook entries to the target image. See Figure 3 for examples. For the second issue, we choose the object location heatmaps, since our ultimate goal is to predict object locations. In this work, we follow a retrieve-and-transform strategy to obtain a feature map that not only encodes the spatial distribution of objects in the image neighborhood, but also further adapted to specific images based on the input image appearance. The overall process is illustrated in Figure 2, with examples of the output scene layout scores shown in Figure 7. For the last issue, we propose two strategies for information fusion either at the feature level or at the final decision scores, respectively.

It should be noted that our method provides interpretable intermediate features of the coarse layout and the refined layout that resemble the psychological process of humans looking up the memory of typical scene layouts, and then adapt them to specific scene appearances. This also allows us to inject additional supervision signals to help convergence during training. In addition, such a high-dimensional mapping is difficult to learn in a purely parametric way due to the large output space. We quantitatively show in the experiments that our strategy is much more effective than a naive baseline which attempts to directly predict the spatial locations of objects.

Iii-a Building a scene layout codebook

As the first foundation stone, we obtain scene layout templates in the training dataset by clustering the image-level appearance features. We build a codebook that encodes typical scene layouts, and train a classifier for the scene layout classification. At test time, we then classify the input image into one of these scene layout clusters, and obtain a rough estimate possible of object locations. To this end, we introduce a feature manifold that is descriptive of the scene layouts. Following [wang2017efficient], we use the

-D features extracted from the

layer of a ResNet-50 [he2016deep] network applied on the input image

, as it has been widely used in image classification to describe the appearance of an image. The network is pretrained on the ImageNet dataset 

[deng2009imagenet, he2016deep]. One advantage of these features is that they are pretrained on a large database and are potentially a more robust representation of the image appearance when compared to other alternatives.

More specifically, let be the number of images in our training set and be the feature dimension for the image-level appearance features (i.e., for our ResNet

features). We can perform K-means clustering with

clusters for the training feature matrix . Denote as the cluster membership for the -th image in the training set, and an indicator function for whether or not the -th image belongs to the -th cluster. Our scene layout codebook stores the cluster membership labels and the ground-truth object annotations in a nonparametric fashion, i.e., . The scene layout score for the -th cluster is obtained by accumulating all ground-truth object annotations in that cluster:

(1)

where is a ground-truth object annotation of the -th image, and is the set of all ground-truth object annotations in the -th image. We build a mixture model with components by sorting ’s by their scale , aspect ratio , object class and split them into groups . Each of these groups includes annotations within certain ranges of and , and of a specific object class . Similar to , an object hypothesis can be sorted into one of these groups. For notation simplicity, we use to denote that and are in the same group. This is straightforward because we would only like to accumulate support for an object hypothesis from those ground-truth annotations that are in the same group. In addition, and are the object center coordinates, and is a normalizing factor. Equation 1 additively accumulates support for an object hypothesis by looking at nearby spatial locations in the ground-truth object annotations while applying a Gaussian smoothing kernel.

The left part of Figure 3 presents the t-SNE visualization [maaten2008visualizing] of the training feature matrix , with each feature color-coded according to its cluster membership. The right part of Figure 3 shows a query image and example images (in the yellow panel) from the matching scene layout cluster on the MIO-TCD dataset. In particular, we show images near the 0th, 25th, 50th, 75th and 100th percentiles according to their distances from the cluster center. In this way, we are able to show the variations within individual clusters. Specifically, a scene layout cluster contains images taken with different cameras from similar views, not merely different images taken from the same camera. In addition, we show the spatial distribution of objects in the car category at different scales (in the blue panel). We note that the heatmaps in Figure 3 can be viewed as a sampling distribution of object locations obtained from a specific image cluster, and they do not generally provide exact object locations for a given image. The intra-cluster layout variations are still large, and that we further refine the scene layout scores to adapt to the appearance of an input image before using them for object detection. We describe the relevant details in Section III-C. Before that, let us discuss details on how to retrieve the scores from the most similar scene layout cluster given a query image.

Fig. 4:

The scene layout classification head. We use feature map dimensions on the MIO-TCD dataset as an example. A max pooling layer and a convolution layer are first applied to the feature backbone. Afterwards, a fully connected layer with

-way outputs predicts the scene layout cluster probabilities.

is the number of clusters in the scene layout codebook.

Iii-B Scene layout classification

Now we move on to describe the first essential step towards scene layout estimation: scene layout classification. After building the scene layout codebook with clusters, we are able to use the cluster membership as a class label for a training image. Mathematically, let be the input image space and be the label space. Let be the cluster membership (also, the scene layout class label) for the -th image in the training set. Therefore, we are given a classification dataset where each . Our scene layout classifier maps the input image space to the label space . We do so by reusing the backbone network of Faster RCNN, and adding an additional sub-network with a softmax output layer. The sharing of feature computation allows us to avoid unnecessary computational overheads. Importantly, we note that the scene layout classification allows us to retrieve the most relevant scene layouts in our codebook, thus making the downstream scene layout transformation and refinement easier to learn.

In terms of the network architecture, we attach a scene layout classification head to the backbone network, as shown in Figure 4. For simplicity in illustrations, we assume that the input image dimension is , as we experimented with the MIO-TCD [luo2018mio] dataset. Denote the output of the last residual blocks of , , and as

. Note that, due to the fully convolutional nature of the backbone, these feature maps have strides of

pixels with respect to the input image. For instance, the spatial dimensions of would be for a input image, where is the number of channels. We use as the input to our scene layout classification head. In order to obtain a more concise representation, our classification head starts with a max pooling layer and a stride convolution layer, followed by a fully connected layer that integrates spatial information for classification.

We empirically found that setting the number of clusters properly is important for obtaining the desired scene layout classification results, as well as object detection performance improvements. Figure 5 presents its impact on scene layout classification accuracies and mean APs for object detection on MIO-TCD and KITTI [geiger2013vision] validation sets, respectively. As we can see, if is too small, the mean AP improvements may not be maximized and the classification is difficult due to the fact that very different scene layouts may be clustered into a same scene layout class. On the other hand, setting to a large value can negatively impact both the classification accuracy and the mean AP. In general, we found that is closely related to the number of typical scene layouts in a dataset. We set for our experiments in this paper.

Fig. 5: The impact of number of clusters on the scene layout classification accuracies and the mean APs. The left panel presents results on the MIO-TCD validation set, and the right panel presents results on the KITTI validation set.

Iii-C Scene layout transformation

[scale=.53]trans-head.pdf

Fig. 6: The scene layout transformation sub-network. We use the feature map dimensions on the MIO-TCD dataset as an example. The sub-network consists of two modules: spatial transformation (the blue panel) and refinement (the green panel). Layer parameters are summarized in Table II.

As discussed in the previous section, we are able to retrieve a scene layout template in terms of object location heatmaps by matching an input image against the scene layout codebook. See Figure 3 for a few examples. Due to large intra-cluster scene layout variations, however, the retrieved templates are usually too coarse if directly used as a feature map for object detection. See Section IV-A for a detailed quantitative analysis. In order to handle the large layout variations, we propose a transformation and a refinement sub-network to adapt the scene layout template to the appearance of a specific input image. In particular, the spatial transformer network (STN) [jaderberg2015spatial] proposes a generic CNN module that allows us to learn any feature transformation in a parameterized form, provided that it is differentiable with respect to the parameters.

Specifically, for an input image suppose that the -th cluster is the scene layout classification result (i.e., the

-th element in the output of the scene layout classification head has the largest logit), so that we retrieve the corresponding scene layout score

. We propose to use a variant of the STN to apply a transformation to to obtain the final scene layout score , i.e., . More specifically, consists of two consecutive CNN modules (namely, spatial transformation and refinement, as shown in Figure 6):

Iii-C1 Spatial transformation

Given the coarse scene layout, we first apply a 2D affine transformation to allow for rotation, translation, scale changes, etc. Because we use a mixture of discretized groups of scales and aspect ratios in this work, we can reduce the pose of an object hypothesis to its object center coordinates . Let and be the object center coordinates before and after the spatial transformation, respectively. In this case, the pointwise transformation in the homogeneous coordinates can be written as:

(2)

As we can see from Equation 2, the 2D affine transformation has parameters. The STN estimates these parameters with a localization network . In addition, we note that the transformed coordinates are defined on a regular grid on an output feature map. Therefore, we can use a sampler to obtain their corresponding coordinates in by applying Equation 2.

Figure 6 presents the detailed network architecture of our spatial transformation module. For simplicity, the feature dimensions on the MIO-TCD dataset are shown as an example. The dataset has object categories, and in this paper we use scale groups and a single aspect ratio group for each object category, so the number of channels for and is . At the top of Figure 6 is the feature backbone, which is shared with downstream object detection modules. As before, we denote the output of the last residual blocks of , , and as . Additionally, denote as the feature map that is downsampled by half from with a max pooling layer. There are two important backbone feature maps we are using in our scene layout transformation sub-network. The first one is , whose spatial dimension is and has a stride of pixels with respect to the input image. This layer can be used as a feature that is indicative of the overall appearance of images, and with a relatively small number of parameters. The other one is , whose spatial dimension is and has a stride of pixels with respect to the input image. We will use it as a lower level (i.e., finer) appearance feature that helps our model to attend to specific object locations in the refinement module. The spatial resolution of is identical to those object location heatmaps we store with the codebook in our implementation. We note that, however, this is only a particular implementation choice; there may be other potentially appropriate choices.

The localization network is shown in the blue dashed box in Figure 6. In order to regress the localization parameters , we use both the coarse scene layout and the backbone features . The idea is that the transformation parameters will be learned from both the retrieved (coarse) scene layout, and the appearance of the input image. is first downsampled from to . The resultant feature maps and are then bottlenecked to channels (with and , respectively) and then concatenated before going through a convolutional layer () and two fully connected layers ( and ) in order to regress . is then supplied to the grid generator and the sampler, which are standard components in an STN [jaderberg2015spatial].

Iii-C2 Refinement

In addition to the affine transformation, we would like the refined scene layout to be able to have an attention mechanism that focuses on specific regions within a particular input image that are likely to contain objects (see Figure 1). As this would require lower level image features for more accurate localization, we make use of , and concatenate it with the affine-transformed scene layout before going through two additional convolutional layers ( and ), to obtain the final transformed scene layout . We note that and are different to feature maps and in Figure 6 as they learn dedicated feature transformations in order to obtain the refined scene layout . On the contrary, and are the shared backbone features. The module is shown in the green dashed box in Figure 6. The specific layer parameters in our scene layout transformation sub-network are summarized in Table II.

layer parameters output size
Spatial transformation
, stride
, stride
, stride
Refinement
, stride
, stride
TABLE II: Layer parameters of the scene layout transformation sub-network used on the MIO-TCD dataset. See Figure 6 for the network architecture.

Table III presents some performance comparisons among using different backbone features for our scene layout transformation sub-network. For the spatial transformation module, we experimented with , and , respectively. In order to use for our localization network, we downsample the feature map so that it has the same spatial resolution to , before going through the downstream processings as shown in Figure 6. We left out because is directly obtained from through max pooling. The use of allows us to assess the efficacy of lower level features when it comes to estimating the spatial transformation parameters. It is clear that performs better than the alternatives. For the refinement module, we tested with , and , respectively. and are upsampled where necessary. The idea is to verify if a cascade of feature maps at multiple feature pyramid levels would additionally benefit the object detection performance. We empirically found that using alone is sufficient to provide a strong performance. We note that, however, in this case and are still used for object detection as backbone features. In general, although the exact connectivity pattern may vary beyond the cases being discussed above, we observe consistent performance improvements when our layout transformation sub-network is used. See Section IV for details.

We present some examples from the outputs of our scene layout transformation in Figure 7, which shows the potential locations for the car category at different scales.

backbone-features AP AP AP
Spatial transformation
57.5 80.9 64.8
52.2 78.2 57.9
55.9 80.0 62.8
Refinement
57.5 80.9 64.8
57.2 80.8 64.2
56.8 80.5 63.9
TABLE III: Average precision values we obtained on the MIO-TCD validation set while using various combinations of backbone features in our layout transformation sub-network.

Iii-D Piecing things together

The scene layout predictions, as shown in Figure 7, provide useful context cues for object detection. In this paper, we propose two strategies to integrate these cues into an object detection framework:

Iii-D1 Late fusion

Following [wang2017efficient], we can directly use the scene layout scores in conjunction with the output of an object detector. The final object detection score is a weighted sum of the two scores:

(3)

where is the scoring function of an object detector, such as Faster RCNN.

is a hyperparameter for the relative importance between the two terms.

Iii-D2 Early fusion

We use the scene layout scores as an intermediate representation that enhances our image features. In particular, we experimented with various feature fusion methods, as shown in Table (b)b. Our best-performing model uses an conv layer, as illustrated in Figure 2. Specifically, for each level in the feature pyramid [lin2017feature]

, we resize the scene layout scores to the corresponding feature map resolution with bilinear interpolation, and then concatenate the scores to the feature map. The

convolution then maps the features back to its original dimensions before concatenation. The main advantage of the early fusion strategy is that it results in a model that allows for alternating optimization of object detection and scene layout estimation. In Section IV, we compare early fusion with late fusion on the three datasets evaluated in this paper. The details of the training schedule for early fusion are presented in Section III-F.

Iii-E Parameter learning

In this section, we discuss details pertaining to the learning of the scene layout transfer for object detection. Particularly, the interpretable nature of our scene layout classification and transformation sub-networks allows us to inject supervision signals during training. We do so by adding additional terms to the learning objective of the object detection algorithm. We begin by discussing our overall learning objective.

Iii-E1 Learning objective

Denote the loss function of our baseline object detection algorithm as

. In the case of Faster RCNN [ren2015faster], this is a multi-task learning objective that involves object classification and bounding box regression. The overall learning objective of our proposed method can be written as follows:

(4)

where and are the loss functions for scene layout classification and transformation, respectively.

Iii-E2 Learning scene layout classification

For scene layout classification, we use the multi-class cross entropy loss, which is the most widely used loss function for neural network based classification. Specifically, the scene layout classification loss is defined as:

(5)

where is the indicator function and denotes the -th element of .

Iii-E3 Learning scene layout transformation

The output of our scene layout transformation sub-network, , is a set of heatmaps for object locations of different object categories at various scales and aspect ratios. We assume that the spatial dimensions of are , where and are the width and height of the heatmaps, and is the number of components in the mixture model discussed in Section III-A, e.g., in the case of the MIO-TCD dataset. Here , and are the number of object categories, scale groups, and aspect ratio groups, respectively. Although it is possible to train the scene layout transformation sub-network without additional supervision, we are able to learn the network parameters with target scene layouts derived from the ground-truth annotations, which we denote as . Overall, we write our scene layout transformation loss as follows:

(6)

where the layout loss, , quantifies the mismatch between and , and is a regularization term to account for the fact that the affine transformation shall not drift too far from an identity mapping. Here is a hyperparameter for the tradeoff between the two terms.

The layout loss is an element-wise mean squared error (MSE) loss given by:

(7)

Here the target scene layout output of the -th image, , is obtained by accumulating ground-truth object annotations in that image:

(8)

where is a ground-truth object annotation, and is the set of all ground-truth object annotations in the -th image. denotes that and are in the same mixture model component. Here and are the object center coordinates, and is a normalizing factor. Finally, the regularization term is given by:

(9)

where are the parameters of the affine transformation and is the identity transformation. is the number of elements in .

dataset input
MIO-TCD
Traffic lights
KITTI
TABLE IV: Feature map resolutions and the kernel dimensions of in the transformation sub-network. See Section III-F for details.

Iii-F Training details

In this paper, we adopt a -step training schedule to learn object detection and scene layout estimation in an alternating fashion. In the first step, we train the Faster RCNN detector following the standard practice described in the paper [ren2015faster]. A scene layout codebook is also learned offline. In the second step, we freeze the backbone features, and learn the scene layout classification sub-network and the scene layout transformation sub-network sequentially. This order is important because without an accurate scene layout being retrieved, the training of the transformation sub-network would become unstable. In this step, we use the SGD solver with initial learning rates set to . In the third step, we fine-tune the object detector again with the scene layout classification and transformation sub-networks fixed. The initial learning rate is reduced to . The above alternating training can be run for more iterations, but we observed negligible improvements.

Due to the fully convolutional nature of the backbone network, the dimensions of the feature maps will depend on the resolution of the input image. The settings being used in this paper are summarized in Table IV. For the Bosch Small Traffic Lights [behrendt2017deep] and the KITTI datasets, the dimensions of are higher than that of the MIO-TCD dataset, so we reduce the number of convolutional kernels by half in (i.e., ), in order to reduce the number of parameters in the layer of the transformation sub-network. We note that, however, that our model is somewhat robust to such changes in the network architecture. It should also be noted that, the higher input image resolution is essential for the Bosch Small Traffic Lights dataset, as most of the objects are smaller than pixels. In addition, the start size of RPN anchors are also reduced from to to better cope with small objects.

Iv Experimental Results

In this section, we thoroughly compare the proposed method with state-of-the-art object detection algorithms on public benchmark datasets, and present ablation studies to verify the efficacy of our model design choices. We focus on two important sub-areas in vision processing for intelligent transportation systems: traffic surveillance and autonomous driving. We evaluate the proposed method on three challenging object detection datasets:

  • MIOvision Traffic Camera Dataset (MIO-TCD) [luo2018mio]: To the best of our knowledge, MIO-TCD is the largest public benchmark for object detection in traffic surveillance images, with images for training and images for testing for its detection task. For this test set, the dataset offers a public object detection challenge: the TSWC-2017 localization challenge. It allows participating teams to upload their test results to an evaluation server to obtain their detection performance. Ranking of entries are based on , following the PASCAL VOC challenge [pascal-voc-2012].

  • Bosch Small Traffic Lights Dataset [behrendt2017deep]: The Bosch Small Traffic Lights dataset presents a unique challenge for detecting small objects with partial occlusions. The training and testing tests contain and images, respectively. Context information can be particularly useful for localizing small objects when the appearance cues are weak from objects themselves.

  • KITTI Object Detection [geiger2013vision]: The KITTI object detection dataset has training images and testing images. As our method crucially relies on the availability of large-scale datasets that cover objects at various locations and scales, we evaluate the detection performance of the most prevalent car category that has training instances. The KITTI dataset also offers a public online benchmark based on .

We refer our method to Layout Transfer Network (LTN) in the following sections.

Iv-a Results on the MIO-TCD dataset

Fig. 7: Examples of the refined scene layout scores for the car category obtained with the MIO-TCD dataset. Input images are shown in the leftmost column, with locations for extra small (XS, farthest), small (S, far), medium (M, close) and large (L, closest) objects in the four columns to the right.
MIO-TCD-test backbone

a.truck

bicyle

bus

car

motorcycle

m.vehicle

n.m.vehicle

pedestrian

p.truck

s.u.truck

workvan

mean

Challenge submissions
LTN (Ours) R101-FPN 92.38 88.42 97.64 95.14 92.65 70.65 58.41 67.31 92.74 74.97 79.93 82.75
Jung et al. [jung2017resnet] R101-ensemble 92.48 87.34 97.46 89.70 88.21 62.32 59.09 48.57 92.25 74.42 79.86 79.24
Wang et al. [wang2017efficient] VGG-16+context 91.62 79.90 96.77 93.80 83.63 56.40 58.23 42.61 92.75 73.80 79.56 77.19
Baseline approaches
SSD-512 [liu2016ssd] VGG-16 91.28 77.36 96.56 93.59 79.53 55.39 56.60 41.58 92.66 72.74 79.40 76.06
YOLO v2 [redmon2017yolo9000] DarkNet-19 88.31 78.64 95.13 81.36 81.36 51.70 56.57 24.96 86.48 69.23 76.43 71.83
Faster RCNN [ren2015faster] VGG-16 80.70 70.63 93.45 79.85 74.58 46.48 21.22 19.49 86.71 53.29 67.40 63.07
TABLE V: Per-class and mean average precision values (, in ) we obtained in the TSWC-2017 localization challenge (MIO-TCD test set). Our entry obtained a mean of , which is the state-of-the-art. We have the highest in out of object categories among all entries.
early-vs-late-fusion AP AP AP
LTN – late fusion 55.2 79.5 61.9
LTN – early fusion 57.5 80.9 64.8
+2.3 +1.4 +2.9
(a) Early vs. late fusion: We compare two fusion methods for the learned scene layout scores. Early fusion allows for joint training for object detection and scene layout estimation, and compares favorably to late fusion.
fusion-method AP AP AP
eltwise-mul 56.2 80.0 63.3
eltwise-sum 54.7 79.5 61.2
conv 57.5 80.9 64.8
(b) Fusion method: For early feature-level fusion, we experimented with different network structures. conv outperforms other alternatives.
layout-resolution AP AP AP
55.4 79.8 61.9
57.5 80.9 64.8
57.7 80.9 65.1
(c) Layout resolution: AP results with various layout resolutions. Resolutions above (i.e., the resolution of the feature map) give diminishing returns.
classify? transform? refine? AP AP AP
Baseline 48.2 76.1 52.7
+ components 52.0 77.6 58.0
53.3 78.8 59.6
55.5 79.8 62.2
Full model 57.5 80.9 64.8
(d) Component test: To isolate the performance gains, we switch off the following three components from our full model: (1) scene layout classification, (2) spatial transformation, and (3) refinement. Each component provides its own AP improvements, while the full model performs the best.
AP AP AP
FCN 51.1 76.2 58.1
LTN 57.5 80.9 64.8
+6.4 +4.7 +6.7
(e) Fully convolutional network (FCN) vs. Layout transfer network (LTN): Directly predicting scene layouts with an FCN does not produce superior results.
AP AP AP
w/o 53.0 78.6 58.6
w/ 57.5 80.9 64.8
+4.5 +2.3 +6.2
(f) Layout regularization: The layout regularization term is essential for learning the spatial transformation. We observe large performance gaps without the regularizer being used.
TABLE VI: Ablations. We report average precision values from our ablation studies on the MIO-TCD validation set. The backbone network is a ResNet-101-FPN [lin2017feature]. See Section IV-A for discussions.

In order to perform detailed ablation studies, we randomly split the original training set of the MIO-TCD dataset into a smaller training set with images and a held-out validation set with images. All results reported in this section are obtained by training models on the smaller training set, with ablation studies being carried out on the held-out validation set.

The quantitative results we obtained in the TSWC-2017 localization challenge111http://podoce.dinf.usherbrooke.ca/results/localization (i.e., the MIO-TCD test set) are reported in Table V. Comparing to other entries, our method shows a clear advantage. In particular, our results are points better in terms of than the current winning entry from Jung et al. [jung2017resnet]. It should be noted, however, that their method uses a 4-models ensemble (two ResNet-50s and two ResNet-101s) based on R-FCN [dai2016r] and ours use only a single model. In addition, our results are points better than our previous work [wang2017efficient] that uses a nonparametric label transfer method to predict the scene layout for object detection. Overall, we obtain the highest average precisions on out of object categories. To the best of our knowledge, these results are the state-of-the-art in the MIO-TCD localization challenge.

In addition to the results on the test set, we also report some ablations we obtained on the held-out validation set, which are presented in Table VI. Specifically, we perform the following ablation studies:

  • Early vs. late fusion (Table (a)a): We compare the performance of early and late fusion as discussed in Section III-D. As expected, the early fusion strategy allows us to jointly optimize for object detection and scene layout estimation, resulting in better detection performance.

  • Fusion method (Table (b)b): In the case of early fusion, there are various ways to integrate the inferred scene layout features with backbone image features. In this set of experiments, we explore different ways to feature fusion, including elementwise multiplication (eltwise-mul), elementwise summation (eltwise-sum), and convolution. The convolution variant performs best and is used in all other experiments by default. We note that, for the two elementwise fusion methods, we need to apply an convolution layer before the elementwise operations to make sure that the feature dimensions of the backbone and the scene layout are compatible.

  • Layout resolution (Table (c)c): Because we use a pyramid of features in the FPN backbone, a higher scene layout resolution could be helpful for detecting and localizing smaller objects. In general, a layout resolution at for the MIO-TCD dataset (i.e., the resolution of ) is sufficient and higher resolutions give marginal performance gains. In this work, we stick to the resolution of as the scene layout resolution on all three datasets, as it provides a good tradeoff between performance and model complexity (i.e., the number of parameters).

  • Component test (Table (d)d): The three main modules in the scene layout estimation are: (1) scene layout classification, (2) spatial transformation, and (3) refinement. Compared to a baseline Faster RCNN that uses none of these components, adding each module or any combination of them contribute to the AP performance.

  • FCN vs. LTN (Table (e)e): One of the main advantages of our proposed method is that we are able to retrieve a rough estimate of the scene layout in the first step, making the subsequent transformation and refinement easier to learn. Here we also report results obtained by learning a fully convolutional network that directly predicts the scene layout features from the backbone features and . As we can see, our retrieve-and-transform strategy boosts the detection performance by a reasonable margin.

  • Layout regularization (Table (f)f): Finally, we verify the effectiveness of our layout regularization term in Equation 9. We observe a large performance improvement with the regularizer being used. In addition, we found in our experiments that the regularization term helps stabilize training and avoid over-fitting.

Traffic-lights-test AP AP AP
Faster RCNN 30.25 73.44 15.91
Faster RCNN + LTN (late fusion) 31.20 75.83 18.52
Faster RCNN + LTN (early fusion) 31.64 75.97 19.06
 + LTN (late fusion) +1.0 +2.4 +2.6
 + LTN (early fusion) +1.4 +2.5 +3.2
TABLE VII: Average precision values we obtained on the Bosch Small Traffic Lights test set.

It is aforementioned that our method provides an interpretable representation of the scene layouts. In Figure 7, we present some examples of the scene layouts for the car category at different scales as predicted by our method. It is clear from these examples that our scene layout scores can reliably predict object locations and scales through the object location heatmaps. In addition, we present some qualitative detection results in Figure 8. As we can see, our method is able to reliably detect overlapping and distant objects by incorporating the context cues. Typical failure modes are also presented in the last two rows, among the most common are class confusions, false alarms, and missing out-of-context objects.

Fig. 8: Example detection results on the MIO-TCD test set. The last two rows present some typical failure modes, namely class confusion, false alarm, and missing out-of-context objects. See Section IV-A for details. Best viewed electronically, zoomed in.
Fig. 9: Example detection results on the Bosch Small Traffic Lights test set. See Section IV-B for details. Best viewed electronically, zoomed in.
Fig. 10: Example detection results on the KITTI test set. See Section IV-C for details. Best viewed electronically, zoomed in.

Iv-B Results on the Bosch Small Traffic Lights dataset

In addition, we report the results we obtained on the Bosch Small Traffic Lights dataset. This dataset presents a unique challenge to detecting small objects with partial occlusions in autonomous driving. The detection results we obtained on the test set are reported in Table VII. In particular, accurate localization for objects as small as a few pixels wide (the medium width is about pixels) is very challenging for state-of-the-art object detectors, as reflected by the low values in a sharp contrast to the values. Although we found that temporal information may be used to improve accuracy on this dataset (as some traffic lights may be missed while being detected in preceding or succeeding frames), it is not the focus of this paper and is therefore not used. Once again, our layout transfer network is able to improve detection results in these very challenging scenarios. Also, we emphasize again that using high resolution input images, as specified in Table IV, and smaller RPN anchors are vital to obtain satisfactory results on this particular dataset. We present some example detection results in Figure 9.

Iv-C Results on the KITTI dataset

The KITTI dataset provides a comprehensive set of real-world and challenging computer vision tasks including stereo, optical flow, visual odometry, object detection and tracking for scene understanding in autonomous driving. A 2D object detection dataset with annotated training images is provided. As it is vital for our method to transfer scene layouts from a large dataset with objects at various locations and scales, we only evaluate the detection performance of cars, which is the largest object category. Other object categories (i.e., pedestrians and cyclists) are much smaller in comparison, so we do not include them in our evaluation. In addition, we split the original training set into a separate training set and a validation set with and images, respectively. This allows us to directly demonstrate the effectiveness of LTN, as compared to a baseline with identical settings otherwise.

KITTI-val Moderate Easy Hard
Faster RCNN 88.72 90.76 80.31
Faster RCNN + LTN (late fusion) 93.12 94.04 86.25
Faster RCNN + LTN (early fusion) 93.09 94.42 86.65
 + LTN (late fusion) +4.4 +3.3 +5.9
 + LTN (early fusion) +4.4 +3.7 +6.3
TABLE VIII: Average precision values (, in ) we obtained on the KITTI validation set.

Following the evaluation protocol of the dataset, the values are reported on Moderate, Easy, and Hard objects, respectively. Table VIII summarizes the results on the KITTI validation set. Compared to the baseline, our LTN (with early fusion) improves the average precision by , and points on the three difficulty levels, respectively. In particular, the performance on the Hard difficulty level is boosted by a large margin, as contextual cues are important for detecting objects that are heavily occluded or truncated by image boundaries. In addition, Table IX reports the results from the KITTI benchmark (i.e., test set). In general, our method achieves a good tradeoff between accuracy and speed, especially at the Hard difficulty level. In particular, the top-performing method RRC [ren2017accurate] is around times slower than our method due to the recurrent nature of its architecture. Another state-of-the-art method, SINet [hu2018sinet], is faster than our approach but we perform better at the Hard difficulty level. In addition, we note that the contributions of others may be orthogonal to ours. Furthermore, some state-of-the-art methods on the online benchmark 222http://www.cvlibs.net/datasets/kitti are not listed here because they either require additional input modalities (e.g., PC-CNN-V2 [du2018general] and F-PointNet [qi2018frustum]) or CAD models during training (e.g., Deep MANTA [chabot2017deep]).

We present some qualitative results in Figure 10. As we can see, scene layouts can be well indicative of potential vehicle locations and scales, and that our method is able to detect overlapping and distant objects with high accuracy.

KITTI-test time/image AP
Moderate Easy Hard
RRC [ren2017accurate] 3.6s 90.23 90.61 87.44
SJTU-HW [zhang2018led, fang2018small] 0.85s 90.08 90.81 79.98
SINet [hu2018sinet] 0.2s 89.60 90.60 77.75
Deep3DBox [mousavian20173d] 1.5s 89.04 92.98 77.17
MS-CNN [cai2016unified] 0.4s 89.02 90.03 76.11
Mono3D [chen2016monocular] 4.2s 88.66 92.33 78.96
SDP+CRC (ft) [yang2016exploit] 0.6s 83.53 90.33 71.13
spLBP [hu2016fast] 1.5s 77.40 87.19 60.60
Reinspect [stewart2016end] 2s 76.65 88.13 66.23
Regionlets [long2014accurate, wang2015regionlets, zou2014generic] 1s 76.45 84.75 59.70
SubCat [ohn2015learning] 0.7s 75.46 84.14 59.71
LTN (Ours) 0.4s 88.85 90.12 79.62
TABLE IX: Average precision values (, in ) on the KITTI benchmark. Methods are ranked based on their performance at the Moderate difficulty level. In general, the LTN achieves a good tradeoff between accuracy and speed, especially at the Hard difficulty level. See text for details.

V Conclusion

In this paper, we proposed a layout transfer network for context aware object detection. An important aspect of our method is that we are able to obtain an interpretable scene layout representation which can be directly used to improve object detection performance. The scene layout transfer in our method provides a general approach to context modeling for object detection that can be used in conjunction with many other detection algorithms not mentioned in this paper. In the future, we wish to achieve scene layout classification with an integrated deep learning approach. In particular, this may allow for integrated training of all sub-systems and parameters, which may provide better overall performance. For example, we can use backpropagation to fine-tune the scene layout codebook after the initialization by clustering. We hope that our work would serve as a modest spur to induce further exploration into simple and robust scene layout representations that may be useful for a wider variety of scene understanding problems.

Acknowledgment

We thank the anonymous reviewers for their insightful comments. We also thank Zhiming Luo and Pierre-Marc Jodoin for their help in our participation in the TSWC-2017 challenge. Project sponsored by NSFC (61703195, 61702431), Fujian NSF (2019J01756), Shanghai NSF (18ZR1425100), The Education Department of Fujian Province (JAT170459, JK2017039, and the Distinguished Young Scholars Program), Fuzhou Technology Planning Program (2018-G-96, 2018-G-98) and Minjiang University (MJUKF201716, MJY19021, MJY19022).

References