The region-based object detectors [rcnn, fastrcnn, fasterrcnn, fpn, cascade, ecr] popularized by R-CNN framework [rcnn] are conceptually intuitive and flexible, and have achieved top accuracies on challenging benchmarks like MS-COCO [coco]. Region-based detectors first generate a sparse set of object proposals, and then refine the proposal locations and classify them as one of the foreground classes or as background using a detection head. One crucial module in such a proposal-driven pipeline is the RoIPool [fastrcnn] or RoIAlign [maskrcnn] operator, which is responsible for extracting RoI (Region of Interests) features aligned with the proposal locations for the detection head.
In this paper, we revisit the RoI features in region-based detectors from the perspective of context information embedding. Our key motivation relies on the fact that while each RoI in very deep CNNs may have a very large theoretical receptive field that often spans the entire input image [fastrcnn]. However, the effective receptive field [Luo2016] may only occupy a fraction of the full theoretical receptive field, making the RoI features insufficient for characterizing objects that are highly dependent on context information, such as “bowl”, “skateboard” etc. Here, the contextual information means any auxiliary information that can assist in suppressing the false positive detections in noisy backgrounds, or recognizing objects that have no distinctive appearances themselves. For example, as shown in Fig. 1 (a), the semantic features of “motorcycle” are strong evidences for filtering out the activations of irrelevant object categories like “spoon”, “bowl”, and “sink”. On the other hand, as shown by Fig. 1 (b), the scene and even the human pose are useful clues for correctly classifying a proposal as “skateboard”, rather than “tennis racket”.
Recently, several works exploited the region-level context information to improve the localization ability of two-stage detectors. Chen et al. [context] demonstrated that rich contextual information from neighboring regions can better refine the proposal locations for two-stage detectors. Kantorov et al. [kantorov2016contextlocnet] leveraged the surrounding context regions to improve weakly supervised object localization. However, to the best of our knowledge, currently there is no enabling framework which is systematically designed for embedding context information to improve the classification ability of region-based detectors.
In this paper, we present a novel Hierarchical Context Embedding (HCE) framework for region-based object detectors. Our framework consists of three modules. Firstly, we consider that the simplest way to break the contextual limit in object detection, is to partially cast the object-level feature learning as an image-level multi-label classification task. Building upon this realization, we design an image-level categorical embedding module, which in essence is a multi-label classifier upon the detection backbone, in parallel with the existing region-based detection head. It enables the backbone to exploit the whole image context to learn discriminative features for context-dependent object categories. Even as a standalone enhancement, our image-level categorical embedding module can lead to improvements over existing region-based detectors.
Upon the image-level categorical embedding module, at the instance-level, we design a simple but effective process to generate hierarchical contextual RoI features which can be directly utilized by the region-wise detection head. Because our contextual RoI features are enhanced by image-level categorical supervisions and exploit larger contexts, they are by nature complementary to conventional RoI features, which is trained by region-based detectors and only exploits limited context. Later, the early-and-late strategies, i.e., feature fusion and confidence fusion, are designed to make full use of our contextual RoI features. By quantitative experiments, we demonstrate that they can be combined to further boost the classification accuracy of the detection head.
In general, our proposed HCE framework is easy to implement and is end-to-end trainable. We conduct extensive experiments on MS-COCO 2017 [coco] to validate the effectiveness of our HCE framework. Without bells and whistles, our HCE framework delivers consistent accuracy improvements for almost all existing mainstream region-based detectors, including FPN [fpn], Mask R-CNN [maskrcnn] and Cascade R-CNN [cascade]. We also conduct ablation studies to verify the effectiveness of each module involved in our HCE framework. Fig. 1 gives the example images of the baseline method and our method, which demonstrates that our framework can effectively filter out the noisy background detections and correctly classify indistinctive objects by leveraging the context information it exploited.
2 Related Work
2.1 Region-based Object Detection
Convolutional neural networks have lead to a paradigm shift of object detection in the past decades [liu2019deep]. Among a large number of approaches, the two-stage R-CNN series [rcnn, fastrcnn, fasterrcnn, fpn, cascade] have become the leading detection framework. The pioneer work, i.e., R-CNN [rcnn], extracts region proposals from image with selective search [uijlings2013selective], and applies a convolutional network to classify each region of interests independently. Fast R-CNN [fastrcnn] improves R-CNN by sharing convolutional features among RoIs, which enables fast training and inference. Then, Faster R-CNN [fasterrcnn] advances the region proposal generation with a Region Proposal Network (RPN). RPN shares the feature extraction backbone with the detection head, which in essence is a Fast R-CNN [fasterrcnn]. Faster R-CNN is a famous two-stage detection framework, and is the foundation for many follow-up works [dai2016r, fpn].
Over very recent years, several algorithms have been proposed to further improve the two-stage Faster R-CNN framework. For example, Feature Pyramid Networks (FPN) [fpn] constructed inherent CNN feature pyramids, which can largely improve the detection performance of small objects. Mask R-CNN [maskrcnn] extended Faster RCNN by constructing the mask branch, and boosted the performance of both object detection and instance segmentation. Cascade R-CNN [cascade] utilized multi-stage training strategy to progressively improve the quality of region proposals, and demonstrated significant gains for high quality (measured by higher IoUs) object detection. Complementary to these works, in this paper, we focus on developing a Hierarchical Context Embedding (HCE) framework to improve the classification ability of all region-based detectors. Thanks to the simplicity and generalization ability of our HCE framework, it brings consistent and significant improvements over aforementioned leading region-based detectors, e.g., FPN, Mask R-CNN and Cascade R-CNN.
2.2 Context Information for Object Detection
In object detection, both global context [galleguillos2010context] and local context [rabinovich2007objects] are widely exploited for improving performance, especially when object appearances are insufficient due to small object size, occlusion, or poor image quality. Our work is inspired by some of previous works, but the key motivation or implementation significantly differ with these works. Next, we review several topics in object detection, which are closely related to our work.
2.2.1 Combined Localization and Classification.
Before the era of deep learning, Harzallahet al. [harzallah2009combining] proposed to combine two closely related tasks, i.e., object localization and image classification. They demonstrated that classification can improve detection by a contextual combination and vice versa. Similar in spirit, we utilize the fully image-level context to learn object-level concepts. But differently, we utilize global context to learn CNN features rather than hand-crafted features adopted in [harzallah2009combining]. Furthermore, we integrate hierarchical contextual clues beneath both whole images and interested regions to modern region-based CNN detectors, rather than the traditional sliding window detector used by [harzallah2009combining].
2.2.2 Region Proposal Refinement.
Recently, Chen et al. [context] explored the rich contextual information to refine the region proposals for object detection. The neighboring regions with useful contexts can benefit the localization quality of region proposals, which further lead to better detection performance. Instead of refining proposals, we focus on improving the classification ability of region-based detectors by embedding hierarchical contextual clues.
2.2.3 Weakly-Supervised Object Detection.
In weakly supervised object detection, the bounding box annotations are not provided, and only image-level categorical labels are available. The common practice [cinbis2016weakly, weakly, bilen2015weakly, kantorov2016contextlocnet] in this area is to first generate a set of noisy object proposals, and then learn from these noisy proposals with specially designed robust algorithms. Among them, Kantorov et al. [kantorov2016contextlocnet] proposed a context-aware deep network which leverages the surrounding context regions to improve localization. Unlike the usage of region-level context information [kantorov2016contextlocnet] for weakly supervised detection, we focus on the task of fully-supervised object detection, and particularly exploit global image-level context to advance the recognition of context-dependent object categories.
2.3 Context Information for Other Vision Tasks
Beyond object detection, context information has also been utilized to improve other vision tasks. For example, Wang et al. [rnn_attention] leveraged attention mechanisms and LSTMs to discover semantic-aware regions and capture the long-range contextual dependencies for multi-label image recognition. He et al. [adaptive] proposed an adaptive context module to generate multi-scale context representations for semantic segmentation. Qu et al. [deshadownet] embedded multi-context information (the appearance of the input image and semantic understanding) to obtain the shadow matte. Byeon et al. [contextvp] leveraged the LSTM units to capture the entire available past context on video prediction. Li et al. [derain] adopted the dilated convolution to acquire more contextual information for single image deraining.
3.1 Framework Overview
We begin by briefly describing our Hierarchical Context Embedding (HCE) framework (see Fig. 2) for region-based object detection. Firstly, an image-level categorical embedding module is employed to advance the feature learning of the objects that are highly dependent on larger context clues. Then, hierarchical contextual RoI features are generated by fusing both instance-level and global-level information derived from the image-level categorical embedding module. Finally, early-and-late fusion modules are designed to make full use of the contextual RoI features to improve the classification performance. Our HCE framework is flexible and generalizable, as it can be applied as a plug-and-play component for almost all mainstream region-based object detectors.
3.2 Image-Level Categorical Embedding
As aforementioned, conventional RoI-based training for region-based detectors may lack the context information, which is crucial for learning discriminative filters for context-dependent objects. To break this limitation, in parallel with the RoI-based branch, we exploit image-level categorical embedding upon the detection backbone, enabling the backbone to learn object-level concepts adaptively from global-level context. Our image-level categorical embedding module does not require additional annotations, as the image-level labels can be conveniently obtained by collecting all instance-level categories in an image.
to obtain the input feature map, and then employ both global max-pooling (GMP) and global average-pooling (GAP) for feature aggregation (as in[woo2018cbam]). Here, the additional convolution layer aims to alleviate the possible slide effects over the original detection backbone.
We refer to the input feature map to our image-level embedding module as context-embedded image feature. This is because the input feature map conveys whole image context for learning all object categories that appear in the image, and in turn, each location of the feature map is supervised by all object categories. By contrast, conventional RoI-based trained by region-based detectors only exploits limited context for learning each object category.
Formally, let denote the input feature map, where is the channel dimensionality, and are the height and width, respectively. Then, the multi-label classifier is constructed by binary classifiers for all categories:
where denotes the number of categories, each element of
is a confidence score (logits), andis binary classifier modeled as one fully-connected layer. We assume that the ground truth label of an image is , where denotes whether object of category appears in the image or not. The multi-label loss can be formulated as follows
is the sigmoid function.
Because the global feature learning strategy is complementary to RoI-based training, our image-level categorical embedding module standalone can boost the performance of existing region-based detectors (demonstrated later by experiments, cf. Table 2). However, one limitation of image-level categorical embedding might be that the derived context-embedded image feature can not be directly used by the detection head.
3.3 Hierarchical Contextual RoI Feature Generation
To further benefit region-wise classification, we generate hierarchical contextual RoI features by combining the instance-level and global-level information from the context-embedded image features. The hierarchical contextual RoI feature generation process is shown in Fig. 3 (b).
3.3.1 Context-Embedded Instance-Level Feature
We apply RoIAlign [maskrcnn] with proposals generated by RPN on the context-embedded feature map to obtain RoI features :
where is the RoIAlign operation and and are the height and width of the RoI, respectively. As is extracted from the context-embedded image feature , we term it as context-embedded instance-level feature.
3.3.2 Context-Aggregated Global-Level Feature
To leverage larger context, we exploit RoIAlign on the context-embedded image feature to aggregate the global-level context. We refer to the derived RoI feature as context-aggregated global-level feature :
where and are the height and width of the input image, respectively.
Once context-embedded instance-level feature and context-aggregated global-level feature obtained, we concatenate these two RoI features and apply a convolution layer to obtain our hierarchical contextual RoI feature :
where denotes the convolution operation,
refers to concatenation and the ReLU nonlinearity operations are performed following the convolution layer. As the resulting hierarchical contextual RoI featureabsorbs rich context information from the context-embedded image feature , it is by nature complementary to the conventional RoI feature extracted from the feature pyramid network (FPN) [fpn].
3.4 Early-and-Late Fusion and Inference
To make full use of our contextual RoI feature , we design the early-and-late fusion strategies, i.e., feature fusion and confidence fusion, which has been proven effective in many applications [gunes2005affect, ebersbach2017fusion]. We show that early-and-late fusion is also well suited to improve region-wise detectors, as it can fully absorb hierarchically embedded information from different levels.
3.4.1 Feature Fusion
To incorporate our contextual RoI features into region-based detection pipeline, the simplest way is fusing them with the original RoI features extracted from the feature pyramid network (FPN) [fpn] with element addition. Formally, let denote the original RoI feature extracted from FPN, and denote the fused RoI feature, then we have:
As shown in Fig. 2, the fused feature map is then fed into the 2 detection head to produce refined bounding boxes and classification scores.
3.4.2 Confidence Fusion
We also consider a simple confidence fusion strategy which is complementary to feature fusion. We apply the 2 head on our hierarchical contextual RoI feature to produce a classification confidence (logits), and then fuse it with that from the corresponding FPN RoI feature by addition. Formally, let denote the fused the confidence:
The fused confidence is transformed by a soft-max layer to produce a novel classification score.
For each proposal, the classification score , paired with the refined bounding box predicted the FPN RoI feature, forms another prediction in parallel with the prediction from the feature fusion branch. It is worth mentioning that the weights of the 2 head applied on different RoI features are shared.
Our early-and-late fusion strategy produces two different predictions for a single object proposal. To obtain the final result, as shown by the pipeline in Fig. 2, we firstly collect all the boxes and confidences from two prediction branches (i.e., feature fusion and confidence fusion), and then perform NMS over all these boxes. Furthermore, as demonstrated later in experiments, while our two fusion strategies are complementary during training, using only one prediction branch during inference will not cause obvious performance drop but reduce computational cost. However, the performance by only using one fusion strategy for training is inferior to that by using two fusion strategies together.
3.4.4 Loss Function
The whole network is trained end-to-end, and the overall loss is computed as follows:
where and are the losses for the feature fusion and confidence fusion branches, respectively. All loss terms are considered equally important, without extra hyper-parameters to characterize the trade-off between them, which reveals HCE is generalized and not tricky.
We conduct extensive experiments on the MS-COCO 2017 dataset [coco] to demonstrate the effectiveness and generalization ability of our hierarchical context embedding framework. MS-COCO 2017 is the most popular benchmark for general object detection, which contains 80 object categories, 118K images for training, 5K images for validation (val) and 20K for testing (test-dev). We report the standard COCO-style Average Precision (AP) with different IoU thresholds from 0.5 to 0.95 with an interval of 0.05 as metric. All models are trained on COCO training set and evaluate on the val set. For fair comparisons with the state-of-the-art, we also report the results on the test-dev set.
4.1 Implementation Details
We implement our method and re-implement all baseline methods based on MMDetection codebase [mmdetection]
. The re-implementations of the baselines strictly follow the default settings of MMDetection. Images are resized such that the short edge has 800 pixels while the long edge has less than 1333 pixels. We use no data augmentation except horizontal flipping for training. The ResNet is exploited as backbone, which is pre-trained on ImageNet[imagenet]
. Models are trained in a batch size of 16 on 8 GPUs. We train all models with SGD optimizer for 12 epochs in the total, with the initial learning rate as 0.02 and decreased by a factor of 0.1 at 8th epoch and 11th epoch. Weight decay and momentum are set as 0.0001 and 0.9, respectively. We also adopt the linear warming up strategy to begin the training of our model.
4.2 Comparisons with Baselines
. “HCE” denotes that the models are trained and inferred on both feature fusion and confidence fusion. Clearly, our HCE framework achieves consistent accuracy gains overall all baseline detectors on all evaluation metrics.
To demonstrate the generality of our HCE framework, we consider three well-known region-based object detectors as our baseline systems, including Feature Pyramid Network (FPN) [fpn], Mask R-CNN [maskrcnn] and Cascade R-CNN [cascade]. All detectors are instantiated with two different backbones, i.e., ResNet-50 and ResNet-101 with FPN. Integrating our framework with Mask R-CNN and Cascade R-CNN is as straightforward as with FPN. For example, we apply our framework within each training stage of Cascade R-CNN.
Comparison results on MS-COCO 2017 val are shown in Table 1. Our HCE framework achieves consistent accuracy gains overall all baseline detectors on all evaluation metrics. Specifically, without the bells and whistles, our method improves 2.1% and 1.7% AP for FPN with ResNet-50 and ResNet-101 backbones, respectively. While for more advanced Mask R-CNN and Cascade R-CNN, our method also brings more than 1% AP improvement on both ResNet-50 and ResNet-101 backbones, e.g., improving the AP for Mask R-CNN with ResNet-50-FPN from 37.3% to 38.8%.
Additionally, it can be observed that our improvements for Mask R-CNN and Cascade R-CNN baselines are not as significant as FPN. We conjecture that this is because Mask R-CNN and Cascade R-CNN themselves integrate mechanisms for better feature learning, which might overlap with the performance gains with our method. Specifically, Mask R-CNN benefits from extra accurate instance-level mask supervisions, while Cascade R-CNN enjoys IoU-specific multi-stage training to progressively refine object proposals and learn discriminative features for IoU-specific proposals. However, even in these cases, our method can also obtain AP improvement over these competing baseline methods.
4.3 Error Analyses
In the following, we perform error analyses to further understand in what aspects our HCE framework improves the region-based object detectors. Following the settings of [yolo], we choose the top N predictions for each category during inference time. Each prediction is classified based on the type of error:
Correct: correct class and IOU 0.5
Location Error: correct class and 0.1 IOU 0.5
Background Error: IOU 0.1 for any object
Classification Error: class is wrong and IOU 0.5
Other: class is wrong and 0.1 IOU 0.5
We compare different error types between the FPN baseline and our method with ResNet-50 as backbone on MS-COCO 2017 val. Fig. 4 shows the results of each error type averaged across all 80 categories, and each error type for “hot dog”, “snowboard” and “baseball glove” which are highly dependent on context information. Obviously, our method can effectively improve the classification ability of region-based detector and reduce the background errors to a large extent, without compromising the localization performance or increasing other type of errors. Our improvements are particularly noticeable for context-dependent object categories. For example, the (normalized) correctly recognized instances of “hot dog” increase from 44.8% to 51.2%, while the background false positive detections reduce from 17.6% to 14.4%. These observations validate that our HCE framework can indeed improve the classification ability.
4.4 Ablation Studies
In this section, we conduct three series of ablation experiments to analyze the proposed method, using ResNet-50 as backbone on MS-COCO 2017 val.
4.4.1 Context Embedding Operations
We first investigate the impacts of different context embedding operations in our HCE framework. Specifically, there are three context embedding operations involved in our framework. Firstly, the image-level categorical embedding module employs multi-label learning (denoted as “MLL”) to embed global-level context to advance the learning of context-dependent categories. Then, for further improving region-based classification, both the context-embedded instance-level feature (denoted as “Instance”) and the context-aggregated global-level feature (denoted as “Global”) are combined to generate hierarchical contextual RoI feature.
Table 2 shows the performance improvements by progressively integrating more context embedding operations. Solely applying MLL on the detection backbone gives AP improvement. This verifies that image-level categorical embedding advances the feature learning for context-dependent object categories. Then, the context-embedded instance-level feature which can be directly utilized by the detection head brings another AP improvement. Finally, global-level context embedding for contextual RoI feature improves AP. These results suggest that the context embedding operations in our framework are complementary with each other.
|Method||FF Train||CF Train||AP||AP||AP||AP||AP||AP|
|Method||FF Test||CF Test||Speed||AP||AP||AP||AP||AP||AP|
4.4.2 Fusion Strategies in Training
We consider the proposed two fusion strategies, feature fusion and confidence fusion, are complementary to each other. To verify this, we evaluate the performance by training the model with feature fusion and confidence fusion individually, as well as both of them. Table 3 shows the results of different fusion strategies. Specifically, “FF Train” means that we apply feature fusion (FF) for training, while “CF Train” means confidence fusion (CF) are applied for training. Utilizing feature fusion and confidence fusion individually for training can outperform the baseline (FPN with MLL) by and AP, respectively. Training with both fusion strategies achieves the best result, and is clearly better than using each individual fusion strategy separately.
4.4.3 Fusion Strategies in Testing
We also evaluate each fusion strategy independently during inference, with all HCE models trained with both fusion strategies. Table 4 shows the results of each fusion strategy and the combined fusion strategies. “FF Test” denotes that we evaluate the feature fusion (FF) strategy during inference, while “CF Test” means that the results are evaluated by confidence fusion (CF) strategy. We can see that once the model is trained with both fusion strategies, using only one fusion branch for inference will not cause obvious accuracy drop, but brings computational economy. For example, using the feature fusion branch for inference adds very minimal time cost (0.003s) to the baseline, but increases the AP from 36.3% to 38.2%. These results also prove the complementarity of the proposed two fusion strategies.
|Mask R-CNN [maskrcnn]||Res101-FPN||38.2||60.3||41.7||20.1||41.1||50.2|
|Cascade R-CNN [cascade]||Res101-FPN||42.8||62.1||46.3||23.7||45.5||55.2|
|Deformable R-FCN* [deformable]||Aligned-Inception-ResNet||37.5||58.0||40.8||19.4||40.1||52.5|
|Cascade +Rank-NMS [ranking]||Res101-FPN||43.2||61.8||47.0||24.6||46.2||55.4|
|HCE Mask R-CNN||Res101-FPN||41.6||63.9||45.4||23.7||44.7||53.1|
|HCE Cascade R-CNN||Res101-FPN||44.1||63.2||47.9||25.2||46.9||57.0|
|HCE Cascade R-CNN*||Res101-FPN||46.5||65.6||50.6||27.4||49.9||59.4|
4.5 Comparisons with State-of-the-art
We compare our proposed method with state-of-the-art on MS-COCO 2017 test-dev. For fair comparisons, we report the performance of all methods with single-model inference. Specifically, we apply our method on FPN, Mask R-CNN and Cascade R-CNN in 2 training scheme without bells and whistles. Table 5 shows all comparison results.
Our hierarchical context embedding framework, when integrated with FPN, Mask R-CNN and Cascade R-CNN object detectors, consistently outperforms state-of-the-art object detectors using the same backbone network. For fairly comparisons with Deformable R-FCN* and DCNv2* which adopt multi-scale 3x training scheme and multi-scale testing, we follow the same experimental setting to train our HCE Cascade R-CNN*. It gives an AP of 46.5%, which surpasses R-FCN* and DCNv2*. These results demonstrate the superior performance of the proposed context embedding framework.
In this paper, we investigated the limitation of context information on conventional region-based detectors, and proposed a novel and effective Hierarchical Context Embedding (HCE) framework to facilitate the classification ability of current region-based detectors. Comprehensive experiments demonstrated the consistent outperforming accuracy on almost all existing mainstream region-based detectors, include FPN, Mask R-CNN and Cascade R-CNN. In the future, we will concentrate in extending the usage scope of our HCE framework and adapting it to one-stage detection paradigm.
Acknowledgements: Z.-M. Chen’s contribution was made when he was an intern in Megvii Research Nanjing. This research was supported by the National Key Research and Development Program of China under Grant 2017YFA0700800, the National Natural Science Foundation of China under Grants 61772257 and the Fundamental Research Funds for the Central Universities 020914380080.