Low-shot Object Detection via Classification Refinement

This work aims to address the problem of low-shot object detection, where only a few training samples are available for each category. Regarding the fact that conventional fully supervised approaches usually suffer huge performance drop with rare classes where data is insufficient, our study reveals that there exists more serious misalignment between classification confidence and localization accuracy on rarely labeled categories, and the prone to overfitting class-specific parameters is the crucial cause of this issue. In this paper, we propose a novel low-shot classification correction network (LSCN) which can be adopted into any anchor-based detector to directly enhance the detection accuracy on data-rare categories, without sacrificing the performance on base categories. Specially, we sample false positive proposals from a base detector to train a separate classification correction network. During inference, the well-trained correction network removes false positives from the base detector. The proposed correction network is data-efficient yet highly effective with four carefully designed components, which are Unified recognition, Global receptive field, Inter-class separation, and Confidence calibration. Experiments show our proposed method can bring significant performance gains to rarely labeled categories and outperforms previous work on COCO and PASCAL VOC by a large margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/05/2018

Decoupled Classification Refinement: Hard False Positive Suppression for Object Detection

In this paper, we analyze failure cases of state-of-the-art detectors an...
03/25/2022

Sylph: A Hypernetwork Framework for Incremental Few-shot Object Detection

We study the challenging incremental few-shot object detection (iFSD) se...
05/20/2021

Generalized Few-Shot Object Detection without Forgetting

Recently few-shot object detection is widely adopted to deal with data-l...
09/23/2021

Towards Generalized and Incremental Few-Shot Object Detection

Real-world object detection is highly desired to be equipped with the le...
04/22/2022

Few-Shot Object Detection with Proposal Balance Refinement

Few-shot object detection has gained significant attention in recent yea...
05/06/2022

MINI: Mining Implicit Novel Instances for Few-Shot Object Detection

Learning from a few training samples is a desirable ability of an object...
03/19/2018

Revisiting RCNN: On Awakening the Classification Power of Faster RCNN

Recent region-based object detectors are usually built with separate cla...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent deep-learning-based object detection approaches 

Ren et al. (2015); Cheng et al. (2018); Redmon and Farhadi (2018) have achieved remarkable performance outperforming traditional approaches 1. However, they all rely on the availability of large scale training samples, which is often time-consuming and labor-intensive to get. On the contrary, humans can master novel visual concepts after browsing only very few examples; such data-efficient learning ability is highly desirable for many real-world applications, e.g., robotic manipulation, where unseen classes of objects frequently encounter. Acquiring enough annotated training samples is very tedious or even not feasible. One popular solution to reduce the labeling burden is to use easily-annotated labels, including weakly-supervised approaches Bilen et al. (2015); Bilen and Vedaldi (2015); Diba et al. (2017); Li et al. (2016) using image-level labels, and semi-supervised approaches Yang et al. (2013); Tang et al. (2016) using a percentage of the fully-annotated data. However, these settings still restrict the detector’s task), and can hardly generalize to unseen classes (i.e., new tasks) with severely limited training samples.

In this paper, we consider a more realistic but rarely explored task termed as low-shot object detection. With the aid of sufficient data from base classes, an object detector should be able to generalize to novel classes with limited training samples. Previous methods use regularized transfer learning  

Chen et al. (2018) and a feature reweighting model  Yan et al. (2019), respectively. For example, Meta-RCNN Yan et al. (2019)

handles the low-shot detection using a unified approach combining the low-shot classification and low-shot localization, which enhances the second stage feature representation by reweighting each ROI feature with a class-attentive vector. However, this may bring two drawbacks, (1) Due to the inherent objective conflicts between the box regression and classification tasks, amplifying the category-correlated feature channels may lead to more translation-invariant representation, which is preferred by classification but not for localization 

Dai et al. (2016). (2) Fine-tuning by oversampling data-rare classes often leads to performance degradation in base classes Yan et al. (2019). Therefore, researchers might expect a simple yet effective solution, which can be adopted into any well-designed but fully supervised trained object detectors, without requiring to modify the original detector.

(a) IoU histogram of VOC 2007 test set
(b) Performance gain by eliminating false positives
Figure 1:

Faster-RCNN Ren et al. (2015)

is a commonly used baseline in many prior works. For a fair comparison, we also use it as the base detector to build up our framework. Faster-RCNN uses a class-agnostic region proposal network (RPN) to generate coarse Region of Interests(ROIs). For the top-ranked ROIs, two sibling heads, which are named classification head and bounding box head, work in parallel to classify them into specific class or background and refine them to fit the nearest ground truth boxes, respectively. It is well known that the localization task in Faster-RCNN can be optimized without knowing the category information of the ground truth box 

Cai and Vasconcelos (2018). For example, the first stage components of backbone and RPN are both class-agnostic, since their parameters are shared over all classes. For the second stage Fast-RCNN, a class-agnostic bounding-box regression head is widely adopted in many recent advances. In contrast, the classification head may suffer more overfitting, as learning robust representations from limited training data is always challenging. To reveal the key factors affecting low-shot performance, we evaluate a fine-tuned Faster-RCNN (FRCN-ft) on Pascal VOC 2007 test set and visualize the IOU distribution (Figure 1.a)for both base and novel categories. Although only limited training samples are provided for novel classes, the localization accuracy is almost not affected, which indicates that the class-agnostic localization head is not the main reason for performance drop. To investigate the role of classification head, we study the potential gain of correcting the wrong classification confidence of false-positive proposals. Under the setting of AP50, given the top 300 proposals of Faster-RCNN, we gradually correct the classification score of false-positive proposals according to a specific threshold. The result is shown in figure 1.b, in the ideal case, by eliminating all false positives, Faster-RCNN could dramatically improve its low-shot performance from 36.4% to 90%U̇pon this fact that most of the accurately localized bounding boxes cannot be classified correctly, we should directly improve the initial classification results by introducing useful refinements. The mechanism of classification refinement of object detection has been previously explored in a fully supervised manner Tan et al. (2019); Cheng et al. (2018)

, where a separate network is introduced to help with the base detector for better classification performance. In this paper, we further define it in a low-shot scenario: a practical classification refinement should treat each class equally important and perform a unified refinement effect, regardless of the training size of each class. To achieve this goal, we propose a low shot correction network(LSCN), which can be effectively optimized under the low-data regime to provide a necessary refinement to a base detector. We implement LSCN with a convolutional neural network that independent from the base detector. LSCN takes false-positive proposals as training samples. Eventually, it will be proficient in correcting the weakness of the base detector. During inference, LSCN can be used as a plugin to refine all proposals. The novelties of our framework are presented in the following four aspects, which facilitates the proposed LSCN to learn from very few data effectively: (1) Unified representation promotes the feature representation of rare class 

Gidaris and Komodakis (2018). (2) Global receptive field improves the model’s expressive power by catching long-range dependencies Wang et al. (2018). (3) Inter-class separation incorporates margins into embedding space in order to enhance feature discriminability. (4) Confidence calibration enhances the correlation between classification confidence and localization accuracy.

In summary, the main contributions of this paper are three folds: (1) We present a detailed analysis of the error mode for low-shot detection. Our experiments reveal that the ordinary classification head in Faster-RCNN is the crucial cause of performance degradation in rarely-labeled classes. (2) We propose a low-shot correction framework (LSCN) that is capable of refining proposals through only a few training data. Our approach is also compatible with other anchor-based detection frameworks. (3) We evaluate the proposed low-shot correction network on Pascal VOC and MS-COCO, state-of-the-art performance is achieved on both base and novel classes comparing with prior works.

2 Related works

Object Detection Generally, modern deep-learning-based object detection methods can be divided into two categories: one-stage proposal-free methods and two-stage proposal-based methods. One-stage detectors usually follow a single shot training strategy without generating region proposals, e.g., SSD Liu et al. (2016) and YOLO Redmon and Farhadi (2018); they usually outperform in speed and efficiency, thus more suitable for real-time applications. In contrast, two-stage detectors such as Faster-RCNN Ren et al. (2015) first generate Region of Interests through a class-agnostic Region Proposal Network (RPN). The obtained ROIs are fed into the second stage for further classification and localization, which is more computationally expensive but with high accuracy.

Classification Refinement Due to the lack of correlations between the task of classification and localization in Faster-RCNN, severe misalignments may exist between the final classification score and its corresponding localization accuracy. Recent works address this issue by introducing a classification refinement mechanism. For example, “Learning to rank”  Tan et al. (2019)

deploys a lightweight ranking head to estimate the localization accuracy directly, the obtained localization confidence is further combined with the raw classification confidence for better proposal ranking. The Decoupled Classification Refinement (DCR) 

Cheng et al. (2018) shares similar insights with us, which aims at improving the classification power of Faster-RCNN through another correction network. However, there is a serval difference comparing with our approach. First, their proposed correction model works with ROIs instead of the final stage boxes, which still misalign with the final-stage localization accuracy. Second, DCR implements the correction model as a standard CNN without a global receptive filed, thus prevents the model from catching long-range dependency. Third, DCR employs the standard softmax cross-entropy loss as the training objective, which cannot introduce any extra margin into feature space for separating hard negatives. Furthermore, DCR deploys the inner-product similarity rule for computing output distributions, which is also harmful to the representation of data-rare classes. Upon the fact that all the existing refinement based approaches are designed in a full-supervised manner, to the best of our knowledge, we are the first to adopt the classification refinement mechanism into a low-shot scenario.

Few-shot learning Few-shot learning learns to recognize novel objects through very few training images. Modern few-shot learning approaches can be divided into two groups: gradient-based approaches Finn et al. (2017); Li et al. (2017); Lee and Choi (2018) and metric-based approaches Xue and Wang (2020); Snell et al. (2017); Sung et al. (2018). Gradient-based approaches aim at learning to effectively adapt a model into unseen categories through a limited number of parameter updates. For example, Model-Agnostic Meta-Learning  Finn et al. (2017) encodes transformable meta knowledge into a good initialization point, where a small number of gradient updates is enough for good performance. Metric-based approaches usually learn a nearest-neighbor embedding function, which encodes objects from the same-category into a compact cluster. For example, Prototypical Networks Snell et al. (2017) employ the means of support features as the class-specific prototype, prediction of a query image is made upon its nearest prototype in the Euclidean space.

Few-Shot Object Detection Most of the recent low-shot detection approaches are adapted from few-shot learning paradigm, Chen et al. Chen et al. (2018) proposed a distillation-based method with less-forgotten constraint and background depression regularization. Kang et al. Kang et al. (2019) propose to emphasize the class-specific feature channels by reweighting feature maps with channel-wise attention. In contrast, Meta-RCNN Yan et al. (2019) employs channel-wise attention to each ROI feature instead of the whole feature map. Repmet Karlinsky et al. (2019) replaces the original sibling head of Faster-RCNN with a nearest-neighbor embedding function, thus only reserving the first-stage bounding-box regression through RPN. However, possibly due to the lack of accurately localized bounding boxes, the overall performance of Repmet is still not satisfactory. Instead, we address the problem from a novel perspective of classification refinement; our proposed low-shot correction network can be used as a useful plugin without requiring to modify the base detector.

3 Methods

3.1 Problem Definition

Given a dataset with bounding box annotations, which contains three parts: a base set , a novel set , and a test set . Following the common practice of low shot, there is no category shared between and , redundant training data with bounding box annotations is available for categories in , while only k bounding box annotations are available for categories in . In this work, we define a more realistic but challenging setting for low-shot object detection, that is, how to utilize the data-redundant base set to promote the learning on data-rare novel set without suffering from catastrophic forgetting Parisi et al. (2019) on the base class. Most of the existing approaches Kang et al. (2019); Yan et al. (2019) fail to achieve this, as they commonly follow a finetuning-after-pretraining strategy, resulting in the gradients from base classes are easy to be overwhelmed by the oversampled novel class. To address this issue, we propose a classification-refinement based approach by employing a low-shot correction network named LSCN, which improves the low-shot classification power of the base detector. The carefully-designed LSCN not only compensates for the lost accuracy in the base classes but also improves the detection performance on novel class significantly.

3.2 Low shot classification refinement framework

Figure 2: The low-shot classification refinement framework

We propose a simple yet effective framework to improve the low-shot classification performance of a base detector. It consists of (1) a base detector, in this work, Faster-RCNN. (2) a low-shot correction network(figure 3). The base detector is first pre-trained on the large-scale base set then finetuning with the oversampled novel set. Given an input image, Faster-RCNN produces object proposals through its sibling heads, false-positive boxes are then selected and cropped from the original input space to train the correction network. During inference, the original classification confidence from the base detector is fused with the correction network’s prediction through element-wise multiplication. The classification refinement framework is illustrated in Figure 2.

Regarding the poor performance on novel classes of the base detector, the main task of this work falls on improving the low-shot classification power of the base detector. That is, given very few training samples, how to design a low-shot correction network that can perform effective classification refinement for novel classes. To address this, we build the correction network with four essential components to make it suitable for a low-shot scenario, which is: (1) Unified representation, which promotes its feature representation of the rare class. (2) Global receptive field which improves the model’s expressive power by catching long-range dependencies. (3) Inter-class separation which incorporates extra margins into the model’s feature space for further enhancing inter-class separability. (4) Confidence calibration which enhances the correlation between the fused score and localization accuracy. Next, we will introduce these components in turn.

Figure 3: The LSCN network

Unified representation The inner-product similarity rule is widely adopted for computing classification distribution in the conventional image recognition tasks. However, it is not feasible in a low-shot scenario, as it decays the magnitudes of both weights and features of rare classes Hou et al. (2019); Guo and Zhang (2017)

, thus leads to biased prediction towards data-redundant classes. To encourage unified recognition over all classes, we introduce the cosine similarity metric into our correction network. Specially with a standard CNN feature extractor

, a zero-bias fully-connect layer is built on top of as classifier. Given a training sample which is sampled from the base detector on

, the correction model’s classification probability is computed as

(1)

where and are -normalized weights and features. is scaling factor used to ensure training convergence. Model is optimized with standard soft-max cross-entropy loss through the following objective

(2)

Apart from making unbiased recognition, cosine similarity also encourages the correction model to learn a more discriminative feature space, which has a better intra-class similarity and inter-class separability Gidaris and Komodakis (2018). Upon this fact, the feature center of each class is also comparable with its corresponding classification weights Qi et al. (2018). Hence, after training on

, we directly infer the weights for novel categories by averaging their normalized features extracted from

(3)

Note that there is one more background class in our task, which plays a crucial rule for eliminating hard negatives. Since it is shared between both base and novel set, we infer its weights by sampling background proposals from both base and novel set.


Global receptive field Inside each region proposal, objects may appear from a wide range of scales or appear at an arbitrary position. However, the effective receptive field of traditional CNNs is usually small and spatially biased to the central region; as a result, objects located at the outer area of the receptive field are more likely to be ignored Luo et al. (2016). Hence, a good correction network is required to have a sufficiently large receptive field that can capture the information from the whole image. An intuitive way to enlarge the receptive field is to stack more convolutional layers Luo et al. (2016). However, large-depth networks usually suffer from high computation cost and the risk of overfitting. Recently, the Non-Local block(NL) is proposed to achieve a global receptive field by capturing the pairwise relationship between any two positions Wang et al. (2018). Compact Generalized Non-local Operation(CGNL) Yue et al. (2018) is an extension from NL by taking cross-channel correlations into consideration. In our work, we inset a CGNL module into our correction network to handle the complex object appearance in region proposals.

We briefly review the Compact Generalized Non-local Operation as follows; CGNL is a self-attention module which aims at capturing long-range dependency from both spatial and channel dimension. Given an input feature map where

donates channel number, we first flat it into 1-D tensor

ith total dimensions, then compute the response by weighting sum all the elements through all the positions, where:

(4)
(5)

and

are trainable linear transformations which can be implemented as

. The pairwise function compute inner-product similarity for all positions, where . After getting the response we reshape it into to fit the size of input , The final output is computed as , where is another trainable .  

Inter-class separation Anchor-based object detectors are often easy to be confused by hard negatives which may share similar image appearance with foregrounds Cheng et al. (2018). In order to overcome this, the correction network should be more focusing on learning to distinguish hard negatives. However, merely oversampling hard negatives during training is not sufficient, since the efficacy of oversampling is highly relying on the availability of redundant training samples thus not feasible in the low-data regime, and training with the standard soft-max cross-entropy loss cannot introduce any extra margin to separate each class Khan et al. (2019). To further enhance inter-class discrepancy, in our work, a margin-based ranking loss is used to incorporate with previously mentioned cosine similarity metric.

Due to the shared discriminative parts between some hard negative backgrounds and foregrounds, features of hard negatives are often closed to the foreground embeddings in the feature space. To overcome this, we propose a background-suppression regularization for training LSCN, which extends the decision boundary away from hard negatives by an extra margin. Specifically, given a batch of region proposals cropped from the original image, we use the normalized feature as an anchor, if belongs to a specific foreground class, we employ the normalized classification weights of its corresponding class as a positive template, and using the normalized weights of background class as a negative template. On the contrary, if belongs to the background, we consider background class as positive, and choosing the negative template as the foreground class that yields the highest response to .

(6)

Comparing with the fully exploited base class, feature representations of the novel class are often less discriminative, thus suffer from high intra-class variance 

Khan et al. (2019), which then leads to more ambiguity between base and novel classes. To overcome this, we expand the decision regions for novel classes to ensure that they can be well separated from base classes. Specifically, we optimize the margins for those underrepresented novel class, through the following inter-class separation regularization

(7)

Where only the foreground proposals are used as training samples, for those proposals belongs to the novel class, we use the normalized weights of its ground-truth class as a positive template, the negative template is selected from the base class that yields the highest response to . Combining all the losses mentioned above, our integrated objective contains three parts.

(8)

Confidence calibration As observed, many accurately-localized foreground boxes of novel categories are misclassified into the background with confidence. One possible reason is that classification is made upon the first stage ROIs instead of the final-stage detected boxes, which has already been adjusted into a new size/position by the regression branch Zhu et al. (2019). Hence judging false positives according to ROIs is not accurate as it may cause the foreground proposals regressed from negatives ROIs to be falsely suppressed, thus resulting in lower recall rate. To overcome this, LSCN directly takes final stage bounding boxes as inputs during both training and inference.

3.3 Training strategy

As there is no parameter shared between LSCN and Faster-RCNN, we can split the whole training process into two separate phases. In the first phase, Faster-RCNN is used as a base detector and pre-trained on the base set for good feature representation. Then the same detector is fine-tuned on a union set of and , where images from

are over-sampled. We donate the obtained detector as FRCN-ft. In the second stage, we implement the LSCN model with an ImageNet pre-trained ResNet50 and a CGNL block. Given a mini-batch of

images, for each image feed into FRCN-ft, we only reserve its top 300 candidate boxes and divide them into three groups according to their IOU with ground truth, which are the foreground, false positives, and background, we then sample a total number of boxes from these three groups uniformly, in our experiments, we set and . Finally, ROI-Align layer is used to crop the selected boxes from the original image and reshape them into . Training of LSCN follows the same two-phase strategy of FRCN-ft. Besides, at the end of the first phase, classifier weights of novel classes are inferred as the average of the normalized features extracted on .

In the meanwhile, in order to expend the training set, instead of just taking fixed proposals from base detector, we inject controllable Gaussian noise into each final stage proposal. Given an output bounding box , with width and height A generated new box is determined by

(9)
(10)

Where the standard deviation of the zero mean normal distributions

is determined by the size of input box and another scaling factor .

4 Experiments

In this section, we train and evaluate our model on Pascal VOC and MS COCO datasets. Comparing with other baseline methods, we show that our model can bring significant gain to both base and novel categories, regardless of their training size. Our implementation is based on Pytorch with 2 Titan RTX 24GB, code to reproduce the results will be made publicly available.

4.1 Dataset settings

For the Pascal VOC dataset Everingham et al. (2009), following the common practice, our model is trained on the union of 07 and 12 train/validation set and is evaluated on 07 test set. Following the splitting rule of Meta-RCNN Yan et al. (2019), we consider three different splits of base and novel class, which are (bird, bus, cow, bike, sofa/ the other), (aero, bottle, cow , horse , sofa / the other) and (boat, cat, bike , sheep , sofa/ the other). During training, model can only access k object instances from each novel class, where we set k= for Pascal VOC. For MS COCO dataset Lin et al. (2014), we train our model on the union of the 80k train set and 35k trainval set, and evaluate on the 5k minival set, with k = . Considering the total 80 categories in COCO dataset, 20 categories included in Pascal VOC are used as novel classes, and the rest are used as base classes.

4.2 Baselines

In this work, our approach is compared with the other five baselines, which are FRCN-joint, FRCN-ft, FRCN-ft+DCR Cheng et al. (2018), YOLO low-shot Kang et al. (2019), and Meta-RCNN Yan et al. (2019). We build the first three baselines according to the original implementation of Faster-RCNN Ren et al. (2015), an ImageNet pre-trained ResNet50 He et al. (2016) is used as backbone for the base detector. Specially, FRCN-joint is jointly trained with both base and novel classes. FRCN-ft is first trained on abundantly labeled base classes then fine-tune on rarely labeled novel classes until converging. For the second baseline, following the original implementation of DCR, the same ResNet50 is used as the decoupled classification network, which is first trained with false positives sampled from FRCN-ft on base set, then fine-tuned with the false positives sampled from the novel set, using the softmax cross-entropy loss. For a fair comparison, the identical k novel-class objects are used in the training of all these baselines. For the fourth baseline YOLO low-shot and fifth baseline Meta-RCNN, results are borrowed from their original paper.

4.3 Results

Methods base split 1 base split 2 base split 3
FRCN-base 70.3 71.8 70.4
FRCN-ft 67.2 67.6 66.8
Meta-RCNN  Yan et al. (2019) 67.9 - -
FRCN-ft+LSCN(ours) 73.1 73.6 72.7
Table 1: Evaluation on Pascal VOC base set
Method/Shots Novel split 1 Novel split 2 Novel split 3
1 2 3 5 10 1 2 3 5 10 1 2 3 5 10
YOLO Low-Shot Kang et al. (2019) 14.8 15.5 26.7 33.9 47.2 15.7 15.3 22.7 30.1 39.2 19.2 21.7 25.7 40.6 41.3
Meta-RCNN  Yan et al. (2019) 19.9 25.5 35.0 45.7 51.5 10.4 19.4 29.6 34.8 45.4 14.3 18.2 27.5 41.2 48.1
FRCN-joint 3.70 5.60 9.50 14.3 26.1 1.80 3.10 7.50 10.9 13.6 3.40 6.70 7.30 8.90 9.10
FRCN-ft 17.1 29.9 31.6 42.1 49.3 9.00 17.6 26.3 31.5 39.6 12.6 15.7 20.6 33.7 46.9
FRCN-ft+DCR 16.3 28.7 32.0 41.9 47.3 7.90 16.8 26.5 32.1 39.1 12.4 16.0 18.5 31.9 46.5
FRCN-ft+LSCN(ours) 30.7 43.1 43.7 53.4 59.1 22.3 25.7 34.8 41.6 50.3 21.9 23.4 30.7 43.1 55.6
Table 2: Evaluation on Pascal VOC novel set

Pascal VOC The desired low-shot detector should perform well on both base and novel classes; In our experiments, we first compare our method with the other two baselines on the base classes, Specially, FRCN-base donate the Faster-RCNN model that is only trained with base categories. For the other two baselines FRCN-ft and Meta-RCNN, they all follow a two-phase learning strategy, where is first trained with base categories then finetuned on novel categories until converging. For a fair comparison, they are all trained with the same number of iterations. To evaluate our approach, we train the LSCN model by using the second baseline FRCN-ft as the base detector. During inference, the prediction of the well-trained LSCN is fused with the base detector FRCN-ft for evaluation. The results are shown in Table 1. Comparing the FRCN-base, both FRCN-ft and Meta-RCNN suffer from serious performance drop on base classes. In contrast, by simply employing our proposed correction network LSCN, the refined FRCN-ft can easily outperform FRCN-base with a large margin, which indicates that our approach can effectively overcome the catastrophic forgetting on the base class.

We then present the evaluation results on novel categories. Experiments are conducted under the k-shot setting with three different splits, where k = 1,2,3,5,10. First, it can be seen that by combining LSCN with the baseline method FRCN-ft, our approach can outperform all previous approaches(Meta-RCNN, YOLO Low-Shot) with a significant margin. Second, comparing with Meta-RCNN that suffer significant performance drop under extremely low-shot cases, our model brings consistent improvement (8-13 points) under different splits/shots, regardless of the training size, which demonstrates that the necessity of deploying the classification-refinement mechanism into low-shot detection. However, it is worth noting that the previous work DCR also employs a similar classification refinement strategy but fails to generalize to low-shot categories, which is because a model optimized with typical inner-product similarity will be highly biased towards data-redundant categories. In contrast, LSCN utilizes cosine similarity, thus achieve a more unified prediction over all classes.


Shots
Methods
AP AP50 AP75
10 YOLO Low-Shot Kang et al. (2019) 5.60 12.3 4.60
Meta-RCNN  Yan et al. (2019) 8.70 19.1 6.60
FRCN-ft 9.16 20.7 7.39
FRCN-ft+LSCN(ours) 12.4 26.3 7.57
30 YOLO Low-Shot Kang et al. (2019) 9.10 19.0 7.60
Meta-RCNN  Yan et al. (2019) 12.4 25.3 10.8
FRCN-ft 12.0 27.5 9.31
FRCN-ft+LSCN(ours) 13.9 30.9 9.96
Table 3: Evaluation on MS COCO novel set

MS COCO We further conduct experiments on the MS-COCO dataset with both 10 shots and 30 shots setups. To evaluate our proposed LSCN, we combine it with the weak baseline FRCN-ft. The evaluation results on novel categories are presented in Figure 6. Our approach not only brings significant performance gains to the original FRCN-ft (7points under AP50) but also outperforms the Meta-RCNN and YOLO-LS baselines with a large margin (8-10points under AP50), which indicates that our approach is more suitable for complex scenarios than previous works.

5 Conclusion

This work explores the challenging low-shot detection task from a novel perspective of classification refinement. A low-shot correction network (LSCN) is proposed to help with a base detector for better detection of data-rare categories. Furthermore, LSCN overcomes a common issue of the previous approaches that suffer from serious catastrophic forgetting on base classes. Evaluation of benchmark datasets clearly shows the effectiveness of our proposed method. In the future, we will explore more on how to improve its running speed for better real-time performance.

References

  • [1] Cited by: §1.
  • H. Bilen, M. Pedersoli, and T. Tuytelaars (2015) Weakly supervised object detection with convex clustering.. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §1.
  • H. Bilen and A. Vedaldi (2015) Weakly supervised deep detection networks.. CoRR. Cited by: §1.
  • Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Cited by: §1.
  • H. Chen, Y. Wang, G. Wang, and Y. Qiao (2018) LSTD: a low-shot transfer detector for object detection. In AAAI, Cited by: §1, §2.
  • B. Cheng, Y. Wei, H. Shi, R. Feris, J. Xiong, and T. Huang (2018) Revisiting rcnn: on awakening the classification power of faster rcnn. In The European Conference on Computer Vision (ECCV), Cited by: §1, §1, §2, §3.2, §4.2.
  • J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks.. In NIPS, Cited by: §1.
  • A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, and L. V. Gool (2017) Weakly supervised cascaded convolutional networks.. In CVPR, pp. 5131–5139. Cited by: §1.
  • M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman (2009) The pascal visual object classes (voc) challenge. International Journal of Computer Vision. Cited by: §4.1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning

    ,
    Cited by: §2.
  • S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Cited by: §1, §3.2.
  • Y. Guo and L. Zhang (2017)

    One-shot face recognition by promoting underrepresented classes

    .
    ArXiv abs/1707.05574. Cited by: §3.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §4.2.
  • S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin (2019) Learning a unified classifier incrementally via rebalancing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.
  • B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell (2019) Few-shot object detection via feature reweighting.. In ICCV, Cited by: §2, §3.1, §4.2, Table 2, Table 3.
  • L. Karlinsky, J. Shtok, S. Harary, E. Schwartz, A. Aides, R. S. Feris, R. Giryes, and A. M. Bronstein (2019) RepMet: representative-based metric learning for classification and few-shot object detection.. In CVPR, Cited by: §2.
  • S. H. Khan, M. Hayat, W. Zamir, J. Shen, and L. Shao (2019) Striking the right balance with uncertainty. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §3.2, §3.2.
  • Y. Lee and S. Choi (2018) Meta-learning with adaptive layerwise metric and subspace.. CoRR. Cited by: §2.
  • D. Li, J. Huang, Y. Li, S. Wang, and M. Yang (2016) Weakly supervised object localization with progressive domain adaptation.. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • Z. Li, F. Zhou, F. Chen, and H. Li (2017) Meta-sgd: learning to learn quickly for few shot learning. Cited by: §2.
  • T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. ArXiv. Cited by: §4.1.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector.. In ECCV, Cited by: §2.
  • W. Luo, Y. Li, R. Urtasun, and R. S. Zemel (2016) Understanding the effective receptive field in deep convolutional neural networks. In NIPS, Cited by: §3.2.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §3.1.
  • H. Qi, M. Brown, and D. G. Lowe (2018) Low-shot learning with imprinted weights. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5822–5830. Cited by: §3.2.
  • J. Redmon and A. Farhadi (2018) YOLOv3: An Incremental Improvement. arXiv.org. Cited by: §1, §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28, Cited by: §1, §1, §2, §4.2.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 30, Cited by: §2.
  • F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1199–1208. Cited by: §2.
  • Z. Tan, X. Nie, Q. Qian, N. Li, and H. Li (2019) Learning to rank proposals for object detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Cited by: §1, §2.
  • Y. Tang, J. Wang, B. Gao, E. Dellandréa, R. J. Gaizauskas, and L. Chen (2016) Large scale semi-supervised object detection using visual and semantic knowledge transfer.. In CVPR, Cited by: §1.
  • X. Wang, R. B. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Cited by: §1, §3.2.
  • W. Xue and W. Wang (2020) One-shot image classification by learning to restore prototypes.

    national conference on artificial intelligence

    .
    Cited by: §2.
  • X. Yan, Z. Chen, A. Xu, X. Wang, X. Liang, and L. Lin (2019) Meta r-cnn : towards general solver for instance-level low-shot learning.. CoRR. Cited by: §1, §2, §3.1, §4.1, §4.2, Table 1, Table 2, Table 3.
  • Y. Yang, G. Shu, and M. Shah (2013) Semi-supervised learning of feature hierarchies for object detection in a video.. In CVPR, Cited by: §1.
  • K. Yue, M. Sun, Y. Yuan, F. Zhou, E. Ding, and F. Xu (2018) Compact generalized non-local network. In NeurIPS, Cited by: §3.2.
  • L. Zhu, Z. Xie, L. Liu, B. Tao, and W. Tao (2019) IoU-uniform r-cnn: breaking through the limitations of rpn. ArXiv. Cited by: §3.2.