FreeAnchor: Learning to Match Anchors for Visual Object Detection

09/05/2019 ∙ by Xiaosong Zhang, et al. ∙ 0

Modern CNN-based object detectors assign anchors for ground-truth objects under the restriction of object-anchor Intersection-over-Unit (IoU). In this study, we propose a learning-to-match approach to break IoU restriction, allowing objects to match anchors in a flexible manner. Our approach, referred to as FreeAnchor, updates hand-crafted anchor assignment to "free" anchor matching by formulating detector training as a maximum likelihood estimation (MLE) procedure. FreeAnchor targets at learning features which best explain a class of objects in terms of both classification and localization. FreeAnchor is implemented by optimizing detection customized likelihood and can be fused with CNN-based detectors in a plug-and-play manner. Experiments on MS-COCO demonstrate that FreeAnchor consistently outperforms their counterparts with significant margins.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

Code Repositories

FreeAnchor

FreeAnchor: Learning to Match Anchors for Visual Object Detection (NeurIPS 2019)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past few years we have witnessed the success of convolution neural network (CNN) for visual object detection 

RCNN14 ; FastRCNN15 ; FasterRCNN15 ; YOLO16 ; FocalLoss17 ; FPN17 ; SSD16 ; CenterNet2019 . To represent objects with various appearance, aspect ratios, and spatial layouts with limited convolution features, most CNN-based detectors leverage anchor boxes at multiple scales and aspect ratios as reference points for object localization RCNN14 ; FastRCNN15 ; FasterRCNN15 ; YOLO16 ; FocalLoss17 ; FPN17 ; SSD16 . By assigning each object to a single or multiple anchors, features can be determined and two fundamental procedures, classification and localization (, bounding box regression), are carried out.

Anchor-based detectors leverage spatial alignment, , Intersection over Unit (IoU) between objects and anchors, as the sole criterion for anchor assignment. Each assigned anchor independently supervises network learning for object prediction, based upon the intuition that the anchors aligned with object bounding boxes are most appropriate for object classification and localization. In what follows, however, we argue that such intuition is implausible and the hand-crafted IoU criterion is not the best choice.

On the one hand, for objects of acentric features, , slender objects, the most representative features are not close to object centers. A spatially aligned anchor might correspond to fewer representative features, which deteriorate classification and localization capabilities. On the other hand, it is infeasible to match objects with proper anchors/features using the IoU criterion when multiple objects come together. These issues arise by pre-defining single anchors for specific objects which then independently supervises network learning for object predictions. The residual problem is how to flexibly match anchors/features with objects, which is the focus of this study.

We propose a learning-to-match approach for object detection, and target at discarding hand-crafted anchor assignment while optimizing learning procedures of visual object detection from three specific aspects. First, to achieve a high recall rate, the detector is required to guarantee that for each object at least one anchor’s prediction is close to the ground-truth. Second, in order to achieve high detection precision, the detector needs to classify anchors with poor localization (large bounding box regression error) into background. Third, the predictions of anchors should be compatible with the non-maximum suppression (NMS) procedure,

, the higher the classification score is, the more accurate the localization is. Otherwise, an anchor with accurate localization but low classification score could be suppressed when using the NMS process.

To fulfill these objectives, we formulate object-anchor matching as a maximum likelihood estimation (MLE) procedure MIL97 ; MLE1922

, which selects the most representative anchor from a “bag" of anchors for each object. We define the likelihood probability of each anchor bag as the largest anchor confidence within it. Maximizing the likelihood probability guarantees that there exists at least one anchor, which has high confidence for both object classification and localization. Meanwhile, most anchors, which have large classification or localization error, are classified as background. During training, the likelihood probability is converted into a loss function, which then drives CNN-based detector training and object-anchor matching.

The contributions of this work are concluded as follows:

  • We formulate detector training as an MLE procedure and update hand-crafted anchor assignment to free anchor matching. The proposed approach breaks the IoU restriction, allowing objects to flexibly select anchors under the principle of maximum likelihood.

  • We define a detection customized likelihood, and implement joint optimization of object classification and localization in an end-to-end mechanism. Maximizing the likelihood drives network learning to match optimal anchors and guarantees the comparability of with the NMS procedure.

2 Related Work

Object detection requires generating a set of bounding boxes along with their classification labels associated with objects in an image. However, it is not trivial for a CNN-based detector to directly predict an order-less set of arbitrary cardinals. One widely-used workaround is to introduce anchors, which employs a divide-and-conquer process to match objects with features. This approach has been successfully demonstrated in SSD SSD16 , DSSD DSSD2017 , YOLO YOLO9000 , RetinaNet FocalLoss17 , Faster R-CNN FasterRCNN2016 and FPN FPN17

. In these detectors, dense anchors need to be configured over convolutional feature maps so that features extracted from anchors can match object windows and the bounding box regression can be well initialized. Anchors are then assigned to objects or backgrounds by thresholding their IoUs with ground-truth bounding boxes  

RCNN14 .

Although effective, these approaches are restricted by heuristics that spatially aligned anchors are compatible for both object classification and localization. For objects of acentric features, however, the detector could miss the best anchors and features.

To break this limitation imposed by pre-assigned anchors, recent anchor-free approaches employ center-ness bounding box regression tian2019fcos , pixel-level supervision East2017 , and anchor-free IoU loss UnitBox2016 . CornerNet CornerNet2018 and CenterNet CenterNet2019 replace bounding box supervision with key-point supervision. The MetaAnchor MetaAnchor2018 approach learns to produce anchors from the arbitrary customized prior boxes with a sub-network. GuidedAnchoring GuidedAnchoring leverages semantic features to guide the prediction of anchors while replacing dense anchors with predicted anchors.

Existing approaches have taken a step towards learnable anchor customization. The recent IoU net IoU-Net18

incorporates IoU-guided NMS, which helps eliminating the suppression failure caused by the misleading classification confidences. Nevertheless, to the best of our knowledge, there still lacks a systematic approach to model the correspondence between anchors and objects during detector training, which inhibits the optimization of feature selection and feature learning.

Figure 1: Comparison of hand-crafted anchor assignment (top) and FreeAnchor (bottom). FreeAnchor allows each object to flexibly match the best anchor from a “bag" of anchors during detector training.

3 The Proposed Approach

To model the correspondence between objects and anchors, we propose to formulate detector training as an MLE procedure. We then define the detection customized likelihood, which simultaneously facilitates object classification and localization. During detector training, we convert detection customized likelihood into detection customized loss and jointly optimizing object classification, object localization, and object-anchor matching in an end-to-end mechanism.

3.1 Detector Training as Maximum Likelihood Estimation

Let’s begin with a CNN-based one-stage detector FocalLoss17 . Given an input image , the ground-truth annotations are denoted as , where a ground-truth box is made up of a class label and a location . During the forward propagation procedure of the network, each anchor obtains a class prediction after the Sigmoid activation, and a location prediction after the bounding box regression. denotes the number of object classes.

During training, hand-crafted criterion based on IoU is used to assign anchors for objects, Fig. 1, and a matrix is defined to indicate whether object matches anchor or not. When the IoU of and is greater than a threshold, matches and . Otherwise, . Specially, when multiple objects’ IoU are greater than this threshold, the object of the largest IoU will successfully match this anchor, which guarantees that each anchor is matched by a single object at most, . By defining as and as , the loss function of the detector is written as follows:

(1)

where denotes the network parameters to be learned. , and respectively denote the Binary Cross Entropy loss () for classification and the loss defined for localization FastRCNN15 . is a regularization factor.

From the MLE perspective, the training loss is converted into a likelihood probability, as follows:

(2)

where and denote classification confidence and denotes localization confidence. Minimizing the loss function defined in Eq. 1 is equal to maximizing the likelihood probability defined in Eq. 2.

Eq. 2 strictly considers the optimization of classification and localization of anchors from the MLE perspective. However, it unfortunately ignores how to learn the matching matrix . Existing CNN-based detectors  RCNN14 ; FastRCNN15 ; FasterRCNN15 ; YOLO16 ; FocalLoss17 ; FPN17 ; SSD16 solve this problem by empirically assigning anchors using the IoU criterion, Fig. 1, but ignoring the optimization of object-anchor matching.

3.2 Detection Customized Likelihood

To achieve the optimization of object-anchor matching, we extend the CNN-based detection framework by introducing detection customized likelihood. Such likelihood intends to incorporate the objectives of recall and precision while guaranteeing the compatibility with NMS.

To implement the likelihood, we first construct a bag of candidate anchors for each object by selecting () top-ranked anchors in terms of their IoU with the object. We then learns to match the best anchor while maximizing the detection customized likelihood.

To optimize the recall rate, for each object we requires to guarantee that there exists at least one anchor , whose prediction ( and ) is close to the ground-truth. The objective function can be derived from the first two terms of Eq. 2, as follows:

(3)

To achieve increased detection precision, detectors need to classify the anchors of poor localization into the background class. This is fulfilled by optimizing the following objective function:

(4)

where is the probability that misses all objects and denotes the probability that anchor correctly predicts object .

To be compatible with the NMS procedure, should have the following three properties: (1) is a monotonically increasing function of the IoU between and , . (2) When is smaller than a threshold , is close to 0. (3) For an object , there exists one and only one satisfying . These properties can be satisfied with a saturated linear function, as

which is shown in Fig. 3, and we have .

Implementing the definitions provided above, the detection customized likelihood is defined as follows:

(5)

which incorporates the objectives of recall, precision and compatibility with NMS. By optimizing this likelihood, we simultaneously maximize the probability of recall and precision and then achieve free object-anchor matching during detector training.

Figure 2: Saturated linear function.
Figure 3: Mean-max function.

3.3 Anchor Matching Mechanism

To implement this learning-to-match approach in a CNN-based detector, the detection customized likelihood defined by Eq. 5 is converted to a detection customized loss function, as follows:

(6)

where the function is used to select the best anchor for each object. During training, a single anchor is selected from a bag of anchors , which is then used to update the network parameter .

At early training epochs, the confidence of all anchors is small for randomly initialized network parameters. The anchor with the highest confidence is not suitable for detector training. We therefore propose using the Mean-max function, defined as:

which is used to select anchors. When training is insufficient, the Mean-max function, as shown in Fig. 3, will be close to the mean function, which means almost all anchors in bag are used for training. Along with training, the confidence of some anchors increases and the Mean-max function moves closer to the max function. When sufficient training has taken place, a single best anchor can be selected from a bag of anchors to match each object.

Replacing the max function in Eq. 6 with Mean-max, adding balance factor , and applying focal loss  FocalLoss17 to the second term of Eq. 6, the detection customized loss function of an FreeAnchor detector is concluded, as follows:

(7)

where is a likelihood set corresponding to the anchor bag . By inheriting the parameters and from focal loss FocalLoss17 , we set , , and .

With the detection customized loss defined above, we implement the detector training procedure as Algorithm 1.

Figure 4: Comparison of learning-to-match anchors (left) with hand-crafted anchor assignment (right) for the “laptop” object. Red dots denote anchor centers. Darker (redder) dots denote higher confidence to be matched. For clarity, we select 16 anchors of aspect-ratio 1:1 from all 40 anchors for illustration. (Best viewed in color)
0:     : Input image.   : A set of ground-truth bounding boxes .   : A set of anchors in image.   : Hyper-parameter about anchor bag size .
0:  : Detection network parameters.
1:   initialize network parameters.
2:  for i=1:MaxIter do
3:     Forward propagation:  Predict class and location for each anchor .
4:     Anchor bag construction: Select top-ranked anchors in terms of their IoU with .
5:     Loss calculation:  Calculate with Eq. 7.
6:     Backward propagation:

using a stochastic gradient descent algorithm.

7:  end for
8:  return  
Algorithm 1 Detector training with FreeAnchor.

4 Experiments

In this section, we present the implementation of an FreeAnchor detector to appraise the effect of the proposed learning-to-match approach. We also compare the FreeAnchor detector with the counterpart and the state-of-the-art approaches. Experiments were carried out on MS-COCO 2017Lin2014MicrosoftCC , which contains 118k images for training, 5k for validation (val) and 20k for testing without provided annotations (-). Detectors were trained on COCO training set, and evaluated on the set. Final results were reported on the - set.

4.1 Implementation Details

FreeAnchor is implemented upon a state-of-the-art one-stage detector, RetinaNet FocalLoss17 , by using ResNet ResNet16 and ResNeXt ResNeXt17 as the backbone networks. By simply replacing the loss defined in RetinaNet with the proposed detection customized loss, Eq. 7, we updated the RetinaNet detector to an FreeAnchor detector. For the last convolutional layer of the classification subnet, we set the bias initialization to with . Training used synchronized SGD over 4 Tesla V100 GPUs with a total of 16 images per mini-batch (4 images per GPU). Unless otherwise specified, all models were trained for 90k iterations with an initial learning rate of 0.01, which is then divided by 10 at 60k and again at 80k iterations.

4.2 Model Effect

Learning-to-match: The proposed learning-to-match approach can select proper anchors to represent the object of interest, Fig. 4. As analyzed in the introduction section, hand-crafted anchor assignment often fails in two situations: Firstly, slender objects with acentric features; and secondly when multiple objects are provided in crowded scenes. FreeAnchor effectively alleviated these two problems. For slender object categories, such as toothbrush, skis, couch, and tie, FreeAnchor significantly outperformed the RetinaNet baseline, Fig. 6. For other object categories including clock, traffic light, and sports ball FreeAnchor reported comparable performance with RetinaNet. The reason for this is that the learning-to-match procedure drives network activating at least one anchor within each object’s anchor bag in order to predict correct category and location. The anchor is not necessary spatially aligned with the object, but has the most representative features for object classification and localization.

We further compared the performance of RetinaNet and FreeAnchor in scenarios of various crowdedness, Fig. 6. As the number of objects in each image increased, the FreeAnchor’s advantage over RetinaNet became more and more obvious. This demonstrated that our approach, with the learning-to-match mechanism, can select more suitable anchors to objects in crowded scenes.

Figure 5: Performance comparison on square and slender objects.
Figure 6: Performance comparison on object crowdedness.

Compatibility with NMS: To assess the compatibility of anchors’ predictions with NMS, we defined the NMS Recall () as the ratio of the recall rates after and before NMS for a given IoU thresholds . Following the COCO-style AP metric Lin2014MicrosoftCC , NR was defined as the averaged when changes from 0.50 to 0.90 with an interval of 0.05, Table  1. We compared RetinaNet and FreeAnchor in terms of their . It can be seen that FreeAnchor reported higher , which means higher compatibility with NMS. This validated that the detection customized likelihood, defined in Section 3.2, can drive joint optimization of classification and localization.

backbone detector NR
ResNet-50 RetinaNet FocalLoss17 81.8 98.3 95.7 87.0 71.8 51.3
FreeAnchor (ours) 83.8 99.2 97.5 89.5 74.3 53.1
Table 1: Comparison of NMS recall (%) on MS-COCO set.

4.3 Parameter Setting

Anchor bag size : We evaluated anchor bag sizes in {40, 50, 60, 100} and observed that the bag size 50 reported the best performance. A smaller bag might miss the best anchor while a larger bag could aggregate the difficulty of anchor estimation.

Background IoU threshold : A threshold was used in during training. We tried background IoU thresholds in {0.5, 0.6, 0.7} and validated that 0.6 worked best.

Focal loss parameter: FreeAnchor introduced a bag of anchors to replace independent anchors and therefore faced more serious sample imbalance. To handle the imbalance, we experimented the parameters in Focal Loss FocalLoss17 as in {0.25, 0.5, 0.75} and in {1.5 , 2.0, 2.5}, and set and .

Loss regularization factor : The regularization factor in Eq. 1, which balances the loss of classification and localization, was experimentally validated to be 0.75.

4.4 Detection Performance

In Table  2, FreeAnchor was compared with the RetinaNet baseline. FreeAnchor consistently improved the AP up to 3.5%, which is a significant margin in terms of the challenging object detection task. Note that the performance gain was achieved with negligible cost of training and test time.

Backbone Detector
Train
time
Test time
/image
AP
ResNet-50 RetinaNet FocalLoss17 9.33h 0.198s 35.7 55.0 38.5 18.9 38.9 46.3
FreeAnchor (ours) 9.95h 0.195s 39.1 58.2 42.1 21.1 41.9 49.9
ResNet-101 RetinaNet FocalLoss17 12.9h 0.257s 37.8 57.5 40.8 20.2 41.1 49.2
FreeAnchor (ours) 13.5h 0.259s 41.3 60.6 44.7 22.5 44.3 53.0
Table 2: Detection performance comparison of FreeAnchor and RetinaNet (baseline).

FreeAnchor was compared with other state-of-the-art detectors in Table  3 under standard setting (std.) and advanced setting (adv.). Standard setting was same as description in Sec. 4.1 on ResNet backbone, advanced setting used the jitter over scales {640, 672, 704, 736, 768, 800} during training on ResNeXt-32x8d-101 backbone. Experiments show that FreeAnchor outperformed the counterparts including MetaAnchor MetaAnchor2018 and IoU-Net IoU-Net18 , which used hand-crafted anchor assignment. It also outperformed the anchor-free approaches including GuidedAnchoring GuidedAnchoring , FSAF zhu2019feature , and CornerNet CornerNet2018 . Using much fewer training iterations (135K), FreeAnchor reported comparable performance with the state-of-the-art CenterNet CenterNet2019 . As an one-stage detector, FreeAnchor surprisingly reported comparable performance with the state-of-the-art multi-stage detector Cascade RCNN cascade18 .

Detector Backbone
Iter.
Batch
size
AP
std. MetaAnchor MetaAnchor2018 ResNet-50 90k 16 37.9 - - - - -
GA-RetinaNet GuidedAnchoring ResNet-50 90k 16 37.1 56.9 40.0 20.1 40.1 48.0
FreeAnchor (ours) ResNet-50 90k 16 39.1 58.2 42.1 21.1 41.9 49.9
std. FPN FPN17 ResNet-101 180k 16 36.2 59.1 39.0 18.2 39.0 48.2
IoU-Net IoU-Net18 ResNet-101 160k 16 40.6 59.0 - - - -
FCOS tian2019fcos ResNet-101 180k 16 41.5 60.7 45.0 24.4 44.8 51.6
Cascade RCNN cascade18 ResNet-101 280k 8 42.8 62.1 46.3 23.7 45.5 55.2
FreeAnchor (ours) ResNet-101 135k 16 41.8 61.1 44.9 22.6 44.7 53.9
adv. RetinaNet FocalLoss17 ResNeXt-101 135k 16 40.8 61.1 44.1 24.1 44.2 51.2
FoveaBox kong2019foveabox ResNeXt-101 135k 16 42.1 61.9 45.2 24.9 46.8 55.6
AB+FSAF zhu2019feature ResNeXt-101 135k 16 42.9 63.8 46.3 26.6 46.2 52.7
CornerNet CornerNet2018 Hourglass-104 500k 49 40.6 56.4 43.2 19.1 42.8 54.3
CenterNet CenterNet2019 Hourglass-104 480k 48 44.9 62.4 48.1 25.6 47.4 57.4
FreeAnchor (ours) ResNeXt-101 135k 16 44.8 64.3 48.4 27.0 47.9 56.0
Table 3: Detection performance comparison with state-of-the-art detectors. For all detectors, jitter over scales was not used during the test phrase for a fair comparison.

5 Conclusion

We proposed an elegant and effective approach, referred to as FreeAnchor, for visual object detection. FreeAnchor updated the hand-crafted anchor assignment to “free" object-anchor correspondence by formulating detector training as a maximum likelihood estimation (MLE) procedure. With FreeAnchor implemented, we significantly improved the performance of object detection, in striking contrast with the baseline detector. The underlying reality is that the MLE procedure with the detection customized likelihood facilitates learning convolutional features that best explain a class of objects. This provides a fresh insight for the visual object detection problem.

References