Our ICCV Submission materials
Rotation augmentations generally improve a model's invariance/equivariance to rotation - except in object detection. In object detection the shape is not known, therefore rotation creates a label ambiguity. We show that the de-facto method for bounding box label rotation, the Largest Box Method, creates very large labels, leading to poor performance and in many cases worse performance than using no rotation at all. We propose a new method of rotation augmentation that can be implemented in a few lines of code. First, we create a differentiable approximation of label accuracy and show that axis-aligning the bounding box around an ellipse is optimal. We then introduce Rotation Uncertainty (RU) Loss, allowing the model to adapt to the uncertainty of the labels. On five different datasets (including COCO, PascalVOC, and Transparent Object Bin Picking), this approach improves the rotational invariance of both one-stage and two-stage architectures when measured with AP, AP50, and AP75. The code is available at https://github.com/akasha-imaging/ICCV2021.READ FULL TEXT VIEW PDF
Our ICCV Submission materials
It is desirable for object detectors to work when scenes are rotated. But there is a problem: methods like Convolutional Neural Networks (CNNs) may be scale and translation invariant but CNNs are not rotation invariant[Goodfellow-et-al-2016]. To overcome this problem, the training dataset can be expanded to include data at new rotation angles. This is known as rotation augmentation. In object detection, rotation augmentation can be abstracted as follows: given an original bounding box and any desired angle of rotation, how should we determine the axis-aligned rotated bounding box label? If the shape of the object is known, this is quite simple: we rotate the object shape and re-compute the bounding box. However in the case of object detection, the shape is unknown.
The prevailing folk wisdom in the community is to select a label large enough to completely overlap the rotated box [zoph2019learning]. In studying this problem, we find that this method may hurt performance, and on COCO [lin2014microsoft], we find that every other prior we tried is better. Yet somehow this “Largest Box” method is very prevalent both in academia and in large scale object detection codebases [xi2018sr, bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet, montserrat2017training, liu2016novel, albumentations-team, chen2019mmdetection, abadi2016tensorflow, jung2018imgaug, casado2019clodsa, solt2019]. Indeed, recent analysis has found that largest box is only robust to about [fastai] of rotation.
In this paper, we propose an advance on the largest box solution that achieves significantly better performance on five object detection datasets; while retaining the simplicity of a few lines of code implementation.
In a nutshell, our solution has two aspects. First, we derive an elliptical shape prior from first principles to determine the rotated box label. We compare it to many other novel priors and show this is optimal. Second, we introduce a novel Rotation Uncertainty (RU) Loss
function, which allows the network to adapt the labels at higher rotations using priors from lower rotations based on label certainty. We demonstrate the effectiveness of this solution by both improving performance datasets where rotation is important such as Pascal VOC[everingham2010pascal] and Transparent Object Bin Picking [kalra_2020_cvpr] and generalizing to novel test time rotations on MS COCO [lin2014microsoft] (Figure 1).
Rotation data augmentation in object detection is not new. This paper is not about finding the best overall way to use a rotation data augmentation. For that - a brute force search or papers like AutoAugment [zoph2019learning] might be better examples. This paper focuses solely on methods to perform a rotation augmentation on axis-aligned bounding boxes. When implemented, these proposed modifications boil down to a few lines of code and leave little reason to use the current Largest Box method.
Data Augmentations are an effective technique to boost the performance of object detection. Data augmentation increases the quantity and improves the diversity of data. Data augmentations are of two types. Photo-metric transforms modify the color channels such that the detector becomes invariant to change in lighting and color. Classical photometric techniques include adding Gaussian blur or adding colour jitters. Modern photo-metric augmentations like Cutout [devries2017improved] and CutMix [yun2019cutmix], randomly remove patches of the image. On the other hand, geometric transforms modify the image’s geometry, making the detector invariant to position and orientation. Geometric modifications require corresponding changes to the labels as well. Geometric transforms are difficult to implement [albumentations-team] and contribute more towards accuracy improvements [taylor2018improving]. We focus specifically on rotation augmentations and object detection.
|Rotation Method||AP50 (Coarse)||AP75 (Fine)||Shape Label|
|Largest Box (e.g. [abadi2016tensorflow, jung2018imgaug, albumentations-team])||Med||Low||No|
|Ellipse + RU Loss (Ours)||Very High||High||No|
|Perfect Box (Gold Std)||Very High||Very High||Yes|
Rotation Augmentation in Object Detection is currently done by the largest box method for major repositories (e.g. [albumentations-team, chen2019mmdetection, abadi2016tensorflow, jung2018imgaug, casado2019clodsa, solt2019]) and publications (e.g. [xi2018sr, bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet, montserrat2017training, liu2016novel]
) that do bounding box rotation for deep learning object detection. The largest box method does a great job guaranteeing containment, but at large angles, it severely overestimates the bounding box’s size. Figure2 shows an example of these over-sized bounding boxes. For that reason, FastAI [fastai] recommends rotation of no more than 3 degrees. Some recent work, such as AutoAugment [zoph2019learning, tan2019efficientnet], use rotation as part of a complex learned data augmentation scheme. While learning rotation augmentation directly is interesting, it requires extensive computing resources. We seek to achieve the simplicity of the largest box, with performance improvements for larger angles.
Oriented Bounding Boxes, a sister of object detection, is the task of predicting non-axis aligned bounding boxes, also known as oriented bounding boxes. Several methods like [Ding_2019_CVPR, cheng2018learning, liu2017learning], aim to achieve rotation invariance when predicting rotated bounding boxes. However, these methods already have labelled rotated boxes as input and do not end up with loose boxes when the input image is rotated. As this is a different task, it is out of the scope of this paper. Our paper focuses on axis-aligned bounding boxes only.
is an important problem to solve in object detection. Classical computer vision methods achieved rotational invariance by extracting features from images[wang2015ordinal, liu2018rotation, liu2014rotation, schmidt2012learning]
. With the rise of neural networks, newer methods attempt to modify the architecture to achieve rotation invariance[cheng2016rifd, cheng2016learning, xi2018sr]. These methods rotate the input images and add special layers that learn the object’s orientation in the image. Our general-purpose method can again aid these methods as well.
An image is parameterized by and coordinates. Suppose the image contains an object with shape . Let denote the shape set that describes all points in an object:
In object detection, a bounding box is defined to be the tightest fitting axis-aligned box around a shape. Therefore a shape determines the coordinates of a bounding box, . Each of the four edges of the bounding box intersects at least one element of the shape set. Let the operator represent the perfect conversion of a shape to a bounding box:
The operator extracts the bounding box (tightest fitting axis aligned box) for by taking the minimum/maximum coordinates of the shape. As is not unique, the same bounding box can be generated by many shapes. For example, a square of side length and a circle of diameter are just two of many unique shapes that generate the same bounding box. More formally, let denote the set of shapes that could possibly generate a bounding box , such that
Where is the dataset specific distribution of shapes. Let us consider the problem of rotation augmentation where an image and corresponding box label is rotated by angle . If the shape of the object is known, then we can rotate the original shape by angle using a rotation operator: . In analogy to Equation 2, we can then use the perfect method to obtain an axis-aligned bounding box for the rotated image as:
We call this method perfect labels, where is the actual shape of the object for a given bounding box . However, this requires shape labels, which are not available for object detection. In object detection, humans label boxes by implicitly segmenting the shape. Without knowing the shape labels, any shape could be , leading to many possible boxes . This paper seeks a method to estimate the rotated bounding box when we do not know the shape. We are only provided with the original bounding box , which we will hereafter write as by making explicit that . The problem statement follows.
Given only an input bounding box and an angle by which the image should be rotated, find the axis aligned bounding box that: (1) has high IoU with ; and (2) improves model performance on rotated versions of vision datasets.
Rotation augmentation without shape knowledge is not a new problem statement. The de facto method in the object detection community for determining the bounding box post-rotation with no shape priors is the largest box method. The largest box method is extremely prevalent (e.g. [montserrat2017training, liu2016novel, albumentations-team, chen2019mmdetection, abadi2016tensorflow, jung2018imgaug, casado2019clodsa, solt2019, kathuria_2018, kdnuggets, lozuwa_2019, saxena_2020, solawetz_2020, matlab]). Just like our proposed method, the largest box takes only the original bounding box and as input. From Equation 3 it is clear that several shapes could define . This creates an ambiguity problem. The largest box method chooses the single largest of these possible shapes in area, . This shape is simply the box itself (Table 1). Treating this as the object shape, Equation 4 can be adapted to obtain
The benefit of this method is that it produces a box that is guaranteed to contain the original object [zoph2019learning], and it is easy to implement. The downside is that the method produces oversized boxes [fastai, zoph2019learning, exchange_1968, albumentations-team_q, open-mmlab_q, aleju], and if used generously, hurts performance more than it helps (Table 11). Surprisingly, to our best knowledge, including personal communication with practitioners and posts on internet forums, no alternatives have been adopted. We hope our method will change that.
We now describe our solution to the problem: given and desired rotation angle , find . In a nutshell, our solution estimates a rotated bounded box by assuming the original shape is an ellipse (Table 1, Figure 3) and rotating accordingly (Section 4.1
). We then adapt the loss function to account for error in the labels (Section4.3).
In this section, we first derive the ellipse assumption from first principles by trying to find the shape that is most likely to have high overlap with potential ground truth boxes. Then we discuss the implementation and intuition of the ellipse method. Finally, we mention other novel methods we developed.
We start with a simple assumption: the optimal method for determining a bounding box post-rotation augmentation should maximize label accuracy, which in the case of object detection is measured in IoU.
We define as the optimal rotated bounding box. We are provided the input angle and box . From Equation 3, this box could have been generated from any number of shapes: . For each shape we can use the “perfect method” from Equation 4 to obtain a potential rotated bounding box. Since multiple shapes can lead to the same rotated box, we obtain possible bounding boxes, which we write as the set:
Hereafter, and without loss of generality, the paper will assume that is the input allowing notation to be simplified:
Now the task becomes to pick the “best” of the possible bounding boxes in . Recall that the de facto solution is to choose the largest box in . This largest box is guaranteed to contain the object. However, optimizing for containment does not seem like a good choice to directly address the metric of AP because AP uses IoU to determine true positives, not containment. A more relevant goal for object detection is to select a box that maximizes:
In which case . Of course, we are not given
. So for the moment, let us assume any of the bounding boxes inhas an equal chance of being the perfect box. Then, it would make sense to optimize over:
Now, let us break the assumption that each candidate box in is equally likely to be the perfect box. Indeed, we know that many shapes can produce the same box (since ), so certain boxes are more likely than others. For example, the only shape that can produce the largest box is the original box itself, whereas other rotated boxes can be generated by multiple object shapes in the dataset. Denote
as the probability that box. Then Equation 10 can be reformulated as:
Readers may recognize this equation as being analogous to an expectation. We refer to this in the paper as the Expected IoU. The expected IoU is not directly tractable: we do not know a priori. However, if we can sample random shapes from a dataset distribution over shapes where we get the following optimization objective:
Since all object detection datasets do not have shape labels, we sample by generating random shapes that touch each side of once. This way we are not dataset-specific. We analyze using COCO shapes in the supplement and show the performance is extremely similar. The above equation is fully differentiable, and so we can solve with gradient ascent. The problem here is that we would then have to solve this equation for every and every box , and this is not practical. Therefore we generalize this further to a canonical shape.
Instead of solving Equation 12 for every possible combination of and , we attempt to find a shape that is optimal across different input bounding boxes. This way, we could solve for some best shape , and solve for as follows:
Note that we now optimize over all rotation angles and aspect ratios simultaneously. This adds enough constraints to find a unique shape. The goal is to find the shape that produces an augmented bounding box that has high IoU with likely ground truth boxes. Since and are differentiable operators, Equation 14 can be optimized through gradient ascent to solve for . We provide details and pseudo-code and some analysis in the supplement. The stable solution found by gradient ascent is that of an elliptical shape. We show the progression of gradient ascent in Figure 4 from the largest box shape to a circle. If we change the aspect ratio, it simply converges to the largest inscribed ellipse. Also in Figure 4, we show the Expected IoU for the Largest Box shape is much lower than the Ellipse, and in Figure 5 we show that the resulting AP of the Ellipse labels is much better. The elliptical solution is similar to the optimized shape for various tested distributions of , including the random model described in the previous paragraph.
When we model the shape as an ellipse, we can find the estimated bounding box as:
where is the largest inscribed ellipse inside , expressed as:
where is the location of the center of and are the width and height of respectively.
This equation is fast, simple to implement, and high-performing on modern vision datasets. The elliptical approximation can be implemented in the same line of code as the largest box method (cf. Appendix A), yet it greatly improves performance. We see in Figure 5 the ellipse labels are far more accurate than the largest box labels. However, one disadvantage with the proposed elliptical box method is that the elliptical box can underestimate the object size or aspect ratio. This still causes some noise in the labels, especially at large rotations. We mitigate this by allowing the model to adapt labels at higher rotations based on priors from lower rotations in Section 4.3.
We do not limit our analysis to the ellipse method. To perform a complete study we came up with an additional 4 methods. To conserve space, full details and results of these novel methods are available in the supplement Appendix B, we provide a quick summary for these methods here.
Scaled Octagon: We use an octagon with a scaling factor (
) to interpolate between the largest box shape and a diamond shape.
Random Boxes: We sample random valid boxes and use those as ground truth labels.
RotIoU: We select the label that has the maximum IoU with the rotated ground truth box rather than the expected axis-aligned ground truth box.
COCO Shape: Rather than using random shapes for the optimization, we use the shapes from the COCO dataset. We keep results from this to the supplement since the performance between this and the ellipse method is negligible and we want this paper to be dataset independent and easy to implement.
|COCO val2017 Ablations|
|(a) Previous method|
|Largest Box(e.g. [bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet])||35.20||28.37||22.34||18.47|
|(b) Our Rotation Label Methods (Section 4.2)|
|Ellipse (Section 4.1.2)||38.21||36.83||33.59||29.95|
|(c) With Our Loss (Section 4.3)|
|Ellipse + RU Loss||38.54||37.45||34.56||31.26|
|Ellipse + RU Loss||39.09||37.99||35.45||32.25|
|Ellipse + RU Loss||39.14||38.19||35.78||32.50|
|Ellipse + RU Loss (Final)||39.33||38.31||36.00||32.72|
|Datasets (At Test Rotation)||Pascal VOC [everingham2010pascal]||Transparent Bin Picking [kalra_2020_cvpr]||Synthetic Fruit Dataset [synthfruit]||Oxford Pets [parkhi12a]|
|Largest Box(e.g. [bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet])||50.23||81.31||54.3||37.49||79.09||28.45||83.47||95.05||92.24||79.54||94.20||90.03|
|Ellipse + RU Loss (Ours)||52.89||81.57||57.97||50.36||81.78||56.76||84.83||95.83||93.17||81.28||94.37||91.09|
As shown in Figure 4 the expected IoU with random shapes is 72.9. This means attaining good performance at the higher APs, like AP75, will be very difficult using just these labels. To tackle this problem, we create a custom loss function that adapts the regression loss to account for the uncertainty of the rotation. The idea is simple - if we are uncertain of the label, we turn off the regression loss if the model is close enough. The labels are more uncertain as the rotation approaches . and perfectly certain at We formalize on the concept of certainty (Figure 6) as a function of :
This function maps a rotation to an IoU threshold . We use this IoU threshold to serve as an indicator for applying regression loss. If the predicted box is greater than , it uses the regression loss, otherwise, it does not and assumes the model’s prediction is correct. We parameterize with . is the angle at which . We visualize in Figure 6. We bound it by 0.5 since that is the threshold for anchor-matching in standard object detection architectures [lin2017focal].
This function allows the model to take the priors it learns at the confident rotations and apply them to the higher rotations, preventing it from overfitting to poor labels. We show in Table 8.
Our hardware setup contains only a single P100 GPU for training, and all our code is implemented in Detectron2 [wu2019detectron2]
with Pytorch[paszke2019pytorch]. We use the default training pipeline for both Faster-RCNN [ren2015faster] and RetinaNet [lin2017focal]. We conduct most of our experiments on the standard COCO benchmark since it contains a variety of objects with many different shapes - making it a challenging test set.
Training Since we have only a single GPU, we can only fit a batch size of 3. To account for this, we increase the training time by around 5x from the default configurations. This allows us to match online available pre-trained baselines for RetinaNet [lin2017focal] and Faster-RCNN [ren2015faster]
. Since most datasets are right-side-up images, we train with a normal distribution with a mean of 0 and a standard deviation of 15 degrees for all experiments. Since this paper aims to find the optimal rotation augmentation method, not the strategy for applying rotation augmentation, we do not try other combinations. This may be left for future work.
|COCO val2017 Results||AP||AP50||AP75|
|Largest Box (e.g. [bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet])||35.20||28.37||22.34||18.47||58.79||56.31||51.49||46.30||36.00||25.37||14.95||10.91|
|Ellipse + RU Loss (Ours)||39.33||38.31||36.00||32.72||60.08||58.66||55.73||51.60||41.74||40.71||38.05||33.97|
Testing: For all datasets except COCO we do not have complete segmentation labels, so we only test on the standard test set (). For COCO we generate our test set by taking the COCO val2017 set and rotating it from - to simulate out-of-distribution rotations. We then bucket these rotations in intervals of 10 and evaluate using segmentation labels to generate ground truth. We leave COCO results for Faster-RCNN to supplement because they are similar to the results for RetinaNet shown below.
In Figure 8 and the accompanying table, we conduct a thorough ablation study on both the method for choosing the label and the impact of the RU Loss function.
In Section 4.1.1 we assumed that the optimal method for label rotation should maximize label accuracy, which we approximated as Expected IoU (Eq. 14). In Figure 8 we demonstrate a strong correlation between Expected IoU and performance on COCO at across all methods, proving the effectiveness of our first principles derivation. We see similar correlations at other angles as well.
In Section 4.1 we introduced many potential methods for rotating a box label. In the ablation Table 8b we show that the Ellipse leads to the best performance across all rotations except where it is within a small noise tolerance. It is also important to note that all methods we tried perform significantly better than the Largest Box - showing the importance of fixing this issue.
Our best performing method consists of using both Ellipse-based label rotation and RU Loss. In this section, we show it leads to much better performance across multiple datasets and approximates segmentation-based rotation augmentations on COCO.
In Table 10, we provide four datasets where our method of rotation augmentation improves performance while the previous one (Largest Box) hurts performance. We notice this to be especially bad at higher APs, such as AP75. The gap is also larger in complex datasets such as transparent object bin picking where the largest box reduces performance by almost 50% and ours increases it by 4.5%.
Our method significantly outperforms the original largest box method and also outperforms not using rotation for AP, AP50, and AP75 across all new angles from on COCO in Figure 1 and Figure 11. In the case of AP50, we show very similar improvements compared to using segmentation-based labels. This is a huge improvement since the largest box method hurts rotation performance.
The widespread Largest Box method (e.g. [xi2018sr, bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet, montserrat2017training, liu2016novel]) is based on the folk wisdom of maximizing overlap. Instead, we show that by maximizing Expected IoU and accounting for label certainty in the loss, we can completely match the performance of perfect “segmentation-based” labels at AP50 while also achieving significant gains for AP and AP75. These results represent a step toward achieving rotation invariance for object detection models, while adding only a few lines of complexity to object detection codebases.
We thank Yuri Boykov, Tomas Gerlich, Abhijit Ghosh, Olga Veksler and Kartik Venkataraman for their helpful discussions and edits to improve the paper.