Towards Rotation Invariance in Object Detection

by   Agastya Kalra, et al.

Rotation augmentations generally improve a model's invariance/equivariance to rotation - except in object detection. In object detection the shape is not known, therefore rotation creates a label ambiguity. We show that the de-facto method for bounding box label rotation, the Largest Box Method, creates very large labels, leading to poor performance and in many cases worse performance than using no rotation at all. We propose a new method of rotation augmentation that can be implemented in a few lines of code. First, we create a differentiable approximation of label accuracy and show that axis-aligning the bounding box around an ellipse is optimal. We then introduce Rotation Uncertainty (RU) Loss, allowing the model to adapt to the uncertainty of the labels. On five different datasets (including COCO, PascalVOC, and Transparent Object Bin Picking), this approach improves the rotational invariance of both one-stage and two-stage architectures when measured with AP, AP50, and AP75. The code is available at



There are no comments yet.


page 2

page 7


Circle Representation for Medical Object Detection

Box representation has been extensively used for object detection in com...

Optimization for Oriented Object Detection via Representation Invariance Loss

Arbitrary-oriented objects exist widely in natural scenes, and thus the ...

RSDet++: Point-based Modulated Loss for More Accurate Rotated Object Detection

We classify the discontinuity of loss in both five-param and eight-param...

Labels Are Not Perfect: Inferring Spatial Uncertainty in Object Detection

The availability of many real-world driving datasets is a key reason beh...

Learning Modulated Loss for Rotated Object Detection

Popular rotated detection methods usually use five parameters (coordinat...

Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes

3D object detection has attracted much attention thanks to the advances ...

Dense Label Encoding for Boundary Discontinuity Free Rotation Detection

Rotation detection serves as a fundamental building block in many visual...

Code Repositories


Our ICCV Submission materials

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is desirable for object detectors to work when scenes are rotated. But there is a problem: methods like Convolutional Neural Networks (CNNs) may be scale and translation invariant but CNNs are not rotation invariant 

[Goodfellow-et-al-2016]. To overcome this problem, the training dataset can be expanded to include data at new rotation angles. This is known as rotation augmentation. In object detection, rotation augmentation can be abstracted as follows: given an original bounding box and any desired angle of rotation, how should we determine the axis-aligned rotated bounding box label? If the shape of the object is known, this is quite simple: we rotate the object shape and re-compute the bounding box. However in the case of object detection, the shape is unknown.

The prevailing folk wisdom in the community is to select a label large enough to completely overlap the rotated box [zoph2019learning]. In studying this problem, we find that this method may hurt performance, and on COCO [lin2014microsoft], we find that every other prior we tried is better. Yet somehow this “Largest Box” method is very prevalent both in academia and in large scale object detection codebases [xi2018sr, bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet, montserrat2017training, liu2016novel, albumentations-team, chen2019mmdetection, abadi2016tensorflow, jung2018imgaug, casado2019clodsa, solt2019]. Indeed, recent analysis has found that largest box is only robust to about  [fastai] of rotation.

Figure 1: A method is proposed to properly rotate a bounding box for rotation augmentation. The previous solution of largest box is an overestimate of the perfect bounding box for a rotated scene. See table for how choice of rotation augmentation affects object detection performance.

In this paper, we propose an advance on the largest box solution that achieves significantly better performance on five object detection datasets; while retaining the simplicity of a few lines of code implementation.

1.1 Contributions

In a nutshell, our solution has two aspects. First, we derive an elliptical shape prior from first principles to determine the rotated box label. We compare it to many other novel priors and show this is optimal. Second, we introduce a novel Rotation Uncertainty (RU) Loss

function, which allows the network to adapt the labels at higher rotations using priors from lower rotations based on label certainty. We demonstrate the effectiveness of this solution by both improving performance datasets where rotation is important such as Pascal VOC 

[everingham2010pascal] and Transparent Object Bin Picking [kalra_2020_cvpr] and generalizing to novel test time rotations on MS COCO [lin2014microsoft] (Figure 1).

1.2 Scope

Rotation data augmentation in object detection is not new. This paper is not about finding the best overall way to use a rotation data augmentation. For that - a brute force search or papers like AutoAugment [zoph2019learning] might be better examples. This paper focuses solely on methods to perform a rotation augmentation on axis-aligned bounding boxes. When implemented, these proposed modifications boil down to a few lines of code and leave little reason to use the current Largest Box method.

2 Related Work

Data Augmentations are an effective technique to boost the performance of object detection. Data augmentation increases the quantity and improves the diversity of data. Data augmentations are of two types. Photo-metric transforms modify the color channels such that the detector becomes invariant to change in lighting and color. Classical photometric techniques include adding Gaussian blur or adding colour jitters. Modern photo-metric augmentations like Cutout [devries2017improved] and CutMix [yun2019cutmix], randomly remove patches of the image. On the other hand, geometric transforms modify the image’s geometry, making the detector invariant to position and orientation. Geometric modifications require corresponding changes to the labels as well. Geometric transforms are difficult to implement  [albumentations-team] and contribute more towards accuracy improvements [taylor2018improving]. We focus specifically on rotation augmentations and object detection.

Figure 2: Comparison between our method and other methods. We show example predictions from models trained with each rotation augmentation above. Our method has comparable performance to using perfect segmentation labels without requiring the extra shape information.
Rotation Method AP50 (Coarse) AP75 (Fine) Shape Label
No Rotation Med Med No
Largest Box (e.g. [abadi2016tensorflow, jung2018imgaug, albumentations-team]) Med Low No
Ellipse + RU Loss (Ours) Very High High No
Perfect Box (Gold Std) Very High Very High Yes

Rotation Augmentation in Object Detection is currently done by the largest box method for major repositories (e.g. [albumentations-team, chen2019mmdetection, abadi2016tensorflow, jung2018imgaug, casado2019clodsa, solt2019]) and publications (e.g. [xi2018sr, bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet, montserrat2017training, liu2016novel]

) that do bounding box rotation for deep learning object detection. The largest box method does a great job guaranteeing containment, but at large angles, it severely overestimates the bounding box’s size. Figure

2 shows an example of these over-sized bounding boxes. For that reason, FastAI  [fastai] recommends rotation of no more than 3 degrees. Some recent work, such as AutoAugment [zoph2019learning, tan2019efficientnet], use rotation as part of a complex learned data augmentation scheme. While learning rotation augmentation directly is interesting, it requires extensive computing resources. We seek to achieve the simplicity of the largest box, with performance improvements for larger angles.

Oriented Bounding Boxes, a sister of object detection, is the task of predicting non-axis aligned bounding boxes, also known as oriented bounding boxes. Several methods like [Ding_2019_CVPR, cheng2018learning, liu2017learning], aim to achieve rotation invariance when predicting rotated bounding boxes. However, these methods already have labelled rotated boxes as input and do not end up with loose boxes when the input image is rotated. As this is a different task, it is out of the scope of this paper. Our paper focuses on axis-aligned bounding boxes only.

Rotational Invariance

is an important problem to solve in object detection. Classical computer vision methods achieved rotational invariance by extracting features from images 

[wang2015ordinal, liu2018rotation, liu2014rotation, schmidt2012learning]

. With the rise of neural networks, newer methods attempt to modify the architecture to achieve rotation invariance  

[cheng2016rifd, cheng2016learning, xi2018sr]. These methods rotate the input images and add special layers that learn the object’s orientation in the image. Our general-purpose method can again aid these methods as well.

3 Background

3.1 Rotation Augmentation for Object Detection

An image is parameterized by and coordinates. Suppose the image contains an object with shape . Let denote the shape set that describes all points in an object:


In object detection, a bounding box is defined to be the tightest fitting axis-aligned box around a shape. Therefore a shape determines the coordinates of a bounding box, . Each of the four edges of the bounding box intersects at least one element of the shape set. Let the operator represent the perfect conversion of a shape to a bounding box:

Figure 3: Our Ellipse method leads to good initial training labels while the Largest Box overestimates the labels.

From left to right: (1) The original bounding box prior to rotation. (2) The oversized Largest Box estimate of the ground truth label post rotation. (3) The tighter Ellipse estimate (Section

4.1). (4) The actual ground truth which we get from segmentation shape labels. (5) The set of all possible ground truth boxes given a rotation and an initial box.

The operator extracts the bounding box (tightest fitting axis aligned box) for by taking the minimum/maximum coordinates of the shape. As is not unique, the same bounding box can be generated by many shapes. For example, a square of side length and a circle of diameter are just two of many unique shapes that generate the same bounding box. More formally, let denote the set of shapes that could possibly generate a bounding box , such that


Where is the dataset specific distribution of shapes. Let us consider the problem of rotation augmentation where an image and corresponding box label is rotated by angle . If the shape of the object is known, then we can rotate the original shape by angle using a rotation operator: . In analogy to Equation 2, we can then use the perfect method to obtain an axis-aligned bounding box for the rotated image as:


We call this method perfect labels, where is the actual shape of the object for a given bounding box . However, this requires shape labels, which are not available for object detection. In object detection, humans label boxes by implicitly segmenting the shape. Without knowing the shape labels, any shape could be , leading to many possible boxes . This paper seeks a method to estimate the rotated bounding box when we do not know the shape. We are only provided with the original bounding box , which we will hereafter write as by making explicit that . The problem statement follows.

Problem Statement:

Given only an input bounding box and an angle by which the image should be rotated, find the axis aligned bounding box that: (1) has high IoU with ; and (2) improves model performance on rotated versions of vision datasets.

3.2 Largest Valid Box Method

Rotation augmentation without shape knowledge is not a new problem statement. The de facto method in the object detection community for determining the bounding box post-rotation with no shape priors is the largest box method. The largest box method is extremely prevalent (e.g. [montserrat2017training, liu2016novel, albumentations-team, chen2019mmdetection, abadi2016tensorflow, jung2018imgaug, casado2019clodsa, solt2019, kathuria_2018, kdnuggets, lozuwa_2019, saxena_2020, solawetz_2020, matlab]). Just like our proposed method, the largest box takes only the original bounding box and as input. From Equation 3 it is clear that several shapes could define . This creates an ambiguity problem. The largest box method chooses the single largest of these possible shapes in area, . This shape is simply the box itself (Table 1). Treating this as the object shape, Equation 4 can be adapted to obtain


The benefit of this method is that it produces a box that is guaranteed to contain the original object [zoph2019learning], and it is easy to implement. The downside is that the method produces oversized boxes [fastai, zoph2019learning, exchange_1968, albumentations-team_q, open-mmlab_q, aleju], and if used generously, hurts performance more than it helps (Table 11). Surprisingly, to our best knowledge, including personal communication with practitioners and posts on internet forums, no alternatives have been adopted. We hope our method will change that.

4 Proposed Solution

We now describe our solution to the problem: given and desired rotation angle , find . In a nutshell, our solution estimates a rotated bounded box by assuming the original shape is an ellipse (Table 1, Figure 3) and rotating accordingly (Section 4.1

). We then adapt the loss function to account for error in the labels (Section 


4.1 Ellipse Method

In this section, we first derive the ellipse assumption from first principles by trying to find the shape that is most likely to have high overlap with potential ground truth boxes. Then we discuss the implementation and intuition of the ellipse method. Finally, we mention other novel methods we developed.

4.1.1 Ellipse from Maximizing Expect IoU

We start with a simple assumption: the optimal method for determining a bounding box post-rotation augmentation should maximize label accuracy, which in the case of object detection is measured in IoU.

We define as the optimal rotated bounding box. We are provided the input angle and box . From Equation 3, this box could have been generated from any number of shapes: . For each shape we can use the “perfect method” from Equation 4 to obtain a potential rotated bounding box. Since multiple shapes can lead to the same rotated box, we obtain possible bounding boxes, which we write as the set:


Hereafter, and without loss of generality, the paper will assume that is the input allowing notation to be simplified:


Now the task becomes to pick the “best” of the possible bounding boxes in . Recall that the de facto solution is to choose the largest box in . This largest box is guaranteed to contain the object. However, optimizing for containment does not seem like a good choice to directly address the metric of AP because AP uses IoU to determine true positives, not containment. A more relevant goal for object detection is to select a box that maximizes:


In which case . Of course, we are not given

. So for the moment, let us assume any of the bounding boxes in

has an equal chance of being the perfect box. Then, it would make sense to optimize over:


Now, let us break the assumption that each candidate box in is equally likely to be the perfect box. Indeed, we know that many shapes can produce the same box (since ), so certain boxes are more likely than others. For example, the only shape that can produce the largest box is the original box itself, whereas other rotated boxes can be generated by multiple object shapes in the dataset. Denote

as the probability that box

. Then Equation 10 can be reformulated as:


Readers may recognize this equation as being analogous to an expectation. We refer to this in the paper as the Expected IoU. The expected IoU is not directly tractable: we do not know a priori. However, if we can sample random shapes from a dataset distribution over shapes where we get the following optimization objective:

Method Shape Definition
Table 1: Our method can be compared with the Largest Box in the shape domain. The implementation difference is one line of code.

Since all object detection datasets do not have shape labels, we sample by generating random shapes that touch each side of once. This way we are not dataset-specific. We analyze using COCO shapes in the supplement and show the performance is extremely similar. The above equation is fully differentiable, and so we can solve with gradient ascent. The problem here is that we would then have to solve this equation for every and every box , and this is not practical. Therefore we generalize this further to a canonical shape.

Label Method (Ours) (Ours)
Expected IoU 60.8 72.9 72.9
Figure 4: The optimal shape to maximize expected IoU with potential ground truth boxes converges to an ellipse. Curve showing the progression of gradient ascent starting from the largest box and converging to an elliptical shape. The final converged expected IoU for matches that of the largest inscribed ellipse .
Optimizing Equation 12 via a canonical shape:

Instead of solving Equation 12 for every possible combination of and , we attempt to find a shape that is optimal across different input bounding boxes. This way, we could solve for some best shape , and solve for as follows:


To obtain the likely shape, we combine Equations 12 and 13 to optimize the quantity:

AP50 AP75
Largest Box 98.2 93.79 86.31 82.9 59.2 25.6 19.9 17.3
Ellipse (Ours) 99.6 98.6 97.0 96.5 86.8 56.2 47.2 46.5
Figure 5: Comparing Label AP (assuming uniform confidence) at different IOU thresholds for both methods at a rotation augmentation. Ours is significantly better at AP50 and AP75.

Note that we now optimize over all rotation angles and aspect ratios simultaneously. This adds enough constraints to find a unique shape. The goal is to find the shape that produces an augmented bounding box that has high IoU with likely ground truth boxes. Since and are differentiable operators, Equation 14 can be optimized through gradient ascent to solve for . We provide details and pseudo-code and some analysis in the supplement. The stable solution found by gradient ascent is that of an elliptical shape. We show the progression of gradient ascent in Figure 4 from the largest box shape to a circle. If we change the aspect ratio, it simply converges to the largest inscribed ellipse. Also in Figure 4, we show the Expected IoU for the Largest Box shape is much lower than the Ellipse, and in Figure 5 we show that the resulting AP of the Ellipse labels is much better. The elliptical solution is similar to the optimized shape for various tested distributions of , including the random model described in the previous paragraph.

4.1.2 The Ellipse Method

When we model the shape as an ellipse, we can find the estimated bounding box as:


where is the largest inscribed ellipse inside , expressed as:


where is the location of the center of and are the width and height of respectively.

This equation is fast, simple to implement, and high-performing on modern vision datasets. The elliptical approximation can be implemented in the same line of code as the largest box method (cf. Appendix A), yet it greatly improves performance. We see in Figure 5 the ellipse labels are far more accurate than the largest box labels. However, one disadvantage with the proposed elliptical box method is that the elliptical box can underestimate the object size or aspect ratio. This still causes some noise in the labels, especially at large rotations. We mitigate this by allowing the model to adapt labels at higher rotations based on priors from lower rotations in Section 4.3.

4.2 Other Methods

We do not limit our analysis to the ellipse method. To perform a complete study we came up with an additional 4 methods. To conserve space, full details and results of these novel methods are available in the supplement Appendix B, we provide a quick summary for these methods here.

  • Scaled Octagon: We use an octagon with a scaling factor (

    ) to interpolate between the largest box shape and a diamond shape.

  • Random Boxes: We sample random valid boxes and use those as ground truth labels.

  • RotIoU: We select the label that has the maximum IoU with the rotated ground truth box rather than the expected axis-aligned ground truth box.

  • COCO Shape: Rather than using random shapes for the optimization, we use the shapes from the COCO dataset. We keep results from this to the supplement since the performance between this and the ellipse method is negligible and we want this paper to be dataset independent and easy to implement.

4.3 Rotation Uncertainty Loss

Figure 6: Rotational certainty used by RU Loss as a function of theta plotted for different hyper-parameters where is the angle at which .
Figure 7: The Expected IoU of a rotation method is heavily correlated with performance. Our final Ellipse is optimal for both.
COCO val2017 Ablations
(a) Previous method
Largest Box(e.g. [bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet]) 35.20 28.37 22.34 18.47
(b) Our Rotation Label Methods (Section 4.2)
Random 37.39 35.59 32.22 28.33
Octagon 35.82 31.64 27.16 23.54
Octagon 36.52 34.57 31.65 28.15
Octagon (Diamond) 38.36 35.39 28.76 22.92
RotIoU 38.32 36.48 32.68 28.94
Ellipse (Section 4.1.2) 38.21 36.83 33.59 29.95
(c) With Our Loss (Section 4.3)
Ellipse + RU Loss 38.54 37.45 34.56 31.26
Ellipse + RU Loss 39.09 37.99 35.45 32.25
Ellipse + RU Loss 39.14 38.19 35.78 32.50
Ellipse + RU Loss (Final) 39.33 38.31 36.00 32.72
Figure 8: The AP at different test rotations on the COCO val2017 set for different methods. (a) The previous method of largest box leads to the worst performance - every other idea we had was better. (b) The Ellipse is the best of all the label generation methods. (c) Our RU Loss with leads to the best AP across all rotations and therefore we use this as our final method. Note: We bold within 0.2 of best result.
Figure 9: Example bounding box predictions for Largest Box model (top row) and Ellipse model (bottom row) for all 5 datasets. We can see that ours produces tighter bounding boxes overall.
Datasets (At Test Rotation) Pascal VOC [everingham2010pascal] Transparent Bin Picking [kalra_2020_cvpr] Synthetic Fruit Dataset [synthfruit] Oxford Pets [parkhi12a]
Methods AP AP50 AP75 AP AP50 AP75 AP AP50 AP75 AP AP50 AP75
No Rotation 51.94 80.91 56.54 48.53 79.14 54.3 84.3 95.07 92.6 80.70 92.80 88.76
Largest Box(e.g. [bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet]) 50.23 81.31 54.3 37.49 79.09 28.45 83.47 95.05 92.24 79.54 94.20 90.03
(relative improvement) -3.29% 0.49% -3.96% -22.7% -0.06% -47.6% -0.98% -0.02% -0.39% -1.43% 1.56% 1.43%
Ellipse + RU Loss (Ours) 52.89 81.57 57.97 50.36 81.78 56.76 84.83 95.83 93.17 81.28 94.37 91.09
(relative improvement) 1.84% 0.82% 2.53% 3.78% 3.35% 4.53% 1.05% 0.80% 0.62% 0.72% 1.69% 2.63%
Figure 10: Across four separate datasets we show that our method of rotation augmentation leads to an improved performance where the previous method hurts performance. Especially in the case of transparent object bin picking where the largest box is almost 50% worse and ours is 4.5% better AP75.

As shown in Figure 4 the expected IoU with random shapes is 72.9. This means attaining good performance at the higher APs, like AP75, will be very difficult using just these labels. To tackle this problem, we create a custom loss function that adapts the regression loss to account for the uncertainty of the rotation. The idea is simple - if we are uncertain of the label, we turn off the regression loss if the model is close enough. The labels are more uncertain as the rotation approaches . and perfectly certain at We formalize on the concept of certainty (Figure 6) as a function of :


This function maps a rotation to an IoU threshold . We use this IoU threshold to serve as an indicator for applying regression loss. If the predicted box is greater than , it uses the regression loss, otherwise, it does not and assumes the model’s prediction is correct. We parameterize with . is the angle at which . We visualize in Figure 6. We bound it by 0.5 since that is the threshold for anchor-matching in standard object detection architectures [lin2017focal].

This function allows the model to take the priors it learns at the confident rotations and apply them to the higher rotations, preventing it from overfitting to poor labels. We show in Table 8.

5 Results

5.1 Setup

Our hardware setup contains only a single P100 GPU for training, and all our code is implemented in Detectron2 [wu2019detectron2]

with Pytorch 

[paszke2019pytorch]. We use the default training pipeline for both Faster-RCNN [ren2015faster] and RetinaNet [lin2017focal]. We conduct most of our experiments on the standard COCO benchmark since it contains a variety of objects with many different shapes - making it a challenging test set.

Training Since we have only a single GPU, we can only fit a batch size of 3. To account for this, we increase the training time by around 5x from the default configurations. This allows us to match online available pre-trained baselines for RetinaNet [lin2017focal] and Faster-RCNN [ren2015faster]

. Since most datasets are right-side-up images, we train with a normal distribution with a mean of 0 and a standard deviation of 15 degrees for all experiments. Since this paper aims to find the optimal rotation augmentation method, not the strategy for applying rotation augmentation, we do not try other combinations. This may be left for future work.

COCO val2017 Results AP AP50 AP75
No Rotation 39.26 37.54 33.68 29.19 59.68 56.88 51.35 45.37 41.69 40.14 35.63 30.39
Largest Box (e.g. [bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet]) 35.20 28.37 22.34 18.47 58.79 56.31 51.49 46.30 36.00 25.37 14.95 10.91
(relative improvement) -10.3% -24.4% -33.7% -36.7% -1.48% -1.01% 0.27% 2.04% -13.6% -36.8% -58.1% -64.1%
Ellipse + RU Loss (Ours) 39.33 38.31 36.00 32.72 60.08 58.66 55.73 51.60 41.74 40.71 38.05 33.97
(relative improvement) 0.17% 2.05% 6.88% 12.1% 0.67% 3.12% 8.54% 13.7% 0.13% 1.42% 6.79% 11.8%
Perfect Labels 39.66 39.17 37.24 34.08 60.28 58.89 55.91 51.83 42.05 41.73 39.61 36.02
(relative improvement) 1.02% 4.36% 10.6% 16.7% 1.00% 3.53% 8.88% 14.2% 0.88% 3.94% 11.2% 18.5%
Figure 11: Our method of Ellipse + RU Loss performs close to perfect labels for AP50 and performs better than No Rotation and Largest Box for AP and AP75 - demonstrating the first reliable rotation augmentation without shape labels.

Testing: For all datasets except COCO we do not have complete segmentation labels, so we only test on the standard test set (). For COCO we generate our test set by taking the COCO val2017 set and rotating it from - to simulate out-of-distribution rotations. We then bucket these rotations in intervals of 10 and evaluate using segmentation labels to generate ground truth. We leave COCO results for Faster-RCNN to supplement because they are similar to the results for RetinaNet shown below.

5.2 Ablation Studies

In Figure 8 and the accompanying table, we conduct a thorough ablation study on both the method for choosing the label and the impact of the RU Loss function.

Justifying EIoU Optimization:

In Section 4.1.1 we assumed that the optimal method for label rotation should maximize label accuracy, which we approximated as Expected IoU (Eq. 14). In Figure 8 we demonstrate a strong correlation between Expected IoU and performance on COCO at across all methods, proving the effectiveness of our first principles derivation. We see similar correlations at other angles as well.

Justifying the Ellipse:

In Section 4.1 we introduced many potential methods for rotating a box label. In the ablation Table 8b we show that the Ellipse leads to the best performance across all rotations except where it is within a small noise tolerance. It is also important to note that all methods we tried perform significantly better than the Largest Box - showing the importance of fixing this issue.

RU Loss Ablation:

In Section 4.3 we introduce a hyper-parameter in our final method. We ablate over that in Table 8 and demonstrate that is optimal. We found this to be true on COCO, however, on simpler datasets we use larger values of for optimal performance.

5.3 Overall Performance

Our best performing method consists of using both Ellipse-based label rotation and RU Loss. In this section, we show it leads to much better performance across multiple datasets and approximates segmentation-based rotation augmentations on COCO.

5.3.1 Object Detection Datasets

In Table 10, we provide four datasets where our method of rotation augmentation improves performance while the previous one (Largest Box) hurts performance. We notice this to be especially bad at higher APs, such as AP75. The gap is also larger in complex datasets such as transparent object bin picking where the largest box reduces performance by almost 50% and ours increases it by 4.5%.

5.3.2 Generalizing to new Rotation Angles

Our method significantly outperforms the original largest box method and also outperforms not using rotation for AP, AP50, and AP75 across all new angles from on COCO in Figure 1 and Figure 11. In the case of AP50, we show very similar improvements compared to using segmentation-based labels. This is a huge improvement since the largest box method hurts rotation performance.

6 Conclusion

The widespread Largest Box method (e.g. [xi2018sr, bochkovskiy2020yolov4, zoph2019learning, tan2019efficientnet, montserrat2017training, liu2016novel]) is based on the folk wisdom of maximizing overlap. Instead, we show that by maximizing Expected IoU and accounting for label certainty in the loss, we can completely match the performance of perfect “segmentation-based” labels at AP50 while also achieving significant gains for AP and AP75. These results represent a step toward achieving rotation invariance for object detection models, while adding only a few lines of complexity to object detection codebases.


We thank Yuri Boykov, Tomas Gerlich, Abhijit Ghosh, Olga Veksler and Kartik Venkataraman for their helpful discussions and edits to improve the paper.