1 Introduction
Typically, automotive systems are equipped with a multicamera network to cover all the field of view and range [horgan2015vision]. Four surroundview fisheye cameras are typically part of this sensor suite, as illustrated in Figure 1. We can accomplish a dense near field perception with the employment of four surroundview fisheye cameras, making them suitable for automated parking, lowspeed maneuvering, and emergency braking [heimberger2017computer]. The wide field of view of the fisheye image comes with the side effect of strong radial distortion. Objects at different angles from the optical axis look quite different, making the object detection task a challenge. A common practice is to rectify distortions in the image using a fourth order polynomial [woodscape] model or unified camera model [khomutenko2015enhanced]
. However, undistortion comes with resampling distortion artifacts, especially at the periphery. In particular, the negative impact on computer vision due to the introduction of spurious frequency components is understood
[LourencoSIFT]. In addition, other more minor impacts include reduced field of view and nonrectangular image due to invalid pixels. Although semantic segmentation is an easier solution on fisheye images, object detection annotation costs are much lower [siam2017deep]. In general, there is limited work on fisheye perception [kumar2018monocular, yahiaoui2019fisheyemodnet, uvrivcavr2019soilingnet, kumar2020fisheyedistancenet, kumar2020unrectdepthnet, kumar2020syndistnet].We can broadly classify the stateoftheart object detection methods based on deep learning into two types: twostage detectors and singlestage detectors. Agarwal
[agarwal2018recent] provides a detailed survey of current object detection methods and its challenges. Relatively, fisheye camera object detection is a much harder problem. The rectangular bounding box fails to be a good representation due to the massive distortion in the scene. As demonstrated in Figure LABEL:fig:abstract(a), the size of the standard bounding box is almost double the size of the object of interest inside it. Instance segmentation can help obtain accurate object contours. However, it is a different task which is computationally complex and more expensive to annotate. It also typically needs a bounding box estimation step. There is relatively less work on object detection for fisheye or closely related omnidirectional cameras. One of the main issues is the lack of a useful dataset, particularly for autonomous driving scenarios. The recent fisheye object detection paper FisheyeDet
[li2020fisheyedet]emphasizes the lack of a useful dataset, and they create a simulated fisheye dataset by applying distortions to the Pascal VOC dataset
[everingham2010pascal]. FisheyeDet makes use of a 4sided polygon representation aided by distortion shape matching. SphereNet [coors2018spherenet] and its variants [perraudin2019deepsphere, su2019kernel, jiang2019spherical] formulate CNNs on spherical surfaces. However, fisheye images do not follow spherical projection models, as seen by nonuniform distortion in horizontal and vertical directions.Our objective is to present a more detailed study of various techniques for fisheye object detection in autonomous driving scenes. Our main contributions include:

[nosep]

Exploration of seven different object representations for fisheye object detection.

Design of novel representations for fisheye images, including the curved box and adaptive step polygon.

Public release of a dataset of 10,000 images with annotations for all the object representations.

Implementation and empirical study of FisheyeYOLO baseline, which can output different representations.
2 Object Representations
2.1 Adaptation of Box representations
Standard Box Representation
The rectangular bounding box is the most common representation for object detection. They are aligned to the pixel grid axes, which makes them efficient to be regressed using a machine learning model. They are represented by four parameters (
, , , ), namely the box center, width and height. It has the advantage of simplified, lowcost annotation. It works in most cases, but it may capture a large nonobject area within the box for complex shapes. It is particularly the case for fisheye distorted images, as shown in Figure LABEL:fig:abstract (a).Oriented Box Representation The oriented box is a simple extension of the standard box with an additional parameter to capture the rotation angle of the box. It is also referred to as a titled or rotated box. Lienhart et al. [lienhart2002extended] adapted ViolaJones object detection framework to output rotated boxes. It is also commonly used in lidar topview object detection methods [Geiger2012CVPR]. The orientation groundtruth range spans the range of (90°to +90°) where this rotation angle is defined with respect to the xaxis. For this study, we used instance segmentation contours to estimate the optimally oriented box as a minimum enclosing rectangle.
Ellipse Representation Ellipse is closely related to an oriented box and can be represented using the same parameter set. Width and height parameters represent the major and minor axis of the ellipse. In contrast to an oriented box, the ellipse has a smaller area at the edge, and thus it is better for representing overlapping objects as shown for the objects at the left end in Figure LABEL:fig:abstract. It may also help fit some objects like vehicles better than a box. We created our ground truth by fitting a minimum enclosing ellipse to the ground truth instance segmentation contours. In parallel work, Ellipse RCNN [dong2020ellipse] used ellipse representation for objects instead of boxes.
2.2 Distortion aware representation
This subsection aims to derive an optimal representation of objects undergoing radial distortion in fisheye images assuming a rectangular box is optimal for pinhole cameras. In the pinhole camera with no distortion, a straight line in the scene is imaged as a straight line in the image. A straight line in the scene is imaged as a curved segment in the image for a fisheye image. The specific type of fisheye distortion determines the nature of the curved segment. The fisheye cameras from the dataset we used are well represented and calibrated using a 4^{th} order polynomial model for the fisheye distortion [woodscape]. The author’s are aware that there have been many developments in fisheye camera models over the past few decades, e.g. [kannala2006fisheye, brauerDivisionModel, Khomutenko2016eucm]. In this section, we consider the fourth order polynomial model and the division model only. The reason is that the fourth order polynomial model is provided by the data set that we use, and we examine the division model to understand if the use of circular arcs is valid under such fisheye projections.
In this case, the projection of a line on to the image can be described parametrically with complicated polynomial curves. Let us consider a much simpler model for the moment  a firstorder polynomial (or equidistant) model of a fisheye camera.
, where is the radius on the image plane, and is the angle of the incident ray against the optical axis. If we consider the parametric equation of a line in 3D Euclidean space:(1) 
where
is the direction vector of the line and
is a point through which the line passes. Hughes et al. [hughesFisheye] have shown that the projection on to a fisheye camera that adheres to equidistant distortion is described by:(2) 
where
(3)  
(4) 
is the projected line in a pinhole camera, and is the distorted image of the line in a fisheye camera.
This is a complex description of a straight line’s projection, especially considering we have ignored all but the firstorder polynomial term. Therefore, it is highly desirable to describe straight lines’ projection using a more straightforward geometric shape.
BräuerBurchardt and Voss [brauerDivisionModel] show that if the firstorder division model can accurately describe the fisheye distortion, then we may use circles in the image to model the projected straight lines. As a note, the division model is generalised in [scaramuzzaFisheye], though it loses the property of straight line to circular arc projection. We should then consider how well the division model fits with the 4^{th} order polynomial model. In [hughesFisheye], the authors adapt the division model slightly to include an additional scaling factor and prove that this does not impact the projection of line to a circle. They show that the division model is a correct replacement for the equidistant fisheye model. Here we repeat this test but compare the division model to the 4^{th} order polynomial. The results are shown in Figure 2. As can be seen, the division model can map to the 4^{th} order polynomial with a maximum of less than 1pixel error. While this may not be accurate enough for applications in which subpixel error is desirable, it is sufficient for bounding box accuracy.
Therefore, we propose a novel curved bounding box representation using circular arcs. Figure 3 (top) provides a visual justification of circular arcs. We illustrate an open cube’s projection with grid lines where the straight lines become circular arcs after projection. Figure 3 (bottom) illustrates the details of the curved bounding box. The blue line represents the axis, and the white lines intersect with the circles creating starting and ending points of the polygon. This representation allows two sides of the box to be curved, giving the flexibility to adapt to image distortion in fisheye cameras. It can also specialize in an oriented bounding box when there is no distortion for the objects near the principal point.
We create an automatic process to generate the representation that takes an object contour as an input. First, we generate an oriented box from the output contour. We choose a point that lies on the oriented box’s axis line to represent a circle center. From the center, we create two circles intersecting with the corner points of the bounding box. We construct the polygon based on the two circles and the intersection points. To find the best circle center, we iterate over the axis line and choose the circle center, which forms a polygon with the minimum IoU with the instance mask. The output polygon can be represented by 6 parameters, namely, (, , , , , ) representing the circle center, two radii and angles of the start and end points of the polygon relative to the horizontal xaxis. By simple algebraic manipulation, we can reparameterize the curved box using the object center (, ) following a typical box representation instead of the center of the circle.
2.3 Generic Polygon Representations
Polygon is a generic representation for any arbitrary shape and is typically used even, for instance segmentation annotation. Thus polygon output can be seen as a coarse segmentation. We discuss two standard representations of a polygon and propose a novel extension that improves accuracy.
Uniform Angular Sampling Our polar representation is quite similar to PolarMask [polarmask] and PolyYOLO [polyyolo] approaches. As illustrated in Figure 4 (left), the full angle range of is split into equal parts where is the number of polygon vertices. Each polygon vertex is represented by the radial distance from the centroid of the object. Uniform angular sampling removes the need for encoding parameter. Polygon is finally represented by object center (, ) and {}.
Uniform Perimeter Sampling In this representation, we divide the perimeter of the object contour equally to create vertices. Thus the polygon is represented by a set of vertices {(, )} using the centroid of the object as the origin. PolyYOLO [polyyolo] showed that it is better to learn polar representation of the vertices {(, )} instead. They define a parameter to denote the presence or absence of a vertex in a sector, as shown in Figure 4 (middle). We extend this parameter to be the count of vertices in the sector.
Curvatureadaptive Perimeter Sampling The original curve in the object contour between two vertices gets approximated by a straight line in the polygon. For regions of high curvature, this is not a good approximation. Thus, we propose an adaptive sampling based on the curvature of the local contour. We distribute the vertices nonuniformly in order to represent the object contour best. Figure 4 (right) shows the effectiveness of this approach, where a larger number of vertices are used for higher curvature regions than straight lines, which can be represented by lesser vertices. We adopt the algorithm in [teh1989detection] to detect the dominant points in a given curved shape, which best represents the object. Then we reduce the set of points using the algorithm in [douglas1973algorithms] to get the most representative simplified curves. This way, our polygon has dense points on the curved parts and sparse points on the straight parts, which maximize the utilization of the predefined number of points per contour.
3 FisheyeYOLO network
We adapt YOLOv3 [YOLOV3] model to output different representations discussed in Section 2. We call this FisheyeYOLO, as illustrated in Figure 5. Our baseline bounding box model is the same as YOLOv3 [YOLOV3], except the Darknet53 encoder is replaced with ResNet18 encoder. Similar to YOLOv3, object detection is performed at multiple scales. For each grid in each scale, object width (), height (), object center coordinates (, ) and object class is inferred. Finally, a nonmaximum suppression is used to filter out the low confidence detections. Instead of using loss for categorical and objectness classification, we used standard categorical crossentropy and binary entropy losses, respectively. The final loss is a combination of sublosses, as illustrated below:
(5)  
(6)  
(7)  
(8)  
(9) 
where height and width are predicted as offsets from precomputed anchor boxes.
In the case of oriented box or ellipse prediction, we define an additional loss function based on ellipse angle or orientation of the box. The loss function for oriented box and ellipse is:
(10)  
(11) 
where , is the total loss minimized for oriented box regression. In case of curved box, is replaced by in equation (13).
We also explored methods of learning orientation as a classification problem instead of a regression problem. One motivation is due to discontinuity of angles at due to wrapping around of angles. In this scenario, we discretized the orientation into 18 bins, where each bin represents a range of 10°with a tolerance of +5°. To further improve our prediction, we design an IoU loss function that guides the model to minimize the difference in the area of the predicted box and the ground truth box. We compute the area of the predicted and ground truth rectangles and apply regression loss on those values. This loss maximizes the overlapping area between the prediction and the ground truth by improving the overall results. The IoU loss is,
(12) 
where represents the area of the representation at hand. We report all the results related to these experiments in Table 3.
The polar polygon regression loss is,
(13)  
(14)  
(15) 
where N corresponds to the number of sampling points, each point is sampled with a step size of angle in polar coordinates, as shown in Figure 4. Our polar loss is similar to PolyYOLO [polyyolo], where each polygon point is (in red) is represented using three parameters , , and . Hence the total required parameters for sampling points are . The same is presented in Figure 4 (middle).
In Cartesian representation, we regress over two parameters (, ) for each polygon point. We further improve our predictions by adding our IoU loss function, which minimizes the area between the prediction and ground truth. We refer to both loss functions as localization loss . Our combined loss for Cartesian polygon predictions is:
(16) 
where and are inherited from YoloV3 loss functions. According to the representation at hand, we perform the nonmaximum suppression. We generate the predictions for all the representations; filter out the low confidence objects—computation of IoU of the output polygon with the list of outputs where highIoU objects are filtered out.
4 Experimental Results
4.1 Dataset and Evaluation Metrics
Our dataset comprises of 10,000 images sampled roughly equally from the four views. The dataset comprises 4 classes, namely vehicles, pedestrians, bicyclists, and motorcyclists. Vehicles further have subclasses, namely cars and large vehicles (trucks/buses). The images are in RGB format with 1MPx resolution and horizontal FOV. The dataset is captured in several European countries and the USA. For our experiments, we used only the vehicles’ class. We divide our dataset into 601030 split and train all the models using the same setting. More details are discussed in our WoodScape Dataset paper [woodscape].
The objective of this work is to study various representations of fisheye object detection. Conventional object detection algorithms evaluate their predictions against their groundtruth, which is usually a bounding box. Unlike conventional evaluation, our first objective is to provide better representation than a conventional bounding box. Therefore, we first evaluate our representations against the most accurate representation of the object, the groundtruth instance segmentation mask. We report mIoU between a representation and the groundtruth instance mask.
Additionally, we qualitatively evaluate the representations in obtaining object intersection with the ground (footpoint). This is critical as it helps localize the object in the map and provide more accurate vehicle trajectory planning. Finally, we report model speed in terms of framespersecond (fps) as we focus on realtime performance. The distortion is higher in side cameras compared to front and rear cameras. Thus, we provide our evaluation on each camera separately. To simplify our baseline, we only evaluate on vehicles class although four classes are available in the dataset.
4.2 Results Analysis
4.2.1 Number of Polygon Points
Polygon is a more generic representation of complex object shapes that arise in fisheye images. We perform a study to understand the effect of the number of vertices parameter in a polygon. We use a uniform perimeter sampling method to vary the number of vertices and compare the IoU using instance segmentation as ground truth. The results are tabulated in Table 1. A 24sided polygon seems to provide a reasonable tradeoff between the number of parameters and accuracy. Although a 120sided polygon provides 8% higher ground truth, it will be difficult to learn this representation and it will produce noisy overfitting. For the quantitative experiments, we fix the number of vertices to be 24 to represent each object. We observe no significant difference in fps due to increasing the number of vertices where our models run at 56 fps on a standard NVIDIA TitanX GPU. It is due to the utilization of YoloV3 [YOLOV3] architecture, which performs the prediction at each grid cell in a parallel manner.
4.2.2 Evaluation of Representation Capacity
Table 2 compares the performance of different representations using its ground truth fit relative to instance segmentation ground truth. This empirical metric is used to demonstrate the maximum performance a representation can achieve regardless of the model complexity. As expected, a 24sided polygon achieves the highest mIou showing that it has the best representation capacity. Our proposed curvatureadaptive polygon achieves a 2.2% improvement over uniform sampling polygon with the same vertices. Polygon annotation is relatively more expensive to collect, and it increases model complexity. Thus it is still interesting to consider more simpler bounding box representations.
Compared to standard box representation, oriented box representation is approximately 2.54% efficient for the side cameras and 1.32.3% for front cameras. Ellipse improves the efficiency further by an additional 2% for side cameras and 12% in front cameras. Our curved box achieves a 1.15% improvement over the standard box. However, it is slightly less than an oriented box due to the constraint that two circular sides of the box share the same circle center, which adds some area inside the polygon, decreasing the IoU. In addition, curvature is not modelled for the horizontal edges of the box. In future work, we plan to explore these extensions to obtain a more optimal curved bounding box and leverage the convergence of circular arcs at vanishing points.
The current simple version of curved box has the advantage of getting a tight bottom edge, capturing the footpoint for estimating the object’s 3D location. The object’s footpoint is captured almost entirely, as observed in qualitative results, especially for the side cameras where distortion is maximized. Compared to polygon representation, curvedbox representation has low annotation cost due to fewer representation points, which saves annotation effort.
4.2.3 Quantitative Results
Table 3 shows our studies on the methods to predict the orientation of the box or the ellipse efficiently. First, we train a model to regress over the box and its orientation, as specified in equation (11). In the second experiment, orientation prediction is addressed as a classification problem instead of regression as a possible solution to the discontinuity problem. We divide the orientation range of into 18 bins, where each bin represents , making this an 18 class classification problem. During the inference, an acceptable error of +5 degrees for each box is considered. Using this classification strategy, we improve performance by 1.6%. We are formulating orientation of box or ellipse prediction as a classification model with IoU loss found to be superior in performance compared to direct regression. This has a 2.9% improvement in accuracy. Hence we use this model as a standard representation for oriented box and ellipse prediction when comparing with other representations.
Table 4 demonstrates our prediction results on our proposed representations. Compared to the standard bounding box approach, the proposed oriented box and ellipse models improved mIoU score on the test set by 2%, 1.8% respectively. Ellipse prediction provides slightly better accuracy than the oriented box as it has higher immunity to occlusions with other objects in the scene due to the absence of corners, and it is demonstrated in Figure 6.
4.2.4 Qualitative Results
Figure 6 shows a visual evaluation of our proposed representations. Results show that the ellipse provides a decent easytolearn representation with a minimum number of parameters and minimum occlusion with the background objects compared to the oriented box representation. Unlike boxes, it allows a minimal representation for the object due to the absence of corners, which avoids incorrect occlusion with free parking slots, for instance, as shown in Figure 6 (Bottom). Polygon representation provides higher accuracy in terms of IoU with instance mask. A fourpoint model provides high accuracy predictions with small objects as 4 points are sufficient to represent. As the dataset has significant small objects that helped this representation to demonstrate good accuracy, and the same is shown in Tables 2 and 3. Visually, large objects cannot be represented by a quadrilateral, as illustrated in Figure 6. A higher number of sampling points on the polygon results in higher performance. However, the predicted masks are still prone to deformation due to minor errors in each point’s localization.
5 Conclusion
In this paper, we studied various representations of fisheye object detection. At a high level, we can split them into bounding box extensions and generic polygon representations. We explored oriented bounding box, ellipse, and designed a curved bounding box with optimal fisheye distortion properties. We proposed a curvature adaptive sampling method for polygon representations, which improves significantly over uniform sampling methods. Overall, the proposed models improve the relative mIoU accuracy significantly by 40% compared to a YOLOv3 baseline. We consider our method to be a baseline for further research into this area. We will make the dataset with ground truth annotation for various representations publicly available. We hope this encourages further research in this area leading to mature object detection on raw fisheye imagery.
Acknowledgement
The authors would like to thank their employer for the opportunity to release a public dataset to encourage more research on fisheye cameras. We also want to thank Lucie Yahiaoui (Valeo) and Ravi Kiran (Navya) for the detailed review.