Log In Sign Up

SynWoodScape: Synthetic Surround-view Fisheye Camera Dataset for Autonomous Driving

by   Ahmed Rida Sekkat, et al.

Surround-view cameras are a primary sensor for automated driving, used for near field perception. It is one of the most commonly used sensors in commercial vehicles. Four fisheye cameras with a 190 field of view cover the 360 around the vehicle. Due to its high radial distortion, the standard algorithms do not extend easily. Previously, we released the first public fisheye surround-view dataset named WoodScape. In this work, we release a synthetic version of the surround-view dataset, covering many of its weaknesses and extending it. Firstly, it is not possible to obtain ground truth for pixel-wise optical flow and depth. Secondly, WoodScape did not have all four cameras simultaneously in order to sample diverse frames. However, this means that multi-camera algorithms cannot be designed, which is enabled in the new dataset. We implemented surround-view fisheye geometric projections in CARLA Simulator matching WoodScape's configuration and created SynWoodScape. We release 80k images from the synthetic dataset with annotations for 10+ tasks. We also release the baseline code and supporting scripts.


page 1

page 2

page 4

page 6

page 7


WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving

Fisheye cameras are commonly employed for obtaining a large field of vie...

Near-field Sensing Architecture for Low-Speed Vehicle Automation using a Surround-view Fisheye Camera System

Cameras are the primary sensor in automated driving systems. They provid...

SoilingNet: Soiling Detection on Automotive Surround-View Cameras

Cameras are an essential part of sensor suite in autonomous driving. Sur...

A2D2: Audi Autonomous Driving Dataset

Research in machine learning, mobile robotics, and autonomous driving is...

Vehicle Re-ID for Surround-view Camera System

The vehicle re-identification (ReID) plays a critical role in the percep...

Open surgery tool classification and hand utilization using a multi-camera system

Purpose: The goal of this work is to use multi-camera video to classify ...

Fisheye Lens Camera based Autonomous Valet Parking System

This paper proposes an efficient autonomous valet parking system utilizi...

I Introduction

Since the early 1960s, we have pursued the fantasy of commuting between places while sitting in a driverless car with no manual intervention. Over the last decade, autonomous driving (AD) has piqued the interest of vehicle manufacturers more than ever before. The vast and ground-breaking advances in artificial intelligence (AI) and computer vision made possible by machine learning are the primary drivers of this developing trend.

Let us consider the case of an automobile. According to global statistics, approximately 3700 lives are lost due to road accidents every day (approximately 1.35 million people per year) and 20 and 50 million people are left with non-fatal injuries [who_accident_2021]. Out of those accidents, more than 70% are caused due to human errors. Despite robust safety standards developed by the manufacturers and technology evolving massively, we have not reached an acceptable number of traffic accidents. What could be the possible reason? Do we have a long-term solution for this? Indeed, AI and autonomous systems could work as magic bullets in these situations. So, the basic principle involves machines taking control over everything. This would mean eradicating human interventions completely, which is the root cause of many of these problems.

An autonomous vehicle drives itself without the assistance of a human operator, using a collection of sensors, cameras, radar, and AI algorithms. Experts have identified five stages in the growth of self-driving vehicles. Each level defines how much a car may take over activities and obligations from its driver, as well as how the car and driver interact: 1) Driver assistance, 2) Semi-automated driving, 3) Highly automated driving, and 4) Fully automated driving 5) Complete automation [sae2014taxonomy].

In Autonomous Driving, Near Field is a region from 0-15 meters and 360° coverage around the vehicle. Near Field perception is primarily needed for use cases such as automated parking, traffic jam assist, and urban driving where the predominant sensor suite includes ultrasonics [popperli2019capsule] and surround-view fisheye-cameras. Despite the importance of such use cases, most research to date has focused on far-field perception. Consequently, there are limited datasets and research on near-field perception tasks. In contrast to far-field, near-field perception is more challenging due to high precision object detection requirements of 10 cm [eising2021near]. For example, an autonomous car needs to be parked in a tight space where high precision detection is required with no room for error.

Fig. 2: Sample images from the surround-view camera network showing wide field of view and coverage.

Near Field perception for AD is a region from 0-10 meters and coverage around the vehicle as shown in Fig. 2. Some of its use cases are automated parking, traffic jam assist, and urban driving. The sensor suite includes ultrasonics, fisheye cameras, and radar. There are limited datasets, and very little work on near-field perception tasks as the main focus is on far-field perception. In contrast to far-field, it is more challenging due to high precision object detection requirements of . For example, let us look at the parking scenario in Fig. 3. The car needs to be parked in a tight space with partial object visibility and no room for error, requiring high precision. Four fisheye cameras are sufficient to cover the near-field perception as shown in Fig. 2.

Fig. 3: Illustration of a tight parking scenario.

Due to their large radial distortion, standard algorithms can not be extended easily on fisheye cameras. There is very little work on the perception algorithms on fisheye cameras. Also, most of the current AD systems are Level 2. In this paper, we focus on providing a synthetic dataset to build a holistic scene understanding for a near-field perception system that constitutes the necessary modules for a Level 3 AD stack using four fisheye cameras. The naive approach is rectifying the fisheye images and applying these algorithms. The standard question which arises when we talk about distortion in fisheye cameras is: Why do we not rectify the images?

  • Most algorithms are usually designed to work on rectified pinhole camera images.

  • Removing distortion leads to a significant loss in the Field-of-View.

  • For a horizontal Field-of-View (HFoV) greater than , rays incident from behind the camera make it theoretically impossible to establish a complete mapping to a rectilinear viewport. Thus the rectification defeats the purpose of using a wide-angle fisheye lens.

  • Resampling distortion artifacts are particularly strong in the periphery as a small region in the fisheye image is expanded to a larger region in the rectified image. The texture is lost, and noise is introduced.

Ii Background

Ii-a Surround-view Camera System

Surround-view fisheye cameras have been deployed in premium cars for over ten years, starting from visualization application on dashboard display units to provide near-field perception for automated parking. Fisheye cameras have a strong radial distortion that cannot be corrected without disadvantages, including reduced FoV and resampling distortion artifacts at the periphery [kumar2020unrectdepthnet]. Appearance variations of objects are larger due to the spatially variant distortion, particularly for close-by objects. Thus fisheye perception is a challenging task, and despite its prevalence, it is comparably less explored than pinhole cameras.

Surround-view cameras consisting of four fisheye cameras are sufficient to cover the near-field perception as shown in Fig. 2. Most algorithms are usually designed to work on rectified pinhole camera images. The naive approach to operating on fisheye images is to first rectify the images and then directly apply these standard algorithms. However, such an approach carries significant drawbacks due to the reduced field-of-view and resampling distortion artifacts in the periphery of the rectified images.

Fisheye cameras are a primary sensor available in most commercial vehicles for automated parking. Rear-view fisheye cameras have become a standard feature for dashboard viewing and reverse parking, even in lower-cost vehicles. Fisheye cameras are used in for autonomous driving tasks such as perception which involves object detection [dahal2021online, rashedfisheyeyolo], soiling detection [uricar2021let, das2020tiledsoilingnet], semantic segmentation [sobh2021adversarial, dahal2021roadedgenet], weather classification [dhananjaya2021weather], depth prediction [kumar2020fisheyedistancenet, kumar2021fisheyedistancenet++, kumar2020syndistnet, kumar2021svdistnet, kumar2020unrectdepthnet], moving object detection [yahiaoui2019fisheyemodnet] and SLAM [gallagher2021hybrid, kumar2018near, kumar2018monocular] are challenging due to the highly dynamic and interactive nature of surrounding objects in the automotive scenarios [kia_2021]. Fisheye cameras are also used commonly in other domains like video surveillance [kim2016fisheye] and augmented reality [schmalstieg2016augmented]. Despite its prevalence, there are only a few public datasets for fisheye images publicly available, and thus relatively little research is performed. The Oxford Robot car dataset [maddern20171] is one such dataset providing fisheye camera images for autonomous driving. It contains over 100 repetitions of a consistent route through Oxford, the UK, captured over a year and used widely for long-term localization and mapping. OmniScape [sekkat2020omniscape] is a synthetic dataset providing semantic segmentation annotations for cameras mounted on a motorcycle. In TABLE I we compare the properties of these two datasets in addition to the WoodScape dataset against the SynWoodScape dataset.

The Oxford Robot car dataset [maddern20171]

The OmniScape dataset [sekkat2020omniscape]

The WoodScape dataset [yogamani2019woodscape]

The SynWoodScape dataset

Real/Synthetic Real Synthetic Real Synthetic
Ego Vehicle Car Motorcycle Car Car
Fisheye image
resolution 1024×1024 1024×1024 1280×966 1280×966
Fisheye image
Bird’s Eye View No No No Yes
Segmentation No Yes Yes Yes
Segmentation No Yes Yes Yes
Segmentation No No Yes Yes
2D Bounding
Boxes No Yes Yes Yes
3D Bounding
Boxes No Yes Yes Yes
Depth map No Yes No Yes
Event camera
signals No No No Yes
Optical Flow No No No Yes
Lidar Yes Yes Yes Yes
IMU Yes Yes Yes Yes
GNSS Yes Yes Yes Yes
TABLE I: Summary of various autonomous driving datasets contaning fisheye images.

Ii-B WoodScape Dataset

The dataset consists of 46,000 images sampled roughly equally from the four views and split into training, validation, and test in a 6:1:3 ratio. This dataset is used for OmniDet [kumar2021omnidet]. A sub-set of 10,000 images from the dataset will be made public on Github111 A baseline code is released along with the dataset on GitHub to encourage further research to the community in developing unified perception models for autonomous driving. It contains several perception tasks listed in Fig. 6. 2D box detection contains the five most essential categories of objects — pedestrians, vehicles, riders, traffic signs, and traffic lights. Vehicles further have sub-classes, namely cars and large vehicles (trucks/buses). The polygon prediction task on raw fisheye is limited to only two classes — pedestrians and vehicles. Unlike traffic lights and traffic signs, these categories are non-rigid in nature and quite diverse in appearance, making them suitable for polygon regression. We sample 24 points with high curvature values from each object instance contour for the polygon regression task. Learning these points helps to regress better polygon shapes, as these points at high curvature define the shape of the object contours. Semantic segmentation comprises 6 classes on road, lanes, curbs, two-wheeled vehicles, vehicles, and persons. The images are in RGB format with 1MPx resolution and horizontal FoV. The dataset is captured in several European countries and the USA. For the experiments, we used only the vehicles’ class. Further details about the dataset usage and demo code can be found on the WoodScape website

WoodScape [yogamani2019_woodscape] is the world’s first public surround view fisheye automotive dataset released to accelerate research in multi-task multi-camera computer vision for automated driving. The dataset sensor setup comprises four surround-view fisheye cameras covering

around the vehicle. The dataset consists of annotations for nine tasks, including segmentation, depth estimation, bounding boxes, pixel level motion masks and a novel lens soiling detection task. Semantic instances for 40+ classes provided for over

10000 images. WoodScape dataset encourages researchers to design algorithms that operate directly on fisheye images, modelling the inherent distortion, rather than using naive rectification. The dataset has enabled such research for depth estimation [kumar2020fisheyedistancenet], object detection [rashed2021generalized], soiling detection [uricar2021let], trailer detection [dahal2019deeptrailerassist] and multi-task learning [kumar2021omnidet].

Iii SynWoodScape

The SynWoodScape dataset is a synthetic version of the WoodScape dataset. The same configuration used to acquire the real data from different locations in Europe and the USA is used in CARLA Simulator. The same calibration parameters, intrinsic and extrinsic ones, were also used to simulate the different sensors. The use of the simulator allows us to extract in additing of all the ground truths proposed in the WoodScape dataset the ground truths for pixel-wise tasks like depth map, optical flow, and event camera signals in a very precise manner. It also allows us to extract synchronized images from four fisheye surround-view cameras in addition to a bird’s eye view image. We also used the simulator to extract images in different weather and lighting conditions. In the following subsections, we explain the construction of the fisheye images using the calibration parameters of the WoodScape dataset and the computation of the different ground truths.

Iii-a Fisheye image generation

To generate the fisheye images, we used a framework based on the cubemap representation of a image and the calibration model proposed in the WoodScape dataset [yogamani2019woodscape]. The model uses a fourth-order polynomial function to estimate the mapping of incident angle to image radius in pixels . Using this model drawn in Fig. 4, each pixel in the fisheye image can be associated to a 3D direction on the unit sphere. We also construct a unit cube that corresponds to the cubemap image. Using ray tracing from the center of the sphere and the cube, we compute the pixel mapping between the fisheye and the cubemap images, as sketched in Fig. 7. Fig. 5 shows an example of mapping the cubemap image into the fisheye image. This mapping is obtained by the intersection of the 3D direction with both the sphere and the cube. A lookup table is then built for each fisheye camera to store the correspondences between the two representations. To extract the fisheye images from CARLA, we acquired five images that form the five views of the cubemap needed to build the fisheye image, and we used the exact calibration parameters of the cameras used to acquire the WoodScape dataset [yogamani2019woodscape] to build the sphere and to place the cameras using the same positions and rotations relative to the car. In such a way, we preserve the same configuration of the WoodScape dataset as if we used the same acquisition platform inside CARLA Simulator.

Fig. 4: Lens projection of the calibration model used.
(a) Fisheye image
(b) Cubemap image
Fig. 5: Mapping of the cubemap’s five views into the fisheye image.

Iii-B Dataset Details

The SynWoodScape dataset contains synthetic data generated from CARLA Simulator [carlasimulator], each sample out of the 10k samples provided contains surround view fisheye images in addition to a bird’s eye view image. Each image comes with a previous image and ground truth for multiple tasks namely, semantic segmentation into 24 classes, instance segmentation, motion segmentation, depth map, optical flow, event camera signals, 3D and 2D bounding boxes, lidar data, IMU and GNSS data. The acquisition was made using a frame rate of 10 FPS. The intrinsic and extrinsic parameters of the cameras used are similar to the parameters of the real cameras used in the acquisition of the WoodScape dataset [yogamani2019woodscape]. Nine different weather and lighting conditions were used to extract the images. The dimensions of the fisheye images are and of the bird’s eye view images are . Fig. 1 shows an example of images extracted with ground truth data generated. Fig. 6 shows a simplified diagram of the extraction procedure of all ground truth data from CARLA Simulator. We explain in the following how we compute the ground truths that are not directly extracted from CARLA Simulator.

Fig. 6: A diagram of data extraction procedure from CARLA Simulator.

Iii-C Instance Segmentation

In order to extract the instance segmentation, we used the depth maps, the 3D bounding boxes, and the semantic segmentation ground truth. With these three modalities, we develop a tool to compute the instance segmentation on perspective images used to generate the omnidirectional images. This tool is based on ray tracing. For each pixel, we compute the 3D position in the world reference of the CARLA Simulator. This is achieved by using the depth map and the camera transform matrix from the sensor to the world reference, which can be obtained after computing the focal length of the camera. The camera transform matrix is obtained according to


where and are the width and the height of the image, respectively, and is the focal length of the camera and it is computed using the formula:


where the field of view of the camera. The 3D position of the pixel of coordinate is obtained using the following formula, where is the corresponding depth map value:


After computing the 3D points of all pixels using (3), and since we have the 3D bounding boxes of each object in the scene identifiable by a unique ID, we need to check which of these 3D bounding boxes the computed 3D points are inside. We check this by calculating the six planes formed by the bounding box; if the 3D points are in between the parallel planes, the point is considered inside the bounding box.

We attribute a random color to each bounding box to obtain the instance segmentation in Fig. 1. This color will present the object inside this bounding box in all images captured during the current recording session.

Fig. 7: Mapping of the cubemap image’s pixels to the fisheye image.
Fig. 8: Overview of our Surround View cameras based multi-task visual perception framework. The distance estimation task (blue block) makes use of semantic guidance and dynamic object masking from semantic/motion estimation (green and blue haze block) and camera-geometry adaptive convolutions (orange block). Additionally, we guide the detection decoder features (gray block) with the semantic features. The encoder block (shown in the same color) is common for all the tasks. Our framework consists of processing blocks to train the self-supervised distance estimation (blue blocks) and semantic segmentation (green blocks), motion segmentation (blue haze blocks), and polygon-based fisheye object detection (gray blocks). We obtain Surround View geometric information by post-processing the predicted distance maps in 3D space (perano

block). The camera tensor

(orange block) helps our OmniDet yield distance maps on multiple camera-viewpoints and make the network camera independent.

Iii-D Motion Segmentation

Motion segmentation is the segmentation of all dynamic objects in the scene that underwent a movement in the world reference. We consider that an object has moved when the distance between the position of this object in the frames t-1 and t is superior to a threshold, in our case 0.5 meters. Since the positions in the simulator are very precise, not using a threshold will give us a noisy motion segmentation. We will be considering in this case the small motions of the objects which will end up considering almost all objects likely to move as moving objects. To construct the ground truth of this segmentation, we used the instance segmentation and the transformation matrices of each dynamic object in the scene. We then compute the distances traveled by each object between the two frames. Since each object has a unique id and corresponds to a unique label in the instance segmentation, we can then build the motion segmentation by selecting the object in the instance segmentation that will also be included or not in the motion segmentation depending on the traveled distances.

Iii-E Optical Flow

Next, we explain how we compute the optical flow analytically using the data extracted from the simulator. First, we calculate the scene flow, and then we project it to the image plane of all representations (perspective and fisheye) in order to obtain optical flow, as described in Fig. 6

. Like instance segmentation, we compute the 3D point cloud of all objects in the scene separately by separating dynamic ones from static ones. since we can extract the positions and rotations of all objects from the simulator, we can compute the transformation matrices in the 3D reference between two frames. Then, we calculate the scene flow by applying the transformation matrices of the movements to the point cloud of dynamic objects and the inverse of the transformation matrix of the camera movement to all 3D point clouds (dynamic and static objects). Next, we project this 3D point cloud before and after being moved into the images. This means we have the 2D coordinates of each pixel in both frames. The vectors of movement are then constructed by each couple of these 2D coordinates representing the optical flow. These vectors can be displayed using color-coding, as shown in Fig. 

1. We provide optical flow for all modalities using this process since we have all the calibration parameters.

Iii-F Event Camera

CARLA Simulator provides event camera signals for perspective images in the form , where is the event triggered at pixel at timestamp with the polarity . The polarity of the event is positive when the brightness increases and negative otherwise. We compute the fisheye event camera signals using the lookup tables that allow us to map from cubemap images to fisheye images. For each event that occurred in the cubemap representation at , if the pixel at is used to create the fisheye image, we compute the corresponding pixel coordinates in the fisheye image using the lookup tables. The corresponding event information and are then assigned. Similar data structures for the perspective representation generated by the CARLA Simulator are then created for the fisheye representation and stored into NumPy array files. Fig. 1 shows an example of the fisheye event signal as an RGB image where blue represents positive polarity and red is the negative one.

Iii-G Bird’s Eye View

Behavior Prediction and Planning are generally made in the top view (or bird’s-eye-view) in a typical autonomous driving stack, as height information is less important, and most of the information an autonomous vehicle needs can be conveniently represented with the top view. In contrast to SLAM, which requires a sequence of images from the same moving camera over time, the top view is based on images acquired by multiple cameras looking in different directions of the vehicle at the same time. As a result, it can extract more useful information from a single data set than SLAM. Furthermore, even if the ego vehicle comes to a halt or goes slowly, top-view semantic segmentation will continue to function, whereas SLAM performs poorly or fails. For the dynamic participants, we introduce the concept of instances. This makes it simple to use prior knowledge of dynamic objects to forecast behavior. Cars, for example, follow a specific motion model (such as a bicycle model) and have constrained patterns of future trajectory, whereas pedestrians move more randomly.

width= Datasets WoodScape SynWoodScape Train/Test R S R+S R S S+R Depth Est. (RMSE) 1.332 2.401 1.479 2.393 1.448 1.396 Semantic Seg. (mIoU) 76.6 71.7 76.2 72.1 78.2 77.8 Motion Seg. (mIoU) 75.3 69.5 74.5 70.7 76.8 75.1 Object Det. (mIoU) 68.4 61.2 67.7 61.9 69.2 68.5

TABLE II: Ablation study of OmniDet [kumar2021omnidet] on WoodScape and SynWoodScape datasets. S indicates test on synthetic dataset with pre-training on the R (real) Woodscape dataset. R indicates test on real dataset from WoodScape with pre-training on the S (synthetic) SynWoodscape. R+S indicates mixed training of real and synthetic datasets.
Model Accuracy (mIoU)
Image Model + IPM 61.2
Top View Semantic Seg. 76.5
Top View Instance Seg. 75.3
Top View Motion Seg. 82.1
TABLE III: Quantitative comparison of segmentation tasks on Top View model vs Transformed Model.

Iv Experiments

Iv-a Real vs. Synthetic Baseline performance

In TABLE II, we ablate the performance of the OmniDet [kumar2021omnidet] on the WoodScape and SynWoodScape datasets. An important aspect of our ablation study entails evaluating the need for domain transfer within our framework, establishing a baseline for the community, and evaluating our framework to test the model generalization capabilities. Because of the differences in synthetic and natural data domains, the perception tasks listed do not yield quantitatively desired results when applied directly to real-world data, necessitating the domain adaptation phase. Although, the performance is quite high despite transfer domain or hybrid training. Initially, we train on the WoodScape and test it on the SynWoodScape to establish a baseline for the domain transfer. Later, we mix both the datasets and set up a quantitative baseline. Finally, we train on SynWoodScape to provide the baseline and evaluate it on WoodScape to measure the deviation of the domain gap.

Iv-B Top View Segmentation

We ablate our OmniDet [kumar2021omnidet] on the top view dataset and establish a baseline performance in TABLE III. Initially, we train the model for the trivial semantic segmentation task and transform it into inverse perspective mapping (IPM) for the behavior and planning stage as explained in section III-G. For the semantic segmentation results, many of the existing approaches tend to connect multiple cars into one contiguous region. Henceforth, we release motion masks and instance segmentation datasets to identify the dynamic objects and localize particular vehicles/instances in the top view and report the baseline results in Table III.

V Conclusion

In this paper, we provide a synthetic dataset using surround-view fisheye cameras dedicated for autonomous driving with ground truth annotations for 10+ tasks. In addition to providing synchronized fisheye data, we provide bird’s eye view data with annotations. We demonstrated the relevance of the generated synthetic data by achieving baseline experiments for depth estimation, semantic segmentation, motion segmentation, and object detection as well as experiments on the same tasks using the top view. Because our dataset is using the same configuration and calibration parameters used in the WoodScape dataset, the couple SynWoodScape/WoodScape is of great interest in the development of models dedicated for fisheye images as well as transfer learning between real and synthetic data or image-to-image translation algorithms.