Adaptable Deformable Convolutions for Semantic Segmentation of Fisheye Images in Autonomous Driving Systems

02/19/2021 ∙ by Clément Playout, et al. ∙ Corporation de l'ecole Polytechnique de Montreal Inria 0

Advanced Driver-Assistance Systems rely heavily on perception tasks such as semantic segmentation where images are captured from large field of view (FoV) cameras. State-of-the-art works have made considerable progress toward applying Convolutional Neural Network (CNN) to standard (rectilinear) images. However, the large FoV cameras used in autonomous vehicles produce fisheye images characterized by strong geometric distortion. This work demonstrates that a CNN trained on standard images can be readily adapted to fisheye images, which is crucial in real-world applications where time-consuming real-time data transformation must be avoided. Our adaptation protocol mainly relies on modifying the support of the convolutions by using their deformable equivalents on top of pre-existing layers. We prove that tuning an optimal support only requires a limited amount of labeled fisheye images, as a small number of training samples is sufficient to significantly improve an existing model's performance on wide-angle images. Furthermore, we show that finetuning the weights of the network is not necessary to achieve high performance once the deformable components are learned. Finally, we provide an in-depth analysis of the effect of the deformable convolutions, bringing elements of discussion on the behavior of CNN models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Networks have established state of the art performances on numerous vision-based tasks (object detection, instance and semantic segmentation) and datasets and therefore become a standard for many application. In particular, these models are major assets for Advanced Driver-Assistance Systems (ADAS) and autonomous vehicles, which require models that are highly efficient while allowing a good understanding of their operation. ADAS primarly rely on a precise identification of the environment surrounding the vehicle. In order to capture this environment, different types of sensors are used, among which fisheye cameras. These are built with specific lenses to achieve extremely wide field-of-view (FoV), reaching angle far superior to regular lenses. As a comparison, the latter are considered wide angle when covering angles from 64°to 84°, whereas the former can reach values up to 270°(but usually between 100°to 180°). However, such high FoVs come at the expense of the rectilinear property provided by regular lenses (straight features in the scene remain straight in the image). Fisheye lenses capture curvilinear images, often roughly described as deformed by barrel distortion (the magnification decreases with the distance from the optical axis -usually the center of the image). Due to their nature, fisheye images are usually harder to analyse with conventional recognition models, requiring these to be tuned and/or retrained. In the context of deep learning, this usually comes to changing the training data, but this raises the question of the number of labelled fisheye images needed and whether the model can still benefits from existing rectilinear datasets. In this work, we demonstrate the impact of using these datasets in order to improve fisheye segmentation by analysing different ways to adapt an existing

“rectilinear”

CNN model to fisheye images. We consider different possibilities, from training a model from scratch with distorted fisheye-like images to simply slightly modifying the support of the convolution using trainable convolutions offsets (known as deformable convolutions). As fisheye data is not a largely open resources in the Machine Learning community, we investigate the minimal number of samples needed to fulfill model adaptation. For the sake of simplicity, the focus of this paper is on semantic segmentation but we hypothetize that most of the technics aftermentioned would generalize to object detection.

Contribution 1:

We propose a novel and simple learning mechanism to adapt standard CNN models on fisheye images using the concept of deformable convolutions. To capture non-linear transformations, we demonstrate through this mechanism the possibility of learning the spatial support of convolution filters independently from their weights while maintaining almost similar performance. This insight could potentially motivate a rethinking of the data augmentation process in deep learning applications.

Contribution 2: We demonstrate that adapting a CNN model to the non-linear spatial distortion induced by ultra wide-angle geometry can be achieved with only a few training images. In this way, we can alleviate the lack of available training datasets of real fisheye images for different perception tasks.

Contribution 3: As advanced driver-assistance systems commonly make use of narrow as well as large FoV cameras to perform different autonomous tasks, we propose a flexible semantic segmentation model that can be deployed with both narrow and large FoVs.

2 Related work

From a computer vision researcher’s perspective, the field of fisheye images is relatively recent, in particular when considering tasks involving object recognition (detection or semantic segmentation). The initial studies on the subject were mainly focused on calibration (finding the intrinsic parameters of the camera), by building a geometrical or analytical model.

[scaramuzzaFlexibleTechniqueAccurate2006a]

proposed a parametric model following a polynomial form to describe the projection function. A fourth-order polynomial was proved to be an accurate model. This model is still regularly used, including to synthesize fisheye distortion in rectilinear images. Calibration is a potential first step toward image rectification, whose goal is to remove the distortion such that straight-line features appear as straight lines in the image. Recently,

[yinFishEyeRecNetMultiContextCollaborative2018b] proposed to use a CNN to directly rectify fisheye image, by training the network to predict the distortion parameters. The authors demonstrated that using a semantic context as an input to the distortion parameters predictor significantly improves the system’s performances. This suggests that rectification could be used as a preprocessing step before object recognition using an existing conventional model. Nonetheless, there is not much work in the literature proposing to combine both processes. As pointed out by [yogamaniWoodScapeMultiTaskMultiCamera2019], this can be explained by the fact that undistortion causes major issues: a typical fisheye with FoV can not be mapped onto a rectilinear image, leading to a reduction of the FoV in the rectified image. Moreover, the re-warping operation leads to a non-uniform sampling accross the resulting image, creating blurry areas. Therefore, current research is shifting to focusing on model adaptation rather than undistorting images. To our knowledge, [fuchengdengObjectDetectionPanoramic2017] are the first to use a standard CNN architecture for objects detection in fisheye-like images. In their case, the images are treated as normal ones and used to finetune a network trained on rectilinear images. The adaptation therefore consists in changing the training data rather than on the model itself. This approach is comparable to what [saezRealTimeSemanticSegmentation2019, yeUniversalSemanticSegmentation2020] proposed, which is to rely on synthetic fisheye generation to extend the training distribution. By sampling different distortion parameters, this types of approaches replicate the effect of data augmentation. We argue that adapting a model pretrained on rectilinear images should not require a full retraining and propose the idea that only a limited amount of new data if needed to implicitly learn the parameters of the image distortion and thereby adapt the model’s semantic prediction. This intuition is motivated by [lopezDeepSingleImage2019], who showed that it is possible to predict extrinsinc and intrisinc camera distortion parameters from a single image. By extension, our work aims to explore the number of training samples needed to adapt a semantic segmentation model to interpret distorted images. For this task, using a single distorted image for finetuning would likely result in biasing the internal statistics of the model and eventually to strong model overfitting. Therefore, instead of modifiying the weights and biases of the model itself, we propose to change the way convolutions are done using deformable convolutions, as originally introduced by [daiDeformableConvolutionalNetworks2017]. Deformable components should theoretically be able to capture the distortion parameters of the image, while the regular convolutions should be able to extract meaningfull semantic features. This idea has also been explored by [dengRestrictedDeformableConvolutionBased2019]

, who transformed deformable convolutions into their restricted equivalent and tested them on fisheye images. Their work also analyze the placement of the deformable layers within the network to achieve optimal performance. Our approach differs from theirs as we only adapt the deformable part of the convolutions, rather than training a complete model. We also study in depth the effect of the positioning of deformable convolutions within the existing network, the effect of batch-normalization and explore few-shots training approaches.

3 Methodology

Problem Statement:

Given a set , where is a set of rectilinear images, the set of their associated 2D semantic segmentation groundtruth and the number of samples in the dataset, given a semantic segmentation model , its prediction and given a parametrized conversion function that associates a rectilinear image to its fisheye equivalent, this works targets to find the adaptation function that optimizes:

(1)

Note that the definition of is voluntarily vague, since can represent a synthetic projection function just as well as a change in the actual camera lens. In the real world, is most likely unknown (and unused). However, we will assume that we know its form in the rest of the paper as is needed in absence of an existing labelled fisheye dataset. To summarize, we want the adapted model to predict from a wide angle image an output as close as possible to , which is the distortion of the prediction obtained from a rectilinear image with the original model.
The rest of this section will describe each of the functions in equation 1, i.e. the segmentation model , the conversion function and finally the components of the adaptation function .

3.1 Baseline for semantic segmentation

Semantic segmention of rectilinear images has been thoroughly studied and refined to a point where many efficient CNN architectures are now easily deployable. The choice of a particular model is mainly based on a trade-off between performance requirements and computational resources available. We favour the former criterion and choose the DeepLabV3+ model as our baseline network. DeepLabV3+ is an architecture introduced by [chenEncoderDecoderAtrousSeparable2018a], extending their previous work on large-scale networks using atrous convolutions. The architecture is composed of three mains components. It starts with an encoder network that reduces the spatial resolution of the input while increasing its depth. The encoder is used to extract features from the input image. The lowest-level features are fed to the second component, the Atrous Spatial Pyramid Pooling (ASPP). It consists of several parallel convolution layers using different dilations rates, working overall as a multi-scale convolutional layer. The last component of the architecture is a decoder module, that expands the features from the encoder and the ASPP back up to the input dimensions. Inspired by the skipped connections proposed by [ronnebergerUNetConvolutionalNetworks2015], the decoder concatenates low- and mid-level features from the encoder and outputs a segmentation map. We experimented with two different models as a backbone for the encoder, the resnet101 proposed by [xieAggregatedResidualTransformations2017] and the Aligned-Xception introduced by [cholletXceptionDeepLearning2017]. We did not observe any improvements with the latter and thus kept the former. The resnet101

was pretrained on a subset of the COCO train2017 dataset provided with the Torchvision library

111 .

3.2 Synthesizing fisheye images

The lack of existing ultra wide-angle datasets has motivated many researchers to generate approximations from rectilinear images. One of the two following approaches is usually chosen:

  • Simulating a fisheye-like distortion on rectilinear images.

  • Rendering images from a 3D scene using a virtual fisheye cameras.

The first option, while being easy to setup, suffers from the drawbacks related to the grid sampling as mentioned in the previous section. The second option is significantly more time-consuming and requires knowledge about 3D graphics tools. On the other hand, it has the advantage of generating more realistic images in terms of their distortion and their FoV. Nonetheless, the 3D rendered images are far from being as detailed as real world images. For the sake of completness and reproducibility, we have experimented with both approaches.

Simulation

To simulate fisheye distortion on rectilinear images, we rely on the same model as the one used in the open-source library OpenCV

222https://docs.opencv.org/master/db/d58/group__calib3d__fisheye.html. Noting a couple of normalized coordinates in the rectilinear image, the distortion functions maps them to normalized fisheye coordinates using the following equations:

(2)

The parameters are tunable and can be adjusted to change the distortion center. We limit ourself experimentally to variations of , which corresponds to a scale factor (as an approximation of a varying focal length), as depicted in Figure 1. Using this set of equations, we are able to apply the distortion on real images from the Cityscape dataset freely provided by [cordtsCityscapesDatasetSemantic2016].

(a) Original
(b) f=125
(c) f=75
Figure 1: Example of distortion using a parametric polynomial to synthesize fisheye-like images.

Generation Our generated dataset is based on an extension of the 3D urban scene provided by [zichaozhangBenefitLargeFieldofview2016]. Instead of the orignal depth-maps provided, we configure the render engine to output semantic maps and rendered images. Moreover, we enrich the scene by adding different 3D assets obtained from a free assets-provider333https://www.blendswap.com/ All assets downloaded are free-to-use for non-commercial purposes.. We thereby added the following objects to the existing scene: cars (6 different models), pedestrians (3 models), bikes (2 models), cyclists, bus and bus station. The limited variability of these scene elements might limit the usefulness of this dataset for real-world applications, but it can still provide important insights into fisheye segmentation. In total, the generated semantic maps are composed of 11 differents classes. Images are rendered at the resolution of . Two renders are done, the first one with a rectilinear camera (FoV=80°) and the second one with a fisheye camera (FoV=180°). As the same scene, with the same contents is represented for both, our comparative study is only focused on the effect of the fisheye distortion all else things being equal. Figure 2 shows the type of results we obtain. We refer to this dataset as BlenDataset.

(a) Rectilinear, FoV=80°
(b) Fisheye, FoV=180°
Figure 2: 3D renders from the same spatial camera position but two different lenses.

3.3 Deformable convolutions

Deformable convolutions (DCN) were introduced by [daiDeformableConvolutionalNetworks2017]

to extend the regular grid sampling locations used in convolutions with 2D offsets. DCN learn the offsets at every location from the feature maps of a previous layer, leading to dense representations that captures free deformation form of spatial kernels that apply in the current convolution layer. The offset simply refers to the displacement vector (

) of a point taken on the spatial grid of the kernel. Given a location , where a standard convolution kernel applies, the output feature value of the deformable convolution layer at becomes:

(3)

where is the grid support of the convolution. Because of the fractional nature of the offsets values, [daiDeformableConvolutionalNetworks2017]

. used bilinear interpolation to apply Equation

3. Originally, the weights of the kernel were learned simultaneously with the offsets. The DCN was applied on rectilinear images to enhance convolutions in CNNs. In this work, we leverage the DCN’s capabilities to take into account non-linear transformations brought by fisheye geometric distortion (as shown in Figure 1) and suggest an adaptable approach for fisheye image recognition tasks. To do so, we propose to transfer the weights of a base model (trained on rectilinear images) to the task of fisheye segmentation by converting regular convolutions to DCN and to only train the offsets layers that precede the main convolution layer (as shown in 3). The resulting model is called adaptable deformable convolution as it adapts an existing convolution to the extrinsic deformations of the grid while preserving the intrinsic properties of the objects. Following equation 1, the objective is to find the optimal adaptation parameters of

that minimizes the error between the predicted output and groundtruth. The offsets are implicitly learned by backpropagating this error while the parameters of the base convolution layers remain fixed.

Figure 3: Deformable convolution (DCN) inserts themselves within a regular convolution layer with a kernel. Our adaptation process only requires to train the prediction of the offsets (in light orange).

4 Experiments

(a) rect-DL3
(b) fish-DL3
(c) adpt-DL3
Figure 4: Comparison of the prediction obtained from the same image with three different models. For the sake of readability, only 5 classes are shwon. The second and third image are distorted with . rect-DL3 has been trained on rectilinear images, fish-DL3 has been trained from scratch on distorted images and adpt-DL3 is the adapted version of rect-DL3 (where convolutions are replaced by DCN but weights are kept equals)

Two datasets were used independantly for our experiments, meaning that we trained different models using the same protocol for each dataset.

BlenDataset: This dataset is composed of 4000 pairs of rectilinear and fisheye synthetic images, as well as their corresponding groundtruth semantic segmentations. The dataset corresponds to a single video sequence taken in the 3D virtual scene. The first 3000 images were used for training purposes and the remaining 1000 images for testing. This was done in order to avoid having consecutive (and therefore highly correlated) frames in the two distinct sets. The training set was furthermore randomly split into two subsets (training and validation), following a 0.8/0.2 ratio.

Cityscape: The Cityscape dataset comprises 5000 images divided into training, validation and test sets (2975, 500 and 1525 images respectively). Groundtruth maps are not publicly available for the test set, therefore we only used the first two sets. The validation set was used as the test set and the original training set was split in two (0.9/0.1 ratio) for training and validation purposes.
For all experiments, the images were kept at their original resolution (), and random patches of size were extracted from them.

Rectilinear segmentation training: We trained the DeepLabV3+ architecture on 4 GPUs, using synchronized batch-norm as proposed by [pengMegDetLargeMiniBatch2018], mixed-precision and a learning rate of 0.005 for the encoder and 0.05 for the decoder. Both learning rate were updated during training using the “poly” schedule policy introduced by [liuParseNetLookingWider2015]

. Weights were updated using the Adam solver. As a loss function, the weighted cross-entropy was used to alleviate the issue of class imbalance. We also observed that data augmentation (random scaling, rotation and horizontal flipping) helped to improve the model’s performance. Training ran for 100 epochs, with a batch size of 8. We refer to this model as

rect-DL3 (for “rectilinear DeepLabV3+”).

Adaptative training: In order to adapt the model, we have added deformable convolutions on top of the trained rectilinear model. We use the efficient implementation from the MMDetection toolbox provided by [mmdetection]. Following the principle of an ablation study, we tested different configurations in order to understand the mechanisms underlying the different network components and demonstrate their respective effects. For each configuration, the adaptation was trained during 25 epochs on fisheye images (generated or simulated). To limit the scope of this paper, we used a fixed distortion level, parameterized by . We found that the prediction of DCN offsets was unstable when using high learning rates. Consequently, we reduced the rates to 0.01 for the decoder and 0.001 for the encoder. All others training parameters were kept identical as in the rectilinear training procedure described in the previous paragraph.

Evaluation procedure

As an evaluation metric, we have adopted the mean Intersection-over-Union (mIoU) proposed by

[cordtsCityscapesDatasetSemantic2016]. The IoU is computed per class for the whole test set and then averaged accross classes. For both datasets, a “void” class, corresponding to unsegmented objects and/or borders of the image, was included and corresponding pixels were discarded from the metric computations.

4.1 Adaptation experiments

Why adapt a model? The need for model adaptation stems from the difficulty regular rectilinear models have in segmenting wide-angle images. This can be illustrated quite simply by directly testing rect-DL3 (without adaptation) on different fisheye distortions. The performances are reported in table 1.

BlenDataset
Rectilinear Fisheye
rect-DL3 0.619 0.517
Cityscape
Rectilinear
rect-DL3 0.747 0.448 0.420 0.232
Table 1: Performances (mIoU) obtained with two rectilinear models rect-DL3 (one per dataset) tested under different configurations. On Cityscape, fisheye effect is simulated and a lower value of f indicates a stronger distortion.

As expected, they quickly deteriorate in comparison to the model’s performance on rectilinear images. The performance degradation is correlated with the strength of the distortion.

An Upper Limit to the Adaptation’s Efficiency As the adaptation phase is only aimed at tackling the distortion problem, we can hypothesize an upper limit on the adapted model’s performance. Given a metric function (for example the mIoU), a rectilinear groundtruth map , a prediction from a rectilinear input with rect-DL3 and a distortion function , the best performance we can expect with the prediction from the adapted model on the corresponding distorted input is:

(4)

In other words, for a distorted input, the best prediction possible corresponds to the distortion of the prediction obtained from the equivalent non-distorted input. This upper limit is denoted . With (the distortion level used in the following experiments), we obtain with rect-DL3. As expected, none of our adapted models exceeded this limit, but the adaptation process proved its efficiency by approaching the upper limit very closely (as shown in Table 4 below).

Adapting with Batch Normalization Following the observations of [liRevisitingBatchNormalization2016], we noticed that finetuning the Batch Normalization (BN) layers during the adaptation phase helps to reach better performances than only learning the offset predictions. Hence, we investigated the effect of each component for the adaptation process; the results of this comparison are reported in Table 2.

Cityscape,
rect-DL3 +BN +DCN +DCN+BN
0.420 0.527 0.531 0.643 0.676
Table 2: Comparison of the adapation performances (mIoU) by finetuning batch-normalization (BN), training deformable offsets (DCN) and both.

his experiment demonstrates that DCNs and BN are complementary options to adapt a given model. Nonetheless, it must be noted that tuning the BN on fisheye images might degrade the performance of the rect-DL3 model on rectilinear images, whereas the offsets of the DCNs’ offsets can very easily be turned off, restoring the adapted model back to its original state. In autonomous vehicles that rely on both wide-angle and regular lenses, this ensures a very simple way to use the same model for both imaging modalities.

What to Adapt? Similarly to [daiDeformableConvolutionalNetworks2017] and [dengRestrictedDeformableConvolution2019], we studied which part of the original model needed to be adapted (i.e adding DCN and finetuning the batch-norm) to obtain optimal performances. Rather than a layer-by-layer approach, we restricted the experiments to the two main blocks of the rect-DL3 model, the encoder and the decoder. Results are reported in Table 3.

Cityscape
decoder only encoder only encoder+decoder
0.414 0.639 0.643
Table 3: Comparison of the effects of adapting different components of the models. For each case, the adaptation consists in adding deformable convolutions and tuning the batch-normalization.

As we can see, adapting the encoder and the decoder jointly seems to offer very little improvement over adapting only the encoder. This opens up interesting leads for future experiments, as a given encoder can be finetuned for different tasks (for example, ours could be tuned for object detection). In light of this, we believe that our adaptation method is likely to be suited for others tasks than semantic segmentation, meaning that the weights of the offset predictions could be re-used with no further tuning in an encoder trained for a different task such as object detection. The verification of this hypothesis is left for future work.

Few-shots adaptative training One of our central ideas of this work is that learning to adapt a model from distorted samples is significantly easier than learning semantic segmentation on them from scratch. We explored this notion by turning our adaptation protocol into a few-shots learning problem. Subsets of different sizes (1, 50, 100, 1000) from the initial training set were sampled and we evaluated the performance of the adapted model on the same test set. Training and testing were done on Cityscape (). Results of this experiment are shown in Figure 5. This graph shows that a single image training won’t allow any improvement compared to the non-adapted model (mIoU dropping from 0.420 to 0.390). Nonetheless, even a relatively small number of samples (n=50 in Figure 5) brings the model’s performance relatively close to the level reached when using the full training set (composed of 2675 images).

Figure 5: Performances (mIoU) with respect to the number of training samples.

Adaptation versus retraining Given the variability of existing fisheye simulation procedure as well as the heavy computational requirement to retrain different semantic models, we do not compare our adapted models directly with the state-of-the art. Nonetheless, we evaluate how it compared with a DeepLabV3+ trained on fisheye images with variable as a form of data augmentation as suggested by [saezRealTimeSemanticSegmentation2019, yeUniversalSemanticSegmentation2020]. This model is referred to as fish-DL3, to be distinguished from the adapted model, referred to as adpt-DL3. Results are shown in Table 4. In addition, a qualitative comparison between the different models is shown in Figure 4.

BlenDataset, fisheye
rect-DL3 fish-DL3 adpt-DL3
0.517 0.588 0.615
Cityscape,
rect-DL3 fish-DL3 adpt-DL3
0.420 0.514 0.643 0.676
Table 4: Comparison of performance (mIoU) between adapted model and retrained model.

Discussion The main experiments in this study served to demonstrate the effectiveness of the proposed adaptable deformable convolutions on fisheye images for semantic segmentation tasks. By learning only the weights of offset layers, the DCN-based model was able to adapt faster to non-linear spatial distortion and capture more accurate feature representations than finetuning or retraining a standard CNN. The proposed approach should remain valid for any choice of , but experiments on dynamic values of

should be conducted to explore the model’s capacity to generalize to different types of fisheye cameras used in ADAS systems and autonomous vehicles. The offsets are learned by backpropagating the cross-entropy loss function. An unanswered question emerges: can we learn the offsets in a self-supervised way, thereby removing the need for explicit data annotation? Given the fact that real-world fisheye images are often difficult to annotate because of their distortion, self-supervised learning could save significant time and costs in adapting CNNs to vision-based tasks on ultra wide-angle images. This work contributes to the goal of self-supervised learning by providing the concept of an upper performance limit on model adaptation to fisheye images. We keep this research area to our future work. We see two limitations of our current work: (1) lack of direct comparisons with related work and (2) lack of validations on real fisheye images. The rationale behind these limitations is the availability of the data. To the best of our knowledge, existing methods (even very few) have used fisheye simulations with a different experimental setup and train on larger dataset to get high performance. In the absence of a unified dataset or at least a standardized distortion approach, fair comparisons are hard to make. To prevent this in the future, we plan on releasing soon our code and trained models.

5 Conclusion and Future Work

Deformable convolutions have been shown to be a significant improvement over regular convolutions in many tasks. This work focuses on proving that they can be effectively used on top of an existing CNN without modifying its pre-trained weights. This opens up interesting applications for systems relying on multiple imaging modalities, as a single model can be reliably adapted to the different tasks by means of marginal modification rather than full retraining. Moreover, we demonstrate that training the deformable components can be done independently from the rest of the model (even if finetuning the batch normalization is advised) and that it does not require a large number of samples and alleviates the need to build large datasets of labeled fisheye images. These observations open different avenues worth exploring in future work. In particular, since autonomous vehicles require real-time object detection, we plan to investigate whether our adaptation protocol can be applied to existing object detection models.

Acknowledgment

This work was supported by NSERC (Natural Sciences and Engineering Research Council of Canada). The authors gratefully acknowledge Philippe Debanné for revising this manuscript.

References